Lexical information is often regarded as a heterogeneous collection of idiosyncratic information about lexical items. An assumption such as this makes it hard to discuss lexical information systematically and, moreover, from the point of view of contemporary lexicography, it is wrong. For this reason, a simple unifying informal model of lexical signs, related to a view which is current in computational linguistics and computational lexicography, is used for the purpose of further discussion.
In general terms, a sign is a unit of communication with identifiable form and meaning. Lexical signs have specific ranks, such as word or phrase (for phrasal idioms), and include: words, phrasal idioms and other items such as dialogue control particles (er, uhm, aha etc.). It may also be argued that even smaller units such as morphemes also have sign structure. Lexical signs thus range, in principle, over fully inflected word forms, morphs (roots , affixes ), stems (roots or stems to which affixation has applied), lemmas (or lemmata), often represented by an orthographic form, and phrasal items (idioms ).
Lexical signs are characterised by the following four basic types of information:
The first two types are referred to as interpretative properties, since they interpret the basic sign representation in terms of the real world of phonetics (or writing) and the real world of meaning, while the second two types are referred to as structural (or syntactic, in a general sense of the term) properties. Complex signs are constructed compositionally from their constituent signs and derive their properties compositionally from these. Non-lexical signs include, for example, freely invented compound words, such as the example given above, or almost any sentence in this book.
The following sections will be devoted to the four main types of lexical information, referring to them as surface, content, grammatical and morphological information , respectively.
In the examples given below, a basic computer-readable attribute-value syntax is used, based on the kind of spoken language lexical representation in DATR used by [Andry et al. (1992)]. The name of the lexical sign (which is not necessarily its orthography) is written with an initial upper case letter and followed by a colon, attribute names can be either word-like atoms or sequences of atoms (in the latter case, permitting an indirect representation of more complex attribute structures); they are enclosed in corner brackets and separated from their values by an equality sign, and the lexical sign is terminated by a period. The SAMPA notation used below is defined in Appendix B; see also Chapter 5.
Table:
<surface orthography> = table
<surface phonetics sampa> = teIbl
<semantics> = artefactual horizontal surface
<distribution> = noun common countable
<composition> = simplex z_plural.
In the case of complex signs, the meaning of the sign is a function of the meanings of its parts and the pronunciation of the sign is a function of the pronunciations of its parts. These functions may be partly idiosyncratic with lexical signs; this is shown in the pronunciation and meaning of words like English ``dustman'':
Dustman:
<surface orthography> = dustman
<surface phonetics sampa> = dVsm@n
<semantics> = 'municipal garbage collector'.
The pronunciation and meaning of this complex lexical sign are not in all respects a general compositional function of its parts, for example the pronunciation of dustman is not /dstmæn/ but /dsmn/, nor is a dustman necessarily only concerned with dust:
Dust:
<surface phonetics sampa> = dVst
<semantics> = 'just visible particles of
solid matter'.
Man:
<surface phonetics sampa> = m{n
<semantics> = 'male adult human being'.
In contrast, the spelling and the distribution of the complex sign are perfectly regular functions of the spellings of the parts and the distribution of the head (i.e. Man) of the sign, respectively.
In perfectly regular cases, there would therefore be no necessity to include complex words in the lexicon. Such cases are practically non-existent, however, since complex words are in general partially idiosyncratic ; in a comprehensive spoken language lexicon , both complex words and their parts therefore need to be included. For most current practical purposes, in which potential words (unknown words or ad hoc word formations) do not need to be treated in addition to actual words (those contained in a lexicon), complex words can be listed in full as unanalysed forms.
Modern computational lexicographic practice attempts to reduce the redundancy in a lexicon as far as possible: fully regular information in compounds can be inherited from the parts of the compounds, while idiosyncratic information is specified locally. In a case like this, a lexical class is specified for defining the structure of compounds , and ``inheritance pointers'' are included. The result is a hierarchical lexicon structure , in which macro-like cross-references are made to other lexical signs (analogous to cross-references in conventional dictionaries), but also to whole classes of lexical signs (archi-signs).