Intuitively, the prototypic lexical unit is a word. This definition has a number of catches to it, however, because the notion of word is not as simple as it seems, and because lexical phrases (idioms ) also exist. The intuitive notion of word has ``fuzzy edges'', as in the following cases:
Pronunciation oriented phonological word subunits include syllables and their parts; phonological subunits do not necessarily correspond closely with morphological subunits.
The central meaning for the purpose of spoken language lexica will be taken to be the morphological word.
Lexical units (entries, items) are assigned sets of properties; these identify the lexical units as signs, and determine the organisation of the lexicon. In practical contexts, the choice of lexical unit and the definition of priorities among its properties may be important for procedural reasons, i.e. in determining ways in which a lexicon may be most easily accessed: through orthography, pronunciation, meaning, syntactic properties, or via its morphological properties (stem, inflection ). The application-driven decision on the kind of lexical unit which is most suitable for a given purpose is a non-trivial one. However, for many practical purposes fairly straightforward guidelines can be given:
It has already been noted that fully inflected form lexica and lexical databases are fairly standard for speech recognition. Where a small closed vocabulary is used, and new, unknown or ad hoc word formations are not required (as with most current applications in speech synthesis and recognition ), fully inflected word forms are listed. This procedure is most convenient in languages with very small inflectional paradigms; for languages of the agglutinative type, in which large numbers of inflectional endings are concatenated, the procedure rapidly becomes intractable. In other applications, too, such as speech synthesis, it may be more tractable to generate fully inflected word forms from stems and endings.
An example of a language with few inflections is English, where (except for a few pronouns) only nouns and verbs are inflected, and even here three forms exist for nouns (uninflected, genitive and plural) and four for verbs (uninflected, third person singular present, past, and present participle; irregular verbs in addition have a different past participle form - the verb to be is, as always, an extreme case). English is therefore not a good example for illustrating inflectional morphology (in other areas of morphology, i.e. in word formation, languages appear to be equally complex).
French is much more complex, with inflections on adjectives, and large verb paradigms; note that orthographic inflection in French has more inflectional endings than are distinguished in phonological inflection.
German also has complex inflectional morphology, with significantly more endings on all articles, pronouns, nouns, adjectives and verbs, increasing the size of the vocabulary over the size of a stem -oriented lexicon by a factor of about 4.
In extremely highly inflecting languages such as Finnish, the number of endings and the length of sequences of endings multiply out to increase the vocabulary by a factor of over 1000. Special morphological techniques have been developed (e.g. two-level morphology ) to permit efficient calculation of inflected forms and to avoid a finite but unmanageable explosion of lexicon size for highly inflecting languages [Koskenniemi (1983), Karttunen (1983)]. These techniques have so far not been applied to any significant extent in speech technology [Althoff et al. (1996), but cf.,].
The figures cited refer only to the sets of forms. When the form-function mapping, i.e. the association of a given inflected form with a morphosyntactic category, is considered, the figures become much worse. A single inflected adjective form such as guten in German has 44 possible interpretations which are relevant for morphosyntactic agreement contexts [Gibbon (1995)], with 13 feminine readings, 17 masculine readings, and 14 neuter readings, depending on different cases (nominative, accusative, genitive and dative) and different determiner (article) categories (strong, weak and mixed). It is possible to reduce the size of these sets by means of default-logic abbreviations in a lexical database , but for efficient processing, they ultimately need to be multiplied out. Similar considerations apply to other word categories, and to other highly inflecting languages.
Complex inflectional properties in many languages other than English imply that, for these languages, large vocabulary systems with complex grammatical constructions require prohibitively large fully inflected form inventories. Although the sets of mappings involved can be very large, the inflectional systems of languages define a finite number of variants for each stem , and therefore it may make sense in complex applications in speech recognition to define a rule-based ``virtual lexical database'' or ``virtual lexicon'' which constructs or analyses each fully inflected word form on demand using a stem or morph lexicon with a morphological rule component [Althoff et al. (1996), Bleiching et al. (1996), Geutner (1995)].
Lexica based on the morphological parts of words, coupled with lexical rules for defining the composition of words from these parts, are not widely used in current speech recognition practice. They are useful, however, in expanding lexica of attested forms to include all fully inflected forms, for instance for word generation and speech synthesis, and in tools which verify the consistency of corpus transcriptions and lexica.
Terminology in this area is somewhat variable. In the most general usage, a stem is any uninflected item, whether morphologically simple or complex. However, intermediate stages in word formation by affixation , and in the inflection of highly inflected languages, are also called stems . The smallest stem is a phonological lexical morph or an orthographic lexical morph , i.e. the phonological or orthographic realisation of a lexical morpheme . Since stems may vary in different inflectional contexts, as affixes do, it is necessary to include information about the morphophonological (and morphographemic ) alternations of such morphemes:
Knife:
<surface phonology singular> = naIf
<surface phonology plural> = naIv + z
<surface orthography singular> = knife
<surface orthography plural> = knive + s.
The use of morphological decomposition of the kind illustrated here has been demonstrated to bring some advantages in medium size vocabulary speech recognition in German [Geutner (1995)]; for languages like English, with a low incidence of inflections, the advantage is minimal.
In a stem lexicon, the basic lexical key or lemma is the stem, which is represented in some kind of normalised notation. The most common kind of normalised or canonical notation has the following two properties:
For specific purposes, in which lexical entries need to be accessed on the basis of a specific property, indexing based, for instance, on the canonical phonemic representation, either of a fully inflected form or of the canonical inflected form , or even of the stem itself, may be required; for stochastic language models, for example, a tree-coded representation may be the optimal representation (see Chapter 7). Phonemic representation is dealt with in more detail below.
As in the knife example, one particular form, for instance orthographic, of an entry is often used as a headword or lemma . From a technical lexicographic point of view, this form then has a dual function:
In spoken language lexicography, this distinction is central, and ignoring it may lead to confusion. This applies particularly in the context of spoken language lexicography, where the primary criterion of access by word form is phonological.
When homographs occur (e.g. bank as a financial institution or as the side of a river), an additional consecutive numbering is used, e.g. bank, bank, etc.
The concept of an abstract lemma, deriving from recent developments in computational linguistics and their application to phonology and prosody , may be used in order to clarify the distinction [Gibbon (1992a)]: an abstract lemma may have any convenient unique name or number (or indeed be labelled by the spelling of the canonical inflected form , as already noted); all properties have equal status, so that the abstract lemma is neutral with respect to different types of lexical access, through spelling, pronunciation, semantics, etc. The examples of lexical entries given so far are based on the concept of an abstract lemma . The neutrality of the abstract lemma with respect to particular properties and particular directions of lexical access make it suitable as a basic concept for organising flexible lexical databases. A lexicon based on a neutral abstract lemma concept is the basic form of a declarative lexicon, in which the structure or the lexicon is not dictated by requirements of specific types of lexical access (characteristics of a procedural lexicon, but by general logical principles. The distinction between declarative and procedural lexica is a relative one, however, which is taken up in the section on spoken language lexicon architectures. For practical applications, a lexical database will need to be procedurally optimised (= indexed) for fast access.