Lexical units

Next: Lexical properties and lexical Up: Types of lexical information Previous: A simple sign model

Intuitively, the prototypic lexical unit is a word. This definition has a number of catches to it, however, because the notion of word is not as simple as it seems, and because lexical phrases (idioms ) also exist. The intuitive notion of word has ``fuzzy edges'', as in the following cases:

1.

Words may contain other words (e.g. compound words such as database, Sprachtechnologie).

2.

Words have different status in respect of their phonetic realisations and their meaning; compare the difference between function words, e.g. to, for with reduced pronunciations and structural meanings, and content words, e.g. word, spell, which refer to real world objects, properties, event types, abstract concepts.

3.

Words may be merged with other words in informal speech ( cliticisation). Examples of clitics are English 's in he's - /hi:z/, French l' in il l'a vu - /il la vy:/, German 'm in auf'm Tisch - /a

fm t

4.

Particular types of word formation such as spelling and acronym formation may require special attention: ecu - /i:k

ju:/, /i:si:

ju:/.

5.

Words may be inflected word forms , making sound (singular) and sounds (plural) into different words. On the other hand, words may be regarded as a class of inflectionally related forms (a paradigm), i.e. sound and sounds then belong to the same word, which may be characterised by a canonical inflected form (e.g. nominative singular), or by the stem shared by the forms and identified by linguistic analysis, or by a number or other abstract label. In speech technology, the inflected word form is the standard definition. In standard dictionaries, the paradigm definition of word is used, represented by a headword or lemma , generally the canonical inflectional form such as nominative singular, in orthographic representation.

6.

Lexical units may need to be larger than the word (e.g. phrasal idioms ).

7.

Lexical units may need to be smaller than the word: Semantically oriented morphological word subunits (word constituents) include

word stems minus inflections ; indivisible word stems are lexical morphemes);
constituent words words formed by compounding (composition);
constituent prefixes , stems and suffixes in words formed by derivation.

Pronunciation oriented phonological word subunits include syllables and their parts; phonological subunits do not necessarily correspond closely with morphological subunits.

8.

Linguistic textbooks distinguish between several different views of words as lexical units, depending on which kind of lexical sign information is regarded as primary:

The phonological word (based on its conformity to the phonotactic structure of a language).
The prosodic word , based on its conformity to the accentuation and the rhythm patterning of the language.
The orthographic word (for instance, as delimited by spaces or punctuation marks ).
The morphological word (based on the indivisibility and fixed internal structure of words).
The syntactic word (based on its distribution in sentences).

9.

The lexical word as a type, as opposed to an occurrence of the type in larger units, and a token of the type in a corpus of speech or writing.

The central meaning for the purpose of spoken language lexica will be taken to be the morphological word.

Lexical units (entries, items) are assigned sets of properties; these identify the lexical units as signs, and determine the organisation of the lexicon. In practical contexts, the choice of lexical unit and the definition of priorities among its properties may be important for procedural reasons, i.e. in determining ways in which a lexicon may be most easily accessed: through orthography, pronunciation, meaning, syntactic properties, or via its morphological properties (stem, inflection ). The application-driven decision on the kind of lexical unit which is most suitable for a given purpose is a non-trivial one. However, for many practical purposes fairly straightforward guidelines can be given:

The form of a lexical item, in particular its orthography, is often used as the main identifying property for accessing the lexicon.
However, access on phonetic grounds, via the phonological form, is evidently the optimal procedure for speech recognition , and access on conceptual semantic or syntactic grounds is evidently the optimal procedure for speech synthesis.
The use of orthography as an intermediate stage in speech recognition is a useful and widespread heuristic which generally does not introduce significant numbers of artefacts into the mapping from speech signals to lexical items, but is not recommended for complex systems with large vocabularies, except as a means of visualisation in user interfaces.
For text-to-speech applications orthography is likely to be the optimal lexical access key.

Fully inflected form lexica

It has already been noted that fully inflected form lexica and lexical databases are fairly standard for speech recognition. Where a small closed vocabulary is used, and new, unknown or ad hoc word formations are not required (as with most current applications in speech synthesis and recognition ), fully inflected word forms are listed. This procedure is most convenient in languages with very small inflectional paradigms; for languages of the agglutinative type, in which large numbers of inflectional endings are concatenated, the procedure rapidly becomes intractable. In other applications, too, such as speech synthesis, it may be more tractable to generate fully inflected word forms from stems and endings.

An example of a language with few inflections is English, where (except for a few pronouns) only nouns and verbs are inflected, and even here three forms exist for nouns (uninflected, genitive and plural) and four for verbs (uninflected, third person singular present, past, and present participle; irregular verbs in addition have a different past participle form - the verb to be is, as always, an extreme case). English is therefore not a good example for illustrating inflectional morphology (in other areas of morphology, i.e. in word formation, languages appear to be equally complex).

French is much more complex, with inflections on adjectives, and large verb paradigms; note that orthographic inflection in French has more inflectional endings than are distinguished in phonological inflection.

German also has complex inflectional morphology, with significantly more endings on all articles, pronouns, nouns, adjectives and verbs, increasing the size of the vocabulary over the size of a stem -oriented lexicon by a factor of about 4.

In extremely highly inflecting languages such as Finnish, the number of endings and the length of sequences of endings multiply out to increase the vocabulary by a factor of over 1000. Special morphological techniques have been developed (e.g. two-level morphology ) to permit efficient calculation of inflected forms and to avoid a finite but unmanageable explosion of lexicon size for highly inflecting languages [Koskenniemi (1983), Karttunen (1983)]. These techniques have so far not been applied to any significant extent in speech technology [Althoff et al. (1996), but cf.,].

The figures cited refer only to the sets of forms. When the form-function mapping, i.e. the association of a given inflected form with a morphosyntactic category, is considered, the figures become much worse. A single inflected adjective form such as guten in German has 44 possible interpretations which are relevant for morphosyntactic agreement contexts [Gibbon (1995)], with 13 feminine readings, 17 masculine readings, and 14 neuter readings, depending on different cases (nominative, accusative, genitive and dative) and different determiner (article) categories (strong, weak and mixed). It is possible to reduce the size of these sets by means of default-logic abbreviations in a lexical database , but for efficient processing, they ultimately need to be multiplied out. Similar considerations apply to other word categories, and to other highly inflecting languages.

Complex inflectional properties in many languages other than English imply that, for these languages, large vocabulary systems with complex grammatical constructions require prohibitively large fully inflected form inventories. Although the sets of mappings involved can be very large, the inflectional systems of languages define a finite number of variants for each stem , and therefore it may make sense in complex applications in speech recognition to define a rule-based ``virtual lexical database'' or ``virtual lexicon'' which constructs or analyses each fully inflected word form on demand using a stem or morph lexicon with a morphological rule component [Althoff et al. (1996), Bleiching et al. (1996), Geutner (1995)].

Stem and morph lexica

Lexica based on the morphological parts of words, coupled with lexical rules for defining the composition of words from these parts, are not widely used in current speech recognition practice. They are useful, however, in expanding lexica of attested forms to include all fully inflected forms, for instance for word generation and speech synthesis, and in tools which verify the consistency of corpus transcriptions and lexica.

Terminology in this area is somewhat variable. In the most general usage, a stem is any uninflected item, whether morphologically simple or complex. However, intermediate stages in word formation by affixation , and in the inflection of highly inflected languages, are also called stems . The smallest stem is a phonological lexical morph or an orthographic lexical morph , i.e. the phonological or orthographic realisation of a lexical morpheme . Since stems may vary in different inflectional contexts, as affixes do, it is necessary to include information about the morphophonological (and morphographemic ) alternations of such morphemes:

Knife:
  <surface phonology singular>   = naIf
  <surface phonology plural>     = naIv + z
  <surface orthography singular> = knife
  <surface orthography plural>   = knive + s.

The use of morphological decomposition of the kind illustrated here has been demonstrated to bring some advantages in medium size vocabulary speech recognition in German [Geutner (1995)]; for languages like English, with a low incidence of inflections, the advantage is minimal.

In a stem lexicon, the basic lexical key or lemma is the stem, which is represented in some kind of normalised notation. The most common kind of normalised or canonical notation has the following two properties:

Canonical inflected form: With morphologically inflected items, a ``normal form'' such as the infinitive for verbs or the nominative singular for nouns is used.
Canonical orthography: A standardised orthographic representation of the canonical inflected form is used.

For specific purposes, in which lexical entries need to be accessed on the basis of a specific property, indexing based, for instance, on the canonical phonemic representation, either of a fully inflected form or of the canonical inflected form , or even of the stem itself, may be required; for stochastic language models, for example, a tree-coded representation may be the optimal representation (see Chapter 7). Phonemic representation is dealt with in more detail below.

The notion of ``lexical lemma''

As in the knife example, one particular form, for instance orthographic, of an entry is often used as a headword or lemma . From a technical lexicographic point of view, this form then has a dual function:

It names the entry.
It also represents one of its properties, namely its spelling.

In spoken language lexicography, this distinction is central, and ignoring it may lead to confusion. This applies particularly in the context of spoken language lexicography, where the primary criterion of access by word form is phonological.

When homographs occur (e.g. bank as a financial institution or as the side of a river), an additional consecutive numbering is used, e.g. bank, bank, etc.

The concept of an abstract lemma, deriving from recent developments in computational linguistics and their application to phonology and prosody , may be used in order to clarify the distinction [Gibbon (1992a)]: an abstract lemma may have any convenient unique name or number (or indeed be labelled by the spelling of the canonical inflected form , as already noted); all properties have equal status, so that the abstract lemma is neutral with respect to different types of lexical access, through spelling, pronunciation, semantics, etc. The examples of lexical entries given so far are based on the concept of an abstract lemma . The neutrality of the abstract lemma with respect to particular properties and particular directions of lexical access make it suitable as a basic concept for organising flexible lexical databases. A lexicon based on a neutral abstract lemma concept is the basic form of a declarative lexicon, in which the structure or the lexicon is not dictated by requirements of specific types of lexical access (characteristics of a procedural lexicon, but by general logical principles. The distinction between declarative and procedural lexica is a relative one, however, which is taken up in the section on spoken language lexicon architectures. For practical applications, a lexical database will need to be procedurally optimised (= indexed) for fast access.

Next: Lexical properties and lexical Up: Types of lexical information Previous: A simple sign model

EAGLES SWLG SoftEdition, May 1997. Get the book...

Lexical units

Kinds of lexical unit

Fully inflected form lexica

Stem and morph lexica

The notion of ``lexical lemma''