next up previous contents index
Next: Lexical properties and lexical Up: Types of lexical information Previous: A simple sign model

Lexical units

 

Kinds of lexical unit

Intuitively, the prototypic lexical unit is a word. This definition has a number of catches to it, however, because the notion of word is not as simple as it seems, and because lexical phrases (idioms ) also exist. The intuitive notion of word has ``fuzzy edges'', as in the following cases:

1.
Words may contain other words (e.g. compound  words such as database, Sprachtechnologie).

2.
Words have different status in respect of their phonetic realisations and their meaning; compare the difference between function words, e.g. to, for with reduced pronunciations and structural meanings, and content words, e.g. word, spell, which refer to real world objects, properties, event types, abstract concepts.

3.
Words may be merged with other words in informal speech ( cliticisation). Examples of clitics are English 's in he's - /hi:z/, French l' in il l'a vu - /il la vy:/, German 'm in auf'm Tisch - /atex2html_wrap_inline45201fm ttex2html_wrap_inline45169tex2html_wrap_inline45205/.

4.
Particular types of word formation such as spelling and acronym   formation may require special attention: ecu - /i:ktex2html_wrap_inline45207tex2html_wrap_inline45207ju:/, /i:si:tex2html_wrap_inline45207tex2html_wrap_inline45207ju:/.

5.
Words may be inflected word forms , making sound (singular) and sounds (plural) into different words. On the other hand, words may be regarded as a class of inflectionally  related forms (a paradigm), i.e. sound and sounds then belong to the same word, which may be characterised by a canonical inflected form   (e.g. nominative singular), or by the stem  shared by the forms and identified by linguistic analysis, or by a number or other abstract label. In speech technology, the inflected word form   is the standard definition. In standard dictionaries, the paradigm definition of word is used, represented by a headword or lemma , generally the canonical inflectional  form such as nominative singular, in orthographic representation.

6.
Lexical units may need to be larger than the word (e.g. phrasal idioms ).

7.
Lexical units may need to be smaller than the word: Semantically oriented morphological word subunits (word constituents) include

Pronunciation oriented phonological word subunits include syllables  and their parts; phonological subunits do not necessarily correspond closely with morphological subunits.
8.
Linguistic textbooks distinguish between several different views of words as lexical units, depending on which kind of lexical sign information is regarded as primary:

9.
The lexical word as a type, as opposed to an occurrence of the type in larger units, and a token of the type in a corpus of speech or writing.

The central meaning for the purpose of spoken language lexica will be taken to be the morphological word. 

Lexical units (entries, items) are assigned sets of properties; these identify the lexical units as signs, and determine the organisation of the lexicon. In practical contexts, the choice of lexical unit and the definition of priorities among its properties may be important for procedural reasons, i.e. in determining ways in which a lexicon may be most easily accessed: through orthography, pronunciation, meaning, syntactic properties, or via its morphological  properties (stem, inflection ). The application-driven decision on the kind of lexical unit which is most suitable for a given purpose is a non-trivial one. However, for many practical purposes fairly straightforward guidelines can be given:

Fully inflected form lexica

   

It has already been noted that fully inflected form lexica and lexical databases  are fairly standard for speech recognition.  Where a small closed vocabulary   is used, and new, unknown or ad hoc word formations are not required (as with most current applications in speech synthesis  and recognition ), fully inflected word forms are listed. This procedure is most convenient in languages with very small inflectional paradigms; for languages of the agglutinative  type, in which large numbers of inflectional endings are concatenated, the procedure rapidly becomes intractable. In other applications, too, such as speech synthesis,  it may be more tractable to generate fully inflected word forms from stems  and endings.

An example of a language with few inflections is English, where (except for a few pronouns) only nouns and verbs are inflected, and even here three forms exist for nouns (uninflected, genitive and plural) and four for verbs (uninflected, third person singular present, past, and present participle; irregular verbs in addition have a different past participle form - the verb to be is, as always, an extreme case). English is therefore not a good example for illustrating inflectional morphology   (in other areas of morphology, i.e. in word formation, languages appear to be equally complex).

French is much more complex, with inflections on adjectives, and large verb paradigms; note that orthographic inflection in French has more inflectional endings than are distinguished in phonological inflection.

German also has complex inflectional morphology,   with significantly more endings on all articles, pronouns, nouns, adjectives and verbs, increasing the size of the vocabulary  over the size of a stem -oriented lexicon by a factor of about 4.

In extremely highly inflecting languages such as Finnish, the number of endings and the length of sequences of endings multiply out to increase the vocabulary  by a factor of over 1000. Special morphological  techniques have been developed (e.g. two-level morphology ) to permit efficient calculation of inflected forms and to avoid a finite but unmanageable explosion of lexicon size  for highly inflecting languages [Koskenniemi (1983), Karttunen (1983)]. These techniques have so far not been applied to any significant extent in speech technology [Althoff et al. (1996), but cf.,].

The figures cited refer only to the sets of forms. When the form-function mapping, i.e. the association of a given inflected form with a morphosyntactic category, is considered, the figures become much worse. A single inflected adjective form such as guten in German has 44 possible interpretations which are relevant for morphosyntactic agreement contexts [Gibbon (1995)], with 13 feminine readings, 17 masculine readings, and 14 neuter readings, depending on different cases (nominative, accusative, genitive and dative) and different determiner (article) categories (strong, weak and mixed). It is possible to reduce the size of these sets by means of default-logic abbreviations in a lexical database , but for efficient processing, they ultimately need to be multiplied out. Similar considerations apply to other word categories, and to other highly inflecting languages.

Complex inflectional properties in many languages other than English imply that, for these languages, large vocabulary  systems with complex grammatical constructions require prohibitively large fully inflected form inventories. Although the sets of mappings involved can be very large, the inflectional systems of languages define a finite number of variants for each stem , and therefore it may make sense in complex applications in speech recognition to define a rule-based ``virtual lexical database'' or ``virtual lexicon'' which constructs or analyses each fully inflected word form on demand using a stem  or morph lexicon   with a morphological rule component [Althoff et al. (1996), Bleiching et al. (1996), Geutner (1995)].    

Stem and morph lexica

   

Lexica based on the morphological  parts of words, coupled with lexical rules for defining the composition of words from these parts, are not widely used in current speech recognition  practice. They are useful, however, in expanding lexica of attested forms to include all fully inflected  forms, for instance for word generation and speech synthesis, and in tools which verify the consistency of corpus transcriptions  and lexica.

Terminology in this area is somewhat variable. In the most general usage, a stem  is any uninflected  item, whether morphologically simple  or complex. However, intermediate stages in word formation by affixation , and in the inflection  of highly inflected  languages, are also called stems . The smallest stem  is a phonological lexical morph  or an orthographic lexical morph , i.e. the phonological or orthographic realisation of a lexical morpheme . Since stems  may vary in different inflectional  contexts, as affixes  do, it is necessary to include information about the morphophonological  (and morphographemic ) alternations of such morphemes: 

Knife:
  <surface phonology singular>   = naIf
  <surface phonology plural>     = naIv + z
  <surface orthography singular> = knife
  <surface orthography plural>   = knive + s.

The use of morphological decomposition  of the kind illustrated here has been demonstrated to bring some advantages in medium size vocabulary  speech recognition  in German [Geutner (1995)]; for languages like English, with a low incidence of inflections, the advantage is minimal.

In a stem  lexicon, the basic lexical key or lemma  is the stem, which is represented in some kind of normalised notation. The most common kind of normalised or canonical notation has the following two properties:

  1. Canonical inflected form:    With morphologically inflected items, a ``normal form'' such as the infinitive for verbs or the nominative singular for nouns is used.
  2. Canonical orthography: A standardised orthographic representation of the canonical inflected form  is used.

For specific purposes, in which lexical entries need to be accessed on the basis of a specific property, indexing based, for instance, on the canonical phonemic representation, either of a fully inflected form  or of the canonical inflected form  , or even of the stem  itself, may be required; for stochastic language models, for example, a tree-coded representation may be the optimal representation (see Chapter 7). Phonemic representation is dealt with in more detail below.    

The notion of ``lexical lemma''

 

As in the knife example, one particular form, for instance orthographic, of an entry is often used as a headword or lemma . From a technical lexicographic point of view, this form then has a dual function:

  1. It names the entry.
  2. It also represents one of its properties, namely its spelling.

In spoken language lexicography, this distinction is central, and ignoring it may lead to confusion. This applies particularly in the context of spoken language lexicography, where the primary criterion of access by word form is phonological.

When homographs occur (e.g. bank as a financial institution or as the side of a river), an additional consecutive numbering is used, e.g. banktex2html_wrap_inline45215, banktex2html_wrap_inline45217, etc.

The concept of an abstract lemma,  deriving from recent developments in computational linguistics and their application to phonology and prosody , may be used in order to clarify the distinction [Gibbon (1992a)]: an abstract lemma  may have any convenient unique name or number (or indeed be labelled  by the spelling of the canonical inflected form  , as already noted); all properties have equal status, so that the abstract lemma  is neutral with respect to different types of lexical access, through spelling, pronunciation, semantics, etc. The examples of lexical entries given so far are based on the concept of an abstract lemma . The neutrality of the abstract lemma  with respect to particular properties and particular directions of lexical access make it suitable as a basic concept for organising flexible lexical databases.  A lexicon based on a neutral abstract lemma  concept is the basic form of a declarative  lexicon, in which the structure or the lexicon is not dictated by requirements of specific types of lexical access (characteristics of a procedural lexicon,  but by general logical principles. The distinction between declarative  and procedural lexica  is a relative one, however, which is taken up in the section on spoken language lexicon   architectures.     For practical applications, a lexical database will need to be procedurally optimised (= indexed) for fast access.



next up previous contents index
Next: Lexical properties and lexical Up: Types of lexical information Previous: A simple sign model

EAGLES SWLG SoftEdition, May 1997. Get the book...