Spoken language and written language lexica


Spoken language lexica differ in coverage and content in many respects from lexica for written language, although they also share much information with them. Written language lexica are generally based on a stem,  neutral or canonical morphological form  (e.g. nominative singular; infinitive), or headword concept, in which generalisations over morphologically related forms may be included. This principle leads to fairly compact representations. Spoken language lexica for speech recognition  are generally based on fully inflected  word forms, as in dictation  systems with about 20000 entries. Depending on the complexity of inflectional  morphology   in the language concerned, the number of fully inflected word form entries is larger than the number of regularly inflectable entries in a dictionary  based on stems  or neutral forms by a factor from 2 or 3 to several thousand, depending on the typology of the language concerned. Speech synthesis  systems for text-to-speech  applications do not rely exclusively on extensive lexica, but also use rule-based techniques for generating pronunciation forms and prosody  (speech melody) from smaller basic units.

An orthographically oriented lexicon generally includes a canonical  phonemic transcription  , based on the citation form  of a word (the pronunciation of a word in isolation) which can be utilised, for example, in sophisticated tools for automatic spelling correction or ``phonetic search'' in name databases. However, this is not always adequate for the requirements of speech recognition systems , in which further details are required.

A spoken language lexicon  may also contain information about pronunciation variants, and often includes prosodic  information about syllable structure , stress , and (in tone  and pitch  accent  languages) about lexical tone  and pitch  accent, with morphological  information about division into stems  and affixes . Spoken language lexica are in general much more heavily orientated towards properties of word forms than towards the distributional and semantic properties of words.

It may happen that a canonical morphological form  or a canonical pronunciation  does not actually occur in a given spoken language corpus ; this would be of little consequence for a traditional dictionary , but in a spoken language dictionary it is necessary to adopt one of the following solutions (see also Chapter 7 for a discussion of solutions to the sparse data  problem in language modelling):

  1. Use the canonical phonemic form , but mark it as non-occurring; additionally, incorporate the attested form.
  2. Adopt an attested form as canonical morphological form  (e.g. nouns occurring only in the plural such as French ténèbres `darkness', English trousers, German Leute `people').

At a more detailed level, orthography (the division of word forms into standardised units of writing) and phonology (the division of word forms into units of pronunciation) are related in different ways in different languages both to each other and also to the morphology  (the division of word forms into units of sense) of the language. The orthographic notion of ``syllable '' serves, in general, in written language lexica for defining hyphenation at line breaks and certain spelling rules (and may even refer to morphological prefixes and suffixes); for this purpose, morphological  information about words is also generally required. In spoken language, however, the phonological notion of ``syllable '' is quite different; it refers to units of speech which are basic to the definition of the well-formed sound sequences of a language and to the rhythmic structure of speech, and forms the basis for the definition of variant pronunciations of speech sounds. Alphabetic orthography involves a close relation between characters and phonemes; in syllabic orthography (Japanese `Kana') characters are closely related to phonological syllables; in logographic orthography (Chinese), characters are closely related to simplex words (cf. numerals in European languages: the spelling ``7'' is pronounced /zi:btex2html_wrap_inline45173n/, /stex2html_wrap_inline45175t/, /stex2html_wrap_inline45175vtex2html_wrap_inline45173n/, and so on).

When complex word forms are put together from combinations of smaller units, different alternations of orthographic units (letters) often occur at the boundaries of the parts of such words (telephone + y = telephony; lady + s = ladies). Similarly, morphophonemic alternations occur in such positions (wife - /watex2html_wrap_inline45169f/ singular vs. wives - /watex2html_wrap_inline45169vz/ plural). Furthermore, additional kinds of lexical unit are required in the lexicon of a spoken language dialogue system:  discourse particles, hesitation phenomena, pragmatic idioms,   such as greetings, or so-called functional units (sequences of functional words which behave as a phonological unit: n'est-ce pas, /ntex2html_wrap_inline45175spa/) and clitics  (functional words which combine with lexical words to form a functional unit, cf. I'm coming, /atex2html_wrap_inline45169m ktex2html_wrap_inline45189mtex2html_wrap_inline45169tex2html_wrap_inline45193/).


