Spoken language and written language lexica

Next: Basic lexicographic coverage criteria Up: What is a spoken Previous: Lexical databases and system

Spoken language and written language lexica

Spoken language lexica differ in coverage and content in many respects from lexica for written language, although they also share much information with them. Written language lexica are generally based on a stem, neutral or canonical morphological form (e.g. nominative singular; infinitive), or headword concept, in which generalisations over morphologically related forms may be included. This principle leads to fairly compact representations. Spoken language lexica for speech recognition are generally based on fully inflected word forms, as in dictation systems with about 20000 entries. Depending on the complexity of inflectional morphology in the language concerned, the number of fully inflected word form entries is larger than the number of regularly inflectable entries in a dictionary based on stems or neutral forms by a factor from 2 or 3 to several thousand, depending on the typology of the language concerned. Speech synthesis systems for text-to-speech applications do not rely exclusively on extensive lexica, but also use rule-based techniques for generating pronunciation forms and prosody (speech melody) from smaller basic units.

An orthographically oriented lexicon generally includes a canonical phonemic transcription , based on the citation form of a word (the pronunciation of a word in isolation) which can be utilised, for example, in sophisticated tools for automatic spelling correction or ``phonetic search'' in name databases. However, this is not always adequate for the requirements of speech recognition systems , in which further details are required.

A spoken language lexicon may also contain information about pronunciation variants, and often includes prosodic information about syllable structure , stress , and (in tone and pitch accent languages) about lexical tone and pitch accent, with morphological information about division into stems and affixes . Spoken language lexica are in general much more heavily orientated towards properties of word forms than towards the distributional and semantic properties of words.

It may happen that a canonical morphological form or a canonical pronunciation does not actually occur in a given spoken language corpus ; this would be of little consequence for a traditional dictionary , but in a spoken language dictionary it is necessary to adopt one of the following solutions (see also Chapter 7 for a discussion of solutions to the sparse data problem in language modelling):

Use the canonical phonemic form , but mark it as non-occurring; additionally, incorporate the attested form.
Adopt an attested form as canonical morphological form (e.g. nouns occurring only in the plural such as French ténèbres `darkness', English trousers, German Leute `people').

At a more detailed level, orthography (the division of word forms into standardised units of writing) and phonology (the division of word forms into units of pronunciation) are related in different ways in different languages both to each other and also to the morphology (the division of word forms into units of sense) of the language. The orthographic notion of ``syllable '' serves, in general, in written language lexica for defining hyphenation at line breaks and certain spelling rules (and may even refer to morphological prefixes and suffixes); for this purpose, morphological information about words is also generally required. In spoken language, however, the phonological notion of ``syllable '' is quite different; it refers to units of speech which are basic to the definition of the well-formed sound sequences of a language and to the rhythmic structure of speech, and forms the basis for the definition of variant pronunciations of speech sounds. Alphabetic orthography involves a close relation between characters and phonemes; in syllabic orthography (Japanese `Kana') characters are closely related to phonological syllables; in logographic orthography (Chinese), characters are closely related to simplex words (cf. numerals in European languages: the spelling ``7'' is pronounced /zi:bn/, /st/, /svn/, and so on).

When complex word forms are put together from combinations of smaller units, different alternations of orthographic units (letters) often occur at the boundaries of the parts of such words (telephone + y = telephony; lady + s = ladies). Similarly, morphophonemic alternations occur in such positions (wife - /waf/ singular vs. wives - /wavz/ plural). Furthermore, additional kinds of lexical unit are required in the lexicon of a spoken language dialogue system: discourse particles, hesitation phenomena, pragmatic idioms, such as greetings, or so-called functional units (sequences of functional words which behave as a phonological unit: n'est-ce pas, /nspa/) and clitics (functional words which combine with lexical words to form a functional unit, cf. I'm coming, /am km/).

Next: Basic lexicographic coverage criteria Up: What is a spoken Previous: Lexical databases and system

EAGLES SWLG SoftEdition, May 1997. Get the book...