Lexical databases and system lexica for spoken language

Next: Spoken language and written Up: What is a spoken Previous: Basic features of a

Lexical databases and system lexica for spoken language

The distinction between lexical databases and system lexica is a useful one, though in practice more complex distinctions are required. The main characteristics of the two kinds of lexical object are outlined below.

Lexical database:

A spoken language lexical database is often a set of loosely related simpler databases (e.g. pronunciation table, index into a signal annotation file database, stochastic word model , linguistic lexical database with syntactic and semantic information).

Purpose:
- Resource for system development (training , evaluation; construction of stochastic language models ).
- Definition of vocabulary coverage .
- Basis for vocabulary consistency maintenance.
- Reference point for integrating different kinds of lexical information.
- Source of information for investigation of vocabulary structure .
Structure:
- Generally fixed record structures, with fields for different types of lexical information, and strings as values in fields.
- Often identification of lexical key (lexical identifier) with orthographic word form. A problem with orthographic keys, particularly with large vocabularies: is the existence of homographs , i.e. lexical items wth the same spelling but different pronunciation (heterophonous homographs) and/or meaning, a potential source of ``orthographic noise''. Additional serial numbering may be used to distinguish between homographs.
- Alternative for larger databases with more complex linguistic information: Unique identification of word as a more abstract unit with a formal identifier and specific properties including orthography, pronunciation, syntax (POS), semantics, etc. on an equal footing.
- Implementation generally conforming to local laboratory standards as a database of ASCII strings, created and accessed by means of standard UNIX tools and UNIX shell scripts, or C programmes; in more complex environments with a commercial database such as ORACLE; occasionally as knowledge bases in higher-level languages such as Prolog or specialised languages such as DATR.
Content:
- Main lookup key (in general an orthographic representation, perhaps supplemented by numbering to distinguish homographs).
- Database entries may be fully inflected forms, uninflected stems , or morphemes (generally morphs , i.e. the phonemic forms of morphemes), or all of these; other inventories containing units such as phonemes , diphones or syllables , may be included.
- Pronunciation (in canonical phonemic representation, perhaps including pronunciation variants.
- Subword boundaries between units such as syllables , morphs (phonemic forms of affixes , lexical roots ), derived stems and constituents of compound words.
- Syntactic category (part of speech, POS, e.g. Noun, Adjective, Article, Pronoun, Verb, Adverb, Preposition, Conjunction, Interjection ) or subcategory (e.g. Proper vs. Common Noun, Intransitive vs. Transitive vs. Ditransitive vs. Prepositional, etc., Verb).
- Semantic categories (in general scenario-specific, i.e. restricted to a given domain or application).
- Corpus information: frequency statistics (of varying complexity, up to sophisticated language models, cf. Chapter 7); concordance information (i.e. list of contexts of occurrence for each word, usually generated on demand); signal annotations.
- Further information: concordance (textual context), links to speech files.
- Implementation:
  - commercial relational or object-oriented database,
  - UNIX ASCII database core with access by UNIX script languages, C or C++ programmes,
  - in-house custom databases or knowledge bases.

System lexicon:

Lexical information (i.e. properties of words) referred to during the speech recognition or synthesis process may not be concentrated in one identifiable lexicon in a given system.

Purpose: Definition of those properties of words required for recognition, parsing and understanding, or for planning, formulation and synthesis.
Structure: In general separate modules for different properties of words with different functions within the system (which are often not regarded as having anything at all to do with a lexicon)
- In speech recognition: Modules such as the word recogniser (typically based on Hidden Markov Model technology), which identifies word forms, i.e. recognition oriented lexical access keys, often phoneme strings derived from orthographic keys and a pronunciation dictionary, the stochastic language model (which defines statistical properties of words in their immediate contexts as bigrams , trigrams , etc.), and the linguistic lexicon with syntactic and semantic information, linked to an application-specific database or knowledge base.
- In speech synthesis: Modules which map orthographic forms (in text-to-speech systems) or conceptual or semantic representations (in concept-to-speech systems) to word structures in terms of morpheme sequences, word prosody (e.g. accentuation) , and pronunciation (in terms of phonemes ), supplemented by detailed rules for phoneme variants in different contexts and for timing and other relevant parametric information.
Content: Application specific; subsets of information defined in the lexical database resource , as outlined under ``Structure''.

Next: Spoken language and written Up: What is a spoken Previous: Basic features of a

EAGLES SWLG SoftEdition, May 1997. Get the book...