Next: Spoken language and written
Up: What is a spoken
Previous: Basic features of a
The distinction between lexical databases and system lexica is a useful one,
though in practice more complex distinctions are required. The main
characteristics of the two kinds of lexical object are outlined below.
- Lexical database:
- A spoken language lexical database is often a set of
loosely related simpler databases (e.g. pronunciation table, index into a
signal annotation file database,
stochastic word model ,
linguistic lexical database with syntactic
and semantic information).
- Purpose:
- Resource for system development (training , evaluation; construction of
stochastic language models ).
- Definition of vocabulary coverage .
- Basis for vocabulary consistency maintenance.
- Reference point for integrating different kinds of lexical information.
- Source of information for investigation of vocabulary
structure .
- Structure:
- Generally fixed record structures, with fields for different types of lexical information, and strings as values in fields.
- Often identification of lexical key (lexical identifier) with
orthographic word form.
A problem with orthographic keys,
particularly with large vocabularies:
is the existence of
homographs ,
i.e. lexical items wth the same spelling but different pronunciation
(heterophonous homographs)
and/or meaning, a potential source of ``orthographic noise''.
Additional serial numbering may be used to distinguish between
homographs.
- Alternative for larger databases with more complex linguistic
information: Unique identification of word as a more abstract unit with a
formal identifier and specific properties including
orthography, pronunciation, syntax (POS), semantics, etc.
on an equal footing.
- Implementation generally conforming to local laboratory standards as a
database of ASCII strings, created and accessed by means of standard UNIX
tools and UNIX shell scripts, or C programmes; in more complex
environments with a commercial database such as ORACLE;
occasionally as knowledge bases in higher-level languages such as Prolog
or specialised languages such as DATR.
- Content:
- Main lookup key (in general an orthographic representation,
perhaps supplemented by numbering to distinguish homographs).
- Database entries may be fully inflected forms, uninflected
stems , or morphemes (generally morphs , i.e. the phonemic
forms of morphemes), or all of these; other inventories containing units
such as phonemes , diphones or syllables , may be included.
- Pronunciation (in canonical phonemic representation, perhaps including
pronunciation variants.
- Subword boundaries between units such as syllables , morphs (phonemic
forms of affixes , lexical roots ), derived stems
and constituents of
compound words.
- Syntactic category
(part of speech, POS, e.g. Noun, Adjective, Article, Pronoun, Verb,
Adverb, Preposition, Conjunction, Interjection ) or subcategory (e.g. Proper
vs. Common Noun, Intransitive vs. Transitive vs. Ditransitive
vs. Prepositional, etc., Verb).
- Semantic categories
(in general scenario-specific, i.e. restricted to a given domain or application).
- Corpus information: frequency statistics (of varying complexity, up to sophisticated language models, cf. Chapter 7);
concordance information (i.e. list of contexts of occurrence for each word,
usually generated on demand);
signal annotations.
- Further information: concordance (textual context), links to
speech files.
- Implementation:
- commercial relational or object-oriented database,
- UNIX ASCII database core with access by UNIX script languages,
C or C++ programmes,
- in-house custom databases or knowledge bases.
- System lexicon:
- Lexical information (i.e. properties of words) referred
to during the speech recognition or synthesis process may not be
concentrated in one identifiable lexicon in a given system.
- Purpose: Definition of those properties of words required for
recognition, parsing and understanding, or for planning, formulation and
synthesis.
- Structure: In general separate modules for different properties of
words with different functions within the system (which are often not
regarded as having anything at all to do with a lexicon)
- In speech recognition: Modules such as the word recogniser (typically
based on Hidden Markov Model technology), which identifies word
forms, i.e. recognition oriented lexical access keys, often
phoneme strings derived from orthographic keys and a pronunciation dictionary,
the stochastic language model
(which defines
statistical properties of words in their immediate contexts as
bigrams , trigrams , etc.),
and the linguistic lexicon with syntactic and
semantic information, linked to an application-specific database
or knowledge base.
- In speech synthesis: Modules which map
orthographic forms (in text-to-speech systems) or conceptual or semantic
representations (in concept-to-speech systems) to word structures in terms of
morpheme sequences, word prosody (e.g. accentuation) , and pronunciation
(in terms of phonemes ), supplemented by detailed rules for phoneme
variants in different contexts and for timing and other relevant
parametric information.
- Content: Application specific; subsets of information defined in the
lexical database resource , as outlined under ``Structure''.
Next: Spoken language and written
Up: What is a spoken
Previous: Basic features of a
EAGLES SWLG SoftEdition, May 1997. Get the book...