A spoken language lexicon may be a component in a system, a system lexicon , or a background resource for wider use, a lexical database , in each case containing information about the pronunciation, the spelling, the syntactic usage, the meaning and specific pragmatic properties of words; lexica containing subsets of this information may also be referred to as spoken language lexica, though the simpler cases are often simply referred to as wordlists. Where there is little danger of confusion, the term spoken language lexicon will be used to refer indifferently to either a spoken language system lexicon or a spoken language lexical database. A lexical databse may be general purpose, or orientated towards specific tasks such as speech recognition or speech synthesis, and restricted to a specific scenario. For system development and evaluation it is generally critical to define an agreed word-list with a well-defined notion of word (e.g. a fully inflected word form), and an associated complete and consistent pronunciation dictionary for grapheme-phoneme conversion and language model construction (see Chapter 7).
A spoken language lexicon is defined as a list of representations of lexical entries consisting of spoken word forms paired with their other lexical properties such as spelling, pronunciation, part of speech (POS), meaning and usage information, in such a way as to optimise lookup of any or all of these properties. This definition covers a wide range of specific types of spoken language lexicon, . At the one end of the spectrum are lists in which orthography provides a more or less indirect representation of a spoken word form pronunciation augmented by tabular pronunciation dictionaries and conversion rules. At the other end are declarative knowledge bases with attribute-value matrix representation formalisms and inheritance hierarchies with associated inference machines, by means of which details of lexical information are inferred from specific premises (entries) about individual lexical items and general premises (rules) about the structure of lexical items. Between these extremes are optimised representations such as those discussed in Chapter 7, and other application directed special lexicon types based, for instance, on the different requirements for pronunciation tables for speech recognisers and for speech synthesisers.
Both in speech recognition and in speech synthesis, the different kinds of spoken language lexicon are generally orientated towards the forms of words rather than towards their distribution in larger text or utterance units, or their meaning and use in context. Furthermore, where possible closed sets of fully inflected words which are actually attested in corpora are preferred to the construction of words on morphological principles, though rule-based word construction is increasing in importance in projects concerned with highly inflecting languages or aimed at the recognition of spontaneous continuous speech in which out-of-vocabulary words or ad hoc coinages (nonce forms) are encountered. In addition to out-of-vocabulary words, systematic noise events may also require inventarisation in a lexical database.