next up previous contents index
Next: Recommendations on resources Up: Introduction Previous: Lexica for spoken language

Lexical information as properties of words

At the present time, information about lexica for spoken language systems  is relatively hard to come by. One reason for this is that such information is largely contained in specifications of particular proprietary or prototype systems and in technical reports with restricted distribution. With the advent of organisations for coordinating the use of language resources, such as ELRA (the European Language Resources Association) and the LDC (the Linguistic Data Consortium), access to information on spoken language lexica is becoming more widely available.

Another reason for difficulties in obtaining information about spoken language lexica is that there is not a close relation between concepts and terminology in the speech processing field on the one hand, and concepts and terminology in traditional lexicography on the other. natural language processing  and computational linguistics. Components such as Hidden Markov Models  for word recognition , stochastic  language models for word sequence patterns, grapheme-phoneme tables  and rules, word-oriented knowledge bases for semantic interpretation or text construction are all concerned with the the identity and properties of words, lexical access , lexical disambiguation , lexicon architecture  and lexical representation, but these relations are not immediately obvious within the specific context of speech technology. Stochastic word models , for instance, would not generally be regarded as a variety of lexicon they evidently do provide corpus-based lexical information about word collocations.

A terminological problem should be noted at the outset: in the spoken language technologies, the term linguistic is often used for the representation and processing in sentence, text and dialogue level components, and acoustic for word models. With present-day systems, this terminology is misleading. The integration of prosody, for example, requires the interfacing of acoustic techniques at sentence, text and dialogue levels, and linguistic analysis is involved at the word level for the specification of of morphological components in systems developed for highly inflecting languages or for the recognition of out-of-vocabulary  words, or for using phonological information in structured Hidden Markov Models (HMMs).

It is useful to distinguish between system lexica  and lexical databases . The distinction may, in specific cases, be blurred, and the unity of the two concepts may also be rather loose if the system lexicon is highly modular, or distributed among several system components, or if several different lexical databases are used. However, the distinction is a useful one. The distinction between lexica and lexical databases  will be discussed below. Since the kinds of information in both these types of lexical object overlap, the term ``spoken language lexicon''  will generally be used in this chapter to cover both types. The following overview is necessarily selective.

Types of application for spoken language lexica

Lexica for spoken language are used in a variety of systems, including the following:

Spoken language lexical databases as a general resource


Spoken language lexica may be components of systems such as those listed above, or reusable background resources. System lexica are generally only of local interest within institutes, companies or projects. Lexical databases as reusable background resources which are intended to be more generally available raise questions of standardised representation, storage and dissemination. In general, the same principles apply as for Spoken Language Corpora:  they are collated, stored and disseminated using a variety of media. In research and development contexts, magnetic media (disk or tape) were preferred until recently; in recent years, local magnetic storage and wider informal dissemination within projects or other relevant communities is conducted via the Internet using standard file transfer protocols, electronic mail and World-Wide Web search and access. Large lexica, and corpora on which large lexica are based, are also stored and disseminated in the form of ISO standard CD-ROMs.

The following brief overview can do no more than list a number of examples of current work on spoken language lexicography. At this stage, no claim to exhaustiveness is made, and no valuation of cited or uncited work is intended.


Lexica in selected spoken language systems


The range of existing spoken language systems is large, so that only a small selection can be outlined, concentrating on well-known older or established systems whose lexicon requirements are representative of different approaches and convey the flavour of basic lexical problems and their treatment. The situation is currently undergoing a process of rapid development. Small vocabulary  systems  are also excluded, as their strong points are evidently not in the area of the lexicon. The concepts referred to in the descriptions are discussed in the relevant sections below. Reference should also be made to Chapters 5 and 7.

was a large-vocabulary  (1011 words) continuous speech  recognition system.  It was developed at Carnegie Mellon University. HARPY was the best performing speech recognition system developed under the five-year ARPA  project launched in 1971. HARPY makes use of various knowledge sources, including a highly constrained grammar  (a finite state grammar  in BNF [Backus Naur Form] notation) and lexical knowledge in the form of a pronunciation dictionary  that contains alternative pronunciations of each word. Initial attempts to derive within-word phonological variations with a set of phonological rules operating on a baseform failed. A set of juncture rules describes inter-word phonological phenomena such as /p/ deletion  at /pm/ junctures: /helpmi/ - /helmi/. The spectral characteristics of allophones   of a given phoneme , including their empirically determined durations , are stored in phone templates  . The HARPY system compiles all knowledge into a unified directed graph representation, a transition network of 15,000 states (the so-called blackboard model). Each state in the network corresponds to a spectral template . The spectra of the observed segments are compared with the spectral templates  in the network. The system determines which sequence of spectra, that is, which path through the network, provides the best match with the acoustic input spectral sequence. (Cf. [Klatt (1977)]; see also [Lowerre & Reddy (1980)]).

also used the blackboard principle (see HARPY), where knowledge sources contribute to the recognition process via a global data base. In the recognition process, an utterance is segmented into categories of manner-of-articulation features, e.g. a stop -vowel-stop  pattern. All words with a syllable structure  corresponding to that of the input are proposed as hypotheses. However, words can also be hypothesised top-down by the syntactic component. So misses by the lexical hypothesiser, which are very likely, can be made up for by the syntactic predictor. The lexicon for word verification has the same structure as HARPY; It is defined in terms of spectral patterns. (Cf. [Klatt (1977)], see also [Erman (1977)] and [Erman & Lesser (1980)]).

is a large-vocabulary  continuous speech  recognition system for speaker-independent     application. It was evaluated on the DARPA  naval resource management task. The baseline SPHINX system works with Hidden Markov Models (HMMs ) where each HMM  represents a phone . The total of phones  is 45. The phone models  are concatenated to create word models , which in turn serve to create sentence models . The phonetic spelling  of a word was adopted from the ANGEL System [Rudnicky et al. (1987)]. The SPHINX baseline system has been improved by introducing multiple codebooks and adding information to the lexical-phonological component: The SPHINX system works with grammars  of different perplexity  (average branching factor; see Chapter 7); the grammars are of a type which can, in principle, be regarded as a specialised tabular, network-like or tree-structured lexicon with probabilistic word-class information: In word recognition tests , the best results were obtained with the bigram grammar , the most restrictive kind of the grammars mentioned above (96% accuracy  compared with 71% for null grammars ).

The SPHINX system has various levels of representation for linguistic units:

(Cf. [Lee et al. (1990)]; see also [Alleva et al. (1992)]).

  (``Erkennen - Verstehen - Antworten - Rückfragen'', ``Recognition - Understanding - Answering - Clarification'') is a large-vocabulary continuous speech  recognition and dialogue system.    It is designed to understand standard German sentences and to react either in form of an answer or a question referring back to what has been said, within the specific discourse domain of enquiries concerning Intercity timetables. The EVAR lexicon has the following properties:   A lexicon administration system has been developed which uses tools for extracting words according to specified criteria, such as ``Look for nouns that express a location'' or ``Look for prepositions that express a direction''.[Ehrlich (1986), Brietzmann et al. (1983), Niemann et al. (1985), Niemann et al. (1992), Cf.,]

  The VERBMOBIL speech-to-speech translation prototype uses lexical information in a wide variety of ways, and much effort went into the creation of standardised orthographic transcriptions , pronouncing dictionaries with integrated prosodic and morphological information, as well as lexica for syntactic, semantic, pragmatic and transfer (translation) information. The system lexicon is distributed between a large number of modules concerned with recognition, parsing, semantic construction and evaluation, transfer, language generation and synthesis, related by ``VERBMOBIL interface terms'', i.e. standardised lexical information vectors. The VERBMOBIL lexical database was made available to the consortium by means of an interactive World-Wide Web form interface together with a concordance for linguistic analysis, and additional special interactive tools for investigating the phonetic similarities which cause false analyses and misunderstandings and can be used to trigger clarification dialogues (see Chapter 13). The core of the VERBMOBIL lexical database is a knowledge base of 10000 lexical stems, and a DATR/Prolog inference machine which generates 50000 fully inflected forms and 300000 mappings between inflected forms and morphological categories ([Bleiching et al. (1996)]).


next up previous contents index
Next: Recommendations on resources Up: Introduction Previous: Lexica for spoken language

EAGLES SWLG SoftEdition, May 1997. Get the book...