next up previous contents index
Next: Recommendations on lexicon construction Up: Lexical knowledge acquisition for Previous: Stages in lexical knowledge

Types of knowledge source

The types of lexical knowledge source for a spoken language system  depend largely on the application. There are few general sources of lexical spoken language material (for instance with pronunciation and general frequency information) for any language. The construction of such a source is a major task which requires concerted action on a large scale by specialists of a whole language engineering community. It is a formidable task for many theoretical and practical reasons, but nevertheless one which will require a great deal of effort in the coming years. The two major sources of lexical knowledge for spoken language lexical systems are:

  1. existing dictionaries (to some extent),
  2. application specific corpora (to a large extent),
  3. results of descriptive, theoretical and computational linguistics (to some extent).

There is still a definite lack of general resources in the area (cf. the introduction to this chapter), and the construction of application-derived, generalisable resources will be a major task for any project and for the entire spoken language community in the coming years.

General lexical material is required for the lexical knowledge in general coverage text-to-speech systems , as well as for broad application pronunciation tables for speech recognition. 


Useful sources of information are generally available dictionaries, particularly pronouncing dictionaries, provided that they adhere to accepted standards of consistency and expressiveness of notation, and are available in electronic form. An overview of some sources was given at the beginning of this chapter, and reference should be made to the results of the EAGLES Working Group on Computational Lexica for further examples.


Spoken language lexica are application specific, and necessarily so when corpus-derived frequency information is needed. An example of a corpus-derived lexicon type for speech recognition  was given above. Another type of corpus-derived lexicon is the diphone  word list widely used in speech synthesis  technology; for this, phoneme  label alignment   with the speech signal is required, with the aid of which diphones  are defined in the signal for further processing. The chapter on Spoken Language Corpora contains detailed information on procedures of corpus treatment, and the results of the EAGLES Working Group on Text Corpora should also be consulted.

Acquisition tools

At the current state of the art, there are few generally available tools for constructing spoken language lexica, either by extraction from existing dictionaries or from corpora. Lexicon construction  usually takes place ``in house'' in individual laboratories or project consortia; lexicon formats consequently vary greatly.

For information on general acquisition tools in the sense of lexicographers' work benches, reference should be made to the results of the EAGLES Working Group on Computational Lexica. It is not appropriate in this context to go into the vast domain of Machine Learning and its application to the (semi-)automatic acquisition of lexica from data.

Of greatest practical use for the development of spoken language lexica in the area of word forms are the tools required for creating different kinds of word form list and word form table from corpora; the general parameters associated with acquiring syntactic, semantic and pragmatic  information are not unique to spoken language lexica (though the details, for instance of spoken language dialogue , indeed differ greatly from spoken to written language).

It is a common practice is either to write custom-made programmes in C, or, where speed of processing is not at a premium, to use standard UNIX script languages for processing orthographic transcriptions. Neither of these procedures is particularly difficult, because of the relatively straightforward and well-understood procedures and associated algorithms.

The simplest approach for many applications where processing time is not critical, for instance with small lexica, or where batch-style processing is acceptable, is to use UNIX tools such as grep, tr, sed, uniq, cut, tail, spell and awk. For descriptions of these tools, a UNIX manual or textbook, or the man page on-line information on a UNIX system should be consulted; techniques for specific database oriented UNIX tools are described by [Aho et al. (1987)], [Dougherty (1990)], [Wall & Schwartz (1991)].

An example of database formatting was given above. Simple examples of UNIX tool applications are illustrated in grossly simplified form below in order to convey an idea of the sort of corpus pre-processing  required for ASCII-based spoken language lexicon  acquisition.

next up previous contents index
Next: Recommendations on lexicon construction Up: Lexical knowledge acquisition for Previous: Stages in lexical knowledge

EAGLES SWLG SoftEdition, May 1997. Get the book...