Types of knowledge source

Next: Recommendations on lexicon construction Up: Lexical knowledge acquisition for Previous: Stages in lexical knowledge

Types of knowledge source

The types of lexical knowledge source for a spoken language system depend largely on the application. There are few general sources of lexical spoken language material (for instance with pronunciation and general frequency information) for any language. The construction of such a source is a major task which requires concerted action on a large scale by specialists of a whole language engineering community. It is a formidable task for many theoretical and practical reasons, but nevertheless one which will require a great deal of effort in the coming years. The two major sources of lexical knowledge for spoken language lexical systems are:

existing dictionaries (to some extent),
application specific corpora (to a large extent),
results of descriptive, theoretical and computational linguistics (to some extent).

There is still a definite lack of general resources in the area (cf. the introduction to this chapter), and the construction of application-derived, generalisable resources will be a major task for any project and for the entire spoken language community in the coming years.

General lexical material is required for the lexical knowledge in general coverage text-to-speech systems , as well as for broad application pronunciation tables for speech recognition.

Dictionaries

Useful sources of information are generally available dictionaries, particularly pronouncing dictionaries, provided that they adhere to accepted standards of consistency and expressiveness of notation, and are available in electronic form. An overview of some sources was given at the beginning of this chapter, and reference should be made to the results of the EAGLES Working Group on Computational Lexica for further examples.

Corpora

Spoken language lexica are application specific, and necessarily so when corpus-derived frequency information is needed. An example of a corpus-derived lexicon type for speech recognition was given above. Another type of corpus-derived lexicon is the diphone word list widely used in speech synthesis technology; for this, phoneme label alignment with the speech signal is required, with the aid of which diphones are defined in the signal for further processing. The chapter on Spoken Language Corpora contains detailed information on procedures of corpus treatment, and the results of the EAGLES Working Group on Text Corpora should also be consulted.

Acquisition tools

At the current state of the art, there are few generally available tools for constructing spoken language lexica, either by extraction from existing dictionaries or from corpora. Lexicon construction usually takes place ``in house'' in individual laboratories or project consortia; lexicon formats consequently vary greatly.

For information on general acquisition tools in the sense of lexicographers' work benches, reference should be made to the results of the EAGLES Working Group on Computational Lexica. It is not appropriate in this context to go into the vast domain of Machine Learning and its application to the (semi-)automatic acquisition of lexica from data.

Of greatest practical use for the development of spoken language lexica in the area of word forms are the tools required for creating different kinds of word form list and word form table from corpora; the general parameters associated with acquiring syntactic, semantic and pragmatic information are not unique to spoken language lexica (though the details, for instance of spoken language dialogue , indeed differ greatly from spoken to written language).

It is a common practice is either to write custom-made programmes in C, or, where speed of processing is not at a premium, to use standard UNIX script languages for processing orthographic transcriptions. Neither of these procedures is particularly difficult, because of the relatively straightforward and well-understood procedures and associated algorithms.

The simplest approach for many applications where processing time is not critical, for instance with small lexica, or where batch-style processing is acceptable, is to use UNIX tools such as grep, tr, sed, uniq, cut, tail, spell and awk. For descriptions of these tools, a UNIX manual or textbook, or the man page on-line information on a UNIX system should be consulted; techniques for specific database oriented UNIX tools are described by [Aho et al. (1987)], [Dougherty (1990)], [Wall & Schwartz (1991)].

An example of database formatting was given above. Simple examples of UNIX tool applications are illustrated in grossly simplified form below in order to convey an idea of the sort of corpus pre-processing required for ASCII-based spoken language lexicon acquisition.

Orthographic transcription to word list:

#!/bin/sh
# Simple wordlist generator
echo Wordlist generator
tr -sc `A-Za-z' `\012' < $1 | sort | uniq > wordlist.srt
echo Wordlist in file `wordlist.srt'

Orthographic transcription to frequency list:

#!/bin/sh
# Simple word frequency generator
echo Word frequency generator
tr -sc `A-Za-z' `\012' < $1 | sort | uniq -c > wordlist.frq
echo Wordlist in file
`wordlist.frq'.

Orthographic transcriptiontranscription!orthographic to digram frequency table:
```
#!/bin/sh
# Simple digram table generator
echo Digram generator
tr -sc `A-Za-z' `\012' < $1 > lines.txt
tail +2 lines.txt > tailed.txt
paste lines.txt tailed.txt | sort | uniq -c > digrams.tab
echo Digram frequency table in file `digrams.tab'.
```
Digram frequency information of this type is the basis for the construction of statistical language models. ; this simple illustration is, however, not to be compared with state of the art technology (cf. Chapter 7).

Next: Recommendations on lexicon construction Up: Lexical knowledge acquisition for Previous: Stages in lexical knowledge

EAGLES SWLG SoftEdition, May 1997. Get the book...