Criteria for the coverage of lexica for spoken language processing systems are heavily corpus determined, and differ considerably from criteria for coverage of lexica for traditional computational linguistics and some areas of natural language processing. In theoretical computational linguistics, interests are determined by systematic fragments of natural languages which reveal interesting problems of representation and processing. In natural language processing, maximally broad coverage is often the goal. In spoken language lexica as currently used in speech technology, lexica are always oriented towards a particular well-defined corpus which has often been specifically constructed for the task in hand. When speech technology and natural language specialists meet, for instance in comprehensive dialogue oriented development projects, these differences of terminology and priorities are a potential source of misunderstanding and disagreement, and joint solutions need to be carefully negotiated.
The main coverage criteria for spoken language lexica may be summarised as follows.
The first four criteria define quantitative or extensional coverage , the fifth defines qualitative or intensional coverage of the lexicon.
These criteria pertain to words; if other units, such as idioms, are involved, the criteria apply analogously to these.
The first three extensional criteria are essentials for the current state of speech technology. Conventional expectations in written language processing, i.e. in natural language processing and computational linguistics, are widely different, and are expressed in the fourth criterion. Clearly the second and fourth criteria clash; the relation to relevant corpora must therefore be carefully flagged in a spoken language lexicon . The degree of extentional coverage (which for a speech recognition system generally has to be 100%) is sometimes expressed in terms of the notions of degree of static coverage (ratio of in a corpus which are contained in a given dictionary to the number of words in the corpus) and the degree of dynamic coverage or saturation (the probability of encountering words which have previously been encountered); the latter value is generally higher than the former [Ferrané et al. (1992)]. On the basis of corpus statistics for typologically different languages such as English [Averbuch et al. (1987)] and French [Mérialdo (1988)], two languages which differ widely in their inflectional structure (English with few verbal inflections , French with a rich verbal inflection system), interesting quantitative comparisons can be made (cf. Table 6.1).
Vocabulary | Static | Dynamic | |
(no. of forms) | coverage | coverage | |
English | 5000 | 92.5% | |
English | 20000 | 97.6% | |
French | 20000 | 94.9% | 98.2% |
French | 200000 | 97.5% | 99.5% |