Next: Recommendations on resources
Up: Introduction
Previous: Lexica for spoken language
At the present time, information about lexica for spoken language
systems
is relatively hard to come by. One reason for this is that such information
is largely contained in specifications of particular proprietary
or prototype systems
and in technical reports with restricted distribution.
With the advent of organisations for coordinating the use of language
resources, such as ELRA (the European Language Resources Association)
and the LDC (the Linguistic Data Consortium), access to information on
spoken language lexica is becoming more widely available.
Another reason for difficulties in obtaining information about
spoken language lexica is that there
is not a close relation between concepts and terminology in the speech
processing field on the one hand,
and concepts and terminology in traditional lexicography on the other.
natural language processing and computational linguistics.
Components such as Hidden Markov Models for
word recognition ,
stochastic language models for word
sequence patterns, grapheme-phoneme tables and rules, word-oriented knowledge
bases for semantic interpretation or text construction are all concerned with
the the identity and properties of words, lexical access ,
lexical disambiguation , lexicon architecture
and lexical representation, but these relations are not
immediately obvious within the specific context of speech technology.
Stochastic word models , for instance, would not
generally be regarded as
a variety of lexicon they evidently do provide corpus-based lexical
information about word collocations.
A terminological problem should be noted at the outset:
in the spoken language technologies, the term linguistic is often
used for the representation and processing in sentence, text and dialogue
level components, and acoustic for word models. With present-day
systems, this terminology is misleading. The integration of prosody, for
example, requires the interfacing of acoustic techniques at sentence, text
and dialogue levels,
and linguistic analysis is involved at the word level for the specification
of of morphological components in systems developed
for highly inflecting languages or for the recognition of
out-of-vocabulary words, or for using phonological
information in structured Hidden Markov Models (HMMs).
It is useful to distinguish between system lexica and
lexical databases . The distinction may,
in specific cases, be
blurred, and the unity of the two concepts may also be rather loose if the
system lexicon is highly modular,
or distributed among several system components,
or if several different lexical databases are used.
However, the distinction is a useful one. The distinction
between lexica and lexical databases will be discussed below. Since the kinds
of information in both these types of lexical object overlap, the term
``spoken language lexicon'' will generally be
used in this chapter to cover both types.
The following overview is necessarily selective.
Lexica for spoken language are used in a variety of systems, including the
following:
- Automatic spelling correctors (spelling is determined to a large
extent by phonological considerations).
- Medium and large-vocabulary automatic speech recognition
(ASR), as in
systems such as SPICOS
[Höge et al. (1985), Dreckschmidt (1987), Ney et al. (1988), Thurmair (1986)], HEARSAY-II
[Lesser et al. (1975), Erman (1977), Erman & Lesser (1980), Erman & Hayes-Roth (1981)], SPHINX
[Lee et al. (1990)], ISADORA [Schukat-Talamazzini (1993)], or, for example
in automatic dictation machines such as IBM's
TANGORA [Averbuch et al. (1986), Averbuch et al. (1987), Jelinek (1985)] and
DragonDictate by Dragon Systems [Baker (1975a), Baker (1975b), Baker (1989), Baker et al. (1992)].
- Speech synthesis in text-to-speech systems , for example in reading
machines, speaking clocks. For further speech synthesis
applications, various relevant studies such as [Allen et al. (1987)], [Bailly & Benoît (1992)],
[Bailly (1994)], [Van Coile (1989)], [Klatt (1982), Klatt (1987)], [Hertz et al. (1985)], [Van Hemert et al. (1987)] can be
consulted (see also Chapter 12).
- Interactive dialogue systems, with speech front ends to databases and
enquiry systems and synthesised responses [Brietzmann et al. (1983), Niemann et al. (1985), Niemann et al. (1992), Bunt et al. (1985), see for instance,]; see also Chapter 13 on interactive
dialogue systems.
- Speech-to-speech translation systems as developed in the ATR and
VERBMOBIL projects, which use various speech recognition
techniques , including
continuous speech recognition, recognition of new words,
word spotting in continuous speech . For speech translation systems see for
instance [Rayner et al. (1993)] and [Woszczyna et al. (1993)].
- Lexica and encyclopaedias on CD-ROM with multimedia (including acoustic)
output.
- Research and development of spoken language processing systems , in the
process of which broader based lexica for written language, coupled with
tools such as grapheme-phoneme converters , may be used as sources of
information.
Spoken language lexica may be components of systems such as those listed
above, or reusable background resources. System lexica are generally only
of local interest within institutes, companies or projects. Lexical databases as reusable
background resources which are intended to be more generally available raise
questions of standardised representation, storage and dissemination.
In general, the same
principles apply as for Spoken Language Corpora:
they are collated, stored and
disseminated using a variety of media. In research and development contexts,
magnetic media (disk or tape) were preferred until recently; in recent years,
local magnetic storage and wider informal dissemination within projects or
other relevant communities is conducted
via the Internet using standard file transfer protocols, electronic mail and
World-Wide Web search and access.
Large lexica, and corpora on which large lexica are based, are also
stored and disseminated in the form of ISO standard CD-ROMs.
The following brief overview can do no more than list a number of examples of
current work on spoken language lexicography. At this stage, no claim to
exhaustiveness is made, and no valuation of cited or uncited work is intended.
- A number of general lexica with information relevant to spoken
language have already been available on CD-ROM for quite some time, including
the Hachette and Robert (9 volume) dictionaries for French, the Oxford English
Dictionary , the Duden dictionary
for German, and the Franklin Computer
Corporation Master 4000 dictionary with acoustic output for 83000 words
[Goorfin (1989)].
- Several lexica with more restricted circulation have been developed
in the context of speech technology research and development. Companies such
as IBM, and telecom research and development institutes such as CNET in France
have developed large lexica (CNET, for instance, has a 55000 word and 12000
phrase lexicon).
- University and other research institutes have also
constructed large
lexica; in France, for example, such institutes as ENST in Paris, ICP in
Grenoble [Tubach & Bok (1985)], Paris [Plenat (1991)], for a pronunciation dictionary of
abbreviations) and IRIT in Toulouse (the BDLEX project) have worked on large
spoken language lexica. The BDLEX-1 lexicon coordinated by IRIT [Pérennou & De Calmès (1987)] contains 23000
entries, and BDLEX-2 [Pérennou et al. (1991), Pérennou et al. (1992), Pérennou & Tihoni (1992)] contains
50000 entries; a set of linguistic software tools permits the construction of a
variety of daughter lexica for spelling correction and
lemmatisation , and
defines a total of 270000 fully inflected forms.
- The Belgian BRULEX psycholinguistic lexicon
contains information on
uniqueness points (the point in a letter tree where a word form is uniquely
identified), lexical fields, phonological patterns and mean digram frequencies
for 36000 words [Content et al. (1990)].
- In the United Kingdom, the Alvey project resulted in many tools and lexical
materials [Boguraev et al. (1988)].
- In the Netherlands, the Nijmegen lexical database CELEX [Baayen (1991)], also available
on CD-ROM, contains components with 400000 Dutch forms, 15000 English forms and
51000 German forms, together with an access tool FLEX.
- For German, lexical databases for spoken language lexica have been
constructed by companies such as Siemens, Daimler-Benz, IBM and Philips, as
well as in university speech technology departments (e.g. Munich, Erlangen,
Karlsruhe, Bielefeld), and in the VERBMOBIL project [Gibbon (1995), Gibbon & Ehrlich (1995)]; these have been made available on the
World-Wide Web with interactive form interfaces.
- Work in computational lexicology and computational phonology has led
to the development of structured lexicon concepts for spoken language such as
ILEX [Gibbon (1992a), Bleiching (1992)] based on the DATR lexical knowledge
representation language [Evans & Gazdar (1989), Evans & Gazdar (1990)]; the DATR language has been
applied to word form lexica in the multilingual SUNDIAL project [Andry et al. (1992)] by the German partner
Daimler-Benz and in the German VERBMOBIL
project [Gibbon (1993)].
- The European Commission has funded a number of projects, particularly
within the ESPRIT programme, in which questions of multilingual spoken
language system lexica have been addressed, albeit relatively indirectly
(POLYGLOT, SUNDIAL, SAM , SAM-A), as well as lexicography projects such as
MULTILEX in the ESPRIT programme [Heyer et al. (1991)],
GENELEX in the EUREKA
programme [Nossin (1991)] and ACQUILEX, which concentrate on multi-functional written
language lexica, though extension of the results to spoken language
information has been provided for by the adoption of general sign-based
lexicon architectures (see the results of the EAGLES Working Group on
Computational Lexica).
The range of existing spoken language systems is large, so that only a small
selection can be outlined, concentrating on well-known
older or established systems whose lexicon requirements are representative
of different approaches and convey the flavour of basic lexical problems and
their treatment.
The situation is currently undergoing a process of rapid development.
Small vocabulary systems are also excluded, as their
strong points are evidently not in the area of the lexicon. The concepts referred
to in the descriptions are discussed in the relevant sections below.
Reference should also be made to Chapters 5 and 7.
- HARPY
-
was a large-vocabulary (1011 words) continuous speech
recognition system. It was
developed at Carnegie Mellon University. HARPY was the best performing speech
recognition system developed under the five-year
ARPA project launched in 1971. HARPY makes use of various knowledge sources,
including a highly constrained grammar (a finite state
grammar in BNF
[Backus Naur Form] notation) and lexical knowledge in the form of a
pronunciation dictionary that contains alternative pronunciations of each word.
Initial attempts to derive within-word phonological variations with a set of
phonological rules operating on a baseform failed. A set of juncture rules
describes inter-word phonological phenomena such as /p/
deletion at /pm/
junctures: /helpmi/ - /helmi/. The spectral characteristics of allophones
of a
given phoneme , including their empirically determined durations , are stored in
phone templates . The HARPY system compiles all knowledge into a unified
directed graph representation, a transition network of 15,000 states (the so-called blackboard model). Each state in
the network corresponds to a spectral template . The spectra of the observed segments are compared with the spectral
templates in the network. The system determines which sequence of spectra,
that is, which path through the network, provides the best match with the
acoustic input spectral sequence.
(Cf. [Klatt (1977)]; see also [Lowerre & Reddy (1980)]).
- HEARSAY-II
-
also used the blackboard principle (see HARPY),
where knowledge sources contribute to the recognition process via a global data
base. In the recognition process, an utterance is segmented into categories of
manner-of-articulation features, e.g. a stop -vowel-stop pattern. All words with
a syllable structure corresponding to that of the input are proposed as
hypotheses. However, words can also be hypothesised top-down by the syntactic
component. So misses by the lexical hypothesiser, which are very likely, can
be made up for by the syntactic predictor. The lexicon for word verification has
the same structure as HARPY; It is defined in terms of spectral patterns.
(Cf. [Klatt (1977)], see also [Erman (1977)] and [Erman & Lesser (1980)]).
- SPHINX
- is a large-vocabulary
continuous speech recognition system for speaker-independent
application. It was evaluated on the DARPA naval resource
management task. The baseline SPHINX system works with Hidden Markov Models
(HMMs ) where each
HMM represents a phone . The total of
phones is 45. The phone models are
concatenated to create word models , which in turn serve to create sentence
models . The phonetic spelling of a word was adopted from the ANGEL System
[Rudnicky et al. (1987)]. The SPHINX baseline system has been improved by
introducing multiple codebooks and adding information to the
lexical-phonological component:
- The most likely pronunciation was substituted for the baseform
pronunciation of a lexical item in the pronunciation dictionary,
retaining the
assumption that each lexical item has only one pronunciation.
-
Different models were created for phones that have typically more than one
realisation such as released and unreleased /d/
at the beginning of /ddma/ and before /m/, respectively.
- Two subword units were introduced:
function word dependent phone
models and generalised triphone models . Since function words are typically
unstressed , phones in function words are very often deleted or reduced, do not
serve as proper models for recognition tasks, and account for almost 50%
of the errors.
The SPHINX system works with grammars of different perplexity (average branching
factor; see Chapter 7); the grammars are of a type which can,
in principle, be regarded as a specialised tabular, network-like or
tree-structured lexicon with probabilistic word-class information:
- A null grammar with a perplexity of 997 (i.e. a vocabulary of 997
words was used); in a null grammar any word can succeed a given word.
- A word-pair grammar with a perplexity of
60; word-pair grammars are
lists of words that can follow a given word.
- A bigram grammar with a perplexity of 20;
this is a word-pair grammar
equipped with word-category transition probabilities.
In word recognition tests , the best results were obtained with the bigram
grammar , the most restrictive kind of the grammars mentioned above (96%
accuracy compared with 71% for null grammars ).
The SPHINX system has various levels of representation for linguistic
units:
- phone models (generalised triphones and extra models for function
words),
- word models (stored in the pronunciation dictionary with one
representation for each word),
- sentence models (for final confirmation).
(Cf. [Lee et al. (1990)]; see also [Alleva et al. (1992)]).
- EVAR
-
(``Erkennen - Verstehen - Antworten - Rückfragen'',
``Recognition - Understanding - Answering - Clarification'') is a
large-vocabulary continuous speech recognition and
dialogue system. It is designed to understand
standard German sentences and to react either in form of an answer or a
question referring back to what has been said, within the specific discourse
domain of enquiries concerning Intercity timetables. The EVAR
lexicon has the following properties:
- The lexicon includes not only sublanguage -specific words but also many
words of the general vocabulary a dialogue of
this kind.
- The lexicon contains fully inflected word forms.
- The baseforms, so-called Normalformen, e.g. infinitive for verbs,
nominative singular for nouns, contain information relevant for all grammatical
forms, thus reducing redundancy in the lexicon.
- The lexicon contains phonological, syntactic, semantic, and pragmatic
information .
- Since the system modules need access only to special lexical knowledge
(the articulation module makes use of
phonological information, while the module in charge of generating the surface
structure of an answer also needs syntactic information), access of individual
modules to the lexicon is restricted. Preprocessors extract the
subset of information relevant for each module.
- The lexical unit in the EVAR lexicon is the graphemic word
(graphematisches Wort); so-called phonetic words (standard pronunciation) and so-called grammatical words (syntactic categories
plus meanings) are assigned to the graphemic words.
- Lexical units are described in attribute-value notation. For example, the
attribute WORT takes a graphemic word as its value.
- Graphemic words again have the attributes
AUSSPRACHE (pronunciation) and SYNTAX-TEIL (syntactic part) for which values are defined in the form of a Duden
standard pronunciation and morpho-syntactic
properties such as the attribute-value pair WORTART-Verb. Numbers keep
track of the various entries for different meanings or syntactic variants (e.g.\
reflexive - non-reflexive), etc. of a lexical item.
- In the baseform entries,
information on stem , pronunciation of the stem (in ASCII symbols that replace
the standard IPA notation), and the inflection pattern is given under
SYNTAX-TEIL.
- Semantic information includes specifications of semantic features and
valence
properties as well as selectional restrictions. Fillmore's system of deep structure cases as suggested in [Fillmore (1968)] has been expanded to 28 cases.
A lexicon administration system has been developed which uses tools for
extracting
words according to specified criteria, such as ``Look for nouns that express a
location'' or ``Look for prepositions that express a direction''.[Ehrlich (1986), Brietzmann et al. (1983), Niemann et al. (1985), Niemann et al. (1992), Cf.,]
- VERBMOBIL
-
The VERBMOBIL speech-to-speech
translation prototype uses lexical information in a wide variety of ways,
and much effort went into the creation of standardised orthographic
transcriptions , pronouncing dictionaries
with integrated prosodic and morphological information, as well as lexica
for syntactic, semantic, pragmatic and transfer (translation) information.
The system lexicon is distributed between a large number of modules concerned
with recognition, parsing, semantic construction and evaluation, transfer,
language generation and synthesis, related by ``VERBMOBIL interface
terms'', i.e. standardised lexical information vectors.
The VERBMOBIL lexical database was made available to the consortium by
means of an interactive World-Wide Web form interface together with a
concordance for linguistic analysis, and additional special interactive tools
for investigating the phonetic similarities which cause false analyses and
misunderstandings and can be used to trigger clarification
dialogues (see Chapter 13).
The core of the VERBMOBIL lexical database
is a knowledge base of 10000 lexical stems, and a DATR/Prolog inference
machine which generates 50000 fully inflected forms and
300000 mappings between inflected forms and morphological categories
([Bleiching et al. (1996)]).
Next: Recommendations on resources
Up: Introduction
Previous: Lexica for spoken language
EAGLES SWLG SoftEdition, May 1997. Get the book...