next up previous contents index
Next: Recommendations on lexicon structure Up: Lexicon structure Previous: Lexicon architecture and lexical

Lexicon architecture and the structure of lexical databases


The architecture of a lexicon, in particular of a lexical database, is determined partly by the types of declarative knowledge it contains, partly by considerations of access and interaction with other databases or modules. The main features of spoken language lexical databases have already been discussed. In practice, a spoken language lexical database is often a set of loosely related simpler databases (e.g. pronunciation table, signal annotation   file, stochastic word model , and a main lexical database with syntactic and semantic information), and is part of a larger complex of databases involving speech signal files, transcription  files (orthographic and phonemic  ), and annotation  (labelling ) files which define a function from the transcriptions  into the digitised speech signal. However, in the interests of consistency it is helpful to take a more general lexicographic point of view, and to see a lexical database for spoken language development as a single database, in which relations between lexical items and their properties at all levels, from acoustics through word structure to syntax, semantics and pragmatics  are defined.

The major problem in deciding how to organise a lexical database is the ambiguity of word forms. In a spoken language system , the focus is on the pronunciation, i.e. on phonemic word forms (not the orthography, though this is often used as a conveniently familiar form of representation). The key issue here is homophony , i.e. a phonemic word form associated with at least two different sets of lexical information, and thus logically involving a disjunction in the database.

In a simple traditional database model based on fixed-length records, in which each field represents a specific attribute of the entity which the record stands for, there is a record for each lexical entry associated with a homophone , uniquely identified by a serial number. However, for specific applications such as the training  of a speech recogniser  it is convenient to have just one record for each word form. In a database which is optimised for this application, the disjunction required by the homphone is within a single record, rather than distributed over alternative records which share the same value for the pronunciation attribute. Structures of this type are typically used in pronunciation dictionaries (pronunciation lexica, pronunciation tables) for speech recognition. Disjunctive information of this kind within the lexical database corresponds to non-deterministic situations and the use of complex search  algorithms in actual spoken language systems. 

A simple database type: Pronunciation tables

Pronunciation tables (pronunciation dictionaries) hardly correspond to the intuitive concept of a lexical database, which implies a fairly high degree of complexity, but they are nevertheless a useful source of practical examples of a simple lexical database structure.

Pronunciation tables define the relation between orthographic and phonemic representations of words. Often they are defined as functions which assign pronunciations (frequently a set of variant pronunciations) to orthographic representations; this is an obvious necessity for text-to-speech   lexica, but in speech recognition applications in which orthographic transcriptions    (which are easier to make and check than phonemic transcriptions)   are mapped to phonemic representations for the purpose of training  speech recognisers , the use of a pronunciation table of this type is relevant.


morpheme : +
word: #
liaisonless  group: ##
phonological syntagma: § (in phrasal entries)
Phonemes  (in IPA  or SAMPA  notation), including
a notation for the French archiphonemes .
Phonological diacritics
latency mark "
(for consonants pronounced in liaison  contexts
or morphological  linking)
consonant deletion  mark ' (e.g. for final consonants)
Table 6.3: Frequently used symbols 


morpheme : +
stem-inflection boundary: #+
word in compounds : #
word in phrases: ##
syllable : .
primary stress : '
secondary stress : '' (two single quotes)
Additional conventions
The boundaries # and ## are both
coextensive + and . boundaries.
Where + and . boundaries are coextensive,
. is written before +.
The stress  marks ' and ''
are written immediately before the vowel, not
before the syllable .
Table 6.4: VERBMOBIL diacritics 

A pronunciation table which involves pronunciation variants (see below) provides a simple illustration of the orthographic noise  problems, represented by disjunctions in the database.

Pronunciation tables have to fulfil a number of criteria, in particular the criterion of unambiguous notation, of consistency with orthographic transcriptions  and other transcriptions  of a particular corpus, and of simple and fast processing.

General proposals for the interchange of lexical information about word forms, including morphological , phonological and prosodic  information, have been made for different languages. They do not have standard status at the current time, but they are sufficiently similar to justify recommendation. A standard for French has been described [Pérennou & De Calmès (1987), Autesserre et al. (1989)], containing the features tabled in Figure 6.3.

For the spoken language lexicon  in the German VERBMOBIL  project the same basic principle has been adopted [Bleiching & Gibbon (1994)], with extensions for incorporating prosodic  information, as in Table 6.4.

Table 6.5 shows an extract from the VERBMOBIL  pronunciation table in the VERBMOBIL WIF (Word form Interchange Format) convention; following current practice, it is organised according to orthographic keys.


ASCII orthography Extended SAMPA transcription 
Angst ?tex2html_wrap_inline45207aNst
Annahme ?tex2html_wrap_inline45207an#ntex2html_wrap_inline45207tex2html_wrap_inline45207a:.m+@
Apparat ?
April ?a.prtex2html_wrap_inline45207Il
Aprilwoche ?a.prtex2html_wrap_inline45207Il#vtex2html_wrap_inline45207tex2html_wrap_inline45207O.x+@
Arzttermin ?tex2html_wrap_inline45207a6tst#tE6.mtex2html_wrap_inline45207tex2html_wrap_inline45207i:n
Aschermittwoch ?tex2html_wrap_inline45207tex2html_wrap_inline45207a.S6#mtex2html_wrap_inline45207It#vtex2html_wrap_inline45207tex2html_wrap_inline45207Ox
Auf_Wiederh"oren ?aUf##vtex2html_wrap_inline45207i:.d6#htex2html_wrap_inline45207tex2html_wrap_inline452072:.r+@n
Auf_Wiederschauen ?aUf##vtex2html_wrap_inline45207i:.d6#Stex2html_wrap_inline45207tex2html_wrap_inline45207aU.+@n
Auf_Wiedersehen ?aUf##vtex2html_wrap_inline45207i:.d6#ztex2html_wrap_inline45207tex2html_wrap_inline45207e:.+@n
August ?aU.gtex2html_wrap_inline45207Ust ?tex2html_wrap_inline45207aU.gUst
Augustwoche ?aU.gtex2html_wrap_inline45207Ust#vtex2html_wrap_inline45207tex2html_wrap_inline45207O.x+@
Ausweichm"oglichkeit ?tex2html_wrap_inline45207aUs#vtex2html_wrap_inline45207tex2html_wrap_inline45207aIC#mtex2html_wrap_inline45207tex2html_wrap_inline452072:k.+lIC.+kaIt
Table 6.5: Extract from the VERBMOBIL pronunciation table 

The convention has been designed to permit the removal of information which is not required, or the selection of useful subsets of the table using simple UNIX tool commands; the use of ``tex2html_wrap_inline45207'' for primary and secondary stress  permits simple generalisation over both.

More complex lexical databases

In a complex project, lexical information from several sources may need to be integrated in a fashion which permits flexible further development work even when the information cannot easily be reduced to a logically fully consistent and well-defined system. A situation such as this will arise when alternative modules, based on different principles, are to be made available for the same system. For instance, two different syntactic components will define different forms of syntactic ambiguity and be associated in different ways with semantic ambiguities. And morphological ambiguities arise with inflected  forms in highly inflecting languages. In order to achieve any kind of integration, at least the word form representations will need to be consistent. The hybrid information sources will have to be represented as conjunctions of the values of independent attributes (i.e. fields within a record), with separate disjunctions, where relevant, within fields.

In general, spoken language projects have been based on the idealised notion of a single, well-defined, consistent and complete; this situation might reasonably be expected to correspond to the reality of a system developed in a single laboratory at one specific location. However, larger scale projects need to be able to cope with hybrid lexical information of the kind just outlined. A project of this type is the VERBMOBIL  project funded by the German government, with international participation.

A general product-oriented solution would obviouly use a product standard database (see the Appendices and Chapter 5), but an illustration of the typical R&D style UNIX database is given here for the sake of simplicity as an example of a database structure designed for hybrid lexical information.

Internal database structure (standard UNIX database format):

Example of record structure:

Example of human-readable formatting

Entry 372: Mutter
  Orth: Mutter
  A3:   mU!t6
  B1:   nomen,akk,fem,sg,@empty@,@empty@,Raute,@empty@
  D1:   nom

On UNIX systems, laboratory-specific acquisition and access routines for ASCII lexical databases are frequently writen with sandard UNIX script languages for ASCII character stream processing. If the resources are available to produce fully specified C and C++ programmes, then of course this is to be preferred. The UNIX tools are useful for prototyping and ad hoc format conversion and informal exchange within the speech development community, but are not to be recommended for commercial use.

The following example illustrates simple UNIX script programming for human-readable format conversion (transformation of selected named attributes of a database record into the attribute format given above):

# dbviewr
# Prettyprint of single entries
# and attributes in lexicon database
# with regular expression matching
# Uses UNIX tools:
#  gawk (i.e. gnu awk), sed, tr
# (Note: sed and tr are used for illustration, and would
#        normally be emulated in gawk)
# Database structure:
# Header: Record 1: Fields containing attribute names.
#         Record 2: Other information.
# Body:   Records >2: Database relation.

if [ $# -lt 3 ]
 echo "Usage: dbview dbname attribute* regexp"

# The GNU version of the awk script language is used:
gawk '
# Transfer the keyword from the command line to an awk variable:
BEGIN {keyword = ARGV[ARGC-1]}
# Identify the attributes in the first record whose values
# are to be queried.
NR == 1 {{for (i=2 ; i < ARGC ; i++)
        {for (j=1 ; j <= NF ; j++)
        if (ARGV[i] == $j) {attrib[j] = "yes"; attname[j] = $j}}}
        {for (i = 2 ; i < ARGC ; i++)
# Find required keyword entry/entries in body of database,
# print required values and set 'found' flag:
$1 ~ keyword && NR > 2  {print "\nEntry " NR-2 ":", $1
        {for (i=1 ; i <= NF ; i++)
        if (attrib[i] ~ "yes") {print "  " attname[i] ":\t" $i
# Print message if no entry was found for the keyword.
END {if (found!="yes") {print "No entry found for",keyword,
' $* |

# Pipe to sed script language,
# translate all sequences of two colons into a slash, all single
# colons into a single colon followed by eight spaces:
sed -e "s/;;/\//g
        s/;/&        /g" |

# Pipe to tr character translator,
# translate all single colons into a linefeed (newline):
tr ";" "\012"

For an overview of related format conversion techniques, see [Aho et al. (1987)], [Dougherty (1990)], [Wall & Schwartz (1991)].


next up previous contents index
Next: Recommendations on lexicon structure Up: Lexicon structure Previous: Lexicon architecture and lexical

EAGLES SWLG SoftEdition, May 1997. Get the book...