The architecture of a lexicon, in particular of a lexical database, is determined partly by the types of declarative knowledge it contains, partly by considerations of access and interaction with other databases or modules. The main features of spoken language lexical databases have already been discussed. In practice, a spoken language lexical database is often a set of loosely related simpler databases (e.g. pronunciation table, signal annotation file, stochastic word model , and a main lexical database with syntactic and semantic information), and is part of a larger complex of databases involving speech signal files, transcription files (orthographic and phonemic ), and annotation (labelling ) files which define a function from the transcriptions into the digitised speech signal. However, in the interests of consistency it is helpful to take a more general lexicographic point of view, and to see a lexical database for spoken language development as a single database, in which relations between lexical items and their properties at all levels, from acoustics through word structure to syntax, semantics and pragmatics are defined.
The major problem in deciding how to organise a lexical database is the ambiguity of word forms. In a spoken language system , the focus is on the pronunciation, i.e. on phonemic word forms (not the orthography, though this is often used as a conveniently familiar form of representation). The key issue here is homophony , i.e. a phonemic word form associated with at least two different sets of lexical information, and thus logically involving a disjunction in the database.
In a simple traditional database model based on fixed-length records, in which each field represents a specific attribute of the entity which the record stands for, there is a record for each lexical entry associated with a homophone , uniquely identified by a serial number. However, for specific applications such as the training of a speech recogniser it is convenient to have just one record for each word form. In a database which is optimised for this application, the disjunction required by the homphone is within a single record, rather than distributed over alternative records which share the same value for the pronunciation attribute. Structures of this type are typically used in pronunciation dictionaries (pronunciation lexica, pronunciation tables) for speech recognition. Disjunctive information of this kind within the lexical database corresponds to non-deterministic situations and the use of complex search algorithms in actual spoken language systems.
Pronunciation tables (pronunciation dictionaries) hardly correspond to the intuitive concept of a lexical database, which implies a fairly high degree of complexity, but they are nevertheless a useful source of practical examples of a simple lexical database structure.
Pronunciation tables define the relation between orthographic and phonemic representations of words. Often they are defined as functions which assign pronunciations (frequently a set of variant pronunciations) to orthographic representations; this is an obvious necessity for text-to-speech lexica, but in speech recognition applications in which orthographic transcriptions (which are easier to make and check than phonemic transcriptions) are mapped to phonemic representations for the purpose of training speech recognisers , the use of a pronunciation table of this type is relevant.
A pronunciation table which involves pronunciation variants (see below) provides a simple illustration of the orthographic noise problems, represented by disjunctions in the database.
Pronunciation tables have to fulfil a number of criteria, in particular the criterion of unambiguous notation, of consistency with orthographic transcriptions and other transcriptions of a particular corpus, and of simple and fast processing.
General proposals for the interchange of lexical information about word forms, including morphological , phonological and prosodic information, have been made for different languages. They do not have standard status at the current time, but they are sufficiently similar to justify recommendation. A standard for French has been described [Pérennou & De Calmès (1987), Autesserre et al. (1989)], containing the features tabled in Figure 6.3.
For the spoken language lexicon in the German VERBMOBIL project the same basic principle has been adopted [Bleiching & Gibbon (1994)], with extensions for incorporating prosodic information, as in Table 6.4.
Table 6.5 shows an extract from the VERBMOBIL pronunciation table in the VERBMOBIL WIF (Word form Interchange Format) convention; following current practice, it is organised according to orthographic keys.
The convention has been designed to permit the removal of information which is not required, or the selection of useful subsets of the table using simple UNIX tool commands; the use of ``'' for primary and secondary stress permits simple generalisation over both.
In a complex project, lexical information from several sources may need to be integrated in a fashion which permits flexible further development work even when the information cannot easily be reduced to a logically fully consistent and well-defined system. A situation such as this will arise when alternative modules, based on different principles, are to be made available for the same system. For instance, two different syntactic components will define different forms of syntactic ambiguity and be associated in different ways with semantic ambiguities. And morphological ambiguities arise with inflected forms in highly inflecting languages. In order to achieve any kind of integration, at least the word form representations will need to be consistent. The hybrid information sources will have to be represented as conjunctions of the values of independent attributes (i.e. fields within a record), with separate disjunctions, where relevant, within fields.
In general, spoken language projects have been based on the idealised notion of a single, well-defined, consistent and complete; this situation might reasonably be expected to correspond to the reality of a system developed in a single laboratory at one specific location. However, larger scale projects need to be able to cope with hybrid lexical information of the kind just outlined. A project of this type is the VERBMOBIL project funded by the German government, with international participation.
A general product-oriented solution would obviouly use a product standard database (see the Appendices and Chapter 5), but an illustration of the typical R&D style UNIX database is given here for the sake of simplicity as an example of a database structure designed for hybrid lexical information.
Entry 372: Mutter
Orth: Mutter
A3: mU!t6
B1: nomen,akk,fem,sg,@empty@,@empty@,Raute,@empty@
nomen,nom,fem,sg,@empty@,@empty@,Raute,@empty@
C1: Nom,OBJEKTTYP
D1: nom
On UNIX systems, laboratory-specific acquisition and access routines for ASCII lexical databases are frequently writen with sandard UNIX script languages for ASCII character stream processing. If the resources are available to produce fully specified C and C++ programmes, then of course this is to be preferred. The UNIX tools are useful for prototyping and ad hoc format conversion and informal exchange within the speech development community, but are not to be recommended for commercial use.
The following example illustrates simple UNIX script programming for human-readable format conversion (transformation of selected named attributes of a database record into the attribute format given above):
#!/bin/sh
# dbviewr
# Prettyprint of single entries
# and attributes in lexicon database
# with regular expression matching
# Uses UNIX tools:
# gawk (i.e. gnu awk), sed, tr
# (Note: sed and tr are used for illustration, and would
# normally be emulated in gawk)
# Database structure:
# Header: Record 1: Fields containing attribute names.
# Record 2: Other information.
# Body: Records >2: Database relation.
if [ $# -lt 3 ]
then
echo "Usage: dbview dbname attribute* regexp"
exit
fi
# The GNU version of the awk script language is used:
gawk '
# Transfer the keyword from the command line to an awk variable:
BEGIN {keyword = ARGV[ARGC-1]}
# Identify the attributes in the first record whose values
# are to be queried.
NR == 1 {{for (i=2 ; i < ARGC ; i++)
{for (j=1 ; j <= NF ; j++)
if (ARGV[i] == $j) {attrib[j] = "yes"; attname[j] = $j}}}
{for (i = 2 ; i < ARGC ; i++)
ARGV[i]=""}}
# Find required keyword entry/entries in body of database,
# print required values and set 'found' flag:
$1 ~ keyword && NR > 2 {print "\nEntry " NR-2 ":", $1
{for (i=1 ; i <= NF ; i++)
if (attrib[i] ~ "yes") {print " " attname[i] ":\t" $i
found="yes"}}}
{last=NR}
# Print message if no entry was found for the keyword.
END {if (found!="yes") {print "No entry found for",keyword,
"in",ARGV[1]}}
' $* |
# Pipe to sed script language,
# translate all sequences of two colons into a slash, all single
# colons into a single colon followed by eight spaces:
sed -e "s/;;/\//g
s/;/& /g" |
# Pipe to tr character translator,
# translate all single colons into a linefeed (newline):
tr ";" "\012"
For an overview of related format conversion techniques, see [Aho et al. (1987)], [Dougherty (1990)], [Wall & Schwartz (1991)].