Lexicon architecture and the structure of lexical databases

Next: Recommendations on lexicon structure Up: Lexicon structure Previous: Lexicon architecture and lexical

Lexicon architecture and the structure of lexical databases

The architecture of a lexicon, in particular of a lexical database, is determined partly by the types of declarative knowledge it contains, partly by considerations of access and interaction with other databases or modules. The main features of spoken language lexical databases have already been discussed. In practice, a spoken language lexical database is often a set of loosely related simpler databases (e.g. pronunciation table, signal annotation file, stochastic word model , and a main lexical database with syntactic and semantic information), and is part of a larger complex of databases involving speech signal files, transcription files (orthographic and phonemic ), and annotation (labelling ) files which define a function from the transcriptions into the digitised speech signal. However, in the interests of consistency it is helpful to take a more general lexicographic point of view, and to see a lexical database for spoken language development as a single database, in which relations between lexical items and their properties at all levels, from acoustics through word structure to syntax, semantics and pragmatics are defined.

The major problem in deciding how to organise a lexical database is the ambiguity of word forms. In a spoken language system , the focus is on the pronunciation, i.e. on phonemic word forms (not the orthography, though this is often used as a conveniently familiar form of representation). The key issue here is homophony , i.e. a phonemic word form associated with at least two different sets of lexical information, and thus logically involving a disjunction in the database.

In a simple traditional database model based on fixed-length records, in which each field represents a specific attribute of the entity which the record stands for, there is a record for each lexical entry associated with a homophone , uniquely identified by a serial number. However, for specific applications such as the training of a speech recogniser it is convenient to have just one record for each word form. In a database which is optimised for this application, the disjunction required by the homphone is within a single record, rather than distributed over alternative records which share the same value for the pronunciation attribute. Structures of this type are typically used in pronunciation dictionaries (pronunciation lexica, pronunciation tables) for speech recognition. Disjunctive information of this kind within the lexical database corresponds to non-deterministic situations and the use of complex search algorithms in actual spoken language systems.

A simple database type: Pronunciation tables

Pronunciation tables (pronunciation dictionaries) hardly correspond to the intuitive concept of a lexical database, which implies a fairly high degree of complexity, but they are nevertheless a useful source of practical examples of a simple lexical database structure.

Pronunciation tables define the relation between orthographic and phonemic representations of words. Often they are defined as functions which assign pronunciations (frequently a set of variant pronunciations) to orthographic representations; this is an obvious necessity for text-to-speech lexica, but in speech recognition applications in which orthographic transcriptions (which are easier to make and check than phonemic transcriptions) are mapped to phonemic representations for the purpose of training speech recognisers , the use of a pronunciation table of this type is relevant.

Boundaries

morpheme : +

word: #

liaisonless group: ##

phonological syntagma: § (in phrasal entries)

Phonemes (in IPA or SAMPA notation), including

a notation for the French archiphonemes .

Phonological diacritics

latency mark "

(for consonants pronounced in liaison contexts

or morphological linking)

consonant deletion mark ' (e.g. for final consonants)

Table 6.3: Frequently used symbols

**Table 6.3:** Frequently used symbols
Boundaries
morpheme :	+
word:	#
liaisonless group:	##
phonological syntagma:	§ (in phrasal entries)
Phonemes (in IPA or SAMPA notation), including
a notation for the French archiphonemes .
Phonological diacritics
latency mark	"
(for consonants pronounced in liaison contexts
or morphological linking)
consonant deletion mark	' (e.g. for final consonants)

Boundaries

morpheme : +

stem-inflection boundary: #+

word in compounds : #

word in phrases: ##

syllable : .

primary stress : '

secondary stress : '' (two single quotes)

Additional conventions

The boundaries # and ## are both

coextensive + and . boundaries.

Where + and . boundaries are coextensive,

. is written before +.

The stress marks ' and ''

are written immediately before the vowel, not

before the syllable .

Table 6.4: VERBMOBIL diacritics

**Table 6.4:** VERBMOBIL diacritics
Boundaries
morpheme :	`+`
stem-inflection boundary:	`#+`
word in compounds :	`#`
word in phrases:	`##`
syllable :	`.`
primary stress :	`'`
secondary stress :	`''` (two single quotes)
Additional conventions
The boundaries `#` and `##` are both
coextensive `+` and `.` boundaries.
Where `+` and `.` boundaries are coextensive,
`.` is written before `+`.
The stress marks `'` and `''`
are written immediately before the vowel, not
before the syllable .

A pronunciation table which involves pronunciation variants (see below) provides a simple illustration of the orthographic noise problems, represented by disjunctions in the database.

Pronunciation tables have to fulfil a number of criteria, in particular the criterion of unambiguous notation, of consistency with orthographic transcriptions and other transcriptions of a particular corpus, and of simple and fast processing.

General proposals for the interchange of lexical information about word forms, including morphological , phonological and prosodic information, have been made for different languages. They do not have standard status at the current time, but they are sufficiently similar to justify recommendation. A standard for French has been described [Pérennou & De Calmès (1987), Autesserre et al. (1989)], containing the features tabled in Figure 6.3.

For the spoken language lexicon in the German VERBMOBIL project the same basic principle has been adopted [Bleiching & Gibbon (1994)], with extensions for incorporating prosodic information, as in Table 6.4.

Table 6.5 shows an extract from the VERBMOBIL pronunciation table in the VERBMOBIL WIF (Word form Interchange Format) convention; following current practice, it is organised according to orthographic keys.

ASCII orthography Extended SAMPA transcription

Angst ?aNst

Annahme ?an#na:.m+@

Apparat ?a.pa.ra:t

April ?a.prIl

Aprilwoche ?a.prIl#vO.x+@

Arzttermin ?a6tst#tE6.mi:n

Aschermittwoch ?a.S6#mIt#vOx

Auf_Wiederh"oren ?aUf##vi:.d6#h2:.r+@n

Auf_Wiederschauen ?aUf##vi:.d6#SaU.+@n

Auf_Wiedersehen ?aUf##vi:.d6#ze:.+@n

August ?aU.gUst ?aU.gUst

Augustwoche ?aU.gUst#vO.x+@

Ausweichm"oglichkeit ?aUs#vaIC#m2:k.+lIC.+kaIt

Table 6.5: Extract from the VERBMOBIL pronunciation table

**Table 6.5:** Extract from the VERBMOBIL pronunciation table
ASCII orthography	Extended SAMPA transcription
`Angst`	`?aNst`
`Annahme`	`?an#na:.m+@`
`Apparat`	`?a.pa.ra:t`
`April`	`?a.prIl`
`Aprilwoche`	`?a.prIl#vO.x+@`
`Arzttermin`	`?a6tst#tE6.mi:n`
`Aschermittwoch`	`?a.S6#mIt#vOx`
`Auf_Wiederh"oren`	`?aUf##vi:.d6#h2:.r+@n`
`Auf_Wiederschauen`	`?aUf##vi:.d6#SaU.+@n`
`Auf_Wiedersehen`	`?aUf##vi:.d6#ze:.+@n`
`August`	`?aU.gUst ?aU.gUst`
`Augustwoche`	`?aU.gUst#vO.x+@`
`Ausweichm"oglichkeit`	`?aUs#vaIC#m2:k.+lIC.+kaIt`

The convention has been designed to permit the removal of information which is not required, or the selection of useful subsets of the table using simple UNIX tool commands; the use of ``'' for primary and secondary stress permits simple generalisation over both.

More complex lexical databases

In a complex project, lexical information from several sources may need to be integrated in a fashion which permits flexible further development work even when the information cannot easily be reduced to a logically fully consistent and well-defined system. A situation such as this will arise when alternative modules, based on different principles, are to be made available for the same system. For instance, two different syntactic components will define different forms of syntactic ambiguity and be associated in different ways with semantic ambiguities. And morphological ambiguities arise with inflected forms in highly inflecting languages. In order to achieve any kind of integration, at least the word form representations will need to be consistent. The hybrid information sources will have to be represented as conjunctions of the values of independent attributes (i.e. fields within a record), with separate disjunctions, where relevant, within fields.

In general, spoken language projects have been based on the idealised notion of a single, well-defined, consistent and complete; this situation might reasonably be expected to correspond to the reality of a system developed in a single laboratory at one specific location. However, larger scale projects need to be able to cope with hybrid lexical information of the kind just outlined. A project of this type is the VERBMOBIL project funded by the German government, with international participation.

A general product-oriented solution would obviouly use a product standard database (see the Appendices and Chapter 5), but an illustration of the typical R&D style UNIX database is given here for the sake of simplicity as an example of a database structure designed for hybrid lexical information.

1.

Internal database structure (standard UNIX database format):

database: header records followed by body records
header: header_record_1 header_record_2 header_record_3
body: body_record_1 ...body_record_n
header_record_1: (record containing attribute names, i.e. field names)
header_record_2: (record defining internal conjunctive/disjunctive structure of attribute values, i.e. field contents)
header_record_3: (record containing source of information)
body_record_i: (record containing values for a given entry)

2.

Example of record structure:

Header: (the designations A3 etc. refer to projects delivering particular types of information)
Record 1:
Orth A3 B1 C1 D1

Record 2:
Orth A3.Phon B1.Wortart,B1.Kasus,B1.Genus,B1.Num,
B1.Detagr,B1.Definit,1.Semobj,B1.Semattr C1.Syncat1_C1.Syncat2 D1.Syncat
Record 3:
reference.ort a3joha.lex b1naeve.lex c1jung.lex d1peters.lex
Body:
Mutter mU!t6 nomen_akk,fem,sg,@empty@,@empty@,Raute,@empty@;
nomen,nom,fem,sg,@empty@,@empty@,Raute,@empty@ Nom,OBJEKTTYP nom
Note that the spaces designate conjunction (i.e. field separators), while the semicolons designate disjunction

3.

Example of human-readable formatting

Entry 372: Mutter
  Orth: Mutter
  A3:   mU!t6
  B1:   nomen,akk,fem,sg,@empty@,@empty@,Raute,@empty@
        nomen,nom,fem,sg,@empty@,@empty@,Raute,@empty@
  C1:   Nom,OBJEKTTYP
  D1:   nom

On UNIX systems, laboratory-specific acquisition and access routines for ASCII lexical databases are frequently writen with sandard UNIX script languages for ASCII character stream processing. If the resources are available to produce fully specified C and C++ programmes, then of course this is to be preferred. The UNIX tools are useful for prototyping and ad hoc format conversion and informal exchange within the speech development community, but are not to be recommended for commercial use.

The following example illustrates simple UNIX script programming for human-readable format conversion (transformation of selected named attributes of a database record into the attribute format given above):

#!/bin/sh
# dbviewr
# Prettyprint of single entries
# and attributes in lexicon database
# with regular expression matching
# Uses UNIX tools:
#  gawk (i.e. gnu awk), sed, tr
# (Note: sed and tr are used for illustration, and would
#        normally be emulated in gawk)
# Database structure:
# Header: Record 1: Fields containing attribute names.
#         Record 2: Other information.
# Body:   Records >2: Database relation.

if [ $# -lt 3 ]
 then
 echo "Usage: dbview dbname attribute* regexp"
 exit
fi

# The GNU version of the awk script language is used:
gawk '
# Transfer the keyword from the command line to an awk variable:
BEGIN {keyword = ARGV[ARGC-1]}
# Identify the attributes in the first record whose values
# are to be queried.
NR == 1 {{for (i=2 ; i < ARGC ; i++)
        {for (j=1 ; j <= NF ; j++)
        if (ARGV[i] == $j) {attrib[j] = "yes"; attname[j] = $j}}}
        {for (i = 2 ; i < ARGC ; i++)
        ARGV[i]=""}}
# Find required keyword entry/entries in body of database,
# print required values and set 'found' flag:
$1 ~ keyword && NR > 2  {print "\nEntry " NR-2 ":", $1
        {for (i=1 ; i <= NF ; i++)
        if (attrib[i] ~ "yes") {print "  " attname[i] ":\t" $i
                                found="yes"}}}
{last=NR}
# Print message if no entry was found for the keyword.
END {if (found!="yes") {print "No entry found for",keyword,
"in",ARGV[1]}} 
' $* |

# Pipe to sed script language,
# translate all sequences of two colons into a slash, all single
# colons into a single colon followed by eight spaces:
sed -e "s/;;/\//g
        s/;/&        /g" |

# Pipe to tr character translator,
# translate all single colons into a linefeed (newline):
tr ";" "\012"

For an overview of related format conversion techniques, see [Aho et al. (1987)], [Dougherty (1990)], [Wall & Schwartz (1991)].

Next: Recommendations on lexicon structure Up: Lexicon structure Previous: Lexicon architecture and lexical

EAGLES SWLG SoftEdition, May 1997. Get the book...