Spoken language lexicon formalisms

Next: Lexicon architecture and lexical Up: Lexicon structure Previous: Lexicon structure

Spoken language lexicon formalisms

Spoken language lexicon formalisms (representation languages) may be broadly classified according to their use:

Linguistically and phonetically based working notations.
Implementation languages for the operational phase.
Algebraic and logical formalisms for formal definition.

Where an ad hoc solution is required for a very small lexicon, and where lexicon structure is simple, a lexicon may be written directly in a standard programming language suitable for high-speed runtime applications, traditionally Fortran but more recently C, or in a higher level language such as LISP or Prolog. Recent developments are moving towards high level knowledge representation languages which are specifically designed to meet all three of the above criteria equally well, in that they are useful working notations, have efficient implementations, and are formally well-defined.

Some of these are also used for general written language lexica. A more detailed classification of formal representation systems may be given as follows:

General data structures (lists, tables or matrices, tree structures designed for optimal lexical access).
Programming languages (C for efficiency; LISP or Prolog for flexibility).
Database systems.
General text markup languages such as SGML.
Knowledge representation languages (inheritance networks, semantic networks, frame systems).
Linguistic knowledge representation languages, commonly based on attribute-value logics.
Lexical knowledge representation languages (attribute based inheritance formalisms) such as DATR.

General data structure definitions for these representations are required for developers and for theoretical work on the complexity and efficiency of lexica and lexicon processing. Standard textbooks on data structures and algorithms should be consulted for this purpose.

Conventional programming languages are generally used for performance reasons in runtime systems. They may also be used to implement small or simple lexica directly, in particular for rapid prototyping of these; this is not optimal software development practice, however, and not to be recommended for developing large or complex lexica, in particular those with highly structured linguistic information.

Database management systems (DBMSs) are widely used for general lexical resource management, including large-scale lexica with rich information which needs to be accessed flexibly and efficiently (see Appendix H). In the SAM project, an ORACLE database management concept for spoken language corpora and lexica was developed [Dolmazon et al. (1990)].

General text markup languages are used for integration with large, pre-analysed written corpora in the development of natural language processing systems and in statistical basic research in computational linguistics, but so far have not been used in spoken language system development (cf. the results of the EAGLES Working Group on Text Corpora). Implementations of SGML are readily available.

Knowledge representation languages (KRLs) are used for developing complex semantic and conceptual knowledge representations, and for integrating spoken language front ends with knowledge based systems; see [Schröder et al. (1987)], [Sagerer (1990)]; more generally, cf. [Bobrow & Winograd (1977)], [Brachman & Levesque (1985)], [Charniak & McDermott (1985)], [De Mori et al. (1984)], [Young et al. (1989)].

Linguistic formalisms in general are discussed in the results of the EAGLES Working Group on Grammar Formalisms, which should be referred to in this connection.

Lexical knowledge representation languages (LKRLs) are a relatively new development. They are coming to be used in knowledge acquisition for integrated lexica which contain a variety of complex lexical information from phonology through morphology and syntax to semantics and pragmatics . They provide a means of bridging the gap between complexity of lexical information and easy-to-read representations using sign-based lexicon models. A LKRL which has been used in several natural language processing and language and speech projects is DATR [Evans & Gazdar (1989), Cahill (1993), Cahill & Evans (1990), Andry et al. (1992), Gibbon (1991), Gibbon (1993), Bleiching (1992), Langer & Gibbon (1992)]. This is the language which is used for basic attribute-value representations in this chapter. A number of public domain DATR implementations are available and can be obtained from the Web sites.

Next: Lexicon architecture and lexical Up: Lexicon structure Previous: Lexicon structure

EAGLES SWLG SoftEdition, May 1997. Get the book...