Spoken language lexicon formalisms (representation languages) may be broadly classified according to their use:
Where an ad hoc solution is required for a very small lexicon, and where lexicon structure is simple, a lexicon may be written directly in a standard programming language suitable for high-speed runtime applications, traditionally Fortran but more recently C, or in a higher level language such as LISP or Prolog. Recent developments are moving towards high level knowledge representation languages which are specifically designed to meet all three of the above criteria equally well, in that they are useful working notations, have efficient implementations, and are formally well-defined.
Some of these are also used for general written language lexica. A more detailed classification of formal representation systems may be given as follows:
General data structure definitions for these representations are required for developers and for theoretical work on the complexity and efficiency of lexica and lexicon processing. Standard textbooks on data structures and algorithms should be consulted for this purpose.
Conventional programming languages are generally used for performance reasons in runtime systems. They may also be used to implement small or simple lexica directly, in particular for rapid prototyping of these; this is not optimal software development practice, however, and not to be recommended for developing large or complex lexica, in particular those with highly structured linguistic information.
Database management systems (DBMSs) are widely used for general lexical resource management, including large-scale lexica with rich information which needs to be accessed flexibly and efficiently (see Appendix H). In the SAM project, an ORACLE database management concept for spoken language corpora and lexica was developed [Dolmazon et al. (1990)].
General text markup languages are used for integration with large, pre-analysed written corpora in the development of natural language processing systems and in statistical basic research in computational linguistics, but so far have not been used in spoken language system development (cf. the results of the EAGLES Working Group on Text Corpora). Implementations of SGML are readily available.
Knowledge representation languages (KRLs) are used for developing complex semantic and conceptual knowledge representations, and for integrating spoken language front ends with knowledge based systems; see [Schröder et al. (1987)], [Sagerer (1990)]; more generally, cf. [Bobrow & Winograd (1977)], [Brachman & Levesque (1985)], [Charniak & McDermott (1985)], [De Mori et al. (1984)], [Young et al. (1989)].
Linguistic formalisms in general are discussed in the results of the EAGLES Working Group on Grammar Formalisms, which should be referred to in this connection.
Lexical knowledge representation languages (LKRLs) are a relatively new development. They are coming to be used in knowledge acquisition for integrated lexica which contain a variety of complex lexical information from phonology through morphology and syntax to semantics and pragmatics . They provide a means of bridging the gap between complexity of lexical information and easy-to-read representations using sign-based lexicon models. A LKRL which has been used in several natural language processing and language and speech projects is DATR [Evans & Gazdar (1989), Cahill (1993), Cahill & Evans (1990), Andry et al. (1992), Gibbon (1991), Gibbon (1993), Bleiching (1992), Langer & Gibbon (1992)]. This is the language which is used for basic attribute-value representations in this chapter. A number of public domain DATR implementations are available and can be obtained from the Web sites.