DOBES Technical Report n (Ega)
(Status: RFC draft, January 2001 -- printed January 7, 2001)
Revised version of a paper presented at the workshop on
Web-Based Language Documentation and Description
12-15 December 2000, Philadelphia, USA.
This paper is a contribution to the lexicographic component of language documentation and metadata specification, whereby the relation between language documentation and linguistic description (and metadescription) is understood as a continuum, not a sharp divide. We will term this specification the `MetaLex' specification. Furthermore, since documents are linguistic objects, they can themselves be the subject of linguistic descriptions; it would therefore seem to be rather foolish to ignore linguistic criteria in this area, and particularly foolish for the linguist to do so (as, alas, many linguists do).
But rather than plunging into the midst of detailed discussions of lexicon architecture, data structures, formats, acquisition and database tool implementation, types of lexical information for specific lexica, and so on, this contribution describes an attempt to step back for a moment from the grip of design and implementation issues in lexicographic development and to specify the linguistic foundations of a requirements specification for lexical development before going into further practical application details.
The term `requirements specification' is meant literally in the sense of software engineering: in this case, it means a specification of the requirements which lexical documentation should fulfil, derived from very general requirements such as the intensional and extensional coverage of lexical information, and the reusability, interoperability, portability and ergonomic utility of lexical software. There are, of course, other technical and logistic requirements which are not covered here.
This is not to say that the issues of lexicon design, implementation, evaluation, acquisition and application logistics are unimportant. On the contrary. But we claim that they must be derived from general specifications of requirements if they are not to risk being ignored by others. In discussing these points, this document draws to different extents on criteria from linguistic theory, descriptive lexicology, lexicography, terminography, computational linguistics and software engineering, and on extensive experience in the lexicography and terminography of spoken language in the EAGLES and Verbmobil language engineering projects, and the Bielefeld-Abidjan ``Encyclopédie des language de Côte d'Ivoire'' documentation design project.
A lexicon is already a form of corpus metadata in the sense that it contains more or less generalised descriptive facts about a corpus or introspected data, and it was treated as such in [Gibbon & al. 1997], i.e. as "linguistic characterisation" of corpora.1
But lexicographers often speak of "lexical data" in the sense of the information in the lexicon itself. In this sense, a lexicon itself needs description in terms of a higher level of lexical metadata, designed
There are numerous approaches to lexicon metadata characterisation and standardisation, from the accepted traditional norms used in typological linguistics (cf. [Coward & Grimes 1995]). to the technology oriented de facto standards work of the EAGLES project series2and the industry standard MARTIF (ISO 12200) for terminological databases, based on standard markup (ISO 8879 SGML). A modern XML version is under development (cf. ISO FDIS 12620); in this context, it is interesting to note that the traditionally automous practical engineering discpline of terminography is gradually accepting more general linguistically based lexicographic standards, though MARTIF extensions to general lexicography are not available.
Today the main foci in lexicography are often more on the standardisation of markup and implementation techniques than on conceptual harmonisation. But the more complex the issues -- and in lexicography they are very complex -- the harder it becomes even to think of documentation standards without looking at the broader picture of the other lexical sciences and the conceptual support they can provide to lexicography.
The present contribution takes a broader view of the position of the lexicon in this unsettled scene from the point of view of some small lexical objects and their properties. Section 3 Section 4 is concerned with characterising large and small lexical objects; in Section 5, a model for characterising types of lexical information is proposed, based on contemporary linguistic and media theory; Section 6 is concerned with the complexity of lexical information and additional, particularly pragmatic and operational dimensions of lexical information to be accounted for in metadata. Section 7 focusses on the status of hyperlexicon realisations of lexical documents in the context of the semiotic model, and, finally, Section 8 summarises the approach.
A number of aspects are excluded from consideration in this document, not because they are unimportant, but because they deserve separate, detailed treatment.
One aspect of standardisation which is only touched on in passing is the procedural or operational side of lexicography. There are two main aspects: first, the manual and/or computer-supported acquisition of lexical information (for example by conscious reflection on lexical objects or by corpus analysis), and second, access to lexical information in paper and electronic media, particularly in hypertext format. The question of acquisition of lexical data and lexical knowledge Section 7).
A further aspect which is excluded from present discussion is the ethics of lexicography, i.e. how lexical information is gathered, which culturally significant lexical information should be included, which lexical information should be disseminated publicly, how the commercial value of lexical information should feed back to the originators of the lexical information, intellectual property rights (IRP) on lexical information, including different forms of the same document.
Lexicon theory, descriptive lexicology and operational lexicography -- the lexicon sciences -- are old sciences and technologies, and use of computational modelling and large-scale corpus processing is rapidly leading to a convergence of these three areas. A general outline of the interrelations between these disciplines is shown in Figure 1.
Underlying the approach presented in the present contribution is the idea that the complexity of lexicographic documentation -- whether paper or electronic -- has become so complex that all the lexicon sciences need to provide input to the development process if an unproductive kind of chaos is not to ensue. A number of approaches in the area of the human language technologies where this principle is practised are discussed in the contributions to [van Eynde & Gibbon2000].
It is a truism to state that archiving and documentation are inconceivable without standardisation -- standardisation at levels which do not prejudice creative scientific and technological innovation. De facto standards have arisen over the past 10 years with the development of the PC in the context of the World Wide Web into a mass Information and Communication Technology product. Some `standards' come and go, or develop too quickly to be regarded as standards except for a transitory period; examples of these are hardware configurations and software norms for media, text and multimedia documents. Other standards are more lasting, in particular those to do with design and quality control of archives, documents and systems (cf. [Gibbon & al. 1997], [Gibbon & al. 2000]). It is standards of this kind which fall into the area of metadata as opposed to being artefacts of specific archives or implementations.
Any inventarised form which may be abstracted from tokens of speech, inscriptions of text, or gestural events, including iconic and indexical signs as well as the conventional symbols, is a lexical object. Because of its generality, this is not a very useful definition as it stands, except to distinguish lexical objects from completely compositional, transparently interpretable complex signs. The definition encompasses a vast spectrum of objects, from the regular phonetic realisations of phonemes and prosodies to the constituents of handwriting and printed or electronic text, through morphemes, words (simplex or complex), phrasal idioms to entire anthologised texts. And there are weird lexical objects, too, such as hums and haws, coughs and tut-tuts, as well as a wide range of conventional, stylised and codified visual gesture systems, all of which have communicative functions which are closely related to the more central aspects of language.
Before proceeding, four central structural concepts for lexicon design will be introduced. Two of these are traditional, though modified for present purposes; the third is new, the fourth is currently topical in the area of language resources in general.
The first (declarative) aspect of macrostructure classifies lexical objects according to two main criteria:
Figure 2 illustrates the core rank hierarchy of conventional lexical objects, with which other lexical objects may be related via notions such as the prosodic hierarchy in speech, layout hierarchies in printed matter, and gestural hierarchies. Conventional lexical objects thus vary in rank from the very small (e.g. font characters and their parts) to the very large (e.g. a standard religious text).
The model shown in Figure 2 is too general to be very helpful when it comes to describing types of lexical information, but it is a useful start. In particular, in the world of multimedia documentation, the idea that a lexicon is basically concerned with words needs to be scotched once and for all. A phraseological unit, for instance, is a lexical object in its own right, at its own rank, and not only by virtue of the words it contains; listing idioms by words is a matter of procedural convenience, not of conceptual clarity, and has led to much confusion in linguistics over the past 40 years. Likewise, an image or a sound may be a lexical object.
Lexicon macrostructure is determined not only by the rank hierarchy of large and small lexical objects and their interpretation, but also, and traditionally more typically, in terms of procedural orderings of lexical microstructure.
We will not discuss macrostructure or mesostructure further in this document; We stipulate that
It will be sufficient at this stage to propose providing lexical metadata at two levels (cf. also [Coward & Grimes 1995] for a practical approach):
The conventional view of types of lexical information was formulated in a classic article ([Fillmore 1971]). Types of lexical information in this sense underlie the microstructure of a lexicon.
Figure 3 visualises a contemporary semiotic model of relations between levels of abstraction for the description of signs. Lexicon microstructures are typically represented in some kind of vector format, for example:
There are also other important issues to do with lexicon microstructure. One of these is the representation of lattice-structured or multilinear information which receives a media interpretation of simultaneity rather than sequentiality. This issue applies immediately to the representation of
A thorough discussion of lexical information which is interpreted as simultaneity relations at different levels is given in [Carson-Berndsen 1998], based on some principles of Event Phonology, as first formulated in [Bird & Klein 1989], and on Prosodic Time Types formulated in [Gibbon 1992]. At the level of resource implementation, the annotation lattice approach of [Bird & Liberman 1999] is clearly relevant as a partial solution to this problem.
The basic semiotic model has a theoretically sound basis, but also a heuristic value as a structured `checklist' for types of lexical information. On this view, the components of the semiotic model are projected on to a microstructure of types of lexical information, whether in a the lexicological context of a linguistic description, or in the lexicographic context of a paper or electronic lexicon. This microstructural vector contains media-oriented types of information on orthography and pronunciation (perhaps also on other gestural properties) as well as other kinds of operational information concerned with lexical acquisition and lexical access keys.
Whatever formalism, abstract data structure or concrete format is selected, this idea of mapping a semiotic model into a microstructure vector, visualised in simplified form in Figure 4, is an essential defining characteristic of a well-defined lexicon, illustrating the theoretical linguistic basis for lexical metadata.
The conventional kinds of lexicon microstructure, even modelled at different rank levels as discussed above in connection with lexicon microstructure, are only sufficient for creating lexical resources of a standard language -- the type of lexicon suited to current standard language oriented speech technology, or, in more jocular terms, to the Scrabble player.
Embedded in Figure 4 is a small diagramme showing three additional dimensions to which the main types of lexical information need to be generalised: compositionality, variety, and procedurality. In principle, the types of lexical information need to be multiplied in order to cope with these additional types; traditional microstructures have an ad hoc combination of these.
Figure 4 elaborates on the theme of dimensionality: each of the dimensions described so far can be further analysed, fractal-like, into subdimensions; three dimensions are chosen to represent the higher dimensionality in each case more on associative than on principled grounds:
Terms such as `genre' and `style' are sometimes used to cover special aspects of register, sometimes to cover both the register and sociolect dimensions.
The historical dimension, which classifies genetically related diachronic forms of a language, is defined in terms of projections of these three dimensions onto a long-term temporal axis of language change, under the influence of internal change and external influence of other languages and varieties.
The fractioning of lexical dimensions does not stop here; lexical macrostructures may involve more complexity than just varietal mappings, lexical microstructures may involve more complexity than structural information (but including any or all of the information types involved in the parameters discussed above), and the specification of an operational lexicon is a complex task indeed.
For any given task in lexical documentation not all of these dimensions and sub-dimensions will be relevant. However, from the point of view of the systematic definition of granularity levels for lexical metadata, the fractal-like characterisations given here, abstract though they may seem at first glance, provide a useful starting point.
In a formal specification for document technology purposes, the dimensions of lexical information may be represented as a recursive attribute-value structure (AVS), as shown in Figure 6.
In this section, one type of document organisation which has consequences for operational lexica is discussed: hypertext (and related notions such as hypermedia, hyperdocument).
The concept of hypertext, and thus also of hyperlexicon, is a presentation level concept, derivable from a more fundamental lexical document structure by means of a media interpretation function.
The file split and hyperlinking functions of a hypertext are comparable to the procedure used for printers' make-up (pagination, line and page breaks, index and table of contents page references, footnoting and endnoting). Hypertexts are sometimes defined as `non-linear' in structure, in contrast to conventional texts; rarely, however, is the level of definition specified. Many texts are non-linear (hierarchical, tree or graph-structured) at the document syntax level and no doubt (if the meaning were sufficiently precisely specified) also st the document semantics level. Linearity here means only the relatively trivial property of a full sequential ordering of printed pages at the presentation (media interpretation) level. But we all know that books, especially reference books, are frequently neither written nor read sequentially. Conversely, there is nothing to prevent a hypertext from being organised sequentially, with each page having only one link to a successor page. And again, there is nothing to stop us from producing or accessing such a hypertext non-linearly. The semiotically based five-component document model is more complex than the usual `logical structure' vs. `rendering' dichotomy of document specifications in much of the hypertext literature, and helps us to avoid such simplistic characterisations.
Summarising: the five component semiotic model introduced in the present contribution locates hypertext at the level of MEDIA SEMANTICS; the distinction between hypertext description and graphical or textual hypertext rendering is captured by the additional MODALITY MODEL component.
In 1995, the concept of a hyperlexicon on the web was explicitly introduced by the author as a database integrity-preserving technique and further developed during the following years (cf. [Gibbon & Lüngen 2000]).3The Verbmobil VM-HyprLex website was one of the first very large-scale CGI database applications on the World-Wide Web, and provided a single-token, simultaneous multiple-access shared database for the 30 or so laboratories around the world who were members of the VerbMobil consortium. The lexicographic task was to standardise and integrate approximately 25 types of lexical information which were made available by the partners in a variety of non-standardised formats.
The extensional coverage (number of entries) is 10000, and the intensional coverage (number of types of lexical information) is 25 (varying with different applications); a number of different search strategies, including regular expressions (restricted to prevent overloading the download channel) and formatting types. Further applications of the hyperlexicon principle have been developed outside the Verbmobil project.4The HyprLex approach is multi-level:
Figure 7 shows a generalised perspective on the logistics of the VM-HyprLex lexicographic task, which is applicable to a wide range of lexicographic tasks in language documentation. The VM-HyprLex source code and databases are available via the BAS and ELRA dissemination agencies.
In language documentation, the most well-known hyperlexicon is Bird's Hyperlex (cf. [Bird 1997]).5Hyperlex has been applied to a number of languages, in the area of endangered languages notably by Connell to Mambila and Amith to Nahuatl.6
Hyperlex bears some similarities to HyprLex, in that it has a CGI-based search concept, and integrates other on the fly calculations into the lexicon via CGI routines; the degree of integration of these add-ons is higher than with the VM-HyprLex application. Both hyperlexicon interfaces offer search and display filters over the microstructure entries of their lexical databases.
The HyprLex approach was further developed by Gibbon & Trippel in [Gibbon & Trippel 2000] in the domain of terminological lexicography, using a textual database for generating a variety of media interpretations. The same approach was used in generating the different media involved in the publication of [Gibbon & al. 1997] and [Gibbon & al. 2000].
So, in conclusion, why all this background discussion of the principles of lexical organisation?
The answer can be stated in terms of a few basic principles:
Faced with this plethora of possibilities we advocate an explicit return to a semiotic model of the lexicon which can be incrementally extended according to fundamental linguistic and operational principles until coherent design strategies for lexical database views can be clearly defined, and hypermedia lexica can be derived automatically. A start may be made by defining a rank hierarchy of lexical objects, and a procedurally neutral microstructure on accepted linguistic typological principles, with definitions of generic metadata as a well-structured mesostructure, in addition to traditional forms of metadata.
We suggest that the next steps are along two dimensions:
The first dimension is to formulate a system design document (including the human factors of lexicographers in the field, the office and the lab, and users of various kinds), an implementation document (with open source after the alpha evaluation stages), an evaluation document, and a maintenance strategy (also including the human factors), based on an overall documentation policy.
The second is to filter out requirements specifications for specific branches of lexicography (for example in the contexts of the scientific documentation of endangered languages, as well as its applications in terminography, language teaching, and other information services), and to derive specific designs from these specifications and the general design document, specific implementations and evaluation procedures.