next up previous contents index
Next: Pronunciation information Up: Lexical surface information Previous: Lexical surface information

Orthographic information

Orthography has been used in several different roles in spoken language lexica, some of which have already been noted:

  1. Convenient general reference labels for words, due to the high level of awareness of, familiarity with and standardisation of orthography in literate societies.
  2. Convenient identifying names for lexical entries, for ``normal lemma '' forms, and for headwords in complex lexicon entries which group related words together.
  3. Convenient identifying names for word hypotheses in word lattices,   as lexical lookup keys.
  4. Visualisation of word hypotheses in a development system.
  5. Representation of the orthographic properties of words (the main function).

Each of these functions is distinct and needs to be kept conceptually separate in order to avoid confusion. The functions (1) and (2) are not particularly problematic. Function (3) is traditionally a feature of speech recognition systems  for relatively small vocabularies. The larger the vocabulary,  however, the greater the danger of introducing unnecessary orthographic noise,  i.e. intrusive artefacts due to homography  (words with identical spelling and different pronunciation); for this reason, in new architectures, phonological (e.g.\ phonemic or autosegmental) representation   in word graphs  may be preferred. Function (4) is unproblematic, though similar reservations as with (3) are to be noted. Function (5) is the main function and is obviously essential for written output of any kind; however, it is often confused with both functions (2) and (3). Care with consistent orthography is obviously essential.

Orthography has the advantage of being highly standardised, except for certain regional variants (British and American English; Federal, Swiss, and Austrian German) and variations in publishers' conventions (e.g. British English ise/-ize as in standardisation/standardization, capitalisation of adjectives in nominal function in German, as die anderen / die Anderen, or variations in hyphenation conventions and the spelling of compound words; variation is found particularly in the treatment of derived and compound  word s (e.g.\ separation and hyphenation) and in the use of typographic devices such as capitalisation). Orthography is given further attention in the section on lexical representation.

A standard orthographic transcription  is often used for convenience as a means of representing and accessing words in a spoken language lexicon.   This has several reasons:

  1. Familiarity to all educated speakers of the language.
  2. High level of standardisation in comparison with theory-influenced phonological transcriptions  .
  3. Sufficient proximity to phonological form, at least in European languages, ensures a reasonably close mapping to pronunciation at the level of whole words (not necessarily in the details of grapheme to phoneme mapping)  in small vocabularies in some languages (French and English are notorious exceptions).

Most European languages have highly regulated orthographies, the use of which is associated with social and political rewards and punishments. Official orthographic reforms, which typically generate much controversy among the general public, may necessitate some re-implementation of spelling checkers and grapheme-phoneme converters  (cf. the ongoing reform of German orthography).

For use in spoken language lexica, particularly in word lists used for training  and testing  recognisers , consistency is essential and often additional conventions are required in order to meet the criterion of general computer readability in the case of special letters and diacritics. Although it cannot be regarded as a standard, it is becomming common practice to use the ASCII codings or their LaTeX adaptations for specific countries. For example, a standard computer-readable orthography for German has become widely accepted for German speech recognition applications which marks special characters, in particular those with an Umlaut diacritic, as shown in Table 6.2.


Standard orthography ASCII orthography
Äpfel "Apfel
ändern "andern
Öl "Ol
östlich "ostlich
Überzug tex2html_wrap_inline45207tex2html_wrap_inline45207Uberzug
über "uber
heiß "s
Table 6.2: Computer readable ASCII orthography for German 

The results of the EAGLES Working Groups on Text Corpora and Lexica should be consulted on orthographic and other matters pertaining to written texts.

next up previous contents index
Next: Pronunciation information Up: Lexical surface information Previous: Lexical surface information

EAGLES SWLG SoftEdition, May 1997. Get the book...