Normal lexical items will be represented by their spellings in the normal way. It is advised to stick to the normal spelling as much as possible. This also means that hyphens are used in the normal way. One dictionary or word list should be chosen (e.g. Duden for German, Larousse for French, Van Dale for Dutch). Each site/language maintains a lexicon of spellings of words used in the SPEECHDAT corpus. This file will be included on the CD-ROMs.
In many languages there are words or expressions which can be spelled in two or more different ways. To maintain consistency, each site/language must compile a list of such items, with the normalised spelling. For instance, in American English the spelling forms ``all right'' and ``alright'' coexist; one of these forms must be established as the standard.
It is probably profitable to always select the form yielding least `words', because that should yield the most powerful language model. There is however a small technical advantage in having the norm as the multiple word variant as spelling checkers can identify the single word forms very easily and convert them to the multiple word form automatically.
Abbreviations should be represented by their full orthographic forms, unless they are spoken in their abbreviated form. Exceptions are normally occurring abbreviations such as Mr, Mrs, Messrs, some of which do not have non-abbreviated forms.
To support homogeneity in spelling conventions used it is strongly recommended to employ an electronic spelling checker. If that is done, the make and type of the checker should be reported.
An orthographic transcription means that the standard spelling in a given language is used for the symbolic representation of the speech. It is possible to include, a very restricted number of markings for regular variations in pronunciation. These cases must be clearly documented! Not more than two or three regular variations must be indicated.
For example: The absence of liaison in French may be indicated (see the final section on some language specific issues).