By orthographic transcription is meant the use of the standard spelling conventions of the language. Orthographic transcriptions are used in large scale speech corpora and in corpora used for research in which details about the pronunciation of words are not important. In the first case, a detailed transcription of at least part of the corpus is perhaps desirable (especially if such a corpus is used for training of speech recognition systems ), but because of the huge amount of work this cannot always be done. In the latter case, precise transcriptions are simply not necessary.
The use of normal spelling necessarily implies a compromise between the sounds heard and what is written down. Particularly in the case of spontaneous speech , there may be a significant discrepancy between what is heard and the symbolic representation used to encode it (see Appendix M for the guidelines for orthographic transcription used in the SPEECHDAT corpora).
Because of the discrepancy between what is heard and what is written, many developers of spontaneous speech corpora have decided to provide an indication of reduced word forms, using the reduced forms given in the standard dictionary for the language. However, in the interests of consistency, developers are sometimes forced to use forms not present in the dictionary. For example, in German, a preposition is often contracted together with a following article, forming one word and thus reducing the number of syllables (e.g. ``zu der'' is pronounced and written as ``zur''). These forms occur in the Duden dictionary (a standard German dictionary). However, in the VERBMOBIL corpus, the developer is also permitted to write ``fürn'' for ``für den'', although this form does not exist in the dictionary. Similarly, in the Dutch Speech Styles corpus it was decided to indicate reduced word forms. Criteria for indicating reduced forms in an orthographic transcription may be a) frequency of occurrence of these forms and b) reduction in the number of syllables . The reduced word forms used should be listed in the accompanying documentation.
Even in speech corpora covering the standard variety of a given language, speakers may have their own idiolect or may use words that have a dialect basis. These words have to be marked in the transcription . The developers of the VERBMOBIL corpus chose an orthographic means of indicating dialect words which are not in the Duden dictionary . It is possible to give information about the meaning behind these words, as in the following, explaining the dialect form of the greeting ``good morning'' spoken in the North of Germany: ``moin, moin <; norddeutsche Grußformel>''
In orthographic transcriptions, numbers are usually spelled out in full rather than being written in digit form. In some cases, the decision is made to deviate from the standard spelling in order to avoid excessively long words. For example, in the VERBMOBIL corpus the numbers 13 to 99 as well as the hundreds from 1 to 19 (the years) are written as a single word in accordance with German orthographic conventions. However, all other numbers are written separately, and thus do not conform to the normal rules. Examples follow:
1993: neunzehnhundert dreiundneunzig 3049614: drei Millionen neunundvierzig tausend sechshundert vierzehn 349: dreihundert neunundvierzig
In orthographic transcriptions, the full form of an abbreviation is usually written. Hence ``e.g.'' is written as ``for example'', and German ``usw.'' is written as ``und so weiter''. Abbreviations which are pronounced as words in their own right are spelled as words (e.g. Benelux, OPEC, NATO ).
In spoken language corpora such as POLYPHONE , speakers were asked to read out (among other things) spelled words. Words can be spelled out in different ways, as follows. Firstly, the names of the letters can be pronounced (e.g. A, B, C), but one can also use words beginning with the letter concerned, like Alpha, Bravo, Charlie (as in the radio alphabets used by the military etc.). Spellings must be indicated in orthographic transcriptions, including the case when only part of the word is spelled out, as in (German) ``USA-trip'', ``Vitamin-C''. In the VERBMOBIL corpus, spellings are indicated by capitals preceded by $: $U-$S-$A-trip, Vitamin-$C.
Interjections such as ``ah'', ``oh'', ``mm'', or the French ``hein'' must be shown according to the standard spelling of these forms in the given language. If there is no standard spelling for a certain interjection, it is necessary to decide on a spelling, and ensure it is included in the documentation associated with the corpus.
When orthographic transcriptions are used for corpora containing read speech, the original written text may function as the default transcription. The transcription indicates how well the written text was read by the speaker. For single words or short sentences, speakers will make relatively few mistakes, as has been found by the Dutch POLYPHONE Corpus and the GRONINGEN Corpus. However, in the case of read texts (even short texts), it appears that speakers often make mistakes. These mistakes are mostly related to deletions of words, false starts, or hesitations. Speakers may also add words not present in the text, or they may use a different word order. Furthermore, speakers may mispronounce words. For example, they may add, omit, or scramble syllables .
Depending on the intended application of a corpus containing read as well as spontaneous speech , such dysfluencies must be indicated in the orthographic transcription. If the corpus is to be used for initialising speech recognition systems , every sound must be annotated, including hesitations, filled pauses etc. Research on reading errors will also require annotation of such dysfluencies. On the other hand, if a corpus is to be used to determine the type of syntactic structures typically used in a certain dialogue system, then it is not necessary to indicate all events occurring in the signal. For an overview of annotations which can be used in an orthographic transcription and subsequent phonetic labelling see [Kohler et al. (1995)]. These annotations concern verbal sounds made by the speaker (e.g. hesitations, coughing, laughing), as well as background noises (slamming of doors, ringing of telephones).
Some projects may wish to produce the orthographic transcription as the first of
several linguistic levels of annotation. In this case, the next level will be
the citation-phonemic form of the speech. If an on-line pronouncing dictionary
is available, the citation-phonemic form may be derived automatically from the
orthography, thus saving time.
RECOMMENDATION 2
When transcribing a corpus orthographically, it is advisable to
generate a list of all unique word forms found in the transcription. This list
will then form the input to a grapheme-to-phoneme conversion
module (which may
involve accessing a phonemic dictionary and/or running letter-to-sound rules).
The output of this module will be a table with the citation-phonemic
forms (canonical forms) of the speech,
which can form a basis for later adaptation to various accents of the same
language. This procedure is followed, for example, in the SPEECHDAT corpora.