Orthographic transcription


By orthographic transcription is meant the use of the standard spelling conventions of the language. Orthographic transcriptions are used in large scale speech corpora and in corpora used for research in which details about the pronunciation of words are not important. In the first case, a detailed transcription of at least part of the corpus is perhaps desirable (especially if such a corpus is used for training  of speech recognition systems ), but because of the huge amount of work this cannot always be done. In the latter case, precise transcriptions are simply not necessary.

The use of normal spelling necessarily implies a compromise between the sounds heard and what is written down. Particularly in the case of spontaneous speech , there may be a significant discrepancy between what is heard and the symbolic representation used to encode it (see Appendix M for the guidelines for orthographic transcription used in the SPEECHDAT  corpora).

Reduced word forms


Because of the discrepancy between what is heard and what is written, many developers of spontaneous speech  corpora have decided to provide an indication of reduced word forms, using the reduced forms given in the standard dictionary   for the language. However, in the interests of consistency, developers are sometimes forced to use forms not present in the dictionary. For example, in German, a preposition is often contracted together with a following article, forming one word and thus reducing the number of syllables  (e.g. ``zu der'' is pronounced and written as ``zur''). These forms occur in the Duden dictionary  (a standard German dictionary). However, in the VERBMOBIL  corpus, the developer is also permitted to write ``fürn'' for ``für den'', although this form does not exist in the dictionary. Similarly, in the Dutch Speech Styles corpus it was decided to indicate reduced word forms. Criteria for indicating reduced forms in an orthographic transcription may be a) frequency of occurrence of these forms and b) reduction in the number of syllables . The reduced word forms used should be listed in the accompanying documentation.    

Dialect forms


Even in speech corpora covering the standard variety of a given language, speakers may have their own idiolect  or may use words that have a dialect basis. These words have to be marked in the transcription . The developers of the VERBMOBIL  corpus chose an orthographic means of indicating dialect words which are not in the Duden dictionary . It is possible to give information about the meaning behind these words, as in the following, explaining the dialect form of the greeting ``good morning'' spoken in the North of Germany: ``moin, moin <; norddeutsche Grußformel>''



In orthographic transcriptions, numbers are usually spelled out in full rather than being written in digit form. In some cases, the decision is made to deviate from the standard spelling in order to avoid excessively long words. For example, in the VERBMOBIL  corpus the numbers 13 to 99 as well as the hundreds from 1 to 19 (the years) are written as a single word in accordance with German orthographic conventions. However, all other numbers are written separately, and thus do not conform to the normal rules. Examples follow:

1993: neunzehnhundert dreiundneunzig
3049614: drei Millionen neunundvierzig tausend
sechshundert vierzehn
349: dreihundert neunundvierzig

Abbreviations and spelled words

In orthographic transcriptions, the full form of an abbreviation is usually written. Hence ``e.g.'' is written as ``for example'', and German ``usw.'' is written as ``und so weiter''. Abbreviations which are pronounced as words in their own right are spelled as words (e.g. Benelux, OPEC, NATO ).

In spoken language corpora such as POLYPHONE , speakers were asked to read out  (among other things) spelled words. Words can be spelled out in different ways, as follows. Firstly, the names of the letters can be pronounced (e.g. A, B, C), but one can also use words beginning with the letter concerned, like Alpha, Bravo, Charlie (as in the radio alphabets used by the military etc.). Spellings must be indicated in orthographic transcriptions, including the case when only part of the word is spelled out, as in (German) ``USA-trip'', ``Vitamin-C''. In the VERBMOBIL  corpus, spellings are indicated by capitals preceded by $: $U-$S-$A-trip, Vitamin-$C.


  Interjections such as ``ah'', ``oh'', ``mm'', or the French ``hein'' must be shown according to the standard spelling of these forms in the given language. If there is no standard spelling for a certain interjection, it is necessary to decide on a spelling, and ensure it is included in the documentation   associated with the corpus.


Orthographic transcription of read speech 

When orthographic transcriptions  are used for corpora containing read speech, the original written text may function as the default transcription. The transcription indicates how well the written text was read by the speaker. For single words or short sentences, speakers will make relatively few mistakes, as has been found by the Dutch POLYPHONE  Corpus and the GRONINGEN Corpus. However, in the case of read texts (even short texts), it appears that speakers often make mistakes. These mistakes are mostly related to deletions  of words, false starts, or hesitations. Speakers may also add words not present in the text, or they may use a different word order. Furthermore, speakers may mispronounce words. For example, they may add, omit, or scramble syllables .

Depending on the intended application of a corpus containing read as well as spontaneous speech , such dysfluencies must be indicated in the orthographic transcription. If the corpus is to be used for initialising speech recognition systems , every sound must be annotated, including hesitations, filled pauses etc. Research on reading errors will also require annotation  of such dysfluencies. On the other hand, if a corpus is to be used to determine the type of syntactic structures  typically used in a certain dialogue system,  then it is not necessary to indicate all events occurring in the signal. For an overview of annotations  which can be used in an orthographic transcription and subsequent phonetic labelling see [Kohler et al. (1995)]. These annotations  concern verbal sounds made by the speaker (e.g. hesitations, coughing, laughing), as well as background noises  (slamming of doors, ringing of telephones).  

Orthographic transcription as the first of many levels

Some projects may wish to produce the orthographic transcription as the first of several linguistic levels of annotation.  In this case, the next level will be the citation-phonemic form of the speech. If an on-line pronouncing dictionary   is available, the citation-phonemic form may be derived automatically from the orthography, thus saving time.

When transcribing a corpus orthographically, it is advisable to generate a list of all unique word forms found in the transcription. This list will then form the input to a grapheme-to-phoneme conversion   module (which may involve accessing a phonemic dictionary  and/or running letter-to-sound rules). The output of this module will be a table with the citation-phonemic forms  (canonical forms)  of the speech, which can form a basis for later adaptation to various accents  of the same language. This procedure is followed, for example, in the SPEECHDAT corpora.      

