Next: Spoken language characterisation Up: SL corpus representation Previous: Non-linguistic and other phenomena

List of recommendations

For the transcription of dialogues between more than two speakers use a ``music score notation''.
For orthographic transcriptions , use the standard spelling as much as possible.
Indicate reduced word forms in orthographic transcriptions a) if these forms occur frequently and b) if they involve syllable deletion.
Use at least two types of ``filler'' syllable: one vowel-like type uh, and one nasal type mm.
Non-speech acoustic events should be annotated at the correct location in the utterance, by first transcribing the words and then indicating which words are simultaneous with the acoustic events.
When orthographic transcription is used in a corpus, it is recommended that a list of unique words and word forms is generated on the basis of the transcription . The orthographic forms of the words can then be converted to phonemes by means of computerised grapheme-to-phoneme conversion. The result of this process is a list of citation forms, also called canonical forms or citation-phonemic forms. These forms represent the pronunciation of words when spoken in isolation, and do not cover variations in pronunciation found in running speech. However, this procedure will at least give a standard pronunciation as a starting-point. This is especially relevant if a corpus is to be used by other persons than those belonging to that language community. On the basis of these canonical forms, phonetic transcriptions can be made semi-automatically using large vocabulary speech recognisers.
If there is no compelling reason otherwise, do not start to transcribe a corpus phonetically, since the time spent on this will never be recovered. If very specific phonetic details are needed, one is advised to look for these on the basis of orthographic and/or phonemic transcriptions.
It is recommended that transcribers give information about the process of transcribing and about the speech that they have transcribed. Some speakers will be easier to transcribe than other speakers. This will depend on the speech rate, the clarity of articulation, the amount of hesitation, and the number of dialect words used by the speakers. Some information about the difficulty of the transcription is very useful for later queries. The transcribers of the Switchboard (telephone) Corpus were asked to indicate on a scale ranging from 1 to 5 the following characteristics of a conversation: difficulty, topicality, naturalness , echo from B (in listening to A separately, B could hardly be heard (1) or was as nearly as loud as A (5)), echo from A, static on A (no static noise (1) or great deal of it (5)), static on B, background A, and background B.
In the case of transcriptions at more than one level (e.g.\ orthographic transcription with some prosodic marks and indications of hesitations etc.), the recommendation is to listen to one level at a time. In everyday life, listeners are accustomed to ignoring hesitations, false starts, and other imperfections, and also do not pay explicit attention to prosody . Transcribers must learn to hear all these events. It seems easiest to listen to the words first and transcribe these, and then to assign the prosodic marks and other annotations .
For orthographic transcriptions it is not necessary to find experienced transcribers. However, for phonemic and phonetic transcriptions it is necessary to use transcribers who are accustomed to listening to speech in a very precise, analytical way.
To give some indication about the time necessary to transcribe speech, here are some examples. The time that will be necessary to make an orthographic transcription of spontaneous speech is about ten times the duration of the speech itself. The time necessary for an orthographic transcription of read sentences is about three times the duration of the speech and for an orthographic transcription of read texts it is about five times the duration of the speech.
Checking of transcription is always necessary. Checking can be done in different ways. An independent transcriber can transcribe the whole or a sample of the corpus. Another possibility is to allow someone else to check the transcription by reading the transcription and listening to the speech. This is less time-consuming. In the case of the latter procedure, it is recommended that the transcription be checked in the opposite order to that used by the first transcriber, since towards the end of the material the first transcriber will be more self-consistent than at the beginning. Inconsistencies may occur in the conventions used (spelling and annotation conventions (brackets, etc.)), as well as in what is heard by the two different persons.
For the label file format, use any format that can easily be converted to a WAVES label file, for the sake of portability across different systems.
Any accuracy measure based on inter-transcriber consistency must control for the factors ``level of transcription '', ``segment type'', and ``task type'' (whether segmentation or labelling ).
If the corpus is confined to one language, and if the labelling is to be alphabetic rather than true IPA symbols, then it is advisable to use a language-specific set of characters. This avoids the notational complexity necessary when all symbols must be kept distinct across all languages, as is needed in the study of general phonetics .
When transcribing prosodically , the provisional recommendation is to use either the ToBI or the IPO system (and the MARSEC system if a purely auditory transcription is being carried out). If the language to be transcribed is not English, and especially if the projected application of the prosodic transcription is in the field of speech technology, then it is probably best to use the IPO system if possible (i.e. if the basic ``grammar'' of contours has already been researched for that language).

Next: Spoken language characterisation Up: SL corpus representation Previous: Non-linguistic and other phenomena

EAGLES SWLG SoftEdition, May 1997. Get the book...