The transcription of read speech versus the transcription of spontaneous speech

              The point of departure in the case of the transcription of read speech is the written text. This makes this type of transcription somewhat easier to perform than transcriptions of spontaneous speech,   where an orthographic transcription  must first be made. In the case of read speech, planning and word seeking processes are not involved. These processes of spontaneous speech production have a significant effect on the speech that is produced. It is well known that spontaneous speech is not fluent: speakers produce numerous filled pauses, mispronunciations, false starts, and repetitions. In addition, depending on the formality or informality of the setting, speakers will use colloquial speech and non-standard pronunciations. These properties of spontaneous speech make all types of transcriptions, global as well as detailed ones, more difficult to perform for spontaneous speech than for read speech. In the case of read speech, the use of written texts will ensure that there are fewer dysfluencies and a lower incidence of non-standard pronunciations.

Another important distinction between read and spontaneous speech in relation to transcriptions is that for read speech it is clear what an utterance is: the written sentence, usually starting with a capital letter and ending with a full stop. For spontaneous speech this is not necessarily the case. Depending on the type of spontaneous speech involved, it is often necessary to define the criteria for delimiting utterances. For dialogues and other forms of conversation in which more than one speaker is involved, it is usual to define utterances more or less in terms of speaker turns  (see the Guidelines issued by the Text Encoding Initiative (TEI)   in [Sperberg-McQueen & Burnard (1994)] and Switchboard).   For monologues , utterances can be defined as stretches of speech mostly preceded and followed by a pause and having a more or less consistent syntactic , semantic, pragmatic , and prosodic  structure (see the criteria developed by the Network of European Reference Corpora (NERC) in [French (1991), French (1992)] and the Dutch Speech Styles Corpus [Den Os (1994)] (see also the results of the EAGLES Working Group on machine readable corpora).

Transcription of dialogues


When two (or more) persons are conversing together, interruptions frequently occur. (cf. Chapter 13). This is true for informal conversations between friends, for formal requests for information, for face-to-face situations, and for telephone conversations. These interruptions may be complete utterances, or they may be for instance affirmative ``yes'', or ``mm''. These interruptions in simultaneous speech must be annotated in the transcription. In the case of a dialogue between two persons, it is possible to give a clear indication of simultaneous speech. For example, Switchboard  uses the ``sharp'' symbol (``#''), at either side of each of the simultaneous segments, to indicate that the two speakers in the telephone conversation speak at the same time:

A: # Right, bye #
B: # Bye bye #

In the case where more than two speakers are conversing, however, it is not possible to indicate the interruptions and simultaneous speech in a clear and simple way. For these cases a so-called ``score notation'' can be used. As for music score notation, the different speakers are given a separate track, one above the other. The tracks must be synchronised with one another. A computer program, known as ``syncwriter'', has been developed that handles this type of conversation, and runs on the Apple Macintosh.

It is also possible to collect dialogues that avoid simultaneous speech. In part of the VERBMOBIL corpus , the dialogue partners are recorded separately. The partners press a button when about to speak, which operates the recording procedure. The recordings are made in two rooms, separated by a glass screen so that the speakers can see each other. The speakers can hear each other by means of headphones. Clearly this situation is not as natural as the case where both speakers are permitted to speak at the same time.                

