Correcting errors in the production of text and speech

In spontaneously  spoken language the editing behaviour of the speaker is audible and remains a part of the recorded data. Interruptions, hesitations, repetitions of words (and parts of words), and especially self-repairs are a characteristic feature of naturally spoken language and must be represented in SL data collections of spontaneous speech . On the other hand, the writer who has even more correcting and editing options in producing a text document, will normally intend to produce a ``clean'' version. In the final version of the text all corrections which may have been carried out have disappeared; this is especially true for text intended to go into print. In the recent past SL data were often recorded as clean speech collections. A typical example is so-called laboratory speech  which is produced when a speaker who is sitting in a monitored recording room reads  a list of prepared text material, and then only the proper reproductions of the individual text items are accepted to enter the data base. Examples of speech corpora collected in this way are EUROM-0 and EUROM-1  (see Appendix J). More recently, however, interest has shifted towards corpora comprising ``real-world'' speech, including hesitations, corrections, background noise,   etc.

