Non-linguistic and other phenomena

Another, additional level of representation may be used for non-linguisitc phenomena which occur when people are speaking. This includes speaker noises  such as coughing, laughter, and lip smacking, as well as extraneous noises  such as the barking of dogs and the slamming of doors. In addition, this level can also be used to label  information such as dysfluencies and filled pauses. The type of representation used for such annotations  will depend on the purpose of the database. An annotation system such as that proposed by the Text Encoding Initiative   is very elaborate and makes heavy demands on a transcriber, but also makes it possible to derive all relevant information from a transcription . While the TEI system   makes use of SGML, which guarantees that existing software can be used, there is a large initial learning curve for the transcriber, which multiplies the possibility of human error in the transcription . Other annotation systems (such as those used in ATIS  and Switchboard ) are less elaborate, but also easier for transcribers to learn. The conventions used in ATIS, Switchboard , POLYPHONE   and the GRONINGEN corpus consist of different types of brackets with possible additional glosses. Retrieval software referring to these particular annotations  must be designed in a more or less ad hoc way, which is less convenient than the TEI system . However, it is possible to provide standard UNIX scripts for a speech corpus. It is important to find the correct balance between the sophistication of the annotation system  and the practicality of the system from the transcriber's point of view.

The types of phenomena which could conceivably be annotated on this level of representation are listed below.

    Words from the recording script which were omitted by the speaker may be indicated. In spontaneous speech , it is very difficult to know whether a speaker has omitted words which he actually intended to say, and so omission is only relevant in the case of read speech. 

    Words that are verbally deleted  by the speaker may be indicated. Verbal deletions  are words that are actually uttered, then (according to the transcriber) superseded by subsequent speech. This can be done explicitly, as in Can you give me some information about the price, I mean, the place where I can find ... Alternatively, it can be done implicitly, as in Can you give me some information about the price, place where I can find .... Verbal deletions  or self-repairs may be indicated in read  as well as spontaneous speech .

    Word fragments comprise one or more sounds belonging to one word. For example, in ATIS  word fragments are indicated by a hyphen, as in Please show fli-flights from Dallas.

    Sometimes only part of a word is unintelligible, in which case only the intelligible part is transcribed orthographically. If a word is completely unintelligible, that fact will be annotated on this level. For example, by putting ``[unintelligible]'' in the text (ATIS) , or by putting two stars ``**'' as in SPEECHDAT  corpora.

    Filled pauses (such as uh and mm) may be indicated. Some annotation  conventions (e.g. POLYPHONE  and Switchboard)  annotate only one or two types of filled pause (uh and mm, or only uh). Other systems (e.g. ATIS  and Speech Styles) annotate more than two types (e.g. uh, mm, um, er, ah). The types of filled pause vary across languages (for example, the British English er is not used in Dutch). The recommendation is to use at least two types: one vowel-like type uh, and one nasal  type mm.

    These can be made either by the speaker or by outside sources. The first category includes lip smacks, grunts, laughter, heavy breathing and coughing. The second category includes the noise  of doors slamming, phones ringing, dogs barking, and all kinds of noises  from other speakers. The Switchboard corpus  uses a very extensive list of non-speech acoustic events, ranging from bird squawk to groaning and yawning. The recommendation is that these events are annotated at the correct location in the utterance, by first transcribing the words and then indicating which words are simultaneous with the acoustic events.

    For dialogues and interviews, words spoken simultaneously by two or more speakers may be indicated.

    Discourse analysis makes use of indications of different speaking turns and initiatives. While these are not generally used in speech technology, it would always be possible to transcribe them.

