Non-linguistic and other phenomena

Next: List of recommendations Up: The levels and types Previous: Prosodic transcription

Non-linguistic and other phenomena

Another, additional level of representation may be used for non-linguisitc phenomena which occur when people are speaking. This includes speaker noises such as coughing, laughter, and lip smacking, as well as extraneous noises such as the barking of dogs and the slamming of doors. In addition, this level can also be used to label information such as dysfluencies and filled pauses. The type of representation used for such annotations will depend on the purpose of the database. An annotation system such as that proposed by the Text Encoding Initiative is very elaborate and makes heavy demands on a transcriber, but also makes it possible to derive all relevant information from a transcription . While the TEI system makes use of SGML, which guarantees that existing software can be used, there is a large initial learning curve for the transcriber, which multiplies the possibility of human error in the transcription . Other annotation systems (such as those used in ATIS and Switchboard ) are less elaborate, but also easier for transcribers to learn. The conventions used in ATIS, Switchboard , POLYPHONE and the GRONINGEN corpus consist of different types of brackets with possible additional glosses. Retrieval software referring to these particular annotations must be designed in a more or less ad hoc way, which is less convenient than the TEI system . However, it is possible to provide standard UNIX scripts for a speech corpus. It is important to find the correct balance between the sophistication of the annotation system and the practicality of the system from the transcriber's point of view.

The types of phenomena which could conceivably be annotated on this level of representation are listed below.

OMISSIONS IN READ TEXT
Words from the recording script which were omitted by the speaker may be indicated. In spontaneous speech , it is very difficult to know whether a speaker has omitted words which he actually intended to say, and so omission is only relevant in the case of read speech.
VERBAL DELETIONS OR CORRECTIONS, IMPLICIT OR EXPLICIT
Words that are verbally deleted by the speaker may be indicated. Verbal deletions are words that are actually uttered, then (according to the transcriber) superseded by subsequent speech. This can be done explicitly, as in Can you give me some information about the price, I mean, the place where I can find ... Alternatively, it can be done implicitly, as in Can you give me some information about the price, place where I can find .... Verbal deletions or self-repairs may be indicated in read as well as spontaneous speech .
WORD FRAGMENTS
Word fragments comprise one or more sounds belonging to one word. For example, in ATIS word fragments are indicated by a hyphen, as in Please show fli-flights from Dallas.
UNINTELLIGIBLE WORDS
Sometimes only part of a word is unintelligible, in which case only the intelligible part is transcribed orthographically. If a word is completely unintelligible, that fact will be annotated on this level. For example, by putting ``[unintelligible]'' in the text (ATIS) , or by putting two stars ``**'' as in SPEECHDAT corpora.
HESITATIONS AND FILLED PAUSES
Filled pauses (such as uh and mm) may be indicated. Some annotation conventions (e.g. POLYPHONE and Switchboard) annotate only one or two types of filled pause (uh and mm, or only uh). Other systems (e.g. ATIS and Speech Styles) annotate more than two types (e.g. uh, mm, um, er, ah). The types of filled pause vary across languages (for example, the British English er is not used in Dutch). The recommendation is to use at least two types: one vowel-like type uh, and one nasal type mm.
NON-SPEECH ACOUSTIC EVENTS
These can be made either by the speaker or by outside sources. The first category includes lip smacks, grunts, laughter, heavy breathing and coughing. The second category includes the noise of doors slamming, phones ringing, dogs barking, and all kinds of noises from other speakers. The Switchboard corpus uses a very extensive list of non-speech acoustic events, ranging from bird squawk to groaning and yawning. The recommendation is that these events are annotated at the correct location in the utterance, by first transcribing the words and then indicating which words are simultaneous with the acoustic events.
SIMULTANEOUS SPEECH
For dialogues and interviews, words spoken simultaneously by two or more speakers may be indicated.
SPEAKING TURNS
Discourse analysis makes use of indications of different speaking turns and initiatives. While these are not generally used in speech technology, it would always be possible to transcribe them.

Next: List of recommendations Up: The levels and types Previous: Prosodic transcription

EAGLES SWLG SoftEdition, May 1997. Get the book...