Next: List of recommendations
Up: The levels and types
Previous: Prosodic transcription
Another, additional level of representation may be used for non-linguisitc
phenomena which occur when people are speaking. This includes
speaker noises such as coughing, laughter, and lip smacking, as well as
extraneous noises such as the barking of dogs and the slamming of doors. In
addition, this level can also be used to label information
such as dysfluencies and filled pauses. The type of representation used for
such annotations will depend on the purpose of the database. An annotation
system such as that proposed by the Text Encoding Initiative
is very elaborate
and makes heavy demands on a transcriber, but also makes it possible to derive
all relevant information from a transcription . While the TEI system
makes use
of SGML, which guarantees that existing software can be used, there is a large
initial learning curve for the transcriber, which multiplies the possibility of
human error in the transcription . Other annotation systems (such as those used
in ATIS and Switchboard ) are less elaborate, but also easier for transcribers
to learn. The conventions used in ATIS, Switchboard , POLYPHONE
and the
GRONINGEN corpus consist of different types of brackets with possible
additional glosses. Retrieval software referring to these particular
annotations must be designed in a more or less ad hoc way, which is less
convenient than the TEI system .
However, it is possible to provide standard UNIX
scripts for a speech corpus. It is important to find the correct balance
between the sophistication of the annotation system and the practicality of the
system from the transcriber's point of view.
The types of phenomena which could conceivably be annotated on this level
of representation are listed below.
- OMISSIONS IN READ TEXT
Words from the recording script which were omitted by the speaker may be
indicated. In spontaneous speech ,
it is very difficult to know whether a
speaker has omitted words which he actually intended to say, and so
omission is only relevant in the case of read speech.
- VERBAL DELETIONS OR CORRECTIONS, IMPLICIT OR EXPLICIT
Words that are verbally deleted by the speaker
may be indicated.
Verbal deletions are words that are actually uttered, then
(according to the transcriber) superseded by subsequent speech. This can be
done explicitly, as in Can you give me some information about the price, I
mean, the place where I can find ... Alternatively, it can be done implicitly,
as in Can you give me some information about the price, place where I can
find .... Verbal deletions or self-repairs may be indicated in
read as well as
spontaneous speech .
- WORD FRAGMENTS
Word fragments comprise one or more sounds belonging to one word. For example,
in ATIS word fragments are indicated by a hyphen, as in Please show
fli-flights from Dallas.
- UNINTELLIGIBLE WORDS
Sometimes only part of a word is unintelligible, in which case only the
intelligible part is transcribed orthographically. If a word is completely
unintelligible, that fact will be annotated on this level. For example, by
putting ``[unintelligible]'' in the text (ATIS) , or by putting
two stars ``**'' as in SPEECHDAT corpora.
- HESITATIONS AND FILLED PAUSES
Filled pauses (such as uh and mm) may be indicated. Some
annotation conventions (e.g. POLYPHONE
and Switchboard) annotate only one or two types of filled pause (uh and
mm, or only uh). Other systems (e.g. ATIS and Speech Styles)
annotate more than two types (e.g. uh, mm, um, er, ah). The types of
filled pause vary across languages (for example, the British English er is
not used in Dutch). The recommendation is to use at least two types: one
vowel-like type uh, and one nasal type mm.
- NON-SPEECH ACOUSTIC EVENTS
These can be made either by the speaker or by outside sources. The first
category includes lip smacks, grunts, laughter, heavy breathing and coughing.
The second category includes the noise of doors slamming, phones
ringing, dogs barking, and all kinds of noises from other speakers.
The Switchboard corpus uses a very extensive list of non-speech acoustic events,
ranging from bird squawk to groaning and yawning. The recommendation is that
these events are annotated at the correct location in the utterance, by first
transcribing the words and then indicating which words are simultaneous with the
acoustic events.
- SIMULTANEOUS SPEECH
For dialogues and interviews, words spoken simultaneously by two or more
speakers may be indicated.
- SPEAKING TURNS
Discourse analysis makes use of indications of different speaking turns and
initiatives. While these are not generally used in speech technology, it would
always be possible to transcribe them.
Next: List of recommendations
Up: The levels and types
Previous: Prosodic transcription
EAGLES SWLG SoftEdition, May 1997. Get the book...