Next: List of recommendations
Up: The levels and types 
 Previous: Prosodic transcription
Another, additional level of representation may be used for non-linguisitc
phenomena which occur when people are speaking.  This includes
speaker noises  such as coughing, laughter, and lip smacking, as well as
extraneous noises  such as the barking of dogs and the slamming of doors.  In
addition, this level can also be used to label  information
such as dysfluencies and filled pauses.  The type of representation used for
such annotations  will depend on the purpose of the database.  An annotation
system such as that proposed by the Text Encoding Initiative
  is very elaborate
and makes heavy demands on a transcriber, but also makes it possible to derive
all relevant information from a transcription .  While the TEI system
  makes use
of SGML, which guarantees that existing software can be used, there is a large
initial learning curve for the transcriber, which multiplies the possibility of
human error in the transcription .  Other annotation systems (such as those used
in ATIS  and Switchboard ) are less elaborate, but also easier for transcribers
to learn.  The conventions used in ATIS, Switchboard , POLYPHONE
  and the
GRONINGEN corpus consist of different types of brackets with possible
additional glosses.  Retrieval software referring to these particular
annotations  must be designed in a more or less ad hoc way, which is less
convenient than the TEI system .
However, it is possible to provide standard UNIX
scripts for a speech corpus.  It is important to find the correct balance
between the sophistication of the annotation system  and the practicality of the
system from the transcriber's point of view.
The types of phenomena which could conceivably be annotated on this level
of representation are listed below.
-  OMISSIONS IN READ TEXT 
 
 
Words from the recording script which were omitted by the speaker may be
indicated.  In spontaneous speech ,
it is very difficult to know whether a
speaker has omitted words which he actually intended to say, and so
omission is only relevant in the case of read speech. 
 -  VERBAL DELETIONS OR CORRECTIONS, IMPLICIT OR EXPLICIT
 
Words that are verbally deleted  by the speaker
may be indicated. 
Verbal deletions  are words that are actually uttered, then
(according to the transcriber) superseded by subsequent speech.  This can be
done explicitly, as in Can you give me some information about the price, I
mean, the place where I can find ...  Alternatively, it can be done implicitly,
as in Can you give me some information about the price, place where I can
find ....  Verbal deletions  or self-repairs may be indicated in
read   as well as
spontaneous speech .
 -  WORD FRAGMENTS
 
Word fragments comprise one or more sounds belonging to one word.  For example,
in ATIS  word fragments are indicated by a hyphen, as in Please show 
fli-flights from Dallas.
 -  UNINTELLIGIBLE WORDS
 
Sometimes only part of a word is unintelligible, in which case only the
intelligible part is transcribed orthographically.  If a word is completely
unintelligible, that fact will be annotated on this level. For example, by
putting ``[unintelligible]'' in the text (ATIS) , or by putting
two stars ``**'' as in SPEECHDAT  corpora.
 -  HESITATIONS AND FILLED PAUSES
 
Filled pauses (such as uh and mm) may be indicated.  Some
annotation  conventions (e.g. POLYPHONE 
and Switchboard)  annotate only one or two types of filled pause (uh and
mm, or only uh).  Other systems (e.g.  ATIS  and Speech Styles)
annotate more than two types (e.g. uh, mm, um, er, ah).  The types of
filled pause vary across languages (for example, the British English er is
not used in Dutch).  The recommendation is to use at least two types: one
vowel-like type uh, and one nasal  type mm.
 -  NON-SPEECH ACOUSTIC EVENTS
 
These can be made either by the speaker or by outside sources.  The first
category includes lip smacks, grunts, laughter, heavy breathing and coughing. 
The second category includes the noise  of doors slamming, phones
ringing, dogs barking, and all kinds of noises  from other speakers. 
The Switchboard corpus  uses a very extensive list of non-speech acoustic events,
ranging from bird squawk to groaning and yawning.  The recommendation is that
these events are annotated at the correct location in the utterance, by first
transcribing the words and then indicating which words are simultaneous with the
acoustic events.
 -  SIMULTANEOUS SPEECH
 
For dialogues and interviews, words spoken simultaneously by two or more
speakers may be indicated.
 -  SPEAKING TURNS 
 
Discourse analysis makes use of indications of different speaking turns and
initiatives.  While these are not generally used in speech technology, it would
always be possible to transcribe them.
 
   
 
 
 
 
 
 Next: List of recommendations
Up: The levels and types 
 Previous: Prosodic transcription
EAGLES SWLG SoftEdition, May 1997. Get the book...