In correctly written texts any morphologically inflected lexical item
generally has just one distinct
orthographic form. Thus the words of European languages are easily
identified and also well distinguished from each other, and there is usually only one
version of each possible orthographic contextual form of any given word.
The spoken versions of orthographically
identical word forms show a great phonetic variation in their segmental and
prosodic realisation. In most European languages the phonetic form of
a given word is in fact extremely variable depending on the context and
other well defined intervening variables such as speaking style and context
of situation, strong and weak
Lombard effects (the influence of the physical environment on speech
production via acoustic feedback), etc. A given word can totally disappear phonetically, or
can be reduced to - and only
signalled by - some reflection of segmental features in the prosody of the
utterance. Most of these inconspicuous variations appear only in a narrow
phonetic transcription of a given pronunciation.
It makes a great difference whether a word has been uttered in
isolation or in continuous speech . Only if a word is consciously and very carefully produced in isolation
can
we observe the explicit
version of its segmental structure . These phonetically explicit forms
produced in a careful speaking style are called citation forms
or canonical
forms.
The segmental structure of so-called citation forms is modified as soon as
it is integrated into connected speech (probably
systematically, although relatively little of the system is currently understood). For the design
of spoken language corpora this is very relevant. It has also been taken into account in the
conventions of the IPA proposed for Computer Representation of
Individual Languages (CRIL, see
Appendix A).
In dealing with SL data one must be able to know which words the speaker
intended to express in a given utterance. This is reflected in the
CRIL convention of the IPA (see Section 5.2.4).
Here it should be mentioned that an SL data collection should ideally have at
least two and possibly three different symbolically specified levels which are related to the
acoustic speech signal:
Detailed phonetic transcriptions are subject to intra and inter-transcriber variability. Furthermore, they
are extremely expensive, to the extent that they are likely to
be prohibitive for large corpora. However, recent attempts using large
vocabulary speech recognisers for the acoustic decoding of speech show some
promise that the process can be automated, at least to the extent that
pronunciation variation can be predicted by means of general phonological and
phonetic rules.
In addition to phonetic detail on the segmental level, several uses of spoken
language corpora may also require prosodic annotation . In this area much work
remains to be done to develop commonly agreed annotation systems. Once such
systems exist, one may attempt to support annotation by means of automatic
recognition procedures.