In correctly written texts any morphologically inflected lexical item
generally has just one distinct
orthographic form. Thus the words of European languages are easily
identified and also well distinguished from each other, and there is usually only one
version of each possible orthographic contextual form of any given word.
The spoken versions of orthographically
identical word forms show a great phonetic variation in their segmental and
prosodic realisation. In most European languages the phonetic form of
a given word is in fact extremely variable depending on the context and
other well defined intervening variables such as speaking style and context
of situation, strong and weak
Lombard effects (the influence of the physical environment on speech
production via acoustic feedback), etc. A given word can totally disappear phonetically, or
can be reduced to - and only
signalled by - some reflection of segmental features in the prosody of the
utterance. Most of these inconspicuous variations appear only in a narrow
phonetic transcription of a given pronunciation.
It makes a great difference whether a word has been uttered in isolation or in continuous speech . Only if a word is consciously and very carefully produced in isolation can we observe the explicit version of its segmental structure . These phonetically explicit forms produced in a careful speaking style are called citation forms or canonical forms. The segmental structure of so-called citation forms is modified as soon as it is integrated into connected speech (probably systematically, although relatively little of the system is currently understood). For the design of spoken language corpora this is very relevant. It has also been taken into account in the conventions of the IPA proposed for Computer Representation of Individual Languages (CRIL, see Appendix A).
In dealing with SL data one must be able to know which words the speaker intended to express in a given utterance. This is reflected in the CRIL convention of the IPA (see Section 5.2.4). Here it should be mentioned that an SL data collection should ideally have at least two and possibly three different symbolically specified levels which are related to the acoustic speech signal:
Detailed phonetic transcriptions are subject to intra and inter-transcriber variability. Furthermore, they
are extremely expensive, to the extent that they are likely to
be prohibitive for large corpora. However, recent attempts using large
vocabulary speech recognisers for the acoustic decoding of speech show some
promise that the process can be automated, at least to the extent that
pronunciation variation can be predicted by means of general phonological and
In addition to phonetic detail on the segmental level, several uses of spoken language corpora may also require prosodic annotation . In this area much work remains to be done to develop commonly agreed annotation systems. Once such systems exist, one may attempt to support annotation by means of automatic recognition procedures.