 
  
  
  
  
 
In correctly written texts any morphologically inflected lexical item
generally has just one distinct
orthographic form. Thus the words of European languages are easily
identified and also well distinguished from each other, and there is usually only one
version of each possible orthographic contextual form of any given word.
The spoken versions of orthographically
identical word forms show a great phonetic variation in their segmental and
prosodic  realisation. In most European languages the phonetic form of
a given word is in fact extremely variable depending on the context and
other well defined intervening variables such as speaking style  and context
of situation, strong and weak
Lombard effects  (the influence of the physical environment on speech
production via acoustic feedback), etc. A given word can totally disappear phonetically, or 
can be reduced to - and only
signalled by - some reflection of segmental features in the prosody  of the
utterance. Most of these inconspicuous variations appear only in a narrow
phonetic transcription   of a given pronunciation.
 
It makes a great difference whether a word has been uttered in
isolation  or in continuous speech . Only if a word is consciously and very carefully produced in isolation 
can
we observe the explicit
version of its segmental structure . These phonetically explicit forms
produced in a careful speaking style  are called citation forms
  or canonical
forms.  
The segmental structure  of so-called citation forms is modified as soon as 
it is integrated into connected speech  (probably 
systematically, although relatively little of the system is currently understood). For the design
of spoken language corpora  this is very relevant. It has also been taken into account in the
conventions of the IPA  proposed for Computer Representation of
Individual Languages (CRIL, see
Appendix A).  
 
In dealing with SL data one must be able to know which words the speaker
intended to express in a given utterance. This is reflected in the
CRIL convention of the IPA (see Section 5.2.4).
Here it should be mentioned that an SL data collection should ideally have at
least two and possibly three different symbolically specified levels which are related to the
acoustic speech signal:
Detailed phonetic transcriptions   are subject to intra and inter-transcriber variability. Furthermore, they 
are extremely expensive, to the extent that they are likely to 
be prohibitive for large corpora. However, recent attempts using large 
vocabulary  speech recognisers  for the acoustic decoding of speech show some 
promise that the process can be automated, at least to the extent that 
pronunciation variation can be predicted by means of general phonological and 
phonetic rules. 
 
In addition to phonetic detail on the segmental level, several uses of spoken 
language corpora may also require prosodic  annotation . In this area much work 
remains to be done to develop commonly agreed annotation systems. Once such 
systems exist, one may attempt to support annotation by means of automatic 
recognition procedures.
 
 
 
  
  
  
 