It is possible to mark some prosodic information even in orthographic transcriptions , such as lengthening of sounds, pauses in words and utterances, emphatic stress , and intonational boundaries. Examples are the ATIS and Switchboard corpora , and the Dutch Speech Styles Corpus. If more detail than this is required, however, it is necessary to undertake a full prosodic transcription. For an overview of some existing prosodic transcription approaches, see [Llisterri (1994)]. The following will describe prosodic transcription in general terms.
The above discussion has been in terms of segmental labelling only. It is also possible to annotate a speech database at the prosodic (suprasegmental) level. This is less straightforward than segmental annotation, as there are far fewer clear acoustic cues to prosodic phenomena. The F0 curve will be the relevant acoustic display, possibly augmented by the intensity curve. The waveform is a useful guide to the current location in the speech and is usually displayed together with the F0 curve (as in the WAVES labelling software).
The units segmented will depend on the particular theoretical bias underlying the given research programme. A basic distinction may be drawn between a prosodic labelling system that annotates the boundaries of units (analogous to the method used in segmental annotation) and a system that annotates the occurrence of isolated prosodic events, such as F0 peaks.
The first type of method may possibly use the intonational categories proposed by [Nespor & Vogel (1986)], such as intonational phrase, phonological phrase, phonological word , foot , and syllable . Alternatively, it could mark the more traditional units of ``minor tone-unit'' and ``major tone-unit'', as in the MARSEC database [Roach et al. (1993)]. Or again, it could annotate the perceptual phonetic categories used in the ``Dutch school'' of intonation studies, such as rises and falls that are early, late or very late in their timing, fast or slow in their rate of change, and full or half sized ['t Hart et al. (1990)]. This type of annotation could be used in conjunction with annotation at the morphosyntactic level to yield information about the relationship between the syntactic and prosodic levels in terms of duration , pauses, etc.
The second type of method, though it may refer to the units mentioned above in its underlying theory, does not in fact annotate them but rather marks the occurrence of high and low tones of various kinds. The recently formulated ToBI transcription system [Silverman et al. (1992)] is the most well-known system of this kind for English, where the prosodic units are annotated at the ``break index'' level rather than the ``tone'' level. (For an account of prosodic labelling for German see [Reyelt et al. (1996)]). Other systems, such as SAMPROSA (see Appendix B) have also been proposed.
Prosodic annotation has only recently come into favour in the field of speech and language technology research. Now that a basic level of competence has been achieved as regards the synthesis and recognition of speech segments, researchers have come to realise that much more work is required on the prosodic aspect of speech technology. This is the motivation for the growth in popularity of speech database research, and for the formulation of the ToBI prosodic transcription system. In order for the prosodic transcriptions of various different speech databases to be comparable, and in order to make the best use of existing resources, the originators of ToBI (Silverman et al., op. cit.) proposed a simple system that would be easy to learn and that would lead to good inter-transcriber consistency. To date it has largely been used for English, especially American English, but at least in principle it could be extended to other languages as well. The system has certain severe limitations (e.g. it has no way of representing pitch range) but its minimalist formulation was dictated by the need for learnability and consistency in use. The ``British school'' type of system used in the MARSEC database of British English [Roach et al. (1993)] contains more phonetic detail but may require more effort in teaching to novice transcribers. The ``IPO'' classification of F0 patterns ['t Hart et al. (1990)] has not yet been used systematically in the annotation of large-scale publicly-available speech corpora, but has been used successfully in the development of speech synthesisers.
Prosodic transcription also has obvious uses in basic linguistic research, especially since research into the suprasegmental aspects of language is not nearly as advanced as research into the segmental aspects. As indicated above, a database annotated at the prosodic and morphosyntactic levels can provide information on the relationship between them with respect to duration and pauses. If the segmental level is also annotated, then many possibilities open up for the study of segmental duration in prosodic contexts. This is especially true in the case of languages other than English, where these aspects have received comparatively little attention to date.
The concept of levels of prosodic labelling applies differently to the two different approaches to prosodic labelling outlined above. In the first case, the obvious categories would be those proposed by Nespor and Vogel (op. cit.), comprising levels of non-overlapping units each of which corresponds to one or more units on the level immediately below (e.g. phonological phrase, foot , syllable ). In the second case, the separate levels have no such intrinsic relationship to one another, but merely deal with different types of phenomena. For example, in the ToBI system, there are separate levels for tones and inter-word ``break indices''. The ToBI system can be described briefly in terms of its separate levels, and is described below. The MARSEC system will be outlined after that. The ``Dutch school'' system of IPO will not be described in much detail, as it has not yet been used for annotation of publicly-available speech corpora: however, extensive references are available in `t Hart et al.\ (op. cit.).
A recent experiment [Pitrelli et al. (1994)] used several prosodic transcribers working independently on the same speech data, comprising both read and spontaneous American English speech. The ToBI system was used, and a high level of consistency across transcribers was found, even given the fact that transcribers included both experts and newly-trained users of the system. This suggests that the system achieves its object of being easy to learn and to apply consistently, at least in the case of American English.
The ``orthographic'' level of the ToBI system contains the orthographic words of the utterance (sometimes only partial words in the case of spontaneous speech ). It is also possible to represent filled pauses (e.g. ``um'', ``er'') at this level.
The ``miscellaneous'' level may be used to mark the duration of such phenomena as silence, audible breaths, laughter and dysfluencies. There is no exhaustive list of categories for this level, and different transcription projects may make their own decisions as to what to annotate.
The ``break index'' level is used to mark break indices, which are numbers representing the strength of the boundary between two orthographic words. The number 0 represents no boundary (with phonetic evidence of cliticisation, e.g. resyllabification of a consonant), and 4 represents a full intonation phrase boundary (usually ``end of sentence'' in read speech ), defined by the occurrence of a final boundary tone after the last phrase tone. The number 3 represents an intermediate phrase boundary, defined by the occurrence of a phrase tone after the last pitch accent , while the number 1 represents most phrase-medial word boundaries. The number 2 represents either a strong disjuncture with pause but no tonal discontinuity, or a disjuncture that is weaker than expected at a tonally-signalled full intonation or intermediate phrase boundary.
The ``tone'' level is used to mark the occurrence of phonological tones at appropriate points in the F0 contour. The basic tones are ``L'' or ``H'' (for ``low'' and ``high''), but these may function as pitch accents , phrase accents or boundary tones, depending on their location in the prosodic unit. In the case of pitch accents (which occur on accented syllables ), there may be one or two tones, and the H tone may or may not be ``downstepped''.
Information about the ToBI system and guidelines for transcribing are available on the Internet.
The MARSEC project [Roach et al. (1993)] is based on the Spoken English Corpus [Knowles et al. (1995)], a corpus of British English that at the time was not time-aligned. The MARSEC project time-aligns the prosodic annotations, the orthographic words , the grammatical tag of each word, and individual segments. The type of prosodic annotation used is the ``tonetic stress mark '' type of system. Several types of accent are recognised: low fall, high fall, low rise, high rise, low fall-rise, high fall-rise, low rise-fall, high rise-fall, low level, and high level. These may occur either on nuclear or on non-nuclear accented syllables . In addition, there is a distinction between major and minor tone-unit boundaries, and there is provision for ``markedly higher'' or ``markedly lower'' in perceived pitch . The tonetic stress mark type of system has been used for many years, and has been applied to many languages apart from English (the same is not true of the ToBI system). However, no extensive attempts have yet been made to apply it in the field of speech technology.
The Spoken English Corpus comprises over fifty thousand words of broadcast British English in various styles, mostly monologues . Two transcribers prosodically annotated it in an auditory fashion, with no access to the F0 curve. They each transcribed half the corpus, but each also independently transcribed certain passages known as ``overlap'' passages, the purpose of which was to check on inter-transcriber consistency. Analysis of the overlap passages reveals that the consistency is fairly good, certainly in the case of major aspects such as location of accents and direction of pitch movement [Knowles & Alderson (1995)]. This result is especially encouraging in view of the fact that the transcription system used contains far more phonetic detail than does the ToBI system.
The phonetically-based analysis of intonation used at IPO ['t Hart et al. (1990)] has the advantage of having proved its usefulness for more than one language, and of having been successfully applied in the field of speech synthesis (neither of these considerations apply to the ToBI system). The analysis proceeds by modelling F0 curves in terms of straight lines that have been experimentally proved to be perceptually indistinguishable from the original (``close-copy stylisations''). This type of representation is then further simplified into ``standardised stylisations'' in terms of a small set of available contours for a given language. This type of representation has been experimentally proved to be distinguishable from the original on close listening, but yet not functionally any different from the original (i.e. the standardised stylisation is linguistically equivalent).
In the case of Dutch, there are ten basic pitch movements (the model has also been applied to British English, German and Russian). These are the five falls and five rises, varying along the parameters of syllable position, rate of pitch change, and size of pitch excursion. These ten pitch movements are grouped into ``pitch configurations'' (of one or two pitch movements each). The pitch configurations are classified into prefixes , roots and suffixes . Sequences of pitch configurations are grouped into valid ``pitch contours'', which in turn are grouped into melodic families or ``intonation patterns'' (of which there are six in Dutch). These groupings are experimentally verified by listeners. The units of this analysis, at all levels, are based on speech corpora of spontaneous and semi-spontaneous utterances in Dutch. In contrast to the ToBI and MARSEC systems , comparatively little effort has been put into checking inter-transcriber consistency, possibly because the detection and labelling of this kind of phonetic unit is less problematic.
In the VERBMOBIL project, a large database of German spontaneous speech is being recorded at Munich, Bonn, Kiel and Karlsruhe. It covers a variety of different German speaking styles . Part of these data are being prosodically labelled at IPDS Kiel according to the VERBMOBIL prosodic conventions PROLAB [Kohler et al. (1995)]. Another section of the corpus is being processed at Braunschweig University according to an adapted ToBI system, along the following guidelines:
The tasks include not only the labelling of speech data, but also the development of a workstation for prosodic labelling, and methods and tools for increasing labelling speed and consistency, as follows:
The label inventory splits into three tiers, as follows:
The functional tier provides information about prosodic function like focus and modality.
This functional tier seems to be unique among labelling systems. The reasons for the introduction of an explicit functional tier are as follows:
The break index tier marks different types of word boundary, as follows:
The tone tier uses a ToBI-like inventory consisting of H and L tones [Reyelt et al. (1996), see also,]. The pitch accents and boundary tones are intended as a phonologically distinctive minimal system, together with additional distinctions which proved to be necessary for labelling spontaneous speech . The accents are as follows:
The auditory impression within the accented syllable is ``high''.
The auditory impression within the accented syllable is ``high''.
The auditory impression within the accented syllable is ``rising, between low and high''.
The auditory impression within the accented syllable is ``low''.
The auditory impression within the accented syllable is ``low''.
Each intonational phrase boundary is marked by two tones , a phrase tone and a boundary tone. These are both labelled, even if there is no clear bitonal pitch movement (and especially at low boundaries).
It is reasonable to assume nowadays that a prosodic transcriber will have access to at least the waveform and the F0 curve for the speech to be transcribed. In that case, the recommendation is to use either the ToBI or the IPO system (and the MARSEC system if a purely auditory transcription is being carried out). If the language to be transcribed is not English, and especially if the projected application of the prosodic transcription is in the field of speech technology, then it is probably best to use the IPO system if possible (i.e. if the basic ``grammar'' of contours has already been researched for that language). However, these can only be provisional recommendations, as little work has been carried out on prosodic labelling in comparison with the great effort that has been expended on segmental labelling . In this situation, it may be that a different system entirely will prove more appropriate to the given language, and it is not possible to make absolute recommendations.