next up previous contents index
Next: The transcription of read Up: SL corpus representation Previous: SL corpus representation

Introduction

In this chapter the linguistic representation of spoken language corpora will be discussed. As stated in Chapter 3, one of the factors that determine whether a collection of speech is a speech corpus is the fact that the latter is augmented with linguistic annotation  (i.e. a symbolic representation of the speech). Since it is impossible to examine the sampled speech data directly, it is only by means of the symbolic representation of the speech that one is able to navigate through the corpus. It is important to note that all types of representations of speech are the result of an analysis or classification of the speech. The representations are not the speech itself, but an abstraction from it. However, they are sometimes used as if they were the speech itself.

In most cases, the symbolic representation of the speech implies that a transcription  of the speech is made. Transcriptions  are used in many fields of linguistics, including phonetics , phonology, dialectology , sociolinguistics , psycholinguistics , second language teaching, and speech pathology. Transcriptions  are also used in disciplines like psychology, anthropology, and sociology. The type of transcription  very much depends on its purpose. In particular, this purpose determines the degree of detail that is required. For example, if a speech corpus has been designed to investigate the amount of time several speakers are speaking simultaneously in a dialogue, a very global transcription  will be sufficient. If a corpus has been collected to establish differences in pronunciations of words, one needs to have a very precise segmental transcription .

Detailed phonemic or phonetic transcriptions    of large scale spoken language corpora with many speakers and much (spontaneous) speech   can never be achieved. This would be too time-consuming and expensive. Therefore most large speech corpora are provided with word for word transcriptions , i.e. word level orthographic representations of what has been said (e.g. the ATIS   and Switchboard corpora ). However, a medium sized corpus of read speech  can be provided with a segmental transcription  and even with labelling   at the segmental level. Examples are the American English TIMIT corpus, which consists of 630 speakers each reading 10 sentences, and also the German PHONDAT corpora (1990 and 1992, both read speech ) and German VERBMOBIL   corpus (from 1993, spontaneous speech ). An orthographic transcription  (sometimes referred to as a transliteration)  may be converted into a canonical phonemic transcription  by means of a grapheme-phoneme converter  or a pronunciation table.

It has been found that providing reliable phonetic  transcriptions  for large corpora is hardly feasible [Cucchiarini (1993)]. However, detailed transcriptions  of a small number of specific phenomena (e.g. presence/absence of diphthongation , voiced /voiceless  character of fricatives ) can be made relatively fast and reliably if the occurrences of these phenomena can be retrieved quickly with the aid of annotation  and direct access to files offered in a computerised speech corpus [Van Hout (1989), Van Bezooijen & Van Hout (1985)].

During the International Conference on Spoken Language Processing (ICSLP) in Banff, Canada in 1992, a workshop was held on ``Orthographic  and Phonetic  Transcription ''. The goal of the workshop was to agree on areas where community-wide conventions were needed, to identify and document current work, and to establish a means of future communication and continued cooperation.

In the remainder of this section some general remarks will be made about transcriptions  of read speech  versus transcriptions of spontaneous speech.    In addition, the levels and types of transcription will be introduced. In the next section (5.2), some background will be given on the task of segmenting and labelling  speech. The following section (Section 5.3) will discuss the levels and types of representation in detail. For each level, reference will be made to existing corpora where possible, the symbols to be used will be presented, and recommendations will be given.





next up previous contents index
Next: The transcription of read Up: SL corpus representation Previous: SL corpus representation

EAGLES SWLG SoftEdition, May 1997. Get the book...