The representation of a text or utterance as a string of symbols, without any reference to the acoustic form of the utterance, was the pattern followed by speech and text corpus work during the 1980s, such as the prosodically-transcribed Spoken English Corpus [Knowles et al. (1995)]. These corpora did not link the symbolic representation with the physical acoustic waveform, and hence were not fully machine-readable. A recent project, MARSEC [Roach et al. (1993)], has generated these links for the Spoken English Corpus such that it is now a segmented and labelled database.
The types of segments that may be delimited are of various kinds, depending on the purpose for which the database is collected. The German PHONDAT and VERBMOBIL-PHONDAT corpora use the CRIL (Computer Representation of Individual Languages) conventions, which propose three levels of representation: orthographic, phonetic and narrow phonetic.
A more detailed system of levels of labelling has been proposed by [Barry & Fourcin (1992)], which includes the above three levels. Each given speech corpus will choose one or more of these levels, which are described in detail in the following sections, and which grew out of the SAM project for the major European languages.
The format of label (transcription ) files varies widely across research institutions. The WAVES format is becoming popular, and has the advantage of being human-readable. The recommendation is to use a label file format that can easily be converted to a WAVES label file, for the sake of portability across different systems.