Manual segmentation refers to the process whereby an expert transcriber segments and labels a speech file by hand, referring only to the spectrogram and/or waveform. There is no automatic assistance in segmenting. The manual method is believed to be more accurate. Also, the use of a human transcriber ensures that the segment boundaries and labels (at least at the narrow phonetic level) are perceptually valid. However, there is a need for explicit segmentation criteria to ensure both inter- and intra-transcriber consistency, together with (ideally) some form of checking procedure. Sets of guidelines for manual segmentation have been developed by various projects. One such is [Hieronymus et al. (1990)], which uses the four levels of underlying phonemic, broad phonetic, narrow phonetic and acoustic. It also retains the same base phonemic symbol even at the acoustic level, to facilitate the automatic determination of boundaries at the phonetic level once the boundaries at the acoustic level have been determined. Much speech data (particularly in English) has been segmented and labelled entirely manually. This also applies to the spontaneous dialogue corpus in the VERBMOBIL project, part of which has been processed manually at IPDS Kiel (CD-ROMs 2,3; [IPDS (1995), IPDS (1996)]) on the basis of transcription conventions (modified SAMPA) laid down in [Kohler et al. (1995)].
One possible measure of accuracy for segmentation and labelling is consistency
between transcribers. [Barry & Fourcin (1992)] quote [Cosi & Omologo (1991)] as
saying that one should not expect more than 90% agreement between
experts. [Eisen (1993)] investigates inter-transcriber consistency for the
separate tasks of segmentation and labelling at three different levels of
labelling, and concludes that consistency depends partly on the degree of
abstraction of the labelling level, and partly on the type of sound involved.
The best results in labelling were achieved at the broad phonetic level, for
fricatives , nasals and laterals, which showed greater than 90% agreement
across transcribers. The best results for segmentation were achieved at the
acoustic-phonetic level, for the acoustic features ``fricative'' , ``voiced'' and
Any accuracy measure based on inter-transcriber consistency must control for the factors ``level of transcription '', ``segment type'', and ``task type'' (whether segmentation or labelling).