Automatic and semi-automatic segmentation

Next: Segmentation and labelling in Up: Concerning the segmentation and Previous: Manual segmentation

Automatic and semi-automatic segmentation

Automatic segmentation refers to the process whereby segment boundaries are assigned automatically by a program. This will probably be an HMM-based speech recogniser that has been given the correct symbol string as input. The output boundaries may not be entirely accurate, especially if the training data was sparse . Semi-automatic segmentation refers to the process whereby this automatic segmentation is followed by manual checking and editing of the segment boundaries.

This form of segmenting is motivated by the need to segment very large databases for the purpose of training ever more comprehensive recognisers . Manual segmentation is extremely costly in time and effort, and automatic segmentation, if sufficiently accurate, could provide a short cut. However, it is still necessary for the researcher to derive the correct symbol string to input to the autosegmenter. This may be derived automatically from an orthographic transcription , in which case it will not always correspond to the particular utterance unless manually checked and edited. The amount of inaccuracy that is acceptable will depend on the uses to which the database is to be put, and its overall size.

There will always be a need to verify the accuracy of an autosegmented database, and the obvious accuracy measure is the consistency between manual and automatic segmentation over a given subset of the database. [Schmidt & Watson (1991)] carried out this evaluation over nearly 6000 phoneme-sized segments, and found that the discrepancy between manual and automatic boundaries varied across segment types. The absolute mean discrepancy was greatest for diphthongs (5.4 ms) and least for nasals (0.37 ms). For 50% of all segmentations, the discrepancy was less than 12 ms, while for 95% it was less than 40 ms. This falls within the range of just-noticeable differences in duration for sounds of the durational order of speech sounds [Lehiste (1970), p. 13,] and so one could conclude that the discrepancies are not perceptually relevant. This means that automatic segmentation for the given data, using the given autosegmenter, was probably sufficiently accurate.

Next: Segmentation and labelling in Up: Concerning the segmentation and Previous: Manual segmentation

EAGLES SWLG SoftEdition, May 1997. Get the book...