If speech data are to be used for training phoneme-based recognisers , then data has to be provided in the stored files obtained from human judges which indicate where each phoneme starts and ends (segmentation) and what its identity is (classification). These data are usually obtained from experienced judges. A section of contiguous speech data (possibly augmented by time-aligned spectrographic or other materials) is displayed on a graphics screen. The judges have manually-controlled cursors which allow them to select and play sections of the speech.
In order to check the reliability of segmentation and classification judgments, a subset of material has to be checked at least once. This checking can be performed by the same (intra-judge) or different (inter-judge) judges. Intra-judge reliability will indicate how consistent a particular judge is whilst inter-judge agreement will indicate how consistent different judges are. If different judges are used, it is necessary to ensure that the judges are sufficiently well instructed so that they are making the same type of decision. In experiments, like has to be compared with like. Thus, if an automatic segmentation algorithm is available that works on acoustic patterns, to check the accuracy of the algorithm human judges should be required to do the same thing as the algorithm. Similar considerations apply when comparing manual judgments against those obtained by machine: There is no point, for instance, in the humans locating phonemes based on global spectral properties when the machine is using local audio properties (such as presence or absence of voicing to make these decisions). For assessing recogniser performance, phoneme segmentations are required, not acoustic properties of subsegmental events. Therefore, this dictates what output is appropriate. Note that it is not a foregone conclusion that segmentation will be comparable between humans nor between humans and the algorithms. If, for instance, laryngeal vibration is used as a basis for segmentation, the point where it starts is often not clear cut.
Returning to the procedures that are usually performed when obtaining phoneme labels , once the identity and extent of a phoneme has been ascertained, a permanent record of these parameters is stored in the computer file. Utilities are usually available which allow these parameters to be retrieved and the information about classifications and start and end points of segments to be aligned against the original speech oscillogram, the spectrogram etc. Typically, the judges work through the files from the beginning of the recording to the end.
Written transcriptions of data is frequently performed. These transcriptions have to be aligned against the speech, for example with Dynamic Time Warping (DTW) algorithms. When this is done, it is difficult to establish whether performance problems are due to human errors of judgment or to limitations in the algorithm. In this section, attention is specifically focussed on human judges' performance which excludes obtaining transcriptions and then aligning them.
Another important point about labelling procedures when the labels are entered directly into the computer is that two processes are involved which are, conceivably, logically distinct: This is also implicit in the use of written transcriptions with DTW alignment which will be used to illustrate this point. The DTW alignment locates labels that are given to it in a statistically optimal way (i.e., segmentation). There is no necessary requirement that the transcriber making the written transcriptions has to indicate segment boundaries, only segment categories (i.e., classification). In this case, humans perform one task and machines the other. Other possibilities are to reverse which decision is made by machine and which by the human, or to get the human or machine to do both tasks separately. Though these would be revealing about the quality of the data at segmentation and classification levels and about the influence these would have on recogniser perfomance, they have not been performed to date. The basic precept about conducting experiments applies again: here, the subject should have a clear idea about what decision is being made. In the case of segmenting and labelling when these are conducted together, a situation is encountered where the decisions are mixed (confounded). This makes it difficult during analysis to disambiguate whether, when an error occurs, the error is associated with one decision (say segmentation) or the other (classification).
The procedures of assessing segmentation and classification separately, outlined in the preceding paragraph, are not currently practised. Nevertheless, separate assessment of the segmentation and classification performance of human judges can be made, albeit to a limited extent. The data available to assess accuracy of segmentation are the location of the segment boundaries in time which will involve the bias brought about by confounding decisions.
To perform a comparative test between human judges' performance, it is recommended that segmentation performance be obtained from at least two judges as well as the algorithm. Parameters that might be measured are mean difference in boundary location between the two human sets of judgments. The data can then be analysed with ANOVA where the null hypothesis would be that there would be no difference between the human on two occasions (intra-judge) or by two separate judges (inter-judge).
The following factors need to be taken into account when selecting what subsample to perform the segmentation assessment on: whether to choose sections from all speakers in case judges or speakers show specific difficulties, or whether to do complete assessments on selected judges and speakers. Another factor to consider (which can be investigated with the statistical techniques outlined earlier) is what length of sample to take - the sample ought to be at least long enough to contain examples of all phones of interest if phones are going to be used in the recogniser . This will ensure identification of points where speakers have specific difficulties speaking certain phones , or judges have difficulties in locating some phones.