Assessment as part of the application will lead to speech being encountered that was not involved in setting up the material employed during training and testing .
It has been suggested that perfomance of a newly-developed recogniser be compared against a reference algorithm (e.g. [Chollet & Gagnoulet (1981)]). The procedures for comparing performance between the reference and newly-developed algorithms would be similar, and encounter the same problems, as those described in connection with human-human and human-algorithm performance.
The procedures for calibrating databases rely in part on checking that the sampling of the corpus is satisfactory (see Section 9.2.3), or involve being able to compare performance against known answers (the problems involved in providing them has been described in Section 9.3).
Data may need to be specially constructed in order to test some specific ideas about why performance of the recogniser is poor. This may involve difficulties in dealing with breathing noises , hesitations, etc. or because difficulty is experienced in recognising particular phonemes or phoneme types. The construction of special data for these purposes needs to bear in mind the concerns discussed above in connection with providing adequate samples of speech (Section 9.3.1).