In speech technology assessment, human performance in achieving a well defined task is one of the most popular methods to obtain a comparative evaluation for an automatic system. For instance, the classical HENR method to assess speech recognisers (cf. Chapter 10) and the well defined and established set of listening tests (Diagnostic Rhyme Test (DRT) , Mean Opinion Score (MOS), etc.) used in assessment of speech output systems (cf. Chapter 12) are well known techniques in speech assessment. Moreover, human calibration may offer an opportunity to investigate the human approach to the solution of the problem, so that the automatic system may take advantage of this knowledge.
In recent years, a significant amount of effort in the field of speaker recognition has been spent on answering the question of how accurate the automatic methods for speaker identification and verification are compared, to the performance of human listeners. From these investigations, the question was raised whether automatic speaker recognition is one area of speech processing where machines can exceed human performance. Unfortunately, as no common formalism has been established and as the experiments reported in the literature usually do not have the same experimental conditions, conclusions are not clear.
In fact, the large number of possible distinct factors that should be managed in a listening session, such as the number of speakers, the duration of voice material, voice familiarity, the phonetic content of the speech material, delay between sessions, etc., make the definition of a standard listening test a very difficult goal. Moreover, the comparison between automatic methods and human listeners can easily end up in the selection of a task reasonable for the automatic system and unfair to listener capability.
Nevertheless, for speaker verification the problem becomes simpler [Rosenberg (1973), Federico (1989)], as a human calibration of the task can be done by a series of pair tests, to which the human listener is asked to judge whether they belong to the same speaker or not. A necessary step is now to define and test procedures for listening tests in automatic speaker verification, so that effort in this field will not vanish owing to a lack of reproducibility , or to multiple test conditions.
Given that listening tests are very time consuming and cost intensive research activities, it is not realistic to envisage such a human calibration on every existing database. A good compromise would be to dedicate some effort to human calibration of standard databases. Nevertheless, further research and experiments in the human test field is necessary in order to fix possible standards and recommendations supported by both theoretical models and experimental results, as the listening methods would surely be helpful both in development and assessment of speaker recognition systems.