In the previous sections, the focus was centered on performance evaluation of the algorithms with emphasis on scoring procedures. However, in most cases, absolute performance may be misleading as it depends on many aspects of the experimental conditions. This is particularly true when the evaluation is based on training and test data which may not be publicly available. In such a case, it is impossible to reproduce the test independently.
In fact, frequent scientific practice consists in comparing a new approach to a previously existing one, and in evaluating the new method in terms of (relative) improvement. This practice can be understood as a way to calibrate the difficulty of the task under test by some kind of reference technique. In this section, we propose to use common reference algorithms in order to make the calibration step even more meaningful.
Desirable properties of a reference system are its relative efficiency and robustness, its easy implementation, and its absolute reproducibility (from the algorithmic point of view). It should not require sophisticated training procedures. In particular, it should be able to operate with very limited training data (like one utterance) per speaker.
From a practical point of view, the calibration of a given database for a given task by a reference system is relatively easy to implement: for any new system embedded in a given application, the reference voice recognition system is replaced by the new one, and the differences in the overall performance of the application gives an indirect figure of merit of the new system compared to the reference one, for this given application. Commercial products can also be evaluated in parallel on the same database, but this may cause additional difficulties. In particular, the development of a specific ``harness'', i.e. an interface between the evaluation data and the system under test may be necessary.
In general, a speaker recognition system can be decomposed into:
It is certainly unrealistic to specify exhaustively the four modules of a reference system, especially the first and fourth module, which may have considerable impact on the performance. We therefore restrict our proposal for a reference system to generic algorithms involved in the second and third module. However, if the system under test is a ``black box '', and if it is not possible to isolate the pre-processing module, some arbitrary choice has to be made for what concerns the pre-processing module of the reference system.
In practice, it is necessary to distinguish between two reference systems: one for text-dependent applications and one for text-independent applications .
For text-dependent and text-prompted applications , a baseline system based on Dynamic Time Warping (DTW) offers a number of advantages:
However, DTW is very sensitive to end-point detection, and relatively heavy in computation when the number of references is high. Nevertheless, we believe that this family of algorithms offers a good compromise to obtain a reference performance on most databases for text-dependent and text-prompted applications . The choice of DTW as a reference system was already proposed in the context of the SAM-A project [SAM-A (1993), Homayounpour et al. (1993)].
For text-independent applications , a sphericity measure was also proposed in SAM-A as a reference method. In fact, the sphericity measure is one possibility among a large family of measures based on second-order statistics (SOS). SOS measures capture the correlations of a time-frequency representation of the signal, which turns out to be highly speaker-dependent [Grenier (1977), Gish et al. (1986), Bimbot & Mathan (1993)]. Here again, the advantages of SOS measures are similar to those of DTW . In particular:
Naturally, the correlation matrix of a speaker is better estimated over a relatively long period of speech (several seconds). Nevertheless, reasonable performance can be obtained on segments as short as three seconds. A detailed evaluation of these measures, as well as a further discussion on their possible use as reference methods can be found in [Bimbot et al. (1995)].
Both DTW and SOS measures are based on totally reproducible algorithmic methods. They probably do not represent the ultimate technological solutions to speaker recognition . However, their systematic use as calibration approaches should at least allow discrimination between relatively trivial tasks, and those which are really challenging. In that sense, the issue of defining reference systems is a matter of efficiency for research and for evaluation methodology.