next up previous contents index
Next: Human calibration Up: Comparative and indirect assessment Previous: Comparative and indirect assessment

Reference systems

   

In the previous sections, the focus was centered on performance evaluation of the algorithms with emphasis on scoring procedures. However, in most cases, absolute performance may be misleading as it depends on many aspects of the experimental conditions. This is particularly true when the evaluation is based on training  and test data  which may not be publicly available. In such a case, it is impossible to reproduce the test independently.

In fact, frequent scientific practice consists in comparing a new approach to a previously existing one, and in evaluating the new method in terms of (relative) improvement. This practice can be understood as a way to calibrate the difficulty of the task under test by some kind of reference technique. In this section, we propose to use common reference algorithms in order to make the calibration step even more meaningful.

Desirable properties of a reference system are its relative efficiency and robustness, its easy implementation, and its absolute reproducibility  (from the algorithmic point of view). It should not require sophisticated training  procedures. In particular, it should be able to operate with very limited training data  (like one utterance) per speaker.

From a practical point of view, the calibration  of a given database for a given task by a reference system is relatively easy to implement: for any new system embedded in a given application, the reference voice recognition system is replaced by the new one, and the differences in the overall performance of the application gives an indirect figure of merit  of the new system compared to the reference one, for this given application. Commercial products can also be evaluated in parallel on the same database, but this may cause additional difficulties. In particular, the development of a specific ``harness'', i.e. an interface between the evaluation data and the system under test may be necessary.

In general, a speaker recognition  system can be decomposed into:

  1. a pre-processing  module, which extracts acoustic parameters from the speech signal (this module includes voice activity detection),
  2. a speaker modelling module, which computes some kind of speaker model,
  3. a scoring module, which computes a resemblance score between a test and a reference pattern,
  4. a decision module, which outputs a diagnostic (identity assignment , acceptance , rejection , doubt, ...).

It is certainly unrealistic to specify exhaustively the four modules of a reference system, especially the first and fourth module, which may have considerable impact on the performance. We therefore restrict our proposal for a reference system to generic algorithms involved in the second and third module. However, if the system under test is a ``black box '', and if it is not possible to isolate the pre-processing  module, some arbitrary choice has to be made for what concerns the pre-processing module of the reference system.

In practice, it is necessary to distinguish between two reference systems: one for text-dependent applications   and one for text-independent applications .

For text-dependent  and text-prompted applications  , a baseline system based on Dynamic Time Warping (DTW)   offers a number of advantages:

However, DTW  is very sensitive to end-point detection, and relatively heavy in computation when the number of references is high. Nevertheless, we believe that this family of algorithms offers a good compromise to obtain a reference performance on most databases for text-dependent  and text-prompted applications  . The choice of DTW  as a reference system was already proposed in the context of the SAM-A  project [SAM-A (1993), Homayounpour et al. (1993)].

For text-independent applications , a sphericity measure was also proposed in SAM-A  as a reference method. In fact, the sphericity measure is one possibility among a large family of measures based on second-order statistics (SOS). SOS measures capture the correlations of a time-frequency representation of the signal, which turns out to be highly speaker-dependent   [Grenier (1977), Gish et al. (1986), Bimbot & Mathan (1993)]. Here again, the advantages of SOS measures are similar to those of DTW . In particular:

Naturally, the correlation matrix of a speaker is better estimated over a relatively long period of speech (several seconds). Nevertheless, reasonable performance can be obtained on segments as short as three seconds. A detailed evaluation of these measures, as well as a further discussion on their possible use as reference methods can be found in [Bimbot et al. (1995)].

Both DTW  and SOS measures are based on totally reproducible algorithmic methods. They probably do not represent the ultimate technological solutions to speaker recognition . However, their systematic use as calibration  approaches should at least allow discrimination between relatively trivial tasks,gif and those which are really challenging. In that sense, the issue of defining reference systems is a matter of efficiency for research and for evaluation methodology.  



next up previous contents index
Next: Human calibration Up: Comparative and indirect assessment Previous: Comparative and indirect assessment

EAGLES SWLG SoftEdition, May 1997. Get the book...