A closed-set identification system can be viewed as a function which
assigns, to any test utterance *z*, an estimated speaker index
, corresponding to the identified speaker
in the set of registered speakers .

In closed-set identification, all test utterances belong to one of the
registered speakers . Therefore, a *misclassification error* occurs
for test utterance number *k* produced by speaker when:

where denotes the Kronecker function, which is 1 if the two arguments are the same and 0 otherwise.

The most natural figure that indicates the performance of a speaker identification system is the relative number of times the system fails in correctly identifying an applicant speaker ; in other words, how often a test utterance will be assigned an erroneous identity. Whereas it is straightforward to calculate a performance figure on a speaker-by-speaker basis, care should be taken when deriving a global score.

With our notation, and assuming that , we define the
*misclassification rate* for speaker as:

If we denote as the
probability that the system under test identifies another speaker
(with index *j*) than the actual speaker , the quantity
provides an estimate of this probability, whereas
provides an estimate of . However, it is preferable to report error
scores rather than success scores, and performance improvements should
be measured as relative error rate reduction. If , is undefined but
.

We suggest using the term *dependable speaker* to qualify a
registered speaker with a low misclassification rate , and the term *
unreliable speaker* for a speaker with a high misclassification
rate .

From speaker-by-speaker figures, the *average misclassification rate* can be derived as:

and by computing separately:

the *gender-balanced misclassification rate* can be obtained as:

The previous scores are different from the *test set misclassification rate* , calculated as:

Scores and are formally identical if and only if
does not depend on *i*, i.e. when the test set contains an identical
number test utterances for each speaker. As it is usually
observed that speaker recognition performances may vary with the speaker's
gender, the comparison of and can show significant differences, if the registered population is not
gender-balanced . Therefore, we believe that an accurate description of the
identification performance requires the three numbers
, , and to be
provided.

Taking another point of view, performance scores can be designed to
measure how reliable the decision of the system is when it has
assigned a given identity; in other words, to provide an estimate of
the probability , i.e. the
probability that the speaker is *not* really when the
system under test has output as the most likely identity.

To define the *mistrust rate*, we have to introduce the following
notation:

By definition, and are respectively the number and proportion of test utterances identified as over the whole test set , while is the number of registered speakers whose identity was assigned at least once to a test utterance.

The *mistrust rate* for speaker can then be computed (for ) as:

Here again, if , is undefined, but .

We suggest that the term *resistant speaker* could be used to
qualify a registered speaker with a low mistrust rate, and the term *vulnerable speaker* for a
speaker with a high mistrust rate.

From this speaker-by-speaker score, the *average mistrust rate*
can be derived as:

the *gender-balanced mistrust rate* is defined as:

By noticing now that:

there appears to be no need to define a test set mistrust rate
. In other words: *the test set mistrust rate is equal to
the test set misclassification rate*.

From a practical point of view, misclassification rates and mistrust rates can be obtained by the exact same scoring programs, operating successively on the confusion matrix and on its transpose.

Most speaker identification systems use a similarity measure between a
test utterance and all training patterns to decide, by a nearest
neighbour decision rule, which is the identity of the test speaker. In
this case, for a test utterance *x*, an ordered list of registered
speakers can be produced:

where, for all index *j*, is judged closer to the test
utterance than is.

The *identification rank* of the genuine speaker of utterance
can then be expressed as:

In other words, is the position at which the correct speaker appears in the ordered list of neighbours of the test utterance . Note that a correct identification of corresponds to .

Under the assumption that , let us now denote:

The * confidence rank* for speaker , which we
will denote here as can then be defined
as the smallest integer number for which of the test
utterances belonging to speaker are part of the
nearest neighbours in the ordered list of candidates. Hence the
formulation:

Then, the *average confidence rank* can be computed as the average of over all registered speakers (for which ):

Though a gender-balanced confidence rank could be defined analogously to gender-balanced misclassification and mistrust rates, the relevance of this figure is not clear.

If finally we denote:

the *test set confidence rank* is defined as:

Average scores, gender-balanced scores and
test set scores all fall under the
same formalism. If we denote as a certain quantity which we will
call the *relative representativity* of speaker and which
satisfies , and if we now consider the linear combination:

It is clear that:

for | |

for | |

for |

Therefore , and correspond to different estimates of a global score, under various assumptions on the relative representativity of each speaker. For average scores, it is assumed that each speaker is equally representative, irrespectively of its sex group, but if the test population is strongly unbalanced this hypothesis may not be relevant (unless there is a reason for it). For gender-balanced scores , each test speaker is supposed to be representative of its sex group, and each sex group is supposed to be equiprobable. Test set scores make the assumption that each test speaker has a representativity which is proportional to its number of test utterances , which is certainly not always a meaningful hypothesis.

Test set scores can therefore be used as an overall performance measure if the test data represent accurately the profile and behaviour of the user population, both in terms of population composition and individual frequency of use. If only the composition of the test set population is representative of the general user population, average scores allow neutralisation of the possible discrepancies in number of utterances per speaker. If finally the composition of the test set speaker population is not representative, gender-balanced scores provide a general purpose estimate.

If there is a way to estimate separately the relative representativity for
each test registered speaker , a *representative mi
sclassification rate*
can be computed as in equation (11.30). Conversely, some techniques
such as those used in public opinion polls can be resorted to in order to
select a representative test population when setting up an evaluation
experiment.