A closed-set identification system can be viewed as a function which
assigns, to any test utterance z, an estimated speaker index
, corresponding to the identified speaker
in the set of registered speakers .
In closed-set identification, all test utterances belong to one of the
registered speakers . Therefore, a misclassification error occurs
for test utterance number k produced by speaker when:
where denotes the Kronecker function, which is 1 if the
two arguments are the same and 0 otherwise.
The most natural figure that indicates the performance of a speaker identification system is the relative number of times the system fails in correctly identifying an applicant speaker ; in other words, how often a test utterance will be assigned an erroneous identity. Whereas it is straightforward to calculate a performance figure on a speaker-by-speaker basis, care should be taken when deriving a global score.
With our notation, and assuming that , we define the
misclassification rate for speaker
as:
If we denote as the
probability that the system under test identifies another speaker
(with index j) than the actual speaker
, the quantity
provides an estimate of this probability, whereas
provides an estimate of
. However, it is preferable to report error
scores rather than success scores, and performance improvements should
be measured as relative error rate reduction.
If
,
is undefined but
.
We suggest using the term dependable speaker to qualify a
registered speaker with a low misclassification rate , and the term
unreliable speaker for a speaker with a high misclassification
rate .
From speaker-by-speaker figures, the average misclassification rate can be derived as:
and by computing separately:
the gender-balanced misclassification rate can be obtained as:
The previous scores are different from the test set misclassification rate , calculated as:
Scores and
are formally identical if and only if
does not depend on i, i.e. when the test set contains an identical
number
test utterances for each speaker. As it is usually
observed that speaker recognition performances may vary with the speaker's
gender, the comparison of
and
can show significant differences, if the registered population is not
gender-balanced . Therefore, we believe that an accurate description of the
identification performance requires the three numbers
,
, and
to be
provided.
Taking another point of view, performance scores can be designed to
measure how reliable the decision of the system is when it has
assigned a given identity; in other words, to provide an estimate of
the probability , i.e. the
probability that the speaker is not really
when the
system under test has output
as the most likely identity.
To define the mistrust rate, we have to introduce the following notation:
By definition, and
are respectively the number and
proportion of test utterances identified as
over the whole test set ,
while
is the number of registered speakers whose identity was
assigned at least once to a test utterance.
The mistrust rate for speaker can then be computed (for
) as:
Here again, if ,
is undefined, but
.
We suggest that the term resistant speaker could be used to
qualify a registered speaker with a low mistrust rate, and the term vulnerable speaker for a
speaker with a high mistrust rate.
From this speaker-by-speaker score, the average mistrust rate can be derived as:
the gender-balanced mistrust rate is defined as:
By noticing now that:
there appears to be no need to define a test set mistrust rate
. In other words: the test set mistrust rate is equal to
the test set misclassification rate.
From a practical point of view, misclassification rates and mistrust rates can be obtained by the exact same scoring programs, operating successively on the confusion matrix and on its transpose.
Most speaker identification systems use a similarity measure between a test utterance and all training patterns to decide, by a nearest neighbour decision rule, which is the identity of the test speaker. In this case, for a test utterance x, an ordered list of registered speakers can be produced:
where, for all index j, is judged closer to the test
utterance than
is.
The identification rank of the genuine speaker of utterance
can then be expressed as:
In other words, is the position at which the correct speaker appears in
the ordered list of neighbours of the test utterance . Note that a correct
identification of
corresponds to
.
Under the assumption that , let us now denote:
The confidence rank for speaker
, which we
will denote here as
can then be defined
as the smallest integer number for which
of the test
utterances belonging to speaker
are part of the
nearest neighbours in the ordered list of candidates. Hence the
formulation:
Then, the average confidence rank can be computed as the average of
over all registered speakers (for which
):
Though a gender-balanced confidence rank could be defined analogously to gender-balanced misclassification and mistrust rates, the relevance of this figure is not clear.
If finally we denote:
the test set confidence rank is defined as:
Average scores, gender-balanced scores and
test set scores all fall under the
same formalism. If we denote as a certain quantity which we will
call the relative representativity of speaker
and which
satisfies
, and if we now consider the linear combination:
It is clear that:
![]() | for ![]() |
![]() | for ![]() |
![]() | for ![]() |
Therefore ,
and
correspond to different estimates of a global score, under various assumptions
on the relative representativity of each speaker.
For average scores,
it is assumed that each speaker is equally representative, irrespectively of
its sex group, but if the test population is strongly unbalanced this
hypothesis may not be relevant (unless there is a reason for it). For
gender-balanced scores , each test speaker is supposed to be representative of
its sex group, and each sex group is supposed to be
equiprobable. Test set
scores make the assumption that each test speaker has a representativity which
is proportional to its number of test utterances , which is certainly not
always a meaningful hypothesis.
Test set scores can therefore be used as an overall performance measure if the test data represent accurately the profile and behaviour of the user population, both in terms of population composition and individual frequency of use. If only the composition of the test set population is representative of the general user population, average scores allow neutralisation of the possible discrepancies in number of utterances per speaker. If finally the composition of the test set speaker population is not representative, gender-balanced scores provide a general purpose estimate.
If there is a way to estimate separately the relative representativity for each test registered speaker , a representative mi sclassification rate can be computed as in equation (11.30). Conversely, some techniques such as those used in public opinion polls can be resorted to in order to select a representative test population when setting up an evaluation experiment.