next up previous contents index
Next: Verification Up: Scoring procedures Previous: Notation

Closed-set identification


A closed-set identification system can be viewed as a function which assigns, to any test utterance  z, an estimated speaker index tex2html_wrap_inline47077, corresponding to the identified speaker tex2html_wrap_inline47079 in the set of registered speakers .

In closed-set identification, all test utterances  belong to one of the registered speakers . Therefore, a misclassification error  occurs for test utterance  number k produced by speaker tex2html_wrap_inline46933 when:


where tex2html_wrap_inline47085 denotes the Kronecker function, which is 1 if the two arguments are the same and 0 otherwise.

Misclassification rates


The most natural figure that indicates the performance of a speaker identification system is the relative number of times the system fails in correctly identifying an applicant speaker ; in other words, how often a test utterance  will be assigned an erroneous identity. Whereas it is straightforward to calculate a performance figure on a speaker-by-speaker basis, care should be taken when deriving a global score.

With our notation, and assuming that tex2html_wrap_inline47087, we define the misclassification rate  for speaker tex2html_wrap_inline46933 as:


If we denote as tex2html_wrap_inline47091 the probability that the system under test identifies another speaker (with index j) than the actual speaker tex2html_wrap_inline46933, the quantity tex2html_wrap_inline47097 provides an estimate of this probability, whereas tex2html_wrap_inline47099 provides an estimate of tex2html_wrap_inline47101. However, it is preferable to report error scores rather than success scores, and performance improvements should be measured as relative error rate reduction.gif  If tex2html_wrap_inline47105, tex2html_wrap_inline47097 is undefined but tex2html_wrap_inline47109.

We suggest using the term dependable speaker  to qualify a registered speaker  with a low misclassification rate ,gif and the term unreliable speaker  for a speaker with a high misclassification rate .gif

From speaker-by-speaker figures, the average misclassification rate  can be derived as:


and by computing separately:


the gender-balanced misclassification rate  can be obtained as: 


The previous scores are different from the test set  misclassification rate , calculated as:


Scores tex2html_wrap_inline47111 and tex2html_wrap_inline47113 are formally identical if and only if tex2html_wrap_inline46935 does not depend on i, i.e. when the test set  contains an identical number tex2html_wrap_inline47119 test utterances  for each speaker. As it is usually observed that speaker recognition  performances may vary with the speaker's gender, the comparison of tex2html_wrap_inline47111 and tex2html_wrap_inline47123 can show significant differences, if the registered population is not gender-balanced . Therefore, we believe that an accurate description of the identification performance requires the three numbers tex2html_wrap_inline47123, tex2html_wrap_inline47111, and tex2html_wrap_inline47113 to be provided.  

Mistrust rates


Taking another point of view, performance scores can be designed to measure how reliable the decision of the system is when it has assigned a given identity; in other words, to provide an estimate of the probability tex2html_wrap_inline47131, i.e. the probability that the speaker is not really tex2html_wrap_inline46933 when the system under test has output tex2html_wrap_inline46933 as the most likely identity.

To define the mistrust rate, we have to introduce the following notation:



By definition, tex2html_wrap_inline47137 and tex2html_wrap_inline47139 are respectively the number and proportion of test utterances  identified as tex2html_wrap_inline46933 over the whole test set , while tex2html_wrap_inline47143 is the number of registered speakers  whose identity was assigned at least once to a test utterance. 

The mistrust rate for speaker tex2html_wrap_inline46933 can then be computed (for tex2html_wrap_inline47147) as:


Here again, if tex2html_wrap_inline47149, tex2html_wrap_inline47151 is undefined, but tex2html_wrap_inline47153.

We suggest that the term resistant speaker  could be used to qualify a registered speaker  with a low mistrust rate,gif and the term vulnerable speaker  for a speaker with a high mistrust rate.gif

From this speaker-by-speaker score, the average mistrust rate can be derived as:


and by computing separately:gif 


the gender-balanced mistrust rate  is defined as:


By noticing now that:


there appears to be no need to define a test set  mistrust rate tex2html_wrap_inline47159. In other words: the test set mistrust rate is equal to the test set misclassification rate. gif

From a practical point of view, misclassification rates  and mistrust rates can be obtained by the exact same scoring programs, operating successively on the confusion matrix  and on its transpose.


Confidence ranks


Most speaker identification systems use a similarity measure between a test utterance  and all training  patterns to decide, by a nearest neighbour decision rule, which is the identity of the test speaker. In this case, for a test utterance  x, an ordered list of registered speakers  can be produced:


where, for all index j, tex2html_wrap_inline47165 is judged closer to the test utterance  than tex2html_wrap_inline47167 is.

The identification rank of the genuine speaker  of utterance tex2html_wrap_inline47169 can then be expressed as:


In other words, tex2html_wrap_inline47171 is the position at which the correct speaker appears in the ordered list of neighbours of the test utterance . Note that a correct identification of tex2html_wrap_inline47169 corresponds to tex2html_wrap_inline47175.

Under the assumption that tex2html_wrap_inline47087, let us now denote:


The tex2html_wrap_inline47181 confidence rank for speaker tex2html_wrap_inline46933, which we will denote here as tex2html_wrap_inline47185 can then be defined as the smallest integer number for which tex2html_wrap_inline47181 of the test utterances  belonging to speaker tex2html_wrap_inline46933 are part of the tex2html_wrap_inline47191 nearest neighbours in the ordered list of candidates. Hence the formulation:
Then, the average tex2html_wrap_inline47181 confidence rank can be computed as the average of tex2html_wrap_inline47185 over all registered speakers  (for which tex2html_wrap_inline47087):

Though a gender-balanced confidence rank  could be defined analogously to gender-balanced misclassification  and mistrust  rates, the relevance of this figure is not clear.

If finally we denote:
the test set  tex2html_wrap_inline47181 confidence rank is defined as:



Average scores, gender-balanced  scores and test set  scores all fall under the same formalism. If we denote as tex2html_wrap_inline47201 a certain quantity which we will call the relative representativity of speaker tex2html_wrap_inline46933 and which satisfies tex2html_wrap_inline47205, and if we now consider the linear combination:

It is clear that:

tex2html_wrap_inline47207 for tex2html_wrap_inline47209
tex2html_wrap_inline47211 for tex2html_wrap_inline47213
tex2html_wrap_inline47215 for tex2html_wrap_inline47217

Therefore tex2html_wrap_inline47111, tex2html_wrap_inline47123 and tex2html_wrap_inline47113 correspond to different estimates of a global score, under various assumptions on the relative representativity of each speaker.gif For average scores, it is assumed that each speaker is equally representative, irrespectively of its sex  group, but if the test population is strongly unbalanced this hypothesis may not be relevant (unless there is a reason for it). For gender-balanced scores , each test speaker is supposed to be representative of its sex  group, and each sex group is supposed to be equiprobable. Test set  scores make the assumption that each test speaker has a representativity which is proportional to its number of test utterances , which is certainly not always a meaningful hypothesis.

Test set scores can therefore be used as an overall performance measure if the test data represent accurately the profile and behaviour of the user population, both in terms of population composition and individual frequency of use. If only the composition of the test set  population is representative of the general user population, average scores allow neutralisation of the possible discrepancies in number of utterances per speaker. If finally the composition of the test set  speaker population  is not representative, gender-balanced scores  provide a general purpose estimate.

If there is a way to estimate separately the relative representativity for each test registered speaker , a representative mi sclassification rate  can be computed as in equation (11.30). Conversely, some techniques such as those used in public opinion polls can be resorted to in order to select a representative test population when setting up an evaluation experiment.