next up previous contents index
Next: Open-set identification Up: Scoring procedures Previous: Closed-set identification



A verification system can be viewed as a function which assigns, to a test utterance  z and a claimed identity tex2html_wrap_inline46991, a boolean value tex2html_wrap_inline47695, which is equal to 1 if the utterance is accepted, and 0 if it is rejected.

Two types of error can then occur. Either a genuine speaker  is rejected, or an impostor is accepted. Hence, a false rejection  corresponds to:


and a false acceptance  happens if:


In the rest of this section we will denote the events as follows:

We first address aspects of static evaluation, that is, what meaningful figures can be computed to measure the performance of a system over which the experimentator has absolutely no control. Then, after discussing the role of decision thresholds, we review several approaches that allow a dynamic evaluation of the system to be obtained, i.e. in a relatively threshold-independent manner.

False rejection rates


If tex2html_wrap_inline47087, the false rejection rate for speaker tex2html_wrap_inline46933 is defined quite naturally as:


Rate tex2html_wrap_inline47723 provides an estimate of tex2html_wrap_inline47725, i.e. the probability that the system makes a diagnostic of rejection , given that the applicant speaker  is the authorised speaker tex2html_wrap_inline46933 (claiming his own identity). If tex2html_wrap_inline47105, tex2html_wrap_inline47723 is undefined but tex2html_wrap_inline47733.

As for closed-set identification , the terms dependable speaker  and unreliable speaker  can be used to qualify speakers with a low or (respectively) high false rejection rate.

From speaker based figures, the average false rejection rate can be obtained as:


while the gender-balanced false rejection rate  is:


where :


The test set  false rejection rate is calculated as:


Rates tex2html_wrap_inline47735, tex2html_wrap_inline47737 and tex2html_wrap_inline45995 provide three different estimates of tex2html_wrap_inline47741. Rate tex2html_wrap_inline45995 is influenced by the test set  distribution of genuine attempts which may only be artefactual.


False acceptance rates and imposture rates


As opposed to false rejection, there are several ways to score false acceptance, depending on whether it is the vulnerability of registered speakers  which is considered or the skills of impostors. Moreover, the way to evaluate false acceptance rates and imposture rates depends on whether the identity of each impostor is known or not.

If the impostor identities are known, the false acceptance rate in favour of impostor tex2html_wrap_inline46969 against registered speaker  tex2html_wrap_inline46933 can be defined, for tex2html_wrap_inline47011, as:

Here, tex2html_wrap_inline47751 can be viewed as an estimate of tex2html_wrap_inline47753, i.e. the probability that the system makes a diagnostic of acceptance, given that the applicant speaker  is the impostor tex2html_wrap_inline46969 claiming identity tex2html_wrap_inline46933.

Then, the average false acceptance rate against speaker tex2html_wrap_inline46933 can be obtained (if tex2html_wrap_inline47761) by averaging the false acceptances over all impostors:
and similarly the average imposture rate in favour of impostor tex2html_wrap_inline46969 can be calculated (for tex2html_wrap_inline47765) as:
Rates tex2html_wrap_inline47767 and tex2html_wrap_inline47769 provide (respectively) estimates of tex2html_wrap_inline47771 and tex2html_wrap_inline47773 under the assumption that all impostors and all claimed identities are equiprobable. The number tex2html_wrap_inline47767 indicates the false acceptance rate obtained on average by each impostor in claiming identity tex2html_wrap_inline46933, while tex2html_wrap_inline47769 indicates the success rate of impostor tex2html_wrap_inline46969 in claiming an identity averaged over each claimed identity. A registered speaker  can be more or less resistant  (low tex2html_wrap_inline47767) or vulnerable  (high tex2html_wrap_inline47767), whereas impostors with a high tex2html_wrap_inline47769 can be viewed as skilled impostors ,gif as opposed to poor impostors gif for those with a low tex2html_wrap_inline47769.

The average false acceptance rate which is equal to the average imposture rate is obtained as:
i.e. as the average of the false acceptances over all couples tex2html_wrap_inline47795,gif which provides an estimate of tex2html_wrap_inline47799 under the assumption that all couples tex2html_wrap_inline47795 are equally likely.

Here, separate estimates of the average false acceptance rate on the male and female registered populations can be obtained as:


and a gender-balanced false acceptance rate  is provided by:

The question could be raised of whether it is desirable to compute a score which would provide an estimation of the false acceptance rate for a gender-balanced impostor population . We propose not to go that far, as it would clearly lead to duplication of scoring figures, but the influence of impostors' gender could be partly neutralised by the experimental design:

It may also be interesting to calculate imposture rates regardless of the claimed identities. In this case, we define the imposture rate in favour of impostor tex2html_wrap_inline46969 regardless of the claimed identity as:
and the average imposture rate regardless of the claimed identity as:

However, tex2html_wrap_inline47751, tex2html_wrap_inline47767, tex2html_wrap_inline47769, tex2html_wrap_inline47819, tex2html_wrap_inline47821, tex2html_wrap_inline47823 and tex2html_wrap_inline47825 cannot be evaluated when the identities of impostors are not known. In this case false acceptance rates and imposture rates can be calculated under the assumption that all impostor test utterances  are produced by distinct impostors.

The false acceptance rate against speaker tex2html_wrap_inline46933 assuming distinct impostors can be obtained (if tex2html_wrap_inline47829) as:
and the average false acceptance rate assuming distinct impostors is defined as:
Here again, separate estimates of the average false acceptance rate assuming distinct impostors, on the male and female registered populations can be obtained as:

with the gender-balanced false acceptance  rate assuming distinct impostors being:

Rate tex2html_wrap_inline47831 provides a speaker-dependent   estimate of tex2html_wrap_inline47771 assuming distinct impostors. Rate tex2html_wrap_inline47835 can be viewed as an estimate of tex2html_wrap_inline47799 under the assumptions of distinct impostors and that all claimed identities are equally likely while tex2html_wrap_inline47839 can be understood as another estimate of tex2html_wrap_inline47799 under the assumptions of distinct impostors, that attempts against male speakers and against female speakers are equiprobable, and that within a gender class all claimed identities are equally likely.

If finally false acceptances are scored globally, regardless of the impostor identity nor of the claimed identity, we obtain the test set false acceptance rate which is identical to the test set imposture rate:

Here, tex2html_wrap_inline47843 provides a test set  estimate of tex2html_wrap_inline47799 which is biased  by the composition of the registered population and a possible uneveness of the number of impostor trials for each speaker. Note the relations

For scoring false acceptance rates, we believe that, beside tex2html_wrap_inline46691, it is necessary to report on tex2html_wrap_inline47821 and tex2html_wrap_inline47823 (when impostors are known) or tex2html_wrap_inline47835 and tex2html_wrap_inline47839 (when they are not known), as the score tex2html_wrap_inline46691 may be significantly influenced by the test data  distribution. The other scores described in this section are mainly useful for diagnostic analysis. 


Relative unreliability, vulnerability and imitation ability


It can also be of major interest to estimate the contribution of a given registered speaker  tex2html_wrap_inline46933 to the overall false rejection rate , which can be denoted as tex2html_wrap_inline47861, i.e. the probability that the identity of the speaker was i given that a (false) rejection  diagnostic was made on a genuine speaker  (claiming his own identity).

We can thus define the average relative unreliability for speaker tex2html_wrap_inline46933 as:
or his test set  relative unreliability:

By construction:


From a different angle, the relative vulnerability for a given registered speaker  tex2html_wrap_inline46933 (i.e. tex2html_wrap_inline47869) can be measured as his contribution to the false acceptance rate. 

Thus, the average relative vulnerability for speaker tex2html_wrap_inline46933 can be defined as:
his relative vulnerability assuming distinct impostors, as:
and his test set  relative vulnerability as:


Finally, by considering the relative success of impostor tex2html_wrap_inline46969, i.e. tex2html_wrap_inline47875, we define in a dual way, as above, the average imitation ability of impostor tex2html_wrap_inline46969:
his imitation ability regardless of the claimed identity:
and his test set  relative imitation ability:

The relative unreliability and vulnerability can also be calculated relatively to the male/female population.      


As for misclassification rates , the gender-balanced, average and test set  false rejection rates   as well as the gender-balanced  and average false acceptance  rates assuming distinct impostors and the test set false acceptance  rate correspond to different estimates of a global score, under various assumptions on the relative representativity of each genuine test speaker . The discussion of Section 11.4.2 can be readily generalised.

For what concerns gender-balanced  and average false acceptance   rates with known impostors, a relative representativity tex2html_wrap_inline47879 can be defined for each couple of registered speaker  and impostor (tex2html_wrap_inline47881) (with tex2html_wrap_inline47883), and if we write:


we have:

tex2html_wrap_inline47885 for tex2html_wrap_inline47887
tex2html_wrap_inline47889 for tex2html_wrap_inline47891

In the case of casual impostors , choosing a selective attempt configuration towards same-sex  speakers is equivalent to the assumption:


i.e. that the representativity of a cross-sex  attempt is zero.

Studies allowing better definition of the representativity of impostor attempts against registered speakers  would be of great help to increase the relevance of evaluation scores.


Tables 11.411.5 and 11.6 give examples of false acceptance rates , false rejection rates , and imposture rates, as well as unreliability, vulnerability and imitation ability. As for the closed-set identification  examples, the number of tests used to design these examples is too small to guarantee any statistical validity.


tex2html_wrap_inline47229 tex2html_wrap_inline47231 tex2html_wrap_inline47233
Male Female
tex2html_wrap_inline47697 6 2 4
tex2html_wrap_inline47699 tex2html_wrap48295 tex2html_wrap48297 tex2html_wrap48295
tex2html_wrap_inline46935 9 2 7
c 18
tex2html_wrap_inline46951 3
tex2html_wrap_inline47723 tex2html_wrap_inline47913 0 tex2html_wrap_inline47915
tex2html_wrap_inline47735 tex2html_wrap_inline47919
tex2html_wrap_inline45995 tex2html_wrap_inline47923
tex2html_wrap_inline47925 2 1
tex2html_wrap_inline47927 tex2html_wrap_inline47929 tex2html_wrap_inline47915
tex2html_wrap_inline47737 tex2html_wrap_inline47935
tex2html_wrap_inline47937 tex2html_wrap_inline47939 tex2html_wrap_inline47941 0 tex2html_wrap_inline47943 tex2html_wrap_inline47945
tex2html_wrap_inline47947 tex2html_wrap_inline47949 tex2html_wrap_inline47951 0 tex2html_wrap_inline47949 tex2html_wrap_inline47951
Table 11.4: Genuine attempts 

Out of 18 genuine attempts, 6 false rejections  are observed, hence the test set false acceptance rate   tex2html_wrap_inline47957. Nevertheless, the 3 false rejections  out of 9 trials for tex2html_wrap_inline47229 do not have the same impact on the average false rejection rate  tex2html_wrap_inline47961 as the 3 false rejections  out of 7 trials for tex2html_wrap_inline47233. In fact, while tex2html_wrap_inline47231 seems to be the most reliable speaker, tex2html_wrap_inline47233 appears more unreliable  than tex2html_wrap_inline47229 on the average, as, for what concerns relative unreliability scores, tex2html_wrap_inline47971.


tex2html_wrap_inline47975 tex2html_wrap_inline47977 tex2html_wrap_inline47979
Against Male Against Female
n = 2 n = 2 n = 2
tex2html_wrap_inline47987 tex2html_wrap_inline47989 tex2html_wrap_inline47991 tex2html_wrap_inline47993 tex2html_wrap_inline47995 tex2html_wrap_inline47997
tex2html_wrap_inline47697 tex2html_wrap48303 tex2html_wrap48297 tex2html_wrap48307 tex2html_wrap47623 - tex2html_wrap47623
tex2html_wrap_inline47699 4 2 1 2 - 4
tex2html_wrap_inline46967 6 2 6 3 - 5
tex2html_wrap_inline48005 8 9 5
d 22
tex2html_wrap_inline47027 2 2 1
tex2html_wrap_inline47037 3
tex2html_wrap_inline47039 5
tex2html_wrap_inline47751 tex2html_wrap_inline48017 0 tex2html_wrap_inline48019 tex2html_wrap_inline48021 undef. tex2html_wrap_inline48023
tex2html_wrap_inline47767 tex2html_wrap_inline47929 tex2html_wrap_inline48029 tex2html_wrap_inline48031
tex2html_wrap_inline47821 tex2html_wrap_inline48035
tex2html_wrap_inline47831 tex2html_wrap_inline48039 tex2html_wrap_inline48041 tex2html_wrap_inline48023
tex2html_wrap_inline47835 tex2html_wrap_inline48047
tex2html_wrap_inline46691 tex2html_wrap_inline48051
tex2html_wrap_inline48053 2 1
tex2html_wrap_inline48055 4 1
tex2html_wrap_inline48057 tex2html_wrap_inline48059 tex2html_wrap_inline48031
tex2html_wrap_inline47823 tex2html_wrap_inline48065
tex2html_wrap_inline48067 tex2html_wrap_inline48069 tex2html_wrap_inline48031
tex2html_wrap_inline47839 tex2html_wrap_inline48075
tex2html_wrap_inline48077 tex2html_wrap_inline48079 tex2html_wrap_inline48081 tex2html_wrap_inline48083 tex2html_wrap_inline48085 tex2html_wrap_inline48087 tex2html_wrap_inline48089
tex2html_wrap_inline48091 tex2html_wrap_inline48093 tex2html_wrap_inline48095 tex2html_wrap_inline48097 tex2html_wrap_inline48099 tex2html_wrap_inline48101 tex2html_wrap_inline48103
tex2html_wrap_inline48105 tex2html_wrap_inline48107 tex2html_wrap_inline48095 tex2html_wrap_inline48111 tex2html_wrap_inline48113 tex2html_wrap_inline48115 tex2html_wrap_inline48117
Table 11.5: Impostor attempts against registered speakers 

One out of three impostor trials from tex2html_wrap_inline48119 against tex2html_wrap_inline47229 were successful while none from tex2html_wrap_inline48123 were. Hence tex2html_wrap_inline48125. But if the identities of impostors are not known, it can only be measured that, out of 8 impostor attempts against tex2html_wrap_inline47229, 2 were successful, i.e. tex2html_wrap_inline48129. As no impostor attempt from tex2html_wrap_inline48119 against tex2html_wrap_inline47233 was recorded, the average false acceptance rate  against tex2html_wrap_inline47233 can only be averaged over 1 impostor. Hence tex2html_wrap_inline48137. The 3 ways of computing false acceptance  rates, namely the average false acceptance rate  tex2html_wrap_inline47821, the average false acceptance rate  assuming distinct impostors tex2html_wrap_inline47835 and the test set false acceptance rate   tex2html_wrap_inline46691 provide significantly different scores, as the number of test utterances is not balanced across all possible couples tex2html_wrap_inline47795. In this example, the relative vulnerability scores tex2html_wrap_inline48077, tex2html_wrap_inline48091 and tex2html_wrap_inline48105 indicate that speaker tex2html_wrap_inline47233 would appear as the most resistant , while speaker tex2html_wrap_inline47231 would seem to be the most vulnerable. 


tex2html_wrap_inline48119 tex2html_wrap_inline48123
m = 3 m = 3
tex2html_wrap_inline47987 tex2html_wrap_inline47991 tex2html_wrap_inline47995 tex2html_wrap_inline47989 tex2html_wrap_inline47993 tex2html_wrap_inline47997
tex2html_wrap_inline47697 tex2html_wrap48303 tex2html_wrap48307 - tex2html_wrap48297 tex2html_wrap47623 tex2html_wrap47623
tex2html_wrap_inline47699 4 1 - 2 2 4
tex2html_wrap_inline46967 6 6 - 2 3 5
tex2html_wrap_inline48185 12 10
d 22
tex2html_wrap_inline47031 2 3
tex2html_wrap_inline47035 2
tex2html_wrap_inline47039 5
tex2html_wrap_inline47751 tex2html_wrap_inline48017 tex2html_wrap_inline48019 undef. 0 tex2html_wrap_inline48021 tex2html_wrap_inline48023
tex2html_wrap_inline47769 tex2html_wrap_inline48207 tex2html_wrap_inline48209
tex2html_wrap_inline47819 tex2html_wrap_inline48213 tex2html_wrap_inline48215
tex2html_wrap_inline48217 tex2html_wrap_inline48219
tex2html_wrap_inline47825 tex2html_wrap_inline48223
tex2html_wrap_inline48225 tex2html_wrap_inline48227
tex2html_wrap_inline48229 tex2html_wrap_inline48083 tex2html_wrap_inline48233 tex2html_wrap_inline48235 tex2html_wrap_inline48237
tex2html_wrap_inline48239 tex2html_wrap_inline48241 tex2html_wrap_inline48243 tex2html_wrap_inline48245 tex2html_wrap_inline48247
tex2html_wrap_inline48249 tex2html_wrap_inline48251 tex2html_wrap_inline48253 tex2html_wrap_inline48107 tex2html_wrap_inline48257
Table 11.6: Impostor attempts from impostors 

Out of 6 trials from impostor tex2html_wrap_inline48119 against speaker tex2html_wrap_inline47229, 2 of them turned out to be successful, while out of 6 other trials against tex2html_wrap_inline47231, 5 lead to a (false) acceptance . As no attempts from tex2html_wrap_inline48119 against tex2html_wrap_inline47233 were recorded, the average imposture rate from impostor tex2html_wrap_inline48119 can be estimated as tex2html_wrap_inline48271. If we now ignore the actual identities of violated speakers , and we summarise globally the success of impostor tex2html_wrap_inline48119, we get tex2html_wrap_inline48275 which turns out to be also equal to tex2html_wrap_inline48277. While tex2html_wrap_inline48279 and tex2html_wrap_inline48281, the average imposture rate regardless of the claimed identity tex2html_wrap_inline47825 indicates that the ``average'' impostor is successful almost 2 times out of 5 in his attempts. All estimates of the relative imitation ability (tex2html_wrap_inline48229, tex2html_wrap_inline48239 and tex2html_wrap_inline48249) agree that tex2html_wrap_inline48119 is a much more skilled impostor  than tex2html_wrap_inline48123 who seems to be quite poor. 

Expected benefit

From now on, we will denote as tex2html_wrap_inline45995 and tex2html_wrap_inline46691 the false rejection  and acceptance rates, whichever exact estimate is really chosen.

Estimates of the following quantities are required:

With the estimates the expected benefit tex2html_wrap_inline48337 of a verification system with false rejection  rate tex2html_wrap_inline45995 and false acceptance  rate tex2html_wrap_inline46691 can be computed as:


In particular, when tex2html_wrap_inline48343 and tex2html_wrap_inline48345, the equal-risk equal-cost expected benefit is:

The expected benefit is usually a meaningful static evaluation figure for the potential clients of the technology. It must however be understood only as the average expected benefit for each user attempt. It does not take into account external factors such as the psychological impact of the system, its maintenance costs, etc.

Threshold setting

Speaker verification  systems usually proceed in two steps. First, a matching score tex2html_wrap_inline48347 is computed between the test utterance  z and the reference model tex2html_wrap_inline46933 corresponding to the claimed identity. Then, the value of the matching score is compared to a threshold tex2html_wrap_inline48353, and a decision is taken as follows:


In other words, verification is positive only if the match between the test utterance  and the reference model (for the claimed identity) is close enough.

A distinction can be made depending on whether each registered speaker  has his individual threshold or whether a single threshold is used which is common to all speakers. In other words, if tex2html_wrap_inline48353 depends on i, the system uses speaker-dependent thresholds , whereas if tex2html_wrap_inline48353 does not depend on i, the system uses a speaker-independent threshold.      gif We will denote as tex2html_wrap_inline48363 the threshold vector tex2html_wrap_inline48365, and as tex2html_wrap_inline48367 and tex2html_wrap_inline48369 the false rejection   and acceptance  rates corresponding to tex2html_wrap_inline48363.

The values of tex2html_wrap_inline48363 have an inverse impact on the false rejection  rate and on the false acceptance rate . Thus, with a low tex2html_wrap_inline48353, fewer genuine attempts from speaker tex2html_wrap_inline46933 will be rejected, but more impostors will be erroneously accepted as tex2html_wrap_inline46933. Conversely, if tex2html_wrap_inline48353 is increased, tex2html_wrap_inline48383 will generally decrease, at the expense of an increasing tex2html_wrap_inline47723. The goal of dynamic evaluation is to provide a description of the system performance which is as independent as possible of the threshold values.

The setting of thresholds is conditioned to the specification of an operating constraint which expresses the compromise that has to be reached between the two types of error. Among many possibilities, the most popular ones are:

Two procedures are classically used to set the thresholds: the a priori threshold setting procedure and the a posteriori threshold setting procedure.

When the a priori threshold setting procedure is implemented, the threshold vector tex2html_wrap_inline48411 is estimated from a set of tuning data, which can be either the training data  themselves, or a new set of unseen data. Then, the false rejection  and acceptance rates  tex2html_wrap_inline48413 and tex2html_wrap_inline48415 are estimated on a disjoint test set . Naturally, there must be no intersection between the tuning data set and the test data set. Not only must the speech material of genuine attempts and impostor attempts be different between these two sets, but also the bundle of pseudo-impostors  used to tune the threshold for a registered speaker  should not contain any of the impostors which will be tested against this very speaker within the test set. Of course, the volume of additional speech data used for threshold setting must be counted as training material , when reporting on the training speech quantity.

When the a posteriori threshold setting procedure is adopted, tex2html_wrap_inline48417 is set on the test data  themselves. In this case, the false rejection  and acceptance  rates tex2html_wrap_inline48419 and tex2html_wrap_inline48421 must be understood as the performance of the system with ideal thresholds. Though this procedure does not lead to a fair measure of the system performance, it can be interesting, for diagnostic evaluation , to compare tex2html_wrap_inline48413 and tex2html_wrap_inline48415 with tex2html_wrap_inline48419 and tex2html_wrap_inline48421.

System operating characteristic

Whichever operating constraint is chosen to tune the thresholds is only one of the infinite number of possible trade-offs, and it is generally not possible to predict, from the false rejection  and false acceptance  rates obtained for a particular functioning point, what would be the error rates for another functioning point. In order to be able to estimate the performance of the system under any conditions, its behaviour has to be modelled so that its performance can be characterised independently from any threshold settings.

  In the case of a speaker-independent threshold, the false rejection  and the false acceptance rates   can be written as functions of a single parameter tex2html_wrap_inline48431, namely tex2html_wrap_inline48433 and tex2html_wrap_inline48435. Then, a more compact way of summarising the system's behaviour consists in expressing tex2html_wrap_inline46691 directly as a function of tex2html_wrap_inline45995 (or the opposite), that is:


Using terminology derived from Communication Theory, function f is sometimes called the Receiver Operating Characteristic   and the corresponding curve tex2html_wrap_inline48443 the ROC curve. Generally, function f is monotonically decreasing and satisfies the limit conditions tex2html_wrap_inline48447 and tex2html_wrap_inline48449. Figure 11.1 depicts a typical ROC curve.  

Figure 11.1: A typical ROC curve and its equal error rate 

The point-by-point knowledge of function f provides a threshold-independent description of all possible functioning conditions of the system. In particular:

In practice, there are several ROC  curves, depending on what type of false rejection  and acceptance  scores are used:gif

a gender-balanced ROC  : tex2html_wrap_inline48481 (or tex2html_wrap_inline48483 if impostors are unknown),
an average ROC  tex2html_wrap_inline48485 (or tex2html_wrap_inline48487 if impostors are unknown),
a test set ROC   tex2html_wrap_inline48489.

However, exhaustively keeping a whole ROC   curve lacks conciseness, and it is classically felt desirable to condense system performance into a single figure. Traditionally, the EER  is chosen for this purpose, In this case, there is a distinct equal error rate   for each ROC   curve, which can be denoted by tex2html_wrap_inline48491, tex2html_wrap_inline48493 and tex2html_wrap_inline48399, respectively.

In the case of speaker-dependent thresholds, the false rejection  and the false acceptance  rates for each speaker tex2html_wrap_inline46933 depend on a different parameter tex2html_wrap_inline48353. Therefore, each speaker has his own ROC  curve:


In this case, there is no simple way of deriving an ``average'' ROC  curve that would represent the general behaviour of the system. Current practice consists in characterising each individual ROC  curve by its equal error rate  tex2html_wrap_inline48501, and in summarising the performance of the system by the average equal error rate  tex2html_wrap_inline48503 computed as:
Note here that a gender-balanced equal error rate   tex2html_wrap_inline48505 can be defined as:


and a test set equal error rate   as:

Though we used the same terminology for denominating equal error rates  with speaker-dependent and speaker-independent thresholds , it must be stressed that the scores are not comparable. Therefore it should always be specified in which framework they are computed.

System characteristic modelling

Equal error rates  can be interpreted as a very local property of the ROC  curve. In fact, as the ROC  curve usually has its concavity turned in the direction of the axis tex2html_wrap_inline48509, the EER  gives an idea of how close the ROC  curve is to the axes. However, this is a very incomplete picture of the general system performance level, as it is virtually impossible to predict the performance of the system under a significantly different operating condition.

Recent work by [Oglesby (1994)] has addressed the question of how to encapsulate the entire system characteristic into a single number. Oglesby's  suggestions, which we will develop now, consist in finding a simple 1-parameter model which describes as accurately as possible the ROC  curve over most of its definition domain. If the approximation is good enough, reasonable error rate estimates for any functioning point can be derived. As in the last section, we will first discuss the case of a system with a speaker-independent threshold, and then extend the approach to speaker-dependent thresholds.  

For modelling the relation between tex2html_wrap_inline45995 and tex2html_wrap_inline46691, the simplest approach is to assume a linear operating characteristic, i.e. a relation between tex2html_wrap_inline45995 and tex2html_wrap_inline46691 of the kind:
where tex2html_wrap_inline48519 is a constant which can be understood as the linear-model EER. gif However, typical ROC   curves do not have a linear shape at all, and this model is too poor to be effective over a large domain.

A second possibility is to assume that the ROC  curve has the approximate shape of the positive branch of a hyperbola, which supposes the relation:
Here tex2html_wrap_inline48521 is another constant which can be interpreted as the hyperbolic-model EER.   The hyperbolic model  is equivalent to a linear model in the log-error domain. It usually fits the ROC  curve much better.gif However, it has the drawback of not fulfilling the limit conditions, as tex2html_wrap_inline48523 and tex2html_wrap_inline48525.

A third possibility, proposed by Oglesby , is to use the following model:


where tex2html_wrap_inline48527 will be referred to as Oglesby's model EER   . Oglesby reports a good fit of the model with experimental data, and underlines the fact that tex2html_wrap_inline48529 and tex2html_wrap_inline48531.

The parametric approach is certainly a very relevant way to give a broader system characterisation. Nevertheless, several issues remain questionable.

First, it is clear that none of the models proposed above account for a possible skewness of the ROC  curve. As Oglesby  notes it, to address skewed characteristics would require introducing an additional variable, which would give rise to a second, non-intuitive, figure.

A second question is what criterion tex2html_wrap_inline48533 should be minimised to fit the model curve tex2html_wrap_inline48535 to the true ROC  curve tex2html_wrap_inline48443. If we denote as tex2html_wrap_inline48539 the optimisation domain on which the best fit is to be found, the most natural criterion would be to minimise the mean square error between tex2html_wrap_inline48535 and tex2html_wrap_inline48443 over the interval tex2html_wrap_inline48539. However, an absolute error difference does not have the same meaning when tex2html_wrap_inline45995 changes order of magnitude, and an alternative could be to minimise the mean square error between the curves in a log-log representation.

A third and most crucial question is how the unavoidable deviations between the model and the actual ROC  curve should be quantified and reported.

Here is a possible answer to these questions. Though the approach that we are going to present has not been extensively tested so far, we believe that it is worth exploring it in the near future, as it may prove useful to summarise concisely the performance of a speaker verification  system, in a relatively meaningful and exploitable manner.

The solution proposed starts by fixing an accuracy  tex2html_wrap_inline48549 for the ROC  curve modelling, say for instance tex2html_wrap_inline48551. Then, if we define:


the following properties are obvious:



Hence, when both constraints are satisfied, both relative differences between the modelled and exact false rejection  and acceptance  rates are below tex2html_wrap_inline48549.

Then, a model of the ROC  curve must be chosen, for instance Oglesby's model . However, if another model fits the curve better, it can alternatively be used, but it preferably should depend on a single parameter, and the link between the value of this parameter and the model equal error rate  should be specified.

For a given parameter tex2html_wrap_inline48555, the lower and upper bound of the tex2html_wrap_inline48557-accuracy  false rejection  rate validity domain, tex2html_wrap_inline48559 and tex2html_wrap_inline48561 are obtained by decreasing (or increasing) tex2html_wrap_inline45995, starting from the initial value tex2html_wrap_inline48565, until one of the two constraints of equations (11.78) and (11.80) is no more satisfied. This process can be repeated for several values of tex2html_wrap_inline48555 varying for instance in small steps within the interval tex2html_wrap_inline48569. Finally, the value of tex2html_wrap_inline48555 corresponding to the wider validity domain can be chosen as the system performance measure, in the validity domain of the approximation. Note that tex2html_wrap_inline48555 does not need to be inside the validity domain for its value to be meaningful.

If the validity domain turns out to be too small, then the process could be repeated after having set the accuracy  tex2html_wrap_inline48575 to a higher value. Another possibility could be to give several model equal error rates , corresponding to several adjacent validity domains (with a same accuracy  tex2html_wrap_inline48575), i.e. a piecewise representation of the ROC curve.  

A first advantage of the parametric description is that it allows prediction of the behaviour of a speaker verification  system for a more or less extended set of operating conditions. It could then be possible to give clear answers to a potential client of the technology, as long as this client is able to specify his constraints. The second advantage is that the model EER  is a number which relates well to the conventional EER . Therefore the new description would not require that the scientific community totally changes its point of view in apprehending the performance of a speaker verification  system.gif The main drawback of the proposed approach is that it lacks experimental validation for the time being. Therefore, we suggest adopting it as an experimental evaluation methodology, until it has been proven efficient.

In dealing with a system using speaker-dependent thresholds , we are brought back to the difficulty of averaging ROC  curve models across speakers. The ROC  curve for each speaker tex2html_wrap_inline46933 can be summarised by a model equal error rate  tex2html_wrap_inline48589 and a tex2html_wrap_inline48557-accuracy  false acceptance  rate validity domain tex2html_wrap_inline48593. Lacking a more relevant solution, we suggest characterising the average system performance by averaging across speakers the model EER , and the bounds of the validity interval. Thus the global system performance could be given as an average model EER :
and an average tex2html_wrap_inline48557-accuracy  false acceptance  rate validity domain:


The same approach can be implemented, with different weights, to compute a gender-balanced model EER    tex2html_wrap_inline48597 and a test set model EER   tex2html_wrap_inline48599, and the corresponding validity domains.

Another possibility would be to fix a speaker-independent   validity domain tex2html_wrap_inline48601 for each ROC   curve, and then compute the individual accuracy  tex2html_wrap_inline48603. Then, to obtain a globalgif score, all tex2html_wrap_inline48603 could be averaged (using weights depending on the type of estimate), and the performance would be a global model equal error rate  together with a false acceptance  rate domain tex2html_wrap_inline48601 common to all speakers, but at an average accuracy. 


For example, consider a verification system with a speaker-independent threshold  that has a gender-balanced Oglesby's equal error rate     of tex2html_wrap_inline48609 with a tex2html_wrap_inline48611-accuracy  false rejection  rate validity domain of tex2html_wrap_inline48613. Here, the ROC  curve under consideration is tex2html_wrap_inline48615. We will denote now tex2html_wrap_inline48617 and tex2html_wrap_inline48619, for simplicity reasons.

For any false rejection  rate a satisfying tex2html_wrap_inline48623, the difference between the actual false acceptance   rate b and the estimated false acceptance   rate tex2html_wrap_inline48627 predicted by Oglesby's model   with parameter tex2html_wrap_inline48629 satisfies tex2html_wrap_inline48631. It can then be computed (using equation (11.77)) that the tex2html_wrap_inline48611-accuracy  false acceptance  rate validity domain is tex2html_wrap_inline48635, and it is guaranteed that, for any value of b in this interval, the difference between the actual false rejection  rate a and the estimated false rejection  rate tex2html_wrap_inline48641 (predicted by Oglesby's model  with EER  0.047) satisfies tex2html_wrap_inline48643. In particular, the exact (gender-balanced) EER   of the system, tex2html_wrap_inline48491, is equal to 0.047, at a tex2html_wrap_inline48551 relative accuracy. 

next up previous contents index
Next: Open-set identification Up: Scoring procedures Previous: Closed-set identification

EAGLES SWLG SoftEdition, May 1997. Get the book...