Here is what could be the description of an experimental protocol of an evaluation:
``The following protocol was designed to estimate the performance of a speaker verification system for the protection of personal portable telephones. The principle of the targeted security system is a personal-password text-dependent speaker verification system. Before a user can place a call on his portable phone, he is asked to utter his identity, i.e. his name and surname. The compatibility between the speaker and the authorised owner is checked locally, and in case of acceptance , the speaker is allowed to dial his number.
To simulate this application, the following experimental protocol was set up. A group of portable phone owners were provided with a (slightly modified) miniature tape recorder (the size of a dictaphone), and were asked to record their name and surname before they placed a call on their phone, except if they had already done so during the previous three hours. To make sure that some users would not record all their utterances consecutively, a temporisation was implemented in the tape recorder, so that a time interval of three hours had to be respected between two activations of the record function. In return for a user's participation, his subscription to the portable phone service was paid for, for six months. In practice, the six month subscription was refunded to a user when he brought back a recorded tape containing 100 recordings. This number corresponds approximately to one session every other day over six months. In reality, the average time after which a tape was returned was 4.2 months.
Once a tape and a tape recorder were returned, the tape's content was digitised at a sampling frequency of 16kHz, and the data were segmented automatically (a beep had been internally recorded on the tape each time the ``stop'' button was pressed). The speech material was not verified exhaustively, but a speech activity detector was used to discard utterances that were composed of silence only. On the average, 97% of the utterances were kept. Silent signal portions lasting longer than 0.2 seconds were removed automatically. The typical bandwidth of the tape recorder's microphone is 150-6000Hz, which is within the tape's bandwidth. All tapes were of the same trademark, and their noise level was judged negligible. Despite the fact that, for a given speaker, the microphone and the tape characteristics remained constant for all recording sessions, the data collection protocol can be considered as realistic for the targeted application.
The first five recordings for each speaker were used as training material , whereas the remaining ones were used as test material (92, on the average). The average registration timespan was estimated to be (5/97) 4.2 months 6.5 days, which may be an overestimate of the actual timespan, as users probably recorded their voice more often at the beginning of the experiment. Accordingly, the average operation timespan was considered to cover approximately four months.
An initial population of 188 persons agreed to take part in the experiment, but 19 of them never returned the recording device, either because they lost it, or because they lost interest in the experiment. Additionally, seven tape recorders and three tapes deteriorated during the six month timespan. Altogether, only 159 different speakers were thus taken into consideration as registered speakers , among which 92 were male speakers (i.e. 58%). All of them were adults over 18. Nothing else about their profile was studied, but they are likely to correspond to a relatively affluent fraction of the population, since they can afford a portable phone.
In this database, a speaker utters his name and surname in 0.8 seconds on the average, but this figure varies significantly from one person to another. The linguistic content of the speech material cannot be specified other than exhaustively.
For impostor modelling , we used all recordings corresponding to the registration phase for all speakers, which we pooled together to form a speaker-independent text-independent model. We then derived an impostor model for each registered speaker as the representation of the user's training pronunciations according to the speaker-independent model. In other words, all registered speakers were part of the pseudo-impostor bundle of a given speaker, including this very speaker.
Six professional imitators (4 male, 2 female) were then asked to simulate acquainted intentional test impostors . For each registered speaker of the same sex , they were provided with the tape recorder of the genuine user, and could listen as much as they wanted to the training material of this user. Then they were asked to produce five imitated utterances of the speaker saying his name and surname. These imitations were recorded on the user's tape recorder, at the end of the user's tape. Given the experimental protocol, it was not possible to provide the imitators with any feedback concerning their success or failure to break the system. Altogether, each male imitator recorded approximately 5 92 = 460 impostor tests against registered male speakers, and each female speaker produced about 5 67 = 335 impostor tests against registered female speakers. All imitators were paid for their work. The imitated speech followed the same processing as the genuine one.
For the evaluation of system performance, each authentic test utterance was tried with the genuine identity (159 92 = 14628 authentic trials), and each imitated utterance was tried against the targeted identity (4 460 + 2 335 = 2510 impostor trials).''
We leave to the reader the pleasure of tracking the unavoidable experimental biases that remain in this imaginary experimental protocol, and how they could be circumvented.