In this section, we give examples of well-known speaker recognition systems which can be found in the literature, in order to illustrate the taxonomy described above.
Among examples of text-dependent systems, the Bell Labs system, reported by [Rosenberg (1976)] and improved by [Furui (1981)] is tested by the latter under the following protocol: ``Several [six] kinds of utterance sets were used to evaluate [the] system ...Two all-voiced sentences were used in the recordings. The males used the sentence, `We were away a year ago' and the females used the sentence, `I know when my lawyer is due.''' [Furui (1981), p. 258,]
The first five utterance sets are composed of male speakers, while the last one is composed of female speakers. Performances are reported for speaker verification experiments on each set. Following our terminology, these experiments simulate a common-password text-dependent speaker verification system . Here, the password is an entire sentence. As Rosenberg notes to justify the use of a text-dependent system in practical applications: ``For many applications, the speakers are expected to be cooperative so that a prescribed text is perfectly feasible.'' [Furui (1981), p. 259,]
The use of a prescribed text has also the advantage that it does not need any prompting , but the drawback that it may be forgotten by the user. As discussed in a next example, a second drawback of text-dependent systems is the possibility for impostors to use pre-recorded speech.
As an example of personal-password text-dependent speaker verification , one can mention a new service offered by the American telephone operator SPRINT. For this service, the user must speak his telephone card number through the phone, in order to have his home bill charged directly for the call he is willing to make. The system identifies the claimed customer by recognising the sequence of digits, and then verifies, on the same sequence of digits, the match between the actual user and the assumed customer. Here, the sequence of digits has a double function: a means of customer identification, and a personal voice password for speaker verification .
Another very popular speaker verification systems was developed by Doddington at Texas Instruments, in the early 70s. Here follow excerpts of the description given by the author [Doddington (1985), p. 1661,]:
To use the system an entrant first opens the door to the entry booth and walks in, then he identifies himself by entering a user ID into a keypad, and then he repeats the verification phrase(s) that the system prompts him to say. If he is verified, the system [ ...] unlocks the inside door of the booth so that he may enter into the computer center. If he is not verified, the system notifies him by saying ``not verified, call for assistance''.
Verification utterances are constructed randomly to avoid the possibility of being able to defeat the system with a tape recording of a valid user. An simple four-word fixed phrase structure is used, with one of sixteen word alternatives filling each of the four word positions (see Table 11.1).
GOOD BEN SWAM NEAR PROUD BRUCE CALLED HARD STRONG JEAN SERVED HIGH YOUNG JOYCE CAME NORTH Table 11.1: Verification Phrase Construction for the TI Operational Voice Verification System (after Doddington)
An example verification utterance might be ``Proud Ben served hard''. These utterances are prompted by voice. This is thought to improve verification performance by stabilising the pronunciation of the user's utterance.
Therefore, the TI system turns out to be a voice-prompted fixed-vocabulary speaker verification system, the claimed identity being input as a personal identification number on a keypad. Doddington's excerpt illustrates well the motivations behind the voice-prompted fixed-vocabulary approach: the relative randomness of the verification utterances protects against impostors using pre-recorded speech, while the use of voice prompts tends to control the reproducibility of the user's pronunciation. However, it must be noted that voice-prompting may also neutralise some of the speaker characteristics (in particular prosodic factors), owing to an unconscious mimicry of the prompt . At the same time, text-prompting has the drawback of requiring a specific device, such as a screen, which is not always possible to implement.
The experiments reported by [Soong et al. (1987)] where sequences of digits are used for speaker verification is another example of a fixed-vocabulary system.
Unrestricted text-independent speaker recognition is usually considered as desirable for several reasons. Even if the user does not have to take the initiative in producing the text, prompted systems are less likely to be defeated by a recorded voice, as the linguistic material is virtually unpredictable. For unprompted systems, identification or verification can take place unobtrusively, during a telephone transaction, for instance. Moreover, unprompted approaches do not require the speaker to be actively cooperative.
Here is the general structure of a text- (or voice-) prompted unrestricted text-independent system, as described by [Furui (1994)], p. 7:
The recognition system prompts each user with a new key sentence every time the system is used, and accepts the input utterance only when it decides that the registered speaker has uttered the prompted sentence [ ...] This method not only can accurately recognise speakers but also can reject utterances whose text differs from the prompted text, even if it is uttered by the registered speaker .
[During registration], since the text of training utterances is known, these utterances can be modelled as the concatenation of [speaker-independent] phoneme models, and these models can be automatically adapted [to the new registered speaker] . In the recognition stage, the system concatenates phoneme models according to the prompted text [i.e. a speaker-specific model and a speaker-independent model]. If the likelihood of both speaker and text is high enough, the speaker is accepted as the claimed speaker.
Note here that the fundamental difference between the system described above and a fixed-vocabulary system is the use of subword speech units (here, phonemes ) which allow to construct speaker-specific models of test words (or sentences) which were not pronounced during the registration phase. Note also the use of an explicit step of speech recognition.
In opposition to prompted systems, here is one example of an experiment in unprompted speaker recognition, as reported by [Gish et al. (1986)], p. 865, concerning the ISIS system from BBN:
We wish to identify an unknown speaker, from an utterance, [ ...] knowing that the utterance was made by one of a set of M possible speakers. We have available training data for each of the M speakers that consists of speech from one or more telephone calls, all distinct from the test telephone call. The text of all utterances is assumed to be unknown.
Here, the protocol described is unprompted unrestricted text-independent closed-set speaker identification. Note also the multi-session character of the experiment, i.e. that the training and test material have been recorded through different channels , probably on different days.