next up previous contents index
Next: Influencing factors Up: A taxonomy of speaker Previous: Definitions



In this section, we give examples of well-known speaker recognition systems which can be found in the literature, in order to illustrate the taxonomy described above.

Text-dependent systems


Among examples of text-dependent systems, the Bell Labs system, reported by [Rosenberg (1976)] and improved by [Furui (1981)] is tested by the latter under the following protocol: ``Several [six] kinds of utterance sets were used to evaluate [the] system ...Two all-voiced  sentences were used in the recordings. The males used the sentence, `We were away a year ago' and the females used the sentence, `I know when my lawyer is due.''' [Furui (1981), p. 258,]

The first five utterance sets are composed of male speakers, while the last one is composed of female speakers. Performances are reported for speaker verification  experiments on each set. Following our terminology, these experiments simulate a common-password text-dependent speaker verification system  . Here, the password is an entire sentence. As Rosenberg notes to justify the use of a text-dependent system in practical applications: ``For many applications, the speakers are expected to be cooperative  so that a prescribed text is perfectly feasible.'' [Furui (1981), p. 259,]

The use of a prescribed text has also the advantage that it does not need any prompting , but the drawback that it may be forgotten by the user. As discussed in a next example, a second drawback of text-dependent systems is the possibility for impostors to use pre-recorded speech.

As an example of personal-password   text-dependent speaker verification , one can mention a new service offered by the American telephone operator SPRINT. For this service, the user must speak his telephone card number through the phone, in order to have his home bill charged directly for the call he is willing to make. The system identifies the claimed customer by recognising the sequence of digits, and then verifies, on the same sequence of digits, the match between the actual user and the assumed customer. Here, the sequence of digits has a double function: a means of customer identification, and a personal voice password for speaker verification .


Fixed-vocabulary systems


Another very popular speaker verification  systems was developed by Doddington at Texas Instruments, in the early 70s. Here follow excerpts of the description given by the author [Doddington (1985), p. 1661,]:

To use the system an entrant first opens the door to the entry booth and walks in, then he identifies himself by entering a user ID into a keypad, and then he repeats the verification phrase(s) that the system prompts  him to say. If he is verified, the system [ ...] unlocks the inside door of the booth so that he may enter into the computer center. If he is not verified, the system notifies him by saying ``not verified, call for assistance''.

Verification utterances are constructed randomly to avoid the possibility of being able to defeat the system with a tape recording of a valid user. An simple four-word fixed phrase structure is used, with one of sixteen word alternatives filling each of the four word positions (see Table 11.1).


Table 11.1: Verification Phrase Construction for the TI Operational Voice Verification System (after Doddington) 

An example verification utterance might be ``Proud Ben served hard''. These utterances are prompted  by voice. This is thought to improve verification performance by stabilising the pronunciation of the user's utterance.

Therefore, the TI system turns out to be a voice-prompted fixed-vocabulary speaker verification system,     the claimed identity being input as a personal identification number on a keypad. Doddington's excerpt illustrates well the motivations behind the voice-prompted   fixed-vocabulary  approach: the relative randomness of the verification utterances protects against impostors using pre-recorded speech, while the use of voice prompts  tends to control the reproducibility  of the user's pronunciation. However, it must be noted that voice-prompting  may also neutralise some of the speaker characteristics  (in particular prosodic  factors), owing to an unconscious mimicry of the prompt . At the same time, text-prompting  has the drawback of requiring a specific device, such as a screen, which is not always possible to implement.gif

The experiments reported by [Soong et al. (1987)] where sequences of digits are used for speaker verification  is another example of a fixed-vocabulary system.


Unrestricted text-independent systems


Unrestricted text-independent speaker recognition  is usually considered as desirable for several reasons. Even if the user does not have to take the initiative in producing the text, prompted  systems  are less likely to be defeated by a recorded voice, as the linguistic material is virtually unpredictable. For unprompted systems,    identification or verification can take place unobtrusively, during a telephone transaction, for instance. Moreover, unprompted approaches   do not require the speaker to be actively cooperative. 

Here is the general structure of a text- (or voice-) prompted unrestricted text-independent system, as described by [Furui (1994)], p. 7:

The recognition system prompts  each user with a new key sentence every time the system is used, and accepts the input utterance only when it decides that the registered speaker  has uttered the prompted  sentence [ ...] This method not only can accurately recognise speakers but also can reject utterances whose text differs from the prompted  text, even if it is uttered by the registered speaker .

[During registration], since the text of training  utterances is known, these utterances can be modelled as the concatenation of [speaker-independent]  phoneme    models, and these models can be automatically adapted [to the new registered speaker] . In the recognition stage, the system concatenates phoneme models   according to the prompted  text [i.e. a speaker-specific model and a speaker-independent model]. If the likelihoodgif of both speaker and text is high enough, the speaker is accepted as the claimed speaker.

Note here that the fundamental difference between the system described above and a fixed-vocabulary system    is the use of subword speech units (here, phonemes ) which allow to construct speaker-specific models of test words  (or sentences) which were not pronounced during the registration phase. Note also the use of an explicit step of speech recognition. 

In opposition to prompted  systems, here is one example of an experiment in unprompted speaker recognition,   as reported by [Gish et al. (1986)], p. 865, concerning the ISIS system from BBN:

We wish to identify an unknown speaker, from an utterance, [ ...] knowing that the utterance was made by one of a set of M possible speakers. We have available training data  for each of the M speakers that consists of speech from one or more telephone calls, all distinct from the test telephone call. The text of all utterances is assumed to be unknown.

Here, the protocol described is unprompted unrestricted text-independent closed-set speaker identification.           Note also the multi-session character of the experiment, i.e. that the training  and test material  have been recorded through different channels , probably on different days.


next up previous contents index
Next: Influencing factors Up: A taxonomy of speaker Previous: Definitions

EAGLES SWLG SoftEdition, May 1997. Get the book...