It is unrealistic to review all the applications that could benefit from the knowledge (recognition) of speaker characteristics. The overlap with the applications of speech technologies could become more and more important in the future. Techniques for speaker recognition can be used for speaker selection or the adaptation of multi-speaker or speaker-independent speech recognition systems, and can bring improvements to speech processing techniques in general, including synthesis and coding.
In most of these cases, it may be inappropriate to treat separately linguistic and paralinguistic aspects of the speech signal. The paralinguistic aspects are certainly more important in the following tasks:
Speaker change detection can be used for automatic speaker labelling in recordings or focussing on the current speaker in video conferences. For some multimedia applications, it may be very useful to access, speaker by speaker, the recording of a conversation or of a radio or television program. Here, the speech to be processed is reasonably sequential, which makes the algorithms usually more efficient, but it may happen that several speakers talk at the same time.
Most of the following discussion will focus on speaker verification . In this context,
the most obvious dichotomy separates remote (telecom) applications from local (on-site) applications.
Remote applications are typically performed over the telephone.
A major problem with the telephone is the diversity of telephone sets and channel paths.
The microphone and the environment is much better controlled with local applications.
One of the most obvious uses of speaker recognition techniques is caller authentication over the telephone network. In the framework of such applications, the main task is therefore speaker verification . The user claims his identity, most of the time by dialling (or saying) a personal code number. Then, either a code word is required to authenticate the speaker, or his utterance of the code number is used for verification purposes.
Typical application areas are:
The main applications for speaker verification over the telephone network are of two kinds: the first kind is banking and remote transactions, the second kind is access to licensed databases. It is obvious that both fields do not require the same level of security: it is usually less costly to let someone unauthorised have access to a database than to allow somebody to operate some transaction that can involve large amounts of money. In many countries, it is possible to pay by credit card over the phone, without any other verification of the customer than the consistency between the customer's name, his credit card number and its expiry date (all of them being on the credit card itself!). Some kind of basic speaker verification (for instance user-specific text-dependent verification), even with tolerant thresholds, could certainly bring more security to this type of transaction. Note that, in this case, the speaker characteristics should be centralised in a single place that delivers the transaction authorisations, and would require special equipment for each supplier, but the verification could take place off-line because the confirmation of the transaction does not usually need to be immediate.
With on-site applications, the person whose identity is to be verified must be either physically present in one particular location or in direct contact with a device under the control of the service provider. This offers some freedom for the product designer to choose from a variety of techniques. Keys, badges, codes, ... are used most frequently. However, they may not insure a sufficient level of security. In such cases, biometric verifiers offer an alternative (or a complement).
Typical applications are:
These applications are the equivalent of database access and remote transactions over the telephone. However, the differences to telecommunication applications come from four facts:
For example, a possible implementation of a voice verification system for money distributors could simply refuse the transaction (or limitate its amount) if the voice characteristics do not match sufficiently the identity corresponding to the Personal Identification Number (PIN). Note that, even if a thief steals a credit card and the PIN is attached to it, he may still not know the voice of the user, and may have difficulty in getting information about it. If he knows the voice of the owner, he may be unable to imitate it. Even if some false acceptances take place, the amount of fraudulent transactions will be necessarily reduced. However, if too many false rejections occur, the bank may lose some of their clients. The probability of an impostor having a voice similar to the user being small, the system can be quite tolerant and still reduce the number of fraudulous withdrawals, be somehow dissuasive to impostors, without offending a significant subset of the regular users. If additional procedures are put into action, such as taking a picture of the user in case of doubt on his identity, or performing some kind of face recognition, dissuasion is reinforced, since it requires a more elaborate and less casual strategy to be resorted to by a possible impostor. For this kind of application, the voice characteristics of the speaker can be stored on the magnetic tape or the chip of his card, which does not require centralised access. The verification must take place in nearly real-time to avoid undesirable delays in the transaction.
In the context of access control, a reasonable system would be based on rather strict verification, with a ``call for assistance'' procedure in case of rejection .