Speech recording, storage, and playback

Next: Canned speech Up: Speech synthesis Previous: Speech synthesis

Speech recording, storage, and playback

The speech is simply converted from an analogue signal to a digital one and stored on computer using a pre-defined format. Under the application control a speech file is selected and played back to the user (e.g. converted from digital codes to an analogue waveform).

The speech acquisition may be done through an analogue telephone line, a digital line or a local microphone . The application developer has to know whether he can record such prompts and what kind of tools are delivered for this purpose.

The converter uses a sampling frequency which is usually correlated to the telephone bandwidth (8kHz), although multimedia applications use 11kHz or 16kHz. Some technologies incorporate both a telephone interface and a microphone.

Different coding techniques are used, in particular PCM (Pulse Code Modulation) and ADPCM (Adaptive Differential PCM). The most common PCM method consists of sampling speech data at a rate of 8000 samples/second, leading to a 64 kbits per second of speech (one byte per sample). There are two types of sample coding called -law (very popular in the USA) and A-law (popular in Europe). So the application developer has to know that this requires 64 kbits per second of speech (64 kbps), which he may have to play back to the users. This technique is standardised by the CCITT under Recommendation G711.

ADPCM proceeds by an encoding of the difference between two adjacent samples, leading to a 32 kbits per second of speech. This has been standardised by the CCITT under Recommendation G721.

There are other coding techniques that allow compressiion of speech with rates below 32 kbps, like CVSD (Continuously Variable Slope Delay) at a rate of 24 kbits/s, an Adaptive Differential PCM (ADPCM) at a rate of 16 kbits/s, subband coders with a rate of 24 to 16 kbps, etc.

The speech data may as well be compressed at a low bit-rate before storage. Of course there is a quality degradation the application developer should take into account. The compression algorithms allow reduction of the speech rate to 9.6 kbps, 7.2 kbps or 4.8 kbps. These compression techniques may be of interest to some application developers who need to reduce the storage capacity of their system and thus the technology providers should inform them about the available techniques.

For example, a subtle combination of different compression rates may be requested in voice mail applications. The application developer may want to store the new and recent messages using the highest available quality and archive the obsolete ones using a low-bit rate coding. If this combination of techniques is available, the technology provider has to provide instructions about how to exploit it.

Another crucial problem that may occur is the misinterpretation of messages when they are mixed with music and DTMF sounds. The application developer has to know how to handle this problem (modify the message or decrease the music segment by some octaves).

In order to achieve the process of recording, storing and playing back the system prompts there is a balance to look for depending on various parameters such as:

achievement of a real-time process (what processor, how many MIPS),
to obtain the best playback quality,
disk and memory storage (what kind of memory, capacity),
bandwidth requirements.

Such parameters have to be clearly stated by the technology provider.

Next: Canned speech Up: Speech synthesis Previous: Speech synthesis

EAGLES SWLG SoftEdition, May 1997. Get the book...