next up previous contents index
Next: Cut-through versus voice-stop (anticipation) Up: Speech recognition systems Previous: Speech recognition and application

Speech input and speech signal acquisition

As noted above, a wide spread of information databases will lead to a widespread use of telecommunication systems to access remote databases. In the meantime, speech control of workstations, office and working environments (the desktop arena) opens new market sectors. It is believed that the wide spread of telephone handsets will permit adoption of speech technologies despite many limitations. The speech input can be either acquired through telephones or microphones.  

Both media yield different performances and are suited to specific types of application.


Microphone interfaces


Using a microphone is an important means of improving performance, as it does not have the drawbacks caused by the technical limitations encountered with the public telephone network, such as bandwidth  limitations - although degradation may occur due to the changes in microphone characteristics (electrical and acoustic).

The application developers have to be aware of the most important characteristics that are relevant to the speech signal representation and thus to the speech recogniser: 

The application developer has to be aware that some users are not comfortable when talking to a machine and will not accept a headset microphone easily. Moreover it is obvious that if speech control is used it is not efficient to encumber the users with a handheld microphone . The use of a remote microphone attached to the monitor or to a stalk which can actively track the speaker can be suggested for real applications. In all cases the choice of a microphone is the responsibility of the application developer (who may consider other ``human factors'') but has to be clearly validated by the technology provider in order to ensure a high signal quality and a high signal-to-noise ratio  to be passed to the speech recognition module.

In general the microphone to use during the exploitation phase has to be similar to the one used during the training phase . So the application developer should be instructed about its characteristics and the most ``influential'' factors in order to select an equivalent microphone.

Some technologies are well adapted to the use of microphone arrays,  although this is still too expensive a solution for low cost and cost-effective applications. In other cases the speech input is acquired through a telephone handset without the telephone line and environment. If such possibilities are offered they have to be clearly mentioned to the application developer.

Some systems allow the use of other microphones provided the system is adapted to the new microphone characteristics through an adaptation procedure that modifies the speech references - trained with another microphone - or the speech input in order to meet the characteristics of the new conditions. In general, the systems have to be trained for each new speaker and for each new microphone.


Telephone interfaces

The telephone is becoming a new ``computer terminal'' and allows access to different services either using DTMF  (Dual Tone Multi-Frequency, or touch tone ), pulse detection  or speech recognition. In almost all applications speech recognition has to be telephone-independent. The telephone channel  includes the telephone handset, the private switch  (PABX - Private Automatic Branch Exchange),  the public network  (PSTN - Public Switched Telephony Network ) and the speech-based system interface with the public network  (directly or through a PABX ).

The use of telephones induces many phenomena that are not observed when the training is carried out with high quality microphones . Such phenomena include local acoustic echo , electrical echo, line noise , non-linearities , spectral distortions , etc. All these features are considered as non-linguistic sources of variability and have to be taken into account.

So the system has to process as its input signal the speech signal uttered by the caller and sent through the telephone handset, the public telephone network and the local interfaces. The system may have to provide the service, using the speech recogniser , to callers wherever the call originated. Hence the system has to support all types of telephone handsets and lines, unless the callers are a selected set of users who are asked to use specific handsets that are provided. In the general case the system has to accommodate telephone handsets with electret and carbon button microphones , cellular phones, etc. This is usually managed through the database collected and used for training. If the system is dedicated to a particular type of call (domestic, inter-city, long distance by satellite or undersea cable) then the technology provider should indicate whether he supplies the corresponding and appropriate speech models or whether he requires the application developer to collect such data. The system may also tackle the telephone line characteristics through an adaptation procedure and the application developer should be instructed about how to use it (see channel /environment adaptation section).

Interface between the network and the speech system
The connection between the public switch (PTT network) and the speech recognition system can be either analogue or digital. The application has to deal with telephone signalling as well as speech signals. Telephone signalling is analysed as part of the speech analysis module for systems that simultaneously manage speech signals, DTMF , and/or pulse detection .
Analogue connection
The interface between the system and the network is through the classical tip and ring wires . The signal is analogue and there is a need to convert it to a digital form.   For this purpose one needs a telephone interface that answers the incoming phone calls (inbound calls ) by an analysis of a loop-current drop event (go off-hook) and that manages the different telephone tones (dial tones, busy tone, etc.) with respect to the PTT regulations as the interface must fulfil the requirements for standard switching compliance  (timing and control signal management ).

Although the speech recogniser  does not deal with telephone control signals, these have to be considered at the application level: to go off-hook, to know that the caller is still on-line, that he hangs up (goes on-hook), to detect an event related to the loop-current drop and to terminate signal processing, to process call transfer functions, to terminate a call, etc.

In some cases the systems host the telephone interface and the speech recogniser  on the same board and speech processing functions consist of a set of programming functions of the same level as the telephone functions.

In many other cases the telephone interface is handled separately on another board and speech data is provided through specific expansion buses  such as PEB , AEB, MVIP    or SCSA     described in Section 2.6.2. In that case speech recognition deals with speech frames forwarded by the telephone interface and does not care about the line signalling.

Analogue to digital conversion
  The speech waveform is converted to digital samples before being passed to the recogniser module. The converter characteristics are of paramount importance and have to be clearly identified. These are the sampling frequency  (how many samples per second), and the coding rate (how many bits are used to represent each sample). The standard figures are 8000Hz and 8 bits but some systems may offer 8000Hz and 13bits. These parameters can be set up once and for all at the application level, but this has to be done in accordance with speech recognition specifications. If they can be configured then the application developer has to be aware of it.  

Digital interfaces
  The telephone network is increasingly based on digital switches. The speech signal uttered by a caller is converted to a digital code and transmission between the public switches is done as a digital signal. If the link between the application and the local switch is through an analogue line then the signal is converted from digital to analogue. If the connection is digital then the switch provides a digital signal as 64 kbits/s of speech plus the telephone signalling. The connection can be a single ISDN  connection (called S0 ), an equivalent of one line, or it can be an E1 (USA)  or T2 (Europe)  group of 24 or 30 lines.

The speech signal is acquired by a telephone interface and has to be passed to the speech recogniser  through an expansion bus  like the ones mentioned above (AEB, PEB , MVIP    or SCSA,     described in Section 2.6.2).  

Noise conditions
  The telephone applications have to operate despite the tremendous amount of noise that comes with the calls. The collected speech includes other signals such as TV, radio, computer and printer noises, other speakers, car noise for cellular calls, etc.). The application developers have to know the best way to take all this into consideration.

Some systems tackle such noises through rejection capabilities   and others through an adaptation of the signal-to-noise ratio . Some systems offer a gain control   which can be set up automatically and dynamically, others offer a static gain parameter adapted to the general operating conditions. Instructions about how to optimise it have to be given.

Some systems that offer echo cancelling functions  necessitate an adjustment of the input and output gains. Hence the speech prompt  level has to be carefully adjusted in order to minimise the echo  to allow efficient cut through or voice stop  capabilities. Tools should be provided in order to set up such a level even if the prompts  recording is carried out at a professional studio .


In order to connect equipment to the public network  it has to fulfil the requirements for switching compliance and local regulations. These requirements include signalling as well as electrical and electromagnetic radiation. After the equipment has passed the tests, it is given a registration number that is requested by the local government agency before system deployment. This does not indicate any assessed performance but simply that the equipment connection to the network is authorised. In several countries the PTT approval  concerns also the equipment that is behind a private switch  (PABX ), and in many others it concerns the application as a packaged system (software and hardware components). For example, if the application is implemented on a PC   with a telephone interface board and a speech recogniser  board then the whole package has to pass the regulation tests.

The application developer has to know the status of the system he is planning to deploy; otherwise he will have to take it out of the network.

next up previous contents index
Next: Cut-through versus voice-stop (anticipation) Up: Speech recognition systems Previous: Speech recognition and application

EAGLES SWLG SoftEdition, May 1997. Get the book...