As noted above, a wide spread of information databases will lead to a widespread use of telecommunication systems to access remote databases. In the meantime, speech control of workstations, office and working environments (the desktop arena) opens new market sectors. It is believed that the wide spread of telephone handsets will permit adoption of speech technologies despite many limitations. The speech input can be either acquired through telephones or microphones.
Both media yield different performances and are suited to specific types of application.
Using a microphone is an important means of improving performance, as it does not have the drawbacks caused by the technical limitations encountered with the public telephone network, such as bandwidth limitations - although degradation may occur due to the changes in microphone characteristics (electrical and acoustic).
The application developers have to be aware of the most important characteristics that are relevant to the speech signal representation and thus to the speech recogniser:
The application developer has to be aware that some users are not comfortable when talking to a machine and will not accept a headset microphone easily. Moreover it is obvious that if speech control is used it is not efficient to encumber the users with a handheld microphone . The use of a remote microphone attached to the monitor or to a stalk which can actively track the speaker can be suggested for real applications. In all cases the choice of a microphone is the responsibility of the application developer (who may consider other ``human factors'') but has to be clearly validated by the technology provider in order to ensure a high signal quality and a high signal-to-noise ratio to be passed to the speech recognition module.
In general the microphone to use during the exploitation phase has to be similar to the one used during the training phase . So the application developer should be instructed about its characteristics and the most ``influential'' factors in order to select an equivalent microphone.
Some technologies are well adapted to the use of microphone arrays, although this is still too expensive a solution for low cost and cost-effective applications. In other cases the speech input is acquired through a telephone handset without the telephone line and environment. If such possibilities are offered they have to be clearly mentioned to the application developer.
Some systems allow the use of other microphones provided the system is adapted to the new microphone characteristics through an adaptation procedure that modifies the speech references - trained with another microphone - or the speech input in order to meet the characteristics of the new conditions. In general, the systems have to be trained for each new speaker and for each new microphone.
The telephone is becoming a new ``computer terminal'' and allows access to different services either using DTMF (Dual Tone Multi-Frequency, or touch tone ), pulse detection or speech recognition. In almost all applications speech recognition has to be telephone-independent. The telephone channel includes the telephone handset, the private switch (PABX - Private Automatic Branch Exchange), the public network (PSTN - Public Switched Telephony Network ) and the speech-based system interface with the public network (directly or through a PABX ).
The use of telephones induces many phenomena that are not observed when the training is carried out with high quality microphones . Such phenomena include local acoustic echo , electrical echo, line noise , non-linearities , spectral distortions , etc. All these features are considered as non-linguistic sources of variability and have to be taken into account.
So the system has to process as its input signal the speech signal uttered by the caller and sent through the telephone handset, the public telephone network and the local interfaces. The system may have to provide the service, using the speech recogniser , to callers wherever the call originated. Hence the system has to support all types of telephone handsets and lines, unless the callers are a selected set of users who are asked to use specific handsets that are provided. In the general case the system has to accommodate telephone handsets with electret and carbon button microphones , cellular phones, etc. This is usually managed through the database collected and used for training. If the system is dedicated to a particular type of call (domestic, inter-city, long distance by satellite or undersea cable) then the technology provider should indicate whether he supplies the corresponding and appropriate speech models or whether he requires the application developer to collect such data. The system may also tackle the telephone line characteristics through an adaptation procedure and the application developer should be instructed about how to use it (see channel /environment adaptation section).
Although the speech recogniser does not deal with telephone control signals, these have to be considered at the application level: to go off-hook, to know that the caller is still on-line, that he hangs up (goes on-hook), to detect an event related to the loop-current drop and to terminate signal processing, to process call transfer functions, to terminate a call, etc.
In some cases the systems host the telephone interface and the speech recogniser on the same board and speech processing functions consist of a set of programming functions of the same level as the telephone functions.
In many other cases the telephone interface is handled separately on another board and speech data is provided through specific expansion buses such as PEB , AEB, MVIP or SCSA described in Section 2.6.2. In that case speech recognition deals with speech frames forwarded by the telephone interface and does not care about the line signalling.
The speech signal is acquired by a telephone interface and has to be passed to the speech recogniser through an expansion bus like the ones mentioned above (AEB, PEB , MVIP or SCSA, described in Section 2.6.2).
Some systems tackle such noises through rejection capabilities and others through an adaptation of the signal-to-noise ratio . Some systems offer a gain control which can be set up automatically and dynamically, others offer a static gain parameter adapted to the general operating conditions. Instructions about how to optimise it have to be given.
Some systems that offer echo cancelling functions necessitate an adjustment of the input and output gains. Hence the speech prompt level has to be carefully adjusted in order to minimise the echo to allow efficient cut through or voice stop capabilities. Tools should be provided in order to set up such a level even if the prompts recording is carried out at a professional studio .
The application developer has to know the status of the system he is planning to deploy; otherwise he will have to take it out of the network.