Once the recognition system has been trained, it can be tested under a set of specified conditions. These conditions may involve adding noise to speech from a database or applying other manipulations to the speech. It is often required that the training has to be adapted to these conditions, e.g. training with noise , etc.
If test conditions involve noise addition, some special precautions have to be made. First it is good to realise that, for a test with added noise, it is possible to have the system trained with or without noise. Because retraining for several noise levels takes additional time, a system is generally trained under one condition of noise, and tested under various conditions.
Secondly, it is very important to add the noise continuously and
independently of the speech. This means that the moment at
which the noise starts must be independent of the beginning and
ending of the words. The reason is that if there is a connection
between the start of the noise and the start of a word, this would
give a clue to the recognition system where the beginning of the word
is, and especially for the assessment of connected word recognition
systems this is a major design consideration. In practice, this means
that the noise should start a few seconds before the test utterance
starts and end only after the recognised words are given. The safest
method for stationary types of noise, however, is to have the noise
continuously available at the recogniser input.
RECOMMENDATION 7
If you test a recognition system under noise conditions,
make sure that the noise signal is continuously available, or at least
during the period extending from some time before the recognition
starts until some time after the recognition has ended.
One way to achieve this is to add analogue noise ``outside'' the playback device. For simplicity, it would not be a bad idea at all to have a playback device produce the noise during the test, although a disadvantage can be the need for careful adjustment of the SNR . This means that there is less control over the noise level (and thus the signal-to-noise ratio ), and one must assure that the levels are correct by measuring the electrical signals (see Chapter 8 for instructions on how to do this). For this purpose the software tool Speech Level Meter (SLM), developed under SAM , can be used (see Appendix E).
The more modern approach to addition of noise is to make a digital addition of noise and test the speech signal. This allows variation of noise within a series of tests quite easily, and signal-to-noise measurements can be performed on a digital level before the mixed signal is fed to the recogniser .
If two signals of comparable level are to be added with limited dynamic range (i.e. the number of bits to represent a sample, often 16bits), the sample values of both signals have to be divided by 2 first. This is a level reduction of 6dB, while the addition of two uncorrelated signals leads to an increase of the signal level by only 3dB. Thus, the total reduction of the level after the signals have been added is 3dB. This effect of level reduction due to addition of signals is typically connected to digital signals.
In the case of analogue mixture of equal-level signals, the
individual signals have to be attenuated by only 3dB before they are
added, leading to a net null level change. In case the speech level
and the noise level are very different, one could in principle use a
smaller reduction of the signals, but the reduction of 6dB is very
convenient in practice, because it can be implemented as a simple
bit-shift in digital addition.
RECOMMENDATION 8
In case you add two signals of comparable level
digitally , reduce both signals 6dB (a factor
2 in amplitude) before the addition. Be aware of the fact that this
will lead to a reduction of level.
If a digital addition of noise is chosen, another point is important: the sample rates of the signals must be the same. This may seem a trivial remark, but in practice this occurs more often than you would expect. Most signal processing software can re-sample a signal digitally, but this may demand resources such as special digital filters. In the SAM project, the program ``RESAM '' was developed for solving the problem of the dual sample rate standards of 16kHz and 20kHz (see Appendix E). This software tool comes with a utility to add digital signals.
In the SAM project and within the NATO research study group RSG 10 several efforts have been undertaken to produce standardised digital noise files. The first product is a CD-ROM entitled ``Noise-ROM-0'', and contains 24 different noise-like signals, each 4 minutes in length, at 20kHz sampling rate . The noises vary from wideband reference noises (5 types) to noise from shotguns, cars, aircraft, armed vehicles, etc. The CD-ROM is produced by TNO Human Factors Research Institute in the Netherlands and RSRE Speech Research Unit in England.
The second noise database is called ``Noisex '', and is distributed on two CD-ROM s. It contains English digits under various calibrated noise conditions. The database has been produced by DRA Speech Research Unit in England.
Some recognition systems are known to be sensitive to slowly varying level fluctuations. This could be the result of a varying mouth-microphone distance, differences in the speakers emotional state, etc. A solution to this problem is the insertion of an automatic gain control (AGC) in the signal path. However, AGC s tend to have the annoying habit of increasing the gain continuously in periods when there is no speech, until the background noise has reached the required level. At the time the speaker starts talking again, an overload occurs and the gain has to be brought back to a lower level immediately. The solution to this problem is to have the AGC detect silences and not increase the gain during these silences. Although this is in principle possible to do with an analog circuit, nowadays a digital AGC is more convenient. An AGC can be implemented with a digital signal processor board in a Personal Computer .