The number of words that are recognised by the system has to be given. This may consist of a small set of words (small vocabulary of about ten words), a medium-size set (from 10 to 100), a large set of words (from 100 to 1000) or very large (over 1000 words). The vocabulary may be seen as a single dictionary or divided into several sublexicons downloaded to the application considering some dialogue phases.
The training phase is a crucial stage and involves several types of data and acquisition conditions. One may distinguish the amount of data characterising the environment and characterising the acquisition channels , the amount of data per speaker that depends on the current application and the amount that is independent of the current application, the time between speech acquisition for training and system use, and so forth.
The data may consist of acoustic speech waveforms of isolated words, or orthographically and/or phonetically labelled speech sentences (arbitrary sentences, application dependent, phonetically balanced ). The corpus size is also important in terms of the number of words/sentences and the number of repetitions. The data may include extra-linguistic phenomena that occur.
The training phase can be achieved by the application developer or by the technology provider if it necessitates some in-house know-how and hints. If the training is not done once and for all it is important to point this out. The duration of the training process has also to be reported.
As noted above, the training phase uses speech corpora that consist of a collection of recorded speech samples with corresponding labels . The technology provider and/or the application developer can select the list of words and set up a collection platform for this purpose. The speaker may have a sheet of paper with the list of the words/sentences to be read or he may be requested to repeat the sentences played back by the system, or he may be asked questions, with his answers recorded to produce the speech data.
This process can be undertaken with or without supervision. The problem to sort out is how to judge the quality of the corpora (in terms of acoustic, phonetic, and linguistic coverage) and how to label the data (the labels can be orthographic, phonetic, acoustic segments with end-point marks, etc.). Selection of appropriate acquisition conditions is a crucial matter and has to be done carefully (in respect of how and by whom it is produced).
Data acquisition necessitates a platform with particular requirements regarding memory storage, CPU capabilities per word/sentence and speaker, and adequate user-friendly interfaces. These parameters are listed within Section 2.6.2.
The recognition process may use a set of words (global approach ) or a set of subword units to identify the user utterances. If the system uses whole word models (global approach ), these have to be learned beforehand. The vocabulary to recognise has to be recorded and used for training for each different application (fixed vocabulary ). In the case of subword units (analytic approach ), the speech units are learned once and for all and the vocabulary lexicon is generated as a concatenation of such units (flexible vocabulary).
In both cases the system may be optimised for a particular language or a class of languages. The
multi-linguality aspect is of paramount importance in the era of open economic marketplaces.
FLEXIBLE VOCABULARY:
The subword units are acoustic units that may be based on linguistic or phonetic entities. The
technology provider has to describe this as a system feature. The application developer has to know
how to use the units to generate his own application vocabulary . During the recognition process a
``parser '' accomplishes the labelling of speech units and provides the lexicon entries.
For example, if the system deals with single words, the application developer has to know the task he will have to carry out and the skills needed for that purpose, such as:
FIXED VOCABULARY:
In this case the system is task dependent. For each task there is a need to acquire specific and tuned
corpora. The technology provider has to describe what kind of data is needed. This could be
speech waveforms of isolated words, or labelled sentences, or labelled and phonetically balanced
sentences, etc. He has to give the validity of the speech database in terms of the number of words,
sentences, speakers, different acquisition conditions, etc. There may be a need for end-point
detection (speech segmentation and labelling ) to be done manually or semi-automatically. There may
be some differences in the database characteristics in order to come up with a speaker dependent or
speaker independent system. If some extra-linguistic phenomena are likely to occur, the technology
provider has to instruct the application developer how to take them into account.
The training phase can be achieved by the application developer or by the technology provider if it necessitates some in-house know-how and hints. Training is a time-consuming process and its duration has to be estimated by the technology developer.
For both fixed and flexible vocabulary approaches there are several possible procedures, which can be itemised as follows:
Some systems offer the possibility to define a particular syntax to recognise connected digits, sequences of words or to spot keywords within a sentence. This can be carried out at the algorithm level or at the modelling stage. The application developer should know what the options are and what the tools are that will allow him to use them (model generation or rule definition).
For example, some systems offer a word spotting functionality through speech modelling and a syntactic rule formalism. Spotting words within a stream of speech is used to cope with para-linguistic factors such as hesitation (Er yes instead of Yes), polite styles (Yes, please, instead of Yes). This approach may be implemented using ``syntactic'' rules such as:
which means that the expected word (WORD) may be embedded in an optional stream of speech (the so called out-of-vocabulary ). ``+'' indicates an alternative and ``()'' indicates options. This production rule allows to extract the recognised word from the sentence (SENTENCE).
The tools may allow a model to be set up for fixed-length sequences of words (if the application expects 8 digits, then this piece of information should be used to increase the recognition rate) or this may be left to the user and the application may analyse the number of words and then manage the dialogue accordingly. This is a way to account for likely user responses.
It has been confirmed from many reported evaluation experiments that the speech uttered by the users during an exploitation session is more representative of the field conditions than the speech acquired during recording sessions. Such speech data may be stored, labelled, incorporated in the training database to account for the field characteristics, and can be used to provide new releases of upgraded speech dictionaries for the application developer. It is important to know whether this is possible or not and how to handle such material.