The issue of what material to employ applies to speech synthesis as well as recognition . Before concentrating on some issues relevant to recognition, a brief comment is in order about selecting a sample speaker to base a synthesis system on as these have different requirements to those of recognition: In synthesis, basically, you are trying to select one or a small number of speakers whose speech conforms to certain criteria. One reasonable criterion might be that the speaker should be ``highly intelligible''. Though professional groups like radio announcers or people with voice training would seem like reasonable candidates for this, from a scientific point of view, some metric to check this ought to be applied so that the decision is not subjective. A metric could be developed based on Likert scales (see Section 9.4.3).
If an ANN-based recogniser is to be trained and tested , different requirements are imposed: If the goal is to recognise unrestricted speech by the speakers of a language (as in the example), a statistically representative sample of exemplars is needed. This means that the sample should conform in its statistical properties to whatever population it is supposed to represent. There are two major facets of the representativeness here. An adequate sample of different speakers is needed (see San ection 9.2.5 for some estimation methods appropriate for this) and the structure of the materials produced by the selected group of speakers (whether it is spontaneous or specially-constructed material for reading) needs checking to see whether it conforms to unrestricted speech in the language. An explicit formulation needs to be made concerning what would constitute an adequate check on whether the samples are representative: A weak formulation might be to check whether the sample contains all the language's phonemes. A stronger formulation would be to establish whether the sample contains all the language's diphones . An even stronger formulation would be to check whether all the diphones occur with the same frequency as they do in spontaneous samples of speech drawn from the language. The stronger the formulation, the more likely it is that the sample is representative of the language but, on the other hand, the more work is required in obtaining and comparing the sample against the language.
Considering phonemes , first, there are reasons to suppose that checking for these units alone would not be satisfactory. The main problem is that if you only check on whether all phonemes occur, then it is being implicitly assumed that some salient identifying property of a phoneme can be extracted whatever phonemic context it occurs in. This assumption is controversial.
Diphones are the observed two phoneme combinations for the language. The advantage of this and related units (such as the demisyllable ) is that they include some measure of context over adjacent phonemes . The disadvantage is that there are more of these units than there are phonemes and, consequently, larger samples of speech are needed to ensure representativeness. Ideally, what would be required if these units are to be the basis of a recogniser is a check that all diphones that occur in spontaneous speech occur in the sample. It cannot be assumed that phonetically balanced passages control for all diphone contexts. For instance, a short phonetically balanced text does not contain examples of phonemes in all the 900 or so diphone contexts that occur in samples of spontaneous speech. The issue about whether passages should be generated that have, say, diphones with the same frequency as occurs in the language itself is tricky for two reasons: First, the baseline data about diphone frequency is not universally available. Second, there are alternative viewpoints about what would be the best structure for the frequency of diphones in a sample. On the one hand, it can be argued that the rare diphones are highly informative segments of speech. It might then be advisable to ensure that these occur with the same frequency as the commonest diphones by generating specific passages. The other point of view is that by artificially manipulating the diphone frequencies in this way, the sample is not representative of the language the recogniser would need to work with. This can be illustrated by considering materials that have been used for training Hidden Markov Model recognisers . Sentences in projects such as SPAR were sometimes developed to obtain an instance of each phoneme of English in a small amount of material. However, inspection of some of the sentences shows that they might be difficult to speak (tongue-twisters) and consequently may lead to pronunciation problems (in particular, abnormally timed speech):
A speaker is likely to experience difficulty on phonemes in this type of material that he would not encounter when these same phonemes occur in other sentences. To the extent that these sentences behave like tongue twisters, the difficulty encountered would be more acute for certain classes of phonemes (consonants and particularly with plosives) than others (the vowel sounds). A person who still wants to use this material might reply that it is conceivable that these sentences could have been uttered, which is true. However, the discussion of sampling (above) illustrates that they cannot be considered a simple random sample . If it is necessary to use the phonemes in them as instances for training particular phone models , their acoustic properties should be checked statistically against other groups of sentences that also contain these phonemes . This analysis would establish whether there are differences between the acoustic properties in the wider samples and these compressed versions. To our knowledge, these tests have not been conducted. They are essential checks that should be made before a compressed sample is used when the final system is applied to with less restricted materials.
If unrepresentative material is used for training and testing a recogniser , misleading conclusions may well be drawn about its performance. This constitutes a major topic of investigation and arises as follows: Suppose the recogniser is trained on phones marked in passages that are produced atypically due to the phonetic density (such as is the case with the SPAR sentences ). When the recogniser fails to recognise a ``typical'' instance of the phone that differs from the atypical ones that it has been trained on, hasn't it behaved correctly with respect to its training? Conversely, when the recogniser recognises a ``typical'' instance of the phone that differs from the atypical ones that it has been trained on, hasn't it made an error? What this shows is that if simple random samples are not employed, then, the conclusions about what constitutes both errors and correct outcomes may be misleading.