next up previous contents index
Next: Segmentation Up: Experimental procedures Previous: Experimental procedures

 

Experimental selection of material

The issue of what material to employ applies to speech synthesis  as well as recognition . Before concentrating on some issues relevant to recognition, a brief comment is in order about selecting a sample speaker to base a synthesis system  on as these have different requirements to those of recognition: In synthesis,  basically, you are trying to select one or a small number of speakers whose speech conforms to certain criteria. One reasonable criterion might be that the speaker should be ``highly intelligible''. Though professional groups like radio announcers or people with voice training  would seem like reasonable candidates for this, from a scientific point of view, some metric to check this ought to be applied so that the decision is not subjective. A metric could be developed based on Likert scales (see Section 9.4.3).

If an ANN-based   recogniser  is to be trained  and tested , different requirements are imposed: If the goal is to recognise unrestricted speech by the speakers of a language (as in the example), a statistically representative sample of exemplars is needed. This means that the sample should conform in its statistical properties to whatever population it is supposed to represent. There are two major facets of the representativeness here. An adequate sample of different speakers is needed (see San ection 9.2.5 for some estimation methods appropriate for this) and the structure of the materials produced by the selected group of speakers (whether it is spontaneous  or specially-constructed material for reading) needs checking to see whether it conforms to unrestricted speech in the language. An explicit formulation needs to be made concerning what would constitute an adequate check on whether the samples are representative: A weak formulation might be to check whether the sample contains all the language's phonemes.  A stronger formulation would be to establish whether the sample contains all the language's diphones . An even stronger formulation would be to check whether all the diphones  occur with the same frequency as they do in spontaneous  samples of speech drawn from the language. The stronger the formulation, the more likely it is that the sample is representative of the language but, on the other hand, the more work is required in obtaining and comparing the sample against the language.

Considering phonemes , first, there are reasons to suppose that checking for these units alone would not be satisfactory. The main problem is that if you only check on whether all phonemes  occur, then it is being implicitly assumed that some salient identifying property of a phoneme  can be extracted whatever phonemic  context it occurs in. This assumption is controversial.

Diphones  are the observed two phoneme  combinations for the language. The advantage of this and related units (such as the demisyllable ) is that they include some measure of context over adjacent phonemes . The disadvantage is that there are more of these units than there are phonemes  and, consequently, larger samples of speech are needed to ensure representativeness. Ideally, what would be required if these units are to be the basis of a recogniser  is a check that all diphones  that occur in spontaneous speech   occur in the sample. It cannot be assumed that phonetically balanced  passages control for all diphone  contexts. For instance, a short phonetically balanced  text does not contain examples of phonemes  in all the 900 or so diphone  contexts that occur in samples of spontaneous speech.   The issue about whether passages should be generated that have, say, diphones  with the same frequency as occurs in the language itself is tricky for two reasons: First, the baseline data about diphone  frequency is not universally available. Second, there are alternative viewpoints about what would be the best structure for the frequency of diphones  in a sample. On the one hand, it can be argued that the rare diphones  are highly informative segments of speech. It might then be advisable to ensure that these occur with the same frequency as the commonest diphones  by generating specific passages. The other point of view is that by artificially manipulating the diphone  frequencies in this way, the sample is not representative of the language the recogniser  would need to work with. This can be illustrated by considering materials that have been used for training  Hidden Markov Model  recognisers . Sentences in projects such as SPAR  were sometimes developed to obtain an instance of each phoneme  of English in a small amount of material. However, inspection of some of the sentences shows that they might be difficult to speak (tongue-twisters) and consequently may lead to pronunciation problems (in particular, abnormally timed speech):

A speaker is likely to experience difficulty on phonemes  in this type of material that he would not encounter when these same phonemes  occur in other sentences. To the extent that these sentences behave like tongue twisters, the difficulty encountered would be more acute for certain classes of phonemes  (consonants and particularly with plosives) than others (the vowel sounds). A person who still wants to use this material might reply that it is conceivable that these sentences could have been uttered, which is true. However, the discussion of sampling (above) illustrates that they cannot be considered a simple random sample  . If it is necessary to use the phonemes  in them as instances for training  particular phone models  , their acoustic properties should be checked statistically against other groups of sentences that also contain these phonemes . This analysis would establish whether there are differences between the acoustic properties in the wider samples and these compressed versions. To our knowledge, these tests have not been conducted. They are essential checks that should be made before a compressed sample is used when the final system is applied to with less restricted materials.

If unrepresentative material is used for training  and testing  a recogniser , misleading conclusions may well be drawn about its performance. This constitutes a major topic of investigation and arises as follows: Suppose the recogniser  is trained  on phones  marked in passages that are produced atypically due to the phonetic density (such as is the case with the SPAR sentences ). When the recogniser  fails to recognise a ``typical'' instance of the phone  that differs from the atypical ones that it has been trained  on, hasn't it behaved correctly with respect to its training? Conversely, when the recogniser  recognises a ``typical'' instance of the phone  that differs from the atypical ones that it has been trained on, hasn't it made an error? What this shows is that if simple random samples   are not employed, then, the conclusions about what constitutes both errors and correct outcomes may be misleading.



next up previous contents index
Next: Segmentation Up: Experimental procedures Previous: Experimental procedures

EAGLES SWLG SoftEdition, May 1997. Get the book...