Speaker characteristics

Next: Final comments Up: Specification of number and Previous: Corpus size in terms

Speaker characteristics

How are the speakers for a speech corpus selected? Again, this strongly depends on the application one has in mind. For the development of a speech synthesis system , experienced speakers, such as news readers or actors, are most appropriate. For the training and testing of recognition systems , on the other hand, the population of interest must be suitably sampled. There is no general agreement on the exact meaning of ``suitable'' in this context. One definition would amount to random sampling of the population of interest. This operationalisation usually results in different numbers of samples from subpopulations in the population of interest. For example, when the total population of army personnel is sampled, the subpopulation of women is likely to be poorly represented. In the case of the training and testing of a recognition system for the army, this female under-representation might seem to be acceptable, because the recogniser would have to deal mainly with male speakers. However, it may appear that some of the influential heavy duty users are women and then the recogniser should better be designed to handle the few but important women with the same performance as for men. In general, random sampling has the potential drawback that extremely large numbers of samples are needed to ensure that rare, but nevertheless important phenomena are included. When, where, and why rare phenomena may still be important depends on the application for which the corpus is collected. In the case of fundamental research, on the other hand, the aim is often to compare subpopulations in some respect, and then it would be more appropriate to draw an equal number of samples from all subpopulations of interest. Uniform sampling of all subpopulations of interest ensures that all relevant variation is included in the corpus with the smallest possible number of speakers. The application for which the speech corpus is collected not only determines the best sampling strategy, but it also influences the choice of speakers. For example, speech processing often involves spectral analysis of the recorded speech. Several analysis techniques, such as pitch extraction or formant extraction , are less accurate for high-pitched voices (women and children) than for low-pitched voices (men). If such analysis techniques are used and the sex of the speakers is of no concern for the research goal, it would thus be sensible to select only men for the speech corpus. In general, however, it is recommended to include all possible types of speakers in a speech corpus, unless there are imperative arguments to exclude specific speaker groups. Specifically, it is strongly recommended to include equal numbers of females and males in each corpus. Speaker characteristics, which are potentially important and should therefore be considered when selecting the speaker population are described and discussed below.

Stable / transient speaker characteristics

The many speaker characteristics that may influence the speech signal can be divided in two main classes: relatively stable characteristics, and transient (temporary) characteristics. Stable speaker characteristics comprise on the one hand physiological and anatomical factors such as sex , age, weight, height, smoking/drinking habits, and possible pathologies, and on the other hand geographical and sociolinguistic factors. Transient (temporary) speaker characteristics cover factors such as a cold, or other mild afflictions of the speech organs, general physical condition (dependent on, for instance, the number of hours of sleep during the previous night), stress , and emotional state. Whereas transient speaker characteristics are very difficult to control, stable speaker characteristics are easier to take into account in the design of the speech corpus. For an overview of several important stable speaker characteristics, we refer to [Scherer & Giles (1979)]. The most important stable speaker characteristics will be mentioned below.

Demographic coverage

Demographic factors form a very important set of relatively stable speaker characteristics which must be considered when designing sampling procedures for a corpus collection project. Each corpus should have sufficient demographic coverage. However, it is not always possible to determine all potentially relevant demographic factors a priori. Nor is the distribution of all factors in the total population always known. It is likely that the availability of detailed and reliable demographic data differs between the European countries. The availability such data in less developed countries is even more questionable. In selecting speakers for inclusion in a corpus the possibility to assess certain characteristics is dependent on the recording protocol. If randomly selected speakers are recorded over the telephone, many personal characteristics cannot reliably be collected: self-report from the speaker is the only means of gathering the data.

Male / female speakers

Sex (gender) distinctions are known to have an enormous impact on speech quality . It is not well known at what age sex-related speech characteristics become prevalent. There is some evidence that sex-related speech characteristics are only partly due to physiological and anatomical differences between the sexes; cultural factors and sex role stereotypes also play an important role. Therefore, it is possible that the age at which sex-related differences become apparent differs between cultures and therefore between languages. See for general information on sex-related speech characteristics [Smith (1979)], [Coates (1986)], [Philips et al. (1987)], and [Brouwer & De Haan (1987)]. For the time being, no definitive recommendations can be given with respect to the age above which sexes should be distinguished and sampled individually. Unless the contrary can be motivated from the specific application the corpus is collected for, each corpus should comprise approximately equal numbers of speakers of both sexes. For some applications, recordings of young children may also be required. Children should be considered as a ``third sex'', independent of adolescent or adult females and males. Speaker sex is known or suspected to affect at least four aspects of speech behaviour.

PITCH AND INTENSITY
Women are known to have higher average pitch than men. There are also indications that average intensity in female voices is somewhat lower than in male voices. In particular, higher pitch may affect spectral analysis techniques: pitch and formant extraction may be less accurate for high-pitched female voices than for low-pitched male voices. When a corpus is recorded to develop and test parameter extraction techniques, a realistic proportion of high-pitched female voices should be present. It should be realised that there is an interaction between sampling rate and the accuracy with which pitch frequency can be determined. In female and child speech even 20kHz sampling frequency may not be high enough to obtain sufficient accuracy, as pitch frequencies in especially child speech may be as high as 500 to 750Hz. Fortunately, sampling frequency can be increased using straightforward signal processing procedures whenever the need arises.
OVERALL SPECTRAL SLOPE
Women are reported to tend more towards a breathy voice quality than males. It is not known whether this tendency is related to physiological and anatomical factors or whether it is mainly due to culturally determined sex role stereotypes. Overall steeper spectral slope causes problems for some parametric signal processing techniques (e.g.\ formant extraction ).
ACCURACY OF PRONUNCIATION
Women are reported to adhere more to standard pronunciation than men [Labov (1972)]. It is not known whether this finding generalises to all languages. It remains to be seen whether sex related pronunciation variation is best modeled and described on the level of phonemic representations of words or on the level of the phonetic implementation of what is essentially the same phonemic form. Awaiting results of further research in additional languages/cultures this factor is probably not sufficiently important to attribute great importance to it. Moreover, this aspect is very difficult, if not impossible, to separate from other sex-related factors, and will therefore be duly represented as long as the sexes are adequately represented in the corpus. Variation in pronunciation accuracy may also be caused by factors related to age and social status.
VOCABULARY AND SYNTAX
Sex-related differences in vocabulary and syntax are certainly culturally determined. Here, the factor sex interacts with factors like age and social status. Differences on the level of vocabulary and syntax are only relevant when spontaneous speech is being recorded. If all speech material consists of read utterances, vocabulary and syntax are completely determined by the prompting material. However, the ability to pronounce can depend on socio-economic status or education.

Age

Although the impact of speaker age on speech behaviour has not received much attention in previous research, there are indications that age influences at least two aspects of speech behaviour [Helfrich (1979)].

Voice quality

There has been some research on the relation between age and voice quality. Most studies were concerned with the question whether speaker age can reliably be estimated from the speech signal alone. It seems that people are moderately good at guessing age from speech signal characteristics, although reported correlation coefficients may be mainly determined by the ability to discriminate between very young, very old and adult but non-senior groups. The exact signal characteristics which enable people to guess the speaker's age are not well understood; neither is it possible to estimate their impact on the performance of automatic speech and speaker recognition . Until the questions about the importance and the exact nature of the impact of age on speech signals have been answered, it is recommended that attempts be made to sample the relevant age groups. In doing so, a distinction should be made between the group under 20, the group between 20 and 60 and the group over 60. If relevant, the group under 20 should, of course, be subdivided into toddlers, children, adolescents and young adults. However, the exact ages separating these subgroups is the subject of discussion. Moreover, in many respects mental and physiological maturation may be more important than calendar age.

Vocabulary and syntax

Here the considerations described above in the paragraph on the impact of sex on speech behaviour apply in exactly the same way. There is some literature suggesting that vocabulary and syntax of the older generation are different from the younger speakers, but apart from obvious observations that the subjects spontaneously discussed by senior citizens tend to differ there is little hard data to support the claim that age is more important a factor than, for instance, social group and education level.

Weight and height

As with speaker age , most research in the past has concentrated on the question whether people can estimate speaker weight or speaker height from speech recordings alone [Van Dommelen (1993)]. It appears that people are moderately successful in this task. It will be clear that weight and height of speakers are highly correlated. The exact signal characteristics that enable people to guess the speaker's weight and height are not known. In a sufficiently large sample of speakers, most weight/height groups will probably be represented.

Smoking and drinking habits

Several investigations have shown that voice quality can change under the influence of smoking or the use of alcohol [Gilbert & Weismer (1974)]. One of the most common consequences of smoking and drinking is premature ageing of the mucous membrane covering the vocalis muscle, resulting in a hoarse voice quality. Excessive drinking may eventually result in brain damage, which may in turn lead to severe speech disorders. The use of drugs can have a similar effect. In those cases it would be more appropriate to speak of pathological speech.

Pathological speech

The boundary that divides pathological speech from non-pathological speech is very difficult to draw. Hoarseness due to smoking can be regarded as a very mild speech disorder, whereas more severe speech disorders include, for instance, paralysis of the vocal cords and aphasia . Speech disorders can be divided into two main classes: those where there is a clear organic (anatomical, physiological, neurological) cause, and those where there is not. The latter category is usually referred to as functional disorder . However, in many cases there is no clear cut distinction between organic and functional speech disorders ; often both types are involved, or it is unclear which of the two types is involved. Speech disorders can be described at five different levels:

ARTICULATION DISORDERS
This involves the distortion , deletion , or substitution of sounds or sound combinations. Usually such disorders are functional , but they may also result from lesions of the lips (e.g., a cleft lip), the palate (a cleft palate), the teeth, the tongue, the jaw, or the nose. Another possible cause of articulatory disorders is dysarthria , a damage to the central or peripheral nervous system, manifested by neuromuscular disability.
RESONANCE DISORDERS
This involves lesions of the oral, nasal , or laryngeal cavities. Apart from functional causes, resonance disorders can result from, for instance, surgical removal of the tonsils, a cleft palate, or nose polyps.
VOICE DISORDERS
This involves lesions of the vocal cords, referred to as dysphonia . The voice may emerge as a whisper (no vocal-cord vibration), for instance due to paralysis; or vocal-cord vibration may be present to some degree, but accompanied by excessive air flow (a ``breathy'' voice ); or there may be irregular and therefore aperiodic vocal fold vibration , for instance due to the growth of abnormal tissue (nodules) on the vocal folds, resulting in a ``hoarse'' voice quality. Dysphonia may be caused by psychological and emotional factors, such as a severe shock, or by organic factors. A serious voice disorder is cancer of the vocal cords, which may lead to the surgical removal of the larynx (laryngectomy). Although the patients can learn alternative voicing mechanisms, their speech is usually severely degraded.
LANGUAGE DISORDERS
This involves disorders that do not affect the production of the speech message, but rather its content. These disorders are usually classified under the name aphasia . Patients suffering from aphasia may, for instance, use a reduced and incomplete sentence structure, have difficulty in wordfinding, use an inappropriate intonation , or make erratic pauses. The cause of aphasia is brain damage due to, for instance, a stroke, thrombosis, a tumour, an accident, or excessive drinking.
RHYTHM DISORDERS
The usual terms to describe the main rhythm disorders are stuttering (or stammering) and cluttering . Stuttering is a very complex phenomenon that is characterised by, for instance, a repetition of speech segments, abnormal prolongations of sound segments, words being unfinished, or circumlocutions to avoid types of sound that cause problems. Stuttering varies enormously from person to person and from situation to situation. It is, for instance, well known that stutterers almost never stutter when they are singing. Both organic (genetic) causes and functional (environmental ) causes are assumed to underlie the stuttering phenomenon. Another major category of nonfluency is cluttering . The primary characteristic here is that the patient tries to talk too quickly, and as a result introduces distortions into his rhythm and articulation. The description and theoretical study of cluttering is less advanced than that of stuttering . In addition, there is a considerable overlap between the categories of stuttering and cluttering .

For many purposes it is most appropriate to build speech corpora with a large variety of speakers. However, the speaker variability should be kept within reasonable bounds. Severely pathological speech will, in general, deviate substantially from ``normal'' speech and thus it is usually not desirable to include this type of speech in a normal speech corpus. On the other hand, speakers with mild pathological disorders, such as hoarseness , can be included in for instance speech corpora designed for recognition.
Of course, research might focus specifically on pathological speech, for instance when a recogniser is developed for use as an environmental control device for handicapped persons. In that case pathological speech should of course be amply represented in the speech corpus. Pathological speech should also be present in a corpus designed to cover as much speaker variation as possible (a kind of ``all-purpose'' speech corpus). A more elaborate discussion of pathological speech can be found in [Perkins (1977)] and [Crystal (1980)].

Professional vs. untrained speakers

Professional speakers should be selected when recording very large corpora with very few speakers, for instance to develop text-to-speech systems. The major reason to prefer professional speakers for this purpose is their ability to keep pitch , intensity and speech rate constant, not only during one recording session, but also over several sessions, which may have to be scheduled on different days , perhaps spread over several weeks or even months. One possibly important drawback of using professional speakers must be emphasised: more often than not, professional speakers are not really representative of the ``normal'' speech behaviour in the community. If the corpus is collected for the development of a text-to-speech system this may not be a problem. However, linguistic and phonetic findings based on a corpus comprising only speech of a small number of highly trained professional speakers should not be generalised without extreme caution.

Geographical and sociolinguistic factors

It is well known that both the regional and the sociolinguistic background of speakers can have a large effect on their speech. People speak differently depending on the specific region(s) in which they were brought up, and dependent on factors such as the linguistic background of the parents, social status, and education level. It is widely assumed that the high-school period is most decisive for the regional or dialectal colouring in one's speech. Therefore it is strongly recommended to obtain information about the high-school period when collecting data about the speaker's background.
Dialectal speech or regional/dialectal colouring of the prestige variant of a language, like Received Pronunciation (RP) in British English or Hochdeutsch in Germany, are known to be perhaps the most important source of speaker-related variation. Not all languages have a widely accepted and well documented pronunciation standard, like RP in English. Given the enormous amount of literature on Dialectology one would assume that the impact of dialects on standard speech is well understood. Unfortunately, this does not appear to be the case. Linguists and dialectologists appear to disagree about the number of major dialects in a language area, and about the boundaries between the areas where a specific dialect is spoken. Moreover, the majority of the dialect studies were based on written questionnaires. Although there are large amounts of recorded dialectal speech stored in the national Dialectology institutes, these recordings do not qualify as corpora, because they exist only on analogue tapes, with little or no detailed annotation . In collecting new corpora the factor regional/dialectal colouring should be properly accounted for. However, since the basic data to determine number of dialects and dialect boundaries are difficult to obtain and probably not always reliable, it is recommended that dialect is operationalised by geographic region. If necessary, processing of the corpus data can yield post hoc data on dialect differences. However, it has appeared that post hoc determination of the dialect background of a speaker as part of the transliteration /transcription process poses big difficulties. There is one additional factor which complicates the procedures for sampling dialectal influence, viz. the increasing mobility of the population. It is acknowledged that the impact of mobility is different between language areas and between countries. However, in sampling for a number of large telephone speech corpora in the U.S.A. (POLYPHONE ; Voice Across America) a special variety called Army Brat was defined, for those speakers who had lived for short periods of time in many different parts of the country. It should be noted that the factor dialect does not only affect pronunciation. More often than not, its impact on vocabulary and perhaps also on syntax is at least as important. Of course, the impact on vocabulary etc. can only come to light in corpus collection paradigms which allow the speaker to select his own words. In corpora comprising only read speech this factor should have no effect. Sociolects can be regarded as varieties spoken by a particular social class. A clear distinction between different social classes exists, for instance, in India, where each member of the society belongs to a specific caste.
However, in most cultures it is very difficult to distinguish between social classes. The division into three categories lower-class, middle-class, upper-class seems to be most widely accepted for Western cultures. Elaborate schemes have been designed to determine a person's social class using factors such as education, profession, and income. In addition to social class membership a person's sociolect is, of course, also influenced by the linguistic background of the parents and the dialect regions in which he grew up. As is the case for dialects , sociolects may influence not only pronunciation, but also syntax and vocabulary. It is recommended that sociolects should be properly accounted for when collecting new corpora. It has been found that the impact of sociolects on speech behaviour strongly interacts with speaking style . Thus, the speech of a pipe-fitter who speaks in a formal way, may resemble the speech of a salesman who speaks in a casual way [Labov (1972)]. This phenomenon probably also applies to regional dialects . Occupation-oriented varieties are often termed registers .
There is considerable uncertainty on how to treat dialects , and sociolects in corpora collected for developing speech technology, e.g. for developing connected speech recognition systems for use in telephone information systems. There may be large differences between countries and cultures in what is most appropriate in this respect. Of course, each operational recognition system should be able to handle the range of dialectal and influences present in the speech of upper and middle class speakers produced in somewhat formal situations. The extent to which dialectal influences occurring in less formal speech, or in formal speech of lower class speakers must also be covered will depend very much on the application for which the recogniser is being developed. Another extremely important factor is the social acceptability of strongly dialectal speech in a given situation. Acceptance is likely to differ strongly between regions in a given country.
If telephone applications are designed in such a way that all calls originating from a specific part of the country are handled in a local centre, one may envisage recognition systems which are adapted to the local dialect , provided that suitable training corpora can be collected. When collecting speech corpora over the telephone by soliciting input from randomly selected subjects one should specify strict guidelines for deciding whether or not a specific speaker deviates too much from the ``standard'' language for him to be included in the corpus. The speech of non-native speakers can be regarded as a special ``sociolect ''. Some non-native speakers may speak the standard language of the country they reside in with only a slight accent, whereas others may speak the standard language with a very marked accent or a poor control over vocabulary and grammar . There seems to be no reason to exclude the former group of non-native speakers from a common speech corpus, whereas the latter group of non-native speakers would preferably be excluded, unless the research is specifically aimed at non-native speech or one wants to build an ``all-purpose'' speech corpus.

Next: Final comments Up: Specification of number and Previous: Corpus size in terms

EAGLES SWLG SoftEdition, May 1997. Get the book...