Speaker selection

Next: The speech material Up: The Dutch POLYPHONE corpus Previous: Recording workstation

Speaker selection

Prospective callers received a personalised letter. Originally, we aimed at collecting 5000 speakers, uniformly divided over a large number of cells, defined according to four criteria, viz. (1) geographical region, (2) socioeconomic status, (3) sex, and (4) age. It should be emphasised that the uniform sampling of the cells was mainly motivated by scientific arguments: in order to find the funds for creating the corpus it was necessary to make it attractive for a wide range of linguistic research, including sociolinguistics and dialectology. Perhaps part of the speakers in our corpus will not be heavy users of the automated services that can be developed by means of the Dutch POLYPHONE corpus. However, we trust that a wide coverage of language and speech behavior will lead to applications that are more robust than what could have been obtained with recognisers trained with much more restricted speech material.

Geographical region, operationalised as the province in which the speaker lives, is the best practically feasible approximation to regional accent and dialect background. By sampling provinces, we sidestep the unsolved problems of how many different regional accents should be distinguished and how these should be defined. Due to the very uneven distribution of the population over provinces it appeared to be practically impossible to get equal numbers of speakers from each province [3,4].

Socioeconomic status is difficult to define, and even more difficult to assess reliably from what respondents are willing to say. We decided to approximate status on the basis of the education level of the respondents. We distinguished three levels, viz. (1) only primary school, (2) secondary school and (3) college/university. Using hindsight, this division was somewhat unfortunate: in formal terms almost every person younger than about 60 has been to school until at least the age of 16, so only a very small proportion of the population falls into the first category. Thus, it is not surprising that we were able to recruit very few speakers who said that they had no more than elementary school. The numbers in the remaining two classes are approximately equal.

We distinguish four age classes, i.e., under 20, between 21 and 40, between 41 and 60, and 61 and older. Information about age is acquired by asking the respondents for their year of birth. Since we set a minimum age of 16 for participation, the under 20 group is much smaller than the other groups. The group of 61 and older is also underrepresented. The group between 20 and 40 is about 50% larger than the group between 40 and 60.

Next: The speech material Up: The Dutch POLYPHONE corpus Previous: Recording workstation

EAGLES SWLG SoftEdition, May 1997. Get the book...