Requirements for future corpora

At this point in time it is too early to give very specific recommendations for future speech corpora. Yet, a number of things can be said. In future corpus projects an attempt should be made to obtain spontaneous expressions of time and date, in addition to the read expressions in POLYPHONE.

Speaker selection and recruitment is still a difficult issue. In the Dutch POLYPHONE project much time, effort and money was spent in order to get a maximally uniform sampling of a large number of cells. To a considerable extent, these efforts have been to no big avail. The major reason to strive towards uniform sampling was scientific: we wanted the corpus to be as attractive as possible for linguists and dialectologists, of course without interfering with the requirements of speech technology. The latter requirements are ill-defined. It is quite likely that applications like Train Time Table Information must deal with the public at large, including low income groups whose speech may differ from the general standard. More research is needed to clarify this issue.

This research was supported by the Foundation for Speech Technology, which is funded by the Dutch National Program for the Advancement of Information Technology (SPIN).

