Next: Polyphone project overview Up: Spoken language reference materials Previous: Appendix: Useful anonymous ftp

EUROM-1 database overview

The first SAM database, EUROM-0, was the precursor to the much more substantial EUROM-1 recording undertaken in SAM. This database has been widely distributed on a single CD-ROM and contains five hours of speech material recorded with 16kHz 16 bit sampling using a condenser microphone in anechoic rooms from four single accent speakers in each of five languages (English, French, Dutch, Italian and Danish). NATO single and triple digit sequences are recorded using only the speech signal, and a continuous speech passage, with a common numeric theme across languages, is also recorded here using two channels - with both speech and laryngographic inputs. This CD-ROM has been used extensively in the SAM project and is specified as a reference for calibration of the speech input assessment tools.

A subsequent major data collection activity has resulted in the collection of a very substantial amount of data which is unique in the size and the breadth of its coverage of different European languages. The EUROM-1 database contains more than twelve hours of data for each of the eleven European languages covered: Danish, Dutch, English, French, German, Greek, Italian, Norwegian, Portuguese, Spanish and Swedish. The material is of high acoustic quality, and was selected specifically for use in the assessment of speech technology devices.

The control software used in making the recordings provided for orthographic labelling of the data and alignment of the text and signal portions at the level of the prompt units. Phonotypical transcriptions have been made separately for all languages and broad phonetic labelling using SAMPA (SAM Phonetic Alphabet) has been applied to some parts of the database. Language subsets of EUROM-1 are now available on CD-ROM, and the provision of EU funding is planned to ensure availability of all recorded material of this important reference resource. A minimum of three CD-ROMs are planned for each language.

Very careful consideration was given to the homogeneity of the data across languages. This was achieved by the use of identical recording protocols, which were specified earlier in the project and applied using standard software tools, and by a careful definition of the speech content, such that each language was represented in the same way wherever possible. The speech (and calibration) recordings were made in acoustically treated rooms using calibrated condenser microphones and, in addition to the acoustic signal, larynx activity was recorded simultaneously, using a laryngograph, for samples of the speech in each language. The use of anechoic condenser microphone recordings permit the subsequent imposition of post-production effects. The recordings were made using the SAM agreed standard of 20kHz 16 bit sampling to ensure optimal signal representation; inter-utterance acoustic background signals were also preserved. The protocols defined for collection of database materials have been developed to provide guidelines on recording procedures and quality criteria for use in the wider European Speech Community. These are provided in full in Appendices B and C.

The database material specification for each language is as follows:

C(C)VC(V) material in isolation and in context, in the range 60-100 items per language
100 selected numbers from 0-9999, providing complete coverage of the phonotactic possibilities of the language number system
40 short passages comprised of five thematically linked sentences
50 sentences composed to compensate for phonemic frequency imbalance in the passages
5 pairs of context words for use with C(C)VC(V) material

The database has been designed with a hierarchical structure to maximise its usefulness both for training and testing different types of speech technology device and for more basic research including inter-language comparisons. In each language, material was recorded by 60 subjects, 30 female and 30 male, each of whom recorded 100 numbers, 3 passages and 5 sentences. Of these a ``few talker'' subset of 5 females and 5 males made extended recordings: isolated C(C)VC(V) items, 500 numbers, 15 passages, 25 sentences. A further ``very few talker'' subset of one female and one male, selected from the 10, additionally recorded the contextualised C(C)VC(V)s and the 5 context words, using both acoustic and laryngographic signals. A total of 660 speakers thus recorded over 130 hours of data, making this a very substantial multilingual resource with many different applications.

Next: Polyphone project overview Up: Spoken language reference materials Previous: Appendix: Useful anonymous ftp

EAGLES SWLG SoftEdition, May 1997. Get the book...