Spoken language is central to human communication and has significant links to both national identity and individual existence. With the increase in availability and capabilities of computing resources, there has been and will continue to be a large expansion in computer-based language technologies. These technologies include speech recognition and synthesis, vocal access to information retrieval systems, speech understanding (or spoken language) systems and spoken language translation. Central to progess made in spoken language technologies lie large corpora of speech with associated text, transcriptions, and lexica.
The structure of spoken language is shaped by many factors, including the phonological, syntactic and prosodic structure of the language being spoken, the acoustic environment in which it is produced, and by the communication channel. The speech signal is produced differently by each speaker, each with a unique vocal tract which assigns its own signature to the signal. Speakers have different dialects, accents and speaking rates, and their speech patterns are influenced by their emotional and physical state, and the context in which they are speaking (e.g., reading aloud, in conversation, giving a lecture) and the acoustic environment. Due to the many sources of variability in the speech signal, a great deal of speech data are needed to model different speech characteristics, and in particular, different dialects and accents.
Recent activities, such as the creation of the Linguistic Data Consortium (LDC) and the Center for Spoken Language Understanding at the Oregon Graduate Institute (OGI) in the U.S. and the LRE RELATOR project in Europe, national efforts in Japan, Australia and China, as well as the international Coordinating Committee for Speech Databases and Assessment (COCOSDA), point out the growing worldwide awareness of the need for and importance of large, publicly available common corpora for the development and evaluation of language technologies, particularly speech recognition and spoken language understanding, as well as for the development and assessment of speech synthesisers. These corpora allow scientists to study, understand, and model the different sources of variability, and to develop, evaluate and compare speech technologies on a common basis.
Corpora collection in Europe is the result of both national efforts and efforts sponsored by the European Community. Several ESPRIT projects have attempted to create comparable multilingual speech corpora in some or all of the official European languages. The first multilingual speech collection action in Europe was in 1989, consisting of comparable speech material recorded in five languages: Danish, Dutch, English, French, Italian. The entire corpus, now known as EUROM-0 includes 8 languages: Danish, Dutch, English, French, German, Italian, Norwegian, Swedish. Other corpora resulting from CEC projects include: SAM/SAM-A EUROM-1 (11 languages: Danish, Dutch, English, French, German, Greek, Itaian, Norwegian, Portuguese, Spanish, Swedish), ARS (Adverse Recognition System: Italian, English? **langs** ), POLYGLOT (7 language IWSR database: Dutch, English, French, German, Greek, and Spanish, 5 language TTS database: Dutch, English, French, German, and Greek), ROARS (Robust Analytical Recognition System: Spanish, ?? **langs**), SPELL (Interactive System for Spoken European Language Training - French, Italian and English), SUNDIAL (Spoken language queries in the travel domain for English, French, Italian and German), SUNSTAR (Integration and Design of Speech Understanding Interfaces; English, German, Danish, Spanish and Italian), and ACCOR (cross-language acoustic-articulatory correlations: Catalan, English, French, German, Irish Gaelic, Italian and Swedish).
What follows is a brief status of linguistic resources for the European Countries, as well as a summary of some of the corpora resulting from European Community projects.