Next: Production costs
Up: European speech resources
Previous: SWEDISH (Sweden)
- Among existing and prospective resources, usability and reusability
of speech databases produced through EEC projects should be enhanced.
These resources are most of the time still unavailable several years
after the end of the projects, because the time-intensive process
leading from the raw data to marketable data was not initially
included (or funded) in the projects (the marketable product is not a
deliverable of the project). The distribution of these databases can
be stimulated by the existence of an European Center for the
Distribution of Language Resources who can provide funding for
reorganisation and documentation of the already existing data and its
production on CD-ROM.
- Existing databases whose diffusion is confidential should be
highlighted. These databases are either not well-known because they
remain in the scope of the project for which they have been designed,
and/or have been kept confidential by their designers who do not want
to disseminate them. It is obvious that for many industrials,
linguistic resources are not yet considered as precompetitive
resources, but as strategic ones. Any future European Center of LR
production and distribution will have to manage the gap between the
views and methods of industrials and academics: long-term widely
available resources for academics vs. short-term strategic resources
for industrials.
- There is a clear lack of speech databases suitable for
multilingual evaluation. At the curent time, only EUROM has a
Europe-wide dimension. Large, multilingual corpora are needed, and
added value (full annotation and labelling) must be included in the
design of new databases. It should be noted that the most widely used
multilingual corpora are currently being recorded by OGI in the U.S.
This corpus is being used for development and testing of automatic
language identification techniques, which have a wide variety of
practical applications.
- Pronouncing dictionaries and spoken language lexicons are very limited
and should be developed for speech output systems assessment. Large phonetic
lexicons are needed for use in speech recognition. One particular
source that should be exploited is the outcome of the Onomastic project,
and in particular the legal details for distribution of the lexica need
to be worked out.
- A corpus of at least 100 hours of high quality speech should exist
in each language providing coverage of many words and phonetic
contexts, with data from a reasonably large number of speakers
(200-500). Most languages are far from that (for example at 20kHz,
100 hours of speech correspond at 15 Gbytes, or about 25 to 30
CD-ROMs). In addition, this basic corpus will need to be supplemented
by application-oriented corpora (such as for dictation (medical,
legal, insurance domains), office systems,
speaker verification/identification, topic spotting, information
retrieval, etc.). Newspaper-based corpora are a potential multilingual
source that should be encouraged, as these basic corpora provide at
the same time speech and language modeling data, that can be used for
dictation as well as other applications such as topic spotting.
Multi-sensor and multimodal corpora are also needed for more basic
research.
- For short-term commercial use there is a need for large
telephone-based corpora, with speech from many (several thousands of)
speakers covering a wide range of dialect, age, and socioeconomic
backgrounds. These corpora should be extensive enough to permit
the design of speaker-independent, vocabulary-independent speech
recognisers that can serve as the source for a variety of applications.
Our conclusions are in agreement with those in the recent EAGLES
report on Spoken Language Systems. It is clear that the need for
adequate resources is a prime concern for many actors in the field of
language technology. The following excerpts come from a draft
EAGLES report on Spoken Language Systems (For EAGLES Restricted use).
We highlight three areas of needs: speech corpora, lexicons and
assessment of speech output.
- Speech corpora size (in terms of speakers), pp. 10-11
- few speakers (< 5 speakers):for development of speech synthesis systems
(dictionaries of phonetic elements). Advanced research. Multi-channel
recordings (Electroglottogram, subglottal pressure, etc.)
- medium (5-50 speakers): experimental research. Number of speakers and
repetitions large enough for statistical processing / or for broad
coverage of phenomena.
- large (> 50 speakers): Train and test of speaker-independent
recognition systems.
- Spoken language lexicons (p. 67)
``Large-scale spoken language lexical resources, from reference sources of
standard, stylistic and regional pronunciations through vocabularies which
are characteristic of spoken language, are required for current research and
development with both statistical and knowledge-based technologies
...
These spoken language lexical resources in the form of actual lexical
databases and tools for constructing them, are sadly lacking.''
- Assessment of speech output systems (p. 102)
``A short-term recommendation is to develop multilingual machine readable
pronouncing dictionaries at the single word level which list permissible
variations...''
Next: Production costs
Up: European speech resources
Previous: SWEDISH (Sweden)
EAGLES SWLG SoftEdition, May 1997. Get the book...