Next: Transcription and documentation conventions Up: European speech resources Previous: General conclusions

Production costs

Speech database is the only type of speech resource we may be able to estimate the costs of production, based on concrete production cases. That is rough speech database. The cost of any Added Value (annotation, transcription, phonetic or prosodic labeling ...) will depend on the manual/semiautomatic/automatic way to proceed and on the corresponding tools available or to be developed. And this refer to Tools in general, whose development costs are those of the software industry (it is essentially a matter of specifications and corrresponding man-months). The production of spoken language lexicons and pronunciation dictionnaries should be comparable to speech databases in term of production costs. An estimate of this cost can hopefully be provided by the ongoing ONOMASTICA project, and perhaps by SPEX from their experience with CLEX.

A current estimation of the minimum cost for database production is 1 Ecu per utterance. But this is the lowest case which may be applicable to large telephone corpora, for which there is no need to move people (even not to pay them), and for which the initial equipment investment is relatively minor. However, these corpora are by nature of telephone speech quality and may not be useful for basic research and technology development. But many other factors can come to increase the cost. At the other extreme, multi-channel recordings in a specific and controled environment with representative selected speakers from all over a country, may be requested for advanced research material. The cost in this case, estimated from completed and available databases, may reach 10 Ecus per utterance (e.g. ten times more than for the telephone corpus.) In fact, more specific corpora, including a variety of articulatory sensors can be substantially more expensive.

So we can argue that the cost scale for the type of widely applicable, general purpose speech data is from 1 to 10 Ecu per utterance, depending on various factors as:

Type of utterance: word, sentence, passages, dialogue ...the cost is higher and higher in terms of recording time, error recovery, storage capacity.
Speech quality: sampling frequency and quantification (number of bits per sample) are relevant factors. From telephonometry to high quality standard audio files, the same utterance will require 5 times storage capacity (e.g. more CD-ROMs for the same database). The use of lossless compression techniques (T. Robinson, CUED) can reduce the needed storage by a factor of 2. The RELATOR project has supported the adaptation of the UNIX compression algorithm to DOS for use on PCs.
Recording protocol: Multi-channel recordings require n times the storage space. For example, if both speech signal and Lx signal are recorded together, for one utterance the volume is doubled. The more the protocol is complex (multi-sensors, specific equipment) the more the production process is time-expensive and the pressing process is space- and hence time-consuming (number of CDs, effort to prepare them, etc.). Due to the limited bandwidth of certain channels, the actual storage requirements may be less, as lower sampling rates can be used.
Speakers selection: from people calling a toll-free number from their own phone, to speakers that must be present in a specific recording site (including transport, hotel, food expenses and eventual payments), there is a wide range of situations and corresponding costs.

Next: Transcription and documentation conventions Up: European speech resources Previous: General conclusions

EAGLES SWLG SoftEdition, May 1997. Get the book...