Speech research

Inside the speech research community itself, formats used for speech databases differ widely, depending on their purpose and the applications they are designed for. The major differences concern the file format and the sampling frequency.

SAMPLING FREQUENCY:
There is a variety of sampling frequencies used in existing speech databases. Table I.1 shows a few examples:

Sampling rate Speech databases

8000Hz Telephone DB as POLYPHONE, CRICUBE (Canada)

10000Hz Read texts corpus (Netherlands)

12000Hz ATR DB (Japan),

12800Hz Telephone DB as COLLECT (Italy)

16000Hz ATR DB, ASJ (Japan), BDSONS (France), GRONINGEN (Netherlands),

PHONDAT & ERBA (Germany), TIMIT (USA)

20000Hz ATR DB (Japan), EUROM-1 (Europe), BDBRUIT (France)

48000Hz Jeida (Japan)

Table I.1: Sampling frequencies

**Table I.1:** Sampling frequencies
Sampling rate	Speech databases
8000Hz	Telephone DB as POLYPHONE, CRICUBE (Canada)
10000Hz	Read texts corpus (Netherlands)
12000Hz	ATR DB (Japan),
12800Hz	Telephone DB as COLLECT (Italy)
16000Hz	ATR DB, ASJ (Japan), BDSONS (France), GRONINGEN (Netherlands),
	PHONDAT & ERBA (Germany), TIMIT (USA)
20000Hz	ATR DB (Japan), EUROM-1 (Europe), BDBRUIT (France)
48000Hz	Jeida (Japan)

The higher the sampling rate, the more space-consuming is the corresponding file, the same amount of acoustic data given. Ten seconds of speech sampled at 10kHz correspond to a 200kb file length (with a 16 bits quantification), but to an 800kb file length when sampled at 40kHz. If usual frequencies tend to be higher because of the high-quality technology being available and of the storage disk space becoming cheaper, many fields in speech research stick to middle frequencies for various reasons: because they do not need such a high quality for their purpose (for instance, speech synthesis), because they are linked to technology standards (European telephony standard: PCM A-Law (POLYPHONE)), or because their purpose is to deal with low-quality speech (speech recognition) for real applications.

It is very unlikely that, getting a speech file, one can guess its sampling rate, coding, conditions of recording and, even less, the age of the speaker. So it is of crucial importance that information on the speech signal file must be somehow available in order to use it properly. The minimum information required concerns of course the way of accessing the file (byte order, quantification, sampling rate). But information about the recording conditions, the speaker characteristics, the text of the utterances, and various parameters, is more than useful in real speech studies. There are two main philosophies in force: storing information within the speech file or outside the speech file, i.e. in an external file. These two approaches have both pros and cons, and are well-represented respectively by the NIST/SPHERE format and the SAM format.

NIST/SPHERE:
This format is provided by the National Institute for Standard and Technology in the USA, and makes use of a within approach using a SPHERE header. It consists of an ``object-oriented, 1024-byte blocked, ASCII structure which is prepended to the waveform data. The header is composed of a fixed-format portion followed by an object-oriented variable portion.'' ``The fixed portion is as follows:
NIST_1A

The first line specifies the header type and the second line specifies the header length.'' The remaining object-oriented variable portion is composed of object-type-value ``triple'' lines which have the following format:

The currently defined objects cover database identification and version, utterance identification, channel count, samples count, sampling rate, min and max level, and A/D settings. ``The list may be expanded for future databases, since the grammar does not impose any limit on the number of objects. The file is simply a repository for ``standard'' object definitions. The single object ``end_head'' marks the end of the active header and the remaining unused header space is undefined'' (but within the 1024 bytes limit)''.

The NIST/SPHERE format is widely used in US and elsewhere, also for US & Dutch POLYPHONE and French BREF. It is supported by NIST, maintenance path exists, and it is provided with a set of tools to handle the header (access, update, remove, replace ...). The header approach minimises the risk of losing track of data identity; the header can support both prompt and transliteration texts but requires data files to be changed after collection for annotation and also if an upgrade/correction is issued. The header is fixed-length and unpromptable through a text editor.

SAM:
This format is a European `standard', defined by the SAM consortium (ESPRIT Project ``SAM'': Speech Assessment Methods) (see Appendix C). SAM claims for an outside approach (headerless) using an associated description file. It consists of a speech file + associated description file.
A speech file contains only speech waveforms.
An associated description file (ASCII) is linked to the speech file.

The files come in pairs; their names are identical, except for the last letter of the extension, according to SAM terminology. The associated description file is a standard label file with a header and a body. It contains all information which is usually required by people working on the files without the database management system. A label file is made of a header and a body (or several). Each line consists of a specific mnemonic followed by the corresponding value:

In a current annotation file the header contains database identification, file localisation, file production, A/D settings, sampling rate, start and end samples, number of channels, speaker information, and pointers to the prompt textfile, recording conditions and protocol. As the format is potentially adapted to store several items in a file, the body contains on-the-field labels for the one or several items recorded in the speech file: sequence beginning (in sample), sequence end, input gain on recording, minimum sample value, maximum sample value, orthographic text prompt are present for each item. Discontinuities between the items are indicated if any. Both the content of the header and of the body can be extended to store new relevant descriptors or labels, provided that adequate mnemonics are created and no contradiction occurs with existing ones.

The SAM format is widely used in Europe for multilingual databases (EUROM-1) and for national ones (French BDSONS, English SCRIBE, Italian COLLECT, Spanish ALBAYZIN). The current SPEECHDAT consortium adopted the SAM format for its telephone recordings. (SAM provided a conversion routine from NIST to SAM format on the DARPA/TIMIT CD-ROMs). The associated description file implies to have files going together by pair and increases the risk of losing files. But the headerless system keeps data files unchanged after collection during database transcription correction/upgrade. It supports multiple annotation levels. File length is not limited, an information is available through a single text editor. ELRA (European Language Resources Association) should take care of the maintenance/upgrades of this format.

Other format
The VERBMOBIL project in Germany has developed its own format, especially for handling dialogue. Examples of databases in Japan (such as JEIDA, ASJ) have no header.

So far we have seen that a correct description of a sound data file includes a lot of mandatory fields. The first (and minimal) one contains information on how to use the file:

The development of speech applications in new domains implies many other descriptors being available. Descriptions of new data types (multisensor, multimodal, dialogue) are needed, as well as more complex and complete descriptions of data (dialogue, e.g. in WOZ techniques; multimodal synchronisation; timing notations; additional descriptions such as dialogue flow, emotional state, man-machine situation). Furthermore, the forthcoming development of database distribution and networking will require information about the sources of the data to be available, such as the way of obtaining it and the right to use it.

The standardisation carried out in previous large collaborative projects must be clearly enhanced; efforts must be devoted to the representation of more complex information on speech data, with associated description files and pointers to various descriptors (including location of the data, source of the data, transformations applied to the sources, country of provenance acknowledgement, restrictions on use, derived information ...).