Next: Conclusion Up: Speech standards Previous: Speech research

Computer hardware and software

Current hardware uses a variety of in-house or more widespread standards. For example, coding is 8 or 16 bits at Apple (Mac) and PCs, U-LAW 8 or 16 bits at SUN (Sparc), NEXT, VAX, DEC, U-LAW or A-LAW at HP. Available sampling rates are often limited to 8kHz in the UNIX world, but higher rates may be available in the PC world (DOS/Windows) and Mac depending on the current or professional I/O boards.

File formats are often indicated with the filename extension they bear. Computer manufacturers such as NEXT and SUN deal with .au (AU) or .snd (SND) files, Apple and Silicon Graphics with .aif (AIF); I/O boards manufacturers may promote their own format (as .voc for SoundBlaster boards) and the developer of the Windows operating system, MICROSOFT, tries to impose its .wav (WAVE) format. This situation is complicated by the encoding mean (linear, compressed, data and information intermingled, etc.) and even for the same filename extension, the implementation may vary slightly for different operating systems (WAVE in Windows or UNIX environments, SND in NEXT or PC/Mac environments). A standardisation initiative comes through the development of Internet, promoting an interchange format called MIME.

A major example of the constraints imposed on the speech research community by the market can be demonstrated by looking at the implications of the multimedia standard development in the PC world.

MULTIMEDIA STANDARD

The world of PCs has considerably evolved during the past few years along two relevant dimensions:

Operating system: the Windows operating system is now used worldwide, and it provides a suitable graphics interface.
I/O boards: the development of multimedia functionalities implied the availability of low-cost I/O boards to be easily included in a low-level PC configuration (SoundBlaster, Pro Audio SPectrum ...).

The point is now whether these current boards, primarily dedicated to audio output, can satisfy the needs of speech research and applications in terms of:

signal quality (signal-to-noise ratio ...);
sampling frequency: the multimedia standard is basically derived from CD Audio standard (44.1kHz) or DAT one (48kHz). So most of the multimedia compatible I/O boards use sampling rates that are provided through successive entire divisors of this basic frequency (22.5, 11.25kHz, etc.). But sampling rates used in our current speech databases are at present 16kHz, 20kHz, ...Care should be taken that a continuum of sampling frequencies could be available (lets say from 5 to 50kHz) on these boards, to satisfy the requirements of the speech research community. It is foreseen that all current cheap boards will not be convenient. Otherwise, on-line resampling techniques would be required (*) to maintain compatibility with existing databases, and for future databases the speech community is to adopt a standard ``audio'' sampling rate.
file format: the multimedia standards apply the same. Most of the boards use ``standard'' (or peculiar) file format definitions, for example the main one is WAVE format (.wav). It means that these boards are not able to play the files from our existing databases (SAM or national) which are in a 16 bit linear format, as the WAVE one consists of chunks of data intermingled with chunks of encoding info. The files of these databases would have to be converted from one format to the other in order to be played. Future databases should either adopt a new ``market'' standard, or have their files converted on input and output.
number of channels available (two or more channels may be requested for various microphones or sensor data recordings).

(*)(**) Using I/O boards without DSP implies that some signal processing will be deported to the PC (speech level detection, min/max measurement, eventual over- or undersampling). These on-line procedures, augmented with on-line format conversion routines, could increase the CPU load in such a way that low-level SESAM workstations could not be able to support running with a high speech sampling rate for example (or using two channels).

One topic is background compatibility with existing databases, another one is which format is going to be ``the standard'', i.e. the worldwide audio/ computer/speech standard. Such a topic is to be considered during the SPEECHDAT project, but it is foreseen that no unique standard will emerge and that conversion routines will remain a big issue. Many tools are available but as an example, even for the RIFF WAVE format the conversion between Windows and UNIX worlds is all but trivial. At the moment, it is not sure whether the current inter-changeable standard I/O boards in the market will satisfactorily meet the speech research needs or not, depending on the target application.

Next: Conclusion Up: Speech standards Previous: Speech research

EAGLES SWLG SoftEdition, May 1997. Get the book...