For general orientation purposes, the reader is recommended to refer to the chapters on system design and specification (Chapter 2) and assessment methodologies and experimental design (Chapter 9).
The chapter on ``system design'' is specifically aimed at potential users of spoken language technology (such as system designers or technology procurers) who need to know how to relate the technical features of the technology to the operational benefits they are seeking to achieve. It is intended that this chapter should be able to help such users to communicate effectively with the technologists and technology suppliers, to give guidance as to what questions they should ask, and to provide a means for specifying their requirements in a way which is meaningful to themselves and to the technologists.
The chapter starts with an introduction to the difference between a system's ``capability profile'' and the requirements of a given application. This is followed by an enumeration of the many and varied factors which influence the performance of the types of spoken language systems covered by the rest of the handbook. Automatic speech recognition systems are treated first, and over twenty factors are presented which range from aspects such as variability in the fluency of the speaker through to variability in the characteristics of telephone handsets. This is followed by a discussion of the different configurational possibilities for speaker verification /identification systems and a brief description of the key facts of speech synthesis systems. Interactive voice systems are introduced and the importance of error recovery strategies is identified.
The chapter goes on to outline key issues associated with the software and/or hardware aspects of the system platform , and highlights the possibilities for system simulation and prototyping, as well as a variety of practical matters ranging from the physical interface between a spoken language system and the host application to dealing with multilinguality.
The chapter on ``spoken language corpus design'' is targeted not only at users of speech corpora within the domain of spoken language technology but also to use in other areas such as sociolinguistics , language learning and pathology. It starts with a discussion of the most important differences between written and spoken language data , and then presents examples of the many application areas which require access to spoken language corpora.
The second half of the chapter describes how to specify a spoken language corpus, first in terms of the required linguistic content and, second, in terms of the number and types of speakers involved. The latter issue is dealt with in some detail, and relevant speaker characteristics are covered which, among many other things, include the age and sex of each speaker, their smoking and drinking habits, and whether or not they have received any professional speech training .
The chapter on ``spoken language corpus collection'' concentrates on the practical aspects of collecting spoken language material. In the first part, the dimensions of data collection are described which cover different recording scenarios such as studio versus location recording, or interviews versus read material , for example. It is also pointed out that important data about spoken language may be collected from sensors other than a microphone , for example by means of multi-channel recordings of signals derived from laryngography , electropalatography or NMR (Nuclear Magnetic Resonance) imaging.
The second part of this chapter contains recommendations for the actual collection of spoken language data covering the necessary equipment and the data management protocols needed. The legal aspects of recording arbitrary spoken language material are also discussed and appropriate recommendations given. It is the intention that the recommendations contained within this chapter should enable any reasonably competent person to establish a suitable recording environment that will deliver data in a controlled manner and to an acceptable level of technical quality.
The chapter on ``spoken language corpus representation'' describes how, to be of value, a set of ``raw'' speech recordings needs to be augmented with symbolic annotation covering a range of phonetic and linguistic levels of description. The transcription of spoken language data is discussed (including problems which arise with spontaneous speech or overlapping speech in dialogues), and mechanisms for segmenting and labelling the data are described. This is followed by an extensive presentation of the many possible representational structures ranging from simple orthography, through detailed low-level acoustic-phonetic analysis, to prosodic transcription and other non-linguistic phenomena (such as hesitations or acoustic non-speech events, for example).
The chapter on ``spoken language lexica'' provides a framework for relating concepts such as the creation of lexica for specific applications, the transfer of lexical resources from one application to another and the automation of these processes. The chapter covers topics such as the basic features of spoken language lexica, the types of information contained within a spoken language lexicon (such as surface, morphological, grammatical, semantic and pragmatic information), lexicon structure (including appropriate formalisms), lexical access and lexical knowledge acquisition (from dictionaries , for example).
The chapter on ``language models'' is different from the other chapters in that it is more concerned with details of techniques and algorithms. This is because of the central role language modelling plays in spoken language systems and in characterising a spoken language corpus . The chapter covers the different formalisms involved, the definition of the key concept of ``perplexity'' and a range of practical schemes for developing high quality language models.
The chapter on ``physical characterisation and description'' is essentially concerned with the non-linguistic aspects of a spoken language corpus . This includes such features as the characteristics of talkers and listeners, the recording environment , the transducer(s) and any communications channel . It also deals with ``reproducibility assurance procedures''; that is, recommendations for ensuring the integrity of the data (for example, calibration techniques and the use of reference signals).
The chapter on ``assessment methodologies and experimental design'' is intended to provide general guidance to all practitioners in the field in matters relating to formal methods for designing and executing statistically significant experiments and for the meaningful interpretation of experimental results. This relates both to the design of representative spoken language corpora and to the evaluation of spoken language systems.
The chapter on the ``assessment of recognition systems'' presents information on the substantial amount of work that has been done in this area over the past years. The chapter starts with a classification of different recognition systems and then introduces various performance measures. A taxonomy of different assessment methodologies is described ranging from the straightforward use of spoken language corpora , to more diagnostic methods and artificial test signals . This is followed by a discussion of the parameters which affect performance including those which affect the speaker (such as workload stress or noise ) and those which affect the recogniser (such as noise ).
The second half of the chapter provides recommendations on testing procedures for two main classes of speech recognition system: the small-vocabulary isolated-word recogniser and the large-vocabulary continuous speech recogniser. In both cases, attention is given to the training of the system, the test procedures and scoring and analysing the results.
The chapter on the ``assessment of speaker verification systems'' opens by presenting a taxonomy of system types in which the difference between identification and verification is made clear, and issues such as text-dependency are illuminated. This is followed by an analysis of the factors which influence the performance of speaker recognition systems and the set of recommended scoring procedures which should be used. The chapter concludes with some specific points concerning the forensic use of speaker recognition systems.
The chapter on the ``assessment of synthesis systems'' starts with a taxonomy of assessment task and techniques, distinguishing, for example, between laboratory and field assessment , and between human judgements and automatic testing . A methodology is then presented covering the choice of subjects for listening experiments, the required test procedures and suitable benchmarks and reference conditions. Recommendations are made for ``black box '' testing of overall output quality, and for ``glass box '' testing at many detailed levels of analysis. A taste is also given to future developments in synthesis evaluation.
The chapter on the ``assessment of interactive systems'' presents recommendations for the specification, design and assessment of interactive systems in which spoken language dialogue plays a major part. After defining different types of dialogue system, the chapter describes in some detail the ``Wizard of Oz'' paradigm for system simulation and the central role it plays in the design and assessment of interactive systems. The chapter goes on to address methods for characterising dialogue systems, tasks and users, and presents an assessment framework which includes high-level metrics such as correction rate and transaction success .