Interaction and control

In any speech recording, some control over the recording process is needed. The type and amount of control can be characterised by the type of recording:

The degree of interaction and control clearly is determined by the communication situation of the recording. In a face to face communication situation, the speaker can be instructed directly. On the telephone, interaction and control is restricted to oral communication.

Random recordings

  In random recordings, all or some of the parameters of a recording, namely the recording time, its duration , the type of speech recorded, etc. are determined in an arbitrary and randomised procedure. In general, the speaker does not know that he is being recorded, and neither does the experimenter know who or what is being recorded. Other than the original setup of the experiment, there is no control over the recording, and no interaction between the experimenter and the speaker.

Such random recordings typcially are: ``record speaker X from the speaker population for Y minutes every Z hours'' (as in the BNC),   or ``record the news on channel X every Y minutes'', or ``record the speech of microphone  number X in the airport tower control room''. In the first example, X, Y, and Z are appropriate random numbers, in the third example the speaker using this particular microphone  is randomly selected. In all three examples, there is no control over the contents of the speech that is being recorded. Examples of random recordings are the ATIS  recordings of air traffic control and Switchboard  of telephone services.

Random recordings are used a) to gather huge amounts of task-oriented speech, b) to reduce the subjective bias of the experimenter concerning the selection of speech to be recorded, and c) to eliminate speaker stress  that results from the knowledge of being recorded.  

Spontaneous dialogue recordings

  In spontaneous dialogue recordings, once the recording equipment is set up, the speakers have been briefed, and the dialogue has begun there is no more control exercised by the experimenter (except perhaps to terminate a dialogue). In a spontaneous dialogue, the participating speakers mutually control themselves by taking turns  (exclusively or overlapping).

The experimenter may or not be a speaker in the dialogue. The dialogue may be focussed on a given topic or restricted to a maximum duration , or it may be completely unrestricted.

In the VERBMOBIL project , negotiation dialogues are recorded. The task to be solved is to find a date for a business trip for two partners. The dialogues are recorded in a studio  with two separate rooms with visual contact through a window, and turn taking  is controlled via a button [Hess et al. (1995)].

Spontaneous dialogue recordings are used a) to collect speech for the analysis of dialogue structures, b) to obtain natural speech with the full set of prosodic  phenomena, and c) in role-play where task-specific speech is simulated.  

Interview recordings

In an interview an interviewer prompts  a speaker to produce speech and leads the speaker through the interview. The interviewer can be the experimenter, a trained human interviewer, or a speech computer.

The interaction between interviewer and speaker begins with a briefing before the interview. During the interview, the interviewer prompts  the speaker to respond to questions, to repeat words or sentences, or to discuss a given topic.

The interviewer has various control instruments at his disposition: he can interrupt the speaker, ask for repetition, change the order of topics in an interview, skip or insert new topics, and deal with topics in various degrees of detail.

The influence of the interviewer on the course and the resulting speech of an interview is strong. A good interviewer can establish a relation with a speaker in such a way that the original goals of the recording can be achieved. In many cases, a few interviews by a trained interviewer can produce the same amount and quality of speech material as very large random or spontaneous speech  recordings . Furthermore, because of the strong interaction control by the interviewer, speech recordings are focused. Finally, the technical quality of the speech recording is much easier to control in an interview than in random  or spontaneous dialogue   recordings.

Computers can also be used as interviewers. The advantages here are that the influence of the interviewer on the speakers is the same for all speakers, allowing the direct comparison of interview recordings, and that multiple measurements can be made during the recording already, e.g. timing measurements. The disadvantage is that the computer can only follow a predetermined script and is thus not able to adapt to situations not foreseen during the design of the speech collection or the script. Furthermore, many speakers feel uncomfortable when they know they are talking to a machine.

Interview recordings are used a) to elicitate speech in a rather controlled way, not necessarily leading to unnatural speech, b) to collect speech via the telephone, and c) to perform fully automated speech collections of large speaker populations.

Read speech recordings

  In read speech recordings, the speaker is asked to read exactly what is presented to him or her. The text to be read may either be printed on paper or presented on a computer screen. In the first case speakers tend to change their articulation according to the perceived structure of the text, e.g. lower the voice at the end of an enumerated list. Presenting text on a computer screen avoids this problem, but many speakers are intimidated when facing a computer screen.

In read speech recordings, the degree of control is very high. The text to be read can be generated according to predetermined criteria, e.g. distribution of phonemes , vocabulary,  etc. During the recording, each utterance of the speaker can be checked directly for errors, and, if an error is found, the speaker is asked to re-read the text.

Read speech is not spontaneous speech . Nevertheless it is close to natural speech in some specific speech styles, e.g. dictation. 

Read speech recordings are used a) to guarantee that the speech material has a certain content, b) to record the phenomena specific to read speech, and c) to monitor the recording very closely, e.g. in multi-channel  recordings .  

Speaker prompting


Speaker prompting is used to elicit directly from a speaker a certain type of speech data, e.g. numbers, dates, times, etc. Such data is much more difficult to obtain in dialogues or role-play. The major problems with speaker prompting are a decrease in the spontaneous  quality of the speech, ambiguous prompts that lead to unexpected responses, and the rigid structure of a prompting script, i.e. a sequence of prompts. Furthermore, speakers tend to imitate the original prosody  when they are asked to repeat an utterance (thus there should be different prompts for the same text so that various prosodic  patterns for the same text are recorded).

However, in many applications prompting a user for input is a natural situation, and thus the decrease in spontaneity in the speech is highly welcome.

Four types of prompts can be distinguished:

The possible responses to prompts may vary greatly. It is thus advisable to instruct the speaker which responses are expected, e.g. ``Please answer the following questions with yes or no''. However, restricting the set of allowed responses too strongly will lead to unnatural speech.

In face to face communication situations, there is an influence of the interviewer on the speaker even when the catalog of prompts is fixed, e.g. in a prompt sheet. Visual communication, deictic references (like pointing with a finger to an item to be read), play a significant role. The interviewer guides the speaker through the script and may immediately correct any errors.

In telephone recordings prompts may be output by a computer or a human interviewer. The advantage of computer prompting is that all speakers hear identical prompts. One disadvantage is that a computer based prompting system strictly follows a predetermined prompting script and may not notice that the speaker is not responding correctly. Computer prompting scripts should thus not take longer than 15 minutes, and the script should be divided into several small units. Between the units, feedback should be given to the user to inform him of the status of the recording.

Human interviewers immediately realise whether a response from a speaker is correct, and they are able to correct wrong responses immediately. However, each prompt is an individual utterance so that variations among responses may also result from the prompts.

Each prompting script, should be thoroughly tested before the actual recording of data. The test participants should be candidate speakers, and test conditions must be as similar as possible to those of the actual recording. In the case of computer prompting, it is useful to have a prompting simulator which can be adapted to new prompting scripts easily.  


  1. Formulate prompts  so as to avoid ambiguities.
  2. Restrict the number of allowed responses.
  3. Test prompting  script on 1% (minimum 10 speakers) of the candidate speakers under realistic conditions.
  4. Give sufficient feedback to the speaker (for instance, after a third, a half, and three quarters of the script).
  5. Add dummy items  at the beginning and at the end of text pages. Also, each speaker should have a different random ordering of test items so that positional effects on the pronunciation can be levelled out.
  6. Add one or more repetitions of the test items. 
  7. Use a VDU for presenting the prompting  material when recording in a studio  or on the experimenter's premises. (When a VDU is used, care should be taken to avoid reflections from the screen and it should be checked that the monitor does not produce audible noise  of its own).

