The management of a speech data collection involves
Speech data collections are very expensive, both in terms of money and resources. It is thus recommended to collect, besides the speech signal itself, as much administrative data as possible, and to design procedures that can be reused in other data collections.
Speakers should be thought of as a primary and very valuable resource in speech recordings. It is therefore advisable to build a speaker database which contains for each speaker
Preferably, such a database is implemented using a database management system on a computer (see Appendix H for details on DBMSs).
This way, data can be entered easily during the preparation of a speech recording. Such a database can also be held on forms in a folder, but then the extraction of speakers according to specific criteria other than the primary ordering criterium is difficult and error-prone.
The recruitment of speakers should have two goals: provide a sufficient number of speakers for a given speech data collection, and provide sufficient information about the speakers which can be used to build or extend a speaker database.
Speaker recruitment can be characterised along the following dimensions:
Recruiting a small (i.e. 1 to 5) or medium (5 to 50) number of speakers is no problem. Depending on the requirements, colleagues, friends, and relatives can be asked to participate. However, one cannot expect any demographic balance in small sets of five or less speakers. The advantage of using friends and relatives is that they may be available for a long period of time, and that they in general can be used for more than one recording.
The recruitment of a large to very large number of speakers is completely different from that of a small number of speakers. Accessing the speakers, scheduling their recordings , evaluating the recordings, and storing the data become such large tasks, that they cannot easily be performed by a single person. Accessing a large number of speakers requires either
Contact addresses, i.e. telephone number or postal address, are expensive: address brokers charge for each address bought, with the risk of the address being useless (the address is wrong, the person is not willing to cooperate, etc.) entirely upon the buyer. Market research institutes have large address databases from which they can select subsets according to specific criteria, but they in general do not give away these addresses. Although addresses allow persons to be contacted directly, e.g. through mail, telephone, or interviewer visits, the rate of return is rather low, typically in the range of less than 5% for mail, 25% for telephone, and 50% for interviewer visits.
Public calls for participation, e.g. newspaper advertisement or article, Internet posting, radio or TV announcement, may reach a very large audience. In many cases, a public call can be arranged at little expense - newspapers, especially the science editors, are willing to cooperate, Internet postings are virtually free, and radio or TV announcements are affordable). The rate of return is usually very low (less than 1%) but this is compensated for by the sheer size of the audience reached. However, the means to determine the response to a call for participation are limited. Also, the number of callers will not be evenly distributed over time (most people will call immediately after having received the call), which may cause capacity problems. People responding to a public call for participation are highly motivated; however, this does not hold for the population as a whole and thus introduces a bias.
In hierarchical recruitment the task of recruiting m speakers is divided into n tasks of recruiting m / n speakers. Hierarchical recruitment works well if the burden of recruiting speakers can be mapped to some real-world hierarchy, e.g. the employee hierarchy in a company. The rate of return strongly depends on the success of a person persuading others to participate.
In all three recruitment strategies, incentives may help to increase the motivation to participate and thus the rate of return. Incentives can either be gifts (e.g.\ telephone cards) or the participation in a lottery with a grand prize. However, such incentives clearly make the recruitment of speakers even more expensive.
Scheduling speakers is important to make optimal use of the recording capacities within a given period of time. Proper scheduling avoids speaker frustration (caused by having to wait, ever-busy telephone lines, etc.) and allows a maximum number of recordings within the given recording capacity.
If speakers are recorded in a studio , a time slot is reserved for each speaker. This time slot must be sufficient for
In general, five minutes for each of the side-tasks should be sufficient.
If speakers have to travel far then it is almost inevitable that some of them come late or do not appear at all. In such cases it is advisable to have some speakers available upon short notice. In any case there must be a person responsible for the scheduling, and this person must be reachable directly by telephone.
For telephone recordings, the number of speakers calling at any one time must be matched to the capacity of the telephone equipment. If potential speakers do not get through because of busy lines, they are likely not to retry. Furthermore, telephone recordings should be possible 24 hours a day, or it must be clear to callers that the service is operational only for a specific period during the day. Note that recording 24 hours a day requires that the recordings be performed automatically because only in rare cases will human operators be available for 24 hours. Again, speakers must be able to reach an operator via telephone, e.g. to report problems or make suggestions.
The cost of a speech recording is determined by the cost for personnel and equipment and by the period of time. The total cost estimate is defined in a budget, and at given times the actual expenditures are compared to the budget plan.
A speech recording project is usually defined by scientifically trained experts, e.g. speech engineers, phoneticians, etc. Only rarely are there people with expert finance and budget knowledge in a project. Hence budgets often are rather broad estimates, and many hidden costs are easily overseen.
The minimum personnel requirements for speech recordings (of a large number of speakers) are a project administrator and supervisor, and a system operator; both should be available for the whole recording period. Depending on the speech recording setup and the processing of the signal data, interviewers, scientific personnel, and temporary collaborators are necessary.
The administrator is responsible for the budget and the supervision of the project as a whole, the recruitment of speakers , scheduling of recordings, and the organisation of the data evaluation. The system operator is responsible for the technical and data processing aspects of the speech recording, i.e. the setup of equipment, storage and backup of data, etc. Interviewers are needed for speech recordings in face to face communication situations. A first evaluation of the technical quality of recordings can be performed by rather unskilled personnel, whereas further processing, e.g. the transliteration or a phonetic segmentation and labelling of utterances require trained experts or scientific personnel.
The cost of personnel is the sum of salary and related infrastructure (room, desk, computer, telephone) and working materials costs. In many cases existing resources can be reused, but it should be clear that they have to be accounted for in the budget.
The cost for equipment consists of the acquisition and maintenance costs. Again, in many cases existing equipment can be reused and it must be accounted for in the budget. Maintenance costs are significant cost factors which often exceed the original acquisition costs. For time-critical projects, maintenance contracts with a guaranteed repair time should be considered.
A speech recording can be divided into the following phases:
All phases are strictly sequential except for recording and evaluation which can be executed in parallel.
Initialisation and test, preparation and cleanup take roughly constant time. The initialisation and test phase must be considered very important because wrong decisions here will affect the rest of the project. Preparation can be short if the initialisation and test phase results in a good procedural setup.
The duration of the recording and evaluation phases depend directly on the number of recordings. As a rough estimate, double the speaking time (prompts and responses) to get an estimate of the time needed to perform an individual recording (speaker instruction, cleanup, etc.). Depending on the quality of the evalaution, the time needed for evaluation may be double (technical evaluation) to ten times (phonetic evaluation and transliteration ) the speaking time.