The DKISALA software tool for semi-automatic labelling of
speech corpora runs on the SESAM workstation. It is possible to
listen to the speech signal during annotation if a
non-SAM-standard PC speech processing DSP32C board is installed
in the workstation (The AT&T development board or the
Loughborough board PC system board based on the AT&T DSP32C
signal processor. Details of configuration and board addresses
can be given by the developing laboratory).
The DKISALA system uses a trained and calibrated
Self-Organising
Neural Network (SONN) to convert speech frame cepstrum
coefficients into a set of continuously valued Acoustic-Phonetic
Features, which are further transformed into a smaller set of
Principal Components. The Principal Components are used to model
individual allophones in a multivariate Gaussian probability
density function, and these are processed by a Viterbi Search
and Level-Building algorithm, which is constrained by the
independently given string of phoneme symbols corresponding to
the speech signal being labelled.
The SONN is trained on large speech corpora, which have
been
manually labelled prior to training. The presently used training
data comprises EUROM-0 speech material from three speakers for
the language under analysis.
The DKISALA system is presently working as a preliminary
version
of an interactive system. Results have shown that certain sound-class transitions are positioned very accurately and
reliably,
while some classes are regularly inaccurate.
The interactive component of the SALA system has been
introduced
to provide a means of preventing a cumulative error accruing as
a result of a number of such sounds occurring within a short
space. The automatic process stops when such a sound occurs. In
such cases, the speech signal, the corresponding acoustic-phonetic features and the spectrogram are displayed on the
graphic screen, and by using the mouse, the information on the
screen and a listening facility, the user can propose the
positioning of the specific boundary transition to the system,
which then takes control and reruns the label procedure. This
procedure is repeated until the entire speech corpus is
labelled.
Developing lab:
Speech Technology Centre
Institute of Electronic Systems
Aalborg University
Fredrik Bajers Vej 7
DK-9220 Aalborg, Denmark