The last difference, and the most important one, must be looked at from two
different angles. The first thing to understand is that the relevant category
of the data (that determines its collection) is already inherently given in the
case of NL, but totally unknown in the case of physically recorded speech.
The ASCII symbols of a given text are elementary categories by themselves,
and are directly used to form syntactically analysable expressions for the
representation of all the different linguistically relevant categories. Thus
relevant categorical information can be directly inferred from categorically
given data and their ASCII representations. In contrast to this NL situation,
the data of a digital speech signal do not signal any such categories,
because they only represent a measured time function without any inherent
categorical interpretation. At the present stage in the development of SLP it
is not yet even possible to decide automatically whether a given digital
signal is a speech signal or not. Therefore the necessary categorical
annotations for SL data must still be produced by human workers (with the
increasing support of semi-automatic procedures).
The second matter that must be considered in judging the different roles of
categories and time functions in speech technology is that speech signals
contain relevant prosodic and paralinguistic information that is not represented by the pure text of what was pronounced within a
given utterance. As long as NLP can be
restricted to non-spoken language processing
the restriction to NL data does
not pose severe problems. But as soon as real speech utterances are to be
processed in an information technology application, the other, non-linguistic, but communicatively extremely relevant categories cannot be
ignored. They must be represented in future SL data collections, and much
effort has still to be invested by the international scientific community to
deal with all these information-bearing aspects of any given speech
utterance.