Next: Durability of textvolatility Up: SL corpus design Previous: About this chapter

Eight main differences between collections of written and spoken language data

Traditionally, linguists and natural language processing (NLP) researchers understood language corpora to consist of written material collected from text sources which already exist and often are available in published form (novels, stage and screen plays, newspapers, manuals, etc.). In this context the term ``spoken language text corpora'' was used to indicate that the data are not taken from existing texts but that speech had to be written down in some orthographic or non-orthographic form in order to become part of a data collection. However, the differences (and relations) between text and speech data are far more complex. There are at least eight important differences, which must not be ignored because they determine relevant properties of the resulting data collections. For future (technological) developments of Spoken Language Processing (SLP) they should be taken into account very seriously.

These eight differences have to do with:

the durability of text as opposed to the volatility of speech,
the different time it takes to produce text and speech,
the different roles errors play in written and spoken language,
the differences in written and spoken words,
the different data structures of ASCII strings and sampled speech signals,
the two reasons that cause the great difference in the size of NL and SL data collections,
the different legal status of written text and spoken utterances, and
the most fundamental distinction (as well as relation) between symbolically specified categories and physically measured time functions.

A closer look at these eight differences between written and spoken data will reveal why the traditional term ``natural language processing '', NLP, also could well be read as standing for ``Non-spoken Language Processing''. As it is our goal to call special attention to the relevant differences we will refer to the written language data as NL data meaning non-spoken language data, and set it in opposition to the term SL data, the acronym for spoken language data.

Next: Durability of textvolatility Up: SL corpus design Previous: About this chapter

EAGLES SWLG SoftEdition, May 1997. Get the book...