Traditionally, linguists and natural language processing (NLP) researchers understood language corpora to consist of written material collected from text sources which already exist and often are available in published form (novels, stage and screen plays, newspapers, manuals, etc.). In this context the term ``spoken language text corpora'' was used to indicate that the data are not taken from existing texts but that speech had to be written down in some orthographic or non-orthographic form in order to become part of a data collection. However, the differences (and relations) between text and speech data are far more complex. There are at least eight important differences, which must not be ignored because they determine relevant properties of the resulting data collections. For future (technological) developments of Spoken Language Processing (SLP) they should be taken into account very seriously.
These eight differences have to do with:
A closer look at these eight differences between written and spoken data will reveal why the traditional term ``natural language processing '', NLP, also could well be read as standing for ``Non-spoken Language Processing''. As it is our goal to call special attention to the relevant differences we will refer to the written language data as NL data meaning non-spoken language data, and set it in opposition to the term SL data, the acronym for spoken language data.