The treatment of speech as a sequence of segments to be delimited is to some extent a convenient fiction, made necessary by the requirements of speech technology. For example, it is notoriously difficult to define the boundaries between vowels and glides, or between a vowel and a following vowel. In addition, information about the place of articulation of a consonant is usually contained in its neighbouring vowels rather than the consonant itself. In the case of place assimilation , electropalatographic studies have shown that there is often a residual gesture towards the underlying segment [Nolan (1987)]. Hence one cannot describe the speech signal as a simple string of discrete phones in absolute terms.
Notwithstanding the above, [Roach et al. (1990)] argue that the attempt to segment speech is valid, as many segments (especially some consonants) have very clear acoustic boundaries. Where clear acoustic boundaries do not exist in the speech signal, selecting a fairly arbitrary point is better than doing no segmenting at all, from the viewpoint of speech technology research. Since segmented corpora may be useful for training HMM-based recognisers, problems of this kind could be cancelled out by including a great deal of data of the problematic kind, so as to avoid skewing the statistical models with only one view of the boundary location.