Sentence syntax information

Next: Recommendations on grammatical information Up: Grammatical information Previous: Statistical language models

Sentence syntax information

Syntactic information is required not only for parsing into syntactic structures for further semantic processing in a speech understanding system, but also in order to control the assignment of prosodic information to sentences in prosodic parsing and prosodic synthesis.

Syntactic information is defined as information about the distribution of a word in syntactic structures. This is a very common, indeed ``classical'', but specialised use of the words ``syntax '' and ``syntactic'' to pertain to phrasal syntax, i.e. the structure of sentences. Other more general uses of the terms for linguistic units which are larger or smaller than sentences are increasingly encountered, such as ``dialogue syntax '', ``word syntax '' (for morphotactics within morphology ).

Within this classical usage, the term syntax is sometimes opposed to the term lexicon; the term grammar is sometimes used to mean syntax , but sometimes includes both phrasal syntax and the lexicon.

Strictly speaking, a stochastic language model is a probabilistic sentence syntax , since it defines the distribution of words in syntactic structures. However, the notion of syntactic structure used is often rather elementary, consisting of a short fixed-length substring or window over word strings, with length two (bigram) or three (trigram). It is also used with quite a different function from the classical combination of sentence syntax and sentence parser .

Sentence syntax defines the structure of a (generally unlimited) set of sentences. Syntactic lexical information is traditionally divided into information about paradigmatic (classificatory; disjunctive; element-class, subclass-superclass) and syntagmatic (compositional; conjunctive; part-whole) relations. The informal definitions of these terms in linguistics textbooks are often unclear, metaphorical and inconsistent. For instance, temporally parallel information about the constitution of phonemes in terms of distinctive features is sometimes regarded as paradigmatic (since features may be seen as intensional characterisations of a class of phonemes ) and sometimes as syntagmatic (since the phonetic events corresponding to features occur together to constitute a phoneme as a larger whole). The relation here is analogous to the relation between intonation and sentences, which are also temporally parallel, and in fact treated in an identical fashion in contemporary computational phonology. From a formal point of view, this is purely a matter of perspective: the internal structure of a unit (syntagmatic relations between parts of the unit) may be seen as a property of the unit (paradigmatic relation of similarity between the whole unit and other units). In lexical knowledge bases for spoken language systems it is crucial to keep questions of syntagmatic distribution and questions of paradigmatic similarity apart as two distinct and complementary aspects of structure.

The part of speech (POS , word class , or category) is the most elementary type of syntactic information. One traditional set of word classes consists of the following: Noun or Substantive, Pronoun, Verb, Adverb, Adjective, Article, Preposition, Conjunction, Interjection . POS classifications are used for tagging written corpora (texts or transcriptions), for the purpose of information retrieval or for the training of class-based statistical language models (Chapter 7); fairly standard POS tagsets have defined for a number of taggers (automatic tagging software; see the results of the EAGLES Working Group on machine readable corpora).

Two main groups of POS category are generally identified:

Lexical categories are the open classes which may be extended by word formation: Noun, Verb, Adjective, Adverb.
Grammatical categories are the closed classes which express syntactic and indexical relations: Pronoun and Article (anaphoric and deictic relations), Preposition (spatial, temporal, personal relations etc.), Conjunction (propositional relations), Interjection (dialogue relations).

The granularity of classification can be reduced by grouping classes together (this particular binary division is relevant for defining stress patterns for example) or increased by defining subcategories based on the complements (object, indirect object, prepositional or sentential object, etc.) of words (in various terminologies: their valency or subcategorisation frames, case frames, transitivity properties). For further information, introductory texts on syntax , e.g. [Sells (1985)] or [Radford (1988)] may be consulted.

In theoretical and computational linguistics, grammars are classified in terms of the Chomsky hierarchy of formal languages which they generate (i.e. define), and often represented as equivalent automata. Some aspects of this classification are discussed in connection with stochastic language models in Chapter 7. For further information, standard computer science compiler construction literature can be consulted.

Next: Recommendations on grammatical information Up: Grammatical information Previous: Statistical language models

EAGLES SWLG SoftEdition, May 1997. Get the book...