Experimental results

Next: Recommendations: m-gram language models Up: Multilevel smoothing for trigram Previous: Cache

Experimental results

To illustrate some of the issues in language modelling, we discuss some experimental results [Rosenfeld (1994), Generet et al. (1995)]. The results were obtained for a subset of the Wall Street Journal (WSJ) corpus. The vocabulary consisted of the (approximately) 20000 most frequent words. In addition, there were two non-speech words. First, each out-of-vocabulary word was replaced by a symbol for unknown word. Second, to mark the sentence end, a symbol for sentence boundary was added. There were three different training sets with 1, 5 and 39 million words and a separate test set of 0.325 million words as shown in Table 7.2.

words sentences

train-1 972 868 41 156

train-5 4 513 716 189 678

train-39 38 532 517 1 611 572

test 324 655 13 542

Table 7.2: Number of words and sentences in training and test data (vocabulary: about 20000 words)

**Table 7.2:** Number of words and sentences in training and test data (vocabulary: about 20000 words)
	words	sentences
train-1	972 868	41 156
train-5	4 513 716	189 678
train-39	38 532 517	1 611 572
test	324 655	13 542

distinct bigrams bigram singletons

train-1 303 858 211 105

train-5 881 263 566 093

train-39 3 500 636 2 046 462

distinct trigrams trigram singletons

train-1 648 482 556 185

train-5 2 420 168 1 990 507

train-39 14 096 109 10 907 373

Table 7.3: Number of distinct and of singleton events for bigrams and trigrams

**Table 7.3:** Number of distinct and of singleton events for bigrams and trigrams
	distinct bigrams	bigram singletons
train-1	303 858	211 105
train-5	881 263	566 093
train-39	3 500 636	2 046 462
	distinct trigrams	trigram singletons
train-1	648 482	556 185
train-5	2 420 168	1 990 507
train-39	14 096 109	10 907 373

Table 7.3 summarises some numbers from which we can estimate the coverage for the three training sets . This table gives the number of different bigrams and the number of singleton bigrams in training . The same numbers are also given for trigrams. As mentioned in the context of linear discounting , we can use these numbers to estimate the probability for new unseen trigrams. We obtain a probability of 0.57, 0.44 and 0.28 for the training sets of 1, 5 and 39 million words, respectively.

Size of training corpus 1 Mio 5 Mio 39 Mio

A) absolute discounting and interpolation [Generet et al. (1995)]

bigram model (with singletons) 288 217 168

trigram model 250 167 105

+ singleton 222 150 97

+ unigram cache 191 133 90

+ bi-/unigram cache 181 128 87

+ singleton + bi-/unigram cache 173 124 85

B) Katz' backing-off [Rosenfeld (1994)]

trigram model 269 173 105

+ bi-/unigram cache 193 133 88

+ bi-/unigram cache + maximum entropy 163 108 71

Table 7.4: Perplexities for different language models

**Table 7.4:** Perplexities for different language models
Size of training corpus	1 Mio	5 Mio	39 Mio
A) absolute discounting and interpolation [Generet et al. (1995)]
bigram model (with singletons)	288	217	168
trigram model	250	167	105
+ singleton	222	150	97
+ unigram cache	191	133	90
+ bi-/unigram cache	181	128	87
+ singleton + bi-/unigram cache	173	124	85
B) Katz' backing-off [Rosenfeld (1994)]
trigram model	269	173	105
+ bi-/unigram cache	193	133	88
+ bi-/unigram cache + maximum entropy	163	108	71

The perplexities for different conditions are summarised in Table 7.4. Table 7.4 consists of two parts, namely A) and B), for which the results are reported in [Generet et al. (1995)] and [Rosenfeld (1994)], respectively. In either part, for each language model test, there are three perplexities, namely for the three training sets of 1, 5 and 39 million words, so that the influence of the size of the training set on the perplexity can be seen immediately. Unfortunately, a direct comparison of the perplexities reported for the two parts of Table 7.4 is difficult for two reasons. First, due to small differences in selecting the articles from the Wall Street Journal corpus , the corpora used in the two parts of Table 7.4 are not completely identical. Second, the unknown word may be handled differently in the perplexity calculations.

For part A) of Table 7.4, the methods have been described in this chapter. The baseline method was absolute discounting with interpolation ; the discounting parameters were history independent. The baseline trigram model was combined with extensions like the singleton backing-off distribution, and the cache model, which was tested in two variants, namely at the unigram level and at the combined unigram /bigram level. For comparison purposes, the baseline trigram language model was also compared with a bigram language model.

Considering part A) of Table 7.4, we can see:

The improvement obtained by a trigram model over a bigram model is as follows (for singleton backing-off ): the perplexity goes down from 168 to 97 for the 39-million training corpus whereas for the 1-million corpus the improvement in perplexity is only from 288 to 222. So in both cases there is an absolute improvement of the perplexity of about 60.
The singleton backing-off distributions help to improve the perplexity ; the improvement is the higher the smaller the size of the training corpus is.
In all tests, the cache model results in a significant improvement, and the addition of the bigram cache always results in an improvement over the pure unigram cache.
The lowest perplexities were obtained for a combination of singleton distributions and the unigram /bigram cache. There is a significant improvement by just increasing the amount of training data : the perplexity is halved by increasing the training corpus from 1 million to 39 million words.

For the CMU results obtained by [Rosenfeld (1994)], the trigram model was based on Katz' backing-off [Katz (1987)], which uses the Turing-Good formula [Good (1953)]. To reduce the memory requirements, the trigram singletons in the training data were omitted. The trigram model was combined with the two cache variants (unigram cache and bigram cache) and the maximum entropy model by linear interpolation .

When looking at the perplexities in part B) of Table 7.4, we see that they are better than those shown in part A) only in the case that the maximum entropy model is added. The maximum entropy model has two characteristic aspects [Rosenfeld (1994)]:

The training is based on a criterion other than maximum likelihood, namely on the so-called maximum entropy principle.
In addition to the usual bigram and trigram dependencies, the language model is designed to model distant bigrams and trigrams and word trigger pairs. A distant m-gram model allows for an additional insertion before the word under consideration. For example, the word tuple (v,x,w) with an arbitrary word x defines a distant bigram (v,*,w). Word trigger pairs are used to capture the dependence between selected word pairs over a longer distance [Rosenfeld (1994)]. Examples of such trigger word pairs are government - parliament and Europe - Brussels.

In comparison with the conventional trigram language model, the maximum entropy requires a much higher cost in terms of programming and CPU time for training . There is an improvement in the order of 20%. This observation is in agreement with the general experimental experience with other language model techniques: To achieve an improvement over a baseline model like the trigram model in combination with the cache, a lot of effort is required, and even then the improvement may be small. For more details on word triggers and maximum entropy, see [Bahl et al. (1984)], [Rosenfeld (1994)], [Lau et al. (1993)], respectively.

To study the dependence of perplexity on the discounting parameters, experimental tests were carried out by [Generet et al. (1995)]. Figure 7.3 and Figure 7.4 show the perplexity as a function of the discounting parameters.

Figure 7.3: Perplexity as a function of b for absolute discounting with backing-off

Figure 7.4: Perplexity as a function of for linear discounting with backing-off

In all cases, the training was based on the 5-million corpus. For both figures, there were three conditions under which the perplexity was measured:

training data without leaving-one-out : Obviously, this condition results in a much too optimistic estimate of the perplexity as can be seen from the comparison with the perplexities of the other two conditions.
training data with leaving-one-out (LOO): This is the typical training condition, which we have described in this chapter. The perplexities are much higher than without leaving-one-out and already close to the perplexities of the test data .
test data: This condition is the default condition for measuring a corpus perplexity , i.e. the test data are not used to train the language model. It can be seen that the perplexities are somewhat higher than the leaving-one-out perplexities. But still, the difference is not too big, which demonstrates the viability of the leaving-one-out concept.

In other words, the last two conditions correspond to the cross-validation concept: either we create a test set from the training data by leaving-one-out or we are given a completely separate set of test data .

When comparing the two types of smoothing, we can see that the perplexity curve for absolute discounting has a very flat minimum for the two cross-validation conditions. This shows that the choice of the parameter b for absolute discounting is not critical. The perplexity curve for linear discounting shows a different behaviour in that the minimum is more distinct. The optimal perplexity for linear discounting is significantly higher than the optimal perplexity for absolute discounting. However, we have to remember that the linear discounting model here is based on history independent discounting parameters, and it is a well known fact that history dependence is important in the case of linear discounting [Jelinek & Mercer (1980)].

Next: Recommendations: m-gram language models Up: Multilevel smoothing for trigram Previous: Cache

EAGLES SWLG SoftEdition, May 1997. Get the book...