Lectures LM

Language Modeling
Michael Collins, Columbia University
Overview
The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods
The Language Modeling Problem

We have some (nite) vocabulary, say V = {the, a, man, telescope, Beckham, two, . . .} We have an (innite) set of strings, V the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP
The Language Modeling Problem (Continued)

We have a training sample of example sentences in English

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,
xV
p(x) 0 for all x V

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,
xV
p(x) 0 for all x V
p(the p(the p(the p(the ... p(the ...
STOP) = 1012 fan STOP) = 108 fan saw Beckham STOP) = 2 108 fan saw saw STOP) = 1015 fan saw Beckham play for Real Madrid STOP) = 2 109
Why on earth would we want to do this?!

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)
Why on earth would we want to do this?!

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) The estimation techniques developed for this problem will be VERY useful for other problems in NLP
A Naive Method
We have N training sentences For any sentence x1 . . . xn , c(x1 . . . xn ) is the number of times the sentence is seen in our training data A naive estimate: p(x1 . . . xn ) = c(x1 . . . xn ) N
Overview
Markov Processes
Consider a sequence of random variables X1 , X2 , . . . Xn . Each random variable can take any value in a nite set V . For now we assume the length n is xed (e.g., n = 100). Our goal: model P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
First-Order Markov Processes
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
= P ( X 1 = x1 )
i=2
P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 )
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
= P ( X 1 = x1 )
i=2 n
P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

i=2
= P ( X 1 = x1 )
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
= P ( X 1 = x1 )
i=2 n
P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

i=2
= P ( X 1 = x1 )
The rst-order Markov assumption: For any i {2 . . . n}, for any x1 . . . xi , P (Xi = xi |X1 = x1 . . . Xi1 = xi1 ) = P (Xi = xi |Xi1 = xi1 )
Second-Order Markov Processes
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

n
i=3
P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )
P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

n
i=3 n
P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 ) P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )
=
i=1
(For convenience we assume x0 = x1 = *, where * is a special start symbol.)
Modeling Variable Length Sequences

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol
Modeling Variable Length Sequences

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol Then use a Markov process as before:
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
=
i=1
P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )
(For convenience we assume x0 = x1 = *, where * is a special start symbol.)
Trigram Language Models

A trigram language model consists of:
1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.
Trigram Language Models

A trigram language model consists of:
1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.
For any sentence x1 . . . xn where xi V for i = 1 . . . (n 1), and xn = STOP, the probability of the sentence under the trigram language model is
n
p(x1 . . . xn ) =
i=1
q (xi |xi2 , xi1 )
where we dene x0 = x1 = *.
An Example
For the sentence the dog barks STOP we would have p(the dog barks STOP) = q (the|*, *) q (dog|*, the) q (barks|the, dog) q (STOP|dog, barks)
The Trigram Estimation Problem

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog)
The Trigram Estimation Problem

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog) A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)
q (laughs | the, dog) =
Sparse Data Problems

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)
Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters
Overview
Evaluating a Language Model: Perplexity

We have some test data, m sentences s1 , s2 , s3 , . . . , sm

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability
m m
log
i=1
p(si ) =
i=1
log p(si )

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability
m m
log
i=1
p(si ) =
i=1
log p(si )
In fact the usual evaluation measure is perplexity Perplexity = 2

l
where
1 l= M
log p(si )
i=1
and M is the total number of words in the test data.
Some Intuition about Perplexity

Say we have a vocabulary V , and N = |V| + 1 and model that predicts q (w|u, v ) = 1 N
for all w V {STOP}, for all u, v V {*}. Easy to calculate the perplexity in this case: Perplexity = 2l Perplexity = N Perplexity is a measure of eective branching factor where l = log 1 N
Typical Values of Perplexity

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74
n i=1
q (xi |xi2 , xi1 ).

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137
n i=1
q (xi |xi2 , xi1 ).
n i=1
q (xi |xi1 ).

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137 A unigram model: p(x1 . . . xn ) = Perplexity = 955
n i=1
q (xi |xi2 , xi1 ).
n i=1
q (xi |xi1 ).
n i=1
q (xi ).
Some History
Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:5064, 1951.
Some History
Chomsky (in Syntactic Structures (1957)):
Second, the notion grammatical cannot be identied with meaningful or signicant in any semantic sense. Sentences (1) and (2) are equally nonsensical, but any speaker of English will recognize that only the former is grammatical. (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless. ... . . . Third, the notion grammatical in English cannot be identied in any way with the notion high order of statistical approximation to English. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally remote from English. Yet (1), though nonsensical, is grammatical, while (2) is not. . . .
Overview
Sparse Data Problems

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)
Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters
The Bias-Variance Trade-O

Trigram maximum-likelihood estimate qML (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 )
Bigram maximum-likelihood estimate qML (wi | wi1 ) = Count(wi1 , wi ) Count(wi1 )
Unigram maximum-likelihood estimate qML (wi ) = Count(wi ) Count()
Linear Interpolation
Take our estimate q (wi | wi2 , wi1 ) to be q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )
where 1 + 2 + 3 = 1, and i 0 for all i.
Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV
q (w | u, v )

wV
q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)]
wV

wV
q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

w
= = 1
wV
qML (w | v ) + 3
qML (w)

wV

w
= = 1
wV
qML (w | v ) + 3
qML (w)
= 1 + 2 + 3

wV

w
= = 1
wV
qML (w | v ) + 3
qML (w)
= 1 + 2 + 3 =1

wV

w
= = 1
wV
qML (w | v ) + 3
qML (w)
= 1 + 2 + 3 =1
(Can show also that q (w | u, v ) 0 for all w V )
How to estimate the values?

Hold out part of training set as validation data

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set Choose 1 , 2 , 3 to maximize: L(1 , 2 , 3 ) =
w1 ,w2 ,w3
c (w1 , w2 , w3 ) log q (w3 | w1 , w2 )
such that 1 + 2 + 3 = 1, and i 0 for all i, and where q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )
Allowing the s to vary

Take a function that e.g., 1 2 (wi2 , wi1 ) = 3 4 partitions histories If Count(wi1 , wi2 ) = 0 If 1 Count(wi1 , wi2 ) 2 If 3 Count(wi1 , wi2 ) 5 Otherwise
Introduce a dependence of the s on the partition: q (wi | wi2 , wi1 ) = 1 i2 i1 qML (wi | wi2 , wi1 ) (w ,w ) +2 i2 i1 qML (wi | wi1 ) (w ,w ) +3 i2 i1 qML (wi )
(w ,wi1 ) (w ,w )
where 1 i2 i1 + 2 i2 (w ,w ) and i i2 i1 0 for all i.
(w
,w
+ 3
(wi2 ,wi1 )
= 1,
Overview
Discounting Methods
Say weve seen the following counts: x Count(x) qML (wi | wi1 ) the 48 the, dog 15 the, woman 11 the, man 10 the, park 5 the, job 2 the, telescope 1 the, manual 1 the, afternoon 1 the, country 1 the, street 1 The maximum-likelihood estimates are (particularly for low count items) 15/48 11/48 10/48 5/48 2/48 1/48 1/48 1/48 1/48 1/48 high
Discounting Methods
Now dene discounted counts, Count (x) = Count(x) 0.5 New estimates:
x the the, the, the, the, the, the, the, the, the, the, dog woman man park job telescope manual afternoon country street Count(x) 48 15 11 10 5 2 1 1 1 1 1 14.5 10.5 9.5 4.5 1.5 0.5 0.5 0.5 0.5 0.5 14.5/48 10.5/48 9.5/48 4.5/48 1.5/48 0.5/48 0.5/48 0.5/48 0.5/48 0.5/48 Count (x) Count (x) Count(the)
Discounting Methods (Continued)

We now have some missing probability mass: (wi1 ) = 1
w
Count (wi1 , w) Count(wi1 )
e.g., in our example, (the) = 10 0.5/48 = 5/48
Katz Back-O Models (Bigrams)

For a bigram model, dene two sets A(wi1 ) = {w : Count(wi1 , w) > 0} B (wi1 ) = {w : Count(wi1 , w) = 0} A bigram model Count (wi1 ,wi ) Count(wi1 ) qBO (wi | wi1 ) = q ML (wi ) (wi1 ) q
wB(wi1 )
If wi A(wi1 ) If wi B (wi1 )
ML (w)
where (wi1 ) = 1
wA(wi1 )
Count (wi1 , w) Count(wi1 )
Katz Back-O Models (Trigrams)

For a trigram model, rst dene two sets A(wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) > 0} B (wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) = 0} A trigram model is dened in terms of the bigram model: Count (wi2 ,wi1 ,wi ) Count(wi2 ,wi1 ) If wi A(wi2 , wi1 ) qBO (wi | wi2 , wi1 ) = (wi2 ,wi1 )qBO (wi |wi1 ) wB(w ,w ) qBO (w|wi1 ) i2 i1 If wi B (wi2 , wi1 )
where (wi2 , wi1 ) = 1

wA(wi2 ,wi1 )
Count (wi2 , wi1 , w) Count(wi2 , wi1 )
Summary
Three steps in deriving the language model probabilities:
1. Expand p(w1 , w2 . . . wn ) using Chain rule. 2. Make Markov Independence Assumptions p(wi | w1 , w2 . . . wi2 , wi1 ) = p(wi | wi2 , wi1 ) 3. Smooth the estimates using low order counts.
Other methods used to improve language models:

Topic or long-range features. Syntactic models.
Its generally hard to improve on trigram models though!!

Lectures LM

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lectures LM

Diunggah oleh

Hak Cipta:

Format Tersedia

Language Modeling

Michael Collins, Columbia University

The Language Modeling Problem

The Language Modeling Problem (Continued)

The Language Modeling Problem (Continued)

p(x) 0 for all x V

The Language Modeling Problem (Continued)

p(x) 0 for all x V

p(the p(the p(the p(the ... p(the ...

Why on earth would we want to do this?!

Why on earth would we want to do this?!

First-Order Markov Processes

First-Order Markov Processes

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 )

First-Order Markov Processes

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

First-Order Markov Processes

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

Second-Order Markov Processes

Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

(For convenience we assume x0 = x1 = *, where * is a special start symbol.)

Modeling Variable Length Sequences

Modeling Variable Length Sequences

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

(For convenience we assume x0 = x1 = *, where * is a special start symbol.)

Trigram Language Models

Trigram Language Models

q (xi |xi2 , xi1 )

The Trigram Estimation Problem

The Trigram Estimation Problem

q (laughs | the, dog) =

Sparse Data Problems

Evaluating a Language Model: Perplexity

Evaluating a Language Model: Perplexity

Evaluating a Language Model: Perplexity

In fact the usual evaluation measure is perplexity Perplexity = 2

and M is the total number of words in the test data.

Some Intuition about Perplexity

Typical Values of Perplexity

q (xi |xi2 , xi1 ).

Typical Values of Perplexity

q (xi |xi2 , xi1 ).

Typical Values of Perplexity

q (xi |xi2 , xi1 ).

Sparse Data Problems

The Bias-Variance Trade-O

Bigram maximum-likelihood estimate qML (wi | wi1 ) = Count(wi1 , wi ) Count(wi1 )

Unigram maximum-likelihood estimate qML (wi ) = Count(wi ) Count()

where 1 + 2 + 3 = 1, and i 0 for all i.

Linear Interpolation (continued)

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)]

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

(For convenience we assume x0 = x1 = , where is a special start symbol.)

(For convenience we assume x0 = x1 = , where is a special start symbol.)