Frequency Estimation

Frequency estimation for rare events
Sergei Winitzki
March 29, 2013
Sources: Baayen, Word frequency distributions. Papers by Stefan Evert, as

inspiration.
Warning: This text is not yet in nal form and may contain serious errors.
The purpose of this text is to document some calculations. It is certainly possible
to obtain most results in a much simpler way. The goal is to present rigorous
derivations through the powerful method of generating functions.
1 Word frequencies in texts
It would appear that the word green is more likely to be encountered in an

English text than the word avunculocal. Can this be made precise? Is there
a mathematically precise way of computing the frequency of the words green
and avunculocal in English texts?
This question can be motivated, for example, by the desire to build a com-
puter program for discovering mistakes in spelling. Standard spellcheckers use a
large list of known words and highlight any word that is not in the list. Suppose
we want to produce a more intelligent spellchecker: it will also highlight a word
that, although written correctly, is unlikely to be seen in a text. To implement
this, we would like to have a table of English word frequencies.
1.1 The problem of rare words

What if we simply take a suciently large sample of text, split it into individual
words, and count how many distinct words we nd, and how many times each
distinct word is repeated? In linguistics, a large and representative sample of
text is called a corpus. Let us denote by N the total number of all words in
the corpus and by {w1 , ..., wR } the set of all used words; here we denoted by R
the number of distinct words. By counting the words, we obtain the empirical
word counts c1 , ..., cR . A straightforward estimate then gives the probability
ci
Pr(wi )
N
for the word wi .
Is the problem solved? No! The trouble is that most languages have a lot of
words that seem to be quite rare. In fact, words like avunculocal are so rare
1
that they may yield an empirical word count of 0 even in a corpus of, say,
a million words. It would be certainly incorrect to assume that any words not
seen in the corpus have exactly zero probability.
Here is another indication of trouble: Any realistic corpus will have quite
a few words that are seen exactly once (ci = 1). The naive estimate of the
probability for all such words is 1/N , and this is likely to be quite dierent from
the true probability.
We can also compute the total number R of dierent words. It turns out
1
that R grows with N: the larger the text sample, the more new words we nd.

Baayen's book says somewhere that R grows roughly proportional to N.
Theoretically speaking, the number of dierent words should become con-
stant, and the estimates ci /N should become precise once our corpus is large
enough. However, experiments show that even text samples having 107 , 108 , or
9
10 words are not large enough because R keeps growing and new rare words
keep showing up!
2 Non-parametric results
The non-parametric results depend only on the assumption that the corpus can
be described by some probability distribution Pr (w). There are no assumptions
about the particular mathematical shape of this function. (One such assumption
would be, say, that the probability of the n-th word in the dictionary decreases
proportionally to 1/n , and then the value of would be the parameter in a
parametric description).
2.1 The bag of words model

It is a fact that any feasible corpus is not large enough. Natural languages have
so many words (and new words are being constantly invented) that it is not
feasible to gather enough text samples for a complete dictionary.
Instead, let us formulate a probabilistic mathematical model that imitates
the behavior of the word counts in samples of text:
The language has a nite but extremely large set of words {w1 , ..., wS }.
The limit S can be taken if this simplies calculations.
Each word wi has a xed true probability pi , and

S
X
pi = 1.
i=1
Words in a text sample are randomly and independently chosen according

to their true probabilities.
1 This is found by a lot of empirical work in corpus linguistics.
2
This is the bag of words model: All words are contained in a large bag.
When we draw a random word from the bag, there is a xed probability pi for
the word wi to be selected. The text sample is created by drawing the rst
word randomly from the bag (and returning it to the bag), then drawing the
second word, etc.
Given the true probability Pr(wi ) = pi of the word wi , we can write down
the probability of seeing the word wi exactly n times in a text sample of length
N:
N N n n
Pr (ci = n) = (1 pi ) pi .
n
The expectation value of the word count is
N N
X X nN ! N n n
E [ci ] = n Pr (ci = n) = (1 pi ) pi
n=0 n=0
n! (N n)!
N
X (N 1)! N n n1
= N pi (1 pi ) pi
n=1
(n 1)! (N n)!
= N pi .
The variance of the word count is
N
X
2 2
E c2i (E [ci ]) = n2 Pr (ci = n) (N pi )

n=0
N
X 2
= n (n 1) Pr (ci = n) + N pi (N pi )
n=2
N
2
X (N 2)! N n n2
= N pi (N pi ) + p2i N (N 1) (1 pi ) pi
n=2
(n 2)! (N n)!
= N pi (1 pi ) .
Therefore, the estimator ci /N has the variance
pi (1 pi )
2 [ci /N ] = .
N
The relative precision of this estimator is
p r
2 [ci ] 1 pi
= .
E [ci ] N pi
This precision is much less than 1 only when N pi 1. So we expect that the
estimator ci /N gives grossly incorrect results for words whose true probability
pi is smaller than 1/N .
3
2.2 The number of words by count
In an attempt to nd a better estimate of the word probabilities, we turn to
other statistics. Let us count how many dierent words have the word count
equal to a given number n.
Each word wi (where i = 1, ..., S ) has probability Pr (ci = n) of having the
word count n. Let us denote this probability by r(n)i . That is, we dene

. N N n n
r(n)i = (1 pi ) pi .
n
Now consider the event V(n) = 1 that we have exactly one word with the word
count n. This event is realized if there exists some i such that the event r(n)i
has been realized for the word wi , and if all the other words have a dierent
word count. Thus, here is the probability of having exactly one word with the
word count n:

Pr V(n) = 1 = r(n)1 1 r(n)2 1 r(n)3 ...+ 1 r(n)1 r(n)2 1 r(n)3 ...+...
(This calculation is, strictly speaking, incorrect : the events r(n)i are not inde-
pendent! However, they can be considered approximately independent when
word counts n are much smaller than the total number N of words in the text.)
The probability of having exactly two words with the word count n is ob-
tained in a similar way, but now we enumerate all pairs of words:

Pr V(n) = 2 = r(n)1 r(n)2 1 r(n)3 1 r(n)4 ...

+ r(n)1 1 r(n)2 r(n)3 1 r(n)4 ... + ... (1)
In order to handle such expressions more easily, let us introduce a generating

function,
.
G(n) (t) = 1 r(n)1 t 1 r(n)2 t 1 r(n)3 t ...
We note that G(n) (0) = 0. Now we can express the probabilities of the events
V(n) = s as
s

1 s
Pr V(n) = s = (1) G(n) (t).
s! ts t=1
The average number of words having the word count n can be computed like
this (here for convenience we assume S = ):

X
E V(n) = Pr V(n) = s s
s=0
s1
s1

X (1) G(n)
= s1
s=1
(s 1)! t
t=1 t
S
G(n) X
= = r(n)i .
t t=0 i=1

4
Similarly, we can compute the variance of the word count:
S
X
2 2

E[V(n) ] E[V(n) ] = ... = r(n)i 1 r(n)i .
i=1
The variance can be expressed through the expected word count for a twice
larger text sample: we note that
2 N 2

2 N 2N 2n 2n n
r(n)i (N ) = (1 pi ) pi = 2N
r(2n)i (2N ) .
n 2n
Using a slightly modied version of Stirling's formula,

x! ex ln xx 1 + 2x,
which is precise to 1% already for x = 1, we can simplify (assuming N 1 and

also N n 1)
N 2

n N !2 (2n)! (2N 2n)! 1 + 4n 1
2N
= 2 .
n! (N n)!2

2n
(2N )! 1 + 2n n
Therefore we can approximately write
2 1
(N )] E[V(n) (N )]2 E V(n) (N ) E V(2n) (2N ) .

E[V(n)
n
It is interesting to consider also the word count V(0) , i.e. the count of words
missing from the text sample. For instance, the event V(0) = s mean that there
are exactly s words that are not present in the text sample. The expectation
value of V(0) is
S S
X X N
E V(0) = r(0)i = (1 pi ) .
i=1 i=1
This expectation value is approximately equal to the total number of words
whose probability is smaller than N 1 .
2.3 Covariance of word counts

Word counts are not statistically independent. Let us compute the covariance
of two dierent word counts,

E V(m) V(n) E V(m) E V(n) .

In order to compute this quantity, we need to consider the joint event V(m) = q, V(n) = s
that there are exactly q m and exactly s words
words with count
with count n.
A simple example of such an event is V(m) = 1, V(n) = 1 . Its probability is
X Y
Pr V(m) = 1, V(n) =1 = r(m)i r(n)j 1 r(m)k r(n)k .
i6=j k6=i;k6=j
5
This expression is similar to Eq. (1), where the sum goes over all the ordered
pairs of words. We introduce a generating function
S
. Y
G(m,n) (t, u) = 1 r(m)i t r(n)i u ,
i=1
Then we can express

Pr V(m) = 1, V(n) = 1 = G(t, u).
t t=1 u u=1
Similarly, we obtain
(1)q+s q s

Pr V(m) = q, V(n) =s = G(t, u).
q!s! tq t=1 us u=1
Now the covariance of V(m) and V(n) can be computed. We nd
N
X

E V(m) V(n) = Pr V(m) = q, V(n) = s qs
q,s=0
2 G

= ;
tu
2 t=u=0
G G G
E V(m) V(n) E V(m) E V(n) =
tu t u t=u=0
!
X X X
= r(m)i r(n)j r(m)i r(n)j
i6=j i j
X
= r(m)i r(n)i .
i
We nd that the word counts are slightly correlated. The last expression can
be rewritten as
N N
X
X N N 2N mn m+n m n

r(m)i r(n)i = (1 pi ) pi = 2N
E V(m+n) (2N ) .
i
m n i m+n
N (and also assuming N

The combinatorial factor can be simplied for large
m 1, N n 1) using Stirling's formula:
N N m+n

m n N !2 (m + n)! (2N m n)! n
2N
m+n .
m+n
m!n! (N m)! (N n)! (2N )! 2
2.4 The total number of distinct words

Earlier we denoted by R the number of distinct words in a text sample. We
can compute the expectation value and the variance of R by considering the
6
event that R = n. This event occurs when exactly n words have word counts
not equal to zero. This is the same as the event V(0) = S n. So we can use
the formulas derived in the previous section for this event. We just need to
substitute R = S V(0) .
The expectation value of R is computed by
S h i
X N
E [R] = E S V(0) = S E V(0) = 1 (1 pi ) .
i=1
The variance of R is
h i S h i
2 2 X N N
E R2 E [R] = E V(0)
2

E V(0) = (1 pi ) 1 (1 pi ) .
i=1
It is interesting that this quantity can be expressed through the values of E [R]
for N and 2N :
2
E R(N )2 E [R(N )] = E [R(2N )] E [R(N )] .

2.5 Invariant quantities

Baayen mentions the following invariant quantity as due to Simpson (Nature
163, p.168, 1949),
N
. X n n1
D= V(n) .
n=2
N N 1
We can compute the expectation value of this quantity by the following method:
N N X S
X n n1 X N N n n n n 1
E[D] = E V(n) = (1 pi ) pi
n=2
N N 1 n=2 i=1 n N N 1
S
" N # S
X X N 2 N n n2
X
= p2i (1 pi ) pi = p2i .
i=1 n=2
n 2 i=1
This quantity D is therefore expected to be invariant with respect to the choice

of the text sample! (There is no dependence on N .)
By analogy we may construct a family of invariant quantities,
N
. X n n1 nk
D(k) = V(n) ... , k 1.
N N 1 N k
n=k+1
The expectation value of D(k) is again independent of the sample size N and
depend only on the distribution of the true probabilities, i.e. only on the
language,
S
X
pki .

E D(k) =
i=1
7
If actual experiments with large texts show that the quantities D(k) change
systematically with N beyond statistical uncertainties, the hypothesis of the
bag of words will be disproved! (But it is perhaps not easy to estimate the
statistical uncertainty in D(k) ?)
2.6 Dependence on sample size

There exist some general relationships that express the dependenc on N for the
various word counts. These relationships are all based on the assumptions of
the probabilistic model; should they not hold, the probabilistic model would be
disproved.
We consider the derivatives of quantities such as R and V(n) with respect
to N. Since N is actually an integer, we could just as well consider a discrete
analog of the derivative, for instance:
S h i XS
N 1 N 1
X N
E [R(N )] E [R(N 1)] = (1 pi ) (1 pi ) = (1 pi ) pi .
i=1 i=1
This expression is similar to the expectation value for V(1) :
S
X N 1
E V(1) (N ) = N (1 pi ) pi .
i=1
By inspection we obtain the (exact !) relationship

1
E [R(N ) R(N 1)] = E V(1) (N ) .
N
Typically, V(1) is much smaller than N, so adding a single word to the sample
does not predict an appreciable change in R.
Analogous relationships hold for the quantities V(n) . For instance,
S
N +1 X
N +1n n
E V(n) (N + 1) = (1 pi ) pi
n i=1
S
N +1 X N n n
= (1 pi ) (1 pi ) pi
n i=1
N +1 n+1
= E V(n) (N ) E V(n+1) (N + 1) .
N +1n N +1n
In practice, N is a large number, much larger than n, but not necessarily
much larger than V(n) (N ). So we may approximate this relationship by
n n+1
E V(n) (N + 1) 1 + E V(n) (N ) E V(n+1) (N + 1) . (2)
N N
8
These relationships allow us to predict (in principle!) how much the word counts
will change if we add more text to our text sample. If we only have a sam-
ple of
N words, not a sample of N + 1 words, we cannot directly estimate
E V(n) (N + 1) . Nevertheless, the quantities change very little between N and
N + 1, so we can always estimate V(n) (N + 1) through known values of V(n) (N ).
The relationship (2) requires V(n+1) (N + 1), which again needs to be expressed
1
using the same relationship. So we get the (asymptotic) series in N :

n n+1 n+1
E V(n) (N + 1) 1 + E V(n) (N ) 1+ E V(n+1) (N )
N N N

(n + 1) (n + 2) n+2
+ 1+ E V(n+2) (N ) + ...
N2 N

Since E V(n) (N ) decreases with growing n, while N is large, only a few rst
term of this series will give sucient precision in practice.
We see that the change due to adding one word is something of order N 1 .
An appreciable change can be expected only if the size of the text sample is
increased by about N words.
2.7 Probabilities for unknown words

We can now answer some interesting questions about word probabilities.
Suppose we have a text sample of N words, and we add one word to the
sample. This word could be one of the words we already seen, or a word we
have not yet seen an unknown word.
What is the probability that the new word is unknown?
What is the expected value of p(w) for this word if it is unknown?
What is the probability that this word is known but is one of the rare
words that were seen so far only n times in the text (where n is small)?
Let us consider the rst question.

According to the bag of words model, the new word can be any of the
words wi independently, with probability pi . The probability that this word is
not in the already known text sample (of length N) is
N
(1 pi ) .
Therefore, the new word is unknown with probabilty
S
X N
Pr (w unknown) = pi (1 pi ) .
i=1
We have seen this expression before; it is equal to
1
E V(1) (N + 1) .
N +1
9
If we only have the text sample of length N, how can we obtain quantities such
as V(1) (N + 1)? By using the relationships such as

n n+1 n+1
V(1) (N + 1) = 1 + V(1) (N ) 1+ V(2) (N ) + ...
N N N
The expected value of p for the new unknown word is
PS N
p2i (1 pi )
= Pi=1

E pw unknown S N
.
i=1 pi (1 pi )
Compare this with the previously derived expressions
S
N +2 X 2 N
E V(2) (N + 2) = pi (1 pi ) .
2 i=1
By inspection, we nd

2 E V(2) (N + 2)
E pw unknown
= .
N + 2 E V(1) (N + 1)
Similarly, there is the probability r(n)i of the new word w = wi being one of
the words that already occurred n times in the existing text sample. Therefore,
the answer to the third question is
S XS
X N N n n+1
Pr (w occurred n times) = r(n)i pi = (1 pi ) pi
i=1
n i=1
n+1
= E V(n+1) (N + 1) .
N +1
It seems useful to record the following calculation:
S XS
X N N n n+k
r(n)i pki = (1 pi ) pi
i=1
n i=1
(n + k)! N!
= E V(n+k) (N + k) .
n! (N + k)!
2.8 Good-Turing estimators

What is our estimate of p(w) for some word w in the text sample? All we know
within the bag of words model is that the word wi t occurred a certain number
of times, say n, in the text sample. Actually, the word w is not the only one
that occurred n times; there are V(n) such words in the text sample. We know
only one thing about each of these words: namely, that they occurred n times.
So our estimate of the probabiliy p(w) must be the same for all these words.
It makes sense to estimate p(w) by computing the expectation value of the
probability pi among all words wi that occur n times. How can we compute this
10
expectation value? The event that the word wi occurs n times has probability
r(n)i . For dierent i, these events are approximately independent (at least when
n is much smaller than N ; however, it is quite dicult to estimate the precision
of the approximation we are making!). So we can compute the expectation value
of pi approximately as
PS
i=1 pi r(n)i n + 1 E V(n+1) (N + 1)

E pw occurred n times
= PS = .
i=1 r(n)i
N +1 E V(n) (N )
The expectation value of p(w) for unknown words is also obtained if we set
n=0 in this formula.***
A diculty in using the Good-Turing estimator in practice is that we need
to know the expectation values E[V(n) ] while we only know the empirically
observed values V(n) . These values are estimators of E[V(n) ] with standard
q
deviations of order E V(n) . Thus, these empirically observed values cannot
be used unless they are quite large. This is typically the case for the values
V(1) and V(2) (there are many rare words). However, for many higher values of
n we will have V(n) = 1 or V(n) = 0. The Good-Turing estimator needs to be
complemented by a smoothing of the values of V(n) . This smoothing does
not have a good theoretical basis, it seems.
2.9 The spring of words (Poisson) model

The bag of words model produces a text with exactly N words. This has the
disadvantage that the numbers of word occurrences in text are not independent
events for dierent words. To take an extreme example: if one word occurs
(by chance) N times, no other words may occur at all. Our calculations of the
covariances of V(n) are not quite correct because of this; in the derivations, we
have assumed that the events r(n)i are all independent for dierent words i.
An alternative model is that each word is emitted independently by a spring
of words. Each word wi has the xed probability pi t of being emitted during
a time interval t, independently of all other words, and independently of all
other times. We let the spring emit words for the total duration of time equal
to N. So on the average we expect to have N words emitted into the corpus.
This model predicts the Poisson distribution for the number of times the
word wi is emitted; the mean of this distribution is N pi . Now, by construction,
the number of occurrences of each word wi is an event independent of the number
of occurrences of all other words.
The distribution of the total length of the corpus is also the Poisson distri-
bution with mean N. This is not a signicant problem if the corpus is very

large: the standard deviation of the corpus size is N, so the corpus size is
determined with high precision (the error is of order 1/ N ).
2.10 Calculations with the Poisson model

The Poisson model gives some advantages for analytic calculations.
11
The Poisson distribution with mean is dened by
k
p(k) = e .
k!
P
It is easy to see that k=0 p(k) = 1.
The generating function of the Poisson distribution is

X
g(t) = p(k)tk = e(t1) .
k=0
Now we are interested in the generating function of all word occurrences. We

introduce formal variables q1 , ..., qS for all words. If the word counts ci of the
word wi in the corpus occur with probability p({ci }), the generating function is
X
G ({qi }) = p ({ci }) q1c1 ...qScS . (3)
{ci }
Here all word counts ci are independently summed over, each ci going from 0
to . Each word count has an independent probability distribution. Therefore
the generating function factorizes into the product of the Poisson generating
functions for each word count. The distribution of the word count for the
word wi is the Poisson distribution with mean N pi ; therefore we nd the total
generating function as
S
! S
X X
G ({qi }) = exp N pi (qi 1) = eN exp N pi qi . (4)
i=1 i=1
Now we might want to compute dierent quantities using this generating

function G. The function G is the sum of terms as shown in Eq. (3). So the
way to use G is to convert each term q1c1 ...qScS to a value, eliminating the formal
c1 cS
variables qi in some way. For instance, if we replace each term q1 ...qS in
Eq. (3) by the value c1 + ... + cS , the resulting value will be the sum of the
values (c1 + ... + cs ) with weight p ({ci }), that is, the result will be the mean
total number of words in the text.
More generally, the way to use the function G is by transforming the expres-
sion G ({qi }) in such a way that the expression
X
G ({qi }) = p ({ci }) q1c1 ...qScS
{ci }
is replaced by the expression not containing the formal parameters qi ,

X
p ({ci }) h (c1 , ..., cS ) ,
{ci }
where h is some chosen function. Dierent choices of the function h will yield
dierent interesting quantities. The trick is to nd an algebraic substitution of
12
qi in the function G that performs the desired transformation. This algebraic
substitution must be a linear map, so that the sum of terms q1c1 ...qScS is mapped
again to a sum of terms.
Let us consider two possible choices of h that may be useful: First,
S
X
h (c1 , ..., cS ) = f (c1 ) + ... + f (cS ) = f (ci ),
i=1
where f is an arbitrary, chosen function. The second choice is
S
Y
h (c1 , ..., cS ) = f (c1 )...f (cS ) = f (ci ),
i=1
where f is again an arbitrary function. We will now nd a way to compute such
substitutions, and use the generating function (4) as a particular example.
The rst substitution requires is to construct a linear map of q1c1 ...qScS into
f (c1 ) + ... + f (cS ). We will use the following trick from linear algebra. There
exists a linear operator T acting in some vector space V, and a vector v V

and a covector v V such that
f (c) = v Tc v, c = 0, 1, ...
(Here we use the notation in which v acts as a map V C, so that the

application of a covector v to a vector x is written as v x.) Then we consider
.
the vector space W = V ... V that is a direct sum of S copies of V . We
dene the operators qi (i = 1, ..., S ) acting in W by
qi = 1V ... T ... 1V ,
where T acts on the i-th copy of V in the direct sum (and 1V is the identity
operator acting in the space V ). In other words, the operator qi acts in the
space W by applying T only to the i-th component of a vector
.
w = v1 ... vS W.
The matrix representation of qi is

1V
..
.
T

,

..
.
1V
where the operator T appears in the i-th row and i-th column. This denition
of qi is designed so that any two qi commute with each other. We can also exress
qi as

1W + 0V ... T 1V ... 0V .
qi = (5)
13
Now, we dene the vector w and the covector w through the previously

dened vector v and covector v as
. .
w = (v ... v) W, w = (v ... v ) W .
By construction, we then have

qScS w = (v ... v ) Tc1 ... TcS (v ... v)
w q1c1 ...
= v Tc1 v + ... + v TcS v = f (c1 ) + ... + f (cS ).
Therefore, the desired transformation of G({qi }) is implemented by substituting
qi qi , applying the resulting operator to the vector w, and applying the
for

covector w to the result (all these applications are linear operations).
In the present case, we do not actually need to nd an explicit form of the
operators qi and the vector spaces V and W. It is sucient that the necessary
operators and vectors exist. We can then substitute qi into Eq. (4) and obtain
" S
!#
X
N
qi }) w = w e
w G ({ exp N pi qi w.
i=1
The operator under the exponential can be computed more explicitly using
Eq. (5):
S
X S h
X i
pi qi = pi 1W + 0V ... pi T 1V ... 0V
i=1 i=1

1W + p1 T 1V ... pS T 1V .
=
The exponential of this operator can be expressed as
S
!
X
exp N pi qi = eN exp N p1 T 1V ... exp N pS T 1V .
i=1
Now we apply the covector w and the vector w to this and obtain the result,
S
X h i
w G ({
qi }) w = v exp N pi T 1V v
i=1
S
X h i
= eN pi v exp N pi T v
i=1
S c
X X (N pi )
= eN pi v Tc v
i=1 c=0
c!
X
S c
X (N pi )
= eN pi f (c)
c=0 i=1
c!
X
X S
= pPoisson [N pi ; c] f (c). (6)
c=0 i=1
14
This is equal to the average of f (c) with the Poisson distribution of word counts
c, added independently for each word (i = 1, ..., S ). Perhaps we could have
obtained this result faster if we thought right away about the independence of
word counts! However, the trick with the generating function can be used also
for distributions where there is no independence (e.g. with the bag of words
model).
Now consider the second possibility: we need to replace q1c1 ...qScS by the
product f (c1 )...f (cS ). Now qi must be replaced by operators

qi = ,
xi
where xi (i = 1, ..., S ) are new formal parameters. Dene the auxiliary function
F by

x2 X xc
F (x) = f (0) + xf (1) + f (2) + ... = f (c) ; (7)
2! c=0
c!
we assume that this function is analytic at least for some x. Then by construc-
tion we will have
c

F (x) = f (c)
xc x=0
and so
c1 cS

... [F (x1 )F (x2 )...F (xS )] = f (c1 )...f (cS ).
x1 x1 =0 xS xS =0
Now let us perform this substitution in the generating function (4). Using the
fact that

exp a F (x) = F (a)
x x=0
for analytic functions F, we obtain:

G [F (x1 )F (x2 )...F (xS )]
xi xi =0
S
!
X
= eN exp N pi [F (x1 )F (x2 )...F (xS )]
i=1
x i
= eN F (N p1 )...F (N pS ). (8)
As examples of application of these techniques to the Poisson model, let us

consider the following calculations:
The mean number of words that occur

n times in the text; this is the
quantity we previously denoted by E V(n) .
The probability of having exactly
k words that (each) occurs n times in
the text. This is Prob V(n) = k .
15
p ({ci }) q1c1 ...qScS , in the generating

To compute E V(n) , we look at one term,
function G. This term describes the event that each word i has the count ci in
the text. When this event occurs, the number of words that occur n times is
equal to the number of q 's whose power is equal to n. This quantity (the number
of words that occur exactly n times) can be expressed as f (c1 )+...+f (cS ) where
the function f is dened by
(
1, c = n;
f (c) =
0, otherwise.
Therefore, we apply the method leading to Eq. (6). The average is then com-
puted as
X S c S n
X (N pi ) X (N pi )
eN pi eN pi

E V(n) = f (c) = .
c=0 i=1
c! i=1
n!
Let us compute the expected total probability ptotal (n) contained in all the
words that occurred exactly n times. This is obtained if we transform each term
q1c1 ...qScS in the generating function G into a term
pi1 + ... + pik ,
where i1 , ..., ik are the words that occurred exactly n times. This kind of
replacement is possible if we use a dierent fi (c) for each word i. The formula (6)
will be then modied to
X
S c
X (N pi )
eN pi fi (c).
c=0 i=1
c!
In our case, we need to dene

(
pi , c = n;
fi (c) =
0, otherwise.
Therefore
S n
X (N pi )
eN pi

E ptotal (n) = pi .
i=1
n!

We can express this through E V(n) that we computed just previously:
S n+1
n+1X (N pi ) n+1
eN pi

E ptotal (n) = = E V(n+1) .
N i=1 (n + 1)! N
Note that the total probability in unknown words (n = 0) is
1
E ptotal (0) = E V(1) .
N
16
The Good-Turing estimator is motivated by these expressions: If we found
V(n) words that occur n times in our sample, and if we expect that the total

probability in these words is E ptotal (n) , then we can estimate the probability
for these words as

1 n + 1 E V(n+1) n + 1 V(n+1)
p E ptotal (n) = .
V (n) N V (n) N V(n)
In this estimate, we replaced the expectation value by the actually observed

value V(n+1) ; this will be a reasonable approximation only when
V(n+1) 1.
Now we turn to the probability Prob V(n) = k that exactly k words occur
n times. It will be convenient to compute the generating function for this
probability,

. X k
g(t) = t Prob V(n) = k .
k=0
Consider again the event {ci } that is represented by the term p ({ci }) q1c1 ...qScS
in the generating function G. The words that occur n times are those i for
c1 cS k
which ci = n. The term p ({ci }) q1 ...qS must be replaced by t where k is the
number of words that occur n times. This replacement will be achieved if we
ci
replace qi by t when ci = n and by 1 otherwise. Thus, we can use the method
leading to Eq. (8) with the function f (c) dened by
(
. t, c = n;
f (c) =
1, otherwise.
We need the function F (x) dened by Eq. (7); we get
xn
F (x) = ex + (t 1) .
n!
Then Eq. (8) yields
S n Y S n
Y (N pi ) (N pi ) N pi
g(t) = eN eN p i + (t 1) = 1 + (t 1) e . (9)
i=1
n! i=1
n!
This generating function gives, in principle, the complete information about the
probability distribution V(n) for the number of words that occur n times in the
text:
1 k

Prob V(n) =k = g(t).
k! tk t=0
It is, however, quite dicult to obtain numerical results from this generating
function; we would need to compute the derivative of very high order k if we
wanted to compute, say, the probability of having exactly 1000 words that occur

twice. As an example of easier calculations, let us nd Prob V(n) = k for k = 0,
17
1, 2.
S n
Y (N pi ) N pi
Prob V(n) = 0 = g(0) = 1 e .
i=1
n!
S
X 1
V(n) = 1 = g 0 (0) = g(0)

Prob
n!eN pi
.
i=1 (N pi )n 1
S S
1 X X 1 1
Prob V(n) = 2 = g 00 (0) = g(0) n!eN pi
.
2 1 n!eN pj
i=1 j=1,j6=i (N pi )n (N pj )n 1
Let us also compute the mean and the variance of the V(n) using this generating
function (note that by construction g(1) = 1):

k

X X 1
E V(n) = kProb V(n) = k = g(t)
(k 1)! tk t=0
k=0 k=1

1 j
X
= j
g 0 (t) = g 0 (1)
j=0
j! t t=0
S n
X (N pi )
= eN pi .
i=1
n!
The variance takes a bit more work: it is convenient to compute rst the quantity

k

h
2
i X X 1
k 2 k Prob V(n) = k =

E V(n) E V(n) = g(t)
(k 2)! tk t=0
k=0 k=2
S S n n
(N pi ) N pi (N pj ) N pj
X X
= g 00 (1) = e e
i=1 j=1,j6=i
n! n!
" S #2 S 2
n n
1 X (N pi ) N pi 1 X (N pi ) N pi
= e e .
2 i=1 n! 2 i=1 n!
Hence the variance of V(n) is
S
" S #2 S 2
n X (N pi )n n
h
2
i 2 X (N pi ) N pi 1 N pi 1 X (N pi ) N pi
E V(n) E V(n) = e e e .
i=1
n! 2 i=1 n! 2 i=1 n!
Let us also compute the Good-Turing estimator directly, i.e. as the average
probability pi among the words i that occur exactly n times in the text. This
can be computed if we replace the term q1C1 ...qScS in the generating function G
by the expression
1
(pi + ... + pik ) ,
k 1
18
where the indices i1 , ..., ik correspond to the words that occur exactly n times,
and k is the number of such words in the event {ci }. It is not easy to obtain
this replacement in one step.
C1
Let us replaceq1 ...qScS by an expression of the form
eu(pi1 +...+pik ) tk ,
where t, u are new formal parameters. This replacement is of the second type
with the function (
teupi , ci = n;
f (c) =
1, otherwise.
The function Fi (xi ) is then
xni
Fi (xi ) = exi + (teupi 1) ,
n!
where we now have to introduce a dierent Fi for every xi . The substitution in
the generating function gives, according to Eq. (8),
S n
Y (N pi ) N pi
g(t, u) = eN F1 (N p1 )...FS (N pS ) = 1 + (teupi 1) e .
i=1
n!
Now, by construction this generating function is the sum of terms
p ({ci }) eu(pi1 +...+pik ) tk ,

and so we need to transform this function further. Computing the k -th deriva-
tive with respect to t at t = 0, we will select only the events where V(n) = k .
Among these events we need to compute the expectation value of
1
(pi + ... + pik )
k 1
by taking the derivative with respect to u at u = 0. Then we have to sum this
value over all k 1. Therefore, the computation needs to proceed as follows:

1 k

X 1
E pGT = g(t, u).
k u u=0 k! tk t=0
k=1
We get
S pi )n N pi
tpi (Nn!

X e
g(t, u) = g(t) (N pi )n N pi
,
u u=0 i=1 1 + (t 1) e n!
where g(t) is the generating function for V(n) obtained previously in Eq. (9).
Thus
S pi )n N pi
1 1 k pi (Nn!

X X e
E pGT = tg(t) (N pi )n N pi
.
k! k tk t=0 i=1 1 + (t 1) e
k=1 n!
19
The extra factor 1/k is inconvenient! We can remove it by noticing that, for
any analytic function h,
k k1 k
k
[th(t)] = k k1 h(t) + t k h(t).
t t t
Therefore
k k1

[th(t)] = k h(t).
tk t=0 tk1 t=0
Also we note that
1 k 1
t k

X
h(t)dt = dt h
0 0 k! tk t=0
k=0
k
k tk+1 t k1
X X
= h = h.
tk t=0 (k + 1)! k! tk1 t=0
k=0 k=1
In this way we can make progress with the expression:
S pi )n N pi
1 k1 pi (Nn!

X X e
E pGT = k1
g(t) (N pi )n N pi
k! t i=1 1 + (t 1) n! e

k=1 t=0
S 1 (N pi )n N pi
n! e
X
= pi dtg(t) pi )n N pi
.
i=1 0 1 + (t 1) (Nn! e
This expression seems to be dicult to transform further.

We can obtain the following approximations:
1. Note that
xn x 1
max e = , at xmax = n.
x n! 1 + 2n
Therefore the denominator can be expanded as a Taylor series in (t 1), and

the rst terms can be taken as approximation.
2. Only some words i will contribute signicantly because their values of
(N pi )n eN pi /n! will be close to maximum. Other words will have negligible
contribution. (This assumes n 1; for n=0 the dominant contribution is by
very rare words.) We can estimate the function xn ex /n! by a Gaussian
" #
2
1 1 (x n)
exp .
1 + 2n 2 n
3. If the distribution of word frequencies is smooth and there are many

words near the frequency n/N , then we can introduce the parameterization
n 1 in i
pi = + ,
N N E V(n)
20
for i near in , where in is such that pin = n/N .
4. We can then use this parameterization and the Gaussian approximation
near pi n/N to estimate
S k
n
s
X (N pi ) N pi k
e k1
E V(n) , k 1.
i=1
n! (2n)
5. It follows that
S n
X (N pi ) N pi
lng(t) = ln 1 (1 t) e
i=1
n!
S X k1 n k
X (1) (N pi ) N pi
= (1 t) e
i=1 k=1
k n!
k1 k1
X (1) 1t
(1 t) E V(n)
k=1
k 2n

1t
= 2nE V(n) Li 21 .
2n

For large n or for large E V(n) , we have g(0) 1.
6. Useful analytic approximations are possible only for large n or for large
E[V(n) ]. The parameter

|lng(0)| nE V(n)
seems to be important; if this parameter is large, we can approximately evaluate
g(t)dt in the following way. If g(0) 1 then

1
dg
dt = g(1) g(0) g(1) = 1.
0 dt
We can write
n
S (N pi ) N pi S n
dg X n! e (N pi ) N pi
X
= g(t) (N p )n g(t) e = g(t)E V(n) ,
dt i=1 1 + (t 1) i
eN pi i=1
n!
n!
where we disregarded the denominators since their contributions are small under
the assumptions (the correction due to the denominators can be estimated). So
1
dg
1 dt E V(n) g(t)dt
0 dt
and 1
1
g(t)dt .
0 E V(n)
In a similar way we can estimate
1
!2
1
(1 t) g(t)dt .
0 E V(n)
21
Then we can obtain the Good-Turing estimator as in the standard formula, up
to corrections of order 1/ |lng(0)|.
***
3 Parametric results
Within the probabilistic model, we may reorder the words so that the true
probabilities decrease with the word index (pi+1 pi ). Then we may assume a
particular asymptotic form of the true probabilities pi for very large values of
the word index i; for instance,
pi AeBi
or
pi AiB .
Here A and B are the parameters of the chosen model. Consequences of these
assumptions will then be immediately checkable against experimental data. We
may compute the best values of the parameters and estimate the goodness of
t.***
22

Frequency Estimation

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Frequency Estimation

Diunggah oleh

Hak Cipta:

Format Tersedia

Frequency estimation for rare events

March 29, 2013

Sources: Baayen, Word frequency distributions. Papers by Stefan Evert, as

1 Word frequencies in texts

It would appear that the word green is more likely to be encountered in an

1.1 The problem of rare words

2.1 The bag of words model

Each word wi has a xed true probability pi , and

Words in a text sample are randomly and independently chosen according

1 This is found by a lot of empirical work in corpus linguistics.

The variance of the word count is

Therefore, the estimator ci /N has the variance

In order to handle such expressions more easily, let us introduce a generating

Using a slightly modied version of Stirling's formula,

which is precise to 1% already for x = 1, we can simplify (assuming N 1 and

Therefore we can approximately write

2.3 Covariance of word counts

Then we can express

N (and also assuming N

2.4 The total number of distinct words

2.5 Invariant quantities

This quantity D is therefore expected to be invariant with respect to the choice

2.6 Dependence on sample size

This expression is similar to the expectation value for V(1) :

By inspection we obtain the (exact !) relationship

2.7 Probabilities for unknown words

What is the probability that the new word is unknown?

What is the expected value of p(w) for this word if it is unknown?

Let us consider the rst question.

Therefore, the new word is unknown with probabilty

We have seen this expression before; it is equal to

2.8 Good-Turing estimators

2.9 The spring of words (Poisson) model

2.10 Calculations with the Poisson model

Now we are interested in the generating function of all word occurrences. We

Now we might want to compute dierent quantities using this generating

is replaced by the expression not containing the formal parameters qi ,

(Here we use the notation in which v acts as a map V C, so that the

The matrix representation of qi is

The exponential of this operator can be expressed as

As examples of application of these techniques to the Poisson model, let us

The mean number of words that occur

pi1 + ... + pik ,

In our case, we need to dene

Note that the total probability in unknown words (n = 0) is

In this estimate, we replaced the expectation value by the actually observed

We need the function F (x) dened by Eq. (7); we get

Hence the variance of V(n) is

The function Fi (xi ) is then

Now, by construction this generating function is the sum of terms

p ({ci }) eu(pi1 +...+pik ) tk ,

In this way we can make progress with the expression:

This expression seems to be dicult to transform further.

Therefore the denominator can be expanded as a Taylor series in (t 1), and

3. If the distribution of word frequencies is smooth and there are many

g(t)dt in the following way. If g(0)  1 then

Anda mungkin juga menyukai

It would appear that the word green is more likely to be encountered in an

1.1 The problem of rare words

2.1 The bag of words model

Each word wi has a xed true probability pi , and

Using a slightly modied version of Stirling's formula,

which is precise to 1% already for x = 1, we can simplify (assuming N 1 and

2.5 Invariant quantities

Let us consider the rst question.

2.9 The spring of words (Poisson) model

Now we might want to compute dierent quantities using this generating

In our case, we need to dene

We need the function F (x) dened by Eq. (7); we get

This expression seems to be dicult to transform further.

g(t)dt in the following way. If g(0) 1 then