Sergei Winitzki
ci
Pr(wi )
N
for the word wi .
Is the problem solved? No! The trouble is that most languages have a lot of
words that seem to be quite rare. In fact, words like avunculocal are so rare
1
that they may yield an empirical word count of 0 even in a corpus of, say,
a million words. It would be certainly incorrect to assume that any words not
seen in the corpus have exactly zero probability.
Here is another indication of trouble: Any realistic corpus will have quite
a few words that are seen exactly once (ci = 1). The naive estimate of the
probability for all such words is 1/N , and this is likely to be quite dierent from
the true probability.
We can also compute the total number R of dierent words. It turns out
1
that R grows with N: the larger the text sample, the more new words we nd.
Baayen's book says somewhere that R grows roughly proportional to N.
Theoretically speaking, the number of dierent words should become con-
stant, and the estimates ci /N should become precise once our corpus is large
enough. However, experiments show that even text samples having 107 , 108 , or
9
10 words are not large enough because R keeps growing and new rare words
keep showing up!
2 Non-parametric results
The non-parametric results depend only on the assumption that the corpus can
be described by some probability distribution Pr (w). There are no assumptions
about the particular mathematical shape of this function. (One such assumption
would be, say, that the probability of the n-th word in the dictionary decreases
proportionally to 1/n , and then the value of would be the parameter in a
parametric description).
The language has a nite but extremely large set of words {w1 , ..., wS }.
The limit S can be taken if this simplies calculations.
2
This is the bag of words model: All words are contained in a large bag.
When we draw a random word from the bag, there is a xed probability pi for
the word wi to be selected. The text sample is created by drawing the rst
word randomly from the bag (and returning it to the bag), then drawing the
second word, etc.
Given the true probability Pr(wi ) = pi of the word wi , we can write down
the probability of seeing the word wi exactly n times in a text sample of length
N:
N N n n
Pr (ci = n) = (1 pi ) pi .
n
The expectation value of the word count is
N N
X X nN ! N n n
E [ci ] = n Pr (ci = n) = (1 pi ) pi
n=0 n=0
n! (N n)!
N
X (N 1)! N n n1
= N pi (1 pi ) pi
n=1
(n 1)! (N n)!
= N pi .
N
X
2 2
E c2i (E [ci ]) = n2 Pr (ci = n) (N pi )
n=0
N
X 2
= n (n 1) Pr (ci = n) + N pi (N pi )
n=2
N
2
X (N 2)! N n n2
= N pi (N pi ) + p2i N (N 1) (1 pi ) pi
n=2
(n 2)! (N n)!
= N pi (1 pi ) .
pi (1 pi )
2 [ci /N ] = .
N
The relative precision of this estimator is
p r
2 [ci ] 1 pi
= .
E [ci ] N pi
This precision is much less than 1 only when N pi 1. So we expect that the
estimator ci /N gives grossly incorrect results for words whose true probability
pi is smaller than 1/N .
3
2.2 The number of words by count
In an attempt to nd a better estimate of the word probabilities, we turn to
other statistics. Let us count how many dierent words have the word count
equal to a given number n.
Each word wi (where i = 1, ..., S ) has probability Pr (ci = n) of having the
word count n. Let us denote this probability by r(n)i . That is, we dene
. N N n n
r(n)i = (1 pi ) pi .
n
Now consider the event V(n) = 1 that we have exactly one word with the word
count n. This event is realized if there exists some i such that the event r(n)i
has been realized for the word wi , and if all the other words have a dierent
word count. Thus, here is the probability of having exactly one word with the
word count n:
Pr V(n) = 1 = r(n)1 1 r(n)2 1 r(n)3 ...+ 1 r(n)1 r(n)2 1 r(n)3 ...+...
(This calculation is, strictly speaking, incorrect : the events r(n)i are not inde-
pendent! However, they can be considered approximately independent when
word counts n are much smaller than the total number N of words in the text.)
The probability of having exactly two words with the word count n is ob-
tained in a similar way, but now we enumerate all pairs of words:
Pr V(n) = 2 = r(n)1 r(n)2 1 r(n)3 1 r(n)4 ...
+ r(n)1 1 r(n)2 r(n)3 1 r(n)4 ... + ... (1)
4
Similarly, we can compute the variance of the word count:
S
X
2 2
E[V(n) ] E[V(n) ] = ... = r(n)i 1 r(n)i .
i=1
The variance can be expressed through the expected word count for a twice
larger text sample: we note that
2 N 2
2 N 2N 2n 2n n
r(n)i (N ) = (1 pi ) pi = 2N
r(2n)i (2N ) .
n 2n
2 1
(N )] E[V(n) (N )]2 E V(n) (N ) E V(2n) (2N ) .
E[V(n)
n
It is interesting to consider also the word count V(0) , i.e. the count of words
missing from the text sample. For instance, the event V(0) = s mean that there
are exactly s words that are not present in the text sample. The expectation
value of V(0) is
S S
X X N
E V(0) = r(0)i = (1 pi ) .
i=1 i=1
This expectation value is approximately equal to the total number of words
whose probability is smaller than N 1 .
X Y
Pr V(m) = 1, V(n) =1 = r(m)i r(n)j 1 r(m)k r(n)k .
i6=j k6=i;k6=j
5
This expression is similar to Eq. (1), where the sum goes over all the ordered
pairs of words. We introduce a generating function
S
. Y
G(m,n) (t, u) = 1 r(m)i t r(n)i u ,
i=1
(1)q+s q s
Pr V(m) = q, V(n) =s = G(t, u).
q!s! tq t=1 us u=1
Now the covariance of V(m) and V(n) can be computed. We nd
N
X
E V(m) V(n) = Pr V(m) = q, V(n) = s qs
q,s=0
2 G
= ;
tu
2 t=u=0
G G G
E V(m) V(n) E V(m) E V(n) =
tu t u t=u=0
!
X X X
= r(m)i r(n)j r(m)i r(n)j
i6=j i j
X
= r(m)i r(n)i .
i
We nd that the word counts are slightly correlated. The last expression can
be rewritten as
N N
X
X N N 2N mn m+n m n
r(m)i r(n)i = (1 pi ) pi = 2N
E V(m+n) (2N ) .
i
m n i m+n
6
event that R = n. This event occurs when exactly n words have word counts
not equal to zero. This is the same as the event V(0) = S n. So we can use
the formulas derived in the previous section for this event. We just need to
substitute R = S V(0) .
The expectation value of R is computed by
S h i
X N
E [R] = E S V(0) = S E V(0) = 1 (1 pi ) .
i=1
The variance of R is
h i S h i
2 2 X N N
E R2 E [R] = E V(0)
2
E V(0) = (1 pi ) 1 (1 pi ) .
i=1
It is interesting that this quantity can be expressed through the values of E [R]
for N and 2N :
2
E R(N )2 E [R(N )] = E [R(2N )] E [R(N )] .
N N X S
X n n1 X N N n n n n 1
E[D] = E V(n) = (1 pi ) pi
n=2
N N 1 n=2 i=1 n N N 1
S
" N # S
X X N 2 N n n2
X
= p2i (1 pi ) pi = p2i .
i=1 n=2
n 2 i=1
N
. X n n1 nk
D(k) = V(n) ... , k 1.
N N 1 N k
n=k+1
The expectation value of D(k) is again independent of the sample size N and
depend only on the distribution of the true probabilities, i.e. only on the
language,
S
X
pki .
E D(k) =
i=1
7
If actual experiments with large texts show that the quantities D(k) change
systematically with N beyond statistical uncertainties, the hypothesis of the
bag of words will be disproved! (But it is perhaps not easy to estimate the
statistical uncertainty in D(k) ?)
S h i XS
N 1 N 1
X N
E [R(N )] E [R(N 1)] = (1 pi ) (1 pi ) = (1 pi ) pi .
i=1 i=1
S
X N 1
E V(1) (N ) = N (1 pi ) pi .
i=1
1
E [R(N ) R(N 1)] = E V(1) (N ) .
N
Typically, V(1) is much smaller than N, so adding a single word to the sample
does not predict an appreciable change in R.
Analogous relationships hold for the quantities V(n) . For instance,
S
N +1 X
N +1n n
E V(n) (N + 1) = (1 pi ) pi
n i=1
S
N +1 X N n n
= (1 pi ) (1 pi ) pi
n i=1
N +1 n+1
= E V(n) (N ) E V(n+1) (N + 1) .
N +1n N +1n
In practice, N is a large number, much larger than n, but not necessarily
much larger than V(n) (N ). So we may approximate this relationship by
n n+1
E V(n) (N + 1) 1 + E V(n) (N ) E V(n+1) (N + 1) . (2)
N N
8
These relationships allow us to predict (in principle!) how much the word counts
will change if we add more text to our text sample. If we only have a sam-
ple of
N words, not a sample of N + 1 words, we cannot directly estimate
E V(n) (N + 1) . Nevertheless, the quantities change very little between N and
N + 1, so we can always estimate V(n) (N + 1) through known values of V(n) (N ).
The relationship (2) requires V(n+1) (N + 1), which again needs to be expressed
1
using the same relationship. So we get the (asymptotic) series in N :
n n+1 n+1
E V(n) (N + 1) 1 + E V(n) (N ) 1+ E V(n+1) (N )
N N N
(n + 1) (n + 2) n+2
+ 1+ E V(n+2) (N ) + ...
N2 N
Since E V(n) (N ) decreases with growing n, while N is large, only a few rst
term of this series will give sucient precision in practice.
We see that the change due to adding one word is something of order N 1 .
An appreciable change can be expected only if the size of the text sample is
increased by about N words.
What is the probability that this word is known but is one of the rare
words that were seen so far only n times in the text (where n is small)?
N
(1 pi ) .
S
X N
Pr (w unknown) = pi (1 pi ) .
i=1
1
E V(1) (N + 1) .
N +1
9
If we only have the text sample of length N, how can we obtain quantities such
as V(1) (N + 1)? By using the relationships such as
n n+1 n+1
V(1) (N + 1) = 1 + V(1) (N ) 1+ V(2) (N ) + ...
N N N
The expected value of p for the new unknown word is
PS N
p2i (1 pi )
= Pi=1
E pw unknown S N
.
i=1 pi (1 pi )
Compare this with the previously derived expressions
S
N +2 X 2 N
E V(2) (N + 2) = pi (1 pi ) .
2 i=1
By inspection, we nd
2 E V(2) (N + 2)
E pw unknown
= .
N + 2 E V(1) (N + 1)
Similarly, there is the probability r(n)i of the new word w = wi being one of
the words that already occurred n times in the existing text sample. Therefore,
the answer to the third question is
S XS
X N N n n+1
Pr (w occurred n times) = r(n)i pi = (1 pi ) pi
i=1
n i=1
n+1
= E V(n+1) (N + 1) .
N +1
It seems useful to record the following calculation:
S XS
X N N n n+k
r(n)i pki = (1 pi ) pi
i=1
n i=1
(n + k)! N!
= E V(n+k) (N + k) .
n! (N + k)!
10
expectation value? The event that the word wi occurs n times has probability
r(n)i . For dierent i, these events are approximately independent (at least when
n is much smaller than N ; however, it is quite dicult to estimate the precision
of the approximation we are making!). So we can compute the expectation value
of pi approximately as
PS
i=1 pi r(n)i n + 1 E V(n+1) (N + 1)
E pw occurred n times
= PS = .
i=1 r(n)i
N +1 E V(n) (N )
The expectation value of p(w) for unknown words is also obtained if we set
n=0 in this formula.***
A diculty in using the Good-Turing estimator in practice is that we need
to know the expectation values E[V(n) ] while we only know the empirically
observed values V(n) . These values are estimators of E[V(n) ] with standard
q
deviations of order E V(n) . Thus, these empirically observed values cannot
be used unless they are quite large. This is typically the case for the values
V(1) and V(2) (there are many rare words). However, for many higher values of
n we will have V(n) = 1 or V(n) = 0. The Good-Turing estimator needs to be
complemented by a smoothing of the values of V(n) . This smoothing does
not have a good theoretical basis, it seems.
11
The Poisson distribution with mean is dened by
k
p(k) = e .
k!
P
It is easy to see that k=0 p(k) = 1.
The generating function of the Poisson distribution is
X
g(t) = p(k)tk = e(t1) .
k=0
X
G ({qi }) = p ({ci }) q1c1 ...qScS . (3)
{ci }
Here all word counts ci are independently summed over, each ci going from 0
to . Each word count has an independent probability distribution. Therefore
the generating function factorizes into the product of the Poisson generating
functions for each word count. The distribution of the word count for the
word wi is the Poisson distribution with mean N pi ; therefore we nd the total
generating function as
S
! S
X X
G ({qi }) = exp N pi (qi 1) = eN exp N pi qi . (4)
i=1 i=1
X
G ({qi }) = p ({ci }) q1c1 ...qScS
{ci }
where h is some chosen function. Dierent choices of the function h will yield
dierent interesting quantities. The trick is to nd an algebraic substitution of
12
qi in the function G that performs the desired transformation. This algebraic
substitution must be a linear map, so that the sum of terms q1c1 ...qScS is mapped
again to a sum of terms.
Let us consider two possible choices of h that may be useful: First,
S
X
h (c1 , ..., cS ) = f (c1 ) + ... + f (cS ) = f (ci ),
i=1
where f is an arbitrary, chosen function. The second choice is
S
Y
h (c1 , ..., cS ) = f (c1 )...f (cS ) = f (ci ),
i=1
where f is again an arbitrary function. We will now nd a way to compute such
substitutions, and use the generating function (4) as a particular example.
The rst substitution requires is to construct a linear map of q1c1 ...qScS into
f (c1 ) + ... + f (cS ). We will use the following trick from linear algebra. There
exists a linear operator T acting in some vector space V, and a vector v V
and a covector v V such that
f (c) = v Tc v, c = 0, 1, ...
qi = 1V ... T ... 1V ,
where T acts on the i-th copy of V in the direct sum (and 1V is the identity
operator acting in the space V ). In other words, the operator qi acts in the
space W by applying T only to the i-th component of a vector
.
w = v1 ... vS W.
1V
..
.
T
,
..
.
1V
where the operator T appears in the i-th row and i-th column. This denition
of qi is designed so that any two qi commute with each other. We can also exress
qi as
1W + 0V ... T 1V ... 0V .
qi = (5)
13
Now, we dene the vector w and the covector w through the previously
dened vector v and covector v as
. .
w = (v ... v) W, w = (v ... v ) W .
By construction, we then have
qScS w = (v ... v ) Tc1 ... TcS (v ... v)
w q1c1 ...
= v Tc1 v + ... + v TcS v = f (c1 ) + ... + f (cS ).
Therefore, the desired transformation of G({qi }) is implemented by substituting
qi qi , applying the resulting operator to the vector w, and applying the
for
covector w to the result (all these applications are linear operations).
In the present case, we do not actually need to nd an explicit form of the
operators qi and the vector spaces V and W. It is sucient that the necessary
operators and vectors exist. We can then substitute qi into Eq. (4) and obtain
" S
!#
X
N
qi }) w = w e
w G ({ exp N pi qi w.
i=1
The operator under the exponential can be computed more explicitly using
Eq. (5):
S
X S h
X i
pi qi = pi 1W + 0V ... pi T 1V ... 0V
i=1 i=1
1W + p1 T 1V ... pS T 1V .
=
S
!
X
exp N pi qi = eN exp N p1 T 1V ... exp N pS T 1V .
i=1
Now we apply the covector w and the vector w to this and obtain the result,
S
X h i
w G ({
qi }) w = v exp N pi T 1V v
i=1
S
X h i
= eN pi v exp N pi T v
i=1
S c
X X (N pi )
= eN pi v Tc v
i=1 c=0
c!
X
S c
X (N pi )
= eN pi f (c)
c=0 i=1
c!
X
X S
= pPoisson [N pi ; c] f (c). (6)
c=0 i=1
14
This is equal to the average of f (c) with the Poisson distribution of word counts
c, added independently for each word (i = 1, ..., S ). Perhaps we could have
obtained this result faster if we thought right away about the independence of
word counts! However, the trick with the generating function can be used also
for distributions where there is no independence (e.g. with the bag of words
model).
Now consider the second possibility: we need to replace q1c1 ...qScS by the
product f (c1 )...f (cS ). Now qi must be replaced by operators
qi = ,
xi
where xi (i = 1, ..., S ) are new formal parameters. Dene the auxiliary function
F by
x2 X xc
F (x) = f (0) + xf (1) + f (2) + ... = f (c) ; (7)
2! c=0
c!
we assume that this function is analytic at least for some x. Then by construc-
tion we will have
c
F (x) = f (c)
xc x=0
and so
c1 cS
... [F (x1 )F (x2 )...F (xS )] = f (c1 )...f (cS ).
x1 x1 =0 xS xS =0
Now let us perform this substitution in the generating function (4). Using the
fact that
exp a F (x) = F (a)
x x=0
for analytic functions F, we obtain:
G [F (x1 )F (x2 )...F (xS )]
xi xi =0
S
!
X
= eN exp N pi [F (x1 )F (x2 )...F (xS )]
i=1
x i
= eN F (N p1 )...F (N pS ). (8)
15
p ({ci }) q1c1 ...qScS , in the generating
To compute E V(n) , we look at one term,
function G. This term describes the event that each word i has the count ci in
the text. When this event occurs, the number of words that occur n times is
equal to the number of q 's whose power is equal to n. This quantity (the number
of words that occur exactly n times) can be expressed as f (c1 )+...+f (cS ) where
the function f is dened by
(
1, c = n;
f (c) =
0, otherwise.
Therefore, we apply the method leading to Eq. (6). The average is then com-
puted as
X S c S n
X (N pi ) X (N pi )
eN pi eN pi
E V(n) = f (c) = .
c=0 i=1
c! i=1
n!
Let us compute the expected total probability ptotal (n) contained in all the
words that occurred exactly n times. This is obtained if we transform each term
q1c1 ...qScS in the generating function G into a term
where i1 , ..., ik are the words that occurred exactly n times. This kind of
replacement is possible if we use a dierent fi (c) for each word i. The formula (6)
will be then modied to
X
S c
X (N pi )
eN pi fi (c).
c=0 i=1
c!
Therefore
S n
X (N pi )
eN pi
E ptotal (n) = pi .
i=1
n!
We can express this through E V(n) that we computed just previously:
S n+1
n+1X (N pi ) n+1
eN pi
E ptotal (n) = = E V(n+1) .
N i=1 (n + 1)! N
1
E ptotal (0) = E V(1) .
N
16
The Good-Turing estimator is motivated by these expressions: If we found
V(n) words that occur n times in our sample, and if we expect that the total
probability in these words is E ptotal (n) , then we can estimate the probability
for these words as
1 n + 1 E V(n+1) n + 1 V(n+1)
p E ptotal (n) = .
V (n) N V (n) N V(n)
Consider again the event {ci } that is represented by the term p ({ci }) q1c1 ...qScS
in the generating function G. The words that occur n times are those i for
c1 cS k
which ci = n. The term p ({ci }) q1 ...qS must be replaced by t where k is the
number of words that occur n times. This replacement will be achieved if we
ci
replace qi by t when ci = n and by 1 otherwise. Thus, we can use the method
leading to Eq. (8) with the function f (c) dened by
(
. t, c = n;
f (c) =
1, otherwise.
xn
F (x) = ex + (t 1) .
n!
Then Eq. (8) yields
S n Y S n
Y (N pi ) (N pi ) N pi
g(t) = eN eN p i + (t 1) = 1 + (t 1) e . (9)
i=1
n! i=1
n!
This generating function gives, in principle, the complete information about the
probability distribution V(n) for the number of words that occur n times in the
text:
1 k
Prob V(n) =k = g(t).
k! tk t=0
It is, however, quite dicult to obtain numerical results from this generating
function; we would need to compute the derivative of very high order k if we
wanted to compute, say, the probability of having exactly 1000 words that occur
twice. As an example of easier calculations, let us nd Prob V(n) = k for k = 0,
17
1, 2.
S n
Y (N pi ) N pi
Prob V(n) = 0 = g(0) = 1 e .
i=1
n!
S
X 1
V(n) = 1 = g 0 (0) = g(0)
Prob
n!eN pi
.
i=1 (N pi )n 1
S S
1 X X 1 1
Prob V(n) = 2 = g 00 (0) = g(0) n!eN pi
.
2 1 n!eN pj
i=1 j=1,j6=i (N pi )n (N pj )n 1
Let us also compute the mean and the variance of the V(n) using this generating
function (note that by construction g(1) = 1):
k
X X 1
E V(n) = kProb V(n) = k = g(t)
(k 1)! tk t=0
k=0 k=1
1 j
X
= j
g 0 (t) = g 0 (1)
j=0
j! t t=0
S n
X (N pi )
= eN pi .
i=1
n!
The variance takes a bit more work: it is convenient to compute rst the quantity
k
h
2
i X X 1
k 2 k Prob V(n) = k =
E V(n) E V(n) = g(t)
(k 2)! tk t=0
k=0 k=2
S S n n
(N pi ) N pi (N pj ) N pj
X X
= g 00 (1) = e e
i=1 j=1,j6=i
n! n!
" S #2 S 2
n n
1 X (N pi ) N pi 1 X (N pi ) N pi
= e e .
2 i=1 n! 2 i=1 n!
S
" S #2 S 2
n X (N pi )n n
h
2
i 2 X (N pi ) N pi 1 N pi 1 X (N pi ) N pi
E V(n) E V(n) = e e e .
i=1
n! 2 i=1 n! 2 i=1 n!
Let us also compute the Good-Turing estimator directly, i.e. as the average
probability pi among the words i that occur exactly n times in the text. This
can be computed if we replace the term q1C1 ...qScS in the generating function G
by the expression
1
(pi + ... + pik ) ,
k 1
18
where the indices i1 , ..., ik correspond to the words that occur exactly n times,
and k is the number of such words in the event {ci }. It is not easy to obtain
this replacement in one step.
C1
Let us replaceq1 ...qScS by an expression of the form
eu(pi1 +...+pik ) tk ,
where t, u are new formal parameters. This replacement is of the second type
with the function (
teupi , ci = n;
f (c) =
1, otherwise.
xni
Fi (xi ) = exi + (teupi 1) ,
n!
where we now have to introduce a dierent Fi for every xi . The substitution in
the generating function gives, according to Eq. (8),
S n
Y (N pi ) N pi
g(t, u) = eN F1 (N p1 )...FS (N pS ) = 1 + (teupi 1) e .
i=1
n!
1
(pi + ... + pik )
k 1
by taking the derivative with respect to u at u = 0. Then we have to sum this
value over all k 1. Therefore, the computation needs to proceed as follows:
1 k
X 1
E pGT = g(t, u).
k u u=0 k! tk t=0
k=1
We get
S pi )n N pi
tpi (Nn!
X e
g(t, u) = g(t) (N pi )n N pi
,
u u=0 i=1 1 + (t 1) e n!
where g(t) is the generating function for V(n) obtained previously in Eq. (9).
Thus
S pi )n N pi
1 1 k pi (Nn!
X X e
E pGT = tg(t) (N pi )n N pi
.
k! k tk t=0 i=1 1 + (t 1) e
k=1 n!
19
The extra factor 1/k is inconvenient! We can remove it by noticing that, for
any analytic function h,
k k1 k
k
[th(t)] = k k1 h(t) + t k h(t).
t t t
Therefore
k k1
[th(t)] = k h(t).
tk t=0 tk1 t=0
Also we note that
1 k 1
t k
X
h(t)dt = dt h
0 0 k! tk t=0
k=0
k
k tk+1 t k1
X X
= h = h.
tk t=0 (k + 1)! k! tk1 t=0
k=0 k=1
S pi )n N pi
1 k1 pi (Nn!
X X e
E pGT = k1
g(t) (N pi )n N pi
k! t i=1 1 + (t 1) n! e
k=1 t=0
S 1 (N pi )n N pi
n! e
X
= pi dtg(t) pi )n N pi
.
i=1 0 1 + (t 1) (Nn! e
xn x 1
max e = , at xmax = n.
x n! 1 + 2n
" #
2
1 1 (x n)
exp .
1 + 2n 2 n
n 1 in i
pi = + ,
N N E V(n)
20
for i near in , where in is such that pin = n/N .
4. We can then use this parameterization and the Gaussian approximation
near pi n/N to estimate
S k
n
s
X (N pi ) N pi k
e k1
E V(n) , k 1.
i=1
n! (2n)
5. It follows that
S n
X (N pi ) N pi
lng(t) = ln 1 (1 t) e
i=1
n!
S X k1 n k
X (1) (N pi ) N pi
= (1 t) e
i=1 k=1
k n!
k1 k1
X (1) 1t
(1 t) E V(n)
k=1
k 2n
1t
= 2nE V(n) Li 21 .
2n
For large n or for large E V(n) , we have g(0) 1.
6. Useful analytic approximations are possible only for large n or for large
E[V(n) ]. The parameter
|lng(0)| nE V(n)
seems to be important; if this parameter is large, we can approximately evaluate
where we disregarded the denominators since their contributions are small under
the assumptions (the correction due to the denominators can be estimated). So
1
dg
1 dt E V(n) g(t)dt
0 dt
and 1
1
g(t)dt .
0 E V(n)
In a similar way we can estimate
1
!2
1
(1 t) g(t)dt .
0 E V(n)
21
Then we can obtain the Good-Turing estimator as in the standard formula, up
to corrections of order 1/ |lng(0)|.
***
3 Parametric results
Within the probabilistic model, we may reorder the words so that the true
probabilities decrease with the word index (pi+1 pi ). Then we may assume a
particular asymptotic form of the true probabilities pi for very large values of
the word index i; for instance,
pi AeBi
or
pi AiB .
Here A and B are the parameters of the chosen model. Consequences of these
assumptions will then be immediately checkable against experimental data. We
may compute the best values of the parameters and estimate the goodness of
t.***
22