Anda di halaman 1dari 22

Latent Data, Complete Likelihoods and the EM

Algorithm
(TA Session, Econ 319, 4/15/02)
1 Motivation: Maximum Likelihood
Denition 1 A statistical model is a pair (Y,P) where Y is the set of
possible observations and P is a family of probability distributions on Y.
We will consider models where P is indexed by a parameter vector :
P = {P( |), } (1)
With the usual abuse of notation, we denote by y both a given element
of Y and a random variable with a distribution on Y.
Denition 2 The sample distribution of y is the distribution of y conditional
on a given value of ,
y P(y|), . (2)
1
Notation: P(y|) denotes the c.d.f. of y given . Let p(y|) denote the
p.d.f. of y given - the function that integrates to P(y|).
Denition 3 The likelihood function L() is equal to the sample distribution
p(y|) seen as a function of . The loglikelihood function is dened as
`() = log L() (3)
Denition 4 The maximum likelihood estimator

ML
of is

ML
= arg max

`() (4)
Suppose we assume that there is a true parameter vector
0
and
that the observed data has been generated by P(y|
0
). Denote by

n,ML
the ML estimator for a sample of size n from P(y|
0
) We know that under
suitable regularity conditions (typically identication of
0
, continuity of `(),
and uniform convergence of the average loglikelihood function to a limiting
function which has maximum at
0
), the ML estimator is consistent, that is:
plim
n

n,ML
=
0
. (5)
2
Under a few added regularity conditions we have:

n(

n,ML

0
) d

0, I(
0
)
1

, (6)
where
I(
0
)
1
= E

2
log p(y|
0
)

(7)
is the Fisher information matrix. Recall that the covariance matrix of the
ML estimator attains the Cramer-Rao lower bound, so the ML estimator is
asymptotically unbiased and asymptotically ecient, and MLE is invariant to
reparameterization of the model. (See Newey-McFadden (1994) for asymp-
totic theory of extremum estimatators, including maximum likelihood.)
For most econometric models nding the ML estimator analytically is
not feasible and one has to resort to numerical optimization. The clas-
sical numerical procedure to optimize likelihood functions is the Newton-
Raphson algorithm, which you probably have already encountered (if not,
see Amemiya p.137-139 or Greene p. 188-191). When `() is known to
be globally concave Newton-Raphson will typically converge very fast. For
quadratic loglikelihoods it converges in one iteration and for loglikelihoods
3
which are approximately quadratic it often converges in less than 10 itera-
tions.
However in complex high dimensional models Newton-Raphson is less at-
tractive. First, it requires the score and inverse hessian at each iteration
which can be very cumbersome to compute. Second, verifying global concav-
ity of the loglikelihood can be dicult and if concavity fails at some points
in the parameter space the algorithm may point to value which decreases
the value of the likelihood function. Third, there is no guarentee that we
converge to a global maximum. Of course this is so for any optimization
algorithm.
Example 5 The Tobit model. Suppose
y

i
=
0
x
i
+
i
, i = 1, . . . , n, (8)
where R
K
, x
i
is a K vector of exogenous covariates and
i
is i.i.d.
normal with mean zero and variance
2
, but we observe y
i
given by
y
i
=

i
if y

i
> 0,
0 if y

i
< 0.
(9)
4
The p.d.f. of y
i
conditional on = (,
2
) is
p(y
i
|) = (
0
x
i
/)I(y
i
= 0) +(y
i

0
x
i
, )I(y
i
> 0), (10)
and by independence
p(y|) =
n
Y
i=1
n
(
0
x
i
/)I(y
i
= 0) +(y
i

0
x
i
, )I(y
i
> 0)
o
. (11)
2 Models with Latent Data
Many econometric models are formulated using latent variables and the im-
plied likelihood function is found by integrating out the latent variables from
the model.
Denote by y

the vector of latent data. The latent data will be related to


the observed data y by an observation rule:
y = (y

), (12)
where is a non-invertible known function. As an example we have for the
Tobit model (y

i
) = I(y

i
> 0)y

i
.
5
Denition 6 The complete-data likelihood function L

() is equal to the
sample distribution of the latent data p(y

|) seen as a function of .The


complete-data loglikelihood function is `

() = log L

().
The complete data likelihood is not of much direct use since the latent
data is not observed. Where the complete data likelihood comes into use is
in the design of algorithms to estimate .
Note that the distribution of y given y

is degenerate:
p(y|y

) =

1 if y = (y

)
0 if y 6= (y

)
(13)
Let p(y, y

|) be the joint distribution of observed and latent data. This


is
p(y, y

|) = p(y|y

)p(y

|) =

p(y

|) if y = (y

)
0 if y 6= (y

)
(14)
So we can write
p(y, y

|) = p(y

|)I(y = (y

)) (15)
6
and the observed data likelihood is
p(y|) =
Z
p(y, y

|)dy

=
Z
p(y

|)I(y = (y

))dy

(16)
To do ML estimation we need to maximize (the log of) p(y|). This
is often to hard to do directly. The EM algorithm is based on replacing
one dicult maximization with a sequence of usually easier maximization
problems using the latent data structure.
3 The EM Algorithm
In order to derive the EM (Expectation-Maximiazation) algorithm, rst note
that by denition
p(y

|y, ) =
p(y, y

|)
p(y|)
=
p(y

|)I(y = (y

))
p(y|)
(17)
This is the p.d.f. of the distribution of the latent data y

conditional on
and the observed data.
Now dene
7
Denition 7
Q(|
0
, y)
Z
log p(y

|)p(y

|y,
0
)dy

=
Z
`

()p(y

|y,
0
)dy

,
the expected value of the complete-data loglikelihood function where the
expectation is with respect to the distribution in 17 using =
0
.
The basic result underlying the EM algorithm is the following lemma:
Lemma 8
Q(
1
|
0
, y) Q(
2
|
0
, y) =`(
1
) `(
2
)
This lemma says that any which increases the Q function also increases
the original likelihood function.
Proof. From 17 we have
log p(y|) = log p(y, y

|) log p(y

|, y) (18)
Take expectations on both sides with respect to the distribution p(y

|y,
0
)
8
dened in 17 and noting that log p(y|) = `() we have
`() =
Z
log p(y, y

|)p(y

|
0
, y)dy

Z
log p(y

|, y)p(y

|
0
, y)dy

= Q(|
0
, y)
Z
log p(y

|, y)p(y

|
0
, y)dy

Then `(
1
) `(
2
) is equivalent to
Q(
1
|
0
, y)
Z
log p(y

|
1
, y)p(y

|
0
, y)dy

Q(
2
|
0
, y)
Z
log p(y

|
2
, y)p(y

|
0
, y)dy

or
Q(
1
|
0
, y) Q(
2
|
0
, y)
Z
log p(y

|
1
, y)p(y

|
0
, y)dy

Z
log p(y

|
2
, y)p(y

|
0
, y)dy

(19)
Now the RHS of this inequality may be written as
Z
log p(y

|
1
, y)p(y

|
0
, y)dy

Z
log p(y

|
2
, y)p(y

|
0
, y)dy

=
Z
log

p(y

|
1
, y)
p(y

|
2
, y)

p(y

|
0
, y)dy

(20)
9
Now from Jensens inequality we get
Z
log

p(y

|
1
, y)
p(y

|
2
, y)

p(y

|
0
, y)dy

log
Z
p(y

|
1
, y)
p(y

|
2
, y)
p(y

|
0
, y)dy

= 0,
so the RHS of 19 is always zero or negative. So if Q(
1
|
0
, y)Q(
2
|
0
, y) 0
then `(
1
) `(
2
).
The EM algorithm is based on the maximizing Q given the previous value
of and the recalcalculating Q with the new . Starting with some initial
value

1
then for j = 1, . . . ,
1. Compute Q(|

j
, y) as
Q(|

j
, y) =
Z
`

()p(y

|y,

j
)dy

,
2. Compute

j+1
as

j+1
= arg max

Q(|

j
, y)
The EM algorithm is not guarenteed to converge to a global maximum.
10
However, once can show that if the EM sequence

1
,

2
, . . . converges to
some value

and the Q(|
0
, y) function is continuous in and
0
then

is
a solution to the likelihood equations,
`()

= 0.
Let us demonstrate the EM algorithm using the Tobit model from our
earlier example.
Example 9 The Tobit model continued. The complete-data likelihood is
p(y

|) =
n
Y
i=1
p(y

i
|)
=
n
Y
i=1
1

1
exp

1
2
2
(y

i

0
x
i
)
2

= (2)
n/2

n
exp
n

1
2
2
n
X
i=1
(y

i

0
x
i
)
2
o
.
Then
log p(y

|) =
n
2
log(2)
n
2
log
2

1
2
2
n
X
i=1
(y

i

0
x
i
)
2
. (21)
To reduce the number of derivations suppose for simplicity that
2
is
known so = .
11
Furthermore,
p(y

|y,
0
) =
Q
n
i=1
p(y

i
|
0
)I(y
i
= (y

i
))
Q
n
i=1
p(y
i
|
0
)
=
n
Y
i=1
p(y

i
|
0
)I(y
i
= (y

i
))
p(y
i
|
0
)
=
Y
{i:y
i
=0}
p(y

i
|
0
)I(y
i
= (y

i
))
p(y
i
|
0
)

Y
{i:y
i
>0}
p(y

i
|
0
)I(y
i
= (y

i
))
p(y
i
|
0
)
,
where the rst product correspond to the y
i
= 0 observations and the second
to the y
i
> 0 observations.
Consider a y
i
= 0 observation. Since (y

i
) = I(y

i
> 0)y

i
we have for
y
i
= 0,
I(y
i
= (y

i
)) = I(0 = I(y

i
> 0)y

i
) = I(y

i
0).
Furthermore, from example 5 we have
p(y
i
= 0|) = (
0
x
i
/)
12
So for y
i
= 0 we get
p(y

i
|
0
)I(y
i
= (y

i
))
p(y
i
|
0
)
=
p(y

i
|)I(y

i
0)
(
0
x
i
/)
, (22)
which is the p.d.f. for a truncated normal distribution, N
(,0]
(y

i
|
0
x
i
,
2
).
For a y
i
> 0 observation we have
I(y
i
= (y

i
)) = I(y
i
= I(y

i
> 0)y

i
) = I(y
i
= y

i
). (23)
Then
p(y

i
|
0
)I(y
i
= (y

i
))
p(y
i
|
0
)
=
p(y

i
|
0
)I(y
i
= y

i
)
p(y
i
|
0
)
= I(y
i
= y

i
) (24)
We have shown that
p(y

i
|y
i
, ) =

N
(,0]
(y

i
|
0
x
i
,
2
) if y
i
= 0
I(y
i
= y

i
) if y
i
> 0
, i = 1, . . . , n. (25)
13
To compute the Q-function note rst that
log p(y

|) =
n
2
log(2)
n
2
log
2

1
2
2
n
X
i=1
y

i
2

1
2
2
n
X
i=1
(
0
x
i
)
2
+
1

2
n
X
i=1
y

i
(
0
x
i
) (26)
Then
Q(|
0
, y) = c(
0
,
2
)
1
2
2
n
X
i=1
(
0
x
i
)
2
+
1

2
n
X
i=1
E[y

i
|
0
, y
i
](
0
x
i
), (27)
where the expectation is with respect to p(y

i
|
0
, y
i
) given in (25). This is
E[y

i
|y
i
,
0
] =

0
0
x
i

(
0
0
x
i
/)
(
0
0
x
i
/)
if y
i
= 0
y
i
if y
i
> 0
y
i
(
0
), i = 1, . . . , n,
(28)
using the properties of the truncated normal distribution. The rst order
condition for maximum of Q is
Q(|
0
, y)

=
1

2
n
X
i=1
x
i
x
0
i
+
1

2
n
X
i=1
x
i
y
i
(
0
) = 0
14
which has solution
=

n
X
i=1
x
i
x
0
i

1
n
X
i=1
x
i
y
i
(
0
). (29)
So the EM algorithm may be summarized as follows. Start with an initial
value

1
. Then for j = 1, . . . ,
1. Generate { y
i
(

j
)}
n
i=1
as
y
i
(

j
) =

0
j
x
i

0
j
x
i
/)
(

0
j
x
i
/)
if y
i
= 0
y
i
if y
i
> 0
, i = 1, . . . , n.
2. Compute

j+1
=

j+1
as

j+1
=

n
X
i=1
x
i
x
0
i

1
n
X
i=1
x
i
y
i
(

j
).
Each new value

j
will increase the likelihood function and the sequence

1
,
2
, . . . will converge to the ML estimate

ML
. Note how easy the compu-
tations are. All we are doing is prediction and running regressions. Easy.
The EM algorithm can also be used in models where the latent data
15
really are parameters. Here is an example.
Example 10 The t-regression model. Suppose we have
y
i
=
0
x
i
+
i
, i = 1, . . . , n, (30)
where R
K
, x
i
is a K vector of exogenous covariates and

i
|
2
, t

(0,
2
),
a t distribution with degrees of freedom and scale . Then
p(y
i
|, , ) =

(( + 1)/2)
(/2)

n
n
Y
i=1

1 +
1

y
i

0
x
i

+1
(31)
The loglikehood function is highly nonlinear and dicult to maximize.
To use the EM algorithm here note that we can write the t-distribution
as a continuous mixture of normals:
p(y
i
|, ,
i
) = N(y
i
|
0
x
i
,
i

2
), (32)
p(
i
|) = IG(
i
|/2, /2), (33)
16
where IG(
i
|/2, /2) is the inverse gamma distribution which has density
p(
i
|) =
(/2)
(/2)
(/2)

(/2+1)
i
exp{/2
i
}. (34)
Then we can recover the t-distribution as
p(y
i
|, , ) =
Z
p(y
i
|, ,
i
)p(
i
|)d
i
(35)
We can now use the EM algorithm treating the
i
s as the unobserved
data y

i
. Suppose is known. Then we need to estimate = (,
2
). The
complete-data loglikelihood is the loglikelihood we get if we knew the
i
s.
From (32) this is
`

() = c
n
X
i=1
1
2
log
i

n
2
log
2

1
2
2
n
X
i=1
(y
i

0
x
i
)
2

i
(36)
17
The distribution p(y

i
|y
i
, ) = p(
i
|y
i
, ) is
p(
i
|y
i
, ) =
p(
i
, y
i
|)
p(y
i
|)
,
p(
i
, y
i
|)
= p(y
i
|
i
, )p(
i
)

1/2
i
exp
n

1
2
i
(y
i

0
x
i
)
2

2
o

(/2+1)
i
exp{/2
i
}
=
((+1)/2+1)
i
exp
n

(y
i

0
x
i
)

2
+
2
i
o
IG

i
|( + 1)/2,
(y
i

0
x
i
)

2
+
2

.
To compute the Q-function we need to take the expectation of `

() with
respect to an inverse gamma distribution. We get
Q(|
0
, y) = c
1
(
0
)
n
2
log
2

1
2
2
n
X
i=1
E

0
[1/
i
|y
i
](y
i

0
x
i
)
2
=
n
2
log
2

1
2
2
n
X
i=1
w
i
(
0
)(y
i

0
x
i
)
2
,
where
w
i
(
0
) =
+ 1
+ (y
i

0
x
i
)
2
/
2
, i = 1, . . . , n, (37)
18
since 1/
i
has a gamma distribution if
i
has in inverse gamma distribution.
The maximum of Q(|
0
, y) is easily found as

=

n
X
i=1
w
i
(
0
)x
i
x
0
i

1
n
X
i=1
w
i
(
0
)x
i
y
0
i
,

2
= n
1
n
X
i=1
w
i
(
0
)(y
i

0
x
i
)
2
.
Starting with some
1
the EM algorithm is then
1. Generate {w
i
(

j
)}
n
i=1
as
w
i
(

j
) =
+ 1
+ (y
i

0
j
x
i
)
2
/
2
j
, i = 1, . . . , n,
2. Compute

j+1
=

j+1
as

j+1
=

n
X
i=1
w
i
(

j
)x
i
x
0
i

1
n
X
i=1
w
i
(

j
)x
i
y
0
i
,

2
j+1
= n
1
n
X
i=1
w
i
(

j
)(y
i

0
j
x
i
)
2
.
The EM algorithm works well for some models. Disadvantages with the
EM algorithm include:
19
The EM algorithm can in some cases exhibit extremely slow conver-
gence. This is often the case when the fraction of missing information
is large (e.g. many y
i
= 0 observations in the Tobit model).
It can be hard to do the required integrations in the E-step analyti-
cally. An example is integration using a multivariate truncated normal
distribution.
It can be hard to do the maximization in the M-step. The main idea
behind the EM algorithm is that this should be an easy maximization
and for some models this is not so.
There are a bunch of extensions to the EM algorithm designed to over-
come the above problems: e.g. Monte Carlo EM (see Wei and Tanner (1990),
Guo and Thompson (1991)) can be used when expectations are not known
in closed form; there is also a generalized EM algorithm (see Gelman et
al.(1998), p. 277-280) which is designed to speed up convergence.
4 Disadvantages with MLE
Asymptotically

ML
is a sucient statistic for . However, in nite
samples

ML
is not, in general, sucient.
20
Inference is based on asymptotic approximations to the exact repeated
sample distribution of

ML
. This may be misleading.
Can be computationally dicult.
The regularity conditions may fail. In this case

ML
may be inconsistent
or even asymptotically the repeated sample distribution may not be
normal.
increases with sample size. Here is a famous example.
Example 11 Fixed eects panel model. Suppose we observe N indi-
viduals over T time periods and
p(y
it
|
i
,
2
) = N(y
it
|
i
,
2
), i = 1, . . . , n; t = 1, . . . , T.
The joint distribution of {y
i1
, . . . , y
iT
}
N
i=1
given = {a
i
}
n
i=1
and
2
is
p(y|,
2
)
NT
exp

1
2
2
n
X
i=1
T
X
t=1
(y
it

i
)
2

. (38)
21
Maximizing the loglikelihood function we nd

i
= y
i
,

2
= (NT)
1
n
X
i=1
T
X
t=1
(y
it
y
i
)
2
,
where y
i
= T
1
P
T
t=1
y
it
.
It is easily shown that
E[
2
] =
T 1
T

2
,
for any value of n. So for large n and small xed T the ML estimator
has a bias of order 1/T.
22