Anda di halaman 1dari 5

The EM algorithm

The goal is to estimate the parameter θ for a family p(x, y; θ) of distributions on a product
space S1 ×S2 . Assume that if we observe the full data (X (n) , Y (n) ), n = 1, . . . , N , finding the
maximum likelihood estimate is a simple computation. However we only get to observe
an i.i.d. random samples X (1) , . . . , X (N ) (the Y (n) ’s remain hidden). The best we can do is
maximize the marginal log-likelihood:

n
X
max `(X (1) , . . . , X (N ) ; θ) = max log pX (X (n) ; θ). (1)
θ θ
i=1

We develop an iterative algorithm which can be used to find local maxima of this marginal
log-likelihood.

The Kullback-Liebler distance and entropy

The Kullback-Liebler distance between two distributions is defined as


X q(y)
K(q, p) = q(y) log .
p(y)
y∈S

Since the log function is convex we have


X p(y) X p(y)
−K(q, p) = q(y) log ≤ log q(y) = 0.
y
q(y) y
q(y)

Since p, q are distributions that add up to 1, equality can only be achieved if p(y) = q(y) for
all y. The entropy of a distribution denoted by H(q) is given by
X
H(q) = − q(y) log q(y).
y

We return to the properties of this quantity later. Note that

K(q, p) = −H(q) − Eq log p.

The Gibbs Variational formula

Let ψ(y) be a function on a discrete state space S and define the probability p on S as

p(y) = eψ(y) /Z = eψ(y)−B , (2)


P
where Z = y∈S eψ(y) is the normalizing constant and B = log(Z). Then

B = max[H(q) + Eq ψ]. (3)


q

1
where the unique maximizer is p. This follows directly from the properties of the KL
distance. For any q
0 ≤ K(q, p) = −H(q) − Eq ψ + B,
so that
B ≥ H(q) + Eq ψ,
and equality is achieved only when q = p.

A variation formula for the log-marginal

Consider a joint distribution p(x, y) on a product space of two sets S1 × S2 with x ∈ S1 and
y ∈ S2 . Write
pX,Y (x, y) elog p(x,y)
pY |X (y|x) = =
pX (x) Z
where the normalizing constant Z is given by the the marginal pX (x). For fixed x set
ψ(y) = log p(x, y), then using the Gibbs variation formula we have

log pX (x) = max [H(q) + Eq log p(x, y)]


q

where q runs over probability disitribution on S2 . The maximum is achieved at q(y) =


pY |X (y|x).

Derivation of the EM iteration

Given observations X (1) , . . . , X (N ) , define


N
X
J (q1 , . . . , qN , θ) = H(qn ) + Eqn log p(X (n) , ·; θ), (4)
i=1

Since
log p(X (n) ; θ) = max[H(q) + Eq log p(X (n) , ·; θ)],
q

max `(X (1) , . . . , X (N ) ; θ) = max max J (q1 , . . . , qN , θ). (5)


θ θ q1 ,...,qN

This suggests an iterative scheme to maximize (1). Maximize in the qn ’s for fixed θ and
then maximize in θ for fixed qn ’s.

Initialize Set t = 0. And pick θ(0) .


· ¸
(t) (n) (t)
E. qn = argmaxq H(q) + Eq log p(X , ·; θ ) .
· ¸
PN (t)
M. θ(t+1) = argmaxθ n=1 H(qn ) + Eq(t) log p(X (n) , ·; θ) .
n

2
For the E step we know already that

p(X (n) , y; θ(t) ) p(X (n) , y; θ(t) )


qn(t) (y) = p(y|X (n) ; θ(t) ) = = P (6)
p(X (n) ; θ(t) ) y 0 ∈S2 p(X
(n) , y 0 ; θ (t) )

In the M -step we estimate:


N h
X i
(t+1)
θ = argmax H(qn(t) ) + Eq(t) log p(X (n) , ·; θ) (7)
n
θ n=1

The first term is a constant with respect to varying θ and can be ignored in the maximiza-
tion. Writing out the expectation we get
N X
X
θ(t+1) = argmax p(y|X (n) ; θ(t) ) log p(X (n) , y; θ).
θ n=1 y∈S2

To avoid confusion, we reiterate that θ(t) in the above expression is a constant value differ-
ent from the variable θ being maximized.
The above maximization has the following interpretation: The value of θ(t+1) is what
would be obtained by maximum likelihood estimation if in addition to the observed data
{X (n) }N
n=1 , we had access to a large number of Y observations for each X
(n) , drawn from
(n) (t)
the conditional p(y|X , θ ). Now, suppose we could easily perform maximum likeli-
hood estimation given full data by computing a sufficient statistic such as an empirical
average of some function T over the full data i.e
N
1 X
θM L = T (X (n) , Y (n) ).
N
n=1

Then, in the present context of unobserved data the θ(t+1) reduces to the average of T
over our hypothetical pool of full data: Y observations for each X (n) , drawn from the
conditional. Thus,
N
1 XX
θ(t+1) = p(y|X (n) , θ(t) )T (X (n) , y). (8)
N
n=1 y∈S2

More generally if the joint distribution is of an exponential family and θM L is the solution
to an equation
N
1 X
Eθ T (X, Y ) = T (X (n) , Y (n) ),
N
i=1

then in the M-step θ(t+1) is obtained as a solution to


N
1 XX
Eθ T (X, Y ) = p(y|X (n) , θ(t) )T (X (n) , y). (9)
N y
n=1

3
Some properties of the iterated maximization

Suppose the iterations reach a fixed point (q∗ , θ∗ ) = (q1∗ . . . , qn∗ , θ∗ ). At a fixed point, by
definition, the E and M steps look like
h i
qn∗ = argmax H(q) + Eq log p(X (n) , y; θ∗ ) (10)
q
N h
X i
θ∗ = argmax H(qn∗ ) + Eqn∗ log p(X (n) , y; θ) . (11)
θ n=1

Assume also that (q∗ , θ∗ ) is a local maximum of J in some neighborhood U of q∗ and V of


θ∗ .
Let θ be in V and let qθ = (q1,θ , . . . , qn,θ ), be the maximizing distributions in equation 10.
Assume θ is close enough to θ∗ so that qθ is in U . This is possible if p(x, y; θ) is smooth in
θ. Then using equation 5 we have

`(Y (1) , . . . , Y (n) ; θ∗ ) = J (q1,θ∗ , . . . , qn,θ∗ , θ∗ )


≥ J (q1,θ , . . . , qn,θ , θ)
= `(Y (1) , . . . , Y (n) ; θ).

EM for Mixture models

In this problem, the domain of the hidden Y variable lies in S2 = {1 . . . K}. Let us setup
some notation.

πk = P (Y = k)
pk (x, θk ) = P (x|Y = k; θk )

So the full parameter θ = (π1 , . . . , πK , θ1 , . . . , θK ). The marginal distribution over S1 is


given by the mixture:
K
X K
X
p(x; θ) = P (x, Y = k; θ) = πk pk (x; θk ).
k=1 k=1

If we have a fully observed sample (X (1) , Y (1) ), . . . , (X (N ) , Y (N ) ), then it is easy to see that
maximum likelihood leads to separating the sample into the K groups according to the
value of Y (n) and computing
N
X
θ̂k = argmax 1[Y (n) =k] log pk (X (n) ; θk )
θk n=1
N
1 X
π̂k = 1[Y (n) =k] . (12)
N
n=1

4
Using Bayes rule The E step which involves computing p(k|X (n) ; θ(t) ) for each k and n
becomes:
(t) (t)
pk (X (n) ; θk )πk
qnt (k) = P (t)
.
K
π p 0 (X (n) ; θ (t) )
k =1 k k
0 0 k

Equation 12 reduces to

N
X
(t+1)
θk = argmax qn(t) (k) log pk (X (n) ; θk ) (13)
θk n=1
N
(t+1) 1 X (t)
πk = qn (k). (14)
N
n=1

Anda mungkin juga menyukai