Anda di halaman 1dari 3

Tutorial of EM Algorithm: A Top-down Approach

Hao Wu #1
#
Department of Computer Science and Technology, Tsinghua University
Haidian, Beijing, 100084
1
haowu06@mails.tsinghua.edu.cn

Abstract— EM algorithm is important. In this document we try where ϕ is a concave function, we have
to make a brief tutorial of this algorithm. We first investigate
why EM algorithm is needed. And then we apply this algorithm L(θ) = log p(x|θ)
to the task of estimating the parameters of gaussian mixture X
models. = log p(x, y|θ)
y
P
I. I NTRODUCTION y q(y) p(x,y|θ)
q(y)
= log P
y q(y)
EM algorithm is a powerful tool for estimating the param- P
q(y) log p(x,y|θ)
y
eters of graphical models with hidden variables. Deeply un- ≥ P
q(y)

derstanding of this algorithm can greatly help one understand y q(y)


other useful machine learning methods. In this document we X p(x, y|θ)
make a brief tutorial of EM algorithm in a top-down manner = q(y) log .
y
q(y)
which can help people learn it more easily. In Section 2
we investigate why EM algorithm is needed. And then in P p(x,y|θ)
Let L(q, θ) := y q(y) log q(y) . It can easily be ob-
Section 3 we apply this algorithm to the task of estimating
served that L(q, θ) is a family of lower bounds of L(θ) with
the parameters of gaussian mixture models. Finally, we make
respect to distribution q. In addition, we also have
our conclusion in Section 4.
X p(x, y|θ)
L(q, θ) = q(y) log
II. W HY EM ALGORITHM ? y
q(y)
X p(y|x, θ)p(x|θ)
First of all, we have a set of observed data points x = = q(y) log
q(y)
(x1 , x2 , . . . , xM1 ) and a set of unobserved data points y = y
X
(y1 , y2 , . . . , yM2 ). x and y are generated by a graphical model = q(y)(log p(y|x, θ) + log p(x|θ) − log q(y))
with parameter θ. Given x we want to estimate the best y
parameter of the model. In other words we want to find X
= q(y) log p(x|θ)
θ∗ which can maximize P the log-likelihood function L(θ) :=
y
log p(x|θ) = log y p(x, y|θ), i.e. X
+ q(y)(log p(y|x, θ) − log q(y))
X y
θ∗ = argmax L(θ) = argmax log p(x, y|θ). X
θ θ p(y|x, θ)
y = log p(x|θ) + q(y) log
y
q(y)
But it is a intractable task since the summation is after the X p(y|x, θ)
logarithm. Can we make it a little bit easier? Here are some =L(θ) + q(y) log .
q(y)
intuitive considerations: y

• It will be better to exchange the summation and the Obviously if we set q(y) = p(y|x, θ) given some fixed
logarithm. value of θ, L(q, θ) will be equal to L(θ). Thus the task of
• If we can find a lower-bound of the log-likelihood L(θ), maximizing L(θ) with respect to θ is equal to the task of
say F (θ), then we can maximize this lower-bound in- maximizing L(q, θ) with respect to both q and θ. This can be
stead. guaranteed by the following lemma.
P
For any distribution q(y), we have y q(y) = 1. Thus
according to Jensen’s inequality [1]
µP ¶ P Lemma 1: If L(q ∗ , θ∗ ) is the global maximum of L(q, θ) ,
ai xi a ϕ(xi ) then we have q ∗ (y) = p(y|x, θ∗ ), and L(θ∗ ) = L(q ∗ , θ∗ ) is
ϕ P ≥ Pi ,
ai ai the global maximum of L(θ).
Proof: Assume that ∃θ0 , L(θ0 ) > L(θ∗ ) = L(q ∗ , θ∗ ). with respect to the parameter µ of the model. Our task is to
Setting q 0 (y) = p(y|x, θ0 ), we have L(q 0 , θ0 ) = L(θ0 ) > estimate µ given the data X and other parameters, σ 2 and α.
L(θ∗ ) = L(q ∗ , θ∗ ), which contradicts with the fact that First of all, let’s calculate µ∗ by using L(µ) directly. The
L(q ∗ , θ∗ ) is the global maximum of L(q, θ). first-order derivative of L(µ) takes the form

Generally, finding the global maximum of L(q, θ) is still M


∂L(µ) X αk N (Xi |µk , σ 2 ) 1
intractable. So practically we only find its local maximum. If = ( PN · 2 (Xi − µk )).
∂µk j=1 αj N (Xi |µj , σ )
2 σ
we can find a local maximum L(q ∗ , θ∗ ), then L(θ∗ ) is also i=1

a local maximum of L(θ). This can be guaranteed by the


following lemma. Thus the solution of ∂L(µ)/∂µk = 0 is hard to be calculated.
Now we use EM algorithm. First we introduce a set of
Lemma 2: If L(q ∗ , θ∗ ) is a local maximum of L(q, θ) , then random variables {Zij }. Zij equals to 1 if Xi is generated
we have q ∗ (y) = p(y|x, θ∗ ), and L(θ∗ ) = L(q ∗ , θ∗ ) is a local by Cj , and equals to 0 otherwise. Each Zij is a hidden
maximum of L(θ). variable and we have little idea about its distribution, so
Proof: Proof omitted here. An intuitive explanation can it is convenient to use EM algorithm here. Let L(Q, µ) =
P P (X,Z|µ)
be found in [2]. Z Q(Z) log Q(Z) .
E-step (maximization of L(Q, µ[t−1] ) with respect to Q):
We can introduce an iterative procedure to find the local
maximum by estimating q and θ separately: Q[t] (Z) ← P (Z|X, µ[t−1] ).

ALGORITHM 1: LocalMaximum(L(q, θ)) M-step (maximization of L(Q[t] , µ) with respect to µ):


INPUT: L(q, θ): the family of lower bounds of the log-likelihood.
OUTPUT: L(q ∗ , θ∗ ): the local maximum of the lower bound and µ[t] ← argmax L(Q[t] , µ)
µ
corresponding parameters, q ∗ and θ∗ . X
1. θ[0] ← initial value; t ← 0; = argmax( Q[t] (Z) log P (X, Z|µ)
µ
2. do Z
3. t ← t + 1;
X
− Q[t] (Z) log Q[t] (Z))
4. q [t] ← argmaxq L(q, θ[t−1] ); (t-th E-step)
Z
5. θ[t] ← argmaxθ L(q [t] , θ); (t-th M-step) X
6. until convergence; = argmax Q[t] (Z) log P (X, Z|µ)
7. return L(q [t] , θ[t] ). µ
Z
X
= argmax P (Z|X, µ[t−1] ) log P (X, Z|µ)
In line 4, we can easily find q [t] by setting q [t] (y) = µ
Z
p(y|x, θ[t−1] ). And in most of the time maximizing L(q [t] , θ)
in line 5 is relatively easy because the logarithm is inside the = argmax EZ [log P (X, Z|µ)|X, µ[t−1] ]
µ
summation (see the definition of L(q, θ)). The iteration stops
= argmax Q(µ|µ[t−1] ).
when the change of L(q, θ) is smaller than a given threshold. µ
Now we have an easier way to approximately estimate the
parameter θ. In one word, this method, which is also called So the key task is to compute µ which can maximize
EM algorithm, is just a procedure to find a local maximum of Q(µ|µ[t−1] ). Using P (X, Z|µ) = P (X|Z, µ)P (Z|µ) we have
the lower bound of the log-likelihood. In most of the time the
resulting model parameter θ∗ is very close to the true solution M
Y
of the global maximum, so this method is widely used today. log P (X, Z|µ) = log P (Xi , Zi |µ)
i=1
III. A N E XAMPLE OF EM A LGORITHM M
Y
Now consider a very simple setting. We have M numbers = log P (Xi |Zi , µ)P (Zi |µ)
X = (X1 , X2 , . . . , XM ), each of which is generated by one of i=1
N gaussian mixture components C = (C1 , C2 , . . . , CN ). We YM N
Y N
Y Z
assume that all of the components share the same variance = log ( N (Xi |µj , σ 2 )Zij αj ij )
σ 2 , but their means µ = (µ1 , µ2 , . . . , µN ) are different. Each i=1 j=1 j=1
component Ci has a weight αi , we use α = (α1 , α2 , . . . , αN ) M Y
Y N

to denote them. The log-likelihood of generating all the = log (αj N (Xi |µj , σ 2 ))Zij
numbers X takes the form i=1 j=1
M X
X N
M
X N
X
L(µ) = log P (X|µ) = log αj N (Xi |µj , σ 2 ) = Zij log(αj N (Xi |µj , σ 2 )).
i=1 j=1
i=1 j=1
Accordingly, we have Thus
X M
Q(µ|µ[t−1] ) = P (Z|X, µ[t−1] ) log P (X, Z|µ) ∂ 1 X
Q(µ|µ[t−1] ) = 0 ⇐⇒ 2 E[Zik ](Xi − µk ) = 0
Z
X X X ∂µk σ i=1
= ... P (Z|X, µ[t−1] ) log P (X, Z|µ) PM
i=1 E[Zik ]Xi
Z11 Z12 ZM N ⇐=µk = P M
.
XX X
i=1 E[Zik ]
= ... P (Z|X, µ[t−1] )
[t] [t] [t]
Z11 Z12 ZM N So in M-step, we should update µ[t] = (µ1 , µ2 , . . . , µN )
M X
X N as PM
Zij log(αj N (Xi |µj , σ 2 )) [t] i=1 E[Zik ]Xi
µk ← P M
,
i=1 j=1
i=1 E[Zik ]
M X
X N X
= P (Z|X, µ[t−1] ) where k = 1, 2, . . . , N .
i=1 j=1 Zij
It is worth noting that E[Zik ] is the probability of that the i-
Zij log(αj N (Xi |µj , σ 2 ))
th data point is generated by the k-th component, which takes
M X
X N
the form
= E[Zij ] log(αj N (Xi |µj , σ 2 )). X
i=1 j=1 E[Zik ] = Zik P (Zik |X, µ)
Zik
Note that we have omitted the conditions of the expectation
for simplicity here. Generally, for a linear function f (x) the =0 · P (Zik = 0|Xi , µ[t−1] )
following property holds: + 1 · P (Zik = 1|Xi , µ[t−1] )
E[f (x)] = f (E[x]). =P (Zik = 1|Xi , µ[t−1] )

Using this equation we can rewrite Q(µ|µ[t−1] ) as P (Zik = 1, Xi |µ[t−1] )


= PN
[t−1] )
j=1 P (Zij = 1, Xi |µ
Q(µ|µ[t−1] ) =EZ [log P (X, Z|µ)]
P (Xi |Zik = 1, µ[t−1] )P (Zik = 1|µ[t−1] )
X N
M X = PN
[t−1] )P (Z = 1|µ[t−1] )
=EZ [ Zij log(αj N (Xi |µj , σ 2 ))] j=1 P (Xi |Zij = 1, µ ij
i=1 j=1 [t−1]
αk N (Xi |µk , σ 2 )
M X
X N = PN [t−1]
.
= E[Zij ] log(αj N (Xi |µj , σ 2 )). j=1 αj N (Xi |µj , σ2 )
i=1 j=1

To maximize Q(µ|µ[t−1] ) we have to solve the following


equation: IV. C ONCLUSIONS
∂ The motivation of EM algorithm is pretty straightforward
Q(µ|µ[t−1] ) = 0. and the whole framework is not very hard to understand. But
∂µk
in fact it is not as easy as I thought to use it correctly. I wrote
The first-order derivative of Q(µ|µ[t−1] ) is this document to make myself understand this algorithm more
M N
deeply. I hope it will help.
∂ ∂ XX
Q(µ|µ[t−1] ) = E[Zij ] log(αj N (Xi |µj , σ 2 ))
∂µk ∂µk i=1 j=1
ACKNOWLEDGMENT
M N
∂ XX 1 Thank Ju Fan who shared me with the slides of Prof.
= ( (E[Zij ] log αj √
∂µk i=1 j=1 2πσ 2 Zhang’s famous course, Pattern Recognition. Thank Socrates
Li who taught me a lot about the LATEX, and shared me
1
(Xi − µj )2 ))
− E[Zij ] some beautiful templates. Finally, thank THUTV, MTV, and
2σ 2
M N
incoming CCTV.
1 X ∂ X
=− 2 E[Zij ](Xi − µj )2
2σ i=1 ∂µk j=1
R EFERENCES
M
1 X d [1] Jensen’s inequality, from Wikipedia.org,
=− 2 E[Zik ](Xi − µk )2 http://en.wikipedia.org/wiki/Jensen%27s inequality.
2σ i=1 dµk
[2] Christopher M. Bishop, Pattern Recognition and Machine Learning, pp.
M 450 - 453.
1 X
= E[Zik ](Xi − µk ).
σ 2 i=1

Anda mungkin juga menyukai