Tutorial of EM Algorithm: A Top-Down Approach (Ver.1.0)

Tutorial of EM Algorithm: A Top-down Approach
Hao Wu #1
#
Department of Computer Science and Technology, Tsinghua University
Haidian, Beijing, 100084
1
haowu06@mails.tsinghua.edu.cn
Abstract— EM algorithm is important. In this document we try where ϕ is a concave function, we have
to make a brief tutorial of this algorithm. We first investigate
why EM algorithm is needed. And then we apply this algorithm L(θ) = log p(x|θ)
to the task of estimating the parameters of gaussian mixture X
models. = log p(x, y|θ)
y
P
I. I NTRODUCTION y q(y) p(x,y|θ)
q(y)
= log P
y q(y)
EM algorithm is a powerful tool for estimating the param- P
q(y) log p(x,y|θ)
y
eters of graphical models with hidden variables. Deeply un- ≥ P
q(y)
derstanding of this algorithm can greatly help one understand y q(y)

other useful machine learning methods. In this document we X p(x, y|θ)
make a brief tutorial of EM algorithm in a top-down manner = q(y) log .
y
q(y)
which can help people learn it more easily. In Section 2
we investigate why EM algorithm is needed. And then in P p(x,y|θ)
Let L(q, θ) := y q(y) log q(y) . It can easily be ob-
Section 3 we apply this algorithm to the task of estimating
served that L(q, θ) is a family of lower bounds of L(θ) with
the parameters of gaussian mixture models. Finally, we make
respect to distribution q. In addition, we also have
our conclusion in Section 4.
X p(x, y|θ)
L(q, θ) = q(y) log
II. W HY EM ALGORITHM ? y
q(y)
X p(y|x, θ)p(x|θ)
First of all, we have a set of observed data points x = = q(y) log
q(y)
(x1 , x2 , . . . , xM1 ) and a set of unobserved data points y = y
X
(y1 , y2 , . . . , yM2 ). x and y are generated by a graphical model = q(y)(log p(y|x, θ) + log p(x|θ) − log q(y))
with parameter θ. Given x we want to estimate the best y
parameter of the model. In other words we want to find X
= q(y) log p(x|θ)
θ∗ which can maximize P the log-likelihood function L(θ) :=
y
log p(x|θ) = log y p(x, y|θ), i.e. X
+ q(y)(log p(y|x, θ) − log q(y))
X y
θ∗ = argmax L(θ) = argmax log p(x, y|θ). X
θ θ p(y|x, θ)
y = log p(x|θ) + q(y) log
y
q(y)
But it is a intractable task since the summation is after the X p(y|x, θ)
logarithm. Can we make it a little bit easier? Here are some =L(θ) + q(y) log .
q(y)
intuitive considerations: y
• It will be better to exchange the summation and the Obviously if we set q(y) = p(y|x, θ) given some fixed
logarithm. value of θ, L(q, θ) will be equal to L(θ). Thus the task of
• If we can find a lower-bound of the log-likelihood L(θ), maximizing L(θ) with respect to θ is equal to the task of
say F (θ), then we can maximize this lower-bound in- maximizing L(q, θ) with respect to both q and θ. This can be
stead. guaranteed by the following lemma.
P
For any distribution q(y), we have y q(y) = 1. Thus
according to Jensen’s inequality [1]
µP ¶ P Lemma 1: If L(q ∗ , θ∗ ) is the global maximum of L(q, θ) ,
ai xi a ϕ(xi ) then we have q ∗ (y) = p(y|x, θ∗ ), and L(θ∗ ) = L(q ∗ , θ∗ ) is
ϕ P ≥ Pi ,
ai ai the global maximum of L(θ).
Proof: Assume that ∃θ0 , L(θ0 ) > L(θ∗ ) = L(q ∗ , θ∗ ). with respect to the parameter µ of the model. Our task is to
Setting q 0 (y) = p(y|x, θ0 ), we have L(q 0 , θ0 ) = L(θ0 ) > estimate µ given the data X and other parameters, σ 2 and α.
L(θ∗ ) = L(q ∗ , θ∗ ), which contradicts with the fact that First of all, let’s calculate µ∗ by using L(µ) directly. The
L(q ∗ , θ∗ ) is the global maximum of L(q, θ). first-order derivative of L(µ) takes the form
Generally, finding the global maximum of L(q, θ) is still M

∂L(µ) X αk N (Xi |µk , σ 2 ) 1
intractable. So practically we only find its local maximum. If = ( PN · 2 (Xi − µk )).
∂µk j=1 αj N (Xi |µj , σ )
2 σ
we can find a local maximum L(q ∗ , θ∗ ), then L(θ∗ ) is also i=1
a local maximum of L(θ). This can be guaranteed by the

following lemma. Thus the solution of ∂L(µ)/∂µk = 0 is hard to be calculated.
Now we use EM algorithm. First we introduce a set of
Lemma 2: If L(q ∗ , θ∗ ) is a local maximum of L(q, θ) , then random variables {Zij }. Zij equals to 1 if Xi is generated
we have q ∗ (y) = p(y|x, θ∗ ), and L(θ∗ ) = L(q ∗ , θ∗ ) is a local by Cj , and equals to 0 otherwise. Each Zij is a hidden
maximum of L(θ). variable and we have little idea about its distribution, so
Proof: Proof omitted here. An intuitive explanation can it is convenient to use EM algorithm here. Let L(Q, µ) =
P P (X,Z|µ)
be found in [2]. Z Q(Z) log Q(Z) .
E-step (maximization of L(Q, µ[t−1] ) with respect to Q):
We can introduce an iterative procedure to find the local
maximum by estimating q and θ separately: Q[t] (Z) ← P (Z|X, µ[t−1] ).
ALGORITHM 1: LocalMaximum(L(q, θ)) M-step (maximization of L(Q[t] , µ) with respect to µ):

INPUT: L(q, θ): the family of lower bounds of the log-likelihood.
OUTPUT: L(q ∗ , θ∗ ): the local maximum of the lower bound and µ[t] ← argmax L(Q[t] , µ)
µ
corresponding parameters, q ∗ and θ∗ . X
1. θ[0] ← initial value; t ← 0; = argmax( Q[t] (Z) log P (X, Z|µ)
µ
2. do Z
3. t ← t + 1;
X
− Q[t] (Z) log Q[t] (Z))
4. q [t] ← argmaxq L(q, θ[t−1] ); (t-th E-step)
Z
5. θ[t] ← argmaxθ L(q [t] , θ); (t-th M-step) X
6. until convergence; = argmax Q[t] (Z) log P (X, Z|µ)
7. return L(q [t] , θ[t] ). µ
Z
X
= argmax P (Z|X, µ[t−1] ) log P (X, Z|µ)
In line 4, we can easily find q [t] by setting q [t] (y) = µ
Z
p(y|x, θ[t−1] ). And in most of the time maximizing L(q [t] , θ)
in line 5 is relatively easy because the logarithm is inside the = argmax EZ [log P (X, Z|µ)|X, µ[t−1] ]
µ
summation (see the definition of L(q, θ)). The iteration stops
= argmax Q(µ|µ[t−1] ).
when the change of L(q, θ) is smaller than a given threshold. µ
Now we have an easier way to approximately estimate the
parameter θ. In one word, this method, which is also called So the key task is to compute µ which can maximize
EM algorithm, is just a procedure to find a local maximum of Q(µ|µ[t−1] ). Using P (X, Z|µ) = P (X|Z, µ)P (Z|µ) we have
the lower bound of the log-likelihood. In most of the time the
resulting model parameter θ∗ is very close to the true solution M
Y
of the global maximum, so this method is widely used today. log P (X, Z|µ) = log P (Xi , Zi |µ)
i=1
III. A N E XAMPLE OF EM A LGORITHM M
Y
Now consider a very simple setting. We have M numbers = log P (Xi |Zi , µ)P (Zi |µ)
X = (X1 , X2 , . . . , XM ), each of which is generated by one of i=1
N gaussian mixture components C = (C1 , C2 , . . . , CN ). We YM N
Y N
Y Z
assume that all of the components share the same variance = log ( N (Xi |µj , σ 2 )Zij αj ij )
σ 2 , but their means µ = (µ1 , µ2 , . . . , µN ) are different. Each i=1 j=1 j=1
component Ci has a weight αi , we use α = (α1 , α2 , . . . , αN ) M Y
Y N
to denote them. The log-likelihood of generating all the = log (αj N (Xi |µj , σ 2 ))Zij
numbers X takes the form i=1 j=1
M X
X N
M
X N
X
L(µ) = log P (X|µ) = log αj N (Xi |µj , σ 2 ) = Zij log(αj N (Xi |µj , σ 2 )).
i=1 j=1
i=1 j=1
Accordingly, we have Thus
X M
Q(µ|µ[t−1] ) = P (Z|X, µ[t−1] ) log P (X, Z|µ) ∂ 1 X
Q(µ|µ[t−1] ) = 0 ⇐⇒ 2 E[Zik ](Xi − µk ) = 0
Z
X X X ∂µk σ i=1
= ... P (Z|X, µ[t−1] ) log P (X, Z|µ) PM
i=1 E[Zik ]Xi
Z11 Z12 ZM N ⇐=µk = P M
.
XX X
i=1 E[Zik ]
= ... P (Z|X, µ[t−1] )
[t] [t] [t]
Z11 Z12 ZM N So in M-step, we should update µ[t] = (µ1 , µ2 , . . . , µN )
M X
X N as PM
Zij log(αj N (Xi |µj , σ 2 )) [t] i=1 E[Zik ]Xi
µk ← P M
,
i=1 j=1
i=1 E[Zik ]
M X
X N X
= P (Z|X, µ[t−1] ) where k = 1, 2, . . . , N .
i=1 j=1 Zij
It is worth noting that E[Zik ] is the probability of that the i-
Zij log(αj N (Xi |µj , σ 2 ))
th data point is generated by the k-th component, which takes
M X
X N
the form
= E[Zij ] log(αj N (Xi |µj , σ 2 )). X
i=1 j=1 E[Zik ] = Zik P (Zik |X, µ)
Zik
Note that we have omitted the conditions of the expectation
for simplicity here. Generally, for a linear function f (x) the =0 · P (Zik = 0|Xi , µ[t−1] )
following property holds: + 1 · P (Zik = 1|Xi , µ[t−1] )
E[f (x)] = f (E[x]). =P (Zik = 1|Xi , µ[t−1] )
Using this equation we can rewrite Q(µ|µ[t−1] ) as P (Zik = 1, Xi |µ[t−1] )

= PN
[t−1] )
j=1 P (Zij = 1, Xi |µ
Q(µ|µ[t−1] ) =EZ [log P (X, Z|µ)]
P (Xi |Zik = 1, µ[t−1] )P (Zik = 1|µ[t−1] )
X N
M X = PN
[t−1] )P (Z = 1|µ[t−1] )
=EZ [ Zij log(αj N (Xi |µj , σ 2 ))] j=1 P (Xi |Zij = 1, µ ij
i=1 j=1 [t−1]
αk N (Xi |µk , σ 2 )
M X
X N = PN [t−1]
.
= E[Zij ] log(αj N (Xi |µj , σ 2 )). j=1 αj N (Xi |µj , σ2 )
i=1 j=1
To maximize Q(µ|µ[t−1] ) we have to solve the following

equation: IV. C ONCLUSIONS
∂ The motivation of EM algorithm is pretty straightforward
Q(µ|µ[t−1] ) = 0. and the whole framework is not very hard to understand. But
∂µk
in fact it is not as easy as I thought to use it correctly. I wrote
The first-order derivative of Q(µ|µ[t−1] ) is this document to make myself understand this algorithm more
M N
deeply. I hope it will help.
∂ ∂ XX
Q(µ|µ[t−1] ) = E[Zij ] log(αj N (Xi |µj , σ 2 ))
∂µk ∂µk i=1 j=1
ACKNOWLEDGMENT
M N
∂ XX 1 Thank Ju Fan who shared me with the slides of Prof.
= ( (E[Zij ] log αj √
∂µk i=1 j=1 2πσ 2 Zhang’s famous course, Pattern Recognition. Thank Socrates
Li who taught me a lot about the LATEX, and shared me
1
(Xi − µj )2 ))
− E[Zij ] some beautiful templates. Finally, thank THUTV, MTV, and
2σ 2
M N
incoming CCTV.
1 X ∂ X
=− 2 E[Zij ](Xi − µj )2
2σ i=1 ∂µk j=1
R EFERENCES
M
1 X d [1] Jensen’s inequality, from Wikipedia.org,
=− 2 E[Zik ](Xi − µk )2 http://en.wikipedia.org/wiki/Jensen%27s inequality.
2σ i=1 dµk
[2] Christopher M. Bishop, Pattern Recognition and Machine Learning, pp.
M 450 - 453.
1 X
= E[Zik ](Xi − µk ).
σ 2 i=1

Tutorial of EM Algorithm: A Top-Down Approach (Ver.1.0)

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Tutorial of EM Algorithm: A Top-Down Approach (Ver.1.0)

Diunggah oleh

Hak Cipta:

Format Tersedia

Tutorial of EM Algorithm: A Top-down Approach

derstanding of this algorithm can greatly help one understand y q(y)

Generally, finding the global maximum of L(q, θ) is still M

a local maximum of L(θ). This can be guaranteed by the

ALGORITHM 1: LocalMaximum(L(q, θ)) M-step (maximization of L(Q[t] , µ) with respect to µ):

Using this equation we can rewrite Q(µ|µ[t−1] ) as P (Zik = 1, Xi |µ[t−1] )

To maximize Q(µ|µ[t−1] ) we have to solve the following

Anda mungkin juga menyukai