Hao Wu #1
#
Department of Computer Science and Technology, Tsinghua University
Haidian, Beijing, 100084
1
haowu06@mails.tsinghua.edu.cn
Abstract— EM algorithm is important. In this document we try where ϕ is a concave function, we have
to make a brief tutorial of this algorithm. We first investigate
why EM algorithm is needed. And then we apply this algorithm L(θ) = log p(x|θ)
to the task of estimating the parameters of gaussian mixture X
models. = log p(x, y|θ)
y
P
I. I NTRODUCTION y q(y) p(x,y|θ)
q(y)
= log P
y q(y)
EM algorithm is a powerful tool for estimating the param- P
q(y) log p(x,y|θ)
y
eters of graphical models with hidden variables. Deeply un- ≥ P
q(y)
• It will be better to exchange the summation and the Obviously if we set q(y) = p(y|x, θ) given some fixed
logarithm. value of θ, L(q, θ) will be equal to L(θ). Thus the task of
• If we can find a lower-bound of the log-likelihood L(θ), maximizing L(θ) with respect to θ is equal to the task of
say F (θ), then we can maximize this lower-bound in- maximizing L(q, θ) with respect to both q and θ. This can be
stead. guaranteed by the following lemma.
P
For any distribution q(y), we have y q(y) = 1. Thus
according to Jensen’s inequality [1]
µP ¶ P Lemma 1: If L(q ∗ , θ∗ ) is the global maximum of L(q, θ) ,
ai xi a ϕ(xi ) then we have q ∗ (y) = p(y|x, θ∗ ), and L(θ∗ ) = L(q ∗ , θ∗ ) is
ϕ P ≥ Pi ,
ai ai the global maximum of L(θ).
Proof: Assume that ∃θ0 , L(θ0 ) > L(θ∗ ) = L(q ∗ , θ∗ ). with respect to the parameter µ of the model. Our task is to
Setting q 0 (y) = p(y|x, θ0 ), we have L(q 0 , θ0 ) = L(θ0 ) > estimate µ given the data X and other parameters, σ 2 and α.
L(θ∗ ) = L(q ∗ , θ∗ ), which contradicts with the fact that First of all, let’s calculate µ∗ by using L(µ) directly. The
L(q ∗ , θ∗ ) is the global maximum of L(q, θ). first-order derivative of L(µ) takes the form
to denote them. The log-likelihood of generating all the = log (αj N (Xi |µj , σ 2 ))Zij
numbers X takes the form i=1 j=1
M X
X N
M
X N
X
L(µ) = log P (X|µ) = log αj N (Xi |µj , σ 2 ) = Zij log(αj N (Xi |µj , σ 2 )).
i=1 j=1
i=1 j=1
Accordingly, we have Thus
X M
Q(µ|µ[t−1] ) = P (Z|X, µ[t−1] ) log P (X, Z|µ) ∂ 1 X
Q(µ|µ[t−1] ) = 0 ⇐⇒ 2 E[Zik ](Xi − µk ) = 0
Z
X X X ∂µk σ i=1
= ... P (Z|X, µ[t−1] ) log P (X, Z|µ) PM
i=1 E[Zik ]Xi
Z11 Z12 ZM N ⇐=µk = P M
.
XX X
i=1 E[Zik ]
= ... P (Z|X, µ[t−1] )
[t] [t] [t]
Z11 Z12 ZM N So in M-step, we should update µ[t] = (µ1 , µ2 , . . . , µN )
M X
X N as PM
Zij log(αj N (Xi |µj , σ 2 )) [t] i=1 E[Zik ]Xi
µk ← P M
,
i=1 j=1
i=1 E[Zik ]
M X
X N X
= P (Z|X, µ[t−1] ) where k = 1, 2, . . . , N .
i=1 j=1 Zij
It is worth noting that E[Zik ] is the probability of that the i-
Zij log(αj N (Xi |µj , σ 2 ))
th data point is generated by the k-th component, which takes
M X
X N
the form
= E[Zij ] log(αj N (Xi |µj , σ 2 )). X
i=1 j=1 E[Zik ] = Zik P (Zik |X, µ)
Zik
Note that we have omitted the conditions of the expectation
for simplicity here. Generally, for a linear function f (x) the =0 · P (Zik = 0|Xi , µ[t−1] )
following property holds: + 1 · P (Zik = 1|Xi , µ[t−1] )
E[f (x)] = f (E[x]). =P (Zik = 1|Xi , µ[t−1] )