The goal is to estimate the parameter θ for a family p(x, y; θ) of distributions on a product
space S1 ×S2 . Assume that if we observe the full data (X (n) , Y (n) ), n = 1, . . . , N , finding the
maximum likelihood estimate is a simple computation. However we only get to observe
an i.i.d. random samples X (1) , . . . , X (N ) (the Y (n) ’s remain hidden). The best we can do is
maximize the marginal log-likelihood:
n
X
max `(X (1) , . . . , X (N ) ; θ) = max log pX (X (n) ; θ). (1)
θ θ
i=1
We develop an iterative algorithm which can be used to find local maxima of this marginal
log-likelihood.
Since p, q are distributions that add up to 1, equality can only be achieved if p(y) = q(y) for
all y. The entropy of a distribution denoted by H(q) is given by
X
H(q) = − q(y) log q(y).
y
Let ψ(y) be a function on a discrete state space S and define the probability p on S as
1
where the unique maximizer is p. This follows directly from the properties of the KL
distance. For any q
0 ≤ K(q, p) = −H(q) − Eq ψ + B,
so that
B ≥ H(q) + Eq ψ,
and equality is achieved only when q = p.
Consider a joint distribution p(x, y) on a product space of two sets S1 × S2 with x ∈ S1 and
y ∈ S2 . Write
pX,Y (x, y) elog p(x,y)
pY |X (y|x) = =
pX (x) Z
where the normalizing constant Z is given by the the marginal pX (x). For fixed x set
ψ(y) = log p(x, y), then using the Gibbs variation formula we have
Since
log p(X (n) ; θ) = max[H(q) + Eq log p(X (n) , ·; θ)],
q
This suggests an iterative scheme to maximize (1). Maximize in the qn ’s for fixed θ and
then maximize in θ for fixed qn ’s.
2
For the E step we know already that
The first term is a constant with respect to varying θ and can be ignored in the maximiza-
tion. Writing out the expectation we get
N X
X
θ(t+1) = argmax p(y|X (n) ; θ(t) ) log p(X (n) , y; θ).
θ n=1 y∈S2
To avoid confusion, we reiterate that θ(t) in the above expression is a constant value differ-
ent from the variable θ being maximized.
The above maximization has the following interpretation: The value of θ(t+1) is what
would be obtained by maximum likelihood estimation if in addition to the observed data
{X (n) }N
n=1 , we had access to a large number of Y observations for each X
(n) , drawn from
(n) (t)
the conditional p(y|X , θ ). Now, suppose we could easily perform maximum likeli-
hood estimation given full data by computing a sufficient statistic such as an empirical
average of some function T over the full data i.e
N
1 X
θM L = T (X (n) , Y (n) ).
N
n=1
Then, in the present context of unobserved data the θ(t+1) reduces to the average of T
over our hypothetical pool of full data: Y observations for each X (n) , drawn from the
conditional. Thus,
N
1 XX
θ(t+1) = p(y|X (n) , θ(t) )T (X (n) , y). (8)
N
n=1 y∈S2
More generally if the joint distribution is of an exponential family and θM L is the solution
to an equation
N
1 X
Eθ T (X, Y ) = T (X (n) , Y (n) ),
N
i=1
3
Some properties of the iterated maximization
Suppose the iterations reach a fixed point (q∗ , θ∗ ) = (q1∗ . . . , qn∗ , θ∗ ). At a fixed point, by
definition, the E and M steps look like
h i
qn∗ = argmax H(q) + Eq log p(X (n) , y; θ∗ ) (10)
q
N h
X i
θ∗ = argmax H(qn∗ ) + Eqn∗ log p(X (n) , y; θ) . (11)
θ n=1
In this problem, the domain of the hidden Y variable lies in S2 = {1 . . . K}. Let us setup
some notation.
πk = P (Y = k)
pk (x, θk ) = P (x|Y = k; θk )
If we have a fully observed sample (X (1) , Y (1) ), . . . , (X (N ) , Y (N ) ), then it is easy to see that
maximum likelihood leads to separating the sample into the K groups according to the
value of Y (n) and computing
N
X
θ̂k = argmax 1[Y (n) =k] log pk (X (n) ; θk )
θk n=1
N
1 X
π̂k = 1[Y (n) =k] . (12)
N
n=1
4
Using Bayes rule The E step which involves computing p(k|X (n) ; θ(t) ) for each k and n
becomes:
(t) (t)
pk (X (n) ; θk )πk
qnt (k) = P (t)
.
K
π p 0 (X (n) ; θ (t) )
k =1 k k
0 0 k
Equation 12 reduces to
N
X
(t+1)
θk = argmax qn(t) (k) log pk (X (n) ; θk ) (13)
θk n=1
N
(t+1) 1 X (t)
πk = qn (k). (14)
N
n=1