X
n
.
If it is known that the data sequence x
n
was generated
according to a probability distribution P(X
n
), there exists a
prex coding such that x
n
can be encoded with code-length:
L(x
n
) = log P(x
n
). However, the probability distribution
generating the data sequence is unknown in general. When we
only know the class M = {P(X
n
|) : } (n = 1, 2, ,
is a k-dimensional compact parameter space, and ),
we consider the min-max criterion proposed by Shtarkov [12]:
min
Q
max
x
n
log Q(x
n
) min
(log P(x
n
|))
.
Here the minimum is attained by the normalized maximum
likelihood (NML) distribution dened by
P
NML
(x
n
|M) =
P(x
n
|
(x
n
, M))
C(M, n)
,
where
(x
n
, M) is the maximum likelihood estimator (MLE)
of from x
n
, and C(M, n) is the normalization constant
dened as follows:
C(M, n) =
y
n
X
n
P(y
n
|
(y
n
, M)). (1)
where the sum in (1) is taken over all possible data sequence
of length n.
The stochastic complexity (SC) [9] of x
n
relative to M is
dened as the code-length of x
n
using the NML distribution,
which we call the NML code-length, as follows:
SC(x
n
: M)
def
= log P
NML
(x
n
|M)
= log P(x
n
|
(x
n
, M)) + log C(M). (2)
The minimum description length (MDL) principle [10] asserts
that for given x
n
, the best model k is the one which attains the
minimum of the SC of x
n
relative to M. We employ the SC,
equivalently, the NML code-length as a criterion for model
selection. The problem is that in general the computational
cost for the normalizing termC(M, n) in Eq. (2) is exponential
in sample size n or may diverge.
B. Approximation of NML for Gaussian Distributions
This section gives a method of approximation of the
NML for Gaussian distributions. When X is a countable
set, Kontkanen and Myllym aki [5],[6] developed algorithms
for efciently computing the normalization term C(M, n). In
the discussion to follow we consider the case where X is
a continuous data set. In this case, the problem is that the
normalization term may diverge [4],[9]. We propose a method
for approximate computation of the normalization term for
Gaussian distributions so that it does not diverge. The key
idea is to appropriately restrict the range of x
n
over which
the sum is taken in the normalization term.
Let an observed data sequence be x
n
= (x
1
, , x
n
) where
x
i
= (x
i1
, , x
im
)
T
(i = 1, , n). We use a class of
Gaussian distributions: N(, ), where R
m
is a mean
vector, R
mm
is a covariance matrix, and m is the
dimension of x
i
. A probability density function of x
n
for the
Gaussian distribution is given by
f(x
n
; , )
=
1
(2)
mn
2
||
n
2
exp
1
2
n
i=1
(x
i
)
T
1
(x
i
)
,
(3)
and the NML distribution based on the Gaussian distribution
is dened as follows:
f
NML
(x
n
)
def
=
f(x
n
; (x
n
),
(x
n
))
Y (min,R)
f(y
n
; (y
n
),
(y
n
))dy
n
(4)
where (x
n
) and
(x
n
) are the MLEs of and respec-
tively:
(x
n
) =
1
n
n
i=1
x
i
,
(x
n
) =
1
n
n
i=1
(x
i
(x
n
))(x
i
(x
n
))
T
.
For given constants
min
(> 0), R(> 0), we set a restricted
domain as follows:
Y (min, R)
def
={y
n
|min
j(y
n
)(j = 1, , m),
|| (y
n
)||
2
R, y
n
X
n
},
1032
where
j
(y
n
) (j = 1, , m) are eigen values of
(y
n
).
This restriction makes the calculation of the normalization
term C(M, n) easier, as shown below.
First, by substituting MLE (x
n
),
(x
n
) into the formula
(3), the numerator of Eq. (4) can be expressed as follows:
f(x
n
; (x
n
),
(x
n
)) =
n
i=1
1
(2)
m
2 |
(x
n
)|
1
2
exp
1
2
(xi (x
n
))
T
(x
n
)
1
(xi (x
n
))
.
Next, we calculate the denominator of formula (4). Using
the fact that (x
n
) and
(x
n
) are sufcient statistics, we
can calculate the normalization term as an integral with
respect to (x
n
) and
(x
n
). As MLEs are sufcient statistics,
f(x
n
; , ) is decomposed as follows:
f(x
n
; , ) = f(x
n
| (x
n
),
(x
n
)) g
1
( (x
n
); , ) g
2
(
(x
n
); ).
where
g1( (x
n
); , )
def
=
1
(2/n)
m
2 ||
1
2
exp
1
2/n
( )
T
1
( )
,
g2( (x
n
); )
def
=
|
|
nm2
2
2
m(n1)
2 |
1
n
|
n1
2 m(
n1
2
)
exp
1
2
Tr(n
1
.
Here we dene the function f(x
n
| (x
n
),
(x
n
)) =
( (x
n
) = ,
(x
n
) =
). We x values (x
n
) =
,
(x
n
) =
, and let
g(
)
def
= g1( ; ,
) g2(
;
)
=
n
mn
2
2
mn
2
m
2 e
mn
2 m(
n1
2
)
|
m
2
1
.
Letting the normalization term of formula (4) be C(M, n),
we can calculate it by integrating g(
Y (
min
,R)
g(
)d
d
=
2
m+1
R
m
2
min
m
2
2
m
m+1
(
m
2
)
n
2e
mn
2 1
m(
n1
2
)
= B(m, min, R)
n
2e
mn
2 1
m(
n1
2
)
, (5)
where we dene B(m,
min
, R) by
B(m, min, R)
def
=
2
m+1
R
m
2
min
m
2
2
m
m+1
(
m
2
)
.
B(m,
min
, R) does not depend on a number of data n. Since
(5) is nite, the normalization term C(M, n) does not diverge.
III. EFFICIENT COMPUTATION OF NML FOR GMMS
This section gives a method of efcient computation of the
NML for GMMs. A Gaussian mixture model (GMM) has been
used as a typical representation of clustering. It takes a form
of a linear combination of K Gaussian distributions with mean
k
and covariance matrix
k
(k = 1, . . . , K):
P(X) =
K
k=1
k
f(X|Z = k;
k
,
k
),
where
k
= P(Z = k), Z is hidden variable indicating the
cluster to which X belongs. We denote the class of Gaussian
mixture distributions with K clusters as M(K).
We consider the case where the cluster indexes z
n
are given
in addition to x
n
. Each z
i
corresponds to x
i
, and we write
this as z
n
= z
1
z
n
. For example, cluster indexes z
n
can be
obtained using the EM algorithm [2] from given x
n
.
The joint probability density function of x
n
and z
n
is given
by
f(x
n
, z
n
; , ) =
K
k=1
h
k
k
x
i
z
k
1
(2)
mh
k
2 ||
h
k
2
exp
1
2
(x
i
)
T
1
(x
i
)
,
where we write P(Z = k) as P(Z = k) =
k
, K is number
of clusters, and h
k
is the number of occurrences of data which
belongs to the cluster k. The NML distribution for x
n
relative
to the class M(K) of GMMs is
fNML(x
n
, z
n
) =
f(x
n
, z
n
; (x
n
, z
n
),
(x
n
, z
n
))
C(M(K), n)
. (6)
The MLE of
k
is
k
= h
k
/n. We use Eq. (5) to compute the
normalization term of the NML distribution for the Gaussian
mixture model M(K) as follows:
C(M(K), n) =
z
n
Y (
min
,R)
f(y
n
, z
n
; (y
n
, z
n
),
(y
n
, z
n
))dy
n
=
h
1
++h
K
=n
n!
h
1
! h
K
!
K
k=1
h
k
n
h
k
I(h
k
), (7)
where,
I(h
k
) = B(m, min, R)
h
k
2e
mh
k
2 1
m(
h
k
1
2
)
.
A straightforward computation of C(M(K), n) takes O(n
K
)
time. We give the following theorem showing that the formula
(7) can be recursively computed in O(n
2
K) time.
Theorem 3.1: The normalization term in (7) satises the
following recursion formula:
C(M(K + 1), n) =
r
1
+r
2
=n
nCr
1
r
1
n
r
1
r
2
n
r
2
C(M(K), r
1
)I(r
2
).
(8)
This enables us to calculate the formula (7) in O(n
2
K)
time.
Proof: In this proof, we use the technique of generating
functions, which was proposed by Kontkanen and Myllym aki
[5],[6], to derive the recurrence formula.
First, note that the normalization term is expressed as
C(M(K), n) =
h
1
++h
K
=n
n!
h
1
! h
K
!
K
k=1
h
k
n
h
k
I(h
k
).
1033
Algorithm 1 Calculation of Stochastic Complexity for GMM
STEP1. Calculate h
k
,
k
,
k
(k = 1, , K).
STEP2. Calculate Numerator of Eq. (6).
STEP3. Dene normalization term as C(M(K), 0) = 1.
STEP4. Calculate C(M(1), j) = I(j) (j = 1, , n).
STEP5. Calculate C(M(k), j) as follows:
for k=2 to K do
for j=1 to n do
C(M(k), j) =
r
1
+r
2
=j
j
Cr
1
r
1
j
r
1
r
2
j
r
2
C(M(k 1), r
1
)I(r
2
).
end for
end for
STEP6. Calculate Stochastic Complexity (NML Code-
length).
We dene a generating function J(z) by
J(z)
def
=
n0
n
n
n!
I(n) z
n
.
The Kth power of J(z) is calculated as follows:
(J(z))
K
=
n0
h
1
++h
K
=n
K
k=1
h
h
k
k
h
k
!
I(h
k
)
z
n
=
n0
n
n
n!
h
1
++h
K
=n
n!
h
1
! h
K
!
K
k=1
h
k
n
h
k
I(h
k
)
z
n
=
n0
n
n
n!
C(M(K), n)z
n
,
and the (K + 1)th power of J(z) is
(J(z))
K+1
= J(z) (J(z))
K
=
n0
r
1
+r
2
=n
r
r
1
1
r
1
!
C(M(K), r
1
)
r
r
2
2
I(r
2
)
r
2
!
z
n
=
n0
n
n
n!
r
1
+r
2
=n
nCr
1
r
1
n
r
1
r
2
n
r
2
C(M(K), r
1
)I(r
2
)
z
n
=
n0
n
n
n!
C(M(K + 1), n)z
n
.
Thus a recursion formula for the normalized term
C(M(K), n) is given by
C(M(K + 1), n) =
r
1
+r
2
=n
nCr
1
r
1
n
r
1
r
2
n
r
2
C(M(K), r
1
)I(r
2
).
We use this recursion formula to calculate the normalization
term (7) as in Algorithm 1. The computational complexity is
O(n
2
K). We use the formula (8) to compute C(M(K), n)
from C(M(K 1), r
1
) (r
1
= 1, , n) in O(n) time. As
all C(M(k), j)(k = 1, , K, j = 1, , n) are calculated
at STEP5 in Algorithm 1, we can compute C(M(K), n) in
O(n
2
K) time.
Then, the stochastic complexity of x
n
relative to M(K) is
calculated as follows:
SC(x
n
, z
n
: M(K)) = log f
NML
(x
n
, z
n
|M(K))
= log f(x
n
, z
n
| (x
n
, z
n
, M(K)),
(x
n
, z
n
, M(K)))
+ log C(M(K), n).
On the basis of the MDL principle, we select the number K
of clusters which minimizes SC(x
n
, z
n
|M(K)) for given x
n
.
IV. EXPERIMENTAL RESULTS
A. Empirical Results Using Articial Data Sets
This section gives empirical results showing the validity
of the NML for GMMs. We generated a number of data
sequences of size n according to the true Gaussian mixture
model M of mixture size K. Each mixture component is a
Gaussian distribution with mean
k
and variance-covariance
matrix
k
(k = 1, , K). For each data sequence x
n
gener-
ated according to the true model M, we also generated their
corresponding cluster indexes z
n
using the EM algorithm [2],
where z
i
shows which cluster x
i
comes from (i = 1, . . . , n).
In our experiment, we repeated cluster generation using the
EM algorithm 100 times by changing initial values of the
algorithm. We compared the three criteria: NML, Akaikes
Information Criterion (AIC) [1] and Bayesian Information
Criterion (BIC) [11] for the choice of the number of clusters.
NML is calculated according to the method proposed in the
previous sections. AIC and BIC are calculated as follows:
AIC(x
n
, z
n
|M(K))
= 2 log f(x
n
, z
n
| (x
n
, z
n
, M(K)),
(x
n
, z
n
, M(K)))
+m(m+ 3)K +K,
BIC(x
n
, z
n
|M(K))
= 2 log f(x
n
, z
n
| (x
n
, z
n
, M(K)), (x
n
, z
n
, M(K)))
+
m(m+ 3)K
2
K
k=1
log h
k
+K log n.
We measured their performance in terms of the identication
probability P(K) and the benet B(K) dened as follows:
Letting K be the true number of clusters and K
be the one
chosen using any criterion,
P(K) = Prob(K
= K),
B(K) = max
0, 1
|K
K|
T
,
where T is a given constant. The identication probability
P(K) is the probability that the true number of clusters is
chosen using the algorithm. The benet is a score assigned
to K
so that if K
K| increases to T.
The benet is calculated as the average of the benets taken
over all of random generation. We compared NML, AIC, and
BIC in terms of how fast the identication probability and the
benet converge as sample size n increases.
Figure IV.1 shows the results in the case of K = 3 and
m = 3 where K is the true number of clusters and m is the
1034
dimension of a datum. We observe from Figure IV.1 that the
number of clusters chosen by NML converges signicantly
faster than those chosen by AIC and BIC. For example, the
least sample size requied for P(K) 0.9 is approximately
200 for NML while it is approximately 1500 for BIC. We
further observe from Figure IV.2 that the number of clusters
chosen by NML becomes very close to the true number of
clusters with high probability.
Table I and II shows the results in the case of the variety
of K and m. We observe that the probability of rightness for
NML is larger than that of BIC in the variety of K and m.
We see from these results that NML gives the best strategy
for the choice of the number of clusters.
O 6OO 1OOO 16OO 2OOO
O
O.2
O.4
O.6
O.8
1
sample size
I
d
e
n
t
i
f
i
c
a
t
i
o
n
P
r
o
b
a
b
i
l
i
t
y
3 clusters, 3 dimension
B!O
A!O
NML
Fig. IV.1. P(K) of the articial data
which has K = 3, m = 3
O 6OO 1OOO 16OO 2OOO
O
O.2
O.4
O.6
O.8
1
sample size
B
e
n
e
f
i
t
3 clusters, 3 dimension
B!O
A!O
NML
Fig. IV.2. B(K) of the articial data
which has K = 3, m = 3
m
3 5 10
3 140 240 700
K 4 180 360 1000
5 240 500 1800
6 360 600 3000
TABLE I
THE SAMPLE SIZE IN WHICH P(K)
IS OVER 80%(NML)
m
3 5 10
3 800 700 240
K 4 900 1400 500
5 1200 1200 1400
6 1600 2000 3000
TABLE II
THE SAMPLE SIZE IN WHICH P(K)
IS OVER 80%(BIC)
B. Empirical Results Using Benchmark Data Sets
We utilized two benchmark data sets: Blood Transfusion
Service Center Data Set andThe Number Image Data Set,
which we describe in details below. All of them were prepared
for the benchmark data sets for classication problems. For
each of data sets, a label is assigned to each datum. Without
knowing these labels in advance, we estimated the label for
each datum through their clustering.
Blood Transfusion Service Center Data Set [14]
This data set has 4 attributes and 2 clusters. The data
set was extracted from the donor database of Blood
Transfusion Service Center in Hsin-Chu City in Taiwan.
The Number Image Data Set
This data set has 10 attributes and 3 clusters. The data
set have the images which represent the numbers from 0
to 2.
Table III and Table IV show the average of the numbers
of clusters estimated by NML, AIC, and BIC for the rst
data set and the second one, respectively, where the average
is taken over 100 times trials for the rst data and over 10
times trials for the second data of cluster generation using
the EM algorithm. For the both cases, we see that only NML
is able to successfully identify the true number of clusters.
This implies that our method of NML computation works
signicantly better than other criteria AIC, BIC.
TABLE III
BLOOD TRANSFUSION
SERVICE CENTER DATA SET
NML AIC BIC
Average K 2.00 4.89 4.84
TABLE IV
THE NUMBER IMAGE DATA
SET
NML AIC BIC
Average K 3.00 7.00 7.00
V. CONCLUSION
We have proposed a new method of efcient computation of
approximate NML code-lengths for Gaussian mixture models.
We have derived an approximation of NML for Gaussian
distributions and a recursion formula for computing NML for
GMM, which enables us to compute NML in O(n
2
K) time
for GMM. In the experiment using articial data sets, the
number of clusters chosen by NML converges signicantly
faster than those chosen by AIC and BIC. In the experiment
using benchmark data sets, it turns out that NML is able to
identify the true number of clusters with high probability. It
remains for future study how to choose R and
min
depending
on sample size.
VI. ACKNOWLEDGEMENTS
This work was supported by Microsoft CORE6 project,
NTT corporation, Hakuhodo corporation, and Grant-in-Aid for
Scientic Research(A).
REFERENCES
[1] H. Akaike: Information theory and an extension of the maximum likelihood
principle. Proceeding of the Second International Symposium on Information
Theory, Budapest, Akademiai Kiado, pp. 267-281.
[2] A.P. Dempster, N.M. Laird, and D.B.Rubin. Maximum likelihood from incomplete
data via the EM algorithm. J.Royal Staitst. Soc.B, Vol.39, pp:138, 1977.
[3] C.Farley and A.E.Raftery: How many clusters? Which clustering method Answers
via model-based cluster analysis. Computer Journal, 41(8),578-588, 1998.
[4] P. D. Grunwald:The Minimum Description Length Principle, MIT Press, ,Cam-
bridge, June 2007.
[5] P.Kontkanen and P.Myllym aki: A linear time algorithm for computing the multino-
mial stochastic complexity. Information Processing Letters, Elsevier, 103, pp:227
233, 2007.
[6] P.Kontkanen and P. Myllym aki: An empirical comparison of NML clustering
algorithms. Proceedings of the 2008 International Conference on Information
Theory and Statistical Learning (ITSL-08), M. Dehmer, M. Drmota, and F.
Emmert-Streib, Eds., pp. 125-131. CSREA Press, 2008.
[7] P.Luosto. Code lengths for model classes with continuous uniform distributions.
Proceedings of Workshop on Information-Theoretic Methods for Science and
Engineering (WITMSE2010), 2010.
[8] G.McLachlan: Finite Mixture Models. John Wiley & Sons, 2000.
[9] J. Rissanen: Fisher information and stochastic complexity. IEEE Transaction on
Information Theory, 42(1):40-47, January 1996.
[10] J. Rissanen. Information and Complexity in Statistical Modeling, Springer, 2007.
[11] G. Schwarz: Estimating the dimension of a model. Annals of Statistics 6 (2), 461-
464, 1978.
[12] Yu. M, Shtarkov: Universal sequential coding of single messages. Translated from
Problems of Information Transmission, Vol.23, No.3,3-17, July-September 1987.
[13] P.Smyth. Probabilistic model-based clustering of multivariate and sequential data.
Proceedings of the Seventh International Conference on Articial Intelligence and
Statistics, Pp:299-304, Morgankaufmann, 1999.
[14] UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets.html
1035