1. INTRODUCTION
Hidden Markov models (HMMs) describe the relationship
between two stochastic processes: an observed process and an
underlying hidden (unobserved) process. The hidden process
is assumed to follow a Markov chain, and the observed data are
modeled as independent conditional on the sequence of hidden
states. The separability of the model for the hidden process and
the conditional model for the observed data leads to great flexibility in the overall model structure.
In particular, the observed data {Yt }nt=1 follow a HMM if
1. The hidden states, {Zt }nt=1 , follow a Markov chain.
2. Given Zt , Yt is independent of Y1 , . . . , Yt1 , Yt+1 , . . . , Yn
and Z1 , . . . , Zt1 , Zt+1 , . . . , Zn .
The HMM is fully specified by the initial and transition probabilities of the hidden Markov chain and by the distribution of
Yt given Zt . Typically, the latter would be chosen from a family
of distributions with mean depending on Zt .
Under the preceding conditions, the observed data may be
autocorrelated, but do not have the Markov property. Furthermore, the marginal distribution of Yt is a finite mixture. For
example, if Yt is a count, one might choose the distribution of
Yt given Zt = z to be Poisson with mean z . In this case, the
marginal distribution of Yt would be a finite mixture of Poisson
distributions. Thus, HMMs are one way of describing overdispersion in count (or binary) data.
HMMs have many areas of application, including speech
recognition (e.g., Levinson, Rabiner, and Sondhi 1983), gene
profiling and recognition (e.g., Krogh 1998), and the modeling
of fetal lamb movements (Leroux and Puterman 1992). Albert,
McFarland, Smith, and Frank (1994) used HMMs to model lesion counts observed on a multiple sclerosis (MS) patientan
Rachel Altman is Assistant Professor, Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada
V5A 1S6 (E-mail: raltman@stat.sfu.ca). This article includes work from the
authors Ph.D. dissertation and was supported in part by a research grant from
the Natural Sciences and Engineering Research Council of Canada. The author
gratefully acknowledges her advisor, John Petkau, for his assistance, both academic and financial. In addition, the author thanks Farouk Nathoo for helpful
conversations, and the editors and referees whose comments led to substantial
improvement of the original manuscript. The author also expresses her appreciation to Paul Albert, Henry McFarland, and the Joseph Frank Experimental
Neuroimaging Section, Laboratory of Diagnostic Radiology Research, Clinical
Center, NIH, for sharing their MS/MRI data. Finally, she thanks Drs. Don Paty
and David Li of the UBC MS/MRI Research Group and Serono International
S.A. for providing the PRISMS data.
201
202
of HMMs by allowing the incorporation of covariates and random effects in both the conditional and the hidden parts of the
model. The advantages of MHMMs are numerous. First, modeling multiple processes simultaneously permits the estimation
of population-level effects, as well as more efficient estimation
of parameters that are common to all processes. Second, these
models are relatively easy to interpret. Finally, MHMMs permit greater flexibility in modeling correlation structure because
they relax the assumption that the observations are independent
given the hidden states.
This article is organized as follows. In Section 2 I describe
the MS study that provided the practical motivation for this
work. In Section 3 I define the class of MHMMs and give some
insight into their interpretation. I discuss estimation of the parameters of these models in Section 4, and in Section 5, for
some simple models, conduct a simulation study to investigate
small sample properties. I illustrate the application of MHMMs
to the MS data in Section 6 and conclude with a discussion in
Section 7.
2. MULTIPLE SCLEROSIS DATA
Magnetic resonance imaging (MRI) scans of relapsing
remitting multiple sclerosis (RRMS) patients are one source
of data that may be appropriately modeled using the HMM
framework. Patients afflicted with this disease have symptoms
that worsen and then improve in alternating periods of relapse
and remission. One such symptom is lesions in the brain; it is
now believed that exacerbations are associated with increased
numbers of lesions (e.g., Sormani, Bruzzi, Comi, and Filippi
2002). Interferon -1a (Rebif) is a common treatment and has
been shown to reduce MRI activity, relapse rate, and clinical
progression according to the Expanded Disability Status Scale
(PRISMS Study Group 1998; Li and Paty 1999; PRISMS Study
Group and the University of British Columbia MS/MRI Analysis Group 2001).
Because the number of lesions depends on whether the patient is in relapse or remission, one would expect the distribution of the lesion counts to be a finite mixture. In addition,
one would expect counts observed on the same patient to be
autocorrelated. Thus, a two-state HMM may be reasonable for
describing such data. Indeed, Albert et al. (1994) used these
ideas in the development of a model for lesion counts observed
monthly on three RRMS patients over a period of approximately 2 years. They used a HMM to analyze each patients
data individually, treating the data as three unrelated processes.
In particular, for a given patient, they let {Zt } be an unobserved,
stationary Markov chain with P(Zt = 1|Zt1 = 1) = P(Zt =
1|Zt1 = 1) = and P(Zt = 1) = P(Zt = 1) = 1/2. They
then modeled the observed lesion count at time t, Yt , conditional
on Zt as Poisson (t ), with
Zt = 1
t1 ,
t =
(1/ )t1 ,
Zt = 1.
Here 0 is an unknown parameter. At first glance, it may appear that this model is not a HMM, because t depends on
Z1 , . . . , Zt1 as well as on Zt . However, Altman and Petkau
(2005), by reparameterizing the model, demonstrated that it is,
in fact, a nonhomogeneous HMM. These authors also pointed
HMM
Poisson
Mixture
1
2
3
138.54
151.40
137.20
134.54
163.66
139.08
138.94
150.66
132.36
out that, under this model, t is restricted to a discrete number of values (evenly spaced on the log scale). Therefore, if the
process {Yt } is stationary, one might expect the model of Albert
et al. (1994) to be similar to a stationary Poisson HMM with
K < hidden states.
There are a number of limitations to the approach of Albert
et al. (1994). First, for each patient, the authors compared the fit
of their HMM to the model that assumes independent Poisson
counts. Based on the Akaike information criterion (AIC) values (reproduced in columns 2 and 3 of Table 1), they claimed
evidence of autocorrelation in the data from patients 2 and 3.
However, the HMM differs from the Poisson model in that it allows not only for autocorrelation but for overdispersion as well.
When one fits a mixture of two Poisson distributions (a model
that allows for overdispersion but not autocorrelation), one sees
that, of the three models, the mixture model actually yields the
lowest AIC values for patients 2 and 3 (column 4 of Table 1). In
other words, although one expects autocorrelation to be present,
it is not detectable using individual HMMs.
Second, as Albert et al. (1994) noted, MS is a very heterogeneous disease, and its degree and behavior are expected to vary
considerably among patients. Although modeling each patients
data separately certainly allows for interpatient differences, individual models require a large number of parameters, which
leads to increased uncertainty in all parameter estimates. In
addition, this approach prevents the assessment of populationlevel effects. For example, in an MS clinical trial, models for
individual patients would typically not be sufficient given the
usual sample size per patient and heterogeneity among patients.
Third, despite its simple form, the model of Albert et al.
(1994) is hard to interpret. Specifically, consider the implied
marginal moment structure:
1
1
,
E[Y1 ] = 0 +
2
1
1
E[Y2 ] = 0 2 + 2(1 ) + 2 ,
2
1
1
E[Y3 ] = 0 3 2 + +
2
1 2
2
[2(1 ) + (1 ) ] + 3 ,
..
.
The model assumes that the mean lesion counts follow a very
complicated trend. It would be difficult to justify this assumption in practice.
Finally, little is known about the theoretical properties of
nonhomogeneous HMMs. For instance, there are no tools for
model assessment in this setting.
203
204
= exp (yit itk c(itk ))/a() + d(yit , )
(1)
and
itk = k + xit k + witk u.
f (y|z, u, )f (z; )f (u; ) du
=
u z
ni
N
u z
f (yit |zit , u, )
i=1 t=1
zi1
i=1
N
u i=1
ni
Pzi,t1 ,zit f (u; ) du
t=2
zi
ni
Pzi,t1 ,zit f (yit |zit , u, ) f (u; ) du.
(3)
t=2
N
i
i1
it
A 1 f (u; ) du.
L(; y) =
(A )
(4)
u i=1
t=2
Thus, the integrand is simple to compute, and it is only the complexity of the integral that can make evaluating and maximizing
(4) challenging. In most applications, however, one would expect (4) to reduce to a much simpler form.
Example 1 (Single, patient-specific random effect). Let ui be
a random effect associated with the ith patient, i = 1, . . . , N,
and let {ui } be iid. Then, observations on different patients are
independent, and, given ui and the sequences of hidden states,
the observations on patient i are independent. In this case, (3)
and (4) simplify to a one-dimensional integral:
L(; y)
=
i=1 ui
(2)
Here k is the fixed effect when Zit = k, xit are covariates for
patient i at time t, and witk is the row of the model matrix for
the random effects for patient i at time t in state k. I denote
the distribution of the random effects by f (u; ) and assume
that the random effects are independent of the hidden states.
I take E[u] 0 to avoid problems with model identifiability.
The notation u (as opposed to ui ) indicates that a random effect
could be common to more than one patienta generalization
that would be helpful if, for example, data were collected on
patients from multiple centers. The form of (1) assumes that
the link function is canonical, an assumption that is not strictly
i=1 ui
ni
Pzt1 ,zit f (yit |zit , ui , ) f (ui ; ) dui
t=2
i1
(A )
ni
it
1f (ui ; ) dui .
(5)
t=2
205
of time in relapse and remission. To explore this class of models, I again specify the conditional model for the observed data
by (1) and (2), but I now allow the model for the hidden process
to vary among patients.
i
In particular, I assume that {Zit |u}nt=1
is a Markov chain and
that Zit |u is independent of Zjs |u for i = j. Any model for these
Markov chains must satisfy the constraints that the transition
probabilities lie between 0 and 1 and that the rows of the transition probability matrices sum to 1. Thus, I propose modeling
the transition probabilities as
P(Zit = |Zi,t1 = k, u, )
+ x + w u}
exp{k
it k
itk
= K
h=1 exp{kh
+ w u}
+ xit kh
itkh
(6)
N
i
i1
it
L(; y) =
A 1 f (u; ) du,
(A )
(7)
u i=1
t=2
it
i1
where now I define the quantities Ai1
k and Ak as Ak =
i
it
k f (yi1 |Zi1 = k, u, ) and Ak = P(Zit = |Zi,t1 = k, u, )
f (yit |Zit = , u, ), t > 1.
as |t s| ,
206
points with the number of iterations of the quasi-Newton routine. To choose the various numbers of points, the likelihood
can be evaluated (at the starting values) for different numbers
of quadrature points, q1 < q2 < . I select qk as the number
of quadrature points if the use of qk+1 , qk+2 , . . . points does not
lead to a substantial change in the value of the integral, where
the definition of substantial depends on the number of iterations executed by the quasi-Newton algorithm. (I accept less
accuracy for early iterations, but demand high accuracy for the
final iterations.)
I have used these techniques to estimate a variety of MHMMs
in the RRMS context. Using a dual 1.2-GHz Athlon processor,
the fitting of a model with no random effects takes less than
1 second, that with one random effect takes approximately 12
seconds, that with two or three correlated random effects takes
several hours, and that with four random effects (two correlated
random effects in the conditional model, independent of two
additional correlated random effects in the hidden model) takes
several days.
4.2 Monte Carlo Methods
For larger numbers of random effects, numerical integration
methods are no longer appropriate. For such complex models,
estimation is significantly more difficult. Of existing estimation
methods, the Monte Carlo expectation-maximization (MCEM)
algorithm (McCulloch 1997) seems to be the most feasible. In
this setting, both the states of the hidden Markov chain and the
random effects are treated as missing data. The complete loglikelihood is then given by
log Lc (; y, z, u)
= log f (y|z, u, ) + log f (z|u, ) + log f (u; )
=
ni
N
i=1 t=1
log zi1
i=1
ni
N
i=1 t=2
f (z, u|y, )
=
where
f (y|z, uj , p )f (z|uj , p )
.
k
p
k
p
k=1
z f (y|z, u , )f (z|u , )
hj (z) = B
The values of hj (z) are easy to compute using the fact that the
standard HMM likelihood can be written as a product of matrices (see Sec. 3).
For the M-step, note that, typically, the parameters involved in f (yit |zit , u, ), the initial probabilities, the transition probabilities, and f (u; ) form disjoint sets. (One notable exception occurs when the hidden Markov chains are
stationary conditional on u, in which case the initial probabilities are functions of the
probabilities.) In other
ni
transition
log
f (yit |zit , u, )|y, p ],
words, the expressions E[ N
i=1
N t=1
N
ni
p
E[ i=1 log zi1 |y, ], E[ i=1 t=2 log Pzi,t1 ,zit |y, p ], and
E[log f (u; )|y, p ] can usually be maximized separately, improving the efficiency of the procedure.
Assuming that the s are treated as unknown parameters, it
is straightforward to show that
N B
z hj (z)1(zi1 = )
i=1
j=1
p+1
,
=
N
where 1(zi1 = ) = 1 if zi1 = and 0 otherwise. Numerical
maximization (e.g., via the quasi-Newton routine) would ordinarily be required in order to obtain updates for the other parameter estimates.
Although this method is flexible and theoretically sound, the
number of samples, B, required to approximate the E-step accurately is an important practical consideration. For integrals of
high dimension, B may be very large, resulting in a prohibitive
computational burden. I recommend the same approach as for
the quadrature method discussed in Section 4.1, namely, to increase the values of B with the number of EM iterations. To
choose these different values, I suggest, given B and starting
values, estimating
N
p
log zi1 |y, ,
E
E
E
i=1
ni
N
log f (yit |zit , u, )|y,
i=1 t=1
ni
N
log Pzi,t1 ,zit |y, p ,
and
i=1 t=2
B
z
B
j=1
j=1
207
208
Table 2. Sample Mean (sample standard deviation, average asymptotic standard error) of the Parameter
Estimates Under Models Cf, Cfr, and Cfr.Hf
Model
Parameter
a1
a2
1
1
2
1
2
True value
Cf
Cfr
Cfr.Hf
1.00
1.50
2.00
.00
.85
.40
1.00
1.00
.50
sample standard deviations. Histograms of the parameter estimates (not shown) do not deviate substantially from the normal
distribution. Therefore, it would seem that the usual asymptotic
properties apply reasonably well when there is no random effect or one random effect in the conditional modeleven for
such a modest sample size.
For models Cf.Hr and Cf.Hfr, with the exception of the estimates of , the sample means are close to the true values,
and the histograms are well behaved. However, the asymptotic
standard errors are, in general, larger than the sample standard
deviations of the parameter estimates. This suggests that more
data may be required in order to obtain accurate standard errors
when there is a random effect in the hidden model. To confirm this theory, I ran 200 additional simulations from Cf.Hr
and Cf.Hfr but with 60 patients instead of 30. The resulting asymptotic standard errors were very close to the sample standard
deviations (agreeing to two decimal places, in most cases).
In the case of both Cf.Hr and Cf.Hfr, the histograms of the
estimates of are distinctly bimodal, with the bulk (78%) of the
estimates clustering around the true value of .5, and the rest
clustering around 9 (implying that the variance of the random
effect, e2 , is approximately 0). This behavior suggests that the
estimated variance of the random effect will be close to 0 unless
the data provide strong evidence otherwise. The simulations using 60 patients support this claim: These histograms are indeed
much closer to unimodal, with less than 5% of the estimates
clustered around 9. Clearly, a considerable amount of data is
necessary for estimating the variance of the random effect in the
hidden model. Not surprisingly, the asymptotic standard errors
Table 3. Sample Mean (sample standard deviation, average asymptotic standard error) of the Parameter
Estimates Under Models Cf.Hf, Cf.Hr, and Cf.Hfr
Model
Parameter
a1
a2
1
1
2
1
2
True value
Cf.Hf
Cf.Hr
Cf.Hfr
1.00
1.50
2.00
.00
.85
.40
1.00
1.00
.50
209
1
2
1L
2L
1H
2H
1
2
1L
2L
1H
2H
log L
Estimate
2.01
.15
1.00
1.35
2.44
3.03
1.68
1.78
1.81
.33
1.37
.13
.54
505.2
Fixed model
SE
.44
.30
.61
.57
.54
.42
.51
.50
1.15
1.25
1.27
1.21
.14
Estimate
.62
2.01
2.35
2.28
3.06
.96
3.74
1.49
1.08
2.31
1.86
1.48
NA
547.0
SE
.10
.11
.37
.19
.44
.18
.44
.62
1.16
1.19
1.09
1.14
NA
210
presumably) allow one to detect the autocorrelation among observations on the same patient. In particular, if one assumes
nih
are iid rather than Markovian, then, conditional on
that {Zhit }t=1
uhi , Yhit has a mixture distribution but is uncorrelated with Yhis ,
s = t. The log-likelihood based on this model has a maximum
value of 528.3. The associated AIC value is thus 1,076.6,
compared with 1,036.5 in the case of the MHMM.
Note that, in my preliminary analysis of these data, I attempted to fit models with a patient-specific random effect in
the hidden as well as conditional model. Under these models,
the likelihoods were very flat, and convergence was difficult to
achieve. In both the model where I allowed P(Zhit = 1|Zhi,t1 =
1) to vary according to a random effect and that where I allowed P(Zhit = 1|Zhi,t1 = 2) to vary, the maximum value of
the likelihood was 505.2 (essentially identical to the maximum achieved by the model with a random effect in the conditional model only). In addition, the estimates of the variance of
the random effects in the hidden model were approximately 0,
and the estimates of the other parameters were very similar (relative to their standard errors) across all models. Thus, for these
data, it appears that the inclusion of random effects in the hidden model is unnecessary.
To conclude, by borrowing strength across patients using a
MHMM with a random effect in the conditional model, I am
able to detect a treatment effect as well as important features
of the data such as autocorrelation. Moreover, this model provides a significantly improved fit over the fixed model with the
addition of only one extra unknown parameter ( ).
7. DISCUSSION
In summary, I have shown that the addition of one or two
random effects to the conditional model for the observed data
results in a model that can be readily interpreted and estimated.
I have applied such a model to the PRISMS data and have
demonstrated the advantages of such an approach in this context.
It is also possible to include random effects in the model for
the hidden process, but such models are more difficult to interpret and may involve high-dimensional integrals. In general,
one has less information about the parameters of the hidden
process than about the parameters of the conditional model. In
this case, extending the model to allow patient-to-patient differences on this level may explain very little additional variation
in the observed data and, hence, may not be worthwhile from
a statistical standpoint. Moreover, capturing interpatient heterogeneity in the hidden processes is still possible by incorporating
covariates in this part of the model. Therefore, in practice, one
would be more likely to use Model I than Model IIunless the
data provided strong indication that the complexity of Model II
was warranted.
Estimation of models with more than three or four random
effects is an area of ongoing research. The MCEM algorithm
can be used, in theory, to estimate any MHMM, but the computational burden required for models with large numbers of correlated random effects can be prohibitive. However, as computing power increases, so will the feasibility of using the MCEM
method to estimate increasingly complex models.
[Received September 2005. Revised August 2006.]
REFERENCES
Albert, P. S., McFarland, H. F., Smith, M. E., and Frank, J. A. (1994), Time Series for Modelling Counts From a RelapsingRemitting Disease: Application
to Modelling Disease Activity in Multiple Sclerosis, Statistics in Medicine,
13, 453466.
Altman, R. M., and Petkau, A. J. (2005), Application of Hidden Markov Models to Multiple Sclerosis Lesion Count Data, Statistics in Medicine, 24,
23352344.
Breslow, N. E., and Clayton, D. G. (1993), Approximate Inference in Generalized Linear Mixed Models, Journal of the American Statistical Association,
88, 925.
Chan, J. S. K., and Kuk, A. Y. C. (1997), Maximum Likelihood Estimation for
Probit-Linear Mixed Models With Correlated Random Effects, Biometrics,
53, 8697.
Hughes, J. P., and Guttorp, P. (1994), A Class of Stochastic Models for Relating Synoptic Atmospheric Patterns to Regional Hydrologic Phenomena,
Water Resources Research, 30, 15351546.
Humphreys, K. (1997), Classification Error Adjustments for Female Labour
Force Transitions Using a Latent Markov Chain With Random Effects, in
Applications of Latent Trait and Latent Class Models in the Social Sciences,
eds. J. Rost and R. Langeheine, New York: Waxmann Munster, pp. 370380.
(1998), The Latent Markov Chain With Multivariate Random Effects, Sociological Methods & Research, 26, 269299.
Krogh, A. (1998), An Introduction to Hidden Markov Models for Biological Sequences, in Computational Methods in Molecular Biology, eds.
S. L. Salzberg, D. B. Searls, and S. Kasif, Amsterdam: Elsevier, pp. 4563.
Kuk, A. Y. C., and Cheng, Y. W. (1999), Pointwise and Functional Approximations in Monte Carlo Maximum Likelihood Estimation, Statistics and
Computing, 9, 9199.
Lee, Y., and Nelder, J. A. (1996), Hierarchical Generalized Linear Models,
Journal of the Royal Statistical Society, Ser. B, 58, 619678.
Leroux, B. G., and Puterman, M. L. (1992), Maximum-Penalized Likelihood
Estimation for Independent and Markov-Dependent Mixture Models, Biometrics, 48, 545558.
Levinson, S. E., Rabiner, L. R., and Sondhi, M. M. (1983), An Introduction to
the Application of the Theory of Probabilistic Functions of a Markov Process
to Automatic Speech Recognition, The Bell System Technical Journal, 62,
10351074.
Li, D. K., and Paty, D. W. (1999), Magnetic Resonance Imaging Results of
the PRISMS Trial: A Randomized, Double-Blind, Placebo-Controlled Study
of Interferon-Beta1a in RelapsingRemitting Multiple Sclerosis, Annals of
Neurology, 46, 197206.
MacDonald, I. L., and Zucchini, W. (1997), Hidden Markov Models and Other
Models for Discrete-Valued Time Series, London: Chapman & Hall.
MacKay, R. J. (2002), Estimating the Order of a Hidden Markov Model, The
Canadian Journal of Statistics, 30, 573589.
(2003), Hidden Markov Models: Multiple Processes and Model Selection, unpublished Ph.D. dissertation, University of British Columbia,
Dept. of Statistics.
McCulloch, C. E. (1997), Maximum Likelihood Algorithms for Generalized
Linear Mixed Models, Journal of the American Statistical Association, 92,
162170.
McCulloch, C. E., and Searle, S. R. (2001), Generalized, Linear, and Mixed
Models, New York: Wiley.
Nash, J. C. (1979), Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, New York: Wiley.
PRISMS Study Group (1998), Randomised Double-Blind Placebo-Controlled
Study of Interferon Beta-1a in Relapsing/Remitting Multiple Sclerosis, The
Lancet, 352, 14981504.
PRISMS Study Group and the University of British Columbia MS/MRI Analysis Group (2001), PRISMS-4: Long-Term Efficacy of Interferon-Beta-1a in
Relapsing MS, Neurology, 56, 16281636.
Seltman, H. J. (2002), Hidden Markov Models for Analysis of Biological
Rhythm Data, in Case Studies in Bayesian Statistics, Vol. 5, Springer-Verlag,
pp. 397405.
Sormani, M. P., Bruzzi, P., Comi, G., and Filippi, M. (2002), MRI Metrics
as Surrogate Markers for Clinical Relapse Rate in RelapsingRemitting MS
Patients, Neurology, 58, 417421.
Turner, T. R., Cameron, M. A., and Thomson, P. J. (1998), Hidden Markov
Chains in Generalized Linear Models, The Canadian Journal of Statistics,
26, 107125.
Wang, P., and Puterman, M. L. (1999), Markov Poisson Regression Models
for Discrete Time Series, Journal of Applied Statistics, 26, 855882.