Abstract— Tracking human motion from monocular video in the state space. In the sense of importance sampling, if the
sequences has attracted significantly increased interests in recent proposal distribution is biased and no corrections are made, the
years. A key to accomplishing this task is to efficiently explore tracker will fail quickly. In addition, the relationship between
a high-dimensional state space. However, the traditional particle
filter method and many of its variants have not been able to meet human motion and image appearance is too complex to be
expectations as they lack a strategy to do efficiently sampling formulated mathematically. Due to the motion blur, depth
or stochastic search. We present a novel approach, namely ambiguity and self occlusion, the extracted low-level visual
differential evolution - Markov chain (DE-MC) particle filtering. features cannot describe the human motion effectively and
By taking the advantage of the DE-MC algorithm’s ability to robustly. Although a variety of approaches have been proposed
approximate complicated distributions, substantial improvement
can be made to the traditional structure of the particle filter. As in recent years, the high dimensional human motion tracking
a result, an efficient stochastic search can be performed to locate problem is still inherently difficult.
the modes of likelihoods. Furthermore, we apply the proposed In this paper, we propose a novel tracking algorithm, namely
algorithm to solve the 3D articulated model-based human motion the Differential Evolution - Markov Chain (DE-MC) particle
tracking problem. A reliable image likelihood function is built filter. It is based on the particle filter framework but makes
for visual tracker design. Based on the proposed DE-MC particle
filter and the image likelihood function, we perform a variety of substantial changes to the core structure of the CONDENSA-
monocular human motion tracking experiments. Experimental TION algorithm. Guided by the DE-MC algorithm, samples
results, including the comparison with the performance of other can make progressive move to the important regions of the
particle filtering methods demonstrate the reliable tracking conditional distribution as defined by the image likelihood.
performance of the proposed approach. Furthermore, we apply the DE-MC particle filter to the
Index Terms— Articulated human motion tracking, importance 3D articulated model-based human motion tracking problem.
sampling, particle filtering, DE-MC. A 3D articulated human body model is proposed in this paper,
I. I NTRODUCTION and we also design a robust multi-cue based measurement
function which describes the resemblance between hypothesis
motion tracking. Section III firstly introduces the background between different state variables, as has been done through
of Markov Chain Monte Carlo (MCMC), Differential Evolu- using the Rao-Blackwellised Particle Filter in [11]. Generally
tion (DE) algorithm and DE-MC algorithm, and then proposes speaking, most of these work rely on training sets and are
DE-MC particle filter algorithm. The 3D articulated human successful when the types of motion to be tracked are close
body model and image likelihoods are built in Section IV. to those in the training set.
Experimental results are shown in Section V and the conclu- From a different perspective, many researchers focus their
sions are drawn in Section VI. attentions on refining dynamical models. Given the posterior
history, they try to accurately predict the region that covers the
solution at current time step by deriving process noise from
II. R ELATED W ORK
an uncertainty description matrix [12], [13], by building an
The core of the tracking algorithm is its mechanism of adaptive velocity model [14]–[16], and especially, by learning
searching for configurations which interpret observations best. motion templates. Authors in [17]–[19] present a real-time
For this reason, a rich body of technical literature was devoted full body tracker. In their work, the parameter space is first
to designing an efficient sampling and search strategy. Basi- partitioned into Gaussian clusters each representing an elemen-
cally the two tasks are towards the same goal and are closely tary motion. A prior dynamics model is then learnt from this
linked: a good sampling strategy will substantially increase low dimensional representation by using an unsupervised EM
the efficiency of searching, and a proper search strategy clustering algorithm. The temporal dependencies of high-level
will increase the possibility of finding extrema. Most human behaviors are captured by a variable length Markov model
motion trackers are based on particle filters. Application of (VLMM). By using the learnt dynamics model, the propaga-
particle filters (or Sequential Monte Carlo Sampling) in the tion of candidate poses is biased in the low dimensional space.
computer vision society can be traced back to the CONDEN- Authors in [17]–[19] apply their 3D human motion tracker
SATION algorithm [1]. Though paving a path to solving many to multi-view videos. They propose a hierarchical algorithm
visual tracking problems, the CONDENSATION algorithm to merge silhouette extraction and volumetric reconstruction
and most of its variants turn out to be inadequate when together, which can efficiently evaluate candidate poses against
the dimensionality of the state space increases by a certain evidences captured from multiple views. The hierarchical
degree. In such a space, the samples are distributed extremely volumetric reconstruction algorithm can effectively fuse infor-
sparsely. Unless enough of them fall into the neighboring mation from different views and resolve spatial ambiguities.
region of a solution, there is no guarantee of reaching this A rich body of literature [20]–[23] has discussed how to utilize
solution. Unfortunately, the simplification which differentiates motion templates to guide the tracking even though it can only
the CONDENSATION (or generic particle filter) from the be applied to limited motion types. All of these efforts rely
Sequential Monte Carlo sampling (SMC) makes a nontrivial heavily on the knowledge about dynamical prior. They usually
move away from the objective. make too rigid assumptions to be applied to general tracking
A possible compensation is to reduce the dimensionality of scenes.
the searching space and hence make the samples relatively Because of the aforementioned limitations of those schemes,
more dense in their distribution. The well-known dimen- we realize that the improvement should be made on the
sionality reduction algorithms such as Principle Component core of particle filtering, for instance, the sampling strat-
Analysis (PCA) and Independent Component Analysis (ICA) egy. Importance sampling is first introduced in the form
sometimes are very useful. In [2], for example, PCA is used of ICONDENSATION [24], [25] in which the samples are
to explore the 3D head motion and pose estimation problem. not only drawn according to the dynamical priors but also
However, the linear nature of the PCA and ICA usually an auxiliary importance function. Poon and Fleet’s Hybrid
make them inadequate to handle the complicated relations Monte Carlo Filtering (HMCF) [26] bears some similarity
between joint angles. There are many non-linear dimensional to our work in that it uses MCMC to generate samples.
reduction techniques available, but some of them, such as But unlike our work, the HMCF follows the gradient of the
Isomap, Laplacian Eigenmaps, Locally Linear Embedding, are posterior distribution to find promising samples. It requires
non-invertible. The shortcoming prevents them from being a analytical form of image likelihood gradient, which is not
qualified candidate in visual tracking applications since we available for most image cues. Therefore it cannot maximize
must return to the state space to retrieve the tracking result. the use of abundant visual information contained in a video
In addition, non-linear dimensionality reduction methods with for tracking. Sminchisescu and Triggs [27] use kinematic jump
inverse mapping ability such as the Locally Linear Coor- sampling to handle the forwards/backwards ambiguities in the
dination (LLC) [3]–[5] algorithm and the Gaussian Process estimation of limb’s pose. While this sampling strategy targets
Latent Variable Model (GPLVM) [6]–[10] have been used for a specific problem structure, our work is concerned with
for tracking human motion. In [5], the LLC preserves the more general case of sampling. In [28], Sminchisescu and
clustering behavior of similar high-dimensional data points and Triggs develop a proposal density based on uncertainty of
separate the different clusters in the global coordinate system, local parameter estimation. Along the eigen-directions which
and the models learned from the LLC is then used for tracking. account for the most uncertain parameters (usually the ones
Authors in [10] employ GPLVM for pose synthesis given that describe the movement in depth), sampling covariances
kinematics constraints. Some researchers also try to partition are inflated by a large scale. So the generated samples can
the search space by looking for the conditional dependencies focus on the modes that cause ambiguities. The unscented
3854 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 10, OCTOBER 2013
particle filter (UPF) [29]–[31] draws samples from a proposal state at each time step according to a posterior distribution
distribution which is determined by the calculation result of p(X 0:k |Y1:k ), where X 0:k = {X 0 , X 1 , . . . , X k } are the state
the unscented Kalman filter (UKF). Alongside approaches that vectors up to and including time k and Y1:k = {Y1 , . . . , Yk }
try to “generate good samples”, there are also research projects are the observations in the same period of time. By Bayesian
devoted to “turning bad samples into good ones”, i.e. driving Inference [1]:
the existing biased samples to a global or quasi-global extrema.
p(X k |Y1:k ) = λk P(Yk |X k ) p(X k |Y1:k−1 )
The Kernel Particle Filter (KPF) [32]–[34] is based on kernel
density estimation and moves the samples towards the gradient = λkP(Yk |X k ) ·
ascending direction of the posterior distribution density. Wu p(X k |X 0:k−1 , Y1:k−1 )
et al. [35] model the 2D tracking problem using a dynamic
Markov Random Field, in which the node potential at each × p(X 0:k−1 |Y1:k−1 )d X 0:k−1
time instance is determined by dynamical model and edge
= λk P(Yk |X k ) p(X k |X k−1 )
potentials are location constraints for body part pairs. They use
the mean field algorithm for inference. In [25], [36], [37], the × p(X k−1 |Y1:k−1 )d X k−1 , (1)
simulating annealing algorithm and the genetic algorithm (GA)
together constitute the foundation of the annealed particle filter where λk is a normalization constant that is independent with
(APF). In this case samples are pushed gradually toward the Xk.
global maximum of a weighting function by progressively By applying the Monte Carlo principle, the continuous
adjusting the sensitivity of the weighting function. In [38], distribution can be represented by discrete samples through
the articulated human body motion is decomposed into limb particle filtering. And since the direct sampling is difficult for
motions. The coupling between limb motions are captured by the complicated posterior distribution, importance sampling
Sequential Ancestral Simulation (SAS) of Bayes Networks principle is adopted. We summarize a typical particle filter
and by Markov Random Field model in the sample weights. step as follows:
(i)
Although this work relies on cyclic motion prior and thus not At time step k, starting with a sample set: {X k−1 ,
(i)
applicable to general motions, an iterative sampling is found ω(X k−1 )}i=1N
:
ˆ N
to be able to improve the quality of sampling, which resembles 1) Selection: Select a new set of samples { X k(i) }i=1 from
the idea of the APF. (i) N (i)
{X k−1 }i=1 according to ω(X k−1 ). The samples with a
The preceding works have demonstrated that the decisive
larger weight should be selected with a higher probabil-
factors to a high-dimensional visual tracking problem are how
ity. The detailed selection strategy can be found in [1]
to allocate the samples and how to guide their movement in
and [39].
the state space. Motivated by this perspective, we present a
2) Prediction: Sample from the proposal function,
new algorithm which makes substantial changes to the core
ˆ
structure of the CONDENSATION algorithm. An information {X k(i) } ∼ g(X k(i) | X k(i) , Yk ), i = 1, 2, . . . , N. (2)
exchange scheme is built between the sampling and weighting
steps to guide the moves of samples in the state space. This 3) Measurement: Evaluate the weight for each sample,
adjustment also has the effect of stochastic global optimiza- ˆ
tion. In addition, the proposed algorithm does not require p(Yk |X k(i) ) p(X k(i) | X k(i) )
ω(X k(i) ) = , i = 1, 2, . . . , N,
any training for dynamical priors. As our work is based on (i) ˆ(i)
g(X k | X k , Yk )
the DE-MC, a method originally developed for approximating (3)
target density functions in statistics, we name the proposed where the p(Yk |X k(i) ) is the image likelihood and
algorithm the DE-MC particle filter. We will demonstrate its ˆ
p(X k(i) | X k(i) ) is the dynamical model. Then normalize
effectiveness with application in monocular 3D human motion N (i)
tracking. The generality of the algorithm however, allows it to the weight so that i=1 ω(X k ) = 1.
be readily used in other visual tracking contexts. 4) Representation: Estimate the state at time step k as
X̃ k = argmax X (i) ω(X k(i) ), i = 1, . . . , N, (4)
III. DE-MC PARTICLE F ILTER k
or
In this section, we will first give a brief introduction to
N
(i) (i)
the particle filter, the Markov Chain Monte Carlo (MCMC) X̃ k = E[X k ] = ω(X k )X k . (5)
algorithm, the Differential Evolution (DE) Algorithm and the i=1
DE-MC algorithm. Then we propose the DE-MC particle filter. Particle filters often suffer from the degeneracy problem,
referring to the cases in which all but a few particles have
A. Particle Filter negligible weights after some iterations. An indicator of the
In an articulated model-based human motion tracking prob- degree of degeneracy is the effective sample size, or survival
lem, joint angles together with global translation and rotation diagnostic [40]:
parameters constitute a state vector. This vector gives a com- 1
Neff = .
N
(6)
plete description for the pose of human. Therefore, the visual (i) 2
tracking problem can be formulated as recursively estimating i=1 ω(X k )
DU et al.: MONOCULAR HUMAN MOTION TRACKING BY USING DE-MC PARTICLE FILTER 3855
A small effective sample size indicates severe degeneracy. In Now we verify that Equation (13) has the prop-
this paper, we try to use a better proposal distribution to tackle erty of reversibility. Firstly, we verify the case when
the degeneracy problem. The optimal proposal distribution is ( p(X k )g(X k−1 |X k ))/( p(X k−1 )g(X k |X k−1 )) > 1. Given this
given by inequality condition, we have:
ˆ ˆ
g(X k(i) | X k(i) , Yk ) = p(X k(i) | X k(i) , Yk ). (7) α(X k−1 , X k ) = 1
However, in practice it is very difficult to either obtain the and
analytical form of the optimal proposal distribution or draw p(X k−1 )g(X k |X k−1 )
α(X k , X k−1 ) = ,
samples from it. Generic sequential importance resampling p(X k )g(X k−1 |X k )
(SIR) particle filters use the dynamical prior as substitute for so:
the proposal distribution,
p(X k−1 )T (X k |X k−1 ) = p(X k−1 )g(X k |X k−1 )
(i) ˆ(i) (i) ˆ(i)
g(X k | X k , Yk ) = p(X k | X k ). (8) p(X k−1 )g(X k |X k−1 )
= p(X k )
In this case, the sample weight in Equation (3) is nothing but p(X k )g(X k−1 |X k )
the image likelihood, ×g(X k−1 |X k )
= p(X k )T (X k−1 |X k ).
ω(X k(i) ) = p(Yk |X k(i) ). (9)
The cases of ( p(X k )g(X k−1 |X k ))/( p(X k−1 )g(X k |X k−1 )) < 1
However, a state space of 20 or so dimensions appears to be so and ( p(X k )g(X k−1 |X k ))/( p(X k−1 )g(X k |X k−1 )) = 1 can be
vast that the dynamical model alone cannot project the samples proved the same way.
into the most probable locations of the solution. Indeed, the One condition that ensures the quality of convergence is
sparse samples thus generated are almost certain to miss many that g(X)/ p(X) > 0 everywhere, so g(X) is usually chosen
local or global modes. Therefore, to find a new strategy that such that it is similar in shape to p(X), the target distribution.
we can perform efficiently stochastic search in such a high- g(X k |X k−1 ) determines how the state space is exploited. This
dimensional space has become an urgent issue. This is our is especially important to a high-dimensional problem such as
motivation in developing the DE-MC particle filter. human tracking. A popular choice is to generate new samples
from a symmetric random walker sampler, as the Metropolis
B. Markov Chain Monte Carlo algorithm does. It means the sampling proposal is determined
A Markov Chain (MC) can be described by a transition only by the samples’ separation from X k−1 :
matrix T in which g(X k |X k−1 ) = g(|X k − X k−1 |). (14)
Tmn = p(X k = Sn |X k−1 = Sm ), (10) Thus the acceptance rate reduces to
where Sn and Sm are two of the probable states. Regardless p(X k )
of which initial state it starts, the Markov chain will always α(X k−1 , X k ) = min 1, . (15)
p(X k−1 )
reach a steady state distribution p(X) if the transition matrix T
possesses irreducibility and aperiodicity properties [41]. These The calculations of the MH algorithm and the Metropolis algo-
two properties guarantee a finite path from each state to every rithm are especially convenient because they can be applied
other state with non-zero transition probability, which is the to situations in which we cannot directly draw samples from
so-called ergodicity. the target distribution, but know how to roughly evaluate their
The Markov Chain Monte Carlo (MCMC) algorithm takes values everywhere. This is precisely the case we encounter in
aim at constructing a MC which has the given target distri- human motion tracking.
bution as its invariant distribution [41]. Normally we ensure
the stationarity by making the chain satisfy the reversibility C. Differential Evolution Algorithm
property, The Differential Evolution algorithm (DE) is an algorithm
p(X k )T (X k−1 |X k ) = p(X k−1 )T (X k |X k−1 ). (11) dealing with the parallel search for a global maximum through
high dimensional state space [42]. Similar to other evolution-
The most frequently adopted MCMC method is the ary program methods such as the Genetic algorithm, it is also
Metropolis-Hasting (MH) algorithm [41]. According to this based on evolution theory and the competition mechanism:
algorithm the transition probability is given by stronger members can more easily survive to the next gen-
eration so as to guarantee that the new generation is better
T (X k |X k−1 ) = α(X k−1 , X k )g(X k |X k−1 ), (12)
than the previous one as a whole. Compared with the Genetic
where g(X k |X k−1 ) is the proposal distribution we can directly algorithm, the Differential Evolution algorithm is defined in
sample from and real parameter spaces instead of binary code parameter spaces.
So it is much simpler to implement. The DE algorithm can
p(X k )g(X k−1 |X k )
α(X k−1 , X k ) = min 1, (13) explore non-isotropic structures such as ridges in the target
p(X k−1 )g(X k |X k−1 ) function because the vector differences are usually aligned
is the acceptance rate. with the direction of the ridges.
3856 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 10, OCTOBER 2013
Fig. 2. The DEMC algorithm simulation result: Muller potential surface Fig. 3. The structure comparison of the DE-MC particle filter and the generic
(200 samples, 30 iterations). particle filter.
E. DE-MC Particle Filter Equation (6). The step size of random jumping for current DE-
Based on the DE-MC algorithm, we propose a novel MC iteration is reduced if the survival rate of the last DE-MC
sequential Monte Carlo sampling approach, namely the DE- iteration is high and is inflated otherwise.
MC particle filter. The DE-MC particle filtering iteration at As previously mentioned, current observations are invisible
time step k can be summarized as follows. to a generic SIR particle filter prior to the measurement stage.
DE-MC Particle Filter Algorithm As such, there is a high risk that these sparse samples are not
Starting from the set of particles which are the filtering placed near the modes of the conditional distribution as defined
(i) (i)
result of time step k − 1, {X k−1 , ω(X k−1 )}i=1
N .
by the likelihood. The DE-MC particle filter, on the other
ˆ(i) N hand, can iteratively use the current observation to adjust the
1) Selection: select a new set of samples { X k }i=1 from
(i) N (i) distribution of particles (See Fig. 3). The success of the DE-
{X k−1 }i=1 with the probability proportional to ω(X k−1 ). MC particle filter relies on two facts. First, as an importance
2) Prediction and Measurement: Apply a constant veloc- sampling algorithm, MCMC guarantees that the samples are
ity dynamical model to the samples: in accordance with the target conditional distribution. As a
ˆ result, samples will be concentrated around the modes of the
X k(i)− = X k(i) + Vk−1 , (21)
observation likelihood. Second, as a stochastic optimization
where Vk−1 is the velocity vector computed in time algorithm, the DE algorithm reduces the chance that the
(i)− N
step k − 1. The particle set {X k }i=1 then acts as samples are trapped in some of the local modes.
the initial population for a T -iteration DE-MC process-
ing. The processing follows the DE-MC algorithm we IV. 3D H UMAN B ODY M ODEL AND I MAGE L IKELIHOODS
listed in Section III-D. The fitness function is the In Section III, we present the proposed DE-MC particle
image likelihoods in the case of visual tracking. For filter to perform an efficient stochastic search to locate the
Equation (17), we choose g ∼ U (−cσ, cσ ) wherein modes of likelihoods. In this section, we will discuss some
σ = [σ0 , σ1 , . . . , σ D−1 ]T is a vector with the elements details and application level issues of the proposed DE-MC
equal to standard deviations for the elements in X. particle filter when it is used for tracking monocular human
Normal distribution can be used here instead of uniform motion video sequences.
distribution. c is a small number which can be flexibly
chosen. Also in the same equation, the value of λ is
A. 3D Articulated Human Body Model
determined by
Articulated human body models are consistent with the
2.38
λ = (1 − c) × √ . (22) natural mechanism of human motion. Therefore, we are able
2D to directly apply our knowledge about human motion to it.
At the end of this step, we take the output population as The model usually has a hierarchical structure, so the motion
(i) (i) N
the particle set of current time step: {X k , ω(X k )}i=1 . of a parent node will constrain that of its child or grandchild
3) Representation and Velocity Updating: Estimate the nodes. This relationship is reflected by the rigid geometric
state at time step k as transformations between the local coordinate systems of the
body parts:
X k = argmax X (i) ω(X k(i) ), i = 1, . . . , N, (23) P = R P + T, (25)
k
and calculate the velocity vector of current time step where the same point is represented in two different body part
Vk = X k − X k−1 . (24) coordinate systems by P = [x, y, z]T and P = [x , y , z ]T .
R is a 3 × 3 rotation matrix and T = [Tx , Ty , Tz ]T is the
We adopt a strategy inspired by [36] to help the filter adapt translation vector.
to the changes of situations: to calculate the value of σ in The model we built has 14 body segments, and 21 DOFs
step 2 as proportional to the survival diagnostic as defined in (Degrees of Freedom) are associated with them. The segments
3858 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 10, OCTOBER 2013
1) Silhouette: Assuming a static camera and video back- where αm denotes the weights. They are proportional to the
ground, a ground-truth silhouette S is extracted through back- area of the image patches projected by different body parts.
ground subtraction. We compare it with S , which is generated We will show in Section V how this change improves the per-
by a hypothesis pose X k(i) . The pixels are categorized into 2 formance of the tracker. The color measurement distribution
groups, R1 and R2 with R1 = S ∩ S and R2 = S ∪ S . can then be formulated as
Let the number of pixels in R1 and R2 be N1 and N2 , 2 (H ,H (i) )
respectively, and then the silhouette area measurement density p2 (Yk |X k(i) ) = e−β D r k , (31)
can be represented by
where β is a scalar that helps the calculation result more
(i) N1 reasonably distributed in the range of (0, 1). To make the ref-
P1 (Yk |X k ) = . (26)
N2 erence appearance model adaptive to the variation of lighting
DU et al.: MONOCULAR HUMAN MOTION TRACKING BY USING DE-MC PARTICLE FILTER 3859
conditions throughout the whole sequence, an update process do not need to do any modifications to make the FD insensitive
can be applied: to translation, rotation and scaling, which we would normally
+ − + have to worry about in many shape analysis scenarios, such as
Hr,k = λHr,k + (1 − λ)Hr,k−1 , (32)
image retrieval. On the contrary, the FD needs to be sensitive
where the sign + and − distinguish the reference appearance to those transformations to reflect the human motion. However,
model after and before the update has occurred. we do wish to avoid the FD’s sensitivity to the starting point,
3) Boundary: Boundaries are often confused with edges and so we set a fixed corner of the boundary as the starting point.
contours. Here we define the boundary as the outer border of Fig. 5(b) demonstrates the effectiveness of FD as a measure-
an object that does not enclose any holes. Consequently, we ment feature for tracking. The images on the left and the top
can not use typical edge and contour extraction method for of the checkerboard are the ground-truth observations and the
boundary extraction. Instead, a morphology operator is applied hypothesis ones, respectively. The gray-level values of those
to the silhouettes S: blocks are proportional to the Euclidean distance between the
FDs of the ground-truths and that of the hypotheses. A dark
B=S−S
M (33)
block indicates a strong resemblance and a bright one indicates
where M is a 3 × 3 uniform structuring element and
otherwise. As we expected, the blocks along the diagonal axis
signifies erosion. We choose Fourier Descriptor as the feature are the darkest among the row and the column they are located
to represent a boundary B: in.
4) Fusion: We fuse the three image cues for an overall
1
N−1
n image likelihood density function
B( f ) = b(n)e−2π f N , (34)
N
n=0 p(Yk |X k(i) ) = p1 (Yk |X k(i) ) · p2 (Yk |X k(i) ) · p3 (Yk |X k(i) ). (37)
where
In our experiments, we assume three cues are of the same
b(n) = x(n) + j y(n), (35)
importance. Fig. 6 shows the surfaces of the proposed image
and [x(n), y(n)](n = 0, 1, . . . , N − 1) are the image coordi- likelihood function for two different frames. In order to be
nates of the pixels on B. N is the total number of pixels on B. visualized, the likelihood function is shown with regard only
The boundary information-based measurement density is then to 2 DOFs of the human body model and all the other DOFs
formulated as are set as constant. The significant peak suggests the validity
(i) of our image likelihood design scheme.
p3 (Yk |X k(i) ) = e−ρ D(B( f )k ,B( f )k ) , (36)
where ρ has a similar function as the β in Equation (31) V. E XPERIMENTAL R ESULTS
(i)
and D(B( f )k , B( f )k ) is the Euclidean distance between the We performed experiments with the proposed DE-MC par-
FD of the ground-truth boundary and that of the boundary ticle filter and the measurement function. We used seven
(i)
generated by hypothesis pose X k . monocular human motion video sequences in the exper-
The first 50 or 100 coefficients in B( f ) are usually sufficient iments. Sequences 1 to 4 are walking, hopping, run-
to reconstruct a boundary of thousands pixels without losing ning and jumping sequences, respectively in a regular set-
much fidelity, as we can see from Fig. 5(a). Using only low ting. Sequence 5 is a public test sequence downloaded
frequency components of the FD allows a strong emphasis to from www.csc.kth.se/hedvig/movies.html, which is a walking
be laid on the gross essence of boundary since high-frequency sequence in a circular trace with a complex background.
components of the FD correspond to noise or trivial details. Sequences 6 and 7 are public test sequences from HumanEva
Moreover, we can utilize the Fast Fourier Transform (FFT) to Database [48], with the subject performing boxing actions
accelerate the computation to a large extent. The FD can also in Sequence 6, and a combination of walking and jogging
be directly integrated into the human tracking framework. We actions in an elliptical path in Sequence 7. The lengths of
3860 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 10, OCTOBER 2013
(a) Tracking results for the walking sequence by the DE-MC (b) Tracking results for the hopping sequence by the DE-MC
particle filter. particle filter.
(c) Tracking results for the running sequence by the DE-MC (d) Tracking results for the jumping sequence by the DE-MC
particle filter. particle filter.
Fig. 7. Snapshots of the DE-MC particle filters tracking results for: (a) walking, (b) hopping, (c) running, and (d) jumping sequences, respectively.
video sequences vary from 1.2 seconds to 5.5 seconds and the any state-of-art bottom-up detectors. Please refer to Yang and
frame size is 720 × 480 for Sequences 1 to 4, 320 × 240 Ramanan’s “Mixture of Deformable Parts” model [49] and
for Sequence 5, 640 × 480 for Sequences 6 and 7. For Viola and Jones’ object detection work [50] for examples of
Sequences 1 to 4 we do not ask the subjects to wear shorts and such detectors.
T-shirts or other clothes that can label the body parts by using
different colors or patterns as clues. The human subject wears
loose fit clothing with uniform color and pattern. It will bring A. Full-Body Human Motion Tracking Experiments
even more difficulties to the task. Although Sequences 1–4 In our work, the captured human motion video sequence is
are shot in side view, the hopping and jumping videos also the ground-truth observation, while an image generated by the
contain many limb motions in depth direction. Sequence 5 articulated human model is called a hypothesis observation.
is more challenging in that not only its background is more Since the 3D joint angle can only be measured by asking the
complicated but also there are considerably more motions subject to wear motion capture devices and there is always an
vertical to image plane in it. Due to serious self-occlusions unsolvable ambiguity in the depth direction based on only 2D
and depth ambiguity, monocular image sequences offer a observation, we currently consider the overlapping as the way
huge challenge for any human motion tracking algorithm. to make qualitative analysis. This method is adopted by many
Moreover, there are both cyclic and non-cyclic motions in related works [9], [28], and [36].
those test sequences. To test the performance of the pro- Figs. 7 and 8 show part of the tracking results for the
posed algorithm, we carry out different kinds of experiments. test sequences. For the walking sequence, a 7-layer DE-MC
In these experiments, model parameters are manually initial- particle filter is used (here we use “layer” to represent a DE-
ized for each test video. However, this stage can be replaced MC iteration). The number of particles is 500, which can
with an automatic initialization module which is driven by lead to a satisfactory balance between the reliability and the
DU et al.: MONOCULAR HUMAN MOTION TRACKING BY USING DE-MC PARTICLE FILTER 3861
Fig. 8. Tracking results for the walking in circle sequence by the DE-MC particle filter.
computational cost of the tracker. For the hopping, running errors were still observed in this case, especially when the
and jumping sequences, a 9-layer DE-MC particle filter with limbs of the subject were bending and projected only to small
600 particles is used, in order to handle faster motions and areas on the image plane. However, overall, our algorithm
more depth ambiguities than those presented in Sequence 1. produces satisfactory results on both sequences.
In Fig. 8, a 12-layer, 500-sample DE-MC particle filter is
used for the walking in circle sequence. As we observed
from the shown results, motion blurs do affect performance B. Comparison Experiment 1
of the tracker, and so do self-occlusions. Misplacement of In Fig. 10, we compare the tracking results obtained by
limbs is inevitable in these cases. However, when blurs or self- implementing a 7-layer DE-MC particle filter with those
occlusions disappear, the tracker is able to correct the errors obtained by implementing other popular particle filtering-
in time. based algorithms. In terms of fair comparison, the experi-
Fig. 9 shows tracking results for several representative ment is based on almost the same number of measurement
frames of Sequence 6 and Sequence 7 from the HumanEva function evaluations because it is the most time-consuming
Database [48]. In the first sequence, the subject performs part for particle filtering. The experiment settings are: CON-
boxing actions, and, in the second, the subject moves in circle DENSATION – 5000 samples; annealed particle filter – 10
and performs a combination of walking and jogging actions, layers, 500 samples per layer; DE-MC particle filter – 7
following by certain free limb motions. Both sequences con- layers, 500 samples per layer. The other factors, such as
tain complex limb motions. We used a DE-MC particle filter the initialization result, initial standard deviation of the state
with 5-layer and 500 samples/layer for the first sequence, and a vector, constant velocity model, adaptive strategy and mea-
filter with 12-layer and 500 samples/layer for the second. Even surement function are all given the same settings for each
though more samples were used for the second sequence to algorithm. Despite consuming only 70% of the computations
offset the difficulty caused by view variations, some tracking spent by the other two algorithms, DE-MC particle filter still
3862 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 10, OCTOBER 2013
Fig. 9. Tracking results for Sequences 6 and 7 from the HumanEva Database [48]. The first two rows are the 50th, 100th, 150th, and 200th frames of the
Box-1(C1) video and the rest are the 500th, 600th, 700th, 800th, 900th, 1600th, 1800th, and 1900th frames of the Combo-2(C3) video.
outperforms them. It estimates every human body part with experiment remarkably demonstrate the performance of the
relatively high accuracy. The annealed particle filter, which proposed approach. Quantitative analysis of this experiment
is based on the Simulated Annealing algorithm, is shown is conducted by measuring the average of sample weights
to successfully track human walking from multi-view video ω(X k(i) ) at each frame. The average sample weights ω̄(X k(i) )
N
sequences in [36]. But in our monocular experiment, it just can be formulated as ω̄(X k(i) ) = (i)
i=1 ω(X k )/N at each
roughly captures the global location of the human subject time step k. They are regarded as a “score” of a tracker’s
but makes invalid estimates to the position of limbs in many efficiency since they can reflect, to a certain degree, how many
cases. The classic CONDENSATION algorithm cannot even of the samples are considered “valid”. The results are plotted
find the human body accurately after 20 or so frames. This in Fig. 11. As we have pointed out, DE-MC particle filter’s
DU et al.: MONOCULAR HUMAN MOTION TRACKING BY USING DE-MC PARTICLE FILTER 3863
Fig. 12. Comparison of the performance of the DE-MC particle filters with
two layers, five layers, and seven layers (from left to right, respectively). The
original image is the 48th frame of Sequence 1.
[29] R. Van Der Merwe, A. Doucet, N. De Freitas, and E. Wan, “The Ming Du received the B.S. degree in electrical engi-
unscented particle filter,” in Advances in Neural Information Processing neering from the Beijing Institute of Technology,
Systems. Cambridge, MA, USA: MIT Press, 2001, pp. 584–590. Beijing, China, in 2002, and the M.S. degree from
[30] K. Okuma, A. Taleghani, N. Freitas, J. Little, and D. Lowe, “A boosted the Department of Electrical and Computer Engi-
particle filter: Multitarget detection and tracking,” in Proc. 8th Eur. Conf. neering, Ryerson University, Toronto, ON, Canada,
Comput. Vis., 2004, pp. 28–39. in 2005. He is currently pursuing the Ph.D. degree
[31] X. Wang, S. Wang, and J. Ma, “An improved particle filter for target with the Department of Electrical and Computer
tracking in sensor systems,” J. Sensors, vol. 7, no. 1, pp. 144–156, 2007. Engineering, University of Maryland, College Park,
[32] C. Chang and R. Ansari, “Kernel particle filter for visual tracking,” IEEE MD, USA. His current research interests include
Signal Process. Lett., vol. 12, no. 3, pp. 242–245, Mar. 2005. computer vision and statistical pattern recognition,
[33] C. Chang, R. Ansari, and A. Khokhar, “Multiple object tracking with especially video-based face detection, and tracking
kernel particle filter,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. and recognition. He is working on video analysis based on machine learning
Pattern Recognit., vol. 1. Jun. 2005, pp. 566–573. algorithms.
[34] J. Schmidt, J. Fritsch, and B. Kwolek, “Kernel particle filter for real-
time 3D body tracking in monocular color images,” in Proc. 7th Int.
Conf. Autom. Face Gesture Recognit., Apr. 2006, pp. 567–572.
[35] Y. Wu, G. Hua, and T. Yu, “Tracking articulated body by dynamic
Markov network,” in Proc. 9th IEEE Int. Conf. Comput. Vis., Oct. 2003,
pp. 1094–1101.
[36] J. Deutscher, A. Blake, and I. Reid, “Articulated body motion capture
by annealed particle filtering,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., vol. 2. Jun. 2000, pp. 126–133.
[37] G. Klein and D. Murray, “Full-3D edge tracking with a particle filter,” Xiaoming Nan received the M.S. degree in telecom-
in Proc. 17th Brit. Mach. Vis. Conf., 2006, pp. 256–260. munication engineering from the Beijing University
[38] C. Chang, R. Ansari, and A. Khokhar, “Cyclic articulated human motion of Posts & Telecommunications, Beijing, China, in
tracking by sequential ancestral simulation,” in Proc. IEEE Comput. Soc. 2010. He is currently pursuing the Ph.D. degree with
Conf. Comput. Vis. Pattern Recognit., vol. 2. Jun.–Jul. 2004, pp. 45–52. the Ryerson Multimedia Research Laboratory, Ryer-
[39] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial son University, Toronto, ON, Canada. His current
on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” research interests include multimedia cloud comput-
IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, Feb. 2002. ing, content based video retrieval, immersive com-
[40] J. MacCormick and M. Isard, “Partitioned sampling, articulated objects, munication, and human body tracking. He received
and interface-quality hand tracking,” in Proc. 6th Eur. Conf. Comput. the Meritorious Prize in Mathematical Contest in
Vis., 2000, pp. 3–19. Modeling in 2006.
[41] C. Andrieu, N. De Freitas, A. Doucet, and M. Jordan, “An introduction
to MCMC for machine learning,” Mach. Learn., vol. 50, no. 1, pp. 5–43,
2003.
[42] R. Storn, “On the usage of differential evolution for function optimiza-
tion,” in Proc. Biennial Conf. North Amer. Fuzzy Inf. Process. Soc.,
Jun. 1996, pp. 519–523.
[43] C. Ter Braak, “Genetic algorithms and Markov chain Monte Carlo:
Differential evolution Markov chain makes Bayesian computing easy,”
Biometris, Wageningen UR, Wageningen, The Netherlands, Tech.
Rep. 010404, 2004. Ling Guan (S’88–M’90–SM’96–F’08) received the
[44] C. Sminchisescu and B. Triggs, “Hyperdynamics importance sampling,” Ph.D. degree in electrical engineering from the
in Proc. 7th Eur. Conf. Comput. Vis., 2002, pp. 769–783. University of British Columbia, Vancouver, BC,
[45] C. Sminchisescu and B. Triggs, “Building roadmaps of minima and Canada, in 1989. He is currently a Professor and a
transitions in visual models,” Int. J. Comput. Vis., vol. 61, no. 1, Tier I Canada Research Chair with the Department
pp. 81–101, 2005. of Electrical and Computer Engineering, Ryerson
[46] P. Perez, J. Vermaak, and A. Blake, “Data fusion for visual tracking University, Toronto, ON, Canada. He held visiting
with particles,” Proc. IEEE, vol. 92, no. 3, pp. 495–513, Mar. 2004. positions with British Telecom, London, U.K., in
[47] T. Roberts, S. McKenna, and I. Ricketts, “Online appearance learning for 1994, Tokyo Institute of Technology, Tokyo, Japan
3D articulated human tracking,” in Proc. 16th IEEE Int. Conf. Pattern in 1999, Princeton University, Princeton, NJ, USA,
Recognit., vol. 1. Aug. 2002, pp. 425–428. in 2000, National ICT Australia, Sydney, Australia,
[48] L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized in 2007, Hong Kong Polytechnic University, Hong Kong, from 2008 to
video and motion capture dataset and baseline algorithm for evaluation 2009, and Microsoft Research Asia, Cambridge, U.K., in 2002 and 2009.
of articulated human motion,” Int. J. Comput. Vis., vol. 87, nos. 1–2, He has published extensively in multimedia processing and communications,
pp. 4–27, 2010. human-centered computing, machine learning, and adaptive image and signal
[49] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible processing. He is a fellow of the Engineering Institute of Canada and an
mixtures-of-parts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Elected member of the Canadian Academy of Engineering. He is the IEEE
Jun. 2011, pp. 1385–1392. Circuits and System Society Distinguished Lecturer from 2010 to 2011 and he
[50] P. Viola and M. Jones, “Robust real-time object detection,” Int. J. is a recipient of the 2005 IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS
Comput. Vis., vol. 57, no. 2, pp. 137–154, 2001. Best Paper Award.