Image

1762 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO.
11, NOVEMBER 2005
Dynamic Trees for Unsupervised Segmentation

and Matching of Image Regions
Sinisa Todorovic, Student Member, IEEE, and Michael C. Nechyba, Member, IEEE
Abstract—We present a probabilistic framework—namely, multiscale generative models known as Dynamic Trees (DT)—for
unsupervised image segmentation and subsequent matching of segmented regions in a given set of images. Beyond these novel
applications of DTs, we propose important additions for this modeling paradigm. First, we introduce a novel DT architecture, where
multilayered observable data are incorporated at all scales of the model. Second, we derive a novel probabilistic inference algorithm for
DTs—Structured Variational Approximation (SVA)—which explicitly accounts for the statistical dependence of node positions and model
structure in the approximate posterior distribution, thereby relaxing poorly justified independence assumptions in previous work. Finally,
we propose a similarity measure for matching dynamic-tree models, representing segmented image regions, across images. Our results
for several data sets show that DTs are capable of capturing important component-subcomponent relationships among objects and their
parts, and that DTs perform well in segmenting images into plausible pixel clusters. We demonstrate the significantly improved properties
of the SVA algorithm—both in terms of substantially faster convergence rates and larger approximate posteriors for the inferred
models—when compared with competing inference algorithms. Furthermore, results on unsupervised object recognition demonstrate the
viability of the proposed similarity measure for matching dynamic-structure statistical models.
Index Terms—Generative models, Bayesian networks, dynamic trees, variational inference, image segmentation, image matching,
object recognition.
1 INTRODUCTION
W E present a probabilistic framework for image segmen-

tation and subsequent matching of segmented regions
in a given set of images, when only weak or no prior
them model causal (Markovian) dependence assumptions
through scales, as illustrated in Fig. 1. DTs provide a
distribution over image-class labels associated with each
knowledge is available. Thus, we formulate the image- node, as well as a distribution over network connectivity.
segmentation and matching problems as inference of poster- Therefore, for a given image, posteriors of both network
ior distributions over pixel random fields given the observed structure and image-class labels need to be inferred in the
image. Our principal challenge, therefore, lies in choosing a inference algorithm. After inference, the model structure can
suitable and numerically manageable statistical model, be determined through Bayesian (MAP) estimation, which
which provides the means for 1) clustering pixels into image gives a topology solution comprising a forest of subtrees, each
regions, which we interpret as objects and 2) detection of of which segments the image, as depicted in Fig. 1. Since each
instantiated objects across a given set of images. The solution root determines a subtree, whose leaf nodes form a detected
to these two problems can be viewed as critical core object, we can assign physical meaning to roots as represent-
components of an integrated computer vision system that is ing whole objects. Also, each descendant of the root down the
capable of first registering unknown/known objects over an subtree can be interpreted as the root of another subtree whose
image set and then updating its knowledge base accordingly. leaf nodes cover only a part of the object. Thus, roots’
While considerations of such a system are beyond the scope of descendants can be viewed as object parts at various scales.
this paper, we point out that the core components introduced The experimental results over several types of image data sets,
here form the basis of many prospective vision systems. Our which we report herein, show that DTs are capable of
focus herein is the formulation of a statistical modeling capturing important component-subcomponent relation-
paradigm with the specified capabilities. ships among objects and their parts, and that DTs perform
Given the assumption that dependencies among image well in segmenting images into plausible pixel clusters.
regions occur only through component-subcomponent rela- Hence, the generative (Markovian) property of DTs provides
tionships, multiscale generative models known as dynamic a solution to our first problem of unsupervised image
trees (DTs) appear very suitable for our goals [1], [2], [3], [4]. In segmentation.
DTs, nodes represent random variables, and arcs between With respect to our second problem, we stress that
traditional approaches to object recognition based on
. S. Todorovic is with the Department of Electrical and Computer
statistical modeling of object appearances and subsequent
Engineering, University of Florida, Gainesville, FL 32611. probability-based classification as, for example, in [5], [6], [7],
E-mail: sinisha@ufl.edu. are ill-posed for unsupervised settings. Here, ill-posed
. M.C. Nechyba is with Pittsburgh Pattern Recognition, Inc., 40 24th Street, indicates two difficulties. First, the problem is not uniquely
Suite 240, Pittsburgh, PA 15222. E-mail: michael@pittpatt.com.
solvable since the absence of prior knowledge on possible
Manuscript received 13 Oct. 2003; revised 17 Mar. 2005; accepted 21 Mar. image classes renders the design of a classifier ambiguous
2005; published online 14 Sept. 2005.
Recommended for acceptance by J. Buhmann.
and, second, the solution does not depend on the data in a
For information on obtaining reprints of this article, please send e-mail to: continuous way; that is, insufficiently large training data sets
tpami@computer.org, and reference IEEECS Log Number TPAMI-0316-1003. lead to very unreliable statistical models. Consequently, in
0162-8828/05/$20.00 ß 2005 IEEE Published by the IEEE Computer Society
TODOROVIC AND NECHYBA: DYNAMIC TREES FOR UNSUPERVISED SEGMENTATION AND MATCHING OF IMAGE REGIONS 1763
may generate pixel values that are inconsistent with

any possible image. Nevertheless, this problem can be
alleviated by appropriate normalization or by span-
ning image data over a set of orthonormal functions
(e.g., in the wavelet transform).
2. We develop a novel probabilistic inference algorithm
for the proposed model. As is the case for many
complex-structure models, exact inference for DTs is
intractable. Therefore, we assume that there are
averaging phenomena in DTs that may render a given
Fig. 1. Two types of DTs: (a) Observable variables present at the leaf
level only. (b) Observable variables present at all levels, round and
set of variables approximately independent of the rest
square-shaped nodes indicate hidden and observable random variables, of the network, and thereby derive a Structured
triangles indicate roots, and unconnected nodes in this example belong Variational Approximation (SVA) [17], [18], [19].
to other subtrees; each subtree segments the image into regions SVA provides for principled solutions, while redu-
marked by distinct shading. cing computational complexity. Unlike the variational
approximation discussed in [4], we explicitly account
unsupervised settings, it is not possible to reliably estimate for the dependence of node positions on the model’s
the variability of data within a class and across various structure, which results in the algorithm’s faster
classes, and thereby to set thresholds for subsequent convergence, when compared to existing approaches.
classification. In contrast, it is widely reported in the literature 3. We propose a similarity measure for comparing DTs,
that image matching based on a suitably defined similarity which we employ for matching segmented image
measure is resilient to the outlined problems [8], [9], [10], [11], regions across a given set of images. Standard
[12], [13]. As such, herewith, we formulate the object- probabilistic approaches to matching use the log-ratio
recognition problem in unsupervised settings as one of of model distributions representing the examined
computing a similarity measure between DTs representing image regions [8], [9], [10], [11]. However, as ex-
instantiated objects across images. For this purpose, we plained before, in unsupervised settings, these dis-
define a novel similarity measure between two dynamic- tributions may have large variations across images,
structure models. From the given data set, we first choose an due to the uninformed estimation of the model
arbitrary image to be the reference, and then search for parameters. To alleviate this problem, we measure
segmented image regions in the rest of the images that are correlation between the cross-likelihoods of the two
similar to the ones in the reference image. Note that the image regions, normalized by the likelihoods of each
unsupervised setting precludes the term reference signifying individual region.
known image classes (i.e., objects). Therefore, we are not in The remainder of the paper is organized as follows: In
position to successfully recognize every appearance of a Section 2, we first discuss prior research related to tree-
reference object across the data set. Since the appearances of
structured statistical modeling. Next, in Section 3, we define
the examined object may differ by many factors including
our dynamic-tree framework. In Section 4, we derive our
pose, occlusion, lighting, and scale, by matching we can only
search for the particular reference appearance of the object in structured-variational approximation inference algorithm
the rest of the images. This limitation of the outlined approach for DTs, and discuss various implementation issues inherent
is, however, inherent to unsupervised settings in general, not to the algorithm, as well as to unsupervised settings. Then, in
to our approach in particular. Nevertheless, results of image Section 5, we consider the problem of comparing DT models
matching in unsupervised settings, which we report herein, and define a similarity measure for that purpose. Finally, in
demonstrate the viability of the proposed approach over Section 6, we present experimental results on segmentation
different image sets. and matching of image regions for several classes of images.
This paper makes a number of contributions:
1. We introduce multilayered data into the model, as 2 RELATED WORK

illustrated in Fig. 1b—an approach that has been Recently, there has been a flurry of research in the field of tree-
extensively investigated in fixed-structure quad-trees structured belief networks (TSBNs) [5], [14], [15], [16], [20].
(e.g., [14], [15], [16]). Throughout, we assume that TSBNs are characterized by a fixed, balancedtree-structure of
multiscale data forms a quad-tree structure of dyadic nodes, representing random variables. The edges of TSBNs
squares, as is the case, for example, in the wavelet represent parent-child (Markovian) dependencies between
transform. The proposed models of data quad-trees neighboring layers of nodes, while random variables belong-
have proved rather successful for various applications
ing to the same layer are conditionally independent. TSBNs
including image denoising, classification, and seg-
have efficient linear-time inference algorithms, of which, in
mentation. Hence, it is important to develop a similar
formulation for DTs. To our best knowledge, the the graphical-models literature, the best-known is Pearl’s
literature reports only research on DTs whose ob- - message passing scheme, also known as belief propaga-
servables exist at the leaf level, as depicted in Fig. 1a, tion [5], [21]. Cheng and Bouman have used TSBNs for
[1], [2], [3], [4]. This may be so because of a multiscale document segmentation [14]; Schneider et al. have
fundamental problem with propagating observables used TSBNs to replace the initial Markovian random field
to higher levels in a generative model. Since over- prior and have achieved efficient simultaneous image
lapping image parts are being reused for deriving denoising and segmentation [20]. The aforementioned
observables at different levels, a generative model examples demonstrate the powerful expressiveness of TSBNs
1764 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005
and the efficiency of their inference algorithms, which is Further, each round-shaped node i (see Fig. 1) is
critically important for our purposes. characterized by random position r i in the image plane.
Despite these attractive properties, TSBNs give rise to The distribution of r i is conditioned on the position of its
blocky segmentations, due to their fixed structure. In the parent r j as
literature, there are several approaches to alleviate this
problem. Irving’s research group has proposed an over- 4 expð 12 ðrr i rr j ddij ÞT 1
ij ðr
ri rr j ddij ÞÞ
lapping tree model, where distinct nodes correspond to P ðrr i jrr j ; zij ¼1Þ¼ 1 ; ð2Þ
2jij j2
overlapping parts in the image [22]. In [23], the authors
have discussed two-dimensional hierarchical models, where ij is a diagonal matrix that represents the order of
where nodes are dependent both at any particular layer magnitude of object size and parameter d ij is the mean of
through a Markov-mesh and across resolutions. In both relative displacement ðrr i rr j Þ. In [4], the authors, for
approaches, segmentation results are superior to those simplicity, set d ij to zero, which favors undesirable position-
when standard TSBNs are used because the descriptive ing of children and parent nodes at the same locations. From
component of the models is improved at some increased our experiments, this may seriously degrade the image-
computational cost. Ultimately, however, these approaches modeling capabilities of DTs, and as such some nonzero
do not deal with the source of the “blockiness”—namely, relative displacement d ij needs to be accounted for. The joint
the fixed tree structure of TSBNs. probability of R ¼4 frr i j8i2V g, is given by
Aside from the work of Williams et al. [1], [2], [3], [4], to
Y z
which we will refer throughout the paper, we point out other P ðRjZÞ ¼
4
P ðrr i jrr j ; zij Þ ij ; ð3Þ
research concerning dynamic-tree structures. Konen et al. i;j2V
have proposed a flexible neural mechanism for invariant
pattern recognition based on correlated neuronal activity and where for roots i 1we have P ðrri jzi0 ¼1Þ ¼4 expð 12 ðrr i ddi ÞT
the self-organization of dynamic links in neural networks 1 r i dd i ÞÞ=ð2ji j2 Þ. At the leaf level, V 0 , we fix node
i ðr
[24]. Also, Montanvert et al. [25] have explored irregular positions R0 to the locations of the finest-scale observables,
multiscale tessellations that adapt to image content. and then use P ðZ; R0 jR0 Þ as the prior over positions and
connectivity, where R0 ¼4 frr i j8i2V nV 0 g and R0 ¼4 frri j8i2V nV 0 g.
Next, each node i is characterized by an image-class label xi
3 DYNAMIC TREES and an image-class indicator random variable xki , such that
The model formulation discussed herein is similar to that of xki ¼1 if xi ¼k, where k is a label taking values in the finite set M.
the position-encoding dynamic trees in [4], where obser- Thus, we assume that the set M of unknown image classes is
vables are present only at the lowest model level. However, finite. The label k of node i is conditioned on image class l of its
since we also consider multilayered observable data in DTs, parent j and is given by conditional probability tables Pijkl . The
for completeness, we introduce both types of models, joint probability of X ¼4 fxki ji2V ; k2Mg is given by
emphasizing, where appropriate, the differences. Y Y h ixki xlj zij
DTs are directed, acyclic graphs with two disjoint sets of P ðXjZÞ ¼ Pijkl ; ð4Þ
nodes representing hidden and observable random vectors. i;j2V k;l2M
Graphically, we represent all hidden variables as round-
shaped nodes, connected via directed edges indicating where for roots i (zi0 ¼1) we use priors P ðxki Þ.
Markovian dependencies, while observables are denoted as Finally, we introduce nodes that are characterized by
rectangular-shaped nodes, connected only to their corre- observable random vectors representing image texture and
sponding hidden variables, as depicted in Fig. 1. Below, we color cues. Here, we make a distinction between two types
first introduce nodes characterized by hidden variables. of DTs. The model where observables are present only at
There are V round-shaped nodes, organized in hierarchical the leaf-level is referred to as DTV 0 ; the model where
levels, V ‘ , ‘ ¼ f0; 1; . . . ; L1g, where V 0 denotes the leaf level observables are present at all levels is referred to as DTV . To
and V 0 ¼4 V =V 0 . The number of round-shaped nodes is identical clarify the difference between the two types of nodes in
to that of the corresponding quad-tree with L levels, such that DTs, we index observables with respect to their locations in
jV ‘ j¼jV ‘1 j=4¼ . . . ¼jV 0 j=4‘ . Connections are established un- the data structure (e.g., wavelet dyadic squares), while
der the constraint that a node at level ‘ can become a root, or it hidden variables are indexed with respect to a node-index
can connect only to the nodes at the next ‘þ1 level. The in the graph. This generalizes correspondence between
network connectivity is represented by random matrix Z, hidden and observable random variables of the position-
where entry zij is an indicator random variable, such that zij ¼1 encoding dynamic trees in [4]. We define position ðiÞ to be
if i2V ‘ and j2f0; V ‘þ1 g are connected. Z contains an equal to the center of mass of the ith dyadic square at level ‘
in the corresponding quad-tree with L levels:
additional zero (“root”) column, where entries zi0 ¼1 if i is a
root. Since each node can have only one parent, a realization of 4
ðiÞ ¼ ½ðnþ0:5Þ2‘ ðmþ0:5Þ2‘ T ; n; m ¼ 1; 2; . . . ; ð5Þ
Z can have at most one entry equal to 1 in each row. We define
the distribution over connectivity as where n and m denote the row and column in the dyadic
square at scale ‘ (e.g., for wavelet coefficients). Clearly,
4
Y
L1 Y zij other application-dependent definitions of ðiÞ are possible.
P ðZÞ ¼ ij ; ð1Þ
‘¼0 ði;jÞ2V ‘ f0;V ‘þ1 g
Note that while the r s are random vectors, the s are
deterministic values fixed at locations where the corre-
P ij is the probability of i being the child of j, subject
where sponding observables are recorded in the image. Also, after
to j2f0;V ‘þ1 g ij ¼1. fixing R0 to the locations of the finest-scale observables, we
4
have 8i2V 0 , ðiÞ¼rri . The definition, given by (5), holds for QðZ; X; R0 Þ ¼ QðZÞQðXjZÞQðR0 jZÞ; ð8Þ
DTV 0 , as well, for ‘¼0.
which enforces that both class-indicator variables X and
For both types of DTs, we assume that observables
Y ¼4 fyy ðiÞ j8i2V g at locations R0 and 0 ¼4 fðiÞj8i2V 0 g are position variables R0 are statistically dependent on the tree
conditionally independent given the corresponding xki : connectivity Z. Since these dependencies are significant in the
prior, one should expect them to remain so in the posterior.
Y Yh ixki Therefore, our formulation appears to be more appropriate
P ðY jX; R0 ; 0 Þ ¼ P ðyy ðiÞ jxki ; ðiÞÞ ; ð6Þ for approximating the true posterior than the mean-field
i2V k2M
variational approximation QðZ; X; R0 Þ¼QðZÞQðXÞQðR0 Þ dis-
where for DTV 0 , V 0 should be substituted for V . The cussed in [1] and the form QðZ; X; R0 Þ¼QðZÞQðXjZÞQðR0 Þ
likelihoods P ðyy ðiÞ jxki ¼1; ðiÞÞ arePmodeled as mixtures proposed in [4]. We define the approximating distributions as
of Gaussians: P ðyy ðiÞ jxki ¼1; ðiÞÞ ¼4 Gg¼1 k ðgÞN ðy
k
y ðiÞ ; k ðgÞ; follows:
k ðgÞÞ. For large Gk , a Gaussian-mixture density can
approximate any probability density [26]. In order to 4
Y
L1 Y zij
QðZÞ ¼ ij ; ð9Þ
avoid the risk of overfitting the model, we assume that ‘¼0 ði;jÞ2V ‘ f0;V ‘þ1 g
the parameters of the Gaussian-mixture are equal for all
4
Y Y h ixki xlj zij
nodes. The Gaussian-mixture parameters can be grouped QðXjZÞ ¼ Qkl ; ð10Þ
ij
in the set ¼4 fGk ; fk ðgÞ; k ðgÞ; k ðgÞgG
g¼1 j 8k2Mg.
k
i;j2V k;l2M
The joint prior of the model can be written as 4
Y
QðR0 jZÞ ¼ ½Qðrri jzij Þzij ð11Þ
0 0 0 0 0 i;j2V 0
P ðZ; X; R; Y Þ¼P ðY jX; R ; ÞP ðXjZÞP ðZ; R jR ÞP ðR Þ: ð7Þ
All the parameters of the joint prior can be grouped in the 4
ij ÞT 1
Y exp 12 ðrri ij ðr
r i
ij Þ
set ¼4 fij ; d ij ; ij ; Pijkl ; g, 8i; j2V , 8k; l2M. QðR0 jZÞ ¼ 1 ; ð12Þ
i;j2V 0 2jij j2
where parameters ij correspond to the ij connection

4 PROBABILISTIC INFERENCE
probabilities, and the Qkl kl
ij are analogous to the Pij conditional
In order to conduct Bayesian estimation of the model for a probability tables. For the parameters of QðR0 jZÞ, note that
given image, as required in our formulation of the image covariances ij and mean values ij form the set of Gaussian
segmentation problem, we need to infer the posterior parameters for a given node i2V ‘ over its candidate
distributions of Z, X, and R0 , given Y and R0 . However,
parents j2V ‘þ1 . Which pair of parameters ð ij ; ij Þ, is used
due to the complexity of DTs, the exact computation of the
to generate r i is conditioned on the given connection between i
posterior P ðZ; X; R0 jY ; R0 Þ is intractable. Therefore, we resort
and j—that is, the current realization of Z. We assume that the
to an approximate inference method, of which two broad
classes exist: deterministic approximations [17], [18], [19] and s are diagonal matrices, such that node positions along the
Monte-Carlo methods [27], [28], [29]. Generally, in MCMC “x” and “y” image axes are uncorrelated. Also, for roots,
approaches, with increasing model complexity, the choice of suitable forms of Q functions are used, similar to the
proposals in the Markov chain becomes hard, so that the specifications given in Section 3. All the parameters of
equilibrium distribution is reached very slowly [27], [28]. To QðZ; X; R0 Þ can be grouped in the set ¼4 fij ; Qkl
ij ; ij ; ij g.
0 0 0
achieve faster inference, we consider variational approxima- To find QðZ; X; R Þ closest to P ðZ; X; R jY ; R Þ we resort to
tion, a specific type of deterministic approximation [17], [18], a standard optimization method, where Kullback-Leibler
[19]. Variational approximation methods have been demon- (KL) divergence between QðZ; X; R0 Þ and P ðZ; X; R0 jY ; R0 Þ is
strated to give good and significantly faster results, when minimized ([30], chapter 2, pp. 12-49, and chapter 16, pp. 482-
compared to Gibbs sampling ([2], chapter 3). The proposed 509). The KL divergence is given by
variational approaches range from a factorized approximat- Z
ing distribution over hidden variables [1] (also known as 4
X QðZ; X; R0 Þ
KLðQkP Þ ¼ dR0 QðZ; X; R0 Þ log : ð13Þ
mean field variational approximation) to more structured R0 Z;X
P ðZ; X; R0 jY ; R0 Þ
solutions [4], where dependencies among hidden variables
are enforced. The underlying assumption in those methods is It is well-known that KLðQkP Þ is nonnegative for any two
that there are averaging phenomena in DTs that may render a distributions Q and P , and KLðQkP Þ¼0 if and only if Q¼P ;
given set of variables approximately independent of the rest these properties are a direct corollary of Jensen’s inequality
of the network. Therefore, the resulting variational optimiza- ([30], chapter 2, pp. 12-49). As such, KLðQkP Þ guarantees a
tion of DTs provides for principled solutions, while reducing global minimum—that is, a unique solution to QðZ; X; R0 Þ. In
computational complexity. In the following section, we the following section, we show how to compute QðZ; X; R0 Þ.
derive a novel Structured Variational Approximation (SVA)
algorithm for DTs. 4.2 Update Equations for Computing QðZ; X; R0 Þ
By minimizing the KL divergence, we derive the update
4.1 Structured Variational Approximation equations for estimating the parameters of the variational
In variational approximation, the intractable distribution distribution QðZ; X; R0 Þ. Below, we summarize the final
P ðZ; X; R0 jY ; R0 Þ is approximated by a simpler distribution derivation results. Detailed derivation steps are reported in
QðZ; X; R0 jY ; R0 Þ closest to P ðZ; X; R0 jY ; R0 Þ. To simplify the Appendix, where we also provide the list of nomen-
notation, below, we omit the conditioning on Y and R0 , and clature. In the following equations, we use to denote an
write QðZ; X; R0 Þ. The novelty of our approach is that we arbitrary normalization constant, the definition of which
constrain the variational distribution to the form may change from equation to equation. Parameters on the
right-hand side of the update equations are assumed updated summing over children and grandparents of i and,
known, as learned in the previous iteration step. therefore, should be iterated until convergence.
4.2.1 Optimization of QðXjZÞ 4.2.3 Optimization of QðZÞ

QðXjZÞ is fully characterized by parameters Qkl
ij , which are QðZÞ is fully characterized by connectivity probabilities ij ,
updated as which are computed as
Qkl kl k
ij ¼ Pij i ; 8i; j 2 V ; 8k; l 2 M; ð14Þ ij ¼ ij expðAij Bij Þ; 8‘; 8ði; jÞ 2 V ‘ f0; v‘þ1 g; ð19Þ
where the auxiliary parameters ki are computed as where Aij represents the influence of observables Y , while
( Bij represents the contribution of the geometric properties
k
P ðyy ðiÞ jxki ; ðiÞÞ; i 2 V 0; of the network to the connectivity distribution. These are
i ¼ Q P
ak ak ci
ð15aÞ defined in the Appendix.
c2V a2M Pci ci ; i 2 V 0;
4.3 Inference Algorithm and Bayesian Estimation
" #ci
Y X For the given set of parameters characterizing the joint
ki ¼ P ðyy ðiÞ jxki ; ðiÞÞ Pciak ac ; 8i 2 V ; ð15bÞ prior, observables Y , and leaf-level node positions R0 , the
c2V a2M standard Bayesian estimation of optimal Z^, X ^ , and R ^0
requires minimizing the expectation of a cost function C:
where (15a) is derived for DTV 0 and (15b) for DTV . Since the
ci are nonzero only for child-parent pairs, from (15), we ðZ^; X
^; R
^0 Þ ¼
note that s are computed for both models by propagating ð20Þ
the messages of the corresponding children nodes arg min 0 IEfCððZ; X; R0 Þ; ðZ ; X ; R0 ÞÞjY ; R0 ; g;
Z;X;R
upward. Thus, Qs, given by (14), can be updated by making
a single pass up the tree. Also, note that for leaf nodes, where CðÞ penalizes the discrepancy between the estimated
i2V 0 , the ci parameters are equal to 0 by definition, configuration ðZ; X; R0 Þ and the true one ðZ ; X ; R0 Þ. We
yielding ki ¼P ðyy ðiÞ jxki ; ðiÞÞ in (15b). propose the following cost function:
Further, from (9) and (10), we derive the update equation 4
for the approximate posterior probability mki that class k2M CððZ; X; R0 Þ; ðZ ; X ; R0 ÞÞ¼
X XX
is assigned to node i2V , given Y and R0 , as ½1
ðzij zij Þ þ ½1
ðxki xk
i Þ
Z X X X i;j2V i2V k2M ð21Þ
mki ¼ dR0 xki QðZ; X; R0 Þ; ¼ ij Qkl l
ð16Þ X
ij mj : þ ½1
ðrr i rr i Þ;
R0 Z;X j2V 0 l2M
i2V 0
Note that the mki
can be computed by propagating image-
where indicates true values, and
ðÞ is the Kronecker delta
class probabilities in a single pass downward. This upward- function. Using the variational approximation P ðZ; X; R0 jY ;
downward propagation, specified by (15) and (16), is very R0 Þ QðZÞQðXjZÞQðR0 jZÞ, from (20) and (21), we derive:
reminiscent of belief propagation for TSBNs [5], [21]. For the X X
special case when ij ¼1 only for one parent j, we obtain the Z^ ¼ arg min QðZÞ ½1
ðzij zij Þ; ð22Þ
standard - rules of Pearl’s message passing scheme for Z
Z i;j
TSBNs. X X
^ ¼ arg min ½1
ðxki xk
X QðZÞQðXjZÞ i ÞÞ; ð23Þ
X
Z;X
4.2.2 Optimization of QðR0 jZÞ i;k
QðR0 jZÞ is fully characterized by parameters ij and ij . The Z X X

update equations for ij and ij , 8ði; jÞ2V ‘ f0; V ‘þ1 g, ‘>0, ^0 ¼ arg min
R dR0 QðZÞQðR0 jZÞ ½1
ðrr i rri Þ: ð24Þ
0
where ij 6¼ 0, are R R0 Z i
" #1 Given the constraints on connections, discussed in Section 3,
X X
1 1 minimization in (22) is equivalent to finding parents:
ij ¼ jp ij þ ci ci
p2V 0 c2V 0
" # ð17Þ ð8‘Þð8i2V ‘ ÞðZi 6¼0Þj^¼ arg max ij ; for DTV 0 ; ð25aÞ
X X j2f0;V ‘þ1 g
jp 1
ij ð
jp þddjp Þþ ci 1
ci ð
ci ddij Þ ;
p2V 0 c2V 0
ð8‘Þð8i2V ‘ Þj^¼ arg max ij ; for DTV ; ð25bÞ
j2f0;V ‘þ1 g
" #12 !
X Trf1
ij jp g
Trf1 1 where ij is given by (19); Zi denotes the ith column of Z and
ij g ¼ Trfij g 1 þ jp
Trf1
p2V 0 ij ij g Zi 6¼0 indicates that there is at least one nonzero element in
" #12 ! ð18Þ column Zi ; that is, i has children, and thereby is included in
X Trf1
1 ci ci g the tree structure. Note that due to the distribution over
þ ci Trfci g 1 þ ;
Trf1
ci ij g
connections, after estimation of Z, for a given image, some
c2V 0
nodes may remain without children. To preserve the
where c and p denote children and grandparents of node i, generative property in DTV 0 , we impose an additional
respectively. Since the s and s are assumed diagonal, it is constraint on Z that nodes above the leaf level must have
straightforward to derive the expressions for the diagonal children in order to be able to connect to upper levels. On the
elements of the s from (18). Note that both ij and ij are other hand, in DTV , due to multilayered observables, all
nodes V must be included in the tree structure, even if they

do not have children. The global solution to (25a) is an open
problem in many research areas. Therefore, for DTV 0 , we
propose a stage-wise optimization, where, as we move
upward, starting from the leaf level ‘¼f0; 1; . . . ; L1g, we
include in the tree structure optimal parents at V ‘þ1
according to
ð8i2V ‘ ÞðZî 6¼0Þ j^¼ arg max ij ; ð26Þ

j2f0;V ‘þ1 g
where Zî denotes ith column of the estimated Z^ and Zî 6¼0
indicates that i has already been included in the tree
structure when optimizing the previous level V ‘ .
Next, from (23), the resulting Bayesian estimator of
image-class labels, denoted as xî , is
ð8i2V Þ xî ¼ arg max mki ; ð27Þ

k2M
where the approximate posterior probability mki that image

class k is assigned to node i is given by (16).
Finally, from (24), optimal node positions are estimated
8‘>0, and 8i2V ‘ as
X X
rî ¼ arg max Qðrri jZÞQðZÞ ¼ ij ij ; ð28Þ
ri
Z j2f0;V ‘þ1 g
where ij and ij are given by (17) and (19), respectively.

The inference algorithm for DTs is summarized in Fig. 2.
The specified ordering of parameter updates for QðZÞ,
QðXjZÞ, and QðR0 jZÞ in Fig. 2, Steps 4-10, is arbitrary;
theoretically, other orderings are equally valid.
4.4 Specification of Model Parameters

Variational inference presumes that parameters V , L, M, and
¼fij ; d ij ; ij ; Pijkl ; g, 8i; j2V , 8k; l2M, are available. Due
to the lack of example images in unsupervised settings, we are
not in a position to learn these parameters on a training image
set. This problem has been addressed in the literature with
indecisive results (e.g., [31], [32], [33]). In the absence of prior
application knowledge, multiple solutions are equally
Fig. 2. Inference of the DT given Y , R0 , and ; t and tin are counters in
reasonable, as even human interpreters arrive at different the outer and inner loops, respectively; N" , ", and " control the
answers [33]. convergence criteria for the two loops.
First, for the given number of leaf-level nodes jV 0 j, we set
L¼ log2 ðjV 0 jÞ. Next, due to a huge diversity of possible Once estimated, these values can be used to optimize jMj.
configurations of objects in images, for each node i2V ‘ , we Then, for the new jMj value, we again conduct the EM
set ij to be uniform over is candidate parents 8j2f0; V ‘þ1 g. algorithm on the quad-tree, and so forth. To optimize jMj, we
Then, for all pairs ði; jÞ2V ‘ V ‘þ1 at level ‘, we set
assume that P ðjMjÞ is the Poisson distribution, with the mean
d ij ¼ðiÞðjÞ—namely, the d ij are initialized to the relative
IEfjMjg¼2—the assumption stemming from our initial guess
displacement of the centers of mass of the ith and jth dyadic
that each image contains at least two image regions, and that
squares in the corresponding quad-tree with L levels,
specified in (5). For roots i, we have d i ¼ðiÞ. Also, we set large values of jMj should be penalized due to computational
diagonal elements of ij to the diagonal elements of a complexity. We optimize jMj by maximizing the function
matrix d ij d T fðjMjÞ¼P ðY jX; ; jMjÞP ðjMjÞ for jMj¼2; 3; 4; , where like-
ij . The number of components Gk in a Gaussian Q
mixture for each class k is set to Gk ¼3, which is empirically lihood P ðY jX; ; jMjÞ¼ i P ðyy ðiÞ jxi ; ðiÞ; jMjÞ is computed
validated to be appropriate. from the results of the EM algorithm on the quad-tree. Since
Now, the most critical parameters that remain to be larger jMj values give larger P ðY jX; ; jMjÞ, fðjMjÞ increases
specified are the number of image classes jMj, conditional until some maximum jMj , when the Poisson distribution of
probability tables Pijkl , and the parameters of a Gaussian- P ðjMjÞ starts to dominate decreasing fðjMjÞ for jMj > jMj .
mixture density . For this purpose, we conduct an iterative Note that P ðjMjjX; Y Þ / fðjMjÞ, so that the maximum of
learning procedure using the EM algorithm on the quad-tree fðjMjÞ, jMj , gives the MAP solution to our parameter
thoroughly discussed in ([16], pp. 399-401). Given V , L, Y , estimation problem. This iterative learning algorithm stops
and jMj of the quad-tree, the algorithm readily estimates Pijkl when jMj is reached, yielding also Pijkl and parameters. Our
and , for a given image. Here, Pijkl and are equal for all levels. experiments show that jMj is on average a conservative
estimate of the true number of classes. Since DTs are Since the size of objects determines the number of
generalized quad-trees, our experimentation suggests that random variables in DTs, the log-ratios may turn out to be
this optimization with respect to the quad-tree is justified. equal for two different scenarios: When subtrees represent
different objects of similar size and when subtrees represent
4.5 Implementation Issues the same object appearing at different size across the
In this section, we list algorithm-related details that are images. The same problem has also been discussed in [12],
necessary for the experimental results, presented in Section 6, [13], where hidden Markov models of different-length
to be reproducible. Other specifications, such as, for example, observation sequences have been compared. Therefore, for
feature extraction, will be detailed in Section 6. each pair of image regions, it is necessary to normalize the
First, direct implementation of (14) would result in probability log-ratios. Usually, this is done by multiplying
numerical underflow. Therefore, we introduce the follow- the log-ratio with a suitable constant >0. Since the s are
ing P scaling procedure: ~ki ¼ ki =Si , 8i2V , 8k2M, where different for every pair of compared regions, a decision
Si ¼4 k2M ki . Substituting the scaled ~s into (14), we obtain criterion based on the normalized log-ratios becomes
nonprobabilistic.1 Our experiments on image matching
Pijkl ki Pijkl ~ki show that this normalization improves performance over
Qkl
ij ¼ P ¼P : matching when probability log-ratios are not normalized.
al a
a2M Pij i a2M Pijal ~ai
There are several disadvantages of the outlined approach
In other words, computation of Qkl ij does not change when to image matching. We have observed great sensitivity of
the scaled ~0 s are used. h log P1 =P2 i to the specification of the optimal . Further-
Second, to reduce computational complexity, as is done in more, matching approaches based on computing h log P1 =P2 i
[4], we consider, for each node i, only the 77 box implicitly assume that distributions P1 and P2 are reliable
encompassing parent nodes j that neighbor the parent of representations of underlying image processes, which is
the corresponding quad-tree. Consequently, the number of justifiable for supervised settings. However, this is not the case
possible children nodes c of i is also limited. Our experiments for unsupervised settings, where P1 and P2 may have huge
show that the omitted nodes, either children or parents, variations over the examined images, due to the uninformed
contribute negligibly to the update equations. Thus, we limit estimation of the prior distribution, that is, in our case, model
overall computational cost as the number of nodes increases. parameters .
Finally, the convergence criterion of the inner loop, where To mitigate the sensitivity to , as well as to variations in
ij and ij are computed, is controlled by parameter " . When across images, we propose a novel similarity measure,
" ¼0:01, the average number of iteration steps, tin , in the inner thereby departing from the outlined approaches where
loop, is from 3 to 5, depending on the image size, where the probability log-ratios are used. Thus, in our approach, the
latter is obtained for 128128 images. The convergence impact of unreliable is neutralized by measuring correla-
criterion of the outer loop is controlled by parameters N" and tion between the cross-likelihoods of the two image regions,
". The aforementioned simplifications, which we employ in which are normalized by the likelihoods of each individual
practice, may lead to suboptimal solutions of SVA. From our region. Below, we mathematically formulate this idea.
experience, though, the algorithm recovers from unstable Let Yt and Yr denote observables of two DT subtrees T t
stationary points for sufficiently large N" . In our experiments, and T r , respectively, where t in the subscript refers to an
we set N" ¼10 and "¼0:01. image region in the test image, and r, to a region in the
After the inference algorithm converged, we then reference image, as defined in Section 1. Here, T t refers to
estimate the DT structure for a given image, which consists the estimated configuration ðZ^t ; X ^t ; R ^0 Þ and the parameter
t
of DT subtrees representing distinct objects in that image. set t for the test image. Similarly, T r refers to the
Having found the DT representation of segmented image estimated configuration ðZ^r ; X ^r ; R ^0 Þ and the parameter
r
regions, we are then in a position to measure the similarity set r for the reference image. As discussed above, we
of the detected objects across a given set of images. normalize the likelihood P ðY jX; ; Þ, given by (6), as
^ r ; r ; r Þ1=Cr , where Cr denotes the cardinality of
Ptr ¼4 P ðYt jX
the model T r . Since the Y s at coarser resolutions affect more
5 STOCHASTIC SIMILARITY MEASURE pixels than at finer PL1scales,
P for DTV , we compute the
‘ ‘
Recently, similarity measures between two statistical models cardinality as C ¼4 ‘¼0 i2T K i , where Ki denotes the size
have been given considerable attention in the literature [8], of the kernel used to compute observables at level ‘; for
[9], [10], [11], [12], [13]. To compare a pair of image regions example, K‘i ¼2‘ for wavelet coefficients. For DTV 0 , C is
represented by statistical models, in standard probabilistic equal to the number of leaf-level nodes. We now define the
approaches, one examines the log-ratio log P1 =P2 or the similarity measure between two models as
expected value of the log-ratio hlog P1 =P2 i, where P1 and P2 rffiffiffiffiffiffiffiffiffiffiffiffi
are some distributions of the two models (e.g., likelihoods, 4 Ptr Prt
tr ¼ : ð29Þ
posteriors, cross probabilities). For instance, Moghaddam [8], Ptt Prr
for the purposes of face recognition, proposes a similarity The defined similarity measure exhibits the following
measure expressed in terms of probabilities of intrapersonal properties: 1) by definition tr ¼rt , 2) from 0<Ptr Ptt and
and extrapersonal facial image variations. Hermosillo et al. [9] 0<Prt Prr it follows that 0<tr 1 and, finally, 3) if T t T r
perform matching of images by computing the variational than tr ¼1. Note that, for property 2), we assume that the
gradient of a hierarchy of statistical similarity measures. inference algorithm guarantees that Ptt and Prr are global
These approaches can be viewed as a form of ML or MAP
principles. Also, a symmetric function of the KL divergence, 1. Scaled P1 and P2 do not satisfy the three axioms of probability over the
hlog P1 =P2 iP1 þhlog P2 =P1 iP2 , has been proposed in [10]. total set of events if the s vary for different events in that set.
Fig. 3. Alignment tests: (a) and (b) 128128 test and reference images, (c) segmented region under T t using DTV 0 , (d) segmented region under T r
using DTV 0 , (e) image regions in the reference image used for substitution y t ðiÞ y r ðiÞ for different t , and (f) image regions in the test image used
for substitution y r ðiÞ y t ðiÞ for different r . The crosses mark the estimated roots’ positions r^t and r^r .
maxima for the test and reference images, respectively. In and its corresponding node characterized by hidden vari-
practice, from our experience, this is not a significant ables (round-shaped node). This deletion of nodes gives rise
concern, as the algorithm converges to near-optimal to a number of tree-pruning strategies. In our approach, for
solutions, as discussed in Section 4.5. DTV , we simply delete outlying rectangular-shaped nodes
In computation of the cross probabilities, say, Ptr , it is and their corresponding round-shaped nodes; other nodes
necessary to substitute observables Yr with Yt in the are kept intact. For DTV 0 , the deletion of nodes at the leaf level,
estimated subtree structure ðZ^r ; X ^r ; R^0 Þ according to a
r V 0 , may leave some higher-level nodes without children. To
specified mapping. While a complete treatment of possible preserve the generative property, as discussed in Section 4.3,
mappings is beyond the scope of this paper, below we from T r , we also prune, in the bottom-up sweep, those nodes
consider one plausible approach. For this purpose, it is that happen to lose all their children. A similar pruning
convenient to index observables in terms of their locations procedure is necessary when computing Prt ðt Þ. In Fig. 3, we
in the image. Recall, in Section 3, we define locations of illustrate alignment tests pictorially for two sample images.
observables ðiÞ, given by (5). Thus, the mapping can be Having defined our similarity measure, we are now in
conducted as follows: For each observable node i in T r that position to conduct matching of segmented image regions
is on location r ðiÞ in the reference image, we first find the across a given set of images.
corresponding location in the test image t ðiÞ and then
substitute y r ðiÞ y t ðiÞ . We define the correspondence
between the locations r ðiÞ and t ðiÞ, as follows: 6 EXPERIMENTS
We report experiments on segmentation and matching of
cosðr Þ sinðr Þ image regions for five sets of images. Data Set I contains 50,
t ðiÞ¼ r^t þ ðr ðiÞ^
rrÞ ; ð30Þ
sinðr Þ cosðr Þ 2‘ 44, binary images with a total of 50 single object appear-
where r is a rotation angle; r^t and r^r are estimated positions ances. A sample of Data Set I images is depicted in Fig. 4a
of the roots of T t and T r in the test and reference images; bc2‘ (top). Data Set II consists of 50, 88, binary images with a total
finds integer, multiples-of-2 values of the form given by (5).
Pictorially, computation of Ptr can be viewed as alignment
of T r with the location of T t in the test image. Thus, according
to (30), we first translate T r until the root of T r coincides with
the root of T t in the reference image.2 After the roots are
aligned, we then rotate T r for several angles r about the
vertical axis containing the roots. A similar expression to (30)
holds for translation and rotation of T t in the reference image,
when computing Prt . Note that, because of the rotation, we
compute two arrays of cross probabilities, Prt ðt Þ and Ptr ðr Þ,
for each finite rotation increment of t and r . Although we
could eliminate either t or r , we do not because of finite
rotation increments that may differ for the two parameters.
We emphasize that the outlined translation/rotation is just a
visual interpretation of the mapping, in which one set of
observables is substituted by the other; however, this
mapping should not be misunderstood as transformation of
already estimated dynamic trees.
Due to the mapping, given by (30), when computing
Ptr ðr Þ, locations of observables t ðiÞ may fall outside the Fig. 4. Image segmentation using DTV 0 : (a) sample 44 and 88 binary
images, (b) clustered leaf-level pixels that have the same parent at level
boundaries of the test image. In this case, it is necessary to ‘¼1, (c) clustered leaf-level pixels that have the same grandparent at
prune that observable node i in T r (rectangular-shaped node) level ‘¼2; clusters are indicated by different shades of gray; the point in
each region marks the position of the parent node, and (d) estimated DT
2. As we demonstrate in Section 6, roots’ positions give a good estimate structure for the 44 image in (a); nodes are depicted inline representing
of the center of mass of true object appearances in the image; therefore, we 4, 2, and 1 actual rows of the levels 0, 1, and 2, respectively; nodes are
align the roots—not the centers of mass of segmented image regions under drawn as pie-charts representing P ðxki ¼1Þ, k 2 f0; 1g; note that there
T r and T t . are two roots representing two distinct objects.
Fig. 7. Image segmentation using DTV 0 : (top) Data Set III images and
(bottom) pixel clusters with the same parent at level 3. DT structure is
preserved over rotations.
Fig. 5. Twenty image classes in Type III and IV Data Sets. at angles 15
, 45
, and 75
. For texture extraction in DTV 0 ,
we compute the difference-of-Gaussian function convolved
with the image: Dðx; y; k; Þ¼ðGðx; y; kÞGðx; y; ÞÞIðx; yÞ,
where x and y represent pixel coordinates, Gðx; y; Þ¼ expð
ðx2 þ y2 Þ=22 Þ=22 and Iðx; yÞ is the intensity image. In
addition to reduced computational complexity, as compared
to the CWT, the function D provides a close approximation to
the scale-normalized Laplacian of Gaussian, 2 r2 G, which
has been shown to produce the most stable image features
across scales when compared to a range of other possible
image functions, such as the gradient and the Hessian pffiffiffi [35],
pffiffiffi
[36]. We compute Dðx; y; k; Þ for three scales k¼ 2; 2; 8,
and ¼ 2. For color features in both DTV and DTV 0 , we choose
the generalized RGB color space: r¼R=ðRþGþBÞ and
g¼G=ðRþ GþBÞ, which effectively normalizes variations in
Fig. 6. Image segmentation using DTV 0 : (a) Data Set III images, (b) pixel brightness; the Y s of higher-level nodes are computed as the
clusters with the same parent at level ‘¼3, (c) pixel clusters with the mean of the rs and gs of their children nodes of the initial quad-
same parent at level ‘¼4, points mark the position of parent nodes. DT
structure is preserved through scales. tree structure. Each color observable is normalized to have
zero mean and unit variance over the data set. Thus, the y ðiÞ s
of 78 multiple object appearances. A sample of Data Set II are eight and five-dimensional vectors for DTV and DTV 0 ,
images is shown in Fig. 4a(bottom). Data Set III comprises 50, respectively.
6464, simple indoor-scene, color images with a total of 105 In the following experiments, we compare our SVA
object appearances of 20 distinct objects shown in Fig. 5. inference algorithm with three other inference algorithms:
Samples of Data Set III images are given in Figs. 6, 7, and 9. 1) Gibbs sampling discussed in [29], 2) mean-field variational
Data Set IV contains 50, 128128, challenging indoor-scene, approximation (MFVA) proposed in [1], and 3) variational
color images with a total of 223 partially occluded object approximation (VA)3 discussed in [4]. All the figures in this
appearances of the same 20 distinct objects as for Data Set III section illustrate segmentation and matching performance
images. Examples of Data Set IV images are shown in Figs. 3 when DTs are inferred using our SVA algorithm.
and 10. Note that objects appearing in Data Sets III and IV are 6.1 Image Segmentation Tests
carefully chosen to test if DTs are expressive enough to
DT-based image segmentation is tested on all five data sets.
capture very small variations in appearances of some classes
Results presented in Figs. 4, 5, 6, 7, and 8 suggest that DTs are
(e.g., two different types of cans in Fig. 5), as well as to encode
able to encode component-subcomponent relationships
large differences among some other classes (e.g., wiry-
among objects and their parts in the image. From Fig. 8, we
featured robot and books in Fig. 5). Finally, Data Set V
observe that nodes at different levels of the dynamic tree can
contains 50, 128128, natural-scene, color images with a total
be interpreted as object parts at various scales. Moreover,
of 297 object appearances, samples of which are shown in
Figs. 8, 11, and 13. Ground truth in images is obtained through from Figs. 6 and 7, we also observe that DTs, inferred through
hand-labeling of pixels. SVA, preserve structure for objects across images subject to
For Data Sets I and II, we experiment only with DTV 0 translation, rotation and scaling. In Fig. 6, note that the level-4
models, with observables Y given by binary pixel values. For clustering for the larger-object scale in Fig. 6c (top) corre-
the other data sets, we test both DTV 0 and DTV . To compute the sponds to the level-3 clustering for the smaller-object scale in
Y s, we account for both color and texture cues. For texture Fig. 6b. In other words, as the object transitions through
analysis in DTV , we choose the complex wavelet transform scales, the tree structure changes by eliminating the lowest-
(CWT) applied to the intensity (gray-scale) image, due to its level layer, while the higher-order structure remains intact.
shift-invariant representation of texture at different scales,
3. Although the algorithm proposed in [4] is also structured variational
orientations, and locations [34]. The CWT’s directional approximation, to differentiate that method from ours, we slightly abuse the
selectivity is encoded in six subimages of coefficients oriented notation.
Fig. 8. Image segmentation using DTV : (a) a Data Set V image, (b), (c), and (d) pixel clusters with the same parent at levels ‘¼3; 4; 5, respectively,
white regions represent pixels already grouped by roots at the previous scale; points mark the position of parent nodes; nodes at different levels of
DTV can be interpreted as object parts at various scales.
We also note that the estimated positions of higher-level object into subregions that are not actual object parts. On the
hidden variables are very close to the center of mass of object other hand, if an object is segmented into several “mean-
parts, as well as of whole objects. We compute the error of ingful” subregions, verified by visual inspection, this type of
estimated root-node positions r^ as the distance from the actual error is not included. The averaged pixel error for Gibbs
center of mass r CM of hand-labeled objects, derr ¼jj^ r rr CM jj. sampling is 6 percent for both type I and II images, while for
The averaged error values over the given test images for VA MFVA, it is 18 percent and 12 percent for the Data Sets I and II,
and SVA are reported in Table 1. We observe that the error respectively. With regard to object detection error, Gibbs
significantly decreases as the image size increases because in sampling yields no error in the Data Set I, and wrongly
summing node positions over parent and children nodes, as in
segments seven objects in the Data Set II (8 percent). The object
(17) and (18), more statistically significant information
detection error for MFVAis 1 undetected object in the Data Set I
contributes to the position estimates. For example, dIV err ¼6:18
(2 percent), and 13 merged/undetected objects in the Data
for SVA is only 4.8 percent of the Data Set IV image size,
whereas dIII Set II (16 percent). With the increase in image size, Gibbs
err ¼4:23 for SVA is 6.6 percent of the Data Set III
image size. sampling becomes infeasible and MFVA exhibits very poor
Typical results of the DT-based image segmentation for
Data Sets III, IV, and V are shown in Figs. 9, 10, and 11. In
Table 2, we report the percentage of erroneously grouped
pixels, and, in Table 3, we report the object detection error,
when compared to ground truth, averaged over each data set.
For estimating the object detection error, the following
instances are counted as error: 1) merging two distinct objects
into one (i.e., failure to detect an object) and 2) segmenting an
Fig. 10. Image segmentation by DTV 0 learned using SVA for Data Set IV
images, (b) negative example, where, due to challenging similarity in
appearance and occlusion, the DT merges two distinct objects into one;
all pixels labeled with the same color are descendants of a unique root.
Fig. 9. Image segmentation by DTV 0 learned using SVA for Data Set III
images; all pixels labeled with the same color are descendants of a
unique root.
TABLE 1
Root-Node Distance Error
Fig. 11. Image segmentation by DTs learned using SVA for Data Set V:
(a) DTV 0 , (b) and (c) DTV , all pixels labeled with the same color are
descendants of a unique root.
TABLE 2
Percent of Erroneously Grouped Pixels
TABLE 3
Object Detection Error
performance; therefore, Tables 2 and 3 report results only for

VA and SVA. Overall, we observe that SVA outperforms other
inference algorithms for image segmentation using DTs.
Interestingly, the segmentation results for DTV models are
only slightly better than for DTV 0 models.
It should be emphasized that our experiments are carried
out in an unsupervised setting, and, as such, cannot be
equitably evaluated against supervised object recognition
results reported in the literature. Take, for instance, the
segmentation in Fig. 10b, where two overlapping, similar-
looking objects are merged into one DT subtree. Given the
absence of prior knowledge, the ground-truth segmentation
for this image is arbitrary, and the resulting segmentation
ambiguous; nevertheless, we still count it toward the object-
detection error percentages in Table 3.
Next, in Figs. 12a and 12b, we illustrate the convergence
rate of computing P ðZ; X; R0 jY ; R0 ÞQðZ; X; R0 Þ for the four
inference algorithms, averaged over the given data sets.
Numbers above bars represent the mean number of iteration
steps it takes for a given algorithm to converge. For all the
approaches, we consider the algorithm converged when
jQðZ; X; R0 ; tþ1ÞQðZ; X; R0 ; tÞj=QðZ; X; R0 ; tÞ<" for N" con-
secutive iteration steps t, where N" ¼10 and "¼0:01 (see Fig. 2,
Step (11)). Overall, SVA converges in the fewest number of
iterations. The average number of iterations for SVA on Data
Set V is 28 and 24 for DTV 0 and DTV , respectively, which takes
Fig. 12. Comparison of inference algorithms: (a) and (b) convergence
approximately 6s and 5s on a Dual 2 GHz PowerPC G5. Here, rate averaged over the given data sets. (c) and (d) Percentage increase
the processing time also includes image-feature extraction. in log QðZ; X; R0 Þ computed in SVA over log QðZ; X; R0 Þ computed in
For the same experiments, in Figs. 12c and 12d, we report Gibbs sampling, MFVA, VA, respectively.
the percentage increase in log QðZ; X; R0 Þ computed using
our SVA over log QðZ; X; R0 Þ obtained by Gibbs sampling, 6.2 Tests of Model Matching
MFVA, and VA, respectively. We note that SVA results in We test our approach to model matching for object
larger approximate posteriors than MFVA and VA. The larger recognition in unsupervised settings for Data Sets III, IV,
log QðZ; X; R0 Þ means that the assumed form of the approx- and V. As explained in Section 1, we consider a constrained
imate posterior distribution QðZ; X; R0 Þ¼QðZÞQðXjZÞQ type of object recognition, where we detect a particular
ðR0 jZÞ more accurately represents underlying stochastic appearance of a reference object in a test image. Given the
processes in the image than VA and MFVA approximations. assumption that each test image cannot contain multiple
We note that SVA yields approximately the same QðZ; X; R0 Þ appearances of the same object, we conduct unsupervised
as Gibbs sampling. object recognition as follows. One at a time, every image in
Fig. 13. Image matching for Data Set V images in (a) and (b). (c) Computation of Prr and Ptt for a sample of two segmented image regions in the reference
and test images, respectively, (d) and (e) computation of Ptr ðr Þ and Prt ðt Þ when T r and T t represent the same object, (f) computation of Ptr ðr Þ and
Prt ðt Þ when T r and T t represent different objects, (g) and (h) 3D plots of tr ðt ; r Þ for t ; r 2 f=4; =4g, where ðr ; t ; Þ marks the maximum.
(a) Reference image. (b) Test image. (c) Prr (left), Ptt (right). (d) Ptr ðr Þ (top) Prt ðt Þ (bottom). (e) Ptr ðr Þ (top) Prt ðt Þ (bottom). (f) Ptr ðr Þ (top) Prt ðt Þ
(bottom). (g) tr ðt ; r Þ plot for the ðT t ; T r Þ pair in (d). (h) tr ðt ; r Þ plot for the ðT t ; T r Þ pair in (e). (i) tr ðt ; r Þ plot for the ðT t ; T r Þ pair in (f).
a given data set is chosen as the reference, while the rest of and reference images;and xit and xir are image-class
the images are then marked as test images. After DT-based indicator random variables in T t and T r , respectively. In
image segmentation of the reference and test images, for a light of the discussion in Section 5 on the necessity to
given T r , we search for the maximum tr ðt ; r Þ over all normalize differently sized models, we also carry out the
possible image regions under T t and rotational alignments object recognition experiments using the normalized sym-
ðt ; r Þ, as illustrated in Fig. 13. Note that t and r should metric KL distance, d~tr , given by
be related by t ¼r , provided the compared objects are
identical. Thus, the test image region under T t , for which 1 Cr X P ðyy t ðiÞ jxit Þ 1 Ct X P ðyy r ðiÞ jxir Þ
d~tr log þ log ;
tr ðt ; r Þ is maximum and jt þr j ", where "¼=16 is a Nt Ct i2T P ðyy t ðiÞ jxir Þ Nr Cr i2T P ðyy r ðiÞ jxit Þ
t r
rotation increment in the alignment tests, is recognized as
the reference image region under T r . From Figs. 13g and where Ct and Cr are the cardinality of T t and T r ,
13h, we observe that tr is a “peaky” function, reaching its respectively, as defined in Section 5. For both distance
maximum when the same objects are matched. measures, the image region under T t , for which dtr , or d~tr , is
To compare our approach with methods which use closest to zero compared to the rest of the segmented
probability log-ratios for image matching, we repeat the regions in the test image, is recognized as T r .
aforementioned set of experiments, but now using the In Table 4, we summarize our object recognition results
symmetric KL distance dtr , specified in [10] as using both DTV 0 and DTV , inferred using SVA and VA, for tr ,
dtr , and d~tr . We define the recognition rate as the percentage of
1 X P ðyy t ðiÞ jxit Þ 1 X P ðyy r ðiÞ jxir Þ correctly detected appearances of an object in the total
dtr log þ log ; number of actual appearances of that object in the test
Nt i2T P ðyy t ðiÞ jxir Þ Nr i2T P ðyy r ðiÞ jxit Þ
t r
images. The false detection rate is defined as the percentage of
where Nt and Nr are the number of observables in T t and incorrectly detected appearances of an object in the total
T r , respectively; y t ðiÞ and y r ðiÞ are observables in the test number of the detected appearances of that object in the test
TABLE 4
their parts, and, hence, that DTs perform well in segmenting
Recognition Rate (R) and False Detection Rate (F) images into plausible pixel clusters. Furthermore, we
reported results on unsupervised object recognition, demon-
strating the viability of the proposed similarity measure for
matching statistical models.
This paper opens a number of research issues that need
further investigation. First among these is the optimal
alignment procedure required for comparing dynamic-
structure models. Possible choices of the alignment proce-
dure, ultimately, should look to enhance the discriminative
power of the similarity measure—that is, how well the
similarity measure distinguishes like objects (i.e., DT
models) from dissimilar objects. Second, our experiments
show (see Figs. 13g and 13h) that the proposed similarity
measure, , is a “peaky” function, which suggests that can
be successfully used in supervised settings. It is likely that
using a suitable (learned) threshold for the classifier could
improve object recognition results beyond those reported in
Table 4. Next, we currently assume that node positions in
DTs are uncorrelated (i.e. diagonal covariances) along “x”
and “y” image coordinates, in order to facilitate the
computation of (18). Often, this may not be an appropriate
images. Here, ground truth is established by visual inspec- assumption, and we will further examine how to modify
tion. Recognition and false detection rates are averaged over our inference algorithm to accommodate dependencies
all segmented regions and all images. Overall, we observe between coordinates. Finally, although DTV type of models
significantly better object recognition performance when tr outperforms DTV 0 in every reported experiment, this may
is used as a model-matching measure compared to dtr and d~tr .
have been the result of more expressive texture extraction
Again, DTV models outperform DTV 0 models.
(i.e., the complex wavelet transform) used in DTV than that
used in DTV 0 . Further research is necessary for establishing
7 CONCLUSION when and why one model is better than the other.
In this paper, we presented a probabilistic framework for
image segmentation and subsequent matching of segmented APPENDIX
regions, when only weak or no prior knowledge is available. DERIVATION OF STRUCTURED VARIATIONAL
We proposed and demonstrated the use of Dynamic Trees
APPROXIMATION
(DTs) to address these problems. More precisely, we
formulated image segmentation as inference of model poster- A.1 Notation
ior distributions, given an image, and subsequent Bayesian
. V ¼fV 0 ; V 0 g: set of all nodes; V 0 : set of leaf-level
estimation of DT structure. Beyond this novel application of nodes;
DTs, we built on previous DT work to formulate a novel DT . y ðiÞ : observable random vector at location ðiÞ;
architecture that introduces multilayered observable data into Y ¼4 fyy ðiÞ j8i2V g;
the model. For the proposed model, we derived a novel . zij : indicator random variable (RV) denoting a
Structured Variational Approximation (SVA) inference algo- connection between nodes i and j; Z ¼4 fzij j8i; j2V g;
rithm that removes independence assumptions between node ij : true probability of i being the child of j; ij :
positions and model structure, as was done in prior work. approximate probability of i being the child of j given
Furthermore, we formulated image matching as similarity Y and R0 ;
analysis between two DTs representing examined image . M: set of image classes; xki : indicator RV denoting i is
regions. To conduct this analysis, we specified a novel labeled as class k2M; X ¼4 fxki j8i2V ; k2Mg; Pijkl : true
similarity measure between two statistical models, which conditional probability tables; Qkl ij : approximate
we find more suitable for unsupervised settings than conditional probability tables given Y and R0 ; mki :
measures based on probability log-ratios. We proposed one approximate posterior that node i is labeled as image
possible alignment procedure for comparing two DTs, and class k given Y and R0 ;
developed criteria based on the resulting similarity measure . r i : position of node i; R0 ¼4 frr i j8i2V 0 g; R0 ¼4 frr i j8i2V 0 g;
for ultimate unsupervised object recognition. ij and d ij : true diagonal covariance and mean of a
Through a set of detailed experiments, we demonstrated relative child-parent displacement ðrr i r j Þ; ij and
the significantly improved properties of the SVA algor- ij : approximate diagonal covariance and mean of r i ,
ithm—both in terms of substantially faster convergence rates, given that j is the parent of i and given Y and R0 ;
and larger approximate posteriors for the inferred models— . hi: expectation value with respect to QðZ; X; R0 Þ; :
when compared with competing inference algorithms. Our normalization constant; paðiÞ: candidate parents of i;
results show that DTs are capable of capturing important cðiÞ: children of i; dðiÞ: all descendants down the
component-subcomponent relationships among objects and subtree of i.
A.2 Preliminaries hðrri rr j dd ij Þðrri rr j ddij ÞT iQ1 ¼

Computation of KLðQkP Þ, given by (13), is intractable, ¼ hðrr i
ij þ
ij rr j ddij Þðrr i ij rrj ddij ÞT iQ1 ¼
ij þ
0 0
because it depends on P ðZ; X; R jY ; R Þ. Note, though, that ¼ ij þ2hðrr i ij ÞT iQ1
ij Þðrr j ddij þ
0 0 0
QðZ; X; R Þ does not depend on P ðY jR Þ and P ðR Þ.
þ hðrr j þddij ij ÞT iQ1 ¼
ij Þðrr j þddij
0 0
Consequently, by subtracting log P ðY jR Þ and log P ðR Þ
¼ ij þ2hðrr i
ij Þð ij ÞT iQ1
jp rrj ddij þ
from KLðQkP Þ, we obtain a tractable criterion JðQ; P Þ,
þ hð jp þrr j þddij
ij Þð jp þrr j þddij ij ÞT iQ1 ¼
whose minimization with respect to QðZ; X; R0 Þ yields the X ð36Þ
¼ ij þ2 jp hðrr i
ij Þð jp rr j ÞT iQðrr i ;rrj jzij ;zjp Þ
same solution as minimization of KLðQkP Þ: p2V 0
X
4
JðQ; P Þ ¼ KLðQkP Þ log P ðY jR0 Þ log P ðR0 Þ; þ jp hðrr j jp ÞT iQðrrj jzjp ¼1Þ
jp Þðrr j
Z X p2V 0
QðZ; X; R0 Þ ð31Þ X
¼ dR0 QðZ; X; R0 Þ log : þ jp ð
ij
jp ddij Þð jp dd ij ÞT ¼
ij
R0 Z;X
P ðZ; X; R; Y Þ
p2V 0
X
JðQ; P Þ is known alternatively as Helmholtz free energy, ¼ ij þ jp 2ijp þ jp þ Mijp ;
p2V 0
Gibbs free energy, or free energy [17], [19]. By minimizing
JðQ; P Þ, we seek to compute parameters of approximate where the definitions of auxiliary matrices ijp and Mijp are
given in the second to the last derivation step above, and
distributions QðZÞ, QðXjZÞ, and QðR0 jZÞ. It is convenient,
i-j-p is a child-parent-grandparent triad. It follows from (35)
first, to reformulate (31) as JðQ; P Þ¼LZ þ LX þ LR , where and (36) that
P
LZ ¼4 Z QðZÞ log QðZÞ
P ðZÞ ,
1X jij j
X LR ¼ ij log 2 þ Trf1
ij ij g
4 QðXjZÞ 2 i;j2V 0 jij j
LX ¼ QðZÞQðXjZÞ log ;
P ðXjZÞP ðY jX; Þ ! ð37Þ
Z;X X
1
and þ jp Trfij ð2ijp þjp þMijp Þg :
p2V 0
Z X 0
QðR jZÞ
LR ¼4 dR00 QðZÞQðR0 jZÞ log : In (37), the last expression left to compute is Trf1
ij ijp g.
R0 P ðRjZÞ
Z For this purpose, we apply the Cauchy-Schwartz inequality
To derive expressions for LZ , LX , and LR , we first observe: as follows:

D E 1 1
zij ¼ ij ; xki ¼ mki ; xki xlj ¼ Qkl l
ij mj ; Trf1
ij ijp g ¼ Trfij ij hðr
2 2
r i jp rr j ÞT ig;
ij Þð
X X ð32Þ 1 1
) mki ¼ ij Qkl l
ij mj ; 8i 2 V ; 8k 2 M; ¼ Trfhij2 ðrr i jp rrj ÞT ij2 ig;
ij Þð ð38Þ
j2V l2M 1 1
Trf1 1
ij ij g Trfij jp g ;
2 2
where hi denotes expectation with respect to QðZ; X; R0 Þ.

where we used the fact that the s and s are diagonal
Consequently, from (1), (9), and (32), we have
matrices. Although the Cauchy-Schwartz inequality, in
X
LZ ¼ ij log½ij =ij : ð33Þ general, does not yield a tight upper bound, in our case it
ij2V appears reasonable to assume that variables r i and r j (i.e.,
Next, from (4), (10), and (32), we derive positions of object parts at different scales) are uncorrelated.
X X Substituting (38) into (37), we finally derive the upper bound
LX ¼ ij Qkl l kl kl
ij mj log½Qij =Pij for LR as
i;j2V k;l2M
XX ð34Þ
mki log P ðyy ðiÞ jxki ; ðiÞÞ: 1 X jij j
LR ij log 2 þ Trf1
ij ij g
i2V k2M 2 i;j2V 0 jij j
X
Note that for DTV 0 , V in the second term is substituted with þ jp Trf1ij jp þ Mijp g ð39Þ
p2V 0
V 0 . Finally, from (3), (11), and (32), we get !
X
1X n o þ2 jp Trf1
1
1 1
jij j ij ij g Trfij jp g :
2 2
LR ¼ ij log Tr 1 ij ij p2V 0

2 i;j2V 0 jij j
n o ð35Þ
1 T A.3 Optimization of QðXjZÞ
þ Tr ij hðrri rr j dd ij Þðrri rr j ddij Þ i ;
QðXjZÞ is fully characterized by parameters Qkl ij . From the
definition of LX , we have @JðQ; P Þ=@Qkl
ij ¼@LX =@Q kl
ij . Due to
where hi denotes expectation with respect to Q1 ¼4 Qðrri jzij ¼1Þ
parent-child dependencies in (32), it is necessary to
Qðrr j jzjp ¼1ÞQðzjp ¼1Þ. Let us now consider the expectation in iteratively differentiate LX with respect to Qklij down the
the last term: subtree of node i. For this purpose, we introduce three
auxiliary terms Fij , Gi , and ki , which facilitate computation Then, from @LR =@ij ¼0, it is straightforward to compute
of @LX =@Qklij , as shown below: the update equation for ij given by (17).
4
X
Fij ¼ ij Qkl l kl kl
ij mj log½Qij =Pij ; A.5 Optimization of QðZÞ
k;l2M
( ) QðZÞ is fully characterized by the parameters ij . From the
4
X X definitions of LZ , LX , and LR we see that @JðQÞ=@ij ¼
Gi ¼ Fdc mki log P ðyy ðiÞ jxki ; ðiÞÞ ;
d;c2dðiÞ k2M V0 ð40Þ @ðLX þLR þLZ Þ=@ij . Similar to the optimization of Qkl
ij , we
4 need to iteratively differentiate LX as follows:
ki ¼ expð@Gi =@mki Þ;
@LX @Fij @Gi @mki @LX @Fij X @Gi @mki
) ¼ þ ; ¼ þ ; ð44Þ
@Qij @Qkl
kl
@mki @Qkl @ij @ij k2M @mki @ij
ij ij
where fgV 0 denotes that the term is included in the where Fij and Gi are defined as in (40). By substituting the
expression for Gi if i is a leaf node for DTV 0 . For DTV , the P
derivatives @Gi =@mki ¼ log ki , and @Fij =@ij ¼ k;l2M
term in braces fg is always included. This allows us to derive P
Qkl l kl kl k
ij mj log½Qij =Pij , and @mi =@ij ¼
kl l
l2M Qij mj into (44),
update equations for both models simultaneously. After
finding the derivatives @Fij =@Qkl l kl kl we obtain
ij ¼ ij mj ðlog½Qij =Pij þ1Þ
and @mki =@Qkl ¼ m
ij j
l
, we arrive at !
ij
@LX X Qkl
ij
kl l k
¼ Qij mj log kl log i ;
@LX =@Qkl l kl kl k
ij ¼ ij mj ðlog½Qij =Pij þ 1 log i Þ: ð41Þ @ij k;l2M
Pij
!
Finally, optimizing (41) with the Lagrange multiplier that X X ð45Þ
P ¼ kl l
Qij mj log al a
Pij i ;
accounts for the constraint k2M Qkl ij ¼1 yields the desired k;l2M a2M
update equation: Qkl kl k
ij ¼ Pij i , introduced in (14).
¼ Aij :
To compute ki ¼ expð@Gi =@mki Þ, we first find
! Next, we differentiate LR , given by (39), with respect to
@Gi X @Fci X @Gc @ma
c
¼ þ ij as
@mki c2cðiÞ @mki a2M @mac @mki
@LR 1 jij j 1
flog P ðyy ðiÞ jxki ; ðiÞÞgV 0 ; ¼ log 1 þ Trf1 ij ij g
ð42Þ @ij 2 jij j 2
X X Qak @Gc
¼ ci Qak log ci
þ 1X
ci
Pciak @mac þ jp Trf1 ij ðjp þMijp Þg
c2cðiÞ a2M 2 p2V 0
flog P ðyy ðiÞ jxki ; ðiÞÞgV 0 ; 1 1

þ2Trf1 2 1
ij ij g Trfij tu g
2
ð46Þ
and then substitute Qkl
ij , given by (14), into (42), which 1X
gives (15). þ ci Trf1 ci ij þMcij g
2 c2V 0

A.4 Optimization of QðR0 jZÞ þ2Trf1
1
1 1
ci ci g Trfci ij g ;

2 2
QðR0 jZÞ is fully characterized by parameters ij and ij .

From the definition of LR , we observe that @JðQÞ=@ij ¼ ¼ Bij 1;
@LR =@ij and @JðQÞ=@ ij ¼@LR =@ i . Since the s are
where indexes c, j, and p denote children, parents, and
positive definite, from (39), it follows that
grandparents of node i, respectively. Further, from (33),
@LR 1 X we get
¼ ij Trf1 1
ij g þ Trfij g þ ci Trf1
ci g
@ij 2 c2V 0 @LZ =@ij ¼ 1 þ log ij =ij : ð47Þ
X 12 1
þ jp Trf1 1 1
ij gTrfij ij g Trfij jp g
2
Finally, substituting (45), (46), and (47) into @JðQÞ=@ij ¼0
p2V 0
! and adding the Lagrange multiplier to account for the
X 2 1 1 P
þ ci Trf1 1 1
ci gTrfci ij g Trfci ci g
2 : constraint j2V 0 ij ¼1, we solve for the update equation of
c2V 0 ij given by (19).
From @LR =@ij ¼ 0, it is straightforward to derive the
update equation for ij given by (18). ACKNOWLEDGMENTS
Next, to optimize the ij parameters, from (36) and (39),
This work was supported in part by grants from NASA
we compute
Langley, the US Air Force Office of Sponsored Research, the
@LR Xh
¼ ij jp 1 US Air Force, and the US Special Operations Command.
ij ð
ij jp ddjp Þ
@
ij c;p2V 0 The authors would like to thank anonymous reviewers
i ð43Þ
ci ij 1 whose thoughtful comments and suggestions improved the
ci ð ci
ij ddij Þ :
quality of the paper.
REFERENCES [25] A. Montanvert, P. Meer, and A. Rosenfield, “Hierarchical Image

Analysis Using Irregular Tessellations,” IEEE Trans. Pattern
[1] N.J. Adams, A.J. Storkey, Z. Ghahramani, and C.K. I. Williams, Analysis and Machine Intelligence, vol. 13, no. 4, pp. 307-316, Apr.
“MFDTs: Mean Field Dynamic Trees,” Proc. 15th Int’l Conf. Pattern 1991.
Recognition, vol. 3, pp. 147-150, 2002. [26] M. Aitkin and D.B. Rubin, “Estimation and Hypothesis Testing in
[2] N.J. Adams, “Dynamic Trees: A Hierarchical Probabilistic Finite Mixture Models,” J. Royal Statisitcal Soc. B, vol. 47, no. 1,
Approach to Image Modeling,” PhD dissertation, Division of pp. 67-75, 1985.
Informatics, Univ. of Edinburgh, Edinburgh, U.K., 2001. [27] D.J.C. MacKay, “Introduction to Monte Carlo Methods,” Learning
[3] A.J. Storkey, “Dynamic Trees: A Structured Variational Method in Graphical Models (Adaptive Computation and Machine Learning),
Giving Efficient Propagation Rules,” Proc. 16th Conf. Uncertainty in M.I. Jordan, ed., pp. 175-204, Cambridge, Mass.: MIT Press, 1999.
Artificial Intelligence, pp. 566-573, 2000. [28] R.M. Neal, “Probabilistic Inference Using Markov Chain Monte
[4] A.J. Storkey and C.K.I. Williams, “Image Modeling with Position- Carlo Methods,” Technical Report CRG-TR-93-1, Connectionist
Encoding Dynamic Trees,” IEEE Trans. Pattern Analysis and Research Group, Univ. of Toronto, 1993.
Machine Intelligence, vol. 25, no. 7, pp. 859-871, July 2003. [29] D.J.C. MacKay, Information Theory, Inference, and Learning Algo-
[5] X. Feng, C.K.I. Williams, and S.N. Felderhof, “Combining Belief rithms, chapter 29, pp. 357-386, Cambridge, U.K.: Cambridge
Networks and Neural Networks for Scene Segmentation,” IEEE Univ. Press, 2003.
Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 4, [30] T.M. Cover and J.A. Thomas, Elements of Information Theory. New
pp. 467-483, Apr. 2002. York: Wiley Interscience Press, 1991.
[6] Z. Ying and D. Castanon, “Partially Occluded Object Recognition [31] M. Mignotte, C. Collet, P. Perez, and P. Bouthemy, “Sonar Image
Using Statistical Models,” Int’l J. Computer Vision, vol. 49, no. 1, Segmentation Using an Unsupervised Hierarchical MRF Model,”
pp. 57-78, 2002. IEEE Trans. Image Processing, vol. 9, no. 7, pp. 1216-1231, 2000.
[7] H. Schneiderman and T. Kanade, “Object Detection Using the [32] C. D’Elia, G. Poggi, and G. Scarpa, “A Tree-Structured Markov
Statistics of Parts,” Int’l J. Computer Vision, vol. 56, no. 3, pp. 151- Random Field Model for Bayesian Image Segmentation,” IEEE
177, 2004. Trans. Image Processing, vol. 12, no. 10, pp. 1259-1273, 2003.
[8] B. Moghaddam, “Principal Manifolds and Probabilistic Subspaces [33] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A Database of
for Visual Recognition,” IEEE Trans. Pattern Analysis and Machine Human Segmented Natural Images and Its Application to
Intelligence, vol. 24, no. 6, pp. 780-788, June 2002. Evaluating Segmentation Algorithms and Measuring Ecological
[9] G. Hermosillo, C. Chefd’Hotel, and O. Faugeras, “Variational Statistics,” Proc. Eighth Int’l Conf. Computer Vision, vol. 2, pp. 416-
Methods for Multimodal Image Matching,” Int’l J. Computer 423, 2001.
Vision, vol. 50, no. 3, pp. 329-343, 2002. [34] N.G. Kingsbury, “Complex Wavelets for Shift Invariant Analysis
[10] H. Greenspan, J. Goldberger, and L. Ridel, “A Continuous and Filtering of Signals,” J. Applied Computer Harmonic Analysis,
Probabilistic Framework for Image Matching,” Computer Vision vol. 10, no. 3, pp. 234-253, 2001.
and Image Understanding, vol. 84, no. 3, pp. 384-406, 2001. [35] T. Lindeberg, “Scale-Space Theory: A Basic Tool for Analysing
[11] D. DeMenthon, D. Doermann, and M.V. Stuckelberg, “Image Structures at Different Scales,” J. Applied Statistics, vol. 21, no. 2,
Distance Using Hidden Markov Models,” Proc. 15th Int’l Conf. pp. 224-270, 1994.
Pattern Recognition, vol. 3, pp. 143-146, 2000. [36] D.G. Lowe, “Distinctive Image Features from Scale-Invariant
[12] B.H. Juang and L.R. Rabiner, “A Probabilistic Distance Measure Keypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
for Hidden Markov Models,” AT&T Technical J., vol. 64, no. 2,
pp. 391-408, 1985. Sinisa Todorovic received the BS degree in
[13] M.C. Nechyba and Y. Xu, “Stochastic Similarity for Validating electrical engineering from the University of
Human Control Strategy Models,” IEEE Trans. Robotic Automation, Belgrade, Serbia, in 1994. He received the MS
vol. 14, no. 3, pp. 437-451, 1998. and PhD degrees at the University of Florida, in
[14] H. Cheng and C.A. Bouman, “Multiscale Bayesian Segmentation 2002, and 2005, respectively. From 1994 to
Using a Trainable Context Model,” IEEE Trans. Image Processing, 2001, he worked as a software engineer in the
vol. 10, no. 4, pp. 511-525, 2001. communications industry. In 2001, he enrolled in
[15] H. Choi and R.G. Baraniuk, “Multiscale Image Segmentation the electrical and computer engineering gradu-
Using Wavelet-Domain Hidden Markov Models,” IEEE Trans. ate program at the University of Florida, Gaines-
Image Processing, vol. 10, no. 9, pp. 1309-1321, 2001. ville. There, as a member of the Center for Micro
[16] J.-M. Laferté, P. Pérez, and F. Heitz, “Discrete Markov Image Air Vehicle Research, he conducted research aimed at enabling
Modeling and Inference on the Quadtree,” IEEE Trans. Image intelligent mission profiles for small aircraft. His primary research
Processing, vol. 9, no. 3, pp. 390-404, 2000. interests encompass statistical image modeling, machine learning, and
[17] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul, “An multiresolution signal processing. He is a student member of the IEEE.
Introduction to Variational Methods for Graphical Models,”
Machine Learning, vol. 37, no. 2, pp. 183-233, 1999. Michael C. Nechyba received the BS degree in
[18] T.S. Jaakkola, “Tutorial on Variational Approximation Methods,” electrical engineering from the University of
Advanced Mean Field Methods, M. Opper and D. Saad, eds., pp. 129- Florida in 1992, and the PhD degree in robotics
161, Cambridge, Mass.: MIT Press, 2000. from Carnegie Mellon University in 1998. Upon
[19] B.J. Frey and N. Jojic, “Advances in Algorithms for Inference and completion of his thesis, he joined the Depart-
Learning in Complex Probability Models for Vision,” IEEE Trans. ment of Electrical and Computer Engineering at
Pattern Analysis and Machine Intelligence, 2004. the University of Florida as assistant professor in
[20] M.K. Schneider, P.W. Fieguth, W.C. Karl, and A.S. Willsky, August 1998. There, he served as an associate
“Multiscale Methods for the Segmentation and Reconstruction of director of the Machine Intelligence Laboratory,
Signals and Images,” IEEE Trans. Image Processing, vol. 9, no. 3, conducting research in two primary areas:
pp. 456-468, 2000. 1) vision-based and sensor-based autonomy for Micro Air Vehicles
[21] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of (MAVs) and 2) direct brain-machine interfaces (BMIs). In October 2004,
Plausible Inference, chapter 4. pp. 143-236, San Mateo, Calif.: he resigned his position at the University of Florida to cofound Pittsburgh
Morgan Kaufamnn, 1988. Pattern Recognition. Pittsburgh Pattern Recognition seeks to commer-
[22] W.W. Irving, P.W. Fieguth, and A.S. Willsky, “An Overlapping cialize face/object recognition technology for various applications in
Tree Approach to Multiscale Stochastic Modeling and Estima- digital photography, surveillance, and homeland security, and is
tion,” IEEE Trans. Image Processing, vol. 6, no. 11, pp. 1517-1529, currently funded through the US intelligence community. He has
1997. published approximately 30 journal and refereed conference papers.
[23] J. Li, R.M. Gray, and R.A. Olshen, “Multiresolution Image He is a member of the IEEE and the IEEE Computer Society.
Classification by Hierarchical Modeling with Two-Dimensional
Hidden Markov Models,” IEEE Trans. Information Theory, vol. 46,
no. 5, pp. 1826-1841, 2000.
[24] W.K. Konen, T. Maurer, and C. von der Malsburg, “A Fast
Dynamic Link Matching Algorithm for Invariant Pattern Recogni-
tion,” Neural Networks, vol. 7, no. 6-7, pp. 1019-1030, 1994.

Image

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Image

Diunggah oleh

Hak Cipta:

Format Tersedia

1762 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO.

11, NOVEMBER 2005

Dynamic Trees for Unsupervised Segmentation

W E present a probabilistic framework for image segmen-

may generate pixel values that are inconsistent with

1. We introduce multilayered data into the model, as 2 RELATED WORK

where parameters ij correspond to the ij connection

4.2.1 Optimization of QðXjZÞ 4.2.3 Optimization of QðZÞ

QðR0 jZÞ is fully characterized by parameters ij and ij . The Z X X

nodes V must be included in the tree structure, even if they

ð8i2V ‘ ÞðZ^i 6¼0Þ j^¼ arg max ij ; ð26Þ

ð8i2V Þ x^i ¼ arg max mki ; ð27Þ

where the approximate posterior probability mki that image

where ij and ij are given by (17) and (19), respectively.

4.4 Specification of Model Parameters

performance; therefore, Tables 2 and 3 report results only for

A.2 Preliminaries hðrri rr j dd ij Þðrri rr j ddij ÞT iQ1 ¼

where hi denotes expectation with respect to QðZ; X; R0 Þ.

LR ¼ ij log Tr 1 ij ij p2V 0

ci ci g Trfci ij g ;

QðR0 jZÞ is fully characterized by parameters ij and ij .

REFERENCES [25] A. Montanvert, P. Meer, and A. Rosenfield, “Hierarchical Image

Anda mungkin juga menyukai

Image

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Image

Diunggah oleh

Hak Cipta:

Format Tersedia

1762 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO.

11, NOVEMBER 2005

Dynamic Trees for Unsupervised Segmentation

W E present a probabilistic framework for image segmen-

may generate pixel values that are inconsistent with

1. We introduce multilayered data into the model, as 2 RELATED WORK

where parameters ij correspond to the ij connection

4.2.1 Optimization of QðXjZÞ 4.2.3 Optimization of QðZÞ

QðR0 jZÞ is fully characterized by parameters  ij and ij . The Z X X

nodes V must be included in the tree structure, even if they

ð8i2V ‘ ÞðZ^i 6¼0Þ j^¼ arg max ij ; ð26Þ

ð8i2V Þ x^i ¼ arg max mki ; ð27Þ

where the approximate posterior probability mki that image

where  ij and ij are given by (17) and (19), respectively.

4.4 Specification of Model Parameters

performance; therefore, Tables 2 and 3 report results only for

A.2 Preliminaries hðrri rr j dd ij Þðrri rr j ddij ÞT iQ1 ¼

where hi denotes expectation with respect to QðZ; X; R0 Þ.

LR ¼ ij log Tr 1 ij ij p2V 0

ci ci g Trfci ij g ;

QðR0 jZÞ is fully characterized by parameters  ij and ij .

REFERENCES [25] A. Montanvert, P. Meer, and A. Rosenfield, “Hierarchical Image

Anda mungkin juga menyukai

where parameters ij correspond to the ij connection

QðR0 jZÞ is fully characterized by parameters ij and ij . The Z X X

ð8i2V ‘ ÞðZ^i 6¼0Þ j^¼ arg max ij ; ð26Þ

where ij and ij are given by (17) and (19), respectively.

LR ¼ ij log Tr 1 ij ij p2V 0

QðR0 jZÞ is fully characterized by parameters ij and ij .