Abstract—We present a probabilistic framework—namely, multiscale generative models known as Dynamic Trees (DT)—for
unsupervised image segmentation and subsequent matching of segmented regions in a given set of images. Beyond these novel
applications of DTs, we propose important additions for this modeling paradigm. First, we introduce a novel DT architecture, where
multilayered observable data are incorporated at all scales of the model. Second, we derive a novel probabilistic inference algorithm for
DTs—Structured Variational Approximation (SVA)—which explicitly accounts for the statistical dependence of node positions and model
structure in the approximate posterior distribution, thereby relaxing poorly justified independence assumptions in previous work. Finally,
we propose a similarity measure for matching dynamic-tree models, representing segmented image regions, across images. Our results
for several data sets show that DTs are capable of capturing important component-subcomponent relationships among objects and their
parts, and that DTs perform well in segmenting images into plausible pixel clusters. We demonstrate the significantly improved properties
of the SVA algorithm—both in terms of substantially faster convergence rates and larger approximate posteriors for the inferred
models—when compared with competing inference algorithms. Furthermore, results on unsupervised object recognition demonstrate the
viability of the proposed similarity measure for matching dynamic-structure statistical models.
Index Terms—Generative models, Bayesian networks, dynamic trees, variational inference, image segmentation, image matching,
object recognition.
1 INTRODUCTION
and the efficiency of their inference algorithms, which is Further, each round-shaped node i (see Fig. 1) is
critically important for our purposes. characterized by random position r i in the image plane.
Despite these attractive properties, TSBNs give rise to The distribution of r i is conditioned on the position of its
blocky segmentations, due to their fixed structure. In the parent r j as
literature, there are several approaches to alleviate this
problem. Irving’s research group has proposed an over- 4 expð 12 ðrr i rr j ddij ÞT 1
ij ðr
ri rr j ddij ÞÞ
lapping tree model, where distinct nodes correspond to P ðrr i jrr j ; zij ¼1Þ¼ 1 ; ð2Þ
2jij j2
overlapping parts in the image [22]. In [23], the authors
have discussed two-dimensional hierarchical models, where ij is a diagonal matrix that represents the order of
where nodes are dependent both at any particular layer magnitude of object size and parameter d ij is the mean of
through a Markov-mesh and across resolutions. In both relative displacement ðrr i rr j Þ. In [4], the authors, for
approaches, segmentation results are superior to those simplicity, set d ij to zero, which favors undesirable position-
when standard TSBNs are used because the descriptive ing of children and parent nodes at the same locations. From
component of the models is improved at some increased our experiments, this may seriously degrade the image-
computational cost. Ultimately, however, these approaches modeling capabilities of DTs, and as such some nonzero
do not deal with the source of the “blockiness”—namely, relative displacement d ij needs to be accounted for. The joint
the fixed tree structure of TSBNs. probability of R ¼4 frr i j8i2V g, is given by
Aside from the work of Williams et al. [1], [2], [3], [4], to
Y z
which we will refer throughout the paper, we point out other P ðRjZÞ ¼
4
P ðrr i jrr j ; zij Þ ij ; ð3Þ
research concerning dynamic-tree structures. Konen et al. i;j2V
have proposed a flexible neural mechanism for invariant
pattern recognition based on correlated neuronal activity and where for roots i 1we have P ðrri jzi0 ¼1Þ ¼4 expð 12 ðrr i ddi ÞT
the self-organization of dynamic links in neural networks 1 r i dd i ÞÞ=ð2ji j2 Þ. At the leaf level, V 0 , we fix node
i ðr
[24]. Also, Montanvert et al. [25] have explored irregular positions R0 to the locations of the finest-scale observables,
multiscale tessellations that adapt to image content. and then use P ðZ; R0 jR0 Þ as the prior over positions and
connectivity, where R0 ¼4 frr i j8i2V nV 0 g and R0 ¼4 frri j8i2V nV 0 g.
Next, each node i is characterized by an image-class label xi
3 DYNAMIC TREES and an image-class indicator random variable xki , such that
The model formulation discussed herein is similar to that of xki ¼1 if xi ¼k, where k is a label taking values in the finite set M.
the position-encoding dynamic trees in [4], where obser- Thus, we assume that the set M of unknown image classes is
vables are present only at the lowest model level. However, finite. The label k of node i is conditioned on image class l of its
since we also consider multilayered observable data in DTs, parent j and is given by conditional probability tables Pijkl . The
for completeness, we introduce both types of models, joint probability of X ¼4 fxki ji2V ; k2Mg is given by
emphasizing, where appropriate, the differences. Y Y h ixki xlj zij
DTs are directed, acyclic graphs with two disjoint sets of P ðXjZÞ ¼ Pijkl ; ð4Þ
nodes representing hidden and observable random vectors. i;j2V k;l2M
Graphically, we represent all hidden variables as round-
shaped nodes, connected via directed edges indicating where for roots i (zi0 ¼1) we use priors P ðxki Þ.
Markovian dependencies, while observables are denoted as Finally, we introduce nodes that are characterized by
rectangular-shaped nodes, connected only to their corre- observable random vectors representing image texture and
sponding hidden variables, as depicted in Fig. 1. Below, we color cues. Here, we make a distinction between two types
first introduce nodes characterized by hidden variables. of DTs. The model where observables are present only at
There are V round-shaped nodes, organized in hierarchical the leaf-level is referred to as DTV 0 ; the model where
levels, V ‘ , ‘ ¼ f0; 1; . . . ; L1g, where V 0 denotes the leaf level observables are present at all levels is referred to as DTV . To
and V 0 ¼4 V =V 0 . The number of round-shaped nodes is identical clarify the difference between the two types of nodes in
to that of the corresponding quad-tree with L levels, such that DTs, we index observables with respect to their locations in
jV ‘ j¼jV ‘1 j=4¼ . . . ¼jV 0 j=4‘ . Connections are established un- the data structure (e.g., wavelet dyadic squares), while
der the constraint that a node at level ‘ can become a root, or it hidden variables are indexed with respect to a node-index
can connect only to the nodes at the next ‘þ1 level. The in the graph. This generalizes correspondence between
network connectivity is represented by random matrix Z, hidden and observable random variables of the position-
where entry zij is an indicator random variable, such that zij ¼1 encoding dynamic trees in [4]. We define position ðiÞ to be
if i2V ‘ and j2f0; V ‘þ1 g are connected. Z contains an equal to the center of mass of the ith dyadic square at level ‘
in the corresponding quad-tree with L levels:
additional zero (“root”) column, where entries zi0 ¼1 if i is a
root. Since each node can have only one parent, a realization of 4
ðiÞ ¼ ½ðnþ0:5Þ2‘ ðmþ0:5Þ2‘ T ; n; m ¼ 1; 2; . . . ; ð5Þ
Z can have at most one entry equal to 1 in each row. We define
the distribution over connectivity as where n and m denote the row and column in the dyadic
square at scale ‘ (e.g., for wavelet coefficients). Clearly,
4
Y
L1 Y zij other application-dependent definitions of ðiÞ are possible.
P ðZÞ ¼ ij ; ð1Þ
‘¼0 ði;jÞ2V ‘ f0;V ‘þ1 g
Note that while the r s are random vectors, the s are
deterministic values fixed at locations where the corre-
P ij is the probability of i being the child of j, subject
where sponding observables are recorded in the image. Also, after
to j2f0;V ‘þ1 g ij ¼1. fixing R0 to the locations of the finest-scale observables, we
TODOROVIC AND NECHYBA: DYNAMIC TREES FOR UNSUPERVISED SEGMENTATION AND MATCHING OF IMAGE REGIONS 1765
4
have 8i2V 0 , ðiÞ¼rri . The definition, given by (5), holds for QðZ; X; R0 Þ ¼ QðZÞQðXjZÞQðR0 jZÞ; ð8Þ
DTV 0 , as well, for ‘¼0.
which enforces that both class-indicator variables X and
For both types of DTs, we assume that observables
Y ¼4 fyy ðiÞ j8i2V g at locations R0 and 0 ¼4 fðiÞj8i2V 0 g are position variables R0 are statistically dependent on the tree
conditionally independent given the corresponding xki : connectivity Z. Since these dependencies are significant in the
prior, one should expect them to remain so in the posterior.
Y Yh ixki Therefore, our formulation appears to be more appropriate
P ðY jX; R0 ; 0 Þ ¼ P ðyy ðiÞ jxki ; ðiÞÞ ; ð6Þ for approximating the true posterior than the mean-field
i2V k2M
variational approximation QðZ; X; R0 Þ¼QðZÞQðXÞQðR0 Þ dis-
where for DTV 0 , V 0 should be substituted for V . The cussed in [1] and the form QðZ; X; R0 Þ¼QðZÞQðXjZÞQðR0 Þ
likelihoods P ðyy ðiÞ jxki ¼1; ðiÞÞ arePmodeled as mixtures proposed in [4]. We define the approximating distributions as
of Gaussians: P ðyy ðiÞ jxki ¼1; ðiÞÞ ¼4 Gg¼1 k ðgÞN ðy
k
y ðiÞ ; k ðgÞ; follows:
k ðgÞÞ. For large Gk , a Gaussian-mixture density can
approximate any probability density [26]. In order to 4
Y
L1 Y zij
QðZÞ ¼ ij ; ð9Þ
avoid the risk of overfitting the model, we assume that ‘¼0 ði;jÞ2V ‘ f0;V ‘þ1 g
the parameters of the Gaussian-mixture are equal for all
4
Y Y h ixki xlj zij
nodes. The Gaussian-mixture parameters can be grouped QðXjZÞ ¼ Qkl ; ð10Þ
ij
in the set ¼4 fGk ; fk ðgÞ; k ðgÞ; k ðgÞgG
g¼1 j 8k2Mg.
k
i;j2V k;l2M
The joint prior of the model can be written as 4
Y
QðR0 jZÞ ¼ ½Qðrri jzij Þzij ð11Þ
0 0 0 0 0 i;j2V 0
P ðZ; X; R; Y Þ¼P ðY jX; R ; ÞP ðXjZÞP ðZ; R jR ÞP ðR Þ: ð7Þ
All the parameters of the joint prior can be grouped in the 4
ij ÞT 1
Y exp 12 ðrri ij ðr
r i
ij Þ
set ¼4 fij ; d ij ; ij ; Pijkl ; g, 8i; j2V , 8k; l2M. QðR0 jZÞ ¼ 1 ; ð12Þ
i;j2V 0 2jij j2
right-hand side of the update equations are assumed updated summing over children and grandparents of i and,
known, as learned in the previous iteration step. therefore, should be iterated until convergence.
Qkl kl k
ij ¼ Pij i ; 8i; j 2 V ; 8k; l 2 M; ð14Þ ij ¼ ij expðAij Bij Þ; 8‘; 8ði; jÞ 2 V ‘ f0; v‘þ1 g; ð19Þ
where the auxiliary parameters ki are computed as where Aij represents the influence of observables Y , while
( Bij represents the contribution of the geometric properties
k
P ðyy ðiÞ jxki ; ðiÞÞ; i 2 V 0; of the network to the connectivity distribution. These are
i ¼ Q P
ak ak ci
ð15aÞ defined in the Appendix.
c2V a2M Pci ci ; i 2 V 0;
4.3 Inference Algorithm and Bayesian Estimation
" #ci
Y X For the given set of parameters characterizing the joint
ki ¼ P ðyy ðiÞ jxki ; ðiÞÞ Pciak ac ; 8i 2 V ; ð15bÞ prior, observables Y , and leaf-level node positions R0 , the
c2V a2M standard Bayesian estimation of optimal Z^, X ^ , and R ^0
requires minimizing the expectation of a cost function C:
where (15a) is derived for DTV 0 and (15b) for DTV . Since the
ci are nonzero only for child-parent pairs, from (15), we ðZ^; X
^; R
^0 Þ ¼
note that s are computed for both models by propagating ð20Þ
the messages of the corresponding children nodes arg min 0 IEfCððZ; X; R0 Þ; ðZ ; X ; R0 ÞÞjY ; R0 ; g;
Z;X;R
upward. Thus, Qs, given by (14), can be updated by making
a single pass up the tree. Also, note that for leaf nodes, where CðÞ penalizes the discrepancy between the estimated
i2V 0 , the ci parameters are equal to 0 by definition, configuration ðZ; X; R0 Þ and the true one ðZ ; X ; R0 Þ. We
yielding ki ¼P ðyy ðiÞ jxki ; ðiÞÞ in (15b). propose the following cost function:
Further, from (9) and (10), we derive the update equation 4
for the approximate posterior probability mki that class k2M CððZ; X; R0 Þ; ðZ ; X ; R0 ÞÞ¼
X XX
is assigned to node i2V , given Y and R0 , as ½1
ðzij zij Þ þ ½1
ðxki xk
i Þ
Z X X X i;j2V i2V k2M ð21Þ
mki ¼ dR0 xki QðZ; X; R0 Þ; ¼ ij Qkl l
ð16Þ X
ij mj : þ ½1
ðrr i rr i Þ;
R0 Z;X j2V 0 l2M
i2V 0
Note that the mki
can be computed by propagating image-
where indicates true values, and
ðÞ is the Kronecker delta
class probabilities in a single pass downward. This upward- function. Using the variational approximation P ðZ; X; R0 jY ;
downward propagation, specified by (15) and (16), is very R0 Þ QðZÞQðXjZÞQðR0 jZÞ, from (20) and (21), we derive:
reminiscent of belief propagation for TSBNs [5], [21]. For the X X
special case when ij ¼1 only for one parent j, we obtain the Z^ ¼ arg min QðZÞ ½1
ðzij zij Þ; ð22Þ
standard - rules of Pearl’s message passing scheme for Z
Z i;j
TSBNs. X X
^ ¼ arg min ½1
ðxki xk
X QðZÞQðXjZÞ i ÞÞ; ð23Þ
X
Z;X
4.2.2 Optimization of QðR0 jZÞ i;k
where Z^i denotes ith column of the estimated Z^ and Z^i 6¼0
indicates that i has already been included in the tree
structure when optimizing the previous level V ‘ .
Next, from (23), the resulting Bayesian estimator of
image-class labels, denoted as x^i , is
estimate of the true number of classes. Since DTs are Since the size of objects determines the number of
generalized quad-trees, our experimentation suggests that random variables in DTs, the log-ratios may turn out to be
this optimization with respect to the quad-tree is justified. equal for two different scenarios: When subtrees represent
different objects of similar size and when subtrees represent
4.5 Implementation Issues the same object appearing at different size across the
In this section, we list algorithm-related details that are images. The same problem has also been discussed in [12],
necessary for the experimental results, presented in Section 6, [13], where hidden Markov models of different-length
to be reproducible. Other specifications, such as, for example, observation sequences have been compared. Therefore, for
feature extraction, will be detailed in Section 6. each pair of image regions, it is necessary to normalize the
First, direct implementation of (14) would result in probability log-ratios. Usually, this is done by multiplying
numerical underflow. Therefore, we introduce the follow- the log-ratio with a suitable constant >0. Since the s are
ing P scaling procedure: ~ki ¼ ki =Si , 8i2V , 8k2M, where different for every pair of compared regions, a decision
Si ¼4 k2M ki . Substituting the scaled ~s into (14), we obtain criterion based on the normalized log-ratios becomes
nonprobabilistic.1 Our experiments on image matching
Pijkl ki Pijkl ~ki show that this normalization improves performance over
Qkl
ij ¼ P ¼P : matching when probability log-ratios are not normalized.
al a
a2M Pij i a2M Pijal ~ai
There are several disadvantages of the outlined approach
In other words, computation of Qkl ij does not change when to image matching. We have observed great sensitivity of
the scaled ~0 s are used. h log P1 =P2 i to the specification of the optimal . Further-
Second, to reduce computational complexity, as is done in more, matching approaches based on computing h log P1 =P2 i
[4], we consider, for each node i, only the 77 box implicitly assume that distributions P1 and P2 are reliable
encompassing parent nodes j that neighbor the parent of representations of underlying image processes, which is
the corresponding quad-tree. Consequently, the number of justifiable for supervised settings. However, this is not the case
possible children nodes c of i is also limited. Our experiments for unsupervised settings, where P1 and P2 may have huge
show that the omitted nodes, either children or parents, variations over the examined images, due to the uninformed
contribute negligibly to the update equations. Thus, we limit estimation of the prior distribution, that is, in our case, model
overall computational cost as the number of nodes increases. parameters .
Finally, the convergence criterion of the inner loop, where To mitigate the sensitivity to , as well as to variations in
ij and ij are computed, is controlled by parameter " . When across images, we propose a novel similarity measure,
" ¼0:01, the average number of iteration steps, tin , in the inner thereby departing from the outlined approaches where
loop, is from 3 to 5, depending on the image size, where the probability log-ratios are used. Thus, in our approach, the
latter is obtained for 128128 images. The convergence impact of unreliable is neutralized by measuring correla-
criterion of the outer loop is controlled by parameters N" and tion between the cross-likelihoods of the two image regions,
". The aforementioned simplifications, which we employ in which are normalized by the likelihoods of each individual
practice, may lead to suboptimal solutions of SVA. From our region. Below, we mathematically formulate this idea.
experience, though, the algorithm recovers from unstable Let Yt and Yr denote observables of two DT subtrees T t
stationary points for sufficiently large N" . In our experiments, and T r , respectively, where t in the subscript refers to an
we set N" ¼10 and "¼0:01. image region in the test image, and r, to a region in the
After the inference algorithm converged, we then reference image, as defined in Section 1. Here, T t refers to
estimate the DT structure for a given image, which consists the estimated configuration ðZ^t ; X ^t ; R ^0 Þ and the parameter
t
of DT subtrees representing distinct objects in that image. set t for the test image. Similarly, T r refers to the
Having found the DT representation of segmented image estimated configuration ðZ^r ; X ^r ; R ^0 Þ and the parameter
r
regions, we are then in a position to measure the similarity set r for the reference image. As discussed above, we
of the detected objects across a given set of images. normalize the likelihood P ðY jX; ; Þ, given by (6), as
^ r ; r ; r Þ1=Cr , where Cr denotes the cardinality of
Ptr ¼4 P ðYt jX
the model T r . Since the Y s at coarser resolutions affect more
5 STOCHASTIC SIMILARITY MEASURE pixels than at finer PL1scales,
P for DTV , we compute the
‘ ‘
Recently, similarity measures between two statistical models cardinality as C ¼4 ‘¼0 i2T K i , where Ki denotes the size
have been given considerable attention in the literature [8], of the kernel used to compute observables at level ‘; for
[9], [10], [11], [12], [13]. To compare a pair of image regions example, K‘i ¼2‘ for wavelet coefficients. For DTV 0 , C is
represented by statistical models, in standard probabilistic equal to the number of leaf-level nodes. We now define the
approaches, one examines the log-ratio log P1 =P2 or the similarity measure between two models as
expected value of the log-ratio hlog P1 =P2 i, where P1 and P2 rffiffiffiffiffiffiffiffiffiffiffiffi
are some distributions of the two models (e.g., likelihoods, 4 Ptr Prt
tr ¼ : ð29Þ
posteriors, cross probabilities). For instance, Moghaddam [8], Ptt Prr
for the purposes of face recognition, proposes a similarity The defined similarity measure exhibits the following
measure expressed in terms of probabilities of intrapersonal properties: 1) by definition tr ¼rt , 2) from 0<Ptr Ptt and
and extrapersonal facial image variations. Hermosillo et al. [9] 0<Prt Prr it follows that 0<tr 1 and, finally, 3) if T t T r
perform matching of images by computing the variational than tr ¼1. Note that, for property 2), we assume that the
gradient of a hierarchy of statistical similarity measures. inference algorithm guarantees that Ptt and Prr are global
These approaches can be viewed as a form of ML or MAP
principles. Also, a symmetric function of the KL divergence, 1. Scaled P1 and P2 do not satisfy the three axioms of probability over the
hlog P1 =P2 iP1 þhlog P2 =P1 iP2 , has been proposed in [10]. total set of events if the s vary for different events in that set.
TODOROVIC AND NECHYBA: DYNAMIC TREES FOR UNSUPERVISED SEGMENTATION AND MATCHING OF IMAGE REGIONS 1769
Fig. 3. Alignment tests: (a) and (b) 128128 test and reference images, (c) segmented region under T t using DTV 0 , (d) segmented region under T r
using DTV 0 , (e) image regions in the reference image used for substitution y t ðiÞ y r ðiÞ for different t , and (f) image regions in the test image used
for substitution y r ðiÞ y t ðiÞ for different r . The crosses mark the estimated roots’ positions r^t and r^r .
maxima for the test and reference images, respectively. In and its corresponding node characterized by hidden vari-
practice, from our experience, this is not a significant ables (round-shaped node). This deletion of nodes gives rise
concern, as the algorithm converges to near-optimal to a number of tree-pruning strategies. In our approach, for
solutions, as discussed in Section 4.5. DTV , we simply delete outlying rectangular-shaped nodes
In computation of the cross probabilities, say, Ptr , it is and their corresponding round-shaped nodes; other nodes
necessary to substitute observables Yr with Yt in the are kept intact. For DTV 0 , the deletion of nodes at the leaf level,
estimated subtree structure ðZ^r ; X ^r ; R^0 Þ according to a
r V 0 , may leave some higher-level nodes without children. To
specified mapping. While a complete treatment of possible preserve the generative property, as discussed in Section 4.3,
mappings is beyond the scope of this paper, below we from T r , we also prune, in the bottom-up sweep, those nodes
consider one plausible approach. For this purpose, it is that happen to lose all their children. A similar pruning
convenient to index observables in terms of their locations procedure is necessary when computing Prt ðt Þ. In Fig. 3, we
in the image. Recall, in Section 3, we define locations of illustrate alignment tests pictorially for two sample images.
observables ðiÞ, given by (5). Thus, the mapping can be Having defined our similarity measure, we are now in
conducted as follows: For each observable node i in T r that position to conduct matching of segmented image regions
is on location r ðiÞ in the reference image, we first find the across a given set of images.
corresponding location in the test image t ðiÞ and then
substitute y r ðiÞ y t ðiÞ . We define the correspondence
between the locations r ðiÞ and t ðiÞ, as follows: 6 EXPERIMENTS
We report experiments on segmentation and matching of
cosðr Þ sinðr Þ image regions for five sets of images. Data Set I contains 50,
t ðiÞ¼ r^t þ ðr ðiÞ^
rrÞ ; ð30Þ
sinðr Þ cosðr Þ 2‘ 44, binary images with a total of 50 single object appear-
where r is a rotation angle; r^t and r^r are estimated positions ances. A sample of Data Set I images is depicted in Fig. 4a
of the roots of T t and T r in the test and reference images; bc2‘ (top). Data Set II consists of 50, 88, binary images with a total
finds integer, multiples-of-2 values of the form given by (5).
Pictorially, computation of Ptr can be viewed as alignment
of T r with the location of T t in the test image. Thus, according
to (30), we first translate T r until the root of T r coincides with
the root of T t in the reference image.2 After the roots are
aligned, we then rotate T r for several angles r about the
vertical axis containing the roots. A similar expression to (30)
holds for translation and rotation of T t in the reference image,
when computing Prt . Note that, because of the rotation, we
compute two arrays of cross probabilities, Prt ðt Þ and Ptr ðr Þ,
for each finite rotation increment of t and r . Although we
could eliminate either t or r , we do not because of finite
rotation increments that may differ for the two parameters.
We emphasize that the outlined translation/rotation is just a
visual interpretation of the mapping, in which one set of
observables is substituted by the other; however, this
mapping should not be misunderstood as transformation of
already estimated dynamic trees.
Due to the mapping, given by (30), when computing
Ptr ðr Þ, locations of observables t ðiÞ may fall outside the Fig. 4. Image segmentation using DTV 0 : (a) sample 44 and 88 binary
images, (b) clustered leaf-level pixels that have the same parent at level
boundaries of the test image. In this case, it is necessary to ‘¼1, (c) clustered leaf-level pixels that have the same grandparent at
prune that observable node i in T r (rectangular-shaped node) level ‘¼2; clusters are indicated by different shades of gray; the point in
each region marks the position of the parent node, and (d) estimated DT
2. As we demonstrate in Section 6, roots’ positions give a good estimate structure for the 44 image in (a); nodes are depicted inline representing
of the center of mass of true object appearances in the image; therefore, we 4, 2, and 1 actual rows of the levels 0, 1, and 2, respectively; nodes are
align the roots—not the centers of mass of segmented image regions under drawn as pie-charts representing P ðxki ¼1Þ, k 2 f0; 1g; note that there
T r and T t . are two roots representing two distinct objects.
1770 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005
Fig. 7. Image segmentation using DTV 0 : (top) Data Set III images and
(bottom) pixel clusters with the same parent at level 3. DT structure is
preserved over rotations.
Fig. 5. Twenty image classes in Type III and IV Data Sets. at angles 15
, 45
, and 75
. For texture extraction in DTV 0 ,
we compute the difference-of-Gaussian function convolved
with the image: Dðx; y; k; Þ¼ðGðx; y; kÞGðx; y; ÞÞIðx; yÞ,
where x and y represent pixel coordinates, Gðx; y; Þ¼ expð
ðx2 þ y2 Þ=22 Þ=22 and Iðx; yÞ is the intensity image. In
addition to reduced computational complexity, as compared
to the CWT, the function D provides a close approximation to
the scale-normalized Laplacian of Gaussian, 2 r2 G, which
has been shown to produce the most stable image features
across scales when compared to a range of other possible
image functions, such as the gradient and the Hessian pffiffiffi [35],
pffiffiffi
[36]. We compute Dðx; y; k; Þ for three scales k¼ 2; 2; 8,
and ¼ 2. For color features in both DTV and DTV 0 , we choose
the generalized RGB color space: r¼R=ðRþGþBÞ and
g¼G=ðRþ GþBÞ, which effectively normalizes variations in
Fig. 6. Image segmentation using DTV 0 : (a) Data Set III images, (b) pixel brightness; the Y s of higher-level nodes are computed as the
clusters with the same parent at level ‘¼3, (c) pixel clusters with the mean of the rs and gs of their children nodes of the initial quad-
same parent at level ‘¼4, points mark the position of parent nodes. DT
structure is preserved through scales. tree structure. Each color observable is normalized to have
zero mean and unit variance over the data set. Thus, the y ðiÞ s
of 78 multiple object appearances. A sample of Data Set II are eight and five-dimensional vectors for DTV and DTV 0 ,
images is shown in Fig. 4a(bottom). Data Set III comprises 50, respectively.
6464, simple indoor-scene, color images with a total of 105 In the following experiments, we compare our SVA
object appearances of 20 distinct objects shown in Fig. 5. inference algorithm with three other inference algorithms:
Samples of Data Set III images are given in Figs. 6, 7, and 9. 1) Gibbs sampling discussed in [29], 2) mean-field variational
Data Set IV contains 50, 128128, challenging indoor-scene, approximation (MFVA) proposed in [1], and 3) variational
color images with a total of 223 partially occluded object approximation (VA)3 discussed in [4]. All the figures in this
appearances of the same 20 distinct objects as for Data Set III section illustrate segmentation and matching performance
images. Examples of Data Set IV images are shown in Figs. 3 when DTs are inferred using our SVA algorithm.
and 10. Note that objects appearing in Data Sets III and IV are 6.1 Image Segmentation Tests
carefully chosen to test if DTs are expressive enough to
DT-based image segmentation is tested on all five data sets.
capture very small variations in appearances of some classes
Results presented in Figs. 4, 5, 6, 7, and 8 suggest that DTs are
(e.g., two different types of cans in Fig. 5), as well as to encode
able to encode component-subcomponent relationships
large differences among some other classes (e.g., wiry-
among objects and their parts in the image. From Fig. 8, we
featured robot and books in Fig. 5). Finally, Data Set V
observe that nodes at different levels of the dynamic tree can
contains 50, 128128, natural-scene, color images with a total
be interpreted as object parts at various scales. Moreover,
of 297 object appearances, samples of which are shown in
Figs. 8, 11, and 13. Ground truth in images is obtained through from Figs. 6 and 7, we also observe that DTs, inferred through
hand-labeling of pixels. SVA, preserve structure for objects across images subject to
For Data Sets I and II, we experiment only with DTV 0 translation, rotation and scaling. In Fig. 6, note that the level-4
models, with observables Y given by binary pixel values. For clustering for the larger-object scale in Fig. 6c (top) corre-
the other data sets, we test both DTV 0 and DTV . To compute the sponds to the level-3 clustering for the smaller-object scale in
Y s, we account for both color and texture cues. For texture Fig. 6b. In other words, as the object transitions through
analysis in DTV , we choose the complex wavelet transform scales, the tree structure changes by eliminating the lowest-
(CWT) applied to the intensity (gray-scale) image, due to its level layer, while the higher-order structure remains intact.
shift-invariant representation of texture at different scales,
3. Although the algorithm proposed in [4] is also structured variational
orientations, and locations [34]. The CWT’s directional approximation, to differentiate that method from ours, we slightly abuse the
selectivity is encoded in six subimages of coefficients oriented notation.
TODOROVIC AND NECHYBA: DYNAMIC TREES FOR UNSUPERVISED SEGMENTATION AND MATCHING OF IMAGE REGIONS 1771
Fig. 8. Image segmentation using DTV : (a) a Data Set V image, (b), (c), and (d) pixel clusters with the same parent at levels ‘¼3; 4; 5, respectively,
white regions represent pixels already grouped by roots at the previous scale; points mark the position of parent nodes; nodes at different levels of
DTV can be interpreted as object parts at various scales.
We also note that the estimated positions of higher-level object into subregions that are not actual object parts. On the
hidden variables are very close to the center of mass of object other hand, if an object is segmented into several “mean-
parts, as well as of whole objects. We compute the error of ingful” subregions, verified by visual inspection, this type of
estimated root-node positions r^ as the distance from the actual error is not included. The averaged pixel error for Gibbs
center of mass r CM of hand-labeled objects, derr ¼jj^ r rr CM jj. sampling is 6 percent for both type I and II images, while for
The averaged error values over the given test images for VA MFVA, it is 18 percent and 12 percent for the Data Sets I and II,
and SVA are reported in Table 1. We observe that the error respectively. With regard to object detection error, Gibbs
significantly decreases as the image size increases because in sampling yields no error in the Data Set I, and wrongly
summing node positions over parent and children nodes, as in
segments seven objects in the Data Set II (8 percent). The object
(17) and (18), more statistically significant information
detection error for MFVAis 1 undetected object in the Data Set I
contributes to the position estimates. For example, dIV err ¼6:18
(2 percent), and 13 merged/undetected objects in the Data
for SVA is only 4.8 percent of the Data Set IV image size,
whereas dIII Set II (16 percent). With the increase in image size, Gibbs
err ¼4:23 for SVA is 6.6 percent of the Data Set III
image size. sampling becomes infeasible and MFVA exhibits very poor
Typical results of the DT-based image segmentation for
Data Sets III, IV, and V are shown in Figs. 9, 10, and 11. In
Table 2, we report the percentage of erroneously grouped
pixels, and, in Table 3, we report the object detection error,
when compared to ground truth, averaged over each data set.
For estimating the object detection error, the following
instances are counted as error: 1) merging two distinct objects
into one (i.e., failure to detect an object) and 2) segmenting an
Fig. 10. Image segmentation by DTV 0 learned using SVA for Data Set IV
images, (b) negative example, where, due to challenging similarity in
appearance and occlusion, the DT merges two distinct objects into one;
all pixels labeled with the same color are descendants of a unique root.
Fig. 9. Image segmentation by DTV 0 learned using SVA for Data Set III
images; all pixels labeled with the same color are descendants of a
unique root.
TABLE 1
Root-Node Distance Error
Fig. 11. Image segmentation by DTs learned using SVA for Data Set V:
(a) DTV 0 , (b) and (c) DTV , all pixels labeled with the same color are
descendants of a unique root.
1772 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005
TABLE 2
Percent of Erroneously Grouped Pixels
TABLE 3
Object Detection Error
Fig. 13. Image matching for Data Set V images in (a) and (b). (c) Computation of Prr and Ptt for a sample of two segmented image regions in the reference
and test images, respectively, (d) and (e) computation of Ptr ðr Þ and Prt ðt Þ when T r and T t represent the same object, (f) computation of Ptr ðr Þ and
Prt ðt Þ when T r and T t represent different objects, (g) and (h) 3D plots of tr ðt ; r Þ for t ; r 2 f=4; =4g, where ðr ; t ; Þ marks the maximum.
(a) Reference image. (b) Test image. (c) Prr (left), Ptt (right). (d) Ptr ðr Þ (top) Prt ðt Þ (bottom). (e) Ptr ðr Þ (top) Prt ðt Þ (bottom). (f) Ptr ðr Þ (top) Prt ðt Þ
(bottom). (g) tr ðt ; r Þ plot for the ðT t ; T r Þ pair in (d). (h) tr ðt ; r Þ plot for the ðT t ; T r Þ pair in (e). (i) tr ðt ; r Þ plot for the ðT t ; T r Þ pair in (f).
a given data set is chosen as the reference, while the rest of and reference images;and xit and xir are image-class
the images are then marked as test images. After DT-based indicator random variables in T t and T r , respectively. In
image segmentation of the reference and test images, for a light of the discussion in Section 5 on the necessity to
given T r , we search for the maximum tr ðt ; r Þ over all normalize differently sized models, we also carry out the
possible image regions under T t and rotational alignments object recognition experiments using the normalized sym-
ðt ; r Þ, as illustrated in Fig. 13. Note that t and r should metric KL distance, d~tr , given by
be related by t ¼r , provided the compared objects are
identical. Thus, the test image region under T t , for which 1 Cr X P ðyy t ðiÞ jxit Þ 1 Ct X P ðyy r ðiÞ jxir Þ
d~tr log þ log ;
tr ðt ; r Þ is maximum and jt þr j ", where "¼=16 is a Nt Ct i2T P ðyy t ðiÞ jxir Þ Nr Cr i2T P ðyy r ðiÞ jxit Þ
t r
rotation increment in the alignment tests, is recognized as
the reference image region under T r . From Figs. 13g and where Ct and Cr are the cardinality of T t and T r ,
13h, we observe that tr is a “peaky” function, reaching its respectively, as defined in Section 5. For both distance
maximum when the same objects are matched. measures, the image region under T t , for which dtr , or d~tr , is
To compare our approach with methods which use closest to zero compared to the rest of the segmented
probability log-ratios for image matching, we repeat the regions in the test image, is recognized as T r .
aforementioned set of experiments, but now using the In Table 4, we summarize our object recognition results
symmetric KL distance dtr , specified in [10] as using both DTV 0 and DTV , inferred using SVA and VA, for tr ,
dtr , and d~tr . We define the recognition rate as the percentage of
1 X P ðyy t ðiÞ jxit Þ 1 X P ðyy r ðiÞ jxir Þ correctly detected appearances of an object in the total
dtr log þ log ; number of actual appearances of that object in the test
Nt i2T P ðyy t ðiÞ jxir Þ Nr i2T P ðyy r ðiÞ jxit Þ
t r
images. The false detection rate is defined as the percentage of
where Nt and Nr are the number of observables in T t and incorrectly detected appearances of an object in the total
T r , respectively; y t ðiÞ and y r ðiÞ are observables in the test number of the detected appearances of that object in the test
1774 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 27, NO. 11, NOVEMBER 2005
TABLE 4
their parts, and, hence, that DTs perform well in segmenting
Recognition Rate (R) and False Detection Rate (F) images into plausible pixel clusters. Furthermore, we
reported results on unsupervised object recognition, demon-
strating the viability of the proposed similarity measure for
matching statistical models.
This paper opens a number of research issues that need
further investigation. First among these is the optimal
alignment procedure required for comparing dynamic-
structure models. Possible choices of the alignment proce-
dure, ultimately, should look to enhance the discriminative
power of the similarity measure—that is, how well the
similarity measure distinguishes like objects (i.e., DT
models) from dissimilar objects. Second, our experiments
show (see Figs. 13g and 13h) that the proposed similarity
measure, , is a “peaky” function, which suggests that can
be successfully used in supervised settings. It is likely that
using a suitable (learned) threshold for the classifier could
improve object recognition results beyond those reported in
Table 4. Next, we currently assume that node positions in
DTs are uncorrelated (i.e. diagonal covariances) along “x”
and “y” image coordinates, in order to facilitate the
computation of (18). Often, this may not be an appropriate
images. Here, ground truth is established by visual inspec- assumption, and we will further examine how to modify
tion. Recognition and false detection rates are averaged over our inference algorithm to accommodate dependencies
all segmented regions and all images. Overall, we observe between coordinates. Finally, although DTV type of models
significantly better object recognition performance when tr outperforms DTV 0 in every reported experiment, this may
is used as a model-matching measure compared to dtr and d~tr .
have been the result of more expressive texture extraction
Again, DTV models outperform DTV 0 models.
(i.e., the complex wavelet transform) used in DTV than that
used in DTV 0 . Further research is necessary for establishing
7 CONCLUSION when and why one model is better than the other.
In this paper, we presented a probabilistic framework for
image segmentation and subsequent matching of segmented APPENDIX
regions, when only weak or no prior knowledge is available. DERIVATION OF STRUCTURED VARIATIONAL
We proposed and demonstrated the use of Dynamic Trees
APPROXIMATION
(DTs) to address these problems. More precisely, we
formulated image segmentation as inference of model poster- A.1 Notation
ior distributions, given an image, and subsequent Bayesian
. V ¼fV 0 ; V 0 g: set of all nodes; V 0 : set of leaf-level
estimation of DT structure. Beyond this novel application of nodes;
DTs, we built on previous DT work to formulate a novel DT . y ðiÞ : observable random vector at location ðiÞ;
architecture that introduces multilayered observable data into Y ¼4 fyy ðiÞ j8i2V g;
the model. For the proposed model, we derived a novel . zij : indicator random variable (RV) denoting a
Structured Variational Approximation (SVA) inference algo- connection between nodes i and j; Z ¼4 fzij j8i; j2V g;
rithm that removes independence assumptions between node ij : true probability of i being the child of j; ij :
positions and model structure, as was done in prior work. approximate probability of i being the child of j given
Furthermore, we formulated image matching as similarity Y and R0 ;
analysis between two DTs representing examined image . M: set of image classes; xki : indicator RV denoting i is
regions. To conduct this analysis, we specified a novel labeled as class k2M; X ¼4 fxki j8i2V ; k2Mg; Pijkl : true
similarity measure between two statistical models, which conditional probability tables; Qkl ij : approximate
we find more suitable for unsupervised settings than conditional probability tables given Y and R0 ; mki :
measures based on probability log-ratios. We proposed one approximate posterior that node i is labeled as image
possible alignment procedure for comparing two DTs, and class k given Y and R0 ;
developed criteria based on the resulting similarity measure . r i : position of node i; R0 ¼4 frr i j8i2V 0 g; R0 ¼4 frr i j8i2V 0 g;
for ultimate unsupervised object recognition. ij and d ij : true diagonal covariance and mean of a
Through a set of detailed experiments, we demonstrated relative child-parent displacement ðrr i r j Þ; ij and
the significantly improved properties of the SVA algor- ij : approximate diagonal covariance and mean of r i ,
ithm—both in terms of substantially faster convergence rates, given that j is the parent of i and given Y and R0 ;
and larger approximate posteriors for the inferred models— . hi: expectation value with respect to QðZ; X; R0 Þ; :
when compared with competing inference algorithms. Our normalization constant; paðiÞ: candidate parents of i;
results show that DTs are capable of capturing important cðiÞ: children of i; dðiÞ: all descendants down the
component-subcomponent relationships among objects and subtree of i.
TODOROVIC AND NECHYBA: DYNAMIC TREES FOR UNSUPERVISED SEGMENTATION AND MATCHING OF IMAGE REGIONS 1775
auxiliary terms Fij , Gi , and ki , which facilitate computation Then, from @LR =@ij ¼0, it is straightforward to compute
of @LX =@Qklij , as shown below: the update equation for ij given by (17).
4
X
Fij ¼ ij Qkl l kl kl
ij mj log½Qij =Pij ; A.5 Optimization of QðZÞ
k;l2M
( ) QðZÞ is fully characterized by the parameters ij . From the
4
X X definitions of LZ , LX , and LR we see that @JðQÞ=@ij ¼
Gi ¼ Fdc mki log P ðyy ðiÞ jxki ; ðiÞÞ ;
d;c2dðiÞ k2M V0 ð40Þ @ðLX þLR þLZ Þ=@ij . Similar to the optimization of Qkl
ij , we
4 need to iteratively differentiate LX as follows:
ki ¼ expð@Gi =@mki Þ;
@LX @Fij @Gi @mki @LX @Fij X @Gi @mki
) ¼ þ ; ¼ þ ; ð44Þ
@Qij @Qkl
kl
@mki @Qkl @ij @ij k2M @mki @ij
ij ij
where fgV 0 denotes that the term is included in the where Fij and Gi are defined as in (40). By substituting the
expression for Gi if i is a leaf node for DTV 0 . For DTV , the P
derivatives @Gi =@mki ¼ log ki , and @Fij =@ij ¼ k;l2M
term in braces fg is always included. This allows us to derive P
Qkl l kl kl k
ij mj log½Qij =Pij , and @mi =@ij ¼
kl l
l2M Qij mj into (44),
update equations for both models simultaneously. After
finding the derivatives @Fij =@Qkl l kl kl we obtain
ij ¼ ij mj ðlog½Qij =Pij þ1Þ
and @mki =@Qkl ¼ m
ij j
l
, we arrive at !
ij
@LX X Qkl
ij
kl l k
¼ Qij mj log kl log i ;
@LX =@Qkl l kl kl k
ij ¼ ij mj ðlog½Qij =Pij þ 1 log i Þ: ð41Þ @ij k;l2M
Pij
!
Finally, optimizing (41) with the Lagrange multiplier that X X ð45Þ
P ¼ kl l
Qij mj log al a
Pij i ;
accounts for the constraint k2M Qkl ij ¼1 yields the desired k;l2M a2M
update equation: Qkl kl k
ij ¼ Pij i , introduced in (14).
¼ Aij :
To compute ki ¼ expð@Gi =@mki Þ, we first find
! Next, we differentiate LR , given by (39), with respect to
@Gi X @Fci X @Gc @ma
c
¼ þ ij as
@mki c2cðiÞ @mki a2M @mac @mki
@LR 1 jij j 1
flog P ðyy ðiÞ jxki ; ðiÞÞgV 0 ; ¼ log 1 þ Trf1 ij ij g
ð42Þ @ij 2 jij j 2
X X Qak @Gc
¼ ci Qak log ci
þ 1X
ci
Pciak @mac þ jp Trf1 ij ðjp þMijp Þg
c2cðiÞ a2M 2 p2V 0
flog P ðyy ðiÞ jxki ; ðiÞÞgV 0 ; 1 1
þ2Trf1 2 1
ij ij g Trfij tu g
2
ð46Þ
and then substitute Qkl
ij , given by (14), into (42), which 1X
gives (15). þ ci Trf1 ci ij þMcij g
2 c2V 0
A.4 Optimization of QðR0 jZÞ þ2Trf1
1
1 1