Anda di halaman 1dari 10

Similarity search using Multi-space KL

Raffaele Cappelli1, Dario Maio2 and Davide Maltoni2


1
Corso di Laurea in Scienze dellInformazione, Universit di Bologna, via Sacchi 3,
47023 Cesena - Italy. E-mail: cappelli@csr.unibo.it
2
DEIS, CSITE - CNR, Universit di Bologna, viale Risorgimento 2,
40136 Bologna - Italy. E-mail: {dmaio,dmaltoni}@deis.unibo.it

Abstract - Karhunen-Love transform is probably the most used statistical framework for
dimensionality reduction in a broad range of scientific fields. Given a set of points in a n-
dimensional space (the points can be derived from images, sounds, or other multimedia objects),
KL provides a mapping which reduces to k (k<<n) the dimensionality of the input patterns without
altering too much their structure; this is obtained by removing the components of minor relevance.
Unfortunately, KL suffers from some scalability problems: in fact, as the size of the database
increases, the efficacy and efficiency of the transform progressively vanish. In this work we
introduce the basics of a new generalization of KL (named MultiSpace KL or MKL) which allows
the scalability problems to be solved and we show how MKL can be used for similarity searches in
multimedia databases. In particular, it is possible to build an index on the database which can be
accessed trough a distance-function which is a metrics of the working multi-space. The paper
reports some preliminary experiments where MKL outperforms KL as the size of the database
increases. Actually, the index here described is a flat, non-incremental structure which cannot be
efficiently used on large multimedia databases. The paper introduces our recent efforts devoted to
the definition of a hierarchical structure based on the MKL subspaces nesting and to the
development of techniques for rearranging the subspaces when new objects are inserted.

Keywords - Karhunen-Love transform, Similarity search, Object retrieval, Scalability.

1. INTRODUCTION

Efficiently retrieving objects by similarity in a large multimedia database requires the objects to
be compactly represented by using a semantics preserving transformation. Until now, for a large
spectrum of applications, the best performing approaches have been designed by using specific
knowledge and by manually selecting the best-suited features or the most appropriate representation
structure. This caused the proliferation of a huge number of application-dependent techniques
which cannot be successfully applied outside of their typical domain.
Usually a multimedia object can be directly represented, without requiring any ad-hoc feature
extraction stage, as a point in a very high dimensional space. For example, an image can be
vectorized by postponing its rows and by associating a dimension to each pixel (i.e., a 256256
image is then a point in the 65536-dimensional space). On the other hand, it is extremely inefficient
to store and retrieve such high dimensional vectors by using conventional spatial data structure as
R-tree [Gutt84] or Grid-File [Niev84], therefore an a-priory dimensionality reduction is mandatory.
The most known and used dimensionality reduction technique is the principal component analysis
[Fuku90] (which in the field of pattern recognition is usually referred as Karhunen-Love transform
or simply KL); the projection of a vector into the KL space is performed through the multiplication
of the vector by a rectangular matrix calculated during an initial learning carried out on a
representative training set. KL has been initially used by researchers for image compression and
reconstruction [Kirb90] and more recently for image recognition [Pent91] and retrieval on image
databases [Pent94] [Swet95] [Swet96].
Generally, when the number of objects and object classes increases, a larger training set is
necessary to fit the representativeness requirement and the efficacy of KL decreases (scalability
problem): in fact, from one side the feature discriminant power progressively vanishes since the
features tend to become very smoothed and, from the other, side the training time can become
daunting. One possible solution to the scalability problem, consists in splitting an hard problem into
more easier sub-problems. For example, in the context of pattern classification based on KL the
linear separability is a necessary requirement in order to find an optimal solution; rarely, a large set
of complex patterns are linearly separable when considered as an ensemble, whereas it is more
likely to find optimal separating hyperplanes if the initial set is partitioned into several subsets and
each of these is treated independently.
In this work we introduce a Multi-space generalization of the KL transform (which we call
MKL) where some subspaces are created to arrange the different objects. Each subspace is used to
represent a subset of objects having common characteristics, thus allowing more selective features
to be employed; furthermore, each subspace is built starting from a reduced set of objects whose
cardinality is independent on the total number of elements.
Given a database of m objects, an MKL solution can be initially determined by using, as training
set, the whole database or a representative portion of it. Then, each database object is associated to a
multidimensional point and projected into the best-suited MKL subspace, thus obtaining a compact
representation which constitutes a signature of the object. Searching by similarity involves
comparing signatures by means of an ad-hoc distance function. A flat index is here proposed where
a pair (objectID, Signature) is created for each database instance. Obviously, retrieving the closest
object/s with respect to a given example requires O(m) distance evaluations among the objects and
the signatures in the index. Anyway, since the adopted distance function essentially requires to
compute Euclidean distances in low dimensional subspaces, the computational complexity is
reasonably low even in the case of medium-large databases.
On the other hand, a flat implementation does not allow to effectively adapt the MKL solution in
case a certain amount of objects (not adequately represented in the initial training set) has to be
inserted to the database. The paper briefly introduces our recent efforts devoted to the definition of
a hierarchical structure based on the MKL subspaces nesting and to the development of techniques
for adapting the subspaces, without recalculating them from scratch, when new objects are inserted.
The rest of this paper is organized as follows: in section 2, the KL transform and some related
results are summarized; section 3 defines the MKL and its operators. Section 4 explains how to

2
measure the distance between a searched object and an object signature and describes a flat
indexing technique. Section 5 reports our experimentation carried out on databases of randomly
generated multidimensional points. Finally, in section 6 we draw our conclusions and give some
pointers to the large amount of work we are going to set up on this topic for the future.

2. KL TRANSFORM

Let P={xin | i=1,...m} be a set of m n-dimensional points (or vectors) derived by the objects of
interest, and let:
1
 x  x be their mean vector,
m x P
1
 C  x  x x  x T be their covariance matrix,
m xP
   nn be the orthonormal matrix which diagonalizes C, that is  T C 
 Diag 1 ,  2 ,...,  n  ,    1 , 2 ,... n  ,
i and i, i=1,...n are the eigenvalues and the eigenvectors of C, respectively.

Then, for a given k (k<n, k<m, k>0), the KL k-dimensional space ( S x , k ) is uniquely identified by
the mean vector x and by the projection matrix  k   n  k whose columns are the  s columns
corresponding to the k largest eigenvalues:


 k   i1 , i2 ,... ik  with i1
i2
...ik
... in

The eigenvectors  i1 , i2 ,... ik indicate the directions of largest variance in the training set, hence
k is a good basis for the object representation. Furthermore, it has been proved [Jain89] [Joll86]
that KL transform guarantees the best Euclidean distance preservation among all the unitary
transformations for dimensionality reduction.

The projection of a vector xn into the space S x , k , is:



 
KL x , S x , k   Tk x  x  (1)

The back-projection, into the original space, of a vector yk belonging to S x , k is:

 
KL y , S x , k   k y x (2)

The choice of the best dimensionality k for the KL target space is not obvious and strictly
depends on the application requirements. In fact, if the KL is employed for pattern compression,
using higher values of k requires more space but determines a better accuracy and minimizes the
pattern reconstruction error [Fuku90]; on the other hand, in the context of pattern classification it
3
has been proved, in the practice, that increasing k beyond a certain limit can even deteriorate the
performance, since usually the components of small variance do not carry significant information
and are largely affected by perturbations and noise. Furthermore, working within large spaces is
computational expensive and requires a lot of memory for storing the pattern projections.
Search by similarity can be performed in a KL space by using the standard Euclidean distance: if
y1, y2, ...ym are the k-dimensional vectors defining the signatures of the objects stored in a database
and x is an n-dimensional vector to be searched, then the projection y of x is computed by (1) and
the Euclidean distances between y and y1, y2, ...ym are calculated in k (figure 1).

y6
y2
r y y4 y5

y1 y3

Fig. 1. A spherical query (with radius r) on a database indexed through KL (n=3, k=2); the objects having
signature y2 and y4 are returned by the query.

3. MKL (MULTI-SPACE KL)

Let P = {xin | i=1,,m} be a set of m n-dimensional vectors, then for each partitioning
={P1, P2, Ps} of P and for each set K ={k1, k2, ks} of scalars, such that:

a)  Pi  P, Pi  Pj  i, j=1..s, i  j
i 1..s
 m 
b) mi  card Pi 
  i = 1..s
 s 1

c) ki<mi, ki>0, ki<n  i = 1..s

the MKL transform is defined by the set of subspaces S = {Si | Si  S xi , i ,k , i=1..s}, where:
i

1
 xi  x ;
mi x P
i

1
  i ,ki is a matrix whose columns are the ki eigenvectors of Ci   x  x x  x T
mi x P
i

corresponding to the ki largest eigenvalues.

Each subset Pi then determines a KL subspace Si of dimension ki; the constraint c) limits the
possible values for ki with respect to the number of vectors in Pi and to the dimension of the original
space. The constraint b) requires the subsets P1,,Ps to be not too much unbalanced. For example,
if m=100 and s=2, each subset must contain at least 33 elements. Furthermore, from ki<mi, ki>0 it

4
follows mi>1: hence, each subset must include at least 2 elements. Finally, the maximum number of
subspaces is smax = m/2 and it can be obtained for k1=k2=...ks=1.
It should be noted that KL represents a particular case of MKL where s=1, ={P} and K={k}.
In figure 2, a dimensionality reduction from a 2-dimensional to 1-dimensional space(s) is shown
both for KL and MKL.


Fig. 2. KL and MKL dimensionality reduction (2 1) applied to the same initial set P; the resulting
subspaces are denoted by straight lines. For MKL, 3 subspaces are used (s=3) and the patterns within
different subsets are differently colored.

A huge number of MKL transforms can be derived from the same initial set P varying s, and
K; in the following we will denote with MKL solution a triplet (s, , K). In [Capp99] we discuss a
criterion defining the optimality of an MKL solution and we report some heuristic algorithms for
calculating optimal MKL solutions. These algorithms require the set K defining the dimensionality
of the subspaces to be given as input and generate both s and as output. Actually, we adopt the
simplifying assumption k1= k2= ...ks = k (that is, all the subspaces have the same dimensionality);
hence, just the parameter k is required. As for the KL, the choice of the best k is not obvious; our
experimental results demonstrated that very good results can be obtained for very low values of k in
case of search by similarity applications.
In practice, MKL can be employed in all the contexts where KL is successfully used: to this
purpose, a generalization of the projection and back-projection operators (formulae (1) and (2)) is
necessary; then:

 the projection of a vector xn into the set of subspaces S defining an MKL solution is:

MKLx , S   t , y (3)
   
t  arg min d FS x , S i  , where d FS x , S i   x  KL KLx , S i , S i 
i 1..s   2

y  KLx , S t 

the scalar t, 1 t  s, denotes the subspace St which is the best-suited in representing x; dFS is
called distance of a pattern from a space, since it geometrically corresponds to the Euclidean
distance of the point x from the hyperplane Si (for example in figure 1, dFS(x, S x , 2 ) is denoted
by the dashed line connecting x to y).

5
 the back-projection, into the original space, of a vector ySt, 1 t  s, is:
 
MKL t , y , S   KLy , S t  (4)

4. DATABASE INDEXING AND SIMILARITY SEARCH WITH MKL

Since in MKL the vectors are projected by (3) into different subspaces, it is not possible to
define a searching strategy which evaluates distances in just one subspace (as we did for KL, see
figure 1).
Let S = {S1,, Ss} be the set of subspaces identifying an MKL solution obtained on a
representative training set extracted from the database. Then, for each multidimensional point x,

corresponding to the object ox in the database, the pair (IDox,< t,y>), where t , y  MKLx , S  is the
signature of ox, is added to an index I.
We define external distance between the multidimensional point z (corresponding to a searched
object oz) and the generic database object ox having signature <t,y> in I, the Euclidean distance
between z and the back-projection of y into the original space:

d E z ,  t , y    z  KLy , S t  (5)
2

The external distance (5) operates in the n-dimensional space where the signatures are re-mapped,
and allows all the common queries usually involved by similarity searches to be implemented:
nearest neighbor, spherical range query, etc. (see figure 3 for an example).

(o1, <1, y1>)


(o3, <1, y3>) (o8, <3, y8>)
(o2, <1, y2>) S1

S2 z r

S3
(o4, <2, y4>)
(o5, <2, y5>) (o7, <3, y7>)
(o6, <2, y6>)

Fig. 3. A spherical query (with center z and radius r) on a database indexed through MKL (n=3, s=3, k1=2,
k2=1, k3=2); the objects o3, o6 and o8 are retrieved since the external distances between z and <1,y3>, <2,y6>
and <3,y8> are less or equal than r.

6
5. EXPERIMENTAL RESULTS

In this section we report some experimental results, regarding the MKL vs KL comparison, in
the context of a retrieval by similarity application. All the objects are randomly generated (directly
as multidimensional points) in order to have at disposal a large number of samples and to be able to
repeat each experiment several times to reduce the aleatory component.
The objects generated belong to c distinct classes; each class i, i=1..c, has exactly mc elements
(mc is an even number), and is created according to a multivariate gaussian distribution N(ci, Wi)
defined by the centroid (or mean vector) ci and by the covariance matrix Wi. The class centroids ci
are randomly generated according to another multivariate gaussian distribution N(0, B) having
mean vector 0 and covariance matrix B. Hence, the matrices B regulate the distribution of the
classes (between distribution) whereas Wi indicates how the patterns are spread around the class
centroids (within distribution). In the practice, it is reasonable to assume that the clouds of points
associated to the different classes are more compact than the whole ensemble of points and that a
strong correlation exists among the different dimensions. The covariance matrices Wi, i=1..c and
B have been defined according to the above criteria (see [Capp99] for the details). Figure 4 shows
an example for n=2, c=12, mc=20.

Fig. 4. n = 2, c = 12, mc = 20

The whole set of objects, obtained at each random generation, is split in two parts: a learning set
LS and a test set TS; each part contains exactly half of the elements of each class (mc/2). The
elements in LS are initially used (as set P) for the calculation of the KL or MKL solution; each of
them is then stored in the database and indexed by adding the corresponding signature to I. The TS
elements are used for simulating search by similarity queries. In particular, nearest-neighbor queries
are performed: for each element xTS belonging to the ith class, the query retrieves from the
database the mc/2 elements which are the most close to x (according to the Euclidean distance in the
k-dimensional space for KL and to the external distance dE for MKL); a score is associated to the
query according to the fraction of retrieved elements belonging to the ith class (that is the same of
x). The average score as is computed over all the queries.

7
In case of MKL, using formula (5) for the external distance computation involves operating in
the original space n, which potentially can be a very high dimensional space, thus requiring a lot of
computations. Actually, formula (5) can be rewritten (i.e. Pitagora theorem) as:
 2
d E z , t , y    d FS z , S t  KLz , S t   y
2
(6)
2

Using (6) to search a pattern z over the whole database allows the terms d FS z , S t  and KLz , S t  ,
which involve operating in n, to be computed just once for each subspace St. Hence, only a small
overhead is introduced with respect to the corresponding KL search.

Two different kinds of experiments are reported in this paper: in the former, we compare the
retrieval accuracy of both KL and MKL with the aim to figure out the role played by the key
parameters k and s; in the latter, we progressively increase the database size to understand how the
search by similarity performance changes.

Accuracy
Given the problem n=100, c=10, mc=120 (i.e. a database of 600 objects), the following results
have been generated by averaging the output of twenty independent retrieval sessions; in each
session the data are regenerated and 6010 60-nearest-neighbor queries are executed. In the left
graph of figure 5, as is plotted as a function of the dimensionality k of a single KL space; the best
performance (as=0.83) is obtained for k=10, thus demonstrating that an increase of k does not
necessary imply a performance improvement. The graph on the right compares MKL and KL: a
distinct MKL curve is plotted for each s in the range [2..7]. MKL outperforms KL for low values of
k (e.g. for k=1 and s=5, as=0.98) whereas for k grater than 10-12 the two methods behave similarly.
The very good performance produced for low values of k makes MKL particularly attractive for
practical applications.

1 as 1
as

0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6 s=2 s=3 s=4
0.5 s=5 s=6 s=7
0.5
KL
0.4 0.4
0.3 k 0.3 k
1 21 41 61 81 1 5 9 13 17

Fig. 5. KL performance (on the left) and MKL performance varying the number of subspaces s (on the right).
In both the cases as is plotted as a function of k.

8
Scalability
On the basis of our experimentation carried out on different sets of random data with n=100, we
found k=10 and k=2 to be good choices for KL and MKL respectively. Therefore, in this
experiment we kept constant these values and we analyzed the average score as a function of the
database size.
Given n=100 and mc=120 we initially calculated as for c=2 and then we progressively added new
classes; at each step, before recomputing the new as, both the KL and MKL solutions have been
recalculated from scratch. Each session, requiring 60c 60-nearest-neighbor queries, has been
executed twenty times; the average values obtained are reported in figure 6. As to MKL the optimal
number s of subspaces has been automatically determined having prefixed a desired reconstruction
error (see [Capp99] for more details).
as
3 4
2 6
7 9 10 13 15
11
0.9

0.7

MKL
0.5
KL

0.3 c
1 6 11 16 21 26 31 36 41


Fig. 6. KL vs MKL performance as a function of the number of classes c (i.e. of the database size (c mc/2)).
The labels on the MKL curve approximately indicate the number of subspaces used.

The graph shows how MKL, by progressively employing a larger number of subspaces, allows
the higher complexity to be controlled. In fact, whereas KL performance rapidly decreases (e.g. for
c=10 as=0.85, for c=35 as=0.44), MKL maintains near-constant the retrieval efficiency.

6. CONCLUSIONS AND FUTURE RESEARCH

Our preliminary experimentation demonstrates the superiority of MKL (with respect to KL) for
database indexing and similarity searching. In particular, a better retrieval accuracy can be obtained
by using MKL and, as the size of the database increases, KL shows a performance degradation
which is avoided by MKL, where a larger number of subspaces can be created to deal with the
higher complexity.
Anyway, the flat indexing approach presented in this paper suffers from some problems: in
particular, the MKL solution cannot be dynamically adapted in case new objects (not represented in
the initial learning set) must be added to the database. Furthermore, the size of the learning set
9
cannot be too high because a very large number of subspaces should be created, thus requiring a
very high computational time.
We are designing a new hierarchical structure (tree like) based on the MKL subspace nesting.
Each node corresponds to a subspace, the child nodes of a given node constitute the set S of
subspaces identifying a MKL solution. The lower levels of the tree correspond to MKL solutions
which are progressively more specific for particular families of objects. Inserting new objects which
are not adequately represented by the existing subspaces can cause the creation, rearrangement or
splitting of subspaces. A search by similarity starts from the root node and moves toward the leaves
(where the objects signatures are stored). At each level, the search continues only in the nodes
which are closer (with respect to the distance from space dFS) than a prefixed tolerance to the
searched objects, that is in the nodes which are the most suitable to represent the searched object.

References

[Capp99] R. Cappelli, D. Maio and D. Maltoni, Multi-space KL for pattern representation and
classification, DEIS internal report, University of Bologna, March 1999.
[Fuku90] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, San
Diego, 1990.
[Gutt84] A. Guttman, R-Trees: A Dynamic Index Structure for Spatial Searching. In proc.
ACM SIGMOID pp. 47-57, 1984.
[Jain89] A. K. Jain, Fundamentals of Digital Image Processing. pp.163-174. Prentice Hall,
1989.
[Joll86] I. T. Jolliffe, Principal Component Analysis. Springer Verlag, New York, 1986.
[Kirb90] M. Kirby and L. Sirovich, Application of the Karhunen-Love procedure for the
characterization of human faces, IEEE Trans. Pattern Analysis Machine
Intelligence. vol.12, pp.103-108, January 1990.
[Niev84] Nievergelt J., Hinterberger H. and Sevcik K. C., The grid file: an adaptable,
symmetric multikey file structure, Acm Trans. on Database System. vol.9, no.1,
1984.
[Pent91] M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive
Neuroscience, vol.3, no.1, pp.71-86, 1991.
[Pent94] A. Pentland, R. W. Picard and S. Scarloff, Photobook: Tools for content-based
manipulation of image databases, in SPIE Storage and Retrieval Image and Video
Databases II, no.2185, (San Jose), February 1994.
[Swet95] L. Swets, B. Punch and J. Weng, SHOSLIF-O: SHOSLIF for object recognition and
image retrieval (phase II), Tech. Rep. CPS 95-39, Michigan State University,
Department of Computer Science, October 1995.
[Swet96] D. L. Swets and J. Weng, Using discriminant eigenfeatures for image retrieval,
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.18, no.8,
pp.831-836, August 1996.

10