Anda di halaman 1dari 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR 1

Robust ImageGraph: Rank-Level Feature Fusion for


Image Search
Ziqiong Liu, Shengjin Wang, Member, IEEE, Liang Zheng, and Qi Tian, Fellow, IEEE

AbstractRecently, feature fusion has demonstrated its effec-


tiveness in image search. However, bad features and inappropri-
ate parameters usually bring about false positive images, i.e.,
outliers, leading to inferior performance. Therefore, a major
challenge of fusion scheme is how to be robust to outliers.
Towards this goal, this paper proposes a rank-level framework
for robust feature fusion. First, we define Rank Distance to
measure the relevance of images at rank level. Based on it,
Bayes similarity is introduced to evaluate retrieval quality of
individual features, through which true matches tend to obtain
higher weight than outliers. Then, we construct the directed
ImageGraph to encode the relationship of images. Each image is
connected to its K nearest neighbors with an edge, and the edge
is weighted by Bayes similarity. Multiple rank lists resulted from
Fig. 1. Examples of good feature and bad feature. For each query, the
different methods are merged via ImageGraph. Furthermore, on top-5 ranked images in the search result of good feature (the first row) and
the fused ImageGraph, local ranking is performed to re-order bad feature (the second row) are demonstrated. Relevant images are marked
the initial rank lists. It aims at local optimization, and thus is with green dot, and irrelevant ones red. The good features work well in that
more robust to global outliers. Extensive experiments on four true match images are retrieved, but bad features rank outliers ahead of true
benchmark datesets validate the effectiveness of our method. matches.
Besides, the proposed method outperforms two popular fusion
schemes, and the results are competitive to the state-of-the-art.
Index TermsImage search, feature fusion, ImageGraph. adopted feature is a good feature and also complementary to
existing ones, a higher performance is expected. Nevertheless,
I. I NTRODUCTION many irrelevant images have high ranks due to the low
discriminability of bad features. If the to-be-fused feature is
This paper considers the task of content-based image search.
a bad feature, the fusion performance may not be guaranteed,
Given a query image, our goal is to retrieve all the appearance
and accuracy may get even lower after fusion. In essence,
similar images in a database. Recently, multiple features are
failure in predicting features effectiveness results in undesir-
employed to boost the overall performance. To take advan-
able search quality [16]. Multiple cues are directly integrated
tage of complementary properties of distinct features, various
without considering their effectiveness in [11, 14, 29, 30].
fusion methods are investigated, ranging from straightforward
Once outliers are introduced by bad features, it is difficult to
combination at feature level [30] to integration at indexing
filter them out. To evaluate the retrieval quality of individual
level [11, 14, 29] and merging graphs of different rank results
method, consensus degree among the top candidates, i.e.,
[15, 17]. It is demonstrated that fusion of multiple features
Jaccard similarity, is utilized [17] at rank level. However,
has been pushing the state-of-the-art forward. However, false
when a bad feature is adopted, outliers may be included
positive images, i.e., outliers, are inevitably introduced in the
into the graph. Usually there are many edges linked between
fusion, leading to inferior accuracy.
the outliers, called Tightly-Knit Community Effect. In this
On one hand, outliers are often brought in by bad features.
scenario, outliers may obtain higher consensus degree among
For a specific query, a good feature means its search accuracy
neighbors than true matched ones, yielding unsatisfactory
is high by itself. By comparison, the feature yielding low
performance.
search quality is called bad feature (see Fig. 1). When the
On the other hand, inappropriate parameter also introduces
Copyright (c) 2010 IEEE. Personal use of this material is permitted. outliers. In [17, 29], K-reciprocal nearest images are treated
However, permission to use this material for any other purposes must be as pseudo positive instances, and thus K shall be equal to the
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.
Z. Liu and S. Wang are with State Key Laboratory of Intelligent Technology number of ground truths. However, it is hard to pre-define K
and Systems, Tsinghua National Laboratory for Information Science and Tech- because database images commonly have various numbers of
nology, Department of Electronic Engineering, Tsinghua University, Beijing ground truths. If K is inappropriate, the performance may be
100084, China (E-mail: ziqiongliu@gmail.com, wgsgj@tsinghua.edu.cn).
Liang Zheng is with the Centre for Quantum Computation and Intelligent affected, especially in [17]. The retrieval quality measurement,
Systems, University of Technology Sydney, Ultimo, NSW 2007, Australia. Jaccard similarity, always varies with K, and gradually loses
(E-mail: liangzheng06@gmail.com). its effectiveness when K gets larger than ground truths number.
Q. Tian is with the University of Texas at San Antonio, 78256, USA (E-
mail: qitian@ cs.utsa.edu). Therefore, choosing an effective measurement to evaluate
Corresponding authors: Shengjin Wang and Qi Tian. retrieval quality of individual features is the key issue in robust

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

2 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

Fig. 2. Toy example of the proposed method. The query image is marked with yellow bounding box, and relevant ones green. Given a query image, two
features are used to obtain search results. Then, for each feature, the corresponding ImageGraph is built. In the ImageGraph, each vertex points to its 3
nearest neighbors and the graph is expanded to the second layer. The edge is weighted by Bayes similarity, reflecting the retrieval quality. In ImageGraph
1, we observe that only 1 relevant image is directly connected by query, which means there are 1 true match in the top-3 ranked images of initial rank list
of Feature 1. Through these relevant images, other two true matches are connected at the second layer of the graph. In ImageGraph 2, query points to two
relevant images directly. ImageGraph 1 and ImageGraph 2 are fused by appending new nodes or re-calculating edge weights of existing nodes. Based on the
fused graph, local ranking is conducted and the images are reranked. Although there are many outliers in the graph, all the true match images are retrieved.

fusion task. Different features may produce scores diverse in ageGraph as Bayes similarity, a better discriminator between
numerical values, so the evaluation scheme should measure relevant/irrelevant images than Jaccard similarity [17], and is
the importance of images in the unified scale. Besides, a good insensitive to parameter changes. Besides, to avoid being af-
evaluation should measure features effectiveness correctly, fected by the outliers in reranking, local ranking is proposed to
assigning higher weight to relevant images under good features re-order the initial result. It aims at local optimization, and thus
and lower weight to highly-ranked outliers under bad features. is more robust to global outliers. Extensive experiments on
In light of the above analysis, this paper first proposes four image retrieval datasets confirm that the proposed method
the Rank Distance to measure the relevance of two images significantly improves baseline performance. Moreover, it is
at rank level, which is based on their ranks when each one robust to the outliers. A toy example of our fusion system is
is used as query to search for the other. Through this mea- illustrated in Fig. 2.
surement, similarity scores of different features are mapped The main contributions of this paper are summarized as
to the unified scale, thus being comparable. Besides, it is follows:
illustrated in [12, 54] that reciprocal neighborhood relationship We propose an effective measurement for robust fusion.
is a stronger indicator of similarity than unidirectional nearest Rank Distance is first introduced to measure the relevance
neighborhood relationship. Since Rank Distance considers the of images on rank level. Based on it, Bayes similarity is
reciprocal ranks of two images, i.e., the local densities of proposed to evaluate the retrieval quality of individual
vectors, it is more reliable to represent the relevance of images features, which is a good discriminator between relevan-
than similarity score. Then, to evaluate the retrieval quality t/irrelevant images and insensitive to parameter changes.
of individual features effectively, we introduce the Bayes We propose the directed ImageGraph structure to encode
similarity. It is defined as the posterior probability of two image-level relationships. ImageGraph builds on K near-
images being true match. Built on the Rank Distance, we est neighbors, and thus more candidates can be included
estimate the Bayes similarity through empirical study. in the graph, improving the recall. Besides, the edge
Our approach adopts the graph-based framework of [17]. weight of ImageGraph is measured by Bayes similarity.
Since not only top-ranked images in initial search results We propose the local ranking to rerank the initial search
but also their neighborhood are included into the graph, the result, further enhancing the robustness of our method.
similarity can be propagated through graph. Consequently, The proposed ranking algorithm aims at local optimiza-
true matched images not directly connected to query can be tion, so that it is more robust to global outliers.
retrieved. Nevertheless, the undirected graph proposed in [17] This paper is an extension of our previous conference
builds on K-reciprocal neighbors that may result in low search publication [51]. Beyond the conference paper, we propose
recall. In contrast with [17], we construct a directed graph, Rank Distance and Bayes similarity for robust evaluation,
denoted as ImageGraph. Our method uses the top-K ranked and reformulate the edge weight of ImageGraph. We also
images, so that more candidates (high recall) can be included conduct more experiments to better validate the effectiveness
in the graph. In addition, we define the edge weight of Im- of our method, and give more detailed discussions. The rest

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

ZIQIONG LIU et al.: ROBUST IMAGEGRAPH: RANK-LEVEL FEATURE FUSION FOR IMAGE SEARCH 3

of the paper is organized as follows. After a brief review of Baluja [5] have proposed a VisualRank framework to effi-
related work in Section II, we introduce the proposed robust ciently model similarity of Google image search results with
ImageGraph in Section III. Section IV describes the datasets graph. It uses the random walk on an affinity graph, and re-
and baselines used in the experiments. Section V presents the orders images according to the visual hyperlinks. In [46], video
experimental results. Finally, conclusions are given in Section search reranking is also formulated as a random walk problem
VI. along the context graph. The edge between videos is weighted
by the linear combination of text score and visual duplicated
II. R ELATED W ORK score. Here, visual duplicated score is the similarity calculated
A. Image Search Pipeline with visual feature. To handle the errors in the initial labeled
set, a graph based semi-supervised learning [47] is applied
In image search, a myriad of methods have been proposed
to the web image search. Furthermore, a graph theoretical
in the last decade. Among them, Bag-of-Words model [23]
framework amenable to noise resistant ranking is proposed in
based on local descriptor is the most popular one. A number of
[45]. In this method, outliers can be removed from the graph
salient local regions are detected from an image with operators
by spectral filtering.
such as DoG [19] and Hessian Affine [20]. Subsequently, the
Graph-based method has also received increased attention
extracted regions are represented as high-dimensional feature
recently in content-based image search. Xie et al. [42] employ
vectors using SIFT [19] or its variants [21]. Each descriptor
the ImageWeb to discover the nature of image relationships
is quantized to its nearest visual word in the pre-trained
for refining similar image search result. A directed graph is
codebook. The codebook is obtained through unsupervised
constructed, and edge weight is computed as the count of
clustering method, e.g., approximate k-means (AKM) [22]
matched features between pairwise images. Then, the HITS
and hierarchical k-means (HKM) [18], and cluster centers are
[50] is employed to rank images using the affinity values. From
treated as visual words of the codebook. Through quantization,
the graph-based perspective, incremental query expansion and
each image is represented as a sparse histogram of visual
image-feature voting are developed in [43]. Specifically, Zhang
words. Then, fast search is achieved using inverted file [35]
et al. [17] propose a undirected graph-based query specific
and TF-IDF [23, 24, 33] weights.
fusion approach, through which multiple retrieval sets are
It is verified in many works that post-processing can further
merged. In this approach, images satisfying the reciprocal
enhance the quality of search results. A quite few works refine
neighbor relation are connected. The edge weight is measured
the initial results using spatial cues, such as [25, 27] Besides,
with the consistency of their neighborhoods, i.e., Jaccard
query expansion [41] uses highly ranked images to learn a
similarity. Then images are re-ordered through link analysis
latent feature model to expand the original query, improving
method. Based on this framework, a weakly supervised multi-
the recall. Recent study of reranking adopts image-level cues.
graph learning is proposed in [15] for enhancing the reranking
For example, K-NN reranking [8] refines the initial rank list
performance. Instead, we adopt the directed graph model, in
automatically using the K nearest neighbors. Alternatively, Qin
which an image is connected to its top-K ranked images.
et al. [12] take advantage of K-reciprocal nearest neighbors to
Specifically, the edge is weighted by Bayes similarity. Further,
identify the image set. In addition, many works conduct the
to be robust to outliers, a safe strategy is used for ranking.
reranking based on complementary cues [1517], which have
shown promising performance. Through combining the rank
lists or scores of multiple features, the recall is significantly C. Feature Fusion
improved and the system is able to find quite challenging
It is indicated that combination of multiple features obtains
occurrences of the query. To some extend, our method belongs
superior performance in image search. In [30], attribute vector
to post-processing method.
and fisher vector are combined at feature level. The fused
Besides, there are also efforts to represent the image using
feature is compressed into small codes by product quantiza-
their global properties, such as GIST [36, 37], visual attributes
tion. This method improves performance for particular object
[2, 4, 7] and deep learning features [31, 32, 34, 57, 58].
retrieval as well as categories. Another promising strategy
Such holistic features demonstrate their advantages in image
performs feature fusion on indexing level. In [28], color
search. They also serve as good complements to local ones.
signature is embedded in the inverted index to filter out false
Additionally, global features are effective to encode images
positive SIFT matches. To model correlation between features,
with relatively smaller bits, which are usually combined with
a multi-IDF scheme is introduced in [11], through which
dimensionality reduction and approximate nearest search [38
different binary features are coupled into the inverted file.
40].
Zheng et al. [14] propose a multi-dimensional inverted index,
and each dimension corresponds to one kind of feature. With
B. Graph-based Ranking the multi-index, retrieval process votes for images in both
Graph-based visual reranking has been proven effective to SIFT and other feature spaces. In addition, semantic-aware co-
refine text-based video and image search results, integrating indexing algorithm [29] leverages global semantic attributes
both initial ranking and visual consistency between images. to update inverted indexes of local features, encouraging
It constructs a graph where pairs of visually similar images semantic consensus among local similar images.
are connected by an edge. The initial rank information is For the late fusion, Zhang et al. [17] propose the graph-
propagated through the graph until convergence. Jing and based query specific fusion approach at rank level. In this

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

4 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

TABLE I
NOTATIONS AND DEFINITIONS

Notation Definition
I=(I1 , I2 , ..., IN ) I indicates the image set, and Ii indicates the i-th image.
N Total image number of dataset
R(Im , In ) Rank of In in the rank list of Im being query.
Nk (Im ) K nearest neighbors of Im .
G = (V, E, w) G indicates a graph, and V , E and w indicate the set of vertices, the set of edges,
and the corresponding edge weight, respectively.
Gs = (Vs , Es , w) Subgraph of ImageGraph G induced by the vertex set Vs V .
Es contains every edge between the vertices in Vs
d(Im , In ) Rank Distance between image Im and In .
K The breadth of ImageGraph.
P The depth of ImageGraph.
T (Im ) True match image set of Im .
F (Im ) False match image set of Im .

method, images being reciprocal K-nearest neighbors are For method i, where i = 1, 2, ..., M , its ImageGraph Gi
connected with an edge. The edge is weighted as Jaccard is constructed based on rank result ri and the pre-computed
similarity for evaluating retrieval quality of individual feature. relevance among the database images Di , which is written as:
Multiple rank lists are merged through graph, and reranking
Gi = (ri ; Di ). (2)
is achieved by PageRank or Maximizing Weighted Density.
However, the effectiveness of this method varies dramatically Specifically, G can be written as the combination of multiple
with the parameter K. Moreover, it also suffers from bad individual graphs G1 , G2 , ..., GM :
features. To be resistent to the noise, a Co-Regularized Multi-
G = (G1 , G2 , ..., GM ). (3)
Graph Learning framework [15] is proposed, incorporating
intra-graph and inter-graph constraints in a supervised way. Finally, the new rank list is calculated through ranking on
Furthermore, a simple and effective fusion method at score the ImageGraph G.
level is proposed in [16]. Through a reference codebook r = g(G). (4)
constructed off-line, features effectiveness are estimated on-
In this section, we first present Rank Distance and Bayes
the-fly in a query-adaptive manner.
similarity in Section III-A and Section III-B. Then we e-
Differently, to be robust in the fusion, we propose a rank-
laborate construction of ImageGraph in Section III-C, and
level fusion method without supervision. We adopt the frame-
introduce fusion via ImageGraph in Section III-D. Finally,
work of [17], and our work departs from the prior arts as fol-
ranking algorithm is described in Section III-E. For clarity,
lows. Firstly, in contrast with the undirected graph used in [17],
we illustrate several important notations and their definitions
we construct a directed graph, denoted as ImageGraph. The
throughout the paper in Table I.
former [17] builds on K-reciprocal neighbors that may result
in low search recall. Instead, our method uses the K nearest
neighbors, thus more candidates can be included. Secondly, A. Rank Distance
instead of Jaccard similarity [17], in our approach, edge weight Since different features may produce scores diverse in
between pairwise images is defined as Bayes similarity built numerical values, it is difficult to compare or weight the
on Rank Distance, a more effective measurement to evaluate importance of them. Moreover, initial search list usually con-
the retrieval quality. Thirdly, local ranking is performed on tains false positive images, especially when retrieval quality is
the fused ImageGraph, which aims at local optimization and bad, and thus similarity score is not reliable to represent the
improves the robustness of our method. relevance between images. To address this issue, we propose
the Rank Distance to serve as a rank-level measurement. Let
III. O UR M ETHOD I = {I1 , I2 , ..., IN } denote the image dataset, and NK (Im )
denote the K nearest neighbors of Im , where m = 1, 2, ..., N .
Before describing our approach in detail, we formulate our N is the number of dataset images. Since local densities of
problem here. Our target is to obtain a new rank according to vectors round Im and In are different, In NK (Im ) does
multiple search results, which can be defined as: not imply Im NK (In ). It is demonstrated in [12, 54] that
r = h(R; D), (1) reciprocal neighborhood, i.e., In NK (Im ) Im NK (In ),
is a much stronger indicator of two images being relevant than
where R = {r1 , r2 , ..., rM } denotes the set of rank lists unidirectional neighborhood. In this paper, we do not require
resulted from M different methods. In the offline process, we reciprocal neighbor relation. Instead, we calculate the distance
take each image in the database as query and get the search of two images based on their ranks obtained when each one
result. Then, for each image, we find its K nearest neighbors is used as query to search the other. Rank Distance can be
in the database. The pre-computed search result of database defined as below,
images is denoted as D. D represents the relevance among R(Im , In ) + R(In , Im )
the database images. d(Im , In ) = , (5)
2N

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

ZIQIONG LIU et al.: ROBUST IMAGEGRAPH: RANK-LEVEL FEATURE FUSION FOR IMAGE SEARCH 5

Fig. 3. Examples of Rank Distance. For each query, the top-5 ranked
images under baseline Cosine distance are illustrated, where many outliers
are introduced. The numbers under the images denote their Rank Distances
(105 ) to the query. True match images are marked with green dot, while Fig. 4. Sample images in the Paris dataset for empirical study. The top and
outliers red. It is clear that Cosine distance pushes outliers in top ranks, but second row demonstrate true matches of Eiffel. We can observe that view
Rank Distance corrects this artifact by increasing the distance between outliers point and illumination vary a lot among true matches. The third and bottom
and the query. It demonstrates that Rank Distance can evaluate the similarity row show false matches of Eiffel.
effectively.

dataset, the Paris 6K, to perform empirical study. This dataset


where R(Im , In ) presents the rank of In in the search result of contains 6385 images from Flickr by searching for particular
Im . Here, we normalize R(m, n) + R(n, n) to [0, 1] using N . Paris landmarks. It is featured by 55 queries of 12 different
In this scenario, smaller Rank Distance denotes that the two landmarks. Some sample images are shown in Fig. 4.
images are more visually similar, and vice vasa. If and only To obtain the rank list, each image is taken as query
if In is ranked among the top search results of Im and Im is using BoW feature. In experiments, we found that different
also ranked high in the rank list of In , i.e., both R(Im , In ) and features follow similar distributions, and the choice of feature
R(In , Im ) are small, the two images are considered to be a true does not have a obvious impact on the esimation, thus we
match [8]. Fig. 3 illustrates the effectiveness of Rank Distance, use BoW here. Then, we compute its Rank Distance to the
which helps find true match images, meanwhile down ranking true matches and false matches, based on which, the two
outliers. Namely, this measurement is robust to outliers. distributions of prior probability density are drawn in Fig. 5.
From Fig. 5, we can easily find out two distributions have a
clear separation. For the Rank Distance distribution of true
B. Bayes Similarity
matches, the percentage decreases rapidly with Rank Dis-
To evaluate the retrieval quality effectively, we propose tance. In contrast, false matches follow a normal distribution
the Bayes similarity, which is defined as the probability of In N (u, ), in which u is about 0.5. As the variance is relative
being the true match of Im large, its percentage value is usually below 3%. In specific,
We denote the true matches and false matches of Im as more than 50% true matches have the Rank Distance smaller
T (Im ) and F (Im ), respectively. Based on the Rank Distance, than 0.04, compared to 4% of false matches. It also implies
Bayes similarity between image Im and In can be formulated that Rank Distance can reflect relevance between images to
as p(In T (Im )|d(Im , In )). For simplicity, we define dn = some extend, distinguishing true matches from outliers.
d(Im , In ), Tm = T (Im ) and Fm = F (Im ). According to the Consequently, according to Eq. 8, the estimated Bayes
Bayes Theorem, p(In Tm |dn ) can be rewritten as: similarity is shown in Fig. 6 . Here, we set the ratio of
p(dn |In Tm ) p(In Tm ) p(In Fm ) to p(In Tm ) as 500. We find in our preliminary
p(In Tm |dn ) = . (6) experiments that the ratio of p(In Fm ) to p(In Tm ) does
p(dn )
not have a significant impact on search accuracy. It is observe
As In either belongs to Tm or Fm , we have that the distribution can be approximated by the following:
p(dn ) =p(dn |In Tm ) p(In Tm )
(7) p(In Tm |dn ) = . (9)
+ p(dn |In Fm ) p(In Fm ). dn

By combining Eq. 6 and Eq. 7, we get As is a constant, we set as 1 for convenience.

p(In Fm ) p(dn |In Fm ) 1


p(In Tm |dn ) = (1 + ) . C. Construction of ImageGraph
p(In Tm ) p(dn |In Tm )
(8) In [17], images being K-reciprocal neighborhood relation
In the Bayes similarity, p(dn |In Tm ) and p(dn |In Fm ) are connected with an undirected edge. The edge weight
are the prior probability distributions of dn . These distribu- is measured by neighborhood consistency. However, K-
tions can be estimated through empirical study. Moreover, reciprocal neighborhood relation may filter out some potential
the number of true matches is far less than the false match candidates in the construction of graph. As a result, recall may
images. Hence, the ratio of p(In Fm ) and p(In Tm ) is not be guaranteed. Moreover, the neighborhood consistency is
generally a very large term. Typically, we use an independent determined by parameter K, and thus the fusion performance

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

6 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

is sensitive to K. Therefore, we propose the ImageGraph (a) True match (b) False match
0.5 0.03
structure to encode the image-level relationships. In our ap- 0.45
proach, we take into account the K nearest neighbors and 0.4
0.025

build a directed graph. K denotes the breadth of ImageGraph. 0.35


0.02
The ImageGraph centering at query q can be represented as

Percentage

Percentage
0.3
G = (V, E, w). V = {v1 , v2 , ..., vN } indicates the set of 0.25 0.015
vertices, where vm is the corresponding vertex of image Im . 0.2
E is the set of edges. If In belongs to NK (Im ), there is a 0.15
0.01

directed edge (Im , In ) E linking from vertex vm to vn . The 0.1


0.005
edge weight w is defined as the Bayes similarity of connected 0.05
images. Substituting Eq. 5 in Eq. 9, we have, 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
 2N Rank Distance Rank Distance
R(Im ,In )+R(In ,Im ) if In NK (Im )
w(vm , vn ) =
0 otherwise Fig. 5. Rank Distance distribution of (a) True match images and (b) False
(10) match images.
For query q, its top-K ranked images are connected by q,
forming the first layer of ImageGraph. The vertices in the first Bayes similarity
layer then continue to link their K nearest neighbors as child
0.1 Experimental Distribution
vertices. The linking edge is weighted as Bayes similarity, p = 8*104/dn
0.09
through which the retrieval quality is evaluated. ImageGraph is
0.08
expanded in this manner until the depth of graph P achieves a
p(I T |d )
n

0.07
threshold. Here, depth of ImageGraph means the shortest path 0.06
m

between starting vertex q and terminal vertex. The algorithm 0.05


n

of ImageGraph construction using one feature is illustrated in 0.04

Algorithm 1. 0.03
0.02

0.01
Algorithm 1 Construction of ImageGraph
0
Off-line: 0 0.1 0.2 0.3 0.4
dn
0.5 0.6 0.7 0.8 0.9 1

1 Given a dataset I = {I1 , I2 , ..., IN }, take each image


as query and get the search result. Fig. 6. Probability distribution of Bayes similarity.
2 For each image Im , where m = 1, 2, ..., N , add a direct-
ed edge from its vertex vm to each vertex corresponding to
the top-K ranked image in the pre-computed search result w(vm , vn ) = i wi (vm , vn ). (12)
D. The edge is weighted as Bayes similarity according to
Eq. 10. The vertices and edges of fused ImageGraph are the union
On-line: set of individual graphs, and the weight is the sum of wi . Since
1 For query q, compute its search result with the given each graph illustrates the image-level relevance in different
feature. feature space, more comprehensive relationships between im-
2 Add a directed edge from its vertex to each vertex ages are represented in the fused ImageGraph. On one hand,
corresponding to the top-K ranked images. Calculate the due to complementary nature of multiple features, candidates
edge weight according to Eq. 10. The new added vertices which are challenging to be searched using one feature may
form the first layer of graph. be easier to be retrieved in another feature space. Thus, more
3 For each new added vertex, add directed edges from it to candidates are included, improving the recall. On the other
its top-K ranked images in the pre-computed search result hand, positive images with high similarity are easily to be
D. The edges are computed as Bayes similarity according searched in multiple feature spaces, which are assigned larger
to Eq. 10. weight after fusion. On the contrary, as negative images can
4 Repeat step. 3 until the depth of graph achieves P. not be retrieved in both feature spaces, their edge weight would
5 Output the ImageGraph G = (V, E, w). be relatively smaller. In this way, the precision is improved.

E. Local Ranking
D. Fusion of Multiple ImageGraphs PageRank [49] is a query independent link analysis method,
As rank result is encoded in ImageGraph, we can fuse ranking on the whole graph. Ranking by maximizing weighted
multiple rank results efficiently via graph fusion. To this end, density [17] starts from query, and ranks a subset of graph
we combine the multiple graphs Gi = (Vi , Ei , wi ) obtained related to the specific query. It sorts the nodes by their degrees,
with different features without supervision [17]. The fused which is the sum of weights from connected edges. However,
graph is denoted as G = (V, E, w), which can be written these ranking methods suffer from outliers. Large K or bad
as: features may bring a lot of outliers into the graph. Usually,
V = i Vi , E = i Ei . (11) there are many edges linked among these irrelevant images,

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

ZIQIONG LIU et al.: ROBUST IMAGEGRAPH: RANK-LEVEL FEATURE FUSION FOR IMAGE SEARCH 7

which is called Tightly-Knit Community Effect. In this serves as a query. The performance is measured by N-S score
situation, Ranking by maximizing weighted density [17] and (maximum 4), which is the recall of top-4 candidate images.
link analysis method [17] may lead to deviation from query Oxford The Oxford Buildings dataset consists of 5,062
and up-rank the noise. images by searching for particular Oxford landmarks from
To tackle this problem, we adopt a safe strategy to perform Flickr. This dataset has a comprehensive ground truth for 11
local-based ranking. The proposed ranking only considers different landmarks, each containing 5 possible queries. For
local optimum instead of global maximum, avoiding being each query, it has a lot of true match images, which are taken
confused by the tightly connected outliers. Since a higher from different viewpoints. Some of them have partial occlusion
edge weight reflects a higher relevance to the query, naturally, or distortion. Retrieval accuracy is measured by mean Average
we aim to find the subgraph Gs starting with q, which is Precision (mAP).
the maximum weighted. The subgraph Gs = (Vs , Es , w) is Flickr 1M The Flickr 1M dataset includes 1 millon images
induced by the vertex set Vs V . Es contains every edge arbitrarily collected from Flickr. These dataset can be added
between the vertices in Vs . We also define the candidate set into the above datasets as distractors for large scale experi-
C as vertices that Vs points to. Specifically, we initialize ments.
subgraph G0s = ({q}, , w), and C 0 contains the vertices
connected by q. At the i + 1th iteration, the vertex in C i B. Features and Baselines
which introduces the maximum weighted edges is included In this paper, we exploit four features: GIST [36], HSV
into Gi+1
s , denoted as vs
i+1
: Histogram, Convolutional Neural Network (CNN) and Bag-
X X of-Words (BoW).
vsi+1 = arg max w(vm , vn ) w(vm , vn ). GIST To compute GIST descriptor, we resize the images to
vsi+1 C i
(vm ,vn )Esi+1 (vm ,vn )Esi 256 256 following [16]. An l2 -normalized 512-dim GIST
(13) descriptor is extracted for each image using 4 scales and
This procedure continues until the number of nodes in Gs 8 orientations. Based on cosine distance, nearest neighbors
satisfies users requirement. The nodes are ranked according search is performed.
to their order of being incorporated into Gs . The algorithm of HSV For each image, we compute a 1000-dim HSV color
local ranking is illustrated in Algorithm 2. histogram using 20105 bins for H, S, V components
respectively. The l2 -normalized histogram is used for nearest
Algorithm 2 Local Ranking neighbors search with cosine distance.
1 Initialize subgraph G0s as ({q}, 0, w) and C 0 as the CNN For an input image, we extract the l2 -normalized
vertices that q points to. 4096-dim CNN descritpor from the 6-th layer in the Caffe
2 At the i + 1th iteration, vertex in C i which could intro- Network [48]. Similarly, cosine distance is defined as the
duce the maximum weighted edges is introduced into Gi+1 s
similarity function of images.Besides, we also fine-tine the
according to Eq. 13. CNN feature following [53].The re-trained feature is denoted
3 Update Gi+1 s and C i+1
. as CNN*.
4 Repeat step. 2 and step. 3 until the number of nodes in BoW For Holidays and UKBench, a 200K codebook is
Gs satisfies users requirement. trained on Flickr60K [25] dataset. 128-bit Hamming signature
5 Output Gs . The vertices are ranked according to their [25] of each SIFT descriptor is embedded in the inverted file
order of being incorporated into Gs . to filter out false matches. Hamming threshold and weighting
parameter are set to 52 and 26, respectively. For Oxford 5K,
1M codebook is trained on Paris6k dataset [26]. Moreover,
rootSIFT [21], burstiness strategy [10], multiple assignment
IV. DATASETS AND BASELINES [9] and pIDF [24] are employed on both dataset to enhance
the performance.
A. Datasets
Search results on three datasets are presented in Table II.
To evaluate the effectiveness of our approach, we conduct It shows that BoW achieves good performance, obtaining
experiments on Holidays [25], UKBench [18], Oxford [27] 80.05% in mAP, 3.583 in N-S score, and 75.31% in mAP
and Flickr 1M [25]. on Holidays, UKBench and Oxford, respectively. By contrast,
Holidays The Holidays dataset consists of 1,491 personal GIST leads to poor performance on these datasets. It yields
holiday images and 500 of them are queries. Most queries have 34.14% in mAP, 1.856 in N-S score, and 12.96% in mAP
less than 4 ground truth images undergoing various changes. on the three datasets, respectively. Moreover, HSV and CNN
The Average Precision (AP) is used to evaluate the retrieval result in moderate accuracy on Holidays and UKBench. Note
performance of each query. It is calculated as the area under that global features, i.e., HSV, GIST and CNN, do not work
the Precision-Recall curve. For all the query images, their APs well on Oxford. It is because most images in Oxford contain
are averaged, yielding mean Average Precision (mAP). mAP buildings, which are difficult to be described using global fea-
is employed to measure retrieval accuracy of the dataset. tures. After fine-tuning, the performance of CNN is improved
UKBench The UKBench dataset contains 10,200 images consistently on the three datasets. Specifically, on the Oxford,
of 2,550 objects. Each object has 4 images with different the retained feature CNN* enhances the original performance
viewpoints and illuminations. In this dataset, each image by about 10% in mAP.

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

8 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

TABLE II
P ERFORMANCE OF BASELINES ON THREE DATASETS . shown in Fig. 8. It is evident that the fusion brings consistent
benefit to various feature combinations.
Datasets GIST HSV CNN CNN* BoW On Holidays, by combining GIST, HSV, and CNN, BoW
Holidays, mAP(%) 34.14 61.21 69.22 72.34 80.05 performance is boosted to 85.51%, 87.71%, and 88.48% in
UKBench, N-S 1.856 3.195 3.397 3.502 3.582 mAP, respectively. Note that fusion of two global features also
Oxford, mAP(%) 12.96 13.29 44.56 54.14 75.31 boosts the overall performance. For HSV and CNN which have
moderate performance, their combination achieves an mAP of
79.09%. It improves the individual baseline of HSV and CNN
(a) Holidays (b) Oxford
86 82 by 17.88% and 9.87%, respectively. When bad feature GIST
is merged, the fusion still yields stable improvement. After
85 80
fused with GIST, the performance of BoW, HSV, and CNN are
84 78
increased by 5.46%, 7.01% and 1.96% in mAP, respectively.
Similar results can be observed on UKBench. The N-S score
mAP(%)

mAP(%)

83 76 of BoW is enhanced from 3.582 to 3.703 and 3.834 through the


K = 12 K = 60 fusion with GIST and HSV, respectively. In particular, BoW
82 74
K = 10
K= 8
K
K
=
=
50
40
performance is further improved to 3.907 in N-S score when
81
K= 6
72
K = 30 fused with CNN, closing to the maximum N-S score 4. On
K= 4 K = 20
Oxford, although global features have poor discrimination for
80
2 4 6 8 10 12 14 16
70
2 4 6 8 10
building images, the performance of BoW is not affected in
P P fusion with these features. Instead, the combination of BoW
with GIST, HSV, and CNN obtains the mAP of 79.08%,
Fig. 7. mAP results against different values of breadth K and depth P. BoW 80.06%, and 81.27%, respectively.
and GIST are fused on (a) Holidays and BoW and CNN are fused on (b)
Oxford. In addition, the results of multiple features fusion are
demonstrated in Table III. When multiple features are fused,
the performance is further boosted. The fusion of four features
V. E XPERIMENTS achieves 90.28% in mAP, 3.916 in N-S score and 82.05%
in mAP on Holidays, UKBench and Oxford, respectively.
A. Parameter Tuning
Moreover, with re-trained CNN feature, our results are further
Two parameters are involved in the construction of Image- enhanced to 90.89% in mAP, 3.920 in N-S score and 84.92%
Graph: breadth K and depth P. In our method, K implies the in mAP, respectively.
number of candidates connected with a vertex. P means the
distance that affinity are propagated on the graph. We test
different combinations of K and R on Holidays and Oxford. C. Comparison with Other Fusion Approaches
The experimental results are demonstrated in Fig. 7. As the To further illustrate the strength of our method, we compare
number of ground truths of Holidays dataset is very small, and our results with two state-of-the-art fusion approaches: graph
Oxford relatively large, we set different K for them. fusion [17] and score fusion [16]. We use their released code
It is notable that increase of K and R jointly helps to and default parameter in the following experiments. Multiple
improve the performance. The mAP first increases with P, then features are fused with BoW on Holidays and UKBench, and
generally keeps stable when P becomes large. The reason lies result comparisons are presented in Fig. 9. It shows that on
in that the true matches which are not directly retrieved by both datasets our method outperforms graph fusion and score
query can be sufficiently exploited with a large P, thus the fusion.
recall is boosted. When all potential candidates are retrieved, On Holidays, score fusion is superior to graph fusion for
performance is no longer increased. In addition, the mAP is each feature combination. Noticing that graph fusion suffers
enhanced when K gets large, since more top-ranked candidates from bad features on this dataset. When BoW is fused with
are included into ImageGraph. Then, the performance reaches GIST, its performance is decreased by 5% in mAP. However,
saturation when K = 10 on Holidays and K = 40 on Oxford. in the same scenario, score fusion improves BoW baseline by
Specifically, on Holidays, the performance of K = 4 is not 0.83%. In particular, our method enhances the performance
improved even if P increases to 16. We speculate that breadth by 5.46%, which is a significant improvement compared to
K limits the property of depth P. Namely, the true match graph fusion and score fusion. Similar phenomenon can be
images which are filtered out by small K, are more difficult to observed when BoW is combined with HSV. When fused with
be retrieved using a large P. From these results, we set K = 10 CNN, graph fusion, score fusion and our method boost BoW
for Holidays and UKBench, and K = 40 on Oxford. Moveover, baseline by 4.26%, 6.22%, and 8.48% in mAP, respectively.
P is set as 8 in our experiments. Furthermore, by taking use of four features, graph fusion,
score fusion, and our method gain the mAP of 83.55%,
87.98%, and 90.28%, respectively.
B. Fusion Results On UKBench, the performance of graph fusion is better
To verify the effectiveness of our method, we first evaluate than score fusion. By 0.008 in N-S score, score fusion with
the fusion of two features. Fusion results on three datasets are GIST is slightly higher than BoW baseline. By comparison,

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

ZIQIONG LIU et al.: ROBUST IMAGEGRAPH: RANK-LEVEL FEATURE FUSION FOR IMAGE SEARCH 9

Fig. 8. Fusion results of two features on (a) Holidays, (b) UKBench and (c) Oxford. Six feature combinations are presented, i.e., BoW +GIST, BoW +
HSV, BoW + CNN, HSV + CNN, HSV + GIST and CNN + GIST. The green bar and blue bar represent result of the first feature and the second
feature, respectively, while yellow bar shows fusion result.

is resistant to bad features, but also brings about superior


improvement. In addition, compared to score fusion, our
method yields better performance, due to the higher recall
brought by ImageGraph.

D. Evaluation of Robustness
In this section, we demonstrate the robustness of our ap-
proach to outliers. It is showed in [52] that the graph fusion
approach is robust to random noise. In the experiments of
[52], random noises are added to the rank results of features.
Specifically, the retrieved results are replaced with randomly
assigned values. In our method, the outliers refer to the natural
noise, which exists in the original rank result. Natural noise is
usually caused by the feature itself. Compared to the random
noise, natural noise is more difficult to tackle.
The outliers in ImageGraph are introduced from two ways.
On one hand, when K is larger than the number of ground
truths, a lot of outliers would be included into the graph. Thus,
we first evaluate the fusion results when K varies, which is
illustrated in Fig. 10. In order to validate our method, we
compare our results with graph fusion [17].
It is shown in Fig. 10 that graph fusion is very sensitive
to parameter K. On Holidays, its performance decreases when
Fig. 9. Comparison with graph fusion ([17]) and score fusion ([16]). Five
feature combinations are presented on (a) Holidays and (b) UKBench. The K gets large. On UKBench, N-S score first rises with K and
yellow bar represents the BoW baseline, while the blue bar, orange bar and then drops after reaching a peak at K = 4. It implies that the
gray bar show the results by graph fusion, score fusion and our method, graph fusion method achieves the best performance when K
respectively.
is about the ground truths number of the dataset. However,
when K becomes large and more outliers are introduced, the
performance drops significantly. Additionally, fusion with bad
graph fusion and our approach enhance the BoW baseline feature, i.e., B+G, leads to more rapid descend, compared
by 0.089 and 0.121 in N-S score, respectively. Good features to the combination of B+G+H and B+G+H+C.
bring further benefit in the fusion. When combined with HSV, In comparison, the performance of our method increases
BoW is increased by 0.228, 0.173, and 0.252 in N-S score with K, and then keeps stable. On Holidays, when K = 20,
using graph fusion, score fusion, and our method, respectively. our method yields the mAP of 84.69%, 88.04% and 90.18%
Similarly, fusion with CNN brings the benefit of 0.301, 0.22, using combination of B+G, B+G+H and B+G+H+C,
and 0.325 in N-S score through graph fusion, score fusion, and respectively, while graph fusion decreases to 48.96%, 67.10%
our method, respectively. When all features are fused together, and 74.58%, respectively. On UKBench, for the three com-
the three methods gain the N-S score of 3.894, 3.841, and binations, our method keeps the performance of 3.678, 3.836
3.916, respectively. and 3.904 in N-S score at K = 20, compared to 2.746, 3.328
In summary, compared to graph fusion, our method not only and 3.581 of graph fusion. It illustrates the robustness of our

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

10 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

approach, which can still promote true match images when


there are a lot of outliers in the graph.
On the other hand, bad features would also result in outliers.
From Fig. 8 and Table III, we can see that the performance of
our method is resistant to bad feature. Furthermore, when the
bad feature is fused, our method can still exploit its comple-
mentary cues, improving the performance. On Holidays, the
BoW is enhanced by 5% in mAP through the combination
with GIST. When GIST is fused with other three features, it
can still improve the performance from 90.02% to 90.28% in
mAP. On UKBench and Oxford, the same situation are also
observed. It reveals the robustness of our method from another
perspective. Fig. 10. Fusion results with various K on (a) Holidays and (b) UKBench.
Three feature combinations are tested and the performance are compared with
graph fusion. Abbreviations B, G, H, and C represent BoW, GIST,
E. Large Scale Experiments HSV, and CNN, respectively. The solid line represents our method and dashed
line the graph fusion ([17]).
To test the scalability of the proposed method, we perform
large scale experiments on Holidays. Flickr 1M images are
added into Holidays dataset as distractors. In this experiment,
the dimension of global features is reduced to 128D by
Principal Component Analysis (PCA). The results are shown
in Table III. It shows that fusion of multiple features can
consistently improve the search accuracy in large scale dataset.
HSV and CNN boost the performance of BoW + GIST from
69.91% to 75.26% and 76.24 % in mAP, respectively. More-
over, the combination of BoW, HSV and CNN obtains 77.21%
in mAP. When the four features are fused, the performance
achieves an mAP of 77.22% on Holidays + Flickr 1M dataset.
With re-trained CNN feature, the performance is boosted to
77.82%. Fig. 11. Reranking results for single features on (a) Holidays and (b)
UKBench.

F. Reranking with Individual Feature


For individual feature, our approach can be applied to = 84.92% on Holidays, UKBench, and Oxford, respectively.
refine the initial rank result. There are two kinds of popular For the large scale dataset, Holidays + Flickr1M, we obtain
reranking methods. One category is based on feature-level an mAP of 77.82%. We can see that our method outperforms
cue, such as RANSAC [27], query expansion [41], etc. The the state-of-the-art approaches on Holidays, which exceeds
other category employs the image-level cue for reranking. the result reported in [16] by 2.91%. On UKBench, we also
Our method belongs to the latter one. Taking use of the achieve the best N-S score 3.920. By 0.08 in N-S score, our
image-level relationships reflected by ImageGraph, true match result is slightly higher than [16]. On Oxford, due to the
images could be promoted and outliers lowered down, thus the effectiveness of CNN*, our results are comparable to [13, 15].
precision is enhanced. Besides, through the graph structure, For the large scale experiment, our result is also competitive
affinity values can be propagated to images which are not to other methods. The examples of retrieval results on three
directly searched by query, thus challenging candidates can datasets are shown in Fig. 12.
be retrieved. The reranking results on Holidays and UKBench
are shown in Fig. 11. Noticing that our method refines the H. Time and Cost
baselines greatly and consistently. On Holidays, the baselines
of GIST, HSV, CNN, and BoW are improved by 4.3%, In our method, each image is used as query with given fea-
2.65%, 4.14%, and 4.37% in mAP, respectively. Similarly, ture. Then we compute and store their relevant relationships.
on UKBench, the performances of BoW, HSV, and CNN are Because this operation is offline and required only once, the
refined by 0.138, 0.177, and 0.224 in N-S score, respectively. time complexity is affordable. The space complexity is O(KN)
Moreover, the bad feature, GIST, is also enhanced from 1.856 for storing connectivity. The on-line computational cost is
to 2.066 in N-S score. Reranking performance reflects the small since it only considers the top candidates. Specially, the
effectiveness of ImageGraph. time complexity for constructing ImageGraph of each query
is O(K P ). Moreover, it requires O(L2 K) for ranking on the
graph, where L is the expected retrieved image number. Our
G. Comparison with the State-of-the-art experiments are performed on a server with 3.46 GHz CPU
We compare our results with the state-of-the-art in Table IV. and 128GB memory. On Holidays+Flickr 1M, the average
Our method achieves mAP = 90.89%, N-S = 3.920, and mAP query time of BoW and global features are 396ms and 189ms,

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

ZIQIONG LIU et al.: ROBUST IMAGEGRAPH: RANK-LEVEL FEATURE FUSION FOR IMAGE SEARCH 11

TABLE V
P OST- PROCESSING TIME ON H OLIDAYS + 1M D ATASET VI. C ONCLUSIONS
This paper proposes a graph-based method for robust feature
Methods Ours [17] [52] [16] [15] [12] fusion at rank level. We first define Rank Distance to measure
Time (ms) 5.36 1 1 10 2210 30 the relevance of images on rank level. Then, based on it, we
introduce the Bayes similarity to evaluate the retrieval quality
of individual features. For each feature, the ImageGraph is
constructed to model the relationship among images, in which
respectively. Each image ID costs about 21 bits. In large scale
an image is connected to its K nearest neighbors with edges,
image search, we store 4 nearest neighbors of each image,
and the edges are weighted with Bayes similarity. Multiple
then 105 bits are needed per image per feature. The memory
ranklists resulted from different methods are fused via Image-
cost of 1 million image of single feature is about 0.09GB. It
Graph. On the fused ImageGraph, images are re-ordered by
usually takes 5.2ms for ImageGraph construction and 0.16ms
local ranking, which further protects the fusion from outliers.
for ranking, which are relatively small compared to the query
Through extensive experiments on three benchmark datasets,
time. It is similar on UKBench and Oxford.
we show that significant improvement can be achieved when
Table IV shows the comparison of average query time and
multiple features are fused. Moreover, we demonstrate that
memory cost on 1 million dataset with the state-of-the-art
our method is robust to outliers, which are usually brought in
methods. Since we use four features in our experiments, the
by bad features or inappropriate parameters. Our method has
average query time is about 0.868s. Note that query time
obtained an mAP of 90.89%, N-S score of 3.920 and 84.92%
is dependent on many factors, such as the machine used
on Holidays, UKBench, and Oxford datasets, respectively, In
and number of features of dataset, thus it is not directly
the large scale experiments, we yield an mAP of 77.82% on
comparable. But it can roughly indicate the time efficiency of
Holidays + Flickr 1M. It shows that our method outperforms
the proposed approach. Moreover, our method belongs to the
two popular fusion schemes, i.e., graph fusion [17] and [16],
post-processing algorithm, which works on the given rank list.
and the results are competitive to the state-of-the-art.
We compare the time of post-processing steps of the proposed
In the future work, we will investigate how to efficiently
method with other post-processing methods considered in
update the ImageGraph structure when new images are added
Table IV. Table V shows the result of comparison. The post-
into the database or old images are deleted from it. In addtion,
processing steps of our method, i.e., ImageGraph construction
more efforts will be made to explore the feature selection
and ranking, cost 5.36ms. Most of the post-processing methods
strategies in the fusion.
in Table V cost a few milliseconds, except [15]. It is because
Acknowledgements This work was supported by the Ini-
[15] uses a supervised framework, which costs a lot of time
tiative Scientific Research Program of Ministry of Education
to build the anchors.
under Grant No. 20141081253. This work was supported
The memory cost of the proposed method is about 0.36GB. in part to Dr. Qi Tian by ARO grant W911NF-15-1-0290
The approach of [16] evaluates the retrieval quality online with and Faculty Research Gift Awards by NEC Laboratories of
score curve, rather than the neighborhood relationship. It only America and Blippar. This work was supported in part by
stores the reference book, which costs 0.076GB extra memory. National Science Foundation of China (NSFC) 61429201.
Since both [14] and [11] store binary signatures of features in
the inverted file, the memory costs of them are 6.1GB. A lot
R EFERENCES
of image-level information is stored in [12] and the cost of it
[1] Y. Wang and G. Mori. A discriminative latent model of object classes
is 22.35GB. Besides, our method adopts the same framework and attributes. In Proceedings of the IEEE European Conference on
with [15, 17, 52], thus the memory costs of these methods are Computer Vision, 2010.
similar to ours in theory. [2] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient object category
recognition using classemes. In Proceedings of the IEEE European
Conference on Computer Vision, 2010.
[3] F. Yu, R. Ji, M-H Tsai, G. Y and S-F. Chang. Weak attributes for
I. Concept detection large-scale image retrieval. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2012.
To further validate our method, we perform concept de- [4] D. Parikh and K. Grauman. Relative attributes. In Proceedings of the
tection experiments on Flickr25000 dataset [56].We randomly IEEE International Conference on Computer Vision, 2011.
select 2000 images from dataset as queries. The rest images [5] F. Jing and S. Baluja. Visualrank: Applying pagerank to large-scale
image search. IEEE Transaction on Pattern Analysis and Machine
are seen as the database images. For each query, we calculate Intelligence, vol.30, no.7, pp.1877-1890, 2008.
its distance to each concept class, using image-to-category dis- [6] J. Wang, Y.-G. Jiang, and S.-F. Chang. Label diagnosis through itself
tance [55]. After obtaining the rank lists of different features, tuning for web image search. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2009.
we fuse them with our proposed method. Here we still use [7] A. Kovashka, D. Parikh, and K. Grauman. WhittleSearch: Image search
mAP to measure the performance. BoW, CNN, HSV and GIST with relative attribute feedback. In Proceedings of the IEEE Conference
obtain the performance of 32.1%, 42.9%, 14.6%, 9.4% in on Computer Vision and Pattern Recognition, 2012.
[8] X. Shen, Z. Lin, J. Brandt, S. Avidan, and Y. Wu. Object retrieval and
mAP, respectively. We can see that the CNN feature achieves localization with spatially-constrained similarity measure and k-nn re-
the best performance for concept detection task. Fused with ranking. In Proceedings of the IEEE Conference on Computer Vision
BoW, HSV and GIST, the CNN result is improved to 49.2%, and Pattern Recognition, 2012.
[9] H. Jegou, M. Douze, and C. Schmid. Improving bag-of-features for large
42.6% and 42.8% in mAP, respectively. The fusion of four scale image search. International Journal of Computer Vision, vol.87,
features obtains the mAP of 50.5%. no.3, pp.316-336, 2008.

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

12 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

Fig. 12. Examples of retrieval results from Holidays (top), UKBench (middle) and Oxford (bottom) datasets, respectively. For each query, its top-10 ranked
images resulted from GIST (the first row), HSV (the second row), CNN (the third row), BoW (the fourth row) and ImageGraph feature fusion (the fifth row)
are shown, respectively. True matched images are marked with green dot, and false matched ones red.

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

ZIQIONG LIU et al.: ROBUST IMAGEGRAPH: RANK-LEVEL FEATURE FUSION FOR IMAGE SEARCH 13

TABLE III
F USION RESULTS OF DIFFERENT FEATURE COMBINATIONS ON BENCHMARKS .

Feature Combinations Holidays, mAP (%) UKBench, N-S Oxford, mAP (%) Holidays+Flickr1M, mAP (%)
BoW + GIST 85.51 3.703 79.08 69.91
BoW + HSV 87.71 3.843 80.06 75.25
BoW + CNN 88.48 3.907 81.27 76.17
BoW + CNN* 89.31 3.913 84.80 77.08
BoW + GIST + HSV 88.52 3.855 79.97 75.26
BoW + GIST + CNN 88.76 3.905 81.96 76.24
BoW + GIST + CNN* 89.40 3.914 84.80 77.09
BoW + HSV + CNN 90.02 3.916 82.01 77.21
BoW + HSV + CNN* 90.89 3.920 84.91 77.82
BoW + GIST + HSV + CNN 90.28 3.916 82.05 77.22
BoW + GIST + HSV + CNN* 90.89 3.920 84.92 77.82

TABLE IV
P ERFORMANCE COMPARISON WITH THE STATE - OF - THE - ART.

Methods Ours [17] [52] [16] [15] [14] [13] [11] [12] [10] [9]

Holidays, mAP(%) 90.89 84.64 84.64 87.98 84.7 85.8 80.1 85.2 - 84.8 84.8
UKBench, N-S 3.920 3.77 3.83 3.841 3.75 3.85 - 3.79 3.67 3.64 3.55
Oxford, mAP(%) 84.92 - - - 84.3 - 85.0 - 81.4 68.5 74.7
Holidays + 1M, mAP(%) 77.82 - - 75.06 79.4 69.0 - - - 77.0 42.3
Query time (s) 0.868 0.749 0.749 - - 1.413 - 0.145 - - 0.65
Memory cost (GB) 0.36 - - 0.076 - 6.1 - 6.1 22.35 - -

[10] H. Jegou, M. Douze, and C. Schmid. On the burstiness of visual retrieval with large vocabularies and fast spatial matching. In Proceedings
elements. In Proceedings of the IEEE Conference on Computer Vision of the IEEE Conference on Computer Vision and Pattern Recognition,
and Pattern Recognition, 2009. 2007.
[11] L. Zheng, S. Wang, and Q. Tian. Coupled Binary Embedding for Large- [23] J. Sivic, and A. Zisserman. Video Google: a text retrieval approach
Scale Image Retrieval. IEEE Transactions on Image Processing, vol.23, to object matching in videos. In Proceedings of IEEE International
no.8, pp.3368-3380, 2014. Conference on Computer Vision, 2003.
[12] D. Qin, S. Gammeter, L. Bossard, T. Quack, and L. Van Gool. Hello [24] L. Zheng, S. Wang, Z. Liu, and Q. Tian. Lp-norm Idf for Large Scale
neighbor: accurate object retrieval with k-reciprocal nearest neighbors. Image Search. In Proceedings of the IEEE Conference on Computer
In Proceedings of the IEEE Conference on Computer Vision and Pattern Vision and Pattern Recognition, 2013.
Recognition, 2011. [25] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak
[13] D. Qin, C. Wengert, and L. Van Gool. Query adaptive similarity for geometric consistency for large scale image search. In Proceedings of
large scale object retrieval. In Proceedings of the IEEE Conference on the IEEE European Conference on Computer Vision, 2008.
Computer Vision and Pattern Recognition, 2013. [26] J. Philbin, O. Chum, M. Isard, J. Sivic, and Zisserman, A. Lost in
[14] L. Zheng, S. Wang, Z. Liu, and Q. Tian. Packing and padding: Coupled quantization: Improving particular object retrieval in large scale image
multi-index for accurate image retrieval. In Proceedings of the IEEE databases. In Proceedings of the IEEE Conference on Computer Vision
Conference on Computer Vision and Pattern Recognition, 2015. and Pattern Recognition, 2008.
[15] C. Deng, R. Ji, W. Liu, D. Tao, and X.Gao. Visual Reranking through [27] J. Philbin, O. Chum, M. Isard, J. Sivic, and Zisserman, A. Object
Weakly Supervised Multi-Graph Learning. In Proceedings of IEEE retrieval with large vocabularies and fast spatial matching. In Proceedings
International Conference on Computer Vision, 2013. of the IEEE Conference on Computer Vision and Pattern Recognition,
[16] L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian. Query- 2007.
Adaptive Late Fusion for Image Search and Person Re-identification. [28] C. Wengert, M. Douze, H. Jegou. Bag-of-colors for improved image
In Proceedings of the IEEE Conference on Computer Vision and Pattern search. In Proceedings of ACM Multimedia, 2011.
Recognition, 2015. [29] S. Zhang, M. Yang, X. Wang, Y. Lin, and Q. Tian. Semantic-aware
[17] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas. Query Co-indexing for Near-duplicate Image Retrieval. In Proceedings of the
specific fusion for image retrieval. In Proceedings of the IEEE European International Conference on Computer Vision, 2013.
Conference on Computer Vision, 2012. [30] M. Douze, A. Ramisa, and C. Schmid. Combining attributes and
[18] D. Nister, H. Stewenius. Scalable recognition with a vocabulary tree. Fisher vectors for efficient image retrieval. In Proceedings of the IEEE
In Proceedings of the IEEE Conference on Computer Vision and Pattern Conference on Computer Vision and Pattern Recognition, 2011.
Recognition, 2006. [31] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features
[19] D.G. Lowe. Distinctive image features from scale-invariant keypoints. off-the-shelf: an astounding baseline for recognition. In Proceedings
International journal of computer vision, vol.60, no.2, pp.91-110, 2004. of the IEEE Conference on Computer Vision and Pattern Recognition
[20] K. Mikolajczyk, C. Schmid. Scale affine invariant interest point Workshops, 2014.
detectors. International journal of computer vision, vol.60, no.1, pp.63- [32] L. Zheng, S. Wang, J. Wang, and Q. Tian. Accurate image search with
86, 2004. multi-scale contextual evidences. In International Journal of Computer
[21] R. Arandjelovic, and A. Zisserman. Three things everyone should know Vision, vol.120, no.1, pp.1-13, 2016.
to improve object retrieval. In Proceedings of the IEEE Conference on [33] L. Zheng, S. Wang, and Q. Tian. Lp-Norm IDF for Scalable Image
Computer Vision and Pattern Recognition, 2012. Retrieval. In IEEE Transactions on Image Processing, vol.23, no.8,
[22] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object pp.3604-3617, 2014.

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2660244, IEEE
Transactions on Image Processing

14 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. X, NO. XX, MONTH YEAR

[34] L. Zheng, Y. Yang, and Q. Tian. SIFT Meets CNN: A Decade Survey Constraints. In European Conference on Computer Vision, 2016.
of Instance Retrieval. In ArXiv:1608.01807, 2016. [58] D. Li, J. B. Huang, Y.L. Li, S. Wang, and M. H. Yang. Weakly
[35] L. Zheng, S. Wang, Z. Liu, and Q. Tian. Fast image retrieval: query Supervised Object Localization with Progressive Domain Adaptation. In
pruning and early termination. In IEEE Transactions on Multimedia, IEEE Conference on Computer Vision and Pattern Recognition, 2016.
vol.17, no.5, pp.648-659, 2015. Ziqiong Liu received the bachelor degree in In-
[36] A. Oliva and A. Torralba. A holistic representation of the spatial formation Engineering from Southeast University,
envelope. International journal of computer vision, vol.42, no.3, pp.145- Nanjing, China, in 2011. She is currently pursu-
175, 2001. ing the Ph.D. degree in Electronic Engineering of
[37] M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, and C. Schmid. Tsinghua University, Beijing, China. Her current
Evaluation of gist descriptors for web-scale image search. In Proceedings research interests include image/video processing
of the ACM International Conference on Image and Video Retrieval, and large scale multimedia retrieval.
2009.
[38] Y. Weiss, A. B. Torralba, and R. Fergus. Spectral hashing In NIPS,
2008.
[39] H. Jegou and M. Douze and C. Schmid. Product quantization for nearest
neighbor search. IEEE Transaction on Pattern Analysis and Machine
Intelligence, vol.33, no.1, pp.117-128, 2011.
[40] L. Pauleve, H. Jegou, L. Amsaleg. Locality sensitive hashing: A
comparison of hash function types and querying mechanisms. Pattern
Recognit. Lett., vol.31, no.11, pp.1348-1358, 2010. Shengjin Wang received the B.E.degree from Ts-
[41] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: inghua University, China, in 1985 and the Ph.D.
Automatic query expansion with a generative feature model for object degree from the Tokyo Institute of Technology,
retrieval. In Proceedings of IEEE International Conference on Computer Tokyo, Japan, in1997. From May 1997 to August
Vision, 2007. 2003, he was a member of the research staff in
[42] L. Xie, Q. Tian, W. Zhou, and B.Zhang. Fast and accurate near-duplicate the Internet System Research Laboratories, NEC
image search with affinity propagation on the ImageWeb. Computer Corporation, Japan. Since September 2003, he has
Vision and Image Understanding, vol.124, pp.31-41, 2014. been a Professor with the Department of Electronic
[43] L. Xie, Q. Tian, W. Zhou, and B.Zhang. Heterogeneous Graph Engineering, Tsinghua University. He has published
Propagation for Large-Scale Web Image Search. IEEE Transactions on more than 80 papers on image processing, computer
Image Processing , vol.24, no.11, pp.4287-4298, 2015. vision, and pattern recognition. He is the holder of
[44] C. Huang, Y.Dong, H. Bai, L. Wang, N. Zhao, S. Cen, and J. Zhao. An ten patents. His current research interests include image processing, computer
efficient graph-based visual reranking. In IEEE ICASSP , 2013. vision, video surveillance, and pattern recognition.
[45] W. Liu, Y. G. Jiang, J. Luo, and S. F. Chang. Noise resistant graph
ranking for improved web image search. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2011.
[46] W. H. Hsu, L. S. Kennedy, and S. F. Chang. Video search reranking
through random walk over document-level context graph. In Proceedings Liang Zheng received the Ph.D degree in Electronic
of ACM International conference on Multimedia, 2007. Engineering from Tsinghua University, China, in
[47] S. C. Hoi, W. Liu, and S. F. Chang. Semi-supervised distance metric 2015, and the B.E. degree in Life Science from
learning for collaborative image retrieval. In Proceedings of the IEEE Tsinghua University, China, in 2010. He was a
Conference on Computer Vision and Pattern Recognition, 2008. postdoc researcher in University of Texas at San
[48] Y. Jia. Caffe: An open source convolutional architecture for fast feature Antonio, USA. He is currently a postdoc researcher
embedding. http://caffe.berkeleyvision.org/, 2013. in Quantum Computation and Intelligent Systems,
[49] L. Page, S. Brin, R. Motwani,Winograd, T. The PageRank citation University of Technology Sydney, Australia. His
ranking: Bringing order to the web, 1999. research interests include image retrieval, classifica-
[50] J M. Kleinberg Authoritative sources in a hyperlinked environment. tion, and person re-identification.
Journal of the ACM, vol.46, no.5, pp.604-632, 1999.
[51] Z. Liu, S. Wang, L. Zheng, and Q. Tian. Visual reranking with improved
image graph. In Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing , 2014.
[52] S. Zhang, M. Yang, T. Cour, K. Yu, D. N. Metaxas Query Specific Rank
Fusion for Image Retrieval. IEEE Transactions on Pattern Analysis and Qi Tian (SM04) received the Ph.D. degree in
Machine Intelligence, vol. 37, no. 4, pp. 803-815, 2015. electrical and computer engineering from the Uni-
[53] A. Babenko, A. Slesarev, A. Chigorin, A. and V. Lempitsky. Neural versity of Illinois, Urbana Champaign in 2002. He is
codes for image retrieval. In Proceedings of the IEEE European currently a Professor in the Department of Computer
Conference on Computer Vision, 2014. Science at the University of Texas at San Antonio
[54] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek. Accurate image (UTSA). Dr. Tians research interests include multi-
search using the contextual dissimilarity measure.. IEEE Transactions media information retrieval and computer vision. He
on Pattern Analysis and Machine Intelligence , vol. 32, no. 1, pp. 2-11, has been serving as Program Chairs, Session Chairs,
2010. Organization Committee Members and TPC for over
[55] L. Xie, R. Hong, B. Zhang, and Q. Tian. Image classification and 120 IEEE and ACM Conferences including ACM
retrieval are one. In ACM International Conference on Multimedia Multimedia, SIGIR, ICCV, ICASSP, etc. He is the
Retrieval, 2015. Guest co-Editors of IEEE Transactions on Multimedia, Journal of Computer
[56] M. J. Huiskes, M. S. Lew. The MIR Flickr Retrieval Evaluation. In ACM Vision and Image Understanding, ACM Transactions on Intelligent Systems
International Conference on Multimedia Information Retrieval, 2008. and Technology, and EURASIP Journal on Advances in Signal Processing
[57] D. Li, W. C. Hung, J. B. Huang, S. Wang, N. Ahuja, and M. H. Yang. and is the Associate Editor of IEEE Transactions on Circuits and Systems for
Unsupervised Visual Representation Learning by Graph-based Consistent Video Technology and in the Editorial Board of Journal of Multimedia.

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Anda mungkin juga menyukai