Anda di halaman 1dari 45

1

BUNDELKHAND INSTITUTE OF ENGINEERING &


TECHNOLOGY JHANSI (U.P)


SESSION: 2012-2013
A
Seminar Report
On
Augmented Reality with Visual Search

.::: UNDER THE GUIDANCE OF:::.

HOD
Head of Department
INFORMATION TECHNOLOGY

.:: SUBMITTED BY::.

Name
(Roll No )
B.TECH-3
rd
Year 6
th
Semester
DEPARTMENT OF INFORMATION TECHNOLOGY
2


BUNDELKHAND INSTITUTE OF ENGINEERING &
TECHNOLOGY JHANSI (U.P)
DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE
This is to certify that seminar titled AUGMENTED REALITY WITH VISUAL
SEARCH has successfully delivered by name(B.Tech. 6
th
semester, Information Technology)
in the partial fulfillment of the B.Tech Degree in Information Technology from Bundelkhand
Institute of Engineering & Technology during the academic year 2012-13.


HEAD OF DEPARTMENT
.
DR.
Department Of Information Technology
B.I.E.T. Jhansi





3



ACKNOWLEDGEMENT

I feel great pleasure in expressing my deep sense of gratitude and heartiest respect to
Dr.Yashpal Singh, H.O.D. Information Technology, Bundelkhand Institute of Engineering
and Technology, Jhansi for their preserving guidance and inspiration throughout the preparation
of this seminar. I am thankful to teachers for their guidance and help.

I gratefully acknowledge the blessing, useful guidance and help that I have received.


name
Roll No-
B.Tech 3
rd
Yr.
Information Technology







4

ABSTRACT
Augmented reality is a direct or indirect, view of a physical, real-world environment whose
elements are augmented by computer-generated sensory input such as sound, video, graphics or
GPS data. With the help of advanced AR technology information about the surrounding real world of the
user becomes interactive and can be digitally manipulated.
Visual capture capability on mobile devices can be used for linking the real world and the
digital world. Mobile phones have evolved into powerful image and video processing devices,
equipped with high-resolution cameras, color displays, and hardware-accelerated graphics. They
are also equipped with GPS, and connected to broadband wireless networks. All this enables a
new class of applications which use the camera phone to initiate search queries about objects in
visual proximity to the user. Such applications can be used, e.g., for identifying products,
comparison-shopping, finding information about movies, CDs, buildings, shops, real estate, print
media or artworks.

For implementation of this system we require two architectures: an integrated system which
runs solely on the phone and a networked system which recognizes a submitted image on a
server. Although the networked system offers a large database capacity, we argue for the
integrated system because it enables real time recognition on the device with smooth user
interaction. We have implemented a highly efficient feature extraction and matching algorithm
targeting resource-constrained mobile devices. The advantage of the system is the complete
integral solution on the phone including a language-independent feature extraction and an
efficient database lookup, which provides instant response.

Project Glass is a research and development program to develop an augmented reality head-
mounted display. It is a newly unveiled concept headgear that would superimpose graphics on
your view of the world. It has a small transparent device just over the right eye which serves as a
means of displaying information in an overlay manner.


5



TABLE OF CONTENT

Title Page no.
1. INTRODUCTION ....1-2

2. AUGMENTED REALITY............................................................................3-4
2.1 HISTORICAL OVERVIEW

3. AUGMENTED REALITY WITH VISUAL SEARCH...5-9
3.1 CHALLENGING ISSUES
3.2 POTENTIAL SOLUTION

4. AUGMENTED REALITY WITH VISUAL SEARCH10-11
4.1 IMAGE RETREIVAL PIPELINE

5. FEATURE EXTRACTION...12-17
5.1 INTEREST POINT DETECTION
5.2 FEATURE DESCRIPTOR COMPUTATION
5.2.1 CHoG: A LOW BITRATE DESCRIPTOR
5.2.2 LOCATION HISTOGRAM CODING

6. FEATURE INDEXING AND MATCHING17-22
6.1 VOCABULARY TREE AND INVERTED INDEX
6.2 INVERTED INDEX COMPRESSION

7. GEOMETRIC VERIFICATION...23-25
7.1 FAST GEOMETRIC RE-RANKING

8. SYSTEM PERFORMANCE..26-32
8.1 RETRIEVAL ACCURACY
8.2 SYSTEM LATENCY
8.3 TRANSMISSION DELAY
8.4 END TO END LATENCY SYSTEM
6

8.5 ENERGY CONSUMPTION

9. PROJECT GLASS...33-35
10. FUTURE CHALLENGES.36
11. CONCLUSION..37
12. REFERENCES...38
TABLE OF FIGURES

FIGURE 2.1...4
FIGURE 4.1...10
FIGURE 5.1...13
FIGURE 5.2...15
FIGURE 5.3...16
FIGURE 5.4...17
FIGURE 6.1...19
FIGURE 6.2...21
FIGURE 7.1...23
FIGURE 7.2...23
FIGURE 7.3...24
FIGURE 8.1...27
FIGURE 8.2...28
FIGURE 8.3...30
FIGURE 8.4...30
FIGURE 8.5...31
FIGURE 8.6...31
FIGURE 8.7...32
FIGURE 9.1...34




7









1. Introduction

As computers increase in power and decrease in size, new mobile, wearable, and
pervasive computing applications are rapidly becoming feasible, providing people access to
online resources always and everywhere. This new flexibility makes possible new kind of
applications that exploit the person's surrounding context. Augmented reality (AR) presents
a particularly powerful user interface to context-aware computing environments. AR
systems integrate virtual information into a person's physical environment so that he or she
will perceive that information as existing in their surroundings. Augmented reality systems
with visual search provide this service without constraining the individuals whereabouts to
a specially equipped area. Ideally, they work virtually anywhere, adding a palpable layer of
information to any environment whenever desired. By doing so, they hold the potential to
revolutionize the way in which information is presented to people. Computer-presented
material is directly integrated with the real world surrounding the freely roaming person,
who can interact with it to display related information, to pose and resolve queries, and to
collaborate with other people. The world becomes the user interface.

Mobile phones have evolved into powerful image and video processing devices
equipped with high-resolution cameras, color displays, and hardware-accelerated graphics.
They are also increasingly equipped with a global positioning system and connected to
broadband wireless networks. All this enables a new class of applications that is augmented
reality with visual search that use the camera phone to initiate search queries about objects
8

in visual proximity to the user . Such applications can be used, e.g., for identifying products,
comparison shopping, finding information about movies, compact disks (CDs), real estate,
print media, or artworks. First deployments of such systems include Google Goggles , Nokia
Point and Find , Kooaba , Ricoh iCandy, and Amazon Snaptell . Mobile image-retrieval
applications pose a unique set of challenges. What part of the processing should be
performed on the mobile client, and what part is better carried out at the server? On the one
hand, transmitting a Joint Photographic Experts Group (JPEG) image could take few
seconds over a slow wireless link. On the other hand, extraction of salient image features is
now possible on mobile devices in seconds.
There are several possible clientserver architectures
The mobile client transmits a query image to the server. The image-retrieval algorithms
run entirely on the server, including an analysis of the query image.
The mobile client processes the query image, extracts features, and transmits feature data.
The image-retrieval algorithms run on the server using the feature data as query.
The mobile client downloads data from the server, and all image matching is performed
on the device.
One could also imagine a hybrid of the approaches mentioned above. When the
database is small, it can be stored on the phone, and image-retrieval algorithms can be run
locally. When the database is large, it has to be placed on a remote server and the retrieval
algorithms are run remotely. In each case, the retrieval framework has to work within
stringent memory, computation, power, and bandwidth constraints of the mobile device. The
size of the data transmitted over the network needs to be as small as possible to reduce
network latency and improve user experience. The server latency has to be low as we scale to
large databases.






9











2. Augmented Reality

Augmented reality is related to the concept of virtual reality (VR). VR attempts to
create an artificial world that a person can experience and explore interactively
predominantly through his or her sense of vision, but also via audio, tactile, and other forms
of feedback. AR also brings about an interactive experience, but aims to supplement the real
world, rather than creating an entirely artificial environment. The physical objects in the
individuals surroundings become the backdrop and target items for computer-generated
annotations. Different researchers subscribe to narrower or wider definitions of exactly what
Constitutes AR. While the research community largely agrees on most of the elements of
AR systems, helped along by the exchange and discussions at several international
conferences in the field, there are still small differences in opinion and nomenclature.

We will define an AR system as one that combines real and computer-generated
information in a real environment, interactively and in real time, and aligns virtual objects
with physical ones. At the same time, AR is a subfield of the broader concept of mixed
reality (MR) which also includes simulations predominantly taking place in the virtual
domain and not in the real world. Mobile AR applies this concept in truly mobile settings;
that is, away from the carefully conditioned environments of research laboratories and
special-purpose work areas.

10

2.1 Historical Overview

While the term augmented reality was coined in the early 1990s, the first fully
Functional AR system dates back to the late 1960s, when Ivan Sutherland and colleagues
(1968) built a mechanically tracked 3D see-through head-worn display, through which the
wearer could see computer-generated information mixed with physical objects, such as signs
on a laboratory wall. For the next few decades much research was done on getting
computers to generate graphical information, and the emerging field of interactive computer
graphics began to flourish. Photorealistic computer-generated images became an area of
research in the late 1970s, and progress in tracking technology furthered the hopes to create
the ultimate simulation machine. The field of augmented reality began to emerge.
It was not until the early 1990s, with research at the Boeing Corporation, that the
notion of overlaying computer graphics on top of the real world received its current name.
Caudell and Mizell (1992) worked at Boeing on simplifying the process of conveying
wiring instructions for aircraft assembly to construction workers, and they referred to their
proposed solution of overlaying computer presented material on top of the real world as
augmented reality. Even though this application was conceived with the goal of mobility in
mind, true mobile graphical AR was out of reach for the available technology until a few
years later.

11


Figure 2.1 Traditional AR restaurant guide. (a) User with MARS backpack, looking at a
restaurant. (b) Annotated view of restaurant, imaged through the head-worn display.






3. Augmented Reality with Visual Search

Augmented reality with visual search is also known as mobile augmented reality.
Revisiting our definition of AR, we can identify the components needed for Mobile
Augmented reality System.

Computational Platform-A computational platform that can generate and manage the
virtual material to be layered on top of the physical environment, process the tracker
information, and control the AR display(s).
12

Displays-Displays to present the virtual material in the context of the physical world. In
the case of augmenting the visual sense, these can be head-worn displays, mobile hand-held
displays, or displays integrated into the physical world.
Registration-Registration must also be addressed: aligning the virtual elements with the
physical objects they annotate. For visual and auditory registration, this can be done by
tracking the position and orientation of the users head and relating that measurement to a
model of the environment and/or by making the computer see and potentially interpret the
environment by means of cameras and computer vision.
Wearable input and interaction technology-Wearable input and interaction
technologies enable a mobile person to work with the augmented world (e.g., to make
selections or access and visualize databases containing relevant material) and to further
augment the world around them.
Wireless Networking-Wireless networking is needed to communicate with other people
and computers while on the run. Dynamic and flexible mobile AR will rely on up-to the-
second information that cannot possibly be stored on the computing device before
application run-time.
Data Storage and access technology-If a MARS is to provide information about a
roaming individuals current environment, it needs to get the data about that environment
from somewhere. Data repositories must provide information suited for the roaming
individuals current context.

3.1 Challenging Issues in Augmented Reality with Visual Search

Devices differ from general computing environments in several aspects. The
design of a mobile image search system must take into account the following inherent
challenges and limitations of mobile devices:

Low Processing Power of CPU-Modern mobile embedded CPUs are designed with
much more than pure speed in mind. Priority is often given to factors which address
requirements of a mobile operating environment such as low heat dissipation, minimal
13

power consumption, and small form factor size. Although technologically advanced,
mobile CPUs are still not fast enough to perform computationally intensive image-
processing operations such as feature extraction. Graphics processing units (GPUs), which
are built into most mobile devices, can help to speed up processing via parallel computing,
but most feature extraction algorithms are designed to be executed sequentially and cannot
fully utilize GPU capabilities.
Less Memory Capacity-Mobile devices have less memory capacity than desktop
systems. Smart phones such as the top-tier Google Nexus One come with 512MB of built-
in RAM. While the Nexus One employs one of the largest memory capacities currently
available, limitations caused by memory become a large issue when extracting features for
an image search. This is because feature extraction often requires large sets of intermediate
data to be stored in memory since analysis is performed sequentially. For example, SURF,
a popular feature extraction algorithm generates results by analyzing data in a lock-step
fashion where data generated in previous stages is referenced by the current stage and by
future stages as well. Furthermore, the total amount of memory usage of each stage grows
linearly with the size of the original image. For moderate- to high-resolution images, this
process could easily exhaust memory resources.
Small Screen size- Modern high-end smart phones boast displays which measure slightly
less than four inches diagonally. However, this size is still much smaller than that of a
common desktop/laptop. Smaller screens greatly limit the amount of information that can be
presented to a user at any given time. This creates a much greater requirement for an
efficient, effective display of search results and also increases the need for higher search
accuracy.
Limited Connectivity-Wi-Fi is a built-in feature for most mobile devices. However Wi-Fi
is still only available at sparse locations even in most urban areas. For the majority of their
network connectivity, mobile devices must rely on a combination of mobile broadband
networks such as 3G, 3.5G, and 4G. These networks provide acceptable network access
speeds, but can become a design limitation when a large amount of data must be transferred
in real time. Moreover, mobile broadband networks are limited in their availability outside of
large cites.

14

To summarize, hard constraints imposed on mobile device platforms distinguish them
from conventional computing platforms and create new challenges for Applications that work
within their realm of limitations. However, despite their shortcomings, mobile devices possess
inherent characteristics that have the potential to increase the accuracy and efficiency of image
search.

3.2 Potential Solution
Some of the challenges that face image search can be addressed by applying solutions
which have been devised to solve problems in related areas.
CPU-Mobile System-on-chips (SOCs) often come with embedded graphics processing unit
(GPU) cores in addition to the CPU. GPUs allow for large quantities of instructions to be
executed in parallel. While originally intended for rendering 2D and 3D graphics, GPUs
have been at the core of a branch of study known as general-purpose computation on
graphics processing units (GPGPU). GPGPU technology extends the programmability of
GPUs to enable non-graphics applications with high parallelizability to run more efficiently
than on a CPU. In the context of mobile image search, where sequential feature extraction
algorithms are often used, GPGPU technology can allow for feature extraction algorithms to
be broken up into smaller subtasks and executed in parallel. Efforts have been made to
improve the parallelization of feature extraction in recent years. In, a number of stages in the
SIFT algorithm are parallelized to run on consumer desktop GPUs, decreasing runtime by a
factor of 10. To fully utilize the GPU, new feature extraction algorithms must be devised
with the aim to be executed concurrently.
Memory- Conservative use of memory in feature extraction algorithms is another area in
which mobile search benefits from other studies. In , the SURF algorithm is ported to
mobile phones for use in an augmented reality experiment. To limit memory usage, only the
smaller of the original image and integral image is saved in memory and conversions from
one to the other are performed as needed. This results in a large reduction in memory usage.
Other engineering approaches include scaling down the original image to a smaller
resolution before performing feature extraction. Smaller images require much less memory
to analyze, but come at the cost of fewer detected features. Another approach is to keep the
15

original size of an image, but introduce an additional step where a user can crop a section of
an image which includes the object of interest. This acts to reduce the image dimensions
while preserving the features which are most relevant to the image search. Another proposed
approach is to divide an image into smaller sub-images and perform analysis on each sub-
image sequentially before merging the results as a final step. This method can be used in
case an algorithm must produce large amounts of intermediate data during execution. The
idea being that after analyzing each sub-image, the intermediate data can be freed and used
for the processing of the subsequent sub-image.
Screen/Interface-Touch screens provide an interface that allows users to express their
intentions more freely and intuitively. However, smaller screen size greatly limits the
amount of result images which can be displayed on the screen at any given time.
Improvements in search accuracy can minimize the number of results that must be returned
to the user before query-relevant content is produced. Another possibility is to perform post-
search pruning on a set of search results based on attributes that can be computed on the
server side. Only the most relevant content is returned by examining the search context and
the users interests. This process can make efficient use of the limited screen space and
enhance the search experience.
Network-Networking challenges in mobile image search can be overcome in several ways
which address the different instances in which a mobile search application makes use of its
network. First, there is the transmission of the extracted feature vectors. In this step, the
features which are obtained from an image are sent to a search server which compares the
extracted features with stored features extracted from a large image database. This challenge
is characterized by a large set of data that must be sent to the search server. A typical image
of a landmark produces hundreds of SURF features. Each feature is expressed by a
descriptor vector holding 64 floating point numbers. By converting the floating point
numbers to bytes, the size of each image feature vectors is reduced, resulting in significantly
less bytes transferred over the network. The next major network usage is when image results
are returned to the user. In this phase, the returned images must be transferred and displayed
for the user to choose from. This challenge can be met by sending only the most relevant
images back to user. To improve the search relevance, we suggest a multimodal query
scheme and dynamic, post-search pruning method. Moreover, pre-scaling images to produce
16

small preview images can further reduce payload size when transferring the search results to
the mobile device.


















4. Image Recognition for Augmented Reality

The most successful algorithms for content-based image retrieval use an approach that is
referred to as bag of features (BoFs) or bag of words (BoWs). The BoW idea is borrowed from
text retrieval. To find a particular text document, such as a Web page, it is sufficient to use a few
well-chosen words. In the database, the document itself can be likewise represented by a bag of
salient words, regardless of where these words appear in the text. For images, robust local
features take the analogous role of visual words. Like text retrieval, BoF image retrieval does not
consider where in the image the features occur, at least in the initial stages of the retrieval
pipeline. However, the variability of features extracted from different images of the same object
makes the problem much more challenging.
17


4.1 Image Retrieval Pipeline
The typical image retrieval pipeline is as follows:


Figure 4.1: A Pipeline for image retrieval.

1. First, the local features are extracted from the query image. The set of image feature is used
assess the similarity between query and database images. For mobile applications, individual
features must be robust against geometric and photometric distortions to encounter when the
user takes the query photo from a different viewpoint and with different lighting compared to
the corresponding database image.
2. Next, the query features are quantized. The partitioning into quantization cells is pre
computed for the database, and each quantization cell is associated with a list of database
images in which the quantized feature vector appears somewhere. This inverted file
circumvents a pair wise comparison of each query feature vector with all the feature vectors
in the database and is the key to very fast retrieval. Based on the number of features they
have in common with the query image, a short list of potentially similar images is selected
from the database.
3. Finally, a geometric verification (GV) step is applied to the most similar matches in the
database. The GV finds a coherent spatial pattern between features of the query image and
the candidate database image to ensure that the match is plausible.




18




















5. Feature Extraction
Feature extraction consists of following steps:
5.1 Interest Point Detection
Feature extraction typically starts by finding the salient interest points in the image. For
robust image matching, we desire interest points to be repeatable under perspective
transformations (or, at least, scale changes, rotation, and translation) and real-world lighting
variations. An example of feature extraction is illustrated in Figure 3. To achieve scale
invariance, interest points are typically computed at multiple scales using an image pyramid. To
achieve rotation invariance, the patch around each interest point is canonically oriented in the
direction of the dominant gradient. Illumination changes are compensated by normalizing the
19

mean and standard deviation of the pixels of the gray values within each patch. Numerous
interest-point detectors have been proposed in the literature. Some of them are:
Corner Detectors-Corner Detectors Corners are among the first low-level features used
for image analysis and in particular, tracking. Based on Moravecs, Harris and Stephens
developed the algorithm that became known as the Harris Corner Detector. They derive a
corner score from the second-order moment image gradient matrix, which also forms
the basis for the detectors proposed by Frstner (1994) and Shi and Tomasi (1994).
Mikolajczyk and Schmid (2001) proposed an approach to make the Harris detector scale
invariant. Other intensity-based corner detectors include the algorithms of Beaudet
(1978), which uses the determinant of the Hessian matrix, and Kitchen and Rosenfeld
(1982), which measures the change of direction in the local gradient field.
Blob Detectors-Instead of trying to detect corners, one may use local extreme of the
responses of certain filters as interest points. In particular, many approaches aim at
approximating the Laplacian of a Gaussian, which, given an appropriate normalization,
Lowe (1999, 2004) proposed to select the local extrema of an image filtered with
differences of Gaussians, which are separable and hence faster to compute than the
Laplacian. The Fast Hessian detector (Bay et al. 2008) is based on efficient-to-compute
approximations to the Hessian matrix at different scales. Agrawal et al. (2008) proposed
to approximate the Laplacian even further, down to bi-level octagons and boxes. Using
slanted integral images, the result can be computed very efficiently despite a fine scale
quantization.
SIFT-The original SIFT descriptor (Lowe 1999, 2004) was computed from the image
intensities around interesting locations in the image domain which can be referred to as
interest points, alternatively key points. These interest points are obtained from scale-
space extrema of differences-of-Gaussians (DoG) within a difference-of-Gaussians
pyramid, as originally proposed by Burt and Adelson (1983) and by Crowley and Stern
(1984).A Gaussian pyramid is constructed from the input image by repeated smoothing
and sub sampling, and a difference-of-Gaussians pyramid is computed from the
differences between the adjacent levels in the Gaussian pyramid. Then, interest points are
obtained from the points at which the difference-of-Gaussians values assume extrema
20

with respect to both the spatial coordinates in the image domain and the scale level in the
pyramid.



Figure5.1: Interest Point Detection

5.2 Feature Descriptor Computation
After interest point detection, we compute a visual word descriptor on the
normalized patch. We would like descriptors to be robust to small distortions in scale,
orientation and lighting conditions. Also, we require descriptors to be discriminative, i.e.,
characteristic of an image or a small set of images. Descriptors that occur in almost every
image (the equivalent of the word and in text documents) would not be useful for
retrieval. Since Lowes paper in 1999, the highly discriminative SIFT descriptor remains the
most popular descriptor in computer vision. Other examples of feature descriptors are
Gradient Location and Orientation Histogram (GLOH) by Mikolajczyk and Schmid,
Speeded Up Robust Features (SURF) by Bay et al. and our own Compressed Histogram of
Gradients (CHoG), Winder and Brown, and Mikolajczyk et evaluate the performance of
different descriptors.
As a 128-dimensional descriptor, SIFT descriptor is conventionally stored as 1024
bits (8 bits/dimension). Alas, the size of SIFT descriptor data from an image is typically
larger than the size of the JPEG compressed image itself. Several compression schemes
have been proposed to reduce the bit rate of SIFT descriptors. In our recent work, we survey
21

different SIFT compression schemes. They can be broadly categorized into schemes based
on hashing, transform coding and vector quantization. We note that hashing schemes like
Locality Sensitive Hashing (LSH), Similarity Sensitive Coding (SSC) or Spectral Hashing
(SH) do not perform well at low bitrates. Conventional transform coding schemes based on
Principal Component Analysis (PCA) do not work well due to the highly non-Gaussian
statistics of the SIFT descriptor. Vector quantization schemes based on the Product
Quantizer or a Tree Structured Vector Quantizer are complex and require storage of large
codebooks on the mobile device.
We came to realize that simply compressing anoff-the-shelf descriptor does not
lead to the best rate-constrained image retrieval performance. One can do better by
designing a descriptor with compression in mind. Of course, such a descriptor still has to be
robust and highly discriminative. Ideally, it would permit descriptor comparisons in the
compressed domain for speedy feature matching. To meet all these requirements
simultaneously, we designed the Compressed Histogram of Gradients (CHoG) descriptor.
The CHoG descriptor is designed to work well at low bitrates. CHoG achieves the
performance of 1024-bit SIFT at less than 60 bits/descriptor. Since CHoG descriptor data
are an order of magnitude smaller than SIFT or JPEG compressed images, it can be
transmitted much faster over slow wireless links. A small descriptor also helps if the
database is stored in the mobile device. The smaller the descriptor, the more features can be
stored in limited memory.


Figure 5.2: Feature Descriptor Computation
22


5.2.1 CHoG: A Low Bitrate Descriptor
CHoG builds upon the principles of HoG descriptors with the goal of being highly
discriminative at low bitrates. How CHoG descriptors are computed.
The patch is divided into spatial bins, which provides robustness to interest point localization
error. We divide the patch around each interest point into soft log polar spatial bins using
DAISY configurations. The log polar configuration is more effective than the square grid
configuration used in SIFT.
The joint (dx, dy) gradient histogram in each spatial bin is captured directly into the
descriptor. CHoG histogram binning exploits the skew in gradient statistics that are observed
for patches extracted around interest points.
CHoG retains the information in each spatial bin as a distribution. This allows the use of
more effective distance measures like KL divergence, and more importantly, allow us to
apply quantization and compression schemes that work well for distributions, to produce
compact descriptors.
Typically, 9 to 13 spatial bins and 3 to 9 gradient bins are chosen resulting in 27 to 117
dimensional descriptors. For compressing the descriptor, we quantize the gradient histogram in
each spatial bin individually.
23



Figure 5.3: The joint (dx, dy) gradient distribution (a) over a large number of cells and (b) its
contour plot. The greater variance in y axis results from aligning the patches along the most
dominant gradient after interest-point detection. (The quantization bin constellations (c) VQ-3,
(d) VQ-5, (e) VQ-7, and (f) VQ-9 and their associated Voronoi cells are shown.

Each interest point has a location, scale and orientation associated with it. Interest point
locations are needed in the geometric verification step to validate potential candidate matches.
The location of each interest point is typically stored as two numbers: x and y co-ordinates in
the image at sub-pixel accuracy. In a floating point representation, each feature location would
require 64 bits, 32 bits each for x and y. This is comparable in size to the CHoG descriptor
itself. We have developed a novel histogram coding scheme to encode the x, y coordinates of
feature descriptors. With location histogram coding, we can reduce location data by an order of
magnitude compared to their floating point representation, without loss in matching accuracy.
24

5.2.2 Location Histogram Coding

Location Histogram Coding is used to compress feature location data efficiently. We note
that the interest points in images are spatially clustered. To encode their locations, we first
generate a 2-D histogram from the locations of the descriptors. Location histogram coding
provides two key benefits. First, encoding the locations of a set of N features as a histogram
reduces the bit rate by log (N!), compared to encoding each feature location in sequence. This
gain arises because ordering information (N! unique orderings) is discarded when a histogram is
computed. Second, we exploit the spatial correlation between the locations of different
descriptors. We divide the image into spatial bins and count the number of features within each
spatial bin. We compress the binary map, indicating which spatial bins contains features, and a
sequence of feature counts, representing the number of features in occupied bins. We encode the
binary map using a trained context-based arithmetic coder, with neighboring bins being used as
the context for each spatial bin. Using location histogram coding, we can transmit each location
with 5 bits/descriptor with little loss in matching accuracy - a 12.5 reduction in data.

Figure 5.4: Location Histogram Coding


25

6. Feature Indexing and matching
For a large database of images, comparing the query image against every database image
using pair wise feature matching is infeasible. A database with millions of images might contain
billions of features. A linear scan through the database would be too time-consuming for
interactive mobile visual search applications. Instead, we must use a data structure that can
quickly return a shortlist of the database candidates most likely to match the query image. The
shortlist may contain false positives, as long as the correct match is included. Slower pairwise
comparisons can subsequently be performed on just the shortlist of candidates rather than the
entire database
Many data structures have been proposed for efficiently indexing all the local features in
a large image database. Lowe proposes approximate nearest neighbor (ANN) search of SIFT
descriptors with a best-bin-first strategy. One of the most popular methods is Sivic and
Zissermans Bag-of- Features (BoF) approach. The BoF codebook is trained by k-means
clustering of many training descriptors. During a query, scoring the database images can be made
fast by using an inverted file index associated with the BoF codebook. To generate a much larger
codebook, Nister and Stewenius utilize hierarchical k-means clustering to create a Vocabulary
Tree (VT). Alternatively, Philbin et al. use randomized k-d trees to partition the feature
descriptor space. Subsequent improvements in tree-based quantization and ANN search include
greedy N-best paths, query expansion; efficient updates over time, soft binning, and Hamming
embedding. As database size increases, the amount of memory used to index the database
features can become very large. Thus, developing a memory-efficient indexing structure is a
problem of increasing interest. Chum et al. use a set of compact minhashes to perform near-
duplicate image retrieval. Zhang et al. decompose each images set of features into a coarse
signature and a refinement signature. The refinement signature is subsequently indexed by a
locality sensitive hash (LSH). To support the popular VT scoring framework, inverted index
compression methods for both hard-binned andsoft-binned VTs have been developed by us, as
explained in the box Inverted Index Compression. The memory for BoF image signatures can
alternatively be reduced using the mini-BoF approach. Very recently, visual word residuals on a
small BoF codebook have shown promising retrieval results with low memory usage. The
residuals are indexed either with PCA and product quantizes or with LSH.
26

6.1 Vocabulary Tree and Inverted Index
A Vocabulary Tree (VT) with an inverted index can be used to quickly compare images
in a large database against a query image. If the VT has L levels excluding the root node and
each interior node has C children, then a fully balanced VT contains K = CL leaf nodes. Fig. 8
shows a VT with L = 2, C = 3, and K = 9. The VT for a particular database is constructed by
performing hierarchical k-means clustering on a set of training feature descriptors representative
of the database Initially, C large clusters are generated from all the training descriptors by
ordinary k-means with an appropriate distance function like L2-norm or symmetric KL
divergence. Then, for each large cluster, k-means clustering is applied to the training descriptors
assigned to that cluster, to generate C smaller clusters. This recursive division of the descriptor
space is repeated until there are enough bins to ensure good classification performance.
Typically, L = 6 and C = 10 are selected, in which case the VT has K = 106 leaf nodes.

Figure 6.1: (a) Construction of a Vocabulary Tree by hierarchical k-means clustering of training
feature descriptors. (b) Vocabulary Tree and the associated inverted index.
27

The inverted index associated with the VT maintains two lists per leaf node. For node k,
there is a sorted array of image IDs {ik1, ik2, , ikNk} indicating which Nk database images
have visited that node. Similarly, there is a corresponding array of counts {ck1, ck2, , ckNk}
indicating the frequency of visits. During a query, a database of N total images can be quickly
scored by traversing only the nodes visited by the query descriptors. Let s(i) be the similarity
score for the ith database image. Initially, prior to visiting any node, s(i) is set to 0. Suppose node
k is visited by the query descriptors a total of qk times. Then, all the images in the inverted list
{ik1, , ikNk} for node k will have their scores incremented according to

where wk is an inverse document frequency (IDF) weight used to penalize often-visited nodes,
ikj is a normalization factor for database image ikj , and q is a normalization factor for the query
image.

Scores for images at the other nodes visited by the query image are updated similarly.
The database images attaining the highest scores s(i) are judged to be the best matching
candidates and kept in a shortlist for further verification.
Soft binning can be used to mitigate the effect of quantization errors for a large VT.
Some descriptors lie very close to the boundary between two bins. When soft binning is
employed, the visit counts are then no longer integers but rather fractional values. For each
feature descriptor, the m nearest leaf nodes in the VT are assigned fractional counts
28


where di is the distance between the ith closest leaf node and the feature descriptor, and is
appropriately chosen to maximize classification accuracy.
6.2 Inverted Index Compression
For a database containing one million images and a VT that uses soft binning, each
image ID can be stored in a 32- bit unsigned integer and each fractional count can be stored in a
32-bit float in the inverted index. The memory usage of the entire inverted index is PK k=1 Nk
64 bits, where Nk is the length of the inverted list at the kth leaf node. For a database of one
million product images, this amount of memory reaches 10 GB, a huge amount for even a
modern server. Such a large memory footprint limits the ability to run other concurrent processes
on the same server, such as recognition systems for other databases. When the inverted indexs
memory usage exceeds the servers available random access memory (RAM), swapping between
main and virtual memory occurs, which significantly slows down all processes.

Figure 6.2: (a) Memory usage for inverted index with and without compression. A 5 savings in
memory is achieved with compression. (b) Server-side query latency (per image) with and
without compression.
29

A compressed inverted index can significantly reduce memory usage without affecting
recognition accuracy. First, because each list of IDs {ik1, ik2, , ikNk} is sorted, it is more
efficient to store consecutive ID differences dk1 = ik1, dk2 = ik2 ik1, , dkNk = ikNk
ik(Nk1) in place of the IDs. This practice is also commonly used in text retrieval. Second, the
fractional visit counts can be quantized to a few representative values using Lloyd-Max
quantization. Third, the distributions of the ID differences and visit counts are far from uniform,
so variable-length coding can be much more rate-efficient than fixed-length coding. Using the
distributions of the ID differences and visit counts, each inverted list can be encoded using an
arithmetic code (AC). Since keeping the decoding delay low is very important for interactive
mobile visual search applications, a scheme that allows ultra-fast decoding is often preferred
over AC. The carryover code [50] and recursive bottom up complete (RBUC) code have been
shown to be at least 10 faster in decoding than AC, while achieving comparable\ compression
gains as AC. The carryover and RBUC codes attain these speed-ups by enforcing word-aligned
memory accesses.
Fig.6.1 compares the memory usage of the inverted index with and without compression,
using the RBUC code. Index compression reduces memory usage from nearly 10 GB to 2 GB.
This 5 reduction leads to a substantial speed-up in server-side processing, as shown in Fig. 6.1
(b). Without compression, the large inverted index causes swapping between main and virtual
memory and slows down the retrieval engine.After compression, memory swapping is avoided
and memory congestion delays no longer contribute to the query latency.







30

7.Geometric Verification
Geometric Verification (GV) typically follows the Feature Matching step. In this stage,
we use location information of query and database features to confirm that the feature matches
are consistent with a change in viewpoint between the two images. We perform pairwise
matching of feature descriptors and evaluate geometric consistency of correspondences as shown
in Figure. The geometric transform between query and database image is estimated using robust
regression techniques like RANSAC [52] or the Hough transform [13]. The transformation can
be represented by the fundamental matrix which incorporates 3-D geometry, or simpler
homography or affine models. Geometric Verification tends to be computationally expensive,
which limits the list of candidate images to a small number.

Figure 7.1:Geometric Verification
A number of groups have investigated different ways to speed up the GV process. In
Chum et al. investigate how to optimize steps to speed up RANSAC. Jegou et al. use weak
geometric consistency checks based on feature orientation information. Some authors have also
proposed to incorporate geometric information into the VT matching step

.Figure 7.2: A image retrieval pipeline can be greatly sped up by incorporatinga geometric re-
ranking stage.
31

To speed up geometric verification, one can add a geometric re-ranking step before the
RANSAC GV step as illustrated in Fig. 11, we propose a re-ranking step that incorporates
geometric information directly into the fast index look up stage, and use it to re-order the list of
top matching images. The main advantage of the scheme is that it only requires x, y feature
location data, and does not use scale or orientation information. As scale and orientation data are
not used, they need not be transmitted by the client, which reduces the amount of data
transferred. We typically run fast geometric re-ranking on a large set of candidate database
images, and reduce the list of images that we run RANSAC on.
7.1 Fast Geometric Re-ranking
We have proposed a fast geometric re-ranking algorithm, that uses x, y locations of
features to rerank a shortlist of candidate images. First, we generate a set of potential feature
matches between each query and database image based on VT quantization results. After
generating a set of feature correspondences, we calculate a geometric score between them. The
process used to compute the geometric similarity score is illustrated in Fig.

Fig.7.3. The location geometric score is computed as follows: (a) features of two images are
matched based on VT quantization, (b) distances between pairs of features within an image are
calculated, (c) log distance ratios of the corresponding pairs (denoted by color) are calculated ,
and (d) histogram of log distance ratios is computed. The maximum value of the histogram is the
geometric similarity score. A peak in the histogram indicates a similarity transform between the
query and database image.
32

We find the distance between two features in the query image and the distance between
the corresponding matching features in the database image. The ratio of the distance corresponds
to the scale difference between the two images. We repeat the ratio calculation for features in the
query image that have matching database features. If there exists a consistent set of ratios (as
indicatedby a peak in the histogram of distance ratios), it is more likely that the query image and
the database image match.The geometric re-ranking is fast because we use the vocabulary tree
quantization results directly to find potential feature matches and using a really simple similarity
scoring scheme. The time required to calculate a geometric similarity score is 1-2 orders of
magnitude less than using RANSAC.















33

8.System Performance
What performance can we expect for a mobile visual search system that incorporates all
the ideas discussed so far? To answer this question, we have a closer look at the experimental
Stanford Product Search System. For evaluation, we use a database of one million CD, DVD and
book cover images, and a set of 1000 query images (500500 pixel resolution) exhibiting
challenging photometric and geometric distortions. For the client, we use a Nokia 5800 mobile
phone with a 300MHz CPU. For the recognition server, we use a Linux server with a Xeon
E5410 2.33GHz CPU and 32GB of RAM. We report results for both 3G and WLAN networks.
For 3G, experiments are conducted in an AT&T 3G wireless network, averaged over several
days, with a total of more than 5000 transmissions at indoor locations where such an image-
based retrieval system would be typically used.
We evaluate two different modes of operation. In Send Features mode, we process the
query image on the phone and transmit compressed query features to the server. In Send Image
mode, we transmit the query image to the server and all operations are performed on the server.
We discuss results of three key aspects that are critical for mobile visual search applications:
retrieval accuracy, system latency and power. A recurring theme throughout this section will be
the benefits of performing feature extraction on the mobile device compared to performing all
processing on a remote server.
8.1 Retrieval Accuracy
It is relatively easy to achieve high precision (low false positives) for mobile visual
search applications. By requiring a minimum number of feature matches after RANSAC
geometric verification, we can avoid false positives entirely. We define Recall as the percentage
of query images correctly retrieved. Our goal is to then maximize Recall at a negligibly low false
positive rate.
Send Features (CHoG), Send Features (SIFT) and Send Image (JPEG). For the JPEG
scheme, the bitrate is varied by changing the quality of compression. For the SIFT scheme, we
extract SIFT descriptors on the mobile device, and transmit each descriptor uncompressed as
1024 bits. For the CHoG scheme, we need to transmit about 60 bits per descriptor accross the
34

network. For SIFT and CHoG schemes, we sweep the Recall-bitrate curve by varying the
number of descriptors transmitted.
First, we observe that a Recall of 96% is achieved at the highest bitrate for challenging
query images even with a million images in the database. Second, we observe that the
performance of the JPEG scheme rapidly deteriorates at low bitrates. The performance suffers at
low bitrates as the interest point detection fails due to JPEG compression artifacts. Third, we
note that transmitting uncompressed SIFT data is almost always more expensive than
transmitting JPEG compressed images. Finally, we observe that the amount of data for CHoG
descriptors are an order of magnitude smaller than JPEG images or SIFT descriptors, at the same
retrieval accuracy.

Figure 8.1: Bit-rate comparisons of different schemes. CHoG descriptor data are an order of
magnitude smaller compared to the JPEG images or uncompressed SIFT descriptors.

35

8.2 System Latency
The system latency can be broken down into 3 components: processing delay on client,
transmission delay, and processing delay on server.
Client and Server Processing Delay-We show the time for the different operations on the
client and server in Table II. The Send Features mode requires 1 second for feature extraction on
the client. However, this increase in client processing time is more than compensated by the
decrease in transmission latency, compared to Send Image, as we illustrate in Fig. On the server,
using VT matching with a compressed inverted index, we can search through a million image
database in 100 milliseconds. We perform GV on a short list of 10 candidates after fast
geometric re-ranking of the top 500 candidate images. We can achieve <1 second server
processing latency while maintaining high recall

Figure 8.2: Measured transmission latency (a) and time-out percentage (b) for transmitting
queries of different size over a 3G network. In-door (I) is tested in-doors with poor
connectivity. In-door (II) is tested in-doors with good reception. Out-door is tested outside
of buildings.

36


8.3 Transmission Delay
The transmission delay depends on the type of network used. In Fig., we observe that
data transmission time is insignificant for a WLAN network due to the high bandwidth available.
However, transmission time turns out to be a bottleneck for 3G networks. In Fig., we present
experimental results for sending data over a 3G wireless network. We vary query data sizes from
that of typical compressed query features (3-4 KB) to typical JPEG query images (50 KB) to
learn how query size affects transmission time.

The communication time-out was set to 60 seconds. We have conducted the experiment
continuously over several days. We tested at three different locations, typical locations where a
user might use the visual search application. The median and average transmission latency of our
experiments are shown. Sending the compressed query features typically takes 3-4 seconds. The
37

time required to send the compressed query image is several times longer and varies significantly
at different locations. However, transmission delay does not include the cases when
communication fails entirely,which increases with query size. We show the percentage of
transmissions that experience a time-out in Fig. The time-out percentage of transmitting
compressed query features is much lower than that of transmitting compressed query images
because of their smaller query size.

Figure 8.3: Measured transmission latency (a) and time-out percentage

Figure 8.4: for transmitting queries of different size over a 3G network.

38


Figure 8.5: Standford Image Search System
8.4End-to-End latency system
We compare end-to-end latency for different schemes in Fig. For WLAN, we observe
that < 1 second query latency is achieved for Send Image mode. Send Features mode is slower
due to the processing delay on the client. With such fast response times over WLAN, we are able
to operate our system in a continuous Mobile Augmented Reality mode. For 3G networks,
network latency remains the bottleneck as seen in Fig. In this scenario, there is significant benefit
in sending compressed features. Send Features reduces system latency by 2 compared to Send
Image mode.


Figure 8.6: End-to-end latency for different schemes.

39

8.5 Energy Consumption
On a mobile device, we are constrained by the energy of the battery, and hence,
conserving energy is critical. We measure the average energy consumption associated with a
single query using the Nokia Energy Profiler 1 on the Nokia 5800 phone. We show the average
energy consumption for a single query using Send Features and Send Image for WLAN and 3G
network connections in Fi. For 3G connections, the energy consumed in Send Image mode is
almost 3 as much as Send Features. The additional time needed to transmit image data
compared to feature data results in a greater amount of energy being consumed. For WLAN
transmission, Send Image consumes less energy, since feature extraction on the mobile client is
not required. Finally, we compute the number of image queries the mobile can send before the
battery runs out of power. A typical phone battery has voltage of 3.7 V and a capacity of 1000
mAH(or 13.3K Joules). Hence, for 3G connections, the maximum number of images that the
mobile can send is 13.3K Joules / 70 Joules = 190 total queries. For Send Features, we would be
able to perform 13.3 K joules / 21 Joules = 630 total queries, which is 3 as many queries as
Send Image can perform. This difference becomes even more important as we move towards
streaming augmented reality applications.

Figure 8.7: Average energy consumption of a single query using Send Image and Send Features
mode for various types of transmission.

40

9.Project Glass
Augmented-reality applications - software that overlays a level of digital information on
top of the physical world around us - have brought us more data. With an augmented-reality app
on your smartphone, you might be able to hold your phone's camera up to capture the image of
a city street. Looking at the screen, you can see information about your surroundings. The
augmented reality app maps digital information to your real-world surroundings.While these
apps can be informative and entertaining, the form factor is still a little clunky. We have to hold
up the smartphone and look at the screen -- it's like you're on a "Star Trek" away team, and
you're the one with the tricorder.

Google's answer to the problem comes in the form of a wearable device. It looks like a
pair of sunglasses with one side of the frames thicker than the other. It's called Project Glass,
and it might turn your world into endless amounts of information.

Project Glass is a research and development program by Google to develop an augmented
reality head-mounted display (HMD). Project Glass products would display information in
smartphone-like format hands-free and could interact with the Internet via natural language
voice commands. The prototype's functionality and minimalist appearance (aluminium strip
with 2 nose pads) has been compared to Steve Mann's EyeTap. The operating system software
used in the glasses will be Google's Android. Project Glass is being developed by Google X
Lab, which has worked on other futuristic technologies such as self-driving cars.Project Glass
can perform following activities:

Remind you of appointments and calendar events
Alert you to social networking activity or text messages
Give turn-by-turn directions
Alert you to travel options like public transportation
Give you information like weather and traffic
Take and share photos and video
Use voice-recognition software to send messages or activate apps
41

Perform Google searches
Participate in video chats on Google Plus
Overlay information on top of physical locations

That last category is a big one. Imagine looking at a building and seeing the names of the
businesses inside it or glancing at a restaurant and being able to take a peek at the menu. With
the right application, you could apply dozens of filters to provide different types of information.

Looking even further into the future, you might be able to use Project Glass to help you
keep track of the people in your life, or learn more about the people you meet. With facial
recognition software and social networking, it's possible you could take a look at someone
you've just met and see their public profiles on any number of social platforms.


Figure 9.1: Project Glass

9.1 What makes it work?
Within the glasses is a microprocessor chip. Considering the size of the device and the
need to manage heat output, it's likely that the chip inside the glasses is an advanced RISC
(reduced-instruction-set computing) machine (ARM)-based microprocessor. These chips are
less powerful than the ones you'll find in a standard desktop computer, but they're also more
efficient and smaller.

42

The glasses had a lot of memory. This allows the processor to work faster -- it has access to
the information it needs when executing operations. They also revealed that the glasses they
were demonstrating had a touch-sensitive surface along the right side of the frame. The frames
also had a button on the top edge of the right eye for taking photos.

The glasses also have a microphone incorporated into the frame and a speaker. According
to CNET's Rafe Needleman, who attended Google I/O and got to try on a pair of glasses, the
frames only have a speaker for the right ear.

Other data-gathering devices within the frame are gyroscopes, an accelerometer and a
compass. These components feed information to the processor, which can then interpret the
position and attitude of the glasses at any given time. The team from Google also revealed that
the glasses have several data-communication radios, including WiFi and Bluetooth antennas.
What it doesn't include - at least in the prototype stage - is a cellular antenna .

One other element that must be part of the frames is power source and camera but the
exact specifications of both these elements have not been revealed yet by google














43

10. Future Scope

Numerous open problems remain. Accurate and near instantaneous Web-scale visual
search with billions of images will likely remain as one of the grand challenges of multimedia
technology for the years to come. Also, we would like to perform mobile visual search at video
rates without ever pressing a button. Although faster processors and networks will get us closer
to this goal, lower-complexity image-analysis algorithms are urgently needed. Hardware
support on mobile devices should also help. It is envisioned that an ongoing standardization
activity in MPEG on compact descriptors for visual search will enable interoperability between
databases and applications, enable hardware support on mobile devices, and reduce load on
wireless networks carrying visual search related data. Ultimately, we may expect to see
ubiquitous mobile augmented reality systems that continuously superimpose information and
links on everything the camera of a mobile device sees, thus seamlessly linking the virtual
world and the physical world.

















44

11. Conclusion

We human can directly perceive only three spatial dimensions, but beyond our perception
our world contains a wealth of information that most of us are not aware and that forms the
forth dimension of the world. This forth dimension can be explored with new, mobile and
pervasive computer research field i.e. Augmented Reality.

Augmented Reality System integrates virtual information into persons physical
environment so that he or she will perceive the information existing in their surrounding. This
augmented reality is achieved with visual search which follows a systematic pipeline starting
with capturing the image, extracting features with SIFT and CHoG like techniques and then
matching these features against a central database and providing relevant information which is
integrated into the environment. Now-a-days system achieve more than 95% recall at negligible
false positive rate for databases with more than one million classes, a recognition performance
that is sufficient for many applications.














45

12. References

[1] Google. (2009). Google Goggles Available: http://www.google.com/mobile/goggles/
[2] Nokia. (2006). Nokia Point and Find. Available: http://www.pointandfind.nokia.com
[3] J. Phil bin, O. Chum, M. Izard, J. Sivic, and A. Fisherman, "Object retrieval with large
vocabularies and fast spatial matching," Proc. of the IEEE Conference on Computer Vision
and Pattern Recognition, 2010
[4].V. Chandrasekhar, D.M.Chen, S.S.Tsai, N.M.Cheung, H.Chen, G.Takacs, Y.Reznik,
R.Vedantham, R.Grzeszczuk, J.Back, and B.Girod. Stanford Mobile Visual Search Data
Set, 2010.
[5] IEEE Signal Processing Magazine, Special Issue On Mobile Meadia Search/Bernd Girod,
Fellow, IEEE, Vijay Chandrasekhar, Member, IEEE, David M Chen, Member, IEEE,2012
[6] http://en.wikipedia.org/wiki/Google_Glass
[7] http://electronics.howstuffworks.com/gadgets/other-gadgets/project-glass.htm

Anda mungkin juga menyukai