Anda di halaman 1dari 29

Department of Informatics

Robotics and Perception Group

Tim Oberhauser

Camera Localization in a
3D Map

Semester Thesis
Robotics and Perception Lab
University of Zurich

Supervision
Christian Forster
Davide Scaramuzza

February 2013

Contents
Abstract

ii

1 Introduction
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
1

2 Approach
2.1 Map Building . . . . . . . .
2.1.1 RGBDSLAM . . . .
2.1.2 Filtering of the Map
2.2 Camera Localization . . . .
2.2.1 Feature Extraction .
2.2.2 Matching . . . . . .
2.2.3 Pose Estimation . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

2
2
2
3
3
4
4
5

3 Results
3.1 Intermediate Results . . . . . . . . . . . . . . . .
3.1.1 Kinect and AR.Drone Camera Compared
3.1.2 Conclusion from the Intermediate Results
3.2 Final Results . . . . . . . . . . . . . . . . . . . .
3.2.1 Feature Map . . . . . . . . . . . . . . . .
3.2.2 Kinect Camera . . . . . . . . . . . . . . .
3.2.3 AR.Drone Camera . . . . . . . . . . . . .
3.2.4 Performance Analysis . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

7
7
8
11
11
12
13
19
19

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

4 Discussion
21
4.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Abstract
An approach has been implemented to estimate the pose of a camera with respect to a 3D map. RGBDSLAM has been used to build a 3D map and has been
modified to be able to export the information needed to localize the camera - 3D
locations and descriptors of all features. Features are extracted from the camera image and matched with the features of the 3D map using their descriptors.
Finally, the pose of the camera can be computed using a RanSaC implementation of the perspective-3-point (P3P) algorithm. If a current pose estimate
is available, the camera model can be used to select the features of the map
that are currently believed to be visible by the camera. This way the number
of features in the matching part of the algorithm, which is computationally the
most expensive step, can be reduced substantially. From the implementation of
RGBDSLAM it was given that features exist multiple times in the feature map.
To speed up the matching step, an option has been added to filter out these
multiple features. The approach has been evaluated in practical experiments
by using the RGB camera of the Kinect without its depth measurement and
finally with the front camera of the AR.Drone.

ii

Chapter 1

Introduction
The broader scope of this project is to use a ground robot paired with a micro
unmanned aerial vehicle (MAV) for indoor exploration. The ground robot could
be equipped with a RGB+depth camera such as the Kinect which can be used
to build a 3D model of the environment. To prevent the MAV from colliding
with walls, it needs to be localized with respect to this map. This could be done
by using the onboard camera of the MAV.
The problem has been simplified to localizing a camera in a 3D map, which is
assumed to be built beforehand. An approach has been implemented and finally
presented and evaluated in this report.

1.1

Related Work

For a similar application - autonomous indoor exploration with a ground robot


and a MAV - Rudol et al. (2008) [1] came up with a method to estimate the
pose of the MAV with respect to the ground robot. Light emitting diodes (LED)
of different colors are mounted on the MAV in a certain pattern. A monocular
camera on the ground robot is used to detect the LED lights which can be
translated into the MAVs relative pose using the known pattern.
Many have tackled the simultaneous localization and mapping (SLAM) problem
using monocular cameras. Two famous ones are MonoSLAM [2] and PTAM [3].
The difference to this project is that the map can be built with a RGB+depth
camera which could - by the additional depth measurement - be more precise.
The monocular camera is only used for localization. However, the approach
presented can be seen as a technique for measuring the pose and yet needs to
be combined with a motion model to compete with the mentioned approaches
in terms of precision.

Chapter 2

Approach
The goal of this work is to find the camera pose with respect to a 3D map by
only using its visual information. The approach described in this report will use
image features from the camera image to match with the 3D map. Therefore
only the feature descriptors and their 3D locations need to be passed to the
routine, from here on such a structure will be called feature map.

2.1

Map Building

There are different approaches in building a 3D map. RGBDSLAM [4] has


been chosen which uses a RGB-D (RGB + depth) camera such as the Microsoft
Kinect. Intentionally, this algorithm was only minimally modified, as it might
be replaced by another routine that suits the problem at hand better. But
as it was not fundamentally changed, the algorithm needs to be well understood to identify potential drawbacks. In the following section the algorithm is
summarized briefly.
The only modification that was necessary was to efficiently pass the feature map
to the proposed routine. This was achieved by using a custom ROS message
type [5], which contains all descriptors and feature locations in a single message
and is sent if requested by the user (by pressing CTRL+K in the RGBDSLAM
window).

2.1.1

RGBDSLAM

RGBDSLAM consists of a front-end and a back-end. The front-end tries to find


relative poses between two key frames and the back-end optimises these poses
to form a globally consistent pose graph.
For every new key frame features are extracted and matched to the features
of a subset of previous key frames using the feature descriptors. From these
correspondences and taking into account the depth information at the key points
the relative transformation between two key frames can be found. However,
there can be false matches and also the depth can be incorrect due to the shutters
of the color and the infrared camera not being synchronized and interpolation
errors. Therefore, a Random Sample Consensus (RanSaC) step is executed,
which randomly chooses three matched feature pairs. At this stage, outliers
2

Chapter 2. Approach

can already be identified by comparing the pairwise Euclidean distance. If the


chosen samples fulfill this criteria, the rigid transformation can be calculated
which is applied to all matched features. A matched pair is considered an
inlier if the distance between the two features after applying the newly found
transformation is smaller than 3 cm. The transformation can be refined by
using all inliers. This process is repeated and the transformation with most
inliers is finally used. The key frame is added if it could be matched with one of
the previous key frames. If a valid transformation could be found between two
key frames, an edge is added in the pose graph. It is the task of the back-end
to optimize the pose graph to make it globally consistent. This is of special
importance in the case of loop closures.

Figure 2.1: 3D map built by RGBDSLAM


For every key frame - among others - the feature descriptors and the feature
locations in 3D are stored. However, it is not saved which features match with
which features in other frames. If all feature descriptors and their location
would be exported, every feature that has been observed in multiple key frames
will also exist multiple times in the exported feature map. This can slow down
the process of matching, which is why filtering the feature map could be of
advantage. One approach on filtering the feature map has been implemented,
which will be described in the next section.

2.1.2

Filtering of the Map

For every feature of the feature map the best k matches are found by OpenCVs
knnMatch from the DescriptorMatcher class [6]. If the matches have a distance
smaller than a certain threshold, they are considered to actually represent the
exact same feature. The 3D location of the features that are considered to belong
together are averaged to reduce the influence of noise of the depth measurements
and errors from the map building process. Figure 2.2 shows the feature map
before and after filtering resulting from 2.1, using a distance threshold of 0.5m.

2.2

Camera Localization

The working principle of the proposed algorithm is shown in figure 2.3. In


order to compute the camera pose with respect to the feature map, points
of the current image need to be registered to points with known 3D position
i.e. the points of the feature map. This step is done by matching the feature

2.2. Camera Localization

(a)

(b)

Figure 2.2: Feature map (a) before and (b) after filtering, displayed by rviz [7]
descriptors of the current image with the descriptors of the feature map. Clearly,
the descriptors of the feature map have to be of the same kind as the descriptors
of the features of the current image.

Figure 2.3: Working principle of the proposed algorithm

2.2.1

Feature Extraction

RGBDSLAM works with SIFT [8], SURF [9] and ORB [10] features. In this
project, only SIFT and SURF have been tested because they lead to more accurate results. However, the approach presented here is not intended to work
exclusively with certain descriptors but should be adaptable to other map building approaches that use different descriptors too.

2.2.2

Matching

OpenCVs BFMatcher [6] has been used to match the descriptors of the current image to the descriptors of the feature map. For SURF and SIFT features
the L2-norm can be used as an error measure between two descriptors. The

Chapter 2. Approach

BFMatcher also has an option for cross-checking, with this turned on, feature i
from set A and feature j from set B are only considered a match if j is the best
match for i and vice versa. This method tends to produce the best matches
with fewest outliers. However, if a feature exists multiple times in one set like in a feature map built from RGBDSLAM - the probability of satisfying the
cross-checking criteria is decreased substantially. I therefore recommend not to
use cross-checking, if the map contains the same features multiple times. Note
that with this option turned off, BFMatcher will find a match for every feature.
Unfortunately matching of feature descriptors is computationally quite expensive, as it involves the calculation of the L2 norm of each possible feature pairs
from set A and B. One way to accelerate this step is to match the features of
the current image with only a subset of the features of the map. If an estimate
of the current camera pose is available, the features of the map can be projected
to the image plane using the camera model. If a projected feature lies within
the boundaries of the image sensor it is considered to be visible. Matching the
features of the current frame only to the visible features of the feature map suggests itself. This can substantially reduce the number of features to match and
accordingly the computational cost. I distinguish between the two cases: global
pose estimation - matching of features of the current image with all features of
the map - and tracking - matching of features of the current image with only
the visible features of the map. Tracking is discussed in the next section.
Tracking
Obviously the pose must be known in order to project world points into the
image plane. This is why, the first pose has to be always calculated by global
localization. After that, the pose computed from the previous frame is always
used to project the features for matching with the current frame. If the proposed approach will be fused with a more advanced velocity model (and other
measurement techniques), self-evidently the current estimate of the pose should
be used for projection.
Matching the features of the current image with only the visible features of the
feature map can clearly speed-up the matching step of the algorithm. Also, the
probability of wrong matches should decrease.
If a wrong pose has resulted from the pose estimation, also wrong features are
reprojected and tried to be matched. To alleviate the aftermath of this occurrence, a rule has been introduced: if for two subsequent frames the resulting
pose is not trusted, the next frames are processed by global pose estimation. As
soon as a valid pose has been found again, the routine switches back to tracking.
In which cases a pose is considered valid or not valid is discussed in section 2.2.3.

2.2.3

Pose Estimation

After finding point correspondences, the camera pose can be calculated using
the perspective-three-point algorithm (P3P) [11], more specifically a RanSaC
[12] implementation thereof.
The algorithm randomly chooses a set of 3 matched feature pairs. First it has to
check the set for the degenerate case: the three points lie close to a line. Also,
another rule has been implemented: the distances of 3 points in the image must

2.2. Camera Localization

be greater than 10 pixels. Given that for one set the 3D location of the points are
given, with these 3 pairs, the possible camera poses can generally be restricted
to four different ones. By using the four possible camera poses, a fourth world
point can be projected into the cameras image plane. The ambiguity can finally
be solved by comparing the location of the projected point with the location of
its match.
Similarly, the other matched world points can be projected to the image plane.
If the distance between the projected world point and its match are lower than
a certain reprojection threshold, it is considered an inlier.
Finally the pose with most inliers and a list of all the inliers are returned.
The resulting pose will have at least 4 inliers - the ones used for computing
the pose. This would mean that not a single feature pair supports the model
found by the algorithm, a result like that is very likely to be wrong and should
therefore not be trusted. A minimum inlier threshold suggests itself, if the
number of inliers does not satisfy this condition, the pose is not considered
to be valid. The choice of the threshold and the effect of such a measure are
discussed in section 3.1.

Chapter 3

Results
After building a map using RGBDSLAM with the RGB and depth data of the
Microsoft Kinect, every calibrated camera should be localizable using only its
grayscale images. The proposed approach has been tested with two different
devices, the Microsoft Kinect by using the grayscale image without the depth
information and the AR.Drone 2 [13] by using its front camera.

Figure 3.1: The flying room of the AI Lab of University of Zurich, also showing
6 of 10 cameras of the Optrack tracking system [14]
Matching the features of a Kinect image with a map that has been built also by
using the Kinect worked pretty well without adjusting the parameters a lot. The
AR.Drone brought about some issues, which could only be partly solved. For
this camera it made sense to first consider a simplified case, instead of directly
try to match the image to the map. The experiment is introduced and presented
in the next section. The goal of this was to explore some of the problems of
the AR.Drone camera and to come up with a reasonable choice for some of the
parameters. These were then used for localizing both cameras in the 3D map.
The results of that are presented in section 3.2.

3.1

Intermediate Results

The localization procedure involves matching of features of the map with features extracted from the to-be-localized camera. The map in our case will always
7

3.1. Intermediate Results

be built by the Kinect. Hence, it makes sense to investigate how matching features from a Kinect frame to a Kinect frame and an AR.Drone-frame to a Kinect
frame compare. This has been investigated by a very simple experiment which
is described in the following.
A single frame with depth information captured by the Kinect serves as a replacement for the map. The AR.Drone has been placed such that it had the
same view point as the Kinect when the single frame has been taken. The pose
estimation process, described in chapter 2, is run on each frame of the video.
Since the camera has not been moved, the correct solution is close to identity for
the rotational and close to zero for the translational part. To have a comparison
to the results from the AR.Drone, the same experiment has also been conducted
using the Kinect.
This experiment leaves aside many issues that have arisen during localization
with respect to the feature map. For instance multiple features, noisy feature
position due to the map building algorithm, very different view angles and scales.
There is no guarantee that the routine with the parameters chosen from this
simple experiment will behave well in a much more complicated environment,
but tuning all the parameters by rerunning the actual localization over and over
again would have been very tiresome if not impossible.
Another simplification made for these experiments is that only SURF features
have been used, with the detector threshold held constant. The detector threshold basically determines the number of features.
In the following, the two cameras are compared by using the described, simple
experiment.

3.1.1

Kinect and AR.Drone Camera Compared

(a)

(b)

Figure 3.2: A frame from (a) the AR.Drone at its higher resolution and (b) the
Kinect from the test experiment

Feature Extraction
The AR.Drone camera has a resolution of 720 by 1280 pixels at a frame rate of
30 frames per second, the resolution can also be switched to 360 by 640. First
the higher resolution is considered.

Chapter 3. Results

The features of the AR.Drone image should be matched with the feature of the
map or in this case of a single frame coming from the Kinect. The features of
the map have also been extracted from a Kinect image, which has a resolution
of 480 by 640. The feature detection process involves the downsampling of the
image a certain number of times, this number is commonly called number of
octaves. If features should be matched from cameras with different resolution,
the number of octaves has to be adjusted accordingly.
By default, the number of octaves is set to 4, this is used to build the map
and in this experiment, to extract features from the Kinect image. If the same
features should be detected in an image with double the resolution - the higher
resolution AR.Drone image - one more downsampling step has to be added, the
number of octaves for this image should be set to 5.
If this leads to better matches is difficult to state, but what matters in the
end is whether or not it leads to more accurate pose estimation. Hence, the
precision resulting from 4 and 5 octaves for the higher resolution image and
for the lower resolution image is shown in table 3.1. The table also lists the
precision resulting from matching a Kinect image to a single Kinect frame.
The reprojection threshold has been set to 4.0 for the higher resolution test
and to 2.0 for the lower resolution. Since all images have been recorded from
the same position, the translational error was simply the distance from zero.
The error caused by imprecise placement of the AR.Drone camera is constant
and - compared to the precision of the algorithm - small. Using the lower
resolution yields the best results of the AR.Drone images and also brings about
an advantage in computational time.

Table 3.1: Comparison of the localization with respect to a single Kinect frame
with an image of the AR.Drone at different resolutions and using different numbers of octaves and with the Kinect

Kinect
to Kinect
AR.Drone
to Kinect
7201280
7201280
360640

number
of octaves

translational error
(meanstd.dev.) [m]

number of inliers
(meanstd.dev.)

number of
measurements

0.0110 0.0083

369.18 13.59

68

4
5
4

0.5113 0.4412
0.4834 0.3809
0.4564 0.1541

39.79 21.41
44.25 24.06
49.57 10.99

110
109
95

Feature Matching
In figure 3.3 typical matching results from both cameras with a Kinect frame are
shown. Obviously, the matches from the AR.Drone to the Kinect frame contain
much more false matches than the ones from the Kinect to Kinect. Reasons for
this will be discussed in chapter 4.

10

3.1. Intermediate Results

(a)

(b)

(c)
Figure 3.3: Typical matching result of (a) two Kinect frames and an AR.Drone
frame with a Kinect frame, using (b) the higher resolution image and (c) the
lower resolution image

P3P RanSaC
Having seen the quality of the matches from the AR.Drone to the Kinect frame,
it is not surprising at all that the output of P3P is also much worse for this
case. Figure 3.4 confirms this.
A measure that has been mentioned before was the introduction of a minimum
number of inliers to decide whether or not a measurement should be trusted.
This has a notable impact only when getting very noisy data. Such a situation
has been simulated by using a low feature detector threshold of 50, instead of
200. For this result, the higher resolution image has been used. The implementation of an inlier threshold entails that some of the measurements are lost.
Figure 3.5 illustrates what impact an inlier threshold has on the precision of the
pose estimate.

Chapter 3. Results

11

(a)

(b)

(c)
Figure 3.4: Inliers after the P3P RanSaC step of (a) two Kinect frames and an
AR.Drone frame with a Kinect frame, using (b) the higher resolution and (c)
the lower resolution image

3.1.2

Conclusion from the Intermediate Results

Since the algorithm works with the lower resolution as well as with the higher
resolution, and taking into account the computational advantage, from here on,
only the lower resolution images will be used.
Also, a minimum inlier threshold has been introduced. Through many experiments, 8 proved to be a reasonable choice.

3.2

Final Results

In this section, the performance of the algorithm is tested in the actual application - localizing a camera in a feature map. The two cameras are considered as
two separate cases. In both cases, the ground truth trajectory has been recorded
by using the Optitrack motion capture system [14]. The motion of an object

12

3.2. Final Results

6
original output of p3p
average error w/o threshold
outliers (<8 inliers)
average error w/ threshold

error [m]

10

20

30

40

50
60
iterations

70

80

90

100

Figure 3.5: AR.Drone: Translational error of P3Ps pose estimation, illustrating


the effect of an inlier threshold of 8
can be recorded by sticking reflective spheres on it. These are made visible for
the infrared cameras by emitting infrared light.
First the map building is discussed in the next section. The map building
algorithm also involves a trajectory estimation, which is in terms of precision
the bottleneck of the localization. Therefore it makes sense to compare the
precision of the localization with the one of RGBDSLAM. This has been done
by using the visual data (without depth) of the same dataset that has been used
for building the map. This is again a simplified case: a part of the frames of the
video have been used as key frames by RGBDSLAM. Hence, all frames lie close
to key frames and the features contained in the map have been observed from
very similar view points as the ones extracted during the localization procedure.
First this simplified case is considered in the following section and after that
the proposed approach is used to reconstruct another trajectory with respect to
the same map as before.
Each dataset consists of an RGB video with a depth measurement for every
pixel. Unfortunately, the infrared projector of the Kinect which is used for
the depth measurement was detected as a reflective sphere which behaved unpredictably. When mounting the reflective balls on the camera, the extra point
made for great ambiguity for the tracking system. This problem could be solved
by extending the Kinect by a rigid wooden construction on which the reflective
spheres could be mounted, the trackable Kinect is shown in figure 3.6.
The positions measured by the Optitrack had to be transformed to RGBDSLAM
coordinates. The measurements with the smallest time differences to the time
stamps of the key frames of RGBDSLAM have been used as the corresponding
ground truth point. Finally, the transformation between the two coordinate
systems could be computed by aligning the corresponding points by minimizing
their distance using a least squares approach. Four approaches for this problem
are compared in [15], the one involving a singular value decomposition of a
matrix has been used.

3.2.1

Feature Map

The parameter choice that has been used for building the map to localize the
cameras can be seen in table 3.2. It was only changed to compare localization

Chapter 3. Results

13

Figure 3.6: Trackable Kinect


based on SURF features to the one on SIFT features. To achieve more precise
map building and to simulate the slow movement of the ground robot, the
dataset was actually played back at a fifth of the original speed. This way, more
key frames were added.
Table 3.2: RGBDSLAM: standard setup
option

standard choice

use SIFTGPU
feature type
minimum number of features
maximum number of features
feature matching

3.2.2

off
SURF
600
1600
FLANN [16]

Kinect Camera

For examining the influence of certain parameters on the precision of the proposed approach, a standard setup has been chosen. In each of the following
sections, one option of the standard setup has been varied to investigate the impact on precision and - if meaningful - computational time. The chosen standard
setup is shown in table 3.3.
Reproducing the RGBDSLAM Trajectory
The dataset has been recorded by moving the trackable Kinect by hand at a
relatively low height to simulate a possible future application with the Kinect
mounted on a ground robot. Figure 3.7 shows the trajectory estimated by

14

3.2. Final Results

Table 3.3: Kinect localization: standard setup


part

option

map building

filter map
distance threshold
feature type
detector threshold
number of octaves
tracking
minimum inliers
reprojection threshold
maximum iterations
maximum data trials
p-value

feature extraction

feature matching
P3P

standard choice
on
0.5 m
SURF
200
4
on
8
1.5 pixels
1000
40
0.9999

RGBDSLAM and by the proposed approach as well as the ground truth from
Optitrack. The precision of the two trajectory estimates can be found in table
3.4. Note that the ground truth trajectory exhibits one clear disturbance which
was most probably because the person holding the camera blocked the view for
the tracking system.
1
0.5
0
0.5

y [m]

1
1.5
2
2.5
3
proposed approach
RDBDSLAM
ground truth

3.5
4
3

0
x [m]

Figure 3.7: Resulting trajectory estimate when trying to reproduce the RGBDSLAM trajectory with only the visual data using the proposed approach in
standard configuration

Reproducing a Free Trajectory


The same feature map is now used to estimate another trajectory. Figure 3.8
shows this trajectory and the trajectory estimated with the standard configuration. The trajectory that has been used for building the map with RGBDSLAM

Chapter 3. Results

15

Table 3.4: Comparison between RGBDSLAM and the proposed approach


translational error (mean std.dev.) [m]
0.1281 0.0800
0.1856 0.2915

RGBDSLAM
proposed approach

is also shown to illustrate their spacial relation, this will not be shown in the
following plots.

1
0.5
0
0.5

y [m]

1
1.5
2
2.5
3
ground truth rgbdslam
proposed approach
ground truth

3.5
4
3

0
x [m]

Figure 3.8: Resulting pose estimation of the Kinect with the standard parameter
set

Filtering of the Map As already explained in section 2.1.1, the feature map
exported from RGBDSLAM contains all features of all key frames. This way, a
huge amount of features is accumulated during the map building process which
also contains redundant information. In section 2.1.2 an approach has been introduced to reduce the number of same features existing multiple times in the
feature map. This step is computationally quite expensive, but it has to be executed only once before localization starts, and it can reduce the computational
time of the localization.
The trajectory estimate without filtering the feature map beforehand is shown
in figure 3.9. The precision of the localization and the computational time of the
matching step - the step that is mainly affected by a larger number of features
- are compared in table 3.5. A detailed discussion about the processing times
of all the parts of the algorithm will follow in 3.2.4. Interestingly, filtering the
map beforehand yields more accurate results during the localization and reduces
computational time of the matching step.

16

3.2. Final Results

0.5

y [m]

0.5

1.5

2
proposed approach
ground truth
2.5
2

1.5

0.5

0.5

x [m]

Figure 3.9: Resulting pose estimation of the Kinect without filtering the map
beforehand
Table 3.5: Kinect: Comparison of precision and processing time of the matching
step between localization with filtering the map and not filtering the map
filter map

translational error
(mean std.dev.) [m]

average computation time


of matching [s]

on
off

0.3440 0.3091
0.3697 0.5851

0.3311
1.1111

P3P: Reprojection Threshold Tuning the reprojection threshold of the


P3P RanSaC trades precision of the pose estimate against robustness to noise.
Both are reflected in the average error of the localization. Values from 1.0 to
3.0 pixels have been tested and the resulting translational errors are shown in
table 3.6 and plotted in figure 3.10. Apparently a reprojection threshold of 1.5
results in the error with the lowest mean value and standard deviation.
Table 3.6: Kinect: Influence of P3Ps reprojection threshold on the precision
reprojection threshold
1.0
1.5
2.0
2.5
3.0

translational error (mean std.dev.) [m]


0.3781
0.3440
0.4370
0.5014
0.4544

0.5130
0.3091
0.4593
0.5218
0.4631

Chapter 3. Results

17

mean
std. dev.
1

error [m]

0.8

0.6

0.4

0.2

0.5

1.5
2
reprojection threshold [pixels]

2.5

3.5

Figure 3.10: Influence of the reprojection threshold on the error of the position
estimate

Tracking vs. Global Localization As discussed in section 2.2.2, matching


the features of the current image only with the ones that are believed to be
visible can speed up the matching part. But it depends on a good estimate of
the pose from the previous frame. The precision and computational time of the
matching step of the localization with and without tracking are compared in
table 3.7. Not only is the tracking approach faster, it is also more precise. This
could be caused by the fact that the probability of getting wrong matches is
higher when matching the features of the current image with all the features of
the feature map.

0.5

y [m]

0.5

1.5

2
proposed approach
ground truth
2.5
2

1.5

0.5

0.5

x [m]

Figure 3.11: Resulting pose estimation of the Kinect without tracking

18

3.2. Final Results

Table 3.7: Kinect: Comparison of precision and processing time of the matching
step between tracking and global localization
tracking

translational error
(mean std.dev.) [m]

average computation time


of matching [s]

on
off

0.3440 0.3091
0.4020 0.4067

0.3311
0.5890

SURF vs. SIFT SURF features have been used because of their advantage
in computational time. However, it is interesting to investigate if SIFT features
lead to more precision of the localization. Of course, for this case the map
has been built using SIFT features. The precision of the trajectory estimate
of RGBDSLAM with SIFT features is comparable to the one with SURF: a
mean error and its standard deviation of 0.1244 and 0.0764 respectively. The
trajectory estimate from the proposed algorithm is shown in figure 3.12 and its
precision compared to the one using SURF features in table 3.8. From the plot,
it seems that the trajectory estimate with SIFT features is more precise but two
outliers raise the mean and standard deviation of the error.
0.5

y [m]

0.5

1.5

2
proposed approach
ground truth
2.5
2

1.5

0.5

0.5

x [m]

Figure 3.12: Trajectory estimate using SIFT features

Table 3.8: Kinect: Precision of the localization using SURF and SIFT features
feature type

translational error (mean std.dev.) [m]

SURF
SIFT

0.3440 0.3091
0.3252 0.6625

Chapter 3. Results

3.2.3

19

AR.Drone Camera

In section 3.1.1, it has been shown that the matching part of the algorithm
does not work as well for the AR.Drone camera as for the Kinect. Because
of this, the pose estimation is more ambiguous with this camera. The lower
image resolution of 360 by 640 has been used. To account for imprecise camera
calibration, the reprojection threshold has been raised to 2.0 pixels, instead of
1.5.
The ground truth and the estimated trajectory of the AR.Drone are depicted in
figure 3.13, the trajectory of the Kinect which has been used to build the map
is again also shown. The quality of the trajectory estimate is not as good as
the one of the Kinect, most probably due to wrong matching. The mean and
standard deviation of the position error can be taken from 3.9. Worth noting is
that out of 497 frames, with SIFT 244 led to a valid pose and with SURF only
48.
1

0.5

0.5

0.5

0.5
1
y [m]

y [m]

1
1.5

1.5

2.5

2.5

3
ground truth rgbdslam
proposed approach
ground truth

3.5
4
3

0
x [m]

ground truth rgbdslam


proposed approach
ground truth

3.5

4
3

(a)

0
x [m]

(b)

Figure 3.13: AR.Drone trajectory estimate using (a) SURF and (b) SIFT features

Table 3.9: AR.Drone: Precision of the localization using SURF and SIFT features

3.2.4

feature type

translational error (mean std.dev.) [m]

SURF
SIFT

0.7923 0.5604
0.5363 0.4607

Performance Analysis

The average processing times of the different steps of a Kinect and of a AR.Drone
frame are shown in figure 3.14. In order to make the results comparable, all
results shown have been measured with SIFT features. All simulations have
been run on a virtual machine with 2 GB of memory and 4 CPUs with 2.6 GHz.
The most expensive parts are feature extraction and matching, causing a total
processing time of 218.8 ms for a single Kinect frame. This means that in a

20

3.2. Final Results

real-time application, only about 4 frames per second can be processed. For
an image from the AR.Drone, this rate is increased to almost 7 frames per
second. Note that the difference is this high, mainly because of the difference
in resolution.

Extract Features: 74.9 ms

Extract Features: 49.5 ms

Reproject: 0.6 ms

Reproject: 0.5 ms

Matching: 105.4 ms

Matching: 75.5 ms

P3P RanSaC: 12.9 ms

P3P RanSaC: 8.2 ms

Displaying: 24.1 ms

Displaying: 13.7 ms

10

20

30

40
50
60
70
Average Processing Time [ms]

80

(a) Total: 217.8 ms

90

100

110

10

20

30

40
50
60
70
Average Processing Time [ms]

80

90

100

110

(b) Total: 147.4 ms

Figure 3.14: Average processing times of (a) a Kinect frame and (b) an
AR.Drone frame, all using SIFT features and tracking

Chapter 4

Discussion
The approach presented in this report has been tested with two cameras, the
Kinect and the front camera of the AR.Drone. With the Kinect, it could be
shown that the algorithm works in principle. Precision can be improved, by
fusing the measurements with a physical model. However, using just the visual
data of the Kinect is still not an application that would make sense in practice. It
only makes sense for cameras with no depth information, such as the AR.Drone.
For this camera, the algorithm did not work as well, especially with SURF
features. An important reason for this is certainly that matching of features
from the AR.Drone image to features from the Kinect image works poorly (see
section 3.1.1). In order to make the algorithm work with this camera (and
hopefully all other cameras too) the source of wrong matching has to be found.
Therefore, the differences between the two cameras have to be investigated:
radial distortion
different contrast and illumination due to different sensor
lens quality?
These are factors that could be responsible for the poor matching in the frame
to frame experiment (see section 3.1.1). In the actual localization some other
difficulties are added:
motion blur
different viewing angles and scales
About the difference from the sensor, lens quality, motion blur and viewing
angles nothing can be done. But about radial distortion there could: in [17]
the influence of radial distortion on SIFT features has been studied and it is
stated that the repeatability of SIFT features can be improved by rectifying the
image before extracting the features. Rectifying the image is a computationally
expensive step, hence a modified SIFT algorithm that takes the radial distortion
into account has been proposed and further improved in [18]. Unfortunately,
an implementation of sRD-SIFT is currently only available for MATLAB. Since
SURF features are detected and extracted in a similar manner as SIFT features,
the statements probably apply to SURF features as well. However, the influence
of radial distortion in this case is not expected to be huge.
21

22

4.1

4.1. Future Work

Future Work

Most importantly, the presented approach should be combined with a motion


model to improve precision.
For the results of this report, timing issues have been ignored completely.
Frames that have been recorded at 30 Hz have been processed at 5 to 7 Hz
without skipping frames. If this approach was to run in a real-time application,
consequently frames would be skipped. To avoid skipping of frames, i.e. not
to lose track in quick turns, the algorithm needs to be extended. For example
by tracking features with optical flow using the pyramidal implementation of
the Lucas Kanade feature tracker [19]: features that have been found as inliers
for the P3P RanSaC step could be tracked. Since they are inliers, they have
a corresponding 3D location in the map and can therefore be fed to the P3P
RanSaC again. This way, the computation can be distributed to two threads
running in parallel. Note that by tracking the P3P inliers of one frame, the
feature extraction and matching can be left out, which enables the algorithm to
run real-time. The principle is depicted in figure 4.1.

Figure 4.1: Proposition for extension of algorithm


A function that uses a previous image and the feature positions of its inliers to
estimate the pose using optical flow has already been implemented and initial
tests seemed promising.

Bibliography
[1] P. Rudol, M. Wzorek, G. Conte, and P. Doherty, Micro unmanned aerial
vehicle visual servoing for cooperative indoor exploration, in Aerospace
Conference, 2008 IEEE, pp. 1 10, March 2008.
[2] A. Davison, I. Reid, N. Molton, and O. Stasse, Monoslam: Real-time single
camera slam, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, pp. 1052 1067, june 2007.
[3] G. Klein and D. Murray, Parallel tracking and mapping for small ar
workspaces, in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th
IEEE and ACM International Symposium on, pp. 225 234, nov. 2007.
[4] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard,
An evaluation of the rgb-d slam system, in Robotics and Automation
(ICRA), 2012 IEEE International Conference on, pp. 1691 1696, May
2012.
[5] http://www.ros.org/wiki/rosmsg. Accessed: 29/01/2013.
[6] http://opencv.willowgarage.com/documentation/cpp/features2d_
common_interfaces_of_descriptor_matchers.html.
Accessed:
29/01/2013.
[7] http://www.ros.org/wiki/rviz. Accessed: 23/02/2013.
[8] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
Int. J. Comput. Vision, vol. 60, pp. 91110, Nov. 2004.
[9] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, Speeded-up robust features
(surf), Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346
359, 2008.
[10] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, Orb: An efficient
alternative to sift or surf, in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2564 2571, nov. 2011.
[11] L. Kneip, D. Scaramuzza, and R. Siegwart, A novel parametrization of the
perspective-three-point problem for a direct computation of absolute camera position and orientation, in Computer Vision and Pattern Recognition
(CVPR), 2011 IEEE Conference on, pp. 2969 2976, June 2011.
23

[12] M. A. Fischler and R. C. Bolles, Random sample consensus: a paradigm


for model fitting with applications to image analysis and automated cartography, Commun. ACM, vol. 24, pp. 381395, June 1981.
[13] http://ardrone2.parrot.com/usa/. Accessed: 11/02/2013.
[14] http://www.naturalpoint.com/optitrack/. Accessed: 20/02/2013.
[15] D. Eggert, A. Lorusso, and R. Fisher, Estimating 3-d rigid body transformations: a comparison of four major algorithms, Machine Vision and
Applications, vol. 9, pp. 272290, 1997.
[16] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in In VISAPP International Conference
on Computer Vision Theory and Applications, pp. 331340, 2009.
[17] M. Lourenco, J. Barreto, and A. Malti, Feature detection and matching in
images with radial distortion, in Robotics and Automation (ICRA), 2010
IEEE International Conference on, pp. 1028 1034, may 2010.
[18] M. Lourenco, J. Barreto, and F. Vasconcelos, srd-sift: Keypoint detection
and matching in images with radial distortion, Robotics, IEEE Transactions on, vol. 28, pp. 752 760, june 2012.
[19] J.-Y. Bouguet, Pyramidal implementation of the lucas kanade feature
tracker description of the algorithm, 2000.

Robotics and Perception Group

Title of work:

Camera Localization in a 3D Map


Thesis type and date:
Semester Thesis, February 2013
Supervision:
Christian Forster
Davide Scaramuzza

Student:
Name:
E-mail:
Legi-Nr.:

Tim Oberhauser
otim@ethz.ch
07-919-806

Statement regarding plagiarism:


By signing this statement, I affirm that I have read the information notice
on plagiarism, independently produced this paper, and adhered to the general
practice of source citation in this subject-area.
Information notice on plagiarism:
http://www.ethz.ch/students/semester/plagiarism_s_en.pdf

Zurich, 6. 3. 2013:

Anda mungkin juga menyukai