• “Homographic Adaptation”
• Synthetic pre-training
Collage courtesy: Andrew Davison’s ICCV 2015 Future of Real-time SLAM workshop talk
data generation pipeline). This random crop is Ip . Then, the
Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly
four corners of Patch A are randomly perturbed by values
(1) produces the Homography relating two images. Our method does net require separate corner detection and homography
within the range [-⇢, ⇢]. The steps four
and allcorrespondences define
Image(s)
translational terms asferent
part lighting and weather conditions. Train and test im-
on, H BA = (H AB ) 1 is applied to the large image to produce
For example, to make our method more robust to motion blur, ages are
Pose / Map
of an optimization problem is difficult.
Network
taken
image from
I 0. A seconddistinct
patch Iwalking paths and 0 not sampled
p is cropped from I at position p.
0
sk. we can apply such blurs to0 the1 image 0 in our training 1 0 set.1 from Thesame
the two grayscale
trajectory patches,
makingIp andthe Ip0 are then stacked
regression channel-
challenging
u1 H11 H12 H13 u1
nal If we want the method to be robust to occlusions, we can (see wise to create the 2-channel image which is fed directly
andinto
@ v1 A ⇠ @H21 H22 H23 A @ v1 A (1) fig. 3). We release this dataset for public use hope
our insert random occluding shapes1 into our training
H31 H32 H33 images. 1 We our ConvNet. The 4-point parameterization of H AB is then
to add used
scenes to this dataset as this project progresses.
on as the associated ground-truth training label. The process
experimented with in-painting random occluding rectangles
PoseNet: A Convolutional
We found that an alternate parameterization, one based on Theisdataset
illustratedwas generated
in Figure 3. using structure from motion
set into
Figure our3: training
Magnified images, as a simple mechanism to simulate
a single kind of location variable, namely the corner location,
view of a sequence of training (green) and Managing the training
techniques [28] which we use as ground image generation truthpipeline gives us
measurements
•
sa real occlusions. Deep Image Homography Estimation using ConvNets
Network for Real-time 6 DOF
is for
more suitable for ourWe deep homography estimation for full control over the kinds of visual
task.this paper. A Google LG Nexus 5 smartphone was effects we want to model.
Localization
Conv1 Conv2 Softmax
3x3
test
1 Inimage (top), the predicted view from
Conv3
our convnet
Conv4 3x3 3x3
manifestation homography
Conv7 Conv8 Max 32x32x128 32x32x128
3x3FC 3x3
Pooling
FC Max
128x128x2 64x64x64
Softmax 64x64x64
Pooling
redlarge-enough
any on the inputdataset
imageproblem
(middle)
could be and
used the
for nearest
training neighbour training
(See Figure 2). Letting u1 = u1 u1 be the u-offset
3x3
insert
3x3
random occluding
3x3
shapes
128x128x64
We
3x3 3x3
Max
3x3 3x3
Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly
Max 32x32x128
8x11
H
(2)
128x128x2
This shows
64x64x64
•
128x128x64 128x128x64
n a
H 1 1 The two grayscale patches, Ip and Ip0 are then stacked channel-
offset. Balancing the rotational and translational terms as part
H BA = (H AB ) u11 is applied H11to the H12large H13image uto1 produce
of an optimization problem is difficult. Equivalently to the matrix formulation of the homography,
1-to-1 mapping
@ v1 A
image I 0 . A second ⇠@
patch H23 A I0 @ A wise to create the 2-channel image which is fed directly into
features, as it contains many ambiguous textureless
Ip0His21croppedH22 from at vposition
1 (Δu , Δv )
fea-
p. (1) our ConvNet. The 4-point parameterization of H AB is then 1 1
•
on 1 H31 H32 H33 1 a single kind of location variable, namely the corner location,
12 Managing the training image generation pipeline gives us
used as the associated ground-truth training label. The process (Δu , Δv ) 4 4
is more suitable for our deep homography estimation task. full control over the kinds of visual effects we want to model.
@form A is needed
Hmatrix = ( H H H (
for the first corner, the 4-point parameterization represents the experimented with in-painting random occluding rectangles 11 12 13
modern deep manifestation of the homography estimation If we want the method to be robust to occlusions, we can
N that context and field of view is more Homography as follows:
important than reso-
problem (See Figure 2). Letting u1 = u1 u1 be the u-offset insert random occluding shapes into our training images. We
into our training images, as a simple mechanism to simulate 21 22 23
res Fig.
Homography as follows:
lution 4-point
2: for parameterization.
relocalization.
0
We use the 4-point
getPerspectiveTransform()in
1
real occlusions.
param-
into our training images, as a simpleu1mechanism
OpenCV. H4point = @
B u2
B
v1
v 2
C
C
to simulate 1 In our experiments, we used cropped MS-COCO [15] images, although
•
1 In our experiments, we used cropped MS-COCO [15] images, although
ate B u2 u4 v4
v2 C
(2) any large-enough dataset could be used for training
The PoseNet
H4point = B C
was implemented Caffe
Estimation
Very few heuristics / little hand-
@ u3 v3 A
between the 8-dof
u4
”corner offset”
III. DATA
v4
matrix
G ENERATION and
FOR the representation
H OMOGRAPHY bridgeFig.
E STIMATION Landmarks dataset in tableWe
Equivalently to the matrix formulation of the homography,
2: 4-point parameterization. 6. use
Tothevalidate that the
4-point param-
(Δu , Δv ) 1 1
by
the 4-point parameterization is represented by eight numbers. (Δu , Δv )
H4point = (Δu , Δv )
2 2
of Relative
the homography
Camera PoseasEstimation
a 3x3 deep
matrix. scratch requires eterization of the homography. There exists a 1-to-1 mapping
(Δu , Δv ) 1 1
known, only a single closed form transformation is needed
Training convolutional
the 4-point parameterization is represented by eight numbers.
networks from
(Δu , Δv )
set
2 2
6
rateUsing Convolutional Neural Networks H4point = (Δu , Δv ) (u ’,v ’)
1-to-1 mapping
for the 8-dof homography. This can be accomplished in a
10 5, reduced
3 3 1 1
offset” matrix
performance forand the representation
finding the nearest
(u ,v )
H 4 4
H11 H12 H13
1 1
tuning
number of ways, for example one can use the normalized
a nearly
Direct Linear Transform (DLT) algorithm [9], or the function number
( labeled training in the training data from the fea-
III. DATA G ENERATION FOR H OMOGRAPHY E STIMATION
Hmatrix = H21 H22 H23
H31 H32 H33
Fig.a 2: 4-point
large parameterization.
amount of data. To meetWe
thisuse the 4-point we
requirement, param-
generate
an hour using a batch size of 75. For reasons of time, we eterization
a nearlyofunlimited
the homography.
numberThere exists training
of labeled a 1-to-1 mapping
examples by of the homography as a 3x3 matrix.
spp
•
(see fig. 3). We also compare our algorithm to the RGB-D
each scene as we found this to improve experimental per- Fig. 7 shows cumulative histograms of localization er-
spp
formance. ror for two indoor and two outdoor scenes. We note that Regression part
although the SCoRe forest is generally more accurate, it
•
Representation part
4. Dataset requires depth information, and uses higher-resolution im-
ral network
Relative Camera Pose Estimation
Fig. 2: Model structure (cnnBspp). Both network branches (representation part)
Deep learning performs extremely well on large datasets,
Using Convolutional Neural
have identical structure with shared weights. Pre-trained Hybrid-CNN [15] neu-
however
wasproducing
utilized to these datasets
initialize is oftenarchitecture.
the proposed expensive orRepresentation
very
agery. The indoor dataset contains many ambiguous and
textureless features which make relocalization without this Maybe due to lack of large-scale data
depth modality extremely difficult. We note our method
labour
an intensive.
image pair toWe overcome thisfeature
by leveraging struc-
Networks
part maps a low dimensional vector which is processed
by regression partmotion
of the network. Regressiongenerate
part consists
often localizes the most difficult testing frames, above the
of 2 fully-connected
ture from to autonomously training labels
layers (camera
(FC1 andposes).
FC2) andThis
estimates relative 95th percentile, more accurately than SCoRe across all the
reduces thecamera
humanpose.
labour to just
. scenes. We also observe that dense cropping only gives a
2017-2018: Splitting Up the Problem
• Most of current work “deep-ifys” the Frontend -> Focus of this talk
• 2018: Early deep learning work -> Focus of other oral at 12:05pm
Photo Credit: Cadena et al 2016
2017-2018 Deep Frontends: Dense
Dense or Semi-
Dense Descriptors
Deep
Image
Network
Universal Correspondence
Network
co 1,2
Relja Arandjelović Correspondence
1,2,∗
Josef Sivic 1,2,3
•
SLAM systems
ning correspondences
th a geometric model
e transformation, and
ributions of this work
nvolutional neural net-
hing. The architecture
at mimic the standard
and simultaneous in-
Deep
LIFT: Learned Invariant Feature Network A
Transform
frontends
ber of detected points in the image with the low- epochs, each consisting of randomly sampling a pair of
ber of detections. It is only meaningful to compare corresponding images and then randomly sampling 10000
producing the same number of interest points: oth- quadruples from this pair. Eventually, by the time training
ome method might report too many points and un- stops our models have seen 20 million sampled quadruples.
tperform others (e.g., if we take all points as "in-
5.1. RGB detector from ground-truth correspon-
Extract points -> “Backend Ready”
", repeatability will be very high). Therefore, we
a range of top/bottom quantiles, producing the de-
mbers of points and compare all methods for those
dences
In this experiment, we show how to use existing 3D data
•
mbers. to establish correspondences for training a detector.
se function. In all experiments, the response func- Training. We used the DTU Robot Image Dataset [1]. It
learnable framework?
SuperPoint: A Deep SLAM Front-end
Image ConvNet
Keypoint 2D
Locations
Keypoint
Descriptors
Probability
Dustbin
• No deconvolution layers
• Each output cell responsible for local 8x8 region
Descriptor Decoder
shared
2D (x,y)
representation
keypoints
Convolution 256
W/8
L2 Keypoint
Interpolate Normalize descriptors
H/8
Descriptor Decoder
How To Train SuperPoint?
ConvNet
Image
Keypoint 2D
Locations
Keypoint
Descriptors
Setting up the Training
First train
on this
“Homographic
MS-COCO (no interest point labels) Adaptation”
Use resulting
detector to
label this
Synthetic Training
• Non-photorealistic shapes
• Heavy noise
• Effective and easy Synthetic Shapes
Magic-
DeTone et. al. 2017
Homographic
Adaptation Point Set #1
Point Set #3
Point Set #2
Point
• Simulate planar camera motion Aggregation
with homographies
Detected Point Superset
• Self-labelling technique
tai
• Label, train, repeat, …
Th
eig
Homographic
Adaptation
• Resulting points:
12
Ea
• Higher coverage
25
an
• More repeatable res
low
ma
HPatches Evaluation
• Homography estimation task
• Dataset of 116 scenes each with 6 images = 696 images
• Indoor and outdoor planar scenes
• Compared against LIFT, SIFT and ORB
50% of dataset:
Illumination
Change
50% of dataset:
Viewpoint
Change
Qualitative Illumination Example
SuperPoint LIFT
SIFT ORB
Qualitative Viewpoint Example #1
• Similar story
SuperPoint LIFT
SIFT ORB
Qualitative Viewpoint Example #2
SuperPoint LIFT
SIFT ORB
HPatches Evaluation
Core Task Sub-metrics
• Real-time deployability
• Synthetic pre-training
Image ConvNet
Keypoint 2D
Locations
Keypoint
Descriptors
Extra Slides
Failure Mode: Extreme Rotation
Super-
Point
• Extreme in-plane rotations
ORB
Iterative Homographic Adaptation
MagicPoint
Further
Homgraphic
Adaptation
Training
SuperPoint