Dl4vslam Talk

SuperPoint: Self-Supervised Interest
Point Detection and Description
Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich
Research @ Magic Leap
CVPR 2018 Deep Learning for Visual SLAM Workshop

Main Ideas
• “SuperPoint”
• A Deep SLAM Frontend
• Multi-task fully convolutional network
• Designed for Real-time
• “Homographic Adaptation”
• Self-supervised recipe to train keypoints
• Synthetic pre-training
• Homography-inspired domain adaptation

2000-2015 Visual SLAM
MonoSLAM
DTAM
• Great Visual SLAM Research
• Real-time systems emerge

PTAM
• Very few learned components
ElasticFusion
LSD-SLAM KinectFusion
Event-camera SLAM DynamicFusion
Collage courtesy: Andrew Davison’s ICCV 2015 Future of Real-time SLAM workshop talk
data generation pipeline). This random crop is Ip . Then, the
Fig. 1: Deep Image Homography Estimation. HomographyNet is a Deep Convolutional Neural Network which directly
four corners of Patch A are randomly perturbed by values
(1) produces the Homography relating two images. Our method does net require separate corner detection and homography
within the range [-⇢, ⇢]. The steps four
and allcorrespondences define
2015-2016: Simple End-to-End Deep SLAM?

estimation parameters are trained in an end-to-end fashion using a large dataset of labeled images.
a homography H AB . Then, the inverse of this homography
og- H BA = (H AB ) 1 is applied to the large image to produce
II. T HE0 4- POINT H OMOGRAPHY
are image I 0 . A second patch Ip is cropped from I 0 atPARAMETERIZATION position p. applying random projective transformations to a large dataset
of natural images 1 . This procedure is detailed below.
le, The two grayscale patches, Ip andway
The simplest Ip0 toareparameterize
then stacked channel- is with a
a homography
To generate a single training example, we first randomly
3x3 matrix image
wise to create the 2-channel and a fixedwhich scaleis(see
fedEquation
directly1).into However, if
nal we unroll the 8 (or 9) parameters of the
crop a square patch from the larger image I at position p (we
ABhomography into a avoid the borders to prevent bordering artifacts later in the
the our ConvNet. The single 4-point parameterization
vector, well quickly realize of H that we isarethenmixing tails
both can be found in table 6. Significant urban clutter such
nal used as the associated ground-truth
rotational training
and translational label.
terms. For example, the subma- as pedestrians andofvehicles were presentperturbed
and data
four corners Patch A are randomly by was col-
values
trix [H11image H22 ], represents the rotational
gives terms
us inlectedthe
Camera
Managing the training H12 ; H21 generation pipeline
Deep
within
from the range
many [-⇢, ⇢]. points
different The four in correspondences
time representing definedif-
homography, while the vector [H13 H23 ] is the translational
on full control over theoffset. kindsBalancing
of visual theeffects
rotationalwe andwant to model. a homography H AB . Then, the inverse of this homography
Image(s)
translational terms asferent
part lighting and weather conditions. Train and test im-
on, H BA = (H AB ) 1 is applied to the large image to produce
For example, to make our method more robust to motion blur, ages are
Pose / Map
of an optimization problem is difficult.
Network
taken
image from
I 0. A seconddistinct
patch Iwalking paths and 0 not sampled
p is cropped from I at position p.
0
sk. we can apply such blurs to0 the1 image 0 in our training 1 0 set.1 from Thesame
the two grayscale
trajectory patches,
makingIp andthe Ip0 are then stacked
regression channel-
challenging
u1 H11 H12 H13 u1
nal If we want the method to be robust to occlusions, we can (see wise to create the 2-channel image which is fed directly
andinto
@ v1 A ⇠ @H21 H22 H23 A @ v1 A (1) fig. 3). We release this dataset for public use hope
our insert random occluding shapes1 into our training
H31 H32 H33 images. 1 We our ConvNet. The 4-point parameterization of H AB is then
to add used
scenes to this dataset as this project progresses.
on as the associated ground-truth training label. The process
experimented with in-painting random occluding rectangles
PoseNet: A Convolutional
We found that an alternate parameterization, one based on Theisdataset
illustratedwas generated
in Figure 3. using structure from motion
set into
Figure our3: training
Magnified images, as a simple mechanism to simulate
a single kind of location variable, namely the corner location,
view of a sequence of training (green) and Managing the training
techniques [28] which we use as ground image generation truthpipeline gives us
measurements
•
sa real occlusions. Deep Image Homography Estimation using ConvNets
Network for Real-time 6 DOF
is for
more suitable for ourWe deep homography estimation for full control over the kinds of visual
task.this paper. A Google LG Nexus 5 smartphone was effects we want to model.
Deep Learning excitement is very high

testing (blue) cameras King’s College. show the predicted Conv1 Conv2
used
Deep Image Homography The 4-point
Estimation parameterization
using ConvNets
camera pose in red for each testing frame. The images show the has been used in traditional
Input Images
For example, to make our method more robust to motion blur,
Conv3 Conv4
Conv5 Conv6
homography estimation methods [2],overlaid

and we in
by a pedestrian to take high definition video around each
use it in our we can apply such blurs to the image in our training set.
Conv7 Conv8
FC
FC
Localization
Conv1 Conv2 Softmax
3x3
test
1 Inimage (top), the predicted view from
Conv3
our convnet
Conv4 3x3 3x3
Input Images 8x11

H
our experiments, modern
we used deepcropped MS-COCO of [13]
theimages, althoughestimation
scene. IfThis videothewas subsampled in time at 2Hz to we gener-
3x3 3x3
we want method to be robust to occlusions, can

Max
16x16x128 16x16x128 1024
Conv5 Conv6 Pooling
manifestation homography
Conv7 Conv8 Max 32x32x128 32x32x128
3x3FC 3x3
Pooling
FC Max
128x128x2 64x64x64
Softmax 64x64x64
Pooling
redlarge-enough
any on the inputdataset
imageproblem
(middle)
could be and
used the
for nearest
training neighbour training
(See Figure 2). Letting u1 = u1 u1 be the u-offset
3x3
insert
3x3
random occluding
3x3
shapes
128x128x64
into our training images.

128x128x64
We
3x3 3x3
Max
3x3 3x3
Pooling ate images to input to the SfM pipeline. There is a spacing

32x32x128
Max
Pooling
16x16x128 16x16x128 1024
Max 32x32x128
8x11
H
(2)
128x128x2
image overlaid in redfor on the

thefirst
input image (bottom).
64x64x64
This shows
64x64x64
our experimented with in-painting random

produces the Homography relating two images. Our method does net require separate corner detection and homography
corner, the 4-point parameterization represents occluding rectangles
ofthe
Pooling
•
128x128x64 128x128x64
about 1m between each camera position.

estimation steps and all parameters are trained in an end-to-end fashion using a large dataset of labeled images.
into our training images, as a simple mechanism to simulate
Simple end-to-end setups work across

system can interpolate Homography
camera pose as follows:
effectively in space between
produces the Homography relating two images. Our method does net require separate corner detection and homography
estimation steps and all parameters are trained in an end-to-end fashion
training frames.
II.using
T HE a4-large
POINT dataset To
of labeled images.
H OMOGRAPHY test on indoor scenes we use the publically available
real occlusions.
PARAMETERIZATION applying random projective transformations to a large dataset
of natural images 1 . This procedure is detailed below.
0 1
The simplest way to parameterize a homography is with a
II. T HE 4- POINT H OMOGRAPHY PARAMETERIZATION 3x3 matrix u1
and av1fixed scale 7 Scenes1 Indataset
(see
applying random projective transformations to a large datasetEquation 1). [20], with scenes shown in fig. 5. This
However, if
crop a square patch from the larger image I at position p (we
hy, The simplest way to parameterize a homography is with a
we unroll
of natural
single
the 81 . (or
images
B u2
vector, well
This9)procedure
v C
parameters
quickly realize that
our experiments, we used cropped MS-COCO [15] images, although
of the homography
is detailed
we
below.
are mixing
into a
both
avoid the borders to prevent bordering artifacts later in the
3x3 matrix and a fixed scale (see Equation 1). However, if rotational
a square and
2 C
patchtranslational dataset
H4point = B terms.
(2) contains
any large-enough
ForI example, significant
dataset could bevariation
at positionthe subma- in camera height and
used for training
many computer vision tasks

the
crop from the larger image p (we four corners of Patch A are randomly perturbed by values
we unroll the 8 (or 9) parameters of the homography into a trixthe
avoid [H11 @ u3
H12 ; Hto21 H
borders v3 A22 ], represents
prevent borderingthe rotational
artifacts laterterms
in thein the within the range [-⇢, ⇢]. The four correspondences define
single vector, well quickly realize that we are mixing both
We experimented with rescaling the original homography, while the vector
data generation pipeline). This random13crop
u4 image to was designed for RGB-D relocalization. It is extremely
[H H ] is the translational
23 is Ip . Then, the
a homography H AB . Then, the inverse of this homography
ily
rotational and translational terms. For example, the subma-
fouroffset. Balancing
corners of PatchtheArotational
v4 are randomly and translational
perturbed byterms values as part
H BA = (H AB ) 1 is applied to the large image to produce
trix [H11 H12 ; H21 H22 ], represents the rotational terms in the of antheoptimization
within range [-⇢, problem⇢]. The isfour difficult.
correspondences define
homography, while the vector [H13 H23 ] is the translational
different sizes before cropping for training and testing.
a homography 0 AB
. Then, challenging for purely visual relocalization using SIFT-like
0 the inverse of this 1 0homography
image I 0 . A second patch Ip0 is cropped from I 0 at position p.
n a
H 1 1 The two grayscale patches, Ip and Ip0 are then stacked channel-
offset. Balancing the rotational and translational terms as part
H BA = (H AB ) u11 is applied H11to the H12large H13image uto1 produce
of an optimization problem is difficult. Equivalently to the matrix formulation of the homography,
1-to-1 mapping
@ v1 A
image I 0 . A second ⇠@
patch H23 A I0 @ A wise to create the 2-channel image which is fed directly into
features, as it contains many ambiguous textureless
Ip0His21croppedH22 from at vposition
1 (Δu , Δv )
fea-
p. (1) our ConvNet. The 4-point parameterization of H AB is then 1 1
zed u1Scaling up the input

0 1 0
H11 H12 H13 theis4-point
10 1 equivalent
u1 to cropping
parameterization
wise to 0
create the
the inputbybe-
The two grayscale1patches, IH
is represented
2-channel
31 IH
p and
eight
1 numbers.
image
0
p 32
which
H33stacked
are then
is fed
1 channel-
directly into
(Δu , Δv )
used as the associated ground-truth training label. The process
H = (Δu , Δv ) 4point
2 2
@ v1 A ⇠ @H21 H22 H23 A @ v1 A We found that an alternate parameterization, one based on is illustrated in Figure 3.
fore downsampling H Intoother
256 words,
(1)
pixels once
on one the side. HThis
11 H
displacement H13four corners
increases
of the tures.
our ConvNet. The 4-point parameterization of H AB is then
is
3 3
•
on 1 H31 H32 H33 1 a single kind of location variable, namely the corner location,
12 Managing the training image generation pipeline gives us
used as the associated ground-truth training label. The process (Δu , Δv ) 4 4
is more suitable for our deep homography estimation task. full control over the kinds of visual effects we want to model.
@form A is needed
Purely data-driven, powerful

We found that an alternate parameterization, one based on is illustrated in Figure 3.
known,
the spatial resolution only
of the input H
a single
pixels.
a single kind of location variable, namely the corner location, =We
closed
The Hfound H thatHthis
transformation
4-point parameterization
Managing the training image generation
has been
matrix used in
pipeline gives us
traditional For example, to make our method more robust to motion blur,
21
homography estimation methods [2], and we use it in our we can apply such blurs to the image in our training set.
1-to-1 mapping
22 23
is more suitable for our deep homography estimation task. full control over the kinds of visual effects we want to model. (u ’,v ’)
for the 8-dof homography. This H canHbe accomplished
H in Experiments
5. a
modern deep manifestation of the homography estimation If we want the method to be robust to occlusions, we can
The 4-point parameterization has been used in traditional For example, to make our method more robust to motion blur, 31 random32 33 into our training images. We
1 1
does not increase the localization performance, indicating

problem (See Figure 2). Letting
homography estimation methods [2], and we use it in our we can apply such blurs to the image in our training set.
number of ways, for example one can use the normalized
u (u
1 ,v
= )
u
H1 u 1 be the u-offset insert
H H H
occluding shapes 1 1
Hmatrix = ( H H H (
for the first corner, the 4-point parameterization represents the experimented with in-painting random occluding rectangles 11 12 13
modern deep manifestation of the homography estimation If we want the method to be robust to occlusions, we can
N that context and field of view is more Homography as follows:
important than reso-
problem (See Figure 2). Letting u1 = u1 u1 be the u-offset insert random occluding shapes into our training images. We
into our training images, as a simple mechanism to simulate 21 22 23
Direct Linear Transform (DLT) algorithm [9], or the function

for the first corner, the 4-point parameterization represents the experimented with in-painting random
We show that PoseNet is able to effectively
0 occluding1 rectangles
real occlusions.
H H H
localize
Deep Image Homography
31 32 33
res Fig.
Homography as follows:
lution 4-point
2: for parameterization.
relocalization.
0
We use the 4-point
getPerspectiveTransform()in
1
real occlusions.
param-
into our training images, as a simpleu1mechanism
OpenCV. H4point = @
B u2
B
v1
v 2
C
C
to simulate 1 In our experiments, we used cropped MS-COCO [15] images, although
(2) any large-enough dataset could be used for training
eterization of the model

homography. There existsusing a 1-to-1 across both the indoor
u3 v3 A
7 Scenes dataset and outdoor Cam-
the mapping
u1 v1
•
1 In our experiments, we used cropped MS-COCO [15] images, although
ate B u2 u4 v4
v2 C
(2) any large-enough dataset could be used for training
The PoseNet
H4point = B C
was implemented Caffe
Estimation
Very few heuristics / little hand-
@ u3 v3 A
between the 8-dof
u4
”corner offset”
III. DATA
v4
matrix
G ENERATION and
FOR the representation
H OMOGRAPHY bridgeFig.
E STIMATION Landmarks dataset in tableWe
Equivalently to the matrix formulation of the homography,
2: 4-point parameterization. 6. use
Tothevalidate that the
4-point param-
(Δu , Δv ) 1 1
by
the 4-point parameterization is represented by eight numbers. (Δu , Δv )
H4point = (Δu , Δv )
2 2
library [10]. It was trained using

Equivalently to the matrix formulation of the homography, stochastic gradient de-
In other words, once the displacement of the four corners is
convnet is regressing pose beyond that of the training ex-
(Δu , Δv )
3
4
3
4
of Relative
the homography
Camera PoseasEstimation
a 3x3 deep
matrix. scratch requires eterization of the homography. There exists a 1-to-1 mapping
(Δu , Δv ) 1 1
known, only a single closed form transformation is needed
Training convolutional
the 4-point parameterization is represented by eight numbers.
networks from
(Δu , Δv )
set
2 2
6
rateUsing Convolutional Neural Networks H4point = (Δu , Δv ) (u ’,v ’)
1-to-1 mapping
for the 8-dof homography. This can be accomplished in a
10 5, reduced
3 3 1 1
scent with a base learning of

In other words, once the displacement of the four corners is −
by 90% between the 8-dof
( (
amples we show the”corner
(Δu , Δv )
offset” matrix
performance forand the representation
finding the nearest
(u ,v )
H 4 4
H11 H12 H13
1 1
a large amount of data. To meet this requirement, we generate

known, only a single closed form transformation is needed
for the 8-dof homography. This can be accomplished in a Direct Linear Transform (DLT) (u ’,v ’) algorithm [9], or the function
1-to-1 mapping
1 1
Hmatrix = H21 H22 H23
H31 H32 H33
every 80 epochs and withunlimited

momentum of
( of0.9. Using by of therepresentation
oneexamplesneighbour homography as a 3x3 matrix.
(u ,v )
getPerspectiveTransform()in
H 1 1
OpenCV.
H11 H12 H13
tuning
a nearly
Direct Linear Transform (DLT) algorithm [9], or the function number
( labeled training in the training data from the fea-
III. DATA G ENERATION FOR H OMOGRAPHY E STIMATION
Hmatrix = H21 H22 H23
H31 H32 H33
getPerspectiveTransform()in OpenCV. Fig. 2: 4-point parameterization. We use the 4-point param-

half of a dual-GPU card (NVidia Titan Black), training took
III. DATA G ENERATION FOR H OMOGRAPHY E STIMATION ture vector produced by the localization convnet. As our
Training deep convolutional networks from scratch requires eterization of the homography. There exists a 1-to-1 mapping
between the 8-dof ”corner offset” matrix and the representation
convB1
convB2
convB3
convB4
convB5
Fig.a 2: 4-point
large parameterization.
amount of data. To meetWe
thisuse the 4-point we
requirement, param-
generate
an hour using a batch size of 75. For reasons of time, we eterization
a nearlyofunlimited
the homography.
numberThere exists training
of labeled a 1-to-1 mapping
examples by of the homography as a 3x3 matrix.
spp
Training deep convolutional networks from scratch requires

performance exceeds this we conclude that the convnet is
between the 8-dof ”corner offset” matrix and the representation
FC2 (3)
a large amount of data. To meet this requirement, we generate

a nearly unlimited number of labeled training examples by of the homography as a 3x3 matrix.
did not explore multi-GPU training, although it is reason- successfully able to regress pose beyond training examples
able to expect better results from using double the through- Δp
•
(see fig. 3). We also compare our algorithm to the RGB-D
Accuracy not yet competitive

put and memory. We subtracted a separate image mean for
FC1 (4)
SCoRe Forest algorithm [20].

convB1
convB2
convB3
convB4
convB5
each scene as we found this to improve experimental per- Fig. 7 shows cumulative histograms of localization er-
spp
formance. ror for two indoor and two outdoor scenes. We note that Regression part
although the SCoRe forest is generally more accurate, it
•
Representation part
4. Dataset requires depth information, and uses higher-resolution im-
ral network
Relative Camera Pose Estimation
Fig. 2: Model structure (cnnBspp). Both network branches (representation part)
Deep learning performs extremely well on large datasets,
Using Convolutional Neural
have identical structure with shared weights. Pre-trained Hybrid-CNN [15] neu-
however
wasproducing
utilized to these datasets
initialize is oftenarchitecture.
the proposed expensive orRepresentation
very
agery. The indoor dataset contains many ambiguous and
textureless features which make relocalization without this Maybe due to lack of large-scale data
depth modality extremely difficult. We note our method
labour
an intensive.
image pair toWe overcome thisfeature
by leveraging struc-
Networks
part maps a low dimensional vector which is processed
by regression partmotion
of the network. Regressiongenerate
part consists
often localizes the most difficult testing frames, above the
of 2 fully-connected
ture from to autonomously training labels
layers (camera
(FC1 andposes).
FC2) andThis
estimates relative 95th percentile, more accurately than SCoRe across all the
reduces thecamera
humanpose.
labour to just
. scenes. We also observe that dense cropping only gives a
2017-2018: Splitting Up the Problem
• Frontend: Image inputs
• Deep Learning success: Images + ConvNets
• Most of current work “deep-ifys” the Frontend -> Focus of this talk
• Backend: Optimization over pose and map quantities
• 2018: Early deep learning work -> Focus of other oral at 12:05pm
Photo Credit: Cadena et al 2016
2017-2018 Deep Frontends: Dense
Dense or Semi-
Dense Descriptors
Deep
Image
Network
Universal Correspondence
Network
• Dense output approaches
Self-supervised Visual Descriptor • Powerful Matchability

Learning for Dense
ural network architecture for geometric matching
co 1,2
Relja Arandjelović Correspondence
1,2,∗
Josef Sivic 1,2,3
Not practical in low-compute

2 3
I ENS INRIA CIIRC
•
SLAM systems
ning correspondences
th a geometric model
e transformation, and
ributions of this work
nvolutional neural net-
hing. The architecture
at mimic the standard
and simultaneous in-
• Too expensive for realtime BA

stimation, while being
monstrate that the net-
Convolutional neural network

m synthetically gener-
anual annotation and
Figure 1: Our trained geometry estimation network automatically
aligns two images with substantial appearance differences. It is
architecture for geometric

increases generaliza- able to estimate large deformable transformations robustly in the
e images. Finally, we presence of clutter.
rm both instance-level
many parameters matching

deformations that require complex geometric models with
state-of-the-art results
taset. which are hard to estimate in a manner
robust to outliers.
2017-2018 Deep Frontends: Sparse
Existing Patch-based Systems
Sliding Window Interest Points
Deep
LIFT: Learned Invariant Feature Network A
Transform
Input Patches Descriptors

Deep
Network B d3
d2
Quad-network forward pass on a training quadruple. Patches (1, 3) and (2, 4) are correspondence pairs between two different
d1
o 1, 2 come from the first image and 3, 4 come from the second image. All of the patches are extracted with a random rotation.
QuadNetworks: Unsupervised
Learning toparame-
Rank for Interest
uantitative evaluation, we use the repeatability mea- descent that chooses the gradient step size per-parameter
Most low-compute Visual SLAM built on sparse

cribed in [22] (with the overlap threshold automatically. We implement the model and optimization
l 40%). The repeatability is the ratio between the •
on a GPU (Nvidia Titan X) using the Torch7 framework [5].
Point Detection
of points correctly detected in a pair of images and The batch size is 256, our models are trained for 2000
frontends
ber of detected points in the image with the low- epochs, each consisting of randomly sampling a pair of
ber of detections. It is only meaningful to compare corresponding images and then randomly sampling 10000
producing the same number of interest points: oth- quadruples from this pair. Eventually, by the time training
ome method might report too many points and un- stops our models have seen 20 million sampled quadruples.
tperform others (e.g., if we take all points as "in-
5.1. RGB detector from ground-truth correspon-
Extract points -> “Backend Ready”
", repeatability will be very high). Therefore, we
a range of top/bottom quantiles, producing the de-
mbers of points and compare all methods for those
dences
In this experiment, we show how to use existing 3D data
•
mbers. to establish correspondences for training a detector.
se function. In all experiments, the response func- Training. We used the DTU Robot Image Dataset [1]. It
Most learned systems patch-based

p|w) is a neural network. We describe it as a tuple
•
has 3D points, coming from a laser scanner, and camera
and use the notation: poses, which allow to project 3D points into the pairs of im-
ages and extract image patches centered at the projections.
, i, o, p) for convolutional layers with filter size
Those projections form the correspondence pairs for train-
⇥ f , taking i input channels, outputting o channels,
ing.
ng zero-padding of p pixels on each border (stride
Two separate networks

Testing. We used the Oxford VGG dataset [22], commonly
lways 1 in all experiments),
chosen for this kind of evaluation. This dataset consists of
40 image pairs.
•
, o) for fully-connected layers, taking i features and
putting o features,
LF-Net: Learning Local NN architectures. In this experiment, we tested two NN ar-
chitectures: a linear model (c(17, 1, 1, 0)) and a non-linear
Lack powerful matchability of dense methods

or the ELU non-linearity function [4],
Features from Images •
NN with one hidden layer (c(17, 1, 32, 0), e, f (32, 1)).
Results. We demonstrate that the filter of our learned linear
or a batch normalization layer, model is different from the filters of the baselines in Fig. 4.
n
for applying the same network n times. Furthermore, we show the detections of the linear model in
Question
• How can we get the power of dense matchability

and the practicality of sparse output in a
learnable framework?
SuperPoint: A Deep SLAM Front-end
Image ConvNet
Keypoint 2D
Locations
Keypoint
Descriptors
• Powerful fully convolutional design

• Points + descriptors computed jointly
• Share VGG-like backbone
• Designed for real-time
• Tasks share ~90% of compute
• Two learning-free decoders: no deconvolution layers
Keypoint / Interest Point Decoder
Convolution
W/8 W
shared Softmax + NMS 2D (x,y)
Reshape keypoints
representation H/8 H
1
65
Per cell 8x8 2D Location Classifier
Probability
Dustbin
8x8 = 64 Possible Locations + 1 Dustbin
• No deconvolution layers
• Each output cell responsible for local 8x8 region
Descriptor Decoder
• Also no deconvolution layers

• Interpolate using 2D keypoint into coarse descriptor map
shared
2D (x,y)
representation
keypoints
Convolution 256
W/8
L2 Keypoint
Interpolate Normalize descriptors
H/8
Descriptor Decoder
How To Train SuperPoint?
ConvNet
Image
Keypoint 2D
Locations
Keypoint
Descriptors
Setting up the Training
• Siamese training -> pairs of images

• Descriptor trained via metric learning
• Keypoints trained via supervised keypoint labels
How to get Keypoint Labels for Natural Images?
• Need large-scale dataset of annotated images

• Too hard for humans to label
Self-Supervised Approach
Synthetic Shapes (has interest point labels)
First train
on this
“Homographic
MS-COCO (no interest point labels) Adaptation”
Use resulting
detector to
label this
Synthetic Training
• Non-photorealistic shapes
• Heavy noise
• Effective and easy Synthetic Shapes
Quads/Tris Quads/Tris/Ellipses Cubes Qu
Checkerboards Lines Stars Quads

The sequences with “Random” inp
8gories in the Synthetic Shapes dataset with and without n
Early Version
3 sequences withthe of SuperPoint
classical
“Random” are (MagicPoint)
detectors.
inputs especially difficul
classical detectors.
ean Av-
tion Er-
agesDeeponSLAM”
“Toward Geometric
Magic-
DeTone et. al. 2017
Linear Interpolation Linear Interpolation

Linear Interpolation
Noise Legend More Noise
Noise Legend More Nois
of im-Image Image+Noise1 Noise2

Linear Interpolation Linear Interpolation
s=0 s=1 s=2
Image Image+No
s=0 Linear Interpolation s=1
ure 11. Effect of Noise Magnitude. Two versions of Ma
Generalizing to Real Data
• Synthetically trained detector
• Works! Despite large domain gap
• Worked well on geometric structures
• Under performed on certain textures unseen

during training
Synthetic Warp +
Unlabeled Run Detector
Input
Image
Homographic
Adaptation Point Set #1
Point Set #3
Point Set #2
Point
• Simulate planar camera motion Aggregation
with homographies
Detected Point Superset
• Self-labelling technique
• Suppress spurious detections
• Enhance repeatable points

Iterative Homographic Adaptation
6.
tai
• Label, train, repeat, …
Th
eig
Homographic
Adaptation
• Resulting points:
12
Ea
• Higher coverage
25
an
• More repeatable res
low
ma
HPatches Evaluation
• Homography estimation task
• Dataset of 116 scenes each with 6 images = 696 images
• Indoor and outdoor planar scenes
• Compared against LIFT, SIFT and ORB
50% of dataset:
Illumination
Change
50% of dataset:
Viewpoint
Change
Qualitative Illumination Example
• SuperPoint -> denser set of correct matches
• ORB -> highly clustered matches
SuperPoint LIFT
SIFT ORB
Qualitative Viewpoint Example #1
• Similar story
SuperPoint LIFT
SIFT ORB
Qualitative Viewpoint Example #2
• In-plane rotation of ~35 degrees
SuperPoint LIFT
SIFT ORB
HPatches Evaluation
Core Task Sub-metrics
Homography Descriptor Metrics Detector Metrics

Estimation NN mAP M. Score Rep. MLE
SuperPoint 0.684 0.821 0.470 0.581 1.158

LIFT 0.598 0.664 0.315 0.449 1.102
SIFT 0.676 0.694 0.313 0.495 0.833
ORB 0.395 0.735 0.266 0.641 1.157
Timing SuperPoint vs LIFT
• Speed important for low-compute Visual SLAM
• SuperPoint total 640x480 time: ~ 33 ms
• LIFT total 640x480 time: ~2 minutes

3D Generalizability of SuperPoint
• Trained+evaluated on planar, does it generalize to 3D?
• “Connect-the-dots” using nearest neighbor matches
• Works across many datasets / input modalities / resolutions!
Freiburg (Kinect) NYU (Kinect) MonoVO (fisheye) ICL-NUIM (synth)
MS7 (Kinect) KITTI (stereo)

New Announcement, Research @ MagicLeap
Public Release of Pre-trained Net:
github.com/MagicLeapResearch/SuperPointPretrainedNetwork
• Sparse Optical Flow Tracker Demo
• Implemented in Python + PyTorch
• Two files, minimal dependencies
• Easy to get up and running

Take-Aways
• “SuperPoint”: A Modern Deep SLAM Frontend
• Non-patch based fully convolutional network
• Real-time deployability
• Self-supervised recipe to train keypoints
• Synthetic pre-training
• Homography-inspired domain adaptation
• Public code available to run SuperPoint

Thank You
Questions?
SuperPoint: A Modern Deep SLAM Front-end
Image ConvNet
Keypoint 2D
Locations
Keypoint
Descriptors
Extra Slides
Failure Mode: Extreme Rotation
Super-
Point
• Extreme in-plane rotations
• Trained for ~30 deg rotations

LIFT
• Optimized tracking scenarios
• LIFT also struggles, despite SIFT

learned orientation estimation
ORB
Iterative Homographic Adaptation
MagicPoint
Further
Homgraphic
Adaptation
Training
SuperPoint

Dl4vslam Talk

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Dl4vslam Talk

Diunggah oleh

Hak Cipta:

Format Tersedia

SuperPoint: Self-Supervised Interest

Point Detection and Description

Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich

Research @ Magic Leap

CVPR 2018 Deep Learning for Visual SLAM Workshop

• A Deep SLAM Frontend

• Multi-task fully convolutional network

• Designed for Real-time

• Self-supervised recipe to train keypoints

• Homography-inspired domain adaptation

• Great Visual SLAM Research

• Real-time systems emerge

Event-camera SLAM DynamicFusion

2015-2016: Simple End-to-End Deep SLAM?

Deep Learning excitement is very high

homography estimation methods [2],overlaid

Input Images 8x11

we want method to be robust to occlusions, can

into our training images.

Pooling ate images to input to the SfM pipeline. There is a spacing

image overlaid in redfor on the

our experimented with in-painting random

about 1m between each camera position.

Simple end-to-end setups work across

many computer vision tasks

zed u1Scaling up the input

Purely data-driven, powerful

does not increase the localization performance, indicating

Direct Linear Transform (DLT) algorithm [9], or the function

(2) any large-enough dataset could be used for training

eterization of the model

library [10]. It was trained using

scent with a base learning of

a large amount of data. To meet this requirement, we generate

every 80 epochs and withunlimited

getPerspectiveTransform()in OpenCV. Fig. 2: 4-point parameterization. We use the 4-point param-

Training deep convolutional networks from scratch requires

a large amount of data. To meet this requirement, we generate

Accuracy not yet competitive

SCoRe Forest algorithm [20].

• Frontend: Image inputs

• Deep Learning success: Images + ConvNets

• Backend: Optimization over pose and map quantities

• Dense output approaches

Self-supervised Visual Descriptor • Powerful Matchability

Not practical in low-compute

• Too expensive for realtime BA

Convolutional neural network

architecture for geometric

many parameters matching

Input Patches Descriptors

Most low-compute Visual SLAM built on sparse

Most learned systems patch-based

Two separate networks

Lack powerful matchability of dense methods

• How can we get the power of dense matchability

• Powerful fully convolutional design

Per cell 8x8 2D Location Classifier

8x8 = 64 Possible Locations + 1 Dustbin

• Also no deconvolution layers

• Siamese training -> pairs of images

• Need large-scale dataset of annotated images

Quads/Tris Quads/Tris/Ellipses Cubes Qu

Checkerboards Lines Stars Quads