M. Soleymani
Sharif University of Technology
Fall 2017
Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2016
and some from John Canny, cs294-129, Berkeley, 2016.
Attention
5
Attention instead of simple encoder-decoder
• Encoder-decoder models
– needs to be able to compress all the necessary information of a
source sentence into a fixed-length vector
– performance deteriorates rapidly as the length of an input
sentence increases.
Source: https://distill.pub/2016/augmented-rnns/
Soft Attention for voice recognition
Source: https://distill.pub/2016/augmented-rnns/
Simple soft attention mechanism
Source: https://distill.pub/2016/augmented-rnns/
Soft Attention for Translation
𝑦 𝑠𝑖−1 𝑠𝑖
Mixture weights: 𝑐𝑖
Decoder RNN
Decoder RNN
Bidirectional
encoder RNN
Decoder RNN
Attention
Model
Bidirectional
encoder RNN
Image:
HxWx3
Recall: RNN for Captioning
CNN
Image: Features:
HxWx3 D
Recall: RNN for Captioning
CNN h0
d1
CNN h0 h1
First
word
Recall: RNN for Captioning
Distribution
over vocab
d1 d2
CNN h0 h1 h2
First Second
word word
Recall: RNN for Captioning
RNN only looks
Distribution
at whole image,
over vocab
once
d1 d2
CNN h0 h1 h2
First Second
word word
Recall: RNN for Captioning
RNN only looks
Distribution
at whole image,
over vocab
once
d1 d2
CNN h0 h1 h2
CNN
Features:
Image: LxD
HxWx3
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft Attention for Captioning
CNN h0
Features:
Image: LxD
HxWx3
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft Attention for Captioning
Distribution over L
locations
a1
CNN h0
Features:
Image: LxD
HxWx3
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft Attention for Captioning
Distribution over L
locations
a1
CNN h0
Features:
Image: LxD
HxWx3 Weighted
z1
features: D
Weighted
combination of
features
Soft Attention for Captioning
Distribution over
L locations
a1
CNN h0 h1
Features:
Image: LxD
HxWx Weighted
z1 y1
3 features: D
Weighted
combination of First
features word
Soft Attention for Captioning
Distribution over L Distribution
locations over vocab
a1 a2 d1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1
features: D
Weighted
combination First
of features word
Soft Attention for Captioning
Distribution over Distribution
L locations over vocab
a1 a2 d1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2
features: D
Weighted
combination First
of features word
Soft Attention for Captioning
Distribution over Distribution
L locations over vocab
a1 a2 d1
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
combination First
of features word
Soft Attention for Captioning
Distribution over L Distribution
locations over vocab
a1 a2 d1 a3 d2
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
combination First
of features word
Soft vs Hard Attention
a b
CNN
c d
Grid of features
Image: (Each D-dimensional)
HxWx3
pa pb
From
RNN: pc pd
Distribution over
grid locations
pa + p b + p c + p c = 1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft vs Hard Attention
a b
CNN
c d
Grid of features
Image: (Each D-dimensional)
HxWx3 Context vector z
(D-dimensional)
pa pb
From
RNN: pc pd
Distribution over
grid locations
pa + p b + p c + p c = 1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft vs Hard Attention
Soft attention:
Summarize ALL locations
a b z = paa+ pbb + pcc + pdd
CNN
c d Derivative dz/dp is nice!
Grid of features Train with gradient descent
Image: (Each D-dimensional)
HxWx3 Context vector z
(D-dimensional)
pa pb
From
RNN: pc pd
Distribution over
grid locations
pa + p b + p c + p c = 1
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft vs Hard Attention
Soft attention:
Summarize ALL locations
a b z = paa+ pbb + pcc + pdd
CNN
c d Derivative dz/dp is nice!
Grid of features Train with gradient descent
Image: (Each D-dimensional)
HxWx3 Context vector z
(D-dimensional)
Hard attention:
pa pb Sample ONE location
From
according to p, z = that vector
RNN: pc pd
With argmax, dz/dp is zero
Distribution over almost everywhere …
grid locations Can’t use gradient descent;
pa + p b + p c + p c = 1 need reinforcement learning
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft Attention for Captioning
Soft
attention
Hard
attention
Model want to attend to salient part of an image while generating its caption
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Soft Attention for Captioning
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Visual Question Answering
Visual Question Answering: RNNs with Attention
Soft Attention for Video
“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
Soft Attention for Video
The attention model:
Features:
Image: LxD
HxWx3
Input
image: Cropped and
HxWx3 rescaled image:
XxYx3
Box
Coordinates:
(xc, yc, w, h)
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Spatial Transformer Networks
Input
image: Cropped and
HxWx3 rescaled image:
XxYx3
Box
Coordinates:
(xc, yc, w, h)
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Spatial Transformer Networks
Idea: Function mapping pixel
coordinates (xt, yt) of output
Can we make this to pixel coordinates (xs, ys) of
function input
differentiable?
Input
image: Cropped and
HxWx3 rescaled image:
XxYx3
Box
Coordinates:
(xc, yc, w, h)
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Spatial Transformer Networks
Idea: Function mapping pixel
coordinates (xt, yt) of output
Can we make this to pixel coordinates (xs, ys) of
(xs, ys)
function input
differentiable?
(xt, yt)
Input
image: Cropped and
HxWx3 rescaled image:
XxYx3
Box
Coordinates:
(xc, yc, w, h)
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Spatial Transformer Networks
Idea: Function mapping pixel
coordinates (xt, yt) of output
Can we make this to pixel coordinates (xs, ys) of
(xs, ys)
function input
differentiable?
(xt, yt)
Input
image: Cropped and
HxWx3 rescaled image:
XxYx3
Box
Coordinates:
(xc, yc, w, h)
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Spatial Transformer Networks
Idea: Function mapping pixel
coordinates (xt, yt) of output
Can we make this to pixel coordinates (xs, ys) of
function input
differentiable?
A small
Localization network
predicts transform 𝜃
Complexity:
• There are many design choices.
• Those choices have a big effect on performance.
• Ensembling has unusually large benefits.
• Simplify where possible!
Attention Takeaways
Explainability:
• Attention models encode explanations.
• Both locus and trajectory help understand what’s going on.