Attention

Attention
M. Soleymani
Sharif University of Technology
Fall 2017
Most slides have been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2016
and some from John Canny, cs294-129, Berkeley, 2016.
Attention
Focusing on a subset of the given information.

2014: Neural Translation Breakthroughs
• Devlin et al, ACL’2014

• Cho et al EMNLP’2014
• Bahdanau, Cho & Bengio, arXiv sept. 2014
• Jean, Cho, Memisevic & Bengio, arXiv dec. 2014
• Sutskever et al NIPS’2014
Other Applications
• Ba et al 2014, Visual attention for recognition

• Mnih et al 2014, Visual attention for recognition
• Chorowski et al, 2014, Speech recognition
• Graves et al 2014, Neural Turing machines
• Yao et al 2015, Video description generation
• Vinyals et al, 2015, Conversational Agents
• Xu et al 2015, Image caption generation
• Xu et al 2015, Visual Question Answering
Soft vs Hard Attention Models
Hard attention:
• Attend to a single input location among the set of locations.
• Can’t use gradient descent.
• Need reinforcement learning.
Soft attention:
• Compute a weighted combination (attention) over some inputs
using an attention network.
• Can use backpropagation to train end-to-end.
5
Attention instead of simple encoder-decoder
• Encoder-decoder models
– needs to be able to compress all the necessary information of a
source sentence into a fixed-length vector
– performance deteriorates rapidly as the length of an input
sentence increases.
• Attention avoids this by:

– allowing the RNN generating the output to focus on hidden
states (generated by the first RNN) as they become relevant.
Bahdanau et al, “Neural Machine Translation by Jointly

Learning to Align and Translate”, ICLR 2015
Soft Attention for Translation
An RNN can attend over the output of another RNN. At every time
step, it focuses on different positions in the other RNN.
“I love coffee” -> “Me gusta el café”

Distribution over
input words

Distribution over
input words

Distribution over
input words

Distribution over
input words

Source: https://distill.pub/2016/augmented-rnns/
Soft Attention for voice recognition
Simple soft attention mechanism
Context vector (input to decoder):

𝑦𝑖−1 𝑦𝑖
𝑦 𝑠𝑖−1 𝑠𝑖
Mixture weights: 𝑐𝑖
Alignment score (how well do input words near j match

output words at position i):
alignment model: s a feedforward neural network which is

jointly trained with all the other components of the proposed system
Alleviate fixed length encoding
• The decoder decides parts of the source sentence to pay attention to.
• By letting the decoder have an attention mechanism, we relieve the

encoder from the burden of source sentence into a fixed length
vector
– the information can be spread throughout the sequence
– and can be selectively retrieved by the decoder accordingly.
From Y. Bengio CVPR 2015 Tutorial

Decoder RNN

Decoder RNN
Bidirectional
encoder RNN

Decoder RNN
Attention
Model
Bidirectional
encoder RNN


Luong, Pham and Manning 2015
Stacked LSTM (c.f. bidirectional flat encoder in Bahdanau et al):
Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong, Hieu Pham, Christopher D. Manning, EMNLP 15
Global Attention Model
Global attention model is similar but simpler than Badanau’s:
Different word matching functions were used

Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15
Local Attention Model
• Compute a best aligned position pt first
• Then compute a context vector centered

at that position

Results
Local and global models

Image Captioning with Attention
Recall: RNN for Captioning
Image:
HxWx3
CNN
Image: Features:
HxWx3 D
CNN h0
Image: Features: Hidden

HxWx3 D state: H
Distribution
over vocab
d1
CNN h0 h1

HxWx3 D state: H y1
First
word
Distribution
over vocab
d1 d2
CNN h0 h1 h2

HxWx3 D state: H y1 y2
First Second
word word
RNN only looks
Distribution
at whole image,
over vocab
once
d1 d2
CNN h0 h1 h2

HxWx3 D state: H y1 y2
First Second
word word
RNN only looks
Distribution
at whole image,
over vocab
once
d1 d2
CNN h0 h1 h2
Image: Features: Hidden What if the RNN

HxWx3 D state: H looks at different
y1 y2
parts of the
image at each
First Second timestep?
word word
Soft Attention for Captioning
CNN
Features:
Image: LxD
HxWx3
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
CNN h0
Features:
Image: LxD
HxWx3
Distribution over L
locations
a1
CNN h0
Features:
Image: LxD
HxWx3
Distribution over L
locations
a1
CNN h0
Features:
Image: LxD
HxWx3 Weighted
z1
features: D
Weighted
combination of
features
Distribution over
L locations
a1
CNN h0 h1
Features:
Image: LxD
HxWx Weighted
z1 y1
3 features: D
Weighted
combination of First
features word
Distribution over L Distribution
locations over vocab
a1 a2 d1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1
features: D
Weighted
combination First
of features word
Distribution over Distribution
L locations over vocab
a1 a2 d1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2
features: D
Weighted
combination First
of features word
Distribution over Distribution
L locations over vocab
a1 a2 d1
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
combination First
of features word
Distribution over L Distribution
locations over vocab
a1 a2 d1 a3 d2
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
combination First
of features word
Soft vs Hard Attention
a b
CNN
c d
Grid of features
Image: (Each D-dimensional)
HxWx3
pa pb
From
RNN: pc pd
Distribution over
grid locations
pa + p b + p c + p c = 1
a b
CNN
c d
Grid of features
HxWx3 Context vector z
(D-dimensional)
pa pb
From
RNN: pc pd
Distribution over
grid locations
pa + p b + p c + p c = 1
Soft attention:
Summarize ALL locations
a b z = paa+ pbb + pcc + pdd
CNN
c d Derivative dz/dp is nice!
Grid of features Train with gradient descent
(D-dimensional)
pa pb
From
RNN: pc pd
Distribution over
grid locations
pa + p b + p c + p c = 1
Soft attention:
Summarize ALL locations
a b z = paa+ pbb + pcc + pdd
CNN
c d Derivative dz/dp is nice!
Grid of features Train with gradient descent
(D-dimensional)
Hard attention:
pa pb Sample ONE location
From
according to p, z = that vector
RNN: pc pd
With argmax, dz/dp is zero
Distribution over almost everywhere …
grid locations Can’t use gradient descent;
pa + p b + p c + p c = 1 need reinforcement learning
Soft
attention
Hard
attention
Model want to attend to salient part of an image while generating its caption
Visual Question Answering
Visual Question Answering: RNNs with Attention
Soft Attention for Video
“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
The attention model:


Attention constrained to fixed
grid! We’ll come back to this ….
Attending to arbitrary regions?
Features:
Image: LxD
HxWx3
Attention mechanism from Show, Attend, and Tell only lets us

softly attend to fixed grid positions … can we do better?
Spatial Transformer Networks
Input
image: Cropped and
HxWx3 rescaled image:
XxYx3
Box
Coordinates:
(xc, yc, w, h)
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Can we make this

function
differentiable?
Input
image: Cropped and
XxYx3
Box
Coordinates:
(xc, yc, w, h)
Idea: Function mapping pixel
coordinates (xt, yt) of output
Can we make this to pixel coordinates (xs, ys) of
function input
differentiable?
Input
image: Cropped and
XxYx3
Box
Coordinates:
(xc, yc, w, h)
(xs, ys)
function input
differentiable?
(xt, yt)
Input
image: Cropped and
XxYx3
Box
Coordinates:
(xc, yc, w, h)
(xs, ys)
function input
differentiable?
(xt, yt)
Input
image: Cropped and
XxYx3
Box
Coordinates:
(xc, yc, w, h)
function input
differentiable?
Input Repeat for all

image: Cropped and pixels in output
HxWx3 rescaled image: to get a sampling
XxYx3 grid
Box
Coordinates:
(xc, yc, w, h)
function input
differentiable?
Input Repeat for all pixels in

image: Cropped and output to get a
HxWx3 rescaled image: sampling grid
XxYx3
Then use bilinear
Box
interpolation to
Coordinates:
compute output
(xc, yc, w, h)
Idea: Function mapping pixel Network
coordinates (xt, yt) of output attends to
to pixel coordinates (xs, ys) of input by
Can we make this predicting 𝜃
input
function
differentiable?
Input Repeat for all pixels in

image: Cropped and output to get a
HxWx3 rescaled image: sampling grid
XxYx3
Then use bilinear
Box
interpolation to
Coordinates:
compute output
(xc, yc, w, h)
Input: Output: Region of

Full image interest from input

A small
Localization network
predicts transform 𝜃


Grid generator uses 𝜃 to
compute sampling grid
A small


Grid generator uses 𝜃 to
compute sampling grid
A small

Sampler uses bilinear

interpolation to
produce output
Insert spatial transformers into a
classification network and it learns to
attend and transform the input
Differentiable “attention /
transformation” module

Attention Takeaways
Performance:
• Attention models can improve accuracy and reduce computation
at the same time.
Complexity:
• There are many design choices.
• Those choices have a big effect on performance.
• Ensembling has unusually large benefits.
• Simplify where possible!
Attention Takeaways
Explainability:
• Attention models encode explanations.
• Both locus and trajectory help understand what’s going on.
Hard vs. Soft:

• Soft models are easier to train, hard models require reinforcement
learning.
• They can be combined, as in Luong et al.

Attention

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Attention

Diunggah oleh

Hak Cipta:

Format Tersedia

Attention

Focusing on a subset of the given information.

• Devlin et al, ACL’2014

• Ba et al 2014, Visual attention for recognition

• Attention avoids this by:

Bahdanau et al, “Neural Machine Translation by Jointly

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly

Context vector (input to decoder):

Alignment score (how well do input words near j match

alignment model: s a feedforward neural network which is

• By letting the decoder have an attention mechanism, we relieve the

From Y. Bengio CVPR 2015 Tutorial

From Y. Bengio CVPR 2015 Tutorial

From Y. Bengio CVPR 2015 Tutorial

From Y. Bengio CVPR 2015 Tutorial

Bahdanau et al, “Neural Machine Translation by Jointly

Effective Approaches to Attention-based Neural Machine Translation

Different word matching functions were used

Effective Approaches to Attention-based Neural Machine Translation

• Then compute a context vector centered

Effective Approaches to Attention-based Neural Machine Translation

Local and global models

Image: Features: Hidden

Image: Features: Hidden

Image: Features: Hidden

Image: Features: Hidden

Image: Features: Hidden What if the RNN

“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Attention mechanism from Show, Attend, and Tell only lets us

Can we make this

Input Repeat for all

Input Repeat for all pixels in

Input Repeat for all pixels in

Input: Output: Region of

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Input: Output: Region of

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Input: Output: Region of

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Input: Output: Region of

Sampler uses bilinear

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Hard vs. Soft:

Anda mungkin juga menyukai