natural language
Describing images
Recurrent Neural Network
image
(32*32
numbers)
differentiable function
class probabilities
(10 numbers)
VGGNet or OxfordNet
Very simple and homogeneous.
(And available in Caffe.)
[224x224x3]
VGGNet or OxfordNet
Very simple and homogeneous.
(And available in Caffe.)
[1000]
CONV
VGGNet or OxfordNet
Very simple and homogeneous.
(And available in Caffe.)
CONV
POOL
VGGNet or OxfordNet
Very simple and homogeneous.
(And available in Caffe.)
CONV
POOL
VGGNet or OxfordNet
Very simple and homogeneous.
(And available in Caffe.)
FULLY-CONNECTED
image
[224x224x3]
probabilities
[1x1x1000]
[1x1x4096] neurons
[7x7x512]
Every neuron in the output:
1. computes a dot product between the
input and its weights
2.
thresholds it at zero
[1x1x4096] neurons
[7x7x512]
Convolutional Layer
224
224
224
D=3
224
64
Convolutional Layer
Can be
implemented
efficiently with
convolutions
224
224
224
D=3
224
64
Pooling Layer
[224x224x64]
[112x112x64]
Pooling Layer
224
224
Pooling Layer
224
downsampling
224
112
112
4
y
max pool
FC
(Fully-connected)
FC
(Fully-connected)
[224x224x3]
differentiable function
[1000]
[224x224x3]
differentiable function
0.2
[1000]
cat
0.4
dog
0.09
0.01
0.3
[224x224x3]
differentiable function
0.2
[1000]
cat
0.4
dog
0.09
0.01
0.3
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights
Training
Loop until tired:
1. Sample a batch of data
2. Forward it through the network to get predictions
3. Backprop the errors
4. Update the weights
[image credit:
Karen Simonyan]
Summary so far:
Convolutional Networks express a single
differentiable function from raw image pixel
values to class probabilities.
Recurrent Neural Network
Plug
- Fei-Fei and I are
teaching CS213n (A
Convolutional Neural
Networks Class) at
Stanford this quarter.
cs231n.stanford.edu
- All the notes are online:
cs231n.github.io
- Assignments are on
terminal.com
English words
RecurrentJS
train recurrent
networks in
Javascript!*
*if you have a lot of time :)
2-layer LSTM
RecurrentJS
train recurrent networks
in Javascript!*
*if you have a lot of time :)
2-layer LSTM
y1
y2
y3
y4
h0
h1
h2
h3
h4
x0
x1
x2
x3
x4
cat
sat
on
mat
<START>
P(word | [<S>])
y1
y2
y4
h0
h1
h2
h3
h4
x0
x1
x2
x3
x4
cat
sat
on
mat
<START>
P(word | [<S>])
h0
y1
h1
y2
h2
h3
y4
h4
x0
<START>
x1
x2
x3
x4
cat
sat
on
mat
x0
<START>
y0
x0
<START>
y0
x0
<START>
sample!
x1
cat
y0
y1
h0
h1
x0
x1
<START>
cat
y0
y1
h0
h1
x0
x1
x2
cat
sat
<START>
sample!
y0
y1
y2
h0
h1
h2
x0
x1
x2
cat
sat
<START>
y0
y1
y2
sample!
h1
h2
x0
x1
x2
x3
cat
sat
on
<START>
y0
y1
y2
y3
h0
h1
h2
h3
x0
x1
x2
x3
cat
sat
on
<START>
sample!
y0
y1
y2
y3
h0
h1
h2
h3
x0
x1
x2
x3
x4
cat
sat
on
mat
<START>
y0
y1
y2
y3
y4
h0
h1
h2
h3
h4
x0
x1
x2
x3
x4
cat
sat
on
mat
<START>
y1
y2
y3
y4
h0
h1
h2
h3
h4
x0
x1
x2
x3
x4
cat
sat
on
mat
<START>
straw hat
training example
straw hat
training example
straw hat
training example
straw hat
y0
y1
y2
h0
h1
h2
x0
<STA
RT>
x1
straw
x2
hat
<START> straw
hat
training example
straw hat
y0
y1
y2
h0
h1
h2
x0
<STA
RT>
training example
before:
h0 = max(0, Wxh * x0)
now:
h0 = max(0, Wxh * x0 + Wih * v)
x1
straw
<START> straw
x2
hat
hat
straw hat
y0
y1
y2
h0
h1
h2
x0
<STA
RT>
x1
straw
x2
hat
<START> straw
hat
training example
test image
test image
x0
<STA
RT>
<START>
test image
y0
h0
x0
<STA
RT>
<START>
test image
y0
sample!
h0
x0
<STA
RT>
<START>
x1
test image
y0
y1
h0
h1
x0
<STA
RT>
straw
<START>
test image
y0
y1
h0
h1
x0
<STA
RT>
straw
<START>
sample!
hat
test image
y0
y1
y2
h0
h1
h2
x0
<STA
RT>
straw
<START>
hat
test image
y0
y1
h0
h1
x0
<STA
RT>
straw
<START>
y2
h2
hat
sample!
<END> token
=> finish.
test image
y0
h0
y1
h1
y2
sample!
<END> token
=> finish.
h2
x0
<STA
RT>
<START>
straw
hat
y1
h0
h1
x0
x1
<START>
cat
hidden representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)
y1
hidden representation
(e.g. 200 numbers)
h1 = max(0, Wxh * x1 + Whh * h0)
h0
h1
x0
x1
<START>
cat
currently:
~120K images
~5 sentences each
Training an RNN/LSTM...
- Clip the gradients (important!). 5 worked ok
- RMSprop adaptive learning rate worked nice
- Initialize softmax biases with log word
frequency distribution
- Train for long time
+ Transfer Learning
straw hat
y0
y1
y2
h0
h1
h2
x0
<ST
ART
>
x1
stra
w
x2
hat
<START> straw
hat
training example
+ Transfer Learning
use weights
pretrained from
ImageNet
straw hat
y0
y1
y2
h0
h1
h2
x0
<ST
ART
>
x1
stra
w
x2
hat
<START> straw
hat
training example
+ Transfer Learning
use weights
pretrained from
ImageNet
straw hat
y0
y1
y2
h0
h1
h2
x0
<ST
ART
>
x1
stra
w
<START> straw
x2
hat
training example
hat
See predictions on
1000 COCO images:
http://bit.ly/neuraltalkdemo
NeuralTalk
- Code on Github
- Both RNN/LSTM
- Python+numpy (CPU)
- Matlab+Caffe if you want
to run on new images (for
now)
Ranking model
Ranking model
web demo:
http://bit.ly/rankingdemo
Summary
Convolutional Neural Network
Neural Networks:
- input->output end-to-end optimization
- stackable / composable like Lego
- easily support Transfer Learning
- work very well.
Summary
Summary
Summary
Thank you!