Presented By
Abhishek Niranjan (SAU/CS(P)/2018/05)
Width
1 2 3 4 5 6
Length
Contd..
o
o 0.82
0
w1 w2 b 0.5 0.2 0.3
m
m1 m2 m
2 1
Sigmoid(X) = 1/(1+exp(-X))
NN(m1,m2) = Sigmoid(w1*m1+w1*m2+b)
NN(2,1) = Sigmoid(0.5*2+0.2*1+0.3) = Sigmoid(1.5) = 0.82
Contd..
Cost function = (target - prediction)2
value 0.672
Cost Cost
function function
“beak” detector
Same pattern appears in different places : They can be
compressed!
“upper-left
beak” detector
“middle beak”
detector
Convolutional neural network
A CNN is a neural network with some convolutional layers
(and some other layers).
A convolutional layer has a number of filters that does
convolutional operation
detector
A filter
These are the network
Convolution parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1 Filter 2
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 1
2 0
Contd.. 3 0
4: 0 3
1 -1 -1
…
Filter 1
-1 1 -1
7 0
-1 -1 1
8 1
1 0 0 0 0 1 Dot 9 0
0 1 0 0 1 0
product 10: 0
…
0 0 1 1 0 0
13 0
1 0 0 0 1 0
14 0
0 1 0 0 1 0
15 1 Only connect to
0 0 1 0 1 0 9 inputs, not fully
16 1
connected
…
Contd..
-1 1 -1 Repeat this for each filter
Filter 2 -1 1 -1 3 -1 -3 -1
-1 1 -1 -1 -1 -1 -1
-3 1 0 -3
1 0 0 0 0 1 -1 -1 -2 1
Feature
0 1 0 0 1 0 -3 -3
Map
0 1
-1 -1 -2 1
0 0 1 1 0 0
1 0 0 0 1 0 3 -2 -2 -1
-1 0 -4 3
0 1 0 0 1 0
0 0 1 0 1 0 Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Convolutional Layers
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 Convolution
Input image
Feature maps
CNN reduce number of connection
It shared weights on the edges
Problem with Feed Forward Neural
Network
It need fixed size input and they give us a fixed size output.
Not Design for sequences/time series data.
Does not model memory.
In a Feed-Forward neural network, the information only
moves in one direction, from the input layer, through the
hidden layers, to the output layer.
Recurrent Neural Networks
Recurrent Neural Networks are types of neural networks
design for capturing information from sequences/ time series
data.
They can take variable size inputs give us variable size
outputs.
It is the first algorithm that remembers its input, due to an
internal memory.
It perfectly suited for Machine Learning problems that involve
sequential data.
Contd..
When it makes a decision, it takes into consideration the
current input and also what it has learned from the inputs it
received previously.
Example
Imagine you have a normal feed-forward neural network and
give it the word „neuron“ as an input and it processes the
word character by character.
At the time it reaches the character „r“, it has already forgotten
about „n“, „e“ and „u“, which makes it almost impossible for
this type of neural network to predict what character would
come next.
A Recurrent Neural Network is able to remember exactly that,
because of it’s internal memory.
Therefore a Recurrent Neural Network has two inputs, the
present and the recent past
Recurrent Neural Networks(Contd..)
Feed-Forward Neural Networks map one input to one output,
RNN’s can map one to many, many to many (translation) and
many to one.
Backpropagation Through Time
Backpropagation Through Time (BPTT) is basically just doing
Backpropagation on an unrolled Recurrent Neural Network.
It is required to do the conceptualization of unrolling, since the
error of a given timestam depends on the previous timestam.
Within BPTT the error is back-propagated from the last to the
first timestep, while unrolling all the timesteps.
It can be computationally expensive when you have a high
number of timesteps.
Long-Short Term Memory
Long Short-Term Memory (LSTM) networks are an extension
for recurrent neural networks, which basically extends their
memory.
LSTM’s enable RNN’s to remember their inputs over a long
period of time.
LSTM’s contain their information in a memory, it can read,
write and delete information from its memory.
This memory can be seen as a gated cell, where gated
means that the cell decides whether or not to store or delete
information.
Contd..
In an LSTM have three gates: input, forget and output gate.
Input gate:- Determine whether new input in or not.
Forget gate:- Delete the information because it isn’t important.
Output gate:- let it impact the output at the current time step.
Introduction
This paper concerned with a classic problem in computer
vision: image-based sequence recognition.
In real world, a visual objects, such as scene text, handwriting
and musical score, tend to occur in the form of sequence, not
in isolation.
Recognizing such sequence-like objects often requires the
system to predict a series of object labels.
The most popular deep models like DCNN cannot be directly
applied to sequence prediction.
Contd..
DCNN models often operate on inputs and outputs with fixed
dimensions.
Recurrent neural networks (RNN) models, another important
branch of the deep neural networks family, were mainly
designed for handling sequences.
The preprocessing step is independent of the subsequent
components in the pipeline.
So RNN can not be trained and optimized in an end-to-end
fashion.
Proposed Network Architecture
The proposed neural network model is named as
Convolutional Recurrent Neural Network (CRNN).
CRNN possesses several distinctive advantages over
conventional neural network models:
i. It can be directly learned from sequence labels (for instance,
words), not require detailed annotations.
ii. It achieves better or highly competitive performance on
scene text.
iii. It contains much less parameters than a standard DCNN
model, consuming less storage space.
Feature Sequence Extraction
DCNN used to extract a sequential feature representation
from an input image.
Before being fed into the network, all the images need to be
scaled to the same height.
Then a sequence of feature vectors is extracted from the
feature maps produced by the of convolutional layers.
This means the ith feature vector is the concatenation of the
ith columns of all the maps.
Sequence Labeling
A deep bidirectional Recurrent Neural Network is built on the
top of the convolutional layers, as the recurrent layers.
The advantages of the recurrent layers are three-fold.
First, RNN has a strong capability of capturing contextual
information within a sequence.
Second, convolutional layer allowing us to jointly train the
recurrent layers and the convolutional layers in a unified
network.
Third, RNN is able to operate on sequences of arbitrary
lengths, traversing from starts to ends.
Contd..
LSTM is directional, it only uses past contexts. However, in
image-based sequences, contexts from both directions are
useful and complementary to each other.
Therefore, It follow and combine two LSTMs, one forward and
one backward, into a bidirectional LSTM.
Transcription
Transcription is the process of converting the per-frame
predictions made by RNN into a predicted sequence.
Thank You