Anda di halaman 1dari 31

An End-to-End Trainable Neural Network for Image-

Based Sequence Recognition and Its Application to


Scene Text Recognition
Authors
Baoguang Shi, Xiang Bai and Cong Yao

Presented By
Abhishek Niranjan (SAU/CS(P)/2018/05)

Department of Computer Science


South Asian University, New Delhi
26th November, 2018
Organization
 Abstract
 Index Terms
 Neural Network
 Convolutional neural network
 Recurrent neural networks
 Long-Short Term Memory
 Introduction
 Proposed Network Architecture
Abstract
 Image-based sequence recognition is trending research topic
in computer vision.
 The problem of scene text recognition is most important and
challenging tasks in image based sequence recognition.
 The Problem is to recognize character sequence from
cropped image.
 Proposed architecture possesses some distinctive properties:
i. It is end to end trainable
ii. It naturally handles sequences in arbitrary lengths
iii. It generates an effective and much smaller model
Neural Network

 A neural network is a series of algorithms that recognize


underlying relationships in a set of data through a process
that mimics the way the human brain operates.
 It is a powerful technique to solve many real world problems.
 Applications of Neural Network:
 Pattern recognition,
 Mobile computing,
 Marketing and financial applications etc.
Color ???
Contd..
Length 3 2 4 3 3.5 2 5.5 1 4.5
Width 1.5 1 1.5 1 0.5 0.5 1 1 1

Width

1 2 3 4 5 6
Length
Contd..

o
o 0.82
0
w1 w2 b 0.5 0.2 0.3

m
m1 m2 m
2 1

 Sigmoid(X) = 1/(1+exp(-X))
 NN(m1,m2) = Sigmoid(w1*m1+w1*m2+b)
 NN(2,1) = Sigmoid(0.5*2+0.2*1+0.3) = Sigmoid(1.5) = 0.82
Contd..
 Cost function = (target - prediction)2

value 0.672

Cost Cost
function function

Target value Prediction 0 0.82


Convolutional Neural Network
 We know it is good to learn a small model.
 From this fully connected model, do we really need all the edges?
 Can some of these be shared?
Consider learning an image
Some patterns are much smaller than the whole image

Can represent a small region with fewer parameters

“beak” detector
Same pattern appears in different places : They can be
compressed!

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
Convolutional neural network
 A CNN is a neural network with some convolutional layers
(and some other layers).
 A convolutional layer has a number of filters that does
convolutional operation

detector

A filter
These are the network
Convolution parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1 Filter 2
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 1
2 0
Contd.. 3 0
4: 0 3
1 -1 -1


Filter 1
-1 1 -1
7 0
-1 -1 1
8 1
1 0 0 0 0 1 Dot 9 0
0 1 0 0 1 0
product 10: 0


0 0 1 1 0 0
13 0
1 0 0 0 1 0
14 0
0 1 0 0 1 0
15 1 Only connect to
0 0 1 0 1 0 9 inputs, not fully
16 1
connected


Contd..
-1 1 -1 Repeat this for each filter
Filter 2 -1 1 -1 3 -1 -3 -1
-1 1 -1 -1 -1 -1 -1
-3 1 0 -3
1 0 0 0 0 1 -1 -1 -2 1
Feature
0 1 0 0 1 0 -3 -3
Map
0 1
-1 -1 -2 1
0 0 1 1 0 0
1 0 0 0 1 0 3 -2 -2 -1
-1 0 -4 3
0 1 0 0 1 0
0 0 1 0 1 0 Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Convolutional Layers

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0 Convolution
Input image
Feature maps
CNN reduce number of connection
It shared weights on the edges
Problem with Feed Forward Neural
Network
 It need fixed size input and they give us a fixed size output.
 Not Design for sequences/time series data.
 Does not model memory.
 In a Feed-Forward neural network, the information only
moves in one direction, from the input layer, through the
hidden layers, to the output layer.
Recurrent Neural Networks
 Recurrent Neural Networks are types of neural networks
design for capturing information from sequences/ time series
data.
 They can take variable size inputs give us variable size
outputs.
 It is the first algorithm that remembers its input, due to an
internal memory.
 It perfectly suited for Machine Learning problems that involve
sequential data.
Contd..
 When it makes a decision, it takes into consideration the
current input and also what it has learned from the inputs it
received previously.
Example
 Imagine you have a normal feed-forward neural network and
give it the word „neuron“ as an input and it processes the
word character by character.
 At the time it reaches the character „r“, it has already forgotten
about „n“, „e“ and „u“, which makes it almost impossible for
this type of neural network to predict what character would
come next.
 A Recurrent Neural Network is able to remember exactly that,
because of it’s internal memory.
 Therefore a Recurrent Neural Network has two inputs, the
present and the recent past
Recurrent Neural Networks(Contd..)
 Feed-Forward Neural Networks map one input to one output,
RNN’s can map one to many, many to many (translation) and
many to one.
Backpropagation Through Time
 Backpropagation Through Time (BPTT) is basically just doing
Backpropagation on an unrolled Recurrent Neural Network.
 It is required to do the conceptualization of unrolling, since the
error of a given timestam depends on the previous timestam.
 Within BPTT the error is back-propagated from the last to the
first timestep, while unrolling all the timesteps.
 It can be computationally expensive when you have a high
number of timesteps.
Long-Short Term Memory
 Long Short-Term Memory (LSTM) networks are an extension
for recurrent neural networks, which basically extends their
memory.
 LSTM’s enable RNN’s to remember their inputs over a long
period of time.
 LSTM’s contain their information in a memory, it can read,
write and delete information from its memory.
 This memory can be seen as a gated cell, where gated
means that the cell decides whether or not to store or delete
information.
Contd..
 In an LSTM have three gates: input, forget and output gate.
 Input gate:- Determine whether new input in or not.
 Forget gate:- Delete the information because it isn’t important.
 Output gate:- let it impact the output at the current time step.
Introduction
 This paper concerned with a classic problem in computer
vision: image-based sequence recognition.
 In real world, a visual objects, such as scene text, handwriting
and musical score, tend to occur in the form of sequence, not
in isolation.
 Recognizing such sequence-like objects often requires the
system to predict a series of object labels.
 The most popular deep models like DCNN cannot be directly
applied to sequence prediction.
Contd..
 DCNN models often operate on inputs and outputs with fixed
dimensions.
 Recurrent neural networks (RNN) models, another important
branch of the deep neural networks family, were mainly
designed for handling sequences.
 The preprocessing step is independent of the subsequent
components in the pipeline.
 So RNN can not be trained and optimized in an end-to-end
fashion.
Proposed Network Architecture
 The proposed neural network model is named as
Convolutional Recurrent Neural Network (CRNN).
 CRNN possesses several distinctive advantages over
conventional neural network models:
i. It can be directly learned from sequence labels (for instance,
words), not require detailed annotations.
ii. It achieves better or highly competitive performance on
scene text.
iii. It contains much less parameters than a standard DCNN
model, consuming less storage space.
Feature Sequence Extraction
 DCNN used to extract a sequential feature representation
from an input image.
 Before being fed into the network, all the images need to be
scaled to the same height.
 Then a sequence of feature vectors is extracted from the
feature maps produced by the of convolutional layers.
 This means the ith feature vector is the concatenation of the
ith columns of all the maps.
Sequence Labeling
 A deep bidirectional Recurrent Neural Network is built on the
top of the convolutional layers, as the recurrent layers.
 The advantages of the recurrent layers are three-fold.
 First, RNN has a strong capability of capturing contextual
information within a sequence.
 Second, convolutional layer allowing us to jointly train the
recurrent layers and the convolutional layers in a unified
network.
 Third, RNN is able to operate on sequences of arbitrary
lengths, traversing from starts to ends.
Contd..
 LSTM is directional, it only uses past contexts. However, in
image-based sequences, contexts from both directions are
useful and complementary to each other.
 Therefore, It follow and combine two LSTMs, one forward and
one backward, into a bidirectional LSTM.

Transcription
 Transcription is the process of converting the per-frame
predictions made by RNN into a predicted sequence.
Thank You

Anda mungkin juga menyukai