Fully Convolutional Recurrent Network For Handwritten Chinese Text Recognition

2016 23rd International Conference on Pattern Recognition (ICPR)
Cancn Center, Cancn, Mxico, December 4-8, 2016
Fully Convolutional Recurrent Network for

Handwritten Chinese Text Recognition
Zecheng Xie, Zenghui Sun, Lianwen Jin , Ziyong Feng, Shuye Zhang
College of Electronic and Information Engineering
South China University of Technology
Guangzhou, China
xiezcheng@foxmail.com, lianwen.jin@gmail.com, sunfreding@gmail.com,
f.ziyong@mail.scut.edu.cn, shuye.cheung@gmail.com
AbstractThis paper proposes an end-to-end framework, However, traditional methods based on over-segmentation
namely fully convolutional recurrent network (FCRN) for hand- can barely overcome their own limitations to rectify the mis-
written Chinese text recognition (HCTR). Unlike traditional segmentations when characters are not correctly separated.
methods that rely heavily on segmentation, our FCRN is trained
with online text data directly and learns to associate the pen-tip Segmentation-free models [4][8] have been studied and have
trajectory with a sequence of characters. FCRN consists of four proved to be useful in different areas. Liwicki et al. [4] and
parts: a path-signature layer to extract signature features from Graves et al. [5] combined bidirectional long short-term mem-
the input pen-tip trajectory, a fully convolutional network to learn ory (LSTM) and the connectionist temporal classifier (CTC)
informative representation, a sequence modeling layer to make to build a speech recognizer. Messina et al. [6] applied multi-
per-frame predictions on the input sequence and a transcription
layer to translate the predictions into a label sequence. We also dimensional LSTM with CTC to offline HCTR. Recently, Shi
present a refined beam search method that efficiently integrates et al. [7] proposed a network architecture called convolutional
the language model to decode the FCRN and significantly improve recurrent neural network (CRNN), which consists of the con-
the recognition results. volutional layers, recurrent layers and transcription layer, for
We evaluate the performance of the proposed method on the image-based sequence recognition.
test sets from the databases CASIA-OLHWDB and ICDAR 2013
Chinese handwriting recognition competition, and both achieve Similar to the aforementioned tasks, variable-length input is
state-of-the-art performance with correct rates of 96.40% and also the fundamental difficulty when solving HCTR problems.
95.00%, respectively. In this paper, we propose a fully convolutional recurrent
network (FCRN), which is a novel framework for HCTR
I. I NTRODUCTION problems that possesses the following advantages: (1) It applies
Handwritten Chinese text recognition (HCTR) is a chal- a path-signature layer to generate signature feature maps for
lenging problem and has received intensive concerns from online data, which uniquely characterizes the pen-tip trajectory.
numerous researchers. The large character set, diversity of (2) It takes an input sequence of arbitrary length and outputs
writing styles and character-touching problem are the main a corresponding label sequence without pre-segmentation. (3)
difficulties of HCTR. Traditional methods [1] [2] overcome It is end-to-end trainable. All its components can be jointly
these difficulties by integrating segmentation and recognition. trained to fit each other and improve the overall function and
Generally, a segmentation-recognition candidate lattice [1] is reliability. Language models are of great importance for speech
first derived from the input pen-tip trajectory through op- recognition and online text recognition, and have been proved
erations of over-segmentation, component combination and to be effective by Wang et al. [9] and Wu et al. [10]. In this
character recognition. Based on the lattice, the optimal path paper, we adopted a refined beam search method to integrate a
can be searched by simultaneously considering the character language model to decode our FCRN. Experiments showed that
recognition score, in addition to the geometric and linguistic by incorporating lexical constraints and prior knowledge about
contexts. Zhou et al. [1] proposed a method based on semi- a certain language, the language model can further decrease
Markov conditional random fields, which combined candidate the error rate by 2%-5%.
character recognition scores with geometric and linguistic The remaining parts of the paper are organized as follows.
contexts. Zhou et al. [2] described an alternative parameter In Section 2, we illustrate the framework of FCRN in detail.
learning method, which aimed at minimizing the character In Section 3, we describe language modeling and in Section 4,
error rate rather than the string error rate. Vision Objects we present the experimental results. In Section 5, we conclude
Ltd., France, whose system yielded the best performance in the paper.
the ICDAR 2013 Chinese handwriting recognition competi-
tion [3], introduced three experts that were responsible for II. F ULLY C ONVOLUTIONAL R ECURRENT N ETWORK
segmentation, recognition and interpretation. They employed a Given the training set Q and a training instance (x, z)
global discriminant training scheme on the text level to learn in which x represents the pen-tip trajectory and z is the
the classifier parameter and meta-parameters of the recognizer. corresponding label sequence, the FCRN aims to minimize the
978-1-5090-4847-2/16/$31.00 2016 IEEE 4011

2() -1 (channel)
64
BBBBBB
128
t=1
k=1
k=0
256
256

512 512
k=2
62 22 11
146
3014 t=T
6230
12662 (heightwidth)
Input Path-signature Layer Fully Convolutional Network Multi-layer BLSTM Transcription
Fig. 1: Architecture of the proposed fully convolutional recurrent network. Given the input pet-tip trajectory, path-signature
layer extracts 2n+1 1 signature feature maps with informative dynamics. Then a fully convolutional network produces a length
T feature sequence whose frames correspond to receptive fields with height 126 pixels and width 62 pixels on the signature
feature maps. After that, a multi-layer BLSTM predicts a probability distribution for each frame in the feature sequence. Finally,
transcription layer derives a label sequence from the per-frame predictions.
loss function L(Q) as the negative log probability of correctly integral represents the path displacement, and the k = 2 iterated
labelling all the training examples in Q: integral represents the curvature of the path.
Y X Note that the k-th iterated integral of S increases rapidly in
L(Q) = ln p(z|x) = ln p(z|x). (1) dimension as k increases while carrying very little information.
(x,z)Q (x,z)Q Hence, a truncated signature is preferred. If truncated at level
Fig. 1 describes the network architecture of the proposed n, the path signature can be expressed by
FCRN. The FCRN consists of four components. First, the
P (S)nt1 ,t2 = (1, St11 ,t2 , , Stn1 ,t2 ). (3)
path signature layer outputs 2n+1 1 feature maps that are
used to characterize the pen-tip trajectory from the online The dimension of the truncated path signature is 2(n+1) 1
handwritten text data. Second, a fully convolutional network (i.e., the number of feature maps). When S is a straight line,
(FCN) produces a feature sequence in which each frame the iterated integrals Stk1 ,t2 can be calculated using
represents the feature vector of a 12662 receptive field on the
signature feature maps. Third, multi-layer bidirectional LSTM St01 ,t2 = 1,
(BLSTM) predicts a probability distribution for each frame in St11 ,t2 = 4t1 ,t2 ,
the feature sequence. Finally, the transcription layer derives a (4)
St21 ,t2 = (4t1 ,t2 4t1 ,t2 )/2!,
label sequence from the per-frame predictions.
St31 ,t2 = (4t1 ,t2 4t1 ,t2 4t1 ,t2 )/3!, ,
A. Path signature layer
where 4t1 ,t2 := St2 St1 denotes the path displacement. Fig. 1
The path signature, pioneered by Chen [11] in the form of shows the signature feature maps of online text data to better
iterated integrals and developed by Terry Lyons and his col- illustrate the idea of the path signature.
leagues to play a fundamental role in rough theory [12][14],
can extract sufficient information that uniquely characterizes B. Fully convolutional network
paths (e.g., in online handwriting) of finite length. A convolutional network is a powerful visual model that
Assume a time interval [T1 , T2 ] and the writing plane U extracts high-level abstract features from an image. Inheriting
R2 . Then a pen stroke can be expressed as: S : [T1 , T2 ] U . this property, an FCN [15] takes an input image of arbitrary
For intervals [t1 , t2 ] [T1 , T2 ], the k-th iterated integral of S size and outputs a corresponding-sized dense response map.
is the 2k dimensional vector defined by Unlike image cropping or sliding window-based approaches,
Z an FCN eliminates redundant computations by sharing a con-
k
St1 ,t2 = 1dSq1 , , dSqk . (2) volutional response map layer-by-layer to make inference and
t1 <q1 <<qk <t2
backpropagation efficient.
By convention, the k = 0 iterated integral is simply the number Basic operations in a convolutional network, such as con-
one (i.e., the offline map of the character), the k = 1 iterated volution, pooling and the element-wise activation function, are
4012
input
hidden state
t=1 t=T
output gate

input
hidden state
Cell
forget gate
input
hidden state
input gate

input hidden state
Fig. 3: Long short-term memory (LSTM) cell.
Fig. 2: The receptive field. Successive frames in the output fea-

ture sequence of FCN correspond to the overlapped receptive FCN. Each time it receives a frame from the input feature
fields on the original data. sequence, LSTM updates its hidden states and predicts a
distribution for further transcription. We note that LSTM has
the following properties for the HCTR problem. Shi et al. [7]
translation invariant. Therefore, locations in the last response showed that LSTM naturally captures the contextual informa-
map correspond to rectangular regions that are called the re- tion from a sequence, which makes the text recognition process
ceptive field in the original image to which they are associated. more efficient and reliable than processing each character
Layer-wise formulations to calculate the exact location and size independently. Moreover, LSTM is not limited to fixed length
of the receptive field are provided below: inputs or outputs, which allows for modeling sequential data
of arbitrary length. Furthermore, LSTM can be jointly trained
ri = (ri+1 1) si + ki , (5)
with an FCN in a unified network (e.g., FCRN). Joint training
ki 1
pi = si pi+1 + ( di ), (6) can benefit both the convolutional layers and LSTM, and
2 improve overall text recognition performance.
where ri is the local region size of the i-th layer, k is the kernel Standard LSTM can only use past contextual information in
size, s is the stride size, p denotes the position and d is the one direction. This is far from sufficient for HCTR in which
padding size of a particular layer. bidirectional contextual knowledge is accessible. Bidirectional
The FCN takes the input of the signature feature maps and LSTM (BLSTM) can learn long-range context dynamics in
outputs a length T feature sequence. As shown in Fig. 2, both input directions and significantly outperform unidirection-
successive frames in the output feature sequence correspond al networks. Furthermore, as suggested by Pascanu et al. [17],
to the overlapped receptive fields on the original data. we stack multiple BLSTMs in our framework to capture higher-
C. Multi-layer BLSTM level abstract information for further transcription. Finally,
fully connected layers were incorporated between the BLSTM
The traditional recurrent neural network (RNN) is well and transcription layer to enhance classification.
known for its self-connected hidden layer that recurrently trans-
fers information from output to input. However, the traditional
D. Transcription
RNN suffers from gradient vanishing and exploding problem.
Long Short-Term Memory (LSTM) [16], the core of which Traditional approaches for HCTR are confronted with the
is the memory cell and three gates (as illustrated in Fig. 3), paradox of a circular dependency between segmentation and
is used here for its strong ability to capture complex and recognition. To avoid the difficulty of segmentation, we adopt-
long-term temporal dynamics. In particular, the three sigmoidal ed connectionist temporal classification (CTC) as the transcrip-
nonlinear gates, namely the input gate, forget gate and output tion layer in our framework. CTC allows an FCN and LSTM
gate, control the information flow in and out of the cell unit. for sequential training without requiring any prior alignment
The input gate protects the cell unit from the influence of the between input images and their corresponding label sequences.
current input along with past hidden states, the forget gate We denote the character set as C 0 = C {blank}, where
allows the memory cell to forget or maintain its previous states C contains all characters used in this task and blank rep-
and the output gate decides how much memory is to be sent resents the null emission. Given length T input sequences
0
out as hidden states. s = s1 , s2 , , sT , where st R|C | , we can obtain an
To capture complex long-term dependencies, we adopted exponentially large number of length T label sequences, known
LSTM for modeling the input feature sequence produced by as alignments, by assigning each time step a label and concate-
4013
nating the labels to form a label sequence. The alignments are scores higher than a probability threshold for each time step,
denoted by and their probability is given below: and then in the next steps it determines the time steps with
T one and only one candidate blank to separate the alignments
Y
P r(|s) = P r(t , t|s). (7) into regions. Finally, it enumerates the candidate paths in every
t=1 region and sequentially concatenates the candidate paths from
the first to last region. In each concatenating step, both the
By applying a sequence-to-sequence operation B, alignments
confidence score and the language model score of the paths is
can be mapped onto a transcription (denoted by l) by first
considered and only the top N paths remain for the subsequent
removing the repeated labels and then the blanks. For example,
concatenating steps.
apple can be transformed by B from aa p pl ll e or
a pp p l ee . The total probability of a transcription can
be calculated by summing the probabilities of all alignments IV. E XPERIMENTAL R ESULTS AND A NALYSIS
that correspond to it:
X A. Online handwritten text data
P r(l|s) = P r(|s). (8)
CASIA-OLHWDB [20] is a Chinese handwriting database
:B()=l
that is often used for online Chinese handwriting recognition. It
As described by Graves and Jaitly [18], because we do not contains both isolated characters and unconstrained text lines.
know the exact position of the labels within a particular The training set of CASIA-OLHWDB for online handwritten
transcription, we consider all locations where they could occur; text recognition contains 4072 pages of handwritten texts,
that is, what allows a CTC to train a network without pre- which incorporates 41,710 text lines, including 1,082,220 char-
segmented data. A detailed forward-backward algorithm to acters of 2650 classes, whereas the test set (denoted as D-Casia)
efficiently calculate the probability in Eq. (8) was described contains 1020 text pages, including 269,674 characters of 2631
by Graves [19]. classes. We randomly split the training set into two groups, with
III. L ANGUAGE M ODELING approximately 90% for training and the remainder for gauging
the convergence of the training process and further parameter
The statistical language model plays a significant role in learning for the beam search. Furthermore, we assessed our
many technological applications, including online and offline proposed method on the test data (denoted as D-Com) of
handwritten text recognition, speech recognition and language the online handwritten text recognition task of the ICDAR
translation. The statistical model of language (e.g., a length T 2013 Chinese handwriting recognition competition [21], which
sequence of words) is represented as follows: contains 3432 text lines, including 91,576 characters of 1375
T
Y classes. However, our actual evaluation dataset is smaller than
P r(w1T ) = P r(wt |w1t1 ), (9) the reported one because we removed outlier characters that are
t=1 never seen in the training data, and actually contains 89,723
where wt is the t-th word in the sequence and wij denotes the characters of 1258 classes.
sequence (wi , wi+1 , , wj1 , wj ). In fact, closer words in a
word sequence tend to be more dependent. Therefore, n-gram B. Textual data
model, which is constructed by the conditional probability of
the next word given the last n 1 words, is more often used The experiments were conducted on three corpora, including
in practice: the PFR corpus [22], which is news text from the 1998 Peoples
Daily corpus; the PH [23] corpus, which is news text from the
P r(wt |w1t1 ) P r(wt |wtn+1
t1
). (10) Peoples Republic of Chinas Xinhua news written between
January 1990 and March 1991; and the SLD corpus [24], which
In this paper, we only considered the character bigram and
contains news text from 2006 Sogou Lab Data. Because the
trigram language model in the experiments.
total amount of Sogou Lab Data was too large, we only used
Decoding a CTC network can be easily accomplished
an extract in our experiments. Detailed information about these
through naive decoding [19], which takes labels within the
corpora is illustrated in Table I.
highest probability for each frame and obtains the transcription
by applying operation B to the alignment. However, naive We constructed our language models using the SRILM
decoding is not sufficient and can be improved by language toolkit [25]. We built three language models based on these
modeling. By incorporating lexical constraints and prior knowl- three corpora, and compared their roles in decoding the FCRN
edge about the language, language modeling can rectify some with the beam search method.
obvious semantic errors, and thus improves the recognition
result. To integrate the language model and overcome the diffi- TABLE I: Character information in the corpora
culty that the operation B creates, we adopted a refined beam
corpora #characters #class
search method to decode the FCRN. Specifically, given the per- PTR 2,199,492 4,689
frame prediction distribution from the Multi-layer BLSTM, the PH 3,697,028 4,722
beam search method first selects the candidates with confidence SLD 56,279,692 6,882
4014
TABLE II: Detailed settings of our system TABLE III: Correct rate and accuracy rate (%) on dataset D-
Layer type Settings Stack times
Casia and D-Com with the path signatures in different truncated
transcription sequence labeling 1 versions (without language modeling).
inner product n: 2048 2
D-Casia D-Com
BLSTM c: 1024 3 Path signatures Feature maps
CR AR CR AR
convolution k: 2 2, s: 1 1, p: 0 0 1
Sig0 1 90.94 89.86 84.91 83.55
convolution k: 3 1, s: 3 1, p: 0 0 1
Sig1 3 93.56 93.04 87.05 86.32
pooling k: 2 2, s: 2 2 94.52 93.22 89.86 88.28
4 Sig2 7
convolution k: 3 3, s: 1 1, p: 0 1
Sig3 15 93.92 93.02 88.46 87.36
path-signature 128 576 (train), 128 2400 (test) 1
input pen-tip trajectory 1
1 In this Table, k, s, p represent kernel, stride and padding size, respec- TABLE IV: Correct rate and accuracy rate (%) on dataset D-
tively. Casia and D-Com based on FCRN with Sig2 which integrates
the language model with different corpus
D-Casia D-Com
C. Experimental setting Corpora n-gram order
CR AR CR AR
The detailed architecture of our FCRN for HCTR is listed FCRN 94.52 93.22 89.86 88.28
2 95.66 94.35 92.97 91.36
in Table II. The kernel number of each layer in our FCN from PTR
3 95.88 94.66 93.10 91.55
bottom to top is 64, 128, 256, 256, 512 and 512. We also 2 95.70 94.40 92.76 91.17
PH
applied batch normalization [26] to the last four convolutional 3 95.88 94.66 93.10 91.55
layers to enable them to converge faster and avoid over-fitting. 2 95.94 94.79 93.41 92.01
SLD
3 96.40 95.34 95.00 92.88
To accelerate the training process, we trained our network with
shorter texts segmented from text lines in the training data,
which could be normalized to the same height of 128 pixels,
We investigated the performance of decoding FCRN with
while retaining the width at fewer than 576 pixels. In the test
different language models using the beam search method.
phase, we maintained the same height but increased the width
Table IV shows that by integrating the corpus PTR with the
to 2400 pixels to contain the text lines from the test set.
bigram language model, the CRs on dataset D-Casia and D-
We constructed our FCRN network within the CAFFE [27]
Com are increased by 1.14% and 3.11%, respectively and the
deep learning framework, in which LSTM is implemented by
ARs increased by 1.13% and 3.08%, respectively, proving the
Venugopalan et al. [28] and others are contributed by ourselves.
effectiveness of the language model. Experiments also showed
The optimization algorithm was AdaDelta with =0.9. We
that with a higher-order language model (e.g., trigram), our
trained our FCRN with GeForce Titan-X GPUs and it took
system still improved performance. Using PH for decoding
approximately four days to reach convergence.
achieved a similar effect. However, when we used a much
We used the correct rate(CR) and accuracy rate(AR) per-
larger corpus, SLD (about 56 million characters), for decoding,
formance measurement discussed in the ICDAR 2013 Chinese
performance significantly improved. A larger and richer corpus
handwriting recognition competition [21] to assess our frame-
made the language model more general and objective, and most
work.
importantly, helped to overcome the curse of dimensionality
D. Experimental results problem [29].
The RCNN architecture proposed by Shi et al. [7] is the
We compared the path signatures (Sig0, Sig1, Sig2, and
special case of our FCRN with the path signature truncated at
Sig3) in different truncated versions on our network. Table III
level zero (i.e., Sig0). As presented in Table V, FCRN outper-
presents the results of our system with naive decoding (i.e.,
formed CRNN in both D-Casia and D-Com, which suggested
without language modeling). We observed that Sig2 outper-
that FCRN captured more essential online information from
formed the other signatures for both CR and AR, which
suggests that Sig2 already extracts sufficient information for
characterizing the pen-tip trajectory. Moreover, as the path TABLE V: Comparison with state-of-the-art methods based on
signature increase from Sig0 to Sig2, system performance correct rate and accuracy rate (%) on dataset D-Casia and D-
improved monotonically from 90.94% to 94.52% because the Com
path signature captured better informative features from the
Dataset Methods CR AR
pen-tip trajectory with higher iterated integrals. However, Sig3 Zhou et al., 2013 [1] 94.34 93.75
performs worse than Sig2 in the experiment, because Sig3 Zhou et al., 2014 [2] 95.32 94.69
captures slightly more information than Sig2 but may bring D-Casia CRNN [7] 90.94 89.86
FCRN 94.52 93.22
much more useless feature. Experiments also showed that FCRN with SLD corpus 96.40 95.34
FCRN performed much better on dataset D-Casia than D-Com Zhou et al., 2013 [1] 94.62 94.06
because the per-character sample distribution of the training set Zhou et al., 2014 [2] 94.76 94.22
VO-3 [3] 95.03 94.49
was more similar to dataset D-Casia than D-Com. D-Com
CRNN [7] 84.91 83.55
Because Sig2 performed the best in all iterated integrals, FCRN 89.86 88.28
we adopted it in our FCRN for the following experiments. FCRN with SLD corpus 95.00 92.88
4015
the pen-tip trajectory and was a better choice for the HCTR [5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and
problem. We also observed that our FCRN with naive decoding J. Schmidhuber, A novel connectionist system for unconstrained hand-
writing recognition, IEEE Transactions on Pattern Analysis and Ma-
already achieved comparable results with those of Zhou et al. chine Intelligence, vol. 31, no. 5, pp. 855868, 2009.
[1] [2] on dataset D-Casia. Furthermore, when decoded with [6] R. Messina and J. Louradour, Segmentation-free handwritten chinese
the trigram language model based on the SLD corpus, our text recognition with lstm-rnn, 13th International Conference on Doc-
ument Analysis and Recognition (ICDAR), pp. 171175, 2015.
system outperformed the other methods, with a CR of 96.40% [7] B. Shi, X. Bai, and C. Yao, An end-to-end trainable neural network
and an AR of 95.34% on dataset D-Casia. On dataset D-Com, for image-based sequence recognition and its application to scene text
although the result can not be strictly compared because of the recognition, CoRR, vol. abs/1507.05717, 2015.
[8] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang, Reading scene text
removal of outlier characters, it may be safe to say that our in deep convolutional sequences, CoRR, vol. abs/1506.04395, 2015.
system achieved state-of-the-art performance. [9] Q.-F. Wang, F. Yin, and C.-L. Liu, Handwritten chinese text recognition
by integrating multiple contexts, IEEE Transactions on Pattern Analysis
V. C ONCLUSION and Machine Intelligence, vol. 34, no. 8, pp. 14691481, 2012.
[10] Y.-C. Wu, F. Yin, and C.-L. Liu, Evaluation of neural network language
This paper presented a novel method of fully convolutional models in handwritten chinese text recognition, 13th International
recurrent network (FCRN) for handwritten Chinese text recog- Conference on Document Analysis and Recognition (ICDAR), pp. 166
170, 2015.
nition. The proposed FCRN is an end-to-end architecture that [11] K.-T. Chen, Integration of pathsa faithful representation of paths by
directly used online text data during the training process to noncommutative formal power series, Transactions of the American
solve the HCTR problem, completely avoiding the difficulty Mathematical Society, vol. 89, no. 2, pp. 395407, 1958.
[12] B. Hambly and T. Lyons, Uniqueness for the signature of a path of
of segmentation. In the experiments, we discovered that the bounded variation and the reduced path group, Annals of Mathematics,
path signature truncated at level two could perfectly capture the pp. 109167, 2010.
pen-tip trajectory of the online text data without significantly [13] T. Lyons and Z. Qian, System control and rough paths,(2002).
[14] T. Lyons, Rough paths, signatures and the modelling of functions on
increasing the computation during the training process. At the streams, CoRR, vol. abs/1405.4537, 2014.
post-processing stage, we present a refined beam search method [15] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks
that effectively integrated explicit language model to perform for semantic segmentation, Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 34313440, 2015.
decoding and significantly improve the recognition result. On [16] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural
the test set of CASIA-OLHWDB for online handwritten text computation, vol. 9, no. 8, pp. 17351780, 1997.
recognition, our system outperformed all other methods. On [17] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, How to construct deep
recurrent neural networks, CoRR, vol. abs/1312.6026, 2013.
the test set of ICDAR 2013 Chinese handwriting recognition [18] A. Graves and N. Jaitly, Towards end-to-end speech recognition with
competition, our system achieved state-of-the-art performance. recurrent neural networks, Proceedings of the 31st International Con-
In the experiment, our system performed much better on ference on Machine Learning (ICML-14), pp. 17641772, 2014.
[19] A. Graves, Supervised sequence labelling. Springer, 2012.
dataset D-Casia than D-Com because of the unbalanced per- [20] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, Casia online and
character sample distribution on the datasets. Our future work offline chinese handwriting databases, 2011 International Conference
will focus on enriching the training dataset to improve the on Document Analysis and Recognition (ICDAR), pp. 3741, 2011.
[21] C.-L. Liu, F. Yin, Q.-F. Wang, and D.-H. Wang, ICDAR 2011 chinese
performance on dataset D-Com using some data augmentation handwriting recognition competition (2011).
approaches such as sample synthesis or sample distortion. [22] the people s daily corpus. http://icl.pku.edu.cn/icl groups/corpus/
dwldform1.asp, [Online] the People s Daily News and Information
ACKNOWLEDGMENT Center, the Peking University Institute of Computational Linguistics and
Fujitsu Research and Development Center Limited. Accessed March 25,
This research is supported in part by NSFC (Grant No.: 2016.
61472144, 61673182), the National Key Research & Develop- [23] G. Jin., The ph corpus. ftp://ftp.cogsci.ed.ac.uk/pub/chinese, [Online]
ment Plan of China (No. 2016YFB1001405), GDSTP (Grant Accessed March 25, 2016.
[24] Sogou lab data. http://www.sogou.com/labs/dl/c.html, [Online] R&D
No.: 2014A010103012, 2015B010101004, 2015B010130003, Center of SOHU. Accessed March 25, 2016.
2015B010131004) , Guangdong University Pearl Scholar foun- [25] A. Stolcke et al., Srilm-an extensible language modeling toolkit.
dation (GDUPS 2011), and the Fundamental Research Funds INTERSPEECH, vol. 2002, p. 2002, 2002.
[26] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep
for the Central Universities (no. D2157060). network training by reducing internal covariate shift, CoRR, vol. ab-
s/1502.03167, 2015.
R EFERENCES [27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
[1] X.-D. Zhou, D.-H. Wang, F. Tian, C.-L. Liu, and M. Nakagawa, Hand- S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for fast
written chinese/japanese text recognition using semi-markov conditional feature embedding, Proceedings of the ACM International Conference
random fields, IEEE Transactions on Pattern Analysis and Machine on Multimedia, pp. 675678, 2014.
Intelligence, vol. 35, no. 10, pp. 24132426, 2013. [28] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and
[2] X.-D. Zhou, Y.-M. Zhang, F. Tian, H.-A. Wang, and C.-L. Liu, K. Saenko, Sequence to sequence-video to text, Proceedings of the
Minimum-risk training for semi-markov conditional random fields with IEEE International Conference on Computer Vision, pp. 45344542,
application to handwritten chinese/japanese text recognition, Pattern 2015.
Recognition, vol. 47, no. 5, pp. 19041916, 2014. [29] Y. Bengio, H. Schwenk, J.-S. Senecal, F. Morin, and J.-L. Gauvain, Neu-
[3] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, ICDAR 2013 chinese ral probabilistic language models, Innovations in Machine Learning, pp.
handwriting recognition competition, 12th International Conference on 137186, 2006.
Document Analysis and Recognition (ICDAR), pp. 14641470, 2013.
[4] M. Liwicki, A. Graves, H. Bunke, and J. Schmidhuber, A novel approach
to on-line handwriting recognition based on bidirectional long short-
term memory networks, Proc. 9th Int. Conf. on Document Analysis and
Recognition, vol. 1, pp. 367371, 2007.
4016

Fully Convolutional Recurrent Network For Handwritten Chinese Text Recognition

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Fully Convolutional Recurrent Network For Handwritten Chinese Text Recognition

Diunggah oleh

Hak Cipta:

Format Tersedia

2016 23rd International Conference on Pattern Recognition (ICPR)

Cancn Center, Cancn, Mxico, December 4-8, 2016

Fully Convolutional Recurrent Network for

978-1-5090-4847-2/16/$31.00 2016 IEEE 4011

Input Path-signature Layer Fully Convolutional Network Multi-layer BLSTM Transcription

input hidden state

Fig. 3: Long short-term memory (LSTM) cell.

Fig. 2: The receptive field. Successive frames in the output fea-

Anda mungkin juga menyukai