(1)
yt = Why ht + by
(2)
(3)
(4)
(5)
(6)
ht = ot tanh(ct )
(7)
A crucial element of the recent success of hybrid HMMneural network systems is the use of deep architectures, which
are able to build up progressively higher level representations
of acoustic data. Deep RNNs can be created by stacking multiple RNN hidden layers on top of each other, with the output sequence of one layer forming the input sequence for the
next. Assuming the same hidden layer function is used for
all N layers in the stack, the hidden vector sequences hn are
iteratively computed from n = 1 to N and t = 1 to T :
hnt = H Whn1 hn hn1
+ Whn hn hnt1 + bnh
(11)
t
where we define h0 = x. The network outputs yt are
yt = WhN y hN
t + by
Fig. 1. Long Short-term Memory Cell
(12)
xt + W
h t1 + b
h t = H Wx
(8)
h
h h
h
x + W
h t+1 + b
h t = H Wx
(9)
h t
h h
h
h t + W
h t + by
yt = W
(10)
hy
hy
Combing BRNNs with LSTM gives bidirectional LSTM [16],
which can access long-range context in both input directions.
N h N
hN
yt = W
t + W
t + by
h Ny
h y
exp(yt [k])
,
Pr(k|t) = PK
0
k0 =1 exp(yt [k ])
(13)
(14)
where yt [k] is the k th element of the length K + 1 unnormalised output vector yt , and N is the number of bidirectional
levels.
3.2. RNN Transducer
CTC defines a distribution over phoneme sequences that depends only on the acoustic input sequence x. It is therefore
an acoustic-only model. A recent augmentation, known as an
RNN transducer [10] combines a CTC-like network with a
separate RNN that predicts each phoneme given the previous
ones, thereby yielding a jointly trained acoustic and language
model. Joint LM-acoustic training has proved beneficial in
the past for speech recognition [20, 21].
Whereas CTC determines an output distribution at every
input timestep, an RNN transducer determines a separate distribution Pr(k|t, u) for every combination of input timestep t
and output timestep u. As with CTC, each distribution covers the K phonemes plus . Intuitively the network decides what to output depending both on where it is in the
input sequence and the outputs it has already emitted. For a
length U target sequence z, the complete set of T U decisions
jointly determines a distribution over all possible alignments
between x and z, which can then be integrated out with a
forward-backward algorithm to determine log Pr(z|x) [10].
In the original formulation Pr(k|t, u) was defined by taking an acoustic distribution Pr(k|t) from the CTC network,
a linguistic distribution Pr(k|u) from the prediction network, then multiplying the two together and renormalising.
An improvement introduced in this paper is to instead feed
the hidden activations of both networks into a separate feedforward output network, whose outputs are then normalised
with a softmax function to yield Pr(k|t, u). This allows a
richer set of possibilities for combining linguistic and acoustic information, and appears to lead to better generalisation.
In particular we have found that the number of deletion errors
encountered during decoding is reduced.
N h N
hN
(15)
lt = W
t + bl
t + W
h Nl
h l
ht,u = tanh (Wlh lt,u + Wpb pu + bh )
(16)
(17)
exp(yt,u [k])
Pr(k|t, u) = PK
,
0
k0 =1 exp(yt,u [k ])
(18)
where yt,u [k] is the k th element of the length K + 1 unnormalised output vector. For simplicity we constrained all non
W EIGHTS
3.7M
0.8M
3.8M
2.3M
3.8M
3.8M
6.8M
4.3M
4.3M
E POCHS
107
82
87
55
115
124
150
112
144
PER
37.6%
23.9%
23.0%
21.0%
19.6%
18.6%
18.4%
18.3%
17.7%
6. REFERENCES
[1] H.A. Bourlard and N. Morgan, Connnectionist Speech
Recognition: A Hybrid Approach, Kluwer Academic
Publishers, 1994.
[2] Qifeng Zhu, Barry Chen, Nelson Morgan, and Andreas
Stolcke, Tandem connectionist feature extraction for
conversational speech recognition, in International
Conference on Machine Learning for Multimodal Interaction, Berlin, Heidelberg, 2005, MLMI04, pp. 223
231, Springer-Verlag.
[3] A. Mohamed, G.E. Dahl, and G. Hinton, Acoustic
modeling using deep belief networks, Audio, Speech,
and Language Processing, IEEE Transactions on, vol.
20, no. 1, pp. 14 22, jan. 2012.
[4] G. Hinton, Li Deng, Dong Yu, G.E. Dahl, A. Mohamed,
N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N.
Sainath, and B. Kingsbury, Deep neural networks for
acoustic modeling in speech recognition, Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82 97, nov.
2012.
[5] A. J. Robinson, An Application of Recurrent Nets to
Phone Probability Estimation, IEEE Transactions on
Neural Networks, vol. 5, no. 2, pp. 298305, 1994.
[6] Oriol Vinyals, Suman Ravuri, and Daniel Povey, Revisiting Recurrent Neural Networks for Robust ASR,
in ICASSP, 2012.
[7] A. Maas, Q. Le, T. ONeil, O. Vinyals, P. Nguyen, and
A. Ng, Recurrent neural networks for noise reduction
in robust asr, in INTERSPEECH, 2012.
[8] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber,
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, in ICML, Pittsburgh, USA, 2006.
[9] A. Graves, Supervised sequence labelling with recurrent
neural networks, vol. 385, Springer, 2012.
[10] A. Graves, Sequence transduction with recurrent neural networks, in ICML Representation Learning Worksop, 2012.
[11] S. Hochreiter and J. Schmidhuber, Long Short-Term
Memory, Neural Computation, vol. 9, no. 8, pp. 1735
1780, 1997.
[12] A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and
J. Schmidhuber, Unconstrained Online Handwriting
Recognition with Recurrent Neural Networks, in NIPS.
2008.
[13] Alex Graves and Juergen Schmidhuber, Offline Handwriting Recognition with Multidimensional Recurrent
Neural Networks, in NIPS. 2009.