Anda di halaman 1dari 28

SPEECH RECOGNITION USING NEURAL NETWORK

K.M. Peshan Sampath, P.W.D.C Jayathilake, R. Ramanan, S. Fernando, Suthrjan


Dr. Chatura De Silva

Department of Computer Science and Engineering,


University of Moratuwa, Sri Lanka

Abstract
All the speech recognition systems now in the market are based on the statistical
techniques. The work presented in this paper is an alternative approach to do it in the
way as human sensors recognize, using Neural Networks. Since the recognizer Neural
Network must have fixed number of input, here it addresses the problem of solving the
variable size of the feature vector of an isolated word into a constant size. It consists of
three distinct blocks feature extractor, Constant Trajectory map part, and Recognizer.
The feature extractor uses a standard LPC Cepstrum coder, which converts the incoming
speech signal captured by DirectSoundCapture com interface into LPC Cepstrum feature
space. The SOM Neural Network makes each variable length LPC trajectory of an
isolated word into a fixed length LPC trajectory and thereby making the fixed length
feature vector, to be fed into to the recognizer. The design of recognizer uses two types of
Neural Networks. Different structures of Multi Layer Perceptron approach tested with 3,
4, 5 hidden layers with Transfer functions of Tanh Sigmoid, Sigmoid of multiple outputs,
signal output in the case of recognition of feature vector of isolated words. The
performance of the Radial Basis Functions Neural Net tested for isolated word
recognition. Comparison among different structures of Neural Networks conducted here
gives a better understanding of the problem and its possible solutions. The feature vector
was normalized and decorrelated by pruning techniques. Fast training method of Neural
Network has been implemented using pruning techniques to Neural Network. Training
process uses momentum to find the global minima of error surface avoiding the
oscillations in local minima. The main contribution of this paper is use of Multi Layer
Perceptron for isolated word recognition is a completely new idea being implemented.

1 Introduction

Speech is produced when air is forced from the lungs through the vocal cords (glottis)
and along the vocal tract. Speech is split into a rapidly varying excitation signal and a
slowly varying filter. The envelope of the power spectra contains the vocal tract
information
Speech produced analytically consists of convolution in the time domain of two
waves generated by formant structure of the vocal tract and excitation of vocal tract
called the pitch of the sound. In the case of recognizing the word uttered we need to focus
only on the shape of Vocal Tract which is unique for the each and every word uttered. On
the other hand recognizing the speaker focus has to be in excitation of the vocal tract.
White noise
Vocal
Tract
Filter
Impulse @ F0

Figure 1: The source-filter model of speech production

2 Speech Analysis

To get better understanding of speech production is to analyze different algorithms of


determining formant distribution and pitch contour and how Vocal Tract excitation and
wave due to the shape of Vocal Tract are combined in different domains like time,
frequency, qferency. It also need get to know of different coding methods of speech
representation.

2.1 Formant analysis

Formant analysis help to identify the word being uttered since it is heavily based on the
resonances of the vocal tract and shape of the vocal tract creation. Peak picking on the LP
spectrum algorithm was implanted in Matlab.

2.2 Pitch analysis

Pitch analysis makes it possible to recognize the speaker and to recognize the expressive
way of speaker speaking. SIFT, AMDF algorithms ware implemented in Matlab.

Formant 1 Pitch

Formant 2

Figure 2: Pitch and Formant distribution


2.3 Frequency domain analysis

Frequency domain analysis is done in order to extract the information in frequency


domain. And well known FFT and IFFT algorithms have been implanted in C++.

(a) (b)
Figure 3: Plots in frequency domain (a) Spectrum (b) Spectrogram

2.4 Cepstrum analysis

Cepstrum analysis on qferency domain and in this domain it is shown that convolved
above two waves in time domain now has been combined in linearly separable manner
and liftering has been done in order to separate the two waves. All analysis has been done
in Matlab.

2.5 Linear Predictive coding analysis

Linear predictive code analysis is another representation of the vocal tract coefficients in
order to represent the uttered word. Set of routines auto correlation, Durbin’s recursion
was implemented in C++ in order to calculate the LPC coefficients.

2.6 Different coding schemes

Coding methods are necessary to represent the speech waves in steady state manner. Here
listed some coding methods widely used for speech representation.

LPC Analysis
LPCC coefficients
Cepstrum analysis
Mel Cepstral Coefficients
Rasta Processing
Rasta PLP Coefficients

2.7 Framing and windowing


Since speech is dynamic, the coding of the speech cannot be done for entire time. The
tactic of tackling this problem is to calculate the coefficients of speech done on small size
frames with overlapping with 2/3 of frame size. Windowing is done in order to smooth
edge effect that arises due to framing

Zoom in

Overlap

Frame
Figure 4: Overlapping of frames

3 Speech Recognition and Neural Networks

The use of Neural Network for speech recognition is pre-mature than the existing
template methods like Dynamic Time Warping and statistical modeling methods like
Hidden Markov model. When considering history of speech recognition attempts Spenix-
3.2 system (1999) uses HMM for continuous speech recognition.

3.1 Biological Neural Networks and speech recognition

In the case of speech recognition process human read from ear and then converts that
signal into an electrical representation. Then that signal propagates to the brain via spiral.
Then that signal propagates through the billions of biological neurons by stimulating each
other of previously created path so that if there is a previous path (early learned) then it
recognizes the word.
3.2 Artificial Neural Networks and speech recognition
In the case of speech recognition process computer reads the signal from the sound card
and converts it into a discrete representation limited by the sampling rate and the number
of the bits per sample. In the same fashion as biological neuron system work here also it
has different paths created on previously trained words. So that it recognizes the word. If
new word being added it presents with word and the path on which it propagates via
giving probability.

Figure 5: Artificial Neuron

In an artificial neuron, numerical values are used as inputs to the “dendrites.” Each input
is multiplied by a value called weight, which simulates the response of a real dendrite.
All the results from the “dendrites” are added and threshold in the “soma.” Finally, the
threshold result is sent to the “dendrites” of other neurons through an “axon.” This
sequence of events can be expressed in mathematical terms as
 n 
y= f  ∑ wi xi  (1)
 i =1 

3.3 Multi Layer Perceptron (MLP)

This is how the representation of the biological neuron system by the artificial neuron
system. Here it has several layers named input layer, hidden layers, and output layer.
Each layer consists of several neurons which has individually processing unit. Two layers
and combined together by weights same as in biological system there is a loss of signal
while propagating between one neuron dendrite to the other. Here learning algorithm is
back propagation. The learning strategy of this type of neural network occurred is called
supervised learning since it tells what to learn. It is up to the network to carry out how to
learn process.

Input Hidden Output


layer layer layer
Figure 6: Multi Layer Perceptron

3.4 Self Organizing Map (SOM)

This neural network mainly is transforming an n-dimensional input vector space into a
discretized m-dimensional space while preserving the topology of the input data. The
structure of this neural network is two layered i.e. input space and output space. The
training procedure is unsupervised and it is called the competitive learning and expressed
as winner takes all. Compared to biological neural network here it is totally a statistical
approach.

Figure 7: Self Organizing Map

3.5 Radial Basis Function (RBF)

This neural network is the most powerful pattern classifying one which considered to be
separated any pattern by constructing any hyper planes among the different classes of
patterns. In RBF initialization of the centers taken place in unsupervised manner looking
at the data pattern which I have used is called modified k-means algorithm. Upon
spreading the centers with relevant to the data set then it trained with the supervised
manner to mimic the human brain as same as back propagation which is know as
extended back propagation variation of the LMS algorithm. Since it uses both the
combination of while try to mimic the human brain it also uses statistical approach in the
initialization process.
4 Algorithms Enhanced

In here it describes the various algorithms found in earlier attempts via research papers
and the modification done to these algorithms in order to come up with a new algorithm
that suit to the problem at hand. All the algorithms described here was implemented in
C++ and heavily tested which leads to expected results.

4.1 Enhanced Back Propagation algorithm

In the process of training the Multi Layer Perceptron well known the back propagation
algorithm was used earlier. To avoid the oscillation at the local minima I applied the
momentum constant so that it goes to the global minima of the error surface leading to
the best solution. I have extended the back propagation algorithm to be used for multiple
outputs.
1) Initialization
The weights of each layer have been initialized to random number.
2) Forward computation
(l )
The induced local filed v j (n) for neuron j in layer l is
m0
v (n) = ∑ w (jil ) ( n) yi( l −1) (n)
(l )
j
i =0
(2)
The output signal of neuron j in layer l is
ψj (v j ( n))
(l )
y j
= (3)
If the neuron j is in the first hidden layer
y (j0 ) (n) = x j (n) (4)
If the neuron j is in the output layer and L is the depth of network
y (j L ) = o j (n) (5)
The error signal is
e j ( n ) = d j ( n ) −o j ( n)
(6) where d j (n) is the desired output for j th element

3) Backward computation
Compute the local gradients ( δ s) of the network as follows:

δ (j l ) ( n) = [ e (j L ) (n) ψ ′(v (j L ) (n)) for neuron j in output layer L. (7)





ψ′(v (jl ) (n)) ∑δk
( l +1)
k ( n) wkj( l +1) ( n) for neuron j in hidden layer l.

The weights update is taken place in accordance to following formulae:


w (jil ) (n + 1) = w (jil ) (n) + α[ w (jil ) (n −1)] + η δ(j l ) (n) y i( l −1) (n)
(8)
where ηlearning is rate and α is momentum constant.
4.2 Enhanced LMS algorithm

In the process of training the Radial Basis Network it was used the LMS algorithm. For
suiting to our problem we have to deal with multiple outputs so I have enhanced and
extended to use with multiple outputs. The algorithm used earlier applies for only with no
hidden layers so I have made the learning procedure the combination of LMS and back
propagation. Further while locating clusters with the LMS and the updating weights using
back propagation algorithms were done.

Adaptation algorithm for Linear Weights and the Positions and Spread of Centers for
RBF Network
1) Linear Weights (output layer)
∂ε (n) N

∂ω i (n)
= ∑e
i =1
j (n)G ( xi − t i (n)
Ci
) (9)

∂ε (n)
ωi (n + 1) = ωi (n) −η1 , i = 1,2,..., m1 (10)
∂ωi (n)
2) Positions of centers
∂ε (n)
[ ]
N
= 2 wi (n) ∑e j (n)G ' ( x j − t i (n) Ci )ξi−1 x j − t i (n)
∂t i (n) j =1

(11)
∂ε (n)
t i (n + 1) = t i ( n) −η2 i = 1,2,..., m1 (12)
∂t i (n)
3) Spread of centers
∂ε (n) N

∂ξ i−1 (n)
= − wi ( n ) ∑
j =1
e j (n)G ′( x j − t i (n) Ci )Q ji (n)

(13)
Q ji ( n) = [x j ][
− t i (n) x j − t i ( n) ]
T
(14)
∂ε (n)
ξi−1 (n +1) = ξi−1 (n) −η3
∂ξi−1 (n)
(15)

4.3 Clustering algorithms

The clustering of data leads to how related data are categorizes into different classes. The
Code Book is the term used to define whole data set in case of sets universal set. The
code book is consisting of set of Code Words. The Code Word is the representation of
different categories. The clustering has most done in order to reduce the data set by
removing the repeating data.

4.3.1 K-means clustering algorithm


This is a time independent data clustering method. K-means clustering algorithm (Duda
and Hart, 1973) was implemented in RBF neural network to pre-initialize the data sets
into code words in the case of code book of RBF.
1) Initialization
The random values were chose for the centers t k (0) of different values.
2) Sampling
A sample vector x from the input space was drawn and it was input to the algorithm at
iteration n.
3) Similarity matching
Let k ( x) denote the index of best matching (winning) center for input vector x.
k ( x) = arg min x(n) − t k (n) , k = 1,2,...,m1
k
(16)
where t k (n) is the center of k th radial-basis function at iteration n.
4) Updating
Adjust the centers of the radial basis function, using the following rule:

t k (n + 1) {
= t k ( n ) + η [ x ( n ) − t k ( n )]} , k = k ( x)
(17)
{ t k (n), } otherwise
where η is the learning rate and 0 <η <1

4.3.2 SOM neural network as clustering algorithm

Modified version of K-means clustering algorithm was implemented in order to cluster


data without changing the time trajectories. The algorithm is as follows:
1) Initialization
The random values were chosen for initial weight vector wi (0) .
2) Sampling
The vector x was chosen from the input space with certain probability and it represents
the activation pattern applied to the lattice.
3) Similarity matching
The best similarity matching neuron i(x), winner was found at the step n as follows:
i( x) = arg min x(n) − w j , j = 1,2,...l (18)
j
4) Updating
The updating the weights have been taken with the following formulae:
w j (n +1) = w j (n) + η(n)hi , j ( x ) (n)[ x (n) − w j (n)]
(19)
where η(n) is the learning rate and hi , j ( x ) (n) is the neighborhood function around
the winning neuron i ( x ) at the n th iteration.

5 Implementation of the Neural Speech Recognizer

Whole system has been implemented in C++. The following diagram shows the abstract
view of the system modules and later it describes the algorithms implemented in each
module.
Sound FIR
Framing
Recording Filtering

Speech
Signal Cepstrum LPC
Windowing
Analysis Analysis

Text
SOM Recognizer

Figure 8: Speech Recognition System

5.1 Design of Feature Extractor

In speech recognition problem FE block has to process the incoming signal i.e. speech
signal, such that its output eases the work of classification stage. It consists of following
modules

5.1.1 Sound Recording

For capturing the sound to be recognized was implemented via DirectSoundCapture


because of DirectSound has major features like low latency, its capability of hardware
acceleration compared to other mechanisms like OCX ActiveX. Basically DirectSound
accesses the hardware through the DirectSound Hardware Abstraction Layer which is an
interface to the sound hardware. This recorded the sound with the sampling rate of 16000
Hz and 16 bits per sample.
But here it had to be carefully designed because it some times has to be dealt with
two pointers for the buffer.
Locked Unlocked
Data Data

Locked Unlocked Locked


Data Data Data

Pointer 1

Pointer 1 Pointer 2

(a) (b)

Figure 9: DirectSoundCapture Buffer (a) Deal with single pointer


(b) Deal with two pointers

5.1.2 Pre-emphasis Filter

A pre-emphasis filter was applied to the digitized speech to spectrally flatten the signal
and diminish the effects of finite numerical precision in further calculations.
The transfer function of pre-emphasis filter corresponds to the first order FIR
filter is defined by as follows:

H ( w) =1 −ae − jw (20)
The value a of was obtained by experimentally and it is equal to 0.85.
The time domain coefficients of filter has been calculated and the multiplied with time
domain wave form.

5.1.3 Speech coding

After the captured speech signal is sampled, the utterance were isolated, and the spectrum
was flattened, each signal was divided into a sequence of frames, each frame 21ms length
and 7ms apart. Then each frame is multiplied by a Hamming window, in order to remove
the leakage effects and to smooth the edges.

 2πn 
w(n) = 0.54 − 0.46 cos  , n ∈[0, N −1] and N = 330 (21)
 N −1 

A vector of 12 Linear Predicting coding Cepstrum coefficients was calculated from each
data block using Durbin’s method and the recursive expressions developed by Furi. The
procedure is as follows:
1. Auto correlated coefficients were obtained as follows:

r (m) = ∑n=0
N −1−m
x ( n ) x ( n + m) , m = 12 and N = 330
(22)

where x(i ) corresponds to the speech sample located at the i th position of the frame.
2. Then LPC values were obtained using following recursion:

l =0 : E (l ) = r (l )
r (l )
l =1 : k (l ) =
E (l −1)
E (l ) = E (l −1){1 − k (l )2}
all = k (l )
1
(r (l ) − ∑l =1 ail −1r (l − i )
l −1
l >1 : k (l ) =
E (l −1)
E (l ) = E (l −1){1 − k (l )2}
a ml = a ml −1 − k (l )all−−1m
all = k (l )

(23)
where m ∈[1, l ] , ∀l ∈[1, p ] , and a n = a np is the n th LPC coefficient. Value of
p =12 .

3. Then the LPC Cepstrum was obtained by following formula.

kck a m −k
a m + ∑k =1 m ∈ [1, p ]
m −1

cm = { m (24)
m −1 kc k a m − k
∑k =1 m m> p
where m = 12 was used in order to achieve 12 LPC Cepstrum coefficients per frame.
Since sample rate is 16000 Hz, the frame size is 21ms and the each frame is 7ms
apart. For each frame LPC Cepstrum features of dimension 12 was calculated.

5.2 Design of constant trajectory mapping module

Using the Self Organizing Map the variable length each and every LPC trajectory is
mapped to a constant trajectory of 6 clusters while preserving the input space. The
implemented algorithm is consisting of three parts.

5.2.1 Competitive process

Let x be the m-dimensional input vector then,


x =[ x1 , x 2 ,..., x m ]T
Let w j be the synaptic weight vector of neuron j then,
w j = [ w j1 , w j 2 ,..., w jm ]T j = 1, 2 …, l

The index of the best match neuron i ( x ) is


i( x) = arg min x − w j , j = 1, 2…, l (25)
j
where l is the total number of neurons in the network.

5.2.2 Cooperative process


2
The lateral distance excited neuron j and winning neuron i d j ,i is,
2
d 2j ,i = r j −ri
(26)
where r j is the position of neuron j and ri is the position of the neuron i.
The width σ of the topological neighborhood shrinks with the time as follows:
− n
σ ( n) = σ 0 exp   , n = 0, 1, 2… (27)
 τ1 
The variation of the topological neighborhood h j ,i ( x ) (n) is,
 − d 2j ,i 
h j ,i ( x ) (n) = exp  
 2σ 2 (n)  , n = 0, 1, 1…
 
(28)
5.2.3 Adaptation process

The changing of the learning rate is as follows:


− n
η(n) = η0 exp   , n = 0, 1, 2… (29)
 τ2 
The adaptation of weights is as follows:
w j (n + 1) = w j ( n) (
+ η ( n ) h j ,i ( x ) ( n ) x − w j ( n ) ) (30)
1000
where η0 = 0.1 , τ 2 = 1000 , and τ1 =
log σ 0
The size of the center is changed dynamically with number of frames. Since it has 6
clustered centered all are initialized to random weight initially and allow the variable
length trajectory of each LPC coefficient to arrange to the size of six unique shape
preserving time domain feature sequence.

5.3 Design of Recognizer

The recognizer was designed to recognize the 10 digits and each digit input to the
recognizer of size of 72 features, feature vector.
5.3.1 Multi Layer Perceptron Approach

A new layered approach was used in the process of designing the Neural Network and the
structure of the Neural Node made independent of both the training process and
recognition process. It consists of 72 input nodes, variable number of hidden nodes and
10 output nodes.
And it has the adaptability of changing its transfer function of Sigmoid
1 + /(1 +exp( −A * value )) and Tanh Sigmoid a tanh( b * value ) . Values for A=1,
a=1.1759, b=0.6667.
It was designed with earlier enhanced back propagation algorithm in section 4.1 extended
to deal with the problem. It has been used the sequence training method and its state is
stored internally every time its weights adjusted in order to avoid the inconsistency of
weights lead to infinity. Then training process was made automated so that both test set
as well as training set present to the layered network so that it stops the training process
when the test set satisfy the condition checked at the end of presence of each epoch of
training set. At the end of training the whole state of the layered network is stored so that
it can retrieve the state in the process of recognition.

Instead of having multiple outputs it was designed another different one of similar to all
above except that it has only one output and the output for different digits were set to sub
ranges of the whole range that can be taken among the transfer function used.

Then above two neural networks are extended to have two hidden and three hidden layers
by extending the layered structure designed as well as extending the enhanced back
propagation algorithm. So finally it was designed MLP one hidden layer, two hidden
layer and three hidden layer networks.

For each digit there are 60 training examples and all these have been presented to the
neural network in mixed of ten digits samples i.e. 0 1 2 … etc sequence. Testing sample
consists of 10 samples for each digit.

Heuristic techniques applied for the network is as follows:


1) Learning rate has been reduced with the each epoch number
 − epochNumbe r  (31)
η(epochNumbe r ) = η0 exp  
 100 
2) Use of Momentum for avoiding the oscillation at the local minima.
3) Normalization of the input vector
Mean
Removal

Decorrelation

Covariance
Equalization

Figure 10: Normalization of Input data

5.3.2 Radial Basis Function Approach

RBF network was implemented which consisting of 72 input nodes and 20 hidden nodes
(20 clusters) and 10 output nodes. The preparation of input vector was taken by the SOM
trajectory. The training examples are presented here as same manner as the MLP design.
The RBF network has two processes while training.

1) Initialization on centers has been implemented as the same way as explained in the K-
means clustering algorithm in 4.3.1 for whole training examples.

2) The learning and adaptation of weights, centers and covariance matrices has been
implemented with enhance LMS algorithm in section 4.2 by extending that idea to
multiple outputs.
Heuristic techniques like normalization of input vectors and adjustment of learning rate
as above were applied.
6 Results of Research

The experiment was done using digital computer with 2.5GHz speed and 250MB
memory IBM machine.

6.1 Success

Capturing the speech via sound card via DirectX was succeeded. Here it shows the graph
of recorded sound for one pronounced in Sinahala “eka”.

Figure 11: Time domain representation of captured “eka” sound

And feature extractor module is also gives result 100% as it was implemented well
known the Durbin’s recursion. The LPC coefficient variation with time is as follows:
LPC Trajectory

1.2

1
LPC 1
0.8 LPC 2
LPC 3
0.6
LPC 4
Amplitude

0.4 LPC 5
LPC 6
0.2
LPC 7
0 LPC 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 LPC 9
-0.2
LPC 10
-0.4 LPC 11
LPC 12
-0.6

-0.8
Frame Number

Figure 12: Variation of LPC coefficient trajectories with frames for sound “eka”
The attempt to make size of variable length feature vector constant while preserving the
input feature space has been succeeded. The reducing the length of each LPC trajectory
has been made constant while preserving the input features with the time.
LPC 1 Reduced Trajectory SOM Trajectory

0. 14 0. 18

0. 12 0. 16

0. 1 0. 14

0. 12
0. 08
0. 1
0. 06
0. 08
0. 04
0. 06

0. 02 0. 04

0 0. 02
1 2 3 4 5 6
0

C l uster ed C enter
LPC 1 T r aj ector y

(a1) (a2)

LPC 2 R educed Trajectory SOM Trajectory

-0. 132 -0. 13


1 2 3 4 5 6
-0. 134 -0. 132

-0. 134
-0. 136

-0. 136
-0. 138

-0. 138
-0. 14
-0. 14
-0. 142
-0. 142
-0. 144
-0. 144

-0. 146
-0. 146

C l uster ed C enter
LPC 2 T r aj ector y

(b1) (b2)

LPC 3 Reduced Trajectory SOM Trajectory

0.12 0.12
0.1 0.1
0.08 0.08
0.06 0.06

0.04 0.04

0.02 0.02

0 0

1 2 3 4 5 6 - 0.02
- 0.02
- 0.04
- 0.04
- 0.06
- 0.06
- 0.08
- 0.08
- 0.1

C lustered Center
LP C 3 Trajec tory

(c1) (c2)
LPC 4 Reduced Trajectory SOM Trajectory

0
1 2 3 4 5 6 0
-0. 02
- 0.02
-0. 04 - 0.04
-0. 06 - 0.06

-0. 08 - 0.08

- 0.1
-0. 1
- 0.12
-0. 12
- 0.14
-0. 14
- 0.16
-0. 16 - 0.18
-0. 18 - 0.2

C lustered C enter
LPC 4 T r aj ector y

(d1)
(d2)

LPC 5 Reduced Trajectory SOM T rajecto ry

0. 5
0.5
0. 4
0.4
0. 3
0.3
0. 2
0.2
0. 1
0.1
0
0
-0. 1
- 0.1 1 2 3 4 5 6
-0. 2
- 0.2
-0. 3
- 0.3
-0. 4
- 0.4
-0. 5

Cl uster ed Center
LP C 5 Trajec tory

(e1) (e2)

LPC 6 R educed Trajectory SOM Trajectory

0.214 0.215

0.2135 0.214

0.213 0.213
0.212
0.2125
0.211
0.212
0.21
0.2115
0.209
0.211
0.208
0.2105
0.207
0.21
0.206
1 2 3 4 5 6
0.205

C lustered C enter
LP C 6 Traject ory

(f1) (f2)
LPC 7 Reduced Trajectory SOM Trajectory

0.3 0.3

0.25 0.25

0.2
0.2

0.15
0.15
0.1
0.1
0.05
0.05
0
1 2 3 4 5 6 0

C lustered C enter
LP C 7 Traject ory

(g1) (g2)

LPC 8 Reduced Trajectory SOM Trajectory

0.4 0.4

0.3 0.3

0.2
0.2
0.1
0.1
0
0
1 2 3 4 5 6 - 0.1
- 0.1
- 0.2
- 0.2
- 0.3
- 0.3 - 0.4

C lustered C enter
LP C 8 Trajec tory

(h1) (h2)

LPC 9 Reduced Trajectory SOM Trajectory

0.3 0.4

0.2 0.3

0.1 0.2

0.1
0
1 2 3 4 5 6 0
- 0.1
- 0.1
- 0.2
- 0.2
- 0.3
- 0.3
- 0.4
- 0.4
- 0.5
- 0.5

C lustered C ent er
LP C 9 Trajec tory

(i1) (i2)
LPC 10 Reduced Trajectory SOM Trajectory

0.1
0.2
0. 05
0.1
0
1 2 3 4 5 6 0
-0.05

-0. 1 - 0.1

-0.15
- 0.2
-0. 2
- 0.3
-0.25
- 0.4
-0. 3

-0.35 - 0.5

C lust ered C enter


LP C 10 T r aj ector y

(j1) (j2)

LPC 11 Reduced Trajectory SOM Trajectory

0.4 0.5
0.3 0.4
0.2 0.3
0.1 0.2

0 0.1

- 0.1 1 2 3 4 5 6 0

- 0.2 - 0.1
- 0.2
- 0.3
- 0.3
- 0.4
- 0.4
- 0.5
- 0.5
- 0.6
- 0.6

C lustered C enter
LP C 11 Trajec t ory

(k1) (k2)

LPC 12 R educed Trajectory SOM Trajectory

0.4
0.4
0.3
0.3
0.2 0.2

0.1 0.1

0 0

1 2 3 4 5 6 - 0.1
- 0.1
- 0.2
- 0.2
- 0.3
- 0.3
- 0.4
- 0.4
- 0.5
C lustered Center
LP C 12 Traject ory

(l1) (l2)
Figure 13: It shows the different LPC coefficients’ trajectory after doing SOM
algorithm for making constant length trajectory. Left figure shows the trajectory of length
6 and the right figure shows the spread of centers to convergence for the sound “eka”.
One tool we have used earlier for pattern recognition is Matlab Neural Network tool box.
Since it has lots of internal limitations like speed wise performance and lack of control
for going into more powerful classification of patterns, I have decided to use our own
implemented Neural Network in C++ which has more speech wise achievement.

So I have come up with my own design for implementing neural network in which
separation of structure, training procedure, and transfer function are very easy. My first
design was the MLP with one hidden layer. Then I have extended it to consist of two
hidden layers and three hidden layers by extending and came up with my own back
propagation algorithm. These all three are classifying separable patterns with 100%
accuracy. So here I present some parameters which gave optimum accuracy in the case of
recognizing the digit being uttered.
Here it presents the results of various neural network configurations gives the result with
different structures. Experimented with different internal parameters of Multi Layer
Perceptron Neural Network, learning rate and momentum constant it is found that

learning rate = 0.1


momentum constant = 0.9

The optimum value of the sigmoid transfer function of 1 + /(1 +exp( −a * value )) is
a = 1
maximum probabilit y = 0.9
minimum probabilit y = 0.1

It is found that optimum values for transfer function of a tanh( b * value ) is


a = 1.7159
b = 0.6667
minimum value = −1
maximum value = 1

Here it presents the recognizing accuracy of types of Multi Layer Perceptron Neural
Networks. All the examples it uses 60 training examples for each digit with altogether
600 training examples. Each Neural Network was trained with 10000 epochs of the
training example with the sequence training.

MLP with one hidden layer (config1) MLP with one hidden layer (config2)
Transfer function 1/(1 +exp ( −value )) Transfer function 1/(1 +exp ( −value ))
Input node number = 72 Input node number = 72
Hidden node number = 15 Hidden node number =120
Output node number =10 Output node number = 10

MLP with two hidden layer (config3) MLP with two hidden layer (config4)
Transfer function 1/(1 +exp ( −value )) Transfer function 1/(1 +exp ( −value ))
Input node number = 72 Input node number = 72
Hidden1 node number=60 Hidden1 node number= 100
Hidden2 node number=25 Hidden2 node number= 50
Output node number =10 Output node number = 10

MLP with one hidden layer (config5) MLP with one hidden layer (config6)
Transfer function 1/(1 +exp ( −value )) Transfer function 1/(1 +exp ( −value ))
Input node number = 72 Input node number = 72
Hidden1 node number = 200 Hidden1 node number = 200
Hidden2 node number = 100 Hidden2 node number = 40
Hidden3 node number = 50 Hidden3 node number = 20
Output node number = 10 Output node number = 10

Digit Config1 Config2 Config3 Config4 Config5 Config6


Recognition Recognition Recognition Recognition Recognition Recognition
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy

0 71 98 71 98 90 89
1 99 8 15 73 13 96
2 96 8 31 6 20 90
3 84 32 92 82 74 75
4 86 17 3 7 28 89
5 92 2 5 6 95 90
6 95 14 4 23 10 91
7 0 3 1 16 78 22
8 1 32 98 85 52 90
9 87 35 29 84 30 90
Table 1: Digit Recognition of 100 of each for each configuration

Learning curve

900
800
700
Aggregate Error

600
500
400
300
200
100
0
Number of Epoch

Error at epoch

Figure 14: Learning curve for MLP layer 3 (config1)


Learning Curve

1000
900
800
Aggregate Error

700
600
500
400
300
200
100
0
Epoch Number

Error at end of epch

Figure 15: Learning curve for MLP layer 5 (config6)

Here it presents the results of identifying the zero (“binduwa”) with different
configurations. Here it classifies whether uttered sound to be zero or not zero.

MLP with one hidden layer (config7) MLP with one hidden layer (config8)
Transfer function 1/(1 +exp ( −value )) Transfer function 1/(1 +exp ( −value ))
Input node number = 72 Input node number = 72
Hidden node number = 5 Hidden node number = 32
Output node number =2 Output node number = 2

MLP with two hidden layer (config9) MLP with two hidden layer (config10)
Transfer function 1/(1 +exp ( −value )) Transfer function 1/(1 +exp ( −value ))
Input node number = 72 Input node number = 72
Hidden1 node number= 30 Hidden1 node number= 100
Hidden2 node number= 5 Hidden2 node number= 30
Output node number =2 Output node number = 2

MLP with one hidden layer (config11) MLP with one hidden layer (config12)
Transfer function 1/(1 +exp ( −value )) Transfer function 1/(1 +exp ( −value ))
Input node number = 72 Input node number = 72
Hidden1 node number = 35 Hidden1 node number = 100
Hidden2 node number = 20 Hidden2 node number = 50
Hidden3 node number =5 Hidden3 node number = 20
Output node number =2 Output node number =2
Digit Config7 Config8 Config9 Config10 Config11 Config12
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy
0 (being 0) 93 99 89 98 97 65
1 (not being 0) 21 92 78 83 83 91
2 (not being 0) 16 61 22 69 80 90
3 (not being 0) 12 55 77 42 79 94
4 (not being 0) 18 21 61 32 96 83
5 (not being 0) 93 71 83 76 95 74
6 (not being 0) 94 68 11 65 10 83
7 (not being 0) 81 65 89 40 94 82
8 (not being 0) 96 87 89 96 96 85
9 (not being 0) 75 65 57 37 12 88
Table 2: Classifying being 0 or not being 0 for 100 of each for each configuration.

Bellow it shows the learning curves of the Network while training with wave data. And it
shows the convergence that is better separation of hyper planes with relevant data sets. It
presents the aggregate error at each end of whole epoch of training set.

Figure 16: Learning curve for MLP 5 layer (config11)

It is not always possible to get a learning curve which minimizes the error. There is a
possibility of getting learning curve which maximizes the error. Here it shows the Multi
Layer Perceptron which
Learning Curve

2000

1500
Error

1000

500

0
Epock Number

Error Change with Epock Number

Figure 17: Learning Curve for MLP 3 layer (72, 5, and 10)

6.2 Attempted Failure

I have tried with MLP using Tanh Sigmoid with single output then the output range [-
1,1], and dividing the output range into differing intervals like
-0.9 for 0, -0.7 for 1, -0.5 for 2, -0.3 for 3, -0.1 for 4, 0.1 for 5,
0.3 for 6, 0.5 for 7, 0.7 for 8 and 0.9 for 9.
With this configuration although I tried with different configurations with changing the
number of layers and changing the number of neurons within the layer I couldn’t be able
to train with the neural network which every time its weights has gone to infinity.

In the case of RBF Neural network it is covariant matrix of size n × n so in the training
process it has to inverse this matrix for every training example. In our problem it is
72 × 72 since the feature vector is size of 72. Also I have written the module which
calculates the inverse of any matrix using recursion. It takes long time as the case it
involves the lot of floating point calculation. So the results for recognition couldn’t be
able to acquire.

7 Conclusion & Future Work

The attempt to mimic a human being by focusing on one sensor of hearing has been
succeeded with number of limitation of the today’s digital computer has as well as some
undiscovered things in science. So it was able to use totally new approach for recognizing
isolated word. In this paper it was presented the results for recognition accuracy. It has
high accuracy of recognizing individual digits.
Apart from the speech technologies in this paper it presents the different neural network
architecture design for the patter at hand. It is concluded that neural networks having
more hidden layers are able to solve the problems very easily. By comparing error curves
and recognition accuracy of digits it is concluded that Multi Layer Perceptron with 5
layers is more generic approach rather than to Multi Layer Perceptron with 3 hidden
layers.
The speech is an analog wave form and the processing of artificial human being must
have to be done on a digital computer. In the case of converting analog to digital
representation is limited by the sampling rate and the number of bits each sample carry.
The limitation being seen here is quantization errors that can be happened in this process.
So I think that next generation of computer, Quantum computers will address this
problem so remarkably.
In the capturing process in the sound card and because of the environment it introduces
lots of noise to the process. The problem face here is while trying to reduce the one type
of noise there introduces another type of noise. It places a greater problem in the
recognition accuracy.
In the case of RBF neural network the calculating the inverse matrix of 72 × 72 matrix
was limited by the floating point calculation capability of speed wise achievement.
Although the use of Neural Networks for speech recognition is not a matured technique
compared to Hidden Markov Model, here it discovered new approach for recognition of
isolated word as HMM using combination of Multi Layer Perceptron and Self Organizing
Map. The results of recognizing accuracy is good and can be accepted to go into the
continuous speech recognition. The future suggestions for continuous speech recognition
is again to use the pattern separating Neural Network to break into words with feeding
the energy variation pattern with the number of zero crossing pattern.

8 References

[1] Christopher M. Bishop “Neural Networks for Pattern Recognition” Oxford University
1995.
[2] L. Rabiner and G. Juang, “Fundamentals of Speech Recognition,” Prentice-Hall,
1993.
[3] N. Negroponte, “Being Digital,” Vintage Books, 1995.
[4] C. R. Jankowski Jr., H. H. Vo, and R. P. Lippmann, “A Comparison of Signal
Processing Front Ends for Automatic Word Recognition,” IEEE Transactions on Speech
and Audio processing, vol. 3, no. 4, July 1995.
[5] J. Tebelskis, “Speech Recognition Using Neural Networks,” PhD Dissertation,
Carnegie Mellon University, 1995.
[6] S. Furui, “Digital Speech Processing, Synthesis and Recognition,” Marcel Dekker
Inc., 1989.
[7] K. Torkkola and M. Kokkonen, “Using the Topology-Preserving Properties of SOMs
in Speech Recognition,” Proceedings of the IEEE ICASSP, 1991.
[8] K-F Lee, H-W Hon, and R. Reddy, “An Overview of the SPHINX Speech
Recognition System,” IEEE Transactions on Acoustic, Speech, and Signal Processing,
vol. 38, no. 1, January 1990.
[9] H. Hasegawa, M. Inazumi, “Speech Recognition by Dynamic Recurrent Neural
Networks,” Proceedings of 1993 International Joint Conference on Neural
Networks.
[10] M. Jamshidi, “Large-Scale Systems: Modelling and Control,” North-Holland, 1983.
[11] P. Zegers, “Reconocimiento de Voz Utilizando Redes Neuronales,” Engineer Thesis,
Pontificia Universidad Católica de Chile, 1992.
[12] M. Woszczyna et al, “JANUS 93: Towards Spontaneous Speech Translation,” IEEE
Proceedings Conference on Neural Networks, 1994.
[13] T. Zeppenfeld and A. Waibel, “A Hybrid Neural Network, Dynamic Programming
Word Spotter,” IEEE Proceedings ICASSP, 1992.
[14] Y. Gong, “Stochastic Trajectory Modeling and Sentence Searching for Continuous
Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 1,
January 1997.
[15] D. Richard, C. Miall, and G. Mitchison, “The Computing Neuron,” Addison-
Wesley, 1989.
[16] S. Haykin, “Neural Networks: A Comprehensive Foundation,” Macmillan College
Publishing Company, 1994.
[17] T. Kohonen, “Self-Organization and Associative Memory,” Springer-Verlag, 1984.
[18] T. Kohonen, “Self-Organizing Maps,” Springer-Verlag, 1995.
[19] T. Kohonen et al, “Engineering Applications of the Self-Organizing Map,”
Proceedings of the IEEE, 1996.
[20] H. Sagan, “Space-Filling Curves,” Springer-Verlag, 1994.
[21] G. Cybenko, “Approximation by Supperpositions of a Sigmoidal Function,”
Mathematics of Control, Signals and Systems, vol. 2, 1989.
[22] K. Funahashi, “On the Approximate Realization of Continuous Mappings by Neural
Networks,” Neural Networks, vol. 2, 1989.
[23] K-I Funahashi and Y. Nakamura, “Approximation of Dynamical Systems by
Continuous Time Recurrent Neural Networks,” Neural Networks, vol. 6, 1993.
[24] T. Kohonen, “The Neural Phonetic Typewriter,” Computer, vol. 21, no. 3, 1988.
[25] K. J. Lang and A. H. Waibel, “A Time-Delay Neural network Architecture for
Isolated Word Recognition,” Neural networks, vol. 3, 1990.
[26] E. Singer and R. P. Lippmann, “A Speech Recognizer using Radial Basis Function
Neural Networks in an HMM Framework,” IEEE Proceedings of the ICASSP, 1992.
[27] H. Hild and A. Waibel, “Multi-Speaker/Speaker-Independent Architectures for the
Multi-State Time Delay Neural Network,” IEEE Proceedings of the ICNN, 1993.
[28] R. M. Gray, “Vector Quantization,” IEEE ASSP Magazine, April 1984.
[29] G. Z. Sun et al, “Time Warping Recurrent Neural Networks,” IEEE Proceedings of
the ICNN, 1992.
[30] A. Papoulis, “Probability, Random Variables, and Stochastic Processes,” McGraw-
Hill, 1991.
[31] S. I. Sudharsanan and M. K. Sundareshan, “Supervised Training of Dynamical
Neural Networks for Associative Memory Design and Identification of Nonlinear Maps,”
International Journal of Neural Systems, vol. 5, no. 3, September 1994.
[32] B. A. Pearlmutter, “Gradient Calculations for Dynamic Recurrent Neural Networks,”
IEEE Transactions on Neural Networks, vol. 6, no. 5, September 1995.
[33] M. K. Sundareshan and T. A. Condarcure, “Recurrent Neural-Network Training by a
learning Automaton Approach for Trajectory Learning and Control SystemDesign,”
IEEE Transactions on Neural Networks, vol. 9, no. 3, May 1998.
[34] D. B. Fogel, “An Introduction to Simulated Evolutionary Optimization,” IEEE
Transactions on Neural Networks, vol. 5, no. 1, January 1994.
[35] Simon Haykin “Neural Networks” Pearson Education inst. 1999.

Anda mungkin juga menyukai