~
are respectively the center frequency and
envelope function of
( ) t
. The latter is chosen to be the
Morlet wavelet. In this case the function is expressed as
follow:
( )
2
0 ~
|
|
.
|
\
|
=
T
t
e t
(4)
Where is the initial-support of the unscaled mother
wavelet. Using a time varying function T, the mother function
of the BWT is expressed as follows[10]:
( )
ft j
T
e
T
t
T
t
t
2 ~
1
|
.
|
\
|
=
(5)
804
The BWT of a given signal
( ) t x
is defined as follows [10, 11]:
( ) dt
a
t
t x
a
a BWT
T x
|
.
|
\
|
=
-
}
t
t
~
) (
1
,
(6)
dt e
T a
t
t x
a T
a
t
f j |
.
|
\
|
-
|
.
|
\
|
=
}
t
t
t
2
) (
1
Hence, the adaptive nature of the BWT is captured by a
time-varying factor T. This factor represents the scaling of the
cochlear filter bank quality at each scale over time [12]. For
the human auditory system, Yao and Zhang [12] have
taken Hz f . 15165
0
= . The discrimination of the scale variable
a is accomplished using a pre-defined logarithmic spacing
across the desired frequency rang so that the center frequency
at each scale is expressed as follows [11]:
( )
,... 2 , 1 , 0 ,
1623 . 1
0
= = m
f
f
m m
(7)
For each time and scale, the adapting function ( ) t , a T
is
calculated using the following equation [10, 12]:
( )
( )
( )
1
2
1
1
,
~
1
,
~
1 ,
|
|
.
|
\
|
c
c
+
|
|
.
|
\
|
+
= A +
t
t
t t
a BWT G
s a BWT BWT
BWT
G a T
x
s s
s
(8)
where
1
~
G
designates the active gain factor representing the
outer hair cell resistance function,
2
~
G
is the active gain factor
representing the time-varying compliance of Basilar
membrane,
s
BWT
is a constant representing the time-varying
compliance of Basilar membrane,
( ) t , a BWT
x
is the BWT at
scale and time
t
, and
t A
is time computation step [11].
1
~
G
and
2
~
G
resolutions in time domain and frequency domain
can be increased respectively [11]. In implementation, BWT
coefficients can be easily computed based on corresponding
coefficients of the Continuous Wavelet Transform (CWT) by:
( ) ( ) ( ) t t t , , , a CWT a K a BWT
x x
=
(9)
where is a factor that depends on [10][MOU 10]. For the
Morlet wavelet
( )
2
0
|
|
.
|
\
|
=
T
t
e t
which is also employed as
the mother function in our experience is expressed by:
( )
( ) ( )
2
0
/ , 1
,
2
T a T
dt e
a K
t
t
t
+
=
}
+
(10)
that is roughly equal to: .
( ) ( )
2
0
/ , 1 7725 . 1 T a T t +
In this paper, we have used the same values as in the
reference [10, 11, 12]:
87 . 0
~
1
= G
,
45
~
2
= G
8 . 0 =
s
BWT
, 0005 . 0
0
= T and the
computation step
t A
is chosen to be equal to,
s
f 1
where
s
f
represents the sampling frequency.
III. CLASSIFIERS
For classification, we have used in this work the Multi
Layer Perceptron (MLP) which is the most popular network
architecture in use today, due originally to Rumelhart and
McClelland (1986). The units each performed a biased
weighted sum of their inputs and pass this activation level
through a transfer function to produce their output, and the
units are arranged in a layered feedforward topology. The
network thus has a simple interpretation as a form of input-
output model, with the weights and thresholds (biases) the free
parameters of the model. Such networks can model functions
of almost arbitrary complexity, with the number of layers, and
the number of units in each layer, determining the function
complexity. Important issues in MLP design include
specification of the number of hidden layers and the number
of units in these layers. The number of input and output units
is defined by the problem (there may be some uncertainty
about precisely which inputs to use [13]. However, for the
moment we will assume that the input variables are intuitively
selected and are all meaningful). The number of hidden units
to use is far from clear. As good a starting point as any is to
use one hidden layer, with the varying number: in this work,
we vary the number of hidden units from 20 to 200.The figure
3 illustrated the used MLP architecture.
805
For learning this MLP, we have used the backpropagation
algorithm.
IV. THE PROPOSED SPEECH RECOGNITION TECHNIQUE
This technique consists at first step in using our proper
speech database containing Arabic speech words which are
recorded by a mono-locutor for a voice command. Those
Arabic words are recorded at 16 kHz and pronounced by a
Male voice. The second step consists in features extraction
from those recorded words. The third step consists in
classifying those extracted features. The features extraction is
performed by computing at first, the Mel Frequency Cepstral
Coefficients (MFCCs) from each recorded word, then the
Bionic Wavelet Transform (BWT) was applied to the vector
obtained from the concatenation of the obtained MFCCs. The
obtained bionic wavelet coefficients were then concatenated to
construct one input of a Multi Layer Perceptual (MLP) used
for features classification. In the MLP learning and test
phases, we have used eleven Arabic words each of them is
repeated twenty five times by the same locutor. The figure 4
shows the different steps of the proposed technique.
This technique is applied to our proper speech database
containing Arabic speech words which are recorded by a
mono-locutor for a voice command. Those Arabic words are
recorded at 16 kHz and pronounced by a Male voice. The
proposed technique consists at first step in features extracting
from those recorded words. The second step consists in
classifying those extracted features. The features extraction is
performed by computing at first, the Mel Frequency Cepstral
Coefficients (MFCCs) from each recorded word, then the
Bionic Wavelet Transform (BWT) is applied to the vector
obtained from the concatenation of the obtained MFCCs. The
obtained bionic wavelet coefficients are then concatenated to
construct one input of a Multi Layer Perceptual (MLP) used
for features classification. In the MLP learning and test
phases, we have used eleven Arabic words each of them is
repeated twenty five times by the same locutor. Figure 4
shows the different steps of the proposed technique.
V. EXPERIMENTS AND RESULTS
For evaluating the proposed technique, we have tested it on
some Arabic words which are reported in TABLE I.
Each of those words is recorded twenty five times by the
same locutor in order to be used for learning and testing the
used MLP: ten occurrences for learning and the rest for
testing. For recording those words, we have used the
Microsoft Windows Sound Recorder. Each element of the
reconstructed vocabulary is stored and labeled with the
word correspondent. For evaluating the proposed technique,
we have employed other techniques such as the method based
on MFCC only and the one based on wavelets. TABLE II
reported the obtained results from the different techniques.
TABLE II. RECOGNITION RATES
Feature extraction
Recognition
rate
MFCC : Mel Frequency Cepstral Coefficients 88.48 %
MFCC 93.93 %
BWT : Bionic Wavelet Transform 09.09 %
MFCC +BWT 89.09 %
MFCC +BWT 99.39 %
BWT+MFCC 55.75 %
BWT+ MFCC 36.36 %
TABLE I. THE USED VOCABULARY
Pronunciation Arabic Writing
Khalfa
Amam
Asraa
Sir
Istader
Takadam
Tarajaa
Tawakaf
Yamine
Yassare
Waraa
Feature Extraction
MFCC BWT
Classification
MLP
Speech signal
Figure 3. The general architecture of the proposed system
Figure 2. The used MLP architecture
806
The speech recognition techniques used in our evaluation
are the technique based on MFCC which used the MFCC
alone for feature extraction, the technique based on MFCC
which used second order differential of MFCC alone for
feature extraction, the technique based on BWT which used
the Bionic Wavelet Transform alone for feature extraction, the
technique MFCC + BWT which firstly computes the MFCC of
the used recorded words and secondly applies to each obtained
coefficient, the Bionic Wavelet Transform (BWT), the
technique MFCC+BWT which firstly computes the
MFCC of the used recorded words and secondly applies to
each obtained coefficient, the BWT, the technique
BWT+MFCC which firstly applies the BWT to the employed
speech signals and secondly applies to each obtained bionic
wavelet coefficient, the MFCC and the technique BWT+
MFCC which firstly applies the BWT to the used speech
signals then the MFCC is applied to each bionic wavelet
coefficient for feature extraction. The obtained results from
recognition rate computation show clearly that the proposed
technique outperforms the others techniques used in our
evaluation.
In term of execution time of the proposed technique
(MFCC +BWT), we have found that the learning phase of
the used MLP, requires two hours and five minutes. For one
word recognition, just one second is required in the phase of
test.
Generally, the performance of speech recognition systems
is influenced by the presence of noise thats why we have
added a module of denoising to the proposed system in order
to make it more robust. In this denoising module, we have
used the Wiener filtering [14]. Figure 4 represents the
modified speech recognition system in which we have
introduced the Wiener filtering as a module of speech
denoising.
TABLE III gives all the parameter values used in Wiener
filtering algorithm implementation.
Where DFT designates the discrete Fourier transform and
VAD is the voice activity detection [15].
TABLE IV reports the obtained recognition rates
computed from the application of the proposed speech
recognition technique. This one is applied to our recorded
speech database corrupted by a Gaussian White noise with
different values of SNR (Signal to Noise Ratio).
This table shows clearly that the denoising module permits
to improve significantly the recognition rates. However, those
rates are naturally less than the recognition rate obtained in
case of application of the proposed technique is applied to the
clean recorded words.
VI. CONCLUSION
In this paper, we have proposed a new technique of Arabic
speech recognition with mono-locutor and a reduced
vocabulary. This technique consists at first step in using our
proper speech database containing Arabic speech words which
are recorded by a mono-locutor. The second step consists in
features extraction from those recorded words. The third step
is to classify those extracted features. The features extraction
TABLE IV. RECOGNITION RATES COMPUTED IN NOISY CONDITION
SNR (db) 0 5 10 12 15 25
Recognition
Rates (%)
with module of
Denoising
60.6 78.7 93.9 96.3 98.7 99.3
Recognition
Rates (%)
without
module of
Denoising
23.03 24.8 37.5 41.2 46.06 81.8
Recognition
Rates (%)
without
introduction of
noise: 99.39
TABLE III. THE USED PARAMETER VALUES FOR WIENER
FILTERING.
Parameter Value
Window type Hamming
Frame length 512
Frame overlap 50%
DFT length 512
Smoothing factor in noise 0.98
Spectrum update 0.98
Smoothing factor in priori
update
0.98
VAD threshold 0.15
Figure 4. The modified speech recognition system.
807
is performed by computing at first step, the Mel Frequency
Cepstral Coefficients (MFCCs) from each recorded word, then
the Bionic Wavelet Transform (BWT) is applied to the vector
obtained from the concatenation of the computed MFCCs. The
obtained bionic wavelet coefficients are then concatenated to
construct one input of a Multi-Layer Perceptual (MLP) used
for features classification. In the MLP learning and test
phases, we have used eleven Arabic words and each of them is
repeated twenty five times by the same locutor. The obtained
results from recognition rate computation show clearly that the
proposed technique outperforms some techniques used in our
evaluation. It gives a 99.39% as a recognition rate. To make
the proposed speech recognition system more robust to noise,
we have introduced a module of denoising as a preprocessing.
The obtained results from recognition rates computation show
clearly that this module improves significantly the
performance of the proposed speech recognition system.
REFERENCES
[1] B.K. Zahira and B. Ali, Utilisation des Algorithmes Gntiques pour la
Reconnaissance de la Parole, SETIT 2009.
[2] A. M. Othman and May H. Riadh, Speech Recognition Using Scaly
Neural Networks, World Academy of Science, Engineering and
Technology 38 2008.
[3] [ALI 02 A.M. Alimi and M. Ben Jemaa, Beta Fuzzy Neural Network
Application in Recognition of Spoken Isolated Arabic Words,
International Journal of Control and Intelligent Systems, Special Issue
on Speech Processing Techniques and Application, Vol.30,No.2,2002.
[4] M. Alghamdi, M. Elshafie and H. Al-Muhtaseb, Arabic broadcast news
transcription system, Journal of Speech Technology, April, 2009.
[5] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, Training
and Adapting MLP Features for Arabic Speech Recognition, Proc. Of
IEEE Conf. Acoust. Speech Signal Process. (ICASSP), 2009.
[6] D. Kewley-Port and Y. Zheng, Auditory models of formant frequency
discrimination for isolated vowels, Journal of the Acoustical Society of
America, 103(3), 1998, pp. 1654-1666.
[7] Md. Rabiullslam, Md. Fayzur Rahmant and Muhammad Abdul Goffar
Khant , Improvement of Speech Enhancement Techniques for Robust
Speaker Identification in Noise, Proceedings of 2009 12th International
Conferenceon Computerand InformationTechnology (ICCIT 2009) 21-
23 December,2009, Dhaka, Bangladesh.
[8] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, Training
and Adapting MLP Features for Arabic Speech Recognition, Proc. Of
IEEE Conf. Acoust. Speech Signal Process. (ICASSP), 2009.
[9] A. Zabidi, et al., Mel-Frequency Cepstrum Coefficient Analysis of
Infant Cry with Hypothyroidism, presented at the 2009 5th Int.
Colloquium on Signal Processing & Its Applications, Kuala Lumpur,
Malaysia, 2009.
[10] O. Sayadi and M.B. Shamsollahi, Multiadaptive Bionic Wavelet
Transform: Application to ECG Denoising and Baseline Wandering
Reduction, EURASIP Journal of Applied Signal Processing,
2007(Article ID 41274):11 pages, 2007.
[11] T. Mourad, S. Lotfi, A.Sabeur and C. Adnane, Recurrent Neural
Network and Bionic Wavelet Transform for speech enhancement, Int.
J. Signal and Imaging Systems Engineering, vol.3, no.2, pp.93-101,
2010.
[12] J. Yao and Y. T. Zhang, The application of bionic wavelet transform to
speech signal processing in cochlear implants using neural network
simulations, IEEE Transactions on Biomedical Engineering, vol.49,
no.11, pp.1299-1309, 2002.
[13] Dr.R.L.K.Venkateswarlu, Dr. R. Vasantha Kumari and G.Vani Jayasri
Speech Recognition using Radial basis Function Neural Network,
Electronics Computer Technology (ICECT), 2011 3rd International
Conference.
[14] P. C. Loizou, Speech Enhancement Theory and Practice, Taylor &
Francis, 2007.
[15] Urmila Shrawanka, Voice Activity Detector and Noise Trackers for
Speech Recognition System in Noisy Environment, International
Journal of Advancements in Computing Technology, vol.2, no.4, 2010.
808