Anda di halaman 1dari 6

Arabic Speech Recognition by Bionic Wavelet

Transform and MFCC using a Multi Layer Perceptron



Mohammed BEN NASR
Department of Electronics
Faculty of Sciences of Tunis
Tunis, Tunisia
bennasr.mouhamed@gmail.com

Mourad TALBI, Adnane CHERIF
Department of Electronics
Faculty of Sciences of Tunis
Tunis, Tunisia
Mouradtalbi196@yahoo.fr, adnane.cher@fst.rnu.tn



AbstractIn this paper, we have proposed a new technique of
Arabic Speech Recognition (ASR) with monolocutor and a
reduced vocabulary. This technique consists at first step in using
our proper speech database containing Arabic speech words
which are recorded by a mono-locutor. The second step consists
in features extracting from those recorded words. The third step
is to classify those extracted features. This extraction is
performed by computing at first step, the Mel Frequency
Cepstral Coefficients (MFCCs) from each recorded word, then
the Bionic Wavelet Transform (BWT) is applied to the vector
obtained from the concatenation of the computed MFCCs. The
obtained bionic wavelet coefficients are then concatenated to
construct one input of a Multi-Layer Perceptual (MLP) used for
features classification. In the MLP learning and test phases, we
have used eleven Arabic words and each of them is repeated
twenty five times by the same locutor. A simulation program is
performed to test the performance of the proposed technique and
shows a classification rate equals to 99.39%. We have also
introduced a module of denoising as a phase of preprocessing.
In this denoising module, we have treated the case of white noise
and we have used the Wiener filtering. In case of SNR=5dB, the
obtained recognition rate is equals to 78.7% and in case of
SNR=10dB, it is equals to 93.9%.

Keywords-Speech Recognition, Feature Extraction, Bionic
Wavelet Transforms (BWT), Mel-Frequency Cepstral Coefficients
(MFCCs), Multi-Layer Perceptron (MLP).
I. INTRODUCTION
Speech recognition is a process used to recognize speech
uttered by a speaker and has been in the field of research for
more than five decades since 1950s [1]. Speech recognition is
an important and emerging technology with great potential.
The significance of speech recognition lies in its simplicity.
This simplicity together with the ease of operating a device
using speech has lots of advantages. It can be used in many
applications like security devices, household appliances,
cellular phones and voice command. The latter is the subject of
this paper. With the advancement of automated system the
complexity for integration & recognition problem is increasing.
The problem is found more complex when processing on
randomly varying analog signals such as speech signals.
Although various methods are proposed for efficient extraction
of speech parameter for recognition, the MFCC method with
advanced recognition method such as HMM [2] is more
dominant used. Research and development on speaker
recognition method and technique has been undertaken for well
over four decade and it continues to be an active area.
Approaches have spanned from human auditory [2] and
spectrogram comparisons, to simple template matching, to
dynamic time-warping approaches, to more modern statistical
pattern recognition, such as neural networks [2]. The
development of an automatic Arabic speech recognition system
(Arabic ASR) becomes an attractive domain of research. Many
efforts of Arabic ASR construction have been made and they
attend promised results [3, 4]. However the majority of these
works use a reduced vocabulary. Multilayer perceptron (MLP)
classifiers are being extensively used for acoustic modeling in
automatic speech recognition (ASR) [5]. The MLP is trained
using acoustic features such as perceptual linear predictive
(PLP) cepstral coefficients and its output classes represent the
subword units of speech such as phonemes.
In this work, we propose a new technique of Arabic speech
recognition with mono-locutor and a small vocabulary. This
technique is applied to our speech database which contains a
number of isolated words recorded by a mono-locator for a
voice command. The first step consists in features extracting
from those words. The second step consists in classifying
those extracted features. The features extraction is performed
by computing in first stage, the Mel Frequency Cepstral
Coefficients (MFCC) from the used recorded words and then
we apply the Bionic Wavelet Transform (BWT) to the vector
obtained from the concatenation of the obtained MFCC
coefficients and finally a features vector is obtained by
concatenation of the obtained bionic wavelet coefficients. This
vector constitutes one input of a Multi Layer Perceptron used
for features classification.
In this paper, we first deal with the Mel Frequency
Cepstral Coefficients (MFCC) and features extraction then we
deal we deal with the Bionic Wavelet Transform then we are
interested in Classifiers using a Multi Layer Perceptron (MLP)
then we detail the proposed speech recognition technique and
finally we give some results and conclusion.
2012 6th International Conference on Sciences of Electronics,
Technologies of Information and Telecommunications (SETIT)
978-1-4673-1658-3/12/$31.00 2012 IEEE 803


Figure 1. Workflow to generate Mel Frequency Cepstrum
Coefficient

II. THE FEATURES EXTRACTION
This stage is very important in the robust speaker
identification system because the pattern matching and speaker
modeling quality of strongly depends on the feature extraction
technique quality. Different types of speech feature extraction
techniques [6] such as LPC, LPCC, RCC, MFCC,MFCC,
MFCC and Wavelets [7] have been applied to extract the
features from the speech signal. In this work, we first, have
extracted the Mel Frequency Cepstral Coefficients (MFCC)
from the employed speech signals then we have applied the
Bionic Wavelet Transform to the obtained MFCCs.
A. MFCC Extraction

MFCC computation is a feature extraction algorithm of
human-audible audio signals. It takes the human perception
sensitivity with respect to frequencies into consideration.
Therefore it is the most appropriate for voice recognition. The
coefficients comprise a good representation of the dominant
features in acoustic information for a selected window of time
[8].
For a signal with N frames, n= 0, 1, N-1, MFCC coefficient
is defined as:

( ) ( ) ( ) ( ) ( ) n s FFT DCT n c log =
(1)
Where, c(n) is the MFCC Coefficients at frame; s(n) is the
original signal at frame n, after application of pre-filtering and
some windowing method. In extracting features using MFC
analysis, five important steps are required, as shown in
Figure1.


The first step is to decompose the input signal into windows
of time, with some overlap, in order to comprehensively
capture the signals temporal features and changes. As a result
of frame blocking, high frequency component is produced at
end of every signal block, which is known as leakage effect in
spectrum. To minimize this leakage effect so as to maintain the
continuity of the first and the last points the frame, a Hamming
window is used. After windowing, Fourier analysis is
performed on each frame, resulting in short-time Discrete
Fourier Transform (DFT). The values derived from here are
then grouped together in critical bands and weighted by a
triangular filter bank called mel-spaced filter bank. The mel-
spaced filter bank is designed on mel-scale frequencies, which
mimic the human auditory system. Mel- scale frequencies are
distributed linearly in the low range but logarithmically in the
high range. The human auditory system can detect frequency
tones lower than 1kHz in linear scale but frequencies higher
that 1 kHz in logarithmic scale.
The number of mel-filter banks can be adjusted
depending on the sampling frequency of the signal. The
mel-scale frequency is given by:

10
2595 log 1
700
freq
mel
| |
= +
|
\ .
(2)


The mel-frequency cepstral coefficients are then derived by
taking the logarithm of the band passed frequency response
and calculating the Discrete Cosine Transform (DCT) for each
intermediate signal[9].
B. Bionic wavelet transform
The bionic wavelet transform (BWT) was initially
introduced as an adaptive wavelet transform and is conceived
especially to model the human auditory system [10]. The
adaptive nature of the BWT is insured by replacing the
constant factor of the wavelet transform with a variable quality
factor. The mother wavelet can be expressed as follow:

( ) ( )
t f j
e t t
0
2 ~ t
=
(3)
Where
0
f
and

~
are respectively the center frequency and
envelope function of
( ) t
. The latter is chosen to be the
Morlet wavelet. In this case the function is expressed as
follow:

( )
2
0 ~
|
|
.
|

\
|

=
T
t
e t
(4)
Where is the initial-support of the unscaled mother
wavelet. Using a time varying function T, the mother function
of the BWT is expressed as follows[10]:

( )
ft j
T
e
T
t
T
t
t

2 ~
1
|
.
|

\
|
=
(5)
804
The BWT of a given signal
( ) t x
is defined as follows [10, 11]:
( ) dt
a
t
t x
a
a BWT
T x
|
.
|

\
|
=
-
}
t
t
~
) (
1
,
(6)

dt e
T a
t
t x
a T
a
t
f j |
.
|

\
|

-
|
.
|

\
|

=
}
t
t
t

2
) (
1

Hence, the adaptive nature of the BWT is captured by a
time-varying factor T. This factor represents the scaling of the
cochlear filter bank quality at each scale over time [12]. For
the human auditory system, Yao and Zhang [12] have
taken Hz f . 15165
0
= . The discrimination of the scale variable
a is accomplished using a pre-defined logarithmic spacing
across the desired frequency rang so that the center frequency
at each scale is expressed as follows [11]:


( )
,... 2 , 1 , 0 ,
1623 . 1
0
= = m
f
f
m m
(7)

For each time and scale, the adapting function ( ) t , a T
is
calculated using the following equation [10, 12]:
( )
( )
( )
1
2
1
1
,
~
1
,
~
1 ,

|
|
.
|

\
|
c
c
+

|
|
.
|

\
|
+
= A +
t
t
t t
a BWT G
s a BWT BWT
BWT
G a T
x
s s
s
(8)
where
1
~
G
designates the active gain factor representing the
outer hair cell resistance function,
2
~
G
is the active gain factor
representing the time-varying compliance of Basilar
membrane,
s
BWT
is a constant representing the time-varying
compliance of Basilar membrane,
( ) t , a BWT
x
is the BWT at
scale and time
t
, and
t A
is time computation step [11].
1
~
G
and
2
~
G
resolutions in time domain and frequency domain
can be increased respectively [11]. In implementation, BWT
coefficients can be easily computed based on corresponding
coefficients of the Continuous Wavelet Transform (CWT) by:

( ) ( ) ( ) t t t , , , a CWT a K a BWT
x x
=
(9)
where is a factor that depends on [10][MOU 10]. For the
Morlet wavelet
( )
2
0
|
|
.
|

\
|

=
T
t
e t
which is also employed as
the mother function in our experience is expressed by:

( )
( ) ( )
2
0
/ , 1
,
2
T a T
dt e
a K
t
t
t
+
=
}
+

(10)
that is roughly equal to: .

( ) ( )
2
0
/ , 1 7725 . 1 T a T t +


In this paper, we have used the same values as in the
reference [10, 11, 12]:

87 . 0
~
1
= G
,
45
~
2
= G
8 . 0 =
s
BWT
, 0005 . 0
0
= T and the
computation step
t A
is chosen to be equal to,
s
f 1
where
s
f

represents the sampling frequency.

III. CLASSIFIERS
For classification, we have used in this work the Multi
Layer Perceptron (MLP) which is the most popular network
architecture in use today, due originally to Rumelhart and
McClelland (1986). The units each performed a biased
weighted sum of their inputs and pass this activation level
through a transfer function to produce their output, and the
units are arranged in a layered feedforward topology. The
network thus has a simple interpretation as a form of input-
output model, with the weights and thresholds (biases) the free
parameters of the model. Such networks can model functions
of almost arbitrary complexity, with the number of layers, and
the number of units in each layer, determining the function
complexity. Important issues in MLP design include
specification of the number of hidden layers and the number
of units in these layers. The number of input and output units
is defined by the problem (there may be some uncertainty
about precisely which inputs to use [13]. However, for the
moment we will assume that the input variables are intuitively
selected and are all meaningful). The number of hidden units
to use is far from clear. As good a starting point as any is to
use one hidden layer, with the varying number: in this work,
we vary the number of hidden units from 20 to 200.The figure
3 illustrated the used MLP architecture.
805

For learning this MLP, we have used the backpropagation
algorithm.
IV. THE PROPOSED SPEECH RECOGNITION TECHNIQUE
This technique consists at first step in using our proper
speech database containing Arabic speech words which are
recorded by a mono-locutor for a voice command. Those
Arabic words are recorded at 16 kHz and pronounced by a
Male voice. The second step consists in features extraction
from those recorded words. The third step consists in
classifying those extracted features. The features extraction is
performed by computing at first, the Mel Frequency Cepstral
Coefficients (MFCCs) from each recorded word, then the
Bionic Wavelet Transform (BWT) was applied to the vector
obtained from the concatenation of the obtained MFCCs. The
obtained bionic wavelet coefficients were then concatenated to
construct one input of a Multi Layer Perceptual (MLP) used
for features classification. In the MLP learning and test
phases, we have used eleven Arabic words each of them is
repeated twenty five times by the same locutor. The figure 4
shows the different steps of the proposed technique.
This technique is applied to our proper speech database
containing Arabic speech words which are recorded by a
mono-locutor for a voice command. Those Arabic words are
recorded at 16 kHz and pronounced by a Male voice. The
proposed technique consists at first step in features extracting
from those recorded words. The second step consists in
classifying those extracted features. The features extraction is
performed by computing at first, the Mel Frequency Cepstral
Coefficients (MFCCs) from each recorded word, then the
Bionic Wavelet Transform (BWT) is applied to the vector
obtained from the concatenation of the obtained MFCCs. The
obtained bionic wavelet coefficients are then concatenated to
construct one input of a Multi Layer Perceptual (MLP) used
for features classification. In the MLP learning and test
phases, we have used eleven Arabic words each of them is
repeated twenty five times by the same locutor. Figure 4
shows the different steps of the proposed technique.

V. EXPERIMENTS AND RESULTS
For evaluating the proposed technique, we have tested it on
some Arabic words which are reported in TABLE I.

Each of those words is recorded twenty five times by the
same locutor in order to be used for learning and testing the
used MLP: ten occurrences for learning and the rest for
testing. For recording those words, we have used the
Microsoft Windows Sound Recorder. Each element of the
reconstructed vocabulary is stored and labeled with the
word correspondent. For evaluating the proposed technique,
we have employed other techniques such as the method based
on MFCC only and the one based on wavelets. TABLE II
reported the obtained results from the different techniques.

TABLE II. RECOGNITION RATES
Feature extraction
Recognition
rate
MFCC : Mel Frequency Cepstral Coefficients 88.48 %
MFCC 93.93 %
BWT : Bionic Wavelet Transform 09.09 %
MFCC +BWT 89.09 %
MFCC +BWT 99.39 %
BWT+MFCC 55.75 %
BWT+ MFCC 36.36 %

TABLE I. THE USED VOCABULARY
Pronunciation Arabic Writing
Khalfa
Amam
Asraa
Sir
Istader
Takadam
Tarajaa
Tawakaf
Yamine
Yassare

Waraa


Feature Extraction
MFCC BWT
Classification

MLP
Speech signal



Figure 3. The general architecture of the proposed system


Figure 2. The used MLP architecture
806
The speech recognition techniques used in our evaluation
are the technique based on MFCC which used the MFCC
alone for feature extraction, the technique based on MFCC
which used second order differential of MFCC alone for
feature extraction, the technique based on BWT which used
the Bionic Wavelet Transform alone for feature extraction, the
technique MFCC + BWT which firstly computes the MFCC of
the used recorded words and secondly applies to each obtained
coefficient, the Bionic Wavelet Transform (BWT), the
technique MFCC+BWT which firstly computes the
MFCC of the used recorded words and secondly applies to
each obtained coefficient, the BWT, the technique
BWT+MFCC which firstly applies the BWT to the employed
speech signals and secondly applies to each obtained bionic
wavelet coefficient, the MFCC and the technique BWT+
MFCC which firstly applies the BWT to the used speech
signals then the MFCC is applied to each bionic wavelet
coefficient for feature extraction. The obtained results from
recognition rate computation show clearly that the proposed
technique outperforms the others techniques used in our
evaluation.
In term of execution time of the proposed technique
(MFCC +BWT), we have found that the learning phase of
the used MLP, requires two hours and five minutes. For one
word recognition, just one second is required in the phase of
test.
Generally, the performance of speech recognition systems
is influenced by the presence of noise thats why we have
added a module of denoising to the proposed system in order
to make it more robust. In this denoising module, we have
used the Wiener filtering [14]. Figure 4 represents the
modified speech recognition system in which we have
introduced the Wiener filtering as a module of speech
denoising.

TABLE III gives all the parameter values used in Wiener
filtering algorithm implementation.

Where DFT designates the discrete Fourier transform and
VAD is the voice activity detection [15].
TABLE IV reports the obtained recognition rates
computed from the application of the proposed speech
recognition technique. This one is applied to our recorded
speech database corrupted by a Gaussian White noise with
different values of SNR (Signal to Noise Ratio).

This table shows clearly that the denoising module permits
to improve significantly the recognition rates. However, those
rates are naturally less than the recognition rate obtained in
case of application of the proposed technique is applied to the
clean recorded words.
VI. CONCLUSION
In this paper, we have proposed a new technique of Arabic
speech recognition with mono-locutor and a reduced
vocabulary. This technique consists at first step in using our
proper speech database containing Arabic speech words which
are recorded by a mono-locutor. The second step consists in
features extraction from those recorded words. The third step
is to classify those extracted features. The features extraction
TABLE IV. RECOGNITION RATES COMPUTED IN NOISY CONDITION
SNR (db) 0 5 10 12 15 25
Recognition
Rates (%)
with module of
Denoising
60.6 78.7 93.9 96.3 98.7 99.3
Recognition
Rates (%)
without
module of
Denoising
23.03 24.8 37.5 41.2 46.06 81.8
Recognition
Rates (%)
without
introduction of
noise: 99.39

TABLE III. THE USED PARAMETER VALUES FOR WIENER
FILTERING.
Parameter Value
Window type Hamming
Frame length 512
Frame overlap 50%
DFT length 512
Smoothing factor in noise 0.98
Spectrum update 0.98
Smoothing factor in priori
update
0.98
VAD threshold 0.15




Figure 4. The modified speech recognition system.
807
is performed by computing at first step, the Mel Frequency
Cepstral Coefficients (MFCCs) from each recorded word, then
the Bionic Wavelet Transform (BWT) is applied to the vector
obtained from the concatenation of the computed MFCCs. The
obtained bionic wavelet coefficients are then concatenated to
construct one input of a Multi-Layer Perceptual (MLP) used
for features classification. In the MLP learning and test
phases, we have used eleven Arabic words and each of them is
repeated twenty five times by the same locutor. The obtained
results from recognition rate computation show clearly that the
proposed technique outperforms some techniques used in our
evaluation. It gives a 99.39% as a recognition rate. To make
the proposed speech recognition system more robust to noise,
we have introduced a module of denoising as a preprocessing.
The obtained results from recognition rates computation show
clearly that this module improves significantly the
performance of the proposed speech recognition system.
REFERENCES
[1] B.K. Zahira and B. Ali, Utilisation des Algorithmes Gntiques pour la
Reconnaissance de la Parole, SETIT 2009.
[2] A. M. Othman and May H. Riadh, Speech Recognition Using Scaly
Neural Networks, World Academy of Science, Engineering and
Technology 38 2008.
[3] [ALI 02 A.M. Alimi and M. Ben Jemaa, Beta Fuzzy Neural Network
Application in Recognition of Spoken Isolated Arabic Words,
International Journal of Control and Intelligent Systems, Special Issue
on Speech Processing Techniques and Application, Vol.30,No.2,2002.
[4] M. Alghamdi, M. Elshafie and H. Al-Muhtaseb, Arabic broadcast news
transcription system, Journal of Speech Technology, April, 2009.
[5] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, Training
and Adapting MLP Features for Arabic Speech Recognition, Proc. Of
IEEE Conf. Acoust. Speech Signal Process. (ICASSP), 2009.
[6] D. Kewley-Port and Y. Zheng, Auditory models of formant frequency
discrimination for isolated vowels, Journal of the Acoustical Society of
America, 103(3), 1998, pp. 1654-1666.
[7] Md. Rabiullslam, Md. Fayzur Rahmant and Muhammad Abdul Goffar
Khant , Improvement of Speech Enhancement Techniques for Robust
Speaker Identification in Noise, Proceedings of 2009 12th International
Conferenceon Computerand InformationTechnology (ICCIT 2009) 21-
23 December,2009, Dhaka, Bangladesh.
[8] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, Training
and Adapting MLP Features for Arabic Speech Recognition, Proc. Of
IEEE Conf. Acoust. Speech Signal Process. (ICASSP), 2009.
[9] A. Zabidi, et al., Mel-Frequency Cepstrum Coefficient Analysis of
Infant Cry with Hypothyroidism, presented at the 2009 5th Int.
Colloquium on Signal Processing & Its Applications, Kuala Lumpur,
Malaysia, 2009.
[10] O. Sayadi and M.B. Shamsollahi, Multiadaptive Bionic Wavelet
Transform: Application to ECG Denoising and Baseline Wandering
Reduction, EURASIP Journal of Applied Signal Processing,
2007(Article ID 41274):11 pages, 2007.
[11] T. Mourad, S. Lotfi, A.Sabeur and C. Adnane, Recurrent Neural
Network and Bionic Wavelet Transform for speech enhancement, Int.
J. Signal and Imaging Systems Engineering, vol.3, no.2, pp.93-101,
2010.
[12] J. Yao and Y. T. Zhang, The application of bionic wavelet transform to
speech signal processing in cochlear implants using neural network
simulations, IEEE Transactions on Biomedical Engineering, vol.49,
no.11, pp.1299-1309, 2002.
[13] Dr.R.L.K.Venkateswarlu, Dr. R. Vasantha Kumari and G.Vani Jayasri
Speech Recognition using Radial basis Function Neural Network,
Electronics Computer Technology (ICECT), 2011 3rd International
Conference.
[14] P. C. Loizou, Speech Enhancement Theory and Practice, Taylor &
Francis, 2007.
[15] Urmila Shrawanka, Voice Activity Detector and Noise Trackers for
Speech Recognition System in Noisy Environment, International
Journal of Advancements in Computing Technology, vol.2, no.4, 2010.

808

Anda mungkin juga menyukai