HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
138
Index Terms Feature Extraction, Feature Matching, Mel Frequency Cepstral Coefficient (MFCC), dynamic Time Warping
(DTW)
1 INTRODUCTION
OICESignalIdentificationconsistoftheprocessto
convert a speech waveform into features that are
useful for further processing. There are many
algorithmsandtechniquesareuse.Itdependsonfeatures
capability to capture time frequency and energy into set
ofcoefficientsforcepstrumanalysis.[1].
Generally, human voice conveys much information
suchasgender,emotionandidentityofthespeaker.The
objective of voice recognition is to determine which
speaker is present based on the individuals utterance
[2].Several techniques have been proposed for reducing
the mismatch between the testing and training environ
ments. Many of these methods operate either in spectral
[3,4], or in cepstral domain [5]. Firstly, human voice is
convertedintodigitalsignalformtoproducedigitaldata
representing each level of signal at every discrete time
step. The digitized speech samples are then processed
using MFCC to produce voice features. After that, the
coefficientofvoicefeaturescangotroughDTWtoselect
thepatternthatmatchesthedatabaseandinputframein
ordertominimizetheresultingerrorbetweenthem.
The popularly used cepstrum based methods to
comparethepatterntofindtheirsimilarityaretheMFCC
and DTW. The MFCC and DTW features techniques can
be implemented using MATLAB [6]. This paper reports
the findings of the voice recognition study using the
MFCCandDTWtechniques.
Therestofthepaperisorganizedasfollows:principles
ofvoicerecognitionisgiveninsection2,themethodology
ofthestudyisprovidedinsection3,whichisfollowedby
resultanddiscussioninsection4,andfinallyconcluding
remarksaregiveninsection5.
139
Step3:Hammingwindowing
Hamming window is used as window shape by considering the next
block in feature extraction processing chain and integrates all the
closest frequency lines. The Hamming window equation is given as:
If the window is defined as W (n), 0 n N-1 where
N = number of samples in each frame
Y[n] = Output signal
X (n) = input signal
W (n) = Hamming window, then the result of windowing signal is
shown below:
(2)
Y n
X n W n
w (n )
0 . 54
0 . 46 cos
2 n
N 1 0
(3)
As shown in Figure 3, MFCC consists of seven computational steps. Each step has its function and mathematical
approaches as discussed briefly in the following:
Step 1: Preemphasis
This step processes the passing of signal through a filter which emphasizes higher frequencies. This process will increase the energy of
signal at higher frequency.
Y n
0 . 95 X
(1)
Lets consider a = 0.95, which make 95% of any one sample is presumed to originate from previous sample.
Step 2: Framing
The process of segmenting the speech samples obtained from analog
to digital conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. The voice signal is divided into frames
of N samples. Adjacent frames are being separated by M (M<N).
Typical values used are M = 100 and
N= 256.
F ( Mel )
2595
log 10 1
f 700
(5)
X2
Energy
(6)
d t
ct
ct
(7)
140
(1)
(2)
(4)
(5)
where,Nisthelengthofthesequence,andVisthenum
beroftemplatestobeconsidered.
Theoretically, the major optimizations to the DTW algorithm arise from observations on the nature of good paths
through the grid. These are outlined in Sakoe and Chiba
[16] and can be summarized as:
Monotonic condition: the path will not turn back on itself, both i
and j indexes either stay the same or increase, they never decrease.
141
Start
Both i and j can only increase by 1 on each step along the path.
Boundary condition: the path starts at the bottom left and ends
3 METHODOLOGY
No
Match with
reference
Template
Yes
Process
1) Speaker
One Female
One Male
2) Tools
Mono microphone
Gold Wave software
3) Environment
Laboratory
4) Utterance
5) Sampling Frequency, fs
16000Khz
6) Feature computational
End
Fig.6.VoiceAlgorithmflowChart
Fig.7.Examplevoicesignalinputoftwodifferencespeakers
142
Figure 7 is used for carrying the voice analysis performance evaluation using MFFC. A MFCC cepstral is a
matrix, the problem with this approach is that if constant
window spacing is used, the lengths of the input and
stored sequences is unlikely to be the same. Moreover,
within a word, there will be variation in the length of individual phonemes as discussed before, Example the
word Volume Up might be uttered with a long /O/ and
short final /U/ or with a short /O/ and long /U/.
Figure 8 shows the MFCC output of two different
speakers. The matching process needs to compensate for
length differences and take account of the non-linear nature of the length differences within the words.
Fig.11.Optimal Warping Path of Test input Female Speaker Channel
One
Fig.9.OptimalWarpingPathofTestinputFemalespeakerVolume
Up
Fig.10.OptimalWarpingPathofTestinputFemaleSpeakerVolume
Down
CONCLUSION
ACKNOWLEDGMENT
The authors would like to thank Universiti Teknologi
PETRONASforsupportingthiswork.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
143