RECOGNITION
METHODOLOGY
Acquire Feature Feature
Preprocessing
Speech Signal Extraction Matching
Recognized
Command
A Simple Model
of Speech Production
Voiced
excitation pulse
train
P(f) Vocal tract
Lips emission
spectral
R(f)
Unvoiced shaping H(f)
excitation white
noise N(f)
𝑆 𝑓 = 𝑣 . 𝑃 𝑓 + 𝑢 .𝑁 𝑓 .𝐻 𝑓 .𝑅 𝑓
𝑆 𝑓 = 𝑋 𝑓 .𝐻 𝑓 .𝑅 𝑓
Spectral Shaping 𝑯(𝒇)
• Changing the shape of the vocal tract changes the spectral
shape of the speech signal, thus articulating different speech
sounds
• Most valuable information for speech recognizer is contained in
the way the spectral shape of the speech signal changes in time.
• Direct computation of power spectrum from the speech signal
results in a spectrum containing “ripples” caused by the
excitation spectrum 𝑋(𝑓).
• A smooth spectral shape without the ripples that represent
𝐻(𝑓) has to be estimated.
Cepstral Transformation
𝑆 𝑓 = 𝑋 𝑓 .𝐻 𝑓 .𝑅 𝑓 = 𝐻 𝑓 .𝑈 𝑓
log 𝑆 𝑓 = log(𝐻 𝑓 . 𝑈 𝑓 )
= log 𝐻 𝑓 + log(𝑈 𝑓 )
• Interpret this log-spectrum as a time signal
• The “ripples” caused by 𝑈(𝑓) would then have a “high-frequency”.
• Hence, by using a kind of low pass filtering we can get the smooth
spectral shape
• Inverse Fourier transform of the log spectrum brings us back to the
time domain, giving the so called cepstrum.
• Low pass filtering is done by setting the higher valued Cepstral
coefficients to zero and then transforming back to the frequency
domain.
• The process of filtering in the Cepstral domain is called “liftering”.
Mel frequency Cepstral Coefficients
• Human ear does not show a linear frequency resolution but
builds several group of frequencies and integrates the spectral
energies within a given group
• The mid frequency and bandwidth of these groups are non-
linearly distributed.
• The non-linear warping of the frequency axis is modeled by
the mel-scale where the frequency groups are assumed to
be linearly distributed
𝑓
𝑓𝑚𝑒𝑙 𝑓 = 2595. log(1 + )
700 𝐻𝑧
• Common way to do mel frequency warping is to use triangle shaped filter
in the spectral domain to build a weighted sum over the power spectrum
coefficients which lie within each window.
• This gives us a new set of coefficients known as the mel spectral coefficie
• Perform Cepstral Transformation on them to extract Mel frequency Cepst
Coefficients.
• The MFCC are directly used for recognition instead of transforming them
back to frequency domain.
Feature Matching
Dynamic Time Warping (DTW)
• Distance calculation using Dynamic Time
Warping
• Each utterance is divided into frames of 20ms.
• MFCC for each of frame is computed and represented by a
vector.
• Hence each utterance is represented by a vector sequence.
X = {x0,x1,….,xTx−1}
• Distance between individual vectors are found using the
Euclidean distance formula.
DTW Algorithm
• Finding the optimal alignment path
DTW Algorithm
Key points to find the optimal path
• A grid point (i,j) in the optimal path can have the predecessors
(i-1,j), (i-1,j-1) and (i,j-1)
• Bellman’s Principle : If Popt is the optimal path through the
matrix of grid points beginning at (0,0) and ending at (Tw-1,Tx-
1), and grid point (i,j) is part of path Popt, then the partial path
from (0,0) to (i,j) is also part of Popt
• Creating an Accumulated distance matrix, according to the
formula
2𝜋𝑛
𝑤 𝑛 = 0.54 − 0.46 cos
𝑛−1
2.4 Noise threshold detection
• For detecting the starting of the utterance from the 3sec long
input speech signal, energy of each frame of the input speech
signal is calculated and stored into an array. Size of the energy
array will be equal to the total number of frames. This energy
array is arranged in the ascending order and mean of first 15
elements gives the energy of the noise. Threshold set was 10
times the noise energy.
2.5 Utterance detection
• Once the threshold has been calculated, all the elements in the energy
array which are greater than the threshold are replaced by 1 and the rest
by 0. Thus a Boolean array of the following form is obtained.
• Sometimes spikes due to the external noise crosses the threshold and
contributes 1 to the Boolean array. To remove these spikes a Median filter
VI in LabVIEW with left and right rank as 3 is used. The median filter
replaces the ith element in the Boolean array with the median of { 𝑖 −
3, 𝑖 − 2, 𝑖 − 1, 𝑖 , 𝑖 − 1, 𝑖 + 2, 𝑖 + 3}elements.
Hence the median filter smoothen the Boolean array.
• Now we use the Peak detector VI in LabVIEW to find the index of the start and
end of the utterance. Using these index extract the corresponding frames
containing the utterance.
• N.B: In my project, all commands where of length less than 0.6sec. Sometimes
spikes due to noise remained even after using the median filter and hence the
ending index was not detected accurately. But the start index was detected
accurately most of the time, so I used to extract 0.6sec of sound after the start
index.
Step 3: Feature Extraction
Block Diagram of Feature Extraction VI
• FFT is done on each frame of the utterance and half of it is
taken.
• The spectrum of each frame is warped onto the Mel scale and
thus Mel spectral coefficients are obtained.
• Discrete cosine transform is done on Mel spectral coefficients of
each frame, hence obtaining MFCC.
• The first 2 coefficients of the obtained MFCC are removed as
they varied significantly between different utterances of the
same word.
• Liftering is done by replacing all MFCC except the first 14 by
zero.
• The first coefficient of MFCC of each frame was replaced by the
log energy of that frame.
• Delta and Acceleration coefficients are found from the MFCC so
as to increase the dimension of the feature vector of the
frames, thereby increasing the accuracy.
• Delta coefficients are found from the following equation. Value
of p chosen was 1.