Anda di halaman 1dari 40

VOICE COMMAND

RECOGNITION
METHODOLOGY
Acquire Feature Feature
Preprocessing
Speech Signal Extraction Matching

Recognized
Command
A Simple Model
of Speech Production

Voiced
excitation pulse
train
P(f) Vocal tract
Lips emission
spectral
R(f)
Unvoiced shaping H(f)
excitation white
noise N(f)

𝑆 𝑓 = 𝑣 . 𝑃 𝑓 + 𝑢 .𝑁 𝑓 .𝐻 𝑓 .𝑅 𝑓

𝑆 𝑓 = 𝑋 𝑓 .𝐻 𝑓 .𝑅 𝑓
Spectral Shaping 𝑯(𝒇)
• Changing the shape of the vocal tract changes the spectral
shape of the speech signal, thus articulating different speech
sounds
• Most valuable information for speech recognizer is contained in
the way the spectral shape of the speech signal changes in time.
• Direct computation of power spectrum from the speech signal
results in a spectrum containing “ripples” caused by the
excitation spectrum 𝑋(𝑓).
• A smooth spectral shape without the ripples that represent
𝐻(𝑓) has to be estimated.
Cepstral Transformation

𝑆 𝑓 = 𝑋 𝑓 .𝐻 𝑓 .𝑅 𝑓 = 𝐻 𝑓 .𝑈 𝑓

log 𝑆 𝑓 = log(𝐻 𝑓 . 𝑈 𝑓 )
= log 𝐻 𝑓 + log(𝑈 𝑓 )
• Interpret this log-spectrum as a time signal
• The “ripples” caused by 𝑈(𝑓) would then have a “high-frequency”.
• Hence, by using a kind of low pass filtering we can get the smooth
spectral shape
• Inverse Fourier transform of the log spectrum brings us back to the
time domain, giving the so called cepstrum.
• Low pass filtering is done by setting the higher valued Cepstral
coefficients to zero and then transforming back to the frequency
domain.
• The process of filtering in the Cepstral domain is called “liftering”.
Mel frequency Cepstral Coefficients
• Human ear does not show a linear frequency resolution but
builds several group of frequencies and integrates the spectral
energies within a given group
• The mid frequency and bandwidth of these groups are non-
linearly distributed.
• The non-linear warping of the frequency axis is modeled by
the mel-scale where the frequency groups are assumed to
be linearly distributed
𝑓
𝑓𝑚𝑒𝑙 𝑓 = 2595. log(1 + )
700 𝐻𝑧
• Common way to do mel frequency warping is to use triangle shaped filter
in the spectral domain to build a weighted sum over the power spectrum
coefficients which lie within each window.
• This gives us a new set of coefficients known as the mel spectral coefficie
• Perform Cepstral Transformation on them to extract Mel frequency Cepst
Coefficients.
• The MFCC are directly used for recognition instead of transforming them
back to frequency domain.
Feature Matching
Dynamic Time Warping (DTW)
• Distance calculation using Dynamic Time
Warping
• Each utterance is divided into frames of 20ms.
• MFCC for each of frame is computed and represented by a
vector.
• Hence each utterance is represented by a vector sequence.
X = {x0,x1,….,xTx−1}
• Distance between individual vectors are found using the
Euclidean distance formula.
DTW Algorithm
• Finding the optimal alignment path
DTW Algorithm
Key points to find the optimal path
• A grid point (i,j) in the optimal path can have the predecessors
(i-1,j), (i-1,j-1) and (i,j-1)
• Bellman’s Principle : If Popt is the optimal path through the
matrix of grid points beginning at (0,0) and ending at (Tw-1,Tx-
1), and grid point (i,j) is part of path Popt, then the partial path
from (0,0) to (i,j) is also part of Popt
• Creating an Accumulated distance matrix, according to the
formula

• The accumulated distance at the point (Tw-1,Tx-1) is the


distance between the vector sequence W and X .
Iteration steps in finding the optimal path
VOICE COMMAND RECOGNITION VI
Front Panel
Block Diagram
Step 1: Acquiring the Speech Signal
• The input speech signal has been acquired using LabVIEW
“Acquire Sound Express VI “ for 3sec at a sampling rate of
11025Hz.

• An array of LED’s in the front panel indicates the progress of


acquiring.
Step 2: Pre-processing
• Preprocessing of the input speech signal consist of the
following steps
Block Diagram of the Preprocessing sub VI
2.1 Pre-Emphasis

• The goal of pre-emphasis is to compensate for the high frequency part


that was suppressed during the sound production mechanism of humans.
Thus the speech signal is passed through a FIR high pass filter which
increases the magnitude of some higher frequencies with respect to the
magnitude of other frequencies hence improves the over-all signal to
noise ratio.
𝑌 𝑛 = 𝑋 𝑛 − 0.95𝑋[𝑛 − 1]
2.2 Framing

• The input speech signal is segmented into small frames of


20ms length with 50% overlap with the adjoining frames to
create continuity.
2.3 Windowing

• Each frame is multiplied with the hamming window in time


domain. This helps to reduce the discontinuity at the start and
end of each frames.

2𝜋𝑛
𝑤 𝑛 = 0.54 − 0.46 cos
𝑛−1
2.4 Noise threshold detection
• For detecting the starting of the utterance from the 3sec long
input speech signal, energy of each frame of the input speech
signal is calculated and stored into an array. Size of the energy
array will be equal to the total number of frames. This energy
array is arranged in the ascending order and mean of first 15
elements gives the energy of the noise. Threshold set was 10
times the noise energy.
2.5 Utterance detection

• Once the threshold has been calculated, all the elements in the energy
array which are greater than the threshold are replaced by 1 and the rest
by 0. Thus a Boolean array of the following form is obtained.

• Sometimes spikes due to the external noise crosses the threshold and
contributes 1 to the Boolean array. To remove these spikes a Median filter
VI in LabVIEW with left and right rank as 3 is used. The median filter
replaces the ith element in the Boolean array with the median of { 𝑖 −
3, 𝑖 − 2, 𝑖 − 1, 𝑖 , 𝑖 − 1, 𝑖 + 2, 𝑖 + 3}elements.
Hence the median filter smoothen the Boolean array.
• Now we use the Peak detector VI in LabVIEW to find the index of the start and
end of the utterance. Using these index extract the corresponding frames
containing the utterance.
• N.B: In my project, all commands where of length less than 0.6sec. Sometimes
spikes due to noise remained even after using the median filter and hence the
ending index was not detected accurately. But the start index was detected
accurately most of the time, so I used to extract 0.6sec of sound after the start
index.
Step 3: Feature Extraction
Block Diagram of Feature Extraction VI
• FFT is done on each frame of the utterance and half of it is
taken.
• The spectrum of each frame is warped onto the Mel scale and
thus Mel spectral coefficients are obtained.
• Discrete cosine transform is done on Mel spectral coefficients of
each frame, hence obtaining MFCC.
• The first 2 coefficients of the obtained MFCC are removed as
they varied significantly between different utterances of the
same word.
• Liftering is done by replacing all MFCC except the first 14 by
zero.
• The first coefficient of MFCC of each frame was replaced by the
log energy of that frame.
• Delta and Acceleration coefficients are found from the MFCC so
as to increase the dimension of the feature vector of the
frames, thereby increasing the accuracy.
• Delta coefficients are found from the following equation. Value
of p chosen was 1.

• Acceleration coefficients are found by replacing the MFCC in the


above equation by delta coefficients
• Feature vector is normalized by subtracting their mean from
each elements

Thus each frame of utterance is converted into a feature vector


of dimension 35.
Step 4 : Feature Matching
• Dictionary with six sets has been created.
• In each set, the feature vector sequence of the words to be
recognized are stored.
• The feature of the test sequence is compared with each words
in the sets using DTW and the best match in each set is
outputted.
• The mode of all six set is considered to be the recognized
command.
• Threshold is set so that random speech signal doesn't result in
a match with the commands.
Front Panel of the Dictionary sub VI
Block Diagram of the Dictionary
sub VI
Block Diagram of the DTW sub VI
Video Demonstration of the VI
http://www.youtube.com/watch?v=aEqa-t_TWiY
Limitations
• Environment Dependent
The input speech feature vector is compared with a set of
feature vectors in the dictionary which were recorded in a
particular environment. So when used from a different
environment the efficiency decreases unless the threshold and
the dictionary are updated accordingly.
• Speaker Dependent
As the dictionary is trained by a particular user, the VI outputs
consistent results when used by the trainer.
Questions..?
Thank You

Anda mungkin juga menyukai