INTRODUCTION
Automatic Speech Recognition
Automatic speech recognition is the process by which a computer maps an acoustic
speech signal to text. Automatic speech understanding is the process by which a computer
maps an acoustic speech signal to some form of abstract meaning of the speech.
What does speaker dependent / adaptive / independent mean?
A speaker dependent system is developed to operate for a single speaker. These systems
are usually easier to develop, cheaper to buy and more accurate, than but not as flexible
as speaker adaptive or speaker independent systems.
A speaker independent system is developed to operate for any speaker of a particular type
(e.g. American English). These systems are the most difficult to develop, most expensive
and accuracy is lower than speaker dependent systems. However, they are more flexible.
A speaker adaptive system is developed to adapt its operation to the characteristics of
new speakers. Its difficulty lies somewhere between speaker independent and speaker
dependent systems.
What does continuous speech or isolated-word mean?
An isolated-word system operates on single words at a time - requiring a pause between
saying each word. This is the simplest form of recognition to perform because the end
points are easier to find and the pronunciation of a word tends not affect others. Thus,
because the occurrences of words are more consistent they are easier to recognize.
A continuous speech system operates on speech in which words are connected together,
i.e. not separated by pauses. Continuous speech is more difficult to handle because of a
variety of effects. First, it is difficult to find the start and end points of words. Another
problem is "co articulation". The production of each phoneme is affected by the
1
production of surrounding phonemes, and similarly the the start and end of words are
affected by the preceding and following words. The recognition of continuous speech is
also affected by the rate of speech (fast speech tends to be harder).
How is speech recognition performed?
A wide variety of techniques are used to perform speech recognition. There are many
types of speech recognition. There are many levels of speech recognition / analysis /
understanding.
Typically speech recognition starts with the digital sampling of speech. The next stage is
acoustic signal processing. Most techniques include spectral analysis; e.g. LPC analysis
(Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients), cochlea
modelling and many more.
The next stage is recognition of phonemes, groups of phonemes and words. This stage
can be achieved by many processes such as DTW (Dynamic Time Warping), HMM
(hidden Markov modelling), NNs (Neural Networks), expert systems and combinations
of techniques. HMM-based systems are currently the most commonly used and most
successful approach. Most systems utilize some knowledge of the language to aid the
recognition process.
Some systems try to "understand" speech. That is, they try to convert the words into a
representation of what the speaker intended to mean or achieve by what they said.
This is a simple recognizer that should give you 85%+ recognition accuracy. The
accuracy is a function of the words you have in your vocabulary. Long distinct words are
easy. Short similar words are hard. You can get 98+% on the digits with this recognizer.
Overview:
Recognize unknown by comparing its pattern against all patterns in the training set and
returning the name of the pattern closest to the unknown. Many variations upon the
theme can be made to improve the performance. Try different filtering of the raw signal
and different processing methods
Automatic speech recognition and speaker verification are among the most challenging
problems of modern man machine interactions. Among their numerous useful
applications are a future chuckles society in which all financial transactions are
executed over the telephone and signed by voice. Access to confidential data can be
made secure by speaker certification. Other applications include voice information and
reservation system covering a wide spectrum of human activities from travel and study to
purchasing and partner matching. In these applications, spoken requests (over the
telephone, say) are understood by machines and answered by synchronized voice. Voice
control of computers and spacecraft (and machines in general whose operators have
limited use of their hands) is an aspiration of long standing. Activation by voice could be
particularly beneficial for the severally handicapped who have lost one or several limbs.
The surgeon in the middle of operation, needing the latest medical information, is another
instance where only the acoustic channels are still fully available for requesting and
receiving the urgently required advice. The ending of manuscripts by voice may
supplement much present paper and pushing or mouse play at the graphics terminals.
The potential applications of speech and speaker recognition are boundless. As early as
1944 speaker identification was used successfully by the allies to trace the movements of
German combat unit by analyzing speech
Spectrograms of enemy voice traffic .remarkably, the human ear is often able to identify a
telephone caller on the basis of a simple hello or just the clearing of his throat. But the
difficulties of recognition by the machine can be staggering. even if we forego automatic
accent classification and especially if we persuade the banks to live with less than
perfection in voice signature(which they really do not need ,considering the large number
of unsigned or falsely signed checks that clear the system everyday)reliable voice
recognition from the large pools of potential speakers on the basis of their speech
3
alone ,will remain problematic for years to come .and as, widely appreciated by now ,the
automatic recognition of anything but isolated words from a limited vocabulary Spoken
by known speakers presents formidable
difficulties
.decades
of
painstaking(and
pain)research have shown that purely technical advances will result ,at best limited
improvements-far short of what is childs play for the human mind.
All speaker recognition systems have to serve two distinguish phases. The first one is
referred to the enrollment sessions or training phase while the second one is referred to as
the operation sessions or testing phase. In the training phase, each registered speaker has
to provide samples of their speech so that the system can build or train a reference model
for that speaker. In case of speaker verification systems, in addition, a speaker-specific
threshold is also computed from the training samples. During the testing (operational)
phase (see Figure 1), the input speech is matched with stored reference model(s) and
recognition decision is made.
Speaker recognition is a difficult task and it is still an active research area. Automatic
speaker recognition works based on the premise that a persons speech exhibits
characteristics that are unique to the speaker. However this task has been challenged by
the highly variant of input speech signals. The principle source of variance comes form
the speakers themselves. Speech signals in training and testing sessions can be greatly
different due to many facts such as people voice change with time, health conditions (e.g.
5
the speaker has a cold), speaking rates, etc. There are also other factors, beyond speaker
variability, that present a challenge to speaker recognition technology. Examples of these
are acoustical noise and variations in recording
environments.
A block diagram of the structure of an MFCC processor is given in Figure 3. The speech
input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency
was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These
sampled signals can capture all frequencies up to 5 kHz, which cover most energy of
sounds that are generated by humans. As been discussed previously, the main purpose of
the MFCC processor is to mimic the behavior of the human ears. In addition, rather than
the speech waveforms themselves, MFFCs are shown to be less susceptible to mentioned
variations.
4.2 Windowing
The next step in the processing is to window each individual frame so as to minimize the
signal discontinuities at the beginning and end of each frame. The concept here is to
minimize the spectral distortion by using the window to taper the signal to zero at the
beginning and end of each frame. If we define the window as w(n),0 n N 1, where
N is the number of samples in each frame, then the result of windowing is the signal
y1( n) x1(n) w(n) , 0 n N -1
The next processing step is the Fast Fourier Transform, which converts each frame of N
samples from the time domain into the frequency domain. The FFT is a fast algorithm to
implement the Discrete Fourier Transform (DFT) which is defined on the set of N
samples {xn}, as follow:
Note that we use j here to denote the imaginary unit, i.e.j= 1 In general Xns are
complex numbers. The resulting sequence {Xn} is interpreted as follow: the zero
frequency corresponds to n = 0, positive frequencies 0 < f < F s / 2 correspond to values 1
n N /2 1, while negative frequencies F s /2 < f < 0 correspond to N /2 + 1 n N
1. Here, Fs denotes the sampling frequency. The result obtained after this step is often
referred to as signals spectrum or periodogram.
the subjective spectrum is to use a filter bank, one filter for each desired Mel-frequency
component (see Figure 4). That filter bank has a triangular band pass frequency response,
and the spacing as well as the bandwidth is determined by a constant Mel-frequency
interval. The modified spectrum of S () thus consists of the output power of these filters
when S () is the input. The number of Mel spectrum coefficients, K, is typically chosen
as 20.
Note that this filter bank is applied in the frequency domain; therefore it simply amounts
to taking those triangle-shape windows in the Figure 4 on the spectrum. A useful way of
thinking about this Mel-wrapping filter bank is to view each filter as a histogram bin
(where bins have
overlap)
in
the
frequencydomain.
10
4.4Cepstrum
In this final step, we convert the log Mel spectrum back to time. The result is called the
Mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech
spectrum provides a good representation of the local spectral properties of the signal for
the given frame analysis. Because the Mel spectrum coefficients (and so their logarithm)
are real numbers, we can convert them to the time domain using the Discrete Cosine
Transform (DCT). Therefore if we denote those Mel power spectrum coefficients that are
the result of the last step are
11
as
Note that we exclude the first component, c 0, from the DCT since it represents the mean
value of the input signal which carried little speaker
specific information.
5. Feature Matching
The problem of speaker recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify
objects of interest into one of a number of categories or classes. The objects of interest
are generically called patterns and in our case are sequences of acoustic vectors that are
extracted from an input speech using the techniques described in the previous section.
The classes here refer to individual speakers. Since the classification procedure in our
case is applied on extracted features, it can be also referred to as feature matching.
Furthermore, if there exists some set of patterns that the individual classes of which are
already known, then one has a problem in supervised pattern recognition. This is exactly
our case since during the training session, we label each input speech with the ID of the
speaker (S1 to S8). These patterns comprise the training set and are used to derive a
classification algorithm. The remaining patterns are then used to test the classification
algorithm; these patterns are collectively referred to as the test set. If the correct classes
of the individual patterns in the test set are also known, then one can evaluate the
performance of the algorithm.
The state-of-the-art in feature matching techniques used in speaker recognition includes
Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector
Quantization (VQ). In this project, the VQ approach will be used, due to ease of
implementation and high accuracy. VQ is a process of mapping vectors from a large
vector space to a finite number of regions in that space. Each region is called a cluster
and can be represented by its center called a codeword. The collection of all codewords is
called a codebook.
12
Figure 5 shows a conceptual diagram to illustrate this recognition process. In the figure,
only two speakers and two dimensions of the acoustic space are shown. The circles refer
to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In
the training phase, a speaker-specific VQ codebook is generated for each known speaker
by clustering his/her training acoustic vectors. The result codewords (centroids) are
shown in Figure 5 by black circles and black triangles for speaker 1 and 2, respectively.
The distance from a vector to the closest codeword of a codebook is called a VQdistortion. In the recognition phase, an input utterance of an unknown voice is vectorquantized using each trained codebook and the total VQ distortion is computed. The
speaker corresponding to the VQ codebook with smallest total distortion is identified.
location
of
centroids.
5.1 Clustering the Training Vectors
After the enrolment session, the acoustic vectors extracted from input speech of a speaker
provide a set of training vectors. As described above, the next important step is to build a
speaker-specific VQ codebook for this speaker using those training vectors. There is a
13
well-know algorithm, namely LBG algorithm [Linde, Buzo and Gray, 1980], for
clustering a set of L training vectors into a set of M codebook vectors. The algorithm is
formally implemented by the following recursive procedure:
1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors
(hence, no iteration is required here).
2. Double the size of the codebook by splitting each current codebook yn according to the
rule
Where n varies from 1 to the current size of the codebook, and is a splitting parameter
(we choose =0.01).
3. Nearest-Neighbor Search: for each training vector, find the codeword in the current
codebook that is closest (in terms of similarity measurement), and assign that vector to
the corresponding cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the training
vectors assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts first by
designing a 1-vector codebook, then uses a splitting technique on the codewords to
initialize the search for a 2-vector codebook, and continues the splitting process until the
desired M-vector codebook is obtained.
Figure 6 shows, in a flow diagram, the detailed steps of the LBG algorithm. Cluster
vectors is the nearest-neighbor search procedure which assigns each training vector to a
cluster associated with the closest codeword. Find centroids is the centroid update
14
procedure. Compute D (distortion) sums the distances of all training vectors in the
nearest-neighbor search so as to determine whether the procedure has converged.
Figure 6. Flow diagram of the LBG algorithm (Adapted from Rabiner and Juang, 1993)
6.Implementation
15
As stated above, in this project we will experience the building and testing of an
automatic speaker recognition system. In order to implement such a system, one must go
through several steps which were described in details in previous sections. Note that
many of the above tasks are already implemented in Matlab. Furthermore, to ease the
development process, we supplied you with two utility functions: melfb and disteu; and
two main functions: train and test. Download all of those files into your working folder.
The first two files can be treated as a black box, but the later two needs to be thoroughly
understood. In fact, your tasks are to write two missing functions: mfcc and vqlbg, which
will be called from the given main functions. In order to accomplish that, follow each
step in this section carefully and answer all the questions.
Speech Data
Click here to down-load the ZIP file of the speech database. After unzipping the file
correctly, you will find two folders, TRAIN and TEST, each contains 8 files, named:
S1.WAV, S2.WAV, , S8.WAV; each is labeled after the ID of the speaker. These files
were recorded in Microsoft WAV format. In Windows systems, you can listen to the
recorded sounds by double clicking into the files.
Our goal is to train a voice model (or more specific, a VQ codebook in the MFCC vector
space) for each speaker S1 -S8 using the corresponding sound file in the TRAIN folder.
After this training step, the system would have knowledge of the voice characteristic of
each (known) speaker. Next, in the testing phase, the system will be able to identify the
(assumed unknown) speaker of each sound file in the TEST folder.
Question 1: Play each sound file in the TRAIN folder. Can you distinguish the voices of
those eight speakers? Now play each sound in the TEST folder in a random order without
looking at the file name (pretending that you do not known the speaker) and try to
identify the speaker using your knowledge of their voices that you just learned from the
TRAIN folder. This is exactly what the computer will do in our system. What is your
(human performance) recognition rate? Record this result so that it could be later on
compared against the computer performance of
our
Speech Processing
16
system.
In this phase you are required to write a Matlab function that reads a sound file and turns
it into a sequence of MFCC (acoustic vectors) using the speech processing steps
described previously. Many of those tasks are already provided by either standard or our
supplied Matlab functions. The Matlab functions that you would need to use are:
wavread, hamming, fft, dct and melfb (supplied function). Type help function name at the
Matlab prompt for more information about a function.
Question 2: Read a sound file into Matlab. Check it by playing the sound file in Matlab
using the function: sound. What is the sampling rate? What is the highest frequency that
the recorded sound can capture with fidelity? With that sampling rate, how many msecs
of actual speech are contained in a block of 256 samples?
Plot the signal to view it in the time domain. It should be obvious that the raw data in the
time domain has a very high amount of data and it is difficult for analyzing the voice
characteristic. So the motivation for this step (speech feature extraction) should be clear
now!
Now cut the speech signal (a vector) into frames with overlap (refer to the frame section
in the theory part). The result is a matrix where each column is a frame of N samples
from original speech signal. Applying the steps: Windowing and FFT to transform the
signal into the frequency domain. This process is used in many different applications and
is referred in literature as Windowed Fourier Transform (WFT) or Short-Time Fourier
Transform (STFT). The result is often called as the spectrum or periodogram.
Question 3: After successfully running the preceding process, what is the interpretation of
the result? Compute the power spectrum and plot it out using the imagesc command.
Note that it is better to view the power spectrum on the log scale. Locate the region in the
plot that contains most of the energy. Translate this location into the actual ranges in time
(msec) and frequency (in Hz) of the input speech signal.
Question 4: Compute and plot the power spectrum of a speech file using different frame
size: for example N = 128, 256 and 512. In each case, set the frame increment M to be
about N/3. Can you describe and explain the differences among those spectra?
17
The last step in speech processing is converting the power spectrum into mel-frequency
cepstrum coefficients. The supplied function melfb facilitates this task.
Question 5: Type help melfb at the Matlab prompt for more information about this
function. Follow the guidelines to plot out the mel-spaced filter bank. What is the
behavior of this filter bank? Compare it with the theoretical part.
Finally, put all the pieces together into a single Matlab function, mfcc, which performs
the MFCC processing.
Question 6: Compute and plot the spectrum of a speech file before and after the melfrequency wrapping step. Describe and explain the impact of the melfb program.
Vector Quantization
The result of the last section is that we transform speech signals into vectors in an
acoustic space. In this section, we will apply the VQ-based pattern recognition technique
to build speaker reference models from those vectors in the training phase and then can
identify any sequences of acoustic vectors uttered by unknown speakers.
Question 7: To inspect the acoustic space (MFCC vectors) we can pick any two
dimensions (say the 5th and the 6th) and plot the data points in a 2D plane. Use acoustic
vectors of two different speakers and plot data points in two different colors. Do the data
regions from the two speakers overlap each other? Are they in clusters?
Now write a Matlab function, vqlbg that trains a VQ codebook using the LGB algorithm
described before. Use the supplied utility function disteu to compute the pairwise
Euclidean distances between the codewords and training vectors in the iterative process.
Question 8: Plot the data points of the trained VQ codewords using the same two
dimensions over the plot from the last question
Compare this with Figure 5. .
18
19
20
21
22
23
24
25
26
27
9.APPENDIX
INTRODUCTION TO MATLAB
MATLAB (short for Matrix Laboratory) is a special purpose computer program
optimized to perform engineering and scientific caluculation.It started life as a program
designed to perform matrix mathematic, but over the years it has grown into a flexible
computing system capable of solving essentially any technical problem. The MATLAB
program implements the MATLAB programming language and provides an exclusive
library of predefined function that make technical programming takes easier and more
efficient.
MATLAB is a huge program, with an incredibly rich variety of function. Even the basic
version of MATLAB without any toolkits is much richer than other technical
programming language. There are more than 1000 functions in the basic MATLAB
product alone, and the toolkits extend this capability with any more functions in various
specialties.
Advantages of MATLAB:
MATLAB has many advantages compared with conventional computer
Languages for technical problem solving. Among them are the following:
BASIC. The program can be used as a scratch pad to evaluate expressions typed at the
28
command line, or it can be used to execute large prewritten programs. Program may be
easily written and modified with the built-in integrated development environment and
debugged with MATLAB debugger.
Platform independence:
predefined functions that provide tested and prepackaged solutions to many basic
MATLAB language; many special purpose toolboxes are available to help solve complex
problems in specific areas.
MATLAB has many integral plotting and imaging commands. the plots and images can
be displayed on many graphical output device supported by the computer on which
MATLAB is running. This capability makes MATLAB an outstanding tool for visualizing
technical data.
is equivalent
29
FFT(X) is the discrete Fourier transform (DFT) of vector X. For matrices, the FFT
operation is applied to each column. For N-D arrays, the FFT operation operates on the
first non-singleton dimension.
FFT(X,N) is the N-point FFT, padded with zeros if X has less
than N points and truncated if it has more.
FFT(X,[],DIM) or FFT(X,N,DIM) applies the FFT operation across the dimension
DIM.
For length N input vector x, the DFT is a length N vector X,
with elements N
30
X(k) =
n=1
The inverse DFT (computed by IFFT) is given by
N
x(n) = (1/N) sum X(k)*exp( j*2*pi*(k-1)*(n-1)/N), 1 <= n <= N.
k=1
Y=WAVREAD(FILE) reads a WAVE file specified by the string FILE, returning the
sampled data in Y. The ".wav" extension is appended if no extension is given.
Amplitude values are in the range [-1,+1].
[Y,FS,NBITS]=WAVREAD(FILE) returns the sample rate (FS) in Hertz and the
number of bits per sample (NBITS) used to encode the data in the file.
[...]=WAVREAD(FILE,N) returns only the first N samples from each channel in the
file.
[...]=WAVREAD(FILE,[N1 N2]) returns only samples N1 through N2 from each
channel in the file.
SIZ=WAVREAD(FILE,'size') returns the size of the audio data contained in the file in
place of the actual audio data, returning the vector SIZ=[samples channels].
[Y,FS,NBITS,OPTS]=WAVREAD(...) returns a structure OPTS of additional
information contained in the WAV file. The content of this structure differs from file to
file. Typical structure fields include '.fmt' (audio format information) and '.info' (text
which may describe subject title, copy right, etc.)
Supports multi-channel data, with up to 32 bits per sample.
DISP(X) displays the array, without printing the array name. In all other ways it's the
same as leaving the semicolon off an expression except that empty arrays don't display.
If X is a string, the text is displayed.
AXIS([XMIN XMAX YMIN YMAX]) sets scaling for the x- and y-axes on the current
plot.
31
AXIS([XMIN XMAX YMIN YMAX ZMIN ZMAX]) sets the scaling for the x-, y- and
z-axes on the current 3-D plot.
AXIS([XMIN XMAX YMIN YMAX ZMIN ZMAX CMIN CMAX]) sets the scaling
for the x-, y-, z-axes and color scaling limits on the current axis (see CAXIS).
V = AXIS returns a row vector containing the scaling for the current plot. If the current
view is 2-D, V has four components; if it is 3-D, V has six components.
AXIS AUTO returns the axis scaling to its default, automatic
mode where, for each dimension, 'nice' limits are chosen based on the extents of all
line, surface, patch, and image children.
AXIS MANUAL freezes the scaling at the current limits, so that if HOLD is turned on,
subsequent plots will use the same limits.
AXIS TIGHT sets the axis limits to the range of the data.
AXIS IJ puts MATLAB into its "matrix" axes mode. The coordinate system origin is
at the upper left corner. The i axis is vertical and is numbered from top to bottom. The
j axis is horizontal and is numbered from left to right.
AXIS XY puts MATLAB into its default "Cartesian" axes mode. The coordinate
system origin is at the lower left corner. The x axis is horizontal and is numbered from
left to right. The y axis is vertical and is numbered from bottom to top.
AXIS EQUAL sets the aspect ratio so that equal tick mark increments on the x-,y- and
z-axis are equal in size. This makes SPHERE(25) look like a sphere, instead of an
ellipsoid.
AXIS IMAGE is the same as AXIS EQUAL except that the plot box fits tightly around
the data.
AXIS SQUARE makes the current axis box square in size.
AXIS NORMAL restores the current axis box to full size and
removes any restrictions on the scaling of the units.
This undoes the effects of AXIS SQUARE and AXIS EQUAL.
AXIS VIS3D freezes aspect ratio properties to enable rotation of 3-D objects and
overrides stretch-to-fill.
AXIS OFF turns off all axis labeling, tick marks and background.
AXIS ON turns axis labeling, tick marks and background back on.
32
IMAGESC(...) is the same as IMAGE(...) except the data is scaled to use the full
colormap.
IMAGESC(...,CLIM) where CLIM = [CLOW CHIGH] can specify the scaling.
COLORBAR('vert')
appends
vertical
color
scale
to
the
current
axes.
If
V = GET(H, 'Default<ObjectType>')
V = GET(H, 'Default<ObjectType><PropertyName>')
returns information about default property values (H must be scalar). 'Default' returns a
list of all default property values currently set on H. 'Default<ObjectType>' returns
only
the
defaults
for
properties
of
<ObjectType>
set
on
H.
child, but can be queried on a figure or on the root. When using the 'Factory' or
'Default' GET, if PropertyName is omitted then the return value will take the form of a
structure in which each field name is a property name and the corresponding value is
the value of that property.If PropertyName is specified then a matrix or string value
will be returned.
sets
multiple property values with a single statement. Note that it is permissible to use
property/value string pairs, structures, and property/value cell array pairs in the same
call to SET.
A = SET(H, 'PropertyName')
34
SET(H,'PropertyName')
returns or displays the possible values for the specified property of the object with
handle H. The returned array is a cell array of possible value strings or an empty cell
array if the property does not have a finite set of possible string values.
A = SET(H)
SET(H)
returns or displays all property names and their possible values for the object with
handle H. The return value is a structure whose field names are the property names of
H, and whose values are cell arrays of possible property values or empty cell arrays.
The default value for an object property can be set on any of an object's ancestors by
setting the PropertyName formed by concatenating the string 'Default', the object type,
and the property name. For example, to set the default color of text objects to red in
the current figure window:
set(gcf,'DefaultTextColor','red')
Defaults can not be set on a descendant of the object, or on the
object itself - for example, a value for 'DefaultAxesColor' can not be set on an axes or
an axes child, but can be set on a figure or on the root.
Three strings have special meaning for PropertyValues:
'default' - use default value (from nearest ancestor) 'factory' - use factory default value
'remove' - remove default value.
D = SIZE(X), for M-by-N matrix X, returns the two-element row vector D = [M, N]
containing the number of rows and olumns in the matrix. For N-D arrays,SIZE(X)
returns a 1-by-N vector of dimension lengths.Trailing
ignored.
[M,N] = SIZE(X) for matrix X, returns the number of rows and columns in X as
separate output variables.
[M1,M2,M3,...,MN] = SIZE(X) returns the sizes of the first N
35
dimensions of array X.
if you plan to use an integer conversion specifier like d.Otherwise, any non-integral
MATLAB values will be outputted using the format where the integer conversion
specifier letter has been replaced by the following non-standard subtype specifiers are
supported for conversion characters o, u, x, and X.
T
1.618
sprintf('%0.5g',1/eps)
4.5036e+15
sprintf('%15.5f',1/eps)
4503599627370496.00000
sprintf('%d',round(pi))
sprintf('%s','hello')
hello
ABS(X) is the absolute value of the elements of X. When X is complex, ABS(X) is the
complex modulus (magnitude) of the elements of X.
PLOT: Linear plot.
their
index.
If
Y is
complex,
PLOT(Y)
is
equivalent
to
blue
point
green
red
cyan
solid
circle
dotted
x-mark
-.
dashdot
plus
--
dashed
magenta
star
yellow
square
black
diamond
triangle (down)
triangle (up)
<
triangle (left)
>
triangle (right)
pentagram
hexagram
For example, PLOT(X,Y,'c+:') plots a cyan dotted line with a plus at each data point;
PLOT(X,Y,'bd') plots blue diamond at each data point but does not draw any line.
PLOT(X1,Y1,S1,X2,Y2,S2,X3,Y3,S3,...) combines the plots defined by the (X,Y,S)
triples, where the X's and Y's are vectors or matrices and the S's are strings.
For example, PLOT(X,Y,'y-',X,Y,'go') plots the data twice, with a solid yellow line
interpolating green circles at the data points. The PLOT command, if no color is
38
specified, makes automatic use of the colors specified by the axes ColorOrder property.
The default ColorOrder is listed in the table above for color systems where the default
is blue for one line, and for multiple lines, to cycle through the first six colors in the
table. For monochrome systems, PLOT cycles over the axes LineStyleOrder property.
PLOT returns a column vector of handles to LINE objects, one handle per line.
the figure so that the next graphics command executes CLF RESET in the figure
(deleting all children of the figure), and creates a new axes in the default position. This
syntax does not return a handle, so it is an error to specify a return argument. The
delayed CLF RESET is accomplished by setting the figure's NextPlot to 'replace'.
HOLD ON holds the current plot and all axis properties so that subsequent graphing
commands add to the existing graph.
HOLD OFF returns to the default mode whereby PLOT commands erase the previous
plots and reset all axis properties before drawing new plots.
HOLD, by itself, toggles the hold state.
HOLD does not affect axis autoranging properties.
Algorithm note:
HOLD ON sets the NextPlot property of the current figure and axes to "add".
HOLD OFF sets the NextPlot property of the current axes to "replace".
Inputs: d contains training data vectors (one per column) k is number of centroids
required
Output: r contains the result VQ codebook (k columns, one for each centroids)
refreshes
41
LEGEND will try to install a ResizeFcn on the figure if it hasn't been defined before.
This resize function will try to keep the legend the same size.
Examples:
x = 0:.2:12;
plot(x,bessel(1,x),x,bessel(2,x),x,bessel(3,x));
legend('First','Second','Third');
legend('First','Second','Third',-1)
b = bar(rand(10,5),'stacked');
colormap(summer);
hold on
x = plot(1:10,5*rand(10,1),'marker','square','markersize',12,...
'markeredgecolor','y','markerfacecolor',[.6 0 .6],... 'linestyle','-','color','r','linewidth',2);
hold
off
legend([b,x],
'Carrots',
'Peas','
Peppers','
Green
'Cucumbers','Eggplant')
Speaker Recognition: Testing Stage
Input:
testdir : string name of directory contains all test sound files
n
code
Note:
Sound files in testdir is supposed to be:
s1.wav, s2.wav, ..., sn.wav
Example:
>> test('C:\data\amintest\', 8, code);
MFCC
Inputs: s contains the signal to analize fs is the sampling rate of the signal
Output: r contains the transformed signal
Speaker Recognition: Training Stage
Input:
traindir : string name of directory contains all train sound files
42
Beans',...
Output:
code
Example:
>> code = train('C:\data\amintrain\', 8);
Input:
x, y: Two matrices whose each column is an a vector data.
Output:
d: Element d(i,j) will be the Euclidean distance between two
column vectors X(:,i) and Y(:,j)
Note:
The Euclidean distance D between two vectors X and Y is:
D = sum((x-y).^2).^0.5
ZEROS Zeros array.
ZEROS(N) is an N-by-N matrix of zeros.
ZEROS(M,N) or ZEROS([M,N]) is an M-by-N matrix of zeros.
ZEROS(M,N,P,...) or ZEROS([M N P ...]) is an M-by-N-by-P-by-... array of zeros.
ZEROS(SIZE(A)) is the same size as A and all zeros.
For vectors, MIN(X) is the smallest element in X. For matrices, MIN(X) is a row
vector containing the minimum element from each column. For N-D arrays, MIN(X)
operates along the first non-singleton dimension. [Y,I] = MIN(X) returns the indices of
the minimum values in vector I.If the values along the first non-singleton dimension
contain more than one minimal element, the index of the first one is
returned.MIN(X,Y) returns an array the same size as X and Y with the smallest
43
S = SPARSE(X) converts a sparse or full matrix to sparse form by squeezing out any
zero elements.
S = SPARSE(i,j,s,m,n,nzmax) uses the rows of [i,j,s] to generate an m-by-n sparse
matrix with space allocated for nzmax nonzeros.The two integer index vectors, i and j,
and the real or complex entries vector, s, all have the same length, nnz, which is the
number of nonzeros in the resulting sparse matrix S . Any elements of s which have
duplicate values of i and j are added together.
There are several simplifications of this six argument call.
S = SPARSE(i,j,s,m,n) uses nzmax = length(s).
S = SPARSE(i,j,s) uses m = max(i) and n = max(j).
S = SPARSE(m,n) abbreviates SPARSE([],[],[],m,n,0). This generates the ultimate
sparse matrix, an m-by-n all zero matrix.The argument s and one of the arguments i or j
may be scalars,in which case they are expanded so that the first three arguments all
have the same length. For example, this dissects and then reassembles a sparse matrix:
[i,j,s] = find(S);
[m,n] = size(S);
S = sparse(i,j,s,m,n);
So does this, if the last row and column have nonzero entries:
[i,j,s] = find(S);
S = sparse(i,j,s);
All of MATLAB's built-in arithmetic, logical and indexing operations can be applied to
sparse matrices, or to mixtures of sparse and full matrices. Operations on sparse
matrices return sparse matrices and operations on full matrices return full matrices. In
most cases,operations on mixtures of sparse and full matrices return full matrices. The
exceptions include situations where the result of a mixed operation is structurally
sparse, eg. A .* S is at least as sparse as S . Some operations, such as S >= 0, generate
44
"Big Sparse", or "BS", matrices -- matrices with sparse storage organization but few
zero elements.
MFCC:
Inputs: s contains the signal to analize fs is the sampling rate of the signal
Output: r contains the transformed signal
Y = DCT(X) returns the discrete cosine transform of X.The vector Y is the same size as
X and contains the discrete cosine transform coefficients. Y = DCT(X,N) pads or
truncates the vector X to length N before transforming.If X is a matrix, the DCT
operation is applied to each column. This transform can be inverted using IDCT.
For vectors, MEAN(X) is the mean value of the elements in X. For matrices,
MEAN(X) is a row vector containing the mean value of each column. For N-D arrays,
MEAN(X) is the mean value of the elements along the first non-singleton dimension of
X.MEAN(X,DIM) takes the mean along the dimension DIM of X.
Example: If X = [0 1 2 3 4 5]
then mean(X,1) is [1.5 2.5 3.5] and mean(X,2) is [1 4]
Disadvantages of MATLAB:
MATLAB has two principle disadvantages. The first is an interpreted language and
therefore can execute more slowly than compiled languages. This problem can mitigate
by properly structuring the MATLAB program nand by the use of the MATLAB compiler
to compile the final MATLAB program before distribution and general use.
The second disadvantage is cost: a fully copy of MATLAB is 5 to 10 times more
expensive than a conventional C or fortan compiler
45
.
CONCLUSION
BIBILIOGRAPHY
46
L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-
F.K. Song, A.E. Rosenberg and B.H. Juang, A vector quantisation approach to
speaker recognition, AT&T Technical Journal, Vol. 66-2, pp. 14-26, March 1987.
comp.speech Frequently Asked Questions WWW site,
http://svr-www.eng.cam.ac.uk/comp.speech
47