Anda di halaman 1dari 47

1.

INTRODUCTION
Automatic Speech Recognition
Automatic speech recognition is the process by which a computer maps an acoustic
speech signal to text. Automatic speech understanding is the process by which a computer
maps an acoustic speech signal to some form of abstract meaning of the speech.
What does speaker dependent / adaptive / independent mean?
A speaker dependent system is developed to operate for a single speaker. These systems
are usually easier to develop, cheaper to buy and more accurate, than but not as flexible
as speaker adaptive or speaker independent systems.
A speaker independent system is developed to operate for any speaker of a particular type
(e.g. American English). These systems are the most difficult to develop, most expensive
and accuracy is lower than speaker dependent systems. However, they are more flexible.
A speaker adaptive system is developed to adapt its operation to the characteristics of
new speakers. Its difficulty lies somewhere between speaker independent and speaker
dependent systems.
What does continuous speech or isolated-word mean?
An isolated-word system operates on single words at a time - requiring a pause between
saying each word. This is the simplest form of recognition to perform because the end
points are easier to find and the pronunciation of a word tends not affect others. Thus,
because the occurrences of words are more consistent they are easier to recognize.
A continuous speech system operates on speech in which words are connected together,
i.e. not separated by pauses. Continuous speech is more difficult to handle because of a
variety of effects. First, it is difficult to find the start and end points of words. Another
problem is "co articulation". The production of each phoneme is affected by the
1

production of surrounding phonemes, and similarly the the start and end of words are
affected by the preceding and following words. The recognition of continuous speech is
also affected by the rate of speech (fast speech tends to be harder).
How is speech recognition performed?
A wide variety of techniques are used to perform speech recognition. There are many
types of speech recognition. There are many levels of speech recognition / analysis /
understanding.
Typically speech recognition starts with the digital sampling of speech. The next stage is
acoustic signal processing. Most techniques include spectral analysis; e.g. LPC analysis
(Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients), cochlea
modelling and many more.
The next stage is recognition of phonemes, groups of phonemes and words. This stage
can be achieved by many processes such as DTW (Dynamic Time Warping), HMM
(hidden Markov modelling), NNs (Neural Networks), expert systems and combinations
of techniques. HMM-based systems are currently the most commonly used and most
successful approach. Most systems utilize some knowledge of the language to aid the
recognition process.
Some systems try to "understand" speech. That is, they try to convert the words into a
representation of what the speaker intended to mean or achieve by what they said.
This is a simple recognizer that should give you 85%+ recognition accuracy. The
accuracy is a function of the words you have in your vocabulary. Long distinct words are
easy. Short similar words are hard. You can get 98+% on the digits with this recognizer.
Overview:

Find the beginning and end of the utterance.

Filter the raw signal into frequency bands.

Cut the utterance into a fixed number of segments.

Average data for each band in each segment.


2

Store this pattern with its name.

Collect training set of about 3 repetitions of each pattern (word).

Recognize unknown by comparing its pattern against all patterns in the training set and
returning the name of the pattern closest to the unknown. Many variations upon the
theme can be made to improve the performance. Try different filtering of the raw signal
and different processing methods
Automatic speech recognition and speaker verification are among the most challenging
problems of modern man machine interactions. Among their numerous useful
applications are a future chuckles society in which all financial transactions are
executed over the telephone and signed by voice. Access to confidential data can be
made secure by speaker certification. Other applications include voice information and
reservation system covering a wide spectrum of human activities from travel and study to
purchasing and partner matching. In these applications, spoken requests (over the
telephone, say) are understood by machines and answered by synchronized voice. Voice
control of computers and spacecraft (and machines in general whose operators have
limited use of their hands) is an aspiration of long standing. Activation by voice could be
particularly beneficial for the severally handicapped who have lost one or several limbs.
The surgeon in the middle of operation, needing the latest medical information, is another
instance where only the acoustic channels are still fully available for requesting and
receiving the urgently required advice. The ending of manuscripts by voice may
supplement much present paper and pushing or mouse play at the graphics terminals.
The potential applications of speech and speaker recognition are boundless. As early as
1944 speaker identification was used successfully by the allies to trace the movements of
German combat unit by analyzing speech
Spectrograms of enemy voice traffic .remarkably, the human ear is often able to identify a
telephone caller on the basis of a simple hello or just the clearing of his throat. But the
difficulties of recognition by the machine can be staggering. even if we forego automatic
accent classification and especially if we persuade the banks to live with less than
perfection in voice signature(which they really do not need ,considering the large number
of unsigned or falsely signed checks that clear the system everyday)reliable voice
recognition from the large pools of potential speakers on the basis of their speech
3

alone ,will remain problematic for years to come .and as, widely appreciated by now ,the
automatic recognition of anything but isolated words from a limited vocabulary Spoken
by known speakers presents formidable

difficulties

.decades

of

painstaking(and

pain)research have shown that purely technical advances will result ,at best limited
improvements-far short of what is childs play for the human mind.

2. Principles of Speaker Recognition


Speaker recognition can be classified into identification and verification. Speaker
identification is the process of determining which registered speaker provides a given
utterance. Speaker verification, on the other hand, is the process of accepting or rejecting
the identity claim of a speaker. Figure 1 shows the basic structures of speaker
identification and verification systems.
Speaker recognition methods can also be divided into textindependent and textdependent methods. In a text-independent system, speaker models capture characteristics
of somebodys speech which show up irrespective of what one is saying. In a textdependent system, on the other hand, the recognition of the speakers identity is based on
his or her speaking one or more specific phrases, like passwords, card numbers, PIN
codes, etc. All technologies of speaker recognition, identification and verification, textindependent and text-dependent, each have its own advantages and disadvantages and
may requires different treatments and techniques. The choice of which technology to use
is application-specific.
At the highest level, all speaker recognition systems contain two main modules (refer to
Figure 1): feature extraction and feature matching. Feature extraction is the process that
extracts a small amount of data from the voice signal that can later be used to represent
each speaker. Feature matching involves the actual procedure to identify the unknown
speaker by comparing extracted features from his/her voice input with the ones from a set
of known speakers. We will discuss each module in detail in later sections.

All speaker recognition systems have to serve two distinguish phases. The first one is
referred to the enrollment sessions or training phase while the second one is referred to as
the operation sessions or testing phase. In the training phase, each registered speaker has
to provide samples of their speech so that the system can build or train a reference model
for that speaker. In case of speaker verification systems, in addition, a speaker-specific
threshold is also computed from the training samples. During the testing (operational)
phase (see Figure 1), the input speech is matched with stored reference model(s) and
recognition decision is made.
Speaker recognition is a difficult task and it is still an active research area. Automatic
speaker recognition works based on the premise that a persons speech exhibits
characteristics that are unique to the speaker. However this task has been challenged by
the highly variant of input speech signals. The principle source of variance comes form
the speakers themselves. Speech signals in training and testing sessions can be greatly
different due to many facts such as people voice change with time, health conditions (e.g.
5

the speaker has a cold), speaking rates, etc. There are also other factors, beyond speaker
variability, that present a challenge to speaker recognition technology. Examples of these
are acoustical noise and variations in recording

environments.

3. Speech Feature Extraction


The purpose of this module is to convert the speech waveform to some type of parametric
representation (at a considerably lower information rate) for further analysis and
processing. This is often referred as the signal-processing front end.
The speech signal is a slowly timed varying signal (it is called quasi-stationary). An
example of speech signal is shown in Figure 2. When examined over a sufficiently short
period of time (between 5 and 100 msec), its characteristics are fairly stationary.
However, over long periods of time (on the order of 1/5 seconds or more) the signal
characteristic change to reflect the different speech sounds being spoken. Therefore,
short-time spectral analysis is the most common way to characterize the speech signal.
A wide range of possibilities exist for parametrically representing the speech signal for
the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency
Cepstrum Coefficients (MFCC), and others. MFCC is perhaps the best known and most
popular, and these will be used in this project.

Figure 2. An example of speech signal


MFCCs are based on the known variation of the human ears critical bandwidths with
frequency, filters spaced linearly at low frequencies and logarithmically at high
frequencies have been used to capture the phonetically important characteristics of
speech. This is expressed in the mel-frequency scale, which is a linear frequency spacing
below 1000 Hz and a logarithmic spacing above 1000 Hz. The process of computing
MFCCs is described in more detail next.

4.Mel-frequency cepstrum coefficients processor

A block diagram of the structure of an MFCC processor is given in Figure 3. The speech
input is typically recorded at a sampling rate above 10000 Hz. This sampling frequency
was chosen to minimize the effects of aliasing in the analog-to-digital conversion. These
sampled signals can capture all frequencies up to 5 kHz, which cover most energy of
sounds that are generated by humans. As been discussed previously, the main purpose of
the MFCC processor is to mimic the behavior of the human ears. In addition, rather than
the speech waveforms themselves, MFFCs are shown to be less susceptible to mentioned
variations.

Figure 3. Block diagram of the MFCC processor


4.1Frame Blocking
In this step the continuous speech signal is blocked into frames of N samples, with
adjacent frames being separated by M (M < N). The first frame consists of the first N
samples. The second frame begins M samples after the first frame, and overlaps it by N
-M samples. Similarly, the third frame begins 2M samples after the first frame (or M
samples after the second frame) and overlaps it by N -2M samples. This process
continues until all the speech is accounted for within one or more frames. Typical values
for N and M are N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the
fast radix-2 FFT) and M = 100.

4.2 Windowing
The next step in the processing is to window each individual frame so as to minimize the
signal discontinuities at the beginning and end of each frame. The concept here is to
minimize the spectral distortion by using the window to taper the signal to zero at the
beginning and end of each frame. If we define the window as w(n),0 n N 1, where
N is the number of samples in each frame, then the result of windowing is the signal
y1( n) x1(n) w(n) , 0 n N -1

Typically the Hamming window is used, which has the form:

The next processing step is the Fast Fourier Transform, which converts each frame of N
samples from the time domain into the frequency domain. The FFT is a fast algorithm to
implement the Discrete Fourier Transform (DFT) which is defined on the set of N
samples {xn}, as follow:

Note that we use j here to denote the imaginary unit, i.e.j= 1 In general Xns are
complex numbers. The resulting sequence {Xn} is interpreted as follow: the zero
frequency corresponds to n = 0, positive frequencies 0 < f < F s / 2 correspond to values 1
n N /2 1, while negative frequencies F s /2 < f < 0 correspond to N /2 + 1 n N
1. Here, Fs denotes the sampling frequency. The result obtained after this step is often
referred to as signals spectrum or periodogram.

4.3 Mel-frequency wrapping


As mentioned above, psychophysical studies have shown that human perception of the
frequency contents of sounds for speech signals does not follow a linear scale. Thus for
each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a
scale called the Mel scale. The Mel-frequency scale is linear frequency spacing below
1000 Hz and a logarithmic spacing above 1000 Hz. As a reference point, the pitch of a 1
kHz tone, 40 dB above the perceptual hearing threshold, is defined as 1000 Mels.
Therefore we can use the following approximate formula to compute the Mels for a given
frequency f in Hz:

One approach to simulating

the subjective spectrum is to use a filter bank, one filter for each desired Mel-frequency
component (see Figure 4). That filter bank has a triangular band pass frequency response,
and the spacing as well as the bandwidth is determined by a constant Mel-frequency
interval. The modified spectrum of S () thus consists of the output power of these filters
when S () is the input. The number of Mel spectrum coefficients, K, is typically chosen
as 20.
Note that this filter bank is applied in the frequency domain; therefore it simply amounts
to taking those triangle-shape windows in the Figure 4 on the spectrum. A useful way of
thinking about this Mel-wrapping filter bank is to view each filter as a histogram bin
(where bins have

overlap)

in

the

frequencydomain.

10

Figure 4. An example of Mel-spaced filter bank

4.4Cepstrum
In this final step, we convert the log Mel spectrum back to time. The result is called the
Mel frequency cepstrum coefficients (MFCC). The cepstral representation of the speech
spectrum provides a good representation of the local spectral properties of the signal for
the given frame analysis. Because the Mel spectrum coefficients (and so their logarithm)
are real numbers, we can convert them to the time domain using the Discrete Cosine
Transform (DCT). Therefore if we denote those Mel power spectrum coefficients that are
the result of the last step are

we can calculate the MFCCs

11

as

Note that we exclude the first component, c 0, from the DCT since it represents the mean
value of the input signal which carried little speaker

specific information.

5. Feature Matching
The problem of speaker recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify
objects of interest into one of a number of categories or classes. The objects of interest
are generically called patterns and in our case are sequences of acoustic vectors that are
extracted from an input speech using the techniques described in the previous section.
The classes here refer to individual speakers. Since the classification procedure in our
case is applied on extracted features, it can be also referred to as feature matching.
Furthermore, if there exists some set of patterns that the individual classes of which are
already known, then one has a problem in supervised pattern recognition. This is exactly
our case since during the training session, we label each input speech with the ID of the
speaker (S1 to S8). These patterns comprise the training set and are used to derive a
classification algorithm. The remaining patterns are then used to test the classification
algorithm; these patterns are collectively referred to as the test set. If the correct classes
of the individual patterns in the test set are also known, then one can evaluate the
performance of the algorithm.
The state-of-the-art in feature matching techniques used in speaker recognition includes
Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector
Quantization (VQ). In this project, the VQ approach will be used, due to ease of
implementation and high accuracy. VQ is a process of mapping vectors from a large
vector space to a finite number of regions in that space. Each region is called a cluster
and can be represented by its center called a codeword. The collection of all codewords is
called a codebook.

12

Figure 5 shows a conceptual diagram to illustrate this recognition process. In the figure,
only two speakers and two dimensions of the acoustic space are shown. The circles refer
to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In
the training phase, a speaker-specific VQ codebook is generated for each known speaker
by clustering his/her training acoustic vectors. The result codewords (centroids) are
shown in Figure 5 by black circles and black triangles for speaker 1 and 2, respectively.
The distance from a vector to the closest codeword of a codebook is called a VQdistortion. In the recognition phase, an input utterance of an unknown voice is vectorquantized using each trained codebook and the total VQ distortion is computed. The
speaker corresponding to the VQ codebook with smallest total distortion is identified.

Figure 5. Conceptual diagram illustrating vector quantization codebook formation. One


speaker can be discriminated from another based of the

location

of

centroids.
5.1 Clustering the Training Vectors
After the enrolment session, the acoustic vectors extracted from input speech of a speaker
provide a set of training vectors. As described above, the next important step is to build a
speaker-specific VQ codebook for this speaker using those training vectors. There is a
13

well-know algorithm, namely LBG algorithm [Linde, Buzo and Gray, 1980], for
clustering a set of L training vectors into a set of M codebook vectors. The algorithm is
formally implemented by the following recursive procedure:
1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors
(hence, no iteration is required here).
2. Double the size of the codebook by splitting each current codebook yn according to the
rule

Where n varies from 1 to the current size of the codebook, and is a splitting parameter
(we choose =0.01).
3. Nearest-Neighbor Search: for each training vector, find the codeword in the current
codebook that is closest (in terms of similarity measurement), and assign that vector to
the corresponding cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the training
vectors assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts first by
designing a 1-vector codebook, then uses a splitting technique on the codewords to
initialize the search for a 2-vector codebook, and continues the splitting process until the
desired M-vector codebook is obtained.
Figure 6 shows, in a flow diagram, the detailed steps of the LBG algorithm. Cluster
vectors is the nearest-neighbor search procedure which assigns each training vector to a
cluster associated with the closest codeword. Find centroids is the centroid update
14

procedure. Compute D (distortion) sums the distances of all training vectors in the
nearest-neighbor search so as to determine whether the procedure has converged.

Figure 6. Flow diagram of the LBG algorithm (Adapted from Rabiner and Juang, 1993)

6.Implementation
15

As stated above, in this project we will experience the building and testing of an
automatic speaker recognition system. In order to implement such a system, one must go
through several steps which were described in details in previous sections. Note that
many of the above tasks are already implemented in Matlab. Furthermore, to ease the
development process, we supplied you with two utility functions: melfb and disteu; and
two main functions: train and test. Download all of those files into your working folder.
The first two files can be treated as a black box, but the later two needs to be thoroughly
understood. In fact, your tasks are to write two missing functions: mfcc and vqlbg, which
will be called from the given main functions. In order to accomplish that, follow each
step in this section carefully and answer all the questions.
Speech Data
Click here to down-load the ZIP file of the speech database. After unzipping the file
correctly, you will find two folders, TRAIN and TEST, each contains 8 files, named:
S1.WAV, S2.WAV, , S8.WAV; each is labeled after the ID of the speaker. These files
were recorded in Microsoft WAV format. In Windows systems, you can listen to the
recorded sounds by double clicking into the files.
Our goal is to train a voice model (or more specific, a VQ codebook in the MFCC vector
space) for each speaker S1 -S8 using the corresponding sound file in the TRAIN folder.
After this training step, the system would have knowledge of the voice characteristic of
each (known) speaker. Next, in the testing phase, the system will be able to identify the
(assumed unknown) speaker of each sound file in the TEST folder.
Question 1: Play each sound file in the TRAIN folder. Can you distinguish the voices of
those eight speakers? Now play each sound in the TEST folder in a random order without
looking at the file name (pretending that you do not known the speaker) and try to
identify the speaker using your knowledge of their voices that you just learned from the
TRAIN folder. This is exactly what the computer will do in our system. What is your
(human performance) recognition rate? Record this result so that it could be later on
compared against the computer performance of

our

Speech Processing
16

system.

In this phase you are required to write a Matlab function that reads a sound file and turns
it into a sequence of MFCC (acoustic vectors) using the speech processing steps
described previously. Many of those tasks are already provided by either standard or our
supplied Matlab functions. The Matlab functions that you would need to use are:
wavread, hamming, fft, dct and melfb (supplied function). Type help function name at the
Matlab prompt for more information about a function.
Question 2: Read a sound file into Matlab. Check it by playing the sound file in Matlab
using the function: sound. What is the sampling rate? What is the highest frequency that
the recorded sound can capture with fidelity? With that sampling rate, how many msecs
of actual speech are contained in a block of 256 samples?
Plot the signal to view it in the time domain. It should be obvious that the raw data in the
time domain has a very high amount of data and it is difficult for analyzing the voice
characteristic. So the motivation for this step (speech feature extraction) should be clear
now!
Now cut the speech signal (a vector) into frames with overlap (refer to the frame section
in the theory part). The result is a matrix where each column is a frame of N samples
from original speech signal. Applying the steps: Windowing and FFT to transform the
signal into the frequency domain. This process is used in many different applications and
is referred in literature as Windowed Fourier Transform (WFT) or Short-Time Fourier
Transform (STFT). The result is often called as the spectrum or periodogram.
Question 3: After successfully running the preceding process, what is the interpretation of
the result? Compute the power spectrum and plot it out using the imagesc command.
Note that it is better to view the power spectrum on the log scale. Locate the region in the
plot that contains most of the energy. Translate this location into the actual ranges in time
(msec) and frequency (in Hz) of the input speech signal.
Question 4: Compute and plot the power spectrum of a speech file using different frame
size: for example N = 128, 256 and 512. In each case, set the frame increment M to be
about N/3. Can you describe and explain the differences among those spectra?

17

The last step in speech processing is converting the power spectrum into mel-frequency
cepstrum coefficients. The supplied function melfb facilitates this task.
Question 5: Type help melfb at the Matlab prompt for more information about this
function. Follow the guidelines to plot out the mel-spaced filter bank. What is the
behavior of this filter bank? Compare it with the theoretical part.
Finally, put all the pieces together into a single Matlab function, mfcc, which performs
the MFCC processing.
Question 6: Compute and plot the spectrum of a speech file before and after the melfrequency wrapping step. Describe and explain the impact of the melfb program.
Vector Quantization
The result of the last section is that we transform speech signals into vectors in an
acoustic space. In this section, we will apply the VQ-based pattern recognition technique
to build speaker reference models from those vectors in the training phase and then can
identify any sequences of acoustic vectors uttered by unknown speakers.
Question 7: To inspect the acoustic space (MFCC vectors) we can pick any two
dimensions (say the 5th and the 6th) and plot the data points in a 2D plane. Use acoustic
vectors of two different speakers and plot data points in two different colors. Do the data
regions from the two speakers overlap each other? Are they in clusters?
Now write a Matlab function, vqlbg that trains a VQ codebook using the LGB algorithm
described before. Use the supplied utility function disteu to compute the pairwise
Euclidean distances between the codewords and training vectors in the iterative process.
Question 8: Plot the data points of the trained VQ codewords using the same two
dimensions over the plot from the last question
Compare this with Figure 5. .

18

Simulation and Evaluation


Now is the final part! Use the two supplied programs: train and test (which require two
functions mfcc and vqlbg that you just wrote) to simulate the training and testing
procedure in speaker recognition system, respectively.
Question 9: What is recognition rate our system can perform? Compare this with the
human performance. For the cases that the system makes errors, re-listen to the speech
files and try to come up with some explanations.
Figure 1: plot of signal s1.wav
Question 10 (optional): You can also test the system with your own speech files. Use the
Windows program Sound Recorder to record more voices from yourself and your
friends. Each new speaker needs to provide one speech file for training and one for
testing. Can the system recognize your voice? Enjoy!

Figure 1: plot of signal s1.wav

19

Power Spectrum (M=100,N=256)

Figure 2.a: power spectrum ( M=100, N=256)

Figure 2.b: Logarithmic Power Spectrum (M=100,N=256)

20

Figure 3.a: Power Spectrum (M=43,N=128,frames=767)

Figure 3.b: Power Spectrum (M=85,N=256,frames=387)

21

Figure 3.c: Power Spectrum (M=171,N=512,frames=191)

Figure 4: Mel-Spaced Filterbank

22

Figure 5.a: Power Spectrum Unmodified

Figure 5.b: Power Spectrum Modified through Mel Cepstrum filter

23

Figure 6: 2D plot of accoustic vectors

24

25

Figure 7: 2D plot of accoustic vectors

26

Figure 8.a: 2D plot of accoustic vectors

27

Figure 8.b: 2D plot of accoustic vectors

9.APPENDIX
INTRODUCTION TO MATLAB
MATLAB (short for Matrix Laboratory) is a special purpose computer program
optimized to perform engineering and scientific caluculation.It started life as a program
designed to perform matrix mathematic, but over the years it has grown into a flexible
computing system capable of solving essentially any technical problem. The MATLAB
program implements the MATLAB programming language and provides an exclusive
library of predefined function that make technical programming takes easier and more
efficient.
MATLAB is a huge program, with an incredibly rich variety of function. Even the basic
version of MATLAB without any toolkits is much richer than other technical
programming language. There are more than 1000 functions in the basic MATLAB
product alone, and the toolkits extend this capability with any more functions in various
specialties.
Advantages of MATLAB:
MATLAB has many advantages compared with conventional computer
Languages for technical problem solving. Among them are the following:

Erase of use: MATLAB is an interpreted language, like many version of

BASIC. The program can be used as a scratch pad to evaluate expressions typed at the
28

command line, or it can be used to execute large prewritten programs. Program may be
easily written and modified with the built-in integrated development environment and
debugged with MATLAB debugger.

Platform independence:

MATLAB is a supported on much different computer system, providing a large measure


of platform independence.

Predefined function: MATLAB comes complete with an extensive library of

predefined functions that provide tested and prepackaged solutions to many basic
MATLAB language; many special purpose toolboxes are available to help solve complex
problems in specific areas.

Device independent plotting: unlike most other computer languages,

MATLAB has many integral plotting and imaging commands. the plots and images can
be displayed on many graphical output device supported by the computer on which
MATLAB is running. This capability makes MATLAB an outstanding tool for visualizing
technical data.

Graphical user interface: Matlab includes tools that allow a programmer to

interactively construct a GUI for his/her program.

MATLAB compiler: MATLABs flexibility and platform independence is

achieved by compiling MATLAB program into a device-independent p-code, and then


interpreting the p-codes instructions at run time .a separate MATLAB compiler is
available. This compiler can compile a MATLAB program into a executable program that
run faster than the interpreted code.
MATLAB Commands used in source code:

LENGTH : Length of vector.

LENGTH(X) returns the length of vector X. It

is equivalent

to MAX(SIZE(X)) for non-empty arrays and 0 for empty ones.

FLOOR: Round towards minus infinity.

FLOOR(X) rounds the elements of X to the nearest integers


towards minus infinity.

29

HAMMING: Hamming window.

HAMMING(N) returns the N-point symmetric Hamming window in a column vector.


HAMMING(N,SFLAG) generates the N-point Hamming window using SFLAG
window
sampling. SFLAG may be either 'symmetric' or 'periodic'. By default, a symmetric
window is returned.

DIAG: Diagonal matrices and diagonals of a matrix.

DIAG(V,K) when V is a vector with N components is a square matrix of order


N+ABS(K) with the elements of V on the K-th diagonal. K = 0 is the main diagonal, K
> 0 is above the main diagonal and K < 0 is below the main diagonal.
DIAG(V) is the same as DIAG(V,0) and puts V on the main diagonal. DIAG(X,K)
when X is a matrix is a column vector formed from the elements of the K-th diagonal
of X.
DIAG(X) is the main diagonal of X. DIAG(DIAG(X)) is a diagonal matrix.
Example
m = 5;
diag(-m:m) + diag(ones(2*m,1),1) + diag(ones(2*m,1),-1)
produces a tridiagonal matrix of order 2*m+1.

FFT: Discrete Fourier transform.

FFT(X) is the discrete Fourier transform (DFT) of vector X. For matrices, the FFT
operation is applied to each column. For N-D arrays, the FFT operation operates on the
first non-singleton dimension.
FFT(X,N) is the N-point FFT, padded with zeros if X has less
than N points and truncated if it has more.
FFT(X,[],DIM) or FFT(X,N,DIM) applies the FFT operation across the dimension
DIM.
For length N input vector x, the DFT is a length N vector X,
with elements N
30

X(k) =

sum x(n)*exp(-j*2*pi*(k-1)*(n-1)/N), 1 <= k <= N.

n=1
The inverse DFT (computed by IFFT) is given by
N
x(n) = (1/N) sum X(k)*exp( j*2*pi*(k-1)*(n-1)/N), 1 <= n <= N.
k=1

WAVREAD: Read Microsoft WAVE (".wav") sound file.

Y=WAVREAD(FILE) reads a WAVE file specified by the string FILE, returning the
sampled data in Y. The ".wav" extension is appended if no extension is given.
Amplitude values are in the range [-1,+1].
[Y,FS,NBITS]=WAVREAD(FILE) returns the sample rate (FS) in Hertz and the
number of bits per sample (NBITS) used to encode the data in the file.
[...]=WAVREAD(FILE,N) returns only the first N samples from each channel in the
file.
[...]=WAVREAD(FILE,[N1 N2]) returns only samples N1 through N2 from each
channel in the file.
SIZ=WAVREAD(FILE,'size') returns the size of the audio data contained in the file in
place of the actual audio data, returning the vector SIZ=[samples channels].
[Y,FS,NBITS,OPTS]=WAVREAD(...) returns a structure OPTS of additional
information contained in the WAV file. The content of this structure differs from file to
file. Typical structure fields include '.fmt' (audio format information) and '.info' (text
which may describe subject title, copy right, etc.)
Supports multi-channel data, with up to 32 bits per sample.

DISP: Display array.

DISP(X) displays the array, without printing the array name. In all other ways it's the
same as leaving the semicolon off an expression except that empty arrays don't display.
If X is a string, the text is displayed.

AXIS: Control axis scaling and appearance.

AXIS([XMIN XMAX YMIN YMAX]) sets scaling for the x- and y-axes on the current
plot.
31

AXIS([XMIN XMAX YMIN YMAX ZMIN ZMAX]) sets the scaling for the x-, y- and
z-axes on the current 3-D plot.
AXIS([XMIN XMAX YMIN YMAX ZMIN ZMAX CMIN CMAX]) sets the scaling
for the x-, y-, z-axes and color scaling limits on the current axis (see CAXIS).
V = AXIS returns a row vector containing the scaling for the current plot. If the current
view is 2-D, V has four components; if it is 3-D, V has six components.
AXIS AUTO returns the axis scaling to its default, automatic
mode where, for each dimension, 'nice' limits are chosen based on the extents of all
line, surface, patch, and image children.
AXIS MANUAL freezes the scaling at the current limits, so that if HOLD is turned on,
subsequent plots will use the same limits.
AXIS TIGHT sets the axis limits to the range of the data.
AXIS IJ puts MATLAB into its "matrix" axes mode. The coordinate system origin is
at the upper left corner. The i axis is vertical and is numbered from top to bottom. The
j axis is horizontal and is numbered from left to right.
AXIS XY puts MATLAB into its default "Cartesian" axes mode. The coordinate
system origin is at the lower left corner. The x axis is horizontal and is numbered from
left to right. The y axis is vertical and is numbered from bottom to top.
AXIS EQUAL sets the aspect ratio so that equal tick mark increments on the x-,y- and
z-axis are equal in size. This makes SPHERE(25) look like a sphere, instead of an
ellipsoid.
AXIS IMAGE is the same as AXIS EQUAL except that the plot box fits tightly around
the data.
AXIS SQUARE makes the current axis box square in size.
AXIS NORMAL restores the current axis box to full size and
removes any restrictions on the scaling of the units.
This undoes the effects of AXIS SQUARE and AXIS EQUAL.
AXIS VIS3D freezes aspect ratio properties to enable rotation of 3-D objects and
overrides stretch-to-fill.
AXIS OFF turns off all axis labeling, tick marks and background.
AXIS ON turns axis labeling, tick marks and background back on.

32

IMAGESC: Scale data and display as image.

IMAGESC(...) is the same as IMAGE(...) except the data is scaled to use the full
colormap.
IMAGESC(...,CLIM) where CLIM = [CLOW CHIGH] can specify the scaling.

COLORBAR: Display color bar (color scale).

COLORBAR('vert')

appends

vertical

color

scale

to

the

current

axes.

COLORBAR('horiz') appends a horizontal color scale.


COLORBAR(H) places the colorbar in the axes H. The colorbar will be horizontal if
the axes H width > height (in pixels).
COLORBAR without arguments either adds a new vertical color scale or updates an
existing colorbar.
H = COLORBAR(...) returns a handle to the colorbar axes.
COLORBAR(...,'peer',AX) creates a colorbar associated with axes AX instead of the
current axes.

GET: Get object properties.

V = GET(H,'PropertyName') returns the value of the specified


property for the graphics object with handle H. If H is a vector of handles, then get
will return an M-by-1 cell array of values where M is equal to length(H).

If

'PropertyName' is replaced by a 1-by-N or N-by-1 cell array of strings containing


property names, then GET will return an M-by-N cell array of values.
GET(H) displays all property names and their current values for the graphics object
with handle H.
V = GET(H) where H is a scalar, returns a structure where each field name is the name
of a property of H and each field contains the value of that property.
V = GET(0, 'Factory')
V = GET(0, 'Factory<ObjectType>')
V = GET(0, 'Factory<ObjectType><PropertyName>')
returns for all object types the factory values of all properties which have user-settable
default values.
V = GET(H, 'Default')
33

V = GET(H, 'Default<ObjectType>')
V = GET(H, 'Default<ObjectType><PropertyName>')
returns information about default property values (H must be scalar). 'Default' returns a
list of all default property values currently set on H. 'Default<ObjectType>' returns
only

the

defaults

for

properties

of

<ObjectType>

set

on

H.

'Default<ObjectType><PropertyName>' returns the default value for the specific


property, by searching the defaults set on H and its ancestors, until that default is found.
If no default value for this property has been set on H or any ancestor of H up through
the root, then the factory value for that property is returned.
Defaults can not be queried on a descendant of the object, or on the object itself - for
example,

avalue for efaultAxesColor' can not be queried on an axes or an axes

child, but can be queried on a figure or on the root. When using the 'Factory' or
'Default' GET, if PropertyName is omitted then the return value will take the form of a
structure in which each field name is a property name and the corresponding value is
the value of that property.If PropertyName is specified then a matrix or string value
will be returned.

SET: Set object properties.

SET(H,'PropertyName',PropertyValue) sets the value of the specified property for the


graphics object with handle H. H can be a vector of handles, in which case SET sets the
properties' values for all the objects. SET(H,a) where a is a structure whose field names
are object property names, sets the properties named in each field name with the values
contained in the structure.
SET(H,pn,pv) sets the named properties specified in the cell array of strings pn to the
corresponding values in the cell array pv for all objects specified in H. The cell array
pn must be 1-by-N, but the cell array pv can be M-by-N where M is equal to length(H)
so that each object will be updated with a different set of values for the list of property
names contained in pn.
SET(H,'PropertyName1',PropertyValue1,'PropertyName2',PropertyValue2,...)

sets

multiple property values with a single statement. Note that it is permissible to use
property/value string pairs, structures, and property/value cell array pairs in the same
call to SET.
A = SET(H, 'PropertyName')
34

SET(H,'PropertyName')
returns or displays the possible values for the specified property of the object with
handle H. The returned array is a cell array of possible value strings or an empty cell
array if the property does not have a finite set of possible string values.
A = SET(H)
SET(H)
returns or displays all property names and their possible values for the object with
handle H. The return value is a structure whose field names are the property names of
H, and whose values are cell arrays of possible property values or empty cell arrays.
The default value for an object property can be set on any of an object's ancestors by
setting the PropertyName formed by concatenating the string 'Default', the object type,
and the property name. For example, to set the default color of text objects to red in
the current figure window:
set(gcf,'DefaultTextColor','red')
Defaults can not be set on a descendant of the object, or on the
object itself - for example, a value for 'DefaultAxesColor' can not be set on an axes or
an axes child, but can be set on a figure or on the root.
Three strings have special meaning for PropertyValues:
'default' - use default value (from nearest ancestor) 'factory' - use factory default value
'remove' - remove default value.

ROUND: Round towards nearest integer.

ROUND(X) rounds the elements of X to the nearest integers.

SIZE: Size of array.

D = SIZE(X), for M-by-N matrix X, returns the two-element row vector D = [M, N]
containing the number of rows and olumns in the matrix. For N-D arrays,SIZE(X)
returns a 1-by-N vector of dimension lengths.Trailing

singleton dmensions are

ignored.
[M,N] = SIZE(X) for matrix X, returns the number of rows and columns in X as
separate output variables.
[M1,M2,M3,...,MN] = SIZE(X) returns the sizes of the first N

35

dimensions of array X.

If the number of output arguments N does not equal

NDIMS(X), then for:


N > NDIMS(X), size returns ones in the "extra" variables,
i.e., outputs NDIMS(X)+1 through N.
N < NDIMS(X), MN contains the product of the sizes of the remaining dimensions,
i.e., dimensions N+1 through NDIMS(X).
M = SIZE(X,DIM) returns the length of the dimension specified by the scalar DIM.
For example, SIZE(X,1) returns the number of rows. When SIZE is applied to a Java
array, the number of rows returned is the length of the Java array and the number of
columns is always 1. When SIZE is applied to a Java array of arrays, the result
describes only the top level array in the array of arrays.

SPRINTF: Write formatted data to string.

[S,ERRMSG] = SPRINTF(FORMAT,A,...) formats the data in the real part of matrix A


(and in any additional matrix arguments), under control of the specified FORMAT
string, and returns it in the MATLAB string variable S. ERRMSG is an optional output
argument that returns an error message string if an error occurred or an empty matrix if
an error did not occur. SPRINTF is the same as FPRINTF except that it returns the data
in a MATLAB string variable rather than writing it to a file.
FORMAT is a string containing C language conversion specifications. Conversion
specifications involve the character %, optional flags, optional width and precision
fields, optional subtype specifier, and conversion characters d, i, o, u, x, X, f,e,E, g, G,
c, and s.
The special formats \n,\r,\t,\b,\f can be used to produce linefeed,carriage return, tab,
backspace, and formfeed characters respectively.
Use \\ to produce a backslash character and %% to produce the percentcharacter.
SPRINTF behaves like ANSI C with certain exceptions and extensions.These include:
ANSI C requires an integer cast of a double argument to correctly use an integer
conversion specifier like d. A similiar conversion is required when using such a
specifier with non-integral MATLAB values. Use FIX, FLOOR, CEIL or ROUND on a
double argument to explicitly convert non-integral MATLAB values to integral values
36

if you plan to use an integer conversion specifier like d.Otherwise, any non-integral
MATLAB values will be outputted using the format where the integer conversion
specifier letter has been replaced by the following non-standard subtype specifiers are
supported for conversion characters o, u, x, and X.
T

- The underlying C datatype is a float rather than an unsigned integer.

b- The underlying C datatype is a double rather than an unsigned integer.


For example, to print out in hex a double value use a format like '%bx'.
SPRINTF is "vectorized" for the case when A is nonscalar. The format string is
recycled through the elements of A (columnwise) until all the elements are used up. It
is then recycled in a similar manner through any additional matrix arguments.
Examples:
sprintf('%0.5g',(1+sqrt(5))/2)

1.618

sprintf('%0.5g',1/eps)

4.5036e+15

sprintf('%15.5f',1/eps)

4503599627370496.00000

sprintf('%d',round(pi))

sprintf('%s','hello')

hello

sprintf('The array is %dx%d.',2,3) The array is 2x3.


sprintf('\n') is the line termination character on all platforms.

MELFB: Determine matrix for a mel-spaced filterbank

Inputs: p number of filters in filterbank n length of fft fs sample rate in Hz


Outputs: x a (sparse) matrix containing the filterbank amplitudes
size(x) = [p, 1+floor(n/2)]
Usage: For example, to compute the mel-scale spectrum of a
colum-vector signal s, with length n and sample rate fs:
f = fft(s);
m = melfb(p, n, fs);
n2 = 1 + floor(n/2);
z = m * abs(f(1:n2)).^2;
z would contain p samples of the desired mel-scale spectrum
To plot filterbanks e.g.:
plot(linspace(0, (12500/2), 129), melfb(20, 256, 12500)'),
37

title('Mel-spaced filterbank'), xlabel('Frequency (Hz)');


ABS: Absolute value.

ABS(X) is the absolute value of the elements of X. When X is complex, ABS(X) is the
complex modulus (magnitude) of the elements of X.
PLOT: Linear plot.

PLOT(X,Y) plots vector Y versus vector X. If X or Y is a matrix,then the vector is


plotted versus the rows or columns of the matrix,whichever line up. If X is a scalar and
Y is a vector, length(Y)disconnected points are plotted. PLOT(Y) plots the columns of
Y versus

their

index.

If

Y is

complex,

PLOT(Y)

is

equivalent

to

LOT(real(Y),imag(Y)). In all other uses of PLOT, the imaginary part is ignored.


Various line types, plot symbols and colors may be obtained with PLOT(X,Y,S) where
S is a character string made from one element from any or all the following 3 columns:
b

blue

point

green

red

cyan

solid

circle

dotted

x-mark

-.

dashdot

plus

--

dashed

magenta

star

yellow

square

black

diamond

triangle (down)

triangle (up)

<

triangle (left)

>

triangle (right)

pentagram

hexagram

For example, PLOT(X,Y,'c+:') plots a cyan dotted line with a plus at each data point;
PLOT(X,Y,'bd') plots blue diamond at each data point but does not draw any line.
PLOT(X1,Y1,S1,X2,Y2,S2,X3,Y3,S3,...) combines the plots defined by the (X,Y,S)
triples, where the X's and Y's are vectors or matrices and the S's are strings.
For example, PLOT(X,Y,'y-',X,Y,'go') plots the data twice, with a solid yellow line
interpolating green circles at the data points. The PLOT command, if no color is
38

specified, makes automatic use of the colors specified by the axes ColorOrder property.
The default ColorOrder is listed in the table above for color systems where the default
is blue for one line, and for multiple lines, to cycle through the first six colors in the
table. For monochrome systems, PLOT cycles over the axes LineStyleOrder property.
PLOT returns a column vector of handles to LINE objects, one handle per line.

SUBPLOT: Create axes in tiled positions.

H = SUBPLOT(m,n,p), or SUBPLOT(mnp), breaks the Figure window into an m-by-n


matrix of small axes, selects the p-th axes for for the current plot, and returns the axis
handle. The axes are counted along the top row of the Figure window, then the second
row, etc. For example,
SUBPLOT(2,1,1), PLOT(income)
SUBPLOT(2,1,2), PLOT(outgo)
plots income on the top half of the window and outgo on the bottom half.
SUBPLOT(m,n,p), if the axis already exists, makes it current.
SUBPLOT(m,n,p,'replace'), if the axis already exists, deletes it and creates a new axis.
SUBPLOT(m,n,P), where P is a vector, specifies an axes position that covers all the
subplot positions listed in P.
SUBPLOT(H), where H is an axis handle, is another way of making an axis current for
subsequent plotting commands.
SUBPLOT('position',[left bottom width height]) creates an axis at the specified
position in normalized coordinates (in the range from 0.0 to 1.0).
If a SUBPLOT specification causes a new axis to overlap an existing axis, the existing
axis is deleted - unless the position of the new and existing axis are identical. For
example, the statement SUBPLOT(1,2,1) deletes all existing axes overlapping the left
side of the Figure window and creates a new axis on that side - unless there is an axes
there with a position that exactly matches the position of the new axes (and 'replace'
was not specified), in which case all other overlapping axes will be deleted and the
matching axes will become the current axes.
SUBPLOT(111) is an exception to the rules above, and is not
identical in behavior to SUBPLOT(1,1,1). For reasons of backwards compatibility, it is
a special case of subplot which does not immediately create an axes, but instead sets up
39

the figure so that the next graphics command executes CLF RESET in the figure
(deleting all children of the figure), and creates a new axes in the default position. This
syntax does not return a handle, so it is an error to specify a return argument. The
delayed CLF RESET is accomplished by setting the figure's NextPlot to 'replace'.

HOLD: Hold current graph.

HOLD ON holds the current plot and all axis properties so that subsequent graphing
commands add to the existing graph.
HOLD OFF returns to the default mode whereby PLOT commands erase the previous
plots and reset all axis properties before drawing new plots.
HOLD, by itself, toggles the hold state.
HOLD does not affect axis autoranging properties.
Algorithm note:
HOLD ON sets the NextPlot property of the current figure and axes to "add".
HOLD OFF sets the NextPlot property of the current axes to "replace".

VQLBG: Vector quantization using the Linde-Buzo-Gray algorithme

Inputs: d contains training data vectors (one per column) k is number of centroids
required
Output: r contains the result VQ codebook (k columns, one for each centroids)

LEGEND: Graph legend.

LEGEND(string1,string2,string3, ...) puts a legend on the current plot using the


specified strings as labels. LEGEND works on line graphs, bar graphs, pie graphs,
ribbon plots, etc. You can label any solid-colored patch or surface object. The fontsize
and fontname for the legend strings matches the axes fontsize and fontname.
LEGEND(H,string1,string2,string3, ...) puts a legend on the plot containing the handles
in the vector H using the specified strings as labels for the corresponding handles.
LEGEND(M), where M is a string matrix or cell array of strings, and LEGEND(H,M)
where H is a vector of handles to lines and patches also works.
LEGEND(AX,...) puts a legend on the axes with handle AX.
40

LEGEND OFF removes the legend from the current axes.


LEGEND(AX,'off') removes the legend from the axis AX.
LEGEND HIDE makes legend invisible.
LEGEND(AX,'hide') makes legend on axis AX invisible.
LEGEND SHOW makes legend visible.
LEGEND(AX,'show') makes legend on axis AX visible.
LEGEND BOXOFF sets appdata property legendboxon to 'off' making legend
background box invisible when the legend is visible.
LEGEND(AX,'boxoff') sets appdata property legendboxon to 'off for axis AX making
legend background box invisible when the legend is visible.
LEGEND BOXON sets appdata property legendboxon to 'on' making legend
background box visible when the legend is visible.
LEGEND(AX,'boxon') sets appdata property legendboxon to 'off for axis AX making
legend background box visible when the legend is visible.
LEGH = LEGEND returns the handle to legend on the current axes or empty if none
exists.
LEGEND with no arguments refreshes all the legends in the current figure (if any).
LEGEND (LEGH)

refreshes

the specified legend.

LEGEND(...,Pos) places the legend in the specified location:


0 = Automatic "best" placement (least conflict with data)
1 = Upper right-hand corner (default)
2 = Upper left-hand corner
3 = Lower left-hand corner
4 = Lower right-hand corner
-1 = To the right of the plot
To move the legend, press the left mouse button on the legend and drag to the desired
location. Double clicking on a label allows you to edit the label.
[LEGH,OBJH,OUTH,OUTM] = LEGEND(...) returns a handle LEGH to the legend
axes; a vector OBJH containing handles for the text, lines, and patches in the legend; a
vector OUTH of handles to the lines and patches in the plot; and a cell array OUTM
containing the text in the legend.

41

LEGEND will try to install a ResizeFcn on the figure if it hasn't been defined before.
This resize function will try to keep the legend the same size.
Examples:
x = 0:.2:12;
plot(x,bessel(1,x),x,bessel(2,x),x,bessel(3,x));
legend('First','Second','Third');
legend('First','Second','Third',-1)
b = bar(rand(10,5),'stacked');
colormap(summer);
hold on
x = plot(1:10,5*rand(10,1),'marker','square','markersize',12,...
'markeredgecolor','y','markerfacecolor',[.6 0 .6],... 'linestyle','-','color','r','linewidth',2);
hold

off

legend([b,x],

'Carrots',

'Peas','

Peppers','

Green

'Cucumbers','Eggplant')
Speaker Recognition: Testing Stage
Input:
testdir : string name of directory contains all test sound files
n

: number of test files in testdir

code

: codebooks of all trained speakers

Note:
Sound files in testdir is supposed to be:
s1.wav, s2.wav, ..., sn.wav
Example:
>> test('C:\data\amintest\', 8, code);

MFCC

Inputs: s contains the signal to analize fs is the sampling rate of the signal
Output: r contains the transformed signal
Speaker Recognition: Training Stage
Input:
traindir : string name of directory contains all train sound files
42

Beans',...

: number of train files in traindir

Output:
code

: trained VQ codebooks, code{i} for i-th speaker

Example:
>> code = train('C:\data\amintrain\', 8);

DISTEU: Pairwise Euclidean distances between columns of two matrices

Input:
x, y: Two matrices whose each column is an a vector data.
Output:
d: Element d(i,j) will be the Euclidean distance between two
column vectors X(:,i) and Y(:,j)
Note:
The Euclidean distance D between two vectors X and Y is:
D = sum((x-y).^2).^0.5
ZEROS Zeros array.
ZEROS(N) is an N-by-N matrix of zeros.
ZEROS(M,N) or ZEROS([M,N]) is an M-by-N matrix of zeros.
ZEROS(M,N,P,...) or ZEROS([M N P ...]) is an M-by-N-by-P-by-... array of zeros.
ZEROS(SIZE(A)) is the same size as A and all zeros.

CEIL: Round towards plus infinity.

CEIL(X) rounds the elements of X to the nearest integers towards infinity.

MIN : Smallest component.

For vectors, MIN(X) is the smallest element in X. For matrices, MIN(X) is a row
vector containing the minimum element from each column. For N-D arrays, MIN(X)
operates along the first non-singleton dimension. [Y,I] = MIN(X) returns the indices of
the minimum values in vector I.If the values along the first non-singleton dimension
contain more than one minimal element, the index of the first one is
returned.MIN(X,Y) returns an array the same size as X and Y with the smallest

43

elements taken from X or Y. Either one can be a scalar.[Y,I] = MIN(X,[],DIM) operates


along the dimension DIM.
When complex, the magnitude MIN(ABS(X)) is used, and the angle ANGLE(X) is
ignored. NaN's are ignored when computing the minimum.

SPARSE: Create sparse matrix.

S = SPARSE(X) converts a sparse or full matrix to sparse form by squeezing out any
zero elements.
S = SPARSE(i,j,s,m,n,nzmax) uses the rows of [i,j,s] to generate an m-by-n sparse
matrix with space allocated for nzmax nonzeros.The two integer index vectors, i and j,
and the real or complex entries vector, s, all have the same length, nnz, which is the
number of nonzeros in the resulting sparse matrix S . Any elements of s which have
duplicate values of i and j are added together.
There are several simplifications of this six argument call.
S = SPARSE(i,j,s,m,n) uses nzmax = length(s).
S = SPARSE(i,j,s) uses m = max(i) and n = max(j).
S = SPARSE(m,n) abbreviates SPARSE([],[],[],m,n,0). This generates the ultimate
sparse matrix, an m-by-n all zero matrix.The argument s and one of the arguments i or j
may be scalars,in which case they are expanded so that the first three arguments all
have the same length. For example, this dissects and then reassembles a sparse matrix:
[i,j,s] = find(S);
[m,n] = size(S);
S = sparse(i,j,s,m,n);
So does this, if the last row and column have nonzero entries:
[i,j,s] = find(S);
S = sparse(i,j,s);
All of MATLAB's built-in arithmetic, logical and indexing operations can be applied to
sparse matrices, or to mixtures of sparse and full matrices. Operations on sparse
matrices return sparse matrices and operations on full matrices return full matrices. In
most cases,operations on mixtures of sparse and full matrices return full matrices. The
exceptions include situations where the result of a mixed operation is structurally
sparse, eg. A .* S is at least as sparse as S . Some operations, such as S >= 0, generate

44

"Big Sparse", or "BS", matrices -- matrices with sparse storage organization but few
zero elements.

MFCC:

Inputs: s contains the signal to analize fs is the sampling rate of the signal
Output: r contains the transformed signal

DCT : Discrete cosine transform.

Y = DCT(X) returns the discrete cosine transform of X.The vector Y is the same size as
X and contains the discrete cosine transform coefficients. Y = DCT(X,N) pads or
truncates the vector X to length N before transforming.If X is a matrix, the DCT
operation is applied to each column. This transform can be inverted using IDCT.

MEAN : Average or mean value.

For vectors, MEAN(X) is the mean value of the elements in X. For matrices,
MEAN(X) is a row vector containing the mean value of each column. For N-D arrays,
MEAN(X) is the mean value of the elements along the first non-singleton dimension of
X.MEAN(X,DIM) takes the mean along the dimension DIM of X.
Example: If X = [0 1 2 3 4 5]
then mean(X,1) is [1.5 2.5 3.5] and mean(X,2) is [1 4]
Disadvantages of MATLAB:
MATLAB has two principle disadvantages. The first is an interpreted language and
therefore can execute more slowly than compiled languages. This problem can mitigate
by properly structuring the MATLAB program nand by the use of the MATLAB compiler
to compile the final MATLAB program before distribution and general use.
The second disadvantage is cost: a fully copy of MATLAB is 5 to 10 times more
expensive than a conventional C or fortan compiler

45

.
CONCLUSION

Speaker recognition is the process of automatically recognizing who is speaking on the


basis of individual information included in speech waves. This technique makes it
possible to use the speakers voice to verify their identity and control access to services
such as voice dialing, banking by telephone, telephone shopping etc
This project dealt with speaker independent system which is developed to operate for any
speaker of a particular type ( e.g. American English ) . These systems are the most
difficult to develop, most expensive and accuracy is lower than speaker dependent
systems. However, they are more flexible.
By applying the procedure described above, for each speech frame of around 30msec
with overlap, a set of mel-frequency cepstrum coefficients is computed.
These are result of a cosine transform of the logarithm of the short term power
spectrum expressed on a mel- frequency scale. This set of coefficients is called an
acoustic vector. Therefore each input utterance is transformed into a sequence of acoustic
vectors. This project also discussed how those acoustic vectors can be used to represent
and recognize the voice characteristic of the speaker.

BIBILIOGRAPHY

L.R. Rabiner and B.H. Juang, Fundamentals of Speech

Recognition, Prentice-Hall, Englewood Cliffs, N.J., 1993.

46

L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-

Hall, Englewood Cliffs, N.J., 1978.

Y. Linde, A. Buzo & R. Gray, An algorithm for vector quantizer design,

IEEE Transactions on Communications, Vol. 28, pp.84-95, 1980.

S. Furui, Speaker independent isolated word recognition using dynamic

features of speech spectrum, IEEE Transactions on Acoustic, Speech, Signal Processing,


Vol. ASSP-34, No. 1, pp. 52-59, February 1986.

S. Furui, An overview of speaker recognition technology, ESCA Workshop

on Automatic Speaker Recognition, Identification and Verification, pp. 1-9, 1994.

F.K. Song, A.E. Rosenberg and B.H. Juang, A vector quantisation approach to

speaker recognition, AT&T Technical Journal, Vol. 66-2, pp. 14-26, March 1987.
comp.speech Frequently Asked Questions WWW site,

http://svr-www.eng.cam.ac.uk/comp.speech

47

Anda mungkin juga menyukai