Anda di halaman 1dari 33

Project Report on:

SPEECH RECOGNITION USING MATLAB

By

SHAIVAL SHAH EC-106

Under Supervision of

Prof. R.K.Dana

Department of Electronics and Communication,


Faculty of Technology, Nadiad

Page 1 of 33

CERTIFICATE
This is to certify that the project titled SPEECH RECOGNITION by
SHAIVAL SHAH (EC-106) has carried out bonafide work during the semester
VII of their under graduate course in the Electronics & Communication
Engineering under my guidance and supervision.

Prof. R.K.Dana
Project Supervisor,
Department of Electronics & Communication,
D. D. University,
Nadiad

Dr. N. J. Kothari
Head,
Department of Electronics & Communication,
D. D. University,
Nadiad

Page 2 of 33

INDEX

1 OF FIGURES
LIST
1. OVERVIEW

4
5

2. PREAMBLE
2
2.1 INTRODUCTION.
2.2 BLOCK DIAGRAM
2.3 PROBLEM DEFINITION.
2.4 OBJECTIVES OF THE STUDY.
2.5 REVIEW OF LITERATURE.
2.6 APPLICATIONS
2.7 METHODOLOGY

6
8
9
9
9
11
11

3. THEORY

3.1 Voice Commands


3.2 Microphone
3.3 Sampler
3.4 Band Pass Filter
3.5 Processing and Decision Making Unit

12
13
13
13
13
13
13

4. DESIGN AND IMPLEMENTATION


4.1 CODE

14

5 RESULT AND OBSERVATION

24

6 CONCLUSION

32

7 REFERENCE

33

Page 3 of 33

LIST OF FIGURES
CHAPTER
2
3
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8

5.9

TITLE
BLOCK DIAGRAM
SPEECH
RECOGNITION
MAIN PROGRAM
GET POWER
GET AVG POWER
AVG POWER OF 1ST
SIGNAL
AVG POWER OF 2ND
SIGNAL
AVG POWER OF 3RD
SIGNAL
FFT OF SIGNALS
FFT OF MATCHED
SIGNALS WITH
SPECIFIC SAMPLINGS
FINAL OUTPUT

PAGE
8
12
20
21
22
23
24
25
26
27

28

Page 4 of 33

Chapter1
OVERVIEW
Speech recognition is a technology that able a computer to capture the words spoken by a
human with a help of microphone. These words are later on recognized by speech recognizer,
and in the end, system outputs the recognized words. The process of speech recognition
consists of different steps that will be discussed in the following sections one by one. An
ideal situation in the process of speech recognition is that, a speech recognition engine
recognizes all words uttered by a human but, practically the performance of a speech
recognition engine depends on number of factors. Vocabularies , multiple users and noisy
environment are the major factors that are counted in as the depending factors for a speech
recognition engine.

Page 5 of 33

Chapter 2
PREAMBLE
2.1 INTRODUCTION
In this report we concentrate on the speech recognition programs that are humancomputer interactive. When software evaluators observe humans testing such software
programs, they gain valuable insights into technological problems and barriers that they may
never witness otherwise. Testing speech recognition products for universal usability is an
important step before considering the product to be a viable solution for its customers later.
This document concerns Speech Recognition accuracy in the automobile, which is a critical
factor in the development of hands-free human-machine interactive devices. There are two
separate issues that we want to test: word recognition accuracy and software friendliness.
Major factors that impede recognition accuracy in the automobile include noise sources such
as tires and wind noise while the vehicle is in motion, engine noise, noises produced by the
car radio/entertainment systems, fans, windshield wipers, horn, turn signals, heater, A/C,
temperature sets, cruise control speed setting, headlight, emergency flashers, and others listed
below.

But, what is speech recognition?


Speech recognition works like this. You speak into a microphone and the computer
transforms the sound of your words into text to be used by your word processor or other
applications available on your computer. The computer may repeat what you just said or it
may give you a prompt for what you are expected to say next. This is the central promise of
interactive speech recognition. Early speech recognition programs made you speak in
staccato fashion, insisting that you leave a gap between every two words. You also had to
correct any errors virtually as soon as they happened, which means that you had to
concentrate so hard on the software that you often forgot what you were trying to say.

Page 6 of 33

The new voice recognition systems are certainly much easier to use. You can speak
at a normal pace without leaving distinct pauses between words. However, you cannot really
use natural speech as claimed by the manufacturers. You must speak clearly, as you do
when you speak to a Dictaphone or when you leave someone a telephone message.
Remember, the computer is relying solely on your spoken words. It cannot interpret your tone
or inflection, and it cannot interpret your gestures and facial expressions, which are part of
everyday human communication. Some of the systems also look at whole phrases, not just the
individual words you speak. They try to get information from the context of your speech, to
help work out the correct interpretation. This is how they can (sometimes) work out what you
mean when there are several words that sound similar (such as to, too and two.)

Page 7 of 33

2.2 BLOCK DIAGRAM


Audio Detection

Amplifiers and Filters

Analog to Digital converter

Digital filters

Finger print generation

Finger print matching

o/p

Page 8 of 33

2.3 PROBLEM DEFINITION

This document defines a set of evaluation criteria and test methods for speech
recognition systems used in vehicles. The evaluation of this in vehicle, noisy environments,
and accuracy and suitability of control under various conditions are also included in this
document for the testing purposes. The effects of engine noise, interference by turbulent air
outside the car, interference by the sounds from the cars radio/entertainment, and
interference by the sounds of the cars windshield wipers are all considered separately.
Recognition accuracy was compared using a variety of different road routines, noisy
environments and languages. Testing in ideal non-noisy environment of a quiet room has
been also performed for comparison.

2.4 OBJECTIVES OF THE STUDY


The purpose of this project report is to give the detailed description of the algorithms used for
speech recognition ; hardware for speech recognition and software programs used for the
development of the security system that at first authenticates the person and after proper
authentication follows the predefined commands given by him.

2.5 REVIEW OF LITERATURE


The training process improves the accuracy of the speech recognition systems quite
quickly if the end-users or the software developers take the time to correct mistakes. This is
important. The voice model created after your enrollment is constantly changed and
updated as you correct misinterpretations made by the systems. This is how the programs
learn. If you do this properly then the accuracy you obtain will improve. If you dont, the
accuracy will deteriorate.

Page 9 of 33

There are three types of corrections you need to make when you are modifying
text. The first is when you cough or get tongue-tied, and the word comes out nothing
like what you intended. The Speech Recognition systems make an honest (if not
sometimes humorous) attempt to translate your jumble. The systems always repeat the
word you just said; therefore this is also the way to detect errors from the systems.
The solution to fix errors is to select the word and then say the word again properly in
place of the mistake, or just delete the word and start over.
The second circumstance is when you simply change your mind. You said
this but you now want to say therefore. You can make these changes any time
because the Speech Recognition systems have not made a mistake. You simply
change the word (by typing or by voice).
In both of these first two cases, the Speech Recognition software has not made
a mistake. In the third type of correction the software gets it wrong. If you say this
and the system interpreted the word as dish, then you need to go through a
correction procedure to enable the system to learn from its mistake. You cannot just
backspace and try again, tempting though that might be.

Why is this? The reason is that modern Speech Recognition is not based on
individual words, but on small sounds within the words. Information gained about the
way you say th will be generalized to other words with a th. Thus if the system
misinterprets the word this, it does not mean the error is restricted to that one word.
If you do not correct the error, other th words might be affected. If you do correct
the error, then the new information is also spread to other words, improving accuracy.
Your voice model will always be getting better or worse. This is something to think
through thoroughly.

Page 10 of 33

2.6 APPLICATION
1)
2)
3)
4)
5)

Mistakes made by human and the system


Baseline (windows up, fans, radio/entertainment system, and windshield wipers off)
Drivers window down (noises from streets, tires, and other vehicles)
Environmental Controls (Fans, Heater, A/C, and Windshield wipers on)
Critical Controls (Cruise Control Speed Setting, Emergency lights, Headlights, variety
of different speeds and location of speakers )

2.7 METHODOLOGY
The first idea was to use optical imaging (CCD cameras) to see the line. This was later given
up due to various reasons including complexity and unavailability of components. Later a
choice was made to use an array of sensors which solved most of the problems pertaining to
complexity.

The resistor values used in the sensor array were experimentally determined rather than
theoretical mathematical design calculations. This was done as the data sheets of the
proximity sensor was not available anywhere and most of the parameters had to be
determined experimentally.

The L293D chip is used as it was a much better option than forming an H-Bridge out of
discrete transistors, which would make the design unstable and prone to risk of damage. The
PIC microcontroller was used as it is the only device I have a full practical knowledge about,
and most of all a RISC processor which are better suited for real time operations. Thus the
midrange devices were chosen. The part 16F873 was used as it has 2 CCP modules which I
could use in PWM mode thus simplifying the software routines which Id otherwise had to
write to generate the PWM control for the motors.
A priority encoder was used to reduce the number of I/O lines used, which reduces it to 5
which otherwise would require 7 and a lot of additional complexity in software which only
results is sluggish operation and inefficiency.
Extra hardware was added to let the robot know if it is on a surface or not. This helps it from
not running off a table or preserving battery if manually lifted off the floor. Software was
coded day and night, deciding on a few algorithms and few tiny details which gradually got
the robot to do what was required. Then extra code was put to find a line if it is not on one.
Page 11 of 33

Chapter3

THEORY
To recognize the voice commands efficiently different parameters of speech like pitch, amplitude
pattern or power/energy can be used. Here to recognize the voice commands power of the speech
signal is used. First the voice commands are taken with the help of a microphone that is directly
connected to PC. After it the analog voice signals are sampled using MATLAB 7.0.1. As speech
signals generally lie in the range of 300Hz-4000 Hz, so according to Nyquist Sampling Theorem,
minimum sampling rate required should be 8000 samples/second. After sampling, the discrete
data obtained is passed through a band pass filter having pass band frequency in the range of 300
- 4000 Hz. The basic purpose of using band pass filter is to eliminate the noise that lies at low
frequencies (below 300 Hz) and generally above 4000 Hz there is no speech signal. This
algorithm for voice recognition comprises of speech templates. The templates basically consist of
the power of discrete signals. To create the templates here the power of each sample is calculated
and then the accumulated power of 250 subsequent samples is represented by one value. For
example, in the implemented algorithm 16000 samples are taken, then power of discrete data will
be represented by 64 discrete values as power of subsequent 250 samples (i.e. 1-250.251500.15750-16000) is accumulated and represented by one value. The numbers of samples
taken and grouped are absolutely flexible and can be changed keeping in mind the required
accuracy, memory space available and the processing time.

Figure 3: Block diagram of the speech recognition system.

Page 12 of 33

3.1 Voice Commands


The voice commands are given by the person who has been authenticated by the algorithm. In
the proposed algorithm a limit on the number of voice commands has been imposed to make
the system useful for real world applications.

3.2 Microphone
The microphone takes the commands from the authenticated person. It is directly connected
to the personal computer. Commands given by the person are taken as analog inputs using the
Date Acquisition Toolbox of MATLAB. 10

3.3 Sampler
The speech signal obtained is sampled to convert it into discrete from. The sampling is done
in MATLAB. As speech signals lies in the range of 300 Hz-4000 Hz, so according to Nyquist
Sampling Theorem minimum sampling rate required should be 8000 samples/second. But in
order to obtain the required accuracy sampling rate is decided as 16000 samples/second.

3.4 Band Pass Filter


After sampling the discrete signal obtained is passed through a band pass filter. Here, the
fourth order Chebyshev band pass filter having the pass band 300 Hz - 4000 Hz. The band
pass filter is used to remove the noise existed outside the pass band.

Page 13 of 33

3.5 Processing and Decision Making Unit


The processing unit does all the processing of the speech signals required for the voice
commands recognition. Here personal computer with MATLAB 7.0.1 is used as processing
and decision making unit.

Page 14 of 33

Chapter 4.
DESIGN AND IMPLEMENTATION
4.1 CODE

The code is divided into 3 modules.


Mainprogram.m = this has the main code (Initialization). The rest of the work is done by the
TMR0 interrupt service routine.
Getavgpower.m = this is an include file holding all the macro definitions, defines and
constant
declarations used in the program.
Getpower.m = this is the TMR0 interrupt service routine, does the entire work.
4.4.1 mainprogram.m
close all;
clear all;
clc;
n=0;
s=0;
p=0;
[w1,fs1]=wavread('nokia (1).wav');
[w2,fs2]=wavread('nokia (2).wav');
[w3,fs3]=wavread('nokia (3).wav');
[w4,fs4]=wavread('nokia (4).wav');
[w5,fs1]=wavread('sony (1).wav');
[w6,fs2]=wavread('sony (2).wav');
[w7,fs3]=wavread('sony (3).wav');
[w8,fs4]=wavread('sony (4).wav');
[w9,fs1]=wavread('apple (1).wav');
[w10,fs2]=wavread('apple (2).wav');
[w11,fs3]=wavread('apple (3).wav');
[w12,fs4]=wavread('apple (4).wav');
nokia_power=getavgpower(w1,w2,w3,w4);
sony_power=getavgpower(w5,w6,w7,w8);
apple_power=getavgpower(w9,w10,w11,w12);
figure();
stem(nokia_power);
title('nokia power');
figure();
stem(sony_power);
title('sony power');
figure();
stem(apple_power);
title('apple power');
[newsound,fs2]=wavread('sample_3.wav'); % sample to be detected
Page 15 of 33

wavplay(newsound,fs2);
new_power=getpower(newsound);
figure();
stem(new_power);
sonydiff=zeros(5,1);
nokiadiff=zeros(5,1);
applediff=zeros(5,1);
for i=1:5
sonydiff(i,1)=abs(new_power(i,1)-sony_power(i,1));
nokiadiff(i,1)=abs(new_power(i,1)-nokia_power(i,1));
applediff(i,1)=abs(new_power(i,1)-apple_power(i,1));
end
sonysum=sum(sonydiff);
applesum=sum(applediff);
nokiasum=sum(nokiadiff);

if sonysum<applesum
if sonysum<nokiasum
figure();
title('sony detected');
a=imread('sony.jpg');
imshow(a);
else
figure();
title('nokia detected');
a=imread('nokia.jpg');
imshow(a);
end
else
if applesum<nokiasum
figure();
title('apple detected');
a=imread('apple.jpg');
imshow(a);
else
figure();
title('nokia detected');
a=imread('nokia.jpg');
imshow(a);
end
end
4.4.2 getavgpower.m
function avgpower=getavgpower(s1,s2,s3,s4)
Page 16 of 33

count=0;
s1final=zeros(16000,1);
s3final=zeros(16000,1);
s2final=zeros(16000,1);
s4final=zeros(16000,1);
fre =16000/2*linspace(0,1,4096*2+1);
s1len=length(s1);
for i=1:s1len
if (s1(i,1)>.2 | s1(i,1)<-.2)
for j=i:s1len
count=count+1;
s1final(count,1)=s1(j,1);
if count>16000
break;
end
end
break ;
end
end
count=0;
%wavplay(s1final,fs);
s1f=abs(fft(s1final,4096*4));
m=max(s1f);
s1f=s1f/m;

s2len=length(s2);
for i=1:s2len
if (s2(i,1)>.2 | s2(i,1)<-.2)
for j=i:s2len
count=count+1;
s2final(count,1)=s2(j,1);
if count>16000
break;
end
end
break ;
end
end
count=0;
%wavplay(s2final,fs);
s2f=abs(fft(s2final,4096*4));
m=max(s2f);
Page 17 of 33

s2f=s2f/m;

s3len=length(s3);
for i=1:s3len
if (s3(i,1)>.2 | s3(i,1)<-.2)
for j=i:s3len
count=count+1;
s3final(count,1)=s3(j,1);
if count>16000
break;
end
end
break ;
end
end
count=0;
%wavplay(s3final,fs);
s3f=abs(fft(s3final,4096*4));
m=max(s3f);
s3f=s3f/m;

s4len=length(s4);
for i=1:s4len
if (s4(i,1)>.2 | s4(i,1)<-.2)
for j=i:s4len
count=count+1;
s4final(count,1)=s4(j,1);
if count>16000
break;
end
end
break ;
end
end
count=0;
%wavplay(s4final,fs);
s4f=abs(fft(s4final,4096*4));
m=max(s4f);
s4f=s4f/m;
sound_sum1=zeros(5,1);
sound_sum2=zeros(5,1);
sound_sum3=zeros(5,1);
sound_sum4=zeros(5,1);
Page 18 of 33

for i=0:4
for j=1:399
sound_sum1((1+i),1)=sound_sum1((1+i),1)+s1f(i*400+j,1);
sound_sum2((1+i),1)=sound_sum2((1+i),1)+s2f(i*400+j,1);
sound_sum3((1+i),1)=sound_sum3((1+i),1)+s3f(i*400+j,1);
sound_sum4((1+i),1)=sound_sum4((1+i),1)+s4f(i*400+j,1);
end
end
figure();
subplot(2,2,1);
stem(sound_sum1);
title('sound_sum1');
subplot(2,2,2);
stem(sound_sum2);
title('sound_sum2');
subplot(2,2,3);
stem(sound_sum3);
title('sound_sum3');
subplot(2,2,4);
stem(sound_sum4);
title('sound_sum4');

avgpower=zeros(5,1);

avgpower(1,1)=(sound_sum1(1,1)+sound_sum2(1,1)+sound_sum3(1,1)+sound_sum4(1,1
))/4;
avgpower(2,1)=(sound_sum1(2,1)+sound_sum2(2,1)+sound_sum3(2,1)+sound_sum4(2,1
))/4;
avgpower(3,1)=(sound_sum1(3,1)+sound_sum2(3,1)+sound_sum3(3,1)+sound_sum4(3,1
))/4;
avgpower(4,1)=(sound_sum1(4,1)+sound_sum2(4,1)+sound_sum3(4,1)+sound_sum4(4,1
))/4;
avgpower(5,1)=(sound_sum1(5,1)+sound_sum2(5,1)+sound_sum3(5,1)+sound_sum4(5,1
))/4;

4.4.3 getpower.m
function soundpower=getpower(sound)
fs1=16000;
fre1 =8000*linspace(0,1,4096*2+1); %
soundfinal=zeros(16000,1);
count=0;
soundlen=length(sound);
for i=1:soundlen
Page 19 of 33

if (sound(i,1)>.2 | sound(i,1)<-.2)
for j=i:soundlen
count=count+1;
soundfinal(count,1)=sound(j,1);
if count>16000
break;
end
end
break ;
end
end
count=0;
soundf=abs(fft(soundfinal,4096*4));
m=max(soundf);
soundf=soundf/m;
figure();
stem(fre1,soundf(1:4096*2+1));
title('soundf');
sound_sum=zeros(5,1);
for i=0:4
for j=1:399
sound_sum((1+i),1)=sound_sum((1+i),1)+soundf(i*400+j,1);
end
end

soundpower=sound_sum;

Page 20 of 33

mainprogram.m

Page 21 of 33

Getpower.m

Page 22 of 33

Getavgpower.m

Page 23 of 33

CHAPTER 5
RESULT AND OBSERVATION
Average power of first signal

THIS IS THE AVERAGE POWER OF NOKIA SIGNAL


GENERATED BY THE FOUR SOUND WAVE FILES STORED IN
DATABASE

Page 24 of 33

Average power of second signal

THIS IS THE AVERAGE POWER OF SONY SIGNAL


GENERATED BY THE FOUR SOUND WAVE FILES STORED IN
DATABASE

Average power of third signal

Page 25 of 33

THIS IS THE AVERAGE POWER OF APPLE SIGNAL


GENERATED BY THE FOUR SOUND WAVE FILES STORED IN
DATABASE

Page 26 of 33

FFT of signals

BY DOING FFT OF THE GENERATED WAVEFILE AND BY


CORRELATING WITH THE DATA BASE PRESENT WE
COMPARE THE FINGERPRINTS OUTPUT WHICH MATCHES
THE BEST FINGERPRINTS

Page 27 of 33

FFT of matched signal with specific samplings

THIS IS THE CORRELATED OUTPUT OF THE GIVEN SIGNAL

Page 28 of 33

Final output image

THIS IS THE OUTPUT WHEN WE HAVE SELECTED


WAVEFILE_1 AS OUR INPUT SOURCE

Page 29 of 33

THIS IS THE OUTPUT WHEN WE HAVE SELECTED


WAVEFILE_2 AS OUR INPUT SOURCE

Page 30 of 33

THIS IS THE OUTPUT WHEN WE HAVE SELECTED


WAVEFILE_3 AS OUR INPUT SOURCE

Page 31 of 33

CHAPTER 6
CONCLUSION
The task of speech recognition has been seen to be one involving many techniques. Some
modern improvements on older processes have shown to be of great help in such tasks as
formant identification, word separation, and phone separation. Accurate speech recognition
is possible, as this project has demonsrated, but as the size of the vocabulary grows, so do the
number of techniques required to differentiate between words. As our techniques improve,
so will the accuracy of our speech recognition programs.
On a personal note, the project was quite informative and enjoyable. It would be interesting
to see it extended into its later stages.

Page 32 of 33

REFERENCE
1. Appleton, E. (1993). Put usability to the test. Datamation, 39(14), 61-62.
2. Jacobsen, N., Hertzum, M., & John, B. (1998). The evaluator effect in usability tests.
SIGCHI : ACM Special Interest Group on Computer-Human Interaction, 255-256.
3. Lecerof, A., & Paterno, F. (1998). Automatic support for usability evaluation. 24(10),
863-888. IEEE Transactions on Software Engineering, 24(10), 863-888.
4. Gales, M. J. F., and Young, S., An Improved Approach to the Hidden Markov Model
Decomposition of Speech and Noise, ICASSP-92, pp. I-233-I-236, 1992.
5. Acero, A., and Stern, R. M., Environmental Robustness in Automatic Speech
Recognition, ICASSP-90, pp. 849-852, 1990.
Dal Degan, N., and Prati, C., Acoustic Noise Analysis and Speech Enhancement Techniques
for Mobile Radio Applications, Signal Processing, 15: 43-56, 1988.
6. http://www.scansoft.com/naturallyspreaking/
7. http://www-3.ibm.com/software/speech/
8. http://www.speech.philips.com/freespeech2000/
9. http://www.fonix.com/
10. http://www.fonix.com/products/embedded/mobilegt/
11. http://www.fonix.com/support/resources.php#whitepapers
12.http://translate.google.com/translate?hl=en&sl=fr&u=http://handisurflionel.chez.tiscali.fr/viavoice_millenium/&prev=/search%3Fq%3DIBM%2BViaVoice%2BM
illenium%2BPro%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8
13. http://www.andreaelectronics.com/faq/da-400.htm

Page 33 of 33

Anda mungkin juga menyukai