Anda di halaman 1dari 6

International Journal of Computer Scie nce Trends and Technology (IJCS T) Volume 4 Issue 2 , Mar - Apr 2016

RESEARCH ARTICLE

OPEN ACCESS

An Analysis on Types of Speech Recognition and Algorithms


Dr.V.Ajantha Devi

[1],

Ms.V.Suganya [2]

[1]

Assistant Professor , M.Phil Research Scholar [2]


Department of Computer Science
Sri Adi Chunchanagiri Womens College
Cumbum Theni Dt.
Tamil Nadu - India

ABSTRACT
Speech recognition has of late beco me a practical technology. It is used in real -world hu man language applications, such as
informat ion retrieval. It is the most common means of the communicat ion because the information contains the fundamental role
in conversation. Fro m the speech or conversation, it converts an acoustic signal that is captured by a microphone or a telephone, t o
a set of words. A cluster of words can either be the final result or it can then apply the synthesis to pronounce into text, wh ich
implies speech-to-text. It means that, speech recognition can serve as the input to further linguistic processing to achieve speech
understanding.This Paper analysis the types and algorithms of speech recognition.
Keywords:- Speech Recognition; Feature Extraction; MFCC; LPC; Hidden Markov Model; Neural Net work; Dynamic Time
Warping.

I.

INTRODUCTION

Speech recognition is the process of taking the


spoken word as an input to a co mputer program. It is the
technology by which sounds, words or phrases spoken by
humans are converted into electrical signals, and these signals
are transformed into coding patterns to which mean ing has
been assigned. Speech Recognition Systems that use training
are called "speaker-dependent" systems. The speech signal is
captured through microphone and processed by software
running on a PC. Speech recognition technology is used in the
field of robotics, automation and human computer interface
applications.

means finding the sequence of characters, which forms the


word. Speech recognition is a friendly human interface for
computer control. Speech recognition is often confused with
natural language understanding. The speech recognition
process is complicated because of the facts; the production of
phoneme and the transition between them varies fro m person
to person. Different people speaks differently, even a word or
a phrase spoken by the same indiv idual may be different fro m
moment to moment that is at different times.

II.

BASICS OF SPEECH RECOGNITION

The following definit ions are the basics needed for


understanding speech recognition technology.
Utterance
An utterance is the vocalizat ion (speaking) of a word
or wo rds that represent a single meaning to the co mputer.
Utterances can be a single word, a few words, a sentence, or
even multiple sentences.

Figure.1 Structure of Speech Recognition System


Speech recognition is a complex, dynamic
classification task. Speech recognition is the process of finding
a linguistic interpretation of a spoken utterance, typically, this

ISSN: 2347-8578

Speaker Dependance
Speaker dependent systems are designed around a
specific speaker. They generally are more accurate for the
correct speaker, but much less accurate for other speakers.
They assume the speaker will speak in a consistent voice and
tempo. Speaker independent sys tems are designed for a variety
of speakers. Adaptive systems usually start as speaker
independent systems and utilize t rain ing techniques to adapt to
the speaker to increase their recognition accuracy.
Vocabularies

www.ijcstjournal.org

Page 350

International Journal of Computer Scie nce Trends and Technology (IJCS T) Volume 4 Issue 2 , Mar - Apr 2016
Vocabularies (or d ictionaries) are lists of words or
utterances that can be recognized by the SR system. Generally,
smaller vocabularies are easier for a co mputer to recognize,
while larger vocabularies are mo re difficult. Un like normal
dictionaries, each entry doesn't have to be a single word. They
can be as long as a sentence or two.
Accuract
The ability of a recognizer can be examined by
measuring its accuracy or how well it recognizes utterances.
This includes not only correctly identify ing an utterance but
also identifying if the spoken utterance is not in its vocabulary.
Good ASR systems have an accuracy of 98% o r mo re! The
acceptable accuracy of a system really depends on the
application.
Training
Some speech recognizers have the ability to adapt to
a speaker. When the system has this ability, it may allow
training to take place. An ASR system is trained by having the
speaker repeat standard or common phrases and adjusting its
comparison algorithms to match that particular speaker.
Training a recognizer usually improves its accuracy.
Train ing can also be used by speakers that have difficu lty
speaking, or pronouncing certain words. As long as the
speaker can consistently repeat an utterance, ASR systems
with training should be able to adapt.

III.

TYPES OF SPEECH RECOGNITION


SYSTEM

Speaker dependent system - The voice recognition


must be trained before it can be used. This often
requires a user reads a series of words and phrases so
the computer can understand the users voice.
Speaker independent system - The voice recognition
software recognizes most users voices with no
training.
Discrete speech recog nition - The user must pause
between each word so that the speech recognition can
identify each separate word.
Continuous speech recogniti on - The voice
recognition can understand a normal rate of speaking.
Natural language - The speech recognition not only
can understand the voice but also return answers to
questions or other queries that are being asked.
Automat ic Speech Recognition (ASR) is the process by
which a co mputer maps an acoustic speech signal to text. ASR
offers a solution, which is directed by converting speech into
text. The technology used to do this process of speech
recognition is known as ASR technology.

ISSN: 2347-8578

Automatic Speech Emotion Recognition (ASER) aims at


automatic identification of different human emotions or
physical states through a humans voice. It acts as a feedback
system for real life applications in the field of robotics, where
it will follow hu man co mmands by understanding the
emotional state of human. It is also used in the field of
security, learning, med icine, entertain ment, etc. In
entertainment the idea is incorporated with the development of
interesting games with virtual reality. In the field of medicine
it can be used for analysis and diagnosis of cognitive state of
human.

IV.

HISTORY OF SPEECH RECOGNITION

Speech Recognition research has been ongoing for


more than 80 years. Over that period there have been at least 4
generations of approaches, and a 5th generation is being
formulated based on current research themes.
By 2001, co mputer speech recognition had reached
80% accuracy and no further progress was reported till 2010.
Speech recognition technology development began to edge
back into the forefront with one major event: the arrival of the
Google Voice Search app for the iPhone. In 2010, Google
added personalized recognition to Vo ice Search on Android
phones, so that the software could record users voice searches
and produce a more accurate speech model. The co mpany also
added Voice Search to its Chro me Bro wser in mid-2011. Like
Googles Voice Search, Siri relies on cloud-based processing.
It draws on its knowledge about the speaker to generate a
contextual reply and responds to voice input.

V.

MODELS, METHODS AND


ALGORITHMS OF SPEECH
RECOGNITION

Acoustic
modeling and Language
modeling are
important parts of modern statistically-based speech
recognition algorith ms. Hidden Markov models (HMMs) are
widely used in many systems. Language modeling is used in
natural language processing applications such as document
classification or statistical machine translation.
i)

HIDDEN MARKOV MODEL:

Hidden Markov Model (HMM ) is statistical model


that output sequence of symbols or quantities. HMM can be
trained automatically, simp le and feasible to use. In speech
recognition, the hidden Markov model would output a
sequence of n-dimensional real-valued vectors. The vectors
would consist of cepstral coefficients, which are obtained by
taking a Fourier t ransform of a short time window o f speech
and decorrelating the spectrum using a cosine transform.

www.ijcstjournal.org

Page 351

International Journal of Computer Scie nce Trends and Technology (IJCS T) Volume 4 Issue 2 , Mar - Apr 2016

ii) DYNAMIC TIME WARPING (DTW):


Dynamic time warping is an algorith m for measuring
similarity between two sequences that may vary in t ime or
speed. DTW has been applied to video, audio, and graphics
indeed, any data that can be turned into a linear representation
can be analyzed with DTW. In general, it is a method that
allo ws a co mputer to find an optimal match between two g iven
sequences (e.g., time series) with certain restrictions. That is,
the sequences are "warped" non-linearly to match each other.
This sequence align ment method is often used in the context
of hidden Markov models.
iii) NEURAL NETWORKS:
Neural networks emerged as an attractive acoustic
modeling approach. Neural networks have been used in many
aspects of speech recognition such as phoneme classification,
isolated word recognition and speaker adaptation. In contrast
to HMMs, neural networks make no assumptions about feature
statistical properties and have several qualities making them
attractive recognition models for speech recognition . To
estimate the probabilit ies of a speech feature segment, neural
networks allow discriminative training in a natural and
efficient way.
iv) SAMPLING THEORY:
When the microphone records the persons analog
speech signal through the computer, so the data quality of the
speech signal will d irect ly decide the quality of the speech
recognition. And the sampling frequency is one of the decisive
factors for the data quality. The analog signal actually consists
of a lot of d ifferent frequencies components. Assuming there
is only one frequency component in this analog signal, and it
has no phase shift. The analog signal x (t) into the discretetime signal x (n), wh ich the co mputer can use to process.
Generally, the discrete signal x (n) is always regarded as one
signal sequence or a vector.

Figure 2: sampling the analog signal

ISSN: 2347-8578

The time period of the analog signal x (t) is T. The


sampling period of the discrete-time signal is TS.
The relation between the analog signal frequency and
time period is reciprocal. when the sampling frequency is
larger or equal than 2 t imes of the maximu m of the analog
signal frequencies, the discrete-time signal is able to be used to
reconstruct the original analog signal. And the higher sampling
frequency will result the better sampled signals for analysis.
v)

DISCRETE FOURIER TRANSFORM(DFT):

DFT is just a type of Fourier Transform for the


discrete-time x (n) instead of the continuous analog signal x
(t). The main function of the Fourier Transform is to transform
the variable fro m the variable n into the variable , wh ich
means transforming the signals from the time domain into the
frequency domain.
vi ) FAST FOURIER TRANSFORM(FFT):
The FFT is still the DFT for transforming the
discrete-time signal fro m time do main into its frequency
domain. The d ifference is that the FFT is faster and more
efficient on co mputation. There are many ways to increase the
computation efficiency of the DFT, but the most widely used
FFT algorithm is the Radix-2 FFT Algorithm.
vii ) CROSS-CORRELATION ALGORITHM:
For the same speaker, the different words also have
the different frequency bands which are due to the different
vibrations of the vocal cord. And the shapes of spectrums are
also different. To realize the speech recognition, there is a
need to compare spectrums between the third recorded signal
and the first two recorded reference signals. By checking
which of two recorded reference signals better matches the
third recorded signal, the system will give the judg ment that
which reference word is again recorded at the third time. The
cross-correlation algorith m is used to check the correlation of
two signals.
The basic idea of the algorithm is
1. Fix one of the t wo signals x(n) and shift the other signal
y(n) left or right with some time units.
2. Multiply the value of x (n) with the shifted signal y (n+m)
position by position.
3. Take the summation of all the mu ltip licat ion results for x
(n) y (n+m).
For example, two sequence signals
x(n) = [0 0 0 1 0], y(n)= [0 1 0 0 0], the lengths for both
signals are N=5. So the cross-correlation for x(n) and y(n) is as
the following figures shown:

www.ijcstjournal.org

Page 352

International Journal of Computer Scie nce Trends and Technology (IJCS T) Volume 4 Issue 2 , Mar - Apr 2016
Figure 5: The results of the cross -correlation, summation of
multiplications
viii) AUTO CORRELATION ALGORITHM:
The auto-correlation is the algorithm to measure how
the signal is self-correlated with itself.
The FIR Wiener Filter:
The FIR W iener filter is used to estimate the desired
signal d (n) fro m the observation process x (n) to get the
estimated signal d (n). It is assumed that d (n ) and x (n) are
correlated and jointly wide-sense stationary. And the error of
estimation is e (n) =d (n)-d (n). The purpose of Wiener filter
is to choose the suitable filter order and find the filter
coefficients with which the system can get the best estimation.
In other words, with the proper coefficients the system can
minimize the mean-square error

Figure 3: The signal sequence x(n)

VI.

PERFORMANCE EVALUATION OF
ASR TECHNIQUES

The performance of a speech recognition system is


measurable. Perhaps the most widely used measurement is
accuracy and speed. Accuracy is measured with the Word
Error Rate (W ER), whereas speed is measured with the real
time factor. WER can be computed by the equation (1)
Figure 4: The signal sequency (n) will shift left or right with m
units

ISSN: 2347-8578

www.ijcstjournal.org

Page 353

International Journal of Computer Scie nce Trends and Technology (IJCS T) Volume 4 Issue 2 , Mar - Apr 2016

VII.
(1)
Where S is the number of substitutions, D is the number of the
deletions, I is the nu mber of the insertions and N is the number
of words in the reference.
The speed of a speech recognition system is
commonly measured in terms of Real Time Factor (RTF). It
takes time P to process an input of duration I. It is defined by
the formula (2)

(2)
The comparison of the various speech recognition
research based on the dataset, feature vectors, and speech
recognition technique adopted for the particular language are
given in the Table 1.

ISSN: 2347-8578

CONCLUSION

Speech recognition has been in development for more


than 50 years, and has been entertained as an alternative access
method for indiv iduals with disabilities for almost as long. In
this paper, the fundamentals of speech recognition are
discussed and its recent progress is investigated. The various
approaches available for developing an ASR system are
clearly explained with its merits and demerits. The
performance of the ASR system based on the adopted feature
extraction technique and the speech recognition approach for
the particular language is compared in this paper. In recent
years, the need for speech recognition research based on large
vocabulary speaker independent continuous speech has highly
increased. Based on the review, the potent advantage of HMM
approach is more suitable for these requirements and offers
good recognition result. These techniques will enable us to

www.ijcstjournal.org

Page 354

International Journal of Computer Scie nce Trends and Technology (IJCS T) Volume 4 Issue 2 , Mar - Apr 2016
create increasingly powerful systems, deployable on a
worldwide basis in future.

REFERENCE
[1] Bassam A. Q. A l-Qatab , Raja N. Ainon, Arabic
Speech Recognition Using Hidden Markov Model
Toolkit(HTK), 978-1-4244-6716-711 0 2010
IEEE.
[2] M. Chandrasekar, M. Ponnavaikko, Tamil speech
recognition: a co mp lete model, Electronic Journal
Technical Acoustics 2008, 20.
[3] Co rneliu Octavian DUMITRU, Inge GA VAT, A
Co mparative Study of Feature Ext raction Methods
Applied to Continuous Speech Recognition in
Ro manian
Language,
48th
International
Symposiu m ELMAR-2006, 07-09 June 2006, Zadar,
Croatia.
[4] DOUGLA S OSHAUGHNESSY, Interacting With
Co mputers
by
Voice: Automatic
Speech
Recognition and Synthesis, Proceedings of the
IEEE, VOL. 91, NO. 9, September 2003, 00189219/03 2003 IEEE.
[5] Ghu lam Muhammad, Yousef A. A lotaibi, and
Mohammad Nurul Huda , Automatic Speech
Recognition for Bangia Digits, Proceedings of
2009 12th International Conference on Co mputer
and Information Technology (ICCIT2009) 21-23
Dec-2009,
Bangladesh,
978-1-4244-62841/092009 IEEE.
[6] A.P.Henry Charles & G.Devaraj, A laigal-A Tamil
Speech Recognition, Tamil Internet 2004,
Singapore.
[7] Meysam Mohamad pour, Fardad Farokh i, An
Advanced Method for Speech Recognition, World
Academy of Science, Engineering and Technology
49 2009.
[8] Santosh K.Gaikwad, Bharti W.Gawali and Pravin
Yannawar, A Rev iew on Speech Recognition
Technique, International Journal of Co mputer
Applications (0975 8887) Vo lu me 10 No.3,
November 2010.

Co mmunicat ion and Networking Technologies, 9781-4244-6589-7/10 2010 IEEE.


[10] N.Uma Maheswari, A.P.Kabilan, R.Venkatesh, A
Hybrid model of Neural Net work Approach for
Speaker
independent
Word
Recognition,
International Journal of Co mputer Theory and
Engineering, Vo l.2, No.6, December, 2010 17938201.
[11] Vimal Krishnan V. R, Athulya Jayaku mar and Babu
Anto.P, Speech Recognition of Isolated Malayalam
Words Using Wavelet Features and Artificial Neural
Network, 4th IEEE International Sy mposiu m on
Electronic Design, Test & Applications, 0-76953110-5/08 2008 IEEE
[12] Zhao Lishuang , Han Zh iyan, Speech Recognition
System Based on Integrating feature and HMM,
2010 International Conference on Measuring
Technology and Mechatronics Automation, 978-07695-3962-1/10 2010 IEEE.
[13] S. Davis, and P. Mermelstein, Co mparison of
parametric representations for monosyllabic word
recognition in continuously spoken sentences, IEEE
Transactions on Acoustics, Speech and Signal
Processing, vol. 28, pp 357-366,2013.
[14] B. H. Juang and L. R. Rab iner (2004). Automatic
Speech Recognition A Brief History of the
Technology Development. Georgia Institute of
Technology, Atlanta Rutgers University and the
University of California, Santa Barbara
[15] B. Plannerer (2005). An Introduction to Speech
Recognition.
Munich,
Germany,
plannerer@ieee.org
[16] B. H. Juang and L. R. Rabiner (1986). An
Introduction to Hidden Markov Models. IEEE
ASSP MA GAZINE, 0740-7467/86/0100-0004, 4
16.
[17] Su ma Swamy and K.V Ramakrishnan.,AN
EFFICIENT SPEECH RECOGNITION SYSTEM ,
An International Journal (CSEIJ), Vo l. 3, No. 4,
August 2013.

[9] Raji Suku mar.A, Firoz Shah.A and Babu Anto.P,


Isolated question words recognition fro m speech
queries by Using artificial neural networks, 2010
Second International conference on Computing,

ISSN: 2347-8578

www.ijcstjournal.org

Page 355

Anda mungkin juga menyukai