Speaker Recognition System: A Project Report On

A Project Report On
Speaker Recognition System

Implemented in MATLAB
Abstract Speaker recognition is the process of automatically recogni ing !ho is speaking on the "asis of indi#idual information included in speech !a#es$
This techni%ue makes it possi"le to use the speaker&s #oice to #erify their identity and control access to ser#ices such as #oice dialling' "anking "y telephone' telephone shopping' data"ase access ser#ices' information ser#ices' #oice mail' security control for confidential information areas' and remote access to computers$ Speaker recognition can "e classified into identification and #erification$ Speaker identification is the process of determining !hich registered speaker pro#ides a gi#en utterance$ Speaker verification' on the other hand' is the process of accepting or rejecting the identity claim of a speaker$ Speaker recognition methods can also "e di#ided into text-independent and text dependent methods$ In a te(t)independent system' speaker models capture characteristics of some"ody*s speech !hich sho! up irrespective of what one is saying$ In a te(t)dependent system' on the other hand' the recognition of the speaker*s identity is "ased on his or her speaking one or more specific phrases' like pass!ords' card num"ers' PI+ codes' etc$ All technologies of speaker recognition' identification and #erification' te(t independent and te(t)dependent' each has its o!n ad#antages and disad#antages and may re%uires different treatments and techni%ues$ The choice of !hich technology to use is application)specific$ The system that
!e !ill de#elop is classified as text independent speaker identification system since its task is to identify the person !ho speaks regardless of !hat is saying$ Overview
Speaker recognition is the process of automatically recogni ing !ho is speaking on the "asis of indi#idual information included in speech !a#es$ This techni%ue makes it possi"le to use the speaker&s #oice to #erify their identity and control access to ser#ices such as #oice dialing' "anking "y telephone' telephone shopping' data"ase access ser#ices' information ser#ices' #oice mail' security control for confidential information areas' and remote access to computers$
Principles of Speaker Recognition

Speaker recognition can "e classified into identification and #erification$ Speaker identification is the process of determining !hich registered speaker pro#ides a gi#en utterance$ Speaker verification' on the other hand' is the process of accepting or rejecting the identity claim of a speaker$ ,igure - sho!s the "asic structures of speaker identification and #erification systems$ Speaker recognition methods can also "e di#ided into text-independent and textdependent methods$ In a te(t)independent system' speaker models capture characteristics of some"ody*s speech !hich sho! up irrespective of what one is saying$ In a te(t) dependent system' on the other hand' the recognition of the speaker*s identity is "ased on
his or her speaking one or more specific phrases' like pass!ords' card num"ers' PI+ codes' etc$ All technologies of speaker recognition' identification and #erification' te(t) independent and te(t)dependent' each has its o!n ad#antages and disad#antages and may re%uires different treatments and techni%ues$ The choice of !hich technology to use is application)specific$ The system that !e !ill de#elop is classified as text-independent speaker identification system since its task is to identify the person !ho speaks regardless of !hat is saying$ At the highest le#el' all speaker recognition systems contain t!o main modules .refer to ,igure -/0 feature extraction and feature matching$ ,eature e(traction is the process that e(tracts a small amount of data from the #oice signal that can later "e used to represent each speaker$ ,eature matching in#ol#es the actual procedure to identify the unkno!n speaker "y comparing e(tracted features from his1her #oice input !ith the ones from a set of kno!n speakers$ 2e !ill discuss each module in detail in later sections$
Figure 1$ Basic structures of speaker recognition systems
All speaker recognition systems ha#e to ser#e t!o distinguish phases$ The first one is referred to the enrollment sessions or training phase !hile the second one is referred to as the operation sessions or testing phase$ In the training phase' each registered speaker has to pro#ide samples of their speech so that the system can "uild or train a reference model for that speaker$ In case of speaker #erification systems' in addition' a speaker)specific
threshold is also computed from the training samples$ 3uring the testing .operational/ phase .see ,igure -/' the input speech is matched !ith stored reference model.s/ and recognition decision is made$ Speaker recognition is a difficult task and it is still an acti#e research area$ Automatic speaker recognition !orks "ased on the premise that a person*s speech e(hi"its characteristics that are uni%ue to the speaker$ 4o!e#er this task has "een challenged "y the highly variant of input speech signals$ The principle source of #ariance is the speaker himself$ Speech signals in training and testing sessions can "e greatly different due to many facts such as people #oice change !ith time' health conditions .e$g$ the speaker has a cold/' speaking rates' etc$ There are also other factors' "eyond speaker #aria"ility' that present a challenge to speaker recognition technology$ 5(amples of these are acoustical noise and #ariations in recording en#ironments .e$g$ speaker uses different telephone handsets/$ General Idea of Speec Recognition 4uman speech presents a formida"le pattern classification task for speech recognition system $ +umerous speech recognition techni%ues ha#e "een formulated yet the #ery "est techni%ues used today ha#e recognition capa"ilities !ell "elo! those of a child$ This is due to the fact that human speech is highly dynamic and comple($ There are generally se#eral types of disciplines present in the human speech$ A "asic understanding of these disciplines is needed in order to create an effecti#e system$ The follo!ing pro#ide a "rief description of the disciplines that ha#e "een applied to speech recognition pro"lems 1 Signal Processing
This process e(tracts the important information from the speech signal in a !ell) organised manner$ In signal processing' spectral analysis is used to characteri e the time #arying properties of the speech signal$ Se#eral other types of processing are also needed prior to the spectral analysis stage to make the speech signal more accurate and ro"ust$ ! Acoustics
The science of understanding the relationship "et!een the physical speech signal and the human #ocal tract mechanisms that produce the speech and !ith !hich the speech is distinguished$
Pattern Recognition
A set of coding algorithm used to compute data to create prototypical patterns of a data ensem"le$ It is used to compare a pair of patterns "ased on the features e(tracted from the speech signal$
"ommunication and Information # eory
The procedures for estimating parameters of the statistical models and the methods for recogni ing the presence of speech patterns$
$inguistics
This refers to the relationships "et!een sounds' !ords in a sentence' meaning and logic of spoken !ords$ 1
P ysiology
This refers to the comprehension of the higher)order mechanisms !ithin the human central ner#ous system$ It is responsi"le for the production and perception of speech !ithin the human "eings$ 1 ! % "omputer Science
The study of effecti#e algorithms for application in soft!are and hard!are$ ,or e(ample' the #arious methods used in a speech recognition system$ ! Psyc ology
The science of understanding the aspects that ena"les the technology to "e used "y human "eings$
Speec Production Speech is the acoustic product of #oluntary and !ell)controlled mo#ement of a #ocal mechanism of a human$ 3uring the generation of speech' air is inhaled into the human lungs "y e(panding the ri" cage and dra!ing it in #ia the nasal ca#ity' #elum and trachea It is then e(pelled "ack into the air "y contracting the ri" cage and increasing the lung pressure$ 3uring the e(pulsion of air' the air tra#els from the lungs and passes through #ocal cords !hich are the t!o symmetric pieces of ligaments and muscles located in the laryn( on the trachea$ Speech is produced "y the #i"ration of the #ocal cords$ Before the e(pulsion of air' the laryn( is initially closed$ 2hen the pressure produced "y the
e(pelled air is sufficient' the #ocal cords are pushed apart' allo!ing air to pass through$ The #ocal cords close upon the decrease in air flo!$ This rela(ation cycle is repeated !ith generation fre%uencies in the range of 674 8 9774 $ The generation of this fre%uency depends on the speaker*s age' se(' stress and emotions$ This succession of the glottis openings and closure generates %uasi)periodic pulses of air after the #ocal cords$ ,igure "elo! sho!s the schematic #ie! of the human speech apparatus$
The speech signal is a time #arying signal !hose signal characteristics represent the different speech sounds produced$ There are three !ays of la"elling e#ents in speech$ ,irst is the silence state in !hich no speech is produced$ Second state is the un#oiced state in !hich the #ocal cords are not #i"rating' thus the output speech !a#eform is aperiodic and random in nature$ The last state is the #oiced state in !hich the #ocal cords are #i"rating periodically !hen air is e(pelled from the lungs$ This results in the output speech "eing %uasi)periodic$ ,igure : "elo! sho!s a speech !a#eform !ith un#oiced and #oiced state$
Speech is produced as a se%uence of sounds$ The type of sound produced depends on shape of the #ocal tract$ The #ocal tract starts from the opening of the #ocal cords to the end of the lips$ Its cross sectional area depends on the position of the tongue' lips' ja! and #elum$ Therefore the tongue' lips' ja! and #elum play an important part in the production of speech$
Block 3iagram .5ngineering Model/ Of 4uman Speech Production System
Factors associated wit speec & Formants&

It has "een kno!n from research that #ocal tract and nasal tract are tu"es !ith non)uniform cross)sectional area$ As sound generated propagates through these the tu"es' the fre%uency spectrum is shaped "y the fre%uency selecti#ity of the tu"e$ This effect is #ery similar to the resonance effects o"ser#ed in organ pipes and !ind instruments$ In the conte(t of speech production' the resonance fre%uencies of #ocal tract are called formant
fre%uencies or simply formants$ In our engineered model the poles of the transfer function are called formants$ 4uman Auditory system is much more sensiti#e to poles than eros$
P onemes&
Phonemes can "e defined as the ;Sym"ols from !hich e#ery sound can "e classified or produced<$ 5#ery Language has its particular phonemes !hich range from 97 8 =7$ 5nglish has >: phonemes$ ,or speech crude estimation of information rate considering physical limitations on articulatory motion is a"out -7 phonemes per second$
#ypes of P onemes&
Speech sounds can "e classified in to 9 distinct classes according to the mode of e(citation$ : 9 -$ Plosi#e Sounds :$ ?oiced Sounds 9$ @n#oiced Sounds
1' Plosive Sounds&

Plosi#e Sounds result from making a complete closure .again to!ard the front end of the #ocal tract/' "uilding up pressure "ehind the closure' and a"ruptly releasing it$
!' (oiced Sounds&

?oiced sounds are produced "y forcing air through the glottis !ith the tension of the #ocal chords adjusted so that they #i"rate in a rela(ation oscillation' there"y producing %uasi)periodic pulses of air !hich e(cite the #ocal tract$ ?oiced sounds are characteri ed "y A 4igh 5nergy Le#els
A ?ery 3istinct resonant and formant fre%uencies$
# e rate at w ic t e vocal c ord vibrates determines t e pitc $ These #i"rations are periodic in time thus #oiced sounds are appro(imated "y an impulse train$ Spacing "et!een impulses is the pitch' ,7$
%' )nvoiced Sounds&

?oiced Sounds are also kno!n as formants generated "y forming a constriction at some point in the #ocal tract .usually to!ard the mouth end/' and forcing the air through the constriction at high enough #elocity to produce tur"ulence$ This creates a "road)spectrum noise source to e(cite the #ocal tract$ @n#oiced sounds are characteri ed "y : A Lo!er 5nergy Le#els than #oiced sounds$ A 4igher fre%uencies than #oiced sounds$
In other !ords !e can say that un#oiced sounds .e$g$ 1sh1' 1s1' 1p1/ are generated !ithout #ocal cords #i"rations$ The e(citation is modeled "y a 2hite Baussian +oise source$ @n#oiced sounds ha#e no pitch since they are e(cited "y a non)periodic signal$
Spectrums Of typical voiced And )nvoiced Speec
By passing the speech through a predictor filter A. /' the spectrum is much more flatten .!hitened/$ But it still contains some fine details$
Special #ype of (oiced and )nvoiced Sounds&

There are ho!e#er some special types of #oiced and un#oiced sounds !hich are "riefly discussed here$ The purpose of their discussion here is only to gi#e the reader an idea a"out the further types of #oiced and un#oiced speech$
(owels&
?o!els are produced "y e(citing a fi(ed #ocal tract !ith %uasi periodic pulses of air caused "y #i"ration of the #ocal cords$ The !ay in !hich the cross)sectional area #aries along the #ocal tract determines the resonant fre%uencies of the tract .formants/ and thus the sound that is produced$ The dependence of cross)sectional area upon distance along the tract is called is called area function of the #ocal tract$ The area function of a particular #o!el is determined primarily "y the position of the tongue "ut the position of ja!s and lips to a small e(tent also affect the resulting sound$ 5(amples a'e'i'o'u
*ip t ongs&
Although there is some am"iguity and disagreement as to !hat is and !hat is not a diphthongs' a reasona"le definition is that a diphthongs is a gliding monosylla"ic speech item that starts at or near the articulatory position for one #o!el and mo#es to or to!ard the position for another$ According to this definition' there are C diphthongs in American 5nglish$ 3iphthongs are produced "y #arying the #ocal tract smoothly "et!een #o!el configurations appropriate to the diphthong$ Based on these data' the diphthongs can "e
characteri ed "y a time #arying #ocal tract area function !hich #aries "et!een t!o #o!el configurations$ 5(amples0 5i1 .as in "ay/ ' o@1 .as in "oat/ ' aI1 .as in "uy/ ' a@1 .as in ho!/
Semivowels&
The group of sound consisting of 1!1' 1l1 ' 1r1 '1y1 is %uite difficult to characteri e$ These sounds are called semi#o!els "ecause of their #o!el)like nature$ They are generally characteri ed "y a gliding transition in the #ocal tract area function "et!een adjacent phonemes$ Thus the acoustic characteristics of these sounds are strongly influenced "y the conte(t in !hich they occur$ ,or our purpose they just considered as transitional #o!el)like sounds and hence are similar in nature to #o!els and diphthongs$
(oiced Fricatives&
The #oiced fricati#es are 1#1 ' 1th1 ' 1 1 and 1 h1are the counterpart of the un#oiced fricati#es 1f1 '1D1 '1s1 and 1sh1 respecti#ely' in that the place of constriction for each of the corresponding phonemes is essentially identical$ 4o!e#er the #oiced fricati#es differ from their un#oiced counterparts in the manner that t!o e(citation sources are in#ol#ed in their production$ The spectra of #oiced fricati#es can "e e(pected to display t!o distinct components$
(oiced Stops&
The #oiced stops 1"1' 1d1'and1g1 are transient non)continuant sounds !hich are produced "y "uilding up pressure "ehind a total constriction some!here in the oral tract' and suddenly releasing the pressure$ ,or 1"1 the constriction is at the lipsE for 1d1 the
constriction is at the "ack of the teethE and for 1g1 it is near the #elum$ 3uring the period there is a total constriction in the tract there is no sound radiated from the lips$ Since the stop sounds are dynamical in nature' there properties are highly influenced "y the #o!el !hich follo!s the stop consonant$
)nvoiced Stops&
The un#oiced stop consonants are 1p1'1t1'and1k1 are similar to their #oiced counterparts 1"1 ' 1d1 and 1g 1 !ith one major e(ception$ 3uring the period of the total closure of the tract' as the pressure "uilds up' the #ocal cords do not #i"rate$ Thus' follo!ing the period of closure as the air pressure is released' there is a "rief inter#al for friction .due to the sudden tur"ulence of the escaping air/ follo!ed "y a period of aspiration .steady flo! of air from glottis e(citing the resonances of the #ocal tract/ "efore #oiced e(citation "egins +earing and Perception Audi"le sounds are transmitted to the human ears through the #i"ration of the particles in the air$ 4uman ears consist of three parts' the outer ear' the middle ear and the inner ear$ The function of the outer ear is to direct speech pressure #ariations to!ard the eardrum !here the middle ear con#erts the pressure #ariations into mechanical motion$ The mechanical motion is then transmitted to the inner ear' !hich transforms these motion into electrical potentials that passes through the auditory ner#e' corte( and then to the "rain $ ,igure "elo! sho!s the schematic diagram of the human ear$
Schematic 3iagram of the 4uman 5ar
# e ,ngineered -odel &

The speech mechanism can "e modeled as a time #arying filter .the #ocal tract/ e(cited "y an oscillator .the #ocal folds/' !ith different outputs$ 2hen #oiced sound is produced' the filter is e(cited "y an impulse chain' in a range of fre%uencies .C7)>77 4 /$ 2hen un#oiced sound is produced' the filter is e(cited "y random !hite noise' !ithout any o"ser#ed periodicity$ These attri"utes can "e o"ser#ed !hen the speech signal is e(amined in the time domain$
(a): The Human Speech Production Figure (b): Speech Production by a machine
. y ,ncode Speec / Speech coding has "een and still is a major issue in the area of digital speech processing$ Speech coding is the act of transforming the speech signal at hand' to a more compact form' !hich can then "e transmitted !ith a considera"ly smaller memory$ The moti#ation "ehind this is the fact that access to unlimited amount of "and!idth is not possi"le$ Therefore' there is a need to code and compress speech signals$ Speech compression is re%uired in long)distance communication' high)%uality speech storage' and message encryption$ ,or e(ample' in digital cellular technology many users need to share the same fre%uency "and!idth$ @tili ing speech compression makes it possi"le for more users to share the a#aila"le system$ Another e(ample !here speech compression is
needed is in digital #oice storage$ ,or a fi(ed amount of a#aila"le memory' compression makes it possi"le to store longer messages$ Speech coding is a lossy type of coding' !hich means that the output signal does not e(actly sound like the input$ The input and the output signal could "e distinguished to "e different$ Foding of audio ho!e#er' is a different kind of pro"lem than speech coding$ Audio coding tries to code the audio in a perceptually lossless !ay$ This means that e#en though the input and output signals are not mathematically e%ui#alent' the sound at the output is the same as the input$ This type of coding is used in applications for audio storage' "roadcasting' and Internet streaming$ Se#eral techni%ues of speech coding such as Linear Predicti#e Foding .LPF/' 2a#eform Foding and Su" "and Foding e(ist$ The pro"lem at hand is to use LPF to code gi#en speech sentences$ The speech signals that need to "e coded are !ide"and signals !ith fre%uencies ranging from 7 to 6 k4 $ The sampling fre%uency should "e at 6k4 $ 3ifferent types of applications ha#e different time delay constraints$ ,or e(ample in net!ork telephony only a delay of -ms is accepta"le' !hereas a delay of =77 ms is permissi"le in #ideo telephony$ Another constraint at hand is not to e(ceed an o#erall "it rate of 6 k"ps$ The speech coder that !ill "e de#eloped is going to "e analy ed using "oth su"jecti#e and o"jecti#e analysis$ Su"jecti#e analysis !ill consist of listening to the encoded speech signal and making adjustments on its %uality$ The %uality of the played "ack speech !ill "e solely "ased on the opinion of the listener$ The speech can possi"ly "e rated "y the listener either impossi"le to understand' intelligi"le or natural sounding$ 5#en though this is a #alid measure of %uality' an o"jecti#e analysis !ill "e introduced to technically
assess the speech %uality and to minimi e human "ias$ ,urthermore' an analysis on the study of effects of "it rate' comple(ity and end)to)end delay on the speech %uality at the output !ill "e made$ The report !ill "e concluded !ith the summary of results and some ideas for future !ork$
Speec Processing The speech !a#eform needs to "e con#erted into digital format "efore it is suita"le for processing in the speech recognition system$ The ra! speech !a#eform is in the analog format "efore con#ersion$ The con#ersion of analog signal to digital signal in#ol#es three phases' mainly the sampling' %uantisation and coding phase$ In the sampling phase' the analog signal is "eing transformed from a !a#eform that is continuos in time to a discrete signal$ A discrete signal refers to the se%uence of samples that are discrete in time$ In the %uantisation phase' an appro(imate sampled #alue of a #aria"le is con#erted into one of the finite #alues contained in a code set$ These t!o stages allo! the speech !a#eform to "e represented "y a se%uence of #alues !ith each of these #alues "elonging to the set of finites #alues$ After passing through the sampling and %uantisation stage' the signal is then coded in the coding phase$ The signal is usually represented "y "inary code$ These three phases needs to "e carried out !ith caution as any miscalculations' o#er)sampling and %uanti ation noise !ill result in loss of information$ Belo! are the pro"lems faced "y the three phases$
Sampling According to the +y%uist Theorem' the minimum sampling rate re%uired is t!o times the "and!idth of the signal$ This minimum sampling fre%uency is needed for the
reconstruction of a "and limited !a#eform !ithout error$ Aliasing distortion !ill occur if the minimum sampling rate is not met$ ,igure "elo! sho!s the comparison "et!een a properly sampled case and an improperly sampled case$
Aliasing 3istortion "y Improperly Sampling 0uanti1ation Speech signals are more likely to ha#e amplitude #alues near ero than at the e(treme peak #alues allo!ed$ ,or e(ample' in digiti ing #oice' if the peak #alue allo!ed is -?' !eak passages may ha#e #oltage le#els on the order of 7$-?$ Speech signals !ith non) uniform amplitude distri"ution are likely to e(perience %uantising noise if the step si e is not reduced for amplitude #alues near ero and increased for e(tremely large #alues$ The %uantising noise is kno!n as the granular and slope o#erload noise$ Branular noise occurs !hen the step si e is large for amplitude #alues near ero$ Slope o#erload noise occurs !hen the step si e is small and cannot keep up !ith the e(tremely large amplitude #alues$ To sol#e the a"o#e %uantising noise pro"lem' 3elta Modulation .3M/ is used$ 3elta Modulation !orks "y reducing the step si e for amplitude #alues near ero and increasing the step si e for e(tremely large amplitude #alues$ ,igure "elo! sho!s a diagram on the t!o types of noises$
Analog Input and Accumulator Output 2a#eform Approac es to Speec Recognition 4uman "eings are the "est ;machine< to recogni e and understand speech$ 2e are a"le to com"ine a !ide #ariety of linguistic kno!ledge concerned !ith synta( and semantics and adapti#ely use this kno!ledge according to the difficulties and characteristics of the sentences$ The speech recognition system is "uilt !ith this aim in mind to match or e(ceed human performance$ There are generally three approaches to speech recognition' namely' acoustic)phonetic' pattern recognition and the artificial intelligence approach$ These three approaches !ill "e e(plained in greater detail in the follo!ing sections$
Speec "oding A digital speech coder can "e classified into t!o main categories' mainly !a#eform coders and #ocoders$ 2a#eform coders employ algorithms to encode and decode speech signals so that the system output is an appro(imation to the input !a#eform$ ?ocoders encode speech signals "y e(tracting a set of parameters that are digiti ed and transmitted to the recei#er$ This set of digiti ed parameters is used to set #alues for parameters in function generators and filters' !hich in turn synthesi e the output speech signals$ The
#ocoder output !a#eform does not appro(imate the input !a#eform signals and may produce an unnatural sound$
Speec Feature ,2traction Introduction

The purpose of this module is to con#ert the speech !a#eform to some type of parametric representation .at a considera"ly lo!er information rate/ for further analysis and processing$ This is often referred as the signal-processing front end$ The speech signal is a slo!ly timed #arying signal .it is called quasi-stationary/$ An e(ample of speech signal is sho!n in ,igure :$ 2hen e(amined o#er a sufficiently short period of time ."et!een = and -77 msec/' its characteristics are fairly stationary$ 4o!e#er' o#er long periods of time .on the order of -1= seconds or more/ the signal characteristic change to reflect the different speech sounds "eing spoken$ Therefore' short-time spectral analysis is the most common !ay to characteri e the speech signal$
Figure 2. An example of speech signal A !ide range of possi"ilities e(ist for parametrically representing the speech signal for the speaker recognition task' such as Linear Prediction Foding .LPF/' Mel),re%uency Fepstrum Foefficients .M,FF/' and others$ M,FF is perhaps the "est kno!n and most popular' and these !ill "e used in this project$ M,FF*s are "ased on the kno!n #ariation of the human ear*s critical "and!idths !ith fre%uency' filters spaced linearly at lo! fre%uencies and logarithmically at high fre%uencies ha#e "een used to capture the phonetically important characteristics of speech$ This is e(pressed in the mel-frequency scale' !hich is a linear fre%uency spacing "elo! -777 4 and a logarithmic spacing a"o#e -777 4 $ The process of computing M,FFs is descri"ed in more detail ne(t$
Pattern Recognition
This direct approach in#ol#es manipulating the speech signals directly !ithout e(plicit feature e(traction of the speech signals$ There are t!o stages in this approach' mainly the training of speech patterns and recognition of patterns #ia pattern comparison$ Se#eral identical speech signals are collected and sent to the system #ia the training procedure$ 2ith ade%uate training' the system is a"le to characteri e the acoustics properties of the pattern$ This type of classification is kno!n as the pattern classification$ The recognition stage does a direct comparison "et!een the unkno!n speech signal and the speech signal patterns learned in the training phase$ It generates a ;accept< or ;reject< decision "ased on the similarity of the t!o patterns$ -$ It is simple to use and the method is fairly easy to understand :$ It has ro"ustness to different speech #oca"ularies' users' features sets' pattern comparison algorithms and decision rules$ 9$ It has "een pro#en that this method generates the most accurate results$
Acoustic3P onetic Approac The acoustic)phonetic approach has "een studied in depth for more than >7 years$ It is "ased on the theory of acoustics phonetics that suggest that there e(ist finite' distincti#e phonetic units of spoken language and that the phonetic units are !idely characteri ed "y a set of properties that are manifest in the speech signal' or its spectrum' o#er time$ The first step in this approach is to segment the speech signal into discrete time regions !here the acoustics properties of the speech signal are represented "y one phonetic unit$ The ne(t step is to attach one or more phonetic la"els to each segmented region according to the acoustic properties$ ,inally the last step attempts to determine a #alid !ord from the
phonetic la"els generated from the first step$ This is consistent !ith the constraints of the speech recognition task
Artificial Intelligence This approach is a com"ination of the acoustic)phonetic approach and the pattern recognition approach$ It uses the concept and ideas of these t!o approaches$ Artificial intelligence approach attempts to mechani e speech recognition process according to the !ay a person applies its intelligence in #isuali ing and analy ing$ In particular among the techni%ues used !ithin this class of methods are the use of an e(pert system for segmentation and la"eling so that this crucial and most complicated step can "e performed !ith more that just the acoustic information used "y pure acoustic)phonetic methods$ +eural +et!orks are often used in this approach to learn the relationship "et!een the phonetic e#ents and all the kno!n inputs$ It can also "e used to differentiate similar sound classes$ *ynamic #ime .arping 3ynamic Time 2arping is one of the pioneer approaches to speech recognition$ It first operates "y storing a prototypical #ersion of each !ord in the #oca"ulary into the data"ase' then compares incoming speech signals !ith each !ord and then takes the closest match$ But this poses a pro"lem "ecause it is unlikely that the incoming signals !ill fall into the constant !indo! spacing defined "y the host$ ,or e(ample' the pass!ord to a #erification system is Gueensland$ 2hen a user utter ; Gueeeeensland<' the simple linear s%uee ing of this longer pass!ord !ill not match the one in the data"ase$ This is
due to the longer constant !indo! spacing of the speech ; Gueeeeensland<$ 3ynamic Time 2arping sol#es this pro"lem "y computing a non)linear mapping of one signal onto another "y minimi ing the distances "et!een the t!o$ Thus 3ynamic Time 2arping .3T2/ is a much more ro"ust distance measure for time series' allo!ing similar shapes to match e#en if they are out of phase in the time domain$
3ynamic Time 2arping ,igure a"o#e sho!s the graph on 3ynamic Time 2arping' !here the hori ontal a(is represents the time se%uence of the input stream' and the #ertical a(is represents the time se%uence of the template stream$ The path sho!n results in the minimum distance "et!een the input and template streams$ The shaded in area represents the search space for the input time to template time mapping function$
+idden -arkov -odel
Speech Recognition is one of the daunting challenges facing researchers throughout the !orld$ The complete solution is far from o#er and enormous efforts ha#e "een spent "y companies to reach the ultimate goal$ One of the techni%ues that gain acceptance from researchers is the state of art' 4idden Marko# Model .4MM/ techni%ue$ This model can also "e incorporated !ith other techni%ues like +eural +et!ork to form a formida"le techni%ue$ The 4idden Marko# Model approach is !idely used in se%uence processing and speech recognition$ The key features of the 4idden Marko# Model lies in its a"ility to model temporal statistics of data "y introducing a discrete hidden #aria"le that goes through a transition from one time step to the ne(t according to the stochastic transition matri($ 3istri"ution of the emission sym"ols is em"odied in the assumption of the emission pro"a"ility density$ A 4idden Marko# Model may "e #ie!ed as a finite machine !here the transitions "et!een the states are dependent upon the occurrence of some sym"ol$ 5ach state transition is associated !ith an output pro"a"ility distri"ution !hich determines the pro"a"ility that a sym"ol !ill occur during the transition and a transition pro"a"ility indicating the likelihood of this transition$ Se#eral analytical techni%ues ha#e "een de#eloped for estimating these pro"a"ilities$ These analytical techni%ues ha#e ena"led 4MM to "ecome more computationally efficient' ro"ust and fle(i"le$ In speech recognition' the 4MM model optimises the pro"a"ility of the training set to detect a particular speech$ The pro"a"ility function is performed "y the ?iter"i algorithm$ This algorithm is a procedure used to determine an optimal state se%uence from a gi#en o"ser#ation se%uence$
4eural 4etworks As you read these !ords' your "rain is actually using its comple( net!ork of -7 neurons to facilitate your readings $ 5ach of these neurons has a "la ing processing speed of a microprocessor' !hich allo!s us to read' think and !rite simultaneously$ Scientists ha#e found out that all "iological neural functions including memory are stored in the neurons and in the connections "et!een them$ As !e learn ne! things e#eryday' ne! connections are made or modified$ Some of these neural structures are defined at "irth !hile others are created e#eryday and others !aste a!ay$ In this thesis' the +eural +et!ork algorithm is actually a"out Artificial +eural +et!orks and not the actual neurons in our "rain$ A picture illustrating the "iological neurons is sho!n in ,igure "elo!
Schematic 3ra!ing of Biological +eurons
4eural -odel
,igure "elo! sho!s a single input neuron$ The scaler input p is multiplied "y the scaler !eight ! to form !p !hich is then sent to the summer$ In the summer' the product scalar !p is added to the "ias " and passed through the summer $ The summer output n' goes into the transfer function f !hich generates the scaler neuron output a$ The #alue of output neuron ;a< depends on the type of transfer function used$ This !hole idea of the artificial neuron is similar to "iological neurons sho!n in ,igure sho!n a"o#e $ The !eight ! corresponds to the strength of the synapse' the cell "ody is e%ui#alent to the summation and the transfer function and finally the neuron output ;a< corresponds to the signal tra#elling in the a(on Summer output n H !p I" +euron output a H f.!pI"/
$ Single Input +euron
+ard $imit #ransfer Function The hard limit transfer function sho!n on the left side of ,igure "elo! sets the output neuron a to 7 if the summer output n is less than 7$ If the summer output n is greater than or e%ual to 7' it sets the output neuron a to -$ This transfer function is useful in classifying
inputs into t!o categories and in this thesis it is used to determine true or false detection of the speech signal$ The figure on the right sho!s the effect of the !eight and the "ias com"ined together$
4ard Limit Transfer ,unction *ecision 5oundary A single layer perceptron consists of the input neuron' the !eight' "ias' summer and the transfer function$ ,igure "elo! - sho!s a diagram of a single layer perceptron$ A single layer perceptron can "e used to classify input #ectors into t!o categories$ The !eight is al!ays orthogonal to the decision "oundary$ ,or e(ample in ,igure "elo! :' the !eight ! is set to J): 9K$ The decision "oundary corresponding to the graph in ,igure "elo! : is indicated$ 2e can use any points on decision "oundary to find the "ias as follo!s0 !p I " H 7$ Once the "ias is set' any point in the plane can "e classified as lying inside the shaded region .!pI"L7/ or outside the shaded region .!pI"M7/$
,igure -0 Multiple Input +euron ,igure :0Perceptron 3ecision Boundary
-el3fre6uency cepstrum coefficients processor A "lock diagram of the structure of an M,FF processor is gi#en in ,igure 9$ The speech input is typically recorded at a sampling rate a"o#e -7777 4 $ This sampling fre%uency !as chosen to minimi e the effects of aliasing in the analog)to)digital con#ersion$ These sampled signals can capture all fre%uencies up to = k4 ' !hich co#er most energy of sounds that are generated "y humans$ As "een discussed pre#iously' the main purpose of the M,FF processor is to mimic the "eha#ior of the human ears$ In addition' rather than the speech !a#eforms themsel#es' M,,F*s are sho!n to "e less suscepti"le to mentioned #ariations$
Frame 5locking In this step the continuous speech signal is "locked into frames of N samples' !ith adjacent frames "eing separated "y M .M < N/$ The first frame consists of the first N samples$ The second frame "egins M samples after the first frame' and o#erlaps it "y N M samples$ Similarly' the third frame "egins :M samples after the first frame .or M samples after the second frame/ and o#erlaps it "y N ) :M samples$ This process continues until all the speech is accounted for !ithin one or more frames$ Typical #alues for N and M are N H :=C .!hich is e%ui#alent to N 97 msec !indo!ing and facilitate the fast radi(): ,,T/ and M H -77$
.indowing The ne(t step in the processing is to !indo! each indi#idual frame so as to minimi e the signal discontinuities at the "eginning and end of each frame$ The concept here is to
minimi e the spectral distortion "y using the !indo! to taper the signal to ero at the "eginning and end of each frame$ If !e define the !indo! as
!here N is the num"er of samples in each frame' then the result of !indo!ing is the signal
Typically the Hamming !indo! is used' !hich has the form0
Fast Fourier #ransform 7FF#8 The ne(t processing step is the ,ast ,ourier Transform' !hich con#erts each frame of N samples from the time domain into the fre%uency domain$ The ,,T is a fast algorithm to implement the 3iscrete ,ourier Transform .3,T/ !hich is defined on the set of N samples OxnP' as follo!0
+ote that !e use j here to denote the imaginary unit' i$e$ -QH j$ In general n*s are comple( num"ers$ The resulting se%uence O nP is interpreted as follo!0 the ero fre%uency corresponds to n H 7' positi#e fre%uencies #alues to The result after this step is often referred to as spectrum or periodogram !hile negati#e fre%uencies correspond to correspond
-el3fre6uency .rapping As mentioned a"o#e' psychophysical studies ha#e sho!n that human perception of the fre%uency contents of sounds for speech signals does not follo! a linear scale$ Thus for each tone !ith an actual fre%uency' f' measured in 4 ' a su"jecti#e pitch is measured on a scale called the Rmel* scale$ The mel-frequency scale is a linear fre%uency spacing "elo! -777 4 and a logarithmic spacing a"o#e -777 4 $ As a reference point' the pitch of a - k4 tone' >7 dB a"o#e the perceptual hearing threshold' is defined as -777 mels$ Therefore !e can use the follo!ing appro(imate formula to compute the mels for a gi#en fre%uency f in 4 0
$ One approach to simulating the su"jecti#e spectrum is to use a filter "ank' spaced uniformly on the mel scale .see ,igure >/$ That filter "ank has a triangular "and pass fre%uency response' and the spacing as !ell as the "and!idth is determined "y a constant mel fre%uency inter#al$ The modified spectrum of S!S" thus consists of the output po!er of these filters !hen S!S" is the input$ The num"er of mel spectrum coefficients' #' is typically chosen as :7$ +ote that this filter "ank is applied in the fre%uency domain' therefore it simply amounts to taking those triangle)shape !indo!s in the ,igure > on the spectrum$ A useful !ay of thinking a"out this mel)!rapping filter "ank is to #ie! each filter as an histogram "in .!here "ins ha#e o#erlap/ in the fre%uency domain$
Figure 9$ An e(ample of Mel)spaced filter "ank
"epstrum In this final step' !e con#ert the log mel spectrum "ack to time$ The result is called the mel fre%uency cepstrum coefficients .M,FF/$ The cepstral representation of the speech spectrum pro#ides a good representation of the local spectral properties of the signal for the gi#en frame analysis$ Because the mel spectrum coefficients .and so their logarithm/ are real num"ers' !e can con#ert them to the time domain using the 3iscrete Fosine Transform .3FT/$ Therefore if !e denote those mel po!er spectrum coefficients that are the result of the last step are !e can calculate the M,FF&s' as
+ote that !e e(clude the first component' 'N7c from the 3FT since it represents the mean #alue of the input signal !hich carried little speaker specific information$
1 ! % 9 : ; < = -odified in t e -F"" based tec ni6ue

Purpose of modification
The purpose of modification in the M,FF "ased algorithms generelly "eing used !as to impro#e its performance "y making it more ro"ust and making it more faster and computationally efficient$ Thus' an effort to reduce the MIPs count so that the algorithm can "e considered for real time applications$ The failure of the speaker recognition systems is that they are not ro"ust enough to "e considered for commercial applications' the intend !as also to take a step for!ard in making them ro"ust$
-odifications
.indowing
Instead of using the hamming !indo! a more efficient kaiser !indo! is used that is "ased on the concept of minimi ing the mean s%uare error rather than ma(imum error$ The e(plaination of the kaiser !indo! is gi#en "elo!$ The Taiser !indo! has an adjusta"le parameter U !hich controls ho! %uickly it approaches ero at the edges$ It is defined "y
!here $7.x/ is the eroth order modified Bessel function .for a definition' and a more detailed discussion$ The higher U the narro!er gets the !indo! and therefore' due to the not so se#ere truncation then' the less se#ere are the "umps a"o#e V $ Boss used this !indo! to o"tain an adjusta"le gradient filter' "ut he used it only on sample points so that' in "et!een sample points' some kind of interpolation has to "e performed' !hich he does not state e(plicitly$ In this !ork' the Taiser !indo!ed cosc function !ill "e used to reconstruct gradients at ar"itrary positions !hich is' of course' more costly$ A"solute of 3,T Before applying to the mel filter "anks the 3,T of the a"solute of the 3,T of the frame !as taken rather than taking the s%uare$ This not only reduces the cost of computing the s%uare "ut also is an attempt of making the algorithm more ru"ust$
Feature -atc ing
Introduction
The pro"lem of speaker recognition "elongs to a much "roader topic in scientific and engineering so called pattern recognition$ The goal of pattern recognition is to classify o"jects of interest into one of a num"er of categories or classes$ The o"jects of interest are generically called patterns and in our case are se%uences of acoustic #ectors that are e(tracted from an input speech using the techni%ues descri"ed in the pre#ious section$ The classes here refer to indi#idual speakers$ Since the classification procedure in our case is applied on e(tracted features' it can "e also referred to as feature matching$ ,urthermore' if there e(ists some set of patterns that the indi#idual classes of !hich are already kno!n' then one has a pro"lem in supervised pattern recognition$ This is e(actly our case since during the training session' !e la"el each input speech !ith the I3 of the speaker$ These patterns comprise the training set and are used to deri#e a classification algorithm$ The remaining patterns are then used to test the classification algorithmE these patterns are collecti#ely referred to as the test set$ If the correct classes of the indi#idual patterns in the test set are also kno!n' then one can e#aluate the performance of the algorithm$ The state)of)the)art in feature matching techni%ues used in speaker recognition include 3ynamic Time 2arping .3T2/' 4idden Marko# Modeling .4MM/' and ?ector Guanti ation .?G/$ In this project' the ?G approach !ill "e used' due to ease of implementation and high accuracy$ ?G is a process of mapping #ectors from a large #ector space to a finite num"er of regions in that space$ 5ach region is called a cluster and can "e represented "y its center called a codeword$ The collection of all code!ords is called a code%ook$
,igure = sho!s a conceptual diagram to illustrate this recognition process$ In the figure' only t!o speakers and t!o dimensions of the acoustic space are sho!n$ The circles refer to the acoustic #ectors from the speaker - !hile the triangles are from the speaker :$ In the training phase' a speaker)specific ?G code"ook is generated for each kno!n speaker "y clustering his1her training acoustic #ectors$ The result code!ords .centroids/ are sho!n in ,igure = "y "lack circles and "lack triangles for speaker - and :' respecti#ely$ The distance from a #ector to the closest code!ord of a code"ook is called a ?G)distortion$ In the recognition phase' an input utterance of an unkno!n #oice is ;#ector)%uanti ed< using each trained code"ook and the total &' distortion is computed$ The speaker corresponding to the ?G code"ook !ith smallest total distortion is identified$
Figure : Fonceptual diagram illustrating #ector %uanti ation code"ook formation$ One speaker can "e discriminated from another "ased of the location of centroids
"lustering t e #raining (ectors
After the enrolment session' the acoustic #ectors e(tracted from input speech of a speaker pro#ide a set of training #ectors$ As descri"ed a"o#e' the ne(t important step is to "uild a speaker)specific ?G code"ook for this speaker using those training #ectors$ There is a !ell)kno! algorithm' namely LBB algorithm JLinde' Bu o and Bray' -W67K' for clustering a set of ( training #ectors into a set of M code"ook #ectors$ The algorithm is formally implemented "y the follo!ing recursi#e procedure0 -$ 3esign a -)#ector code"ookE this is the centroid of the entire set of training #ectors .hence' no iteration is re%uired here/$ : :$ 3ou"le the si e of the code"ook "y splitting each current code"ook yn according to the rule
9 !here n #aries from - to the current si e of the code"ook' and X is a splitting parameter .!e choose XH7$7-/$
9$ +earest)+eigh"or Search0 for each training #ector' find the code!ord in the current code"ook that is closest .in terms of similarity measurement/' and assign that #ector to the corresponding cell .associated !ith the closest code!ord/$
>$ Fentroid @pdate0 update the code!ord in each cell using the centroid of the training #ectors assigned to that cell$
=$ Iteration -0 repeat steps 9 and > until the a#erage distance falls "elo! a preset threshold C$ Iteration :0 repeat steps :' 9 and > until a code"ook si e of M is designed$
Intuiti#ely' the LBB algorithm designs an M)#ector code"ook in stages$ It starts first "y
designing a -)#ector code"ook' then uses a splitting techni%ue on the code!ords to initiali e the search for a :)#ector code"ook' and continues the splitting process until the desired M)#ector code"ook is o"tained$ ,igure C sho!s' in a flo! diagram' the detailed steps of the LBB algorithm$ ; )luster vectors< is the nearest)neigh"or search procedure !hich assigns each training #ector to a cluster associated !ith the closest code!ord$ ; *ind centroids< is the centroid update procedure$ ;)ompute + !distortion"< sums the distances of all training #ectors in the nearest)neigh"or search so as to determine !hether the procedure has con#erged$
Obtaining Speec .aveform The first task is to record the speech !a#eform from the speaker and upload it into the program$ The sound Recorder program in Microsoft 2indo!s is chosen to record the speech !a#eform$ The recorded speech is automatically filtered' sampled at a sampling rate of ::$7= T4 and then sa#ed as a !a#e file$ 2a#e file format is chosen "ecause it is highly compati"le !ith the Matla" program as it can "e easily retrie#ed #ia a single command$ Pre3,2traction Process The speech !a#e filed is loaded into the Matla" program "y using a ;!a#e read< function !hich limits the amplitude of the speech signal to a magnitude of -$ The signal is then sa#ed as an M ( - #ector !here M refers to the total num"er of samples in the speech signal$ 5ach element in the M #ector contains the amplitude of the speech signal at a
particular sampling instant$ The speech signal is no! ready to go through the compati"ility and %uality process$
0uality Process Before the actual e(traction process takes place' the !a#e file is su"jected to a series of processes to ensure the compati"ility and %uality of the signal$ 2hen the speech signal is "eing loaded into the Matla" program' the signal is not centered at the y H 7 a(is$ In order to "ring the !hole signal to centre on the ero)line' special program code !as !ritten$ This code is used to find the mean of the signal and then su"tract this mean from each of the sample #alues of the signal$ This is sho!n in ,igure - "elo!$ The reason for shifting the !hole signal to the y H 7 a(is !ill "e e(plained in section
Speech 2a#eform
The ne(t process is to suppress the noise present in the speech !a#eform$ Although the sound recorder program has performed the initial filtering' some noise is still present in the speech !a#eform$ Another section of the Matla" program code is used to set a threshold #alue on the speech signal$ Any #alue of the speech signal that falls "elo! this threshold #alue !ill "e set to ero$ This !ill greatly suppress the un!anted noise and in the meantime preser#e the content of the main speech signal$ This is illustrated in ,igure : and ,igure 9 "elo!$ After some testing' it !as found that a threshold #alue of 7$7: is most suita"le$
,ig : Speech !a#e form "efore filtering
,ig 9 Speech after filtering The final compati"ility and %uality process is to determine the area of interest of the speech signal$ This is done "y detecting the first rise point and the final drop point of the
speech !a#eform$ This can "e done easily since the speech signal is cleared of un!anted noise$ 4ence area of interest of the speech lies "et!een the first rise and final drop point of the speech !a#eform$ This area of interest is later used for the e(traction and coding processes$
*atabase Before the classification takes place' speech samples from each speaker ha#e to "e collected' coded' con#erted to an M,FF code and stored into the data"ase$ The utterance of these speech samples has to "e of the same phrase$ These speech samples for each speaker are then a#eraged to produce the mean reference matri($ This a#eraging process is necessary as it reduces the inconsistency of the speaker*s speech$ The speech samples must also go through a process to find the standard de#iation of the samples$
4eural 4etwork Process After the mean reference matri( of each indi#idual speaker is o"tained' the elements in the matri( are then passed through the +eural +et!ork to construct the !eight and "ias for each indi#idual speaker$ Once the !eight and "ias of each indi#idual speaker is generated' the program is ready to classify any unkno!n user$
"onclusion
2ith the positi#e results collected' Speech recognition using M,FF and +eural +et!ork has pro#en to "e e(cellent in classifying speech signals$ @nlike traditional speech recognition techni%ues !hich in#ol#e comple( ,ourier transformations' the method used "y Mel ,re%uency cepstrum in coding the signal is simple and accurate$ The acoustics characteristics of the speaker*s speech can easily "e detected from #isual inspection on the M,FF code$ The +eural +et!ork classification method used is also relia"le and uncomplicated to implement$ ,rom the results' it is o"#ious that single sylla"le !ords are more relia"le in terms of training$ This is pro"a"ly "ecause humans* pronunciations of single sylla"le !ords are more consistent$ 3espite the positi#e results collected' there are still a fe! ;,alse< acceptances and ;,alse< rejections "eing detected$ This may "e considered a serious issue !hen it is applied in a high security room$ The main reason "ehind these errors is due to the inconsistency in the human speech$ Although this system is a formida"le com"ination' the single layer of Perceptron techni%ue is una"le to reduce the inconsistency of the speech signals$ Therefore more ro"ust and po!erful methods ha#e to "e employed to reduce the inconsistency of the speech signals$ This !ill "e further e(plained in the follo!ing chapter$ ,inally to conclude' Mel fre%uency cepstrum processing has the a"ility to discriminate signals that remain indistinguisha"le in the fre%uency domain$ ,urthermore due to their economic' ro"ustness and fle(i"ility' these t!o com"ined techni%ues can "e easily
implemented on cost effecti#e machines !hich re%uires speech #erification or identification$ Ot er -et ods in 4eural 4etworks
Many methods of classifying speech signals can "e found in +eural +et!ork$ One of the methods is "ased on the Auto associati#e +eural +et!ork model$ The distri"ution capturing a"ility of the net!ork is e(ploited to "uild the speaker speech signal$ Another high performance +eural +et!ork "ased approach is "y using a State Transition Matri($ This method has the a"ility to address inconsistent speech signals$ An unsuper#ised Learning method like Tohonen Self)Organising Map can also "e employed$
Speaker Identification
The current system is more focused on speaker #erification !hich tests an unkno!n speaker against a kno!n speaker$ The method presented in this thesis is still not relia"le enough to "e used on speaker identification applications$ In speaker identification' the aim is to determine !hether the utterance of an unkno!n speaker "elongs to any of the speakers from amongst a kno!n group$ Of these t!o applications' speaker identification is generally more difficult to achie#e due the larger speaker populations !hich !ill
produce more errors$ ,uture !ork should "e concentrated on speaker identification as it !ill increase the commercial #alue of the system

Speaker Recognition System: A Project Report On

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Speaker Recognition System: A Project Report On

Diunggah oleh

Hak Cipta:

Format Tersedia

A Project Report On

Speaker Recognition System

Principles of Speaker Recognition

Figure 1$ Basic structures of speaker recognition systems

"ommunication and Information # eory

Block 3iagram .5ngineering Model/ Of 4uman Speech Production System

Factors associated wit speec & Formants&

1' Plosive Sounds&

!' (oiced Sounds&

A ?ery 3istinct resonant and formant fre%uencies$

%' )nvoiced Sounds&

Spectrums Of typical voiced And )nvoiced Speec

Special #ype of (oiced and )nvoiced Sounds&

Schematic 3iagram of the 4uman 5ar

# e ,ngineered -odel &

Speec Feature ,2traction Introduction

+idden -arkov -odel

Schematic 3ra!ing of Biological +eurons

$ Single Input +euron

,igure -0 Multiple Input +euron ,igure :0Perceptron 3ecision Boundary

Typically the Hamming !indo! is used' !hich has the form0

Figure 9$ An e(ample of Mel)spaced filter "ank

1 ! % 9 : ; < = -odified in t e -F"" based tec ni6ue

Feature -atc ing

"lustering t e #raining (ectors

,ig : Speech !a#e form "efore filtering

Anda mungkin juga menyukai