Anda di halaman 1dari 44

Abstract

Speech processing involves recognition, synthesis, language identification, speaker recognition, and a
host of subsidiary problems regarding variations in speaker and speaking conditions. Notwithstanding
the difficulty of the problems, and the fact that speech processing spans two major areas, acoustic
engineering and computational linguistics, great progress has been made in the past fifteen years, to the
point that commercial speech recognizers are increasingly available in the late 1990s. Still, problems
remain both at the sound level, especially dealing with noise and variation, and at the dialogue and
conceptual level, where speech blends with natural language analysis and generation.

5.1 Definition of Area


Speech processing comprises several areas, primarily speech recognition and
speech synthesis, but also speaker recognition, language recognition, speech
understanding and vocal dialog, speech coding, enhancement, and transmission. A
panorama of techniques and methods may be found in Cole et al. (1998) or Juang
et al. (1998).
Speech recognition is the conversion of acoustic information into linguistic
information that may result in a written transcription, or that has to be understood.
Speech synthesis is the conversion of linguistic information for human auditory
consumption. The starting point may be a text or a concept that has to be
expressed.

5.2 The Past: Where We Come From


5.2.1 Major Problems in Speech Processing
Many of the problems that have been addressed in the history of speech processing
concern variability:

Acoustic variability, due to the fact that the same phonemes pronounced in
different contexts (that is surrounded by different phonemes) will have
different acoustic realization (this is called thecoarticulation effect).
Additional factors that play a role include the fact that the general prosody
of a sentence modifies the corresponding signal, and that the signal is
different when speech is uttered in various environments, in noise, with
reverberation, with different microphones, or different types of
microphones.
Speaking variability, when the same speaker speaks normally, shouts,
whispers, uses a creaky voice, or has a cold.
Speaker variability, since different speakers have different timbers and
different speaking habits.

Linguistic variability, in which the same sentence can be pronounced in


many different ways, using many different words, synonyms, and many
different syntactic structures and prosodic schemes.
Phonetic variability, due to the different possible pronunciations of the same
words by speakers having different regional or socio-linguistic accents.

Noise and channel distortions are difficult to handle, especially when there is no a
priori knowledge of the noise or of the distortion. These phenomena directly affect
the acoustics of the signal, but may also indirectly modify the voice at the source.
This is known as the Lombard effect, where noise modifies the utterance of the
words (as people tend to speak louder), but may also be reflected in voice changes
due to the psychological awareness of speaking to a machine.
The fact that, contrary to written texts, speech is continuous and has no silence to
separate words, adds extra difficulty. But continuous speech is also difficult to
handle because linguistic phenomena of various kinds may occur at the junctions
between words, or within words which are often used, and which are usually short
and therefore much affected by coarticulation.
5.2.2 History of Major Methods, Techniques, and Approaches
Regarding speech synthesis, the origins may be placed very early in time. The first
result in that field may be placed in 1791, when W. von Kempelen demonstrated
his speaking machine, which was built with a mechanical apparatus mimicking the
human vocal apparatus. The next major successful attempt may be placed at the
New York World Fair in 1939, when H. Dudley presented the Voder, based on
electrical devices. In this case, the approach was rather based on an analysissynthesis approach. The sounds where first analyzed and then replayed. In both
cases, it was necessary to learn how to play those very special musical instruments
(one week in the case of the Voder), and the human demonstrating the systems
probably used the now well-known trick of announcing to the audience what they
would hear, and thus inducing the understanding of the corresponding sentence.
Since then, major progress may be reported in that field, with basically two
approaches still reflecting the Von Kempelen/Dudley dichotomy on "KnowledgeBased" vs "Template-Based" approaches. The first approach is based on the
functioning of the vocal tract, which often goes together with formant synthesis
(the formants are the resonances of the vocal tract). The second is based on the
synthesis of pre-analyzed signals, which leads to diphone synthesizers, and more
generally to signal segment concatenation. A speech synthesizer for American
English was designed based on the first approach at MIT (Klatt, 1980), and
resulted in the best synthesizer available at that time. Several works may also be
reported in the field of articulatory synthesis, which aims at mimicking more
closely the functioning of the vocal apparatus. However, the best quality is
presently obtained by diphone based approaches or the like, using simply PCM

encoded signals, especially illustrated by the Psola system designed at CNET


(Moulines and Charpentier, 1990).
In addition to the phoneme-to-sound levels, Text-to-Speech synthesis systems also
contain a Grapheme-to-Phoneme conversion level. This operation initially used a
large set of rules, including morpho-syntactic tagging and even syntactic parsing to
solve some difficult cases. Several attempts to perform this operation by automatic
training on large amounts of texts or directly on the lexicon, using stochastic
approaches or even Neural Nets, resulted in encouraging results, and even claims
that machine "able to learn reading" have been invented. However, rule-based
approaches still produce the best results. Specific attention has recently been
devoted to the grapheme-to-phoneme conversion of proper names, including
acronyms. Prosodic markers are generated from the texts using rules and partial
parsing.
Regarding speech recognition, various techniques were used in the 60s and 70s.
Researchers here also found their way between knowledge based approaches for
"analytic recognition" and template matching approaches for "global recognition".
In the first case, the phonemes were first recognized and then linguistic knowledge
and AI techniques helped reconstruct the utterance and understand the sentence,
despite the phoneme recognition errors. An expert systems methodology was
specifically used for phoneme decoding in that approach. In the second approach,
the units to be recognized were the words. Template matching systems include a
training phase, in which each word of the vocabulary is pronounced by the user
and the corresponding acoustic signal is stored in memory. During the recognition
phase, the same speaker pronounces a word of the vocabulary and the
corresponding signal is compared with all the signals that are stored in memory.
This comparison employs a pattern matching technique called Dynamic Time
Warping (DTW), which accommodates differences between the signals for two
pronunciations of the same word (since even the same speaker never pronounces
words exactly the same way, with differences in the duration of the pronunciation
of the phonemes, the energy, and the timber). This approach was first successfully
used for speaker-dependent isolated word recognition for small vocabularies (up to
100 words). It was then extended to connected speech, to speaker independent
isolated words, or to larger vocabularies, but independently on each of those 3
dimensions, by improving the basic technique.
The next major progress was made on the introduction of a statistical approach
called Hidden Markov Models (HMMs) by researchers at IBM (Baker, 1975,
Jelinek, 1976). In this case, instead of storing in the memory the signal
corresponding to a word, the system stores an abstract model of the units to be
recognized, which are represented as finite state automata, made up of states and
links between states. The parameters of the model are the probability to traverse a
link between two states, and the probability of observing a speech spectrum
(acoustic vector) while traversing that link. Algorithms were proposed in the late

60s that find those parameters (that is, train the model) (Baum, 1972), and match in
an optimal way a model with a signal (Viterbi, 1967), similarly to DTW. The
interesting features of this approach is that it is possible to include in a given model
parameters which represent different ways of pronouncing a word for different
speaking styles of the same speaker, or for different speakers, and different
pronunciations of the words, with different probabilities, or, even more
interestingly, that it is possible to train phoneme models instead of word models.
The recognition process may then be expressed as finding the word sequence
which maximizes the probability that the word sequence produced the signal. This
can be simply rewritten as the product of the probability that the signal was
produced by the word sequence (Acoustic Model) and the probability of the word
sequence (Language Model). This latter probability can be obtained by computing
the frequency of the succession of two (bigrams) or three (trigrams) words in texts
or speech transcriptions corresponding to the kind of utterances which will be
considered in the application. It is also possible to consider the probabilities of
grammatical category sequences (biclass and triclass models).
The HMM approach requires very large amounts of data for training, both in terms
of signal and in terms of textual data, and the availability of such data is crucial for
developing technologies and applications, and evaluating systems.
Various techniques have been proposed for the decoding process (depth-first,
breadth-first, beam search, A* algorithm, stack algorithm, Tree Trellis, etc.). This
process is very time consuming, and one research goal is to accelerate the process
without losing quality.
This statistical approach was proposed in the early 70s. It was developed
throughout the early 80s in parallel with other approaches, as there was no
quantitative way of comparing approaches on a given task. The US Department of
Defense DARPA Human Language Technology program, which started in 1984,
fostered an evaluation-driven comparative research paradigm, which clearly
demonstrated the advantages of the statistical approach (DARPA, 198998).
Gradually, the HMM approach became more popular, both in the US and abroad.
In parallel, the connectionist, or neural network (NN), approach was experimented
in various fields, including speech processing. This approach is also based on
training, but is considered to be more discriminative than the HMM one. However,
it is less adequate than HMM to model the time information. Hybrid systems that
combine HMMs and NNs have therefore been proposed. Though they provide
interesting results, and, in some limited cases, even surpass the pure HMM
approach, they have not proven their superiority.
This history illustrates how problems were attacked and in some cases partly
solved by different techniques: acoustic variability through the use of TemplateMatching using DTW in the 70s, followed by stochastic modeling in the 80s,

speaker and speaking variability through clustering techniques followed by


stochastic modeling, differential features and more data in the 80s, linguistic
variability through N-grams and more data, in the 70s and 80s. It is an example of
the classic paradigm development and hybridization for Language Processing, as
discussed in Chapter 6. Currently, the largest efforts are presently devoted to
address improved language modeling, phonetic pronunciation variability, noise and
channel distortion through signal processing techniques and more data, up to
multilinguality, through more data and better standards, and to multimodality,
through multimodal data, integration and better standards and platforms (see
also Chapter 9).
5.3 The Present: Major Bottlenecks and Problems
In speech recognition, basic research is still needed in the statistical modeling
approach. Some basic statements are still very crude, such as considering the
speech signal to be stationary, or the acoustic vectors to be uncorrelated. How can
HMM capabilities be pushed? Progress continues, using HMMs with more training
data, or considering different aspects of the data for different uses: understanding
or dialog handling, through the use of corpus containing semantically labeled
words or phrases. At the same time, the availability of large quantities of data for a
given application is not always possible, and the adaptation of a system to a new
application is often very costly. Techniques have been proposed, such as tied
mixtures for building acoustic models or backing off techniques for building
language models, but progress is still required. It is therefore important to develop
methods that enable easy application adaptation, even if little or no data is
available beforehand.
Using prosody in recognition is still an open issue. Still today, very few operational
systems consider prosodic information, as there is no clear evidence that taking
into account prosody results in better performances, given the nature of the
applications being addressed at present. It seems likely however that some positive
results have been obtained on the German language within the Verbmobil program
(Niemann et al., 1997).
Addressing spontaneous speech is still an open problem, and difficult tasks such as
DARPAs SwithBoard and CallHome projects still achieve poor results, despite the
efforts devoted to the development of systems in this area.
Recognizing voice in noisy conditions is also important. Two approaches are
conducted in parallel, either using noise robust front-ends or using a model based
approach. The second will probably provide the best results in the long run.
Systems are now getting more speaker-independent, but commercial systems are
still "speaker adaptive": they may recognize a new user with low performance, and
improve during additional spoken interaction with the user. Speaker adaptation will

stay as a research topic for the future, with the goal to make it more natural and
invisible. The systems will thus become more speaker-independent, but will still
have a speaker adaptation component. This adaptation can also be necessary for the
same speaker, if his or her voice changes due to illness conditions for example
In speech synthesis, the quality of text-to-speech synthesis is better, but still not
good enough for replacing "canned speech" (constructed by concatenating phrases
and words). The generalization of the use of Text-to-Speech synthesis for
applications such as reading aloud email messages will however probably help
making this imperfect voice familiar and acceptable. Further improvement should
therefore be obtained on phoneme synthesis itself, but attention should be placed
on improving the naturalness of the voice. This involves prosody, as it is very
difficult to generate a natural and acceptable prosody from the text, and it may be
somehow easier to do it in the speech generation module of an oral dialogue
system. This also involves voice quality, allowing the TTS synthesis system to
change its voice to interpret the right meaning of a sentence. Voice conversion
(allowing a TTS synthesis system to speak with the voice of the user, after analysis
of this voice) is another area of R&D interest (Abe et al., 1990).
Generally speaking, the research program for the next years should be "to put back
Language into Language Modeling", as proposed by F. Jelinek during the MLIM
workshop. It requires taking into account that the data which has to be modeled
is language, not just sounds, and that it therefore has some specifics, including an
internal structure which involves more than a window of two or three words. This
would suggest going beyond Bigrams and Trigrams, to consider parsing complete
sentences.
In the same way, as suggested by R. Rosenfeld during the MLIM workshop, it may
be proposed "to put Speech back in Speech Recognition", since the data to be
modeled is speech, with its own specifics, such as having been produced by a
human brain through the vocal apparatus. In that direction, it may be mentioned
that the signal processing techniques for signal acquisition were mostly based on
MFCC (Mel Frequency Cepstral Coefficients) in the 80s (Davis and Merlmelstein,
1980), and are getting closer to perceptual findings with PLP (Perceptually
weighted Linear Prediction) in the 90s (Hermansky, 1990).
Several application areas are now developing, including consumer electronics
(mobile phones, hand-held organizers), desktop applications (Dictation, OS
navigation, computer games, language learning), telecommunications (autoattendant, home banking, call-centers). These applications require several
technological advances, including consistent accuracy, speaker-independence and
quick adaptation, consistent handling of Out-Of-Vocabulary words, easy addition
of new words and names, automatic updating of vocabularies, robustness to noise
and channel, barge-in (allowing a human to speak over the systems voice and
interrupt it), and also standard software and hardware compatibility and low cost.

5.4 The Future: Major Breakthroughs Expected


Breakthroughs will probably continue to be obtained through sustained incremental
improvements based on the use of statistical techniques on ever larger amounts of
data and differently annotated data. Every year from the mid-80s we can identify
progress and better performances on more difficult tasks. Significantly, results
obtain within DARPAs ATIS task (Dahl et al., 1994) showed that the performance
on understanding obtained on written data transcribed from speech was achieved
on actual speech data only one year later.
Better pronunciation modeling will probably enlarge the population that can get
acceptable results on a recognition system, and therefore strengthen the
acceptability of the system.
Better language models are presently a major issue, and could be obtained by
looking beyond N-Grams. This could be achieved by identifying useful linguistic
information, and incorporating more Information Theory in Spoken Language
Processing systems.
In five years, we will probably have considerably more robust speech recognition
for well defined applications, more memory-efficient and faster recognizers to
support integration with multi-media applications, speech I/O embedded in client
server architecture, distributed recognition to allow mass telephony applications,
efficient and stable multilingual applications, better integration of NLP in welldefined areas, and much more extensible modular toolkits to reduce the lifecycle of
application development. While speech is considered nowadays as a
communication means, it will be considered, with the research progress, as a
material comparable to text, that you can easily index, access randomly, sort,
summarize, translate, and retrieve. This view will drastically change our
relationship with the vocal media.
Multimodality is an important area for the future, as discussed in Chapter 9. It can
intervene for the processing of a single media, such as speech recognition using
both the audio signal and the visual signal of the lips, which results in improved
accuracy, especially in noisy conditions. But it can also address different media,
such as integrating speech, vision and gesture in multimodal multimedia
communication, which includes the open issue of sharing a common reference for
the human and the machine. Multimodal training is another dimension, based on
the assumption that humans learn to use one modality by getting simultaneous
stimuli coming from different modalities. In the long run, modeling speech will
have to be considered in tandem with other modalities.
Transmodality is another area of interest. It addresses the problem of providing an
information through different media, depending on which media is more
appropriate to the context in which the user stands when requesting the information

(sitting in front of his computer, in which case a text + graphics output may be
appropriate, or driving his car, in which case, a speech output of a summarized
version of the textual information may be more appropriate, for example).
5.5 Juxtaposition of this Area with Other Areas
Over the years, speech processing is getting closer to natural language processing,
as speech recognition is shifting to speech understanding and dialogue, and as
speech synthesis becomes increasingly natural and approaches language generation
from concepts in dialogue systems. Speech recognition would benefit from better
language parsing, and speech synthesis would benefit from better morpho-syntactic
tagging and language parsing.
Speech recognition and speech synthesis are used in Machine Translation (Chapter
4) for spoken language translation (Chapter 7).
Speech processing meets Natural Language Processing, but also computer vision,
computer graphics, gestural communication in multimodal communication
systems, with open research issues on the relationship between image, language
and gesture for example (see Chapter 9).
Even imperfect speech recognition meets Information Retrieval (Chapter 2) in
order to allow for multimedia document indexing through speech, and retrieval of
multimedia documents (such as in the US Informedia (Wactlar et al., 1999) and the
EU Thistle or Olive projects). This information retrieval may even be multilingual,
extending the capability of the system to index and retrieve the requested
information, whatever the language spoken by the user, or present in the data.
Information Extraction (Chapter 3) from spoken material is a similar area of
interest, and work has already been initiated in that domain within DARPAs Topic
Detection and Tracking program. Here also, it will benefit from cooperation
between speech and NL specialists and from a multilingual approach, as data is
available on multiple sources in multiple languages worldwide.
Speech recognition, speech synthesis, speech understanding and speech generation
meet in order to allow for oral dialogue. Vocal dialogue will get closer to research
in the area of dialogue modeling (indirect speech acts, beliefs, planning, user
models, etc.). Adding a multilingual dimension empowers individuals and gives
them a universal access to the information world.
5.6 The Treatment of Multiple Languages in Speech Processing
Addressing multilinguality is important in speech processing. A system that
handles several languages is much easier to put on the market than a system that
can only address one language. In terms of research, the structural differences
across languages are interesting for studying any one of them. Rapid deployment

of a system to a large market, which necessitates the handling of several languages,


is challenging, and several companies offer speech recognition or speech synthesis
systems that handle different languages in their different versions, less frequently
different languages within a single version. Addressing multilinguality not only
includes getting knowledge on the structures and elements of a different language,
but also requires accommodating speakers who speak that language with accents
that may differ and who use words and sentence structures that may be far away
from the canonical rules of the language.
As discussed in Chapter 7, language identification is part of multilingual speech
processing. Detecting the language spoken enables selecting the right Acoustic and
Language Models. An alternative could be to use language-independent Acoustic
Models (and less probably even language-independent Language Models).
However, present systems will get into trouble if someone shifts from one
language to another within one sentence, or one discourse, as humans sometimes
do.
Similarly, a speech synthesis system will have to be able to identify the language
spoken in order to pronounce it correctly, and systems aiming at the pronunciation
of email will have to shift most often between the users language and English,
which is used for many international exchanges. Here also, some sentences may
contain foreign words or phrases that must be pronounced correctly. Large efforts
may be required to gather enough expertise and knowledge on the pronunciation of
proper names in various countries speaking different languages, as in the European
project Onomastica (Schmidt et al., 1993). Also, successful attempts to quickly
train a speech synthesis system by using a large enough speech corpus in that
language have been reported (Black and Campbell, 1995). In this framework, the
synthesis is achieved by finding in the speech corpus the longest speech units
corresponding to parts of the input sentence. This approach requires no extended
understanding of the language to be synthesized. Another aspect of multilingual
speech synthesis is the possibility of using voice conversion in spoken language
translation. In this case, the goal is to translate the speech uttered by the speaker in
the target language and to synthesize the corresponding sentence to the listener,
using the voice that the speaker would have if he would be speaking that language.
Such attempts were conducted in the Interpreting Telephony project at ATR (Abe
et al., 1990).
Complete multilingual systems therefore require language identification,
multilingual speech recognition and speech synthesis, and machine translation for
written and spoken language. Some spoken translation systems already exist and
work in laboratory conditions for well-defined tasks, including conference
registration (Morimoto et al., 1993) and meeting appointment scheduling (Wahlster
, 1993).

With respect to multilinguality, there are two important questions. First, can data
be shared across languages (if a system is able to recognize one language, will it be
necessary to conduct the same effort to address another one? Or is it possible to
reuse for example the acoustic models of the phonemes that are similar in two
different languages)? Second, can knowledge be shared across language? (Could
the scientific results obtained in studying one language be used for studying
another language? As the semantic meaning of a sentence remains the same, when
it is pronounced in two different languages, it should be possible to model
language-independent knowledge independently of the languages used)?
5.7 Conclusion
Notwithstanding the difficulty of the problems facing speech processing, and
despite the fact that speech processing spans two major areas, acoustic engineering
and computational linguistics, great progress has been made in the past fifteen
years. Commercial speech recognizers are increasingly available today,
complementing machine translation and information retrieval systems in a trio of
Language Processing applications. Still, problems remain both at the sound level,
especially dealing with noise and variations in speaker and speaking condition, and
at the dialogue and conceptual level, where speech blends with natural language
analysis and generation.

5.8 References
Abe, M., S. Nakamura, K. Shikano, and H. Kuwabara. 1990. Voice
conversion through vector quantization. Journal of the Acoustical Society of
Japan, E-11 (7176).
Baker, J.K. 1975. Stochastic Modeling for Automatic Speech
Understanding. In R. Reddy (ed), Speech Recognition (521542).
Academic Press.
Baum, L.E. 1972. An Inequality and Associated Maximization Technique in
Statistical Estimation of Probabilistic Functions of Markov
Processes. Inequalities 3 (18).
Black, A. W. and N. Campbell. 1995. Optimising selection of units from
speech databases for concatenative synthesis. Proceedings of the fourth
European Conference on Speech Communication and Technology (581
584). Madrid, Spain.
Cole, R., J. Mariani, H. Uszkoreit, N. Varile, A. Zaenen, A. Zampolli, V.
Zue. 1998. Survey of the State of the Art in Human Language Technology.

Cambridge: Cambridge University Press (or


seehttp://www.cse.ogi.edu/CSLU/HLTsurvey/HLTsurvey.html.)
Dahl, D.A., M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett,
C. Pao, A. Rudnicky, and E. Shriberg.. 1994. Expanding the Scope of the
ATIS Task: the ATIS-3 Corpus. Proceedings of the DARPA Conference on
Human Language Technology (4349). San Francisco: Morgan Kaufmann.
DARPA. 19891998. Proceedings of conference series initially
called Workshops on Speech and Natural Language and later Conferences
on Human Language Technology. San Francisco: Morgan Kaufmann.
Davis, S. B. and P. Mermelstein. 1980. Comparison of parametric
representations for monosyllabic word recognition in continuously spoken
sentences. IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP-28 (357366).
Hermansky, H. 1990. Perceptual linear predictive (PLP) analysis for
speech. Journal of the Acoustical Society of America, 87(4) (17381752).
Jelinek, F. 1976. Continuous speech recognition by statistical
methods. Proceedings of the IEEE, 64 (532556).
Juang, B.H., D. Childers, R.V. Cox, R. De Mori, S. Furui, J. Mariani, P.
Price, S. Sagayama, M.M. Sondhi, R. Weishedel. 1998. Speech Processing:
Past, Present and Outlook. IEEE Signal Processing Magazine, May 1998.
Klatt, D.H. 1980. Software for a cascade/parallel formant
synthesizer. Journal of the Acoustical Society of America 67 (971995).
Morimoto, T., T. Takezawa, F. Yato, S. Sagayama, T. Tashiro, M. Nagata,
and A. Kurematsu,. 1993. ATRs speech translation system: ASURA.
In Proceedings of the third European Conference on Speech
Communication and Technology (12951298). Berlin, Germany.
Moulines, E. and F. Charpentier. 1990. Pitch-synchronous waveform
processing techniques for text-to-speech synthesis using diphones. Speech
Communication. 9 (453467).
Niemann, H., E. Noeth, A. Kiessling, R. Kompe and A. Batliner. 1997.
Prosodic Processing and its Use in Verbmobil. Proceedings of ICASSP97 (7578). Munich, Germany.
Schmidt, M.S., S. Fitt, C. Scott, and M.A. Jack. 1993. Phonetic transcription
standards for European names (ONOMASTICA). Proceedings of the third

European Conference on Speech Communication and Technology (279


282). Berlin, Germany.
Viterbi, A.J. 1967. Error Bounds for Convolutional Codes and an
Asympotically Optimum Decoding Algorithm. IEEE Transactions on
Information Theory IT-13(2), (260269).
Wactlar, H.D., M.G. Christel, Y. Gong, A.G. Hauptmann. 1999. Lessons
learned from building a Terabyte Digital Video Library. IEEE Computer32(2), (6673).
Wahlster, W. 1993. Verbmobil, translation of face-to-face
dialogs. Proceedings of the Fourth Machine Translation Summit (127
135). Kobe, Japan.

Speaker recognition
Sadaoki Furui (2008), Scholarpedia, 3(4):3715.

doi:10.4249/scholarpedia.3715

revision #64889 [link to/cite this article]

Post-publication activity
Curator: Sadaoki Furui

Dr. Sadaoki Furui, Tokyo Institute of Technology

Speaker recognition is the process of automatically recognizing who is speaking by using the
speaker-specific information included in speech waves to verify identities being claimed by
people accessing systems; that is, it enables access control of various services by voice (Furui,
1991, 1997, 2000). Applicable services include voice dialing, banking over a telephone network,
telephone shopping, database access services, information and reservation services, voice mail,
security control for confidential information, and remote access to computers. Another important
application of speaker recognition technology is as a forensics tool.
Contents

1 Principles of Speaker Recognition


o

1.1 General Principles and Applications

1.2 Speaker Identification and Verification

1.3 Text-Dependent, Text-Independent and Text-Prompted Methods

2 Text-Dependent Speaker Recognition Methods


o

2.1 DTW-Based Methods

2.2 HMM-Based Methods

3 Text-Independent Speaker Recognition Methods


o

3.1 Long-Term-Statistics-Based Methods

3.2 VQ-Based Methods

3.3 Ergodic-HMM-Based Methods

3.4 Speech-Recognition-Based Methods

4 Text-Prompted Speaker Recognition

5 High-level Speaker Recognition

6 Normalization and Adaptation Techniques


o

6.1 Parameter-Domain Normalization

6.2 Likelihood Normalization

6.3 Updating Models and A Priori Threshold for Speaker Verification

6.4 Model-based Compensation Techniques

7 References

8 See Also

Principles of Speaker Recognition


General Principles and Applications
Speaker identity is correlated with physiological and behavioral characteristics of the speech
production system of an individual speaker. These characteristics derive from both the spectral
envelope (vocal tract characteristics) and the supra-segmental features (voice source
characteristics) of speech. The most commonly used short-term spectral measurements are
cepstral coefficients and their regression coefficients. As for the regression coefficients, typically,
the first- and second-order coefficients, that is, derivatives of the time functions of cepstral
coefficients, are extracted at every frame period to represent spectral dynamics. These regression
coefficients are respectively referred to as the delta-cepstral and delta-delta-cepstral coefficients.

Speaker Identification and Verification


Speaker recognition can be classified into speaker identification and speaker verification. Speaker
identification is the process of determining from which of the registered speakers a given
utterance comes. Speaker verification is the process of accepting or rejecting the identity claimed
by a speaker. Most of the applications in which voice is used to confirm the identity of a speaker
are classified as speaker verification.
In the speaker identification task, a speech utterance from an unknown speaker is analyzed and
compared with speech models of known speakers. The unknown speaker is identified as the
speaker whose model best matches the input utterance. In speaker verification, an identity is
claimed by an unknown speaker, and an utterance of this unknown speaker is compared with a
model for the speaker whose identity is being claimed. If the match is good enough, that is, above
a threshold, the identity claim is accepted. A high threshold makes it difficult for impostors to be
accepted by the system, but with the risk of falsely rejecting valid users. Conversely, a low
threshold enables valid users to be accepted consistently, but with the risk of accepting
impostors. To set the threshold at the desired level of customer rejection (false rejection) and
impostor acceptance (false acceptance), data showing distributions of customer and impostor
scores are necessary.

The fundamental difference between identification and verification is the number of decision
alternatives. In identification, the number of decision alternatives is equal to the size of the
population, whereas in verification there are only two choices, acceptance or rejection, regardless
of the population size. Therefore, speaker identification performance decreases as the size of the
population increases, whereas speaker verification performance approaches a constant
independent of the size of the population, unless the distribution of physical characteristics of
speakers is extremely biased.
There is also a case called open set identification, in which a reference model for an unknown
speaker may not exist. In this case, an additional decision alternative, the unknown does not
match any of the models, is required. Verification can be considered a special case of the open
set identification mode in which the known population size is one. In either verification or
identification, an additional threshold test can be applied to determine whether the match is
sufficiently close to accept the decision, or if not, to ask for a new trial.
The effectiveness of speaker verification systems can be evaluated by using the receiver operating
characteristics (ROC) curve adopted from psychophysics. The ROC curve is obtained by assigning
two probabilities, the probability of correct acceptance (1 ? false rejection rate) and the
probability of incorrect acceptance (false acceptance rate), to the vertical and horizontal axes
respectively, and varying the decision threshold. The detection error trade-off (DET) curve is also
used, in which false rejection and false acceptance rates are assigned to the vertical and
horizontal axes respectively. The error curve is usually plotted on a normal deviate scale. With
this scale, a speaker recognition system whose true speaker and impostor scores are Gaussians
with the same variance will result in a linear curve with a slope equal to ? 1. The DET curve
representation is therefore more easily readable than the ROC curve and allows for a comparison
of the systems performance over a wide range of operating conditions.
The equal-error rate (EER) is a commonly accepted overall measure of system performance. It
corresponds to the threshold at which the false acceptance rate is equal to the false rejection rate.

Text-Dependent, Text-Independent and Text-Prompted


Methods
Speaker recognition methods can also be divided into text-dependent (fixed passwords) and textindependent (no specified passwords) methods. The former require the speaker to provide
utterances of key words or sentences, the same text being used for both training and recognition,
whereas the latter do not rely on a specific text being spoken. The text-dependent methods are
usually based on template/model-sequence-matching techniques in which the time axes of an
input speech sample and reference templates or reference models of the registered speakers are
aligned, and the similarities between them are accumulated from the beginning to the end of the
utterance. Since this method can directly exploit voice individuality associated with each

phoneme or syllable, it generally achieves higher recognition performance than the textindependent method.
There are several applications, such as forensics and surveillance applications, in which
predetermined key words cannot be used. Moreover, human beings can recognize speakers
irrespective of the content of the utterance. Therefore, text-independent methods have attracted
more attention. Another advantage of text-independent recognition is that it can be done
sequentially, until a desired significance level is reached, without the annoyance of the speaker
having to repeat key words again and again.
Both text-dependent and independent methods have a serious weakness. That is, these security
systems can easily be circumvented, because someone can play back the recorded voice of a
registered speaker uttering key words or sentences into the microphone and be accepted as the
registered speaker. Another problem is that people often do not like text-dependent systems
because they do not like to utter their identification number, such as their social security number,
within the hearing of other people. To cope with these problems, some methods use a small set of
words, such as digits as key words, and each user is prompted to utter a given sequence of key
words which is randomly chosen every time the system is used. Yet even this method is not
reliable enough, since it can be circumvented with advanced electronic recording equipment that
can reproduce key words in a requested order. Therefore, a text-prompted speaker recognition
method has been proposed in which password sentences are completely changed every time.

Text-Dependent Speaker Recognition Methods


Text-dependent speaker recognition methods can be classified into DTW (dynamic time warping)
or HMM (hidden Markov model) based methods.

DTW-Based Methods
In this approach, each utterance is represented by a sequence of feature vectors, generally, shortterm spectral feature vectors, and the trial-to-trial timing variation of utterances of the same text
is normalized by aligning the analyzed feature vector sequence of a test utterance to the template
feature vector sequence using a DTW algorithm. The overall distance between the test utterance
and the template is used for the recognition decision. When multiple templates are used to
represent spectral variation, distances between the test utterance and the templates are averaged
and then used to make the decision. The DTW approach has trouble modeling the statistical
variation in spectral features.

HMM-Based Methods
An HMM can efficiently model the statistical variation in spectral features. Therefore, HMMbased methods have achieved significantly better recognition accuracies than DTW-based
methods.

Text-Independent Speaker Recognition Methods

In text-independent speaker recognition, generally the words or sentences used in recognition


trials cannot be predicted. Since it is impossible to model or match speech events at the word or
sentence level, the following four kinds of methods have been investigated.

Long-Term-Statistics-Based Methods
Long-term sample statistics of various spectral features, such as the mean and variance of
spectral features over a series of utterances, have been used. Long-term spectral averages are
extreme condensations of the spectral characteristics of a speaker's utterances and, as such, lack
the discriminating power of the sequences of short-term spectral features used as models in textdependent methods.

VQ-Based Methods
A set of short-term training feature vectors of a speaker can be used directly to represent the
essential characteristics of that speaker. However, such a direct representation is impractical
when the number of training vectors is large, since the memory and amount of computation
required become prohibitively large. Therefore, attempts have been made to find efficient ways of
compressing the training data using vector quantization (VQ) techniques.
In this method, VQ codebooks, consisting of a small number of representative feature vectors, are
used as an efficient means of characterizing speaker-specific features. In the recognition stage, an
input utterance is vector-quantized by using the codebook of each reference speaker; the VQ
distortion accumulated over the entire input utterance is used for making the recognition
determination.
In contrast with the memoryless (frame-by-frame) VQ-based method, non-memoryless source
coding algorithms have also been studied using a segment (matrix) quantization technique. The
advantage of a segment quantization codebook over a VQ codebook representation is its
characterization of the sequential nature of speech events. A segment modeling procedure for
constructing a set of representative time normalized segments called filler templates has been
proposed. The procedure, a combination of K-means clustering and dynamic programming time
alignment, provides a means for handling temporal variation.

Ergodic-HMM-Based Methods
The basic structure is the same as the VQ-based method, but in this method an ergodic HMM is
used instead of a VQ codebook. Over a long timescale, the temporal variation in speech signal
parameters is represented by stochastic Markovian transitions between states. This method uses
a multiple-state ergodic HMM (i.e., all possible transitions between states are allowed) to classify
speech segments into one of the broad phonetic categories corresponding to the HMM states. The
automatically obtained categories are often characterized as strong voicing, silence, nasal/liquid,
stop burst/post silence, frication, etc.

The VQ-based method has been compared with the discrete/continuous ergodic HMM-based
method, particularly from the viewpoint of robustness against utterance variations. It was found
that the continuous ergodic HMM method is far superior to the discrete ergodic HMM method
and that the continuous ergodic HMM method is as robust as the VQ-based method when
enough training data is available. However, when little data is available, the VQ-based method is
more robust than the continuous HMM method. Speaker identification rates using the
continuous HMM were investigated as a function of the number of states and mixtures. It was
shown that the speaker recognition rates were strongly correlated with the total number of
mixtures, irrespective of the number of states. This means that using information on transitions
between different states is ineffective for text-independent speaker recognition.
A technique based on maximum likelihood estimation of a Gaussian mixture model (GMM)
representation of speaker identity is one of the most popular methods. This method corresponds
to the single-state continuous ergodic HMM. Gaussian mixtures are noted for their robustness as
a parametric model and for their ability to form smooth estimates of rather arbitrary underlying
densities.
The VQ-based method can be regarded as a special (degenerate) case of a single-state HMM with
a distortion measure being used as the observation probability.

Speech-Recognition-Based Methods
The VQ- and HMM-based methods can be regarded as methods that use phoneme-classdependent speaker characteristics contained in short-term spectral features through implicit
phoneme-class recognition. In other words, phoneme-classes and speakers are simultaneously
recognized in these methods. On the other hand, in the speech-recognition-based methods,
phonemes or phoneme-classes are explicitly recognized, and then each phoneme/phoneme-class
segment in the input speech is compared with speaker models or templates corresponding to that
phoneme/phoneme-class.
A five-state ergodic linear predictive HMM for broad phonetic categorization has been
investigated. In this method, after frames that belong to particular phonetic categories have been
identified, feature selection is performed. In the training phase, reference templates are
generated and verification thresholds are computed for each phonetic category. In the
verification phase, after phonetic categorization, a comparison with the reference template for
each particular category provides a verification score for that category. The final verification
score is a weighted linear combination of the scores for each category. The weights are chosen to
reflect the effectiveness of particular categories of phonemes in discriminating between speakers
and are adjusted to maximize the verification performance. Experimental results showed that
verification accuracy can be considerably improved by this category-dependent weighted linear
combination method.

A speaker verification system using 4-digit phrases has also been tested in actual field conditions
with a banking application, where input speech was segmented into individual digits using a
speaker-independent HMM. The frames within the word boundaries for a digit were compared
with the corresponding speaker-specific HMM digit model and the Viterbi likelihood score was
computed. This was done for each of the digits making up the input utterance. The verification
score was defined to be the average normalized log-likelihood score over all the digits in the
utterance.
A large vocabulary speech recognition system has also been used for speaker verification. With
this approach a set of speaker-independent phoneme models were adapted to each speaker.
Speaker verification consisted of two stages. First, speaker-independent speech recognition was
run on each of the test utterances to obtain phoneme segmentation. In the second stage, the
segments were scored against the adapted models for a particular target speaker. The scores were
normalized by those with speaker-independent models. The system was evaluated using the 1995
NIST-administered speaker verification database, which consists of data taken from the
Switchboard corpus. The results showed that this method did not out-perform Gaussian mixture
models.

Text-Prompted Speaker Recognition


In this method, key sentences are completely changed every time. The system accepts the input
utterance only when it determines that the registered speaker uttered the prompted sentence.
Because the vocabulary is unlimited, prospective impostors cannot know in advance the sentence
they will be prompted to say. This method not only accurately recognizes speakers, but can also
reject an utterance whose text differs from the prompted text, even if it is uttered by a registered
speaker. Thus, a recorded and played back voice can be correctly rejected.
This method uses speaker-specific phoneme models as basic acoustic units. One of the major
issues in this method is how to properly create these speaker-specific phoneme models when
using training utterances of a limited size. The phoneme models are represented by Gaussianmixture continuous HMMs or tied-mixture HMMs, and they are made by adapting speakerindependent phoneme models to each speaker's voice.
In the recognition stage, the system concatenates the phoneme models of each registered speaker
to create a sentence HMM, according to the prompted text. Then the likelihood of the input
speech against the sentence model is calculated and used for speaker verification.

High-level Speaker Recognition


High-level features such as word idiolect, pronunciation, phone usage, prosody, etc. have also
been successfully used in text-independent speaker verification. Typically, high-level-feature
recognition systems produce a sequence of symbols from the acoustic signal and then perform
recognition using the frequency and co-occurrence of symbols. In an idiolect approach, word

unigrams and bigrams from manually transcribed conversations are used to characterize a
particular speaker in a traditional target/background likelihood ratio framework. The use of
support vector machines for performing the speaker verification task based on phone and word
sequences obtained using phone recognizers has been proposed. The benefit of these features was
demonstrated in the NIST extended data task for speaker verification; with enough
conversational data, a recognition system can become familiar with a speaker and achieve
excellent accuracy. The corpus was a combination of phases 2 and 3 of the Switchboard-2
corpora. Each training utterance in the corpus consisted of a conversation side that was
nominally of length 5 minutes (approximately 2.5 minutes of speech) recorded over a land-line
telephone. Speaker models were trained using 1 ? 16 conversation sides. These methods need
utterances of at least several minutes long, much longer than those used in conventional speaker
recognition methods.

Normalization and Adaptation Techniques


How can we normalize intra-speaker variation of likelihood (similarity) values in speaker
verification? The most significant factor affecting automatic speaker recognition performance is
variation in signal characteristics from trial to trial (inter-session variability, or variability over
time). Variations arise from the speaker him/herself, from differences in recording and
transmission conditions, and from noise. Speakers cannot repeat an utterance precisely the same
way from trial to trial. It is well known that samples of the same utterance recorded in one
session are much more highly correlated than tokens recorded in separate sessions. There are
also long term trends in voices.
It is important for speaker recognition systems to accommodate these variations. Adaptation of
the reference model as well as the verification threshold for each speaker is indispensable to
maintaining a high recognition accuracy over a long period. In order to compensate for the
variations, two types of normalization techniques have been tried ? one in the parameter domain,
and the other in the distance/similarity domain. The latter technique uses the likelihood ratio or
a posteriori probability. To adapt HMMs for noisy conditions, various techniques including the
HMM composition (PMC: parallel model combination) method, have proved successful.

Parameter-Domain Normalization
As one typical normalization technique in the parameter domain, spectral equalization, the socalled blind equalization method, has been confirmed to be effective in reducing linear channel
effects and long-term spectral variation. This method is especially effective for text-dependent
speaker recognition applications using sufficiently long utterances. In this method, cepstral
coefficients are averaged over the duration of an entire utterance, and the averaged values are
subtracted from the cepstral coefficients of each frame (CMS; cepstral mean subtraction). This
method can compensate fairly well for additive variation in the log spectral domain. However, it
unavoidably removes some text-dependent and speaker-specific features, so it is inappropriate

for short utterances in speaker recognition applications. It has also been shown that time
derivatives of cepstral coefficients (delta-cepstral coefficients) are resistant to linear channel
mismatches between training and testing.

Likelihood Normalization
A normalization method for likelihood (similarity or distance) values that uses a likelihood ratio
has been proposed. The likelihood ratio is the ratio of the conditional probability of the observed
measurements of the utterance given the claimed identity is correct, to the conditional
probability of the observed measurements given the speaker is an impostor (normalization term).
Generally, a positive log-likelihood ratio indicates a valid claim, whereas a negative value
indicates an imposter. The likelihood ratio normalization approximates optimal scoring in Bayes
sense.
This normalization method is, however, unrealistic because conditional probabilities must be
calculated for all the reference speakers, which requires large computational cost. Therefore, a set
of speakers, cohort speakers, who are representative of the population distribution near the
claimed speaker has been chosen for calculating the normalization term. Another way of
choosing the cohort speaker set is to use speakers who are typical of the general population. It
was reported that a randomly selected, gender-balanced background speaker population
outperformed a population near the claimed speaker.
A normalization method based on a posteriori probability has also been proposed. The difference
between the normalization method based on the likelihood ratio and that based on a posteriori
probability is whether or not the claimed speaker is included in the impostor speaker set for
normalization; the cohort speaker set in the likelihood-ratio-based method does not include the
claimed speaker, whereas the normalization term for the a posteriori-probability-based method
is calculated by using a set of speakers including the claimed speaker. Experimental results
indicate that both normalization methods almost equally improve speaker separability and
reduce the need for speaker-dependent or text-dependent thresholding, compared with scoring
using only the model of the claimed speaker.
A method in which the normalization term is approximated by the likelihood for a world model
representing the population in general has also been proposed. This method has an advantage in
that the computational cost for calculating the normalization term is much smaller than the
original method since it does not need to sum the likelihood values for cohort speakers. A method
based on tied-mixture HMMs in which the world model is made as a pooled mixture model
representing the parameter distribution for all the registered speakers has been proposed. The
use of a single background model for calculating the normalization term has become the
predominate approach used in speaker verification systems.

Since these normalization methods neglect absolute deviation between the claimed speaker's
model and the input speech, they cannot differentiate highly dissimilar speakers. It has been
reported that a multilayer network decision algorithm makes effective use of the relative and
absolute scores obtained from the matching algorithm.
A family of normalization techniques has been proposed, in which the scores are normalized by
subtracting the mean and then dividing by standard deviation, both terms having been estimated
from the (pseudo) imposter score distribution. Different possibilities are available for computing
the imposter score distribution: Znorm, Hnorm, Tnorm, Htnorm, Cnorm and Dnorm (Bimbot et
al., 2004). The state-of-the-art text-independent speaker verification techniques associate one or
more parameterization level normalization approaches (CMS, feature variance normalization,
feature warping, etc.) with world model normalization and one or more score normalizations.

Updating Models and A Priori Threshold for Speaker


Verification
How to update speaker models to cope with the gradual changes in peoples voices is an
important issue. Since we cannot ask every user to utter many utterances across many different
sessions in real situations, it is necessary to build each speaker model based on a small amount of
data collected in a few sessions, and then the model must be updated using speech data collected
when the system is used.
How to set the a priori decision threshold for speaker verification is another important issue. In
most laboratory speaker recognition experiments, the threshold is set a posteriori to the systems
equal error rate (EER). Since the threshold cannot be set a posteriori in real situations, we have
to have practical ways to set the threshold before verification. It must be set according to the
relative importance of the two errors, which depends on the application.
These two problems are intrinsically related each other. Methods for updating reference
templates and the threshold in DTW-based speaker verification were proposed. An optimum
threshold was estimated based on the distribution of overall distances between each speakers
reference template and a set of utterances of other speakers (interspeaker distances). The
interspeaker distance distribution was approximated by a normal distribution, and the threshold
was calculated by the linear combination of its mean value and standard deviation. The
intraspeaker distance distribution was not taken into account in the calculation, mainly because
it is difficult to obtain stableestimates of the intraspeaker distance distribution from small
numbers of training utterances. The reference template for each speaker was updated
by averaging new utterances and the present template after time registration. These methods
have been extended and applied to text-independent and text-prompted speaker verification
using HMMs.

Model-based Compensation Techniques

Various model-based compensation techniques for mismatch factors including channel,


additive noise, linguistic content and intra-speaker variation have recently been proposed (e. g.,
Fauve et al., 2007; Yin et al., 2007). Key developments include support vector machines (SVMs),
associated nuisance attribute projection compensation (NAP) and factor analysis (FA). They have
been shown to provide significant improvements in GMM-based text-independent speaker
verification. These approaches involve estimating the variability from a large database in which
each speaker is recorded across multiple sessions. The underlying hypothesis is that a lowdimensional session variability subspace exists with only limited overlap on speaker-specific
information.
The goal of NAP is to project out a subspace from the original expanded space, where information
has been affected by nuisance effects. This is performed by learning on a background set of
recordings, without explicit labeling, from many different speakers recordings. The most
straightforward approach is to use the difference between a given session and the mean across
sessions for each speaker. This information is pooled across speakers to form a combined matrix.
An eigen problem is solved on the corresponding covariance matrix to find the dimensions of
high variability for the pooled set. The resulting vectors are used in a SVM framework. FA shares
similar characteristics to that of NAP, and it operates on generative models with traditional
statistical approaches, such as EM, to model intersession variability.

References

Bimbot, F. J., Bonastre, F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S.,
Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz D. and Reynolds, D. A. (2004) A Tutorial
on Text-Independent Speaker Verification, EURASIP Journ. on Applied Signal Processing,
pp. 430-451.

Fauve, B. G. B., Matrouf, D., Scheffer, N., and Bonastre, J.-F (2007) State-of-the-Art
Performance in Text-Independent Speaker Verification through Open-Source Software,
IEEE Trans. On Audio, Speech, and Language Process., 15, 7, pp. 1960-1968.

Furui, S. (1991) Speaker-Independent and Speaker-Adaptive Recognition Techniques, in


Furui, S. and Sondhi, M. M. (Eds.) Advances in Speech Signal Processing, New York: Marcel
Dekker, pp. 597-622.

Furui, S. (1997) Recent Advances in Speaker Recognition, Proc. First Int. Conf. Audio- and
Video-based Biometric Person Authentication, Crans-Montana, Switzerland, pp. 237-252.

Furui, S. (2000) Digital Speech Processing, Synthesis, and Recognition, 2nd Edition, New
York: Marcel Dekker.

Yin, S.-C., Rose, R. and Kenny, P. (2007) A Joint factor Analysis Approach to Progressive
Model Adaptation in Text-Independent Speaker Verification, IEEE Trans. On Audio,
Speech, and Language Process., 15, 7, pp. 1999-2010.

Internal references

Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.

Eugene M. Izhikevich (2006) Bursting. Scholarpedia, 1(3):1300.

James Meiss (2007) Dynamical systems. Scholarpedia, 2(2):1629.

Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.

Philip Holmes and Eric T. Shea-Brown (2006) Stability. Scholarpedia, 1(10):1838.

ALL COURSE CATALOGS

2014 UNDERGRADUATE CATALOG

2014 GRADUATE CATALOG

UT Dallas 2014 Graduate Catalog

Erik Jonsson School of Engineering &


Computer Science
Department of Electrical Engineering
Department Faculty
Professors: Naofal Al-Dhahir, Poras T. Balsara, Dinesh Bhatia, Andrew J. Blanchard, Ann
(Catrina) Coleman, James J. Coleman, David E. Daniel, Babak Fahimi, John P. Fonseka, William
R. Frensley, Andrea Fumagalli, John H. L. Hansen, C. Robert Helms, Nasser
Kehtarnavaz,Kamran Kiasaleh, Gil S. Lee, Jeong-Bong Lee, Jin Liu, Duncan L.
MacFarlane, Yiorgos Makris,Hlaing Minn, Won Namgoong, Aria Nosratinia, Mehrdad
Nourani, Kenneth K. O, Raimund J. Ober,Lawrence J. Overzet, Kaushik
Rajashekara, Mohammad Saquib, Carl Sechen, Mark W. Spong,Lakshman Tamil, Murat
Torlak, Dian Zhou
Professors Emeritus: Louis R. Hunt, William J. Pervin, Don Shaw
Research Professors: Walter Duncan, Andrew Marshall, Hisashi (Sam) Shichijo
Associate Professors: Gerald O. Burnham, Yun Chiu, Rashaunda Henderson, Wenchuang
(Walter) Hu, Roozbeh Jafari, Hoi Lee, Dongsheng Brian Ma, Issa M. S. Panahi, Siavash
Pourkamali
Assistant Professors: Bilal Akin, Taylor Barton, Carlos A. Busso-Recabarren, Joseph CallenesSloan, Nicholas Gans, Myoungsoo Jung, Chadwin D. Young
Research Assistant Professors: Hynek Boril, Abhijeet Sangwan

Senior Lecturers: Charles (Pete) Bernardin, Peter A. Blakey, Paul Deignan, Nathan B.
Dodge,James Florence, Jung Lee, Randall E. Lehmann, P. K. Rajasekaran, Ricardo E.
Saad, William (Bill) Swartz, Marco Tacca
UT Dallas Affiliated Faculty: Larry P. Ammann, Leonidas Bleris, Yves J. Chabal, Bruce E.
Gnade, Matthew J. Goeckner, Robert D. Gregg, Jiyoung Kim, Moon J. Kim, David J. Lary, Yang
Liu, Robert L. Rennaker II, Mario A. Rotea, Mathukumalli Vidyasagar, Robert M. Wallace, Steve
Yurkovich

Objectives
The program leading to the MSEE degree provides intensive preparation for professional
practice in a broad spectrum of high-technology areas of electrical engineering. It is designed to
serve the needs of engineers who wish to continue their education. Courses are offered at a time
and location convenient for the student who is employed on a full-time basis.
The objective of the doctoral program in electrical engineering is to prepare individuals to perform
original, leading edge research in the broad areas of communications and signal processing;
mixed-signal IC design; digital systems; power electronics; microelectronics and nanoelectronics,
optics and photonics; optical communication devices and systems; power electronics and energy
systems, and wireless communications. Because of our strong collaborative programs with
Dallas-area high-technology companies, special emphasis is placed on preparation for research
and development positions in these high-technology industries.

Facilities
The Erik Jonsson School of Engineering and Computer Science has developed a state-of-the-art
information infrastructure consisting of a wireless network in all buildings and an extensive fiberoptic and copper Ethernet. Through the Texas Higher Education Network, students and faculty
have direct access to most major national and international networks. UT Dallas has an Internet
2 connection. In addition, many personal computers and UNIX workstations are available for
student use.
The Engineering and Computer Science Building and the new Natural Science and Engineering
Research Laboratory provide extensive facilities for research in microelectronics,
telecommunications, and computer science. A Class 10000 microelectronics clean room facility,
including e-beam lithography, sputter deposition, PECVD, LPCVD, etch, ash and evaporation, is
available for student projects and research. The Plasma Applications and Science Laboratories
have state-of-the-art facilities for mass spectrometry, microwave interferometry, optical
spectroscopy, optical detection, in situ ellipsometry and FTIR spectroscopy. In addition, a
modified Gaseous Electronics Conference Reference Reactor has been installed for plasma
processing and particulate generatiohn studies. Research in characterization and fabrication of
nanoscale materials and devices is performed in the Nanoelectronics Laboratory. The Optical
Communications Laboratory includes attenuators, optical power meters, lasers, APD/p-i-n
photodetectors, optical tables, and couplers and is available to support system level research in
optical communications. Tissue optics research is also supported in this laboratory. The Photonic
Testbed Laboratory supports research in photonics and optical communications with currentgeneration optical networking test equipment. The Electronic Materials Processing Laboratory
has extensive facilities for fabricating and characterizing semiconductor and optical devices. The

Photonic Devices and Systems Laboratory houses graduate research projects centered on
optical instrumentation and photonic integrated circuits.
The Renewable Energy and Vehicular Technology Laboratory (REVT-Lab) is equipped with
various sources of renewable energy such as wind and solar, a micro-grid formed by a network
of multi-port power electronic converters, a stationary plug in hybrid vehicle testbed, a stationary
DFIG-based wind energy emulator, a series of adjustable speed motor drive technologies
including PMSM, SRM and induction motor drives. All of the testbeds are equipped with digital
control, state-of-the-art measurement and protection devices. REVT laboratory is also equipped
with a cold plasma chamber for hydrogen harvesting and battery testing facilities. The main focus
of the REVT Lab is to improve reliability and security of the power electronic-driven technologies
as applied to utility and vehicular industries.
The Texas Analog Center of Excellence (TxACE) at The University of Texas at Dallas (UT
Dallas) has the mission of leading the country in analog research and education. TxACE
research seeks to create fundamental analog, mixed signal and RF design innovations in
integrated circuits and systems that improve energy efficiency, healthcare, and public safety and
security. The center is supported by Semiconductor Research Corporation, Texas Emerging
Technology Fund, Texas Instruments Inc., the UT System, and UT Dallas. TxACE is the largest
analog technology center in the world on the basis of funding and the number of principal
investigators. The center funds ~70 directed research projects led by ~65 principal and coprincipal investigators from 31 academic institutions including three international institutions.
The Multimedia Communications Laboratory has a dedicated network of PC's, Linux stations,
and multi-processor, high performance workstations for analysis, design and simulation of image
and video processing systems. The Signal and Image Processing (SIP) Laboratory has a
dedicated network of PC's equipped with digital camera and signal processing hardware
platforms allowing the implementation of advanced image processing algorithms. The Statistical
Signal Processing Laboratory is dedicated to research in statistical and acoustic signal
processing for biomedical and non-biomedical applications. It is equipped with high-performance
computers and powerful textual and graphical software platforms to analyze advanced signal
processing methods, develop new algorithms, and perform system designs and simulations. The
Acoustic Research Laboratory provides number of test-beds and associated equipment for signal
measurements, system modeling, real-time implementation and testing of algorithms related to
audio/acoustic/speech signal processing applications such as active noise control, speech
enhancement, dereverberation, echo cancellation, sensor arrays, psychoacoustic signal
processing, etc.
The Center for Robust Speech Systems (CRSS) is focused on a wide range of research in the
area of speech signal processing, speech and speaker recognition, speech/language technology,
and multi-modal signal processing involving facial/speech modalities. CRSS is affiliated with
HLTRI in the Erik Jonsson School, and collaborates extensively with faculty and programs across
UT Dallas on speech and language research. CRSS supports an extensive network of
workstations, as well as a High-Performance Compute Cluster with over 30TB of disk space and
420 CPU ROCS multi-processor cluster. The center also is equipped with several Texas
Instruments processors for real-time processing of speech signals, and two ASHA certified sound
booths for perceptual/listening based studies and for speech data collection. CRSS supports

mobile speech interactive systems through the UT Drive program for in-vehicle driver-behavior
systems, and multi-modal based interaction systems via image-video-speech research.
The Sensing, Robotics, Vision, Control and Estimation (SeRViCE) Lab focuses on topics of
control and estimation with applications in robotics, autonomous vehicles and sensor
management. Primary expertise is in vision-based control and estimation and nonlinear control,
that is, using cameras as the primary sensor to control robots or other complex systems.
Robotics resources in the lab currently include two Pioneer 3-DX mobile robots from Mobile
Robots Inc. and a Stubli TX90 robot manipulator, with six degrees of freedom, 7kg nominal
payload and capable of torque level control. Camera resources include multiple web cameras,
three high-quality, firewire, color, digital video cameras, and an 18Mp digital SLR camera. The
SeRViCE Lab also features general support equipment, including desktop and mobile work
stations DLP projectors, power tools, hand tools, oscilloscopes, and other electronic
measurement equipment.
The Laboratory for Autonomous Robotics and Systems (LARS) focuses on the development of
novel control theory to support autonomous and teleoperation of general robotic systems. Active
research projects include: (a) human-in-the-loop multi-robot telemanipulation, (b) autonomous
networked robotics, and (c) control of bipedal walking robots. The LARS is equipped with high
speed high resolution 8-camera Vicon motion capture system for general purpose motion
tracking. The LARS possesses various mobile robots to supported multi-robot research, including
six gumstix controlled iRobot Creates and a Quanser QBall quadrotor UAV. The LARS also
possesses various force feedback user interface devices, including Logitech force feedback
joystick and driving wheel, and Novint Falcon, a 3-translational degree-of-freedom Deltastructure desktop haptic device.
The Broadband Communication Laboratory has design and modeling tools for fiber and wireless
transmission systems and networks, and all-optical packet routing and switching. The Advanced
Communications Technologies (ACT) Laboratory provides a design and evaluation environment
for the study of telecommunication systems and wireless and optical networks. ACT has facilities
for designing network hardware, software, components, and applications.
The Center for Systems, Communications, and Signal Processing, with the purpose of promoting
research and education in general communications, signal processing, control systems, medical
and biological systems, circuits and systems and related software, is located in the Erik Jonsson
School.
The Wireless Information Systems (WISLAB) and Antenna Measurement Laboratories have
wireless experimental equipment with a unique multiple antenna testbed to integrate and to
demonstrate radio functions (i.e. WiFi and WiMAX) under different frequency usage
characteristics. With the aid of the Antenna Measurement Lab located in the Waterview Science
and Technology Center (WSTC), the researchers can design, build, and test many types of
antennas.
The Quality of Life Technology Laboratory is a multidisciplinary engineering education, research
and developmental laboratory aimed at improving Quality of Life of people through technological
advancements, innovations, and intelligent system designs. It has design, modeling and
simulation tools for medical devices and systems.

The faculty of the Erik Jonsson School's Photonic Technology and Engineering Center (PhoTEC)
carry out research in enabling technologies for microelectronics and telecommunications. Current
research areas include nonlinear optics, Raman amplification in fibers, optical switching,
applications of optical lattice filters, microarrays, integrated optics, and optical networking.
In addition to the facilities on campus, cooperative arrangements have been established with
many local industries to make their facilities available to UT Dallas graduate engineering
students.

Master of Science in Electrical Engineering


33 semester credit hours minimum

Admission Requirements
The university's general admission requirements are discussed on the Graduate
Admissionpage (catalog.utdallas.edu/2014/graduate/admission).
A student lacking undergraduate prerequisites for graduate courses in electrical engineering
must complete these prerequisites or receive approval from the graduate advisor and the course
instructor.
A diagnostic exam may be required. Specific admission requirements follow.
The student entering the MSEE program should meet the following guidelines:
An undergraduate preparation equivalent to a baccalaureate in electrical engineering

from an accredited engineering program.


A grade point average in upper-division quantitative coursework of 3.0 or better on a

4.0 point scale, and


GRE revised scores of 154, 156, and 4 for the verbal, quantitative, and analytical
writing components, respectively, are advisable based on our experience with student
success in the program.
Applicants must submit three letters of recommendation from individuals who are able to judge
the candidate's probability of success in pursuing a program of study leading to the master's
degree. Applicants must also submit an essay outlining the candidate's background, education,
and professional goals. Students from other engineering disciplines or from other science and
math areas may be considered for admission to the program; however, some additional
coursework may be necessary before starting the master's program.

Degree Requirements
The university's general degree requirements are discussed on the Graduate Policies and
Procedures page (catalog.utdallas.edu/2014/graduate/policies/policy).
The MSEE requires a minimum of 33 semester credit hours.
All students must have an academic advisor and an approved degree plan. These are based
upon the student's choice of concentration (Biomedical Applications of Electrical Engineering;
Circuits and Systems; Communications; Control Systems; Digital Systems; Photonic Devices and
Systems; Power Electronics and Energy Systems, RF and Microwave Engineering, Signal

Processing; Solid State Devices and Micro Systems Fabrication). Courses taken without advisor
approval will not count toward the 33 semester credit hour requirement. Successful completion of
the approved course of studies leads to the MSEE degree.
The MSEE program has both a thesis and a non-thesis option. All part-time MSEE students will
be assigned initially to the non-thesis option. Those wishing to elect the thesis option may do so
by obtaining the approval of a faculty thesis supervisor. With the prior approval of an academic
advisor, non-thesis students may count no more than 3 semester credit hours of research or
individual instruction courses towards the 33 semester credit hour degree requirement.
All full-time, supported students are required to participate in the thesis option. The thesis option
requires nine semester credit hours of research (of which three must be thesis semester credit
hours), a written thesis submitted to the graduate school, and a formal public defense of the
thesis. The supervising committee administers this defense and is chosen in consultation with
the student's thesis advisor prior to enrolling for thesis credit. Research and thesis semester
credit hours cannot be counted in an MSEE degree plan unless a thesis is written and
successfully defended.

Concentrations
One of the nine concentrations listed below, subject to approval by a graduate advisor, must be
used to fulfill the requirements of the MSEE program. Students must achieve an overall GPA
(grade point average) of 3.0 or better, a GPA of 3.0 or better in their core MSEE classes, and a
grade of B- or better in all their core MSEE classes in order to satisfy their degree requirements.
One 5000 level electrical engineering course can be counted towards the graduate semester
credit hours.

Biomedical Applications of Electrical Engineering


This curriculum provides a graduate-level introduction to advanced methods and biomedical
applications of electrical engineering.
Each student electing this concentration must take 15 semester credit hours:
EEBM 6373 Anatomy and Human Physiology for Engineers
EEBM 6374 Genes, Proteins and Cell Biology for Engineers
EEBM 6376 Lecture Course in Biomedical Applications of Electrical Engineering
and two core courses from any one other concentration.
Approved electives must be taken to make a total of 33 semester credit hours.
Depending on the specific orientation of the course program it can be very beneficial to the
student to take courses from other departments (e.g. Biology, Chemistry, Brain and Behavioral
Sciences, Computer Science-Bioinformatics). Typically, not more than three approved courses
can be taken outside the electrical engineering (EE) department. Additional courses can be taken
only with the explicit approval by the department head.

It is highly recommended that students take an independent study course with an EE faculty
member that will be counted as one of the EE electives. The independent study course is
intended to gear the coursework towards one of the following research areas in the department:
biosensors, biomedical signal processing, bioinstrumentation, medical imaging, biomaterials, and
bio-applications in RF.

Circuits and Systems


The courses in this curriculum emphasize the design and test of circuits and systems, and the
analysis and modeling of integrated circuits.
Each student electing this concentration must take five required courses (15 semester credit
hours).
Two of the courses are:
EECT 6325 VLSI Design
EECT 6326 Analog Integrated Circuit Design
The remaining three courses must be selected from:
EECT 6378 Power Management Circuits
EECT 6379 Energy Harvesting, Storage, and Powering for Microsystems
EECT 7325 Advanced VLSI Design
EECT 7326 Advanced Analog Integrated Circuit Design
EECT 7327 Data Converters
EEDG 6301 Advanced Digital Logic
EEDG 6303 Testing and Testable Design
EEDG 6306 Application Specific Integrated Circuit Design
EEDG 6375 Design Automation of VLSI Systems
EERF 6330 RF Integrated Circuit Design
Approved electives must be taken to make a total of 33 semester credit hours.

Communications
This curriculum emphasizes the application and theory of all phases of modern communications.
Each student electing this concentration must take four required courses (12 semester credit
hours).
Two of the courses are:
EESC 6349 Random Processes

EESC 6352 Digital Communication Systems


The remaining two must be selected from:
EEOP 6310 Optical Communication Systems
EERF 5305 Radio Frequency Engineering
EESC 6340 Introduction to Telecommunications Networks
EESC 6341 Information Theory I
EESC 6343 Detection and Estimation Theory
EESC 6344 Coding Theory
EESC 6353 Broadband Digital Communication
EESC 6360 Digital Signal Processing I
EESC 6390 Introduction to Wireless Communication Systems
Approved electives must be taken to make a total of 33 semester credit hours.

Control Systems
This curriculum emphasizes methods to predict, estimate, and regulate the behavior of electrical,
mechanical, or other systems including robotics.
Each student electing this concentration must take four required courses (12 semester credit
hours).
Two of the courses are:
EECS 6331 Linear Systems
EESC 6349 Random Processes
The remaining two must be selected from:
EECS 6336 Nonlinear Systems
EEGR 6381 Computational Methods in Engineering
EESC 6343 Detection and Estimation Theory
EESC 6360 Digital Signal Processing I
EESC 6364 Pattern Recognition
EESC 7V85 Special Topics in Signal Processing
Approved electives must be taken to make a total of 33 semester credit hours.

Digital Systems
The goal of the curriculum is to educate students about issues arising in the design and analysis
of digital systems, an area relevant to a variety of high-technology industries. Because the
emphasis is on systems, coursework focuses on three areas: hardware design, software design,
and analysis and modeling.
Each student electing this concentration must take four required courses (12 semester credit
hours):
Two of the courses are:
EEDG 6301 Advanced Digital Logic
EEDG 6304 Computer Architecture
The remaining two must be selected from:
EECT 6325 VLSI Design
EEDG 6302 Microprocessor Systems
EEDG 6345 Engineering of Packet-Switched Networks
Approved electives must be taken to make a total of 33 semester credit hours.

Photonic Devices and Systems


This curriculum is focused on the application and theory of modern optical devices, materials,
and systems.
Each student electing this concentration must take four required courses (12 semester credit
hours).
EEGR 6316 Fields and Waves
EEOP 6310 Optical Communication Systems
EEOP 6311 Photonic Devices and Integration
EEOP 6314 Principles of Fiber and Integrated Optics
Approved electives must be taken to make a total of 33 semester credit hours.

Power Electronics and Energy Systems


The goal of the curriculum is to prepare students to address growing needs in contemporary
power electronics and energy related areas. The coursework focuses on fundamentals of power
electronics, design and control of motor drives, power management, and energy systems.
Each student electing this concentration must take four required courses (12 semester credit
hours):
Two of the courses are:

EEPE 6354 Power Electronics


EEPE 6356 Adjusted Speed Motor Drives
The remaining two must be selected from:
EECT 6378 Power Management Circuits
EECT 6379 Energy Harvesting, Storage and Powering for Microsystems
EEPE 6357 Control, Modeling and Simulation in Power Electronics
EEPE 6358 Electrification of Transportation
EEPE 6359 Renewable Energy Systems and Distributed Power Generation Systems
EEPE 7356 Computer Aided Design of Electric Machines
EEPE 7V91 Special Topics in Power Electronics
Approved electives must be taken to make a total of 33 semester credit hours

RF and Microwave Engineering


This curriculum is focused on the application and theory of modern electronic devices, circuits,
and systems in the radiofrequency and microwave regime.
Each student electing this concentration must take the following four required courses (12
semester credit hours):
Four of the courses are:
EEGR 6316 Fields and Waves
EERF 6311 RF and Microwave Circuits
EERF 6355 RF and Microwave Amplifier Design
EERF 6395 RF and Microwave Systems Engineering
Approved electives must be taken to make a total of 33 semester credit hours.

Signal Processing
This curriculum emphasizes the application and theory of signal processing.
Each student electing this concentration must take four required courses (12 semester credit
hours).
Two of the courses are:
EESC 6349 Random Processes
EESC 6360 Digital Signal Processing I
The remaining two must be selected from:

EESC 6343 Detection and Estimation Theory


EESC 6350 Signal Theory
EESC 6361 Digital Signal Processing II
EESC 6362 Introduction to Speech Processing
EESC 6363 Digital Image Processing
EESC 6364 Pattern Recognition
EESC 6365 Adaptive Signal Processing
EESC 6366 Speech and Speaker Recognition
EESC 6367 Applied Digital Signal Processing
EESC 7V85 Special Topics in Signal Processing
Approved electives must be taken to make a total of 33 semester credit hours.

Solid State Devices and Micro Systems Fabrication


This concentration is focused on the fundamental principles, design, fabrication and analysis of
solid-state devices and associated micro systems.
Each student electing this concentration must take four required courses (12 semester credit
hours).
Two of the courses are:
EEGR 6316 Fields and Waves
EEMF 6319 Quantum Physical Electronics
and at least two of the following four courses:
EEMF 6320 Fundamentals of Semiconductor Devices
EEMF 6321 Active Semiconductor Devices
EEMF 6322 Semiconductor Processing Technology
EEMF 6382 Introduction to MEMS
Approved electives must be taken to make a total of 33 semester credit hours.

Doctor of Philosophy in Electrical Engineering


75 semester credit hours minimum beyond the baccalaureate degree

Admission Requirements

The university's general admission requirements are discussed on the Graduate


Admissionpage (catalog.utdallas.edu/2014/graduate/admission).
The PhD in Electrical Engineering is awarded primarily to acknowledge the student's success in
an original research project, the description of which is a significant contribution to the literature
of the discipline. Applicants for the doctoral program are therefore selected by the Electrical
Engineering Program Graduate Committee on the basis of research aptitude, as well as
academic record. Applications for the doctoral program are considered on an individual basis.
The following are guidelines for admission to the PhD program in Electrical Engineering:
A master's degree in electrical engineering or a closely associated discipline from an

institution of higher education in the U.S. or from an acceptable foreign university.


Consideration will be given to highly qualified students wishing to pursue the doctorate
without satisfying all of the requirements for a master's degree. A grade point average
(GPA) in graduate coursework of 3.5 or better on a 4.0 point scale
GRE revised scores of 154, 156, and 4 for the verbal, quantitative, and analytical
writing components, respectively, are advisable based on our experience with student
success in the program.
Applicants must submit three letters of recommendation on official school or business letterhead
or the UT Dallas Letter of Recommendation Form from individuals who are familiar with the
student's record and able to judge the candidate's probability of success in pursuing doctoral
study in electrical engineering.
Applicants must also submit a narrative describing their motivation for doctoral study and how it
relates to their professional goals.
For students who are interested in a PhD but are unable to attend school full-time, there is a parttime option. The guidelines for admission to the program and the degree requirements are the
same as for full-time PhD students. All students must have an academic advisor and an
approved plan of study.

Degree Requirements
The university's general degree requirements are discussed on the Graduate Policies and
Procedures page (catalog.utdallas.edu/2014/graduate/policies/policy).
Each program for doctoral study is individually tailored to the student's background and research
objectives by the student's supervisory committee. The program will require a minimum of 75
semester credit hours beyond the baccalaureate degree. These credits must include at least 30
semester credit hours of graduate level courses beyond the baccalaureate level in the major
concentration. All PhD students must demonstrate competence in the master's level core
courses in their research area. All students must have an academic advisor and an approved
plan of study.
Also required are:
A research oriented oral qualifying examination (QE) demonstrating competence in the

PhD candidate's research area. A student must make an oral presentation based on a
review of 2 to 4 papers followed by a question-answer session. Admission to PhD
candidacy is based on two criteria: Graded performance in the QE and GPA in

graduate level organized courses. A student entering the PhD program with a MSEE
must pass this exam within 3 long semesters, and a student entering without an MSEE
must pass this exam within 4 long semesters. A student has at most two attempts at
this qualifying exam. The exam will be given during the fall and spring semesters.
A comprehensive exam consisting of: a written dissertation proposal, a public seminar,
and a private oral examination conducted by the PhD candidate's supervising
committee. At least half of the supervising committee must comprise of core EE faculty
members and it must be chaired or co-chaired by an EE faculty member.
Completion of a major research project culminating in a dissertation demonstrating an
original contribution to scientific knowledge and engineering practice. The dissertation
will be defended publicly. The rules for this defense are specified by the Office of the
Dean of Graduate Studies. Neither a foreign language nor a minor is required for the
PhD. However, the student's supervisory committee may impose these or other
requirements that it feels are necessary and appropriate to the student's degree
program.

Research
The principal concentration areas for the MSEE program are: Biomedical Applications of
Electrical Engineering; Circuits and Systems; Communications; Control Systems; Digital
Systems; Photonic Devices and Systems; Power Electronics and Energy Systems, RF and
Microwave Engineering; Signal Processing; Solid State Devices and Micro Systems Fabrication.
Besides courses required for each concentration, a comprehensive set of electives is available in
each area.
Doctoral level research opportunities include: VLSI design and test, analog and mixed-signal
circuits and systems, RF and microwave engineering, biomedical applications of electrical
engineering, power electronics, renewable energy, motors and drives, vehicular technology,
computer architecture, embedded systems, computer aided design (CAD), ASIC design
methodologies, high speed system-on chip design and test, reconfigurable computing, network
processor design, interconnection networks, nonlinear signal-processing, smart antennas and
array processing, statistical and adaptive signal processing, multimedia signal processing, image
processing, real-time imaging, medical image analysis, pattern recognition, speech processing
and recognition, control theory, robotics, digital communications, modulation and coding,
electromagnetic-wave propagation, diffractive structures, fiber and integrated photonics,
nonlinear optics, optical transmission systems, all-optical networks, optical investigation of
material properties (reflectometry and ellipsometry), optical instrumentation, lasers, quantum-well
optical devices, theory and experiments in semiconductor-heterostructure devices, plasma
deposition and etching, nanoelectronics, wireless communication, network protocols and
evaluation, mobile computing and networking, and optical networking.
Interdisciplinary Opportunities: Continuing with the established tradition of research at UT Dallas,
the Electrical Engineering Program encourages students to interact with researchers in the
strong basic sciences and mathematics. Cross disciplinary collaborations have been established
with the Chemistry, Mathematics, and Physics programs of the School of Natural Sciences and
with faculty in the School of Brain and Behavioral Science.

How speaker recognition works


There exist two types of speaker recognition:

Text dependent (restrained) : the subject has to say a fixed phrase (password) which is the same for
enrollment and for verification, or the subject is prompted by the system to repeat a randomly generated
phrase.
Text independent (unrestrained) : recognition based on whatever words the subject sais.

Text dependent recognition has better performance for subjects that cooperate. But text independent voice recognition
is more flexible that it can be used for non-cooperating individuals.
Basically identification or authentication using speaker recognition consists of four steps:
1.
2.
3.
4.

voice recording
feature extraction
pattern matching
decision (accept / reject)

Visualization of the accoustic pattern of the voice: loudness of the input vs. time.

Depending on the application a voice recording is performed using a local, dedicated system or remotely (e.g.
telephone). The accoustic patterns of speech can be visualized as loudness or frequency vs. time. Speaker recognition
systems analyze the frequency as well as attributes such as dynamics, pitch, duration and loudness of the signal.
During feature extraction the voice recording is cut into windows of equal length, these cut-out samples are
calledframes which are often 10 to 30 ms long.
Pattern matching is the actual comparisson of the extracted frames with known speaker models (or templates), this
results in a matching score which quantifies the similarity in between the voice recording and a known speaker model.
Pattern matching is often based onHidden Markov Models (HMMs), a statistical model which takes into account the
underlying variations and temporal changes of the accoustic pattern.
Alternatively Dynamic Time Warping is used, this algorithm measures the similarity in between two sequences that vary
in speed or time, even if this variation is non-linear such as when the speaking speed changes during the sequence.
Some systems use "anti-speaker" techniques such as cohort models.

Application of speaker recognition


Voice recognition is mostly used for telephone based applications, such as for telephone banking and hotel or flight
bookings.

Nuance is a US based company and a major player when it comes to speech recognition, they also developed
a product for speaker recognition called Nuance Verifier.
Voice Trust is a german company specialized in speaker recognition solutions.

Suitability of speaker recognition


How suitable is speaker recognition as a biometric solution? We use the following 7 criteria to evaluate the suitability of
speaker recognition:
Universality

Obviously for people who are mute or having problems with their voice due to severe
illness this biometric solution is not useable.

Uniqueness

Because of the combination of physiological and behavioral factors the voice is a


unique feature of an individual, the voice has more unique features than a fingerprint.

Permanence

An issue with speaker recognition is that the voice changes with ageing, and is also

influenced by factors such as sickness, tiredness, stress, etc.

Collectability

Voice recordings are easy to obtain and do not require expensive hardware. The real
advantage of voice recognition is that it can be done over telephone lines or using
computer microphones, with variable recording and transmission quality. Pattern
matching algorithms must be able to handle ambient noise and differing quality of the
recordings.

Acceptability

Speaker recognition is unobtrusive, speaking is a natural process so no unusual


actions are required. When speaker recognition is used for surveillance applications or
in general when the subject is not aware of it then the common privacy concerns of
identifying unaware subjects apply.

A major issue with speaker recognition is spoofing using voice recordings. The risk of
spoofing with voice recordings can be mitigated if the system requests a random
Circumvention
generated phrase to be repeated, an impostor cannot anticipate the random phrase
that will be required and therefore cannot attempt a playback spoofing attack.

Performance

Robustness is very dependent on the setup, when telephone lines or computer


microphones are used the algorithms will have to compensate for noise and issues
with room accoustics. Furthermore speaker recognition is, because the voice is a
behavioral biometric, impacted by errors of the individual such as misreadings and
misprononciations.
http://www.biometric-solutions.com/solutions/index.php?story=speaker_recognition

1.7: Speaker Recognition


Sadaoki Furui
NTT Human Interface Laboratories, Tokyo, Japan

1.7.1: Principles of Speaker Recognition


Speaker recognition, which can be classified into identification and verification, is
the process of automatically recognizing who is speaking on the basis of individual
information included in speech waves. This technique makes it possible to use the
speaker's voice to verify their identity and control access to services such as voice
dailing, banking by telephone, telephone shopping, database access services,
information services, voice mail, security control for confidential information
areas, and remote access to computers. AT&T and TI (with Sprint) have started
field tests and actual application of speaker recognition technology; Sprint's Voice
Phone Card is already being used by many customers. In this way, speaker
recognition technology is expected to create new services that will make our daily
lives more convenient. Another important application of speaker recognition
technology is for forensic purposes.
Figure
shows the basic structures of speaker identification and verification
systems. Speaker identification is the process of determining which registered

speaker provides a given utterance. Speaker verification, on the other hand, is the
process of accepting or rejecting the identity claim of a speaker. Most applications
in which a voice is used as the key to confirm the identity of a speaker are
classified as speaker verification.

Figure: Basic structures of speaker recognition systems.


There is also the case called open set identification, in which a
reference model for an unknown speaker may not exist. This is usually the case in
forensic applications. In this situation, an additional decision alternative, the
unknown does not match any of the models, is required. In both verification and
identification processes, an additional threshold test can be used to determine if the
match is close enough to accept the decision or if more speech data are needed.
Speaker recognition methods can also be divided into text-dependent and textindependent methods. The former require the speaker to say key words or
sentences having the same text for both training and recognition trials, whereas the
latter do not rely on a specific text being spoken.
Both text-dependent and independent methods share a problem however. These
systems can be easily deceived because someone who plays back the recorded

voice of a registered speaker saying the key words or sentences can be accepted as
the registered speaker. To cope with this problem, there are methods in which a
small set of words, such as digits, are used as key words and each user is prompted
to utter a given sequence of key words that is randomly chosen every time the
system is used. Yet even this method is not completely reliable, since it can be
deceived with advanced electronic recording equipment that can reproduce key
words in a requested order. Therefore, a text-prompted (machine-driven-textdependent) speaker recognition method has recently been proposed by [MF93b].

1.7.2: Feature Parameters


Speaker identity is correlated with the physiological and behavioral characteristics
of the speaker. These characteristics exist both in the spectral envelope (vocal tract
characteristics) and in the supra-segmental features (voice source characteristics
and dynamic features spanning several segments).
The most common short-term spectral measurements currently used are Linear
Predictive Coding (LPC)-derived cepstral coefficients and their regression
coefficients. A spectral envelope reconstructed from a truncated set of cepstral
coefficients is much smoother than one reconstructed from LPC coefficients.
Therefore it provides a stabler representation from one repetition to another of a
particular speaker's utterances. As for the regression coefficients, typically the
first- and second-order coefficients are extracted at every frame period to represent
the spectral dynamics. These coefficients are derivatives of the time functions of
the cepstral coefficients and are respectively called the delta- and delta-deltacepstral coefficients.

1.7.3: Normalization Techniques


The most significant factor affecting automatic speaker recognition performance is
variation in the signal characteristics from trial to trial (intersession variability and
variability over time). Variations arise from the speaker themselves, from
differences in recording and transmission conditions, and from background noise.
Speakers cannot repeat an utterance precisely the same way from trial to trial. It is
well known that samples of the same utterance recorded in one session are much
more highly correlated than samples recorded in separate sessions. There are also
long-term changes in voices.
It is important for speaker recognition systems to accommodate to these variations.
Two types of normalization techniques have been tried; one in the parameter
domain, and the other in the distance/similarity domain.
Parameter-Domain Normalization

Spectral equalization, the so-called blind equalization method, is a typical


normalization technique in the parameter domain that has been confirmed to be
effective in reducing linear channel effects and long-term spectral variation
[Ata74,Fur81]. This method is especially effective for text-dependent speaker
recognition applications that use sufficiently long utterances. Cepstral coefficients
are averaged over the duration of an entire utterance and the averaged values
subtracted from the cepstral coefficients of each frame. Additive variation in the
log spectral domain can be compensated for fairly well by this method. However, it
unavoidably removes some text-dependent and speaker specific features; therefore
it is inappropriate for short utterances in speaker recognition applications.
Distance/Similarity-Domain Normalization
A normalization method for distance (similarity, likelihood) values using a
likelihood ratio has been proposed by [HBP91]. The likelihood ratio is defined as
the ratio of two conditional probabilities of the observed measurements of the
utterance: the first probability is the likelihood of the acoustic data given the
claimed identity of the speaker, and the second is the likelihood given that the
speaker is an imposter. The likelihood ratio normalization approximates optimal
scoring in the Bayes sense.
A normalization method based on a posteriori probability has also been proposed
by [MF94a]. The difference between the normalization method based on the
likelihood ratio and the method based on a posteriori probability is whether or not
the claimed speaker is included in the speaker set for normalization; the speaker set
used in the method based on the likelihood ratio does not include the claimed
speaker, whereas the normalization term for the method based on a
posteriori probability is calculated by using all the reference speakers, including
the claimed speaker.
Experimental results indicate that the two normalization methods are almost
equally effective [MF94a]. They both improve speaker separability and reduce the
need for speaker-dependent or text-dependent thresholding, as compared with
scoring using only a model of the claimed speaker.
A new method in which the normalization term is approximated by the likelihood
of a single mixture model representing the parameter distribution for all the
reference speakers has recently been proposed. An advantage of this method is that
the computational cost of calculating the normalization term is very small, and this
method has been confirmed to give much better results than either of the abovementioned normalization methods [MF94a].
Distance/Similarity-Domain Normalization

A normalization method for distance (similarity, likelihood) values using a


likelihood ratio has been proposed by [HBP91]. The likelihood ratio is defined as
the ratio of two conditional probabilities of the observed measurements of the
utterance: the first probability is the likelihood of the acoustic data given the
claimed identity of the speaker, and the second is the likelihood given that the
speaker is an imposter. The likelihood ratio normalization approximates optimal
scoring in the Bayes sense.
A normalization method based on a posteriori probability has also been proposed
by [MF94a]. The difference between the normalization method based on the
likelihood ratio and the method based on a posteriori probability is whether or not
the claimed speaker is included in the speaker set for normalization; the speaker set
used in the method based on the likelihood ratio does not include the claimed
speaker, whereas the normalization term for the method based on a
posteriori probability is calculated by using all the reference speakers, including
the claimed speaker.
Experimental results indicate that the two normalization methods are almost
equally effective [MF94a]. They both improve speaker separability and reduce the
need for speaker-dependent or text-dependent thresholding, as compared with
scoring using only a model of the claimed speaker.
A new method in which the normalization term is approximated by the likelihood
of a single mixture model representing the parameter distribution for all the
reference speakers has recently been proposed. An advantage of this method is that
the computational cost of calculating the normalization term is very small, and this
method has been confirmed to give much better results than either of the abovementioned normalization methods [MF94a].

1.7.4: Text-Dependent Speaker Recognition Methods


Text-dependent methods are usually based on template-matching techniques. In
this approach, the input utterance is represented by a sequence of feature vectors,
generally short-term spectral feature vectors. The time axes of the input utterance
and each reference template or reference model of the registered speakers are
aligned using a dynamic time warping (DTW) algorithm and the degree of
similarity between them, accumulated from the beginning to the end of the
utterance, is calculated.
The hidden Markov model (HMM) can efficiently model statistical variation in
spectral features. Therefore, HMM-based methods were introduced as extensions
of the DTW-based methods, and have achieved significantly better recognition
accuracies [NND89].

1.7.5: Text-Independent Speaker Recognition Methods


One of the most successful text-independent recognition methods is based on
vector quantization (VQ). In this method, VQ codebooks consisting of a small
number of representative feature vectors are used as an efficient means of
characterizing speaker-specific features. A speaker-specific codebook is generated
by clustering the training feature vectors of each speaker. In the recognition stage,
an input utterance is vector-quantized using the codebook of each reference
speaker and the VQ distortion accumulated over the entire input utterance is used
to make the recognition decision.
Temporal variation in speech signal parameters over the long term can be
represented by stochastic Markovian transitions between states. Therefore,
methods using an ergodic HMM, where all possible transitions between states are
allowed, have been proposed. Speech segments are classified into one of the broad
phonetic categories corresponding to the HMM states. After the classification,
appropriate features are selected.
In the training phase, reference templates are generated and verification thresholds
are computed for each phonetic category. In the verification phase, after the
phonetic categorization, a comparison with the reference template for each
particular category provides a verification score for that category. The final
verification score is a weighted linear combination of the scores from each
category.
This method was extended to the richer class of mixture autoregressive (AR)
HMMs. In these models, the states are described as a linear combination (mixture)
of AR sources. It can be shown that mixture models are equivalent to a larger
HMM with simple states, with additional constraints on the possible transitions
between states.
It has been shown that a continuous ergodic HMM method is far superior to a
discrete ergodic HMM method and that a continuous ergodic HMM method is as
robust as a VQ-based method when enough training data is available. However,
when little data is available, the VQ-based method is more robust than a
continuous HMM method [MF93a].
A method using statistical dynamic features has recently been proposed. In this
method, a multivariate auto-regression (MAR) model is applied to the time series
of cepstral vectors and used to characterize speakers. It was reported that
identification and verification rates were almost the same as obtained by an HMMbased method [GMF94].

1.7.6: Text-Prompted Speaker Recognition Method

In the text-prompted speaker recognition method, the recognition system prompts


each user with a new key sentence every time the system is used and accepts the
input utterance only when it decides that it was the registered speaker who repeated
the prompted sentence. The sentence can be displayed as characters or spoken by a
synthesized voice. Because the vocabulary is unlimited, prospective impostors
cannot know in advance what sentence will be requested. Not only can this method
accurately recognize speakers, but it can also reject utterances whose text differs
from the prompted text, even if it is spoken by the registered speaker. A recorded
voice can thus be correctly rejected.
This method is facilitated by using speaker-specific phoneme models as basic
acoustic units. One of the major issues in applying this method is how to properly
create these speaker-specific phoneme models from training utterances of a limited
size. The phoneme models are represented by Gaussian-mixture continuous HMMs
or tied-mixture HMMs, and they are made by adapting speaker-independent
phoneme models to each speaker's voice. In order to properly adapt the models of
phonemes that are not included in the training utterances, a new adaptation method
based on tied-mixture HMMs was recently proposed by [MF94b].
In the recognition stage, the system concatenates the phoneme models of each
registered speaker to create a sentence HMM, according to the prompted text. Then
the likelihood of the input speech matching the sentence model is calculated and
used for the speaker recognition decision. If the likelihood is high enough, the
speaker is accepted as the claimed speaker.

1.7.7: Future Directions


Although many recent advances and successes in speaker recognition have been
achieved, there are still many problems for which good solutions remain to be
found. Most of these problems arise from variability, including speaker-generated
variability and variability in channel and recording conditions. It is very important
to investigate feature parameters that are stable over time, insensitive to the
variation of speaking manner, including the speaking rate and level, and robust
against variations in voice quality due to causes such as voice disguise or colds. It
is also important to develop a method to cope with the problem of distortion due to
telephone sets and channels, and background and channel noises.
From the human-interface point of view, it is important to consider how the users
should be prompted, and how recognition errors should be handled. Studies on
ways to automatically extract the speech periods of each person separately from a
dialogue involving more than two people have recently appeared as an extension of
speaker recognition technology.
This section was not intended to be a comprehensive review of speaker recognition
technology. Rather, it was intended to give an overview of recent advances and the

problems which must be solved in the future. The reader is referred to the
following papers for more general reviews:
[Fur86a,Fur89,Fur91,Fur94,O'S86,RS91].