Figure 1(a): Speakeridentification periods of time (of the order of 1/5 seconds or more) the
signal characteristic change to reflect the different
speech sounds being spoken. Therefore, short-time
spectral analysis is the most common way to characterize
the speech signal.
A wide range of possibilities exist for
parametrically representing the speech signal for the
speaker recognition task. Mel-Frequency Cepstrum
Figure 1(b): Speaker verification Coefficients (MFCC), is perhaps the best known and
most popular, and these will be used in this project.
Automatic speaker recognition works based on
MFCC’s are based on the known variation of the human
the premise that a person’s speech exhibits
ear’s critical bandwidths with frequency, filters spaced
characteristics that are unique to the speaker. However
linearly at low frequencies and logarithmically at high
this task has been challenged by the highly variant of
frequencies have been used to capture the phonetically
input speech signals. The principle source of variance
important characteristics of speech. This is expressed in
comes form the speakers themselves. Speech signals in
the mel-frequency scale, which is a linear frequency
training and testing sessions can be greatly different due
spacing below 1000 Hz and a logarithmic spacing above
to many facts such as people voice change with time,
1000 Hz. The process of computing MFCCs is described continues until all the speech is accounted for within one
in more detail next. or more frames.
4. Mel-frequency Wrapping into the time domain using the Discrete Cosine
As mentioned above, psychophysical studies Transform (DCT). Therefore if we denote those mel
have shown that human perception of the frequency power spectrum coefficients that are the result of the last
contents of sounds for speech signals does not follow a step are S , k=1,2,3,…..K, we can calculate the MFCC’s
k
linear scale. Thus for each tone with an actual frequency, as,
point, the pitch of a 1 kHz tone, 40 dB above the from the DCT since it represents the mean value of the
perceptual hearing threshold, is defined as 1000 mels. input signal which carried little speaker specific
Figure 13: The centroid of the entire set. between different speaker's codebooks, is quite capable
of discriminating impostors from a true speaker because
of this distinctive feature.
The mel frequency cepstra possess a significant
advantage over the linear frequency cepstra. Specifically,
MFCC allow better suppression of insignificant spectral
variation in the higher frequency bands. Another obvious
advantage is that mel-frequency cepstrum coefficients
form a particular compact representation.
It is useful to examine the lack of commercial
Figure 14: The centroid is splitted into 2 using LBG success for Automatic Speaker Recognition compared to
algorithm. that for speech recognition. Both speech and speaker
recognition analyze speech signals to extract spectral
parameters such as cepstral coefficients. Furthermore,
both often employ similar template matching methods,
the same distance measures, and similar decision
procedures. Speech and speaker recognition, however,
have different objectives: selecting which of M words
was spoken vs. which of N speakers spoke. Speech
analysis techniques have primarily been developed for
phonemic analysis, e.g., to preserve phonemic content
Figure 15: Finally an 16-vector codebook is generated
during speech coding or to aid phoneme identification in
using LBG algorithm.
speech recognition. Our understanding of how listeners
exploit spectral cues to identify human sounds exceeds
CONCLUSIONS & DISCUSSIONS our knowledge of how we distinguish speakers. For text-
dependent Automatic Speaker Recognition, using
template-matching methods borrowed directly from [8] Douglas O'Shaughnessy, "Speaker Recognition",
speech recognition yields good results in limited tests, IEEE Acoustic, Speech, Signal Processing Magazine,
but performance decreases under adverse conditions that October 1986.
might be found in practical applications. For example, [9] S. Furui, "A Training Procedure for Isolated Word
telephone distortions, uncooperative speakers, and Recognition Systems", IEEE Transactions on Acoustic,
speaker variability over time often lead to accuracy Speech, Signal Processing, April 1980.
levels unacceptable for many applications.
REFERENCES
[1] L.R. Rabiner and B.H. Juang, Fundamentals of
Speech Recognition, Prentice- Hall, Englewood Cliffs,
N.J., 1993.
[2] L.R Rabiner and R.W. Schafer, Digital Processing of
Speech Signals, Prentice- Hall, Englewood Cliffs, N.J.,
1978.
[3] S.B. Davis and P. Mermelstein, “Comparison of
parametric representations for monosyllabic word
recognition in continuously spoken sentences”, IEEE
Transactions on Acoustics, Speech, Signal Processing,
August 1980.
[4] Y. Linde, A. Buzo & R. Gray, “An algorithm for
vector quantizer design”, IEEE Transactions on
Communications, 1980.
[5]S. Furui,“ Speaker independent isolated word
recognition using dynamic features of speech spectrum”,
IEEE Transactions on Acoustic, Speech, Signal
Processing, February 1986.
[6] S. Furui, “An overview of speaker recognition
technology”, ESCA Workshop on Automatic Speaker
Recognition, Identification and Verification, 1994.
[7] F.K. Song, A.E. Rosenberg and B.H. Juang, “A
vector quantisation approach to speaker recognition”,
AT&T Technical Journal, March 1987.