Abstract- One of the most challenging problems in speech ambient noise levels degrades accuracy. Another method
processing is demarcation of utterances from background involves computation of Zero-Crossings Rate of the
noise. This problem is often referred to as the end-point Speech Signal and the estimation of a Zero-crossing
alignment problem. The accurate detection of a word's start threshold (ZCT). A combination of these two methods has
and end points means that subsequent processing of the data been shown to yield better results than when either of
can be kept to a minimum. Consider the speech recognition these is employed individually.
technique based on template matching. The exact timing of In this paper we discuss the implementation of an end-
an utterance will generally not be the same as that of the point detection algorithm employing uni-dimensional
template. They will also have different durations. In many
Mahalanobis-Distance function governed by 3-σ rule, as a
cases the accuracy of alignment depends on the accuracy of
sample clustering mechanism to determine whether
the endpoint detections. Also the acoustic properties of
different sounds (phonemes) pose the challenge of robustness particular frames of speech signal belong to silence or to
so that the algorithm must be able to cater to all the voiced regions. The organization of the paper is as
acoustical variations possible in human speech. Most follows: Section 2 contains a brief introduction to
methods proposed over the years use Short-Time Energy Mahalanobis Distance Function. Section 3 describes the
(STE) based analysis involving empirical measures of the end-point segmentation algorithm. Section 4 presents the
speech signal such as the Zero Crossing Rate and Energy inferences drawn from the exercise.
Thresholds, in order to estimate whether the signal region
under consideration belongs to speech utterance or II. NORMAL DISTRIBUTION AND
silence/background noise. In this paper we proposed the MAHALANOBIS DISTANCE FUNCTION
implementation of Utterance demarcation based on For most practical purposes the background/ ambient
Mahalanobis Distance Function. noise present in the recorded speech signals may be
treated as Gaussian noise. Gaussian noise is statistical
Keywords: Utterance Demarcation, End-point alignment,
noise that has a probability density function (pdf) of the
Mahalanobis Distance Function, Normal Distribution, 3-σ
rule.
normal distribution (also known as Gaussian distribution).
A. Normal Distribution
I. INTRODUCTION
In probability theory and statistics, the normal
The process of Speech and/or Speaker Recognition distribution also known as the Gaussian distribution is a
essentially involves the signal preprocessing stage, tasked continuous probability distribution that often gives a good
with efficient and robust extraction of acoustic properties description of data that cluster around the mean. The
from the speech signal. One of the facets of Feature graph of the associated probability density function is
Extraction process is demarcation of utterances from bell-shaped, with a peak at the mean, and is known as the
background noise. This problem is often referred to as the Gaussian function or bell curve. The normal distribution
end-point alignment/detection problem. In the is often used to describe, at least approximately, any
conventional endpoint detection algorithms, the short- variable that tends to cluster around the mean.
time energy or spectral energy is usually employed as the The simplest case of a normal distribution is known as
primary feature parameters with the augmentation of zero the standard normal distribution, described by the
crossing rate, pitch and duration information. But features probability density function
thus obtained become less reliable in the presence of non-
stationary noise and various types of sound artifacts. In
general, such algorithms involve empirical estimation of
energy thresholds and the basic premise is that energy (1)
content in voiced region of speech is greater than
silence/unvoiced region. However use of such ad hoc
threshold values usually means that any change in the
1
ERTAI-2010 Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis Distance Function as a Sample Clustering Technique
The constant in this expression ensures that simplest and by far the most intuitive method is that of
the total area under the curve Φ (x) is equal to one and calculating the Euclidean Distance between two items.
1⁄2 in the exponent makes the “width” of the curve d(p,q) = (4)
(measured as half of the distance between the inflections
points of the curve) also equal to one. More generally, a
normal distribution results from exponentiation of a Where d (p,q) is the Euclidean distance between two n-
quadratic function (just as an exponential distribution dimensional variables p & q.
results from exponentiation of a linear function): This distance measure has a straightforward geometric
interpretation, is easy to code and is fast to calculate, but
it has two basic drawbacks:
(2) First, the Euclidean distance is extremely sensitive to
the scales of the variables involved. In geometric
This yields the classic “bell curve” shape (provided situations, all variables are measured in the same units of
that a < 0 so that the quadratic function is concave). One length. With other data, though, this is likely not the case.
can adjust a to control the “width” of the bell, then adjust In most practical cases Modeling problems deal with
b to move the central peak of the bell along the x-axis, variables, which have very different scales, which are not
and finally adjust c to control the “height” of the bell. For comparable.
f(x) to be a true probability density function over R, one Second, the Euclidean distance is blind to correlated
variables. Consider a hypothetical data set containing 5
must choose c such that (which is variables, where one variable is an exact duplicate of one
only possible when a < 0). of the others. The copied variable and its twin are thus
Rather than using coefficients a, b, and c, it is far more completely correlated. Yet, Euclidean distance has no
common to describe a normal distribution by its mean µ = means of taking into account that the copy brings no new
−b/(2a) and variance σ2 = −1/(2a). Changing to these new information, and will essentially weight the copied
parameters allows us to rewrite the probability density variable more heavily in its calculations than the other
function in a convenient standard form, variables. This is explained in the following example:
Consider two points A and B are equally distant from the
center µ of the distribution as shown in figure 1.
(3)
calculating for each sample s[i], and the Mahalanobis Fig. 5 and 6 denote the utterance demarcation
distance achieved by using Energy and ZCR Thresholds, and using
Mahalanobis Distance Technique, respectively.
d= (12)
Accuracy Comparison
4
ERTAI-2010 Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis Distance Function as a Sample Clustering Technique
V. CONCLUSION
This paper described the implementation of Utterance
demarcation based on Mahalanobis Distance Function in
order to overcome the problem of demarcation of
utterances from background noise. This is achieved
through the implementation of an end-point detection
algorithm employing uni-dimensional Mahalanobis-
Distance function governed by 3-σ rule, as a sample
clustering mechanism to determine whether particular
frames of speech signal belong to silence or to voiced
regions. Finally, a comparison of accuracy achieved in
both the cases is presented, that depicts the superior
performance of Mahalanobis Distance Measure based
technique for utterance demarcation.
REFERENCES
[1] Ayaz Keerio, Bhargav Kumar Mitra, Philip Birch, Rupert
Young, and Chris Chatwin, “On Preprocessing of Speech
Signals,” International Journal of Signal Processing, Vol. 5,
no. 3, pp. 216-222, February 2009.
[2] L. R. Rabiner and M. R. Sambur, "An algorithm for
determining the endpoints of isolated utterances," Bell Syst.
Tech. J., Vol. 54, pp. 297—315, Feb. 1975.
[3] Rabiner. L. R., and Juang. B. H., “Fundamentals of Speech
Recognition,” AT&T, 1993, Prentice-Hall, Inc.
[4] Rabiner. L. R., Schafer. R. W., “Digital Processing of Speech
Signals,” First Edition, Prentice-Hall.
[5] Duda R. O., Hart. P. E, Strok. D. G., “Pattern Classification,”
Second Edition, John Wiley and Sons Inc., 2001.
[6] L.R. Rabiner and M.R. Sambur, “An Algorithm for
Determining the Endpoints for Isolated Utterances,” The
Bell System Technical Journal, Vol. 54, No. 2