Anda di halaman 1dari 5

ERTAI-2010 Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis Distance Function as a Sample Clustering Technique

Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis


Distance Function as a Sample Clustering Technique

Srinivasa Rao Mutcha Rahul Sharma


Applied Artificial Intelligence Group Applied Artificial Intelligence Group
C-DAC C-DAC
Pune, India. Pune, India
e-mail: srinivasam@cdac.in e-mail: rahul.div@gmail.com

Abstract- One of the most challenging problems in speech ambient noise levels degrades accuracy. Another method
processing is demarcation of utterances from background involves computation of Zero-Crossings Rate of the
noise. This problem is often referred to as the end-point Speech Signal and the estimation of a Zero-crossing
alignment problem. The accurate detection of a word's start threshold (ZCT). A combination of these two methods has
and end points means that subsequent processing of the data been shown to yield better results than when either of
can be kept to a minimum. Consider the speech recognition these is employed individually.
technique based on template matching. The exact timing of In this paper we discuss the implementation of an end-
an utterance will generally not be the same as that of the point detection algorithm employing uni-dimensional
template. They will also have different durations. In many
Mahalanobis-Distance function governed by 3-σ rule, as a
cases the accuracy of alignment depends on the accuracy of
sample clustering mechanism to determine whether
the endpoint detections. Also the acoustic properties of
different sounds (phonemes) pose the challenge of robustness particular frames of speech signal belong to silence or to
so that the algorithm must be able to cater to all the voiced regions. The organization of the paper is as
acoustical variations possible in human speech. Most follows: Section 2 contains a brief introduction to
methods proposed over the years use Short-Time Energy Mahalanobis Distance Function. Section 3 describes the
(STE) based analysis involving empirical measures of the end-point segmentation algorithm. Section 4 presents the
speech signal such as the Zero Crossing Rate and Energy inferences drawn from the exercise.
Thresholds, in order to estimate whether the signal region
under consideration belongs to speech utterance or II. NORMAL DISTRIBUTION AND
silence/background noise. In this paper we proposed the MAHALANOBIS DISTANCE FUNCTION
implementation of Utterance demarcation based on For most practical purposes the background/ ambient
Mahalanobis Distance Function. noise present in the recorded speech signals may be
treated as Gaussian noise. Gaussian noise is statistical
Keywords: Utterance Demarcation, End-point alignment,
noise that has a probability density function (pdf) of the
Mahalanobis Distance Function, Normal Distribution, 3-σ
rule.
normal distribution (also known as Gaussian distribution).
A. Normal Distribution
I. INTRODUCTION
In probability theory and statistics, the normal
The process of Speech and/or Speaker Recognition distribution also known as the Gaussian distribution is a
essentially involves the signal preprocessing stage, tasked continuous probability distribution that often gives a good
with efficient and robust extraction of acoustic properties description of data that cluster around the mean. The
from the speech signal. One of the facets of Feature graph of the associated probability density function is
Extraction process is demarcation of utterances from bell-shaped, with a peak at the mean, and is known as the
background noise. This problem is often referred to as the Gaussian function or bell curve. The normal distribution
end-point alignment/detection problem. In the is often used to describe, at least approximately, any
conventional endpoint detection algorithms, the short- variable that tends to cluster around the mean.
time energy or spectral energy is usually employed as the The simplest case of a normal distribution is known as
primary feature parameters with the augmentation of zero the standard normal distribution, described by the
crossing rate, pitch and duration information. But features probability density function
thus obtained become less reliable in the presence of non-
stationary noise and various types of sound artifacts. In
general, such algorithms involve empirical estimation of
energy thresholds and the basic premise is that energy (1)
content in voiced region of speech is greater than
silence/unvoiced region. However use of such ad hoc
threshold values usually means that any change in the
1
ERTAI-2010 Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis Distance Function as a Sample Clustering Technique

The constant in this expression ensures that simplest and by far the most intuitive method is that of
the total area under the curve Φ (x) is equal to one and calculating the Euclidean Distance between two items.
1⁄2 in the exponent makes the “width” of the curve d(p,q) = (4)
(measured as half of the distance between the inflections
points of the curve) also equal to one. More generally, a
normal distribution results from exponentiation of a Where d (p,q) is the Euclidean distance between two n-
quadratic function (just as an exponential distribution dimensional variables p & q.
results from exponentiation of a linear function): This distance measure has a straightforward geometric
interpretation, is easy to code and is fast to calculate, but
it has two basic drawbacks:
(2) First, the Euclidean distance is extremely sensitive to
the scales of the variables involved. In geometric
This yields the classic “bell curve” shape (provided situations, all variables are measured in the same units of
that a < 0 so that the quadratic function is concave). One length. With other data, though, this is likely not the case.
can adjust a to control the “width” of the bell, then adjust In most practical cases Modeling problems deal with
b to move the central peak of the bell along the x-axis, variables, which have very different scales, which are not
and finally adjust c to control the “height” of the bell. For comparable.
f(x) to be a true probability density function over R, one Second, the Euclidean distance is blind to correlated
variables. Consider a hypothetical data set containing 5
must choose c such that (which is variables, where one variable is an exact duplicate of one
only possible when a < 0). of the others. The copied variable and its twin are thus
Rather than using coefficients a, b, and c, it is far more completely correlated. Yet, Euclidean distance has no
common to describe a normal distribution by its mean µ = means of taking into account that the copy brings no new
−b/(2a) and variance σ2 = −1/(2a). Changing to these new information, and will essentially weight the copied
parameters allows us to rewrite the probability density variable more heavily in its calculations than the other
function in a convenient standard form, variables. This is explained in the following example:
Consider two points A and B are equally distant from the
center µ of the distribution as shown in figure 1.
(3)

Notice that for a standard normal distribution, µ = 0


and σ2 = 1. The last part of the equation above shows that
any other normal distribution can be regarded as a version
of the standard normal distribution that has been stretched
horizontally by a factor σ and then translated rightward by
a distance µ. Thus, µ specifies the position of the bell
curve’s central peak, and σ specifies the “width” of the
bell curve.

Figure 2. Euclidean Distance Measure

Yet, it seems inappropriate to say that they occupy


“equivalent” positions with respect to O as:
A is in a low-density region, while B is in a high-
density region. So, in a situation like this one, the usual
Euclidian distance
d ²(A, µ) = ∑I (oi - µ i)² (5)
Figure 1 “Bell Curve” Does not seem to be the right tool for measuring the
“distance” of a point to the center of the distribution. We
B. Mahalanobis Distance Function could instead consider as “equally distant from the mean”
Many data mining and pattern clustering tasks involve two points with the same probability density: this would
calculating abstract "distances" between items or make them equally probable when drawing observations
collections of items. Some modeling algorithms, such as from the distribution. Suppose now that the distribution is
k-nearest neighbors or radial basis function neural multi-normal. Because of the functional form of the
networks, make direct use of multivariate distances. The multivariate normal distribution, these two points would
lead to the same value of the quantity:
2
ERTAI-2010 Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis Distance Function as a Sample Clustering Technique

D ² = (x - µ)’∑-1(x - µ) (6) B. Sample Clustering based on 3 σ Decision Rule


With the covariance matrix ∑ of the distribution. D is The central limit theorem states that the distribution of
called the Mahalanobis distance of the point x to the mean a sum of many independent, identically distributed
µ of the distribution. random variables tends towards the famous "bell-shaped"
The Mahalanobis distance takes into account the normal distribution with a pdf of:
covariance among the variables in calculating distances.
With this measure, the problems of scale and correlation
inherent in the Euclidean distance are no longer an issue.
For case of a uni-dimensional variable the (10)
Where µ is the arithmetic mean of the sample. The
Mahalanobis Distance is defined by the standard deviation therefore is simply a scaling variable
following formula: that adjusts how broad the curve will be, though also
appears in the normalizing constant to keep the
distribution normalized for different widths. In statistics,
D= (7) the three-sigma rule, or empirical rule, or 68-95-99.7 rule,
states that for a normal distribution, nearly all values lie
In the following section we discuss the use of within 3 standard deviations of the mean. If a data
Mahalanobis Distance Function as a basis of end-point distribution is normal then about 68% of the data values
detection/alignment of Semi-continuous Speech Signals. are within 1 standard deviation of the mean
(mathematically, µ ± σ, where µ is the arithmetic mean),
about 95% are within two standard deviations (µ ± 2σ),
III. END-POINT ALIGNMENT and about 99.7% lie within 3 standard deviations (µ ± 3σ).
This is known as the 3-σ rule or 68-95-99.7 rule, or the
USING MAHALANOBIS DISTANCE & 3-σ RULE
empirical rule.
Assume that the ambience noise present in the
captured signal is Gaussian in nature, then the uni-
dimensional Mahalanobis distance function can be used to
extract the voiced part from the signal. The Algorithm is
divided into two parts: Background Noise Calibration and
Sample Clustering based on 3 σ Decision Rule.
A. Calibration of Background noise
The initial process involved is that of calibration of
ambient noise parameters. The parameters considered are
mean and standard deviation values of each sample. At
first a 200~300 msec window is chosen to find out the
parameters of the Gaussian distribution; the time duration Figure 3. Normal Distribution
of the window is chosen considering the fact that the
speaker will take more than 200~300 msec to initiate Having defined the uni-dimensional Mahalanobis
speaking after recording starts. For a speech signal distance function as
recorded at sampling frequency of 16 KHz it amounts to
3200 samples worth of data. If µsil and σsil are the mean d= (11)
and the standard deviation respectively of initial silence,
then: It is evident from the aforementioned 3 rule that
µsil = (8) there is a probability of 99.7% that the distance, d will be
less than 3.
We may now proceed with sample clustering as
σsil = (9) follows:
The entire speech recording is divided into non-
Where s[i] is the instantaneous sample and N = 3200 overlapping frames of 25 msec. This is done based on the
premise that all speech signals are quasi-stationary in
samples nature, meaning, when a speech signal is examined over a
These values viz. µsil and σsil characterize the sufficiently short period of time (between 10 and 100
background noise. msec), its characteristics are fairly stationary. For each
frame thus created, beginning from the first sample,
proceed to the last sample of the speech recording,
3
ERTAI-2010 Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis Distance Function as a Sample Clustering Technique

calculating for each sample s[i], and the Mahalanobis Fig. 5 and 6 denote the utterance demarcation
distance achieved by using Energy and ZCR Thresholds, and using
Mahalanobis Distance Technique, respectively.

d= (12)

If d > 3, then the sample is said to belong to the voiced


segment of the signal, otherwise it is treated as
silence/noise. Repeat the process for each frame. The last
frame if incomplete is zero-padded. An intuitive
annotation method is to mark each silence/noise sample
thus obtained as S and each voiced sample as V. Now in
place of the original speech signal we have an annotated
stream of samples marked as S or V. Given that the
frames are non-overlapping it is safe to assume that
frames in which number of V marked samples is greater Figure 5. Segmentation boundaries given by STE based method.
than S marked samples, the entire frame logically belongs
to the Voiced segment. The converse statement for
majority of S marked samples also holds true. Retain all
frames for which number of V marked samples are
greater than S marked samples. Such frames constitute the
voiced/ speech segment of the recording and by retrieving
the starting and end indices of the frames we have the
end-points of each voiced segments of speech. The
aforementioned method is fairly intuitive and easy to
implement.
IV. EXPERIMENT RESULTS
In this section we discuss the performance of two
different approaches to the problem of Utterance Figure 6. Segmentation boundaries given by Mahalanobis
Demarcation in Speech. At first we implemented the Distance measure based method.
algorithm described in [2] that makes use of STE based
methods of Energy & Zero Crossings Rate Thresholds. It shall be noted that Fig. 6 depicts more regions than
The system thus implemented was tested for performance Fig. 5, owing to the fact that the Mahalanobis Distance
accuracy using speech samples of a male speaker but with Technique aims at separating the silence regions and
different speaking style; one set of recordings with semi- utterance regions as separate clusters. The experiment
continuous speech. Each recording in this case had words data consisted of 10 recordings for each speaking style.
separated temporally by order of 50-60 milliseconds. The Table 1 gives the accuracy obtained with either of the
other set contained recordings of same text but with a aforementioned techniques. Another observation of
continuous speaking style. The following figure is a interest is the fact that it was found that the choice of
waveform for utterance “I work in CDAC” in semi- frame length had a significant impact on the performance
continuous mode. of Mahalanobis Distance based technique.

TABLE I EXPERIMENT RESULTS

Accuracy Comparison

Method → STE MAHALANOBIS


Speaking Rate ↓

Semi-continuous 75.4 % 87.2%


Figure 4. Recording of “I work in CDAC”
Continuous 37.2 % 44.5%

4
ERTAI-2010 Demarcation of Utterance Boundaries in Speech Signals Using Mahalanobis Distance Function as a Sample Clustering Technique

V. CONCLUSION
This paper described the implementation of Utterance
demarcation based on Mahalanobis Distance Function in
order to overcome the problem of demarcation of
utterances from background noise. This is achieved
through the implementation of an end-point detection
algorithm employing uni-dimensional Mahalanobis-
Distance function governed by 3-σ rule, as a sample
clustering mechanism to determine whether particular
frames of speech signal belong to silence or to voiced
regions. Finally, a comparison of accuracy achieved in
both the cases is presented, that depicts the superior
performance of Mahalanobis Distance Measure based
technique for utterance demarcation.

REFERENCES
[1] Ayaz Keerio, Bhargav Kumar Mitra, Philip Birch, Rupert
Young, and Chris Chatwin, “On Preprocessing of Speech
Signals,” International Journal of Signal Processing, Vol. 5,
no. 3, pp. 216-222, February 2009.
[2] L. R. Rabiner and M. R. Sambur, "An algorithm for
determining the endpoints of isolated utterances," Bell Syst.
Tech. J., Vol. 54, pp. 297—315, Feb. 1975.
[3] Rabiner. L. R., and Juang. B. H., “Fundamentals of Speech
Recognition,” AT&T, 1993, Prentice-Hall, Inc.
[4] Rabiner. L. R., Schafer. R. W., “Digital Processing of Speech
Signals,” First Edition, Prentice-Hall.
[5] Duda R. O., Hart. P. E, Strok. D. G., “Pattern Classification,”
Second Edition, John Wiley and Sons Inc., 2001.
[6] L.R. Rabiner and M.R. Sambur, “An Algorithm for
Determining the Endpoints for Isolated Utterances,” The
Bell System Technical Journal, Vol. 54, No. 2

Anda mungkin juga menyukai