Anda di halaman 1dari 5

An Improved Speech Feature Extraction Algorithm Using DWT

Xiang Wu, Feng Tian, Jingao Liu


Department of Electronic Science & Technology, East China Normal University, 200241, Email:
knightwx@gmail.com

Abstract the degradation of automatic speech recognition


performance. Because of multi resolution we can have
This paper proposes a novel approach for speech a way of the lowest possible compu- tational costs in
feature extraction., the proposed noise robust feature realtime and good recognition per- formance.
extraction algorithm is found to be robust to different The remainder of this paper is organized as follows.
noise characteristics, SNRs, and under various In Section 2 wavelet transform in speech recognition is
training-testing matched or unmatched con- ditions. briefly reviewed. In Section 3, we describe how the
And the algorithm this paper presents exhibit a low features are extracted, different algorithm of all
computational cost. We also give a practical weighting processing steps and weighting are introduced in detail.
technique here. The experiment compares traditional The experiment and comparison results of the orthodox
MFCCs algorithm and our methods, results show approach and our algorithm are given in Section 4.
overall improvement with the appli- cation of proposed Finally, we make a conclusion regarding our approach
wavele based denoising algorithm with a modified and suggest some areas of further research in Section 5.
thresholding procedure.
2. WT processing for speech signal
1. Introduction
Feature extraction should execute after voice
In pattern recognition, speech recognition is a very activity detection. Amplifying, AD and denoising are
important aspect. Particularly, with the com- puter needed. VAD based on higher order statistics is
widely used, it appears to be even more significant, like presented in [9]. The discrete time implementation of
voice driven service portals, speech interfaces in wavelet transform is based on the iteration of
automotive navigational and guidance systems or two-channel filter bank, which is on each level,
speech driven applications in modern offices [1]. followed by a decimation-by-two unit [10]. While the
MFCCs (Mel-Frequency Cepstral Coe- fficients) [2] basis vectors of the DCT have approximately the same
are important featrure parameter to the speech resolution in time and frequency since same length
recognition, but the performance of a speech windows are used in calculating the cepstral
recognition system drops dramatically when there is a coefficients. Especially, DWT is suitable for noise
mismatch between training and testing conditions. robustness issues. Wavelet coefficient sequence
Many different algorithms have been studied to thresholding combines simplicity and efficiency and,
decrease the effect of noise on the recognition therefore, it is an extensively investigated noise
performance [3]. [4, 5, 6] have given several useful reduction method. Fig.1 illustrates the block diagram of
speech enhancement techniques. the proposed algorithm.
Wavelet analysis is a rapid development of new
application in the mathematics field. It has become a
popular tool in many research areas. WT decomposes
data into a sparse, multiscale representation. The
wavelet transform, with its flexible time-frequency
resolution is, therefore, an appropriate tool for the
analysis of signals consisting of short high-frequency
bursts, and long quasi-stationary segments [7, 8]. This Fig.1. Block diagram of proposed noise robust
paper uses WT in speech feature extraction, trying to feature extraction algorithm based on DWT
resolve the primary problem of acoustical mismatch for

978-1-4244-1724-7/08/$25.00 ©2008 IEEE 1086 ICALIP2008


The DWT algorithm is given by Mallat [11]. Given In fact, Vaseghi has suggested multi-resolution
speech signal function f(x), wavelet base can be built features to overcome DCT problem in MFCCs ope-
as: ration in [12]. According to human hearing characte-
φ jk ( x) = 2 − j / 2 φ (2 − j x − k ) ristic as mel-frequency, the DWT should base on
(1) mel-frequency scaled bands. The parameter matrix is
− j/2 −j obtained by applying the DWT to the mel-scaled
ψ jk ( x) = 2 ψ (2 x − k)
(2) log-filterbank energies of a speech frame. The WT uses
{φ jk ( x)} {ψ jk ( x)} short basis functions to measure the high frequency
Where and are respectively content of the signal and long basis functions to
orthogonal function set. If P0f=f,in the level j Pj-1f can measure the low frequency content of the signal.
be decomposed using one-dimension DWT by
orthogonal projection Pjf and Qjf as: 3. Feature extracting and matching
Pj −1 f = Pj f + Q j f = ∑ c φ jk + ∑ d ψ k
j
k
j
jk
k k The following requirements to feature pa- rameters
(3) needs to be considered with the aim of improving
where: speech recognition performance:
p −1 p −1 1) The parameters can effectively present speech
c kj = ∑ h ( n ) c 2j k−1+ n d kj = ∑ g ( n ) c 2j k− +1 n feature.
n=0 , n=0 2) It is independent in each step.
( j = 1,2,..., L, k = 0,1,..., N 2 j − 1) 3) Low computational cost is needed for feature
(4) extracting and matching, effective feature post-
Where {h(n)} and {g(n)} are respectively low pass and processing algorithms should be applied to trans-
high pass weight coefficients. They are decided by form feature vectors in lower dimensional space,
{φ jk ( x)} {ψ jk ( x)} to decorrelate elements in feature vectors, and to
wavelet base and , p denotes enhance the classification process.
0
{C }n Mel-frequency wrapping use the following
the length of coefficients. are input data, N
approximate formula to compute the mels for a given
denotes signal length, L denotes the level. Hence 1D
frequency f in Hz:
DWT in each level is similar to 1D convolution except
sample distance. ⎛ f ⎞
f mel = 2595 log 10 ⎜ 1 + ⎟ (5)
After the operation, a 2D matrix about j and k is ⎝ 700 ⎠
gotten. Where j=1 denote approximate dissolution and
But MFCCs [13] derived on the basis of short-time
the speech spectrum from high to low was presented by
fourier transform and power spectrum estimation, have
j from low to high. K denotes the po- sition in time
been used predominantly to date as the funda- mental
domain. For k is different in different level, the feature
speech features in state-of-the-art speech recognition
parameter has pyramid structure, this will have less
systems. The STFT-based approach has, due to the
computational cost. The spreads of several level of ‘l’
constant analysis window length, fixed time-frequency
are represented in Fig. 2.
resolution and is, therefore, not optimized to analyze
simultaneously the non- stationary parts of a speech
signal with the same precision. Conventional algorithm
uses many filters in frequency domain to extract each
spectrum feature. They simply divided the frequency
band into subbands, extracted features for each
subband. So speech signal will be divided into short
time blocks and be transformed for each. Computation
in both time and frequency domain are referred, WT is
obviously more proper. The localization property of
transform can be explained by the feature parameters
matrix, it is

Fig.2. Speech signal and each level wavelet


decomposition

1087
⎡d 0 d 0 " d 0 " d 0 ⎤ 4. Experiment
⎢ 0 1 N /2 N

⎢ d 0 d1 " d N / 2
1 1 1
⎥ In this section, we describe the recognition
⎢ ⎥ experiments and their performances. Connected digit
⎢# % ⎥ (6) recognition experiments were performed using the
⎢d L " d L L ⎥ Aurora 2 [17] and Aurora 3 [18] databases, which were
⎢ 1 N / 2 −1 ⎥ designed to evaluate the performance of automatic
⎢c1L " c L L ⎥ speech algorithms under different noisy and acoustical
⎣ N / 2 −1 ⎦ mismatch conditions. The training and test files include
Conventional feature extraction methods use the 10 digits spoken 10 times each with approximately 1s,
entire frequency band to extract speech features for Background noises include speech and car. The same
speech recognition. The human speech recognition test and training files were used for all experiments.
system seems to utilize partial recognition infor- The speech signal was sampled at 8 kHz and
mation across frequencies, it’s local in frequency. analyzed with 24 ms hamming windows stepped by 8
According to the studies, voiced phonemes have the ms. For the computation of mel-scaled log filter-bank
maximum energy mainly at lower frequencies of the energies, 24 triangular mel-scaled band-pass filters
speech spectrum. Therefore, it was sufficient to were designed and implemented. With the method we
reconstruct the signal from the denoised subbands in proposed the log mel-filter bank outputs were replaced
the range 0-2 kHz only [14]. Here we design the WT with wavelet packet outputs (level 5). The feature
decomposition mode depends on human’s inner ear, parameters comparison is shown in Fig. 4.
which determines how much energy is contained at the
different frequencies that make up a specific sound
scene and when these energies occur in time [15]. As (5)
shows, the usual processing in MFCC calculation is to
use the Log compression of the mel filterbank outputs
as well as the Log compression of full frame energy.
We weighting the contribution of coefficients for the
total score based on Log function too. The weighting
coefficients are shown in Fig.3

Fig.3. Log compression weighting coefficients

Parameters matching is the same as con- ventional


algorithm, we use DTW (Dynamic Time Warping) [16] Fig.4. feature parameters Comparison
here. For noise affects each coefficient differently,
when one frequency band is corrupted by noise, only a The BER versus the SNR in different noise
few coefficients matching will be affected if the speech background of MFCCs and our feature parameters is
features represent local infor- mation as in our methods. shown in Fig. 5.
Even if the entire frequency band is corrupted, the
noise level can be different for each frequency interval
for real applications, so the various coefficients will be
affected differently. This significantly improves the
performance comparing to STFT.

1088
Grant 075115002. Authors wish to thanks Pr. Liu
Jin-gao whose comments and su- ggestions have
largely contributed to improve this paper.

7. References

[1] J.-C. Junqua, J.-P. Haton, Robustness in Automatic


Speech Recognition, Kluwer, Norwell, MA, USA,
1996.

[2] Nasersharif B, Akbari A, SNR-dependent compression


of enhanced Mel sub-band energies for compensation
of noise effects on MFCC features, Pattern Recognition
Letters Volume 28, Issue 11, 1 August 2007, Pages
1320-1326
Fig. 5. BER performance of Speech recognition
by different feature parameters [3] Gong, Y., Speech recognition in noisy environments: a
survey. Speech Comm. 16, 261–291, 1995.
The significant performance differences between
feature parameters experimentally demonstrates that
the use of local features in the frequency domain [4] Cung, H.M., Normandin, Y., Noise adaptation
improves the performance for the noisy speech. The algorithms for robust speech recognition. In: Proc.
weighting algorithm significantly improve the per- ESCA Workshop Speech Processing Adverse
formance. These results suggest that weighting the Conditions, pp. 171–174, 1992.
contribution of each coefficient to the recognition score
based on the noise level improves the per- formance.
[5] Mansour, D., Juang, B.H., The short-time modified
5. Conclusion and further research coherence representation and noisy speech recognition.
IEEE Trans. Acoust. Speech Signal Process. 37,
A new method for noise robust speech pa- 795–804, 1989.
rameterization procedure based on DWT has been
presented. This algorithm can give compact and [6] Gales, M.J.F., Young, S.J., Robust continuous speech
accurate representation of speech signals in the time- recognition using parallel model compensation. IEEE
frequency plane, it also has efficient feature vector
Trans. Acoust. Speech Signal Process. 4 (5), 352–359,
post-processing. Compare to conventional MFCCs,
denoising algorithm using time-frequency adaptive 1996.
threshold estimation and modified soft thresholding
procedure, as well as the proposed voice activity [7] I. Daubechies, Ten Lectures on Wavelets, SIAM,
detection algorithm, is more efficient under high Philadelphia, USA, 1997.
discriminant speech-background noise conditions. At
last from the experiment result, it has been concluded
[8] G. Strang, T. Nguyen, Wavelets and Filter Banks.
that the proposed denoising algorithm introduces lower
level of additional signal distortion due to proposed Wellesley, Wellesley-Cambridge Press, MA, USA,
time-frequency noise level adaptation and is, therefore, 1997.
more appropriate for the effective noise reduction of
noisy speech signals. [9] E. Nemer, E.R. Goubran, S. Mahmoud, Robust voice
activity detection using higher-order statistics in the
6. Acknowledgment LPC residual domain, IEEE Trans. Speech Audio
Process. 9 (2001) 217–231.
This work was supported in part by the key project
in Shanghai Science and Technology Committee under

1089
[10] B. Kotnik, Z. Kacic, B. Horvat, Noise robust speech 1998, pp. 81–84.
parameterization based on joint wavelet packet
decomposition and autoregressive modeling, in: [15] I. Daubechies, Ten Lectures on Wavelets, SIAM,
Proceedings of the Eurospeech 2003, Geneva, Philadelphia, USA, 1997.
Switzerland, 2003.
[16] Furlanello C, Merler S, Jurman G, Combining feature
[11] DWIGHTF.MIX, “ Wavelets for engineer”, ILEY- selection and DTW for time-varying functional
INTERSCIENCE, 2006,pp 234-305 genomics, IEEE TRANSACTIONS ON SIGNAL
PROCESSING, Volume: 54, Issue: 6 ,pp. 2436-2443,
[12] Vaseghi, S., Harte, N., Milner, B., Multi resolution Part 2,JUN 2006
phonetic/segmental features and models for HMM
based speech recognition. In: Proc. ICASSP, pp. [17] H.-G. Hirsch, D. Pearce, The Aurora experimental
1263–1266, 1997. framework for the performance evaluation of speech
recognition systems under noisy conditions, in:
[13] M. Bahoura, J. Rouat, Wavelet speech enhancement Proceedings of the ISCA ITRW ASR 2000, Paris,
based on the teager energy operator, Signal Process. France, Sept. 2000, pp. 181–188.
Lett. IEEE 8 (1) (2001).
[18] AU/225/00, AU/271/00, AU/273/00, AU/378/00.
[14] R. Sarikaya, B.L. Pellom, J.H.L. Hansen, Wavelet Finnish, Spanish, German, Danish databases for ETSI
packet transform feature with application to speaker STQ Aurora WI008 advanced DSR front-end
identification, in: Proceedings of the IEEE Nordic evaluation: description and baseline results, 2000.
Signal Processing Symposium, Vigsٛ , Denmark, June,

1090

Anda mungkin juga menyukai