1087
⎡d 0 d 0 " d 0 " d 0 ⎤ 4. Experiment
⎢ 0 1 N /2 N
⎥
⎢ d 0 d1 " d N / 2
1 1 1
⎥ In this section, we describe the recognition
⎢ ⎥ experiments and their performances. Connected digit
⎢# % ⎥ (6) recognition experiments were performed using the
⎢d L " d L L ⎥ Aurora 2 [17] and Aurora 3 [18] databases, which were
⎢ 1 N / 2 −1 ⎥ designed to evaluate the performance of automatic
⎢c1L " c L L ⎥ speech algorithms under different noisy and acoustical
⎣ N / 2 −1 ⎦ mismatch conditions. The training and test files include
Conventional feature extraction methods use the 10 digits spoken 10 times each with approximately 1s,
entire frequency band to extract speech features for Background noises include speech and car. The same
speech recognition. The human speech recognition test and training files were used for all experiments.
system seems to utilize partial recognition infor- The speech signal was sampled at 8 kHz and
mation across frequencies, it’s local in frequency. analyzed with 24 ms hamming windows stepped by 8
According to the studies, voiced phonemes have the ms. For the computation of mel-scaled log filter-bank
maximum energy mainly at lower frequencies of the energies, 24 triangular mel-scaled band-pass filters
speech spectrum. Therefore, it was sufficient to were designed and implemented. With the method we
reconstruct the signal from the denoised subbands in proposed the log mel-filter bank outputs were replaced
the range 0-2 kHz only [14]. Here we design the WT with wavelet packet outputs (level 5). The feature
decomposition mode depends on human’s inner ear, parameters comparison is shown in Fig. 4.
which determines how much energy is contained at the
different frequencies that make up a specific sound
scene and when these energies occur in time [15]. As (5)
shows, the usual processing in MFCC calculation is to
use the Log compression of the mel filterbank outputs
as well as the Log compression of full frame energy.
We weighting the contribution of coefficients for the
total score based on Log function too. The weighting
coefficients are shown in Fig.3
1088
Grant 075115002. Authors wish to thanks Pr. Liu
Jin-gao whose comments and su- ggestions have
largely contributed to improve this paper.
7. References
1089
[10] B. Kotnik, Z. Kacic, B. Horvat, Noise robust speech 1998, pp. 81–84.
parameterization based on joint wavelet packet
decomposition and autoregressive modeling, in: [15] I. Daubechies, Ten Lectures on Wavelets, SIAM,
Proceedings of the Eurospeech 2003, Geneva, Philadelphia, USA, 1997.
Switzerland, 2003.
[16] Furlanello C, Merler S, Jurman G, Combining feature
[11] DWIGHTF.MIX, “ Wavelets for engineer”, ILEY- selection and DTW for time-varying functional
INTERSCIENCE, 2006,pp 234-305 genomics, IEEE TRANSACTIONS ON SIGNAL
PROCESSING, Volume: 54, Issue: 6 ,pp. 2436-2443,
[12] Vaseghi, S., Harte, N., Milner, B., Multi resolution Part 2,JUN 2006
phonetic/segmental features and models for HMM
based speech recognition. In: Proc. ICASSP, pp. [17] H.-G. Hirsch, D. Pearce, The Aurora experimental
1263–1266, 1997. framework for the performance evaluation of speech
recognition systems under noisy conditions, in:
[13] M. Bahoura, J. Rouat, Wavelet speech enhancement Proceedings of the ISCA ITRW ASR 2000, Paris,
based on the teager energy operator, Signal Process. France, Sept. 2000, pp. 181–188.
Lett. IEEE 8 (1) (2001).
[18] AU/225/00, AU/271/00, AU/273/00, AU/378/00.
[14] R. Sarikaya, B.L. Pellom, J.H.L. Hansen, Wavelet Finnish, Spanish, German, Danish databases for ETSI
packet transform feature with application to speaker STQ Aurora WI008 advanced DSR front-end
identification, in: Proceedings of the IEEE Nordic evaluation: description and baseline results, 2000.
Signal Processing Symposium, Vigsٛ , Denmark, June,
1090