Anda di halaman 1dari 13

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO.

4, MAY 2011

775

Robust Speech Dereverberation Based on Blind Adaptive Estimation of Acoustic Channels


Mohammad Ariful Haque, Touqul Islam, and Md. Kamrul Hasan
AbstractThis paper addresses the problem of speech dereverberation considering a noisy and slowly time-varying environment. The proposed multimicrophone speech dereverberation model utilizes the estimated acoustic impulse responses (AIRs) to dereverberate the speech as well as improve the signal-to-noise ratio without a priori information about the AIRs, location of the source and microphones, or statistical properties of the speech/noise, which are some common assumptions in the related literature. The received noisy signals are ltered through an eigenlter which improves the power of the speech signal as compared to that of the additive noise. The eigenlter is efciently computed avoiding the tedious Cholesky decomposition, solely from the estimates of the AIRs. The design of the eigenlter also incorporates a frequency domain constraint that improves the quality of the speech signal, resists spectral nulls in addition to improving the signal-to-noise ratio (SNR). A zero-forcing equalizer (ZFE) is used to dereverberate the speech signal by eliminating the distortion caused by the AIRs as well as the eigenlter. The ZFE is implemented in block-adaptive form which makes the proposed technique suitable for speech dereverberation in a time-varying condition. The simulation results verify the superior performance of the proposed method as compared to the state-of-the-art dereverberation techniques in terms of log-likelihood ratio (LLR), segSNR, weighted spectral slope (WSS), and perceptual evaluation of speech quality (PESQ). Index TermsAdditive noise, constrained optimization, speech dereverberation, time-varying channels, zero-forcing equalization (ZFE).

I. INTRODUCTION

EREVERBERATION is an acoustic signal processing technique that aims to extract the original speech signal from one or more observations of the reverberant signal. Efcient dereverberation can enhance the performance of many speech communication systems such as hands-free telephony, voice-controlled systems, hearing aids, and cochlear implants

Manuscript received August 25, 2009; revised December 26, 2009, April 12, 2010, May 04, 2010; accepted July 17, 2010. Date of publication August 09, 2010; date of current version February 14, 2011. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sharon Gannot. M. A. Haque and T. Islam are with the Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka-1000, Bangladesh (e-mail: arifulhoque@eee.buet.ac.bd; touq56@eee.buet.ac.bd). Md. K. Hasan is with the Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka-1000, Bangladesh, and also with the Department of Biomedical Engineering, Kyung Hee University, Kyungki 446-701, Korea (corresponding author e-mail: khasan@eee.buet.ac.bd). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TASL.2010.2064306

[1]. Dereverberation of speech considering practical acoustic scenario is a difcult task. The reasons are as follows: 1) dereverberation is a blind problem, i.e., the source signal is unknown to the receiver and hence no training pulses can be sent to estimate the long AIRs, typically consisting of few thousands of coefcients; 2) the degree of complexity of the dereverberation problem increases many fold in a noisy environment as the channel vector does not remain collinear with the minimum mean squared error solution [2], [3]. Moreover, the frequency response of the AIRs have nulls that cause signicant noise amplication during the channel equalization process; 3) Radlovic et al. demonstrated that even a small movement of the speaker, of the order of a tenth of the acoustic wavelength, can cause signicant changes in the AIRs [4]. Although the literature is rich in variety of dereverberation techniques, only a few of them consider noise and time-varying channels simultaneously. The existing dereverberation techniques can be divided into two major groups based on whether or not they equalize the AIRs. Some common techniques that do not equalize the AIRs but practically reduce the effect of reverberation are beamforming [5], linear prediction residual processing [6], and spectral enhancement [7]. These methods, however, always show suboptimal performances as they cannot eradicate the cause of reverberation. Therefore, from a theoretical perspective it is desired that the AIRs be equalized using a suitable inverse lter. The AIRs can be equalized in two different ways, rst, using an inverse lter directly obtained from the received microphone signals, e.g., the reduced mutually referenced equalizer (RMRE) [8], correlation-based multichannel inverse lter with spectral subtraction (ISS) [9], and harmonicity-based dereverberation (HERB) [10] methods. However, the RMRE lter is very sensitive to additive noise, channel order mismatch and therefore impractical for real situations. The HERB method requires around 60 minutes of speech data to acquire the inverse lter under the assumption that the system is time-invariant. The ISS method takes into account the time-varying situation but designs the inverse lter under the noise-free assumption. Dereverberation considering a noisy environment has been proposed in [11] based on the linear predictive multi-input equalization (LIME). The method is, however, vulnerable to large channel length, sensitive to speech duration and is not suitable for incoherent noise. The second approach of equalizing the AIRs is based on the blind estimation of acoustic channels. In this approach, the channels are obtained rst from the noisy received signals and then a proper inverse lter is designed utilizing the estimates. The pioneering work in blind channel equalization utilizing the AIRs was done by Miyoshi and Kaneda in 1988 [12]. They

1558-7916/$26.00 2010 IEEE

776

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

proposed an elegant method commonly known as the multipleinput/output inverse theorem (MINT) for multichannel inversion of the room acoustics. The MINT method is effective for speech dereverberation provided that the AIRs are known in advance. Even a moderate channel estimation error causes signicant spectral distortion in the dereverberated speech. The single-channel inversion of the AIRs is not as sensitive as the MINT method; however, it causes severe noise amplication near the frequencies close to the zeros of the channel. Moreover, the AIR is a non-minimum phase system which does not have a causal and stable inverse lter. The channel shortening technique, which is aptly used in communication, can be an alternative approach for speech dereverberation [13]. The algorithm is based on innity-norm optimization that attempts to maximize the channel coefcient with the highest absolute value in the early portion. In the same time, it minimizes the channel coefcient with the highest absolute value in the late reections. However, the method does not consider the channel estimation error and is computationally intensive and exhibits a very slow convergence rate. In this paper, we present a multimicrophone dereverberation model relaxing the time-invariant channel and noise-free assumptions to suit a practical noisy environment. The technique does not require a priori information about the AIRs or position of the source and microphones. The AIRs are estimated using the robust normalized multichannel frequency-domain LMS (NMCFLMS) algorithm proposed in our previous work [14]. Here, we consider the complete dereverberation problem utilizing the adaptive estimates of the AIRs. The proposed model, therefore, includes channel estimation, signal-to-noise ratio (SNR) improvement and channel equalization stages. The multimicrophone received signals are rst ltered through a bank of eigenlters that enhance the power of the speech signal as compared to that of the additive noise. Unlike conventional techniques, the eigenlters are designed using the AIRs that enable us to relax the wide-sense-stationarity assumption for the speech signal. Moreover, an efcient eigensolver technique is proposed to reduce the cost of computation by avoiding the Cholesky decomposition. However, the equivalent channel, that results from the convolution of the AIRs and the eigenlters, becomes extremely narrowband if the eigenlters are designed considering SNR maximization only. As a result, the perceptual quality of speech falls though the SNR is increased. Moreover, the equivalent channel contains spectral nulls that amplify noise during the equalization process. Therefore, a frequency-domain constraint is incorporated in the design of the eigenlters to annihilate these nulls by enforcing spectral atness. The constrained optimization thus reduces speech distortion and mitigates noise amplication problem during the equalization stage. Dereverberation is nally achieved using a zero-forcing equalizer (ZFE) that eliminates the speech distortion caused by the AIRs and eigenlters in the preceding stages. In this work, the ZFE is implemented in block-adaptive mode so that different blocks of data can be equalized with the corresponding estimates of the AIRs and thus it becomes suitable for time-varying conditions. Moreover, the block ZFE is effective for processing the speech signal that does not have a xed duration. Extensive experiments are conducted, using

both simulated and real reverberant acoustic channels, which demonstrate that the combination of proposed methods can offer better speech quality and SNR improvement as compared to the state-of-the-art dereverberation techniques. The notations used throughout the paper are as follows: time-domain scalar; frequency-domain scalar; time-domain vector; frequency-domain vector; estimate of . The rest of the paper is organized as follows. The problem scenario is presented in Section II. Section III presents the proposed dereverberation technique in detail. This section comprises of three subsections including blind channel estimation, eigenlter-based signal power enhancement and block-adaptive zero-forcing equalization to produce the dereverberated signal. The simulation results are given in Section IV. The paper concludes with some remarks in Section V. II. PROBLEM FORMULATION Consider a speech signal recorded inside an echoic room using a linear array of microphones. The received signals at the microphones can be modeled as convolutional mixtures of the speech signal and the impulse responses of the acoustic paths between source and microphones. The channel outputs and observed signals are then given by (1) (2) is the where represents the time-domain sample index, , and number of microphones, denote, respectively, the clean speech, reverberant speech, the reverberant speech corrupted by background noise, observation noise, and impulse response from the source to the th microphone. Using vector notation, (1) can be written as (3) where represents the channel impulse response vector of length , the superscript denotes transpose and . , Our objective is to obtain a dereverberated speech signal, from the received noisy signal, , where denotes the data length. III. SPEECH DEREVERBERATION IN NOISE As we have stated in the introduction, the proposed dereverberation model, which is depicted in Fig. 1, is composed of three different blocks, such as, blind channel estimation, SNR improvement and zero-forcing equalization. In this section, we will present these blocks in detail. A. Blind Channel Estimation Estimation of long AIRs in a practical noisy environment is a difcult task. Therefore, channel estimation-based speech dereverberation model in a noisy environment is not common in the literature. However, the robust multichannel frequency-domain LMS algorithm (MCFLMS) [14] proposed in our previous work

HAQUE et al.: ROBUST SPEECH DEREVERBERATION BASED ON BLIND ADAPTIVE ESTIMATION OF ACOUSTIC CHANNELS

777

block error signal obtained using linear convolution. Now, the can be decomposed as circulant matrix (8) where is the discrete Fourier transform (DFT) matrix of and is a diagonal matrix whose diagonal size elements are given by the DFT of the rst column of , . Substituting (8) into (7), the block error sequence in i.e., the frequency domain can be expressed as

(9) where underline denotes frequency domain quantities, and the DFT matrix of size
Fig. 1. Block diagram model of the proposed dereverberation technique.

be

can estimate the AIRs in noise with reasonable accuracy. The algorithm exploits the cross-relation (CR) between different ob, that are driven by the same input. Based on servations, this CR, an error function may be dened as (4) Now, a block of the error signal can be determined as (5) where is the block time index. Here, is dened as

Now, a frequency-domain cost function can be obtained as (10) where the superscript denotes Hermitian transpose. The MCFLMS-type adaptive algorithms estimate the AIRs by minimizing the cost function in (10). The update equation of the MCFLMS algorithm is given by (11)

(6) where , and is the estimate of the th channel at the th block. In order to avoid computationally intensive time-domain con, frequency-domain ltering techvolutions to calculate nique is preferred. The block error signal, , in (5) can be rewritten using the circular convolution as (7) where of length is a circulant matrix with its rst column , and

where (12) (13) It is to be mentioned here that the estimate is normalized after each iteration to avoid the all-zero trivial estimate. In is called the step-size parameter and the gradient vector (11), can be obtained as [15]

(14) where the superscript denotes complex conjugate and

Here, is the identity matrix and is a null matrix. It is worthwhile to mention that the matrix, , is used to discard the rst circular convolution results, and retain only the last results so that the right-hand side of (7) becomes identical to the

The MCFLMS-type algorithms give good initial estimates of the channels followed by a rapid divergence from this better estimate in the presence of additive noise. This misconvergence

778

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

TABLE I SPECTRALLY CONSTRAINED NORMALIZED MULTICHANNEL FREQUENCY-DOMAIN LMS ALGORITHM

where (20) (21)

(22) impulse response vectors into a longer Concatenating the one, we can write the update equation as (23) and are obtained by concatenating the where individual channel terms similarly as (12) and (13), respectively. is a diagonal matrix with diagonal terms of in sequential order. Now, the update equation for the robust NMCFLMS algorithm can be written as

is associated with nonuniform spectral attenuation of the estimated AIRs [2]. Therefore, a penalty function that ensures spectral atness can be dened as [14] maximize (15)

(24) where

subject to

(16)

(25) (26)

where (16) is ensured by the unit norm constraint imposed on the update equation. The above penalty function will be maximum when the estimated channel coefcients have uniform magnitude spectra in the frequency-domain. Therefore, the requirement of spectral atness can be attached as a constraint to the via the penalty term using original cost function . The adaptive update rule for this the Lagrange multiplier constrained minimization with xed step size can be readily obtained as (17) where

(27) Here, is a constant parameter required for proper coupling between and . In this paper, the AIRs will be estimated using (24) for speech dereverberation. The implementation of our blind channel estimation algorithm is shown in Table I. B. Signal Power Enhancement If the ZFE is cascaded just after the channel estimation stage, the SNR deteriorates at the output of the equalizer due to severe noise amplication near the spectral nulls. Therefore, an SNR improvement scheme is essential before the channel equalization stage. To this end, we propose a modied eigenlter in this section. The modication is made in two ways. First, the conventional design of eigenlters is computationally expensive. Therefore, an efcient eigensolver technique is proposed that reduces the cost of computation by avoiding the Cholesky decomposition. Second, the frequency spectrum of the conventional eigenlter is extremely narrowband. As a result, the speech signals is severely distorted at the output of the eigenlters. To overcome this limitation as well as remove the spectral nulls, a frequency-domain constraint is attached to the eigenlters that improves the quality of the dereverberated speech. It is reported in the literature that an eigenlter that maximizes the SNR can be obtained from the eigenvector of the data correlation matrix corresponding to the largest eigenvalue [16]. However, speech dereverberation is a blind problem and we do

(18) is a diagonal matrix with diagonal elements and is the norm. Similarly, the spectral constraint, , can also be attached to the update equation of the normalized MCFLMS (NMCFLMS) algorithm in order to improve the noise robustness. The update equation of the original NMCFLMS algorithm can be expressed as [15] Here,

(19)

HAQUE et al.: ROBUST SPEECH DEREVERBERATION BASED ON BLIND ADAPTIVE ESTIMATION OF ACOUSTIC CHANNELS

779

Fig. 2. Block diagram of the signal path and noise path for the k th channel.

not have access to the speech signal. One may use the correlation matrix obtained from the received microphone signals, with the assumption that the desired signal (here speech signal) is a wide-sense stationary random process, but speech signal is highly nonstationary and this assumption does not hold at all. Therefore, the eigenlter (eigenvector corresponding to the largest eigenvalue) estimated from the data correlation matrix is not a proper choice. We propose an improved eigenlter technique utilizing the estimates of the AIRs that enhances the energy in the signal of length path as compared to that of the noise path. Let, represent the eigenlter in the th channel. Now, we can consider the separate signal and noise paths as shown in Fig. 2 and are the speech and noise components where at the output of the lter , respectively. If we design such that the energy of the signal path maximizes with respect to that of the noise path, the SNR will increase at the output of length of the eigenlter. Let represents the equivalent channel impulse response at the output of the eigenlter block. Then, we can write (28) is the composite where is the eigenlter and composite convolution matrix, where of size is the convolution matrix of . Now, the desired objective function can be written as (29) where . The optimal method for maximizing the signal path energy would nd so as to maxiwhile satisfying . We mize see, therefore, that this problem may be viewed as an eigenvalue can problem and the optimum FIR lter that maximizes be obtained as (30) is the eigenvector associated with the largest eigenwhere . We can easily estimate from the estivalue of mates of AIRs, . Since the AIRs may vary with time, a obtained from of a particular instant would xed not work for the entire speech waveform. Therefore, we need with a new set of at regular to update the matrix intervals. Moreover, a sharp change in the AIRs usually gives an abrupt rise in the cost function, and this uctuation may also with a new set of be used for updating the matrix whenever the AIRs change.

An iterative algorithm is proposed here for nding the eigenlter which gives a number of advantages over the onepass solution. First, we can avoid computationally intensive Cholesky de[17]. composition which may become unstable for large Second, the optimum eigenlter is extremely narrowband in the frequency-domain causing speech distortion in the output. Moreover, spectral nulls are present in the equivalent channels which causes signicant noise amplication at the output of the equalization process. Therefore, we enforce spectral atness in the eigenlter, in addition to the signal path energy maximization, by incorporating a spectral constraint. The objective function for the iterative solution can be written as (31) where denotes the estimate of . We propose an efcient eigensolver for nding the eigenvector corresponding to the maximum eigenvalue. The update equation at th iteration can be expressed as (32) where is the step size and represent the trace of a matrix. The proof of (32) is provided in the Appendix which shows that the algorithm converges in the mean to the corresponding to the largest eigenvalue. eigenvector of in the frequencyIn order to enforce spectral atness of domain, we formulate a penalty function similar to (15) which is maximize (33)

where represents the th elements of point discrete Fourier transform (DFT) of . Maximizing the penalty function, , tries to make each and every component uniform in the frequency-domain. Thus, it resists of . spectral nulls in the equivalent channel impulse response Now, the gradient of the penalty function is obtained as

where

can be expressed as [16] (34)

Therefore, we can write

as (35)

is a diagonal matrix with diagonal elements . Finally, the update rule for the proposed spectrally constrained eigensolver algorithm is given by (36)

where

780

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

TABLE II FREQUENCY-DOMAIN CONSTRAINED EIGENSOLVER ALGORITHM

where is a tuning parameter for the frequency-domain spectral constraint. The lower the value of , the stronger the spectral constraint. The implementation of our signal power enhancement scheme is shown in Table II. C. Zero Forcing Equalization Dereverberation of speech requires blind equalization of the AIRs. Among the various linear equalization techniques proposed in the literature, the ZFE and the minimum mean-square error (MMSE) equalizer are the most common [18]. Although the MMSE technique is more noise robust than the ZFE, the advantage is obtained at the expense of computational complexity as the minimum of the error function is to be searched for a wide range of delays. The ZFE is computationally efcient and gives direct equalization of the AIRs in the frequency domain. However, the ZFE can lead to considerable noise amplication which makes it unsuitable for practical applications. In the proposed dereverberation technique, the eigenlter along with the frequency-domain constraint can effectively compensate for such SNR degradation by providing adequate signal power enhancement in the previous stage. Moreover, the ZFE is implemented in block adaptive mode for the reasons stated after (37). We can obtain an estimate of the equivalent channel from the speaker to the output of the eigenlter from the estimates of AIRs and impulse response of the eigenlter . Let represent the equivalent channel vector in the frequency-domain. Therefore, the th frequency-component of the required ZFE can be expressed as (37)

and hence the FIR approximation of the inverse lter becomes very high order. In other words, the noncausal part of the inverse lter is prohibitively large for causal implementation. For example, the length of the noncausal part of a typical which requires 31.25 s delay inverse lter is around for causal implementation. The second approach is the block adaptive ZFE in the frequency-domain. Although a block delay is unavoidably introduced in this case, such a delay is smaller than that with the causal implementation in the time domain. and , the block For example, with a block size of delay is 4.95 s. In this paper, we propose a block ZFE utilizing the overlap-save method [19]. at the output of the eigenlters, The time-domain signal as shown in Fig. 1, can be expressed as

(38) where and . The power of the signal term in (38) is signicantly enhanced as compared to that of the noise term due to eigenltering in the previous stage. Therefore, the noise term can be ignored in the derivation of the ZFE for simplication. Now, we formulate a suitable transformation operation that can transform a block of data at the output of the eigenlter to a direct product of the source signal vector and equivalent channel vector. Then it becomes easy to dereverberate that block of data by canceling using the estimates. Let represent out the effect of ( , an integer number), that results a vector of length from the circular convolution of the source signal vector and the equivalent channel (39) where represents block-time index

and

is a circulant matrix with the rst column as

with is the th frequency-component of . A where small positive number, , is added in the denominator to avoid division by zero. The more the nulls in the channel, the higher the value of should be used. It is to be mentioned here that the ZFE is identical to the Wiener lter for channel equalization in the MMSE sense. However, this simple ZFE is not implementable in practice. The main reason is that the DFT size for should be, at least, the sum of the lengths of the obtaining signal vector and the channel impulse response vector minus one, but the length of the speech signal is usually undened. It is not practically possible to store the entire speech waveform and then perform zero forcing equalization. Moreover, the AIRs for are slowly time-varying. We cannot assume the same the entire speech signal. The above-mentioned problems can be resolved by two different ways. The rst one is the time-domain . However, the zeros lter obtained from the IDFT of of the AIRs are very close to the unit circle in the -plane

We can nd from inspection that the last points of are identical to linear convolution between and which can be represented as

(40) where

and

HAQUE et al.: ROBUST SPEECH DEREVERBERATION BASED ON BLIND ADAPTIVE ESTIMATION OF ACOUSTIC CHANNELS

781

TABLE III BLOCK ADAPTIVE ZERO FORCING EQUALIZER ALGORITHM

effect can be neglected in the derivation. Now, the block-adapin can be easily tive ZFE that can compensate for obtained as

. . . . . . The circulant matrix can be decomposed as (41) is a diagonal matrix whose elements are obtained where . Substifrom the DFT coefcients of the rst column of tuting into (40) and taking DFT of , we obtain the frequency-domain block vector as (42) where

. . . . . .

.. ..

. .

. . . . . .

. . . . . .

(46)

where and is the th component of the vector . An estimate of can be obtained from the estimated AIRs and eigenlter impulse can be exresponse. Therefore, the source signal block, as tracted from (47) where . Now, we can obtain the dereverberated speech block at the output of the equalizer corresponding to each element of as (48) . Finally, the entire by discarding the rst samples from speech signal can be reconstructed by concatenating the succesas sive dereverberated blocks

Multiplying both sides of (42) with

, we get (49) (43) The implementation of the ZFE is shown in Table III. IV. SIMULATION RESULTS In this section, we present simulation results to evaluate the performance of the proposed multi-microphone speech dereverberation technique (consisting of channel estimation, SNR enhancement and zero forcing equalization blocks) for both simulated and real acoustic channels under time-invariant and time-varying conditions with noise. The simulated channels were generated using the well-known image model [20] and the real reverberant channels were obtained from the multichannel acoustic reverberation database at York (MARDY) [21]. We also present comparative dereverberation performance using the correlation-based multichannel inversion with spectral subtraction (ISS) [9], Multichannel Linear Prediction (MLP) [11], innity-norm minimization ( -norm) [13] algorithms. For speech input, a number of both female and male utterances, sampled at 8 kHz were used. Different objective measures that are used to evaluate the quality of speech are the log-likelihood ratio (LLR), average segmental SNR (segSNR), weighted spectral slope (WSS) and perceptual evaluation of speech quality (PESQ) [22]. The higher the values of the segSNR and PESQ, and the lower the values of LLR and WSS, the better the speech quality.

where

For acoustic channels, is usually very large and hence can be approximated by the identity matrix as [15] (44) Substituting (44) into (42) gives (45) Now, the right-hand side of (45) is the product of source data matrix and equivalent channel vector in the frequency-domain. The term, is simply a scalar quantity. Therefore, its

782

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

Fig. 3. Convergence prole of the robust NMCFLMS algorithm for the identiacoustic channels with coefcients using speech cation of input at SNR dB.

M =5

= 25

L = 4400

A. Simulated Acoustic Channels The dimension of the room size was taken to be m. A linear array consisting of microphones with unim was used in the experiment. The form separation of rst microphone was positioned at (1.0, 1.5, 1.6) m and the location of the other microphones can be obtained by successively m to the -coordinate of the rst microphone. adding The initial position of the speaker was xed at (2.0, 1.2, 1.6) m. Wall reection coefcients are 0.9 for all walls, ceiling, and the samoor. The length of each impulse response was s was considered. ples and the reverberation time The additive noise was white zero-mean Gaussian. In all cases, were used. The eigenlter length was that determined using the formula . The block length for ZFE was , which gives means a block-delay of 4.95 s. In Fig. 3, we present channel estimation results of the robust NMCFLMS algorithm at 25-dB SNR with speech input. The accuracy of channel estimation is indicated by the normalized projection misalignment (NPM) [23]. The lower the value of NPM, the better the channel estimation. The NMCFLMS algorithm without the spectral constraint shows good initial convergence as revealed from the lower NPM values in the early stage of iterations, but following this apparent convergence, NPM starts to increase until complete misconvergence. To the contrary, the spectrally constrained algorithm converges to the desired solution with almost no sacrice in the speed of convergence. The accuracy of the nal estimate is, however, dictated by the noise level of the observation data. Fig. 4(a) depicts the original channel and (b) the estimated channel using the robust NMCFLMS algorithm at 20-dB SNR. The NPM between the original and estimated channels is 7.8 dB. The direct inversion using the MINT method fails to equalize the AIRs with such an estimate. Fig. 4(c) shows the IDFT sequence of the equalized channel at the output of the ZFE using the proposed method. We see that the equalized channel is near impulse like and both the early and late reections are signicantly attenuated. As a result, we can say that the dereverberation of the speech signal is achieved at the output of the ZFE. The effectiveness of the proposed spectral constraint in the eigenlter is demonstrated in Fig. 5. It is observed in Fig. 5(a)
Fig. 4. Channel equalization at SNR dB using the proposed dereverberation method. (a) Original (b) Estimated (c) Equalized.

= 20

Fig. 5. Frequency spectrum of the equivalent channel from the speaker to the output of the beamformer (a) without spectral constraint (b) with spectral con. straint,

= 0:5

that, without the constraint, the channel frequency response severely attenuates in most of the frequency components. To the contrary, the proposed spectral constraint can successfully resist such spectral attenuation, as shown in Fig. 5(b), and thus improves the perceptual quality of the dereverberated speech. Table IV shows the effect of the proposed eigenlter, with and without constraint, on the quality of the dereverberated speech. The results were obtained by averaging the performances of eight different male and female utterances. For case 1, the SNR of the dereverberated speech deteriorates although the perceptual quality of the speech improves as the received noisy signals are not enhanced by the eigenlter before zero forcing equalization. For case 2, the eigenlter stage is used and hence there is a substantial improvement in the output SNR. However, the quality of the speech deteriorates, which is evident from the LLR, segSNR and WSS measures, due to narrowband nature of the eigenlter. In case 3, the constrained optimization of the eigenlter, that enforces spectral atness in the equalized channel, can improve the LLR, segSNR, WSS, and PESQ of the dereverberated speech maintaining a good output SNR. The dereverberation performance of the proposed algorithm is demonstrated through spectrogram in Fig. 6. It is observed

HAQUE et al.: ROBUST SPEECH DEREVERBERATION BASED ON BLIND ADAPTIVE ESTIMATION OF ACOUSTIC CHANNELS

783

Fig. 6. Spectrogram of the (a) clean speech, (b) noisy reverberated speech at 30-dB SNR, (c) denoised speech, and (d) dereverberated using the proposed method. TABLE IV EFFECT OF THE EIGENFILTER STAGE ON THE DEREVERBERATION PERFORMANCE

TABLE V PERFORMANCE OF THE PROPOSED ALGORITHM AT DIFFERENT SNRS

Fig. 7. Quality of the dereverberated speech at different block-lengths of the proposed zero forcing equalization.

that the clean speech has vivid harmonic structure in the shorttime Fourier transform (STFT) domain [Fig. 6(a)]. The spectrodB) reverberant speech as degram of the noisy (SNR picted in Fig. 6(b) shows smearing of the frequency spectrum due to reverberation. The presence of noise is understood from the yellowish surface of the gure (see the pdf le). The output of the eigenlter, shown in Fig. 6(c), conrms the denoising effect by the appearance of the bluish background (see the pdf le) that represents low energy. Finally, the spectrogram of the dereverberated speech using the proposed method is shown in Fig. 6(d). As can be seen, the smearing is lessened and the harmonic structure has reappeared. Fig. 7 shows the effect of zero-forcing equalizer block-length on the dereverberation performance of our algorithm. The results reveal that the proposed technique is slightly dependent on this parameter giving almost similar performance when the block-length is varied.

The performance of the proposed algorithm at different SNRs is presented in Table V. The input and output SNRs are estimated from the ratio of the energy of the speech component to that of noise component in the received microphone signal and dereverberated signal, respectively. The NPM of the estimated channels at input SNRs 25, 20, 15, and 10 dBs are 8.23, 7.76, 7.60, and 5.33 dBs, respectively. It is observed that the proposed eigenlter can improve the overall SNR of the speech signal despite some noise amplication by the ZFE. LLR, segSNR, WSS, and PESQ measures show that the technique is effective for dereverberation in a wide range of SNRs. Now, we compare the performance of the proposed method with other state-of-the-art dereverberation techniques ( -norm, ISS and MLP methods) using four female (f1f4) and four male (m1m4) utterances. The results are presented in Tables VIIX. For the -norm method, the length of the shortening lter was taken 1100, the step-size in the update equation was selected 0.00001 and the iteration was continued until the cost function reaches a steady-state minimum value. The ISS and MLP methods were implemented with the same parameters as in the respective papers. SNR = 20 dB was considered for evaluating all the techniques. The results show that the proposed technique performs better than the comparing methods for most settings in terms of the LLR, segSNR, WSS, and PESQ. The average improvement in the LLR is 0.372 point, which is 0.160, 0.196, 0.106 points better than the -norm, ISS, and MLP methods, respectively. The average improvement in the segSNR is 4.76 dB, which is 5.02, 2.55, 3.36 dB better than the -norm, ISS and MLP methods, respectively. The average improvement in the WSS score is 21.26, which is 15.85, 22.73, 11.93 score better than the -norm, ISS and MLP methods, respectively. The average improvement in PESQ is 0.743 point, which is

784

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

TABLE VI QUALITY OF THE DEREVERBERATED SPEECH IN TERMS OF LLR FOR THE PROPOSED AND OTHER STATE-OF-THE-ART TECHNIQUES

Fig. 8. Impulse responses of real reverberant acoustic channels. The length of . each impulse response is L

= 4400

TABLE VII QUALITY OF THE DEREVERBERATED SPEECH IN TERMS OF segSNR FOR THE PROPOSED AND OTHER STATE-OF-THE-ART TECHNIQUES

B. Real Reverberant Channels In this section, we present the speech dereverberation performance of our proposed technique using the real reverberant acoustic channels obtained from the MARDY. The eight AIRs that were used for generating the reverberated sound are shown in Fig. 8. The length of each impulse response was . The additive noise in each channel truncated to was a low-passed ltered zero mean white Gaussian sequence. The lter was desinged using ParksMcClellan algorithm [19] with a 3500-Hz passband cutoff frequency, 3600-Hz stopband cutoff frequency, with a sampling frequency of 8000 Hz, 40-dB attenuation in the stopband, and 3 dB of ripple in the passband. The real reverberant channels are spectrally more at than the simulated acoustic channels. Therefore, the value of and are different. In all simulations with real channels, and were used. The length of the eigenlter, was used according to the formula given in Section IV-A. The , which means a block-delay block length for the ZFE was of 4.95 s. The LLR, segSNR, WSS, and PESQ score of the proposed method along with other state-of-the-art dereverberation techdB was niques are presented in Tables XXIII. SNR considered for evaluating all the techniques. It is observed that the proposed technique surpasses all the comparing methods by a signicant margin. The average improvement in the LLR is 0.200 point, which is 0.190, 0.266, 0.144 points better than the -norm, ISS and MLP methods, respectively. The average improvement in the segSNR is 4.81 dB, which is 4.45, 3.42, 6.84 dB better than the -norm, ISS and MLP methods, respectively. The average improvement in the WSS score is 5.74, which is 7.09, 12.41, 14.96 score better than the -norm, ISS and MLP methods, respectively. The average improvement in the PESQ is 0.235 point, which is 0.107, 0.342, 0.719 points better than the -norm, ISS and MLP methods, respectively. The average improvement in SNR for these utterances using the real acoustic channels was 2.43 dB. The inferior performance of the comparing methods can be reasoned as follows. The ISS assumes that the source signal is white which does not hold for the speech input. Therefore, the received signal is prewhitened before calculating the coefcients of the inverse lter. The technique proposed in [9] for estimating the whitening lter is based on the magnitude spectrum of the autoregressive (AR) system of the speech signal. Since the phase spectrum of the AR system function is ignored,

TABLE VIII QUALITY OF THE DEREVERBERATED SPEECH IN TERMS OF WSS FOR THE PROPOSED AND OTHER STATE-OF-THE-ART TECHNIQUES

TABLE IX QUALITY OF THE DEREVERBERATED SPEECH IN TERMS OF PESQ FOR THE PROPOSED AND OTHER STATE-OF-THE-ART TECHNIQUES

0.536, 0.622, 0.684 points better than the MLP methods, respectively.

-norm, ISS and

HAQUE et al.: ROBUST SPEECH DEREVERBERATION BASED ON BLIND ADAPTIVE ESTIMATION OF ACOUSTIC CHANNELS

785

TABLE X QUALITY OF THE DEREVERBERATED SPEECH FOR THE REAL ACOUSTIC CHANNELS IN TERMS OF LLR

TABLE XI QUALITY OF THE DEREVERBERATED SPEECH FOR THE REAL ACOUSTIC CHANNELS IN TERMS OF SegSNR

improper inversion of the AIRs. Moreover, the presence of noise which is not considered in the ISS further deteriorates its performance. The MLP method estimates the AR parameters of the speech from the characteristic polynomials of the prediction matrix calculated using the correlation between the current samples and one sample delayed version of the multichannel received signals. The prediction matrix was estimated from a 2 s speech samples. However, the AR parameters cannot be assumed stationary for such a long duration. As a result, the estimated AR parameters is an average of the actual variables which deteriorates the perceptual quality of the dereverberated speech. Moreover, in our implementation of the MLP method, the characteristic polynomials of the prediction matrix tend to diverge from the actual value when the AIRs exceed a few hundred taps. For comparison purposes, we have used known AR parameters for simulating the MLP method. The method further assumes that at least one microphone is closer to the speaker than the noise source in order to obtain the source LP residual from the noisy received signal. However, the assumption does not hold for incoherent noise and the dereverberated output was found severely noise corrupted. The shortening of AIRs based on the -norm optimization proposed in [13] does not consider channel estimation inaccuracy which is a signicant source of error for a practical implementation. Moreover, the method does not compensate early reections which cause the different objective measures except the PESQ to show poor speech quality. C. Time-Varying Condition

TABLE XII QUALITY OF THE DEREVERBERATED SPEECH FOR THE REAL ACOUSTIC CHANNELS IN TERMS OF WSS

TABLE XIII QUALITY OF THE DEREVERBERATED SPEECH FOR THE REAL ACOUSTIC CHANNELS IN TERMS OF PESQ

the prewhitening performance becomes erroneous which causes

In a realistic environment, the acoustic channels are time-varying. The slight movement of the speakers head, which is very natural during conversation, causes the AIRs to be changed. An adaptive channel estimation algorithm can track the time-varying channels. Therefore, the proposed dereverberation technique is suitable for changing acoustic condition. In order to simulate the time-varying condition, the length of each impulse response was taken to be samples corresponding to the reverberation time s. Fig. 9 shows the NPM of channel estimation in which the source position was shifted six times from the original position. In the rst four cases, the speaker moved to the left by 1 cm at each step and in the last two cases, the speaker moved to the right by 1 cm at each step. The notches in the NPM curve show the instant when the speaker moved. We nd that the algorithm remains in a good NPM level despite frequent changes in the AIRs and quickly converges to the previous level after the movement of the speaker. In order to visualize how fast the algorithm can track the time-varying AIRs, the speaker was moved in faster paces in subsequent experiments. The convergence prole of the adaptive channel estimation algorithm is shown in Figs. 9(a)(c). Here, we see that the algorithm requires around 20 blocks of data to converge to the previous NPM level after the speaker has moved. Since each block of data requires number of new speech samples, the algorithm requires around 6 s (for 8000 Hz sampling frequency) to converge in the time-varying condition. In other words, if the AIRs remain same for around 20 blocks of data, we can obtain an estimate of the AIRs from the noisy received signal. Since the estimated channels show a good NPM value, the

786

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

can be diagonalized as (51) where tors of is the unitary matrix whose columns are the eigenvecand is a diagonal matrix with diagonal elements , be the eigenvalues of . Substituting (51) into (50) and premultiplying by , we obtain (52) where . The set of rst-order difference equations in (50) are now decoupled. Therefore, the solution of the th equation can be obtained as [24] (53)
Fig. 9. Convergence prole of the robust NMCFLMS algorithm for time-varying channels.

is the component of is an arbitrary conwhere , and is the unit stant that depends on the initial value of can be obtained as step function. Now

dereverberation performance using these estimates would be reasonable. . . . V. CONCLUSION We have proposed a complete speech dereverberation technique suitable for a practical noisy environment with slowly varying acoustic channels. The proposed technique adaptively estimates the AIRs, then lters the multichannel speech signal in order to enhance the signal power and nally equalizes the AIRs using a ZFE. The AIRs are estimated using the robust NMCFLMS algorithm. A spectral constraint is attached to the conventional NMCFLMS algorithm to stop its misconvergence to a narrowband solution. The signal enhancement lter has been obtained from the eigenlter of the channel convolution matrix using an efcient iterative technique that avoids computationally intensive Cholesky decomposition. Moreover, we have attached a frequency domain constraint to this lter in order to mitigate speech distortion and spectral nulls in the equivalent channel. The proposed ZFE is implemented in block-adaptive form so that it can process the received signal when the AIRs are time-varying and/or the duration of the signal is not dened. The elegant feature of our technique is that it does not require a priori information about the source or microphone location and assumptions on statistical properties of the speech or noise signal. The proposed method has shown better performance as compared to the state-of-the-art dereverberation techniques which can be clearly observed from the LLR, segSNR, WSS, PESQ, and SNR measures for both female and male utterances. . . . (54)

where is the eigenvector corresponding to the eigenvalue . Since , we have for all . As a decays with each iteration, where the rate of decay result, is dependent on the value of . The larger the value of , the smaller the rate of decay. Therefore, after a large number of can be expressed as iterations, the nal value of when when

(55)

. Substituting where represents a small number and (55) into (54), the nal estimate of the channel can be approximated as

where eigenvalue

is the eigenvector corresponding to the largest . ACKNOWLEDGMENT

APPENDIX The iterative update equation for nding the eigenvector of corresponding to the largest eigenvalue can be formulated as (50)

The work presented in this paper was carried out in the Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology. REFERENCES
[1] B. W. Gillespie and L. Atlas, Acoustic diversity for improved speech recognition in reverberant environments, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2002, vol. 1, pp. 557560.

HAQUE et al.: ROBUST SPEECH DEREVERBERATION BASED ON BLIND ADAPTIVE ESTIMATION OF ACOUSTIC CHANNELS

787

[2] M. A. Haque, M. Bashar, P. Naylor, K. Hirose, and M. K. Hasan, Energy constrained frequency-domain normalized LMS algorithm for blind channel identication, in J. Signal, Image, Video Process. (SIViP). London, U.K.: Springer, Apr. 2007, pp. 203213. [3] M. A. Haque and M. K. Hasan, Robust multichannel LMS-type algorithms with fast decaying transient for blind identication of acoustic channels, IET Signal Process., vol. 2, no. 4, pp. 431441, 2008. [4] B. D. Radlovic, R. Williamson, and R. Kennedy, On the poor robustness of sound equalization in reverberant environments, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1999, vol. 2, pp. 881884. [5] S. Gannot, D. Burshtein, and E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech, IEEE Trans. Signal Process., vol. 49, no. 8, pp. 16141626, Nov. 2001. [6] B. Yegnanarayana and P. Satyanarayana, Enhancement of reverberant speech using LP residual signal, EEE Trans. Acoust., Speech, Signal Process., vol. 8, no. 3, pp. 267281, May 2000. [7] M. Wu and D. L. Wang, A two-stage algorithm for one-microphone reverberant speech enhancement, IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 3, pp. 774784, May 2006. [8] T. S. Bakir and R. M. Mersereau, Blind adaptive dereverberation of speech signals using a microphone array, in Proc. Int. Conf. Applicat.Specic Syst., Architectures, Process., Jun. 2003. [9] K. Furuya and A. Kataoka, Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 5, pp. 15791591, Jul. 2007. [10] T. Nakatani, T. Hikichi, K. Kinoshita, T. Yoshioka, M. Delcroix, M. Miyoshi, and B. Juang, Robust blind dereverberation of speech signals based on characteristics of short-time speech segments, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS07), 2007, pp. 29862989. [11] M. Delcroix, T. Hikichi, and M. Miyoshi, Dereverberation and denoising using multichannel linear prediction, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 17911801, Aug. 2007. [12] M. Miyoshi and Y. Kaneda, Inverse ltering of room acoustics, IEEE Trans. Acoust., Speech, Signal Process., vol. 36, no. 2, pp. 145152, Feb. 1988. [13] T. Mei, A. Mertins, and M. Kallinger, Room impulse response shortening with innity-norm optimization, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2009, pp. 37453748. [14] M. A. Haque and M. K. Hasan, Noise robust multichannel frequencydomain LMS algorithms for blind channel identication, IEEE Signal Process. Lett., vol. 15, pp. 305308, 2008.

[15] Y. Huang and J. Benesty, A class of frequency-domain adaptive approaches to blind multichannel identication, IEEE Trans. Signal Process., vol. 51, no. 1, pp. 1124, Jan. 2003. [16] S. Haykin, Adaptive Filter Theory, 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 1996. [17] M. Ding, B. L. Evans, R. K. Martin, and C. R. Johnson, Minimum intersymbol interference methods for time domain equalizer design, in Proc. Global Telecommun. Conf., 2003, vol. 4, pp. 21462150. [18] A. Goldsmith, Wireless Communication. New York: Cambridge Univ. Press, 2005. [19] J. G. Proakis and D. G. Manolakis, Introduction to Digital Signal Processing. New York: Macmillan, 1989. [20] J. B. Allen and D. A. Berkley, Image method for efciently simulating small-room acoustics, J. Acoust. Soc. Amer., vol. 65, no. 4, pp. 943950, Apr. 1979. [21] J. Y. C. Wen, N. D. Gaubitch, E. A. P. Habets, T. Myatt, and P. A. Naylor, Multichannel Acoustic Reverberation Database at York, [Online]. Available: http://www.commsp.ee.ic.ac.uk/sap [22] Y. Hu and P. C. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 229238, Jan. 2008. [23] D. R. Morgan, J. Benesty, and M. M. Sondhi, On the evaluation of estimated impulse responses, IEEE Signal Process. Lett., vol. 5, no. 7, pp. 174176, Jul. 1998. [24] J. G. Proakis, C. M. Rader, F. Ling, C. L. Nikias, M. Moonen, and I. K. Proudler, Algorithms for Statistical Signal Processing. Upper Saddle River, NJ: Pearson Education, 2002. Mohammad Ariful Haque, photograph and biography not available at the time of publication.

Touqul Islam, photograph and biography not available at the time of publication.

Md. Kamrul Hasan, photograph and biography not available at the time of publication.

Anda mungkin juga menyukai