Anda di halaman 1dari 2

2016 IEEE International Conference on Consumer Electronics (ICCE)

Multi-Stage Speech Enhancement for Automatic


Speech Recognition
Seungyeol Lee*, Youngwoo Lee, and Namgook Cho
Samsung Electronics, Suwon, Korea
{sy1031.lee, ywbme.lee, namgook.cho}@samsung.com

AbstractIn this paper, we propose a multi-stage speech


enhancement technique for speech recognition. At rst, a multi-
channel speech enhancement method takes advantage of the
spatial information of speech source. Then, in the second stage,
single-channel speech enhancement based on data-driven ap-
proach is adopted to improve performance of speech recogni-
tion at server side. This method can improve the quality of
speech signal which maximizes the advantage of each speech Fig. 1. Proposed speech enhancement algorithm for speech recognition
enhancement technique. The experimental result shows that the
proposed technique is superior to conventional multi-stage speech
enhancement algorithms.
we apply data-driven approach [4]. Data-driven method works
I. I NTRODUCTION well in low SNR environment because there is no need to
Until recently, speech enhancement technique called as pre- estimate noise spectrum by using voice activity detection.
processing is a very important for speech recognition. The real Fig. 1 shows the overall structure of proposed multi-stage
acoustic environment around users is composed of abnormal speech enhancement algorithm.
background noise, and the frequency characteristic varies very
rapidly. Consequently, pre-processing techniques to minimize II. M ULTI - CHANNEL PARAMETRIC WEIGHTING FILTER
the distortion of signal is critical to improve automatic speech
recognition (ASR) performance. In order to improve the Let s(t) denote a speech signal aquired from N-
quality of actual speech, Single channel speech enhancement microphone array. xn (t) can be get by convolution be-
techniques have been adopted as pre-processing. However, the tween s(t) and nth transfer function gn (t), yn (t) is sum
single channel techniques have difculties to enhancing sig- of xn (t) and noise signal vn (t). Short-time Fourier trans-
nals, especially estimating noise power spectrum. To overcome form (STFT) of these signal can be shown as Yn (k, l),
this limitation, multi channel speech enhancement techniques Xn (k, l), Vn (k, l) with frequency index k and frame in-
have been widely studied for a prolonged period. Multichannel dex l. To formulate the algorithm, we use the following
T
speech enhancement algorithm based on beam-forming [1] has vector notation. y(k, l) [Y1 (k, l) YN (k, l)] , v(k, l)
T T
been developed previously. In the method, generalized side- [V1 (k, l) VN (k, l)] , x(k, l) [X1 (k, l) XN (k, l)] ,
lobe cancellation (GSC) beam-forming technique is used for we can calculate the power spectral  density (PSD) of y(t),
H
enhancing signals. In order to steer the direction of beam- v(t)
 and x(t) as
 yy (k) E y(k, l)y
 (k, l) , vv (k)

former, the method exploits time difference of arrival (TDOA) E v(k, l)vH (k, l) , xx (k) E x(k, l)xH (k, l) =
[2]. But the method has several disadvantages for robust pre- yy (k) vv (k), respectively. We dene the following
two terms (k, l)  yH (k, l) 1
processing: 1) steering error due to inaccuracy of TDOA vv (k, l)y(k, l), (k, l)
estimation and 2) latency by TDOA. Avoiding these problems, tr vv (k, l)
1 yy (k, l) as multi-channel a posteriori SNR.
we use parametric weighting lter (PMWF) methods [3] in the We estimate a priori speech absence probability (SAP) q(k, l)
paper. Parametric lter works well without TDOA because which has a role of prior knowledge in statistical model by
multi-channel speech presence probability (MC-SPP) has a combining three terms, local, frequency smoothed and frame
role of nding direction and determining noise spectrum smoothed value. Local SAP can be achieved as
update.
l)
(k,
However, it is very difcult to enhancing signals with multi- qlocal
(k, l) = . (1)
channel speech enhancement in devices such as mobile l)
1 + (k,
phones, because reducing consumption of electric power may
be more important than performance improvement. For that Frequency smoothed SAP is dened by weighted average of
reason, we propose device distributed structure for speech nearby frequency bins as
enhancement on client-server ASR architecture. On device, K1

simple multi-channel algorithm works well, and more compli- qf req (k, l) = wf req (i)(k i, l) (2)
cated techniques can be applied in server side. In this paper, i=K1

383 978-1-4673-8364-6/16/$31.00 2016 IEEE


2016 IEEE International Conference on Consumer Electronics (ICCE)

where wf req is a Hamming window of size 2K1 +1 and inter- IV. E XPERIMENTAL R ESULT
frame smoothed SAP can be shown as follows: For the performance evaluation of the proposed method,
K1
 experiments has been conducted with audio samples recorded
qf rame (k, l) = wf rame (i)(k, l i). (3) over 1200 spoken sentences uttered in Korean by 12 person (5
i=K1 female, 7 male). The spoken words are captured by a mobile
Fianlly, a priori SAP is shown as phone which is located in 50cm away from a speaker. Noise
signals are captured in various noisy environment such as
(k, l)
q(k, l) = qlocal qf req (k, l)
qf rame (k, l). (4)
street, cafe, car, etc. and mixed with clean signal. In our task,
Now the multi-channel speech presence probability proposed ASR system based on deep neural network decoder (KALDI)
by [5] is shown as is used. The experimental results are shown in Table I.
 
1
q(k, l) 
 l)
(k, TABLE I
p(k, l) = 1 + 1 + (k, l) exp
1 q(k, l) l)
1 + (k,
E XPERIMENTAL RESULT WITH WORD RECOGNITION RATE

  (5) Environment car stationary non-stationary overall


where (k, l) tr 1
vv (k, l)
xx (k, l) , (k, l) conventional[1] 64.3% 69.5% 63.2% 65.7%
proposed 81.6% 80.3% 71.7% 77.9%
H
y (k, l) vv (k, l)
1 xx (k, l) vv (k, l)y(k, l).
1

Using this probability, noise update smoothing parameter is According to this result, proposed method is superior to con-
calculated as ventional method. The main reason of the result is robustness

(k.l) = v + (1 v )p(k, l), (6) of nding directions. For proposed algorithm, Estimation of
TDOA and noise update smoothing parameter is all included
and use it to obtain a noise PSD matrix estimate at frame l as
in integrated speech presence probability calculation, it helps
follows:
to prevent false gain estimation. Furthermore, data-driven
vv (k, l) =
vv (k, l 1) + [1
(k.l) (k.l)]y(k, l)y(k, l)H . based single channel enhancement is likely to outperform
(7) conventional method. Then, we can conclude that proposed
Lastly, we obtain PMWF gain function hW (k, l) using esti- algorithm works well in any acoustic environment.
mated noise PSD matrix vv (k, l) and as
V. C ONCLUSION
1 (k, l)
yy (k, l) IN N
hW (k, l) = vv (8) In this paper, we propose a new speech enhancement
+ (k, l) for client-server distributed ASR architecture. The princi-
1
where (k, l) = tr[
vv (k, l)yy (k, l)] N .
pal contribution of this paper is combination the advan-
tages of a multi-microphone noise reduction technique and
III. S INGLE - CHANNEL DATA - DRIVEN APPROACH the single-microphone noise reduction technique. Parametric
Among single-channel speech enhancement algorithms, multi-channel speech enhancement algorithm helps improve
data-driven scheme is a method of training the noise reduction performance by using multi-channel speech presence probabil-
gain function by using a priori SNR and a posteriori SNR [4]. ity. In single-channel stage, the speech enhancement method
Looking in detail as follows, noise suppression gain is found was possible through the training by the output of multi-
by means of a training procedure using the speech from the channel speech enhancement. The performance of the pro-
training database. In each frame, for each frequency bin, we posed approach has been found to be superior to that of the
have a pair of a priori SNR and a posteriori SNR that falls conventional method through the ASR test.
into one of the parameters cells. A priori SNR and a posteriori
SNR pairs from different frequency bins and different frames R EFERENCES
can fall into the same parameter cell during the course of the [1] J. Cho, S. Lee and I. Hwang, A Practical Approach to Robust Speech
Recognition Using Two Microphones in Driving Environments, in Proc.
training. To each of those a priori SNR and a posteriori SNR Audio Engineering Society Convention 137, Los Angeles, USA, 2014.
pairs corresponds a clean amplitude A and a noisy amplitude [2] W. Cui, J. Cho and S. Lee, A robust TDOA estimation method for in-
R. Those are collected and after all the train signals are car-noise environments, in Proc. Interspeech2014, Singapore, 2014.
[3] M. Souden, J. Chen, J. Benesty and S. Affes, On optimal frequency-
processed, the optimal value of Gij for parameter cell (i, j) domain multichannel linear ltering for noise reduction, IEEE Trans.
is found by minimizing a distortion measure of interest. We Audio, Speech, Lang. Process., vol.18, no. 2, pp. 260-276, Feb. 2010.
consider the following distortion measure that is Weighted- [4] J. Erkelens, J. Jensen and R. Heusdens, A data-driven approach to opti-
mizing spectral speech enhancement methods for various error criteria,
Euclidean distortion measure [6] Speech Commun., vol. 49, pp. 530-541, 2007.
Mij Mij [5] M. Souden, J. Chen, J. Benesty and S. Affes, Gaussian model-based
 
Gij = ( Ap+1
ij (m)Rij (m))/( Apij (m)Rij (m)) (9) multichannel speech presence probability, IEEE Trans. Audio, Speech,
Lang. Process., vol. 18, no. 5, pp. 1072-1077, Jul. 2010.
m=1 m=1 [6] P.C. Loizou, Speech enhancement based on perceptually motivated
where Rij (m) is the mth noisy amplitude that fell into Bayesian estimators of the magnitude spectrum, IEEE Trans. Speech,
Audio. Process., vol. 13, no. 5, pp. 857-869, Sep. 2005.
parameter cell (i, j) and Aij (m) is the corresponding clean
amplitude.

384 978-1-4673-8364-6/16/$31.00 2016 IEEE

Anda mungkin juga menyukai