Anda di halaman 1dari 12

1680 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO.

10, OCTOBER 2015

Reduction of Gaussian, Supergaussian, and Impulsive


Noise by Interpolation of the Binary Mask Residual
Marco Ruhland, Joerg Bitzer, Member, IEEE, Matthias Brandt, and Stefan Goetze, Member, IEEE

AbstractIn this paper, we present a new approach for noise re- and he showed how to employ this prior information within
duction. A binary timefrequency (T-F) masking threshold crite- the minimum mean square error (MMSE) estimator for a de-
rion is proposed and analyzed with respect to the average spectra of noising algorithm. This approach works particularly well at low
music and noise disturbances. Modied autoregressive (AR) detec-
tion and AR interpolation are then applied to the residual signal of signal-to-noise ratios (SNR) below 0 dB since the Laplacian
the binary masking process. The proposed method is able to reduce noise assumption yields less musical noise in that case. Fur-
supergaussian and impulsive noise while ensuring preservation of ther renement of the estimators is presented in [6], [7]. Besides
the desired signal, which is crucial for professional high-quality these methods, several techniques exist to reduce the musical
audio restoration, and it is also suitable for Gaussian noise to a cer- noise phenomenon, e.g., spectral peak elimination [8], spectral
tain extent. The approach is compared to a state-of-the-art restora-
tion algorithm by means of the objective measures signal-to-noise weighting [9], spectral domain smoothing [10], or smoothing of
ratio (SNR) improvement and perceptual quality, and by subjec- the spectral gain function [11]. For speech signals, it has been
tive listening tests. The objective results as well as the listening shown that cepstral smoothing is superior to spectral domain
tests show that the proposed algorithm is especially suited for su- smoothing, since speech-relevant information like pitch and for-
pergaussian, grainy-sounding noise types, e.g., optical soundtrack mant structure can be protected against smoothing within the
noise of celluloid movie footage, or rain noise.
cepstral domain [12][14].
Index TermsInterpolation, noise reduction, optical soundtrack The previously described denoising methods usually work in
noise, timefrequency masking. the frequency-domain. However, for the removal of impulsive
disturbances, e.g., clicks caused by dust and scratches on
I. INTRODUCTION a gramophone disc, time-domain interpolation methods are
preferable. Since click disturbances are mostly single, sparse

T HE term noise reduction is often associated with the re- events in time, lasting only a few milliseconds, it is usually
moval or reduction of Gaussian noise disturbances. The sufcient to replace the corrupted samples by interpolated
assumption of a Gaussian random process is common for pop- values of the surrounding unaffected samples [15]. While
ular denoising algorithms, like the well-known Wiener lter [1] frequency-based methods usually affect all samples of a signal
or the Ephraim-Malah method [2] developed in the 1980s. How- block, better preservation of the desired signal is achieved
ever, at the same time, Porter and Boll [3] showed that speech by time-domain interpolation, since only a few samples are
signals are rather characterized by leptokurtic, respectively su- changed. Pioneering work on time-domain interpolation has
pergaussian amplitude distributions, and that the error intro- been done independently by Vaseghi [16] and Veldhuis [17] for
duced by the decient Gaussian assumption may be signicant. different applications.
This nding led to the development of more sophisticated ap- However, in some cases, a strict distinction between impulse
proaches, e.g., [4] by Cohen, featuring two-sided Gamma- and disturbance and hiss cannot be made easily. Imagine the noise
Laplace densities for the speech amplitude distributions, how- of a heavy rainfall or applause which is the result of a vast
ever still assuming a Gaussian distribution for the noise. Only number of small impulsive events per time instant, and cannot
a few months later Martin [5] showed that car noise does rather be regarded as single and sparse any more. Though the audible
have a Laplacian distribution instead of a Gaussian distribution, sensation is stationary, a certain granularity is perceived that
allows the listener to identify the impulsive origin of the noise.
Similar noise can be observed when listening to the sound of
Manuscript received October 14, 2014; revised February 18, 2015; accepted
May 29, 2015. Date of publication June 11, 2015; date of current version June old celluloid movies of the optical soundtrack era. The footage
19, 2015. This work was supported in part by the German Federal Ministry of decomposes with time and suffers from dust and mould, badly
Education and Research (BMBF) under Grants 17N3008 and 03FH030PX2 and
affecting the audio signal which is encoded in an optical sound-
in part by the EU-FP7 project EcoShopping under Grant 609180. The associate
editor coordinating the review of this manuscript and approving it for publica- track next to the picture information. The resulting noise is
tion was Prof. DeLiang Wang. grainy and of supergaussian amplitude distribution. Broadcast
M. Ruhland and S. Goetze are with the Project Group Hearing, Speech and
media archives request for new restoration techniques to cope
Audio Technology, Fraunhofer Institute for Digital Media Technology (IDMT),
D-26129 Oldenburg, Germany (e-mail: marco.ruhland@idmt.fraunhofer.de). with such kinds of degradation. For an overview of restoration
J. Bitzer and M. Brandt are with the Institute for Hearing Technology and of optical soundtracks please cf. [18]. This kind of noise is prob-
Audiology, Jade University of Applied Sciences, 26121 Oldenburg, Germany.
lematic for both frequency- as well as time-domain approaches.
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org. In this contribution we propose a new hybrid algorithm, per-
Digital Object Identier 10.1109/TASLP.2015.2444664 forming frequency-domain binary masking and time-domain

2329-9290 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1681

interpolation to tackle the described problem. Comparing our samples at a sampling frequency Hz,
method with a state-of-the-art restoration algorithm will give with samples block shift, corresponding to 50%
new ideas on how to cope with supergaussian noise. overlap, by
The remainder of this contribution is organized as follows.
Section II introduces our approach, being a combination of time- (3)
frequency (T-F) masking and AR detection and interpolation.
In Section III, a brief summary of the theoretical analysis of the
proposed algorithm is given based on a former contribution of In Eq. (3), denotes the frequency index
the authors [19]. Then, the proposed algorithm is tested versus and the discrete frame index. The frequency-domain target
a state-of-the-art algorithm by means of objective quality mea- signal and the residual signal are initialized with
sures and subjective listening tests in Sections IV and V, respec- zeros. The absolute squared magnitude is compared
tively. Finally, conclusions are drawn in Section VI. to the binary masking threshold estimate for each fre-
quency bin . The threshold estimate can be initialized
II. THEORY with zeros before processing the rst block , or, for
faster initial response, with the squared magnitude of the rst
A. Binary Masking signal block, . If the squared magnitude for a certain
A degraded audio signal at discrete time index is given frequency bin is above , the respective fre-
as a set of sinusoidal basis functions plus a random process quency bin is copied into the target STFT signal
(cf. [20], [21]): . Otherwise, it is copied into the residual STFT
signal block . Finally, the inverse discrete Fourier
transform (IDFT) is used to obtain the time-domain signal
(1) blocks and of the target signal

For declicking applications, the task is to estimate the complex (4)


coefcients in Eq. (1) and the number of sinusoids, that
minimize the energy of the residual , e.g., in a least-squares
and the residual signal
sense (cf. [21], [20] or [22]). The residual is then treated
as an autoregressive (AR) process, and AR detection and inter-
polation is performed within to eliminate the clicks. After- (5)
wards, the interpolated residual is added back to the split-apart
sinusoids, to result in the restored audio signal . 1) Proposed BM threshold estimate: The above split-
An easier way to split the sinusoids from the residual can ting method has already been used in [24], however, with a
be achieved by T-F masking. The use of T-F masking for sepa- single-value threshold for all frequency bins . In order to obtain
ration tasks is quite young, though the idea behind it is quite a frequency dependent threshold for the binary mask, we use
simple and of low complexity. It has been successfully em-
ployed e.g., in computational auditory scene analysis (CASA), (6)
independent component analysis (ICA), improvement of au-
tomatic speech recognition (ASR), and cochlear implant (CI) where the power spectral density (PSD) estimate of the input
signal enhancement, amongst other elds of use. A list of con- signal is calculated by a recursively smoothed peri-
tributions on the use of T-F masking within these elds of re- odogram
search is given in [23], pg. 342. For a two-source separation (7)
task, like in our case, the separation of a desired target signal
from a noisy residual signal, the T-F masking process is also In Eq. (7), the smoothing vector is dened as
called binary masking (BM).
Let be the noisy music signal to be re- (8)
stored, as the sum of the undisturbed signal , and the noise
disturbance . The aim of the process is to nd a target signal with
and a residual signal that match the true unknown de-
sired signal and the noise signal as closely as possible. (9)
Although and are different from and , Eq. (2)
and
shall be satised,

(2) (10)

expressing Eq. (1) in a simpler fashion. In Eq. (2), repre- In Eqs. (9) and (10), we use time constants of s,
sents the so-called target signal, as the sum of sinusoids, and and s. The relatively short release time en-
the noisy residual signal. Several steps have to be per- sures that the threshold estimate follows the noise oor quickly,
formed to obtain these signals. First, the noisy signal is whereas the high attack time helps to preserve the threshold.
transformed into the frequency-domain, using a DFT of length This ensures that short transients, e.g., drum sounds, and other
1682 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

Fig. 1. Block diagram of the BM process. The noisy mixture is split into a target signal holding the desired components of the signal, and a residual
signal containing mainly the noise. By comparing the target and residual spectrograms on the right side with the clean and noise spectrograms on the left
side, it is obvious that the BM process does not separate the signals perfectly. Further processing is required for high quality audio restoration results.

short-time instrumental events will not change the threshold too preferably within the residual signal, and nally recombining
fast. The result is, that, after some seconds of adaptation time, target and residual to form the restoration result.
the threshold signal gives a robust estimate of the noise 2) Ideal Binary Mask: Besides other BM estimation methods
PSD of the disturbed signal. By applying the binary masking that exist in the literature (cf. e.g. [27]), the ideal binary mask
procedure, the target signal frames will contain the (IBM) was proposed in [28] and is formulated in [23] in detail.
main content of the music, and the residual will con- As described in [27], it is obtained by comparing the true instan-
tain the noise as visualized in Fig. 1. However, some low-level taneous SNR with a xed threshold (typically 0 dB) for each T-F
parts of the music, that fall below the threshold , e.g., bin. According to [29] and others, the IBM offers the highest
reverb tails, will remain in the residual signal. Finally, in SNR improvement considering the target signal versus the de-
Eq. (6) is needed to improve the threshold estimation and serves graded signal. As a matter of course, the true instantaneous SNR
as a frequency dependent overestimation factor. This xed real is unknown for real-world restoration tasks. For comparison we
valued parameter compensates for the fact that music signals will show IBM results as well in Section IV as a quality level
follow a -characteristic [25], i.e., that the long-term power for the objective measurements.
spectrum of music shows a decaying slope of 3 dB per octave,
or 10 dB per frequency decade. Since most impulsive noise B. Autoregressive Detection
disturbances are characterized by a at spectrum with high en- As already stated in Section I, several noise types with super-
ergy also in the upper frequency range, there is an increasing gaussian distribution, such as e.g. rain noise, can be seen as a
offset in the estimate over frequency. The lack of energy vast number of small impulses per time instant. The task of an
of music in the high frequency range will cause the threshold impulse detection algorithm is to localize impulse disturbances
estimate to be too low in that range. For compensation, in time with high accuracy since a low false alarm rate of such
the decay parameter is needed, being an algorithm helps to preserve the unaffected parts within the
desired signal. As shown in [30], the AR method works best in
(11)
terms of missing detection rate and lowest false alarm rate com-
pared to other methods. This detection method originally has
This invokes a slope of dB/dec. to be added to the
been introduced in [31], [32] and is recommended in standard
threshold estimate. Intuitively, if music entails a slope of
audio restoration literature like [15], [20]. Here, we use AR de-
10 dB/dec., a value of about 5 dB/dec. for will help to
tection and interpolation within the residual signal. An AR coef-
adjust the threshold properly (assuming 0 dB/dec. for the noise
cient estimate can be obtained from a signal block
disturbance).
using e.g. the Yule-Walker or the Burg method [33]. It is shown
Fig. 1 shows an example of a BM process. By comparing
in [30] that the AR model order usually can be quite low
the target and residual spectrograms on the right side with the
(as long as is fullled) for reliable detection results.
clean and noise spectrograms on the left side, it is obvious that
We choose . An error signal is then calcu-
the BM process does not separate the signals perfectly. Musical
lated by applying the well-known AR prediction error formula
noise is present in the target signal, and the residual might still
contain some components of the desired signal. Please note, that
listening examples are available at [26]. Hence, for high quality (12)
audio restoration, it is mandatory to apply further processing,
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1683

on the residual block , with being the estimated


AR coefcients. For the case that no impulsive event is present
in , the prediction error will be low. But as
soon as an impulse occurs within , the mist in pre-
diction leads to an error signal of high magnitude. Thus, as sug-
gested in the literature mentioned above, impulsive disturbances
can be detected by thresholding the squared error ,
leading to a binary detection sequence . However,
instead of thresholding the squared error signal we
introduce a xed detection percentage to select a xed number
of samples in each processed block. If , no samples of the
signal block shall be detected as impulsive and, thus
no interpolation will follow. On contrary, if , all of the
samples within the block shall be treated as impulsive (this of
course makes no sense, since the subsequent AR interpolator
would not work without any non-impulsive samples remaining
in the block). If is set to a value between 0 and 1, e.g.,
, then only 10% of the samples of the block shall be
detected as impulsive, namely those 10% that have the highest
squared error . This is done by sorting the samples
of within the current block descending by energy,
whilst memorizing the sample indices, taking the rst percent
of the sorted index table, and assigning logical 1 to the detection
output sequence at the cropped index table positions.
The remaining values for are set to 0, indicating the Fig. 2. Block diagram showing the partitioning of signal vector into known
samples and unknown samples , and the reassembling of and the inter-
non-impulsive samples. polation result that replaces the discarded samples , to obtain the inter-
polated audio signal .
C. Autoregressive Interpolation
After detection, the impulsive samples have to be interpo-
lated. In the literature, several methods have been proposed for
this task, e.g., median ltering [34], neural network approaches
like in [24], or frequency-domain based approaches as described
in [35]. The latter is well-suited for the restoration of clipped
audio signals. However, according to [22] and others, the time-
domain least-squares AR-based interpolator, proposed in [36]
and [17], is a very effective tool for audio restoration. To ob-
tain the interpolation rule, Eq. (12) is written in a matrix/vector
form, and minimized in the least-squares sense [17], [20].
Fig. 3. Partitioning of the AR coefcient matrix into matrices and ,
(13) according to the positions of known and unknown samples in a signal block.

Please note, that the dependency of the block index is resp. . The separation and reassembling proce-
omitted in Eq. (13), for reasons of readability. In Eq. (13), dure is shown in Fig. 2. The matrices and from Eq. (13)
are obtained by column-wise partitioning of a matrix of size
(14)
, consisting of the AR coefcients, according to
is a vector consisting of the non-impulsive the positions of known and unknown samples and (in-
or known samples within the audio signal (the operator dicated by ) within a signal block, as visualized in
calculates the nearest integer), i.e., is a subset of the vector Fig. 3. The AR order is chosen to be , according to
the ndings in [21]. Multiple iterations of detection and interpo-
(15) lation could be calculated to improve the AR estimation. How-
ever, the overall improvement for the restored signal would be
of length that the detection algorithm identied as clean, non- rather small (cf. [21]).
impulsive samples, in order of their appearance in . is the
interpolator solution for the impulsive or unknown samples D. Spectral Correction and Recombining
within in the least-squares sense. The residual signal block is The penultimate step is a spectral correction of the in-
interpolated, using the block AR interpolator given by Eq. (13). terpolated residual. After the time-domain interpolation of
The detected unknown samples in are replaced by the inter- , it may happen that some frequency bins in the
polation result , to form a processed residual signal , spectral representation , which have been set to 0 by
1684 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

Fig. 4. Zoomed sections of the squared magnitude spectrum, (a) of the binary
masking residual , (b) after interpolation, and (c) after interpolation and
spectral correction . Arrows indicate audible energy regain caused by
interpolation.

the BM process, regain energy (as visualized in Fig. 4(b), by


arrows). This phenomenon is perceivable, especially in the
low-frequency range, as rumbling. Therefore, we suggest to set Fig. 5. Block diagram of the algorithm. Grey circles indicate the algorithms
such bins back to zero again in , the spectral repre- fundamental parameters (decay), and detection percentage .
sentation of . Although this procedure has slight
inuence on the whole time-domain signal block again, the function (ICDF) of the noise, which is given in Table I,
interpolated sections remain reduced in power. The nal step the cropping borders and of the histogram can be
of the proposed algorithm is to recombine target and residual, calculated from as (cf. [19])
i.e., adding back to the target signal ,
to obtain the restored audio block (17)

(16) Then, the variance , and thereby the power of the noisy
residual is dened by
Finally, applying a Hann-window and overlap-add procedure
[37] results in the full-length restored audio signal .
Fig. 5 shows a block diagram of the whole algorithm. (18)
The following section summarizes briey the ndings of [19]
about the analytic prediction of the noise reduction level, i.e.
about the choice of the parameter , that can be obtained by with being the statistical power density function (PDF) of
binary masking in combination with AR interpolation. For more the noise process (also listed in Table I). Finally, the interpolated
details, the interested reader is referred to [19]. noise level in dB is expressed as

(19)
III. ANALYTIC PREDICTION OF NOISE REDUCTION
Some of the theoretical aspects of this paper were addressed Summarized, the noise power after AR interpolation can be cal-
in an earlier contribution by the authors [19]. Assuming a BM culated by using the well-known statistic variance calculation
scheme as introduced in the previous section, and assuming the integral with the PDF, and taking the concatenation borders as
split-apart residual signal having a white spectrum (which in the variance integrals delimiters from the ICDF. This theory
fact is a patchy spectrum, since multiple T-F bins are set to zero was veried by simulations in [19], using Gaussian, Laplacian,
during the masking process), it was shown that the proposed AR and modied Cauchy noise with a variable density parameter .
detection and interpolation process equals to setting the highest These noises offer distinct sound characteristics and different
energy time domain samples of the residual noise signal to zero. values of the sample kurtosis , which can be seen as a measure
This, furthermore, means that the time sample histogram of the for the peakiness or impulsiveness of the noise (cf. Table I).
residual noise gets truncated from both sides, depending on the The simulation results of [19] showed that the level of noise
detection percentage . Knowing the inverse cumulative density reduction increases with growing detection percentage , and
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1685

TABLE I
THREE DIFFERENT NOISE TYPES AND THEIR PDFS/ICDFS, SAMPLE KURTOSIS , AND PERCEIVED SOUND IMPRESSION.
GAUSSIAN AND LAPLACE DENSITY FORMULAE ARE OF ZERO MEAN VALUE AND VARIANCE ONE. THE MODIFIED
CAUCHY DISTRIBUTION OFFERS A DENSITY PARAMETER . DENOTES THE INVERSE ERROR FUNCTION

under Creative Commons licence or otherwise allowed for re-


search. Some examples are located on the website to this article
[26], namely:
Symphony No. 5 by Gustav Mahler (genre classic)
An industrial synth-pop song (genre pop)
A spoken-word recording of the fable The North Wind
and the Sun [38] in the German version (genre speech).
Each clean audio signal is of length 30 seconds and mixed to
mono. Two types of noise disturbances are added to the clean
audio signals. The rst noise signal is Gaussian white noise
having a kurtosis of . The second noise signal is a super-
gaussian white noise generated by the modied Cauchy noise
generator published in [39], having a kurtosis of (den-
sity parameter ). Although the proposed BMRI algo-
rithm also works for colored noise, only white noise signals will
be tested here to be in accordance to the analytic predictions of
Fig. 6. Analytic prediction of the achievable SNR improvement over the de-
tection percentage for white Cauchy noise (solid) and white Gaussian noise
Section III. Both white noise signals are mixed to the clean audio
(dashed). signals to form a disturbed signal at different input SNRs. Two
restoration algorithms will be compared:
that the amount of noise reduction is also dependent on the The least squares autoregressive (LSAR) method, pro-
sample kurtosis of the noise signal. The higher the sample posed by Vaseghi [15]. This algorithm is a well-known
kurtosis, i.e., the more peaky or impulsive the noise is, the standard in professional audio restoration. We use the
more noise reduction can be obtained at a xed detection source code provided on the webpage of Nuzman [21].
percentage . Fig. 6 shows the analytic prediction of the achiev- The proposed algorithm, in the following termed as binary
able SNR improvement over , for white Gaussian and white mask residual interpolation method (BMRI).
Cauchy noise , derived from Eqs. (17)(19). Please For both algorithms, the detection AR order and AR in-
note, that although Fig. 6 depicts the whole range of from 0 to terpolation order were set to 32 samples, according to the
100%, only values below 50% should be used, since for proper ndings in [21]. It is hardly possible for the LSAR algorithm to
AR interpolation the number of unknown samples should not predict the SNR improvement in dB a priori, so we decided to
exceed the number of known samples. Additionally, using use default parameters (threshold , fatness , one iteration
high values can reduce the residual signal level too much, only, cf. [21] for further information). The audio material was
so that musical noise might become audible in the restoration processed using the given setting, and the SNR improvement
result. was measured afterwards. Based on this, the BMRI algorithms
In the following section, these ndings will be veried by ob- detection percentage was set accordingly, using the analytic
jective measures on degraded audio signals, that are processed prediction results from Fig. 6, to achieve similar SNR improve-
with the proposed implementation of the algorithm described ment as the LSAR algorithm.
in Section II. Furthermore, a state-of-the-art algorithm will be A. Objective Quality Measures
tested versus the proposed algorithm. The analytic prediction
will help to adjust the performance of the proposed algorithm to Two objective quality measures are applied to evaluate the
the performance of the state-of-the-art algorithm. performance of the algorithms. After introducing the measures
in the following, the measurement results will be presented.
From the great variety of objective measures [40] that, in gen-
IV. OBJECTIVE EVALUATION
eral, can be used to assess the performance of noise reduction
Three musical genres were used for the following evalua- algorithms, we chose the SNR improvement measure and the
tions: Classical music, pop music, and speech. From each genre, overall Perceptual Similarity Measure (PSMt) of the Oldenburg
ten audio tracks of different artists were selected. All pieces are Perception Model (PEMO) [41] based on the auditory model by
1686 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

Dau et al. [42]. The SNR measure is a frequently used quality B. Measurement Results
measure that is easy to calculate, but as stated in [43], not always The top row of Fig. 7 shows the SNR improvement
shows high correlation with subjective results. Perceptual mea- over the input SNR, , for the LSAR algorithm (left panel),
sures are better suited to determine the perceived signal quality the proposed BMRI algorithm with a matched setting (middle
[44]. Since the commonly used perceptual evaluation of speech panel), and again the BMRI with a high setting (right panel). The
quality scores (PESQ) [45] is limited to 16 kHz sampling fre- bottom row of the plots shows the perceptual measure
quency and speech only, we decided to supplement the eval- over the input SNR. Each panel contains three curves for the
uation by the PEMO-based perceptual measure PSMt. Overall genres pop, classical music, and speech in white Cauchy noise
trends between results for PESQ and PSMt are similar. How- (black lines), and three for the genres in Gaussian noise (dark-
ever, PEMO allows the use of audio material of higher sampling grey lines), plus the neutral line (dashed, no improvement). Fur-
frequencies and is not limited to speech signals. Furthermore, it thermore, two dashed light-grey lines indicate the performance
has been shown that PEMO is well-suited for the evaluation of of the IBM (diamond markers) and the proposed BM (triangle
noise reduction tasks [44]. markers) on classical music and white Cauchy noise as best
1) SNR measure: The input and output signal-to-noise ratios performing genre, both without interpolation and recombina-
( and ) can be determined in case the clean audio tion of target and residual, i.e. measured on the extracted target
signal is available. The can be set by mixing the normal- signals alone. They dene an example for oor and ceiling of
ized clean audio signal and the normalized noise disturbance performance for standalone binary masking, highlighted as a
signal with adjusted levels. After processing by the denoising light-grey corridor in Fig. 7.
algorithm, the (in dB) can be calculated as the average The matched setting (middle panels) means that the SNR im-
SNR over all discrete time frames by [40] provement of the BMRI is set equal to the measured SNR im-
provement of the LSAR algorithm at its highest point (speech at
0 dB input SNR, white Cauchy noise). The reason for choosing
that point is, that speech, with its inherent pauses, at a low input
SNR, comes closest to the condition of a theoretically white
BM residual signal with no tonal components. This is the pre-
(20) liminary assumption that has to be satised in order to pre-
where is the clean audio signal without noise, dict the SNR improvement of the BMRI algorithm in depen-
is the restored audio signal after processing, dence of the detection percentage parameter and vice versa
is the set of frames with speech activity and its cardinality (cf. Section III). The LSAR algorithm offers a dB
[46]. The SNR improvement is then obtained by at that point, so according to Fig. 6, the BMRI detection per-
centage was set to , resulting in a very close match of
(21) BMRI and LSAR for speech at 0 dB SNR in white Cauchy noise.
For the high setting of the BMRI in the right panel, a detection
It is common practice to plot the over the to percentage with a theoretical dB was
visualize the input-output behavior of the investigated system. chosen, to investigate possible degradations of the desired signal
2) Overall Perceptual Similarity Measure (PSMt) using by interpolation artefacts at higher values.
PEMO: Since the SNR measures do not incorporate any
knowledge about the human auditory system [43] and it has C. Discussion
been shown that taking such information into account in objec- The results for the LSAR algorithm (left panels in Fig. 7)
tive quality assessment [44] is of importance, we furthermore show a good SNR improvement for the white Cauchy noise at
calculate the Perceptual Similarity Measure (PSM) which low input SNR. Towards higher input SNR, the improvement
was originally developed to predict quality degradations of drops, and even reaches negative values for pop music at 20 dB
broadband audio signals and which is based on the linear cross input SNR, as indicated by the standard deviation bars. Here,
correlation coefcient of internal representations of signal the desired signal gets affected by the LSAR algorithm. Since
pairs in blocks of 10 ms length [47]. The measure PSMt is the at 20 dB input SNR the white Cauchy noise is quite low com-
5th percentile of the PSM output, calculating the perceptual pared to the desired signal, it is obvious that transients like e.g.
distance between a test signal and a reference signal in a drum sounds cause high AR error signals that trigger the inter-
range between zero and one. A PSMt value of zero means polation threshold and thus get degraded. The speech and clas-
no similarity, whereas a value of one stands for both test and sical music signals achieve the best SNR improvement, since
reference signal being perceptually identical. PSMt showed its inherent transient sounds (like e.g. plosives) are lower in
high correlation with subjective ratings [41], [44]. For the input energy than transients in pop music. For the Gaussian noise
PSMt measure , the clean signal serves as reference, there is no SNR improvement at all for low SNR, and nally,
and a degraded signal at a given SNR is used as test signal. some negative improvement at 15 and 20 dB input SNR, for
The output PSMt measure is calculated using the the same reasons mentioned above. Compared to the standalone
clean audio signal as reference again, and the restored audio performance of the proposed BM, the LSAR seems to be worse
signal as test signal. As before for the SNR measures, the PSMt at low input SNR. However, this is an erroneous belief, as the
improvement is calculated by subtracting PSMt corridor in the lower panel conrms. The proposed BMs
from , and nally, plotted over . target signal is practically free of white Cauchy noise, but yet
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1687

Fig. 7. (top row) and (bottom row) over , for LSAR (left column) and for BMRI (matched setting, middle column, and high setting, right
column). Speech (circles), classic (crosses) and pop (squares) are degraded by the supergaussian white Cauchy noise (black lines) and Gaussian noise (dark-grey
lines). The dashed, light-grey lines show the performance of the IBM (diamonds) and the proposed BM (triangles) on classical music and white Cauchy noise (best
performer) obtained from the respective target signals (no recombination of target and residual), and thereby dene an area of performance for standalone binary
masking without interpolation (light-grey area). Remark: Marker positions are shifted on the x-axis, for better visibility.

dominated by a lot of musical noise. This perceptual mist is not upper middle panel, the 0 dB input SNR point for speech in
reected within the SNR measure. Therefore, the PSMt curves white Cauchy noise offers the same amount of SNR improve-
in the bottom left panel of Fig. 7 show a different behavior. ment like the LSAR algorithm in the upper left panel. How-
The highest PSMt improvement for white Cauchy noise at low ever, the proposed BMRI algorithm yields higher for
input SNR is now obtained for pop and classical music and not the upper input SNRs, giving no negative any more,
for speech, as for the measures. By listening to the de- and overall higher values for all SNR conditions. Even
graded input audio les [26], it can be noticed that the white for Gaussian noise, SNR improvement is visible. The ability of
Cauchy noise is more annoying within speech as it is within the the BMRI algorithm to reduce Gaussian noise disturbances has
wide-band pop and classical music, although the mixing of clean already been shown in [19]. Best is obtained for speech
audio and noise at distinct SNRs was carried out carefully. This and classical music, both for Gaussian and supergaussian dis-
is an effect of auditory masking [48]. If a single sine tone and turbance. The considerably better results for BMRI especially at
a stationary narrow-band noise masker of the same center fre- high input SNR are due to the carefully elaborated BM threshold
quency is presented to a human listener at the same time, the lis- criterion in combination with the constraint to only interpolate
tener will not be able to detect the tone within the noise at a cer- a xed amount of samples per signal block.
tain signal-to-masker ratio (SMR) [1]. This works vice versa, In the right column of Fig. 7 the results for the BMRI algo-
meaning that noise can be masked by a tone, however, needing rithm with the high setting at are plotted. Even at this
a higher SMR [49]. The auditory masking effects are taken in ac- fairly high level of noise reduction, there are almost no negative
count within modern audio codecs, and they are incorporated as values at input SNR 15 and 20 dB for both noise types
well in the auditory model [42] of PEMO-Q. Now, considering than for the LSAR algorithm, meaning that there is less affection
wide-band classical or pop music, more noise is masked within of the desired signal by the BMRI at high input SNR compared
the auditory system than it is by the sparsely lled spectrum of to LSAR, although the amount of noise reduction is higher. In
speech. This nally leads to the lower PSMt improvement for the bottom right panel, the black curves for white Cauchy noise
speech in the PSMt plots, and might also explain the decline in show the same tendency over input SNR as in the middle panel
PSMt improvement towards higher SNRs for pop and classical for the matched setting, but at a clearly higher . Finally,
music. The grey curves for Gaussian noise in the bottom left also the dark-grey curves show more perceptual improvement
panel show no PSMt improvement for that noise type, whether now, conrming the ability of the proposed BMRI algorithm to
positive nor negative, although there is some negative at reduce Gaussian noise as well.
15 and 20 dB input SNR for mainly pop music in the upper left It is worth mentioning that both LSAR and BMRI outper-
panel. Here as well, auditory masking effects might hide that the forms the ideal binary mask for high input SNR, in terms of per-
desired signal gets affected by the restoration algorithm. ceptual measurement. Informal listening conrms that the IBM
The middle panels of Fig. 7 show the results for the BMRI produces musical noise and artefacts even at an SNR of 15 and
algorithm, congured with the matched setting. Looking at the 20 dB and therefore gives poor perceptional performance.
1688 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

V. SUBJECTIVE EVALUATION [54], commonly used to evaluate the performance of


The previously described objective quality measures already noise reduction algorithms in speech communication
reveal valuable quantitative properties of the examined audio al- systems. For this test, the subjects were asked to assess
gorithms. However, objective methods for predicting the overall the quality dimensions degradation of desired signal,
quality perceived by the human listeners may not fully correlate annoyance of residual noise, and overall quality on a
to the subjective opinion of test listeners [44], [50]. Therefore, quasi-continuous MOS scale, ranging from one to ve
it is often advantageous to measure the perceived quality also [44], [55] (1=bad or very annoying, 2=poor or annoying,
by means of subjective listening tests. In this contribution, we 3=fair or slightly annoying, 4=good or perceptible but
assessed the subjective perceived quality of the processed audio not annoying, 5=excellent or imperceptible) while they
samples based by means of subjective ratings of ten normal could switch seamlessly between a restored audio signal
hearing persons with musical background. A subset of the audio and a clean, undisturbed version of the signal. Besides the
samples from Section IV was used (pop music, classical music, restored signals from the LSAR and BMRI algorithm, a
and speech), again degraded by adding white Cauchy noise of third set of test signals was generated by mixing the clean
density ( ), at 10 dB and 20 dB SNR. LSAR signals with noise at a 15 dB better SNR than the SNR of
and the proposed BMRI algorithm were applied for restoration. the LSAR and BMRI input signals. This additional signal
For the LSAR algorithm, the default setting was used, and for set serves as the imitation of an imaginary artefact-free
BMRI, the matched setting (cf. Section IV for the parameters of denoising algorithm with 15 dB noise reduction, as quality
the settings). reference (REF). Six runs were performed over all test
The objective quality measures calculated in Section IV-B, les and subjects, where runs one and two were discarded
as well as informal listening tests, reveal just minor improve- (they served as training phase for the subjects). Results
ment for the standard Gaussian noise. Higher settings beyond from run three to six were averaged, as in [43].
12.5% would be necessary to achieve similar noise reduction
B. Results of Listening Tests
performance as in the white Cauchy case. If is raised, more and
more samples of the residual get interpolated and the residual 1) 2-AFC test: The results of the AFC test evaluation are
signal loses energy, and thus, the musical noise of the target shown in Table II for an input SNR of 10 dB and in
signal would become audible in the restoration result. In contrast Table III for an input SNR of 20 dB, respectively. The
to speech enhancement for, e.g., hands-free systems in cars or as proposed BMRI algorithm is preferred for both SNR con-
pre-processing for automatic speech recognition (ASR), here it ditions and all three musical genres. For all test scenarios,
is crucial for high quality audio restoration to not introduce any the unprocessed signal is rated worst in terms of overall
artefacts, be it at the cost of obtaining less noise reduction than signal quality. The BTL-model scale differences towards
possible. Besides that, the objective measures showed that the the rst rank are given in parenthesis for each algorithm
common LSAR approach hardly improves Gaussian disturbed [53]. Mostly, the distances to the third rank (unprocessed
signals at all. Informal listening to the example signals on the signal) are rather large, whereas the distances from the
website [26] conrm this. For these reasons, we will focus on second rank (LSAR) against the BMRI tend to be smaller.
the white Cauchy noise only and omit the Gaussian noise type This shows that both, BMRI and LSAR achieve major
in the following subjective evaluations. improvement in quality towards the unprocessed signal,
but nevertheless, BMRI outperforms the common LSAR
A. Evaluation Procedure approach. For pop music at 10 dB SNR, a large BTL
Two different subjective evaluation methods are used to ana- distance of the LSAR algorithm towards the rst rank can
lyze different aspects of the processed signals: be observed. This indicates the advantage of the proposed
1) Pairwise comparison by a two-alternative forced choice BMRI algorithm over the common approaches, e.g. in
test (2-AFC) [51]: The noisy test signals (classic, pop, terms of being able to preserve desired transients like
speech) have either been processed by BMRI or LSAR or snare drums etc., as already explained in Section II-A.
left unprocessed. For both noise conditions (10 and 20 dB The signicance level for all AFC tests is 99%. The con-
SNR) and for each genre, three possible algorithm combi- sistency in Tables II and III is a measure for the amount
nations had to be tested, namely BMRI vs. LSAR, BMRI of inconsistent subject ratings [56]. A consistency of zero
vs. unprocessed, and LSAR vs. unprocessed, resulting in a means that the subjects answers are completely contra-
total of 18 pairs. These pairs were presented in six runs dictory, whereas a value of one means that there are no
at 36 random pair trials per run. The subjects were in- contradicting statements. All consistency values show
structed to choose the audio sample with the preferred that there were no contradictions amongst the subjects
overall quality in each trial. To determine the rank order individual ratings except for classical music at SNR 10 dB
of the corresponding algorithms, the Bradley-Terry-Luce (consistency 0.9). This small deviation could arise from
(BTL) model [52], [53] has been applied with a minimum the fact that classical music offers a high dynamic range,
required signicance of 99% to declare a test valid. and therefore, it might be more difcult to distinguish the
2) A Mean Opinion Score (MOS) test procedure was used restoration results of very silent parts in the music at that
to evaluate the effect of the denoising algorithms on the rather low SNR.
residual noise and the target signal individually. It has 2) MOS test: Fig. 8 to 10 show the results of the conducted
been adopted from the ITU-T recommendations P.835 MOS rating regarding target signal degradation (cf. Fig. 8),
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1689

TABLE II
RESULTS OF THE 2-AFC LISTENING TESTS AT 10 dB SNR

TABLE III
RESULTS OF THE 2-AFC LISTENING TESTS AT 20 dB SNR

Fig. 8. Results of the MOS test: Individual rating of the degradation of the
target signal. The input SNR of the test signals was20 dB. The REF signal imi-
tates an artefact-free denoising algorithm with 15 dB reduction. Rating 5 = the residual noise annoyance (cf. Fig. 9), and overall quality
target signal is not audibly degraded, 1 = unacceptable degradation. (cf. Fig. 10). It can be seen that the preferred choice of
algorithm depends on the type of target signal that is pro-
cessed. The medians reveal that the BMRI algorithm is pre-
ferred in comparison to the LSAR algorithm for all test
signals over all three quality dimensions, as already indi-
cated by the results of the 2-AFC test (cf. Tables II and III).
Especially in terms of the annoyance of the residual noise
signal (Fig. 9), the proposed BMRI algorithm outperforms
the LSAR algorithm clearly. However, the ratings for the
simulated artefact-free denoising (REF) indicate that there
is still room for improvement of the BMRI algorithm.

VI. CONCLUSION
In this paper we presented an approach for the reduction of
noise disturbances of Gaussian, supergaussian and impulsive
characteristics. Although the approach is especially suited to
reduce supergaussian and impulsive noise, also Gaussian dis-
Fig. 9. Results of the MOS test: Individual rating of the annoyance of the
residual noise signal. The input SNR of the test signals was 20 dB. The REF turbances can be reduced to a certain extent. High preservation
signal imitates an artefact-free denoising algorithm with 15 dB reduction. Rating of the desired signal is achieved by a carefully elaborated BM
5 = the residual noise is not annoying / not audible at all, 1 = unacceptable an- threshold criterion in combination with a modied AR detection
noyance by the residual noise.
and interpolation stage in the time-domain. We showed by an-
alytical treatment that introducing an AR detection percentage
parameter allows for precise prediction of the noise reduction
level for white disturbances. We tied in with these ndings and
veried the suitability of the approach for high quality audio
restoration by means of objective and subjective quality mea-
sures. The objective measures conrm the analytic prediction,
and, together with the subjective listening tests, it is shown that
the proposed approach is superior to the common LSAR detec-
tion and interpolation methods in audio restoration, especially
for supergaussian noise types.

REFERENCES
[1] P. Vary and R. Martin, Digital Speech Transmission. Enhancement,
Coding and Error Concealment, 1st ed. Chichester, U.K.: Wiley,
2006.
[2] Y. Ephraim and D. Malah, Speech enhancement using a minimum-
mean square error short-time spectral amplitude estimator, IEEE
Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp.
Fig. 10. Results of the MOS test: Individual rating of the overall output signal 11091121, Dec. 1984.
quality. The input SNR of the test signals was 20 dB. The REF signal imitates an [3] J. Porter and S. Boll, Optimal estimators for spectral restoration
artefact-free denoising algorithm with 15 dB reduction. Rating 5 = best overall of noisy speech, in Proc. IEEE Int. Conf. Acoust., Speech, Signal
quality, 1 = unacceptable overall quality of the restored signal. Process., 1984, pp. 5356.
1690 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015

[4] I. Cohen, Speech enhancement using super-Gaussian speech models [29] D. P. Ellis, Model-based scene analysis, in Computational Audi-
and noncausal a priori SNR estimation, Speech Commun., vol. 47, no. tory Scene Analysis: Principles, Algorithms, and Applications. Pis-
3, pp. 336350, 2005. cataway, NJ, USA: Wiley/IEEE Press, 2006, pp. 115147, , .
[5] R. Martin, Speech enhancement based on minimum mean-square [30] I. Kauppinen, Methods for detecting impulsive noise in speech and
error estimation and supergaussian priors, IEEE Trans. Speech Audio audio signals, in Proc. Int. Conf. Digital Signal Process., 2002, vol.
Process., vol. 13, no. 5, pp. 845856, Sep. 2005. 2, pp. 967970.
[6] J. S. Erkelens, R. C. Hendriks, R. Heusends, and J. Jensen, Minimum [31] S. V. Vaseghi and P. J. W. Rayner, A new application of adaptive
mean-square error estimation of discrete Fourier coefcients with gen- lters for restoration of archived gramophone recordings, in Proc. Int.
eralized gamma priors, IEEE Trans. Audio, Speech, Lang. Process., Conf. Acoust., Speech, Signal Process., 1988, pp. 25482551.
vol. 15, no. 6, pp. 17411752, Aug. 2007. [32] S. V. Vaseghi and P. J. W. Rayner, Detection and suppression of
[7] I. Andrianakis and P. R. White, Speech spectral amplitude estima- impulsive noise in speech communication systems, in Proc. IEEE
tors using optimally shaped gamma and chi priors, ELSEVIER Speech Commun., Speech, Vis., 1990, vol. 137, pp. 3846.
Commun., vol. 51, no. 1, pp. 114, 2009. [33] J. G. Proakis and D. K. Manolakis, Digital Signal Processing, 4th ed.
[8] Z. Goh, K.-C. Tan, and B. Tan, Postprocessing method for suppressing Upper Saddle River, NJ, USA: Prentice-Hall, Apr. 2006.
musical noise generated by spectral subtraction, IEEE Trans. Speech [34] N. Jayant, Average- and median-based smoothing techniques for im-
Audio Process., vol. 6, no. 3, pp. 287292, May 1998. proving digital speech quality in the presence of transmission errors,
[9] D. Malah, R. V. Cox, and A. J. Accardi, Tracking speech-presence un- IEEE Trans. Commun., vol. COM-24, no. 9, pp. 10431045, Sep. 1976.
certainty to improve speech enhancement in non-stationary noise envi- [35] S. J. Godsill and P. J. W. Rayner, Frequency-based interpolation of
ronments, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., sampled signals with applications in audio restoration, in Proc. IEEE
1999, vol. 2, pp. 789792. Int. Conf. Acoust., Speech, Signal Process., 1993, vol. 1, pp. 209212.
[10] M. Brandt and J. Bitzer, Optimal spectral smoothing in short-time [36] A. J. E. M. Janssen, R. Veldhuis, and L. B. Vries, Adaptive interpola-
spectral attenuation (STSA) algorithms: Results of objective measures tion of discrete-time signals that can be modelled as AR processes, in
and listening tests, in Proc. 17th Eur. Signal Process. Conf. (EU- Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1986, vol. 34,
SIPCO09), Aug. 2009, pp. 199203. no. 2, pp. 317330.
[11] H. Gustafsson, S. E. Nordholm, and I. Claesson, Spectral subtraction [37] K.-D. Kammeyer and K. Kroschel, Digital signal processingfiltering
using reduced delay convolution and adaptive averaging, IEEE Trans. and spectral analysis with MATLAB exercises, digitale signalverar-
Speech Audio Process., vol. 9, no. 8, pp. 799807, Nov. 2001. beitungfilterung und spektralanalyse mit MATLABbungen, 8th ed.
[12] C. Breithaupt, T. Gerkmann, and R. Martin, Cepstral smoothing of Wiesbaden, Germany: Vieweg+Teubner-Verlag, 2012.
spectral lter gains for speech enhancement without musical noise, [38] Int. Phonetic Association, Handbook International Phonetic Associ-
IEEE Signal Process. Lett., vol. 14, no. 12, pp. 10361039, 2007. ation: A Guide to the Use of the International Phonetic Alphabet.
[13] C. Breithaupt, T. Gerkmann, and R. Martin, A novel a priori SNR Cambridge, U.K.: Cambridge Univ. Press, Jun. 1999.
estimation approach based on selective cepstro-temporal smoothing, [39] J. H. McCulloch, Alpha-Stable Distributions in MATLAB, 1996 [On-
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2008, pp. line]. Available: www.mathworks.com/matlabcentral/leexchange/
48974900. 13619-toolbox-non-local-means/content/toolbox_nlmeans/toolbox/
[14] T. Gerkmann and R. Martin, On the statistics of spectral amplitudes stabrnd.m, last seen in March 2012
after variance reduction by temporal cepstrum smoothing and cepstral [40] P. C. Loizou, Speech Enhancement: Theory and Practice. Boca
nulling, IEEE Trans. Signal Process., vol. 57, no. 11, pp. 41654174, Raton, FL, USA: CRC Press, 2007.
2009. [41] R. Huber and B. Kollmeier, PEMO-Q - a new method for objective
[15] S. V. Vaseghi, Advanced Digital Signal Processing and Noise Reduc- audio quality assessment using a model of auditory perception, IEEE
tion, 1st ed. Leipzig, Germany: Teubner, 1996. Trans. Audio, Speech, Lang. Process., vol. 14, no. 6, pp. 19021911,
[16] S. V. Vaseghi, Algorithms for restoration of archived gramophone Nov. 2006.
recordings, Ph.D. dissertation, Univ. of Cambridge, Cambridge, U.K., [42] T. Dau, D. Pueschel, and A. Kohlrausch, A quantitative model of the
1988. effective signal processing in the auditory system, J. Acoust. Soc.,
[17] R. Veldhuis, Restoration of Lost Samples in Digital Signals. Engle- vol. 99, no. 6, pp. 36153622, 1996.
wood Cliffs, NJ, USA: Prentice-Hall, 1990. [43] I. Kauppinen and K. Roth, Improved noise reduction in audio sig-
[18] D. Richter, I. Kurreck, and D. Poetsch, Restoration of optical vari- nals using spectral resolution enhancement with time-domain signal
able density sound tracks on motion picture lms by digital image pro- extrapolation, IEEE Trans. Speech Audio Process., vol. 13, no. 6, pp.
cessing, in Proc. Int. Conf. Optimiz. Elect. Electron. Equipment, 2000, 12101216, 2005.
vol. 3, pp. 793798. [44] T. Rohdenburg, V. Hohmann, and B. Kollmeier, Objective perceptual
[19] M. Ruhland, S. Goetze, M. Brandt, S. Doclo, and J. Bitzer, A new quality measures for the evaluation of noise reduction schemes,
approach for reduction of supergaussian noise using autoregressive in- in Proc. 9th Int. Workshop Acoust. Echo Noise Control, 2005, pp.
terpolation and time-frequency masking, in Proc. 13th Int. Workshop 169172.
Acoust. Echo Noise Control, Aachen, Germany, Sep. 2012, pp. 14. [45] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, Perceptual
[20] S. J. Godsill and P. J. W. Rayner, Digital Audio Restoration. London, evaluation of speech quality (PESQ) - a new method for speech quality
U.K.: Springer, 1998. assessment of telephone networks and codecs, Speech Commun., vol.
[21] J. Nuzman, Audio Restoration: An Investigation of Digital Methods for 2, pp. 749752, 2001.
Click Removal and Hiss Reduction, 2004 [Online]. Available: www. [46] S. Goetze, V. Mildner, and K.-D. Kammeyer, A psychoacoustic noise
umiacs.umd.edu/jnuzman/audio/audio.pdf, last seen in March 2012 reduction approach for stereo hands-free systems, in Proc. 120th
[22] M. Kahrs and K. Brandenburg, Applications of Digital Signal Pro- Conv. Audio Eng. Soc. (AES), 2006.
cessing to Audio and Acoustics, 1st ed. London, U.K.: Springer, 1998. [47] HrTech, PEMO-Q (AudioQual and SpeechQual) Audio and Speech
[23] D. L. Wang, Time-frequency masking for speech separation and its Quality Prediction Based on the Oldenburg Perception Model (PEMO)
potential for hearing aid design, Trends Amplificat., vol. 12, no. 4, pp. - Manual. Oldenburg, Germany: HrTech gGmbH, Kompetenzzen-
332353, 2008. trum fr Hrgeraete-Systemtechnik, 2010.
[24] A. Czyzewski, Learning algorithms for audio signal enhancement: [48] B. C. J. Moore, An Introduction to the Psychology of Hearing.
Part 1 neural network implementation for the removal of impulse dis- Leiden, The Netherlands: Brill, 2012.
tortions, J. Audio Eng. Soc., vol. 45, no. 10, pp. 815831, 1997. [49] T. Painter and A. Spanias, Perceptual coding of digital audio, Proc.
[25] D. J. Levitin, P. Chordia, and V. Menon, Musical rhythm spectra IEEE, vol. 88, no. 4, pp. 451515, Apr. 2000.
from Bach to Joplin obey a 1/f power law, in Proc. Nat. Acad. Sci., [50] S. Goetze, E. Albertin, J. Rennies, E. A. P. Habets, and K.-D. Kam-
2012 [Online]. Available: http://www.pnas.org/content/early/2012/02/ meyer, Speech quality assessment for listening-room compensation,
14/1113828109.abstract, last seen in March 2012 J. Audio Eng. Soc., vol. 62, no. 6, pp. 386399, Jun. 2014.
[26] M. Ruhland, Website with Audio Examples to this Paper, 2014 [On- [51] L. L. Thurstone, A law of comparative judgment, Psychol. Rev., vol.
line]. Available: http://tgm.jade-hs.de/Ruhland_2014_Reduction 34, no. 4, pp. 273286, 1927.
[27] Y. Hu and P. C. Loizou, Techniques for estimating the ideal binary [52] R. A. Bradley and M. E. Terry, Rank analysis of incomplete block
mask, in Proc. 11th Int. Workshop Acoust. Echo Noise Control, 2008. designs: I. The method of paired comparisons, Biometrika, vol. 39,
[28] G. Hu and D. L. Wang, Speech segregation based on pitch tracking no. 3/4, p. 324, Dec. 1952.
and amplitude modulation, in Proc. IEEE Workshop Applicat. Signal [53] K. Tsukida and M. R. Gupta, How to analyze paired comparison data,
Process. Audio Acoust., 2001, pp. 7982. May 2011.
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1691

[54] ITU-T, Methods for objective and subjective assessment of quality audio technology in Oldenburg as a Scientic Supervisor and has been the
(P.835), Nov. 2003. Deputy Head of the Transfer Center for User-Oriented Assistance Systems
[55] S. R. Quackenbush, T. P. Barnwell III, and M. A. Clements, Objective since 2013. His current research interests include all forms of single- and
Measures of Speech Quality. Englewood Cliffs, NJ, USA: Prentice- multichannel speech enhancement, audio restoration, audio effects for musical
Hall, 1988. applications, and information retrieval for large media archives.
[56] M. G. Kendall and B. B. Smith, On the method of paired compar-
isons, Biometrika, vol. 31, no. 3/4, pp. 324345, 1940.

Matthias Brandt was born in Bremen in 1980.


Marco Ruhland studied electrical engineering at the He received his diploma in electrical engineering
Cooperative State University of Mosbach, Germany. in 2008 from the University of Bremen, Germany.
He received his Dipl.-Ing. (BA) degree in 2001. From 2009 to 2012 he was employed at the Jade Uni-
After ve years in the electrical construction group versity of Applied Sciences Oldenburg, Germany.
of the Michael Weinig AG, Tauberbischofsheim, Since 2012, he has been a Ph.D. student at the Uni-
he studied hearing technology and audiology at versity of Oldenburg, Germany, in the eld of audio
the Jade University of Applied Sciences of Old- restoration. His research focus is on the processing
enburg, Germany, graduating with the B.Eng. in of music signalsfrom developing methods to extract
2010 and continuing with the master studies at parameters required for automatic denoising to
the Carl-von-Ossietzy University of Oldenburg, creating electronic music in his spare time.
Germany. He received his M.Sc. in 2012 and is now
with the Fraunhofer Institute for Digital Media Technology (IDMT), project
group Audio Technology for Assistive Systems, in Oldenburg, Germany. His
main research interests are on speech enhancement, speech recognition, event Stefan Goetze is Head of Audio System Tech-
detection, and audio restoration algorithms. nology for Assistive Systems at the Fraunhofer
Institute for Digital Media Technology (IDMT),
Project group Hearing, Speech and Audio (HSA)
in Oldenburg, Germany. He received his Dipl.-Ing.
Joerg Bitzer received his diploma in 1995 and his in 2004 and Dr.-Ing. in 2013 at the University of
doctorate in electrical engineering in 2002 from the Bremen, Germany, where he was a Research Engi-
University of Bremen where he also was a Research neer from 2004 to 2008. His research interests are
Assistant until 1999. From 2000 to 2003 he was sound pick/up, processing and enhancement, such
Head of the algorithm development team at Houpert as noise reduction, acoustic echo cancellation and
Digital Audio, a company specialized in audio signal dereverberation, as well as assistive technologies,
processing. Since September 2003, he has been a humanmachine-interaction, detection and classication of acoustic events and
Professor for audio signal processing at the Jade automatic speech recognition. He is a Lecturer at the University of Bremen
University of Applied Sciences Wilhelmshaven/Old- and Project Leader of national and international research projects in the eld
enburg/Elseth. Additionally, in 2010, he joined the of acoustics for ambient assisted living (AAL) technologies. He is member of
Fraunhofer project group for hearing, speech, and IEEE.

Anda mungkin juga menyukai