AbstractIn this paper, we present a new approach for noise re- and he showed how to employ this prior information within
duction. A binary timefrequency (T-F) masking threshold crite- the minimum mean square error (MMSE) estimator for a de-
rion is proposed and analyzed with respect to the average spectra of noising algorithm. This approach works particularly well at low
music and noise disturbances. Modied autoregressive (AR) detec-
tion and AR interpolation are then applied to the residual signal of signal-to-noise ratios (SNR) below 0 dB since the Laplacian
the binary masking process. The proposed method is able to reduce noise assumption yields less musical noise in that case. Fur-
supergaussian and impulsive noise while ensuring preservation of ther renement of the estimators is presented in [6], [7]. Besides
the desired signal, which is crucial for professional high-quality these methods, several techniques exist to reduce the musical
audio restoration, and it is also suitable for Gaussian noise to a cer- noise phenomenon, e.g., spectral peak elimination [8], spectral
tain extent. The approach is compared to a state-of-the-art restora-
tion algorithm by means of the objective measures signal-to-noise weighting [9], spectral domain smoothing [10], or smoothing of
ratio (SNR) improvement and perceptual quality, and by subjec- the spectral gain function [11]. For speech signals, it has been
tive listening tests. The objective results as well as the listening shown that cepstral smoothing is superior to spectral domain
tests show that the proposed algorithm is especially suited for su- smoothing, since speech-relevant information like pitch and for-
pergaussian, grainy-sounding noise types, e.g., optical soundtrack mant structure can be protected against smoothing within the
noise of celluloid movie footage, or rain noise.
cepstral domain [12][14].
Index TermsInterpolation, noise reduction, optical soundtrack The previously described denoising methods usually work in
noise, timefrequency masking. the frequency-domain. However, for the removal of impulsive
disturbances, e.g., clicks caused by dust and scratches on
I. INTRODUCTION a gramophone disc, time-domain interpolation methods are
preferable. Since click disturbances are mostly single, sparse
T HE term noise reduction is often associated with the re- events in time, lasting only a few milliseconds, it is usually
moval or reduction of Gaussian noise disturbances. The sufcient to replace the corrupted samples by interpolated
assumption of a Gaussian random process is common for pop- values of the surrounding unaffected samples [15]. While
ular denoising algorithms, like the well-known Wiener lter [1] frequency-based methods usually affect all samples of a signal
or the Ephraim-Malah method [2] developed in the 1980s. How- block, better preservation of the desired signal is achieved
ever, at the same time, Porter and Boll [3] showed that speech by time-domain interpolation, since only a few samples are
signals are rather characterized by leptokurtic, respectively su- changed. Pioneering work on time-domain interpolation has
pergaussian amplitude distributions, and that the error intro- been done independently by Vaseghi [16] and Veldhuis [17] for
duced by the decient Gaussian assumption may be signicant. different applications.
This nding led to the development of more sophisticated ap- However, in some cases, a strict distinction between impulse
proaches, e.g., [4] by Cohen, featuring two-sided Gamma- and disturbance and hiss cannot be made easily. Imagine the noise
Laplace densities for the speech amplitude distributions, how- of a heavy rainfall or applause which is the result of a vast
ever still assuming a Gaussian distribution for the noise. Only number of small impulsive events per time instant, and cannot
a few months later Martin [5] showed that car noise does rather be regarded as single and sparse any more. Though the audible
have a Laplacian distribution instead of a Gaussian distribution, sensation is stationary, a certain granularity is perceived that
allows the listener to identify the impulsive origin of the noise.
Similar noise can be observed when listening to the sound of
Manuscript received October 14, 2014; revised February 18, 2015; accepted
May 29, 2015. Date of publication June 11, 2015; date of current version June old celluloid movies of the optical soundtrack era. The footage
19, 2015. This work was supported in part by the German Federal Ministry of decomposes with time and suffers from dust and mould, badly
Education and Research (BMBF) under Grants 17N3008 and 03FH030PX2 and
affecting the audio signal which is encoded in an optical sound-
in part by the EU-FP7 project EcoShopping under Grant 609180. The associate
editor coordinating the review of this manuscript and approving it for publica- track next to the picture information. The resulting noise is
tion was Prof. DeLiang Wang. grainy and of supergaussian amplitude distribution. Broadcast
M. Ruhland and S. Goetze are with the Project Group Hearing, Speech and
media archives request for new restoration techniques to cope
Audio Technology, Fraunhofer Institute for Digital Media Technology (IDMT),
D-26129 Oldenburg, Germany (e-mail: marco.ruhland@idmt.fraunhofer.de). with such kinds of degradation. For an overview of restoration
J. Bitzer and M. Brandt are with the Institute for Hearing Technology and of optical soundtracks please cf. [18]. This kind of noise is prob-
Audiology, Jade University of Applied Sciences, 26121 Oldenburg, Germany.
lematic for both frequency- as well as time-domain approaches.
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org. In this contribution we propose a new hybrid algorithm, per-
Digital Object Identier 10.1109/TASLP.2015.2444664 forming frequency-domain binary masking and time-domain
2329-9290 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1681
interpolation to tackle the described problem. Comparing our samples at a sampling frequency Hz,
method with a state-of-the-art restoration algorithm will give with samples block shift, corresponding to 50%
new ideas on how to cope with supergaussian noise. overlap, by
The remainder of this contribution is organized as follows.
Section II introduces our approach, being a combination of time- (3)
frequency (T-F) masking and AR detection and interpolation.
In Section III, a brief summary of the theoretical analysis of the
proposed algorithm is given based on a former contribution of In Eq. (3), denotes the frequency index
the authors [19]. Then, the proposed algorithm is tested versus and the discrete frame index. The frequency-domain target
a state-of-the-art algorithm by means of objective quality mea- signal and the residual signal are initialized with
sures and subjective listening tests in Sections IV and V, respec- zeros. The absolute squared magnitude is compared
tively. Finally, conclusions are drawn in Section VI. to the binary masking threshold estimate for each fre-
quency bin . The threshold estimate can be initialized
II. THEORY with zeros before processing the rst block , or, for
faster initial response, with the squared magnitude of the rst
A. Binary Masking signal block, . If the squared magnitude for a certain
A degraded audio signal at discrete time index is given frequency bin is above , the respective fre-
as a set of sinusoidal basis functions plus a random process quency bin is copied into the target STFT signal
(cf. [20], [21]): . Otherwise, it is copied into the residual STFT
signal block . Finally, the inverse discrete Fourier
transform (IDFT) is used to obtain the time-domain signal
(1) blocks and of the target signal
(2) (10)
expressing Eq. (1) in a simpler fashion. In Eq. (2), repre- In Eqs. (9) and (10), we use time constants of s,
sents the so-called target signal, as the sum of sinusoids, and and s. The relatively short release time en-
the noisy residual signal. Several steps have to be per- sures that the threshold estimate follows the noise oor quickly,
formed to obtain these signals. First, the noisy signal is whereas the high attack time helps to preserve the threshold.
transformed into the frequency-domain, using a DFT of length This ensures that short transients, e.g., drum sounds, and other
1682 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
Fig. 1. Block diagram of the BM process. The noisy mixture is split into a target signal holding the desired components of the signal, and a residual
signal containing mainly the noise. By comparing the target and residual spectrograms on the right side with the clean and noise spectrograms on the left
side, it is obvious that the BM process does not separate the signals perfectly. Further processing is required for high quality audio restoration results.
short-time instrumental events will not change the threshold too preferably within the residual signal, and nally recombining
fast. The result is, that, after some seconds of adaptation time, target and residual to form the restoration result.
the threshold signal gives a robust estimate of the noise 2) Ideal Binary Mask: Besides other BM estimation methods
PSD of the disturbed signal. By applying the binary masking that exist in the literature (cf. e.g. [27]), the ideal binary mask
procedure, the target signal frames will contain the (IBM) was proposed in [28] and is formulated in [23] in detail.
main content of the music, and the residual will con- As described in [27], it is obtained by comparing the true instan-
tain the noise as visualized in Fig. 1. However, some low-level taneous SNR with a xed threshold (typically 0 dB) for each T-F
parts of the music, that fall below the threshold , e.g., bin. According to [29] and others, the IBM offers the highest
reverb tails, will remain in the residual signal. Finally, in SNR improvement considering the target signal versus the de-
Eq. (6) is needed to improve the threshold estimation and serves graded signal. As a matter of course, the true instantaneous SNR
as a frequency dependent overestimation factor. This xed real is unknown for real-world restoration tasks. For comparison we
valued parameter compensates for the fact that music signals will show IBM results as well in Section IV as a quality level
follow a -characteristic [25], i.e., that the long-term power for the objective measurements.
spectrum of music shows a decaying slope of 3 dB per octave,
or 10 dB per frequency decade. Since most impulsive noise B. Autoregressive Detection
disturbances are characterized by a at spectrum with high en- As already stated in Section I, several noise types with super-
ergy also in the upper frequency range, there is an increasing gaussian distribution, such as e.g. rain noise, can be seen as a
offset in the estimate over frequency. The lack of energy vast number of small impulses per time instant. The task of an
of music in the high frequency range will cause the threshold impulse detection algorithm is to localize impulse disturbances
estimate to be too low in that range. For compensation, in time with high accuracy since a low false alarm rate of such
the decay parameter is needed, being an algorithm helps to preserve the unaffected parts within the
desired signal. As shown in [30], the AR method works best in
(11)
terms of missing detection rate and lowest false alarm rate com-
pared to other methods. This detection method originally has
This invokes a slope of dB/dec. to be added to the
been introduced in [31], [32] and is recommended in standard
threshold estimate. Intuitively, if music entails a slope of
audio restoration literature like [15], [20]. Here, we use AR de-
10 dB/dec., a value of about 5 dB/dec. for will help to
tection and interpolation within the residual signal. An AR coef-
adjust the threshold properly (assuming 0 dB/dec. for the noise
cient estimate can be obtained from a signal block
disturbance).
using e.g. the Yule-Walker or the Burg method [33]. It is shown
Fig. 1 shows an example of a BM process. By comparing
in [30] that the AR model order usually can be quite low
the target and residual spectrograms on the right side with the
(as long as is fullled) for reliable detection results.
clean and noise spectrograms on the left side, it is obvious that
We choose . An error signal is then calcu-
the BM process does not separate the signals perfectly. Musical
lated by applying the well-known AR prediction error formula
noise is present in the target signal, and the residual might still
contain some components of the desired signal. Please note, that
listening examples are available at [26]. Hence, for high quality (12)
audio restoration, it is mandatory to apply further processing,
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1683
Please note, that the dependency of the block index is resp. . The separation and reassembling proce-
omitted in Eq. (13), for reasons of readability. In Eq. (13), dure is shown in Fig. 2. The matrices and from Eq. (13)
are obtained by column-wise partitioning of a matrix of size
(14)
, consisting of the AR coefcients, according to
is a vector consisting of the non-impulsive the positions of known and unknown samples and (in-
or known samples within the audio signal (the operator dicated by ) within a signal block, as visualized in
calculates the nearest integer), i.e., is a subset of the vector Fig. 3. The AR order is chosen to be , according to
the ndings in [21]. Multiple iterations of detection and interpo-
(15) lation could be calculated to improve the AR estimation. How-
ever, the overall improvement for the restored signal would be
of length that the detection algorithm identied as clean, non- rather small (cf. [21]).
impulsive samples, in order of their appearance in . is the
interpolator solution for the impulsive or unknown samples D. Spectral Correction and Recombining
within in the least-squares sense. The residual signal block is The penultimate step is a spectral correction of the in-
interpolated, using the block AR interpolator given by Eq. (13). terpolated residual. After the time-domain interpolation of
The detected unknown samples in are replaced by the inter- , it may happen that some frequency bins in the
polation result , to form a processed residual signal , spectral representation , which have been set to 0 by
1684 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
Fig. 4. Zoomed sections of the squared magnitude spectrum, (a) of the binary
masking residual , (b) after interpolation, and (c) after interpolation and
spectral correction . Arrows indicate audible energy regain caused by
interpolation.
(16) Then, the variance , and thereby the power of the noisy
residual is dened by
Finally, applying a Hann-window and overlap-add procedure
[37] results in the full-length restored audio signal .
Fig. 5 shows a block diagram of the whole algorithm. (18)
The following section summarizes briey the ndings of [19]
about the analytic prediction of the noise reduction level, i.e.
about the choice of the parameter , that can be obtained by with being the statistical power density function (PDF) of
binary masking in combination with AR interpolation. For more the noise process (also listed in Table I). Finally, the interpolated
details, the interested reader is referred to [19]. noise level in dB is expressed as
(19)
III. ANALYTIC PREDICTION OF NOISE REDUCTION
Some of the theoretical aspects of this paper were addressed Summarized, the noise power after AR interpolation can be cal-
in an earlier contribution by the authors [19]. Assuming a BM culated by using the well-known statistic variance calculation
scheme as introduced in the previous section, and assuming the integral with the PDF, and taking the concatenation borders as
split-apart residual signal having a white spectrum (which in the variance integrals delimiters from the ICDF. This theory
fact is a patchy spectrum, since multiple T-F bins are set to zero was veried by simulations in [19], using Gaussian, Laplacian,
during the masking process), it was shown that the proposed AR and modied Cauchy noise with a variable density parameter .
detection and interpolation process equals to setting the highest These noises offer distinct sound characteristics and different
energy time domain samples of the residual noise signal to zero. values of the sample kurtosis , which can be seen as a measure
This, furthermore, means that the time sample histogram of the for the peakiness or impulsiveness of the noise (cf. Table I).
residual noise gets truncated from both sides, depending on the The simulation results of [19] showed that the level of noise
detection percentage . Knowing the inverse cumulative density reduction increases with growing detection percentage , and
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1685
TABLE I
THREE DIFFERENT NOISE TYPES AND THEIR PDFS/ICDFS, SAMPLE KURTOSIS , AND PERCEIVED SOUND IMPRESSION.
GAUSSIAN AND LAPLACE DENSITY FORMULAE ARE OF ZERO MEAN VALUE AND VARIANCE ONE. THE MODIFIED
CAUCHY DISTRIBUTION OFFERS A DENSITY PARAMETER . DENOTES THE INVERSE ERROR FUNCTION
Dau et al. [42]. The SNR measure is a frequently used quality B. Measurement Results
measure that is easy to calculate, but as stated in [43], not always The top row of Fig. 7 shows the SNR improvement
shows high correlation with subjective results. Perceptual mea- over the input SNR, , for the LSAR algorithm (left panel),
sures are better suited to determine the perceived signal quality the proposed BMRI algorithm with a matched setting (middle
[44]. Since the commonly used perceptual evaluation of speech panel), and again the BMRI with a high setting (right panel). The
quality scores (PESQ) [45] is limited to 16 kHz sampling fre- bottom row of the plots shows the perceptual measure
quency and speech only, we decided to supplement the eval- over the input SNR. Each panel contains three curves for the
uation by the PEMO-based perceptual measure PSMt. Overall genres pop, classical music, and speech in white Cauchy noise
trends between results for PESQ and PSMt are similar. How- (black lines), and three for the genres in Gaussian noise (dark-
ever, PEMO allows the use of audio material of higher sampling grey lines), plus the neutral line (dashed, no improvement). Fur-
frequencies and is not limited to speech signals. Furthermore, it thermore, two dashed light-grey lines indicate the performance
has been shown that PEMO is well-suited for the evaluation of of the IBM (diamond markers) and the proposed BM (triangle
noise reduction tasks [44]. markers) on classical music and white Cauchy noise as best
1) SNR measure: The input and output signal-to-noise ratios performing genre, both without interpolation and recombina-
( and ) can be determined in case the clean audio tion of target and residual, i.e. measured on the extracted target
signal is available. The can be set by mixing the normal- signals alone. They dene an example for oor and ceiling of
ized clean audio signal and the normalized noise disturbance performance for standalone binary masking, highlighted as a
signal with adjusted levels. After processing by the denoising light-grey corridor in Fig. 7.
algorithm, the (in dB) can be calculated as the average The matched setting (middle panels) means that the SNR im-
SNR over all discrete time frames by [40] provement of the BMRI is set equal to the measured SNR im-
provement of the LSAR algorithm at its highest point (speech at
0 dB input SNR, white Cauchy noise). The reason for choosing
that point is, that speech, with its inherent pauses, at a low input
SNR, comes closest to the condition of a theoretically white
BM residual signal with no tonal components. This is the pre-
(20) liminary assumption that has to be satised in order to pre-
where is the clean audio signal without noise, dict the SNR improvement of the BMRI algorithm in depen-
is the restored audio signal after processing, dence of the detection percentage parameter and vice versa
is the set of frames with speech activity and its cardinality (cf. Section III). The LSAR algorithm offers a dB
[46]. The SNR improvement is then obtained by at that point, so according to Fig. 6, the BMRI detection per-
centage was set to , resulting in a very close match of
(21) BMRI and LSAR for speech at 0 dB SNR in white Cauchy noise.
For the high setting of the BMRI in the right panel, a detection
It is common practice to plot the over the to percentage with a theoretical dB was
visualize the input-output behavior of the investigated system. chosen, to investigate possible degradations of the desired signal
2) Overall Perceptual Similarity Measure (PSMt) using by interpolation artefacts at higher values.
PEMO: Since the SNR measures do not incorporate any
knowledge about the human auditory system [43] and it has C. Discussion
been shown that taking such information into account in objec- The results for the LSAR algorithm (left panels in Fig. 7)
tive quality assessment [44] is of importance, we furthermore show a good SNR improvement for the white Cauchy noise at
calculate the Perceptual Similarity Measure (PSM) which low input SNR. Towards higher input SNR, the improvement
was originally developed to predict quality degradations of drops, and even reaches negative values for pop music at 20 dB
broadband audio signals and which is based on the linear cross input SNR, as indicated by the standard deviation bars. Here,
correlation coefcient of internal representations of signal the desired signal gets affected by the LSAR algorithm. Since
pairs in blocks of 10 ms length [47]. The measure PSMt is the at 20 dB input SNR the white Cauchy noise is quite low com-
5th percentile of the PSM output, calculating the perceptual pared to the desired signal, it is obvious that transients like e.g.
distance between a test signal and a reference signal in a drum sounds cause high AR error signals that trigger the inter-
range between zero and one. A PSMt value of zero means polation threshold and thus get degraded. The speech and clas-
no similarity, whereas a value of one stands for both test and sical music signals achieve the best SNR improvement, since
reference signal being perceptually identical. PSMt showed its inherent transient sounds (like e.g. plosives) are lower in
high correlation with subjective ratings [41], [44]. For the input energy than transients in pop music. For the Gaussian noise
PSMt measure , the clean signal serves as reference, there is no SNR improvement at all for low SNR, and nally,
and a degraded signal at a given SNR is used as test signal. some negative improvement at 15 and 20 dB input SNR, for
The output PSMt measure is calculated using the the same reasons mentioned above. Compared to the standalone
clean audio signal as reference again, and the restored audio performance of the proposed BM, the LSAR seems to be worse
signal as test signal. As before for the SNR measures, the PSMt at low input SNR. However, this is an erroneous belief, as the
improvement is calculated by subtracting PSMt corridor in the lower panel conrms. The proposed BMs
from , and nally, plotted over . target signal is practically free of white Cauchy noise, but yet
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1687
Fig. 7. (top row) and (bottom row) over , for LSAR (left column) and for BMRI (matched setting, middle column, and high setting, right
column). Speech (circles), classic (crosses) and pop (squares) are degraded by the supergaussian white Cauchy noise (black lines) and Gaussian noise (dark-grey
lines). The dashed, light-grey lines show the performance of the IBM (diamonds) and the proposed BM (triangles) on classical music and white Cauchy noise (best
performer) obtained from the respective target signals (no recombination of target and residual), and thereby dene an area of performance for standalone binary
masking without interpolation (light-grey area). Remark: Marker positions are shifted on the x-axis, for better visibility.
dominated by a lot of musical noise. This perceptual mist is not upper middle panel, the 0 dB input SNR point for speech in
reected within the SNR measure. Therefore, the PSMt curves white Cauchy noise offers the same amount of SNR improve-
in the bottom left panel of Fig. 7 show a different behavior. ment like the LSAR algorithm in the upper left panel. How-
The highest PSMt improvement for white Cauchy noise at low ever, the proposed BMRI algorithm yields higher for
input SNR is now obtained for pop and classical music and not the upper input SNRs, giving no negative any more,
for speech, as for the measures. By listening to the de- and overall higher values for all SNR conditions. Even
graded input audio les [26], it can be noticed that the white for Gaussian noise, SNR improvement is visible. The ability of
Cauchy noise is more annoying within speech as it is within the the BMRI algorithm to reduce Gaussian noise disturbances has
wide-band pop and classical music, although the mixing of clean already been shown in [19]. Best is obtained for speech
audio and noise at distinct SNRs was carried out carefully. This and classical music, both for Gaussian and supergaussian dis-
is an effect of auditory masking [48]. If a single sine tone and turbance. The considerably better results for BMRI especially at
a stationary narrow-band noise masker of the same center fre- high input SNR are due to the carefully elaborated BM threshold
quency is presented to a human listener at the same time, the lis- criterion in combination with the constraint to only interpolate
tener will not be able to detect the tone within the noise at a cer- a xed amount of samples per signal block.
tain signal-to-masker ratio (SMR) [1]. This works vice versa, In the right column of Fig. 7 the results for the BMRI algo-
meaning that noise can be masked by a tone, however, needing rithm with the high setting at are plotted. Even at this
a higher SMR [49]. The auditory masking effects are taken in ac- fairly high level of noise reduction, there are almost no negative
count within modern audio codecs, and they are incorporated as values at input SNR 15 and 20 dB for both noise types
well in the auditory model [42] of PEMO-Q. Now, considering than for the LSAR algorithm, meaning that there is less affection
wide-band classical or pop music, more noise is masked within of the desired signal by the BMRI at high input SNR compared
the auditory system than it is by the sparsely lled spectrum of to LSAR, although the amount of noise reduction is higher. In
speech. This nally leads to the lower PSMt improvement for the bottom right panel, the black curves for white Cauchy noise
speech in the PSMt plots, and might also explain the decline in show the same tendency over input SNR as in the middle panel
PSMt improvement towards higher SNRs for pop and classical for the matched setting, but at a clearly higher . Finally,
music. The grey curves for Gaussian noise in the bottom left also the dark-grey curves show more perceptual improvement
panel show no PSMt improvement for that noise type, whether now, conrming the ability of the proposed BMRI algorithm to
positive nor negative, although there is some negative at reduce Gaussian noise as well.
15 and 20 dB input SNR for mainly pop music in the upper left It is worth mentioning that both LSAR and BMRI outper-
panel. Here as well, auditory masking effects might hide that the forms the ideal binary mask for high input SNR, in terms of per-
desired signal gets affected by the restoration algorithm. ceptual measurement. Informal listening conrms that the IBM
The middle panels of Fig. 7 show the results for the BMRI produces musical noise and artefacts even at an SNR of 15 and
algorithm, congured with the matched setting. Looking at the 20 dB and therefore gives poor perceptional performance.
1688 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
TABLE II
RESULTS OF THE 2-AFC LISTENING TESTS AT 10 dB SNR
TABLE III
RESULTS OF THE 2-AFC LISTENING TESTS AT 20 dB SNR
Fig. 8. Results of the MOS test: Individual rating of the degradation of the
target signal. The input SNR of the test signals was20 dB. The REF signal imi-
tates an artefact-free denoising algorithm with 15 dB reduction. Rating 5 = the residual noise annoyance (cf. Fig. 9), and overall quality
target signal is not audibly degraded, 1 = unacceptable degradation. (cf. Fig. 10). It can be seen that the preferred choice of
algorithm depends on the type of target signal that is pro-
cessed. The medians reveal that the BMRI algorithm is pre-
ferred in comparison to the LSAR algorithm for all test
signals over all three quality dimensions, as already indi-
cated by the results of the 2-AFC test (cf. Tables II and III).
Especially in terms of the annoyance of the residual noise
signal (Fig. 9), the proposed BMRI algorithm outperforms
the LSAR algorithm clearly. However, the ratings for the
simulated artefact-free denoising (REF) indicate that there
is still room for improvement of the BMRI algorithm.
VI. CONCLUSION
In this paper we presented an approach for the reduction of
noise disturbances of Gaussian, supergaussian and impulsive
characteristics. Although the approach is especially suited to
reduce supergaussian and impulsive noise, also Gaussian dis-
Fig. 9. Results of the MOS test: Individual rating of the annoyance of the
residual noise signal. The input SNR of the test signals was 20 dB. The REF turbances can be reduced to a certain extent. High preservation
signal imitates an artefact-free denoising algorithm with 15 dB reduction. Rating of the desired signal is achieved by a carefully elaborated BM
5 = the residual noise is not annoying / not audible at all, 1 = unacceptable an- threshold criterion in combination with a modied AR detection
noyance by the residual noise.
and interpolation stage in the time-domain. We showed by an-
alytical treatment that introducing an AR detection percentage
parameter allows for precise prediction of the noise reduction
level for white disturbances. We tied in with these ndings and
veried the suitability of the approach for high quality audio
restoration by means of objective and subjective quality mea-
sures. The objective measures conrm the analytic prediction,
and, together with the subjective listening tests, it is shown that
the proposed approach is superior to the common LSAR detec-
tion and interpolation methods in audio restoration, especially
for supergaussian noise types.
REFERENCES
[1] P. Vary and R. Martin, Digital Speech Transmission. Enhancement,
Coding and Error Concealment, 1st ed. Chichester, U.K.: Wiley,
2006.
[2] Y. Ephraim and D. Malah, Speech enhancement using a minimum-
mean square error short-time spectral amplitude estimator, IEEE
Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp.
Fig. 10. Results of the MOS test: Individual rating of the overall output signal 11091121, Dec. 1984.
quality. The input SNR of the test signals was 20 dB. The REF signal imitates an [3] J. Porter and S. Boll, Optimal estimators for spectral restoration
artefact-free denoising algorithm with 15 dB reduction. Rating 5 = best overall of noisy speech, in Proc. IEEE Int. Conf. Acoust., Speech, Signal
quality, 1 = unacceptable overall quality of the restored signal. Process., 1984, pp. 5356.
1690 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 10, OCTOBER 2015
[4] I. Cohen, Speech enhancement using super-Gaussian speech models [29] D. P. Ellis, Model-based scene analysis, in Computational Audi-
and noncausal a priori SNR estimation, Speech Commun., vol. 47, no. tory Scene Analysis: Principles, Algorithms, and Applications. Pis-
3, pp. 336350, 2005. cataway, NJ, USA: Wiley/IEEE Press, 2006, pp. 115147, , .
[5] R. Martin, Speech enhancement based on minimum mean-square [30] I. Kauppinen, Methods for detecting impulsive noise in speech and
error estimation and supergaussian priors, IEEE Trans. Speech Audio audio signals, in Proc. Int. Conf. Digital Signal Process., 2002, vol.
Process., vol. 13, no. 5, pp. 845856, Sep. 2005. 2, pp. 967970.
[6] J. S. Erkelens, R. C. Hendriks, R. Heusends, and J. Jensen, Minimum [31] S. V. Vaseghi and P. J. W. Rayner, A new application of adaptive
mean-square error estimation of discrete Fourier coefcients with gen- lters for restoration of archived gramophone recordings, in Proc. Int.
eralized gamma priors, IEEE Trans. Audio, Speech, Lang. Process., Conf. Acoust., Speech, Signal Process., 1988, pp. 25482551.
vol. 15, no. 6, pp. 17411752, Aug. 2007. [32] S. V. Vaseghi and P. J. W. Rayner, Detection and suppression of
[7] I. Andrianakis and P. R. White, Speech spectral amplitude estima- impulsive noise in speech communication systems, in Proc. IEEE
tors using optimally shaped gamma and chi priors, ELSEVIER Speech Commun., Speech, Vis., 1990, vol. 137, pp. 3846.
Commun., vol. 51, no. 1, pp. 114, 2009. [33] J. G. Proakis and D. K. Manolakis, Digital Signal Processing, 4th ed.
[8] Z. Goh, K.-C. Tan, and B. Tan, Postprocessing method for suppressing Upper Saddle River, NJ, USA: Prentice-Hall, Apr. 2006.
musical noise generated by spectral subtraction, IEEE Trans. Speech [34] N. Jayant, Average- and median-based smoothing techniques for im-
Audio Process., vol. 6, no. 3, pp. 287292, May 1998. proving digital speech quality in the presence of transmission errors,
[9] D. Malah, R. V. Cox, and A. J. Accardi, Tracking speech-presence un- IEEE Trans. Commun., vol. COM-24, no. 9, pp. 10431045, Sep. 1976.
certainty to improve speech enhancement in non-stationary noise envi- [35] S. J. Godsill and P. J. W. Rayner, Frequency-based interpolation of
ronments, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., sampled signals with applications in audio restoration, in Proc. IEEE
1999, vol. 2, pp. 789792. Int. Conf. Acoust., Speech, Signal Process., 1993, vol. 1, pp. 209212.
[10] M. Brandt and J. Bitzer, Optimal spectral smoothing in short-time [36] A. J. E. M. Janssen, R. Veldhuis, and L. B. Vries, Adaptive interpola-
spectral attenuation (STSA) algorithms: Results of objective measures tion of discrete-time signals that can be modelled as AR processes, in
and listening tests, in Proc. 17th Eur. Signal Process. Conf. (EU- Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1986, vol. 34,
SIPCO09), Aug. 2009, pp. 199203. no. 2, pp. 317330.
[11] H. Gustafsson, S. E. Nordholm, and I. Claesson, Spectral subtraction [37] K.-D. Kammeyer and K. Kroschel, Digital signal processingfiltering
using reduced delay convolution and adaptive averaging, IEEE Trans. and spectral analysis with MATLAB exercises, digitale signalverar-
Speech Audio Process., vol. 9, no. 8, pp. 799807, Nov. 2001. beitungfilterung und spektralanalyse mit MATLABbungen, 8th ed.
[12] C. Breithaupt, T. Gerkmann, and R. Martin, Cepstral smoothing of Wiesbaden, Germany: Vieweg+Teubner-Verlag, 2012.
spectral lter gains for speech enhancement without musical noise, [38] Int. Phonetic Association, Handbook International Phonetic Associ-
IEEE Signal Process. Lett., vol. 14, no. 12, pp. 10361039, 2007. ation: A Guide to the Use of the International Phonetic Alphabet.
[13] C. Breithaupt, T. Gerkmann, and R. Martin, A novel a priori SNR Cambridge, U.K.: Cambridge Univ. Press, Jun. 1999.
estimation approach based on selective cepstro-temporal smoothing, [39] J. H. McCulloch, Alpha-Stable Distributions in MATLAB, 1996 [On-
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2008, pp. line]. Available: www.mathworks.com/matlabcentral/leexchange/
48974900. 13619-toolbox-non-local-means/content/toolbox_nlmeans/toolbox/
[14] T. Gerkmann and R. Martin, On the statistics of spectral amplitudes stabrnd.m, last seen in March 2012
after variance reduction by temporal cepstrum smoothing and cepstral [40] P. C. Loizou, Speech Enhancement: Theory and Practice. Boca
nulling, IEEE Trans. Signal Process., vol. 57, no. 11, pp. 41654174, Raton, FL, USA: CRC Press, 2007.
2009. [41] R. Huber and B. Kollmeier, PEMO-Q - a new method for objective
[15] S. V. Vaseghi, Advanced Digital Signal Processing and Noise Reduc- audio quality assessment using a model of auditory perception, IEEE
tion, 1st ed. Leipzig, Germany: Teubner, 1996. Trans. Audio, Speech, Lang. Process., vol. 14, no. 6, pp. 19021911,
[16] S. V. Vaseghi, Algorithms for restoration of archived gramophone Nov. 2006.
recordings, Ph.D. dissertation, Univ. of Cambridge, Cambridge, U.K., [42] T. Dau, D. Pueschel, and A. Kohlrausch, A quantitative model of the
1988. effective signal processing in the auditory system, J. Acoust. Soc.,
[17] R. Veldhuis, Restoration of Lost Samples in Digital Signals. Engle- vol. 99, no. 6, pp. 36153622, 1996.
wood Cliffs, NJ, USA: Prentice-Hall, 1990. [43] I. Kauppinen and K. Roth, Improved noise reduction in audio sig-
[18] D. Richter, I. Kurreck, and D. Poetsch, Restoration of optical vari- nals using spectral resolution enhancement with time-domain signal
able density sound tracks on motion picture lms by digital image pro- extrapolation, IEEE Trans. Speech Audio Process., vol. 13, no. 6, pp.
cessing, in Proc. Int. Conf. Optimiz. Elect. Electron. Equipment, 2000, 12101216, 2005.
vol. 3, pp. 793798. [44] T. Rohdenburg, V. Hohmann, and B. Kollmeier, Objective perceptual
[19] M. Ruhland, S. Goetze, M. Brandt, S. Doclo, and J. Bitzer, A new quality measures for the evaluation of noise reduction schemes,
approach for reduction of supergaussian noise using autoregressive in- in Proc. 9th Int. Workshop Acoust. Echo Noise Control, 2005, pp.
terpolation and time-frequency masking, in Proc. 13th Int. Workshop 169172.
Acoust. Echo Noise Control, Aachen, Germany, Sep. 2012, pp. 14. [45] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, Perceptual
[20] S. J. Godsill and P. J. W. Rayner, Digital Audio Restoration. London, evaluation of speech quality (PESQ) - a new method for speech quality
U.K.: Springer, 1998. assessment of telephone networks and codecs, Speech Commun., vol.
[21] J. Nuzman, Audio Restoration: An Investigation of Digital Methods for 2, pp. 749752, 2001.
Click Removal and Hiss Reduction, 2004 [Online]. Available: www. [46] S. Goetze, V. Mildner, and K.-D. Kammeyer, A psychoacoustic noise
umiacs.umd.edu/jnuzman/audio/audio.pdf, last seen in March 2012 reduction approach for stereo hands-free systems, in Proc. 120th
[22] M. Kahrs and K. Brandenburg, Applications of Digital Signal Pro- Conv. Audio Eng. Soc. (AES), 2006.
cessing to Audio and Acoustics, 1st ed. London, U.K.: Springer, 1998. [47] HrTech, PEMO-Q (AudioQual and SpeechQual) Audio and Speech
[23] D. L. Wang, Time-frequency masking for speech separation and its Quality Prediction Based on the Oldenburg Perception Model (PEMO)
potential for hearing aid design, Trends Amplificat., vol. 12, no. 4, pp. - Manual. Oldenburg, Germany: HrTech gGmbH, Kompetenzzen-
332353, 2008. trum fr Hrgeraete-Systemtechnik, 2010.
[24] A. Czyzewski, Learning algorithms for audio signal enhancement: [48] B. C. J. Moore, An Introduction to the Psychology of Hearing.
Part 1 neural network implementation for the removal of impulse dis- Leiden, The Netherlands: Brill, 2012.
tortions, J. Audio Eng. Soc., vol. 45, no. 10, pp. 815831, 1997. [49] T. Painter and A. Spanias, Perceptual coding of digital audio, Proc.
[25] D. J. Levitin, P. Chordia, and V. Menon, Musical rhythm spectra IEEE, vol. 88, no. 4, pp. 451515, Apr. 2000.
from Bach to Joplin obey a 1/f power law, in Proc. Nat. Acad. Sci., [50] S. Goetze, E. Albertin, J. Rennies, E. A. P. Habets, and K.-D. Kam-
2012 [Online]. Available: http://www.pnas.org/content/early/2012/02/ meyer, Speech quality assessment for listening-room compensation,
14/1113828109.abstract, last seen in March 2012 J. Audio Eng. Soc., vol. 62, no. 6, pp. 386399, Jun. 2014.
[26] M. Ruhland, Website with Audio Examples to this Paper, 2014 [On- [51] L. L. Thurstone, A law of comparative judgment, Psychol. Rev., vol.
line]. Available: http://tgm.jade-hs.de/Ruhland_2014_Reduction 34, no. 4, pp. 273286, 1927.
[27] Y. Hu and P. C. Loizou, Techniques for estimating the ideal binary [52] R. A. Bradley and M. E. Terry, Rank analysis of incomplete block
mask, in Proc. 11th Int. Workshop Acoust. Echo Noise Control, 2008. designs: I. The method of paired comparisons, Biometrika, vol. 39,
[28] G. Hu and D. L. Wang, Speech segregation based on pitch tracking no. 3/4, p. 324, Dec. 1952.
and amplitude modulation, in Proc. IEEE Workshop Applicat. Signal [53] K. Tsukida and M. R. Gupta, How to analyze paired comparison data,
Process. Audio Acoust., 2001, pp. 7982. May 2011.
RUHLAND et al.: REDUCTION OF GAUSSIAN, SUPERGAUSSIAN AND IMPULSIVE NOISE 1691
[54] ITU-T, Methods for objective and subjective assessment of quality audio technology in Oldenburg as a Scientic Supervisor and has been the
(P.835), Nov. 2003. Deputy Head of the Transfer Center for User-Oriented Assistance Systems
[55] S. R. Quackenbush, T. P. Barnwell III, and M. A. Clements, Objective since 2013. His current research interests include all forms of single- and
Measures of Speech Quality. Englewood Cliffs, NJ, USA: Prentice- multichannel speech enhancement, audio restoration, audio effects for musical
Hall, 1988. applications, and information retrieval for large media archives.
[56] M. G. Kendall and B. B. Smith, On the method of paired compar-
isons, Biometrika, vol. 31, no. 3/4, pp. 324345, 1940.