We define a time series of probability outputs by the DNN as Figure 2: Time series of probabilities: (a) baseline (b) after
P(E) = {P1 (E), P2 (E), ..., P1100 (E)} for an event E for filtering (c) smoothed
each clip consisting of 1100 frames. We use time series filtering
followed by masking to obtain our final probability predictions
for a given clip. Time series smoothing is aimed at smoothing- 3.2. Time series masking
out the probability signal and using temporal context to make a
prediction. Time series masking is used to reduce the frequent We use several techniques to reduce the false alarm rate dur-
false alarms introduced by the DNN classifier. We describe each ing detection. The baseline model is more prone to false alarms
individual unit in the following subsections. as compared to missed detection as the majority class is down-
sampled. We define masking of a probability as multiplying the
frame probability Pt (E) by a constant θt , 0 ≤ θt ≤ 1 based
3.1. Time series smoothing
on certain criteria. We define these criteria based on smoothed
In this scheme, we use the raw probability time series P(E) probability time series as well as the state occupancy probabil-
as output by the DNN system. This scheme initially removes ities from an ASR system. We discuss these masking schemes
the inherent noise in the probability signal due to frame-wise below.
classification. We filter the probability signal using a low pass
filter (block 1 in figure 1). We use a low pass filter G(z) of 3.2.1. Masking based on smoothed probability time series
order n cascaded with itself p times (equation 1 ). n determines
This masking scheme is based on our empirical observations
the length of filter and p the sharpness of filter response around
from Ps . We observe that in the regions of laughter and filler
the poles. These parameters are tuned on the development set.
in a clip, at least one of the frames has a high probability value.
1 This suggests that at least one frame during an event is confi-
low pass filter : G(z) = n
P p (1) dently classified by the DNN system. We utilize this informa-
zi tion by designing a mask as shown in equation 2. This scheme
i=0
preserves the probability time series around the high probabil-
ity frames and sets the ones with Pt (E) less than a threshold
We next capture the information from the surrounding frames
T1 to 0. We obtain the modified time series Ptm1 (E) as shown
to obtain smoothed probability estimates. These events happen
in equation 3. N1 determines the length of the window to be
over several frames and the decision on one frame should be de-
preserved and θt1 is the multiplication factor for the tth frame.
pendent on its neighbors. The filtered output probabilities as ob-
SH is the set of frames {th1 , th2 , ..thi , ..} with Pthi (E) > T2 .
tained above are shifted as shown in block 2 of figure 1 to obtain
We tune the parameters T1 , T2 and N1 on the development set.
probabilities for the surrounding frames. {a1, a2, ..., ak} ∈ Z
represent the shift introduced by each of the shift units. The
output from these advance/delay blocks are then fed to a lo- (
0, if min |t − th | > N1 & Pts (E) < T1
gistic regression model to obtain the final smooth probabilities. θt1 = th ∈SH (2)
We chose 3 shift blocks with {a1=-2, a2=0, a3=2} to make 1, otherwise
a decision for each frame. Using closer frames led to highly
correlated features while training logistic regression, leading to
Ptm1 (E) = θt1 × Pts (E) (3)
numerical problems. Also, using a higher number of frames did
not improve the performance. We obtain the smoothed proba- We further mask frames that correspond to an event happen-
bility time series Ps = {P1s (E), P2s (E), ..., P1100
s
(E)} after ing for extremely short time. If two non-zero frames are very
this operation. Figure 2 shows the plot of P, intermediate fil- close to each other after the previous operation, we set all the
tered probability and Ps for one clip in the training set contain- event probabilities between those frames to zero . It is unlikely
ing laughter. As can be seen, initial filtering removes the noise that these events last only for a few frames and hence all such
in the times series. Then the application of logistic regression frames can be disregarded as not belonging to the event. The
further increases the probability values for regions correspond- mask is shown in equation 4 and the modified time series in 5.
ing to the event while at the same time attenuating the event S0 is the set of frames {t01 , t02 , .., t0i , ..}with Ptm1
0i
(E) = 0.
probability for frames not corresponding to the event. N2 determines the maximum length between two frames in set
174
more or less acoustically homogeneous. Thus while decoding,
only a few states are likely to be occupied during decoding and
finding n-best paths. This leads to a lower entropy for fillers
frames. We define different masks for laughter and filler time
series as defined in equation 8 and 10 respectively, utilizing the
above properties. A plot for the same clip as in figure 2 after the
masking schemes is shown in figure 3.
(
1, if min |t − th | < N & Ht > T4
θtl = th ∈SH (8)
kl < 1, otherwise
175
Figure 5: ROC curves for laughter and filler detection
176
nonverbal autistic children using a simultaneous commu-
nication system.,” 1973.
[4] John Morreall, Taking laughter seriously, State University
of New York Press, 1983.
[5] Mark Van Vugt, Charlie Hardy, Julie Stow, and Robin
Dunbar, “Laughter as social lubricant: A biosocial hy-
pothesis about the functions of laughter and humour,” Un-
published manuscript, 2007.
[6] Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro
Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed
Chetouani, Felix Weninger, Florian Eyben, Erik Marchi,
et al., “The interspeech 2013 computational paralinguis-
tics challenge: Social signals, conflict, emotion, autism,”
Proc. Interspeech, 2013.
[7] Herbert H Clark and Jean E Fox Tree, “Using uh and um
in spontaneous speaking,” Cognition, vol. 84, no. 1, pp.
73–111, 2002.
[8] Jo-Anne Bachorowski, Moria J Smoski, and Michael J
Owren, “The acoustic features of human laughter,” The
Journal of the Acoustical Society of America, vol. 110, pp.
1581, 2001.
[9] Lawrence Rabiner and Bing-Hwang Juang, “An introduc-
tion to hidden Markov models,” ASSP Magazine, IEEE,
vol. 3, no. 1, pp. 4–16, 1986.
[10] Lawrence R Rabiner, “A tutorial on hidden markov mod-
els and selected applications in speech recognition,” Pro-
ceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[11] John Lafferty, Andrew McCallum, and Fernando CN
Pereira, “Conditional random fields: Probabilistic mod-
els for segmenting and labeling sequence data,” 2001.
[12] David H Wolpert, “Stacked generalization,” Neural net-
works, vol. 5, no. 2, pp. 241–259, 1992.
[13] Florian Eyben, Martin Wöllmer, and Björn Schuller,
“Opensmile: the munich versatile and fast open-source
audio feature extractor,” in Proceedings of the interna-
tional conference on Multimedia. ACM, 2010, pp. 1459–
1462.
[14] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning
algorithm for deep belief nets,” Neural computation, vol.
18, no. 7, pp. 1527–1554, 2006.
[15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
“Learning representations by back-propagating errors,”
Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[16] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš
Burget, Ondrej Glembek, Nagendra Goel, Mirko Hanne-
mann, Petr Motlıcek, Yanmin Qian, Petr Schwarz, et al.,
“The Kaldi speech recognition toolkit,” in Proc. ASRU,
2011.
[17] Donald E Mowrer, Leonard L LaPointe, and James Case,
“Analysis of five acoustic correlates of laughter,” Journal
of Nonverbal Behavior, vol. 11, no. 3, pp. 191–199, 1987.
[18] Hui Jiang, “Confidence measures for speech recognition:
A survey,” Speech communication, vol. 45, no. 4, pp. 455–
470, 2005.
177