Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
ABSTRACT 25 participants have rated the pulse clarity of one hundred excerpts
from movie soundtracks. The mapping between the model predic-
An overview of studies dealing with onset detection and tempo
tions and the ratings, discussed in section 6, was carried out via
extraction is reformulated under the new conceptual and computa-
regressions.
tional framework defined by MIRtoolbox [1]. Each approach can
be specified as a flowchart of general high-level signal-processing
operators that can be tuned along diverse options. This frame- 2. COMPUTING THE ONSET DETECTION FUNCTION
work encourages more advanced combinations of the different ap-
proaches and offers the possibility of comparing multiple approaches 2.1. Preprocessing
under a single optimized flowchart. Besides, a composite model
explaining pulse clarity judgments is decomposed into a set of in- First of all, the audio signal is loaded from a file:
dependent factors related to various musical dimensions. To eval-
uate the pulse clarity model, 25 participants have rated the pulse a = miraudio(’myfile.wav’) (1)
clarity of one hundred excerpts from movie soundtracks. The map-
ping between the model predictions and the ratings was carried out The audio signal can be segmented into characteristic and sim-
via regressions. ilar regions based on novelty [2] by calling the mirsegment op-
erator [1]:
1. INTRODUCTION a = mirsegment(a) (2)
MIRtoolbox is a Matlab toolbox offering an extensive set of signal When the tempo is supposed to remain stable within each seg-
processing operators and musical feature extractors [1]. The objec- ment [3], command (2) automatically ensures that the tempo will
tive is to design a tool capable of analyzing a large range of musical be computed for each segment separately.
dimensions from extensive set of audio files. The first public ver-
sion released last year contains the core of the framework enabling 2.2. Filterbank decomposition
a broad overview of musical dimensions investigated in compu-
tational music analysis. The aim of current research is mainly to The estimation of the onset positions generally requires a decom-
improve the set of tools by integrating a large range of approaches position of the audio waveform along particular frequency regions.
currently advertised in the research community. The simplest method consists in discarding the high-frequency com-
This paper focuses on the joint questions of onset extraction ponents by filtering the signal with a narrow bandpass filter [4]:
and tempo estimation. A synthetic overview of studies in this do-
main is reformulated using the operators defined in MIRtoolbox. a = mirfilter(a,’Scale’,50,20000) (3)
Section 2 shows various methods to compute the onset detection
curve and section 3 deals with the description of the curve, and in More subtle models require a multi-channel decomposition of
particular the detection of the onsets themselves. The estimation the signal mimicking auditory processes. This can be done through
of tempo from the onset curve is dealt in section 4. Throughout filterbank decomposition [5, 6]:
this review, each approach is modeled in terms of a flowchart of
general high-level signal-processing operators available in MIR- b = mirfilterbank(a,’CriticalBand’,
toolbox with multiple options and parameters to be tuned accord-
ingly. This framework encourages more advanced combinations ’Scale’,44,18000) (4)
of the different approaches and offers the possibility of comparing
multiple approaches under a single optimized flowchart. where more precised specification can be optionally indicated.
In section 5, a composite model explaining pulse clarity judg- Alternatively, the decomposition can be performed via a time-
ments is decomposed into a set of independent factors related to frequency representation computed through STFFT [7, 3]:
various musical dimensions. To evaluate the pulse clarity model,
s = mirspectrum(a,’Frame’,’FFT’,
∗ This work has been supported by the European Commission (NEST
project “Tuning the Brain for Music", code 028570) and by the Academy
’WinLength’,.023,’s’,
of Finland (project number 119959). ’Hop’,50,’%’) (5)
DAFX-1
Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
2.3.1. Envelope extraction The contrast between successive frames is observed through dif-
ferentiation, leading to a spectral flux:
The description of this temporal evolution of energy results from
an envelope extraction, basically through rectification (or squar- od = mirflux(s) (17)
ing) of the signal, low-pass filtering, and finally downsampling us- where diverse distance can be specified using the ’Dist’ param-
ing the following command: eter, such as L1 -norm [12] or L2 -norm [8]. Components contribut-
ing to a decrease of energy can be ignored [8] using the ’Inc’
od = mirenvelope(x) (10)
option. Instead of simple differentiation, FIR filter differentiator
[13] can be specified. Each distance between successive frames
where x can be either the undecomposed audio signal a, or the
can be normalized by the total energy of the first frame (’Norm’
filterbank decomposed b2 , or the middle-frequency band m.
option) in order to ensure a better adaptiveness to volume variabil-
Further refinement enables an improvement of the peak pick-
ity [8]. Besides, the computation can be performed in the complex
ing: first the logarithm of the signal is computed [5]3 :
domain (’Complex’ option) in order to include the phase infor-
od = mirenvelope(od,’Log’) (11) mation [14].
The novelty curve designed for musical segmentation, as men-
and the result is differentiated and half-wave rectified4 : tioned in section 2.1, can actually be considered as a more refined
way of evaluating distance between frames [15]. We notice in par-
od = mirenvelope(od,’Diff’,’HWR’) (12) ticular that the use of novelty on multi-pitch extraction results [16]
leads to particular good results when estimating onsets from violin
Some approaches advocate the use of a smoothed differentiator soli (see Figures 1-4).
FIR instead, based on exponential weightening5 [8] (available as a
f = mirpitch(a,’Frame’) (18)
parameter ’Smooth’ of the ’Diff’ option).
od = mirnovelty(f) (19)
1 This second call of the command mirspectrum does not mean that a
second FFT is computed. It just indicates a further operation on a mirspec- 2.5. Post-processing
trum object already computed.
2 It should be mentioned that if x is a mirspectrum object, the command
If necessary, the onset detection function can be smoothed through
(10) should include the ’Band’ keyword in order to specify that the enve- low-pass filtering [15]:
lope should be computed along bands and not the spectrum decomposition
in each frame. od = smooth(od) (20)
3 A µ-law compression [7] can be specified as well using the ’Mu’ op-
tion. In order to adapt further computation (such as peak picking or
4 A weighted average of the original envelope and its differentiated ver-
periodicity estimation) to local context, the onset detection curve
sion [7] can be obtained using the ’Weight’ option. can be detrended by removing the median [17, 13, 15]:
5 The logarithmic transformation might exempt from this loss of infor-
mation, though. od = detrend(od,’Median’) (21)
DAFX-2
Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
Pitch
500
0
0 5 10 15
Temporal location of events (in s.)
Figure 2: Frame-decomposed generalized and enhanced autocorrelation function [16] used for the multi-pitch extraction.
Similarity matrix
temporal location of frame centers (in s.)
14
12
10
2 4 6 8 10 12 14
temporal location of frame centers (in s.)
Novelty
1
coefficient value
0.5
0
0 5 10 15
Temporal location of events (in s.)
Figure 4: Novelty curve estimated along the diagonal of the similarity matrix [2], and onset detection (circles) featuring one false positive
(the second onset) and one false negative (around time t = 12.5 s.).
DAFX-3
Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
p = entropy(ph,’Lag’) (34)
3.3. Attack characterization
If the note onset temporal position is estimated using an energy- An emphasis towards best perceived periodicities can be ob-
based strategy (section 2.3), some characteristics related to the at- tained by multiplying the autocorrelation function (or the spec-
tack phase can be assessed as well. trum) with a resonance curve [23, 10] (’Resonance’ option).
If the note onset positions are found at local maxima of the
energy curve (amplitude envelope or RMS in particular), they can
be considered as ending positions of the related attack phases. A 4.1.3. Comb filters
complete determination of the attack requires therefore an estima-
tion of its starting position, through an extraction of the preceding Another strategy commonly used for periodicity estimation is based
local minima using an appropriate smoothed version of the energy on a bank of comb filters [6, 7]:
curve. Figure 5 shows the output of the command:
ph = mirspectrum(od,’Comb’) (35)
at = mironsets(’ragtime.wav’,’Attacks’) (28)
7 Following our discussion initiated at footnote 2, here the ’Band’ op-
Then the characteristics of the attack phase can be its duration or tion is explicitly mentioned as fluctuation pattern is usually computed from
its mean slope [20]. Figure 6 shows the output of the command: time-frequency representation. The ’Band’ keyword will not be mentioned
in the following commands for clarity sake.
8 In MIRtoolbox 1.0, mirspectrum was strictly related to FFT whereas
as = mirattackslope(at) (29)
mirautocor was related to autocorrelation function. In the new version,
6 More subtle combination processes have been proposed [5], based on mirspectrum should be understood as a general representation of energy
detailed auditory modeling, but are not integrated in the toolbox yet. distribution along frequencies, implemented by various methods.
DAFX-4
Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
0.2
amplitude
0.15
0.1
0.05
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
time (s)
Attack Slope
2
coefficient value
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Temporal location of events (in s.)
4.1.4. Event-wise The main pulse can be estimated by extracting the global max-
imum in the spectrum.
Periodicities can be estimated also from actual onset positions, ei-
ther detected from the onset detection curve (o, computed in sec- mp = peak(p) (40)
tion 3.2), or from onset dates read from MIDI files:
The summation of the peaks across bands can also be per-
o = mironsets(’myfile.mid’) (36) formed after the peak picking as well [25], with a clustering of
the close peaks and summation of the clusters:
The periodicities can be displayed with a histogram showing
mp = sum(mp,’Band’,’Tolerance’,.02,’s’) (41)
the distribution of all the possible inter-onset intervals [24]:
More refined tempo estimation are available as well. For in-
h = mirhisto(mirioi(o,’All’)) (37) stance, three peaks can be collected for each periodicity spectrum,
and if a multiplicity is found between their lags, the fundamental
which can be represented in the frequency domain: is selected [13]. Similarly, harmonics of a series of candidate lag
values can be searched in the autocorrelation function [10].
p = mirspectrum(h) (38) Finally the peaks in the autocorrelation can be converted into
BPM using the mirtempo operator:
Alternatively, the MIDI file can be transformed into an onset
detection curve by summing Gaussian kernels located at the on- t = mirtempo(mp) (42)
set points of each note [23]. The onset detection curve can then
be fed to the same analyses as for audio files, as presented at the
5. MODELING PULSE CLARITY
beginning of this section.
The computation developed in the previous sections help to offer
4.2. Peak picking a description of the metrical content of musical work in terms of
tempo. But some further analyses may enable to produce further
The previous paragraph gave an overview of diverse methods for important information related to rhythm. In particular, one im-
the estimation of rhythmic periodicity: FFT, autocorrelation func- portant way of description musical genre and particular works is
tion, comb filters output, histogram. Following the unifying view related to the amount of pulsation, more precisely to the clarity of
encouraged in the mirtoolbox framework, all these diverse repre- its expression. The understanding of pulse clarity may yields new
sentation can be considered as one single periodicity spectrum p, ways to improve automated genre classification in particular.
which can be further analyzed as follows.
The periodicity estimation on separate band can be summed 5.1. Previous work
before the peak picking:
At least one previous work has studied this dimension [26] – termed
p = sum(p,’Band’) (39) beat strength. The proposed solution is based on the computation
DAFX-5
Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
of the autocorrelation function of the onset detection curve decom- 5.3. Harmonic relations between pulsations
posed into frames:
The clarity of a pulse seems to decrease if pulsations with no har-
p = mirspectrum(o,’Autocor’,’Frame’) (43) monic relations coexist. We propose to formalize this idea as fol-
lows. First a certain number of peaks are selected from the auto-
the three best periodicities are extracted [11]: correlation curve p:9
These periodicities – or more precisely, their related autocorrela- Let the list of peak lags be P = {li }i∈[0,N ] , and let the first
tion coefficients – are collected into a histogram: peak l0 be the one considered as the main pulsation, as determined
in paragraph 4.2. The list of peak amplitudes is {p(li )}i∈[0,N ] .
h = mirhisto(t) (45) A peak will be inharmonic if the remainder of the euclidian
division of its lag with the lag of the main peak (and the inverted
From the histogram, two estimation of beat strength are pro- division as well) is significantly high. This defines the set of inhar-
posed: the SUM measure sums all the bins of the histogram: monic peaks H:
Other descriptors have been added that do not relate directly to the
5.2. Statistical description of the autocorrelation curve periodicity of the pulses, but indicate factors of energy variability
that could contribute to the perception of clear pulsation. Some
For that purpose, the analysis is focused on the analysis of the factors defined in section 3 have been included:
autocorrelation function p itself, as defined in equation (43), and
tries to extract from it any information related to the dominance of • the articulation ARTI, based on Average Silence Ratio (24),
the pulsation. The most evident descriptor is the amplitude of the
• the attack slope ATAK (3.3).
main peak, hence the global maximum of the curve:
Finally, a variability factor VAR sums the amplitude difference be-
MAX = max(p) (48) tween successive local extrema of the onset detection curve.
The whole flowchart of operators required for the estimations
It seems that the global minimum is usually (inversely) related to of the pulse clarity factors is indicated in Figure 7.
the importance of the main pulsation:
DAFX-6
Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
VAR = mirpulseclarity
(o,!Variability!)
AS = mirattackslope KURT =
(od) kurtosis(mp)
o = mironsets
(od,!Detect!,!Yes!)
od = mironsets mp = mirpeak(p) TEMP =
(a,!Detect!,!No!) mirtempo(mp)
p = mirspectrum
(o,!Autocor!)
ART = mirlowenergy pp = mirpeaks(p) HARM = mirpulseclarity
(od,!Threshold!,.5) (pp,!Harmony!)
MAX = MIN = ENTR =
max(p) min(p) entropy(p)
DAFX-7
Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
ter of acoustic musical signals,” IEEE Trans. Audio Speech [23] P. Toiviainen and J. S. Snyder, “Tapping to bach: Resonance-
Langage Proc., vol. 14, no. 1, pp. 342–355, 2006. based modeling of pulse,” Music Perception, vol. 21, no. 1,
[8] C. Duxbury, M. Sandler, and M. Davies, “A hybrid approach pp. 43–80, 2003.
to musical note onset detection,” in Proc. Digital Audio Ef- [24] F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, “Evalu-
fects (DAFx-02), Hamburg, Germany, Sep. 26-28, 2002, pp. ating rhythmic descriptors for musical genre classiÞcation,”
33–38. in Proc. AES Intl. Conf., London, UK, June 17-19 2004, pp.
[9] A. Friberg, E. Schoonderwaldt, and P. N. Juslin, “Cuex: An 196–204.
algorithm for extracting expressive tone variables from audio [25] S. Dixon, E. Pampalk, and G. Widmer, “Classification of
recordings,” Acustica / Acta Acustica, , no. 93, pp. 411–420, dance music by periodicity pattern,” in Proc. Intl. Conf. on
2007. Music Information Retrieval, London, UK, Oct. 26-20 2003,
[10] D. Eck and N. Casagrande, “Finding meter in music using pp. 504–509.
an autocorrelation phase matrix and shannon entropy,” in [26] G. Tzanetakis, G. Essl, and P. Cook, “Human perception
Proc. Intl. Conf. on Music Information Retrieval, London, and computer extraction of musical beat strength,” in Proc.
UK, Sep. 11-15 2005, pp. 504–509. Digital Audio Effects (DAFx-02), Hamburg, Germany, Sep.
[11] G. Tzanetakis and P. Cook, “Musical genre classification of 26-28, 2002, pp. 257–61.
audio signals,” IEEE Trans. Speech Audio Proc., vol. 10, no. [27] G. E. P. Box and D. R. Cox, “An analysis of transformations,”
5, pp. 293–302, 2002. J. Roy. Stat. Soc., , no. 26, pp. 211–246, 1964.
[12] P. Masri, Computer modeling of Sound for Transformation
and Synthesis of Musical Signal, Ph.D. thesis, University of
Bristol, 1996.
[13] M. Alonso, B. David, and G. Richard, “Tempo and beat
estimation of musical signals,” in Proc. Intl. Conf. on Music
Information Retrieval, Barcelona, Spain, Oct. 10-14 2004,
pp. 158–163.
[14] J. P. Bello, C. Duxbury, M. Davies, and M. Sandler, “On
the use of phase and energy for musical onset detection in
complex domain,” IEEE Sig. Proc. Letters, vol. 11, no. 6,
pp. 553–556, 2004.
[15] J. P. Bello, S. Abdallah L. Daudet, C. Duxbury, M. Davies,
and M. Sandler, “A tutorial on onset detection in music sig-
nals,” Tr. Speech Audio Proc., vol. 13, no. 5, pp. 1035–1047,
2005.
[16] T. Tolonen and M. Karjalainen, “A computationally efficient
multipitch analysis model,” IEEE Trans. Speech Audio Proc.,
vol. 8, no. 6, pp. 708–716, 2000.
[17] M. Davies and M. Plumbley, “Comparing mid-level rep-
resentations for audio based beat tracking,” in Proc. Digi-
tal Music Res. Network Summer Conf., Glasgow, July 23-24
2005.
[18] J. J. Burred and A. Lerch, “A hierarchical approach to au-
tomatic musical genre classification,” in Proc. Digital Audio
Effects (DAFx-03), London, UK, Sep. 8-11 2003, pp. 344–
349.
[19] Y. Feng, Y. Zhuang, and Y. Pan, “Popular music retrieval by
detecting mood,” in Proc. Intl. ACM SIGIR Conf. on Res.
Dev. Information Retrieval, Toronto, Canada, Jul. 28-Aug. 1
2003, pp. 375–376.
[20] G. Peeters, “A large set of audio features for sound descrip-
tion (similarity and classification) in the cuidado project (ver-
sion 1.0),” Tech. Rep., Ircam, 2004.
[21] E. Pampalk, A. Rauber, and D. Merkl, “Content-based orga-
nization and visualization of music archives,” in Proc. Intl.
ACM Conf. on Multimedia, 2002, pp. 570–579.
[22] J. C. Brown, “Determination of the meter of musical scores
by autocorrelation,” J. Acoust. Soc. Am., vol. 94, no. 4, pp.
1953–1957, 1993.
DAFX-8