Anda di halaman 1dari 9

Multi-Feature Modeling of Pulse Clarity from Audio

Olivier Lartillot,1 Tuomas Eerola,2 Petri Toiviainen,3 Jose Fornari


Finnish Centre of Excellence in Interdisciplinary Music Research, University of Jyväskylä, Finland
1
olivier.lartillot@campus.jyu.fi, 2tuomas.eerola@campus.jyu.fi, 3petri.toiviainen@campus.jyu.fi

loudness, roughness, etc. – only account for a moderate


ABSTRACT amount of the variance in listeners’ responses (e.g., Leman et
This paper defines pulse clarity as an introspective understanding of al, 2005). Hence novel high-level musical features are needed.
the quality of the rhythmic structures not merely reduced to questions This study is focused on one particular high-level
related to synchronization capabilities. The objective of this study is dimension that may contribute to the subjective appreciation
to establish a composite model explaining pulse clarity judgments of music: namely pulse clarity, which conveys how easily
from the analysis of audio recordings, and to ground the validity of listeners can perceive the underlying pulsation in music.
the model through listening tests. The models used in this study are The objective of this study is to establish a composite
based on a range of musical features usually regarded as important in model explaining pulse clarity judgments from the analysis of
the perception of pulse clarity. Rhythmic periodicity is estimated via audio recordings, and to evaluate the validity of the model
the autocorrelation of the amplitude envelope of the audio waveform. through listening tests. In this project, the development and
Statistical characterization of the autocorrelation function indicates evaluation of novel features is based on the comparison with
the prominence of the main pulsation. Harmonic relations between empirical data collected from listeners’ rating.
the main pulsation and the secondary periodicities may also
contribute to the rhythmic clarity. Besides periodicity, descriptors B. Relation to pulse prediction studies
related to the amplitude envelope itself are also considered. The Research on musical pulse perception generally focuses on
models have been written in Matlab using MIRtoolbox. To evaluate listeners’ capacity to actively infer and produce musical pulse
the models, 25 participants rated the pulse clarity of one hundred
synchronized with the music being heard (cf. review in
excerpts from movie soundtracks. The mapping between the model
Toiviainen & Snyder, 2003). The main behavioural method
predictions and the ratings is carried out via regressions. Almost a
half of the listeners’ rating variance can be explained with a
for studying pulse finding requires subjects to tap the
combination of periodicity-based and non-periodicity-based factors. perceived pulse of musical excerpts. One disadvantage of this
method is that it can be difficult to disentangle perceptual and
motor processes, a problem shared with other tapping
I. INTRODUCTION paradigms (Wing & Kristofferson, 1973).
At first sight, rating tasks may seem less informative than
A. Motivations tapping exercises, which usually offer a detailed description
Analysing music recordings using computer models of listeners’ active synchronization with music, including
enables a detailed understanding of the form and the period and phase of tapping, time to find a pulse, variability of
substance of the creative process developed by the performers the intertap interval, and deviations from the beat. It seems
and emerging from the compositions. Besides, computational however that a rating task focused on pulse clarity incites
analysis of music can contribute to the understanding of music subjects to focus on an introspective understanding of the
cognition processes, as it enables to draw more precise quality of the rhythmic structure that is complementary to the
relations between musical characteristics as they are heard, tapping paradigm. One hypothesis founding this study is that
and their effects as can be measured by experimental similarity of tapping behaviour in reaction to various musical
psychology in particular. In addition, systematic extraction of excerpts does not necessarily imply that listeners qualitatively
musical features facilitates the development of practical infer the same degree of pulse clarity for the different
computational tools for the management of musical databases, examples, since their subjective impression is not
such as the automated classification of musical databases systematically documented. If two musical pieces
along genres. significantly differ in terms of their pulse clarity, their
Clearly, the quality of the computational system for music corresponding tapping responses might probably indicate
analysis highly depends on the type of musical dimensions similar differences. The repercussions of more subtle
and characteristics that are formalized and integrated. Usually, variations in pulse clarity might be less easily detected in the
a large range of low-level features is proposed, mostly tapping responses.
focused on diverse statistical descriptions of the spectral For these reasons, the notion of pulse clarity is considered
content of the acoustic recording. Various higher-level in this paper as a subjective measure that participants were
musical components, that can be related to traditional musical asked to rate whilst listening to a given set of musical excerpts.
dimensions, such as tonality, rhythm and tempo, chords, etc., The aim is to model these behavioural responses using signal
might be proposed as well. Additional musical descriptions, processing and statistical methods.
not necessarily fully documented in traditional musicology, C. Relation to beat strength
are necessary, in order to enrich the qualitative and
quantitative depiction of the audio recordings. For instance, This concept of pulse clarity shows some similarities with
concerning the understanding of emotional impact of music, the notion of beat strength (Tzanetakis, Essl & Cook, 2002),
the existing low-level features commonly used for the “loosely defined as the rhythmic characteristic(s) that allows
understanding of emotional effects of music – such as us to discriminate between two pieces of music having the
same tempo,” with the example that a piece of Hard Rock has
(usually) a higher beat strength than a Classical Music at the
same tempo. However, Tzanetakis et al.’ beat strength model,
as explained more in detail in section V, is based on a global
analysis of the beat distribution along a given musical piece,
whereas in our approach, we conjecture the possibility of
defining the clarity of a pulsation using as less as five seconds
of a musical excerpt. Indeed, even musical excerpts less than a
few seconds long can easily convey to the listeners various Figure 2. RMS energy curve of the same piano ragtime excerpt,
degrees of rhythmicity. From the analysis of the successive where crosses indicate the energy related to the successive frames.
local contexts, global analyses can be performed through Detected note onsets are highlighted by red circles.
usual statistical techniques.
3) Spectral flux. Contrary to amplitude envelope and RMS
II. MODELLING PULSE CLARITY energy curve, spectral flux focuses on a decomposition of the
The models used in this study are based on a range of energy along the frequencies, and detects discontinuities
musical features usually regarded as important in the throughout the temporal evolution of this spectral distribution.
perception of pulse clarity (see for instance Tzanetakis, Essl & This enables to highlight transitions in the musical surface that
Cook, 2002). could not be perceived by simply looking at the global shape of
the energy of the sound, as produced by amplitude envelopes
A. Describing the temporal evolution of the musical and energy curves. One limitation of the spectral flux
surface representations is that the comparisons are made locally,
Various factors of energy variability might contribute to the frame by frame, without carefully distinguishing the
perception of clear pulsation. The analysis of energy contribution of the different frequencies or considering longer
variability requires an explicit description of the temporal temporal contexts.
evolution of the energy, which can be performed in two
different ways.
1) Amplitude envelope. One method consists in the
extraction of the amplitude envelope from the waveform of
the audio signal, showing the global outer shape of the signal.
It is particularly useful for revealing the long-term evolution
of the signal, and has application in particular to the detection
of musical events such as notes.

Figure 3. Spectral flux of the piano ragtime excerpt, with


detected note onsets highlighted by red circles.

4) Novelty curve. A method more developed than the


previous ones consists in computing distances not only
between strictly successive frames, but also between all frames
in a temporal neighbourhood of pre-specified width. Detection
of transitional points is performed using signal-processing
operations: namely, inter-frame distances are stored into a
Figure 1. Ampitude envelope of a piano ragtime excerpt, with similarity matrix, and a so-called "novelty curve" is computed
detected note onsets highlighted by red circles. through a convolution along the main diagonal of the
similarity matrix with a Gaussian checkerboard kernel (Foote
2) Frame-decomposed energy. Another method is based on & Cooper, 2003). Intuitively, the novelty curve indicates the
a decomposition of the audio recordings into a succession of positions of transitions in the temporal evolution of pitch.
very small and overlapping excerpts (called “frames”), and The following example (Figures 4-6) shows that a violin
the computation of the average root-mean-square (RMS) solo can be segmented into successive notes using the novelty
energy related to each of these successive frames. Similarly to curve strategy.
amplitude envelope, frame-decomposed RMS curves focus on
the dynamic evolution of the global intensity of sound.

Figure 4. Frame decomposed spectrum (here, autocorrelation


function) of the beginning of a violin solo.
Figure 7. Onset detection (red circles) related to the beginning of
a piano performance of Schumann’s Kinderszene II, and
determination of the attack phases, indicated by ascending red
lines. The starting points of the attack phases are indicated by
red diamonds, and the ending points corresponds to the note
onset time (red circles).
Each attack phase can be characterized by its the mean
Figure 5. Similarity matrix showing the similarity (in warm slope (Peeters, 2004). Mean slope seems to provide an
colors) between all the successive frames in the autocorrelation indicator of attack that is more robust than attack duration.
curve of the previous figure (ordered along the x-axis), with all
the same frames (ordered along the y-axis).
Figure 8 shows the slope of the attack phases extracted from
the previous example. Each musical excerpt has been
therefore associated with a variable AS, corresponding to the
average of the mean slope of all the attack phases. See Figure
18 for an overview of the variables defined in this paper.

Figure 6. Novelty curve obtained by analyzing the local


discontinuities along the main diagonal of the similarity matrix.
Red circles indicate temporal positions of tone transitions. Figure 8. Slope of each attack phase determined in the previous
example. The temporal positions of attack phases are
B. Non-periodic accounts of pulse clarity indicated along the x-axis.

Pulse clarity characterizations are drawn from various 2) Articulation. Another possible non-periodical account of
description of the dynamic curves presented in paragraph A. pulse clarity is related to articulation (ART variable), i.e., the
Most of these descriptions concern the periodicity of performance characterization oriented along the
particular events, as detailed in paragraphs C and D. This staccato/legato polarity. The Average Silence Ratio (ASR) is
paragraph does, however, not focus on periodic pulsation, but defined for the RMS curve, and indicates the percentage of
rather on the characterization of individual pulses. These frames that have an RMS energy significantly lower than the
pulses are extracted from the dynamic curves through a mean RMS energy of all frames (Feng, Zhuang & Pan, 2003).
detection of the major peaks, i.e. local maxima that show The ASR is similar to the low-energy rate (Burred & Lerch,
significant contrast. 2003), except the use of a different energy threshold: the ASR
1) Attack Characterization. The clarity of each pulse might is meant to characterize significantly silent frames.
be correlated with the sharpness of the corresponding note 3) Variability. Finally, a variability factor (VAR) sums the
attack. The attack phase of each note can be tentatively amplitude difference between successive local extrema of the
detected using the amplitude envelope: the peaks, as shown in onset detection curve, and another factor (PEAK) stores the
Figure 7, actually correspond to the end of each attack phase. average magnitude of the peaks in the amplitude envelope.
The starting point of each phase can be estimated as well, by
extracting the nearest local minimum preceding each peak. To C. Statistical description of the autocorrelation function
be more precise, the local minimum is actually computed Compared to non-periodic descriptors of individual pulses,
from a smoothed version of the amplitude envelope, in order global characterization of periodic pulsations may be
to ignore any spurious burst of energy not directly related to supposed to significantly contribute to the perception of pulse
the attack phase itself. clarity. Rhythmic periodicity is commonly estimated via the
computation of the autocorrelation function of the amplitude
Figure 7 shows the onset detection of the beginning of a
envelope (Brown, 1993), or more preferably of the time
performance of Schumann's Kinderszene II, with
derivative of the amplitude envelope (Figure 9). This
determination of the attack phases.
autocorrelation function, as shown in Figure 10, indicates the
presence of periodicities for a given range of possible periods.
The peaks indicate the most probable periods.
From the resulting autocorrelation function, various
statistical characterizations have been chosen in order to
indicate the prominence of the pulsations.
1) Global Maximum. The most evident descriptor is the
amplitude of the main peak (MAX variable), i.e., the global
maximum of the curve. By property, the autocorrelation
function reaches its peak at the origin, i.e. for a period of 0 s,
meaning that the correlation due to periodic repetitions in the
Figure 9. Time derivation of the amplitude envelope of ragtime signal are all necessarily inferior to the correlation of the signal
(cf. Figure 1). with itself. This has no use in our study, except that the
maximum at the origin can be used as a reference in order to
normalize the autocorrelation function. In this way, the actual
values shown in the autocorrelation function correspond
uniquely to periodic repetitions, and are not influenced by the
global intensity of the total signal. The global maximum will
be extracted within a part of the autocorrelation function
corresponding to perceptible rhythmic periodicities, i.e. for the
range of tempi between 40 and 200 BPM.

Figure 10. Autocorrelation function related to the previous onset


curve.
The autocorrelation function performs a purely
mathematical analysis of the dynamic curve. In order to model
the perception of musical pulses, the most salient periodicities
are emphasized by multiplying the autocorrelation function by
a resonance function (Toiviainen & Snyder, 2003).
Figure 13. Global maximum of the enhanced autocorrelation
function, highlighted with a red circle. Its amplitude
corresponds to factor MAX, and its corresponding tempo
leads to factor TEMP. This half-wave rectified curve is used
for the estimation of entropy ENTR as well.

2) Global Minimum. The global minimum (MIN) gives


another aspect of the importance of the main pulsation. The
motivation for including this measure lies in the fact that for
periodic stimuli with a mean of zero the autocorrelation
Figure 11. Autocorrelation function of the onset curve, after function shows minima with negative values, whereas for
application of the resonance curve, in order to emphasize the non-periodic stimuli this does not hold true. Note that this
most salient periodicities. factor cannot be computed from the enhanced autocorrelation
Various signal processing techniques can be applied in function, as the enhancement implicitly removes the negative
values in the curve.
order to improve the results. For instance, in autocorrelation
functions, periodicities are shown by peaks at the
corresponding periods, but also by peaks related to all
possible multiples of these periods, leading to a highly
redundant structure. These harmonic series can be simplified
by using an enhancement method (Tolonen & Karjalainen,
2000).

Figure 14. Global minima of the non-enhanced autocorrelation


function, highlighted with a red circle. Its amplitude
corresponds to factor MIN.

3) Peak Kurtosis. Another way of describing the clarity of a


rhythmic pulsation consists in assessing whether the main
pulsation is related to a very precise and stable periodicity, or if
on the contrary the pulsation slightly oscillates around a range
Figure 12. Enhanced autocorrelation function of the same onset of possible periodicities. We propose to evaluate this
curve, where redundant harmonics have been filtered out. characteristic through a direct observation of the autocorrelation
function. In the first case, if the periodicity remains clear and
stable, the autocorrelation function should display a clear peak
at the corresponding periodicity, with significantly sharp This strategy can be formalized as follows. First a certain
slopes. In the second and opposite case, if the periodicity number of peaks are selected from the autocorrelation
fluctuates, the peak should present far less sharpness and the function r. By default all local maxima showing sufficient
slopes should be more gradual. This characteristic can be contrasts with respect to their adjacent local minima are
estimated by computed the kurtosis of the lobe of the selected. More precisely, a local maximum will be selected as
autocorrelation function containing the major peak (cf. Figure peak if its distance with the previous and successive local
15). The kurtosis, or more precisely the excess kurtosis of the minima (if any) is higher than a pre-specified threshold.
main peak (KURT), returns a value close to zero if the peak
resembles a Gaussian. Higher values of excess kurtosis
correspond to higher sharpness of the peak.
r(l0)
r(l1) r(l2)

l0 l1 l2

Figure 16. From the autocorrelation function r are extracted


peaks: the main peak l0 related to the period .46 s, and two
lower peaks l1 and l2, considered as non-harmonic since their
Figure 15. Lobe of the autocorrelation function containing the periods, respectively .61 s and 1.08 s, are not multiple of the
main peak. The kurtosis of this lobe is used for factor KURT. period of the main peak l0.
Let the list of peak lags be P ={li}i ! [0, N], and let the first
4) Entropy. The entropy of the autocorrelation function peak l0 be related to the main pulsation. The list of peak
(ENTR) characterizes the simplicity of the function and amplitudes is {r(li)}i ! [0, N]. A peak will be inharmonic if the
provides in particular a measure of the peakiness of the remainder of the Euclidian division of its lag with the lag of
function. This measure can be used to discriminate periodic the main peak (and the inverted division as well) is
and non-periodic signals. In particular, signals exhibiting significantly high. This defines the set of inharmonic peaks
periodic behaviour tend to have autocorrelation functions with
IH:
clearer peaks and thus lower entropy than non-periodic ones.
IH = { i ! [0, N] " either li ! [#l0,(1-#)l0] [l0] }
5) Tempo. Another hypothesis is that the faster a tempo or l0 ! [#li,(1-#)li] [li]}
(TEMP) is, the more clearly it is perceived by the listeners. where # is a constant tuned to .15 in our implementation.
This conjecture is based on the fact that fast tempi imply a The degree of harmonicity is thus decreased by the
higher density of beats, guiding the rhythmic understanding of cumulation of the autocorrelation coefficients of the
the listeners more tenaciously. inharmonic peaks:
D. Rhythmic harmonicity HARM = exp (-1/$ %i ! IH r(li)/r(l0) )
where $ is another constant tuned to 4.
Harmonic relations between the main pulsation and the
secondary periodicities may also contribute to the rhythmic
clarity. The clarity of a pulse seems to decrease if pulsations
III. EVALUATION OF THE MODELS
with no harmonic relations coexist. In the case of a simple In order to assess the validity of the models predicting
metrical decomposition, periods are all in integer ratio with pulse clarity judgments presented in the previous section, a
respect to the shortest period. These pulsations simply listening experiment was carried out.
correspond to the common subdivisions of the metrical The stimulus material was initially based on a database of
hierarchy. If on the contrary several pulsations are not detected 360 short excerpts, of 15 to 30 second length, of movie
as harmonics of the same fundamental pulsation, the resulting soundtracks – which is also used in a related project where is
rhythmic pulsation might appear more complex and less studied the effect of various musical features on emotions
intelligible. induced by listeners. From the initial database, 100
Based on these observations, we postulate that pulse clarity five-second excerpts were selected, so that the chosen samples
is low if the pulsation structure contains heterogeneous – i.e., qualitatively cover a large range of pulse clarity (and also
non-harmonic – elements, thereby failing to exhibit a single tonal clarity, another high-level feature studied in our research
multi-layered metrical hierarchy. The estimation of the project). For instance, pulsation might be absent, ambiguous,
hierarchic quality of the pulsation structure can be formalized in or on the contrary clear or even excessively steady. The
different ways. One solution for instance would consist in selection has been performed intuitively, by ear, but also with
testing whether all the periods can be expressed as multiples of the support of a computational analysis of the database based
the shortest period. The variant of the previous method used in on a first version of the harmonicity-based pulse clarity
our model focuses on the main period instead of the shortest
model.
one. The main period, as shown by the global maximum in the
autocorrelation function (Figure 13), corresponds to the A. Listening experiments
pulsation that is the most extensively developed in the given
Twenty-five participants rated the pulse clarity of one
musical piece or excerpt. The strategy consists thus in testing
whether all the periods can be expressed as multiples or hundred excerpts from movie soundtracks. Each participant
subdivisions of the main period. rated the excerpts alone, on a 9-value scale using a computer
interface developed in the PD environment, in which the order 5. AS, r = .34,
of presentation was randomised. During the same experiment, 6. KURT, using the amplitude envelope, r = .31,
the subjects were asked to rate tonal clarity, brightness and 7. HARM, using the spectral flux, r = .23,
articulation as well. 8. ENTR, using the novelty curve, r = .22,
For the modelling, the mean response of the participants’ 9. ENTR, using the amplitude envelope, r = .20,
ratings was taken, as there was considerable consensus of the 10. TEMP, r = .19.
pulse clarity ratings as measured by the mean inter-subject
correlation (r=.581, p<0.001). In parallel, the same set of In this list have been deleted variables showing a
musical stimuli was fed to the pulse clarity models presented cross-correlation higher than .75 with other variables offering
in the previous section. The mapping between the model better correlation r with the ratings. The cross-correlations
predictions and the listeners' ratings was finally carried out via between the selected variables are indicated in table 1.
regressions.
B. Model evaluation Table 1. Cross-correlation between the ten most correlated
variables (before normalization of the distribution).
A flowchart of all operators used in the present study for
the estimation of pulse clarity is indicated in Figure 18. The 2. 3. 4. 5. 6. 7. 8. 9. 10.
models have been implemented in Matlab using the 1. .61 .63 .37 .30 .51 .27 .18 .19 .01
MIRtoolbox (Lartillot & Toiviainen, 2007). 2. 1 .59 .65 .29 .64 .31 .24 .33 .04
One problem raised by the computational framework 3. 1 .37 .44 .54 .05 .12 .17 .22
presented in this paper is related to the high number of 4. 1 .27 .47 .46 .25 .18 .00
degrees of freedom that have to be specified when choosing 5. 1 .16 .16 .26 .07 .04
the proper onset detection curve and periodicity detection and
6. 1 .32 .17 .37 .11
characterization methods. The MIRtoolbox environment has
been designed with the objective of enabling the design and 7. 1 .23 .22 .05
evaluation of complex flowcharts. Series of alternative values 8. 1 .11 .04
can be specified for each option and parameter, and all the 9. 1 .20
resulting models are computed and evaluated one after the
other. Due to the combinatorial complexity of possible A stepwise regression leads to the following model:
configurations, optimisation tools are being designed that 1. MIN, with an adjusted r2 = .25,
systematically sample the set of possible solutions and 2. ENTR, leading to a cumulative adjusted r2 = .35,
produce a large number of flowcharts progressively loaded 3. HARM, cumulative adjusted r2 = .39,
with musical databases and compared with the behavioural 4. temporal slope of AS, cumulative adjusted r2 = .41,
data. 5. PEAK, cumulative adjusted r2 = .43,
C. Pre-processing of the statistical variables 6. KURT, cumulative adjusted r2 = .46.
As a prerequisite to the statistical mapping, listeners’ The fourth variable integrated in the model represents the
ratings and models predictions need to exhibit Gaussian temporal slope formed by the series of attack slopes related to
distributions. The mapping routine includes an optimization the successive note onsets.
algorithm that detect data that are not sufficiently normal –
based on the Lilliefors (1967) test using a significance level B. Analysis of normalized variables
of .5. The non-normally distributed data are progressively After the normalization, the results show significant
transformed into normally distributed data through differences. The best correlations with the pulse clarity ratings
optimization based on power transformation (Box and Cox, are achieved by the following variables, in decreasing order of
1964). In our implementation of the optimization process, the absolute correlation:
normalization does not always succeed completely, as 8
distributions, out of 74, remained non-normalized according 1. MIN, based on amplitude envelope, with a correlation
to the Lilliefors test even after the transformation. r = .49,
2. MAX, based on amplitude envelope, with a
IV. RESULTS correlation r = .40,
3. MIN, based on spectral flux, r = .39,
A. Analysis of non-normalized variables 4. AS, r = .36,
Before the normalization of the distributions based on the 5. MAX based on spectral flux, r = .36,
Box-Cox optimisation process, the best correlations with the 6. PEAK, r = .30,
pulse clarity ratings are achieved by the following variables, 7. KURT, using the amplitude envelope, r = .28,
in decreasing order of absolute correlation: 8. HARM, using the spectral flux, r = .23,
9. temporal slope of AS, r = .22,
1. MIN, based on amplitude envelope, with a correlation 10. ENTR, using either amplitude envelope or spectral
r = .51, flux, r = .21.
2. MAX, using the amplitude envelope, r = .48,
3. MIN, based on spectral flux, r = .45, The cross-correlations between the variables are indicated
4. MAX, using the spectral flux, r = .38, in table 2.
Table 2. Cross-correlation between the ten most correlated entropy (ENTR) gives also another explanation partially
variables (after normalization of the distribution). dissociated from the inharmonicity factor.
Non-periodic factors present a significant impact on pulse
2. 3. 4. 5. 6. 7. 8. 9. 10. clarity judgments: attack qualities related to individual tone
1. .5 .49 .32 .33 .18 .47 .29 .11 .20 envelopes play a role, both in terms of dynamic evolution of
2. 1 .46 .30 .53 .01 .66 .26 .14 .28 the attack phases (characterized by the attack slope AS) and
their relative maximal energy (PEAK). Hence both periodic
3. 1 .37 .29 .16 .42 .02 .20 .14
and non-periodic descriptions contribute to the qualitative
4. 1 .26 .73 .24 .18 .14 .10 descriptions of rhythmic strength.
5. 1 .02 .42 .48 .11 .17 The temporal slope of AS is a problematic feature that was
6. 1 .05 .04 .20 .10 incidentally added to the picture due to underlying automated
7. 1 .31 .01 .38 processes in the MIRtoolbox environment, and that
8. 1 .19 .22 nonetheless prove to play a significant role in the explanation
of listeners’ variance. The difficulty here is to offer a rational
9. 1 .21
explanation of the contribution of this feature, which in theory
does not particularly relate to the concept of pulse clarity.
A stepwise regression leads to the following model: Either this feature emerged by pure coincidence, indicating a
1. MIN, adjusted r2 = .24, possible risk of model overfitting, or the discovery indicates
2. ENTR, cumulative adjusted r2 = .35, the existence of a particular phenomenon that would require
3. HARM, cumulative adjusted r2 = .38, further explanation.
4. temporal slope of AS, cumulative adjusted r2 = .41, Finally an integration of the various factors can be achieved
5. PEAK, cumulative adjusted r2 = .45, through stepwise regression, resulting in a model that can
6. KURT, cumulative adjusted r2 = .47. explain nearly one half of the variance in the listeners’ ratings.
C. Comments This first investigation, although promising, needs to be
further improved, in order to reveal whether or not the
The correlations between individual variables and listeners’ remaining part of the variability can be explained with
ratings are not significantly high, as they do not exceed systematic descriptions.
here .5, once the data has been properly scaled in order to
exhibit Gaussian distribution. This would indicate therefore V. DISCUSSION
that pulse clarity judgments result from complex appreciation
of the musical surface related to various characterizations of A. Related work
the dynamic evolution, and in particular related to the periodic
configurations. One of the most salient and expected factors The modelling of beat strength by Tzanetakis, Essl and
for pulse clarity is the global maximum of the autocorrelation Cook (2002) is based on the computation of the
curve, indicating the relative amplitude of the main autocorrelation function of the onset detection curve
periodicity (MAX). The opposite description, i.e. the global decomposed into frames and the extraction of the three best
minimum of the autocorrelation curve (MIN) might represent periodicities. These periodicities -- or more precisely, their
at first sight a less intuitive characteristic of the periodicities. related autocorrelation coefficients -- are collected into a
Yet the MIN factor offers the most significant contribution to histogram. Two methods are proposed in order to estimate
the explanation of the variance in listeners’ ratings, which beat strength from the histogram: the SUM measure sums all
proves to be dissociated from the MAX factor. A more the bins of the histogram, whereas the PEAK measure divides
comprehensive interpretation of the MIN variable is planned the maximum value to the main amplitude. As explained
for further research. previously, the modelling of beat strength is aimed at
The large set of features modelled in the proposed understanding the global metrical aspect of an extensive
methodology has enabled to demonstrate that rhythmic musical piece. Our study, on the contrary, is focused on an
periodicities can result from several dimensions of the musical understanding of the short-term characteristics of rhythmical
surface such as those related to the global evolution of the pulse.
sound dynamics (MAX and MIN based on amplitude B. Future works
envelope) or to the qualitative evolution of the spectral
Improvement of the current models and addition of new
content (MAX and MIN based on spectral flux): the two
features may enable the establishment of an extensive
paradigms are correlated to a certain extent (up to .53), but not
computational modelling that would offer more significant
totally.
correlation with the listeners’ ratings.
The sharpness of the main peak in the amplitude function –
More advance articulations between the various features
described by its kurtosis (KURT) – is a characterization of
already developed may be attempted as well. Besides additive
periodicity whose significance within the final statistical
regressions, multiplicative regressions may be of relevance,
validation is quite congruent with our expectations. The
for particular variables such as rhythmic inharmonicity for
model we developed in order to formalize the possible
instance. Other articulations between the variables need to be
harmonicity effects between multiple layers of pulsations
integrated directly in the design of the feature extraction
(HARM) contributes modestly to the general picture, by
strategies. Let’s consider for instance the dichotomy between
explaining a small part of the ratings variability on its own.
the features based on amplitude envelope and on spectral
The simpler characterization of periodic simplicity in terms of
descriptions: the behaviour of the two respective paradigms
may explain disjunctive categories of musical examples: those ACKNOWLEDGMENT
where the rhythm is based on global energy or, respectively,
This work has been supported by the European
on spectral and pitch content. A more robust model should be
Commission (BrainTuning FP6-2004-NEST-PATH-028570),
able to adapt to any of these representations, by combining
the Academy of Finland (project 119959) and the Center for
amplitude and spectral information and extracting
Advanced Study in the Behavioral Sciences, Stanford
periodicities in both dimensions.
University. We are grateful to Tuukka Tervo for running the
C. A computational framework for feature design and listening experiment.
evaluation
The study developed in this paper attempted an
REFERENCES
experimental modelling of a high-level musical dimension Box, G.E.P., & Cox, D.R. (1964). An analysis of transformations.
based on a combination of feature extracted from audio Journal of the Royal Statistical Society, Series B 26, 211-246.
Brown, J. C. (1993). Determination of the meter of musical scores by
recordings, and their mapping with listeners’ ratings. The
autocorrelation. Journal of the Acoustical Society of America,
resulting multi-feature modelling has been integrated in the 94-4, 1953-1957.
latest version of MIRtoolbox (Lartillot & Toiviainen, 2007), Burred, J.J., & Lerch, A. (2003). A hierarchical approach to
based on the values returned by the stepwise regressions automatic musical genre classification. Proceedings of the Sixth
presented in the previous section. One option related to the International Conference on Digital Audio Effects, UK: London.
mirpulseclarity operator that can be tuned concerns the level Feng, Y., Zhuang, Y., & Pan, Y. (2003). Popular music retrieval by
of detail of the model, which corresponds to the number of detecting mood. Proceedings of the International ACM SIGIR
variables to be integrated into the multi-feature model. Conference on Research and Development in Information
Alternatively, users can chose by themselves the variables to Retrieval (pp. 375-376), Canada: Toronto.
Foote, J.T., & Cooper, M.L. (2003). Media Segmentation using
be integrated into the framework and specify the respective
Self-Similarity Decomposition. Proceedings of SPIE Storage and
weights. Retrieval for Multimedia Databases (pp. 167-75), US: San Jose,
In a methodological point of view, this project lead also to California.
the design of a new computational tool for the establishment Lartillot, O., & Toiviainen, P. (2007). A Matlab toolbox for musical
of such high-level features, which has been integrated into the feature extraction from audio Proceedings of the Tenth
MIRtoolbox environment. The various models can be International Conference on Digital Audio Effects, France:
designed using signal processing operators, high-level musical Bordeaux.
feature extractors and statistical operators available in Lartillot, O., Toiviainen, P., & Eerola, T. (2008). MIRtoolbox.
MIRtoolbox. Complex flowchart featuring all the models and http://www.jyu.fi/music/coe/materials/mirtoolbox
Leman, M., Vermeulen, V., De Voogdt, L., Moelants, D., & Lesaffre,
all their possible variants can be designed, and progressively
M. (2005). Prediction of musical affect using a combination of
evaluated by automated routines. The resulting predictions are acoustic structural cues. Journal of New Music Research, 34(1),
then stored in a text file (‘my_prediction.txt’ in figure 17). In 39-67.
parallel, the listeners’ ratings are stored in another text file Lilliefors, H.W. (1967). On the Komogorov-Smirnov test for
(‘listeners_ratings.txt’). The statistical mapping can be normality with mean and variance unknown. Journal of the
automated using the mirmap routine, which implicitly American Statistical Association, 62, 399-402.
performs the normalization of the distribution and the linear Peeters. G. (2004). A large set of audio features for sound description
regressions. The new version of MIRtoolbox featuring the (similarity and classification) in the CUIDADO project. Technical
pulse clarity predictor and the statistical routines can be report, version 1.0.
Toiviainen, P., & Snyder, J. S. (2003). Tapping to Bach:
downloaded for free (Lartillot et al., 2008).
Resonance-based modeling of pulse. Music Perception, 21-1,
43-80.
Tolonen, T., & Karjalainen, M. (2000). A computationally efficient
multipitch analysis model. IEEE Transactions on Speech and
Audio Processing, 8-6, 708-716.
Tzanetakis, G., Essl, G., & Cook, P. (2002). Human perception and
computer extraction of musical beat strength. In Udo Zölzer (Ed.),
Proceedings of the Fifth International Conference on Digital
Audio Effects (pp. 257-261). Germany: Hamburg.
Wing, A. M., & Kristofferson, A. B. (1973). Response delays and the
timing of discrete motor responses. Perception and Psychophysics,
14, 5-12.

Figure 17. General overview of the feature design and evaluation


mechanisms offered in MIRtoolbox.
Figure 18. Pulse clarity variables, and their localization in the feature extraction flowchart. In italics are indicated various options and
parameters. The whole set of possible alternatives are automatically attempted, evaluated and compared.

Anda mungkin juga menyukai