Anda di halaman 1dari 5

Cross-Correlation as a measure for cross-modal analysis of Music

and Floor data

Naveen Kulkarni

Abstract

With the advent of multimodal sensing, complex behaviors of human beings are being under-
stood in a better way from feature extraction to cognitive analysis. This project utilizes some
of the concepts involved in multimodal sensing and processing of data to understand different
applications that could potentially be developed based on this knowledge. Here we try to un-
derstand the feature sets of audio and floor data by using spectral band based analysis. Some
potential applications include automatic genre identification in virtual reality based entertain-
ment systems, performance analysis of a dancer in a reality show where an objective evaluation
is generated ,which can be perceived in a way similar to an objective score or the mean opinion
score in the case of only music data. The basic idea is to use correlation as a metric to find
similarity in music and floor data.

1 Introduction

Music is a representation of pressure variations over time and henceforth can be seen as a wave with
amplitude values proportional to the variations in pressure. Intuitively, it can be hypothesized that
the floor data obtained from the dancer’s performance is similar in nature to the music associated
with it and hence there is a high probability of correlation between the two modalities. Assuming this
to be true, the correlation values computed may be used to grade the performance of the dancer
for the given music piece.This project aims at extracting few features and find the best feature
among them. The important question to be answered is ”Can we obtain an objective evaluation of
a dancer’s performance for a given music using correlation as a metric?” As part of the process of
answering this question, the following action items would be covered:

• Represent sound and floor pressure data in a feature space

• Experiment and find the best set of features to represent the datasets

• Identify and establish a relationship between the datasets captured from the two modalities

Section 2 covers the feature extraction mechanisms for music and floor data and also emphasizes
on features which are the best in the present case. Section 3 establishes relationship between the
two sensing modalities. Results are included in section 4 and conclusions based on inferences are
provided in section 5.

1
2 Method

2.1 Data Partitioning

The input data to the system are :

• Music data with a sampling rate of 44100 Hz.


• Floor data with 40 frames per second data rate.

Only left channel of the given audio wav files is considered for all audio processing.Relevant data
attributes of the floor pressure data, namely - frame, size, x, y and val - were extracted out of
the data captured (at 40 frames per second) by the pressure sensing floor[2].The complete floor
dataset is partitioned into smaller units with the appropriate floor data being associated with the
corresponding music data using a MATLAB script. This makes usability of data simple and efficient.
Three different dance segments have been used in the experimentation where corresponding floor
data is extracted from the complete floor data. To facilitate usage of same rate for floor as well
as music data, 1102 samples of audio are grouped as one frame and hence 44100 samples of data
correspond to 40 frames per second of music data. The window size used to segment the input
audio file is 37.5(25 +12)ms which correspond to 1653(1102+551) samples of overlapped audio data
for each windowed frame.The overlap factor is 50%.To arrive at one signal value for each frame a

Hanning window w(n) = 0.5(1 − cos( 1−N )), where N is the window’s width and 0 ≤ n ≤ N − 1,
was constructed over a frame and used to calculate the weighted mean across a frame. This was
done to ensure that the resampled data was smooth and approximates the original signal well.Both
audio and pressure datasets were then normalized to have zero mean and unity variance.

2.2 Audio Feature Extraction

In order to capture and represent the inherent knowledge present in the audio data well, it is
necessary to understand the nature of audio features first. Audio Features can typically be classified
under the following categories namely - timbre features , temporal and energy features and rhythmic
features[1]. Table 1 categorizes some of the popular audio features under these categories.

Table 1: Audio Features Categories


Feature Type Description Audio Features
Timbre Features Help differentiate the quality of a MFCC, Spectral Flux,Spectral Cen-
musical note or sound that distin- troid, Zero Crossings
guishes different textures of sound
Temporal and Energy Computes amount of silence and av- Amplitude, Energy, Entropy
Features erage intensity
Rhythmic Features Symbolically represent the patterns Onsets, Meter
of ”beats” and ”pulses”

Different features are extracted from the music data.Spectral flux, short time energy, zero crossings,
cepstrum and beat onset detection are few features used in the analysis. After segmentation of

2
audio data the features are extracted for each windowed data.Spectral flux, short term energy,zero
crossings and cepstrum are some of the features which are extracted for each segment of win-
dowed data.Spectral flux is defined as the overall change in the spectrum between two adjacent
frames.Spectral flux gives an estimate of the amount of spectral variation in the signal and songs
that are more or less ”texturally stable”. Short term energy is a measure of intensity or amplitude
values of the segmented audio and is used to understand the power of the windowed data.Zero cross-
ings are extracted as to determine the noisyness of the given data and is generally used to distinguish
speech from music.Cepstrum is another important feature which provides information about rate
of change in different spectrum bands. This can be used to detect variations across different bands
and hence can be used to evaluate the presence of beats in a music signal. But the algorithm
which is primarily used for correlation measurements is the beat detection algorithm,because other
features although relevant for associating with floor data, may not completely establish a one to one
relationship with floor data in the direction of beat detection. The fundamental principle in beat
detection is when there is a noticeable change in intensity, pitch or timbre,it results in an onset being
perceived[6].Based on the human psychoacoustic system characteristics,onsets are computed using
an algorithm proposed by Scheirer[4]. In the beat detection algorithm the segmented audio data is
passed through a set of six filterbanks and the envelope is detected on the resulting data through
full rectification.The filterbank implementation in the algorithm has six bands; each band has sharp
cutoffs and covers roughly a one-octave range. The lowest band is a low-pass filter with cutoff at
200 Hz; the next four bands are bandpass, with cutoffs at 200 and 400 Hz, 400 and 800 Hz, 800
and 1600 Hz, and 1600 and 3200 Hz. The highest band is high pass,with cutoff frequency at 3200
Hz.The envelope is extracted from each band of the filtered signal through a rectify-and-smooth
method. The rectified filterbank outputs are convolved with a 200-ms half-Hanning (raised cosine)
window. Then the onsets are detected through transition detection by differentiation and then the
positive slopes are detected through half wave rectification.

2.3 Floor Pressure Extraction

The partitioned floor data represent the pressure variations or the intensity variations of the dancer.
Different set of sensors are activated for each frame, and therefore different frames have different
sizes.Thus to evaluate a single metric from a set of values in a frame histogram analysis is used.
The histogram corresponding to each frame is determined in terms of ten bins. A centroid of this
ten bins represents the weighted average value of all pressure values over a single frame.Also the
pressure variations of floor data can be treated as a signal with only low frequencies. The was
evident in the empirical observations associated with frequency domain analysis of floor data. This
reasoning is also supported by the cepstrum of the floor data which has a reasonably flat envelope
across all sample values. Hence the same analysis can be applied to floor data as the music data.
This is accomplished by passing through a set of six filterbanks . The filterbank used is exactly
same as the one used in audio data analysis.

2.4 Computing Relationship Between Audio and Floor Pressure Data

Spectral band filtered floor and music data are now available for correlation analysis. Each frame of
music and floor data has been divided into six different subbands. For each subband, a correlation

3
analysis is performed from frame size starting from 20 to a max frame size of 80. The max values
of each correlations and the corresponding lags are calculated. Based on the lags and the peak
correlations obtained for each subband, the number of peak correlation weighted lag values which
are less than half the range of frame size are grouped as positive correlations and the weighted
lags corresponding to upper half of the range of frame size are grouped as negative correlations.
Correlation measure is defined as the ratio of number of positive correlations to the total number of
correlations for each subband. The weighted average of the correlations of each subband provide an
overall correlation metric which can be used to grade a dancer. Three different segments of music
and floor data have been used to grade three different dancers. This is an objective evaluation and
hence this may or may not vary starkly with respect to subjective evaluation.

3 Results

4 Conclusion

Correlation measures can be used to find relationship between floor and music data. Large window
frame results in non-uniform spreading of peaks and hence may not be an accurate measure for
finding a relationship between floor and music data. For window sizes greater than 100 frames the
correlation peaks were far from zero lag and the percentage weightage of negative correlations were
more compared to positive correlations.This was true across all six frequency bands.Hence small
window frame is used as a correlation measure from which an approximately equal correlation metric
is obtained for each band. The weighted average of this correlation metric provides the overall
consistency of the music data with the floor data.’Hopak’ segment provided the best correlation
result and hence has the highest grade among the three segments chosen.’Angry’ wave segment had
inferior results compared to ’Hopak’ but was almost similar to ’Angryens’ wave segment. Analysis
on single pressure data for different set of music data and vice-versa will provide more information
as to how an appropriate frame size can be chosen . Correlation analysis based on cross-modal
integration of different audio features and the corresponding floor data features may lead to better
understanding of the relationship between floor and music data.

5 Acknowledgements and Statement of Contribution

I would like to thank Prof Gang Qian and Prof Ellen Campana for their wonderful support and co-
operation throughout the project. I also would like to thank Ashok Venkatesan for all the discussions
and analysis involved in this project. Spectral subband analysis which played a significant role in
the analysis of floor and audio data was the first step involved in the project and i really would like
to thank Ashok for this contribution.

4
References

[1] B. Thoshkahna, ”Algorithms for Music Information Retrieval”, MS Thesis IIS-Bangalore,Pages


29-33

[2] P. Srinivasan et al,”A pressure sensing floor for interactive media application”, in Proceedings
of the 2005 ACM SIGCHI International Conference on Advances in computer entertainment
technology

[3] B.D. Storey, ”Computing Fourier Series and Power Spectrum with MATLAB”

[4] E.D. Scheirer, ”Tempo and beat analysis of acoustic musical signals”, The Journal of the
Acoustical Society of America, 1998

[5] A. Klapuri, ”Sound onset detection by applying psychoacoustic knowledge”, ICASSP, 1999

[6] D. Moelants et al,A Computer System for the Automatic Detection of Perceptual Onsets in a
Musical Signal, In Camurri, Antonio (Ed.). KANSEI, The Technology of Emotion, pp. 140146.
Genova, 1997.

Anda mungkin juga menyukai