Salient Feature Extraction of Musical Instrument S

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/247440833
Salient Feature Extraction of Musical Instrument

Signals
Article January 2000
CITATIONS READS
13 19
1 author:
Tae Hong Park

New York University
37 PUBLICATIONS 93 CITATIONS
SEE PROFILE
All in-text references underlined in blue are linked to publications on ResearchGate, Available from: Tae Hong Park
letting you access and read them immediately. Retrieved on: 01 September 2016
Salient Feature Extraction of Musical Instrument Signals
A Thesis
Submitted to the Faculty
in partial fulfillment of the requirements for the
degree of
Master of Arts
in
Electro-Acoustic Music
by Tae Hong Park
DARTMOUTH COLLEGE
Hanover, New Hampshire
June 2, 2000
Examining Committee:
_________________________
Larry Polanksy (Chair)
_________________________
Jon Appleton
_________________________
Charlie Sullivan
_________________________
Roger Sloboda
Dean of Graduate Studies
i
Abstract
Musical timbre is inherently multidimensional and extremely complex in structure.
For those reasons, timbre is an on-going research topic in areas such as
computer science, psychology, music and engineering. Humans have the
natural ability to segregate, identify and recognize sounds in a variety of
situations - separated by a wall from the sound source, in a concert hall, within a
noisy traffic environment or at a cocktail party. Although computer systems have
been realized to determine ways of recognizing and identifying sounds with
extracted features, none have come close to performing nearly as well as
humans do. The robustness of computer systems degrades especially in
uncontrolled or natural settings, where more than one sound source, distortions
and aural distractions exist. Many questions remain unanswered: how and what
kind of information does the brain actually receive from our auditory sensory
organs; which features are critically important and which are redundant or even
cause confusion in the recognition and identification process?
The timbre recognition process for computer systems may be basically divided
into two parts. Firstly, the feature extraction part which extracts salient
characteristics of an acoustic signal. Secondly, the recognition part which uses
the extracted data for categorization, prediction and taxonomy. In this thesis I
will concentrate on the feature extraction part. I have implemented and
developed a number of algorithms that are useful in picking out acoustical
characteristics of musical instruments. The purpose of the software is to give
ii
musicians and researchers a usable tool for exploration of timbral characteristics.
The signal processing algorithms implemented in software were all realized in
the Java programming environment. The system can be regarded as a GUI
(graphical user interface) based frequency and time domain signal processing
system, where timbral features are extracted and displayed visually for better
understanding.
iii
Acknowledgments
First and foremost, I thank Jon Appleton for giving me the opportunity to come to
the Electro-Acoustic Music Program at Dartmouth and making my two years a
challenging, memorable and exciting time. Many thanks to Larry Polanksy for
his invaluable guidance, advice and the countless discussions. I am very
grateful to Charlie Sullivan for his thoughtful and insightful critiques which have
helped me tremendously in finishing the thesis. Thanks to Charles Dodge for
introducing me to so many different facets of music I did not know existed,
Douglas Repetto, Mary Roberts and Eric Lyon for their support. I would also like
to thank Dee Copley for keeping it all together, Andrew, Iroro, Jonathan and Paul
for letting me "multi-computer" all the time (well, most of the time).
I thank my parents and brothers for their unending encouragement and being the
best teachers I have had. They have been there for me throughout the years,
guided me and supported me with all the different (and sometimes strange)
paths I had chosen.
Finally, I wish to thank Kyoung Hyun for her continual and unwavering support,
understanding and love. It would have been very hard to reach this milestone
without her - thank you.
iv
Table of Contents
1 Introduction 1
1.1 Motivation 1
1.2 Feature Extraction and Timbre 3
2 Signal Processing Modules 7
2.1 Introduction 7
2.2 Frequency Domain Analysis 8
2.2.1 DFT and STFT 8
2.2.2 Spectral Peak Detection and Tracking 9
2.2.2.1 Step 1: Rough Peak Detection 11
2.2.2.2 Step 2: Prominent Peak Search 14
2.2.2.3 Step 3: Harmonic Break Search 15
2.2.2.4 Step 4: Harmonicity Analysis 16
2.2.2.5 Partial Tracking Between Frames 18
2.2.3 Spectral Centroid 19
2.2.4 Spectral Smoothness 20
2.3 Time Domain Analysis 21
2.3.1 Noise Content Analysis: Linear Prediction 22
2.3.2 Pitch Detection 25
2.3.2.1 Autocorrelation 27
2.3.2.2 Detection of Periods 28
2.3.2.3 Natural Cubic Spline Interpolation 29
v
2.3.2.4 Period Averaging 30
2.3.3 Amplitude Envelope 31
2.3.4 Amplitude Modulation 34
2.3.5 Attack Time 35
3 Software Implementation 37
3.1 Introduction 37
3.2 Why Java? 37
3.3 Main Software Structure 38
3.4 Software Features 40
4 Conclusion and Further Work 42
Appendix 44
References 53
vi
List of Illustrations
Figure 2.1 Short time Fourier transform and Spectral Peak Detection 9
Figure 2.2 Plucked string spectrum 10
Figure 2.3 Peak detection algorithm 11
Figure 2.4 Rough search for peaks 12
Figure 2.5 Actual peak assessment 13
Figure 2.6 Transitional peaks (noise) 13
Figure 2.7 Prominent peak search 14
Figure 2.8 Harmonic break search 16
Figure 2.9 Partial tracking between frames 18
Figure 2.10 Spectral centroid of french horn and electric bass at 44.1 kHz 20
Figure 2.11 White noise, sine wave and electric bass spectral smoothness 21
Figure 2.12 Vocal tract model 22
Figure 2.13 Noise content analysis flute and electric bass 23
Figure 2.14 Noise content analysis 24
Figure 2.15 Noise content analysis of electric bass 25
Figure 2.16 Pitch computation 26
Figure 2.17 Error plot: interp., interp. with period averaging and DFT 26
Figure 2.18 Autocorrelation signal, sine wave at 100 Hz 27
Figure 2.19 Autocorrelation signal, sine wave at 1010 Hz 28
Figure 2.20 Peak detection through zero crossing and interpolation 29
Figure 2.21 Natural cubic spline 30
Figure 2.22 Peak averaging (with number of peaks) 31
vii
Figure 2.23 Electric bass envelope 32
Figure 2.24 Amplitude envelope 33
Figure 2.25 Amplitude modulation analysis 34
Figure 2.26 Amplitude modulation alto-saxophone 35
Figure 3.1 Main software architecture 39
Figure 3.2 Snapshot of software 40
Figure A.1 Rough peak detection 47
Figure A.2 Prominent peak search 48
Figure A.3 Harmonic break search, general flowcharts 49
Figure A.4 Harmonic break search, sub-module flowcharts 50
Figure A.5 Harmonic analysis, general flowchart 51
Figure A.6 Harmonic analysis, detailed flowchart 52
List of Tables
Table 3.1 Software features 40
viii
Chapter 1 Introduction
1.1 Motivation
I am looking at an acoustical waveform, horizontal axis denoting discrete time
and vertical axis denoting discrete magnitude values. What information does
this signal contain, how much information is there? Is there anything hidden
behind the waveform? Is it just two dimensional in nature? How can I better
understand why it makes this unique sound? Questions like these have aroused
my curiosity in timbre which has lead to conceiving this thesis - feature extraction
of musical signals.
Feature extraction is an integral part of understanding musical instrument
signals. These signals contain a wealth of information and feature extraction is a
method for obtaining specific characteristics through signal processing
techniques. Hence, it is partly a process of reducing the overwhelming
acoustical information and focusing on specific areas that may give clues for
describing the signal under investigation. In a computer system, digital signal
processing techniques are used for analysis. The techniques of data analysis
are divided into frequency and time domain analyses. With these techniques,
numerous approaches from different angles are employed to extract salient
information, ultimately to help understand timbral characteristics.
Various signal processing software systems exist for extracting specific
acoustical features. However, very few systems exist that are tailored for the
1
purpose of analysis and extraction of timbral qualities of musical instrument
signals. In this thesis I have developed and implemented various algorithms for
extracting salient features into one software application which can be readily
used by musicians, composers, engineers or anyone interested in analyzing
musical signals from a signal processing point of view.
It is also interesting to note that although numerous signal processing algorithms
have been devised to accomplish feature extraction tasks, it is still unclear as to
which aspects of timbre are essential and which are less or more meaningful
than others. To my knowledge, there exists no theory nor rule that
unambiguously defines a hierarchical description of timbral features. It is my
hope that this software system will provide users the means to explore,
investigate and experiment with audio signals and help answer some of the
many questions regarding timbre that are yet to be discovered. However, I also
plan to continue research in timbre to encompass a recognition module which
would be able to take the extracted features and recognize the sound source
being analyzed.
The software rendered in Java has been chosen for its platform independence
and graphical user interface (GUI) capabilities. The Java Swing GUI was used
to facilitate the interpretation of extracted features through graphical displays and
parametric controls of various signal processing coefficients.
2
1.2 Feature Extraction and Timbre
In this thesis spectral analysis is based on the Fourier transform. The theory
behind the Fourier transform was first published in "Analytical Theory of Heat" by
Fourier. Fourier claimed in his writing that any periodic continuous signal could
be represented by the sum of an infinite number of sine and cosine waves. This
elegant description of periodic signals was later exploited by the 19th century
physicist Herman Helmholtz (Helmholtz 1877). His view of the ear was that of a
"frequency analyzer" based primarily on Fourier's mathematical theorem, Ohm's
physical definition of a simple tone and the existence of a resonator in the
cochlea, capable of accomplishing sound analysis. According to Helmholtz's
theory, the cochlea behaved like a spectral analyzer analogous to the Fourier
transform. He believed that the cochlea resonated at specific locations along the
basilar membrane (Carterette and Friedman 1978), each tuned to specific
frequencies. Helmholtz also claimed that the spectral magnitude components,
and not the phase components, were the sole factors contributing to the
perception of musical tones. However, this over-generalization of the human ear
performing a strict Fourier transform on the incoming sound waves was
disproved by Bksy (Bksy 1943) who demonstrated the impossibility of such
precise and acute tuning resonators in the cochlea as described by Helmholtz.
In fact, the hair cells in the basilar membrane (comparable to frequency bins in
the Fourier transform) are stimulated in an overlapping manner. That is, a sine
tone at 100 Hz will not just trigger one hair cell at precisely that frequency, rather
a group of hair cells will be excited leading to the perception of its pitch.
3
Furthermore, the importance of phase in perceiving musical sounds was
demonstrated by Clark (Clark, Luce, Abrams, Schlossberg and Rome 1963),
who clearly showed that in the absence of phase information, acoustic
waveforms sounded unrealistic. This may be partly attributed to the fact that the
highly transient onset part of a signal stores a great deal of phase information.
Helmholtz's theory works well in ideal situations when a signal is periodic.
However, real-life sounds are only quasi-periodic and vary considerably. The
significance of spectral fluctuation as well as inharmonicity (Fletcher, Blackman
and Stratton 1962) and spectral fusion (McAdams 1984) have also been studied
as potential features in describing musical tones.
Although the Fourier transform has been known for quite some time, it was not
widely applied by the music community until after 1965, with the introduction of
the fast Fourier transform (Cooley and Tukey 1965). The advent of the FFT
stimulated research in music partly due to the cost effectiveness in processing
the discrete Fourier transform. One such area of research in timbre was
conducted using multidimensional scaling (MDS) methods (Grey 1976). The
structure of musical signals was mapped to a three dimensional timbre space.
The listener determined the similarity or dissimilarity between sounds when
salient features were changed. The three dimensions incorporated were
brightness, spectral flux and attack time. Instead of natural sounds, additive
synthesis methods were employed for easy control of timbral parameters in
conducting the experiments. Noise content of musical signals on the other hand
4
has not been investigated in as much detail by researchers compared to the
"periodic" aspects of musical sounds (some work has been done in modeling
non-periodic signals by Serra 1997). However, voice coding research has been
adapting noise analysis techniques enthusiastically, where speech is divided into
a periodic and noisy part. The use of a LPC (Linear Prediction Coding) method
has been the primary backbone in current and past speech analysis by synthesis
(AbS) systems.
During the past decade a number of research topics in timbre have been
pursued in the area of so called Computational Auditory Scene Analysis (CASA).
It may be thought of as a research area in psychophysical disciplines to describe
and explain how the listener perceives sounds. Sound, in this context may be
referred to a multiplexed signal - an aggregate of a number of sound sources.
The approach is to find the underlying reasons as to why we hear what we hear
and not merely be content with the results of a computer system that finds a
matching answer to a stimulus. The proliferation of CASA can be largely
attributed to Bregman, who published his book Auditory Scene Analysis in 1990
(Bregman 1990). The book describes in detail highly intuitive and clever
experiments that attempt to explain psychoacoustic phenomena and makes
robust modeling of such features. However, as is the case with most if not all
psychoacoustic experiments, the stimuli or test tones used in Bregram's book
are also static, synthesized, sine-tones or simply impractical sound examples
which are often remotely related to real-life sounds. Nevertheless, a significant
5
and impressive amount of work has been done in this field. Work by Ellis (Ellis
1996) used a prediction-based model of the auditory system with good results in
grouping sounds in noisy environments such as car horns, door slams and
squeals in a "city street environment". He used a re-synthesis approach to
assess its robustness and performance. Another is statistically based pattern-
recognition approach (Martin 1999) where the "listening" system classifies
musical instruments as one of 25 possibilities based on Ellis's PDCASA
(Prediction Driven Computational Auditory Scene Analysis) architecture.
6
Chapter 2 Signal Processing Modules
2.1 Introduction
Musical instrument signals generally consist of a transient portion and steady
state or quasi-periodic portion. The transient part is usually the attack of the
signal and the steady state the portion that follows the attack part. When
investigating time variant signals it is critical to make use of both time and
frequency domain analysis techniques. Some important features in musical
signals include duration, amplitude modulation, pitch, spectral harmonicity,
spectral envelope, spectral centroid and the like. Attack time is especially
considered a salient feature of musical timbre (Eagleson and Eagleson 1947;
Saldanha and Corso 1964; Elliot 1975) and has been thought to be a dominant
feature of musical instruments. However, it has also been discovered that the
attack time and also note-to-note transients of a signal are neither sufficient nor
necessary for recognizing musical instruments (Kendall 1986). This
controversial discovery supports the importance of the steady state portion of a
signal.
This chapter mainly describes the implementation of the signal processing
algorithms used in the software system for extracting features that depict these
transient and stationary characteristics in the frequency and time domain. The
frequency domain analysis section of this chapter is primarily based on the
discrete Fourier transform (DFT). DFT based spectral analysis algorithms
7
discussed includes short time Fourier transform, spectral centroid, spectral
smoothness and tracking of partials over time. In the time domain analysis
section I will mainly describe the implementation of algorithms including pitch
detection with interpolation and a period averaging based on the autocorrelation
function. Other modules discussed are amplitude envelope, amplitude
modulation, attack time computation and noise content analysis.
2.2 Frequency Domain Analysis
2.2.1 DFT and STFT
The spectral analysis part of feature extraction is primarily based on the discrete
Fourier transform (DFT). Below the continuous time and discrete time versions
of the Fourier transform are shown.

X = x t e j t dt (2.1)

N 1
X [ k ]= x [n]e j2 kn / N 0n N 1, 0k N 1 (2.2)
n=0
To extract transitory spectral characteristics the short time Fourier transform
(STFT) was used (Allen 1977; Allen and Rabiner 1977). The basic algorithm is
as follows.
N 1
X [ k ]= w [ k mD] x [ k ]e j2 kn/ N 0n N 1, 0k N 1 (2.3)
n=0
As seen in figure 2.1 the STFT can be simply described as windowing and taking
the FFT of the signal. There are various window types available in the program
8
STFT
signal
peak accumulate
windowing FFT
detection partials
Figure 2.1 Short time Fourier transform and Spectral Peak Detection
with different side-lobe and main lobe characteristics. The Hamming window has
been shown to work particularly well with musical signals (De Poli, Piccialli and
Roads 1991). See the appendix for details regarding windowing and its side-
lobe and main lobe characteristics.
2.2.2 Spectral Peak Detection and Tracking
Pitched musical instruments display a high degree of harmonic spectral quality
when analyzed for frequency content. Most tend to have quasi-integer harmonic
relationships between spectral peaks and the fundamental frequency. In voice,
the spectral envelope displays mountain-like contours or valleys known as
formants. The locations of the formants distinctively describe vowels. This is
also evident in violins, but the number of valleys is greater and the formant
locations change very little with time unlike the voice, which varies substantially
for each vowel. Woodwinds such as the bassoon and oboe on the other hand
have fewer formants than the voice, but tend to have stronger and clearer
spectral contours that perceptually characterize the woodwind family (Cook
1999). Generally, musical instruments like the plucked string (figure 2.2) exhibit
9
lower energy in the high frequency bins. The higher partials normally have less
energy and also die out faster than lower ones over time.
Figure 2.2 Plucked string spectrum
Using the short time Fourier transform, I have implemented a spectral peak
detection and tracking method, extracting quasi-integer related harmonics from
the spectrum. The peak picking algorithm takes into consideration magnitude
and frequency information to select the most prominent and harmonically
behaving peaks. To help in the search for spectral peaks, various threshold
values are used as described below.
10
The spectral peak detection algorithm is divided into four main steps. The first
pass roughly locates possible peaks, where the roughness factor for searching
peaks is controlled via a threshold value. The threshold value basically dictates
the degree of "peakiness" that is allowed for a local maximum to be considered a
possible peak. The second pass filters out peaks that may have been
erroneously selected in step 1. The third pass looks for any broken harmonic
sequence, analyzing harmonic relationships of the currently selected peaks. In
this pass, peaks that may have been deleted or missed in the previous two
passes are inserted. The final pass looks at the selected peaks and further does
a harmonic analysis ultimately leaving a set of peaks that are most probably
harmonics. A mean and scalable standard deviation error method is applied for
control of inharmonicity.
signal rough peak prominent peak harmonic break harmonicity harmonics

detection search search analysis
Figure 2.3 Peak detection algorithm
2.2.2.1 Step 1: Rough Peak Detection
In the rough peak detection algorithm possible peaks are picked using negative
and positive slope threshold values to guide in the selection process. As shown
in figure 2.4 the polarity of the slope of the spectrum is computed from bin to bin
(DC to Nyquist) using the basic assumption that a transition from positive to
11
actual peak actual peaks
peak candidate
not actual peak
slope + + + - + - - ... ... + + - - ... - + - ...
0 1 2 4 5 6 ... ... k-1 k
Figure 2.4 Rough search for peaks
negative slope calls for the possibility of a peak. The following conditions help in
the selection of a peak:
1. The slope must change polarity, positive to negative.
2. The magnitude difference between the peak candidate and the current bin's
magnitude component (X[k]-X[k+4]) must be greater than a threshold value -
see example (figure 2.5).
3. A new peak candidate search occurs only after there is a slope change from
negative to positive and when a threshold value as shown in figure 2.6 is
exceeded.
12
Refer to flowcharts in the appendix for details.
detected
peak candidate p0
(not peak yet)
p0
threshold
value
k k+1 ... k+4
peak candidate p0 ...

becomes an actual
peak at this bin
Figure 2.5 Actual peak assessment
peak candidate
detected
reset,
new
candidate
search
slope change from

negative to positive
threshold value =
X[k-14] + constant
threshold value =
X[k-7] + constant
... k-14 ... k-7 k-4 ... k ...
noise
(disregard)
Figure 2.6 Transitional peaks (noise)
13
2.2.2.2 Step 2: Prominent Peak Search
In step 2, prominent peaks are located from a set of potential peaks found in
step 1. The purpose is to filter out local peaks which may be present between
stronger partial candidates as shown in figure 2.7. The search for prominent
peaks is done in the following way:
prominent peak
prominent peak maximum

prominent peak
magnitude
prominent peak
adaptive
threshold
level
adaptive
threshold
level
DC fs/2
probably
local peaks
probably probably
local peak local peak
Figure 2.7 Prominent peak search
1. The bin with the maximum magnitude is found.
2. Relative to position of the peak with maximum amplitude, peaks are analyzed
moving towards DC.
14
3. Relative to position of peak with maximum amplitude, peaks are analyzed
moving towards the Nyquist frequency.
Local maxima or peaks are picked out using an adaptive threshold value that is
reflective of a prominent peaks (possible partials) and its neighboring peaks as
shown in figure 2.7. For example a 50% threshold value will require neighboring
peaks to be greater than at least half the magnitude of the prominent peak
(possible partial). Refer to the appendix for details on algorithm.
2.2.2.3 Step 3: Harmonic Break Search
The third step is called the harmonic break search. Here, I have tried to analyze
if some "potential partials" were deleted or missed in the previous steps. This
may occur when potentially harmonically related peaks temporarily have little
energy or are simply much weaker than the stronger ones, but are nevertheless
harmonic. The harmonic break search is divided into the following sub-routines:
1. Analyze harmonic relationship between current partial candidates, by
computing the mean bin spacing between all prominent peaks.

N 2
= 1 F [ k 1]F [ k ] (2.4)
F
N 1 k =0
2. Detecting any harmonic breaks, or discontinuities between prominent peaks.
3. If discontinuities are found, going back to step 1 and 2 and do a refined
search of possible peaks between pairs of prominent peaks.
15
prominent prominent prominent prominent
peak peak peak peak
harmonic
break
F +/- threshold
right left
threshold threshold
bound bound
F F F F F
Figure 2.8 Harmonic break search
In the harmonic break search's second step, harmonic discontinuities are
detected using a pair of threshold values limiting the range of harmonic
deviation. Hence, the algorithm expects the possibility of a peak within the
threshold bounds computed in sub-step 2 (figure 2.8). Refer to appendix for
more details on algorithm.
2.2.2.4 Step 4: Harmonicity Analysis
Finally in step 4 an overall harmonicity verification is performed. In this last step,
the first few peaks (selectable in software) are used as a guide to determine the
final set of partials. The reason for choosing the first few peaks of the spectrum
16
is due to the fact that in highly pitch salient signals, lower harmonics usually are
stronger and more stable than higher ones.
The idea is to use the gaussian normal distribution function employing mean,
variance and standard deviation for eliminating inharmonic or misbehaving
partials. A peak that is outside a right and left threshold bound is considered
inharmonic and misbehaving. A mean bin spacing value denoting the bin
distances between neighboring peak candidates is computed to render the
variance and standard deviation. As the lower partials generally tend to be more
stable and have more energy, the first K (K: integer > 0) peaks are used for the
computation of the standard deviation. A scaled version of the the standard
deviation is then used as a criterion for evaluating inharmonicity of each partial
candidate. The scaled standard deviation is increased or decreased to control
the permitted spread of each peak. In other words, the scaled standard
deviation is directly relevant to the amount of inharmonicty tolerated for selecting
the final set of peaks. The scalar that controls the scaled standard deviation is a
value between 0 and 1, where 1 is equivalent to limiting the peaks to the original
un-scaled standard deviation. This method is implemented by computing an
ideal sequence of harmonics using the above acquired data. Hence the ideal
harmonic series is a sequence of partials as shown below.
binideal [0] , binideal [1] , ... , binideal [ m] (2.5)
where , binideal [0]=mean of first K peaks K :integer0
The ideal set of harmonics and the actual set of harmonics are compared and
the error (equation 2.6) for each peak is computed and verified against the
17
scaled standard deviation for final assessment. Peaks that have excessive error
values are deleted from the final set of peaks and the remaining ones are finally
considered harmonics. See the appendix for more details on algorithm.
=binideal [ k ]bin actual [l ] , k =0 ... M , l=0 ... N (2.6)
Equation 2.6 shows the error between the ideal and actual bins where M is the
number of ideal peaks and N is the number of actual peaks in the spectrum. M
and N have different values as missing partials may exist in the actual set of
peaks.
2.2.2.5 Partial Tracking between Frames
Once harmonics have been evaluated in each frame (a frame is equal to the
length of the FFT), they are combined to render a spectrogram. Frame to frame
partial movement is determined using a harmonic continuity criterion as shown in
figure 2.9.
harmonic discontinuity
harmonic (new path begins)
continuity
...
error ...
frequency (bin)
margin
discrete
...
error
margin
...
...
k k+1 k+2 k+3 k+4 ...

frame number
18
Figure 2.9 Partial tracking between frames
The harmonic continuity criterion is explained as follows: Each harmonic in a
frame is allowed to sway in frequency within a set of error margin values. Hence,
as shown in figure 2.9, four of the harmonics make a continuous harmonic path
(k, k+1, k+2, k+3). However, the harmonic in frame k+4 exceeds the allowed
error margin and breaks the previous harmonic path. At frame k+4 a new path is
created and the path which started at frame k is discontinued. The harmonic
continuity criterion is helpful in observing movements of the harmonics over time
and frequency.
2.2.3 Spectral Centroid
The spectral centroid (Beauchamp 1982) is commonly associated with the
measure of the brightness of a sound (Grey and Gordon 1978). This measure is
obtained by evaluating the "center of gravity" using the Fourier transform's
frequency and magnitude information (Equation 2.7). Generally speaking, it has
been found that increased loudness also increases the amount of high spectrum
content of a signal thus making a sound brighter.

N 1
k X [k ]
sc= kN=11 (2.7)
X [k ]
k =1
X[k] is the magnitude corresponding to bin k and N is the length of the DFT. This
measure has also been used in MDS (multidimensional scaling) based systems.
Figure 2.10 Spectral centroid of french horn and electric bass at 44.1 kHz
19
frames
frames
Figure 2.10 shows examples of the spectral centroid for the french horn and the
electric bass guitar.
2.2.4 Spectral Smoothness
The spectral smoothness (McAdams 1999) measures the smoothness of the
frame to frame spectral envelope obtained via the short time Fourier transform.
The algorithm basically takes the average of adjacent amplitudes of the spectral
bins and compares them to the current amplitude at bin k as shown in equation
2.8.
N 1
20 log X [ k 1]20 log X [ k ]20 log X [ k 1]
ss= 20 log X [ k ] (2.8)
k =1 3
20
Figure 2.11 shows examples of the spectral smoothness for white noise, sine
wave and electric bass sampled at 44,100 Hz.
Figure 2.11 White noise, sine wave and electric bass spectral smoothness
2.3 Time Domain Analysis
In this section I will discuss the time domain signal processing modules
implemented in the software system.
2.3.1 Noise Content Analysis: Linear Prediction
I have used linear prediction as the basis for extracting the degree of "noisiness"
21
of a signal. The motivation behind using the LPC method for musical signals lies
in its robust performance in modeling the voice. In the "LPC vocal tract model",
the resultant acoustical signal is represented via a noise signal and a sequence
of pulses passed through a resonant all-pole filter, shaping the spectral envelope
of the voice as shown in figure 2.12. In essence, the linear prediction filter
coefficients are used to predict the current sample with a finite number of
weighted past samples. Figure 2.13 shows examples of noise content analysis
for the flute and electric bass.
pitch
LPC coefficients
pulse generator
synthesized speech
spectral envelope
noise generator
Figure 2.12 Vocal tract model
22
magnitude
magnitude
samples (time)
magnitude
samples
magnitude
samples
Figure 2.13 Noise content analysis flute and electric bass

23
The linear prediction model is simply defined as in equation 2.9. It assumes that
the current sample may be represented with past samples weighted
"appropriately" (Atal and Hanauer 1971).

N
s [ k ]= a i s[ k i ] (2.9)
i=1
The coefficients in the difference equation are selected so that the error
between the current sample s[k] and the predicted sample form equation 2.9 is
minimized as shown in equation 2.10 using the least square method.

N
1
=
M
{ s [ k ]s [ k ]}2 (2.10)
k =1
The noise content analysis algorithm is shown in figure 2.14. Before submitting
the signal to the short term prediction filter block, a pre-emphasis filter (equation
2.11) is used to flatten the spectrum for enhanced performance. The pre-
emphasis filter (high pass filter) coefficients range from 0.95 to 0.98. The
residual signal ds[n] is passed through a "spike damping filter" which damps
spikes that are present in ds[n] and ultimately renders the noise content of the
signal (figure 2.15).
y [n]= x [n]ax [n1] (2.11)
residual/
(time)
signal noise
spike
pre-emphasis forward LPC inverse LPC
damping
ds[n]
linear prediction
Figure 2.14 Noise content analysis
24
samples(time)
Figure 2.15 Noise content analysis of electric bass
2.3.2 Pitch Detection
The pitch detection algorithm uses autocorrelation, natural cubic spline
interpolation and period averaging to accurately compute the pitch of the signal.
The range of operation is from approximately 26 Hz to 5000 Hz (A0 = 27.50 Hz,

25
C8 = 4186 Hz). Figure 2.16 shows the basic procedure for computing pitch. As
seen in figure 2.17 the error for the period- averaging method is smallest
compared to the autocorrelation method without interpolation and the FFT. The
period averaging method is discussed in section 2.3.2.4.
signal detection of interpolation of period

autocorrelation pitch(t)
peaks periods averaging
Figure 2.16 Pitch computation
DFT method
autocorrelation
without
interpolation autocorrelation
with interpolation
and period
averaging
Figure 2.17 Error plot: interp., interp. with period averaging and DFT
26
2.3.2.1 Autocorrelation
Autocorrelation is a standard way of determining signal periodicity and is defined
as:

acf xx = x t x t dt (2.12)

The discrete time equivalent is:

N 1
acf xx []= x [n] x [n] 0l (2.13)
n=1
A typical autocorrelation vector with increasing integer lag values is shown in
figure 2.18. In the current implementation zero crossings of the autocorrelation
signal are determined. More precisely, the location of a peak bounded by two
zero crossing pairs is considered a peak if it agrees with specific magnitude
threshold values.
Figure 2.18 Autocorrelation signal, sine wave at 100 Hz
27
Comparing figure 2.18 and figure 2.19 it is clear that the time resolution for
higher frequencies in the autocorrelation vector decreases substantially, causing
greater error. In other words, the samples that are present between the
autocorrelation peaks (period of the signal) decrease with the increase in
frequency (approximately 440 samples vs. 50 samples, figure 2.18 and figure
2.19). One way to improve performance is to use interpolation.
Figure 2.19 Autocorrelation signal, sine wave at 1010 Hz
2.3.2.2 Detection of Periods
Periods in the autocorrelation signal are detected through peaks that correspond
to the frequency of the audio signal (figure 2.20). Peaks are extracted using two
zero crossing pairs for each peak. These pairs define the range where a peak
28
that corresponds to the period, could actually be found. The fact that the
autocorrelation vector's magnitude decreases with the increase of its lag is
exploited in determining if a peak is really a period or just a local peak
corresponding to strong harmonics. The first period value is used as the basis to
look for and compute consecutive peaks in the autocorreation vector. Hence, an
error margin dictated by the first period found is used to guide in searching the
remaining peaks. All peaks that are considered periods are subjected to
interpolation.
T1 T3 T5
T0 T2 T4
local peaks
actual
peaks/
periods
zero zero
crossing crossing
pairs pairs
Figure 2.20 Peak detection through zero crossing and interpolation
2.3.2.3 Natural Cubic Spline Interpolation
The natural cubic spline Interpolation method is used in pinpointing the "actual"
peak and hence its period, in the autocorrelation function. The basic idea behind
29
the natural cubic spline method is shown in figure 2.21. Each "curvature"
connecting the knots is represented by a cubic polynomial equation denoted by
S i x .
s i-4 s i-3 s i-2 s i-1 si
t i-4 t i-3 t i-2 t i-1 t i

t i+1
Figure 2.21 Natural cubic spline
This essentially is a problem of solving each polynomial bounded by knots, for its
roots (Cheney and Kincaid 1994).

3 2
S i x=a i x bi x ci xd (2.14)
2.3.2.4 Period Averaging
As mentioned before, an increase in pitch decreases the period length. Since
the number of autocorrelation vectors decrease accordingly, interpolation can
only help so much. To get better performance both for low frequency and high
frequency pitch detection, I have developed a period averaging method which
simply uses M number of periods found in the autocorrelation vector to compute
the mean after interpolation. In a given frame, if we find M peaks/periods, the
respective period can be represented as:
30
T m={T 0 , T 1 , T 2 , T 3 ... , T M } , 0mM (2.15)
The number of autocorrelation peaks may vary from frame to frame. The
maximum number of periods for averaging is controlled by variable M. The
average period then becomes:

M f 1
1
T =
Mf
T m (2.16)
m=0
where M f is the maximum number of peaks found in frame f. As seen in
figure 2.17, this decreases the error margin considerably.
T1 T3 T5
T0 T2 T4
M=6
f
Figure 2.22 Peak averaging (with number of peaks)
2.3.3 Amplitude Envelope
The amplitude envelope describes the energy change of the signal in the time
domain and is generally equivalent to the so called ADSR (attack, decay,

31
sustain, release) of a musical sound.
frames
Figure 2.23 Electric bass envelope
The envelope of the signal is computed with a frame by frame RMS (root mean
square) and a low 3rd order Butterworth low pass filter.
RMS (equation 2.17) is related to the average power of a signal different from
the average or peak level. There is a fundamental difference between the
average, peak level and the signal's RMS value. The average changes very little
even if the signal consists of numerous transients peaks. The peak level on the
other hand can vary greatly in a small amount of time, but without much affecting
the average value. RMS is a more perceptually relevant measurement and has
32
been shown to correspond more closely to the way we hear loudness.

N 1
1 (2.17)
rms=
N
x [ k ]2
k =0
The frame by frame RMS is used quite similar to the short time Fourier transform
method. The length of the RMS frame determines the time resolution of the
envelope. A large frame length yields lower transient information and small
frame length greater transient energy. The window length M (equation 2.18) is
an integer multiple of N, N being the total length of the signal and p a positive
integer.

LM 1
1
frame by frame rms=
M
x [ k ]2 where Mp= N (2.18)
k=L
fs
8k 22.05k 44.1k
fc
RMS
signal amplitude
frame/frame RMS LPF
envelope
Figure 2.24 Amplitude Envelope
The window size is selectable in the software, a longer window resulting in less
transitional information and shorter window more transitional information. The
cutoff frequencies have been determined empirically at 350 Hz (fs = 8000), 1200
Hz (fs = 22050) and 1700 Hz (fs = 44100).
33
2.3.4 Amplitude Modulation
Detecting amplitude modulation is similar to the amplitude envelope detection
algorithm with a few steps added. Figure 2.25 shows a summary of amplitude
modulation analysis. The steady state portion of the signal is extracted and
analyzed for peaks which corresponds to the frequency of amplitude modulation.
For accurate location of peaks, the cubic spline interpolation method is again
used.
fs
8k 22.05k 44.1k
fc
RMS
signal steady state
frame/frame RMS LPF 20log
extraction
AM frequency interpolation of
peak detection
frequency computation peaks
Figure 2.25 Amplitude modulation analysis
Amplitude modulation is frequently observed in musical instruments such as the
violin, flute and saxophone (figure 2.26). The frequency in Hertz is computed
using the following formula:
fs
frequency= (2.19)
wT
34
where fs is the sampling rate, w the RMS frame length and T the period in
samples of the RMS signal.
frames
Figure 2.26 Amplitude modulation alto-saxophone
2.3.5 Attack Time
Attack time (Saldanha and Corso 1964; Elliot 1975) is an important feature of
timbre. It is defined as the time it takes to reach the maximum amplitude of a
signal from a minimum threshold magnitude (McAdams, 1999). The minimum
threshold value is necessary as it acts as a gating function, only starting
measurement of the attack time when this threshold level is exceeded. Although
the attack portion embodies a great deal of transitional information of the signal
leading to a steady state, it is difficult to say where the attack portion ends and
35
where the steady state begins. As a matter of fact, it is even difficult to say how
much information the attack portion actually represents and no concrete
measurement technique has been published to date.
attack time=t xMax t xThresh (2.20)
However, this attribute of timbre has been indirectly applied successfully in
wavetable synthesis. The basic idea is to take an auditory snapshot of the signal
- the attack and first few milliseconds of the steady state portion, then loop the
steady state portion. Hence, this gives the listener the illusion that the whole
signal is being played back, although only a fractional length of the signal has
actually been used to render such an illusion. Today's popular music genres and
"electronic" jazz music are very much dominated by this technology. However,
many contemporary composers (Appleton 1991) also have used this technology
mainly via alternative MIDI controllers such as the Radio Baton.
36
Chapter 3 Software Implementation
3.1 Introduction
The various signal processing modules discussed in chapter 2 were assembled
and implemented in the Java programming environment (the software is
available at http://eamusic.dartmouth.edu/~taehong). The system uses various
GUI (Graphical User Interface) capabilities for easy visualization of extracted
features. The following sections describe the software and the motive behind
using Java.
3.2 Why Java?
The Java programming environment was used for the following reasons.
1. Platform independence
2. Non-real time requirement for this thesis
3. Good GUI design capabilities
4. Java Sound
5. Syntax similarity to C/C++
Java has been designed for the purpose of neutrality towards architecture. This
can be achieved if pure Java code is written. For this reason, the program I have
developed has no native coding methods (system dependent code) and runs on
pure Java alone. Although native coding methods improve efficiency, this
37
program does not require real-time processing or time-critical computation. One
of the goals of this thesis was to write an intuitive GUI (graphical user interface)
- easy control of parameters, coefficients and visual representation of extracted
features. This was done with the Java Swing GUI environment. Another reason
for choosing Java lies in sound playback and recording. The ability to play and
record sounds, without developing interrupt service routines and memory
accessing procedures pertinent to each platform was a major plus. Finally, the
syntax similarity of Java to C/C++ greatly facilitated the move from C/C++ to
Java without having to learn a completely new "language".
3.3 Main Software Structure
The software's main modules are the GUI, command center and sound object.
The GUI is responsible for responding to requests performed by the user via
button clicks, data entry, menu selections etc. . The command center takes the
job of directing the commands requested by the user and notifying the
appropriate sound object. The sound object which is "instantiated" whenever a
new sound file is loaded into memory, supervises and keeps track of all its child
frames (internal frames - each corresponds to a DSP process) for updates and
unnecessary re-computation using linked lists. Hence, unused objects and
commands are removed or added to linked lists making command processing
efficient both in memory management and data management.
38
GUI
command center
Sound Sound Sound Sound

Object 0 Object 1
... Object S-1 Object S
signal processing
library
DSP DSP DSP DSP

Command Command ... Command Command
Manager Manager Manager Manager
DSP
command 0
DSP
command 0
... DSP
command 0
DSP
command 0
DSP DSP DSP DSP

command 1 command 1 ... command 1 command 1
... ... ... ... ...
DSP
command J
DSP
command K
... DSP
command L
DSP
command M
Figure 3.1 Main software architecture
39
Figure 3.2 Snapshot of software
3.4 Software Features
This section lists features of the program.
File IO Features
Features Description
File Open saves aiff, wav, au
File save writes aiff, wav, au, raw data (float)
DSP Command Features

DC Offset Removal removes DC Frequency
40
File IO Features
AM Analysis amplitude modulation analysis
Attack Time attack time computation
Amplitude Envelope amplitude envelope rendering
FFT discrete Fourier transform
Spectrogram display of peaks vs. frame/time
Pitch Detection detection of pitch (vs. time)
Pitch Modulation detection of low freq. pitch modulation
Spectral Centroid spectral centroid computation (vs. time)
Spectral Smoothness spectral smoothness computation
(vs. time)
Noise Content Analysis noise content analysis (forward LPC,
inverse LPC)
Sound Features
Play plays loaded waveform and residue
signal if selected
Stop stops play and record
Rec records signal from mic/line
Pause pauses record and play
Plot Commands
Zoom in/out
Scroll x,y
Quick view "summarized" view of signal, improves
screen update latency

Log view/Linear view log view of magnitude component in
FFT
Table 3.1 Software features
41
Chapter 4 Conclusion and Further Work
I began this thesis with the intent to gain further understanding about timbre.
Through implementation and development of signal processing algorithms I have
conceived a software system that may be used by musicians and researchers in
the analysis of audio signals. While realizing these algorithms with the aim of
making the system a useable software tool, I have gained insight and knowledge
pertinent to timbre and signal processing techniques for extraction of timbral
features. The software is capable of robustly extracting features such as
accurate pitch, harmonic movement, amplitude modulation and all the algorithms
explained in chapter 2. This software by no means encompasses every feature
extraction algorithm available nor does it perform without flaw for all audio
signals. Rather, while limiting its scope down to musical instrument signals such
as the voice, horns, stringed instruments and synthesized sounds, it serves as a
continuously evolving project ultimately leading to a system for recognition of
sound sources.
The next step is to continue investigating in detail existing timbre recognition
systems and models, determine weaknesses and strengths, develop new
algorithms and apply known algorithms that may help in the recognition process.
A number of such methods are neural networks, MDS, CASA systems, gestallt
theory based approaches and pattern recognition techniques developed by the
machine vision community. It may be that some techniques work better in
42
certain situations and other methods in different situations. Perhaps a
combination of methods will render a robust recognition system. A myriad of
questions will arise, nonetheless, it will be exciting and interesting to participate
and observe advances in the discovery of the mysteries of perception.
43
Appendix
A.1 Windowing
To understand why windowing is used in short time Fourier transform, I will show
the spectral characteristics when a signal is windowed by a simple rectangular
window. Let us consider the case where x[n] is the signal and w[n] is the
windowing function.
x [n]= a n , 0n N 1
(a.1)
0, otherwise
w [n]= 1, 0n N 1 (a.2)
0, otherwise
then
x w [n]= x [n] w [n] (a.3)
However, we know that a multiplication in the time domain corresponds to a
convolution in the frequency domain or more precisely:
1
X wf = { X f W f } (a.4)
2
We also observe that for a rectangular window the Fourier transform is

N 1
W = e j n
f
n=0
1e j N
= (a.5)
1e j
sin 0.5 N j0.5 N 1
= e
sin 0.5
44
Hence, we see that convolution is characteristic of the Dirichlet kernel. The
Dirichlet kernel causes distortions characterized by the main and side-lobe
widths. Each sample in the time domain will render a sync function in the
frequency domain causing side effects in the form of spectral smearing due to
the finite main-lobe width and side-lobe interference produced by neighboring
samples in the signal.
As described above, the choice of windowing functions plays a vital role in short
time Fourier analysis. The main idea in selecting the windowing function is to
taper off the abrupt end points of the rectangular window achieving gradual
transition. This results in minimized side-lobes magnitudes and minimized main-
lobe widths. The behavior of windowing functions can be found in many signal
processing textbooks (Porat 1997).
Rectangular window
The main lobe width is 4pi/N and has a 13dB side-lobe attenuation where N is
the number of samples.
Hann window
The Hann window also known as the Hanning window achieves a side-lobe
reduction by superposition. Three Dirichlet kernels are shifted and added
together resulting in partial cancellation of the side-lobes. The amount of shift is
2pi/(N-1) from the center. The resulting characteristics of the Hanning window
45
which is sometimes called the cosine window has a side-lobe leve of 32 dB and
a main lobe width of 8pi/N where N is the number of samples.
Hamming window
The Hamming window is similar to the Hanning window with modifications in
weighting the Dirichlet kernels. The time and frequency domain windows are as
follows. The main-lobe is 8pi/N with 43dB side-lobes. One characteristic is the
non-zero values at both end points and therefore is sometimes referred to as the
half-raised cosine window. N is the number of samples.
Blackmann window
The Blackmann window has a 57dB side-lobe and a main-lobe width of 12pi/N.
N is the number of samples.
46
A.2 Spectral Peak Detection and Tracking
rough peak detection
start
peakFound = false
set threshold values
tempMin = min(spectrum's magnitude)
compute slope
slope positive?
Yes
No
current magnitude >
tempMin + threshold1 &&
Yes peakFound == true
Yes
not a transient peak

peakFound = false
No
No
samples remaining
for checking?
peakCandidate = current peak No
No
slope negative? peakFound == true?
Yes Yes
current magnitude <

peakCandidate-threshold2? tempMin = current magnitude
&&
peakFound == false
Yes
peakCandidate is an actual peak

Yes
peakFound = true
No
compute slope
samples remaining
No
for checking?
done
Figure A.1 Rough peak detection
47
prominent peak search
start from maximum peak index and
peakMag (peak magnitude) decrement towards DC
tempMax = maximum peak magnitude
- magnitude array of rough peaks
promMagThresh (prominent magnitude ratio) No

- ratio magnitude ( 0 < ratio <1)
more peak candidates to test?
maxIndex (maximum bin index)
- integer value of bin position of maximum magnitude
Yes
current peak candidate >

tempMax*promMagThresh
Yes
current peak candiate is a prominent

peak
add to output array
No
update tempMax
tempMax = current peak's magnitude
find maximum peak's bin index
from selected peak candidates
scan from maxIndex to DC and select

prominent peaks
scan from maxIndex+1 to fs/2 and

select prominent peaks
update new peak array:

bin and magnitude information
start from maximum peak index+1 and
increment towards fs/2
tempMax = maximum peak magnitude
done
No
more peak candiates to test?
Yes
current peak candidate >

tempMax*promMagThresh
Yes
current peak candidate is a prominent

peak
add to output array
No
update tempMax
tempMax = current peak's magnitude
Figure A.2 Prominent peak search
48
harmonic break search
difference[ ]
- bin spacing distance
thresh
- error boundary for peak to be considered a possible
peak
mean bin spacing computation
harmonic break region determination
rough peak detection
zoomed in
peak search
prominent peak evaluation
update bin and magnitude arrays with

harmonic break insertions
done
Figure A.3 Harmonic break search, general flowchart
49
mean bin spacing harmonic break region
computation determination
i=0
i=0
bins remaining?
bins remaining?
Yes
Yes
difference[i] = bin[i+1] - bin[i]

difference[i] > thresh*mean
&&
difference[i] < thresh*mean
Yes
i++
No
add this region to list for zoomed
in search
found harmonic break
No
No
mean(difference[])
i++
found harmonic break? No

zoomed in peak search
Yes
done
more break regions to test?
Yes
rough peak detection on break region

add to output if weak peaks are found
in break region
No
Figure A.4 Harmonic break search, sub-module flowcharts
50
harmonicty analysis
start
check if first K peaks exist
mean bin spacing computation

standard deviation computation
scaled standard deviation computation
compute ideal harmonic sequence

compute error between actual and
ideal bins
determine
harmonicty
check error to
scaled standard deviation
delete inharmonic peaks
done
Figure A.5 Harmonic analysis, general flowchart
51
deterime harmonicity
idealCount : ideal bin spacing counter
realCount : real bin index counter
numOfPeaks : total number of peaks (not
analyzed)
harmError : harmonicity error between real
and ideal bin
scaledStdDev : scaled standard deviation
upperBound : used to determine overshoot/
undershoot/missing partials
i=0 tolerance : error tolerance scalar
idealCount = 0 bin : bin array containing peaks for
realCount = 0
badBinCount = 0 analysis
realCount < numOfPeaks
Yes
harmError =
(1+idealCount)*binSpacingMean
abs(harmError) > scaledStdDev
Yes
No
upperBound =
binSpacingMean*tolerance
harmError > 0 No
Yes
abs(harmError) < upperBound
badPeakBin[badBinCount++] =
idealCount++
bin[realCount++]
Yes
overshoot
harmError > 0 No
No
undershoot idealCount++
Yes
No
bin[realCount++]
check if neighboring
bin has less error
abs(harmError) >
(1+idealCount)*binSpacingMean -
bin[realCount+1]
Yes
bin[realCount++]
No
realCount++
idealCount++
done
Figure A.6 Harmonic analysis, detailed flowchart
52
References
Allen J. B. 1977. "Short Time Spectral Analysis, Synthesis and Modification by

Discrete Fourier Transform." IEEE Transaction on Acoustics, Speech and Signal
Processing 25(3):235-238.
Allen J. B., and L. Rabiner 1977. "A Unified Approach to Short-Time Fourier
Analysis and Synthesis." Proceedings of the IEEE vol. 65(11):1558-1564.
Appleton, J. 1991 "Pacific Rimbmbo." The Virtuoso in the Computer Age.

CDCM Computer Music Series. vol. 5.
Atal, H., and S. Hanauer 1971. "Speech Analysis and Synthesis by Linear
Prediction of the Speech Wave." Journal of Acoustical Society of America 50(2):
637 - 655.
Bksy, G. v. 1943. "Ueber die Resonanzkurve und die Abklingzeit der

verschiedenen Stellen der Schneckentrennwand." Akustische Zeitschrift.
Beauchamp, J. W. 1982. "Synthesis by Spectral Amplitude and 'Brightness'

Matching Analyzed Musical Sounds." Journal of Audio Engineering Society 30
(6): 396-406.
Bregman, A. 1990. Auditory Scene Analysis. Cambridge: The MIT Press.
Carterette E., and M. Friedman 1978. Handbook of Perception. Academic Press
Cheney, W., and D. Kincaid 1994. Numerical Mathematics and Computing, 3rd
Edition. Brooks/Cole Pub Co.
Clark, M., D. Luce, R. Abrams, H. Schlossberg, and J. Rome 1963. "Preliminary

Experiments on the Aural Significance of Parts of Tones of Orchestral
Instruments and on Choral Tones." Journal of Audio Engineering Society 11(1):
45-54.
53
Cooley, J., and J. Tukey 1965. "An Algorithm for the Machine Calculation of
Complex Fourier series." Mathematics of Computation.
Eagleson, H., and W. Eagleson 1947. "Identification of Musical Instruments

when Heard Directly Over a Public Address System." Journal of Acoustical
Society of America 19(2): 338-342.
Elliot, C. 1975. "Attacks and Releases as Factors in Instrument Identification."

Journal of Research in Music Education 23: 35-40.
Ellis, P. W. 1996. "Prediction-Driven Computational Auditory Scene Analysis."

Ph.D. Dissertation, MIT.
Fletcher, H., and W. A. Munson 1933. "Loudness, its Definition, Measurement

and Calculation." Journal of Acoustical Society of America.
Fletcher, H., E. Blackman and R. Stratton 1962. "Quality of Piano Tones."

Journal of Acoustical Society of America 34(6):1534 -1544.
Grey, J. 1976. "Multidimensional Scaling of Musical Timbres." Journal of the

Acoustical Society of America 61(5): 1270-1277.
Grey J. M., and J. W. Gordon 1978. "Perceptual Effects of Spectral Modifications

on Musical Timbres." Journal of Acoustical Society of America 63(5): 1493-1500
Helmholtz, H. 1877. "On the Sensations of the Tone as a Physiological Basis for
the Theory of Music." (translation) New York Dover
Kendall, R. A. 1986. "The Role of Acoustic Signal Partitions in Listener

Categorization of Musical Phrase." Music Perception 4(2): 185-214.
McAdams, S. 1984. "Spectral Fusion, Spectral Parsing, and the Formation of

Auditory Images." Technical Report STAN-M-22. Stanford University, Dept. of
Music (CCRMA).
54
De Poli, G., A. Piccialli and C. Roads 1991. "Representation of Musical Signals."
Cambridge: The MIT Press.
McAdams S. 1999. "Perspectives on the Contribution of Timbre to Musical

Structure." Computer Music Journal 23(3): 85-103.
Martin, K. 1999. "Sound-Source Recognition: A Theory and Computational

Model." Ph.D. Dissertation, MIT.
Cook, P. 1999. Music, Cognition and Computerized Sound: An Introduction to

Psychoacoustics. Cambridge: The MIT Press.
Porat, B. 1997. A Course in Digital Signal Processing. John Wiley & Sons, Inc.
Shepard, R. 1964. "Circularity in Judgments of Relative Pitch." Journal of

Acoustical Society of America 36:2346-2353.
Serra X. 1997. "Musical Modeling with Sinusoids plus Noise." G. D. Poli and
others (eds.) Musical Signal Processing. Swets & Zeitlinger Publishers.
Saldanha, E., and Corso J. 1964. "Timbre cues and the identification of musical
instruments." Journal of Acoustical Society of America 36: 2021-2026.
55

Salient Feature Extraction of Musical Instrument S

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Salient Feature Extraction of Musical Instrument S

Diunggah oleh

Hak Cipta:

Format Tersedia

See

Salient Feature Extraction of Musical Instrument

Article January 2000

Tae Hong Park

Submitted to the Faculty

in partial fulfillment of the requirements for the

by Tae Hong Park

Hanover, New Hampshire

Musical timbre is inherently multidimensional and extremely complex in structure.

For those reasons, timbre is an on-going research topic in areas such as

computer science, psychology, music and engineering. Humans have the

natural ability to segregate, identify and recognize sounds in a variety of

noisy traffic environment or at a cocktail party. Although computer systems have

been realized to determine ways of recognizing and identifying sounds with

extracted features, none have come close to performing nearly as well as

humans do. The robustness of computer systems degrades especially in

cause confusion in the recognition and identification process?

characteristics of an acoustic signal. Secondly, the recognition part which uses

will concentrate on the feature extraction part. I have implemented and

developed a number of algorithms that are useful in picking out acoustical

characteristics of musical instruments. The purpose of the software is to give

The signal processing algorithms implemented in software were all realized in

the Java programming environment. The system can be regarded as a GUI

the Electro-Acoustic Music Program at Dartmouth and making my two years a

his invaluable guidance, advice and the countless discussions. I am very

helped me tremendously in finishing the thesis. Thanks to Charles Dodge for

introducing me to so many different facets of music I did not know existed,

paths I had chosen.

without her - thank you.

1.2 Feature Extraction and Timbre 3

2 Signal Processing Modules 7

2.2 Frequency Domain Analysis 8

2.2.1 DFT and STFT 8

2.2.2 Spectral Peak Detection and Tracking 9

2.2.2.1 Step 1: Rough Peak Detection 11

2.2.2.2 Step 2: Prominent Peak Search 14

2.2.2.3 Step 3: Harmonic Break Search 15

2.2.2.4 Step 4: Harmonicity Analysis 16

2.2.2.5 Partial Tracking Between Frames 18

2.2.3 Spectral Centroid 19

2.2.4 Spectral Smoothness 20

2.3 Time Domain Analysis 21

2.3.1 Noise Content Analysis: Linear Prediction 22

2.3.2 Pitch Detection 25

2.3.2.2 Detection of Periods 28

2.3.2.3 Natural Cubic Spline Interpolation 29

2.3.3 Amplitude Envelope 31

2.3.4 Amplitude Modulation 34

2.3.5 Attack Time 35

3.2 Why Java? 37

3.3 Main Software Structure 38

3.4 Software Features 40

4 Conclusion and Further Work 42

Figure 2.2 Plucked string spectrum 10

Figure 2.3 Peak detection algorithm 11

Figure 2.4 Rough search for peaks 12

Figure 2.5 Actual peak assessment 13

Figure 2.6 Transitional peaks (noise) 13

Figure 2.7 Prominent peak search 14

Figure 2.8 Harmonic break search 16

Figure 2.9 Partial tracking between frames 18

Figure 2.12 Vocal tract model 22

Figure 2.13 Noise content analysis flute and electric bass 23