discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/247440833
CITATIONS READS
13 19
1 author:
SEE PROFILE
All in-text references underlined in blue are linked to publications on ResearchGate, Available from: Tae Hong Park
letting you access and read them immediately. Retrieved on: 01 September 2016
Salient Feature Extraction of Musical Instrument Signals
A Thesis
degree of
Master of Arts
in
Electro-Acoustic Music
DARTMOUTH COLLEGE
June 2, 2000
Examining Committee:
_________________________
Larry Polanksy (Chair)
_________________________
Jon Appleton
_________________________
Charlie Sullivan
_________________________
Roger Sloboda
Dean of Graduate Studies
i
Abstract
situations - separated by a wall from the sound source, in a concert hall, within a
uncontrolled or natural settings, where more than one sound source, distortions
and aural distractions exist. Many questions remain unanswered: how and what
kind of information does the brain actually receive from our auditory sensory
organs; which features are critically important and which are redundant or even
The timbre recognition process for computer systems may be basically divided
into two parts. Firstly, the feature extraction part which extracts salient
the extracted data for categorization, prediction and taxonomy. In this thesis I
ii
musicians and researchers a usable tool for exploration of timbral characteristics.
(graphical user interface) based frequency and time domain signal processing
system, where timbral features are extracted and displayed visually for better
understanding.
iii
Acknowledgments
First and foremost, I thank Jon Appleton for giving me the opportunity to come to
challenging, memorable and exciting time. Many thanks to Larry Polanksy for
grateful to Charlie Sullivan for his thoughtful and insightful critiques which have
Douglas Repetto, Mary Roberts and Eric Lyon for their support. I would also like
to thank Dee Copley for keeping it all together, Andrew, Iroro, Jonathan and Paul
for letting me "multi-computer" all the time (well, most of the time).
I thank my parents and brothers for their unending encouragement and being the
best teachers I have had. They have been there for me throughout the years,
guided me and supported me with all the different (and sometimes strange)
Finally, I wish to thank Kyoung Hyun for her continual and unwavering support,
understanding and love. It would have been very hard to reach this milestone
iv
Table of Contents
1 Introduction 1
1.1 Motivation 1
2.1 Introduction 7
2.3.2.1 Autocorrelation 27
v
2.3.2.4 Period Averaging 30
3 Software Implementation 37
3.1 Introduction 37
Appendix 44
References 53
vi
List of Illustrations
Figure 2.1 Short time Fourier transform and Spectral Peak Detection 9
Figure 2.10 Spectral centroid of french horn and electric bass at 44.1 kHz 20
Figure 2.11 White noise, sine wave and electric bass spectral smoothness 21
Figure 2.17 Error plot: interp., interp. with period averaging and DFT 26
vii
Figure 2.23 Electric bass envelope 32
List of Tables
viii
Chapter 1 Introduction
1.1 Motivation
and vertical axis denoting discrete magnitude values. What information does
this signal contain, how much information is there? Is there anything hidden
behind the waveform? Is it just two dimensional in nature? How can I better
understand why it makes this unique sound? Questions like these have aroused
my curiosity in timbre which has lead to conceiving this thesis - feature extraction
of musical signals.
acoustical information and focusing on specific areas that may give clues for
processing techniques are used for analysis. The techniques of data analysis
are divided into frequency and time domain analyses. With these techniques,
acoustical features. However, very few systems exist that are tailored for the
1
purpose of analysis and extraction of timbral qualities of musical instrument
signals. In this thesis I have developed and implemented various algorithms for
extracting salient features into one software application which can be readily
which aspects of timbre are essential and which are less or more meaningful
hope that this software system will provide users the means to explore,
investigate and experiment with audio signals and help answer some of the
many questions regarding timbre that are yet to be discovered. However, I also
would be able to take the extracted features and recognize the sound source
being analyzed.
The software rendered in Java has been chosen for its platform independence
and graphical user interface (GUI) capabilities. The Java Swing GUI was used
2
1.2 Feature Extraction and Timbre
In this thesis spectral analysis is based on the Fourier transform. The theory
behind the Fourier transform was first published in "Analytical Theory of Heat" by
Fourier. Fourier claimed in his writing that any periodic continuous signal could
be represented by the sum of an infinite number of sine and cosine waves. This
elegant description of periodic signals was later exploited by the 19th century
physicist Herman Helmholtz (Helmholtz 1877). His view of the ear was that of a
theory, the cochlea behaved like a spectral analyzer analogous to the Fourier
transform. He believed that the cochlea resonated at specific locations along the
and not the phase components, were the sole factors contributing to the
In fact, the hair cells in the basilar membrane (comparable to frequency bins in
the Fourier transform) are stimulated in an overlapping manner. That is, a sine
tone at 100 Hz will not just trigger one hair cell at precisely that frequency, rather
a group of hair cells will be excited leading to the perception of its pitch.
3
Furthermore, the importance of phase in perceiving musical sounds was
waveforms sounded unrealistic. This may be partly attributed to the fact that the
highly transient onset part of a signal stores a great deal of phase information.
However, real-life sounds are only quasi-periodic and vary considerably. The
and Stratton 1962) and spectral fusion (McAdams 1984) have also been studied
Although the Fourier transform has been known for quite some time, it was not
widely applied by the music community until after 1965, with the introduction of
the fast Fourier transform (Cooley and Tukey 1965). The advent of the FFT
the discrete Fourier transform. One such area of research in timbre was
brightness, spectral flux and attack time. Instead of natural sounds, additive
conducting the experiments. Noise content of musical signals on the other hand
4
has not been investigated in as much detail by researchers compared to the
"periodic" aspects of musical sounds (some work has been done in modeling
non-periodic signals by Serra 1997). However, voice coding research has been
a periodic and noisy part. The use of a LPC (Linear Prediction Coding) method
has been the primary backbone in current and past speech analysis by synthesis
(AbS) systems.
During the past decade a number of research topics in timbre have been
and explain how the listener perceives sounds. Sound, in this context may be
The approach is to find the underlying reasons as to why we hear what we hear
and not merely be content with the results of a computer system that finds a
attributed to Bregman, who published his book Auditory Scene Analysis in 1990
(Bregman 1990). The book describes in detail highly intuitive and clever
robust modeling of such features. However, as is the case with most if not all
5
and impressive amount of work has been done in this field. Work by Ellis (Ellis
1996) used a prediction-based model of the auditory system with good results in
grouping sounds in noisy environments such as car horns, door slams and
6
Chapter 2 Signal Processing Modules
2.1 Introduction
state or quasi-periodic portion. The transient part is usually the attack of the
signal and the steady state the portion that follows the attack part. When
investigating time variant signals it is critical to make use of both time and
spectral envelope, spectral centroid and the like. Attack time is especially
Saldanha and Corso 1964; Elliot 1975) and has been thought to be a dominant
feature of musical instruments. However, it has also been discovered that the
attack time and also note-to-note transients of a signal are neither sufficient nor
signal.
algorithms used in the software system for extracting features that depict these
transient and stationary characteristics in the frequency and time domain. The
7
discussed includes short time Fourier transform, spectral centroid, spectral
smoothness and tracking of partials over time. In the time domain analysis
The spectral analysis part of feature extraction is primarily based on the discrete
Fourier transform (DFT). Below the continuous time and discrete time versions
X = x t e j t dt (2.1)
N 1
X [ k ]= x [n]e j2 kn / N 0n N 1, 0k N 1 (2.2)
n=0
(STFT) was used (Allen 1977; Allen and Rabiner 1977). The basic algorithm is
as follows.
N 1
X [ k ]= w [ k mD] x [ k ]e j2 kn/ N 0n N 1, 0k N 1 (2.3)
n=0
As seen in figure 2.1 the STFT can be simply described as windowing and taking
the FFT of the signal. There are various window types available in the program
8
STFT
signal
peak accumulate
windowing FFT
detection partials
Figure 2.1 Short time Fourier transform and Spectral Peak Detection
with different side-lobe and main lobe characteristics. The Hamming window has
been shown to work particularly well with musical signals (De Poli, Piccialli and
Roads 1991). See the appendix for details regarding windowing and its side-
when analyzed for frequency content. Most tend to have quasi-integer harmonic
also evident in violins, but the number of valleys is greater and the formant
locations change very little with time unlike the voice, which varies substantially
for each vowel. Woodwinds such as the bassoon and oboe on the other hand
have fewer formants than the voice, but tend to have stronger and clearer
1999). Generally, musical instruments like the plucked string (figure 2.2) exhibit
9
lower energy in the high frequency bins. The higher partials normally have less
energy and also die out faster than lower ones over time.
Using the short time Fourier transform, I have implemented a spectral peak
the spectrum. The peak picking algorithm takes into consideration magnitude
behaving peaks. To help in the search for spectral peaks, various threshold
10
The spectral peak detection algorithm is divided into four main steps. The first
pass roughly locates possible peaks, where the roughness factor for searching
peaks is controlled via a threshold value. The threshold value basically dictates
possible peak. The second pass filters out peaks that may have been
erroneously selected in step 1. The third pass looks for any broken harmonic
this pass, peaks that may have been deleted or missed in the previous two
passes are inserted. The final pass looks at the selected peaks and further does
a harmonic analysis ultimately leaving a set of peaks that are most probably
harmonics. A mean and scalable standard deviation error method is applied for
control of inharmonicity.
In the rough peak detection algorithm possible peaks are picked using negative
and positive slope threshold values to guide in the selection process. As shown
in figure 2.4 the polarity of the slope of the spectrum is computed from bin to bin
(DC to Nyquist) using the basic assumption that a transition from positive to
11
actual peak actual peaks
peak candidate
not actual peak
negative slope calls for the possibility of a peak. The following conditions help in
2. The magnitude difference between the peak candidate and the current bin's
3. A new peak candidate search occurs only after there is a slope change from
exceeded.
12
Refer to flowcharts in the appendix for details.
detected
peak candidate p0
(not peak yet)
p0
threshold
value
peak candidate
detected
reset,
new
candidate
search
threshold value =
X[k-14] + constant
threshold value =
X[k-7] + constant
noise
(disregard)
13
2.2.2.2 Step 2: Prominent Peak Search
In step 2, prominent peaks are located from a set of potential peaks found in
step 1. The purpose is to filter out local peaks which may be present between
stronger partial candidates as shown in figure 2.7. The search for prominent
prominent peak
prominent peak
adaptive
threshold
level
adaptive
threshold
level
DC fs/2
probably
local peaks
probably probably
local peak local peak
2. Relative to position of the peak with maximum amplitude, peaks are analyzed
14
3. Relative to position of peak with maximum amplitude, peaks are analyzed
Local maxima or peaks are picked out using an adaptive threshold value that is
shown in figure 2.7. For example a 50% threshold value will require neighboring
peaks to be greater than at least half the magnitude of the prominent peak
The third step is called the harmonic break search. Here, I have tried to analyze
if some "potential partials" were deleted or missed in the previous steps. This
may occur when potentially harmonically related peaks temporarily have little
energy or are simply much weaker than the stronger ones, but are nevertheless
harmonic. The harmonic break search is divided into the following sub-routines:
15
prominent prominent prominent prominent
peak peak peak peak
harmonic
break
F +/- threshold
right left
threshold threshold
bound bound
F F F F F
deviation. Hence, the algorithm expects the possibility of a peak within the
the first few peaks (selectable in software) are used as a guide to determine the
final set of partials. The reason for choosing the first few peaks of the spectrum
16
is due to the fact that in highly pitch salient signals, lower harmonics usually are
The idea is to use the gaussian normal distribution function employing mean,
partials. A peak that is outside a right and left threshold bound is considered
inharmonic and misbehaving. A mean bin spacing value denoting the bin
variance and standard deviation. As the lower partials generally tend to be more
stable and have more energy, the first K (K: integer > 0) peaks are used for the
the permitted spread of each peak. In other words, the scaled standard
the final set of peaks. The scalar that controls the scaled standard deviation is a
value between 0 and 1, where 1 is equivalent to limiting the peaks to the original
ideal sequence of harmonics using the above acquired data. Hence the ideal
The ideal set of harmonics and the actual set of harmonics are compared and
the error (equation 2.6) for each peak is computed and verified against the
17
scaled standard deviation for final assessment. Peaks that have excessive error
values are deleted from the final set of peaks and the remaining ones are finally
Equation 2.6 shows the error between the ideal and actual bins where M is the
number of ideal peaks and N is the number of actual peaks in the spectrum. M
and N have different values as missing partials may exist in the actual set of
peaks.
Once harmonics have been evaluated in each frame (a frame is equal to the
length of the FFT), they are combined to render a spectrogram. Frame to frame
figure 2.9.
harmonic discontinuity
harmonic (new path begins)
continuity
...
error ...
frequency (bin)
margin
discrete
...
error
margin
...
...
18
Figure 2.9 Partial tracking between frames
frame is allowed to sway in frequency within a set of error margin values. Hence,
as shown in figure 2.9, four of the harmonics make a continuous harmonic path
(k, k+1, k+2, k+3). However, the harmonic in frame k+4 exceeds the allowed
error margin and breaks the previous harmonic path. At frame k+4 a new path is
created and the path which started at frame k is discontinued. The harmonic
and frequency.
measure of the brightness of a sound (Grey and Gordon 1978). This measure is
been found that increased loudness also increases the amount of high spectrum
k X [k ]
sc= kN=11 (2.7)
X [k ]
k =1
X[k] is the magnitude corresponding to bin k and N is the length of the DFT. This
measure has also been used in MDS (multidimensional scaling) based systems.
Figure 2.10 Spectral centroid of french horn and electric bass at 44.1 kHz
19
frames
frames
Figure 2.10 shows examples of the spectral centroid for the french horn and the
frame to frame spectral envelope obtained via the short time Fourier transform.
The algorithm basically takes the average of adjacent amplitudes of the spectral
bins and compares them to the current amplitude at bin k as shown in equation
2.8.
N 1
20 log X [ k 1]20 log X [ k ]20 log X [ k 1]
ss= 20 log X [ k ] (2.8)
k =1 3
20
Figure 2.11 shows examples of the spectral smoothness for white noise, sine
Figure 2.11 White noise, sine wave and electric bass spectral smoothness
In this section I will discuss the time domain signal processing modules
I have used linear prediction as the basis for extracting the degree of "noisiness"
21
of a signal. The motivation behind using the LPC method for musical signals lies
in its robust performance in modeling the voice. In the "LPC vocal tract model",
the resultant acoustical signal is represented via a noise signal and a sequence
of pulses passed through a resonant all-pole filter, shaping the spectral envelope
of the voice as shown in figure 2.12. In essence, the linear prediction filter
coefficients are used to predict the current sample with a finite number of
weighted past samples. Figure 2.13 shows examples of noise content analysis
pitch
LPC coefficients
pulse generator
synthesized speech
spectral envelope
noise generator
22
magnitude
magnitude
samples (time)
magnitude
samples
magnitude
samples
s [ k ]= a i s[ k i ] (2.9)
i=1
The coefficients in the difference equation are selected so that the error
between the current sample s[k] and the predicted sample form equation 2.9 is
The noise content analysis algorithm is shown in figure 2.14. Before submitting
the signal to the short term prediction filter block, a pre-emphasis filter (equation
2.11) is used to flatten the spectrum for enhanced performance. The pre-
emphasis filter (high pass filter) coefficients range from 0.95 to 0.98. The
residual signal ds[n] is passed through a "spike damping filter" which damps
spikes that are present in ds[n] and ultimately renders the noise content of the
residual/
(time)
signal noise
spike
pre-emphasis forward LPC inverse LPC
damping
ds[n]
linear prediction
24
samples(time)
interpolation and period averaging to accurately compute the pitch of the signal.
seen in figure 2.17 the error for the period- averaging method is smallest
compared to the autocorrelation method without interpolation and the FFT. The
DFT method
autocorrelation
without
interpolation autocorrelation
with interpolation
and period
averaging
Figure 2.17 Error plot: interp., interp. with period averaging and DFT
26
2.3.2.1 Autocorrelation
as:
acf xx = x t x t dt (2.12)
signal are determined. More precisely, the location of a peak bounded by two
threshold values.
27
Comparing figure 2.18 and figure 2.19 it is clear that the time resolution for
greater error. In other words, the samples that are present between the
frequency (approximately 440 samples vs. 50 samples, figure 2.18 and figure
Periods in the autocorrelation signal are detected through peaks that correspond
to the frequency of the audio signal (figure 2.20). Peaks are extracted using two
zero crossing pairs for each peak. These pairs define the range where a peak
28
that corresponds to the period, could actually be found. The fact that the
corresponding to strong harmonics. The first period value is used as the basis to
look for and compute consecutive peaks in the autocorreation vector. Hence, an
error margin dictated by the first period found is used to guide in searching the
remaining peaks. All peaks that are considered periods are subjected to
interpolation.
T1 T3 T5
T0 T2 T4
local peaks
actual
peaks/
periods
zero zero
crossing crossing
pairs pairs
The natural cubic spline Interpolation method is used in pinpointing the "actual"
peak and hence its period, in the autocorrelation function. The basic idea behind
29
the natural cubic spline method is shown in figure 2.21. Each "curvature"
S i x .
s i-4 s i-3 s i-2 s i-1 si
This essentially is a problem of solving each polynomial bounded by knots, for its
only help so much. To get better performance both for low frequency and high
30
T m={T 0 , T 1 , T 2 , T 3 ... , T M } , 0mM (2.15)
The number of autocorrelation peaks may vary from frame to frame. The
T1 T3 T5
T0 T2 T4
M=6
f
The amplitude envelope describes the energy change of the signal in the time
frames
The envelope of the signal is computed with a frame by frame RMS (root mean
RMS (equation 2.17) is related to the average power of a signal different from
average, peak level and the signal's RMS value. The average changes very little
even if the signal consists of numerous transients peaks. The peak level on the
other hand can vary greatly in a small amount of time, but without much affecting
the average value. RMS is a more perceptually relevant measurement and has
32
been shown to correspond more closely to the way we hear loudness.
N 1
1 (2.17)
rms=
N
x [ k ]2
k =0
The frame by frame RMS is used quite similar to the short time Fourier transform
method. The length of the RMS frame determines the time resolution of the
envelope. A large frame length yields lower transient information and small
frame length greater transient energy. The window length M (equation 2.18) is
an integer multiple of N, N being the total length of the signal and p a positive
integer.
LM 1
1
frame by frame rms=
M
x [ k ]2 where Mp= N (2.18)
k=L
fs
8k 22.05k 44.1k
fc
RMS
signal amplitude
frame/frame RMS LPF
envelope
The window size is selectable in the software, a longer window resulting in less
cutoff frequencies have been determined empirically at 350 Hz (fs = 8000), 1200
33
2.3.4 Amplitude Modulation
algorithm with a few steps added. Figure 2.25 shows a summary of amplitude
modulation analysis. The steady state portion of the signal is extracted and
For accurate location of peaks, the cubic spline interpolation method is again
used.
fs
8k 22.05k 44.1k
fc
RMS
signal steady state
frame/frame RMS LPF 20log
extraction
AM frequency interpolation of
peak detection
frequency computation peaks
violin, flute and saxophone (figure 2.26). The frequency in Hertz is computed
fs
frequency= (2.19)
wT
34
where fs is the sampling rate, w the RMS frame length and T the period in
frames
Attack time (Saldanha and Corso 1964; Elliot 1975) is an important feature of
measurement of the attack time when this threshold level is exceeded. Although
the attack portion embodies a great deal of transitional information of the signal
leading to a steady state, it is difficult to say where the attack portion ends and
35
where the steady state begins. As a matter of fact, it is even difficult to say how
wavetable synthesis. The basic idea is to take an auditory snapshot of the signal
- the attack and first few milliseconds of the steady state portion, then loop the
steady state portion. Hence, this gives the listener the illusion that the whole
signal is being played back, although only a fractional length of the signal has
actually been used to render such an illusion. Today's popular music genres and
"electronic" jazz music are very much dominated by this technology. However,
many contemporary composers (Appleton 1991) also have used this technology
36
Chapter 3 Software Implementation
3.1 Introduction
features. The following sections describe the software and the motive behind
using Java.
The Java programming environment was used for the following reasons.
1. Platform independence
4. Java Sound
Java has been designed for the purpose of neutrality towards architecture. This
can be achieved if pure Java code is written. For this reason, the program I have
developed has no native coding methods (system dependent code) and runs on
pure Java alone. Although native coding methods improve efficiency, this
37
program does not require real-time processing or time-critical computation. One
of the goals of this thesis was to write an intuitive GUI (graphical user interface)
features. This was done with the Java Swing GUI environment. Another reason
for choosing Java lies in sound playback and recording. The ability to play and
accessing procedures pertinent to each platform was a major plus. Finally, the
syntax similarity of Java to C/C++ greatly facilitated the move from C/C++ to
The software's main modules are the GUI, command center and sound object.
The GUI is responsible for responding to requests performed by the user via
button clicks, data entry, menu selections etc. . The command center takes the
job of directing the commands requested by the user and notifying the
new sound file is loaded into memory, supervises and keeps track of all its child
frames (internal frames - each corresponds to a DSP process) for updates and
38
GUI
command center
signal processing
library
DSP
command 0
DSP
command 0
... DSP
command 0
DSP
command 0
DSP
command J
DSP
command K
... DSP
command L
DSP
command M
39
Figure 3.2 Snapshot of software
File IO Features
Features Description
File Open saves aiff, wav, au
File save writes aiff, wav, au, raw data (float)
40
File IO Features
AM Analysis amplitude modulation analysis
Attack Time attack time computation
Amplitude Envelope amplitude envelope rendering
FFT discrete Fourier transform
Spectrogram display of peaks vs. frame/time
Pitch Detection detection of pitch (vs. time)
Pitch Modulation detection of low freq. pitch modulation
Spectral Centroid spectral centroid computation (vs. time)
Spectral Smoothness spectral smoothness computation
(vs. time)
Noise Content Analysis noise content analysis (forward LPC,
inverse LPC)
Sound Features
Play plays loaded waveform and residue
signal if selected
Stop stops play and record
Rec records signal from mic/line
Pause pauses record and play
Plot Commands
Zoom in/out
Scroll x,y
Quick view "summarized" view of signal, improves
FFT
41
Chapter 4 Conclusion and Further Work
I began this thesis with the intent to gain further understanding about timbre.
the analysis of audio signals. While realizing these algorithms with the aim of
making the system a useable software tool, I have gained insight and knowledge
accurate pitch, harmonic movement, amplitude modulation and all the algorithms
extraction algorithm available nor does it perform without flaw for all audio
signals. Rather, while limiting its scope down to musical instrument signals such
sound sources.
algorithms and apply known algorithms that may help in the recognition process.
A number of such methods are neural networks, MDS, CASA systems, gestallt
42
certain situations and other methods in different situations. Perhaps a
43
Appendix
A.1 Windowing
To understand why windowing is used in short time Fourier transform, I will show
window. Let us consider the case where x[n] is the signal and w[n] is the
windowing function.
x [n]= a n , 0n N 1
(a.1)
0, otherwise
w [n]= 1, 0n N 1 (a.2)
0, otherwise
then
1
X wf = { X f W f } (a.4)
2
n=0
1e j N
= (a.5)
1e j
sin 0.5 N j0.5 N 1
= e
sin 0.5
44
Hence, we see that convolution is characteristic of the Dirichlet kernel. The
widths. Each sample in the time domain will render a sync function in the
frequency domain causing side effects in the form of spectral smearing due to
As described above, the choice of windowing functions plays a vital role in short
time Fourier analysis. The main idea in selecting the windowing function is to
taper off the abrupt end points of the rectangular window achieving gradual
lobe widths. The behavior of windowing functions can be found in many signal
Rectangular window
The main lobe width is 4pi/N and has a 13dB side-lobe attenuation where N is
Hann window
The Hann window also known as the Hanning window achieves a side-lobe
2pi/(N-1) from the center. The resulting characteristics of the Hanning window
45
which is sometimes called the cosine window has a side-lobe leve of 32 dB and
Hamming window
weighting the Dirichlet kernels. The time and frequency domain windows are as
follows. The main-lobe is 8pi/N with 43dB side-lobes. One characteristic is the
non-zero values at both end points and therefore is sometimes referred to as the
Blackmann window
The Blackmann window has a 57dB side-lobe and a main-lobe width of 12pi/N.
46
A.2 Spectral Peak Detection and Tracking
rough peak detection
start
peakFound = false
set threshold values
tempMin = min(spectrum's magnitude)
compute slope
slope positive?
Yes
No
current magnitude >
tempMin + threshold1 &&
Yes peakFound == true
Yes
No
No
samples remaining
for checking?
No
Yes Yes
Yes
No
compute slope
samples remaining
No
for checking?
done
47
prominent peak search
start from maximum peak index and
peakMag (peak magnitude) decrement towards DC
tempMax = maximum peak magnitude
- magnitude array of rough peaks
Yes
No
update tempMax
tempMax = current peak's magnitude
find maximum peak's bin index
from selected peak candidates
done
No
Yes
Yes
update tempMax
tempMax = current peak's magnitude
48
harmonic break search
difference[ ]
- bin spacing distance
thresh
- error boundary for peak to be considered a possible
peak
zoomed in
peak search
prominent peak evaluation
done
49
mean bin spacing harmonic break region
computation determination
i=0
i=0
bins remaining?
bins remaining?
Yes
Yes
Yes
i++
No
add this region to list for zoomed
in search
found harmonic break
No
No
mean(difference[])
i++
Yes
done
Yes
No
50
harmonicty analysis
start
done
51
deterime harmonicity
idealCount : ideal bin spacing counter
realCount : real bin index counter
numOfPeaks : total number of peaks (not
analyzed)
harmError : harmonicity error between real
and ideal bin
scaledStdDev : scaled standard deviation
upperBound : used to determine overshoot/
undershoot/missing partials
i=0 tolerance : error tolerance scalar
idealCount = 0 bin : bin array containing peaks for
realCount = 0
badBinCount = 0 analysis
Yes
harmError =
(1+idealCount)*binSpacingMean
Yes
No
upperBound =
binSpacingMean*tolerance
harmError > 0 No
Yes
abs(harmError) < upperBound
badPeakBin[badBinCount++] =
idealCount++
bin[realCount++]
Yes
overshoot
harmError > 0 No
No
undershoot idealCount++
Yes
No
badPeakBin[badBinCount++] =
bin[realCount++]
check if neighboring
bin has less error
abs(harmError) >
(1+idealCount)*binSpacingMean -
bin[realCount+1]
Yes
badPeakBin[badBinCount++] =
bin[realCount++]
No
realCount++
idealCount++
done
52
References
Allen J. B., and L. Rabiner 1977. "A Unified Approach to Short-Time Fourier
Analysis and Synthesis." Proceedings of the IEEE vol. 65(11):1558-1564.
Atal, H., and S. Hanauer 1971. "Speech Analysis and Synthesis by Linear
Prediction of the Speech Wave." Journal of Acoustical Society of America 50(2):
637 - 655.
Cheney, W., and D. Kincaid 1994. Numerical Mathematics and Computing, 3rd
Edition. Brooks/Cole Pub Co.
53
Cooley, J., and J. Tukey 1965. "An Algorithm for the Machine Calculation of
Complex Fourier series." Mathematics of Computation.
Helmholtz, H. 1877. "On the Sensations of the Tone as a Physiological Basis for
the Theory of Music." (translation) New York Dover
54
De Poli, G., A. Piccialli and C. Roads 1991. "Representation of Musical Signals."
Cambridge: The MIT Press.
Porat, B. 1997. A Course in Digital Signal Processing. John Wiley & Sons, Inc.
Serra X. 1997. "Musical Modeling with Sinusoids plus Noise." G. D. Poli and
others (eds.) Musical Signal Processing. Swets & Zeitlinger Publishers.
Saldanha, E., and Corso J. 1964. "Timbre cues and the identification of musical
instruments." Journal of Acoustical Society of America 36: 2021-2026.
55