Anda di halaman 1dari 66

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/247440833

Salient Feature Extraction of Musical Instrument


Signals

Article January 2000

CITATIONS READS

13 19

1 author:

Tae Hong Park


New York University
37 PUBLICATIONS 93 CITATIONS

SEE PROFILE

All in-text references underlined in blue are linked to publications on ResearchGate, Available from: Tae Hong Park
letting you access and read them immediately. Retrieved on: 01 September 2016
Salient Feature Extraction of Musical Instrument Signals

A Thesis

Submitted to the Faculty

in partial fulfillment of the requirements for the

degree of

Master of Arts

in

Electro-Acoustic Music

by Tae Hong Park

DARTMOUTH COLLEGE

Hanover, New Hampshire

June 2, 2000

Examining Committee:

_________________________
Larry Polanksy (Chair)

_________________________
Jon Appleton

_________________________
Charlie Sullivan
_________________________
Roger Sloboda
Dean of Graduate Studies
i
Abstract

Musical timbre is inherently multidimensional and extremely complex in structure.

For those reasons, timbre is an on-going research topic in areas such as

computer science, psychology, music and engineering. Humans have the

natural ability to segregate, identify and recognize sounds in a variety of

situations - separated by a wall from the sound source, in a concert hall, within a

noisy traffic environment or at a cocktail party. Although computer systems have

been realized to determine ways of recognizing and identifying sounds with

extracted features, none have come close to performing nearly as well as

humans do. The robustness of computer systems degrades especially in

uncontrolled or natural settings, where more than one sound source, distortions

and aural distractions exist. Many questions remain unanswered: how and what

kind of information does the brain actually receive from our auditory sensory

organs; which features are critically important and which are redundant or even

cause confusion in the recognition and identification process?

The timbre recognition process for computer systems may be basically divided

into two parts. Firstly, the feature extraction part which extracts salient

characteristics of an acoustic signal. Secondly, the recognition part which uses

the extracted data for categorization, prediction and taxonomy. In this thesis I

will concentrate on the feature extraction part. I have implemented and

developed a number of algorithms that are useful in picking out acoustical

characteristics of musical instruments. The purpose of the software is to give

ii
musicians and researchers a usable tool for exploration of timbral characteristics.

The signal processing algorithms implemented in software were all realized in

the Java programming environment. The system can be regarded as a GUI

(graphical user interface) based frequency and time domain signal processing

system, where timbral features are extracted and displayed visually for better

understanding.

iii
Acknowledgments

First and foremost, I thank Jon Appleton for giving me the opportunity to come to

the Electro-Acoustic Music Program at Dartmouth and making my two years a

challenging, memorable and exciting time. Many thanks to Larry Polanksy for

his invaluable guidance, advice and the countless discussions. I am very

grateful to Charlie Sullivan for his thoughtful and insightful critiques which have

helped me tremendously in finishing the thesis. Thanks to Charles Dodge for

introducing me to so many different facets of music I did not know existed,

Douglas Repetto, Mary Roberts and Eric Lyon for their support. I would also like

to thank Dee Copley for keeping it all together, Andrew, Iroro, Jonathan and Paul

for letting me "multi-computer" all the time (well, most of the time).

I thank my parents and brothers for their unending encouragement and being the

best teachers I have had. They have been there for me throughout the years,

guided me and supported me with all the different (and sometimes strange)

paths I had chosen.

Finally, I wish to thank Kyoung Hyun for her continual and unwavering support,

understanding and love. It would have been very hard to reach this milestone

without her - thank you.

iv
Table of Contents

1 Introduction 1

1.1 Motivation 1

1.2 Feature Extraction and Timbre 3

2 Signal Processing Modules 7

2.1 Introduction 7

2.2 Frequency Domain Analysis 8

2.2.1 DFT and STFT 8

2.2.2 Spectral Peak Detection and Tracking 9

2.2.2.1 Step 1: Rough Peak Detection 11

2.2.2.2 Step 2: Prominent Peak Search 14

2.2.2.3 Step 3: Harmonic Break Search 15

2.2.2.4 Step 4: Harmonicity Analysis 16

2.2.2.5 Partial Tracking Between Frames 18

2.2.3 Spectral Centroid 19

2.2.4 Spectral Smoothness 20

2.3 Time Domain Analysis 21

2.3.1 Noise Content Analysis: Linear Prediction 22

2.3.2 Pitch Detection 25

2.3.2.1 Autocorrelation 27

2.3.2.2 Detection of Periods 28

2.3.2.3 Natural Cubic Spline Interpolation 29

v
2.3.2.4 Period Averaging 30

2.3.3 Amplitude Envelope 31

2.3.4 Amplitude Modulation 34

2.3.5 Attack Time 35

3 Software Implementation 37

3.1 Introduction 37

3.2 Why Java? 37

3.3 Main Software Structure 38

3.4 Software Features 40

4 Conclusion and Further Work 42

Appendix 44

References 53

vi
List of Illustrations

Figure 2.1 Short time Fourier transform and Spectral Peak Detection 9

Figure 2.2 Plucked string spectrum 10

Figure 2.3 Peak detection algorithm 11

Figure 2.4 Rough search for peaks 12

Figure 2.5 Actual peak assessment 13

Figure 2.6 Transitional peaks (noise) 13

Figure 2.7 Prominent peak search 14

Figure 2.8 Harmonic break search 16

Figure 2.9 Partial tracking between frames 18

Figure 2.10 Spectral centroid of french horn and electric bass at 44.1 kHz 20

Figure 2.11 White noise, sine wave and electric bass spectral smoothness 21

Figure 2.12 Vocal tract model 22

Figure 2.13 Noise content analysis flute and electric bass 23

Figure 2.14 Noise content analysis 24

Figure 2.15 Noise content analysis of electric bass 25

Figure 2.16 Pitch computation 26

Figure 2.17 Error plot: interp., interp. with period averaging and DFT 26

Figure 2.18 Autocorrelation signal, sine wave at 100 Hz 27

Figure 2.19 Autocorrelation signal, sine wave at 1010 Hz 28

Figure 2.20 Peak detection through zero crossing and interpolation 29

Figure 2.21 Natural cubic spline 30

Figure 2.22 Peak averaging (with number of peaks) 31

vii
Figure 2.23 Electric bass envelope 32

Figure 2.24 Amplitude envelope 33

Figure 2.25 Amplitude modulation analysis 34

Figure 2.26 Amplitude modulation alto-saxophone 35

Figure 3.1 Main software architecture 39

Figure 3.2 Snapshot of software 40

Figure A.1 Rough peak detection 47

Figure A.2 Prominent peak search 48

Figure A.3 Harmonic break search, general flowcharts 49

Figure A.4 Harmonic break search, sub-module flowcharts 50

Figure A.5 Harmonic analysis, general flowchart 51

Figure A.6 Harmonic analysis, detailed flowchart 52

List of Tables

Table 3.1 Software features 40

viii
Chapter 1 Introduction

1.1 Motivation

I am looking at an acoustical waveform, horizontal axis denoting discrete time

and vertical axis denoting discrete magnitude values. What information does

this signal contain, how much information is there? Is there anything hidden

behind the waveform? Is it just two dimensional in nature? How can I better

understand why it makes this unique sound? Questions like these have aroused

my curiosity in timbre which has lead to conceiving this thesis - feature extraction

of musical signals.

Feature extraction is an integral part of understanding musical instrument

signals. These signals contain a wealth of information and feature extraction is a

method for obtaining specific characteristics through signal processing

techniques. Hence, it is partly a process of reducing the overwhelming

acoustical information and focusing on specific areas that may give clues for

describing the signal under investigation. In a computer system, digital signal

processing techniques are used for analysis. The techniques of data analysis

are divided into frequency and time domain analyses. With these techniques,

numerous approaches from different angles are employed to extract salient

information, ultimately to help understand timbral characteristics.

Various signal processing software systems exist for extracting specific

acoustical features. However, very few systems exist that are tailored for the

1
purpose of analysis and extraction of timbral qualities of musical instrument

signals. In this thesis I have developed and implemented various algorithms for

extracting salient features into one software application which can be readily

used by musicians, composers, engineers or anyone interested in analyzing

musical signals from a signal processing point of view.

It is also interesting to note that although numerous signal processing algorithms

have been devised to accomplish feature extraction tasks, it is still unclear as to

which aspects of timbre are essential and which are less or more meaningful

than others. To my knowledge, there exists no theory nor rule that

unambiguously defines a hierarchical description of timbral features. It is my

hope that this software system will provide users the means to explore,

investigate and experiment with audio signals and help answer some of the

many questions regarding timbre that are yet to be discovered. However, I also

plan to continue research in timbre to encompass a recognition module which

would be able to take the extracted features and recognize the sound source

being analyzed.

The software rendered in Java has been chosen for its platform independence

and graphical user interface (GUI) capabilities. The Java Swing GUI was used

to facilitate the interpretation of extracted features through graphical displays and

parametric controls of various signal processing coefficients.

2
1.2 Feature Extraction and Timbre

In this thesis spectral analysis is based on the Fourier transform. The theory

behind the Fourier transform was first published in "Analytical Theory of Heat" by

Fourier. Fourier claimed in his writing that any periodic continuous signal could

be represented by the sum of an infinite number of sine and cosine waves. This

elegant description of periodic signals was later exploited by the 19th century

physicist Herman Helmholtz (Helmholtz 1877). His view of the ear was that of a

"frequency analyzer" based primarily on Fourier's mathematical theorem, Ohm's

physical definition of a simple tone and the existence of a resonator in the

cochlea, capable of accomplishing sound analysis. According to Helmholtz's

theory, the cochlea behaved like a spectral analyzer analogous to the Fourier

transform. He believed that the cochlea resonated at specific locations along the

basilar membrane (Carterette and Friedman 1978), each tuned to specific

frequencies. Helmholtz also claimed that the spectral magnitude components,

and not the phase components, were the sole factors contributing to the

perception of musical tones. However, this over-generalization of the human ear

performing a strict Fourier transform on the incoming sound waves was

disproved by Bksy (Bksy 1943) who demonstrated the impossibility of such

precise and acute tuning resonators in the cochlea as described by Helmholtz.

In fact, the hair cells in the basilar membrane (comparable to frequency bins in

the Fourier transform) are stimulated in an overlapping manner. That is, a sine

tone at 100 Hz will not just trigger one hair cell at precisely that frequency, rather

a group of hair cells will be excited leading to the perception of its pitch.

3
Furthermore, the importance of phase in perceiving musical sounds was

demonstrated by Clark (Clark, Luce, Abrams, Schlossberg and Rome 1963),

who clearly showed that in the absence of phase information, acoustic

waveforms sounded unrealistic. This may be partly attributed to the fact that the

highly transient onset part of a signal stores a great deal of phase information.

Helmholtz's theory works well in ideal situations when a signal is periodic.

However, real-life sounds are only quasi-periodic and vary considerably. The

significance of spectral fluctuation as well as inharmonicity (Fletcher, Blackman

and Stratton 1962) and spectral fusion (McAdams 1984) have also been studied

as potential features in describing musical tones.

Although the Fourier transform has been known for quite some time, it was not

widely applied by the music community until after 1965, with the introduction of

the fast Fourier transform (Cooley and Tukey 1965). The advent of the FFT

stimulated research in music partly due to the cost effectiveness in processing

the discrete Fourier transform. One such area of research in timbre was

conducted using multidimensional scaling (MDS) methods (Grey 1976). The

structure of musical signals was mapped to a three dimensional timbre space.

The listener determined the similarity or dissimilarity between sounds when

salient features were changed. The three dimensions incorporated were

brightness, spectral flux and attack time. Instead of natural sounds, additive

synthesis methods were employed for easy control of timbral parameters in

conducting the experiments. Noise content of musical signals on the other hand

4
has not been investigated in as much detail by researchers compared to the

"periodic" aspects of musical sounds (some work has been done in modeling

non-periodic signals by Serra 1997). However, voice coding research has been

adapting noise analysis techniques enthusiastically, where speech is divided into

a periodic and noisy part. The use of a LPC (Linear Prediction Coding) method

has been the primary backbone in current and past speech analysis by synthesis

(AbS) systems.

During the past decade a number of research topics in timbre have been

pursued in the area of so called Computational Auditory Scene Analysis (CASA).

It may be thought of as a research area in psychophysical disciplines to describe

and explain how the listener perceives sounds. Sound, in this context may be

referred to a multiplexed signal - an aggregate of a number of sound sources.

The approach is to find the underlying reasons as to why we hear what we hear

and not merely be content with the results of a computer system that finds a

matching answer to a stimulus. The proliferation of CASA can be largely

attributed to Bregman, who published his book Auditory Scene Analysis in 1990

(Bregman 1990). The book describes in detail highly intuitive and clever

experiments that attempt to explain psychoacoustic phenomena and makes

robust modeling of such features. However, as is the case with most if not all

psychoacoustic experiments, the stimuli or test tones used in Bregram's book

are also static, synthesized, sine-tones or simply impractical sound examples

which are often remotely related to real-life sounds. Nevertheless, a significant

5
and impressive amount of work has been done in this field. Work by Ellis (Ellis

1996) used a prediction-based model of the auditory system with good results in

grouping sounds in noisy environments such as car horns, door slams and

squeals in a "city street environment". He used a re-synthesis approach to

assess its robustness and performance. Another is statistically based pattern-

recognition approach (Martin 1999) where the "listening" system classifies

musical instruments as one of 25 possibilities based on Ellis's PDCASA

(Prediction Driven Computational Auditory Scene Analysis) architecture.

6
Chapter 2 Signal Processing Modules

2.1 Introduction

Musical instrument signals generally consist of a transient portion and steady

state or quasi-periodic portion. The transient part is usually the attack of the

signal and the steady state the portion that follows the attack part. When

investigating time variant signals it is critical to make use of both time and

frequency domain analysis techniques. Some important features in musical

signals include duration, amplitude modulation, pitch, spectral harmonicity,

spectral envelope, spectral centroid and the like. Attack time is especially

considered a salient feature of musical timbre (Eagleson and Eagleson 1947;

Saldanha and Corso 1964; Elliot 1975) and has been thought to be a dominant

feature of musical instruments. However, it has also been discovered that the

attack time and also note-to-note transients of a signal are neither sufficient nor

necessary for recognizing musical instruments (Kendall 1986). This

controversial discovery supports the importance of the steady state portion of a

signal.

This chapter mainly describes the implementation of the signal processing

algorithms used in the software system for extracting features that depict these

transient and stationary characteristics in the frequency and time domain. The

frequency domain analysis section of this chapter is primarily based on the

discrete Fourier transform (DFT). DFT based spectral analysis algorithms

7
discussed includes short time Fourier transform, spectral centroid, spectral

smoothness and tracking of partials over time. In the time domain analysis

section I will mainly describe the implementation of algorithms including pitch

detection with interpolation and a period averaging based on the autocorrelation

function. Other modules discussed are amplitude envelope, amplitude

modulation, attack time computation and noise content analysis.

2.2 Frequency Domain Analysis

2.2.1 DFT and STFT

The spectral analysis part of feature extraction is primarily based on the discrete

Fourier transform (DFT). Below the continuous time and discrete time versions

of the Fourier transform are shown.


X = x t e j t dt (2.1)

N 1
X [ k ]= x [n]e j2 kn / N 0n N 1, 0k N 1 (2.2)
n=0

To extract transitory spectral characteristics the short time Fourier transform

(STFT) was used (Allen 1977; Allen and Rabiner 1977). The basic algorithm is

as follows.
N 1
X [ k ]= w [ k mD] x [ k ]e j2 kn/ N 0n N 1, 0k N 1 (2.3)
n=0

As seen in figure 2.1 the STFT can be simply described as windowing and taking

the FFT of the signal. There are various window types available in the program

8
STFT
signal
peak accumulate
windowing FFT
detection partials

Figure 2.1 Short time Fourier transform and Spectral Peak Detection

with different side-lobe and main lobe characteristics. The Hamming window has

been shown to work particularly well with musical signals (De Poli, Piccialli and

Roads 1991). See the appendix for details regarding windowing and its side-

lobe and main lobe characteristics.

2.2.2 Spectral Peak Detection and Tracking

Pitched musical instruments display a high degree of harmonic spectral quality

when analyzed for frequency content. Most tend to have quasi-integer harmonic

relationships between spectral peaks and the fundamental frequency. In voice,

the spectral envelope displays mountain-like contours or valleys known as

formants. The locations of the formants distinctively describe vowels. This is

also evident in violins, but the number of valleys is greater and the formant

locations change very little with time unlike the voice, which varies substantially

for each vowel. Woodwinds such as the bassoon and oboe on the other hand

have fewer formants than the voice, but tend to have stronger and clearer

spectral contours that perceptually characterize the woodwind family (Cook

1999). Generally, musical instruments like the plucked string (figure 2.2) exhibit

9
lower energy in the high frequency bins. The higher partials normally have less

energy and also die out faster than lower ones over time.

Figure 2.2 Plucked string spectrum

Using the short time Fourier transform, I have implemented a spectral peak

detection and tracking method, extracting quasi-integer related harmonics from

the spectrum. The peak picking algorithm takes into consideration magnitude

and frequency information to select the most prominent and harmonically

behaving peaks. To help in the search for spectral peaks, various threshold

values are used as described below.

10
The spectral peak detection algorithm is divided into four main steps. The first

pass roughly locates possible peaks, where the roughness factor for searching

peaks is controlled via a threshold value. The threshold value basically dictates

the degree of "peakiness" that is allowed for a local maximum to be considered a

possible peak. The second pass filters out peaks that may have been

erroneously selected in step 1. The third pass looks for any broken harmonic

sequence, analyzing harmonic relationships of the currently selected peaks. In

this pass, peaks that may have been deleted or missed in the previous two

passes are inserted. The final pass looks at the selected peaks and further does

a harmonic analysis ultimately leaving a set of peaks that are most probably

harmonics. A mean and scalable standard deviation error method is applied for

control of inharmonicity.

signal rough peak prominent peak harmonic break harmonicity harmonics


detection search search analysis

Figure 2.3 Peak detection algorithm

2.2.2.1 Step 1: Rough Peak Detection

In the rough peak detection algorithm possible peaks are picked using negative

and positive slope threshold values to guide in the selection process. As shown

in figure 2.4 the polarity of the slope of the spectrum is computed from bin to bin

(DC to Nyquist) using the basic assumption that a transition from positive to

11
actual peak actual peaks

peak candidate
not actual peak

slope + + + - + - - ... ... + + - - ... - + - ...

0 1 2 4 5 6 ... ... k-1 k

Figure 2.4 Rough search for peaks

negative slope calls for the possibility of a peak. The following conditions help in

the selection of a peak:

1. The slope must change polarity, positive to negative.

2. The magnitude difference between the peak candidate and the current bin's

magnitude component (X[k]-X[k+4]) must be greater than a threshold value -

see example (figure 2.5).

3. A new peak candidate search occurs only after there is a slope change from

negative to positive and when a threshold value as shown in figure 2.6 is

exceeded.

12
Refer to flowcharts in the appendix for details.
detected
peak candidate p0
(not peak yet)

p0

threshold
value

k k+1 ... k+4

peak candidate p0 ...


becomes an actual
peak at this bin

Figure 2.5 Actual peak assessment

peak candidate
detected
reset,
new
candidate
search

slope change from


negative to positive

threshold value =
X[k-14] + constant
threshold value =
X[k-7] + constant

... k-14 ... k-7 k-4 ... k ...

noise
(disregard)

Figure 2.6 Transitional peaks (noise)

13
2.2.2.2 Step 2: Prominent Peak Search

In step 2, prominent peaks are located from a set of potential peaks found in

step 1. The purpose is to filter out local peaks which may be present between

stronger partial candidates as shown in figure 2.7. The search for prominent

peaks is done in the following way:

prominent peak

prominent peak maximum


prominent peak
magnitude

prominent peak
adaptive
threshold
level
adaptive
threshold
level

DC fs/2
probably
local peaks
probably probably
local peak local peak

Figure 2.7 Prominent peak search

1. The bin with the maximum magnitude is found.

2. Relative to position of the peak with maximum amplitude, peaks are analyzed

moving towards DC.

14
3. Relative to position of peak with maximum amplitude, peaks are analyzed

moving towards the Nyquist frequency.

Local maxima or peaks are picked out using an adaptive threshold value that is

reflective of a prominent peaks (possible partials) and its neighboring peaks as

shown in figure 2.7. For example a 50% threshold value will require neighboring

peaks to be greater than at least half the magnitude of the prominent peak

(possible partial). Refer to the appendix for details on algorithm.

2.2.2.3 Step 3: Harmonic Break Search

The third step is called the harmonic break search. Here, I have tried to analyze

if some "potential partials" were deleted or missed in the previous steps. This

may occur when potentially harmonically related peaks temporarily have little

energy or are simply much weaker than the stronger ones, but are nevertheless

harmonic. The harmonic break search is divided into the following sub-routines:

1. Analyze harmonic relationship between current partial candidates, by

computing the mean bin spacing between all prominent peaks.


N 2
= 1 F [ k 1]F [ k ] (2.4)
F
N 1 k =0

2. Detecting any harmonic breaks, or discontinuities between prominent peaks.

3. If discontinuities are found, going back to step 1 and 2 and do a refined

search of possible peaks between pairs of prominent peaks.

15
prominent prominent prominent prominent
peak peak peak peak

harmonic
break

F +/- threshold

right left
threshold threshold
bound bound

F F F F F

Figure 2.8 Harmonic break search

In the harmonic break search's second step, harmonic discontinuities are

detected using a pair of threshold values limiting the range of harmonic

deviation. Hence, the algorithm expects the possibility of a peak within the

threshold bounds computed in sub-step 2 (figure 2.8). Refer to appendix for

more details on algorithm.

2.2.2.4 Step 4: Harmonicity Analysis

Finally in step 4 an overall harmonicity verification is performed. In this last step,

the first few peaks (selectable in software) are used as a guide to determine the

final set of partials. The reason for choosing the first few peaks of the spectrum
16
is due to the fact that in highly pitch salient signals, lower harmonics usually are

stronger and more stable than higher ones.

The idea is to use the gaussian normal distribution function employing mean,

variance and standard deviation for eliminating inharmonic or misbehaving

partials. A peak that is outside a right and left threshold bound is considered

inharmonic and misbehaving. A mean bin spacing value denoting the bin

distances between neighboring peak candidates is computed to render the

variance and standard deviation. As the lower partials generally tend to be more

stable and have more energy, the first K (K: integer > 0) peaks are used for the

computation of the standard deviation. A scaled version of the the standard

deviation is then used as a criterion for evaluating inharmonicity of each partial

candidate. The scaled standard deviation is increased or decreased to control

the permitted spread of each peak. In other words, the scaled standard

deviation is directly relevant to the amount of inharmonicty tolerated for selecting

the final set of peaks. The scalar that controls the scaled standard deviation is a

value between 0 and 1, where 1 is equivalent to limiting the peaks to the original

un-scaled standard deviation. This method is implemented by computing an

ideal sequence of harmonics using the above acquired data. Hence the ideal

harmonic series is a sequence of partials as shown below.

binideal [0] , binideal [1] , ... , binideal [ m] (2.5)

where , binideal [0]=mean of first K peaks K :integer0

The ideal set of harmonics and the actual set of harmonics are compared and

the error (equation 2.6) for each peak is computed and verified against the

17
scaled standard deviation for final assessment. Peaks that have excessive error

values are deleted from the final set of peaks and the remaining ones are finally

considered harmonics. See the appendix for more details on algorithm.

=binideal [ k ]bin actual [l ] , k =0 ... M , l=0 ... N (2.6)

Equation 2.6 shows the error between the ideal and actual bins where M is the

number of ideal peaks and N is the number of actual peaks in the spectrum. M

and N have different values as missing partials may exist in the actual set of

peaks.

2.2.2.5 Partial Tracking between Frames

Once harmonics have been evaluated in each frame (a frame is equal to the

length of the FFT), they are combined to render a spectrogram. Frame to frame

partial movement is determined using a harmonic continuity criterion as shown in

figure 2.9.

harmonic discontinuity
harmonic (new path begins)
continuity

...

error ...
frequency (bin)

margin
discrete

...

error
margin
...

...

k k+1 k+2 k+3 k+4 ...


frame number

18
Figure 2.9 Partial tracking between frames

The harmonic continuity criterion is explained as follows: Each harmonic in a

frame is allowed to sway in frequency within a set of error margin values. Hence,

as shown in figure 2.9, four of the harmonics make a continuous harmonic path

(k, k+1, k+2, k+3). However, the harmonic in frame k+4 exceeds the allowed

error margin and breaks the previous harmonic path. At frame k+4 a new path is

created and the path which started at frame k is discontinued. The harmonic

continuity criterion is helpful in observing movements of the harmonics over time

and frequency.

2.2.3 Spectral Centroid

The spectral centroid (Beauchamp 1982) is commonly associated with the

measure of the brightness of a sound (Grey and Gordon 1978). This measure is

obtained by evaluating the "center of gravity" using the Fourier transform's

frequency and magnitude information (Equation 2.7). Generally speaking, it has

been found that increased loudness also increases the amount of high spectrum

content of a signal thus making a sound brighter.


N 1

k X [k ]
sc= kN=11 (2.7)
X [k ]
k =1

X[k] is the magnitude corresponding to bin k and N is the length of the DFT. This

measure has also been used in MDS (multidimensional scaling) based systems.

Figure 2.10 Spectral centroid of french horn and electric bass at 44.1 kHz

19
frames

frames

Figure 2.10 shows examples of the spectral centroid for the french horn and the

electric bass guitar.

2.2.4 Spectral Smoothness

The spectral smoothness (McAdams 1999) measures the smoothness of the

frame to frame spectral envelope obtained via the short time Fourier transform.

The algorithm basically takes the average of adjacent amplitudes of the spectral

bins and compares them to the current amplitude at bin k as shown in equation

2.8.
N 1
20 log X [ k 1]20 log X [ k ]20 log X [ k 1]
ss= 20 log X [ k ] (2.8)
k =1 3

20
Figure 2.11 shows examples of the spectral smoothness for white noise, sine

wave and electric bass sampled at 44,100 Hz.

Figure 2.11 White noise, sine wave and electric bass spectral smoothness

2.3 Time Domain Analysis

In this section I will discuss the time domain signal processing modules

implemented in the software system.

2.3.1 Noise Content Analysis: Linear Prediction

I have used linear prediction as the basis for extracting the degree of "noisiness"

21
of a signal. The motivation behind using the LPC method for musical signals lies

in its robust performance in modeling the voice. In the "LPC vocal tract model",

the resultant acoustical signal is represented via a noise signal and a sequence

of pulses passed through a resonant all-pole filter, shaping the spectral envelope

of the voice as shown in figure 2.12. In essence, the linear prediction filter

coefficients are used to predict the current sample with a finite number of

weighted past samples. Figure 2.13 shows examples of noise content analysis

for the flute and electric bass.

pitch

LPC coefficients
pulse generator

synthesized speech
spectral envelope

noise generator

Figure 2.12 Vocal tract model

22
magnitude
magnitude

samples (time)
magnitude

samples
magnitude

samples

Figure 2.13 Noise content analysis flute and electric bass


23
The linear prediction model is simply defined as in equation 2.9. It assumes that

the current sample may be represented with past samples weighted

"appropriately" (Atal and Hanauer 1971).


N

s [ k ]= a i s[ k i ] (2.9)
i=1

The coefficients in the difference equation are selected so that the error

between the current sample s[k] and the predicted sample form equation 2.9 is

minimized as shown in equation 2.10 using the least square method.


N
1
=
M
{ s [ k ]s [ k ]}2 (2.10)
k =1

The noise content analysis algorithm is shown in figure 2.14. Before submitting

the signal to the short term prediction filter block, a pre-emphasis filter (equation

2.11) is used to flatten the spectrum for enhanced performance. The pre-

emphasis filter (high pass filter) coefficients range from 0.95 to 0.98. The

residual signal ds[n] is passed through a "spike damping filter" which damps

spikes that are present in ds[n] and ultimately renders the noise content of the

signal (figure 2.15).

y [n]= x [n]ax [n1] (2.11)

residual/
(time)
signal noise
spike
pre-emphasis forward LPC inverse LPC
damping
ds[n]

linear prediction

Figure 2.14 Noise content analysis

24
samples(time)

Figure 2.15 Noise content analysis of electric bass

2.3.2 Pitch Detection

The pitch detection algorithm uses autocorrelation, natural cubic spline

interpolation and period averaging to accurately compute the pitch of the signal.

The range of operation is from approximately 26 Hz to 5000 Hz (A0 = 27.50 Hz,


25
C8 = 4186 Hz). Figure 2.16 shows the basic procedure for computing pitch. As

seen in figure 2.17 the error for the period- averaging method is smallest

compared to the autocorrelation method without interpolation and the FFT. The

period averaging method is discussed in section 2.3.2.4.

signal detection of interpolation of period


autocorrelation pitch(t)
peaks periods averaging

Figure 2.16 Pitch computation

DFT method

autocorrelation
without
interpolation autocorrelation
with interpolation
and period
averaging

Figure 2.17 Error plot: interp., interp. with period averaging and DFT

26
2.3.2.1 Autocorrelation

Autocorrelation is a standard way of determining signal periodicity and is defined

as:

acf xx = x t x t dt (2.12)

The discrete time equivalent is:


N 1
acf xx []= x [n] x [n] 0l (2.13)
n=1

A typical autocorrelation vector with increasing integer lag values is shown in

figure 2.18. In the current implementation zero crossings of the autocorrelation

signal are determined. More precisely, the location of a peak bounded by two

zero crossing pairs is considered a peak if it agrees with specific magnitude

threshold values.

Figure 2.18 Autocorrelation signal, sine wave at 100 Hz

27
Comparing figure 2.18 and figure 2.19 it is clear that the time resolution for

higher frequencies in the autocorrelation vector decreases substantially, causing

greater error. In other words, the samples that are present between the

autocorrelation peaks (period of the signal) decrease with the increase in

frequency (approximately 440 samples vs. 50 samples, figure 2.18 and figure

2.19). One way to improve performance is to use interpolation.

Figure 2.19 Autocorrelation signal, sine wave at 1010 Hz

2.3.2.2 Detection of Periods

Periods in the autocorrelation signal are detected through peaks that correspond

to the frequency of the audio signal (figure 2.20). Peaks are extracted using two

zero crossing pairs for each peak. These pairs define the range where a peak
28
that corresponds to the period, could actually be found. The fact that the

autocorrelation vector's magnitude decreases with the increase of its lag is

exploited in determining if a peak is really a period or just a local peak

corresponding to strong harmonics. The first period value is used as the basis to

look for and compute consecutive peaks in the autocorreation vector. Hence, an

error margin dictated by the first period found is used to guide in searching the

remaining peaks. All peaks that are considered periods are subjected to

interpolation.
T1 T3 T5
T0 T2 T4

local peaks

actual
peaks/
periods

zero zero
crossing crossing
pairs pairs

Figure 2.20 Peak detection through zero crossing and interpolation

2.3.2.3 Natural Cubic Spline Interpolation

The natural cubic spline Interpolation method is used in pinpointing the "actual"

peak and hence its period, in the autocorrelation function. The basic idea behind
29
the natural cubic spline method is shown in figure 2.21. Each "curvature"

connecting the knots is represented by a cubic polynomial equation denoted by

S i x .
s i-4 s i-3 s i-2 s i-1 si

t i-4 t i-3 t i-2 t i-1 t i


t i+1

Figure 2.21 Natural cubic spline

This essentially is a problem of solving each polynomial bounded by knots, for its

roots (Cheney and Kincaid 1994).


3 2
S i x=a i x bi x ci xd (2.14)

2.3.2.4 Period Averaging

As mentioned before, an increase in pitch decreases the period length. Since

the number of autocorrelation vectors decrease accordingly, interpolation can

only help so much. To get better performance both for low frequency and high

frequency pitch detection, I have developed a period averaging method which

simply uses M number of periods found in the autocorrelation vector to compute

the mean after interpolation. In a given frame, if we find M peaks/periods, the

respective period can be represented as:

30
T m={T 0 , T 1 , T 2 , T 3 ... , T M } , 0mM (2.15)

The number of autocorrelation peaks may vary from frame to frame. The

maximum number of periods for averaging is controlled by variable M. The

average period then becomes:


M f 1
1
T =
Mf
T m (2.16)
m=0

where M f is the maximum number of peaks found in frame f. As seen in

figure 2.17, this decreases the error margin considerably.

T1 T3 T5
T0 T2 T4

M=6
f

Figure 2.22 Peak averaging (with number of peaks)

2.3.3 Amplitude Envelope

The amplitude envelope describes the energy change of the signal in the time

domain and is generally equivalent to the so called ADSR (attack, decay,


31
sustain, release) of a musical sound.

frames

Figure 2.23 Electric bass envelope

The envelope of the signal is computed with a frame by frame RMS (root mean

square) and a low 3rd order Butterworth low pass filter.

RMS (equation 2.17) is related to the average power of a signal different from

the average or peak level. There is a fundamental difference between the

average, peak level and the signal's RMS value. The average changes very little

even if the signal consists of numerous transients peaks. The peak level on the

other hand can vary greatly in a small amount of time, but without much affecting

the average value. RMS is a more perceptually relevant measurement and has

32
been shown to correspond more closely to the way we hear loudness.


N 1
1 (2.17)
rms=
N
x [ k ]2
k =0

The frame by frame RMS is used quite similar to the short time Fourier transform

method. The length of the RMS frame determines the time resolution of the

envelope. A large frame length yields lower transient information and small

frame length greater transient energy. The window length M (equation 2.18) is

an integer multiple of N, N being the total length of the signal and p a positive

integer.


LM 1
1
frame by frame rms=
M
x [ k ]2 where Mp= N (2.18)
k=L

fs

8k 22.05k 44.1k

fc

RMS
signal amplitude
frame/frame RMS LPF
envelope

Figure 2.24 Amplitude Envelope

The window size is selectable in the software, a longer window resulting in less

transitional information and shorter window more transitional information. The

cutoff frequencies have been determined empirically at 350 Hz (fs = 8000), 1200

Hz (fs = 22050) and 1700 Hz (fs = 44100).

33
2.3.4 Amplitude Modulation

Detecting amplitude modulation is similar to the amplitude envelope detection

algorithm with a few steps added. Figure 2.25 shows a summary of amplitude

modulation analysis. The steady state portion of the signal is extracted and

analyzed for peaks which corresponds to the frequency of amplitude modulation.

For accurate location of peaks, the cubic spline interpolation method is again

used.

fs

8k 22.05k 44.1k

fc

RMS
signal steady state
frame/frame RMS LPF 20log
extraction

AM frequency interpolation of
peak detection
frequency computation peaks

Figure 2.25 Amplitude modulation analysis

Amplitude modulation is frequently observed in musical instruments such as the

violin, flute and saxophone (figure 2.26). The frequency in Hertz is computed

using the following formula:

fs
frequency= (2.19)
wT

34
where fs is the sampling rate, w the RMS frame length and T the period in

samples of the RMS signal.

frames

Figure 2.26 Amplitude modulation alto-saxophone

2.3.5 Attack Time

Attack time (Saldanha and Corso 1964; Elliot 1975) is an important feature of

timbre. It is defined as the time it takes to reach the maximum amplitude of a

signal from a minimum threshold magnitude (McAdams, 1999). The minimum

threshold value is necessary as it acts as a gating function, only starting

measurement of the attack time when this threshold level is exceeded. Although

the attack portion embodies a great deal of transitional information of the signal

leading to a steady state, it is difficult to say where the attack portion ends and

35
where the steady state begins. As a matter of fact, it is even difficult to say how

much information the attack portion actually represents and no concrete

measurement technique has been published to date.

attack time=t xMax t xThresh (2.20)

However, this attribute of timbre has been indirectly applied successfully in

wavetable synthesis. The basic idea is to take an auditory snapshot of the signal

- the attack and first few milliseconds of the steady state portion, then loop the

steady state portion. Hence, this gives the listener the illusion that the whole

signal is being played back, although only a fractional length of the signal has

actually been used to render such an illusion. Today's popular music genres and

"electronic" jazz music are very much dominated by this technology. However,

many contemporary composers (Appleton 1991) also have used this technology

mainly via alternative MIDI controllers such as the Radio Baton.

36
Chapter 3 Software Implementation

3.1 Introduction

The various signal processing modules discussed in chapter 2 were assembled

and implemented in the Java programming environment (the software is

available at http://eamusic.dartmouth.edu/~taehong). The system uses various

GUI (Graphical User Interface) capabilities for easy visualization of extracted

features. The following sections describe the software and the motive behind

using Java.

3.2 Why Java?

The Java programming environment was used for the following reasons.

1. Platform independence

2. Non-real time requirement for this thesis

3. Good GUI design capabilities

4. Java Sound

5. Syntax similarity to C/C++

Java has been designed for the purpose of neutrality towards architecture. This

can be achieved if pure Java code is written. For this reason, the program I have

developed has no native coding methods (system dependent code) and runs on

pure Java alone. Although native coding methods improve efficiency, this

37
program does not require real-time processing or time-critical computation. One

of the goals of this thesis was to write an intuitive GUI (graphical user interface)

- easy control of parameters, coefficients and visual representation of extracted

features. This was done with the Java Swing GUI environment. Another reason

for choosing Java lies in sound playback and recording. The ability to play and

record sounds, without developing interrupt service routines and memory

accessing procedures pertinent to each platform was a major plus. Finally, the

syntax similarity of Java to C/C++ greatly facilitated the move from C/C++ to

Java without having to learn a completely new "language".

3.3 Main Software Structure

The software's main modules are the GUI, command center and sound object.

The GUI is responsible for responding to requests performed by the user via

button clicks, data entry, menu selections etc. . The command center takes the

job of directing the commands requested by the user and notifying the

appropriate sound object. The sound object which is "instantiated" whenever a

new sound file is loaded into memory, supervises and keeps track of all its child

frames (internal frames - each corresponds to a DSP process) for updates and

unnecessary re-computation using linked lists. Hence, unused objects and

commands are removed or added to linked lists making command processing

efficient both in memory management and data management.

38
GUI

command center

Sound Sound Sound Sound


Object 0 Object 1
... Object S-1 Object S

signal processing
library

DSP DSP DSP DSP


Command Command ... Command Command
Manager Manager Manager Manager

DSP
command 0
DSP
command 0
... DSP
command 0
DSP
command 0

DSP DSP DSP DSP


command 1 command 1 ... command 1 command 1

... ... ... ... ...

DSP
command J
DSP
command K
... DSP
command L
DSP
command M

Figure 3.1 Main software architecture

39
Figure 3.2 Snapshot of software

3.4 Software Features

This section lists features of the program.

File IO Features
Features Description
File Open saves aiff, wav, au
File save writes aiff, wav, au, raw data (float)

DSP Command Features


DC Offset Removal removes DC Frequency

40
File IO Features
AM Analysis amplitude modulation analysis
Attack Time attack time computation
Amplitude Envelope amplitude envelope rendering
FFT discrete Fourier transform
Spectrogram display of peaks vs. frame/time
Pitch Detection detection of pitch (vs. time)
Pitch Modulation detection of low freq. pitch modulation
Spectral Centroid spectral centroid computation (vs. time)
Spectral Smoothness spectral smoothness computation

(vs. time)
Noise Content Analysis noise content analysis (forward LPC,

inverse LPC)

Sound Features
Play plays loaded waveform and residue

signal if selected
Stop stops play and record
Rec records signal from mic/line
Pause pauses record and play

Plot Commands
Zoom in/out
Scroll x,y
Quick view "summarized" view of signal, improves

screen update latency


Log view/Linear view log view of magnitude component in

FFT

Table 3.1 Software features

41
Chapter 4 Conclusion and Further Work

I began this thesis with the intent to gain further understanding about timbre.

Through implementation and development of signal processing algorithms I have

conceived a software system that may be used by musicians and researchers in

the analysis of audio signals. While realizing these algorithms with the aim of

making the system a useable software tool, I have gained insight and knowledge

pertinent to timbre and signal processing techniques for extraction of timbral

features. The software is capable of robustly extracting features such as

accurate pitch, harmonic movement, amplitude modulation and all the algorithms

explained in chapter 2. This software by no means encompasses every feature

extraction algorithm available nor does it perform without flaw for all audio

signals. Rather, while limiting its scope down to musical instrument signals such

as the voice, horns, stringed instruments and synthesized sounds, it serves as a

continuously evolving project ultimately leading to a system for recognition of

sound sources.

The next step is to continue investigating in detail existing timbre recognition

systems and models, determine weaknesses and strengths, develop new

algorithms and apply known algorithms that may help in the recognition process.

A number of such methods are neural networks, MDS, CASA systems, gestallt

theory based approaches and pattern recognition techniques developed by the

machine vision community. It may be that some techniques work better in

42
certain situations and other methods in different situations. Perhaps a

combination of methods will render a robust recognition system. A myriad of

questions will arise, nonetheless, it will be exciting and interesting to participate

and observe advances in the discovery of the mysteries of perception.

43
Appendix

A.1 Windowing

To understand why windowing is used in short time Fourier transform, I will show

the spectral characteristics when a signal is windowed by a simple rectangular

window. Let us consider the case where x[n] is the signal and w[n] is the

windowing function.

x [n]= a n , 0n N 1
(a.1)
0, otherwise

w [n]= 1, 0n N 1 (a.2)
0, otherwise

then

x w [n]= x [n] w [n] (a.3)

However, we know that a multiplication in the time domain corresponds to a

convolution in the frequency domain or more precisely:

1
X wf = { X f W f } (a.4)
2

We also observe that for a rectangular window the Fourier transform is


N 1
W = e j n
f

n=0

1e j N
= (a.5)
1e j
sin 0.5 N j0.5 N 1
= e
sin 0.5

44
Hence, we see that convolution is characteristic of the Dirichlet kernel. The

Dirichlet kernel causes distortions characterized by the main and side-lobe

widths. Each sample in the time domain will render a sync function in the

frequency domain causing side effects in the form of spectral smearing due to

the finite main-lobe width and side-lobe interference produced by neighboring

samples in the signal.

As described above, the choice of windowing functions plays a vital role in short

time Fourier analysis. The main idea in selecting the windowing function is to

taper off the abrupt end points of the rectangular window achieving gradual

transition. This results in minimized side-lobes magnitudes and minimized main-

lobe widths. The behavior of windowing functions can be found in many signal

processing textbooks (Porat 1997).

Rectangular window

The main lobe width is 4pi/N and has a 13dB side-lobe attenuation where N is

the number of samples.

Hann window

The Hann window also known as the Hanning window achieves a side-lobe

reduction by superposition. Three Dirichlet kernels are shifted and added

together resulting in partial cancellation of the side-lobes. The amount of shift is

2pi/(N-1) from the center. The resulting characteristics of the Hanning window

45
which is sometimes called the cosine window has a side-lobe leve of 32 dB and

a main lobe width of 8pi/N where N is the number of samples.

Hamming window

The Hamming window is similar to the Hanning window with modifications in

weighting the Dirichlet kernels. The time and frequency domain windows are as

follows. The main-lobe is 8pi/N with 43dB side-lobes. One characteristic is the

non-zero values at both end points and therefore is sometimes referred to as the

half-raised cosine window. N is the number of samples.

Blackmann window

The Blackmann window has a 57dB side-lobe and a main-lobe width of 12pi/N.

N is the number of samples.

46
A.2 Spectral Peak Detection and Tracking
rough peak detection

start

peakFound = false
set threshold values
tempMin = min(spectrum's magnitude)

compute slope

slope positive?

Yes

No
current magnitude >
tempMin + threshold1 &&
Yes peakFound == true

Yes

not a transient peak


peakFound = false

No
No

samples remaining
for checking?

peakCandidate = current peak No

No

slope negative? peakFound == true?

Yes Yes

current magnitude <


peakCandidate-threshold2? tempMin = current magnitude
&&
peakFound == false

Yes

peakCandidate is an actual peak


Yes
peakFound = true

No

compute slope

samples remaining
No
for checking?

done

Figure A.1 Rough peak detection

47
prominent peak search
start from maximum peak index and
peakMag (peak magnitude) decrement towards DC
tempMax = maximum peak magnitude
- magnitude array of rough peaks

promMagThresh (prominent magnitude ratio) No


- ratio magnitude ( 0 < ratio <1)
more peak candidates to test?
maxIndex (maximum bin index)
- integer value of bin position of maximum magnitude
Yes

current peak candidate >


tempMax*promMagThresh

Yes

current peak candiate is a prominent


peak
add to output array

No

update tempMax
tempMax = current peak's magnitude
find maximum peak's bin index
from selected peak candidates

scan from maxIndex to DC and select


prominent peaks

scan from maxIndex+1 to fs/2 and


select prominent peaks

update new peak array:


bin and magnitude information
start from maximum peak index+1 and
increment towards fs/2
tempMax = maximum peak magnitude

done
No

more peak candiates to test?

Yes

current peak candidate >


tempMax*promMagThresh

Yes

current peak candidate is a prominent


peak
add to output array
No

update tempMax
tempMax = current peak's magnitude

Figure A.2 Prominent peak search

48
harmonic break search
difference[ ]
- bin spacing distance

thresh
- error boundary for peak to be considered a possible
peak

mean bin spacing computation

harmonic break region determination

rough peak detection

zoomed in
peak search
prominent peak evaluation

update bin and magnitude arrays with


harmonic break insertions

done

Figure A.3 Harmonic break search, general flowchart

49
mean bin spacing harmonic break region
computation determination

i=0
i=0

bins remaining?
bins remaining?

Yes
Yes

difference[i] = bin[i+1] - bin[i]


difference[i] > thresh*mean
&&
difference[i] < thresh*mean

Yes
i++
No
add this region to list for zoomed
in search
found harmonic break
No
No

mean(difference[])
i++

found harmonic break? No


zoomed in peak search

Yes

done

more break regions to test?

Yes

rough peak detection on break region


add to output if weak peaks are found
in break region

No

Figure A.4 Harmonic break search, sub-module flowcharts

50
harmonicty analysis

start

check if first K peaks exist

mean bin spacing computation


standard deviation computation
scaled standard deviation computation

compute ideal harmonic sequence


compute error between actual and
ideal bins
determine
harmonicty
check error to
scaled standard deviation

delete inharmonic peaks

done

Figure A.5 Harmonic analysis, general flowchart

51
deterime harmonicity
idealCount : ideal bin spacing counter
realCount : real bin index counter
numOfPeaks : total number of peaks (not
analyzed)
harmError : harmonicity error between real
and ideal bin
scaledStdDev : scaled standard deviation
upperBound : used to determine overshoot/
undershoot/missing partials
i=0 tolerance : error tolerance scalar
idealCount = 0 bin : bin array containing peaks for
realCount = 0
badBinCount = 0 analysis

realCount < numOfPeaks

Yes

harmError =
(1+idealCount)*binSpacingMean

abs(harmError) > scaledStdDev

Yes
No
upperBound =
binSpacingMean*tolerance
harmError > 0 No

Yes
abs(harmError) < upperBound
badPeakBin[badBinCount++] =
idealCount++
bin[realCount++]
Yes

overshoot
harmError > 0 No

No
undershoot idealCount++

Yes
No

badPeakBin[badBinCount++] =
bin[realCount++]

check if neighboring
bin has less error
abs(harmError) >
(1+idealCount)*binSpacingMean -
bin[realCount+1]

Yes

badPeakBin[badBinCount++] =
bin[realCount++]

No

realCount++
idealCount++

done

Figure A.6 Harmonic analysis, detailed flowchart

52
References

Allen J. B. 1977. "Short Time Spectral Analysis, Synthesis and Modification by


Discrete Fourier Transform." IEEE Transaction on Acoustics, Speech and Signal
Processing 25(3):235-238.

Allen J. B., and L. Rabiner 1977. "A Unified Approach to Short-Time Fourier
Analysis and Synthesis." Proceedings of the IEEE vol. 65(11):1558-1564.

Appleton, J. 1991 "Pacific Rimbmbo." The Virtuoso in the Computer Age.


CDCM Computer Music Series. vol. 5.

Atal, H., and S. Hanauer 1971. "Speech Analysis and Synthesis by Linear
Prediction of the Speech Wave." Journal of Acoustical Society of America 50(2):
637 - 655.

Bksy, G. v. 1943. "Ueber die Resonanzkurve und die Abklingzeit der


verschiedenen Stellen der Schneckentrennwand." Akustische Zeitschrift.

Beauchamp, J. W. 1982. "Synthesis by Spectral Amplitude and 'Brightness'


Matching Analyzed Musical Sounds." Journal of Audio Engineering Society 30
(6): 396-406.

Bregman, A. 1990. Auditory Scene Analysis. Cambridge: The MIT Press.

Carterette E., and M. Friedman 1978. Handbook of Perception. Academic Press

Cheney, W., and D. Kincaid 1994. Numerical Mathematics and Computing, 3rd
Edition. Brooks/Cole Pub Co.

Clark, M., D. Luce, R. Abrams, H. Schlossberg, and J. Rome 1963. "Preliminary


Experiments on the Aural Significance of Parts of Tones of Orchestral
Instruments and on Choral Tones." Journal of Audio Engineering Society 11(1):
45-54.

53
Cooley, J., and J. Tukey 1965. "An Algorithm for the Machine Calculation of
Complex Fourier series." Mathematics of Computation.

Eagleson, H., and W. Eagleson 1947. "Identification of Musical Instruments


when Heard Directly Over a Public Address System." Journal of Acoustical
Society of America 19(2): 338-342.

Elliot, C. 1975. "Attacks and Releases as Factors in Instrument Identification."


Journal of Research in Music Education 23: 35-40.

Ellis, P. W. 1996. "Prediction-Driven Computational Auditory Scene Analysis."


Ph.D. Dissertation, MIT.

Fletcher, H., and W. A. Munson 1933. "Loudness, its Definition, Measurement


and Calculation." Journal of Acoustical Society of America.

Fletcher, H., E. Blackman and R. Stratton 1962. "Quality of Piano Tones."


Journal of Acoustical Society of America 34(6):1534 -1544.

Grey, J. 1976. "Multidimensional Scaling of Musical Timbres." Journal of the


Acoustical Society of America 61(5): 1270-1277.

Grey J. M., and J. W. Gordon 1978. "Perceptual Effects of Spectral Modifications


on Musical Timbres." Journal of Acoustical Society of America 63(5): 1493-1500

Helmholtz, H. 1877. "On the Sensations of the Tone as a Physiological Basis for
the Theory of Music." (translation) New York Dover

Kendall, R. A. 1986. "The Role of Acoustic Signal Partitions in Listener


Categorization of Musical Phrase." Music Perception 4(2): 185-214.

McAdams, S. 1984. "Spectral Fusion, Spectral Parsing, and the Formation of


Auditory Images." Technical Report STAN-M-22. Stanford University, Dept. of
Music (CCRMA).

54
De Poli, G., A. Piccialli and C. Roads 1991. "Representation of Musical Signals."
Cambridge: The MIT Press.

McAdams S. 1999. "Perspectives on the Contribution of Timbre to Musical


Structure." Computer Music Journal 23(3): 85-103.

Martin, K. 1999. "Sound-Source Recognition: A Theory and Computational


Model." Ph.D. Dissertation, MIT.

Cook, P. 1999. Music, Cognition and Computerized Sound: An Introduction to


Psychoacoustics. Cambridge: The MIT Press.

Porat, B. 1997. A Course in Digital Signal Processing. John Wiley & Sons, Inc.

Shepard, R. 1964. "Circularity in Judgments of Relative Pitch." Journal of


Acoustical Society of America 36:2346-2353.

Serra X. 1997. "Musical Modeling with Sinusoids plus Noise." G. D. Poli and
others (eds.) Musical Signal Processing. Swets & Zeitlinger Publishers.

Saldanha, E., and Corso J. 1964. "Timbre cues and the identification of musical
instruments." Journal of Acoustical Society of America 36: 2021-2026.

55

Anda mungkin juga menyukai