STATISTICS in MUSICOLOGY
Jan Beran
2003048488
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microlming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specic permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identication and explanation, without intent to infringe.
Contents
Preface 1 Some mathematical foundations of music 1.1 General background 1.2 Some elements of algebra 1.3 Specic applications in music 2 Exploratory data mining in musical spaces 2.1 Musical motivation 2.2 Some descriptive statistics and plots for univariate data 2.3 Specic applications in music univariate 2.4 Some descriptive statistics and plots for bivariate data 2.5 Specic applications in music bivariate 2.6 Some multivariate descriptive displays 2.7 Specic applications in music multivariate 3 Global measures of structure and randomness 3.1 Musical motivation 3.2 Basic principles 3.3 Specic applications in music 4 Time series analysis 4.1 Musical motivation 4.2 Basic principles 4.3 Specic applications in music 5 Hierarchical metho ds 5.1 Musical motivation 5.2 Basic principles 5.3 Specic applications in music 6 Markov chains and hidden Markov mo dels 6.1 Musical motivation 6.2 Basic principles
6.3 Specic applications in music 7 Circular statistics 7.1 Musical motivation 7.2 Basic principles 7.3 Specic applications in music 8 Principal comp onent analysis 8.1 Musical motivation 8.2 Basic principles 8.3 Specic applications in music 9 Discriminant analysis 9.1 Musical motivation 9.2 Basic principles 9.3 Specic applications in music 10 Cluster analysis 10.1 Musical motivation 10.2 Basic principles 10.3 Specic applications in music 11 Multidimensional scaling 11.1 Musical motivation 11.2 Basic principles 11.3 Specic applications in music List of gures References
Preface
An essential aspect of music is structure. It is therefore not surprising that a connection between music and mathematics was recognized long before our time. Perhaps best known among the ancient quantitative musicologists are the Pythagoreans, who found fundamental connections between musical intervals and mathematical ratios. An obvious reason why mathematics comes into play is that a musical performance results in sound waves that can be described by physical equations. Perhaps more interesting, however, is the intrinsic organization of these waves that distinguishes music from ordinary noise. Also, since music is intrinsically linked with human perception, emotion, and reection as well as the human body, the scientic study of music goes far beyond physics. For a deeper understanding of music, a number of dierent sciences, such as psychology, physiology, history, physics, mathematics, statistics, computer science, semiotics, and of course musicology to name only a few need to be combined. This, together with the lack of available data, prevented, until recently, a systematic development of quantitative methods in musicology. In the last few years, the situation has changed dramatically. Collection of quantitative data is no longer a serious problem, and a number of mathematical and statistical methods have been developed that are suitable for analyzing such data. Statistics is likely to play an essential role in future developments of musicology, mainly for the following reasons: a) statistics is concerned with nding structure in data; b) statistical methods and structures are mathematical, and can often be carried over to various types of data statistics is therefore an ideal interdisciplinary science that can link dierent scientic disciplines; and c) musical data are massive and complex and therefore basically useless, unless suitable tools are applied to extract essential features. This book is addressed to anybody who is curious about how one may analyze music in a quantitative manner. Clearly, the question of how such an analysis may be done is very complex, and no ultimate answer can be given here. Instead, the book summarizes various ideas that have proven useful in musical analysis and may provide the reader with food for thought or inspiration to do his or her own analysis. Specically, the methods and applications discussed here may be of interest to students and researchers in music, statistics, mathematics, computer science, communication, and en
gineering. There is a large variety of statistical methods that can be applied in music. Selected topics are discussed in this book, ranging from simple descriptive statistics to formal modeling by parametric and nonparametric processes. The theoretical foundations of each method are discussed briey, with references to more detailed literature. The emphasis is on examples that illustrate how to use the results in musical analysis. The methods can be divided into two groups: general classical methods and specic new methods developed to solve particular questions in music. Examples illustrate on one hand how standard statistical methods can be used to obtain quantitative answers to musicological questions. On the other hand, the development of more specic methodology illustrates how one may design new statistical models to answer specic questions. The data examples are kept simple in order to be understandable without extended musicological terminology. This implies many simplications from the point of view of music theory and leaves scope for more sophisticated analysis that may be carried out in future research. Perhaps this book will inspire the reader to join the eort. Chapters are essentially independent to allow selective reading. Since the book describes a large variety of statistical methods in a nutshell it can be used as a quick reference for applied statistics, with examples from musicology. I would like to thank the following libraries, institutes, and museums for their permission to print various pictures, manuscripts, facsimiles, and photographs: Zentralbibliothek Z urich (Ruth H ausler, Handschriftenabteilung; Anik o Lad` anyi and Michael Kotrba, Graphische Sammlung); Belmont Mu sic Publishers (Anne Wirth); Philippe Gontier, Paris; Osterreichische Post AG; Deutsche Post AG; Elisabeth von JanozaBzowski, D usseldorf; University Library Heidelberg; Galerie Neuer Meister, Dresden; RobertSterlHaus (K.M. Mieth); B ela Bart ok Memorial House (J anos Szir anyi); Frank Martin Society (Maria Martin); KaradarBertoldi Ensemble (Prof. Francesco Bertoldi); col legno (Wulf Weinmann). Thanks also to B. Repp for providing us with the tempo data for Schumanns Tr aumerei. I would also like to thank numerous colleagues from mathematics, statistics, and musicology who encouraged me to write this book. Finally, I would like to thank my wife and my daughter for their encouragement and support, without which this book could not have been written. Jan Beran Konstanz, March 2003
CHAPTER 1
Figure 1.1 Quantitative analysis of music helps to understand creative processes. (Pierre Boulez, photograph courtesy of Philippe Gontier, Paris; and Jim by J.B.)
Figure 1.2 J.S. Bach (16851750). (Engraving by L. Sichling after a painting by Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek Z urich.)
documented in Beethovens famous sketchbooks. Similarily, the art of counterpoint that culminated in J.S. Bachs (Figure 1.2) work relies to a high degree on intrinsically mathematical principles. A rather peculiar early account of explicit applications of mathematics is the use of permutations in change ringing in English churches since the 10th century (Fletcher 1956, Price 1969, Stewart 1992, White 1983, 1985, 1987, Wilson 1965). More standard are simple symmetries, such as retrograde (e.g. Crab fugue, or Canon cancricans), inversion, arpeggio, or augmentation. A curious example of this sort is Mozarts Spiegel Duett (or mirror duett, Figures 1.6, 1.7 ; the attibution to Mozart is actually uncertain). In the 20th century, composers such as Messiaen or Xenakis (Xenakis 1971; gure 1.15) attempted to develop mathematical theories that would lead to new techniques of composition. From a strictly mathematical point of view, their derivations are not always exact. Nevertheless, their artistic contributions were very innovative and inspiring. More recent, mathematically stringent approaches to music theory, or certain aspects of it, are based on modern tools of abstract mathematics, such as algebra, algebraic geometry, and mathematical statistics (see e.g. Reiner 1985, Mazzola 1985, 1990a, 2002, Lewin 1987, Fripertinger 1991, 1999, 2001, Beran and Mazzola 1992, 1999a,b, 2000, Read 1997, Fleischer et al. 2000, Fleischer 2003). The most obvious connection between music and mathematics is due to the fact that music is communicated in form of sound waves. Musical sounds can therefore be studied by means of physical equations. Already in ancient Greece (around the 5th century BC), Pythagoreans found the relationship between certain musical intervals and numeric proportions, and calculated intervals of selected scales. These results were probably obtained by studying the vibration of strings. Similar studies were done in other cultures, but are mostly not well documented. In practical terms, these studies lead to singling out specic frequencies (or frequency proportions) as musically useful and to the development of various scales and harmonic systems. A more systematic approach to physics of musical sounds, music perception, and acoustics was initiated in the second half of the 19th century by pathbreaking contributions by Helmholz (1863) and other physicists (see e.g. Rayleigh 1896). Since then, a vast amount of knowledge has been accumulated in this eld (see e.g. Backus 1969, 1977, Morse and Ingard 1968, 1986, Benade 1976, 1990, Rigden 1977, Yost 1977, Hall 1980, Berg and Stork 1995, Pierce 1983, Cremer 1984, Rossing 1984, 1990, 2000, Johnston 1989, Fletcher and Rossing 1991, Gra 1975, 1991, Roederer 1995, Rossing et al. 1995, Howard and Angus 1996, Beament 1997, Crocker 1998, Nederveen 1998, Orbach 1999, Kinsler et al. 2000, Raichel 2000). For a historic account on musical acoustics see e.g. Bailhache (2001). It may appear at rst that once we mastered modeling musical sounds by physical equations, music is understood. This is, however, not so. Music is not just an arbitrary collection of sounds music is organized sound.
Figure 1.3 Ludwig van Beethoven (17701827). (Drawing by E. D urck after a painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek Z urich.)
Figure 1.5 Gottfried Wilhelm Leibniz (16461716). (Courtesy of Deutsche Post AG and Elisabeth von JanotaBzowski.)
Physical equations for sound waves only describe the propagation of air pressure. They do not provide, by themselves, an understanding of how and why certain sounds are connected, nor do they tell us anything (at least not directly) about the eect on the audience. As far as structure is concerned, one may even argue for the sake of argument that music does not necessarily need physical realization in form of a sound. Musicians are able to hear music just by looking at a score. Beethoven (Figures 1.3, 1.16) composed his ultimate masterpieces after he lost his hearing. Thus, on an abstract level, music can be considered as an organized structure that follows certain laws. This structure may or may not express feelings of the composer. Usually, the structure is communicated to the audience by means of physical sounds which in turn trigger an emotional experience of the audience (not necessarily identical with the one intended by the composer). The structure itself can be analyzed, at least partially, using suitable mathematical structures. Note, however, that understanding the mathematical structure does not necessarily tell us anything about the eect on the audience. Moreover, any mathematical structure used for analyzing music describes certain selected aspects only. For instance, studying symmetries of motifs in a composition by purely algebraic means ignores psychological, historical, perceptual, and other important issues. Ideally, all relevant scientic disciplines would need to interact to gain a broad understanding. A further complication is that the existence of a unique truth is by no means certain (and is in fact rather unlikely). For instance, a composition may contain certain structures that are important for some listeners but are ignored by others. This problem became apparent in the early 20th century with the introduction of 12tone music. The general public was not ready to perceive the complex structures of dodecaphonic music and was rather appalled by the seemingly chaotic noise, whereas a minority of specialized listeners was enthusiastic. Another example is the
comparison of performances. Which pianist is the best? This question has no unique answer, if any. There is no xed gold standard and no unique solution that would represent the ultimate unchangeable truth. What one may hope for at most is a classication into types of performances that are characterized by certain quantiable properties without attaching a subjective judgment of quality. The main focus of this book is statistics. Statistics is essential for connecting theoretical mathematical concepts with observed reality, to nd and explore structures empirically and to develop models that can be applied and tested in practice. Until recently, traditional musical analysis was mostly carried out in a purely qualitative, and at least partially subjective, manner. Applications of statistical methods to questions in musicology and performance research are very rare (for examples see Yaglom and Yaglom 1967, Repp 1992, de la MotteHaber 1996, Steinberg 1995, Waugh 1996, Nettheim 1997, Widmer 2001, Stamatatos and Widmer 2002) and mostly consist of simple applications of standard statistical tools to conrm results or conjectures that had been known or derived before by musicological, historic, or psychological reasoning. An interesting overview of statistical applications in music, and many references, can be found in Nettheim (1997). The lack of quantitative analysis may be explained, in part, by the impossibility of collecting objective data. Meanwhile, however, due to modern computer technology, an increasing number of musical data are becoming available. An indepth statistical analysis of music is therefore no longer unrealistic. On the theoretical side, the development of sophisticated mathematical tools such as algebra, algebraic geometry, mathematical statistics, and their adaptation to the specic needs of music theory, made it possible to pursue a more quantitative path. Because of the complex, highly organized nature of music, existing, mostly qualitative, knowledge about music must be incorporated into the process of mathematical and statistical modeling. The statistical methods that will be discussed in the subsequent chapters can be divided into two categories: 1. Classical methods of mathematical statistics and exploratory data analysis: many classical methods can be applied to analyze musical structures, provided that suitable data are available. A number of examples will be discussed. The examples are relatively simple from the point of view of musicology, the purpose being to illustrate how the appropriate use of statistics can yield interesting results, and to stimulate the reader to invent his or her own statistical methods that are appropriate for answering specic musicological questions. 2. New methods developed specically to answer concrete questions in musicology: in the last few years, questions in music composition and performance lead to the development of new statistical methods that are specically designed to solve questions such as classication of perfor
mance styles, identication and modeling of metric, melodic, and harmonic structures, quantication of similarities and dierences between compositions and performance styles, automatic identication of musical events and structures from audio signals, etc. Some of these methods will be discussed in detail. A mathematical discipline that is concerned specically with abstract denitions of structures is algebra. Some elements of basic algebra are therefore discussed in the next section. Naturally, depending on the context, other mathematical disciplines also play an equally important role in musical analysis, and will be discussed later where necessary. Readers who are familiar with modern algebra may skip the following section. A few examples that illustrate applications of algebraic structures to music are presented in Section 1.3. An extended account of mathematical approaches to music based on algebra and algebraic geometry is given, for instance, in Mazzola (1990a, 2002) (also see Lewin 1987 and Benson 19952002). 1.2 Some elements of algebra 1.2.1 Motivation Algebraic considerations in music theory have gained increasing popularity in recent years. The reason is that there are striking similarities between musical and algebraic structures. Why this is so can be illustrated by a simple example: notes (or rather pitches) that dier by an octave can be considered equivalent with respect to their harmonic meaning. If an instrument is tuned according to equal temperament, then, from the harmonic perspective, there are only 12 dierent notes. These can be represented as integers modulo 12. Similarily, there are only 12 dierent intervals. This means that we are dealing with the set Z12 = {0, 1, ..., 11}. The sum of two elements x, y Z12 , z = x + y is interpreted as the note/interval resulting from increasing the note/interval x by the interval y. The set Z12 of notes (intervals) is then an additive group (see denition below). 1.2.2 Denitions and results We discuss some important concepts of algebra that are useful to describe musical structures. A more comprehensive overview of modern algebra can be found in standard text books such as those by Albert (1956), Herstein (1975), Zassenhaus (1999), Gilbert (2002), and Rotman (2002). The most fundamental structures in algebra are group, ring, eld, module, and vector space. Denition 1 Let G be a nonempty set with a binary operation + such that a + b G for all a, b G and the following holds: 1. (a + b) + c = a + (b + c) (Associativity)
2. There exists a zero element 0 G such that 0 + a = a + 0 = a for all aG 3. For each a G, there exists an inverse element (a) G such that (a) + a = a + (a) = 0 Then (G, +) is called a group. The group (G, +) is called commutative (or abelian), if for each a, b G, a + b = b + a. The number of elements in G is called order of the group and is denoted by o(G). If the order is nite, then G is called a nite group. In a multiplicative way this can be written as Denition 2 Let G be a nonempty set with a binary operation such that a b G for all a, b G and the following holds: 1. (a b) c = a (b c) (Associativity) 2. There exists an identity element e G such that e a = a e = a for all aG 3. For each a G, there exists an inverse element a1 G such that a 1 a = a a 1 = e Then (G, ) is called a group. The group (G, ) is called commutative (or abelian), if for each a, b G, a b = b a. For subsets we have Denition 3 Let (G, ) and (H, ) be groups and H G. Then H is called subgroup of G. Some groups can be generated by a single element of the group: Denition 4 Let (G, ) be a group with n < elements denoted by ai (i = 0, 1, ..., n 1) and such that 1. ao = an = e 2. ai aj = ai+j if i + j n and ai aj = ai+j n if i + j > n Then G is called a cyclic group. Furthermore, if G = (a) = {ai : i Z } where ai denotes the product with all i terms equal to a, then a is called a generator of G. An important notion is given in the following Denition 5 Let G be a group that acts on a set X by assigning to each x X and g G an element g (x) X. Then, for each x X, the set G(x) = {y : y = g (x), g G} is called orbit of x. Note that, given a group G that acts on X, the set X is partitioned into disjoint orbits. If there are two operations + and , then a ring is dened by Denition 6 Let R be a nonempty set with two binary operations + and such that the following holds: 1. (R, +) is an abelian group
2. a b R for all a, b R 3. (a b) c = a (b c) (Associativity) 4. a (b + c) = a b + a c and (b + c) a = b a + c a (distributive law) Then (R, +, ) is called an (associative) ring. If also a b = b a for all a, b R, then R is called a commutative ring. Further useful denitions are: Denition 7 Let R be a commutative ring and a R, a = 0 such that there exists an element b R, b = 0 with a b = 0. Then a is called a zerodivisor. If R has no zerodivisors, then it is called an integral domain. Denition 8 Let R be a ring such that (R \ {0}, ) is a group. Then R is called a division ring. A commutative division ring is called a eld. A module is dened as follows: Denition 9 Let (R, +, ) be a ring and M a nonempty set with a binary operation +. Assume that 1. (M, +) is an abelian group 2. For every r R, m M , there exists an element r m M 3. r (a + b) = r a + r b for every r R, m M 4. r (s b) = (r s) a for every r, s R, m M 5. (r + s) a = r a + s a for every r, s R, m M Then M is called an Rmodule or module over R. If R has a unit element e and if e a = a for all a M , then M is called a unital Rmodule. A a unital Rmodule where R is a eld is called a vector space over R. There is an enormous amount of literature on groups, rings, modules, etc. Some of the standard results are summarized, for instance, in text books such as those given above. Here, we cite only a few theorems that are especially useful in music. We start with a few more denitions. Denition 10 Let H G be a subgroup of G such that for every a G, a H a1 H . Then H is called a normal subgroup of G. Denition 11 Let G be such that the only normal subgroups are H = G and H = {e}. Then G is called a simple group. Denition 12 Let G be a group and H1 , ..., Hn normal subgroups such that (1.1) G = H1 H 2 Hn and any a G can be written uniquely as a product a = b1 b2 bn (1.2) with bi Hi . Then G is said to be the (internal) direct product of H1 , ..., Hn .
Denition 13 Let G1 and G2 be two groups, dene G = G1 G2 = {(a, b) : a G1 , b G2 } and the operation by (a1 , b1 ) (a2 , b2 ) = (a1 a2 , b1 b2 ). Then the group (G, ) is called the (external) direct product of G1 and G2 . Denition 14 Let M be an Rmodule and M1 , ..., Mn submodules such that every a M can be written uniquely as a sum a = a1 + a2 + ... + an with ai Mi . Then M is said to be the direct sum of M1 , ..., Mn . We now turn to the question which subgroups of nite groups exist. Theorem 1 Let H be a subgroup of a nite group G. Then o(H ) is a divisor of o(G). Theorem 2 (Sylow) Let G be a group and p a prime number such that pm is a divisor of o(G). Then G has a subgroup H with o(H ) = pm . Denition 15 A subgroup H G such that pm is a divisor of o(G) but pm+1 is not a divisor, is called a pSylow subgroup. The next theorems help to decide whether a ring is a eld. Theorem 3 Let R be a nite integral domain. Then R is a eld. Corollary 1 Let p be a prime number and R = Zp = {x mod p : x N } be the set of integers modulo p (with the operations m + and dened accordingly). Then R is a eld. An essential way to compare algebraic structures is in terms of operationpreserving mappings. The following denitions are needed: Denition 16 Let (G1 , ) and (G2 , ) be two groups. A mapping g : G1 G2 such that g (a b) = g (a) g (b) (1.4) is called a (group)homomorphism. If g is a onetoone (group)homomorphism, then it is called an isomorphism (or groupisomorphism). Moreover, if G1 = G2 , then g is called an automorphism (or groupautomorphism). Denition 17 Two groups G1 , G2 are called isomorphic, if there is an isomorphism g : G1 G2 . Analogous denitions can be given for rings and modules: Denition 18 Let R1 and R2 be two rings. A mapping g : G1 G2 such that g (a + b) = g (a) + g (b) (1.5) and g (a b) = g (a) g (b) (1.6) is called a (ring)homomorphism. If g is a onetoone (ring)homomorphism, then it is called an isomorphism (or ringisomorphism). Furthermore, if R1 = R2 , then g is called an automorphism (or ringautomorphism). (1.3)
Denition 19 Two rings R1 , R2 are called isomorphic, if there is an isomorphism g : R1 R2 . Denition 20 Let M1 and M2 be two modules over R. A mapping g : M1 M2 such that for every a, b M1 , r R, g (a + b) = g (a) + g (b) and g (r a) = r g (a) (1.8) is called a (module)homomorphism (or a linear transformation). If g is a onetoone (module)homomorphism, then it is called an isomorphism (or moduleisomorphism). Furthermore, if G1 = G2 , then g is called an automorphism (or moduleautomorphism). Denition 21 Two modules M1 , M2 are called isomorphic, if there is an isomorphism g : M1 M2 . Finally, a general family of transformations is dened by Denition 22 Let g : M1 M2 be a (module)homomorphism. Then a mapping h : M1 M2 dened by h(a) = c + g (a) (1.9) with c M2 is called an ane transformation. If M1 = M2 , then g is called a symmetry of M . Moreover, if g is invertible, then it is called an invertible symmetry of M . Studying properties of groups is equivalent to studying groups of automorphisms: Theorem 4 (Cayleys theorem) Let G be a group. Then there is a set S such that G is isomorphic to a subgroup of A(S ) where A(S ) is the set of all onetoone mappings of S onto itself. Denition 23 Let G be a nite group. Then the group (A(S ), ) (where a b denotes successive application of the functions a and b) is called the symmetric group of order n, and is denoted by Sn . Note that Sn is isomorphic to the group of permutations of the numbers 1, 2, ..., n, and has n! elements. Another important concept is motivated by representation in coordinates as we are used to from euclidian geometry. The representation follows since, in terms of isomorphy, the inner and outer product can be shown to be equivalent: Theorem 5 Let G = H1 H2 Hn be the internal direct product of H1 , ..., Hn and G = H1 H2 ... Hn the external direct product. Then G and G are isomorphic, through the isomorphism g : G G dened by g (a1 , ..., an ) = a1 a2 ... an . This theorem implies that one does not need to distinguish between the internal and external direct product. The analogous result holds for modules: (1.7)
Theorem 6 Let M be a direct sum of M1 , ..., Mn . Then M is isomorphic to the module M = {(a1 , a2 , ..., an ) : ai Mi } with the operations (a1 , a2 , ...) + (b1 , b2 , ...) = (a1 + b1 , a2 + b2 , ...) and r (a1 , a2 , ...) = (r a1 , r a2 , ...). Thus, a module M = M1 + M2 + ... + Mn can be described in terms of its coordinates with respect to Mi (i = 1, ..., n) and the structure of M is known as soon as we know the structure of Mi (i = 1, ..., n). Direct products can be used, in particular, to characterize the structure of nite abelian groups: Theorem 7 Let (G, ) be a nite commutative group. Then G is isomorphic to the direct product of its Sylowsubgroups. Theorem 8 Let (G, ) be a nite commutative group. Then G is the direct product of cyclic groups. Similar, but slightly more involved, results can be shown for modules, but will not be needed here. 1.3 Specic applications in music In the following, the usefulness of algebraic structures in music is illustrated by a few selected examples. This is only a small selection from the extended literature on this topic. For further reading see e.g. Graeser (1924), Sch onberg (1950), Perle (1955), Fletcher (1956), Babbitt (1960, 1961), Price (1969), Archibald (1972), Halsey and Hewitt (1978), Balzano (1980), Rahn (1980), G otze and Wille (1985), Reiner (1985), Berry (1987), Mazzola (1990a, 2002 and references therein), Vuza (1991, 1992a,b, 1993), Fripertinger (1991), Lendvai (1993), Benson (19952002), Read (1997), Noll (1997), Andreatta (1997), StangeElbe (2000), among others. 1.3.1 The Mathieu group It can be shown that nite simple groups fall into families that can be described explicitly, except for 26 socalled sporadic groups. One such group is the socalled Mathieu group M12 which was discovered by the French mathematician Mathieu in the 19th century (Mathieu 1861, 1873, also see e.g. Conway and Sloane 1988). In their study of probabilistic properties of (card) shuing, Diaconis et al. (1983) show that M12 can be generated by two permutations (which they call Mongean shues ), namely 1 = and 2 = 1 2 6 7 3 4 5 8 5 6 4 9 7 8 9 3 10 2 10 11 12 11 1 12 (1.11) 1 2 7 6 3 8 4 5 5 9 6 7 8 4 10 3 9 10 11 12 11 2 12 1 (1.10)
where the low rows denote the image of the numbers 1, ..., 12. The order of this group is o(M12 ) = 95040 (!) An interesting application of these permutations can be found in Ile de feu 2 by Olivier Messiaen (Berry 1987) where 1 and 2 are used to generate sequences of tones and durations. 1.3.2 Campanology A rather peculiar example of group theory in action (though perhaps rather trivial mathematically) is campanology or change ringing (Fletcher 1956, Wilson 1965, Price 1969, White 1983, 1985, 1987, Stewart 1992). The art of change ringing started in England in the 10th century and is still performed today. The problem that is to be solved is as follows: there are k swinging bells in the church tower. One starts playing a melody that consists of a certain sequence in which the bells are played, each bell being played only once. Thus, the initial sequence is a permutation of the numbers 1, ..., k . Since it is not interesting to repeat the same melody over and over, the initial melody has to be varied. However, the bells are very heavy so that it is not easy to change the timing of the bells. Each variation is therefore restricted, in that in each round only one pair of adjacent bells can exchange their position. Thus, for instance, if k = 4 and the previous sequence was (1, 2, 3, 4), then the only permissible permutations are (2, 1, 3, 4), (1, 3, 2, 4), and (1, 2, 4, 3). A further, mainly aesthetic restiction is that no sequence should be repeated except that the last one is identical with the initial sequence. A typical solution to this problem is, for instance, the Plain Bob that starts by (1, 2, 3, 4), (2, 1, 4, 3), (2, 4, 1, 3),... and continues until all permutations in S4 are visited. 1.3.3 Representation of music Many aspects of music can be embedded in a suitable algebraic module (see e.g. Mazzola 1990a). Here are some examples: 1. Apart from glissando eects, the essential frequencies in most types of music are of the form
K
= o
i=1
i px i
(1.12)
where K < , o is a xed basic frequency, pi are certain xed prime numbers and xi Q. Thus,
K
= log = o +
i=1
xi i
K
(1.13)
where o = log o , i = log pi (i 1). Let = { : = i=1 xi i , xi Q} be the set of all logfrequencies generated this way. Then is a module over Q. Two typical examples are:
(a) o = 440 Hz , K = 3, 1 = 2, 2 = 3, 3 = 5 : This is the socalled Euler module in which most Western music operates. An important subset consists of frequencies of the just intonation with the pure intervals octave (ratio of frequencies 2), fth (ratio of frequencies=3/2) and major third (ratio of frequencies 5/4): = log = log 440 + x1 log 2 + x2 log 3 + x3 log 5 (1.14)
(xi Z). The notes (frequencies) can then be represented by points in a threedimensional space of integers Z3 . Note that, using the notation a = (a1 , a2 , a3 ) and b = (b1 , b2 , b3 ), the pitch obtained by addition c = a + b corresponds to the frequency o 2a1 +b1 3a2 +b2 5a3 +b3 .
p (b) o = 440 Hz , K = 1, 1 = 2, and x = 12 , where p Z : This corresponds to the welltempered tuning where an octave is divided into equal intervals. Thus, the ratio 2 is decomposed into 12 ratios 12 2 so that p log 2 (1.15) = log 440 + 12 If notes that dier by one or several octaves are considered equivalent, then we can identify the set of notes with the Zmodule Z12 = {0, 1, ..., 11}.
2. Consider a nite module of notes (frequencies), such as for instance the welltempered module M = Z12 . Then a scale is an element of S = {(x1 , ..., xk ) : k M , xi M, xi = xj (i = j )}, the set of all nite vectors with dierent components. 1.3.4 Classication of circular chords and other musical objects A central element of classical theory of harmony is the triad. An algebraic property that distinguishes harmonically important triads from other chords can be described as follows: let x1 , x2 , x3 Z12 , such that (a) xi =xj (i=j ) and (b) there is an inner symmetry g : Z12 Z12 such that {y : y = g k (x1 ), k N} = {x1 , x2 , x3 }. It can be shown that all chords (x1 , x2 , x3 ) for which (a) and (b) hold are standard chords that are harmonically important in traditional theory of harmony. Consider for instance the major triad (c, e, g ) = (0, 4, 7) and the minor triad (c, e , g ) = (0, 3, 7). For the rst triad, the symmetry g (x) = 3x + 7 yields the desired result: g (0) = 7 = g , g (7) = 4 = e and g (4) = 7 = g . For the minor triad the only inner symmetry is g (x) = 3x + 3 with g (7) = 0 = c, g (0) = 3 = e and g (3) = 0 = c. This type of classication of chords can be carried over to more complicated congurations of notes (see e.g. Mazzola 1990a, 2002, Straub 1989). In particular, musical scales can be classied by comparing their inner symmetries.
1.3.5 Torus of thirds Consider the group G = (Z12 , +) of pitches modulo octave. Then G is isomorphic to the direct sum of the Sylow groups Z3 and Z4 by applying the isomorphism g : Z12 Z3 + Z4 , x y = (y1 , y2 ) = (x mod 3, x mod 4) (1.16) (1.17)
Geometrically, the elements of Z3 + Z4 can be represented as points on a torus, y1 representing the position on the vertical meridian and y2 the position on the horizontal equatorial circle (Figure 1.8). This representation has a musical meaning: a movement along a meridian corresponds to a major third, whereas a movement along a horizontal circle corresponds to a minor third. One then can dene the torusdistance dtorus (x, y ) by equating it to the minimal number of steps needed to move from x to y . The value of dtorus (x, y ) expresses in how far there is a thirdrelationship between x and y. The possible values of dtorus are 0 (if x = y ), 1, 2, and 3 (smallest thirdrelationship). Note that dtorus can be decomposed into d3 + d4 where d3 counts the number of meridian steps and d4 the number of equatorial steps. 1.3.6 Transformations For suitably chosen integers p1 , p2 , p3 , p4 , consider the fourdimensional module M = Zp1 Zp2 Zp3 Zp4 over Z where the coordinates represent onset time, pitch (welltempered tuning if p2 = 12), duration, and volume. Transformations in this space play an essential role in music. A selection of historically relevant transformations used by classical composers is summarized in Table 1.1 (also see Figure 1.13). Generally, one may say that ane transformations are most important, and among these the invertible ones. In particular, it can be shown that each symmetry of Z12 can be written as a product (in the group of symmetries Symm(Z12 )) of the following musically meaningful transformations: Multiplication by 1 (inversion); Multiplication by 5 (ordering of notes according to circle of quarts); Addition of 3 (transposition by a minor third); Addition of 4 (transposition by a major third). All these transformations have been used by composers for many centuries. Some examples of apparent similarities between groups of notes (or motifs) are shown in Figures 1.10 through 1.12. In order not to clutter the pictures, only a small selection of similar motifs is marked. In dodecaphonic and serial music, transformation groups have been applied systematically (see e.g. Figure 1.9). For instance, in Sch obergs Orchestervariationen op.
Table 1.1 Some ane transformations used in classical music Function Shift: f (x) = x + a Musical meaning Transposition, repetition, change of duration, change of loudness Arpeggio
Shear, e.g. of x = (x1 , ..., x4 )t w.r.t. line y = o + t (0, 1, 0, 0): f (x) = x + a (0, 1, 0, 0) for x not on line, f (x) = x for x on line Reection, e.g. w.r.t. v = (a, 0, 0, 0): f (x ) = (a (x 1 a ), x 2 , x 3 , x 4 ) Dilatation, e.g. w.r.t. pitch: f (x ) = (x 1 , a x 2 , x 3 , x 4 ) Exchange of coordinates: f (x ) = (x 2 , x 1 , x 3 , x 4 )
Retrograde, inversion
Augmentation
31, the full orbit generated by inversion, retrograde and transposition is used. Webern used 12tone series that are diagonally symmetric in the twodimensional space spanned by pitch and onset time. Other famous examples include Eimerts rotation by 45 degrees together with a dilatation by 2 (Eimert 1964) and serial compositions such as Boulezs Structures and Stockhausens KontraPunkte. With advanced computer technology (e.g. composition soft and hardware such as Xenakis UPIC graphics/computer system or the recently developed Presto software by Mazzola 1989/1994), the application of ane transformations in musical spaces of arbitrary dimension is no longer the tedious work of the early dodecaphonic era. On the contrary, the practical ease and enormous artistic exibility lead to an increasing popularity of computer aided transformations among contemporary composers (see e.g. Iannis Xenakis, Kurt Dahlke, Wilfried Jentzsch, Guerino Mazzola 1990b, Dieter Salbert, KarlHeinz Sch oppner, Tamas Ungvary, Jan Beran 1987, 1991, 1992, 2000; cf. Figure 1.14).
SpiegelDuett
Violin
7
(W.A. Mozart)
Allegro q=120
mf
Vln.
12
Vln.
18
Vln.
22
Vln.
27
Vln.
32
Vln.
36
Vln.
41
Vln.
46
Vln.
51
Vln.
57
Vln.
60
Vln.
Figure 1.7 Wolfgang Amadeus Mozart (17561791). (Engraving by F. M uller after a painting by J.W. Schmidt; courtesy of Zentralbibliothek Z urich.)
Figure 1.9 Arnold Sch onberg Sketch for the piano concert op. 42 notes with tone row and its inversions and transpositions. (Used by permission of Belmont Music Publishers.)
Figure 1.10 Notes of Air by Henry Purcell. (For better visibility, only a small selection of related motifs is marked.)
Figure 1.11 Notes of Fugue No. 1 (rst half ) from Das Wohltemperierte Klavier by J.S. Bach. (For better visibility, only a small selection of related motifs is marked.)
Figure 1.12 Notes of op. 68, No. 2 from Album f ur die Jugend by Robert Schumann. (For better visibility, only a small selection of related motifs is marked.)
Figure 1.13 A miraculous transformation caused by high exposure to Wagner operas. (Caricature from a 19th century newspaper; courtesy of Zentralbibliothek Z urich.)
Figure 1.14 Graphical representation of pitch and onset time in Z2 71 together with anti Piano concert No. 2 instrumentation of polygonal areas. (Excerpt from S by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.)
CHAPTER 2
Piano
Robert Schumann
9
ritard.
13
17
21
a tempo
ritard.
23 ritard.
1947
5
1963
log(tempo)
1965
10 15
10
20
30
onset time
2.2 Some descriptive statistics and plots for univariate data 2.2.1 Denitions We give a brief summary of univariate descriptive statistics. For a comprehensive discussion we refer the reader to standard text books such as Tukey (1977), Mosteller and Tukey (1977), Hoaglin (1977), Tufte (1977), Velleman and Hoaglin (1981), Chambers et al. (1983), Cleveland (1985). Suppose that we observe univariate data x1 , x2 , ..., xn . To summarize general characteristics of the data, various numerical summary statistics can be calculated. Essential features are in particular center (location), variability, asymmetry, shape of distribution, and location of unusual values (outliers). The most frequently used statistics are listed in Table 2.1. We recall a few well known properties of these statistics: Sample mean: The sample mean can be understood as the center of gravity of the data, whereas the median divides the sample in two halves
Table 2.1 Simple descriptive statistics Name Empirical distribution function Minimum Maximum Range Sample mean Sample median Sample quantile Lower and upper quartile Sample variance Sample standard deviation Interquartile range Sample skewness Sample kurtosis Denition Fn (x) = n1
n i=1
Feature measured 1{xi x} Proportion of obs. x Smallest value Largest value Total spread Center
1 } 2
xi
M = inf {x : Fn (x)
q = inf {x : Fn (x) } Q1 = q 1 , Q2 = q 3
4
s2 = (n 1)1 s = + s2 IQR = Q2 Q1 m3 = n m4 = n
1 1
n i=1 (xi
x )2
x )/s]
3 4
x )/s] 3
with an (approximately) equal number of observations. In contrast to the median, the mean is sensitive to outliers, since observations that are far from the majority of the data have a strong inuence on its value. Sample standard deviation: The sample standard deviation is a measure of variability. In contrast to the variance, s is directly comparable with the data, since it is measured in the same unit. If observations are drawn independently from the same normal probability distribution (or a distribution that is similar to a normal distribution), then the following rule of thumb applies: (a) approximately 68% of the data are in the interval x s; (b) approximately 95% of the data are in the interval x 2s; (c) almost all data are in the interval x 3s. For a suciently large sample size, these conclusions can be carried over to the population from which the data were drawn.
Interquartile range: The interquartile range also measures variability. Its advantage, compared to s, is that it is much less sensitive to outliers. If the observations are drawn from the same normal probability distribution, then IQR/1.35 (or more precisely IQR/[1 (0.75) 1 (0.25)] where 1 is the quantile function of the standard normal distribution) estimates the same quantity as s, namely the population standard deviation.
i (i = 1, ..., n), q coincides with at least one ob Quantiles: For = n servation. For other values of , q can be dened as in Table 1.1 or, alternatively, by interpolating neighboring observed values as follows: let i =n < < = i+1 is dened by n . Then the interpolated quantile q
q = q +
(q q ) 1/n
(2.1)
Note that a slightly dierent convention used by some statisticians is to .5 )quantile (see e.g. Chambers et al. call inf {x : Fn (x) } the ( 0n 1983). Skewness: Skewness measures symmetry/asymmetry. For exactly symmetric data, m3 = 0, for data with a long right tail m3 > 0, for data with a long left tail m3 < 0. Kurtosis: The kurtosis is mainly meaningful for unimodal distributions, i.e. distributions with one peak. For a sample from a normal distribution, m4 0. The reason is that then E [(X )4 ] = 3 4 where = E (X ). For samples from unimodal distributions with a sharper or atter peak than the normal distribution, we then tend to have m4 > 0 and m4 < 0 respectively. Simple, but very useful graphical displays are: Histogram: 1. Divide an interval (a, b] that includes all observations into disjoint intervals I1 = (a1 , b1 ], ..., Ik = (ak , bk ]. 2. Let n1 , ..., nk be the number of observations in the intervals I1 , ..., Ik respectively. 3. Above each interval Ij , plot a rectangle of width wj = bj aj and height hj = nj /wj . Instead of the absolute frequencies, one can also use relative frequencies nj /n where n = n1 + ... + nk . The essential point is that the area is proportional to nj . If the data are drawn from a probability distribution with density function f, then the histogram is an estimate of f. Kernel estimate of a density function: The histogram is a step function, and in that sense does not resemble most density functions. This can be improved as follows. If the data are realizations of a continuous random x variable X with distribution F (x) = P (X x) = f (u)du, then a smooth estimate of the probability density function f can be dened by a kernel estimate (Rosenblatt 1956, Parzen 1962, Silverman 1986) of the
form (x) = 1 f nb
n
K(
i=1
xi x ) b
(2.2)
where K (u) = K (u) 0 and K (u)du = 1. Most kernels used in practice also satisfy the condition K (u) = 0 for u > 1. The bandwidth b then species which data in the neighborhood of x are used to estimate f (x). In situations where one has partial knowledge of the shape of f, one may incorporate this into the estimation procedure. For instance, Hjort and Glad (2002) combine parametric estimation based ) with kernel smoothing of the on a preliminary density function f (x; ). They show that major eciency gains remaining density f /f (x; can be achieved if the preliminary model is close to the truth. Barchart: If data can assume only a few dierent values, or if data are qualitative (i.e. we only record which category an item belongs to), then one can plot the possible values or names of categories on the xaxis and on the vertical axis the corresponding (relative) frequencies. Boxplot (simple version): 1. Calculate Q1 , M, Q2 and IQR = Q2 Q1 . 2. Draw parallel lines (in principle of arbitrary length) at the levels 3 Q1 , M, Q2 , A1 = Q1 3 2 IQR, A2 = Q2 + 2 IQR, B1 = Q1 3IQR and B2 = Q1 + 3IQR. The points A1 , A2 are called inner fence, and B1 , B2 are called outer fence. 3. Identify the observation(s) between Q1 and A1 that is closest to A1 and draw a line connecting Q1 with this point. Do the same for Q2 and A2 . 4. Identify observation(s) between A1 and B1 and draw points (or other symbols) at those places. Do the same for A2 and B2 . 5. Draw points (or other symbols) for observations beyond B1 and B2 respectively. The boxplot can be interpreted as follows: the relative positions of Q1 , M, Q2 and the inner and outer fences indicate symmetry or asymmetry. Moreover, the distance between Q1 and Q2 is the IQR and thus measures variability. The inner and outer fences help to identify outliers, i.e. values lying unusually far from most of the other observations. Qqplot for comparing two data sets x1 , ..., xn and y1 , ..., ym : 1. Dene a certain number of points 0 < p1 < ... < pk 1 (the standard choice is: 0.5 pi = i where N = min(n, m)). 2. Plot the pi quantiles (i = 1, ..., N ) N of the y observations versus those of the x observations. Alternative plots for comparing distributions are discussed e.g. in Ghosh and Beran (2000) and Ghosh (1996, 1999).
2.3 Sp ecic applications in music univariate 2.3.1 Tempo curves Figure 2.3 displays 28 tempo curves for performances of Schumanns Tr aumerei op. 15, No. 7, by 24 pianists. The names of the pianists and dates of the recordings (in brackets) are Martha Argerich (before 1983), Claudio Arrau (1974), Vladimir Ashkenazy (1987), Alfred Brendel (before 1980), Stanislav Bunin (1988), Sylvia Capova (before 1987), Alfred Cortot (1935, 1947 and 1953), Cliord Curzon (about 1955), Fanny Davies (1929), J org Demus (about 1960), Christoph Eschenbach (before 1966), Reine Gianoli (1974), Vladimir Horowitz (1947, before 1963 and 1965), Cyprien Katsaris (1980), Walter Klien (date unknown), Andr e Krust (about 1960), Antonin Kubalek (1988), Benno Moisewitsch (about 1950), Elly Ney (about 1935), Guiomar Novaes (before 1954), Cristina Ortiz (before 1988), Artur Schnabel (1947), Howard Shelley (before 1990), Yakov Zak (about 1960). Tempo is more likely to be varied in a relative rather than absolute way. For instance, a musician plays a certain passage twice as fast as the previous one, but may care less about the exact absolute tempo. This suggests consideration of the logarithm of tempo. Moreover, the main interest lies in comparing the shapes of the curves. Therefore, the plotted curves consist of standardized logarithmic tempo (each curve has sample mean zero and variance one). Schumanns Tr aumerei is divided into four main parts, each consisting of about eight bars, the rst two and the last one being almost identical (see Figure 2.1). Thus, the structure is: A, A , B, and A . Already a very simple exploratory analysis reveals interesting features. For each pianist, we calculate the following statistics for the four parts respectively: x , M, s, Q1 , Q2 , m3 and m4 . Figures 2.4a through e show a distinct pattern that corresponds to the division into A, A , B, and A . Tempo is much lower in A and generally highest in B. Also, A seems to be played at a slightly slower tempo than A though this distinction is not quite so clear (Figures 2.4a,b). Tempo is varied most towards the end and considerably less in the rst half of the piece (Figures 2.4c). Skewness is generally negative which is due to occasional extreme ritardandi. This is most extreme in part B and, again, least pronounced in the rst half of the piece (A, A ). A mirror image of this pattern, with most extreme positive values in B , is observed for kurtosis. This indicates that in B (and also in A ), most tempo values vary little around an average value, but occasionally extreme tempo changes occur. Also, for A, there are two outliers with an extremly negative skewness these turn out to be Fanny Davies and J org Demus. Figures 2.4f through h show another interesting comparison of boxplots. In Figure 2.4f, the dierences between the lower quartiles in A and A for performances before 1965 are compared with those from performances recorded in 1965 or later. The clear dierence indicates that, at least for the
20
ARGERICH
ARRAU
ASKENAZE
BRENDEL
40
BUNIN
CAPOVA
CORTOT1
CORTOT2
CORTOT3
CURZON
DAVIES
60
log(tempo)
DEMUS
ESCHENBACH
GIANOLI
HOROWITZ1
HOROWITZ2
HOROWITZ3
KATSARIS
80
KLIEN
KRUST
KUBALEK MOISEIWITSCH
NEY
NOVAES
100
ORTIZ
SCHNABEL
SHELLEY
ZAK
10
20
30
onset time
Figure 2.3 Twentyeight tempo curves of Schumanns Tr aumerei performed by 24 pianists. (For Cortot and Horowitz, three tempo curves were available.)
sample considered here, pianists of the modern era tend to make a much stronger distinction between A and A in terms of slow tempi. The only exceptions (outliers in the left boxplot) are Moiseiwitsch and Horowitz rst performance and Ashkenazy (outlier in the right boxplot). The comparsion of skewness and curtosis in Figures 2.4g and h also indicates that modern pianists seem to prefer occasional extreme ritardandi. The only exception in the early 20th century group is Artur Schnabel, with an extreme skewness of 2.47 and a kurtosis of 7.04. Direct comparisons of tempo distributions are shown in Figures 2.5a
Figure 2.4 Boxplots of descriptive statistics for the 28 tempo curves in Figure 2.3.
through f. The following observations can be made: a) compared to Demus (quantiles on the horizontal axis), Ortiz has a few relatively extreme slow tempi (Figure 2.5a); b) similarily, but in a less extreme way, Cortots interpretation includes occasional extremely slow tempo values (Figure 2.5b); c) Ortiz and Argerich have practically the same (marginal) distribution (Figure 2.5c); d) Figure 2.5d is similar to 2.5a and b, though less extreme; e) the tempo distribution of Cortots performance (Figure 2.5e) did not change much in 1947 compared to 1935; f) similarily, Horowitzs tempo distribu
tions in 1947 and 1963 are almost the same, except for slight changes for very low tempi (Figure 2.5f).
1
1
1
Argerich
2
Cortot
Ortiz
2
3
3
4
4
4
3
2
1
2
1
2
1
Demus
1
1
4
3
2
Ortiz
1
Demus
1
2 0
1
Horowitz 1963
Cortot 1947
Krust
2
2
3
4
4
4
3
2
1
2
1
Demus
4
3
2
1
4
3
2
1
Cortot 1935
Horowitz 1947
2.3.2 Notes modulo 12 In most classical music, a central tone around which notes uctuate can be identied, and a small selected number of additional notes or chords (often triads) play a special role. For instance, from about 400 to 1500 A.D., music was mostly written using socalled modes. The main notes
were the rst one (nalis, the nal note) and the fth note of the scale (dominant). The system of 12 major and 12 minor scales was developed later, adding more exibility with respect to modulation and scales. The main representatives of a major/minor scale are three triads, obtained by adding thirds, starting at the basic note corresponding to the rst (tonic), fourth (subtonic) and fth (tonic) note of the scale respectively. Other triads are also but to a lesser degree associated with the properties tonic, subtonic and/or dominant. In the 20th century, and partially already in the late 19th century, other systems of scales as well as systems that do not rely on any specic scales were proposed (in particular 12tone music).
0.20
Figure 2.6a: J.S.Bach  Fugue 1, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 2 7 8 0 b c c d
0.4
0.15
5 6 0 9 b
5 7 8 9 a e f
3 4 6
3 4 6 9 a b d e
8 b d
1 2
5 4 6 0 a
3 7 8 b c d e f
9
f
4 7 a c 3 8 d e f
1 5 0 f
1 3 2 7 9 a c e f
5 4 6 0
1 3 4 0 9
2 5 7 6 a f e 3 2 5 4 8 6
8 d c
0.1 0.2
0.10
1 2
1 e
d
1 c d e b 2 a f 0 3 4 5 7 8 9 6
Figure 2.6b: W.A.Mozart  KV 545, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
0.3
1 2
0.05
3 2 a b c 4 5 7 8 6 0 9
1 7 9
0 a d b c e
0.0
10
11
0.0
1 2 3 4 5 7 8 6 9 0 a b c d e
1 3 2 4 5 7 8 6 9 0 a b c d e f
1 3 2 4 5 7 8 6 9 0 a b c d e f
1 3 2 4 5 7 8 6 9 0 a b d f c e
1 2 3 4 5 6 7 8 9 0 a c b d e f
0 1
4 5 6 7 8 3 9 2 0 1 a e f c b d
2
4 6 7 8 3 5 1 2 9 0 a c b d e f
a 3 4 5 6 7 8 9 0 c b d e f 1 2
4 5
c b d e f a 0 2 9 1 3 4 5 6 7 8
1 2 f 3 c b d 4 5 6 9 0 a e 7 8
5 6 7 8 9 0 a b 1 2 3 4 c d e f
8 9
3 c d e f 1 2 4 9 0 b 5 6 7 8 a
1 2 3 4 5 6 7 8 9 0 a c b d e f
3
1 2 3 4 5 6 7 8 9 0 a c b d e f
6 7
5 6 7 8 9 0 a b 4 1 2 3 c d e f
11
10
(NotesTonic) mod 12
(NotesTonic) mod 12
0.4
4 5 3 7 6 1 2 8 9 0 a e f b d c
Figure 2.6c: R.Schumann  op.15/2, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
0.30
0.2
6 7 8 0 a b 5 9 c d 2 1 3 4 e f
0.0
10
11
0.0
1 2 3 4 7 5 6 8 9 0 a b c d e f
2 3 4 f 1 7 5 6 8 9 c d e 0 a b
1 b c d 2 3 0 a e f 4 7 6 8 9 5 1 2 3 4 7 5 6 8 9 0 a b c d e f
9 8 0 a b c e 7 5 6 d f 2 3 4 1
7 8 0 b c d 1 3 9 a 2 4 5 6 e f
0.1
5 2 6 9 a b e f 1 3 4 7 8 0 c d
1 4 7 6 8 0 a c 2 3 5 9 b d e f 1 2 3 4 7 5 6 8 9 0 a b c d e f
1 2 a b c d e f 3 4 7 5 6 8 9 0
1 2 3 4 5 6 d e f 7 8 9 0 a b c
Figure 2.6d: R.Schumann  op.15/3, frequencies of notes number i, i in [1+j,16+j] (j=0, ,65) e 2 f 1 3 4 d 5 0 a b 6 8 9 c c f 1 d 7 7 b e 2 e f 2 1 3 3 4 4 d 6 8 0 9 a c b e f 2 1 3 5 5 5 6 8 0 9 a 4 d 6 7 7 c e f 2 3 7 8 8 0 9 9 9 a b e f 1 1 3 4 4 d d 5 5 6 6 7 8 0 0 a a c c b b e e e e f f f f 2 2 2 2 1 1 1 1 3 3 3 3 4 4 4 4 d d d d 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 0 0 0 0 9 9 9 9 a a a a c c c c b 2 b b b
0.3
0.10
0.20
4 5 6 7 8 0 9 a
e f 2 1 3 d c b
11
10
(NotesTonic) mod 12
(NotesTonic) mod 12
Figure 2.6 Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16.
A very simple illustration of this development can be obtained by counting the frequencies of notes (pitches) in the following way: consider a score in equal temperament. Ignoring transposition by octaves, we can represent all notes x(t1 ), ..., x(tn ) by the integers 0, 1, ..., 11. Here, t1 t2 ... tn
0.3
0.2
8 9 3 5 6 7 0 2 4 1 a b
c d e f
Figure 2.7a: A.Scriabin  op.51/2, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
0.3
c d e f
b
d e f 1 2 3 4 5 6 7 8 9 0 a b c
0 1
0.0
10
11
0.0
a 1 2 3 4 5 6 7 8 9 0
7 9 0 a b 1 2 5 6 8 c d 3 4 e f
3 4 5 1 2 6 8 7 9 0 c a b d e f
6 a 1 2 4 7 8 9 0 b c d e 3 5 f
f 1 2 3 4 7 8 9 0 a e 5 6 b c d
5 6 3 4 7 8 a 2 9 1 0 b c d e
f
Figure 2.7b: A.Scriabin  op.51/4, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64)
0.2
1 2 3 4 5 6 7 8 9 0 a b c d e f
1 2 3 4 5 6 7 8 9 0 a b c d e f
b c d e f 1 2 3 4 5 6 7 8 9 0 a
a b c d e f 0 1 2 3 4 5 6 7 8 9
1 f 2 3 4 5 6 7 0 a b c d e 8 9
1 9 0 a b c d 2 3 4 8 e f 5 7 6
0 1
f e 2 1 d 3 4 5 6 c 7 8 0 a b 9
2
c d e f b 2 1 3 4 5 6 7 8 9 0 a
3
f 4 5 6 7 8 9 0 a d e 2 1 3 b c
4
9 8 0 a 7 b c d 4 6 2 1 3 5 e f
2 3 1 5 6 4 7
8 9 0 a b c d e f
c d e f 1 b 2 3 4 5 9 0 a 6 7 8
7
0 e f 6 7 8 9 a b c d 2 1 3 5 4
0.1
0.1
2 1 3 4 b c d e f 5 6 7 8 9 0 a
4 5 6 7 8 9 0 e f 2 1 3 a b d c
10
f 2 1 3 4 b c d e 5 6 7 8 9 0 a
11
(NotesTonic) mod 12
(NotesTonic) mod 12
0 a d e
0.12
Figure 2.7c: F.Martin  Prelude 6, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 1 2 8 7 0
0.20
Figure 2.7d: F.Martin  Prelude 7, frequencies of notes number i, i in [1+j,16+j] (j=0, ,64) 2 e f 1 3 4 5
8 9 b c
1 2 6 7
5 4 6
3 5 7
1 0 a c f
1 5 6 a b
3 5 4 9 a d e
f e
3 6
b
f
4
0 a b c
0.08
5 4 7 f
1 3 2 8 c
3 5 4 b c f
1 3 a f e
4 6 8 9
2 4 6 8 7 b d e
9 b c d f
3 2 5 4 6 9 f e
3 2 9 0 c
6 8 7 b c f
3 2 6 8 9 c d f
1 b d
0.10
8 9 0 a d
2 e f 1 3 4 5
8 9 d
1 2 9 0 1 2 3 4 a b c
6 7 8 d e f 5 9 0
7 8
a
b d c e f
d c e
5 6 7
5 6 7 8
6 7 8 9 0 a b 1 2 3 5 d c
9 9
0 a b d c e f
6
1
7 8 9 0 a b c
4 5 6 7 9 0 a d e f
c 1 2 3 8 b
3 4 5 8 9 0 d c
6 7 a e 1 2 b f
3 6
0.04
4 6 7 9 0 a b d e
8 9 0 a d e
2 8 7 9 0 b c d
0 a b d
3 5 9
3 2 5 4 8 7 0 a e
8 7 0 b c d
4 d f e
1 2
1 5 4 7 0 a b e
2 4 9 0 a c
7 e f 1 2 6
6 7 a b c
0
3 5
0.0
a 1 7 8 9 0
2 4 8 9 0 a b 3
c
1 2
0
5 f
1 2 3
c f e
4 5
1 6
6
1 a
7 8 9 10
3 5 6 8 7
11
4
0 1 2 3
2 3 4 5 6
4
4 e f
5
1 2 3 4 5 6 7
7
6 d 1 2 3 4 5 e f
8
d e f
9 10 11
(NotesTonic) mod 12
(NotesTonic) mod 12
Figure 2.7 Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16.
denote the scoreonset times of the notes. To make dierent compositions comparable, the notes are centered by subtracting the central note which is dened to be the most frequent note. Given a prespecied integer k (in our case k = 16), we calculate the relative frequencies pj (x) = (2k + 1)1
j +2k
1{x(ti ) = x}
i=j
where 1{x(ti ) = x} = 1, if x(ti ) = x and zero otherwise and j = 1, 2, ..., n 2k 1. This means that we calculate the distribution of notes for a moving window of 2k + 1 notes. Figures 2.6a through d and 2.7a through d display the distributions pj (x) (j = 4, 8, ..., 64) for the following compositions: Fugue 1 from Das Wohltemperierte Klavier I by J.S. Bach (16851750), Sonata KV 545 (rst movement) by W.A. Mozart (17561791; Figure 2.8), Kinderszenen No. 2 and 3 by R. Schumann (18101856; Figure 2.9), Pr eludes op. 51, No. 2 and 4 by A. Scriabin (18721915) and Pr eludes No.
Figure 2.8 Johannes Chrysostomus Wolfgangus Theophilus Mozart (17561791) in the house of Salomon Gessner in Zurich. (Courtesy of Zentralbibliothek Z urich.)
6 and 7 by F. Martin (18901971). For each j = 4, 8, ..., 64, the frequencies pj (0), ..., pj (11) are joined by lines respectively. The obvious common feature for Bach, Mozart and Schumann is a distinct preference (local maximum) for the notes 5 and 7 (apart from 0). Note that if 0 is the root of the tonic triad, then 5 corresponds to the root of the subdominant triad. Similarily, 7 is root of the dominant triad. Also relatively frequent are the notes 3 =minor third (second note of tonic triad in minor) and 10 =minor seventh, which is the fourth note of the dominant seventh chord to the subtonic. Also note that, for Schumann, the local maxima are somewhat less pronounced. A dierent pattern can be observed for Scriabin and even more for Martin. In Scriabins Pr elude op. 51/2, the perfect fth almost never occurs, but instead the major sixth is very frequent. In Scriabins Pr elude op. 51/4, the tonal system is dissolved even further, as the clearly dominating note is 6 which builds together with 0 the augmented fourth (or diminished fth) an interval that is considered highly dissonant in tonal music. Nevertheless, even in Scriabins compositions, the distribution of notes does not change very rapidly, since the sixteen overlayed curves are almost identical. This may indicate that the notion of scales or a slow harmonic development still play a role. In contrast, in Frank Martins Pr elude No. 6, the distribution changes very quickly. This is hardly surprising, since Martins style incorporates, among other inuences, dodecaphonism (12tone music) a compositional technique that does not impose traditional restrictions on the harmonic structure. 2.4 Some descriptive statistics and plots for bivariate data 2.4.1 Denitions We give a short overview of important descriptive concepts for bivariate data. For a comprehensive treatment we refer the reader to standard text books given above (also see e.g. Plackett 1960, Ryan 1996, Srivastava and Sen 1997, Draper and Smith 1998, and Rao 1973 for basic theoretical results). Correlation If each observation consists of a pair of measurements (xi , yi ), then the main objective is to investigate the relationship between x and y. Consider, for example, the case where both variables are quantitative. The data can then be displayed in a scatter plot (y versus x). Useful statistics are Pearsons sample correlation r= 1 n
n
(
i=1
yi y xi x )( )= sx sy
n i=1 (xi
n i=1 (xi
x )(yi y )
n i=1 (yi
x )2
y )2
(2.3)
n )2 i=1 (xi x
1 and s2 y = n
n )2 i=1 (yi y
and Spearmans
rSp =
1 n
(
i=1
vi v ui u )( )= su sv
n i=1 (ui
n i=1 (ui
u )(vi v )
n i=1 (vi
u )2
v )2
(2.4)
where ui denotes the rank of xi among the xvalues and vi is the rank of yi among the y values. In (2.3) and (2.4) it is assumed that sx , sy , su and sv are not zero. Recall that these denitions imply the following properties: a) 1 r, rSp 1; b) r = 1, if and only if yi = o + 1 xi and 1 > 0 (exact linear relationship with positive slope); c) r = 1, if and only if yi = o + 1 xi and 1 < 0 (exact linear relationship with negative slope); d) rSp = 1, if and only if xi > xj implies yi > yj (strictly monotonically increasing relationship); e) r = 1, if and only if xi > xj implies yi < yj (strictly monotonically decreasing relationship); f) r measures the strength (and sign) of the linear relationship; g) rSp measures the strength (and sign) of monotonicity; h) if the data are realizations of a bivariate random variable (X, Y ), then r is an estimate of the population correlation = cov (X, Y )/ var(X )var(Y ) where cov (X, Y ) = E [XY ] E [X ]E [Y ], var(X ) = cov (X, X ) and var(Y ) = cov (Y, Y ). When using these measures of dependence one should bear in mind that each of them measures a specic type of dependence only, namely linear and monotonic dependence respectively. Thus, a Pearson or Spearman correlation near or equal to zero does not necessarily mean independence. Note also that correlation can be interpreted in a geometric way as follows: dening the ndimensional vectors x = (x1 , ..., xn )t and y = (y1 , ..., yn )t , r is equal to the standardized scalar product between x and y, and is therefore equal to the cosine of the angle between these two vectors. A special type of correlation is interesting for time series. Time series are data that are taken in a specic ordered (usually temporal) sequence. If Y1 , Y2 , ..., Yn are random variables observed at time points i = 1, ..., n, then one would like to know whether there is any linear dependence between observations Yi and Yik , i.e. between observations that are k time units apart. If this dependence is the same for all time points i, and the expected value of Yi is constant, then the corresponding population correlation can be written as function of k only (see Chapter 4), cov (Yi , Yi+k ) var(Yi )var(Yi+k )
nk
= (k )
(2.5)
yi+k y yi y )( ) s s
(2.6)
n k , because no data are available beyond (n k ) + k = n. For large lags (large compared to n), (k ) is not a very precise estimate, since there are only very few pairs that are k time units apart. The denition of (k ) and (k ) can be extended to multivariate time series, taking into account that dependence between dierent components of the series may be delayed. For instance, for a bivariate time series (Xi , Yi ) (i = 1, 2, ...), one considers lagk sample crosscorrelations XY (k ) = 1 n
nk
(
i=1
xi x yi+k y )( ) sX sY
(2.7)
as estimates of the population crosscorrelations XY (k ) = cov (Xi , Yi+k ) var(Xi )var(Yi+k ) (2.8)
1 1 where s2 (xi x (yi y )(xi+k x ) and s2 )(yi+k y ). If X = n Y = n XY (k ) is high, then there is a strong linear dependence between Xi and Yi+k .
Regression In addition to measuring the strength of dependence between two variables, one is often interested in nding an explicit functional relationship. For instance, it may be possible to express the response variable y in terms of an explanatory variable x by y = g (x, ) where is a variable representing the part of y that is unexplained. More specically, we may have, for example, an additive relationship y = g (x) + or a multiplicative equation y = g (x)e . The simplest relationship is given by the simple linear regression equation (2.9) y = o + 1 x + where is assumed to be a random variable with E () = 0 (and usually nite variance 2 = var() < ). Thus, the data are yi = o + 1 xi + i (i = 1, ..., n) where the i s are generated by the same zero mean distribution. Often the i s are also assumed to uncorrelated or even independent this is however not a necessary assumption. An obvious estimate of the unknown parameters o and 1 is obtained by minimizing the total sum of squared errors SSE = SSE (bo , b1 ) = (yi bo b1 xi )2 =
2 ri (bo , b1 )
(2.10)
with respect to bo , b1 . The solution is found by setting the partial derivatives with respect to bo and b1 equal to zero. A more elegant way to nd the solution is obtained by interpreting the problem geometrically: dening the ndimensional vectors 1 = (1, ..., 1)t , b = (bo , b1 )t and the n 2 matrix X with columns 1 and x, we have SSE = y bo 1 b1 x2 = y X b2
where . denotes the squared euclidian norm, or length of the vector. It is then clear that SSE is minimized by the orthogonal projection of y on the plane spanned by 1 and x. The estimate of = (o , 1 )t is therefore o , 1 )t = (X t X )1 X t y = ( (2.11)
and the projection which is the vector of estimated values y i is given by n )t = X (X t X )1 X t y (2.12) y = ( y1 , ..., y Dening the measure of the total variability of y, SST = yy 12 (total 2 sum of squares), and the quantities SSR = y y 1 (regression sum of squares=variability due to the fact that the tted line is not horizontal) 2 and SSE = y y  (error sum of squares, variability unexplained by regression line), we have by Pythagoras SST = SSR + SSE (2.13) o + 1 x The proportion of variability explained by the regression line y = is therefore n yi y i )2 SSE y y 12 SSR =1 ( =1 . (2.14) R2 = in = = 2 2  y y 1  SST SST ( y y ) i=1 i By denition, 0 R2 1, and R2 = 1 if and only if y i = yi (i.e. all points are on the regression line). Moreover, for simple regression we also have R2 = r2 . The advantage of dening R2 as above (instead of via r2 ) is that the denition remains valid for the multiple regression model (see below), i.e. when several explanatory variables are available. Finally, note that an 2 2 = (n 2)1 ri (o , 1 ). estimate of 2 is obtained by In analogy to the sample mean and the sample variance, the least squares estimates of the regression parameters are sensitive to the presence of outliers. Outliers in regression can occur in the y variable as well as in the xvariable. The latter are also called inuential points. Outliers may often be correct and in fact very interesting observations (e.g. telling us that the assumed model may not be correct). However, since least squares estimates are highly inuenced by outliers, it is often dicult to notice that there may be a problem, since the tted curve tends to lie close to the outliers. Alternative, robust estimates can be helpful in such situations (see Huber 1981, Hampel et al. 1986). For instance, instead of minimizing the residual sum of squares we may minimize (ri ) where is a bounded function. If is dierentiable, then the solution can usually also be found by solving the equations n r ( ) r(b) = 0 (j = 0, ..., p) (2.15) bj i=1 where 2 is a robust estimate of 2 obtained from an additional equation and p is the number of explanatory variables. This leads to estimates that
are (up to a certain degree) robust with respect to outliers in y , not however with respect to inuential points (outliers in x). To control the eect of inuential points one can, for instance, solve a set of equations r j ( , xi ) = 0 (j = 0, ..., p) i=1
n
(2.16)
where is such that it downweighs outliers in x as well. For a comprehensive theory of robustness see e.g. Huber (1981), Hampel et al. (1986). For more recent, ecient and highly robust methods see Yohai (1987), Rousseeuw and Yohai (1984), Gervini and Yohai (2002), and references therein. The results for simple linear regression can be extended easily to the case where more than one explanatory variable is available. The multiple linear regression model with p explanatory variables is dened by y = o + 1 x1 + ... + p xp + . For data we write yi = o + 1 xi1 + ... + p xip + i (i = 1, ..., n). Note that the word linear refers to linearity in the parameters o , ..., p . The function itself can be nonlinear. For instance, we may have polynomial regression with y = o + 1 x + ... + p xp + . The same geometric arguments as above apply so that (2.11) and (2.12) hold with = (o , ..., p )t , and the n (p + 1)matrix X = (x(1) , ..., x(p+1) ) with columns x(1) = 1 and x(j +1) = xj = (x1j , ..., xnj )t (j = 1, ..., p). Regression smoothing A more general, but more dicult, approach to modeling a functional relationship is to impose less restrictive assumptions on the function g. For instance, we may assume y = g (x) + (2.17) with g being a twice continuously dierentiable function. Under suitable additional conditions on x and it is then possible to estimate g from observed data by nonparametric smoothing. As a special example consider observations yi taken at time points i = 1, 2, ..., n. A standard model is yi = g (ti ) + i (2.18)
where ti = i/n, i are independent identically distributed (iid) random variables with E (i ) = 0 and 2 = var(i ) < 0. The reason for using standardized time ti [0, 1] is that this way g is observed on an increasingly ne grid. This makes it possible to ultimately estimate g (t) for all values of t by using neighboring values ti , provided that g is not too wild. A simple estimate of g can be obtained, for instance, by a weighted average (kernel smoothing)
n
g (t) =
i=1
wi yi
(2.19)
wi = wi (t; b, n) =
n j =1
K(
ttj b )
(2.20)
with b > 0, and a kernel function K 0 such that K (u) = K (u), K (u) = 1 0 (u > 1) and 1 K (u)du = 1. The role of b is to restrict observations that inuence the estimate to a small window of neighboring time points. For instance, the rectangular kernel K (u) = 1 2 1{u 1} yields the sample mean of observations yi in the window n(t b) i n(t + b). An even more elegant formula can be obtained by approximating the Riemann sum 1 ttj n 1 j =1 K ( b ) by the integral 1 K (u)du = 1: nb 1 g (t) = wi yi = nb i=1
n n
K(
i=1
t ti )yi b
(2.21)
In this case, the sum of the weights is not exactly equal to one, but asymptotically (as n and b 0 such that nb3 ) this error is negligible. It can be shown that, under fairly general conditions on g and , g converges to g, in a certain sense that depends on the specic assumptions (see e.g. Gasser and M uller 1979, Gasser and M uller 1984, H ardle 1991, Beran and Feng 2002, Wand and Jones 1995, and references therein). An alternative to kernel smoothing is local polynomial tting (Fan and Gijbels 1995, 1996; also see Feng 1999). The idea is to t a polynomial locally, i.e. to data in a small neighborhood of the point of interest. This can be formulated as a weighted least squares problem as follows: o g (t) = (2.22)
1 , ..., p )t solves a local least squares problem dened by o , = ( where ti t 2 )ri (a). (2.23) a b Here ri = yi [ao + a1 (ti t) + ... + ap (ti t)p ], K is a kernel as above and b > 0 is the bandwidth dening the window of neighboring observations. It can be shown that asymptotically, a local polynomial smoother can be written as kernel estimator (Ruppert and Wand 1994). A dierence only occurs at the borders (t close to 0 or 1) where, in contrast to the local polynomial estimate, the kernel smoother has to be modied. The reason is that observations are no longer symmetrically spaced in the window t b). A major advantage of local polynomials is that they automatically 1 , g 2 etc. Kernel (t) = 2 provide estimates of derivatives, namely g (t) = smoothing can also be used for estimation of derivatives; however dierent (and rather complicated) kernels have to be used for each derivative (Gasser and M uller 1984, Gasser et al. 1985). A third alternative, socalled wavelet = arg min K(
thresholding, will not be discussed here (see e.g. Daubechies 1992, Donoho and Johnston 1995, 1998, Donoho et al. 1995, 1996, Vidakovic 1999, and Percival and Walden 2000 and references therein). A related method based of wavelets is discussed in Chapter 5. Smoothing of twodimensional distributions, sharpening Estimating a relationship between x and y (where x and y are realizations of random variables X and Y respectively) amounts to estimating the joint twodimensional distribution function F (x, y ) = P (X x, Y y ). For continuous variables with F (x, y ) = ux vy f (u, v ) dudv , the density function f can be estimated, for instance, by a twodimensional histogram. For visual and theoretical reasons, a better estimate is obtained by kernel estimation (see e.g. Silverman 1986) dened by (x, y ) = f 1 nb1 b2 K (xi x, yi y ; b1 , b2 )
i=1
(2.24)
where the kernel K is such that K (u, v ) = K (u, v ) = K (u, v ) 0, and K (u, v )dudv = 1. Usually, b1 = b2 = b and K (u, v ) has compact support. Examples of kernels are K (u, v ) = 1 4 1{u 1}1{v  1} (rectangular kernel with rectangular support), K (u, v ) = 1 1{u2 + v 2 1} (rectangular kernel with circular support), K (u, v ) = 2 1 [1 u2 v 2 ] (Epanechnikov 1 kernel with circular support) or K (u, v ) = (2 )1 exp[ 2 (u2 + v 2 )] (normal density kernel with innite support). In analogy to onedimensional density estimation, it can be shown that under mild regularity conditions, (x, y ) is a consistent estimate of f (x, y ), provided that b1 , b2 0, and f nb1 , nb2 . Graphical representations of twodimensional distribution functions are (x, y )) is plotted against 3dimensional perspective plot: z = f (x, y ) (or f x and y ; contour plot: like in a geographic map, curves corresponding to equal levels of f are drawn in the xy plane; image plot: coloring of the xy plane with the color at point (x, y ) corresponding to the value of f. A simple way of enhancing the visual understanding of scatterplots is socalled sharpening (Tukey and Tukey 1981; also see Chambers et al. 1983): (x, y ) b are drawn in for given numbers a and b, only points with a f the scatterplot. Alternatively, one may plot all points and highlight points (x, y ) b. with a f Interpolation Often a process may be generated in continuous time, but is observed at discrete time points. One may then wish to guess the values of the points
in between. Kernel and local polynomial smoothing provide this possibility, since g (t) can be calculated for any t (0, 1). Alternatively, if the observations are assumed to be completely without error, i.e. yi = g (ti ), then deterministic interpolation can be used. The most popular method is spline interpolation. For instance, cubic splines connect neighboring observed values yi1 , yi by cubic polynomials such that the rst and second derivatives at the endpoints ti1 , ti are equal. For observations y1 , ..., yn at equidistant time points ti with ti ti1 = tj tj 1 = t (i, j = 1, ..., n), we have n 1 polynomials pi (t) = ai + bi (t ti ) + ci (t ti )2 + di (t ti )3 (i = 1, ..., n 1) (2.25)
To achieve smoothness at the points ti where two polynomials pi1 , pi meet, one imposes the condition that the polynomials and their rst two derivatives are equal at ti . This together with the conditions pi (ti ) = yi leads to a system of 3(n 2) + n = 4(n 1) 2 equations for 4(n 1) parameters ai , bi , ci , di (i = 1, ..., n 1). To specify a unique solution one therefore needs two additional conditions at the border. A typical assumption is p (t1 ) = p (tn ) = 0 which denes socalled natural splines. Cubic splines have a physical meaning, since these are the curves that form when a thin rod is forced to pass through n knots (in our case the knots are t1 , ..., tn ), corresponding to minimum strain energy. The term spline refers to the thin exible rods that were used in the past by draftsmen to draw smooth curves in ship design. In spite of their natural meaning, interpolation splines (and similarily other methods of interpolation) can be problematic since the interpolated values may be highly dependent on the specic method of interpolation and are therefore purely hypothetical unless the aim is indeed to build a ship. Splines can also be used for smoothing purposes by removing the restriction that the curve has to go through all observed points. More specically, one looks for a function g (t) such that
n
V () =
i=1
[ g (t)]2 dt
(2.26)
is minimized. The parameter > 0 controls the smoothness of the resulting curve. For small values of , the tted curve will be rather rough but close to the data; for large values more smoothness is achieved but the curve is, in general, not as close to the data. The question of which to choose reects a standard dilemma in statistical smoothing: one needs to balance the aim of achieving a small bias ( small) against the aim of a small variance ( large). For a given value of , the solution to the minimization problem above turns out to be a natural cubic spline (see Reinsch 1967; also see Wahba 1990 and references therein). The solution can also be written as a kernel smoother with a kernel function K (u) proportional
1 to exp(u/ 2) sin(/4 + u/ 2) and a bandwidth b proportional to 4 1 (Silverman 1986). If ti = i/n, then the bandwidth is exactly equal to 4 . Statistical inference In this section, correlation, linear regression, nonparametric smoothing, and interpolation were introduced in an informal way, without exact discussion of probabilistic assumptions and statistical inference. All these techniques can be used in an informal way to explore possible structures without specic model assumptions. Sometimes, however, one wishes to obtain more solid conclusions by statistical tests and condence intervals. There is an enormous literature on statistical inference in regression, including nonparametric approaches. For selected results see the references given above. For nonparametric methods also see Wand and Jones (1995), Simono (1996), Bowman and Azzalini (1997), Eubank (1999) and references therein. 2.5 Sp ecic applications in music bivariate 2.5.1 Empirical tempoacceleration Consider the tempo curves in Figure 2.3. An approximate measure of tempoacceleration may be dened by a(ti ) = [y (ti ) y (ti1 )] [y (ti1 ) y (ti2 )] 2 y (t) = 2 t [ti ti1 ] [ti1 ti2 ] (2.27)
where y (t) is the tempo (or logtempo) at time t. Figures 2.10a through f show a(t) for the three performances by Cortot and Horowitz. From the pictures it is not quite easy to see in how far there are similarilies or differences. Consider now the pairs (aj (ti ), al (ti )) where aj , al are acceleration measurements of performance j and l respectively. We calculate the sample correlations for each pair (j, l) {1, ..., 28} {1, ..., 28}, (j = l). Figure 2.11a shows the correlations between Cortot 1 (1947) and the other performances. As expected, Cortot correlates best with Cortot: the correlation between Cortot 1 and Cortots other two performances (1947, 1953) is clearly highest. The analogous observation can be made for Horowitz 1 (1947) (Figure 2.11b). Also interesting is to compare how much overall resemblance there is between a selected performance and the other performances. For each of the 28 performances, the average and the maximal correlation with other performances were calculated. Figures 2.11c and d indicate that, in terms of accelaration, Cortots style appears to be quite unique among the pianists considered here. The overall (average and maximal) similarily between each of his three acceleration curves and the other performances is much smaller than for any other pianist.
10
10
a(t)
a(t)
a(t)
5
5
10
10
10
15
20
25
30
10
15
20
25
30
15
10
5
10
15
20
25
30
onset time t
onset time t
15
onset time t
10
10
a(t)
a(t)
a(t)
5
5
10
10
10
15
20
25
30
10
15
20
25
30
15
10
5
10
10
15
20
25
30
onset time t
onset time t
onset time t
2.5.2 Interpolated and smoothed tempo curves velocity and acceleration Conceptually it is plausible to assume that musicians control tempo in continuous time. The measure of acceleration given above is therefore a rather crude estimate of the actual acceleration curve. Interpolation splines provide a simple possibility to guess the tempo and its derivatives between the observed time points. One should bear in mind, however, that interpolation is always based on specic assumptions. For instance, cubic splines assume that the curve between two consecutive time points where observations are available is, or can be well approximated by, a third degree polynomial. This assumption can hardly be checked experimentally and can lead to undesirable eects. Figure 2.12 shows the observed and interpolated tempo for Martha Argerich. While most of the interpolated values seem plausible, there are a few rather doubtful interpolations (marked with arrows) where the cubic polynomial by far exceeds each of the two observed values at the neighboring knots.
mean correlation
Correlation
0.4
0.5
0.6
0.7
0.8
0.4
0.8
ARGERICH ARRAU ASKENAZE BRENDEL
1.2
0
ARGERICH ARRAU ASKENAZE BRENDEL
BUNIN
BUNIN CAPOVA CORTOT2 CORTOT3 CURZON DAVIES DEMUS ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK
CORTOT3
10
10
Performance
15
HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST
Performance
20
20
25
25
mean correlation
Correlation
0.6
0.7
0.8
0.9
1.0
0.2
0.6
ARGERICH ARRAU ASKENAZE BRENDEL
1.0
1.4
0
ARGERICH ARRAU ASKENAZE BRENDEL
5
BUNIN CAPOVA CORTOT1 CURZON DAVIES DEMUS ESCHENBACH GIANOLI HOROWITZ1 HOROWITZ2 HOROWITZ3 KATSARIS KLIEN KRUST KUBALEK MOISEIWITSCH NEY NOVAES ORTIZ SCHNABEL SHELLEY ZAK
CORTOT2
CORTOT3
10
10
Performance
15
15
Performance
20
20
25
25
2.5.3 Tempo hierarchical decomposition by smoothing The tempo curve may be thought of as an aggregation of mostly smooth tempo curves at dierent onsettimescales. This corresponds to the general structure of music as a mixture of global and local structures at various scales. It is therefore interesting to look at smoothed tempo curves, and their derivatives, at dierent scales. Reasonable smoothing bandwidths may be guessed from the general structure of the composition such as time signature(s), rhythmic, metric, melodic, and harmonic structure, and so on. For tempo curves of Schumanns Tr aumerei (Figure 2.3), even multiples of 1/8th are plausible. Figures 2.13 through 2.16 show the following kernelsmoothed tempo curves with b1 = 8, b2 = 1, and b3 = 1/8 respectively: g 1 (t) = (nb1 )1 g 2 (t) = (nb2 )1 g 3 (t) = (nb3 )1 and the residuals 1 (t) g 2 (t) g 3 (t). e (t) = yi g (2.31) K( K( t ti )yi b1 (2.28) (2.29) (2.30)
ARGERICH
0.4
ARRAU
0.4
ASKENAZE
0.4
BRENDEL
0.4
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
BUNIN
0.6
CAPOVA
0.6
CORTOT1
0.6
CORTOT2
0.6
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
CORTOT3
0.6
CURZON
0.6
DAVIES
0.6
DEMUS
0.6
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ESCHENBACH
0.6
GIANOLI
0.6
HOROWITZ1
0.6
HOROWITZ2
0.6
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
HOROWITZ3
0.6
KATSARIS
0.6
KLIEN
0.6
KRUST
0.6
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
KUBALEK
0.6
MOISEIWITSCH
0.6
NEY
0.6
NOVAES
0.6
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ORTIZ
0.6
SCHNABEL
0.6
SHELLEY
0.6
ZAK
0.6
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ARGERICH
0.5
ARRAU
1.5
ASKENAZE
1.5
BRENDEL
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
BUNIN
1.5
CAPOVA
1.5
CORTOT1
1.5
CORTOT2
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
CORTOT3
1.5
CURZON
1.5
DAVIES
1.5
DEMUS
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ESCHENBACH
1.5
GIANOLI
1.5
HOROWITZ1
1.5
HOROWITZ2
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
HOROWITZ3
1.5
KATSARIS
1.5
KLIEN
1.5
KRUST
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
0
1.0
10 15 20 25 30 t
KUBALEK
1.5
MOISEIWITSCH
1.5
NEY
1.5
NOVAES
2.0
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ORTIZ
2.0
SCHNABEL
2.0
SHELLEY
2.0
ZAK
2.0
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ARGERICH
ARRAU
ASKENAZE
BRENDEL
2
2
2
2
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
BUNIN
0 3
CAPOVA
0 3
CORTOT1
3 0
CORTOT2
2
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
CORTOT3
0
CURZON
0
DAVIES
0 3
DEMUS
3 0
3
3
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ESCHENBACH
0
GIANOLI
0
HOROWITZ1
0 3
HOROWITZ2
3 0
3
3
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
HOROWITZ3
0
KATSARIS
0
KLIEN
0 3
KRUST
3 0
3
3
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
KUBALEK
0
MOISEIWITSCH
0
NEY
0 3
NOVAES
3 0
3
3
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
1
10 15 20 25 30 t
ORTIZ
0
SCHNABEL
0
SHELLEY
0 3
ZAK
3
3
3
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
Figure 2.15 Smoothed tempo curves g 3 (t) = (nb3 )1 g 2 (t)] (b3 = 1/8).
ti )[yi g 1 (t) K ( t b3
1.0
1.0
ARGERICH
ARRAU
1.0
ASKENAZE
1.0
BRENDEL
1.0
1.0
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
BUNIN
1.5
CAPOVA
1.5
CORTOT1
1.5
CORTOT2
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
CORTOT3
1.5
CURZON
1.5
DAVIES
1.5
DEMUS
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ESCHENBACH
1.5
GIANOLI
1.5
HOROWITZ1
1.5
HOROWITZ2
1.5
0
1.5
10 15 20 25 30 t
1.5
10 15 20 25 30 t
1.5
10 15 20 25 30 t
1.5
10 15 20 25 30 t
HOROWITZ3
KATSARIS
KLIEN
KRUST
1.5
1.5
1.5
1.5
0
1.5
10 15 20 25 30 t
1.5
10 15 20 25 30 t
1.5
10 15 20 25 30 t
1.5
10 15 20 25 30 t
KUBALEK
MOISEIWITSCH
NEY
NOVAES
1.5
1.5
1.5
1.5
0
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
ORTIZ
1.5
SCHNABEL
1.5
SHELLEY
1.5
ZAK
1.5
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
10 15 20 25 30 t
Figure 2.16 Smoothed tempo curves residuals e (t) = yi g 1 (t) g 2 (t) g 3 (t).
The tempo curves are thus decomposed into curves corresponding to a hierarchy of bandwidths. Each component reveals specic features. The rst component reects the overall tendency of the tempo. Most pianists have an essentially monotonically decreasing curve corresponding to a gradual, and towards the end emphasized, ritardando. For some performances (in particular Bunin, Capova, Gianoli, Horowitz 1, Kubalek, and Moisewitsch) there is a distinct initial acceleration with a local maximum in the middle of the piece. The second component g 2 (t) reveals tempouctuations that correspond to a natural division of the piece in 8 times 4 bars. Some pianists, like Cortot, greatly emphasize the 84 structure. For other pianists, such as Horowitz, the 84 structure is less evident: the smoothed tempo curve is mostly quite at, though the main, but smaller, tempo changes do take place at the junctions of the eight parts. Striking is also the distinction between part B (bars 17 to 24) and the other parts (A,A ,A ) of the composition in particular in Argerichs performance. The third component characterizes uctuations at the resolution level of 2/8th. At this very local level, tempo changes frequently for pianists like Horowitz, whereas there is less local movement in Cortots performances. Finally, the residuals e(t) consist of the remaining uctuations at the nest resolution of 1/8th. The similarity between the three residual curves by Horowitz illustrate that even at this very ne level, the seismic variation of tempo is a highly controlled process that is far from random.
In Chapter 3, the socalled melodic indicator will be introduced. One of the aims will be to explain some of the variability in tempo curves by melodic structures in the score. Consider a simple melodic indicator m(t) = wmelod (t) (see Section 3.3.4) that is essentially obtained by adding all indicators corresponding to individual motifs. Figures 2.17a and d display smoothed curves obtained by local polynomial smoothing of m(t) using a large and a small bandwidth respectively. Figures 2.17b and e show the rst derivatives of the two curves in 2.17a,d. Similarily, the second derivatives are given in gures 2.17c and f. For the tempo curves, the rst and second derivatives of local polynomial ts with b = 4 are given in Figures 2.18 and 2.19 respectively. A resemblance can be found in particular between the second derivative of m(t) in Figure 2.17f and the second derivatives of tempo curves in Figure 2.19. Also, there are interesting similarities and dierences between the performances, with respect to the local variability of the rst two derivatives. Many pianists start with a very small second derivative, with strongly increased values in part B.
a) m(t) (span=24/32)
78
b) m(t) (span=24/32)
0.5
80
0.0
mel. Ind.
1st der.
82
0.5
84
10
15 t
20
25
30
10
15 t
20
25
30
0.4
0.0
0.2
0.4
10
15 t
20
25
30
d) m(t) (span=8/32)
40 40 20
e) m(t) (span=8/32)
100 2nd der. 150
f) m(t) (span=8/32)
mel. Ind.
60
1st der.
80
20
100
10
15 t
20
25
30
40
10
15 t
20
25
30
100
50
50
10
15 t
20
25
30
Figure 2.17 Melodic indicator local polynomial ts together with rst and second derivatives.
2.5.5 Tempo and loudness By invitation of Prince Charles, Vladimir Horowitz gave a benet recital at Londons Royal Festival Hall on May 22, 1982. It was his rst European appearance in 31 years. One of the pieces played at the concert was Schumanns Kinderszene op. 15, No. 4. Figure 2.20 displays the (approximate) soundwave of Horowitzs performance sampled from the CD recording. Two variables that can be extracted quite easily by visual inspection are: a) on the horizontal axis the time when notes are played (and derived from this quantity, the tempo) and b) on the vertical axis, loudness. More specically, let t1 , ..., tn be the score onsettimes and u(t1 ), ..., u(tn ) the corresponding performance times. Then an approximate tempo at scoreonset time ti can be dened by y (ti ) = (ti+1 ti )/(u(ti+1 ) u(ti )). A complication with loudness is that the amplitude level of piano sounds decreases gradually in a complex manner so that loudness as such is not dened exactly. For simplicity, we therefore dene loudness as the initial amplitude level (or rather its logarithm). Moreover, we consider only events where the scoreonset time is a multiple of 1/8. For illustration, the rst four events (score onset times 1/8, 2/8, 3/8, 4/8) are marked with arrows in Figure 2.20. An interesting question is what kind of relationship there may be between time delay y and loudness level x. The autocorrelations of x(ti ) =
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
1.5
1.5
1.5
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
1.5
0.5
0.5
ARGERICH
ARRAU
ASKENAZE
BRENDEL
BUNIN
CAPOVA
1.0
CORTOT1
0 5 10 15 20 25 30
1.0
0.5
1.0
1.0
1.0
1.0
1.0
1st der.
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
0.5
1.5
0.5
0.5
0.5
0.5
0.5
0 5 10 15 20 25 30
1.5
1.5
1.5
1.5
1.5
1.5
0.5
0.5
1.0
CORTOT2
CORTOT3
CURZON
DAVIES
DEMUS
ESCHENBACH
GIANOLI
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
1.5
1.5
1.5
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
1.5
0.5
0.5
1.0
HOROWITZ1
HOROWITZ2
HOROWITZ3
KATSARIS
KLIEN
KRUST
KUBALEK
0 5 10 15 20 25 30
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.5
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
1st der.
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
0.5
0.5
1.0
MOISEIWITSCH
NEY
NOVAES
ORTIZ
SCHNABEL
SHELLEY
ZAK
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
Figure 2.18 Tempo curves (Figure 2.3) rst derivatives obtained from local polynomial ts (span 24/32).
ARGERICH
ARRAU
ASKENAZE
BRENDEL
BUNIN
CAPOVA
CORTOT1
1
1
1
1
1
1
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2
2
2
2nd der.
2
2
2
3
3
3
3
3
3
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
3
2
1
0 5 10 15 20 25 30
CORTOT2
CORTOT3
CURZON
DAVIES
DEMUS
ESCHENBACH
GIANOLI
1
2nd der.
1
1
1
1
1
2
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
3
2
2
2
2
2
0 5 10 15 20 25 30
3
3
3
3
3
3
2
1
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
HOROWITZ1
3
HOROWITZ2
3
HOROWITZ3
3
KATSARIS
3
KLIEN
3
KRUST
3
KUBALEK
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.






2 3
2 3
2 3
2 3
2 3
2 3
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
0 5 10 15 20 25 30
0 5 10 15 20 25 30
2 3

0 5 10 15 20 25 30
MOISEIWITSCH
3
NEY
3
NOVAES
3
ORTIZ
3
3
SCHNABEL
3
SHELLEY
3
ZAK
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.
2nd der.






2 3
2 3
2 3
2 3
2 3
2 3
2 3

0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
0 5 10 15 20 25 30
t
Figure 2.19 Tempo curves (Figure 2.3) second derivatives obtained from local polynomial ts (span 8/32).
Figure 2.20 Kinderszene No. 4 sound wave of performance by Horowitz at the Royal Festival Hall in London on May 22, 1982.
log(Amplitude) and y (ti ) as well as the crossautocorrelations between the two time series are shown in Figure 2.21a. The main remarkable crossautocorrelation occurs at lag 8. This can also be seen visually when plotting y (ti+8 ) against x(ti ) (Figure 2.21b). There appears to be a strong relationship between the two variables with the exception of four outliers. The three tted lines correspond to a) a least square linear regression t using all data; b) a robust high breakdown point and high eciency regression (Yohai et al. 1991); and c) a least squares t excluding the outliers. It should be noted that the outliers all occur together in a temporal cluster (see Figure 2.21c) and correspond to a phase where tempo is at its extreme (lowest for the rst three outliers and fastest for the last outlier). This indicates that these are informative outliers (in contrast to wrong measurements) that should not be dismissed, since they may tell us something about the intention of the performer. Finally, Figure 2.21d displays a sharpened version of the scatterplot in (x, y ) are marked Figure 2.21b: Points with high estimated joint density f with O. In contrast to what one would expect from a regression model, random errors i that are independent of x, the points with highest density gather around a horizontal line rather than the regression line(s) tted in Figure 2.21b. Thus, a linear regression model is hardly applicable. Instead, the data may possibly be divided into three clusters: a) a cluster with low loudness and low tempo; b) a second cluster with medium loudness and low to medium tempo; and c) a third cluster with a high level of loudness and medium to high tempo.
Figure 2.21 log(Amplitude) and tempo for Kinderszene No. 4 auto and cross correlations (Figure 2.24a), scatter plot with tted least squares and robust lines (Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Figure 2.24d).
2.5.6 Loudness and tempo twodimensional distribution function In the example above, the correlation between loudness and tempo, when measured at the same time, turned out to be relatively small, whereas there appeared to be quite a clear lagged relationship. Does this mean that there is indeed no immediate relationship between these two variables? Consider x(ti ) = log(Amplitude) and the logarithm of tempo. The scatterplot and the boxplot in Figures 22a and b rather suggest that there may be a relationship, but the dependence is nonlinear. This is further supported by the twodimensional histogram (Figure 23a), the smoothed density (Figure 24a) and the corresponding image plots (Figures 23b and 24b; the actual observations are plotted as stars). The density was estimated by a kernel estimate with the Epanechnikov kernel. Since correlation only measures linear dependence, it cannot detect this kind of highly nonlinear relationship.
Figure 2.22 Horowitz performance of Kinderszene No. 4 log(tempo) versus log(Amplitude) and boxplots of log(tempo) for three ranges of amplitude.
2.5.7 Melodic temposharpening Sharpening can also be applied by using an external variable. This is illustrated in Figures 2.25 through 2.27. Figure 2.25a displays the estimated density function of log(m +1) where m(t) is the value of a melodic indicator at onset time t. The marked region corresponds to very high values of the density function f (namely f (x) > 0.793). This denes a set Isharp of corresponding sharpening onset times. The series m(t) is shown in Figure 2.25b, with sharpening onset times t Isharp highlighted by vertical
Figure 2.23 Horowitz performance of Kinderszene No. 4 twodimensional histogram of (x, y ) = (log (tempo), log (Amplitude)) displayed in a perspective and image plot respectively.
Figure 2.24 Horowitz performance of Kinderszene No. 4 kernel estimate of twodimensional distribution of (x, y ) = (log (tempo), log (Amplitude)) displayed in a perspective and image plot respectively.
Figure 2.25 R. Schumann, Tr aumerei op. 15, No. 7 density of melodic indicator with sharpening region (a) and melodic curve plotted against onset time, with sharpening points highlighted (b).
CORTOT1
CORTOT2
CORTOT3
tempo
tempo
tempo
HOROWITZ1
HOROWITZ2
HOROWITZ3
tempo
tempo
tempo
Figure 2.26 R. Schumann, Tr aumerei op. 15, No. 7 tempo by Cortot and Horowitz at sharpening onset times.
CORTOT1
10
10
CORTOT2
10
CORTOT3
diff(tempo)
diff(tempo)
diff(tempo)
10
10
HOROWITZ1
10
10
HOROWITZ2
10
10
HOROWITZ3
diff(tempo)
diff(tempo)
diff(tempo)
10
10
Figure 2.27 R. Schumann, Tr aumerei op. 15, No. 7 tempo derivatives for Cortot and Horowitz at sharpening onset times.
10
lines. Figures 2.26 and 2.27 show the tempo y and its discrete derivative v (ti ) = [y (ti+1 ) y (ti )]/(ti+1 ti ) for ti Isharp and the performances by Cortot and Horowitz. The pictures indicate a systematic dierence between Cortot and Horowitz. A common feature is the negative derivative at the fth and sixth sharpening onset time. 2.6 Some multivariate descriptive displays 2.6.1 Denitions Suppose that we observe multivariate data x1 , x2 , ..., xn where each xi is a pdimensional vector (xi1 , ..., xip )t Rp . Obvious numerical summary statistics are the sample mean x = ( x1 , x 2 , ..., x p )t where x j = n 1
n i=1
Sjl = (n 1)1
(xij x j )(xil x l ).
i=1
Most methods for analyzing multivariate data are based on these two statistics. One of the main tools consists of dimension reduction by suitable projections, since it is easier to nd and visualize structure in low dimensions. These techniques go far beyond descriptive statistics. We therefore postpone the discussion of these methods to Chapters 8 to 11. Another set of methods consists of visualizing individual multivariate observations. The main purpose is a simple visual identication of similarities and dierences between observations, as well as search for clusters and other patterns. Typical examples are: Faces: xi =(xi1 , ..., xip )t is represented by a face with features depending on the values of corresponding coordinates. For instance, the face function in SPlus has the following correspondence between coordinates and feature parameters: xi,1 =area of face; xi,2 = shape of face; xi,3 = length of nose; xi,4 = location of mouth; xi,5 = curve of smile; xi,6 = width of mouth; xi,7 = location of eyes; xi,8 = separation of eyes; xi,9 = angle of eyes; xi,10 = shape of eyes; xi,11 = width of eyes; xi,12 = location of pupil; xi,13 = location of eyebrow; x14 = angle of eyebrow; xi,15 = width of eyebrows. Stars: Each coordinate is represented by a ray in a star, the length of each corresponding to the value of the coordinate. More specically, a star for a data vector xi = (xi1 , ..., xip )t is constructed as follows: 1. Scale xi to the range [0, r] : 0 x1j, ..., xnj r; 2. Draw p rays at angles j = 2 (j 1)/p (j = 1, ..., p); for a star with
origin 0 representing observation xi , the end point of the j th ray has the coordinates r (xij cos j , xij sin j ); 3. For visual reasons, the end points of the rays may be connected by straight lines. Proles: An observation xi =(xi1 , ..., xip )t is represented by a plot of xij versus j where neighboring points xij 1 and xij (j = 1, ..., p) are connected. Symb ol plot: The horizontal and vertical positions represent xi1 and xi2 respectively (or any other two coordinates of xi ). The other coordinates xi3 , ..., xip determine p 2 characteristic shape parameters of a geometric object that is plotted at point (xi1 , xi2 ). Typical symbols are circle (one additional dimension), rectangle (two additional dimensions), stars (arbitrary number of additional dimensions), and faces (arbitrary number of additional dimensions). 2.7 Sp ecic applications in music multivariate 2.7.1 Distribution of notes Cherno faces In music that is based on scales, pitch (modulo 12) is usually not equally distributed. Notes that belong to the main scale are more likely to occur, and within these, there are certain prefered notes as well (e.g. the roots of the tonic, subtonic and supertonic triads). To illustrate this, we consider the following compositions: 1. Saltarello (Anonymus, 13th century); 2. Prelude and Fugue No. 1 from Das Wohltemperierte Klavier (J. S. Bach, 16851750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 18101856); 4. Piano piece op. 19, No. 2 (A. Sch onberg, 18741951; gure 2.28); 5. Rain Tree Sketch 1 (T. Takemitsu, 19301996). For each composition, the distribution of notes (pitches) modulo 12 is calculated and centered around the central pitch (dened as the most frequent pitch modulo 12). Thus, the central pitch is dened as zero. We then obtain ve vectors of relative frequencies pj = (pj 0 , ..., pj 11 )t (j = 1, ..., 5) characterizing the ve compositions. In addition, for each of these vectors the number nj of local peaks in pj is calculated. We say that a local peak at i {1, ..., 10} occurs, if pji > max(pji1 , pji+1 ). For i = 10, we say that a local peak occurs, if pji > pji1 . Figure 2.29a displays Cherno faces of the 12dimensional vectors vj = (nj , pj 1 , ..., pj 11 )t . In Figure 2.29b, the coordinates of vj (and thus the assignment of feature variables) were permuted. The two plots illustrate the usefulness of Cherno faces, and at the same time the diculties in nding an objective interpretation. On one hand, the method discovers a plausible division in two groups: both picures show a clear distinction between classical tonal music (rst three faces) and the three representatives of avantgarde music of the 20th century. On the other hand, the
exact nature of the distinction cannot be seen. In Figure 2.29a, the classical faces look much more friendly than the rather miserable avantgarde fellows. The judgment of conservative music lovers that avantgarde music is unbearable, depressing, or even bad for health, seems to be conrmed! Yet, bad temper is the response of the classical masters to a simple permutation of the variables (Figure 2.29b), whereas the grim avantgarde seems to be much more at ease. The diculty in interpreting Cherno faces is that the result depends on the order of the variables, whereas due to their psychological eect most feature variables are not interchangeable.
Figure 2.28 Arnold Sch onberg (18741951), selfportrait. (Courtesy of Verwertungsgesellschaft BildKunst, Bonn.)
2.7.2 Distribution of notes star plots We consider once more the distribution vectors pj = (pj 0 , ..., pj 11 )t of pitch modulo 12 where 0 is the tonal center. In contrast to Cherno faces, permutation of coordinates in star plots is much less likely to have a subjective inuence on the interpretation of the picture. Nevertheless, certain patterns can become more visible when using an appropriate ordering of the variables. From the point of view of tonal music, a natural ordering of pitch can be obtained, for instance, from the ascending circle of fourths. This leads t to the following permutation p j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 ) . (p0 is omitted, since it is maximal by denition for all compositions.) Since stars are easy to look at, it is possible to compare a large number of observations simultaneously. We consider the following set of compositions:
ANONYMUS
BACH
SCHUMANN
WEBERN
SCHOENBERG
TAKEMITSU
Figure 2.29 a) Cherno faces for 1. Saltarello (Anonymus, 13th century); 2. Prelude and Fugue No. 1 from Das Wohltemperierte Klavier (J. S. Bach, 16851750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 18101856); 4. Piano piece op. 19, No. 2 (A. Sch onberg, 18741951); 5. Rain Tree Sketch 1 (T. Takemitsu, 19301996).
ANONYMUS
BACH
SCHUMANN
WEBERN
SCHOENBERG
TAKEMITSU
Figure 2.29 b) Cherno faces for the same compositions as in gure 2.29a, after permuting coordinates.
A. de la Halle (1235?1287): Or est Bayard en la pature, hure!; J. de Ockeghem (14251495): Canon epidiatesseron; J. Arcadelt (15051568): a) Ave Maria, b) La ingratitud, c) Io dico fra noi; W. Byrd (15431623): a) Ave Verum Corpus, b) Alman, c) The Queens Alman; J.P. Rameau (16831764): a) La Poplini` ere, b) Le Tambourin, c) La Triomphante; J.S. Bach (16851750): Das Wohltemperierte Klavier Preludes und Fuges No. 5, 6 and 7; D. Scarlatti (16601725): Sonatas K 222, K 345 and K 381; J. Haydn (17321809): Sonata op. 34, No. 2; W.A. Mozart (17561791): 2nd movements of Sonatas KV 332, KV 545 and KV 333; M. Clementi (17521832): Gradus ad Parnassum Studies 2 and 9 (Figure 11.4); R. Schumann (18101856): Kinderszenen op. 15, No. 1, 2, and 3; F. Chopin (18101849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32, No. 1, c) Etude op. 10, No. 6; R. Wagner (18131883): a) Bridal Choir from Lohengrin, b) Ouverture to Act 3 of Die Meistersinger; C. Debussy (18621918): a) Claire de lune, b) Arabesque No. 1, c) Reections dans leau; A. Scriabin (18721915): Preludes op. 2/2, op. 11/14 and op. 13/2; B. Bart ok (18811945): a) Bagatelle op. 11, No. 2 and 3, b) Sonata for Piano; O. Messiaen (19081992): Vingts regards sur lenfant de J esus, No. 3; S. Prokoe (18911953): Visions fugitives No. 11, 12 and 13; A. Sch onberg (18741951): Piano piece op. 19, No. 2; T. Takemitsu (19301996): Rain Tree Sketch No. 1; A. Webern (18831945): Orchesterst uck op. 6, No. 6; J. Beran (*1959): S anti piano concert No. 2 (beginning of 2nd Mov.) The star plots of p j are given in Figure 2.31. From Halle (cf. Figure 2.30) up to about the early Scriabin, the long beams form more or less a halfcircle. This means that the most frequent notes are neighbors in the circle of quarts and are much more frequent than all other notes. This is indeed what one would expect in music composed in the tonal system. The picture starts changing in the neighborhood of Scriabin where long beams are either
isolated (most extremely for Bart oks Bagatelle No. 3) or tend to cover more or less the whole range of notes (e.g. Bart ok, Prokoe, Takemitsu, Beran). Due to the variety of styles in the 20th century, the specic shape of each of the stars would need to be discussed in detail individually. For instance, Messiaens shape may be explained by the specic scales (Messiaen scales) he used. Generally speaking, the dierence between star plots of the 20th century and earlier music reects the replacement of the traditional tonal system with major/minor scales by other principles.
Figure 2.30 The minnesinger Burchard von Wengen (12291280), contemporary of Adam de la Halle (1235?1288). (From Codex Manesse, courtesy of the University Library Heidelberg.) (Color gures follow page 152.)
HALLE
OCKEGHEM
ARCADELT
ARCADELT
ARCADELT
BYRD
BYRD
BYRD
RAMEAU
RAMEAU
RAMEAU
BACH
BACH
BACH
SCARLATTI
SCARLATTI
SCARLATTI
HAYDN
MOZART
MOZART
MOZART
CLEMENTI
CLEMENTI
SCHUMANN
SCHUMANN
SCHUMANN
CHOPIN
CHOPIN
CHOPIN
WAGNER
WAGNER
DEBUSSY
DEBUSSY
DEBUSSY
SCRIABIN
SCRIABIN
SCRIABIN
BARTOK
BARTOK
BARTOK
PROKOFFIEFF
PROKOFFIEFF
PROKOFFIEFF
MESSIAEN
SCHOENBERG
WEBERN
TAKEMITSU
BERAN
t Figure 2.31 Star plots of p j = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 ) for compositions from the 13th to the 20th century.
2.7.3 Joint distribution of interval steps of envelopes Consider a composition consisting of onset times ti and pitch values x(ti ). In a polyphonic score, several notes may be played simultaneously. To simplify analysis, we dene a simplied score by considering the lower and upper envelope: Denition 24 Let
n
Cj
where A = {t 1 , ..., tn } Z+ (t1 < t2 < ... < tn ), B R or Z and Cj = {(t, x(t)) C : t = tj }. Then the lower and upper envelope of C are
min
x(t)), j = 1, ..., n}
max
In other words, for each onset time, the lowest and highest note are selected to dene the lower and upper envelope respectively. In the example below, we consider interval steps y (ti ) = y (ti+1 ) y (ti ) mod 12 for the upper envelope of a composition with onset times t1 , ..., tn and pitches y (t1 )..., y (tn ). A simple aspect of melodic and harmonic structure is the question in which sequence intervals are likely to occur. Here, we look at the empirical twodimensional distribution of (y (ti ), y (ti+1 )). For each pair (i, j ), (11 i, j 11, i, j =0), we count the number nij of occurences and dene Nij = log(nij + 1). (The value 0 is excluded here, since repetitions of a note or transposition by an octave are less interesting.) If only the type of interval and not its direction is of interest, then i, j assume the values 1 to 11 only. A useful representation of Nij can be obtained by a symbol plot. In Figures 2.32 and 2.33, the x and y coordinates correspond to i and j respectively. The radius of a circle with center (i, j ) is proportional to Nij . The compositions considered here are: a) J.S. Bach: Pr aludium No. 1 from Das Wohltemperierte Klavier; b) W.A. Mozart : Sonata KV 545, (beginning of 2nd Movement); c) A. Scriabin: Pr elude op. 51, No. 4; and d) F. Martin: Pr elude No. 6. For Bachs piece, there is a clear clustering in three main groups in the rst plot (there are almost never two successive interval steps downwards) and a horseshoelike pattern for absolute intervals. Remarkable is the clear negative correlation in Mozarts rst plot and the concentration on a few selected interval sequences. A negative correlation in the plots of interval steps with sign can also be found for Scriabin and Martin. However, considering only the types of intervals without their sign, the number and variety of interval sequences that are used relatively frequently is much higher for Scriabin and even more for Martin. For Martin, the plane of absolute intervals (Figure 2.33d) is lled almost uniformly. 2.7.4 Pitch distribution symbol plots with circles Consider once more the distribution vectors pj = (pj 0 , ..., pj 11 )t of pitch modulo 12 as in the starplot example above. The star plots show a clear distinction between modern compositions and classical tonal compositions. Symbol plots can be used to see more clearly which composers (or compositions) are close with respect to pj . In gure 2.34 the x and y axis corresponds to pj 5 and pj 7 . Recall that if 0 is the root of the tonic triad, then 5 is the root of the subtonic and 7 the root of the dominant
Figure 2.32 Symbol plot of the distribution of successive interval pairs (y (ti ), y (ti+1 )) (a, c) and their absolute values (b, d) respectively, for the upper envelopes of Bachs Pr aludium No. 1 (Das Wohltemperierte Klavier I) and Mozart s Sonata KV 545 (beginning of 2nd movement).
Figure 2.33 Symbol plot of the distribution of successive interval pairs (y (ti ), y (ti+1 )) (a, c) and their absolute values (b, d) respectively, for the upper envelopes of Scriabins Pr elude op. 51, No. 4 and F. Martins Pr elude No. 6.
triad. The radius of the circles in Figure 2.34 is proportional to pj 1 , the frequency of the dissonant minor second. In color Figure 2.35, the radius represents pj 6 , i.e. the augmented fourth. Both plots show a clear positive relationship between pj 5 and pj 7 . Moreover the circles tend to be larger for small values of x and y . The positioning in the plane together with the size of the circles separates (apart from a few exceptions) classical tonal compositions from more recent ones. To visualize this, four dierent colors are chosen for early music (black), baroque and classical (green), romantic (blue) and 20/21st century (red). The clustering of the four colors indicates that there is indeed an approximate clustering according to the four time periods. Interesting exceptions can be observed for early music with two extreme outliers (Halle and Arcadelt). Also, one piece by Rameau is somewhat far from the rest.
0.20
RAMEAU
0.15
0.10
SCARLATTI SCARLATTI BYRD BYRD DEBUSSY PROKOFFIEFF OCKEGHEM BACH BACH MOZART SCARLATTI DEBUSSY SCHUMANN BACH HAYDN WAGNER MOZART CLEMENTI SCRIABIN SCHUMANN WAGNER
0.05
0.0
HALLE
0.0
0.05
0.10
0.15
0.20
2.7.5 Pitch distribution symbol plots with rectangles By using rectangles, four dimensions can be represented. Color Figure 2.36 shows a symbol with (x, y )coordinates (pj 5 , pj 7 ) and rectangles with width
0.20
RAMEAU
0.15
0.10
SCARLATTI SCARLATTI BYRD BYRD DEBUSSY PROKOFFIEFF OCKEGHEM BACH BACH MOZART SCARLATTI DEBUSSY SCHUMANN BACH HAYDN WAGNER MOZART CLEMENTI SCRIABIN SCHUMANN WAGNER
0.05
0.0
HALLE
0.0
0.05
0.10
0.15
0.20
Figure 2.35 Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportional to pj 6 . (Color gures follow page 152.)
pj 1 (diminished second) and height pj 6 (augmented fourth). Using the same colors for the names as above, a similar clustering as in the circleplot can be observed. The picture not only visualizes a clear fourdimensional relationship between pj 1 , pj 5 , pj 6 and pj 7 , but also shows that these quantities are related to the time period. 2.7.6 Pitch distribution symbol plots with stars Five dimensions are visualized in color Figure 2.37 with (x, y ) = (pj 5 , pj 7 ) and the variables pj 1 , pj 6 and pj 10 (diminished seventh) dening a starplot for each observation, the rst variables starting on the right and the subsequent variables winding counterclockwise around the star (in this case a triangle). The shape of the triangle is obviously a characteristic of the time period. For tonal music composed mostly before about 1900, the stars are very narrow with a relatively long beam in the direction of the diminished seventh. The diminished seventh is indeed an important pitch in tonal music, since it is the fourth note in the dominant seventh chord to the subtonic. In contrast, notes that are a diminished second and an
0.0
RAMEAU
ARCADELT SCHUMANN
RAMEAU ARCADELT
BYRD SCRIABIN RAMEAU SCARLATTI SCARLATTI CLEMENTI BYRD MOZART DEBUSSY BYRD PROKOFFIEFF OCKEGHEM BACH SCRIABIN BACH MOZART CHOPIN SCARLATTI DEBUSSY SCHUMANN CHOPIN BACH HAYDN WAGNER WEBERN MOZART CLEMENTI SCRIABIN DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU SCHOENBERG MESSIAEN BERAN ARCADELT BARTOK PROKOFFIEFF BARTOK HALLE
0.1
0.0
0.05
0.10
0.15
0.20
Figure 2.36 Symbol plot with x = pj 5 , y = pj 7 . The rectangles have width pj 1 (diminished second) and height pj 6 (augmented fourth). (Color gures follow page 152.)
augmented fourth above the root of the tonic triad build, together with the tonic root, highly dissonant intervals and are therefore less frequent in tonal music. Color Figure 2.37 shows the triangles; the names without the triangles are plotted in color Figure 2.38. 2.7.7 Pitch distribution prole plots Finally, as an alternative to star plots, Figure 2.39 displays prole plots t of p j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 ) . For compositions up to about 1900, the proles are essentially Ushaped. This corresponds to stars with clustered long and short beams respectively, as seen previously. For modern compositions, there is a large variety of shapes dierent from a Ushape.
0.0
0.05
0.0
0.05
0.10
0.15
0.20
Figure 2.37 Symbol plot with x = pj 5 , y = pj 7 , and triangles dened by pj 1 (diminished second), pj 6 (augmented fourth) and pj 10 (diminished seventh). (Color gures follow page 152.)
0.20
RAMEAU
BYRD SCRIABIN RAMEAU SCARLATTI SCARLATTI CLEMENTI BYRD MOZART DEBUSSY BYRD PROKOFFIEFF BACH BACH MOZART OCKEGHEM CHOPIN SCRIABIN SCARLATTI DEBUSSY SCHUMANN CHOPIN BACH HAYDN WAGNER WEBERN MOZART CLEMENTI SCRIABIN DEBUSSY CHOPIN SCHUMANN PROKOFFIEFF BARTOK WAGNER TAKEMITSU SCHOENBERG MESSIAEN BERAN ARCADELT BARTOK PROKOFFIEFF BARTOK HALLE
0.0
0.05
0.10
0.15
0.0
0.05
0.10
0.15
0.20
Figure 2.38 Names plotted at locations (x, y ) = (pj 5 , pj 7 ). (Color gures follow page 152.)
HALLE
0.10
OCKEGHEM
0.10
ARCADELT
0.10
ARCADELT
0.20 0.10
ARCADELT
0.10
BYRD
0.10
BYRD
0.10
BYRD
0.0
0.08
0.0
0.0
0.0
0.0
0.0
0.0
2 4
6 8 10
4 6
8 10
2 4
6 8 10
4 6
8 10
2 4
6 8 10
2 4
6 8 10
4 6
8 10
0.0
2 4
6 8 10
RAMEAU
0.10
0.10
RAMEAU
0.10
RAMEAU
0.10
BACH
0.10
BACH
0.08
BACH
0.10
SCARLATTI
0.08
SCARLATTI
0.02
0.02
0.0
0.0
0.0
2 4
6 8 10
4 6
8 10
2 4
6 8 10
4 6
8 10
2 4
6 8 10
0.0
0.0
0.0
2 4
6 8 10
4 6
8 10
2 4
6 8 10
0.15
0.15
0.08
0.10
0.10
0.10
0.15
SCARLATTI
HAYDN
MOZART
MOZART
MOZART
CLEMENTI
CLEMENTI
0.10
SCHUMANN
0.05
0.05
0.02
0.02
0.05
0.0
2 4
6 8 10
4 6
8 10
2 4
6 8 10
0.0
4 6
8 10
2 4
6 8 10
2 4
6 8 10
4 6
8 10
0.0
2 4
6 8 10
SCHUMANN
0.08
0.10
SCHUMANN
0.10
CHOPIN
0.10
CHOPIN
0.10
CHOPIN
0.10
WAGNER
0.08
WAGNER
0.10
DEBUSSY
0.02
0.02
0.04
0.02
0.0
0.0
2 4
6 8 10
0.0
4 6
8 10
2 4
6 8 10
4 6
8 10
2 4
6 8 10
0.0
2 4
6 8 10
4 6
8 10
2 4
6 8 10
DEBUSSY
0.10
DEBUSSY
0.10
SCRIABIN
0.10
SCRIABIN
0.10
SCRIABIN
0.15
BARTOK
0.10
BARTOK
0.10
BARTOK
0.10
0.02
0.02
0.0
0.0
0.05
2 4
6 8 10
4 6
8 10
2 4
6 8 10
0.0
4 6
8 10
2 4
6 8 10
2 4
6 8 10
0.0
4 6
8 10
0.06
2 4
6 8 10
0.10
0.20
PROKOFFIEFF
0.10
PROKOFFIEFF
PROKOFFIEFF
0.12
MESSIAEN
0.20
SCHOENBERG
0.08
WEBERN
0.09
TAKEMITSU
0.10
BERAN
0.05
0.04
0.02
0.04
0.05
2 4
6 8 10
4 6
8 10
2 4
6 8 10
4 6
8 10
2 4
6 8 10
0.02
0.07
2 4
6 8 10
4 6
8 10
0.06
2 4
6 8 10
CHAPTER 3
not and 11 = very. Thus, for a vocabulary V of V  = N = 22 words, n = 2 digits would be sucient. More generally, suppose that we have a set V with N = 2n elements. Then we need n = log2 N digits for encoding the elements in the binary system. The number n is then called the information of a message from vocabulary V . Note that in the special case where V consists of one element only, n = 0, i.e. the information content of a message is zero, because we know which element of V will be contained in the message even before receiving it. An extension of this denition to integers N that are not necessarily powers of 2 can be justied as follows: consider a sequence of k elements from V . The number of sequences v1 , ..., vk (vi V ) is N k . (Note that one element is allowed to occur more than once.) The number of binary digits to express a sequence v1 , ..., vk is nk where 2nk 1 < N k 2nk . The average number of digits needed to express an element in this sequence is nk /k where k log2 N nk < k log2 N + 1. We then have nk = log2 N. lim k k The following denition is therefore meaningful: Denition 25 Let VN be a nite set with N elements. Then the information necessary to characterize the elements of VN is dened by I (VN ) = log2 N (3.1)
This denition can also be derived by postulating the following properties a measure of information should have: 1. Additivity: If VK  = N M , then I (VK ) = I (VN ) + I (VM )
2. Monotonicity: I (VN ) I (VN +1 ) 3. Denition of unit: I (V2 ) = 1. The only function that satises these conditions is I (VN ) = log2 N. Consider now a more complex situation where VN = k j =1 Vj , Vj Vl = (j = l) and Vj  = Nj (and hence N = N1 + ... + Nk ), and dene pj = Nj /N . Suppose that we select an element from V randomly, each element having the same probability of being chosen. If an element v V is known to belong to a specic Vj , then the additional information needed to identify it within Vj is equal to I (Vj ) = log2 Nj . The expected value of this additional information is therefore
k k
I2 =
j =1
pj log2 Nj =
j =1
pj log2 (N pj )
(3.2)
Let I1 be the information needed to identify the set Vj which v belongs to. Then the total information needed for identifying (encoding) elements of V is (3.3) log2 N = I1 + I2 On the other hand, famous formula pj log2 N = log2 N so that we obtain Shannons
k
I1 =
j =1
pj log2 (pj )
(3.4)
I1 is also called Shannon information. Shannon information is thus the expected information about the occurence of the sets V1 , ..., Vk contained in a randomly chosen element from V . Note that the term information can be used synonymously for uncertainty: the information obtained from a random experiment diminishes uncertainty by the same amount. The derivation of Shannon information is credited to Shannon (1948) and, independently, Wiener (1948). In physics, an analogous formula is known as entropy and is a measure of the disorder of a system (see Boltzmann 1896, gure 3.1). Shannons formula can also be derived by postulating the following properties for a measure of information of the outcome of a random experiment: let V1 , ..., Vk be the possible outcomes of a random experiment and denote by pj = P (Aj ) the corresponding probabilities. Then a measure of information, say I , obtained by the outcome of the random experiment should have the following properties: 1. Function of probabilities: I = I (p1 , ..., pk ), i.e. I depends on the probabilities pj only; 2. Symmetry: I (p1 , ..., pk ) = I (p(1) , ..., p(k) ) for any permutation ; 3. Continuity: I (p, 1 p) is a continuous function of p (0 p 1);
1 4. Denition of unit: I ( 1 2 , 2 ) = 1;
5. Additivity and weighting by probabilities: I (p1 , ..., pk ) = I (p1 + p2 , p3 , ..., pk ) + (p1 + p2 )I ( p1 p2 , ) (3.5) p1 + p2 p1 + p2
The meaning of the rst four properties is obvious. The last property can be interpreted as follows: suppose the outcome of an experiment does not distinguish between V1 and V2 , i.e. if v turns out to be in one of these two sets, we only know that v V1 V2 . Then the infomation provided by the experiment is I (p1 + p2 , p3 , ..., pk ). If the experiment did distinguish between V1 and V2 , then it is reasonable to assume that the information would be larger by the amount p1 p2 , ). (p1 + p2 )I ( p1 + p2 p1 + p2 Equation (3.5) tells us exactly that: the complete information I (p1 , ..., pk ) can be obtained by adding the partial and the additional information. It turns out that the only function for which the postulates hold is Shannons information: Theorem 9 Let I be a functional that assigns each nite discrete distribution function P (dened by probabilities p1 , ..., pk , k 1) a real number I (P ), such that the properties above hold. Then
k
I (P ) = I (p1 , ..., pk ) =
j =1
pj log2 pj
(3.6)
Shannon information has an obvious upper bound that follows from Jensens inequality: recall that Jensens inequality states that for a convex function wj = 1 we have g and weights wj 0 with g( wj xj ) wj g (xj ).
In particular, for g (x) = x log2 x, k 1 g (pj ) = k 1 Hence, I (P ) log2 k (3.7) This bound is achieved by the uniform distribution pj = 1/k . The other extreme case is pj = 1 for some j . This means that event Vj occurs with certainty and I (p1 , ..., pk ) = I (pj ) = I (1) = I (1, 0) = I (1, 0, 0) etc. Then from the fth property we have I (1, 0) = I (1) + I (1, 0) so that I (1) = 0. The interpretation is that, if it is clear a priori which event will occur, then a random experiment does not provide any information. The notion of information can be extended in an obvious way to the case where one has an innite but countable number of possible outcomes. pj log2 pj g ( k 1 pj ) = k 1 log2 k.
The information contained in the realization of a random variable X with possible outcomes x1 , x2 , ... is dened by I (X ) = pj log2 pj
where pj = P (X = xj ). More subtle is the extension to continuous distributions and random variables. A nice illumination of the problem is given in Renyi (1970): for a random variable with uniform distribution on (0,1), the digits in the binary expansion of X are innitely many independent 01random variables where 0 and 1 occur with probability 1/2 each. The information furnished by a realization of X would therefore be innite. Nevertheless, a meaningful measure of information can be dened as a limit of discrete approximations: Theorem 10 Let X be a random variable with density function f. Dene XN = [N X ]/N where [x] denotes the integer part of x. If I (X1 ) < , then the following holds: I (XN ) =1 (3.8) lim N log2 N
N
(3.9)
We thus have Denition 26 Let X be a random variable with density function f . Then I (X ) =
(3.10)
is called the information (or entropy) of X . Note that, in contrast to discrete distributions, information can be negative. This is due to the fact that I (X ) is in fact the limit of a dierence of informations. The notion of entropy can also be carried over to measuring randomness in stationary time series in the sense of correlations. (For the denition of stationarity and time series in general see Chapter 4.) Denition 27 Let Xt (t Z) be a stationary process with var(Xt ) = 1, and spectral density f . Then the spectral entropy of Xt is dened by I (Xt , t Z) =
(3.11)
This denition is plausible, because for a process with unit variance, f has the same properties as a probability distribution and can be interpreted as a distribution on frequencies. The process Xt is uncorrelated if and only if f is constant, i.e. if f is the uniform distribution on [, ]. Exactly in this case entropy is maximal, and knowledge of past observations does not help to predict future observations. On the other hand, if f has one or more
extreme peaks, then entropy is very low (and in the limit minus innity). This corresponds to the fact that in this case future observations can be predicted with high accuracy from past values. Thus, future observations do not contain as much new information as in the case of independence. 3.2.2 Measuring metric, melodic, and harmonic importance General idea Western classical music is usually structured in at least three aspects: melody, metric structure, and harmony. With respect to representing the essential melodic, metric, and harmonic structures, not all notes are equally important. For a given composition K , we may therefore try to nd metric, melodic, and harmonic structures and quantify them in a weight function w : K R3 (which we will also call an indicator). For each note event x K, the three components of w(x) = (wmelodic (x), wmetric (x), wharmonic (x)) quantify the importance of x with respect to the melodic, metric, and harmonic structure of the composition respectively. Omnibus metric, melodic, and harmonic indicators Specic denitions of structural indicators (or weight functions) are discussed for instance in Mazzola et al. (1995), Fleischer et al. (2000), and Beran and Mazzola (2001). To illustrate the general approach, we give a full denition of metric weights. Melodic and harmonic weights are dened in a similar fashion, taking into account the specic nature of melodic and harmonic structures respectively. Metric structures characterize local periodic patterns in symbolic onset times. This can be formalized as follows: let K Z4 be a composition (with coordinates Onset Time, Pitch,Loudness, and Duration), T Z its set of onset times (i.e. the projection of K on the rst axis) and let tmax = max{t : t T }. Without loss of generality the smallest onset time in T is equal to one. Denition 28 For each triple (t, l, p) Z N N the set B (t, l, p) = {t + kp : 0 k l} is called a meter with starting point t, length l and period p. The meter is called admissible, if B (t, l, p) T . The nonnegative length l of a local meter M = B (t, l, p) is uniquely determined by the set M and is denoted by l(M ). Note that by denition, t B (t, l, p) for any (t, l, p) Z N N. The importance of events at onset time s is now measured by the number of meters this onset is contained in. For a given triple (t, l, p), three situations can occur:
1. B (t, l, p) is admissible and there is no other admissible local meter B = B (t , l , p ) such that B B ; 2. B (t, l, p) is not admissible; 3. B (t, l, p) is admissible, but there is another admissible local meter B = B (t , l , p ) such that B B . We count only case 1. This leads to the following denition: Denition 29 An admissible meter B (t, l, p) for a composition K Z4 is called a maximal local meter if and only if it is not a proper subset of another admissible local meter B (t , l , p ) of K . Denote by M(K ) the set of maximal local meters of K and by M(K, t) the set of maximal local meters of K containing onset t. Note that the set M(K ) is always a covering of T . Metric weights can now be dened, for instance, by Denition 30 Let x K be a note event at onset time t(x) T , M = M(K, t) the set of maximal local meters of K containing t(x), and h a nondecreasing real function on Z. Specify a minimal length lmin . Then the metric indicator (or metric weight) of x, associated with the minimal length lmin , is given by wmetric (x) =
M M, l(M )lmin
h(l(M ))
(3.12)
In a similar fashion, melodic indicators wmelodic and harmonic indicators wharmonic can be derived from a melodic and harmonic analysis respectively. Specic indicators A possible objection to weight functions as dened above is that only information about pitch and onset time is used. A score, however, usually contains much more symbolic information that helps musicians to read it correctly. For instance, melodic phrases are often connected by a phrasing slur, notes are grouped by beams, separate voices are made visible by suitable orientation of note stems, etc. Ideally, structural indicators should take into account such additional information. An improved indicator that takes into account knowledge about musical motifs can be dened for example as follows: Denition 31 Let M = {(1 , y1 ), ..., (k , yk )}, 1 < 2 < ... < k be a motif where y denotes pitch and onset time. Given a composition K T Z Z2 , dene for each scoreonset time ti T (i = 1, ..., n) and u {1, ..., k }, the shifted motif M (ti , u) = {(ti + 1 u , y1 ), ..., (ti + k u , yk )}
and denote by Tu (ti ) = {ti + 1 u , ..., ti + k u } = {s1 , ..., sk } the corresponding onset times. Moreover, let Xu (ti ) = {x = (x(s1 ), ..., x(sk )) : (si , x(si )) K } be the set of all pitchvectors with onset set Tu (ti ). Then we dene the distance
k
du (ti ) =
xXu (ti )
min
(x(si ) yi )2
(3.13)
i=1
If Xu is empty, then du (ti ) is not dened or set equal to an arbitrary upper bound D < . In this denition, it is assumed that the motif is identied beforehand by other means (e.g. by hand using traditional musical analysis). The distance du (ti ) thus measures in how far there are notes that are similar to those in M, if ti is at the uth place of the rhythmic pattern of motif M. Note that the euclidian distance (x(si ) yi )2 could be replaced by any other reasonable distance. Analogously, distance or similarity can be measured by correlation: Denition 32 Using the same denitions as above, let
k
x = arg
xXu (ti )
min
(x(si ) yi )2 ,
i=1
and dene ru (ti ) to be the sample correlation between xo and y = (y1 , ..., yk ). If M (ti , u) K , then set ru (ti ) = 0. Disregarding the position within a motif, we can now dene overall motivic indicators (or weights), for instance by
k
wd,mean (ti ) = g (
u=1
du (ti ))
(3.14)
(3.15) (3.16)
Finally, given weights for p dierent motifs, we may combine these into one overall indicator. For instance, an overall melodic indicator based on correlations can be dened by
p
wmelod (ti ) =
j =1
h(wcorr,j (ti ), Li )
(3.17)
where wcorr,j is the weight function for motif number j and Li is the number of elements in the motif. Including Li has the purpose of attributing higher weights to the presence of longer motifs. The advantage of the motifbased denition is that one can rst search for possible motifs in the score, making full use of the available information in the score as well as musicological and historical knowledge, and then incorporate these in the denition of melodic weights. Similar denitions may be obtained for metric and harmonic indicators. 3.2.3 Measuring dimension There are many dierent denitions of dimension, each measuring a specic aspect of objects. Best known is the topological dimension. In the usual k euclidian space Rk with scalar product < x, y >= i=1 xi yi and distances x y  = < x y, x y >, the topological dimension of the space is equal to k . The dimension of an object in this space is equal to the dimension of the subspace it is contained in. The euclidian space is, however, rather special since it is metric with a scalar product. More generally, one can dene a topological dimension in any topological (not necessarily metric) space in terms of coverings. We start with the denition of a topological space: a topological space is a nonempty set X together with a family O of socalled open subsets of X satisfying the following conditions: 1. X O and O ( denotes the empty set) 2. If U1 , U2 O, then U1 U2 O 3. If U1 , U2 O, then U1 U2 O. A covering of a set S X is a collection U O of open sets such that S U U U. A renement of a covering U is a covering U such that for each U U there exists a U U with U U . The denition of topological dimension is now as follows: Denition 33 A topological space X has topological dimension m, if every covering U of X has a renement U in which every point of X occurs in at most m + 1 sets of U , and m is the smallest such integer. The topological dimension of a subset S X is analogous. For instance, a straight line in a euclidian space can be divided into open intervals such that at most two intervals intersect so that dT = 1. Similarily, a simple geometric gure in the plane, such as a disk or a rectangle (including the inner area), can be covered with arbitrarily small circles or rectangles such that at most three such sets intersect this number can however not be made smaller. Thus, the topological dimension of such an object is dT = 3 1 = 2.
The topological dimension is a relatively rough measure of dimension, since it can assume integer values only and thus classies sets (in a topological space) into a nite or countable number of categories. On the other hand, dT is dened for very general spaces where a metric (i.e. distances) need not exist. A ner denition of dimension, which is however conned to metric spaces, is the HausdorBesicovitch dimension. Suppose we have a set A in a metric space X . In a metric space, we can dene open balls of radius r around each point x X by U (r) = {y X : dX (x, y ) < r} where dX is the metric in X . The idea is now to measure the size of A by covering it with a nite number of balls Ur = {U1 (r), ..., Uk (r)} of radius r and to calculate an approximate measure of A by Ur ,r,h (A) = h(r) (3.18)
where the sum is taken over all balls and h is some positive function. This measure depends on r, the specic covering Ur and h. To obtain a measure that is independent of a specic covering, we dene the measure r,h (A) = inf U ,,h (A)
U :<r
(3.19)
This measure is still only an approximation of A. The question is now whether we can get a measure that corresponds exactly to the set A. This is done by taking the limit r 0 : h (A) = lim r,h (A)
r 0
(3.20)
Clearly, as r tends to zero, r,h becomes at most larger and therefore has a limit. The limit can be either zero (if r,h = 0 already), innity, or a nite number. This leads to the following denition: Denition 34 A function h for which 0 < h (A) < is called intrinsic function of A. Consider, for example, a simple shape in the plane such as a circle with radius R. The area of the circle A can be measured by covering it by small circles of radius r and evaluating h (A) using the function h(r) = r2 . It is well known that limr0 r,h (A) exists and is equal to h (A) = R2 . On the other hand, if we took h(r) = r with < 2, then h (A) = , whereas for > 2, h (A) = 0. For standard sets, such as circles, rectangles, triangles, cylinders, etc., it is generally true that the intrinsic function for a set A that with topological dimension dT = d is given by (Hausdor 1919) h(r) = hd (r) =
1 d )} {( 2
(1 + d 2)
rd .
(3.21)
Many other more complicated sets, including randomly generated sets, have intrinsic functions of the form h(r) = L(r)rd for some d > 0 which is not always equal to dT , and L a function that is slowly varying at the origin (see e.g. Hausdor 1919, Besicovitch 1935, Besicovitch and Ursell 1937, Mandelbrot 1977, 1983, Falcomer 1985, 1986, Kono 1986, Telcs 1990, Devaney 1990). Here, L is called slowly varying at zero, if for any u > 0, limr0 [L(ur)/L(r)] = 1. This leads to the following denition of dimension: Denition 35 Let A be a subset of a metric space and h(r) = L(r) rd an intrinsic function of A where L(r) is slowly varying. Then dH = d is called the HausdorBesicovitch dimension (or Hausdor dimension) of A. The denition of Hausdor dimension leads to the denition of fractals (see e.g. Mandelbrot 1977): Denition 36 Let A be a subset of a metric space. Suppose that A has topological dimension dT and Hausdor dimension dH such that dH > dT . Then A is called a fractal.
Figure 3.2 Fractal pictures (by C eline Beran, computer generated.) (Color gures follow page 152.)
Intuitively, dH > dT means that the set A is more complicated than a standard set with topological dimension dT . An alternative denition of Hausdordimension is the fractal dimension: Denition 37 Let A be a compact subset of a metric space. For each > 0, denote by N () the smallest number of balls of radius r necessary to cover A. If log N () (3.22) dF = lim 0 log exists, then dF is called the fractal dimension of A. It can be shown that dF dT . Moreover, in Rk one has dF k = dT . Beautiful examples of fractal curves and surfaces (cf. Figure 3.2) can be found in
Mandelbrot (1977) and other related books. Many phenomena, not only in nature but also in art, appear to be fractal. For instance, fractal shapes can be found in Jackson Pollocks (19121956) abstract drip paintings (Taylor 1999a,b,c, 2000). In music, the idea of fractals was used by some contemporary composers, though mainly as a conceptual inspiration rather than an exact algorithm (e.g. Harri Vuori, Gy orgy Ligeti; Figure 3.3).
The notion of fractals is closely related to selfsimilarity (see Mandelbrot 1977 and references therein). Selfsimilar geometric objects have the property that the same shapes are repeated at innitely many scales. By drawing recursively m smaller copies of the same shape rescaling them by a factor s one can construct fractals. For selfsimilar objects, the fractal dimension can be calculated directly from the scaling factor s and the number m of repetitions of the rescaled objects by dF = log m log s (3.23)
For many purposes more realistic are random fractals where instead of the shape itself, the distribution remains the same after rescaling. More specically, we have Denition 38 Let Xt (t R) be a stochastic process. The process is called selfsimilar with selfsimilarity parameter H , if for any c > 0 Xt =d cH Xct where = d means equality of the two processes in distribution. The parameter H is also called Hurst exponent. Selfsimilar processes are (like their deterministic counterparts) very special models. However, they play a central role for stochastic processes just like the normal distribution for random variables. The reason is that, under very general conditions, the limit of partial sum processes (see Lamperti 1962, 1972) is always a selfsimilar process:
Theorem 11 Suppose that Zt (t R+ ) is a stochastic process such that Z1 = 0 with positive probability and Zt is the limit in distribution of the sequence of normalized partial sums
[nt] 1 1 a n Snt = an s=1
Xs (n = 1, 2, ...)
(3.24)
where X1 , X2 , ... is a stationary discrete time process with zero mean and a1 , a2 , ... a sequence of positive normalizing constants such that log an . Then there exists an H > 0 such that for any u > 0, limn (anu /an ) = uH , Zt is selfsimilar with selfsimilarity parameter H , and Zt has stationary increments. The selfsimilarity parameter therefore also makes sense for processes that are not exactly selfsimilar themselves, since it is dened by the rate nH needed to standardize partial sums. Moreover, H is related to the fractal dimension, the exact relationship between H and the fractal dimension however depends on some other properties of the process as well. For instance, sample paths of (univariate) Gaussian selfsimilar processes socalled fractional Brownian motion (see Chapter 4) have, with probability one, a fractal dimension of 2 H with possible values of H in the interval (0, 1). Thus, the closer H is to 1, the more a sample paths is similar to a simple geometric line with dimension one. On the other hand, as H approaches zero, a typical sample path lls up most of the plane so that the dimension approaches two. Practically, H can be determined from an observed series X1 , ..., Xn , for example by maximum likelihood estimation. For a thorough discussion of selfsimilar and related processes and statistical methods see e.g. Beran (1994). Further references on fractals apart from those given above are, for instance, Edgar (1990), Falconer (1990), Peitgen and Saupe (1988), Stoyan and Stoyan (1994), and Tricot (1995). A cautionary remark should be made at this point: in view of theorem 11, the fact that we do nd selfsimilarity in aggregated time series is hardly surprising and can therefore not be interpreted as something very special that would distinguish the particular series from other data. What may be special at most is which particular value of H is obtained and which particular selfsimilar process the normalized aggregated series converges to. 3.3 Specic applications in music 3.3.1 Entropy of melodic shapes Let x(ti ) be the upper and y (ti ) the lower envelope of a composition at scoreonset times ti (i = 1, ..., n). To investigate the shape of the melodic
movement we consider the rst and second discrete derivatives x(1) (ti ) = and x(2) (ti ) = 2 x(ti ) [x(ti+2 ) x(ti+1 )] [x(ti+1 ) x(ti )] = 2 ti [ti+2 ti+1 ] [ti+1 ti ] [x(ti+1 ) x(ti )]12 ti+1 ti (3.26) x(ti ) x(ti+1 ) x(ti ) = ti ti+1 ti (3.25)
Alternatively, if octaves do not count, we dene x(1;12) (ti ) = and x(2;12) (ti ) = [x(ti+2 ) x(ti+1 )]12 [x(ti+1 ) x(ti )]12 [ti+2 ti+1 ] [ti+1 ti+2 ] (3.28) (3.27)
where [x]k = x mod k . Thus, in this denition intervals between successive notes x(ti ), x(ti+1 ) and x(tj ), x(tj +1 ) respectively are considered identical if they dier by octaves only. The number of possible values of x(2) and x(2;12) is nite, however potentially very large. In rst approximation we may therefore consider both variables to be continuous. In the following, the distribution of x(2) and x(2;12) (see Chapter 2). is approximated by a continuous density kernel estimate f For illustration, we dene the following measures of entropy: 1. E1 = (x) log f f 2 (x)dx (3.29)
is obtained from the observed data x(2;12) (t1 ), ..., x(2;12) (tn ) by where f kernel estimation. 2. E2 : Same as E1 , but using x(2) (t1 ), ..., x(2) (tn ) instead. 3. E3 = (x, y ) log f f 2 (x, y )dxdy (3.30)
(x, y ) is a kernel estimate based on observations (ai , bi ) with where f (2) ai = x (ti1 ) and bi = x(2) (ti ). Thus, E3 is the (empirical) entropy of the joint distribution of two successive values of x(2) . 4. E4 : Same as Entropy 3, but using (x(2;12) (ti1 ), x(2;12) (ti )) instead. 5. E5 : Same as Entropy 3, but using (x(ti ) y (ti ))(1) instead. 6. E6 : Same as Entropy 3, but using (x(ti ) y (ti ))(1;12) instead. 7. E7 : Same as Entropy 1, but using (x(ti ) y (ti ))(1) instead. 8. E8 : Same as Entropy 1, but using (x(ti ) y (ti ))(1;12) instead.
Figure 3.4 Comparison of entropies 1, 2, 3, and 4 for J.S. Bachs Cello Suite No. I and R. Schumanns op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16.
Each of these entropies characterizes information content (or randomness) of certain aspects of melodic patterns in the upper and lower envelope. Figures 3.4a through d show boxplots of Entropies 1 through 4 for Bach and Schumann (Figure 3.8). The pieces considered here are: J.S. Bach Cello Suite No. I (each of the six movements separately), Pr aludium und Fuge No. 1 and 8 from Das Wohltemperierte Klavier I (each piece separately); R. Schumann op. 15, No. 2, 3, 4 and 7, and op. 68, No. 2 and 16. Obviously there is a dierence between Bach and Schumann in all four entropy measures. In Bachs pieces, entropy is higher, indicating a more uniform mixture of local melodic shapes. 3.3.2 Spectral entropy of local interval variability Consider the local variability of intervals yi = x(ti+1 ) x(ti ) between successive notes. Specically, we consider a moving nearest neighbor window [ti , ti+4 ] (i = 1, ..., n 4) and dene local variances vi =
3
1 41
3 j =0
(yi+j y i )2
(3.31)
where y i = 41 j =0 yi+j . Based on this, a SEMIFARmodel is tted to the time series zi = log(vi + 1 2 ) (see Chapter 4 for the denition of SEMI) is then used to dene the FAR models). The tted spectral density f (; spectral entropy E9 =
) log f (; )d f (;
(3.32)
If octaves do not count, then intervals are circular so that an estimate of variability for circular data should be used. Here, we use R = 2(1 R) as dened in Chapter 7. To transform the range [0, 2] of R to the real line, the logistic transformation is applied, dening zi = log( R + ) 2 + R
where is a small positive number that is needed in order that < zi < even if R = 0 or 2 respectively. Fitting a SEMIFARmodel to zi we then dene E10 the same way as E9 above. Figure 3.6 shows a comparison of E9 and E10 for the same compositions as in 3.3.1. In contrast to the previous measures of entropy, Bach is consistently lower than Schumann. With respect to E10 this is also the case in comparison with Scriabin (Figure 3.5) and Martin. Thus, for Bach there appears to be a high degree of nonrandomness (i.e. organization) in the way variability of interval steps changes sequentially.
Figure 3.5 Alexander Scriabin (18711915) (at the piano) and the conductor Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gem aldegalerie Neuer Meister, Dresden, and RobertSterlHouse.)
Figure 3.6 Comparison of entropies 9 and 10 for Bach, Schumann, and Scriabin/Martin.
3.3.3 Omnibus metric, melodic, and harmonic indicators for compositions by Bach, Schumann, and Webern Figures 3.7, and 3.9 through 3.11 show the omnibus metric, melodic, and harmonic weight functions for Bachs Canon cancricans, Schumanns op. 15/2 and 7, and for Weberns Variations op 27. For Bachs composition, the almost perfect symmetry around the middle of the composition can be seen. Moreover, the metric curve exhibits a very regular up and down. Schumanns curves, in particular the melodic one, show clear periodicities. This appears to be quite typical for Schumann and becomes even clearer when plotting a kernelsmoothed version of the curves (here a bandwidth of 8/8 was used). Interestingly, this type of pattern can also be observed for Webern. In view of the historic development of 12tone music as a logical continuation of harmonic freedom and romantic gesture achieved in the 19th and early 20th centuries, this similarity is not completely unexpected. Finally, note that a relationship between metric,
Figure 3.7 Metric, melodic, and harmonic global indicators for Bachs Canon cancricans.
melodic and harmonic structure can not be seen directly from the raw curves. However, smoothed weights as shown in the gures above reveal clear connections between the three weight functions. This is even the case for Webern, in spite of the absence of tonality.
3.3.4 Specic melodic indicators for Schumanns Tr aumerei Schumanns Tr aumerei is rich in local motifs. Here, we consider eight of these as indicated in Figure 3.12. Figure 3.13 displays the individual indicator functions obtained from (3.16). The overall indicator function m(t) = wmelod (t) displayed in Figure 3.15 is dened by (3.17) with h(w, L) = [2 max(w, 0.5)]L and Lj =number of notes in motif j. The contributions h(wcorr,j (ti ), Lj ) of wcorr,j (j = 1, ..., 8) are given in Figure 3.14.
Figure 3.9 Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 2 (upper gure), together with smoothed versions (lower gure).
Figure 3.10 Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 7 upper gure), together with smoothed versions (lower gure).
Figure 3.11 Metric, melodic, and harmonic global indicators for Weberns Variations op. 27, No. 2 (upper gure), together with smoothed versions (lower gure).
Figure 3.12 R. Schumann Tr aumerei: motifs used for specic melodic indicators.
Figure 3.14 R. Schumann Tr aumerei: contributions of individual motifs to overall melodic indicator.
w
0
0
50
100
150
10
15
20
25
30
onset time
CHAPTER 4
1. The probability law has to be dened on an innite dimensional space of vectors (X1 , X2 , ...). This diculty is even more serious for continuous time series where a sample path is a function on R; 2. The nite sample vector X (n) = (X1 , ..., Xn )t has an arbitrary ndimensional distribution so that it cannot be estimated from observed values x1 , ..., xn consistently, unless some minimal assumptions are made. Diculty 1 can be solved by applying appropriate mathematical techniques and is described in detail in standard books on stochastic processes and time series analysis (see e.g. Billingsley 1986 and the references above). Difculty 2 cannot be solved by mathematical arguments only. It is of course possible to give necessary or sucient conditions such that the probability distribution can be estimated with arbitrary accuracy (measured in an appropriate sense) as n tend innity. However, which concrete assumptions should be used depends on the specic application. Assumptions should neither be too general (otherwise population quantities cannot be estimated) nor too restrictive (otherwise results are unrealistic). A standard, and almost necessary, assumption is that Xt can be reduced to a stationary process Ut by applying a suitable transformation. For instance, we may have a deterministic trend (i) plus stationary noise Ui , Xi = (i) + Ui , (4.1) or an integrated process of order m for which the mth dierence is stationary, i.e. (4.2) (1 B )m Xi = Ui where (1 B )Xi = Xi Xi1 . In the latter case, Xt is called mdierence stationary. Stationarity is dened as follows: Denition 39 A time series Xi is called strictly stationary, if for any k, i1 , ..., in N, P (Xi1 x1 , ..., Xin xn ) = P (Xi1 +k x1 , ..., Xin +k xn ) The time series is called weakly (or second order) stationary, if (i) = E (Xi ) = = const (4.4) and for any i, j N, the autocovariance depends on the lag k = i j  only, i.e. (4.5) cov (Xi , Xi+k ) = (k ) = (k ) A second order stationary process can be decomposed into uncorrelated random components that correspond to periodic signals, via the socalled spectral representation
(4.3)
Xt = +
(4.6)
process (in ) with the following properties: ZX (0) = 0, E [ZX ()] = 0 and for 1 > 2 1 > 2 , E [Z X (2 , 1 )ZX (2 , 1 )] = 0 (4.7) where ZX (u, v ) = ZX (u) ZX (v ). The integral in (4.6) is dened as a limit in mean square. It can be constructed by approximating the function eit by step functions gn () =
gn ()dZX () =
As gn eit , the integrals In converge to a random variable I, in the sense that lim E [(I In )2 ] = 0.
n
The random variable I is then denoted by exp(it)dZ (). The spectral representation is especially useful when one needs to identify (random) periodicities. For this purpose one denes the spectral distribution function FX () = E [ZX () ZX (0)2 ] = E [ZX ()2 ] The variance is then decomposed into frequency contributions by
(4.8)
var(Xt ) =
E [dZX ()2 ] =
dFX ()
(4.9)
This means that the expected contribution (expected squared amplitude) of components with frequencies in the interval (, + ] to the variance of Xt is equal to F ( + ) F (). Two interesting special cases can be distinguished: Case 1 F dierentiable: In this case, d F () + o() = f () + o(). d The function f is called spectral density and can also be dened directly by 1 X (k )eik (4.10) f () = 2 F ( + ) F () =
k=
X (k ) =
eik f ()d
(4.11)
A high peak of f at a frequency o means that the component(s) at (or in the neighborhood of) o contribute largely to the variability of Xt . Note
that the period of exp(it), as a function of t, is T = 2/ (sometimes = /(2 ) as frequency in order that the period T is one therefore denes directly the inverse of the frequency). Thus, a peak of f at o implies that a sample path of Xt is likely to exhibit a strong periodic component with frequency o . Periodicity is, however, random the observed series is not a periodic function. The meaning of random periodicity can be explained best in the simplest case where T is an integer: if f has a peak at frequency o = 2/T, then the correlation between Xt and Xt+jT (j Z) is relatively high compared to other correlations with similar lags. A further complication that blurs periodicity is that, if f is continuous around a peak at o , then the observed signal is a weighted sum of innitely (in fact uncountably) many, relatively large components with frequencies that are similar to o . The sharper the peak, the less this blurring takes place and a distinct periodicity (though still random) can be seen. In the other extreme case where f is constant, there is no preference for any frequency, and X (k ) = 0 (k = 0), i.e. observations are uncorrelated. Case 2  F is a step function with a nite or countable number of jumps: this corresponds to processes of the form
k
Xt =
j =1
Aj eij t
E [Aj 2 ], E [Aj 2 ]
(4.12)
var(Xt ) =
j =1
(4.13)
This means that the variance is a sum of contributions that are due to the frequencies j (1 j k ). A sample path of Xt cannot be distinguished from a deterministic periodic function, because the randomly selected amplitudes Aj are then xed. Finally, it should be noted that not all frequencies are observable when observations are taken at discrete time points t = 1, 2, ..., n. The smallest identiable period is 2, which corresponds to a highest observable frequency of 2/2 = . The largest identiable period is n/2, which corresponds to the smallest frequency 4/n. As n increases, the lowest frequency tends to zero, however the highest does not. In other words, the highest frequency resolution does not improve with increasing sample size. To obtain more general models, one may wish to relax the condition of stationarity. An asymptotic concept of local stationarity is dened in Dahlhaus (Dahlhaus 1996a,b, 1997): a sequence of stochastic processes Xt,n
(4.14)
with = meaning almost sure (a.s.) equality, (u) continuous, and there exists a 2 periodic function A : [0, 1] R C such that A(u, ) = (u, ), A(u, ) is continuous in u, and A t sup A( , ) At,n () cn1 n t, (4.15)
(a.s.) for some constant c < . Intuitively, this means that for n large enough, the observed process can be approximated locally in a small time t window t by the stationary process exp(it)A( n , )dZX (). The or1 der n of the approximation is chosen such that most standard estimation procedures, such as maximum likelihood estimation, can be applied locally and their usual properties (e.g. consistency, asymptotic normality) still hold. Under smoothness conditions on A one can prove that a meaningful evolving spectral density fX (u, ) (u (0, 1)) exists such that fX (u, ) = lim 1 n 2
(4.16)
The function fX (u, ) is called evolutionary spectral density. Note that, for xed u, lim cov (X[unk/2],n , X[un+k/2],n ) = X (k )
n
= (2 )1
Thumfart (1995) carries this concept over to series with discrete spectra. A simplied denition can be given as follows: a sequence of stochastic processes Xt,n (n N ) is said to have a discrete evolutionary spectrum FX (u, ), if t t t Aj ( )eij ( n )t (4.17) Xt,n = ( ) + n n
j M
where M Z, and j (u) is twice continuously dierentiable. The discrete evolutionary spectrum can be dened in analogy to the continuous case. For other denitions of nonstationary processes see e.g. Priestley (1965, 1981), Ghosh et al. (1997) and Ghosh and Draghicescu (2002a,b). 4.2.2 Sampling of continuoustime time series Often time series observed at discrete time points t = j (j = 1, 2, 3, ...) actually happen in continuous time R. Sampling in discrete time
leads to information loss in the following way: let Y be a second order stationary time series with R. (Stationarity in continuous time is dened in an exact analogy to denition 39.) Then, Y has a spectral representation
ei dZY (),
(4.18)
FY () =
E [dZ ()2 ]
(4.19)
and, if F exists, a spectral density function fY () = F () = We also have Y ( ) = cov (Yt , Yt+ ) = ei f ()d. 1 2
ei Y ( )d
(4.20)
The reason why the frequency range extends to (, ), instead of [, ], is that in continuous time, by denition, arbitrarily small frequencies are observable. Suppose now that Y is observed at discrete time points t = j , i.e. we observe (4.21) Xt = Yj Then we can write
Xt =
eij ( ) dZY () =
u=
/ +(2/ )u / +(2/ )u
/ /
=
u=
/ /
where dZX () =
(4.24)
fX () =
u=
fY ( + (2/ )u)
(4.25)
for [ , ]. This result can be interpreted as follows: a frequency > / can be written as = o (2/ )j for some j N where o is in the interval [/, / ]. The contributions of the two frequencies and
o to the observed function Xt (in discrete time) are confounded, i.e. they cannot be distinguished. Thus, if we observe a peak of fX at a frequency (0, / ], then this may be due to any of the periodic components with periods 2/( + (2/ )u), u = 0, 1, 2, ..., or a combination of these. This has, for instance, direct implications for sampling of sound signals. Suppose that 22050Hz (i.e. = 22050 2 138544.2) is the highest frequency that we want to identify (and later reproduce) correctly, instead of attributing it to a lower frequency. This would cover the range perceivable by the human ear. Then must be so small that / 22050 2. Thus the time gap between successive measurements of the sound wave must not exceed 1/44100. 4.2.3 Linear lters Suppose we need to extract or eliminate frequency components from a signal Xt with spectral density fX . The aim is thus, for instance, to produce an output signal Yt whose spectral density fY is zero for a frequency interval a b. The simplest, though not necessarily best, way to do this is linear ltering. A linear lter maps an input series Xt to an output series Yt by
Yt =
j =
aj Xtj
(4.26)
The coecients must fulll certain conditions in order that the sum is a2 dened. If Xt is second order stationary, then we need j < . The resulting spectral density of Yt is fY () = A()2 fX () where A() =
j =
(4.27)
aj eij .
(4.28)
To eliminate a certain frequency band [a, b] one thus needs a linear lter such that A() 0 in this interval. Equation (4.27) also helps to construct and simulate time series models with desired spectral densities: a series with spectral density fY () = (2 )1 A()2 can be simulated by passing a series of independent observations Xt through the lter A(). Note that, in reality, one can use only a nite number of terms in the lter so that only an approximation can be achieved. 4.2.4 Special models When modeling time series statistically, one may use one of the following approaches: a) parametric modeling; b) nonparametric modeling; and c)
semiparametric modeling. In parametric modeling, the probability distribution of the time series is completely specied a priori, except for a nite dimensional parameter = (1 , ..., p )t . In contrast, for nonparametric models, an innite dimensional parameter is unknown and must be estimated from the data. Finally, semiparametric models have parametric and nonparametric components. A link between parametric and nonparametric models can also be established by databased choice of the length p of the unknown parameter vector , with p tending to innity with the sample size. Some typical parametric models are: 1. White noise: Xt second order stationary, var(Xt ) = 2 , fX () = 2 /(2 ), and X (k ) = 0 (k = 0) 2. Moving average process of order q, MA(q ):
q
Xt = + t +
k=1
k tk
(4.29)
with R, t independent identically distributed (iid) r.v., E (t ) = 0 and 2 = var(t ) < . This can also be written as Xt = (B )t
q k=0
(4.30)
k B k . where B is the backshift operator with BXt = Xt1 , (B ) = q k If k=0 k z = 0 implies z  > 1, then Xt is invertible in the sense that it can also be written as
Xt =
k=1
k (Xtk ) + t .
(Xt )
k=1
k (Xtk ) = t
p k=1
(4.31) k z k = 0
k or (B )(Xt ) = t where (B ) = 1 p k=1 k B . If 1 implies z  > 1, then Xt is stationary. 4. Autoregressive moving average process, ARMA(p, q ):
(4.32)
(4.33)
5. Linear process: Xt = +
(4.34)
j =
where j depend on a nite dimensional parameter vector . The spectral density is 2 fX () =  (ei )2 . 6. Integrated ARIMA process, ARIMA(p, d, q ) (Box and Jenkins 1970): (B )((1 B )d Xt ) = (B )t (4.35) with d = 0, 1, 2, ..., where (z ) and (z ) are not zero for z  1. This means that the dth dierence (1 B )d Xt is a stationary ARMA process. 7. Fractional ARIMA process, FARIMA(p, d, q ) (Granger and Joyeux 1980, Hosking 1981, Beran 1995): (1 B ) (B ){(1 B )m Xt } = (B )t with d = m + ,
1 2
(4.36)
<< 1 2 , m = 0, 1. Here,
(1 B )d = with
(1)k B k
k=0
d k
(4.37)
The fractional dierencing parameter plays an important role. If = 0, then (1 B )m Xt is an ordinary ARIMA(p, 0, q ) process, with spectral density such that fX () converges to a nite value fX (0) as 0 and the covariances decay exponentially, i.e. X (k ) Cak for some 0 < C < , 0 < a < 1. The process is therefore said to have short memory. For > 0, fX has a pole at the origin of the form fX () 2 as 0, and X (k ) k 2d1 so that
X (k ) = .
k=
This case is also known as long memory, since autocorrelations decay very slowly (see Beran 1994). On the other hand, if < 0, then fX () 2 converges to zero at the origin and
X (k ) = 0.
k=
This is called antipersistence, since for large lags there is a negative correlation. The fractional dierencing parameter , or d = + m, is also called longmemory parameter, and is related to the fractal or Hausdor dimension dH (see Chapter 3). For an extended discussion of longmemory and antipersistent processes see e.g. Beran (1994) and references therein.
8. Fractional Gaussian noise (Mandelbrot and van Ness 1968, Mandelbrot and Wallis 1969): recall that a stochastic process Yt (t R) is called selfsimilar with selfsimilarity parameter H , if for any c > 0, Yt =d cH Yct . This denition implies that the covariances of Yt are equal to 2 2H (t + s2H t s2H ) 2 where 2 > 0. If Yt is Gaussian (i.e. all joint distributions are normal), then the process is fully determined by its expected value and the covariance function. Therefore, there is only one selfsimilar Gaussian process. This process is called fractional Brownian motion BH (t) with selfsimilarity parameter 0 < H < 1. The discrete time increment process cov(Yt , Yt+s ) = Xt = BH (t) BH (t 1) (t N) (4.38)
is called fractional Gaussian noise (FGN). FGN is stationary with autocovariances 2 (k ) = (k + 12H + k 12H 2k 2H ), (4.39) 2 the spectral density is equal to (Sinai 1976)
f () = 2cf (1 cos )
j =
2j + 2H 1 , [, ]
(4.40)
with cf = cf (H, 2 ) = 2 (2 )1 sin(H )(2H + 1) and 2 = var(Xi ). For further discussion see e.g. Beran (1994). 8. Polynomial trend model:
p
Xt =
j =0
j t j + U t
(4.41)
Xt =
j =0
j cos j t +
j =0
j sin j t + Ut
(4.42)
with Ut stationary 10. Nonparametic trend model: t (4.43) Xt,n = g ( ) + Ut n with g : [0, 1] R a smooth function (e.g. twice continuously dierentiable) and Ut stationary. 11. Semiparametric fractional autoregressive model, SEMIFAR(p, d, q ) (Beran 1998, Beran and Ocker 1999, 2001, Beran and Feng 2002a,b): (1 B ) (B ){(1 B )m Xt g (st )} = Ut (4.44)
where d, , t and g are as above and m = 0, 1. In this case, the centered dierenced process Yt = (1 B )m Xt g (st ) is a fractional ARIMA(p, , 0) model. The SEMIFAR model incorporates stationarity, dierence stationarity, antipersistence, short memory and long memory, as well as an unspecied trend. Incorporating all these components enables us to distinguish statistically which of the components are present in an observed time series (see Beran and Feng 2002a,b). A software implementation by Beran is included in the S P luspackage F inM etrics and described in Zivot and Wang (2002). 4.2.5 Fitting parametric models If Xt is a second order stationary model with a distribution function that o o t , ..., k ) is known except for a nite dimensional parameter o = (1 k R , then the standard estimation technique is the maximum likelihood method: given an observed time series x1 , ..., xn , estimate by = arg max h(x1 , ..., xn ; ) (4.45)
where h is the joint density function of (X1 , ..., Xn ). If observations are discrete, then h is the joint probability P (X1 = x1 , ..., Xn = xn ). Equivalently, we may maximize the loglikelihood L(x1 , ..., xn ; ) = log h(x1 , ..., xn ; ). is asymptotically consistent, in Under fairly general regularity conditions, the sense that it converges in probabilty to o . In other words, limn P ( o  > ) = 0 for all > 0. In the case of a Gaussian time series with spectral density fX (; ), we have 1 t 1 L(x1 , ..., xn ; ) = [log 2 + log n  + (xx ) )] (4.46) n (xx 2 =x (1, 1, ..., 1)t , and n  is the determinant of where x = (x1 , ..., xn )t , x the covariance matrix of (X1 , ..., Xn )t with elements [n ]ij = cov (Xi , Xj ). Since under general conditions n1 log n  converges to (2 )1 times the o 1958), and the (j, l)th element of integral of log fX (Grenander and Szeg 1 1 fX () exp{i(j l)}d, an approximation n can be approximated by can be obtained by the socalled Whittle estimator (Whittle 1953; to also see e.g. Fox and Taqqu 1986, Dahlhaus 1987) that minimizes Ln () = 1 4
[log fX (; ) +
I () ]d fX (; )
(4.47)
An alternative approximation for Gaussian processes is obtained by using an autoregressive representation of the type Xt = j =1 bj Xtj + t , where t are independent identically distributed zero mean normal variables with variance 2 . This leads to minimizing the sum of the squared residuals as explained below in Equation (4.50) (see e.g. Box and Jenkins 1970, Beran 1995).
In general, the actual mathematical and practical diculty lies in dening a computationally feasible estimation procedure and also to obtain . There is a large variety of models for the asymptotic distribution of which this has been achieved. Most results are known for linear models Xt = j tj with iid t . (All examples given in the previous section are linear.) The reason is that, if the distribution of t is known, then the distribution of the process can be recovered by looking at the autocovariances, or equivalently the spectral density, only. Furthermore, if Xt is invertible, i.e. o if Xt can be written as Xt = k=1 k Xtk + t , then can be estimated by maximizing the loglikelihood of the independent variables t :
n
= arg max
t=1
(4.48)
where h is the probability density of and et () = xt k=1 k xtk . t1 t () = xt k=1 k xtk . In For a nite sample, et () is approximated by e 2 1 the simplest case where t are normally distributed with h (x)= (2 ) 2 2 2 2 2 exp{x /(2e )} and = ( , 2 , ..., p ) = ( , ), we have et () = et ( ) and n n 2 et ( ) 2 log + ] (4.49) = arg min[ t=1 t=1 Dierentiating with respect to leads to
n
= arg min
t=1
e2 t ( )
(4.50)
2 and = n 1 e 2 ). Under mild regularity conditions, as n tends to t ( innity, the distribution of n( ) tends to a normal distribution N (0, V ) with with covariance matrix V = 2B 1 where B is a p p matrix with elements log f (; ) log f (; )d Bij = (2 )1 i j
(see e.g. Box and Jenkins 1970, Beran 1995). The estimation method above assumes that the order of the model, i.e. the length p of the parameter vector , is known. This is not the case in general so that p has to be estimated from data. Information theoretic considerations (based on denitions discussed in Section 3.1) lead to Akaikes famous criterion (AIC; Akaike 1973a,b) p = arg min{2 log likelihood + 2p}
p
(4.51)
More generally, we may minimize AIC = 2 log likelihood + k with respect to p. This includes the AIC ( = 2), the BIC (Bayesian information criterion, Schwarz 1978, Akaike 1979) with = log n and the HIC (Han
nan and Quinn 1979) with = 2c log log n (c > 1). It can be shown that, if the observed process is indeed generated by a process from the postulated class of models, and if its order is po , then for O(2c log log n) the estimated order is asymptotically correct with probability one. In contrast, if /(2c log log n) 0 as n , then the criterion tends to choose too many parameters in the sense that P ( p > po ) converges to a positive probability. This is, for instance, the case for Akaikes criterion. Thus, if identication of a correct model is the aim, and the observed process is indeed likely to be at least very close to the postulated model class, then O(2c log log n) should be used. On the other hand, one may argue that no model is ever correct, so that increasing the number of parameters with increasing sample size may be the right approach. In this case, the original AIC is a good candidate. It should be noted, however, that if p as n , then changes, the asymptotic distribution and even the rate of convergence of since this is a kind of nonparametric modeling with an ultimately innite dimensional parameter. 4.2.6 Fitting non and semiparametric models Most techniques for tting nonparametric models rely on smoothing, combined with additional estimation of parameters needed for ne tuning of the smoothing procedure. To illustrate this, consider for instance, (1 B )m Xt = g (st ) + Ut (4.52)
as dened above where Ut is second order stationary and st = t/n. If m is known, then g may be estimated, for instance, by a kernel smoother 1 g (to ) = nb
n
K(
t=1
st sto )yt b
(4.53)
as dened in Chapter 2, with xt = (1 B )m xt . However, results may dier considerably depending on the choice of the bandwidth b (see e.g. Gasser and M uller 1979, Beran and Feng 2002a,b). The optimal bandwidth depends on the nature of the residual process Ut . A criterion for optimality is, for instance, the integrated mean squared error IM SE = The IMSE can be written as IM SE = {E [ g(s)]g (s)}2 ds+ var( g (s))ds = {Bias2 +variance}ds. E {[ g(s) g (s)]2 }ds.
The Bias only depends on the function g , and is thus independent of the error process. The variance, on the other hand, is a function of the covariances U (k ) = cov (Ut , Ut+k ), or equivalently the spectral density fU .
The bandwidth that minimizes the IM SE thus depends on the unknown quantities g and fU . Both g and fU , therefore, have to be estimated simultaneously in an iterative fashion. For instance, in a SEMIFAR model, the asymptotically optimal bandwidth can be shown to be equal to bopt = Copt n(21)/(52) where Copt is a constant that depends on the unknown parameter vector = ( 2 , d, 1 , ..., p )t . Note that in this case, m is also part of the unknown vector. An algorithm for estimating g as well as can be dened by starting with an initial estimate of , calculating the corresponding optimal bandwidth, subtracting g from xt , reestimating , estimating the new optimal bandwidth and so on. Note that in addition the order p is unknown, so that a model choice criterion has to be used at some stage. This complicates matters considerably, and special care has to be taken to dene a reliable algorithm. Algorithms that work theoretically as well as practically for reasonably small sample sizes are discussed in Beran and Feng (2002a,b). 4.2.7 Spectral estimation Sometimes one is only interested in the spectral density fX of a stationary process or, equivalently, the autocovariances X (k ), without modeling the whole distribution of the time series. The reason can be, for instance, that as discussed above, one may be mainly interested in (random) periodicities which are identiable as peaks in the spectral density. A natural nonparametric estimate of X (k ) is the sample autocovariance (k ) = 1 n
nk
(xt x )(xt+k x )
t=1
(4.54)
(k )eik =
k=(n1)
(4.55)
where w is a weight function. It can be shown that E [I ()] fX () as n . However, for lags close to n 1, (k ) is very inaccurate, because one averages over n k observed pairs only. For instance, for k = n 1, there is only one observed pair, namely (x1 , xn ), with this lag! As a result, I () does
not converge to fX (). Instead, the following holds, under mild regularity conditions: if 0 < 1 < ... < k < , and n , then, as n , the distribution of 2 [I (1 )/fX (1 ), ..., 2I (k )/fX (k )] converges to the distribution of (Z1 , ..., Zk ) where Zi are independent 2 2 distributed random variables. This result is also true for sequences of frequencies 0 < 1,n < ... < k,n < as long as the smallest distance between the frequencies, min i,n j,n  does not converge to zero faster than n1 . Because of the latter condition, and also for computational reasons (fast Fourier transform, FFT; see Cooley and Tukey 1965, Bringham 1988), one usually calculates I () at the socalled Fourier frequencies j = 2j/n (j = 1, ..., m) with n m = [(n 1)/2]) only. Note that for Fourier frequencies, t=1 eitj = 0, so that the xt eit 2 . I () = (2n)1  Thus, the sample mean actually does not need to be subtracted. The periodogram at Fourier frequencies can also be understood as a decomposition of the variance into orthogonal components, analogous to classical analysis of variance (Sche e 1959): for n odd,
n t=1
(xt x ) = 4
j =2
I (j )
(4.56)
(xt x )2 = 4
I (j ) + 2I ( ).
j =2
(4.57)
This means that I (j ) corresponds to the (empirically observed) contribution of periodic components with frequency j to the overall variability of x1 , ..., xn . A consistent estimate of fX can be obtained by eliminating or downweighing sample autocovariances with too large lags: () = 1 f 2
n1
wn (k ) (k )eik
k=(n1)
(4.58)
where wn (k ) = 0 (or becomes negligible) for k > Mn , with Mn /n 0 and Mn . Equivalently, one can dene a smoothed periodogram () = f Wn ( )I ( )d (4.59)
for a suitable sequence of window functions Wn such that Wn ( )f ( )d converges to f () as n . See e.g. Priestley (1981) for a detailed discussion. Finally, it should be noted that, in spite of inconsistency, the raw periodogram is very useful for nding periodicities. In particular, in the case
of deterministic periodicities with frequencies j , I () diverges to innity for = j and remains nite (proportional to a 2 2 variable) elsewhere. 4.2.8 The harmonic regression model An important approach to analyzing musical sounds is the harmonic regression model
p
Xt =
j =1
[j cos j t + j sin j t] + Ut
(4.60)
with Ut stationary. Note that, theoretically, this model can also be understood as a stationary process with jumps in the spectral distribution FX (see Section 4.2.1). Given = (1 , ..., p )t , the parameter vector = (1 , ..., p , 1 , ..., p )t can be estimated by the least squares or, more generally, weighted least squares method, = arg min
(4.61)
where w is a weight function. The solution is obtained from usual linear regression formulas. In many applications the situation is more complex, since the frequencies 1 , ..., p are also unknown. This leads to a nonlinear regression problem. A simple approximate solution can be given by (Walker 1971, Hannan 1973, Hassan 1982, Brown 1990, Quinn and Thomson 1991)
p
= arg
0<1 ,...,p
max

j =1
j = and j =
(4.63)
t w( n )xt sin j t (4.64) n t t=1 w( n ) Note that (4.64) means that we look for the k largest peaks in the (wtapered) periodogram. Under quite general assumptions, the asymptotic distribution of the estimates can be shown to be as follows: the vectors 3 j j ), n( j j )]t Zn,j = [ n( j j ), n 2 (
n t=1
(j = 1, ..., p) are asymptotically mutually independent, each having a 3dimensional normal distribution with expected value zero and covariance matrix C (j ) that depends on fU (j ) and the weight function w. The formulas for C are as follows (Irizarry 1998, 2000, 2001, 2002): C (j ) = 4fU (j ) 2 V (j ) 2 j + j (4.65)
where
2 c1 2 j + c 2 j c3 j j V (j ) = c 4 j
c3 j j 2 c2 2 j + c 1 j c4 j
c 4 j c4 j , co
(4.66) (4.67)
2 co = ao bo , c1 = Uo Wo , c2 = a o b 1 ,
2 2 3 2 (Wo W1 U2 W1 Uo 2Wo W2 U1 + 2Wo W1 W2 Uo ), (4.68) c3 = ao W1 Wo 2 c4 = ao (Wo W1 U2 W1 U1 Wo W2 U1 + W1 W2 Uo ), 2 2 ao = (Wo W2 W1 ) , 2 U2 + Wn+1 (Wn+1 Uo 2Wn U1 ) (n = 0, 1), bn = Wn 1
Un =
o
sn w2 (s)ds, sn w(s)ds.
Wn =
o
This result can be used to obtain tests and condence intervals for j , j and j (j = 1, 2, ..., p), with the unknown quantities j , j and fU (j ) then replaced by estimates. Note that this involves, in particular, estimation of the spectral density of the residual process Ut . A quantity that is of particular interest is the dierence between the fundamental frequency 1 and partials j 1 , j = j j 1 . (4.74)
For many musical instruments, this dierence is exactly or approximately equal to zero. The asymptotic distribution given above can be used to test the null hypothesis Ho : j = 0 or to construct condence intervals for j . 3 j j ) is asymptotically normal with zero mean More specically, n 2 ( and variance j 2 fU (1 ) fU (j ) + 2 . (4.75) v = 4co 2 2 2 j + j 1 + 1 This can be generalized to any hypothesized relationship j = j g (j )1 (see the example of a guitar mentioned in the next section). 4.2.9 Dominating frequencies in random series In the harmonic regression model, the main signal consists of deterministic periodic functions. For less harmonic noisy signals, a weaker form
of periodicity may be observed. This can be modeled by a purely random process whose mth dierence Yt = (1 B )m Xt is stationary (m = 0, 1, ...) with a spectral density f that has distinct local maxima. Estimation of local maxima and identication of the corresponding frequencies is considered, for instance, in Newton and Pagano (1983) and Beran and Ghosh (2000). Beran and Ghosh (2000) consider the case where Yt is a fractional ARIM A(p, d, 0) process of unknown order p. Suppose we want to estimate the frequency max where f assumes the largest local maximum. In a rst step, the parameter vector = ( 2 , d, 1 , ..., p ) (with d = + m) is estimated by maximum likelihood and p is chosen by the BIC. Let = ( 2 , , 3 , ..., p+2 ) = ( 2 , ) and 2 2 (ei )2 1 ei 2 = g (; ) (4.76) 2 2 t = (1 B )m Xt and max is set equal be the spectral density of Yt . Then Y ) assumes its to the frequency where the estimated spectral density f (; maximum. Dene (4.77) Vp ( ) = 2W 1 f (; ) = where Wij = (2 )1 [
log g (x; u) log g (x; u)dx]u= , (i, j = 1, ..., p+1). ui uj (4.78) n( max max ) d N (0, p ) 1 (4.79)
Then, as n ,
where d denotes convergence in distribution, g , g denotes derivatives with respect to frequency and g with respect to the parameter vector. Note in particular that the order of var( max ) is n1 whereas in the harmonic regression model the frequency estimates have variances of the order n3 . The reason is that a deterministic periodic signal is a much stronger form of periodicity and is therefore easier to identify. 4.3 Specic applications in music 4.3.1 Analysis and modeling of musical instruments There is an abundance of literature on mathematical modeling of sound signals produced by musical instruments. Since a musical instrument is a very complex physical system, even if conditions are kept xed, not only deterministic but also statistical models are important. In addition to that,
various factors can play a role. For instance, the sound of a violin depends on the wood it is made of, which manufacturing procedure was used, current atmospheric conditions (temperature, humidity, air pressure), who plays the violin, which particular notes are played in which context, etc. The standard approach that makes modeling feasible is to think of a sound as the result of harmonic components that may change slowly in time, plus noise components that may be described by random models. It should be noted, however, that sound is not only produced by an instrument but also perceived by the human ear and brain. Thus, when dealing with the significance or eect of sounds, physiology, psychology and related scientic disciplines come into play. Here, we are rst concerned with the actual objective modeling of the physical sound wave. This is a formidable task on its own, and far from being solved in a satisfactory manner. The scientic study of musical sound signals by physical equations goes back to the 19th century. Helmholtz (1863) proved experimentally that musical sound signals are mainly composed of frequency components that are multiples of a fundamental frequency (also see Raleigh 1894). Ohm conjectured that the human ear perceives sounds by analyzing the power spectrum (i.e. essentially the periodogram), without taking into account relative phases of the sounds. These conjectures have been mostly conrmed by psychological and physiological experiments (see e.g. Grey 1977, Pierce 1983/1992). Recent mathematical models of instrumental sound waves (see e.g. Fletcher and Rossing 1991) lead to the assumption that, for short time segments, a musical sound signal is stationary and can be written as a harmonic regression model with 1 < 2 < ... < p . To analyze a musical sound wave, one therefore can divide time into small blocks and t the harmonic regression model as described above. The lowest frequency 1 is called the fundamental frequency and corresponds to what one calls pitch in music. The higher frequencies j (j 2) are called partials, overtones, or harmonics. The amplitudes of partials, and how they change gradually, are main factors in determining the timbre of a sound. For illustration, Figure 4.1 shows the sound wave (air pressure amplitudes) of a piano during 1.9 seconds where rst a c and then an f are played. The signal was sampled in 16bit format at a sampling rate of 44100 Hz. This corresponds to CDquality and means that every second, 44100 measurements of the sound wave were taken, each of the measurements taking an integer value between 32768 to 32767 (32767+32768+1=216). Figure 4.2 shows an enlarged picture of the shaded area in Figure 4.1 (2050 measurements, corresponding to 0.046 seconds). The periodogram (in logcoordinates) of this subseries is plotted in Figure 4.3. The largest peak occurs approximately at the fundamental frequency 1 = 441 29/12 262.22 of c . Note that, since the periodogram is calculated at Fourier frequencies only, 1 cannot be identied exactly (see also the remarks below). A small number of partials j (j 2) can also be seen in Figure 4.3 the contribution of
higher partials is however relatively small. In contrast, the periodogram of e played on a harpsichord shows a large number of distinctly important partials (Figures 4.4, 4.5). There is obviously a clear dierence between piano and harpsichord in terms of amplitudes of higher partials. A comprehensive study of instrumental or vocal sounds also needs to take into account dierent techniques in which an instrument is played, and other factors such as the particular pitch 1 that is played. This would, however, be beyond the scope of this introductory chapter. A specic component that is important for timbre is the way in which the coecients j , j change in time (see e.g. Risset and Mathews 1969). Readers familiar with synthesizers may recall envelopes that are controlled by parameters such as attack and delay. The development of j , j can be studied by calculating the periodogram for a moving timewindow and plotting its values against time and frequency in a 3dimensional or image plot. Thus, we plot the local periodogram (in this context also called
amplitude
10^3
1
time in seconds
Figure 4.2 Zoomed piano sound wave shaded area in Figure 4.1.
spectrogram) I (t, ) = 1 t j ij 2 )e  W( xj  n 2 ( tj ) nb W j =1 nb j =1
n
(4.81)
where W : R R+ is a weight function such that W (u) = 0 for u > 1 and b > 0 is a bandwidth that determines how large the window (block) is, i.e. how many consecutive observations are considered to correspond approximately to a harmonic regression model with xed coecients j , j and stationary noise Ut . This is illustrated in color Figure 4.7 for a harpsichord sound, with W (u) = 1{u 1}. Intense pink corresponds to high values of I (t, ). Figures 4.6a through d show explicitly the change in I (t, ) between four dierent blocks. Since the note was played staccato, the sound wave is very short, namely about 0.1 seconds. Nevertheless, there is a change in the spectrum of the sound, with some of the higher harmonics fading away. Apart from the relative amplitudes of partials, most musical sounds in
amplitude
3000
2000
1000
1000
2000
0.0
0.01
0.02
time in seconds
0.03
0.04
played on a harpsichord.
periodogram
periodogram
10^5
10^3
10^1
0.0
0.5
1.0
1.5
frequency a
2.0
2.5
3.0
10^0
10^2
10^4
0.0
0.5
1.0
1.5
frequency a
2.0
2.5
3.0
periodogram
periodogram
10^4
10^2
10^0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
10^0
10^2
10^4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
frequency b
frequency c
Figure 4.6 Harpsichord sound periodogram plots for dierent time frames (moving windows of time points).
Figure 4.7 A harpsichord sound and its spectrogram. Intense pink corresponds to high values of I (t, ). (Color gures follow page 152.)
clude a characteristic nonperiodic noise component. This is a further justication, apart from possible measurement errors, to include a random deviation part in the harmonic regression equation. The properties of the stochastic process Ut are believed to be characteristic for specic instruments (see e.g. Serra and Smith 1991, Rodet 1997). Typical noise components are, for instance, transient noise in percussive instruments, breath noise in wind instruments, or bow noise of string instruments. For a discussion of statistical issues in this context see e.g. Irizarry (2001). For most instruments, not only the harmonic amplitudes but also the characteristics of the noise component change gradually. This may be modeled by smoothly changing processes as dened for instance in Ghosh et al. (1997). Other approaches are discussed in Priestley (1965) and Dahlhaus (1996a,b, 1997) (see Section 4.2.1 above). Some interesting applications of the asymptotic results in Section 4.2.8 to questions arising in the analysis of musical sounds are discussed in Irizarry
(2001). In particular, the following experiment is described: recordings of a professional clarinet player trying to play concert pitch A (1 = 441Hz ) and a professional guitar player playing D (1 = 146.8Hz ) were made. For the analysis of the clarinet sound, a onesecond segment was divided into nonoverlapping blocks consisting of 1025 measurements (23 milliseconds) and the harmonic regression model was tted to each block separately. For the guitar, the same was done with 60 nonoverlapping intervals with 3000 observations each. Two types of results were obtained: 1. The clarinet player turned out to be always out of tune in the sense that the estimated fundamental frequency 1 was always outside the 95%o )n 3 2 where the null hypothesis acceptance region 441Hz 1.96 C33 (1 o is Ho : 1 = 1 = 441Hz . On the other hand, from the point of view of musical perception, the clarinet player was not out of tune, because the deviation from 441Hz was less than 0.76Hz which corresponds to 0.03 semitones. According to experimental studies, the human ear cannot distinguish notes that are 0.03 semitones apart (Pierce 1983/1992). 2. Physical models (see e.g. Fletcher and Rossing 1991) postulate the following relationships between the fundamental frequency and partials: for a harmonic instrument such as the clarinet, one expects j = j 1 , whereas for a plucked string instrument, such as the guitar, one should have j cj 2 1 where c is a constant determined by properties of the strings. The experiment described in Irizarry (2001) supports the assumption for the clarinet in the sense that, in general, the 95%condence intervals for the dierence j j1 contained 0. For the guitar, his ndings suggest a relationship of the form j c(a + j )2 1 with a = 0. 4.3.2 Lickliders theory of pitch perception Thumfart (1995) uses the theory of discrete evolutionary spectra to derive a simple linear model for pitch perception as proposed by Licklider (1951). The general biological background is as follows (see e.g. Kelly 1991): vibrations of the ear drum caused by sound waves are transferred to the inner ear (cochlea) by three ossicles in the middle ear. The inner ear is a spiral structure that is partitioned along its length by the basilar membrane. The sound wave causes a traveling wave on the basilar membrane which in turn causes hair cells positioned at dierent locations to release a chemical transmitter. The chemical transmitter generates nerve impulses to the auditory nerve. At which location on the membrane the highest amplitude occurs, and thus which groups of hair cells are activated, depends on the frequency
of the sound wave. This means that certain frequency regions correspond to certain hair groups. Frequency bands with high spectral density f (or high increments dF of the spectral distribution) activate the associated hair groups. To obtain a simple model for the eect of a sound on the basilar membrane movement, Slaney and Lyon (1991) partition the cochlea into 86 sections, each section corresponding to a particular group of cells. Thumfart (1995) assumes that each group of cells acts like a separate linear lter j (j = 1, ..., 86). (This is a simplication compared to Slaney and Lyon who use nonlinear models.) The wave entering the inner ear is assumed to be the original sound wave Xt , ltered by the outer ear by a linear lter A1 , and the middle ear by a linear A2 . Thus, the output of the inner ear that generates the nal nerve impulses consists of 86 time series Yt,j = j (B )A2 (B )A1 (B )Xt (j = 1, ..., 86). (4.82)
Calculating tapered local periodograms Ij (u, ) of Yt,j for each of the 86 sections (j = 1, ..., 86), one can then dene the quantity
c(k, j, u) =
Ij (u, )eik d
(4.83)
which Slaney and Lyon call correlogram. This is in fact an estimated local autocovariance at lag k for section j and the timesegment with midpoint u. The SlaneyLyoncorrelogram thus essentially characterizes the local autocovariance structure of the resulting nerve impulse series. Thumfart (1995) shows formally how, and under which conditions, this model can be dened within the framework of processes with a discrete evolutionary spectrum. He also suggests a simple method for estimating pitch (the fundamental frequency) at local time u by setting 1 (u) = 2/kmax (u) where 86 kmax (u) = arg maxk C (k, u) and C (k, u) = j =1 c(k, j, u). 4.3.3 Identication of pitch, tone separation and purity of intonation In a recent study, Weihs et al. (2001) investigate objective criteria for judging the quality of singing (also see Ligges et al. 2002). The main question asked in their analysis is how to assess purity of intonation. In an experimental setting, with standardized playback piano accompaniment in a recording studio, 17 singers were asked to sing H andels Tochter Zion and Beethovens Ehre Gottes aus der Natur. The audio signal of the vocal performance was recorded in CD quality in 16bit format at a sampling rate of 44100 Hz. For the actual statistical analysis, data is reduced to 11000Hz, for computational reasons, and standardized to the interval [1,1]. The rst question is how to identify the fundamental frequency (pitch) 1 . In the harmonic regression model above, estimates of 1 and the partials j (2 j k ) are identical with the k frequencies where the pe
riodogram assumes its k largest values. Weihs et al. suggest a simplied (though clearly suboptimal) version of this, in that they consider the periodogram at Fourier frequencies j = 2j/n (j = 1, 2, ..., m = [(n 1)/2]) only and set 1 =
j {2 ,...,m1 }
min
(4.84)
In other words, 1 corresponds to the Fourier frequency where the rst peak of the periodogram occurs. Because of the restriction to Fourier frequencies, the peridogram may have two adjacent peaks and the estimate is too inaccurate in general. An empirical interpolation formula is suggested by the authors to obtain an improved estimate 1 . A comparison with harmonic regression is not made, however, so that it is not clear how good the interpolation works in comparison. Given a procedure for pitch identication, an automatic note separation procedure can be dened. This is a procedure that identies time points in a sound signal where a new note starts. The interesting result in Weihs et al. is that automatic note separation works better for amateur singers than for professionals. The reason may be the absence of vibrato in amateur voices. In a third step, Weihs et al. address the question of how to assess computationally the purity of intonation based on a vocal time series. This is done using discriminant analysis. The discussion of these results is therefore postponed to Chapter 9. 4.3.4 Music as 1/f noise? In the 1970s Voss and Clarke (1975, 1978) discovered a seemingly universal law according to which music has a 1/f spectrum. With 1/f spectrum one means that the observed process has a spectral density f such that f () 1 as 0. In the sense of denition (4.10), such a density actually does not exist  however, a generalized version of spectral density exists in the sense that the expected value of the periodogram converges to this function (see Matheron 1973, Solo 1992, Hurvich and Ray 1995). Specically, Voss and Clarke analyzed acoustic music signals by rst transforming the recorded signal Xt in the following way: a) Xt is ltered by a lowpass lter (frequencies outside the interval [10Hz, 10000Hz ] are elim2 is ltered by another inated); and b) the instantaneous power Yt = Xt lowpass lter (frequencies above 20Hz are eliminated). This ltering technique essentially removes higher frequencies but retains the overall shape (or envelope) of each sound wave corresponding to a note and the relative position on the onset axis. In this sense, Voss and Clarke actually analyzed rhythmic structures. A recent, statistically more sophisticated study along this line is described in Brillinger and Irizarry (1998). One objection to this approach can be that in acoustic signals, structural
3000
1000
air pressure
log(power)
1000
3000
0.0
0.02
0.04
0.06
time (sec)
0.08
0.10
0.12
13
14
15
16
17
18
b) Harpsichord  log(power)
0.0
0.02
0.04
0.06
0.08
0.10
0.12
time (sec)
8 0
1 0
2 0
4 0
13
14
15
16
17
18
0.0001
0.0100
log(f)
6 0
0.01
0.05
0.10
0.50
1.00
log(y**2)
log(frequency)
Figure 4.8 A harpsichord sound wave (a), logarithm of squared amplitudes (b), histogram of the series (c) and its periodogram on logscale (d) together with tted SEMIFARspectrum.
properties of the composition may be confounded with those of the instruments. Consider, for instance, the harpsichord sound wave in Figure 4.8a. The square of the wave is displayed in Figure 4.8b on logarithmic scale. The picture illustrates that, apart from obvious oscillation, the (envelope of the) signal changes slowly. Fitting a SEMIFARmodel (with order p 8 chosen by the BIC) yields a good t to the periodogram. The estimated = 0.51 with a 95%condence interval fractional dierencing parameter is d of [0.29,0.72]. This corresponds to a spectral density (dened in the generalized sense above) that is proportional to 1.02 , or approximately 1 . Thus, even in a composition consisting of one single note one would detect 1/f noise in the resulting sound wave. Instead of recorded sound waves, we therefore consider the score itself, independently of which instrument is supposed to play. This is similar but not identical to considering zero crossings of a sound signal (see Voss and
Clarke 1975, 1978, Voss 1988; Brillinger and Irizarry 1998). Figures 4.9a and c show the logfrequencies plotted against onset time for the rst movement of Bachs rst CelloSuite and for Paganinis Capriccio No. 24. For Bach, the 0.7 with a 95%condence interval of [0.46, 0.93]. SEMIFARt yields d This corresponds to a 1/f 1.4 spectrum; however 1/f (d = 1/2) is included in the condence interval. Thus, there is not enough evidence against the 0.21 1/f hypothesis. In contrast, for Paganini (Figure 4.11) we obtain d with a 95%condence interval of [0.07, 0.35] which excludes 1/f noise. This indicates that there is a larger variety of fractal behavior than the 1/f law would suggest. Note also that in both cases there is also a trend in the data which is in fact an even stronger type of long memory than the stochastic one. Moreover, Bachs (and also to a lesser degree Paganinis) spectrum has local maxima in the spectral density, indicating periodicities (see Section 4.2.9). Thus, there is no pure 1/f behavior but instead a mixture of longrange dependence expressed by the power law near the origin, and shortrange periodicities.
Figure 4.9 Logfrequencies with tted SEMIFARtrend and loglogperiodogram together with SEMIFARt for Bachs rst Cello Suite (1st movement; a,b) and Paganinis Capriccio No. 24 (c,d) respectively.
Finally, consider an alternative quantity, namely local variability of notes modulo octave. Since we are in Z12 , a measure of variability for circular ) as dened in data should be used. Here, we use the measure V = (1 R Chapter 7 or rather the transformed variable log[(V +0.05)/(1.05 V )]. The resulting standardized time series are displayed in Figures 4.10a and c. The loglogplot of the periodgrams and tted SEMIFARspectra are given in Figures 4.10b and d respectively. The estimated longmemory parameters
Figure 4.10 Local variability with tted SEMIFARtrend and loglogperiodogram together with SEMIFARt for Bachs rst Cello Suite (1st movement; a,b) and Paganinis Capriccio No. 24 (c,d) respectively.
= 0.51 ([0.20, 0.81]) for Bach and 0.33 are similar to before, namely d ([0.24, 0.42]) for Paganini.
CHAPTER 5
Hierarchical methods
5.1 Musical motivation Musical structures are typically generated in a hierarchical manner. Most compositions can be divided approximately into natural segments (e.g. movements of a sonata); these are again divided into smaller units (e.g. exposition, development, and coda of a sonata movement). These can again be divided into smaller parts (e.g. melodic phrases), and so on. Dierent parts even at the same hierarchical level need not be disjoint. For instance, dierent melodic lines may overlap. Moreover, dierent parts are usually closely related within and across levels. A general mathematical approach to understanding the vast variety of possibilities can be obtained, for instance, by considering a hierarchy of maps dened in terms of a manifold (see e.g. Mazzola 1990a). The concept of hierarchical relationships and similarities is also related to selfsimilarity and fractals as dened in Mandelbrot (1977) (see Chapter 3). To obtain more concrete results, hierarchical regression models have been developed in the last few years (Beran and Mazzola 1999a,b, 2000, 2001). 5.2 Basic principles 5.2.1 Hierarchical aggregation and decomposition Suppose that we have two time series Yt , Xt and we wish to model the relatioship between Yt and Xt . The simplest model is simple linear regression Yt = o + 1 Xt + t (5.1)
where t is a stationary zero mean process independent of Xt . If Yt and Xt are expected to be hierarchical, then we may hope to nd a more realistic model by rst decomposing Xt (and possibly also Yt ) and searching for dependence structures between Yt (or its components) and the components of Xt . Thus, given a decomposition Xt = Xt,1 + ... + Xt,M , we consider the multiple regression model
M
Yt = o +
j =1
j Xt,j + t
(5.2)
with t second order stationary and E (t ) = 0. Alternatively, if Yt = Yt,1 + ... + Yt,L , we may consider a system of L regressions
M
Yt,1 = 01 +
j =1 M
j 1 Xt,j + t,1
Yt,2 = 02 +
j =1
j 2 Xt,j + t,2 . . .
M
Yt,L = 0L +
j =1
jL Xt,j + t,L .
Three methods of hierarchical regression based on decompositions will be discussed here: HIREG: hierarchical regression using explanatory variables obtained by kernel smoothing with predetermined xed bandwidths; HISMOOTH: hierarchical smoothing models with automatic bandwidth selection; HIWAVE: hierarchical wavelet models. 5.2.2 Hierarchical regression Given an explanatory time series Xt (t = 1, 2, ..., n), a smoothing kernel K , and a hierarchy of bandwidths b1 > b2 > ... > bM > 0, dene Xt,1 = and for 1 < j M , Xt,j 1 = nbj ts K( )[Xt nbj s=1
n j 1
1 nb1
K(
s=1
ts )Xt nb1
(5.3)
Xt,l ]
l=1
(5.4)
The collection of time series {X1,j , ..., Xn,j } (j = 1, ..., M ) is called a hierarchical decomposition of Xt . The HIREGmodel is then dened by (5.2). If t (t = 1, 2, ...) are independent, then usual techniques of multiple linear regression can be used (see e.g. Plackett 1960, Rao 1973, Ryan 1996, Srivastava and Sen 1997, Draper and Smith 1998). In case of correlated errors t , appropriate adjustments of tests, condence intervals, and parameter selection techniques must be made. The main assumption in the HIREG model is that we know which bandwidths to use. In some cases this may indeed be true. For instance, if there is a threefourth meter at the beginning of a musical score, then bandwidths that are divisible by three are plausible.
5.2.3 Hierarchical smoothing Beran and Mazzola (1999b) consider the case where the bandwidths bj are not known a priori. Essentially, this amounts to a nonlinear regression M model Yt = o + j =1 j Xt,j + t where not only j (j = 0, ..., p) are unknown, but also b1 , ..., bM , and possibly the order M, have to be estimated. The following denition formalizes the idea (for simplicity it is given for the case of one explanatory series Xt only): Denition 40 For integers M, n > 0, let = (1 , ..., M ) RM , b = (b1 , ..., bM ) RM , b1 > b2 > ... > bM = 0, ti [0, T ], 0 < T < , t1 < t2 < ... < tn , and = (, b)t . Denote by K : [0, 1] R+ a nonnegative symmetric kernel function such that K (u)du = 1, K is twice continuously dierentiable, and dene for b > 0 and t [0, T ], the NadarayaWatson weights (Nadaraya 1964, Watson 1964) ab (t, ti ) =
n j =1 ti K ( t b ) ttj b )
K(
(5.5)
Also, let i (i Z) be a stationary zero mean process satisfying suitable moment conditions, f the spectral density of i , and assume i to be independent of Xi . Then the sequence of bivariate time series {(X1,n , Y1,n ), ..., (Xn,n , Yn,n )} (n = 1, 2, 3, ...) is a Hierarchical Smoothing Model (or HISMOOTH model), if
M
Yi,n = Y (ti ) =
j =1
j g (ti ; bj ) + i
(5.6)
g (ti ; bj ) =
l=1
(5.7)
Denote by o = ( o , bo )t the true parameter vector. Then o can be estimated by a nonlinear least squares method as follows: dene
M
ei () = Y (ti )
l=1
j g (ti ; bj )
n 2 i=1 ei ( )
(5.8)
b g .
and g =
Then (5.9)
= argmin S () or equivalently
n
) = 0 (ti , y ;
i=1
(5.10)
where = (1 , ..., 2M )t , j (t, y ; ) = ei ()g (t; bj ) for j = 1, ..., M, and j (t, y ; ) = ei ()j g (t; bj ) (5.12) (5.11)
is asympfor j = M +1, ..., 2M. Under suitable assumptions, the estimate totically normal. More specically, set hi (t; o ) = g (t; bi ) (i = 1, ..., M ) (t; bi ) (i = M + 1, ..., 2M ) hi (t; o ) = i g = [ (i j )]i,j =1,...,n = [cov (i , j )]i,j =1,...,n and dene the 2M n matrix G = G2M n = [hi (tj ; o )]i=1,...,2M ;j =1,...,n and the 2M 2M matrix Vn = (GGt )1 (GGt )(GGt )1
1 (A1) f () cf 2d (cf > 0) as 0 with 1 2 < d < 2;
(5.16)
(5.17)
The following assumptions are sucient to obtain asymptotic normality: (A2) Let ar = n 1 b r = n 1
n
(i j )g (ti ; br )g (tj ; br ),
i,j =1 n
(i j )g (ti ; br )g (tj ; bs ).
i,j =1
Then, as n , lim inf ar  > 0, and lim inf br  > 0 for all r, s {1, ..., M }. (A3) x(ti ) = (ti ) where : [0, T ] R is a function in C [0, T ], T < . (A4) The set of time points converges to a set A that is dense in [0, T ]. Then we have (Beran and Mazzola 1999b): Theorem 12 Let 1 and 2 be compact subsets of R and R+ respectively, 1 M = M 1 2 and let = 2 min{1, 1 2d}. Suppose that (A1), (A2), (A3) o and (A4) hold and is in the interior of . Then, as n , p o ; (i) (ii) Vn V where V is a symmetric positive denite 2M 2M matrix; ) d N (0, V ). (iii) n (
is asymptotically normal, but for d > 0 (i.e. longmemory errors), Thus, 1 1 the rate of convergence n 2 d is slower than the usual n 2 rate. A particular aspect of HISMOOTH models is that the bandwidths bj are xed positive unknown parameters that are estimated from the data. This means that, in contrast to nonparametric regression models (see e.g. Gasser and M uller 1979, Simono 1996, Bowman and Azzalini 1997, Eubank 1999), the notion of optimal bandwidth does not exist here. There is a xed true bandwidth (or a vector of true bandwidths) that has to be estimated. A HISMOOTH model is in fact a semiparametric nonlinear regression rather than a nonparametric smoothing model. Theorem 1 can be interpreted as multiple linear regression where uncertainty due to (explanatory) variable selection is taken into account. The set of possible combinations of explanatory variables is parametrized by a continuous bandwidthparameter vector b M 2 . Condence intervals for based on the asymptotic distribution of take into account additional uncertainty due to variable selection from the (innite) parametric family of M explanatory variables X = {(xb1 , ..., xbM ) : bj 2 , b1 > b2 > ... > bM }. For the practical implementation of the model, the following algorithms that include estimation of M are dened in Beran and Mazzola (1999b): if M is xed, then the algorithm consists of two basic steps: a) generation of the set of all possible explanatory variables xs (s S ), and b) selection of M variables (bandwidths) that maximize R2 . This means that after step 1, the estimation problem is reduced to variable selection in multiple regression, with a xed number M of explanatory variables. Standard regression software, such as the function leaps in SPlus, can be used for this purpose. The detailed algorithm is as follows: Algorithm 1 Dene a suciently ne grid S = {s1 , ..., sk } 2 and carry out the following steps: Step 1: Dene k explanatory time series xs = [xs (t1 ), ..., xs (tn )]t (s S ) by xs (ti ) = g (ti , s). Step 2: For each b = (b1 , ..., bM ) S M , with bi > bj (i < j ) dene the n M matrix X = (xb1 , ..., xbM ) and let = (b) = (X t X )1 X t y. Also, denote by R2 (b) the corresponding value of R2 obtained from least squares regression of y on X. = ( = (, b = argmax R2 (b) and b). Step 3: Dene b)t by
b
If M is unknown, then the algorithm can be modied, for instance by increasing M as long as all coecients are signicant. In order to calculate at each stage, the error process i needs to be the standard deviation of modeled explicitly. Beran and Mazzola (1999) use fractional autoregressive models together with the BIC for choosing the order of the process. This leads to Algorithm 2 Dene a suciently ne grid S = {s1 , ..., sk } 2 for the
bandwidths, and calculate k explanatory time series xs (s S ) by xs (ti ) = g (ti , s). Furthermore, dene a signicance level , set Mo = 0, and carry out the following steps: Step 1: Set M = Mo + 1; Step 2: For each b = (b1 , ..., bM ) S M , with bi > bj (i < j ) dene the n M matrix X = (xb1 , ..., xbM ) and let = (b) = (X t X )1 X t y. Also, denote by R2 (b) the corresponding value of R2 obtained from least squares regression of y on X. = ( = ( )t by b = argmaxb R2 (b) and b). Step 3: Dene b, t Step 4: Let e( ) = [e1 , ..., en ] be the vector of regression residuals. Assume that ei is a fractional autoregressive process of unknown order p charac2 , d, 1 , ..., p ). Estimate p and by terized by a parameter vector = ( maximum likelihood and the BIC. Step 5: Calculate for each j = 1, ..., M, the estimated standard deviation ) of j , and set j ( j  1 ( ))] pj = 2[1 ( j where denotes the cumulative standard normal distribution function. If max (pj ) < , set Mo = Mo + 1 and repeat 1 through 5. Otherwise, equal to the corresponding = Mo and stop the iteration and set M estimate. 5.2.4 Hierarchical wavelet models Wavelet decomposition has become very popular in statistics and many elds of application in the last few years. This is due to the exibility to depict local features at dierent levels of resolution. There is an extended literature on wavelets spanning a vast range between profound mathematical foundations and mathematical statistics to concrete applications such as data compression, image and sound processing, and data analysis, to name only a few. For references see for example Daubechies (1992), Meyer (1992, 1993), Kaiser (1994), Antoniadis and Oppenheim (1995), Ogden (1996), Mallat (1998), H ardle et al. (1998), Vidakovic (1999), Percival and Walden (2000), Jansen (2001), Jaard et al. (2001). The essential principle of wavelets is to express square integrable functions in terms of orthogonal basis functions that are zero except in a small neighborhood, the neighborhoods being hierarchical in size. The set of basis functions = {ok , k Z} {jk , j, k Z} is generated by two functions only, the father wavelet and the mother wavelet , respectively, by up/downscaling and shifting of the location respectively. If scaling is done by powers of 2 and shifting by integers, then the basis functions are: ok (x) = oo (x k ) = (x k ) (k Z) (5.18)
jk (x) = 2 2 oo (2j x k ) = 2 2 (2j x k ) (j N, k Z) With respect to the scalar product < g, h >= functions are orthonormal:
(5.19)
< ok , om >= 0 (k = m), < ok , ok >= k 2 = 1 < jk , lm >= 0 (k = m or j = l), < jk , jk >= jk  = 1 < jk , ol >= 0
2 2
Every function g in L (R) (the space of square integrable functions on R) has a unique representation
g (x) =
k=
ak ok (x) +
j =0 k=
(5.23)
=
k=
ak (x k ) +
(5.24)
where ak =< g, k >= and bjk =< g, jk >= g (x)jk (x)dx (5.26) Note in particular that g 2 (x)dx = a2 b2 k + jk . The purpose of this representation is a decomposition with respect to frequency and time. A simple wavelet, where the meaning of the decomposition can be understood directly, is the Haar wavelet with (x) = 1{0 x < 1} where 1{0 x < 1} = 1 for 0 x < 1 and zero otherwise, and 1 1 } 1{ x < 1}. 2 2 For the Haar basis functions k , we have coecients (x) = 1{0 x <
k+1
g (x)k (x)dx
(5.25)
(5.27)
(5.28)
ak =
k
g (x)dx
(5.29)
Thus, coecients of the basis functions k are equal to the average value of g in the interval [k, k + 1]. For jk we have bjk = 2 2 [
j
2j (k+ 1 2) 2j k
g (x)dx
2j (k+1) 2j (k+ 1 2)
g (x)dx]
(5.30)
which is the dierence between the average values of g in the intervals 1 j j 2j k x < 2j (k + 1 (k + 1). This can be 2 ) and 2 (k + 2 ) x < 2 interpreted as a (signed) measure of variability. Since each interval Ijk =
[2j k, 2j (k + 1)] has length 2j and midpoint 2j (k + 1 2 ), the coecients ajk (or their squares a2 ) characterize the variability of g at dierent scales jk ) that becomes ner 2j (j = 0, 1, 2, ...) and a grid of locations 2j (k + 1 2 as the scale decreases with increasing values of j. Suppose now that a time series (function) yt is observed at a nite number of discrete time points t = 1, 2, ..., n with n = 2m . To relate this to wavelet decomposition in continuous time, one can construct a piecewise constant function in continuous time by
n1
gn (x) =
k=0
yk 1 {
k+1 k x< }= n n
n1 k=0
(5.31) Since gn is a step function (like the Haar basis functions themselves) and zero outside the interval [0, 1), the Haar wavelet decomposition of gn has only a nite number of nonzero terms:
m1 2j 1
gn (x) = aoo +
j =0 k=0
bjk jk (x)
(5.32)
Note that gn assumes only a nite number of values gn (x) = ynx (x = j 1/n, 2/n, ..., 1). Moreover, for x = k/n, jk (x) = 2 2 (2j x k ) is nonzero m j for 0 k < 1/(2 1) only. Therefore, Equation (5.32) can be written in matrix form and calculation of the coecients aoo and bjk can be done by matrix inversion. Since matrix inversion may not be feasible for large data sets, various ecient algorithms such as the socalled discrete wavelet transform have been developed (see e.g. Percival and Walden 2000). An interesting interpretation of wavelet decomposition can be given in terms of total variability. The total variability of an observed series can be decomposed into contributions of the basis functions by )2 = (yt y
m1 2j 1 j =0 k=0 j j A plot of b2 = period) and k jk against j (or 2 = frequency, or 2 (location) shows for each k and j how much of the signals variability is due to variation at the corresponding location k and frequency 2j . To illustrate how wavelet decomposition works, consider the following simulated example: let xi = 2 cos(2i/90) if i {1, ..., 300} or {501, ..., 700} or {901, ..., 1024}, For 301 i 500, set xi = 1 2 cos(2i/10), and for 1 701 i 900, xi = 15 cos(2i/10) + 10000 (i 200)2 . The observed signal thus consists of several periodic segments with dierent frequencies and amplitudes, the largest amplitude occurring between t = 701 and 900, together with a slight trend. Figure 5.1a displays xi . The coecients for the four highest levels (i.e. j = 0, 1, 2, 3) are plotted against time in Figure
b2 jk .
(5.33)
5.1b. Note that D stands for mother and S for father wavelet. Moreover, the numbering in the plot (as given in SPlus ) is opposite to the one given above: s4 and d4 in the plot correspond to the coarsest level j = 0 above. The corresponding functions at the dierent levels are given in Figure 5.1c. The ten and fty largest basis contributions are given in Figures 5.1d and e respectively (together with the data on top and residuals at the bottom). Figure 5.1f shows the time frequency plot of the squared coecients in the wavelet decomposition of xi . Bright shading corresponds to large coecients. All plots emphasize the highfrequency portion with large amplitude between i = 701 and 900. Moreover, the trend at this location is visible through the coecient values of the father wavelet (s4 in the plot) and the slightly brighter shading in the lowest frequency band of the timefrequency plot. An alternative to HISMOOTH models can be dened via wavelets (the following denition is a slight modication of Beran and Mazzola 2001): Denition 41 Let , L2 (R) be a father and the corresponding mother wavelet respectively, k (.) = (. k ), j,k = 2j/2 (2j . k ) (k Z, j N) the orthogonal wavelet basis generated by and , and ui and i (i Z) independent stationary zero mean processes satisfying suitable moment conditions. Assume X (ti ) = g (ti ) + ui with g L2 [0, T ], ti [0, T ] and wavelet decomposition g (t) = ak k (t) + bj,k j,k (t). For 0 = cM +1 < cM < ... < c1 < co = let g (t; ci1 , ci ) =
ci ak <ci1
ak k (t) +
ci bj,k <ci1
Then (X (ti ), Y (ti )) (i = 1, ..., n) is a Hierarchical Wavelet Model (HIWAVE model) of order M , if there exists M N, = (1 , ..., M ) RM , = (1 , ..., M ) RM + , 0 < M < ...1 < o = such that
M
Y (ti ) =
l=1
l g (ti ; l1 , l ) + i .
(5.34)
The denition means that the time series Y (t) is decomposed into orthogonal components that are proportional to certain bands in the wavelet decomposition of the explanatory series X (t) the bands being dened by the size of wavelet coecients. As for HISMOOTH models, the parameter vector = (, )t can be estimated by nonlinear least squares regression. To illustrate how HIWAVEmodels may be used, consider the following simulated example: let xi = g (ti ) (i = 1, ..., 1024) as in the previous example. The function g is decomposed into g (t) = g (t; , 1 ) + g (t; 1 , 0) = g1 (t) + g2 (t) where 1 is such that 50 wavelet coecients of g are larger or equal 1 . Figure 5.2 shows g , g1 , and g2 . A simulated series of response variables, dened by Y (ti ) = 2g1 (ti ) + i (t = 1, ..., 1024) with independent 2 = 100, is shown in Figure 5.3b. zeromean normal errors i with variance
10
x
0
x
10
200
400
a
600
800
1000
200
400
b
600
800
1000
Figure 5.1 e and f: wavelet components of simulated signal in a and frequency plot of coecients.
20
g1=first 50 components of x
10
10
g2=xg1
20
30
10
40
400
800
10
xg1 0
g1 0
400
800
200
400
600
800
1000
A comparison of the two scatter plots in Figures 5.3c and d shows a much clearer dependence between y and g1 as compared to y versus x = g . Figure 5.3e illustrates that there is no relationship between y and g2 . Finally, the timefrequency plot in Figure 5.3f indicates that the main periodic behavior occurs for t {701, ..., 900}. The diculty in practice is that the correct decomposition of x into g1 and the redundant component g2 is not known 1 g (ti ; , o + 1 ) (for a priori. Figure 5.4 shows y and the HIWAVEcurve graphical reason the tted curve is shifted vertically) tted by nonlinear least squares regression. Apparently, the algorithm identied 1 and hence the relevant time span [701, 900] quite exactly, since g (ti ; , 1 ) corresponds to the sum of the largest 51 wavelet components. The estimated coecients 1 = 1.95. If we assume (incorrectly of course) that o = 0.36 and are 1 has been known a priori, then we can give condence intervals for both parameters as in linear least squares regression. These intervals are generally too short, since they do not take into account that 1 is estimated. However, if a null hypothesis is not rejected using these intervals, then it will not be rejected by the correct test either. In our case, the linear regression condence intervals for o and 1 are [0.96, 0.24] and [1.81, 2.09] respectively, and thus contain the true values o = 0 and 1 = 2.
Figure 5.3 Simulated HIWAVE model  explanatory series g1 (a), y series (b), y versus x (c), y versus g1 (d), y versus g2 = x g1 (e) and time frequency plot of y (f ).
COLOR FIGURE 2.30 The minnesinger Burchard von Wengen (12291280), contemporary of Adam de la Halle (1235?1288). (From Codex Manesse, courtesy of the University Library, Heidelberg.)
COLOR FIGURE 2.35 Symbol plot with x = pj5, y = pj7, and radius of circles proportional to pj6.
COLOR FIGURE 2.36 Symbol plot with x = pj5, y = pj7. The rectangles have width pj1 (diminished second) and height pj6 (augmented fourth).
COLOR FIGURE 2.37 Symbol plot with x = pj5, y = pj7, and triangles defined by pj1 (diminished second), pj6 (augmented fourth), and pj10 (diminished seventh).
COLOR FIGURE 3.2 Fractal pictures (by Cline Beran, computer generated).
COLOR FIGURE 4.7 A harpsichord sound and its spectrogram. Intense pink corresponds to high values of I(t,).
COLOR FIGURE 9.6 Graduale written for an Augustinian monastery of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek Zrich.)
5.3 Sp ecic applications in music 5.3.1 Hierarchical decomposition of metric, melodic, and harmonic weights Decomposition of metric, melodic and harmonic weights as in (5.3) and (5.4) can reveal structures and relationships that are not obvious in the original series. To illustrate this, Figures 5.5a through d and 5.5e through h show a decomposition of these weights for Bachs Canon cancricans from Das Musikalische Opfer BWV 1079 and Weberns Variation op. 27/2 respectively. The bandwidths were chosen based on time signature and bar grouping. Weberns piano piece is written in 2/4 signature, its formal grouping is 1 + 11 + 11 + 11 + 11; however, Webern insists on a grouping in 2bar portions suggesting the bandwidths of 5.5 (11 bars), 1 (2 bars) and 0.5 (1 bar). Bachs canon is written in 4/4 signature; the grouping is 9+9+9+9. The chosen bandwidths are 9 (9 bars), 3 (3 bars) and 1 (1 bar). For both compositions, much stronger similarities between the smoothed metric, melodic, and harmonic components can be observed than for the original weights. An extended discussion of these and other examples can be found in Beran and Mazzola (1999a).
Figure 5.5 Hierarchical decomposition of metric, melodic, and harmonic indicators for Bachs Canon cancricans (Das Musikalische Opfer BWV 1079) and Weberns Variation op. 27, No. 2.
5.3.2 HIREG models of the relationship between tempo and melodic curves Quantitative analysis of performance data is an attempt to understand objectively how musicians interpret a score (Figure 5.6). For the analysis of tempo curves for Schumanns Tr aumerei (Figure 2.3), Beran and Mazzola (1999a) construct the following matrix of explanatory variables by decomposing structural weight functions into components of dierent
smoothness: let x1 = xmetric = metric weight, x2 = xmelod = melodic weight, x3 = xhmean = harmonic (mean) weight (see Chapter 3). Dene the bandwidths b1 = 4 (4 bars), b2 = 2 (2 bars) and b3 = 1 (1 bar) and denote the corresponding components in the decomposition of x1 , x2 , x3 by xj,metric = xj,1 , xj,melod = xj,2 , xj,hmean = xj,3 . More exactly, since harmonic weights are originally dened for each note, two alternative variables are considered for the harmonic aspect: xhmean (tl ) = average harmonic weight at onset time tl , and xhmax (tl ) = maximal harmonic weight at onset time tl . Thus, the decomposition of four dierent weight functions xmetric , xmelod , xhmean , and xhmax is used in the analysis. Moreover, for each curve, discrete derivatives are dened by dx(tj ) = and dx(2) (tj 1 ) = x(tj ) x(tj 1 ) tj tj 1
dx(tj ) dx(tj 1 ) . tj tj 1 Each of these variables is decomposed hierarchically into four components, as decribed above, with the bandwidths b1 = 4 (weighted averaging over 8 bars), b2 = 2 (4 bars), b3 = 1 (2 bars) and b4 = 0 (residual no averaging). We thus obtain 48 variables (functions): xmetric,1 dxmetric,1 d2 xmetric,1 xmelodic,1 dxmelodic,1 d2 xmelodic,1 xhmax,1 dxhmax,1 d2 xhmax,1 xhmean,1 dxhmean,1 d2 xhmean,1 xmetric,2 dxmetric,2 d2 xmetric,2 xmelodic,2 dxmelodic,2 d2 xmelodic,2 xhmax,2 dxhmax,2 d2 xhmax,2 xhmean,2 dxhmean,2 d2 xhmean,2 xmetric,3 dxmetric,3 d2 xmetric,3 xmelodic,3 dxmelodic,3 d2 xmelodic,3 xhmax,3 dxhmax,3 d2 xhmax,3 xhmean,3 dxhmean,3 d2 xhmean,3 xmetric,4 dxmetric,4 d2 xmetric,4 xmelodic,4 dxmelodic,4 d2 xmelodic,4 xhmax,4 dxhmax,4 d2 xhmax,4 xhmean,4 dxhmean,4 d2 xhmean,4
In addition to these variables, the following scoreinformation is modeled in a simple way: 1. Ritardandi There are four onset intervals R1 , R2 , R4 , and R4 with an explicitly written ritardando instruction, starting at onset times to (Rj ) (j = 1, 2, 3, 4) respectively. This is modeled by linear functions xritj (t) = 1{t Rj } (t to (Rj )), j = 1, 2, 3, 4 (5.35)
Figure 5.6 Quantitative analysis of performance data is an attempt to understand objectively how musicians interpret a score without attaching any subjective judgement. (Left: Freddy by J.B.; right: J.S. Bach, woodcutting by Ernst W urtemberger, Z urich. Courtesy of Zentralbibliothek Z urich).
2. Suspensions There are four onset intervals S1 , S2 , S4 , and S4 with suspensions, starting at onset times to (Sj ) (j = 1, 2, 3, 4) respectively. The eect is modeled by the variables xsusj (t) = 1{t Sj } (t to (Sj )), j = 1, 2, 3, 4 (5.36)
3. Fermatas There are two onset intervals F1 , F2 with fermatas. Their eect is modeled by indicator functions xf ermj (t) = 1{t Fj }, j = 1, 2 (5.37)
The variables are summarized in an n 57 matrix X . After orthonormalization, the following model is assumed: y (j ) = Z (j ) + (j ) where y (j ) = [y (t1 , j ), y (t2 , j ), ..., y (tn , j )]t are the tempo measurements for performance j , Z is the orthonormalized X matrix, (j ) is the vector of coecients (1 (j ), ..., p (j ))t and (j ) = [(t1 , j ), (t2 , j ), ..., (tn , j )]t is a vector of n identically distributed, but possibly correlated, zero mean random variables (ti , j ) (ti T ) with variance var((ti , j )) = 2 (j ). Beran and Mazzola (1999a) select the most important variables for each of the 28 performances separately, by stepwise linear regression. The main aim of the analysis is to study the relationship between structural weight functions and tempo with respect to a) existence, b) type and complexity, and c) comparison of dierent performances. It should perhaps be emphasized at this point that quantitative analysis of performance data aims at gaining a better objective understanding how pianists interpret a score
without attaching any subjective judgement. The aim is thus not to nd the ideal performance which may in fact not exist or to state an opinion about the quality of a performance. The values of R2 , obtained for the full model with all explanatory variables, vary between 0.65 and 0.85. Note, however, that the number of potential explanatory variables is very large so that high values of R2 do not necessarily imply that the regression model is meaningful. On the other hand, musical performance is a very complex process. It is therefore not unreasonable that a large number of explanatory variables may be necessary. This is conrmed formally, in that for most performances, the selected models turn out to be complex (with many variables), all variables being statistically signicant (at the 5%level) even when correlations in the errors are taken into account. For instance, for Brendels performance (R2 = 0.76), seventeen signicant variables are selected (including rst and second derivatives). In spite of the complexity, there is a large degree of similiarity between the performances in the following sense: a) all except at most 3 of the 57 coecients j have the same sign for all performances (the results are therefore hardly random), b) there are canonical variables that are chosen by stepwise regression for (almost) all performances, and c) the same is true if one considers (for each performance separately) explanatory variables with the largest coecient. Figure 5.7 shows three of these curves. The upper curve is the most important explanatory variable for 24 of the 28 performances. The exceptions are: all three Cortotperformances and Krust with a preference for the middle curve which reects the division of the piece into 8 parts and the performance by Ashkenazy with a curve similar to Cortots. Apparently, Cortot, Krust, and Ashkenazy put special emphasis on the division into 8 parts. The results can also be used to visualize the structure of tempo curves in k  as criterion for the importance of the following way: using the size of  variable k, we may add the terms in the regression equation sequentially to obtain a hierarchy of tempo curves ranging from very simple to complex. This is illustrated in Figures 5.8a and b for Ashkenazy and Horowitzs third performance. 5.3.3 HISMOOTH models for the relationship between tempo and structural curves An analysis of the relationship between a melodic curve (Chapter 3) and the 28 tempo curves for Schumanns Tr aumerei is discussed in Beran and Mazzola (1999). In a rst step, eects of fermatas and ritardandi are subtracted from each of the 28 tempo series individually, using linear regression. The component of the melodic curve mt orthogonal to these variables is then used. The second algorithm for HISMOOTH models is used, with a grid G that takes into account that 0 t 32 and only certain multiples of 1/8 correspond to musically interesting neighborhoods: G = {32, 30, 28, 26, 24,
Figure 5.7 Most important melodic curves obtained from HIREG t to tempo curves for Schumanns Tr aumerei.
20
15
10
10
15
onset time
20
25
30
5
10
15
20
25
10
15
onset time
20
25
30
Figure 5.8 Successive aggregation of HIREGcomponents for tempo curves by Ashkenazy and Horowitz (third performance).
22, 20, 18, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.5, 1, 0.75, 0.5, 0.25, 0.125}. Note that since for large bandwidths the resulting curves g do not vary much, large trial bandwidths do not need to be too close together. The error process is modeled by a fractional AR(p, d) process, the order being estimated from the data by the BIC. Note that, from the musicological point of view, the fractional dierencing parameter can be interpreted as a measure of selfsimilarity (see Chapter 3). For illustration, consider the performances CORTOT1 and HOROWITZ1 (see Figures 5.9b and c). In both cases, the number M of explanatory variables estimated by Algorithm 2 turns out to be 3 (with a level of signicance of = 0.05). The estimated bandwidths (and 95%condence intervals) b2 = 2.0 ([1.10, 2.90]) and b3 = 0.5 ([0.17, 0.83]) are b1 = 4.0 ([2.66, 5.34]), b3 = for CORTOT1 and b1 = 4 ([2.26, 5.74]), b2 = 1 ([0.39, 1.62]) and 1 = 0.81 .25 ([0.04, 0.46]) for HOROWITZ1. The estimates of are 3 = 0.624 ([1.15, 0.10]), 2 = 1.08 ([0.21, 1.05]) and ([1.53, 0.10]), 1 = 0.42 ([0.66, 0.18]), 2 = 0.54 ([0.13, 0.95]) and 3 = 0.68 and ([1.08, 0.28]) respectively. Finally, the tted error process for Cortot is = 0.25 ([0.60, 0.09]) and 1 = 0.77 a fractional AR(1) process with d = 0.30 [0.48, 1]. For Horowitz we obtain a fractional AR(2) process with d 1 = 0.26 ([0.09, 0.42]) and 2 = 0.43 ([0.55, 0.30]). ([0.14, 0.45], A possible interpretation of the results is as follows: the largest bandwidth b1 = 4 (one bar) is the same for both performers. A relatively large portion of the shaping of the tempo happens at this level. Apart from this, however, Horowitzs bandwidths are smaller. Horowitz appears to emphasize very local melodic structures more than Cortot. Moreover, > 0 (longrange dependence): while the small scale strucfor Horowitz, d tures are explained by the melodic structure of the score, the remaining unexplained part of the performance is still coherent in the sense that there is a relatively strong (self)similarity and positive correlations even be < 0 (antipersistence): tween remote parts. On the other hand, for Cortot, d While larger scale structures are explained by the melodic structure of the score, more local uctuations are still coherent in the sense that there is a relatively strong negative autocorrelation even between remote parts, these smaller scale structures are however dicult to relate directly to the melodic structure of the score. Figures 5.9a through d also simplied tempo curves for all 28 performances, obtained by HISMOOTH ts with M = 3. The comparison of typical characteristics is now much easier than for the original curves. In particular, there is a strong similarity between all three performances by Horowitz on one hand, and the three performances by Cortot on the other hand. Several performers (Moisewitsch, Novaes, Ortiz, Krust, Schnabel, Katsaris) put even higher emphasis on global melodic features than Cortot. Striking similarities can also be seen between Horowitz, Klien, and
Brendel. Another group of similar performances consisting of Cortot, Argerich, Capova, Demus, Kubalek, and Shelley. 5.3.4 Digital encoding of musical sounds (CD, mpeg) Wavelet decomposition plays an important role in modern techniques of digital sound and image processing. Digital encoding of sounds (e.g. CD, mpeg) relies on algorithms that make it possible to compress complex data in as few storage units as possible. Wavelet decoposition is one such technique: instead of storing a complete function (evaluated or measured at a very large number of time points on a ne grid), one only needs to keep the relatively small number of wavelet coecients. There is an extensive literature on how exactly this can be done to suit particular engineering needs. Since here the focus is on genuine musical questions rather than signal processing, we do not pursue this further. The interested reader is referred to the engineering literature such as Eelsberg and Steinmetz (1998) and references therein. 5.3.5 Wavelet analysis of tempo curves Consider the tempo curves for Schumanns Tr aumerei. Wavelet analysis can help one to understand some of the similarities and dierences be
tween tempo curves. This is illustrated in Figures 5.10a through f where timefrequency plots of the three tempo curves by Cortot are compared with those by Horowitz. (More specically, only the rst 128 observations are used here.) The obvious dierence is that Horowitz has more power in the high frequency range. Figures 5.11a through f compare the wavelet coefcients of residuals obtained after subtracting a kernelsmoothed version of the tempo curves (bandwidth 1/8, i.e. averaging was done over one quarter of a bar). This provides an overview of local details of the curves. In particular, it can be seen at which level of resolution each pianist kept essentially the same prole throughout the years. For instance, for Horowitz the complete prole at level 2 (d2) remains essentially the same. An even better adaptation to data is achieved by using socalled wavelet packets, which are generalizations of wavelets, in conjunction with a bestbasis algorithm. The idea of the algorithm is to nd the best type of basis functions suitable to approximate an observed time series with as few basis functions as possible. This is a way out of the limitation due the very specic shape of a particular class of wavelet functions (see e.g. Haar wavelets where we are conned to step functions). For detailed references on wavelet packets see e.g. Coifman et al. (1992) and Coifman and Wickerhauser (1992). Figures 5.12 through 5.14 illustrate the usefulness of this approach: the 28 tempo curves of Schumanns Tr aumerei are approximated by the most important
Figure 5.10 Time frequency plots for Cortots and Horowitzs three performances.
two (Figure 5.12), ve (Figure 5.13) and ten (Figure 5.14) best basis functions. The plots show interesting and plausible similarities and dierences. Particularly striking are Cortots 4bar oscillations, Horowitzs seismic local uctuations, the relatively unbalanced tempo with a few extreme tempo variations for Eschenbach, Klien, Ortiz, and Schnabel, the irregular shapes for Moisewitsch, and also a strong similarity between Horowitz1 and Moisewitsch with respect to the general shape (Figure 5.12). 5.3.6 HIWAVE models of the relationship between tempo and melodic curves HIWAVE models can be used, for instance, to establish a relationship between structural curves obtained from a score and a performance of the score. Here, we consider the tempo curves by Cortot and Horowitz (Figure 5.15a), and the melodic weight function m(t) dened in Section 3.3.4. Assuming a HIWAVEmodel of order 1, Figure 5.15b displays the value of R2
d1
d1
d1
d2
d2
d2
s2
s2
s2
0 20 40 60 80 100
a
0 20 40 60 80 100
b
0 20 40 60 80 100
c
d1
d1
d1
d2
d2
d2
s2
s2
s2
0 20 40 60 80 100
d
0 20 40 60 80 100
e
0 20 40 60 80 100
f
Figure 5.11 Wavelet coecients for Cortots and Horowitzs three performances.
ARGERICH
1.0
1
ARRAU
0
ASKENAZE
1 0
BRENDEL
1
BUNIN
1.0
CAPOVA
1.0
CORTOT1
0.5
0.5
1
0.5
1
0.0
1
0.0
2
2
1
0.5
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
0.5
1.0
0
2
3
3
2
0.0
50
100
150
50
100
150
CORTOT2
1.0
CORTOT3
1.0
1.0
CURZON
1.0 0.5
0.5
DAVIES
1.0 0.5
DEMUS
ESCHENBACH
2 1
GIANOLI
1
0.5
0.5
0.0
0.0
1
0.0
0.5
0.0
2
0.5
1.0
1.0
3
1.0
50
100
150
50
100
150
50
100
150
1.5
50
100
150
50
100
150
4
50
100
150
3
0
2
1
50
100
150
HOROWITZ1
1.0
HOROWITZ2
2
0
HOROWITZ3
1
1
KATSARIS
0
KLIEN
1.0 0.5
KRUST
1.0
KUBALEK
0.5
1
1
2
2
0.5
3
2
3
0.5
0.0
2
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
1.5
3
3
4
50
100
150
1.0
0
0.0
1
0.5
1
50
100
150
MOISEIWITSCH
1.0
NEY
1
1
NOVAES
0
ORTIZ
0
SCHNABEL
1.0 0.5
SHELLEY
0
ZAK
0.5
1
1
2
0.0
1
2
0.0
1
3
3
0.5
4
0.5
2
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
1.0
2
50
100
150
3
0
2
1
50
100
150
Figure 5.12 Tempo curves approximation by most important 2 best basis functions.
ARGERICH
1
ARRAU
1
1
ASKENAZE
1
0
BRENDEL
1
0
BUNIN
1.0
CAPOVA
1
CORTOT1
1
1
0.5
1
1
1
2
2
2
0.0
2
3
3
3
2
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
1.0
3
50
100
150
3
0
2
1
50
100
150
CORTOT2
1
CORTOT3
1
CURZON
1.0 0.5
1
DAVIES
1
DEMUS
ESCHENBACH
1
GIANOLI
0 1
1
0.5
1
1
1
2
1
2
3
2
1.5
2
3
4
50
100
150
3
50
100
150
2
50
100
150
50
100
150
50
100
150
50
100
150
4
0
3
2
1
50
100
150
HOROWITZ1
1.0
HOROWITZ2
2
0
HOROWITZ3
1
1
KATSARIS
1 2
KLIEN
0.5
KRUST
1.0
KUBALEK
1
0.0
1
1
1
2
2
2
1.0
3
3
2
0.5
50
100
150
50
100
150
50
100
150
50
100
150
4
50
100
150
1.5
3
50
100
150
1.0
0
3
0.0
50
100
150
MOISEIWITSCH
1.5
NEY
1
NOVAES
1
1
ORTIZ
0 1
0
SCHNABEL
1
SHELLEY
0 1
ZAK
1.0
0.5
1
2
3
0.0
1
1
2
1
0.5
3
4
2
2
50
100
150
2
50
100
150
50
100
150
50
100
150
5
50
100
150
50
100
150
3
0
2
1
50
100
150
Figure 5.13 Tempo curves approximation by most important 5 best basis functions.
ARGERICH
1
1
ARRAU
1
ASKENAZE
1
BRENDEL
1
BUNIN
2 1
CAPOVA
0 1
CORTOT1
1
1
1
1
1
2
2
2
2
2
1
3
3
3
3
50
100
150
50
100
150
50
100
150
50
100
150
3
50
100
150
3
0
2
1
50
100
150
50
100
150
CORTOT2
1
CORTOT3
1
CURZON
0.5
1
DAVIES
1
DEMUS
ESCHENBACH
1
GIANOLI
0 1
1
1
0.5
1
1
2
2
1
2
3
1.5
3
2
4
3
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
4
0
3
2
1
50
100
150
HOROWITZ1
1.5
HOROWITZ2
2
HOROWITZ3
1
KATSARIS
2 1
KLIEN
1
KRUST
1
KUBALEK
0.5
1
1
1
1
0.5
2
2
2
1
2
3
3
3
1.5
3
3
2
50
100
150
50
100
150
50
100
150
4
50
100
150
4
50
100
150
50
100
150
2
0
1
50
100
150
MOISEIWITSCH
1
NEY
1
NOVAES
1
1
ORTIZ
1
0
SCHNABEL
2
SHELLEY
1
ZAK
1
1
1
2
1
1
1
2
3
1
2
2
3
4
2
2
5
50
100
150
50
100
150
50
100
150
50
100
150
50
100
150
3
50
100
150
3
0
2
1
50
100
150
Figure 5.14 Tempo curves approximation by most important 10 best basis functions.
for the simple linear regression model yi = o + 1 g (ti ; , ) as a function of the number of waveletcoecients of mi that are larger or equal to . Two observations can be made: a) for almost all choices of , the t for Horowitz (gray lines) is better and b) the best value of is practically the same for all six performances. Figure 5.15c shows the tted HIWAVEcurves for Cortot and Horowitz separately. The result shows an amazing agreement between the three Cortot performances on one hand and the three Horowitz curves on the other hand. The HIWAVEts seem to have extracted a major aspect of the performance styles. Horowitz appears to build blocks of almost horizontal tempo levels and adds, within these blocks, very ne tempo variations. In contrast, for Cortot, blocks have a more parabolic shape. It should be noted, of course, that, since Haar wavelets were used here, these features (in particular Horowitz horizontal blocks) may be somewhat overemphasized. Analogous pictures are displayed in Figures 5.16a through c and 5.17a through c for the rst and second dierence of the tempo respectively. Particularly interesting are Figures 5.17b and c: the values of R2 are practically the same for all Horowitz performances and clearly lower than for Cortot. Moreover, as before, both pianists show an amazing consistency in their performances.
Figure 5.15 Tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVEt plotted against trial cuto parameter (b) and tted HIWAVEcurves (c).
Figure 5.16 First derivative of tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVEt plotted against trial cuto parameter (b) and tted HIWAVEcurves (c).
Figure 5.17 Second derivative of tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVEt plotted against trial cuto parameter (b) and tted HIWAVEcurves (c).
CHAPTER 6
assume that the Markov chain is homogeneous in the sense that for any i, j N, the conditional probability P (Xt+1 = j Xt = i) does not depend on time t. The probability distribution of the process Xt (t = 0, 1, 2, ...) is then fully specied by the initial distribution i = P (Xo = i) pij = P (Xt+1 = j Xt = i) (i, j = 1, 2, ..., S ) (6.2)
and the (nite or innite dimensional) matrix of transition probabilities (6.3) where S  = m is the number of elements in the state space S. Without loss of generality, we may assume S = {1, 2, ..., m}. Note that the vector = (1 , ..., m )t and the matrix M = (pij )i,j =1,2,...,m have the following properties: 0 i , pij 1,
m
i = 1
i=1
and
pij = 1
j =1
Probabilities of events can be obtained by matrix multiplication, since pij = P (Xt+n = j Xt = i) = and
(n) (n) m
pj
= P (Xt+n = j ) = [ t M n ]j
(6.5)
6.2.2 Transience, persistence, irreducibility, periodicity, and stationarity The dynamic behavior of a Markov chain can essentially be characterized by the notions of transiencepersistence, irreducibilityreducibility, aperiodicityperiodicity and stationaritynonstationarity. These properties will be discussed now. Consider the probability that the rst visit in state j occurs at time n, given that the process started in state i, fij = P (X1 = j, ..., Xn1 = j, Xn = j Xo = i) Note that fij
(n) (n)
(6.6)
where Tj = min{n : Xn = j }
n1
is the rst time when the process reaches state j . The conditional probability that the process ever visits the state j can be written as fij = P (Tj < Xo = i) = P ( n=1 {Xn = j }Xo = i) =
n=1
fij
(n)
(6.7)
We then have the following Denition 42 A state i is called i) transient, if fii < 1. ii) persistent, if fii = 1; Persistence means that we return to the same state again with certainty. For transient states it can occur, with positive probability, that we never return to the same place. As it turns out, a positive probability of never returning implies that there is indeed a point of no return, i.e. a time point after which one never returns. This can be seen as follows. Conditionally on Xo = i, the probability that state j is reached at least k + 1 times is k . Hence, for k , we obtain the probability of returning equal to fij fjj innitely often
k . qij = P (Xn = j innitely oftenXo = i) = fij lim fjj k
(6.8)
This implies qij = 0 for fjj < 1 and qij = 1 for fjj = 1. A simple way of checking whether a state is persistent or not is given by Theorem 13 The following holds for a Markov chain: i) A state j is transient qjj = 0 ii) A state j is persistent qjj = 1
(n) (n) n=1 pjj < (n) n=1 pjj = .
The condition on n=1 pii can be simplied further for irreducible Markov chains: Denition 43 A Markov chain is called irreducible, if for each i, j S , (n) pij > 0 for some n. Irreducibility means that wherever we start, any state j can be reached in due time with positive probability. This excludes the possibility of being caught forever in a certain subset of S . With respect to persistent and transient states, the situation simplies greatly for irreducible Markov chains: Theorem 14 Suppose that Xt (t = 0, 1, ...) is an irreducible Markov chain. Then one of the following possibilities is true:
i) All states are transient. ii) All states are persistent. Instead of speaking of transient and persistent states one therefore also uses the notion of transient and persistent Markov chain respectively. Another important property is stationarity of Markov chains. The word stationarity implies that the distribution remains stable in some sense. The rst denition concerns initial distributions: Denition 44 A distribution is called stationary if
k
i pij = j ,
i=1
(6.9)
or in matrix form, t M = . (6.10) This means that if we start with distribution , then the distribution of all subsequent Xt s is again . The next question is in how far the initial distribution inuences the dynamic behavior (probability distribution) into the innite future. A possible complication is that the process may be periodic in the sense that one may return to certain states periodically: Denition 45 A state j is called to have period , if pjj > 0 implies that n is a multiple of . For an irreducible Markov chain, all states have the same period. Hence, the following denition is meaningful: Denition 46 An irreducible Markov chain is called periodic if > 1, and it is called aperiodic if = 1. It can be shown that for an aperiodic Markov chain, there is at most one stationary distribution and, if there is one, then the initial distribution does not play any role ultimately: Theorem 15 If Xt (t = 0, 1, ...) is an aperiodic irreducible Markov chain for which a stationary distribution exists, then the following holds: (i) the Markov chain is persistent; (ii) limn pij = j > 0 for all i, j ; (iii) the stationary distribution is unique. In the other case of an aperiodic irreducible Markov chain for which no stationary distribution exists, we have
(n) lim p n ij (n) (n)
=0
for all i, j . Note that this is even the case if the Markov chain is persistent. One then can classify irreducible aperiodic Markov chains into three classes:
Theorem 16 If Xt (t = 0, 1, 2, ...) is an irreducible aperiodic Markov chain, then one the following three possibilities is true: (i) Xt is transient,
(n) lim p n ij
=0
and
n=1
pij <
(n)
= 0,
pij =
and j =
n=1
nfjj =
(n)
= j > 0
for all i, j and the average number of steps till the process returns to state j is given by 1 j = j For Markov chains with a nite state space, the results simplify further: Theorem 17 If Xt is an irreducible aperiodic Markov chain with a nite state space, then the following holds: (i) Xt is persistent (ii) a unique stationary distribution = (1 , ..., k )t exists and is the solution of t (I M ) = 0, (0 j 1, where I is the m m identity matrix. Note that j Mij = j pij = 1 so that j (I M )ij = 0, i.e. the matrix (I M ) is singular. (If this were not the case, then the only solution to the system of linear equations would be 0 so that no stationary distribution would exist.) Thus, there are innitely many solutions of (6.13). However, there is only one solution that satises the conditions 0 j 1 and j = 1. j = 1) (6.11)
6.2.3 Hidden Markov models A hidden Markov model is, as the name says, a model where an underlying Markov process is not directly observable. Instead, observations Xt (t = 1, 2, ...) are generated by a series of probability distributions which in turn are controlled by an unobserved Markov chain. More specically, the following denitions are used: let t (t = 1, 2, ...) be a Markov chain with initial distribution so that P (1 = j ) = j , and transition probabilities pij = P (t+1 = j t = i). (6.12)
The state of the Markov chain determines the probability distribution of the observable random variables Xt by ij = P (Xt = j t = i) (6.13)
In particular, if the state spaces of t and Xt are nite with dimensions m1 and m2 respectively, then the probability distribution of the process Xt is determined by the m1 dimensional vector , the m1 m1 dimensional transition matrix M = (pij )i,j =1,...,m1 and the m2 m1 dimensional matrix = (ij )i=1,...,m2 ;j =1,...,m1 that links t with Xt . Analogous models can be dened for the case where Xt (t N) are continuous variables. The exibility of hidden Markov models is due to the fact that Xt can be an arbitrary quantity with an arbitrary distribution that can change in time. For instance, Xt itself can be equal to a time series Xt = (Z1 , ..., Zn ) = (Z1 (t), ..., Zn (t)) whose distribution depends on t . Typically, such models are used in automatic speech processing (see e.g. Levinson et al. 1983, Juang and Rabiner 1991). The variable t may represent the unobservable state of the vocal tract at time t, which in turn produces an observable acoustic signal Z1 (t), ..., Zn (t) generated by a distribution characterized by t . Given observations Xt (t = 1, 2, ..., N ), the aim is to guess which congurations t (t = 1, 2, ..., N ) the vocal tract was in. More specically, it is sometimes assumed that there is only a nite number of possible acoustic signals. We may therefore denote by Xt the label of the observed signal and estimate by maximizing the a posteriori probability P ( = j Xt = i). Using the Bayes rule, this leads to t = arg = arg max
j =1,...,m1
max
P (t = j Xt = i) (6.14)
j =1,...,m1
m1 l=1
6.2.4 Parameter estimation for Markov and hidden Markov models In principle, parameter estimation for Markov chains and hidden Markov models is simple, since the likelihood function can be written down explic
itly in terms of simple conditional probabilities. The main diculties that can occur are: 1. Large number of unknown parameters: the unknown parameters for a Markov chain are the initial distribution and the transition matrix M = (pij )i,j =1,...,m . If m is nite, then the number of unknown parameters is (m 1)+ m(m 1). If the initial distribution does not matter, then this reduces to m(m 1). Both numbers can be quite large compared to the available sample size, since they increase quadratically in m. The situation is even worse if the state space is innite, since then the number of unknown parameters is innite. A solution to this problem is to impose restrictions on the parameters or to dene parsimonious models where M is characterized by a lowdimensional parameter vector. 2. Implicit solution: The maximum likelihood estimate of the unknown parameters is the solution of a system of nonlinear equations, and therefore must be found by a suitable numerical algorithm. For real time applications with massive data input, as they typically occur in speech processing or processing of musical sound signals, fast algorithms are required. 3. Asymptotic distribution: The asymptotic distribution of maximum likelihood estimates is not always easy to derive. 6.3 Specic applications in music 6.3.1 Stationary distribution of intervals modulo 12 We consider intervals between successive notes modulo octave for the upper envelopes of the following compositions: Anonymus: a) Saltarello (13th century); b) Saltarello (14th century); c) Alle Psallite (13th century); d) Troto (13th century) A. de la Halle (1235?1287): Or est Bayard en la pature, hure! J. de Ockeghem (14251495): Canon epidiatesseron J. Arcadelt (15051568): a) Ave Mari, b) La Ingratitud, c) Io Dico Fra Noi W. Byrd (15431623): a) Ave Verum Corpus, b) Alman, c) The Queens Alman J. Dowland (15621626): a) Come Again, b) The Frog Galliard, c) The King Denmarks Galliard H.L. Hassler (15641612): a) Galliard, b) Kyrie from Missa secunda, c) Sanctus et Benedictus from Missa secunda G.P. Palestrina (15251594): a) Jesu Rex admirabilis, b) O bone Jesu, c) Pueri Hebraeorum
J.P. Rameau (16831764): a) La Popliniere, b) Tambourin, c) La Triomphante (Figure 6.1) J.F. Couperin (16681733): a) Barriquades mysterieuses, b) La Linotte Earouch ee, c) Les Moissonneurs, d) Les Papillons J.S. Bach (16851750): Das Wohltemperierte Klavier; CelloSuites I to VI (1st Movements) D. Scarlatti (16601725): a) Sonata K 222, b) Sonata K 345, c) Sonata K 381 J. Haydn (17321809): Sonata op. 34, No. 2 W.A. Mozart (17561791): a) Sonata KV 332, 2nd Mov., b) Sonata KV 545, 2nd Mov., c) Sonata KV 333, 2nd Mov. F. Chopin (18101849): a) Nocturne op. 9, No. 2, b) Nocturne op. 32, No. 1, c) Etude op. 10, No. 6 (Figure 6.2) R. Schumann (18101856): Kinderszenen op. 15 J. Brahms (18331897): a) Hungarian dances No. 1, 2, 3, 6, 7, b) Intermezzo op. 117, No. 1 (Figures 6.12, 9.7, 11.5) C. Debussy (18621918): a) Claire de lune, b) Arabesque No. 1, c) Reections dans leau A. Scriabin (18721915): Preludes a) op. 2, No. 2, b) op. 11, No. 14, c) op. 13, No. 2 S. Rachmanino (18731943): a) Prelude op. 3, No. 2, b) Preludes op. 23, No. 3, 5, 9 B. Bart ok (18811945): a) Bagatelle op. 11, No. 2, b) Bagatelle op. 11, No. 3, c) Sonata for piano O. Messiaen (19081992): Vingts regards sur lenfant de J esus, No. 3 S. Prokoe (18911953): Visions fugitives a) No. 11, b) No. 12, c) No. 13 A. Sch onberg (18741951): Piano piece op. 19, No. 2 T. Takemitsu (19301996): Rain tree sketch No. 1 A. Webern (18831945): Orchesterst uck op. 6, No. 6 Since we are not interested in note repetitions, zero is excluded, i.e. the state space of Xt consists of the numbers 1,...,11. For the sake of simplicity, Xt is assumed to be a Markov chain. This is, of course, not really true nevertheless an approximation by a Markov chain may reveal certain characteristics of the composition. The elements of the transition matrix M = (pij )i,j =1,...,11 are estimated by relative frequencies p ij =
n t=2
1{xt1 = i, xt = j }
n1 t=1
1{xt = i}
(6.15)
Figure 6.1 JeanPhilippe Rameau (16831764). (Engraving by A. St. Aubin after J. J. Caeri, Paris after 1764; courtesy of Zentralbibliothek Z urich.)
and the stationary distribution of the Markov chain with transition ma = ( trix M pij )i,j =1,...,11 is estimated by solving the system of linear equations ) = 0 t (I M as described above. Figures 6.3a through l show the resulting values of j (joined by lines). For each composition, the vector j is plotted against j. For visual clarity, points at neighboring states j and j 1 are connected. The gures illustrate how the characteristic shape of changed in the course of the last 500 years. The most dramatic change occured in the 20th century with a attening of the peaks. Starting with Scriabin a pioneer of atonal music, though still rooted in the romantic style of the late 19th century, this is most extreme for the compositions by Sch onberg, Webern, Takemitsu, and Messiaen. On the other hand, Prokoes Visions fugitives exhibit clear peaks but at varying locations. The estimated stationary distributions can also be used to perform a cluster analysis. Figure 6.4 shows the result of the single linkage algorithm with the manhattan norm (see Chapter 10). To make names legible, only a subsample of the data was used. An almost perfect separation between Bach and composers from the classical and romantic period can be seen.
6.3.2 Stationary distribution of interval torus values An analogous analysis can be carried out replacing the interval numbers by the corresponding values of the torus distance (see Chapter 1). Excluding zeroes, the state space consists of the three numbers 1, 2, 3 only. For the same compositions as above, the stationary probabilities j (j = 1, 2, 3) are calculated. A cluster analysis as above, but with the new probabilties, yields practically the same result as before (Figure 6.5). Since the state space contains three elements only, it is now even easier to nd the patterns that j ) (i = j ) apdetermine clustering. In particular, logoddsratios log( i / pear to be characteristic. Boxplots are shown in Figures 6.6a, 6.7a and 6.8a for categories of composers dened by date of birth as follows: a) before 1600 (early music); b) [1600,1720) (baroque); c) [1720,1800) (classic); d) [1800,1880) (romantic and early 20th century) (Figure 6.12); e) 1880 and later (20th century). This is a simple, though somewhat arbitrary, division with some inaccuracies for instance, Sch onberg is classied 2 is highin category 4 instead of 5. The logoddsratio between 1 and est in the classical period and generally tends to decrease afterwards. Moreover, there is a distinct jump from the baroque to the classical period. 3 ). Here, however, the attained level This jump is also visible for log( 1 / 3 ) a gradual increase is kept in the subsequent time periods. For log( 2 /
Figure 6.3 Stationary distributions j (j = 1, ..., 11) of Markov chains with state space Z12 \ {0}, estimated for the transition between successive intervals.
HAYDN SCHUMANN MOZART CHOPIN BACH HAYDN BRAHMS CHOPIN BACH BACH MOZART RACHMANINOFF CHOPIN SCHUMANN RACHMANINOFF SCHUMANN HAYDN BRAHMS SCHUMANN SCHUMANN SCHUMANN BRAHMS SCHUMANN MOZART MOZART SCHUMANN RACHMANINOFF SCHUMANN HAYDN BRAHMS
SCHUMANN BRAHMS BRAHMS SCHUMANN SCHUMANN SCHUMANN HAYDN CHOPIN RACHMANINOFF MOZART BACH HAYDN BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH HAYDN
Figure 6.4 Cluster analysis based on stationary Markov chain distributions for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmanino.
can be observed. The dierences are even more visible when comparing individual composers. This is illustrated in Figures 6.9a and b where Bachs 3 ) and log( 2 / 3 ) are compared, and in Figures and Schumanns log( 1 / 6.10a through f where the median and lower and upper quartiles of j are plotted against j . Finally, Figure 6.11 shows the plots of log( 1 / 3 ) and 3 ) against the date of birth. log( 2 / 6.3.3 Classication by hidden Markov models Chai and Vercoe (2001) study classication of folk songs using hidden Markov models. They consider, essentially, four ways of representating a melody; namely by a) a vector of pitches modulo 12; b) a vector of pitches modulo 12 together with duration (duration being represented by repeating the same pitch); c) a sequence of intervals (dierenced series of pitches); and d) sequence of intervals, with intervals being classied into only ve interval classes {0}, {1, 2}, {1, 2}, {x 3} and {x 3}. The observed data consist of 187 Irish, 200 German, and 104 Austrian homophonic melodies from folk songs. For each melody representation, the authors estimate the parameters of several hidden Markov models which dier mainly with respect to the size of the hidden state space. The models are tted for each
3.0
SCHUMANN BRAHMS CHOPIN RACHMANINOFF BACH HAYDN BRAHMS SCHUMANN MOZART CHOPIN CHOPIN BACH SCHUMANN HAYDN SCHUMANN HAYDN BACH RACHMANINOFF MOZART SCHUMANN BACH SCHUMANN CHOPIN SCHUMANN BRAHMS RACHMANINOFF MOZART SCHUMANN RACHMANINOFF SCHUMANN HAYDN SCHUMANN BRAHMS BRAHMS BRAHMS SCHUMANN MOZART MOZART SCHUMANN BACH BACH HAYDN HAYDN BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH HAYDN BACH SCHUMANN BACH BACH BACH
Figure 6.5 Cluster analysis based on stationary Markov chain distributions of torus distances for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmanino.
country separately. Only 70% of the data are used for estimation. The remaining 30% are used for validation of a classication rule dened as follows: a melody is assigned to country j , if the corresponding likelihood (calculated using the countrys hidden Markov model) is the largest. Not surprisingly, the authors conclude that the most reliable distinction can be made between Irish and nonIrish songs.
1.0
0.0
0.5
1.0
1.5
b. 1600
1600 1720
1720 1800
1800 1880
from 1880
1.5
1.0
0.5
0.0
birth 17201800
Figure 6.6 Comparison of log odds ratios log( 1 / 2 ) of stationary Markov chain distributions of torus distances.
1
1
b. 1600
1600 1720
1720 1800
1800 1880
from 1880
Figure 6.7 Comparison of log odds ratios log( 1 / 3 ) of stationary Markov chain distributions of torus distances.
b. 1600
1600 1720
1720 1800
1800 1880
from 1880
Figure 6.8 Comparison of log odds ratios log( 2 / 3 ) of stationary Markov chain distributions of torus distances.
1.0
0.5
0.0
0.5
1.0
Bach
Schumann
0.5
0.0
0.5
1.0
1.5
Bach
Schumann
Figure 6.9 Comparison of log odds ratios log( 1 / 3 ) and log( 2 / 3 ) of stationary Markov chain distributions of torus distances.
log(pi(1)/pi(3))
log(pi(2)/pi(3))
1200 1400 1600 1800
1
0
1200
1400
1600
1800
year a
year b
Figure 6.11 Log odds ratios log( 1 / 3 ) and log( 2 / 3 ) plotted against date of birth of composer.
6.3.4 Reconstructing scores from acoustic signals One of the ultimate dreams of musical signal recognition is to reconstruct a musical score from the acoustic signal of a musical performance. This is a highly complex task that has not yet been solved in a satisfactory manner. Consider, for instance, the problem of polyphonic pitch tracking dened as follows: given a musical audio signal, identify the pitches of the music. This problem is not easy for at least two reasons: a) dierent instruments have dierent harmonics and a dierent change of the spectrum; and b) in polyphonic music, one must be able to distinguish dierent voices (pitches) that are played simultaneously by the same or dierent instruments. An approach based on a rather complex hierarchical model is proposed for instance in Walmsley, Godsill, and Rayner (1999). Suppose that a maximal number N of notes can be played simultaneously and denote by = (1 , ..., N )t the vector of 01variables indicating whether note j (j = 1, ..., N ) is played or not. Each note j is associated with a harmonic representation (see Chapter 4) with fundamental frequency j and amplitudes b1 (j ), ..., bk (j ) (k = number of harmonics). Time is divided into
disjoint time intervals, socalled frames. In each frame i of length mi , the sound signal is assumed to be equal to yi (t) = i (t) + ei (t) where i (t) (t = 1, ..., mi ) is the sum of the harmonic representations of the notes and a random noise ei . Walmsley et al. assume ei to be iid (independent 2 identically distributed) normal with zero mean and variance i . Taking everything together, the probability distribution of the acoustic signal is fully specied by a nite dimensional parameter vector . In principle, given an observed signal, could be estimated by maximizing the likelihood (see Chapter 4). The diculty is, however, that the dimension of is very high compared to the number of observations. The solution proposed by Walmsley et al. is to circumvent this problem by a Bayesian approach, in that is assumed to be generated by an a priori distribution. Given the data, consisting of a sound signal and an a priori distribution p(), the a posteriori distribution p(yi ) of is given by p(yi ) = where f (yi ) = (2i )
m i /2
(6.16)
exp(
t=1
2 e2 i (t)/i )
and ei (t) = ei (t; ). How many notes and which pitches are played can then be decided, for instance, by searching for the mode of the distribution. Even if this model is assumed to be realistic, a major practical diculty remains: the dimension of can be several hundred. The computation of the a posteriori distribution is therefore very dicult since calculation of )p( )d involves highdimensional numerical intergration. A furf (yi  ther complication is that some of the parameters may be highly correlated. Walmsley et al. therefore propose to use Markov Chain Monte Carlo Methods (see e.g. Gilks et al. 1996). The essential idea is to simulate the integral by a sample mean of f (yi ) where is sampled randomly from the a priori distribution p(). Sampling can be done by using a Markov process whose stationary distribution is p. The simulation can be simplied further by the socalled Gibbs sampler which uses suitable onedimensional conditional distributions (Besag 1989). A more modest task than polyphonic pitch tracking is automatic segmentation of monophonic music. The task is as follows: given a monophonic musical score and a sampled acoustic signal of a performance of the score, identify for each note and rest in the score the corresponding time interval in the performance. A possible approach based on hidden Markov processes and Bayesian models is proposed in Raphael (1999) (also see Raphael 2001a,b). Raphael, who is a professional oboist and a mathematical statistician, also implemented his method in a computer system, called Music Plus One, that performs the role of a musical accompanist.
CHAPTER 7
Circular statistics
7.1 Musical motivation Many phenomena in music are circular. The best known examples are repeated rhythmic patterns, the circles of fourths and fths, and scales modulo octave in the welltempered system. In the circle of fourths, for example, one progresses by steps of a fourth and arrives, after 12 steps, at the initial starting point modulo octave. It is not immediately clear whether and how to calculate in such situations, and what type of statistical procedures may be used. The theory of circular statistics has been developed to analyze data on circles where angles have a meaning. Originally, this was motivated by data in biology (e.g. direction of bird ight), meteorology (e.g. direction of wind), and geology (e.g. magnetic elds). Here we give a very brief introduction, mostly to descriptive statistics. For an extended account of methods and applications of circular statistics see, for instance, Mardia (1972), Batschelet (1981), Watson (1983), Fisher (1993), and Jammalamadaka and SenGupta (2001). In music, circular methods can be applied to situations where angles measure a meaningful distance between points on the circle and arithmetic operations in the sense of circular data are well dened.
7.2 Basic principles 7.2.1 Some descriptive statistics Circular data are observations on a circle. In other words, observations consist of directions expressed in terms of angles. The rst question is which statistics describe the data in a meaningful way or, at an even more basic level, how to calculate at all when moving on a circle. The diculty can be seen easily by trying to determine the average direction. Suppose we observe two angles 1 = 330o and 2 = 10o . It is plausible to say that the average direction is 350o . However, the average is (330o + 10o )/2 = 170o which is almost the opposite direction. Calculating the sample mean of angles is obviously not meaningful. The simple solution is to interpret angular observations as vectors in the plane, with end points on the unit circle, and applying vector addition
instead of adding angles. Thus, we replace i (i = 1, ..., n) by xi = (sin i , cos i ) where is measured anticlockwise relative to the horizontal axis. The following descriptive statistics can then be dened. Denition 47 Let
n n
C=
i=1
cos i , S =
i=1
sin i , R =
C 2 + S2.
(7.1)
The (vector of the) mean direction of i (i = 1, ..., n) is equal to x = cos sin = C/R S/R (7.2)
Equivalently one may use the following Denition 48 The (angle of the) mean direction of i (i = 1, ..., n) is equal to S (7.3) = arctan + 1{C < 0} + 2 1{C > 0, S < 0} C Moreover, we have Denition 49 The mean resultant length of i (i = 1, ..., n) is equal to =R R (7.4) n Note that R is the length of the vector nx obtained by adding all observed = 1. In all other vectors. If all angles are identical, then R = n so that R cases, we have 0 R < 1. In the other extreme case with i = 2i/n (i.e. the angles are scattered uniformly over [0, 2 ], there are no clusters = 0. In this sense, R measures the amount of of directions), we have R concentration around the mean direction. This leads to Denition 50 The sample circular variance of i (i = 1, ..., n) is equal to V =1R (7.5) is not a perfect measure of concentration, since Note, however, that R R = 0 does not necessarily imply that the data are scattered uniformly. For instance, suppose n is even, 2i+1 = and 2i = 0. Thus there are two = 0. preferred directions. Nevertheless, R Alternative measures of center and variability respectively are the median and the dierence between the lower and upper quartile. The median direction is a direction Mn = o determined as follows: a) nd the axis (straight line through zero) such that the data are divided into two groups of equal size (if n is odd, then the axis passes through at least one point, otherwise through the midpoint between the two observations in the middle); b) take the direction on the chosen axis for which the more points
xi are closer to the point (cos, sin)t dened by . Similarly, the lower and upper quartiles, Q1 , Q2 can be dened by dividing each of the halves into two halves again. An alternative measure of variability is then given by IQR = Q2 Q1 . Since we are dealing with vectors in the twodimensional plane, all quantities above can be expressed in terms of complex numbers. In particular, one can dene trigonometric moments by Denition 51 For p = 1, 2, ... let
n n
Cp =
i=1
cos pi , Sp =
i=1
sin pi , Rp =
2 + S2 Cp p
(7.6)
(7.7)
(7.8)
(7.9)
is called the pth trigonometric sample moment. For p = 1, this denition yields
(1) 1 + iS 1 = R 1 ei m1 = C
=
i=1
o Sp
=
i=1
(7.10)
o p C =
o o Cp Sp o p , S = n n
(7.11) (7.12)
(p)
(7.13)
is called the pth centered trigonometric (sample) moment mo p , centered relative to the mean direction (1). (1)) = 0 so that mo Note, in particular, that sin(i 1 = R1 . An overview of descriptive measures of center and variability is given in Table 7.1.
Feature measured Center (direction) Concentration Center (angle) Center (angle) Center of left and right half Center (angle) Center (direction, unit vector) Variability Variability Variability Variability Variability Variability
Concentration Circular variance Circular stand. dev. Circular dispersion Mean deviation Interquartile range
dn = (1 Dn =
 i Mn 
IQR = Q2 Q1
7.2.2 Correlation and autocorrelation A model for perfect linear association between two circular random variables , is = + (c mod 2 ) (7.14) where c [0, 2 ) is a xed constant. A sample statistic that measures how close we are to this perfect association is r, = or r, = det(n1
n i,j =1;i=j n i,j =1;i=j
sin(i j ) sin(i j )
n i,j =1;i=j n i=1
sin2 (i j ) det(n1
n i=1
sin2 (i j )
(7.15)
t xi yi ) n i=1 t) yi yi
1 xi xt i ) det(n
(7.16)
where xi = (cos i , sin i )t and yi = (cos i , sin i )t . For a time series t (t = 1, 2, ...) of circular data, this denition can be carried over to autocorrelations r(k ) = or r (k ) = det(n1 det(n1
nk t i=1 xi xi+k ) nk t i=1 xi xi ) n i,j =1;i=j
j +k )
(7.17)
(7.18)
7.2.3 Probability distributions A probability distribution for circular data is a distribution F on the interval [0, 2 ). The sample statistics dened in Section 7.1 are estimates of the corresponding population counterparts in Table 7.2. Most frequently used distributions are the uniform, cardioid, wrapped, von Mises, and mixture distributions. Uniform distribution U ([0, 2 )): u 1{0 u < 2 }, 2
F (u) = P (0 u) = f () = F () =
1 1{0 u < 2 }. 2 In this case, p = p = 0, the mean direction is not dened, and the circular standard deviation and dispersion are innite. This expresses the fact that there is no preference for any direction and variability is therefore maximal. Cardioid (or Cosine) distribution C (, ): u sin(u ) + ]1{0 u < 2 } 2
F (u) = [ and
1 (1 + 2 cos(u ))1{0 u < 2 } 2 where 0 1 2 . In this case, = , 1 = , p = 0 (p 1) and = 1/(22 ). An interesting property is that this distribution tends to the uniform distribution as 0. f (u) =
Feature 
Center (angle)
pth central trig. moment Mean resultant length Median direction Quartiles q1 , q2 Modal direction Principal direction Concentration Circular variance Circular stand. dev. Circular dispersion Mean deviation Interquartile range
= 1  M = { :
Concentration dF () =
+
dF () =
1 } 2
q1 = median of { : M M } q2 = median of { : M M + } = arg max f () M = rst eigenvector of t = E (XX ) 1 = rst eigenvalue of = 1 = 2 log(1 ) )/(22 )
2 0
= (1 =
  M dF ()
IQR = q2 q1
Wrapped distribution: Let X be a random variable with distribution function FX . The random variable = X (mod 2 ) has a distribution F on [0, 2 ) given by
F (u) =
j =
[F (u + 2j ) F (2j )]
f (u) =
j =
fX (u + 2j ).
An important special example is the wrapped normal distribution. The wrapped normal distribution W N (, ) is obtained by wrapping a normal distribution with E (X ) = and var(X ) = 2 log (0 < 1). This yields the circular density function f (u) =
2 1 [1 + 2 j cos j (u )]1{0 u < 2 } 2 j =1 2
Then, = , 1 = , = (1 4 )/(22 ), p,C = p and p,S = 0 (p 1). For 0, we obtain the uniform distribution, and for 1 a distribution with point mass in the direction . von Mises distribution M (, ) The most frequently used unimodal circular distribution is the von Mises distribution with density function 1 e cos(u) 1{0 u < 2 } f (u) = 2Io () where 0 < , 0 < 2 and Io = 1 2
2 o
1 2j ( ) (j !)2 2
is the modied Bessel function of the rst kind and order 0. In this case, we have = , 1 = I1 /Io , = (I1 /Io )1 , p,C = Ip /Io and p,S = 0 (p 1) where 1 ( )2j +p Ip = (j + p)!j ! 2 j =0 is a modied Bessel function of order p. For 0, the M (, )distribution converges to U ([0, 2 )), and for we obtain a point mass in the direction . Mixture distribution: All distributions above are unimodal. Distributions with more than one mode can be modeled, for instance, by mixture distributions f (u) = p1 f,1 (u) + ... + pm f,m (u) where 0 p1 , ..., pm 1, ity densities. 7.2.4 Statistical inference Statistical inference about population parameters is mainly known for the distributions above. Classical methods can be found in Mardia (1972), pi = 1 and f,j are dierent circular probabil
Batschelet (1981), Watson (1983), and Fisher (1993). For recent results see e.g. Jammalamadaka and SenGupta (2001). 7.3 Sp ecic applications in music 7.3.1 Variability and autocorrelation of notes modulo 12
Figure 7.1 B ela Bart ok statue by Varga Imre in front of the B ela Bart ok Memorial House in Budapest. (Courtesy of the B ela Bart ok Memorial House.)
The following analysis is done for various compositions: pitch is represented in Z12 with 0 set equal to the note (modulo 12) with the highest frequency in the composition. Given a note j in Z12 , the corresponding circular point is then x = (x1 , x2 )t = (cos(2j/12), sin(2j/12))t . The , d and the maximal circular autofollowing statistics are calculated: 1 , R correlation m = max1k10 r (k ). The compositions considered here are:
Figure 7.2 Sergei Prokoe as a child. (Courtesy of Karadar Bertoldi Ensemble; www.karadar.net/Ensemble/.)
Figure 7.3 Circular representation of compositions by J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8).
J. S. Bach: Das Wohltemperierte Klavier I (all preludes and fugues) D. Scarlatti: Sonatas Kirkpatrick No. 49, 125, 222, 345, 381, 412, 440, 541 B. Bart ok (Figure 7.1): Bagatelles No. 13, Sonata for Piano (2nd movement) S. Prokoef (Figure 7.2): Visions fugitives No. 115. To simplify the analysis, the upper envelope is considered for each composition. The data set that was available consists of played music. Thus, instead of the written score we are looking at its realization by a pianist. This results in some changes of onset times. In particular, some notes with equal score onset times are not played simultaneously. Strictly speaking, the analysis thus refers to the played music rather than the original score. In Figure 7.3, four representative compositions are displayed. Z12 is represented by a circle starting on top with 0 and proceeding clockwise as j Z12 increases. A composition is thus represented by pitches j1 , ..., jn Z12 , each pitch beings represented by a dot on the circle. In order to visualize how frequent each note is, each point xi = (cos i , sin i )t (i = 1, ..., n) where i = 2ji , is displaced slightly by adding a random number from a uniform distribution on [0, 0.1] to the angle i . (This technique of exploratory data analysis is often referred to as jittering see Chambers et al. 1983) Moreover, to obtain an impression of the dynamic movement, successive points xi , xi+1 are joined by a line. The connections visualize which notes are likely to follow each other. Some clear dierences are visible between the four plots: for Bach, the main movements take place along the edges, the main points and vertices corresponding to the Dmajor scale. The rather curious simple gure for Bart oks Bagatelle No. 3 stems from the continuous repetition of the same chromatic gure in the upper voice. For Prokoe one can see two main vertices that are positioned symmetrically with respect to the middle vertical line. This is due to the repetitive nature of the upper en , d, and log m, comparing Bach, 1, R velope. Figure 7.4 shows boxplots of Scarlatti, Bart ok and Prokoef. Variability is clearly lower for Bart ok and Prokoef, independently of the specic statistic that is used. There are also some, but less extreme, dierences with respect to the maximal autocorrelation m. As one may perhaps expect, Bart ok has the highest values of m. 7.3.2 Variability and autocorrelation of note intervals modulo 12 The same as above can be carried out for intervals between successive notes (Figure 7.5). Figure 7.6 shows that, again, variability is much lower for Bart ok and Prokoe.
1, R , d and log m for notes modulo 12, comparing Bach, Figure 7.4 Boxplots of Scarlatti, Bart ok, and Prokoef.
Figure 7.5 Circular representation of intervals of successive notes in the following compositions: J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8).
1, R , d and log m for note intervals modulo 12, comparing Figure 7.6 Boxplots of Bach, Scarlatti, Bart ok, and Prokoef.
Figure 7.7 Circular representation of notes ordered according to circle of fourths in the following compositions: J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8).
1, R , d and log m for notes 12 ordered according to circle Figure 7.8 Boxplots of of fourths, comparing Bach, Scarlatti, Bart ok and Prokoef.
Figure 7.9 Circular representation of intervals of successive notes ordered according to circle of fourths in the following compositions: J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8).
1, R , d and log m for note intervals modulo 12 ordered Figure 7.10 Boxplots of according to circle of fourths, comparing Bach, Scarlatti, Bart ok, and Prokoef.
7.3.3 Notes and intervals on the circle of fourths Alternatively, the analysis above can be carried out by ordering notes according to the circle of fourths. Thus, a rotation by 360o/12 = 30o corresponds to a step of one fourth. The analogous plots are given in Figures 7.7 through 7.10. This specic circular representation makes some symmetries and their harmonic meaning more visible.
CHAPTER 8
do not dier very much with respect to that projection, and are therefore more dicult to distinguish. Denition via spectral decomposition of matrices The algorithm given above has an elegant interpretation: Theorem 18 (Spectral decomposition theorem) Let B be a symmetric p p matrix. Then B can be written as
p
B = AAt =
j =1
j a(j ) [a(j ) ]t
(8.1)
1 0 . . 0 0 2 . . . . where = is a diagonal matrix, j are the eigen . . 0 0 . . 0 p values and the columns a(j ) of A the corresponding orthonormal eigenvectors of B, i.e. we have (8.2) Ba(j ) = j a(j ) (8.3) a(j ) 2 = [a(j ) ]t a(j ) = 1, and [a(j ) ]t a(l) = 0 for j = l In matrix form equation (8.3) means that A is an orthogonal matrix, i.e. At A = I (8.4)
where I denotes the identity matrix with Ijj = 1 and Ijl = 0 (j = l). This result can now be applied to the covariance matrix of a random vector X = (X1 , ..., Xp )t : Theorem 19 Let X be a pdimensional random vector with expected value E (X ) = and p p covariance matrix . Then = AAt
(j )
(8.5)
where the columns a of A are eigenvectors of and is a diagonal matrix with eigenvalues 1 , ..., p 0. In particular, we may permute the sequence of the X components such that the eigenvalues are ordered. We thus obtain: Theorem 20 Let X be a pdimensional random vector with expected value E (X ) = and a pp covariance matrix . Then there exists an orthogonal matrix A such that (8.6) = AAt and is a diagonal where the columns a(j ) of A are eigenvectors of matrix with eigenvalues 1 2 ... p 0. Moreover, the covariance matrix of the transformed vector Z = At (X ) (8.7)
is equal to cov (Z ) = At A = (8.8) Note in particular that var(Z1 ) = 1 var(Z2 ) = 2 ... var(Zp ) = p and the covariance matrix may be approximated by a matrix
q
(q ) =
j =1
j a(j ) [a(j ) ]t
for a suitably chosen value q p. If a good approximation can be achieved for a relatively small value of q, then this means that most of the random variation in X occurs in a low dimensional space spanned by the random vector Z (q ) = (Z1 , ..., Zq )t . Denition 53 The transformation dened by Z = At (X ) is called the principal component transformation. The ith component of Z, Zj = [At (X )]j = [(X )t a(j ) ]t (8.9)
is called the j th principal component of X . The j th column of A, i.e. the j th eigenvector a(j ) , is called the vector of principal component loadings. In summary, the principal component transformation rotates the original random vector X in such a way that the new coordinates Z1 , ..., Zp are uncorrelated (orthogonal) and they are ordered according to their importance with respect to characterizing the covariance structure of X . The following result states that the algorithmic and the algebraic denition are indeed the same: Theorem 21 Consider U = bt X where b = (b1 , ..., bp )t and b = 1. Suppose that U is orthogonal (i.e. uncorrelated) to the rst k principal components of X . Then var(U ) is maximal, among all such projections, if and only if b = a(k+1) , i.e. if U is the (k + 1)st principal component Zk+1 . 8.2.2 Denition of PCA for observed data The denition of principal components given above cannot applied directly to data, since the expected value and covariance matrix are usually unknown. It can however be modied in an obvious way by replacing population quantities by suitable estimates. The simplest solution is to use the sample mean and the sample covariance matrix. For observed vectors x(i) = (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) one denes =x = 1 n
n
x(i)
i=1
(8.10)
(x(i) x )(x(i) x )t .
i=1
(8.11)
The estimated ith vector of principal component loadings, a (j ) , is the stan The dardized eigenvector corresponding to the j thlargest eigenvalue of . estimated principal component transformation is then dened by ]t t (x x z=A ) = [(x x )t A (8.12) are equal to the orthogonal vectors a where the columns of A (j ) . Applying this transformation to the observed vectors x(1), ..., x(n), enables us to compare observations with respect to their principal components. The j th principal component of the ith observation is equal to )t a (j ) zj (i) = (x(i) x (8.13) In other words, the ith observed vector x(i) x is transformed into a rotated vector z (i) = (z1 (i), ..., zp (i))t with the corresponding observed principal components. In matrix form, we can dene the n p matrix of observations x1 (1) x2 (1) xp (1) x1 (2) x2 (2) xp (2) (8.14) X= . . . . . . . .. . x1 (n) x2 (n) xp (n) and the n p matrix of observed principal components z1 (1) z2 (1) zp (1) z1 (2) z2 (2) zp (2) Z = . . . . . . . .. . z1 (n) z2 (n) zp (n) so that Z = (X I y t )A (8.16) where I denotes the identity matrix. Note that the j th column z (j ) = (zj (1), ..., zj (n))t consists of the observed j th principal components. Therefore, the sample variance of the j th principal components is given by
1 s2 z = n n i=1 2 j . zj (i) =
(8.15)
j is large, then the observed j th principal components zj (1), ..., zj (n) If have a large sample variance so that the observed values are scattered far apart. 8.2.3 Scale invariance? The principal component transformation is based on the covariance matrix. It is therefore not scale invariant, since variance and covariance depend on the units in which individual components Xj are measured. It is
therefore often recommended to standardize all components. Thus, we ren place each coordinate xj by (xj x j )/sj where x j = n1 i=1 xj (i) and n n 2 1 2 2 1 sj = n j ) (or sj = (n 1) j )2 ). i=1 (xj (i) x i=1 (xj (i) x 8.2.4 Choosing important principal components Since an orthogonal transformation does not change the length of vectors, the total variability of the random vector Z in (8.7) is the same as the one of the original random vector X with covariance matrix = (ij )i,j =1,...,p . More specically, one denes total variability by
p
Vtotal = tr() =
i=1
ii .
(8.17)
The singular value decomposition (spectral decomposition) of then implies Theorem 22 Let be a covariance matrix with spectral decomposition = AAt . Then
p
Vtotal = tr() =
i=1
ii
(8.18)
Since the eigenvalues i are ordered according to their size, we may therefore hope that the proportion of total variation P (q ) = 1 + ... + q p i=1 i (8.19)
is close to one for a low value of q. If this is the case, then one may reduce the dimension of the random vector considerably without losing much i versus q and q )/ (q ) = ( 1 + ... + information. For data, we plot P judge by eye from which point on the increase in P (q ) is not worth the price of adding additional dimensions. Alternatively, we may plot the con i or j itself, against j. This is the j / tribution of each eigenvalue, socalled scree graph. More formal tests, e.g. for testing which eigenvalues are nonzero or for comparing dierent eigenvalues, are available however mostly under the rather restrictive assumption that the distribution of X is multivariate normal (see e.g. Mardia et al. 1979, Ch. 8.3.2). In addition to the scree plot, the decision on the number of principal components is often also based on the (possibly subjective) interpretability of the components. The interpretation of principal components may be (i) based on the coecients aj and/or on the correlation between Zj and the coordinates of the original random vector X = (X1 , ..., Xp )t . Note that since E (ZX t ) = E (At XX t) = At = At AAt = At , var(Xk ) = kk and
j kk
(8.20)
j kk
(8.21)
8.2.5 Plots One of the main diculties with highdimensional data is that they cannot be represented directly in a twodimensional display. Principal components provide a possible solution to this problem. The situation is particularly simple if the rst two principal components explain most of the variability. In that case, the original data (x1 (i), ..., xp (i))t (i = 1, 2, ..., n) may be replaced by the rst two principal components (z1 (i), z2 (i))t (i = 1, 2, ..., n). Thus, z2 (i) is plotted against z1 (i). If more than two principal components are needed, then the plot of z2 (i) versus z1 (i) provides at least a partial view of the data structure, and further projections can viewed by corresponding scatter plots of other components, or by symbol plots as described in Chapter 2. The scatter plots can be useful for identifying structure in the data. In particular, one may detect unusual observations (outliers) or clusters of similar observations. 8.3 Sp ecic applications in music 8.3.1 PCA of tempo skewness The 28 tempo curves in Figure 2.3, each consisting of measurements at p = 212 onset times, can be considered as n = 28 observations of a 212dimensional random vector. Principal component analysis cannot be applied directly to these data. The reason is that PCA relies on estimating the p p covariance matrix. The number of observations (n = 28) is much smaller than p. Therefore, not all elements of the covariance matrix can be estimated consistently and an empirical PCAdecomposition would be highly unreliable. A solution to this problem is to reduce the dimension p in a meaningful way. Here, we consider the following reduction: the onsettime axis is divided into 8 disjoint blocks A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 of 4 bars each. For each part number i (i = 1, ..., 8) and each performance j (j = 1, ..., 28), we calculate the skewness measure j (i) = x M Q2 Q1
Figure 8.1 Tempo curves for Schumanns Tr aumerei: skewness for the eight parts A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of the part.
where M is the median and Q1 , Q2 are the lower and upper quartile respectively. Figure 8.1 shows j (i) plotted against i. An apparent pattern is the generally strong negative skewness in B2 . (Recall that negative skewness can be created by extreme ritardandi.) Apart from that, however, Figure 8.1 is dicult to interpret directly. Principal component analysis helps to nd more interesting features. Figure 8.3 shows the loadings for the rst four principal components which explain more than 80% of the variability (see Figure 8.2). The loadings can be interpreted as follows: the rst component corresponds to a weighted average emphasizing the skewness values in the rst half of the piece. The 28 performances apparently dier most with respect to j (i) during the rst 16 bars of the piece (parts A1 , A2 , A1 , A2 ). The second most important distinction between pianists is characterized by the second component. This component compares skewness for the Aparts with the values in B1 and B2 . The third component essentially
0.355
0.015
Variances
0.564
0.709 0.824
0.005
0.0
Comp. 1
Comp. 2
Comp. 3
Comp. 4
Comp. 5
Comp. 6
compares the rst with the second half. Finally, the fourth component essentially compares the odd with the even numbered parts, excluding the end A1 , A2 . Components two to ve are displayed in Figure 8.4, with z2 and z3 on the x and y axis respectively and rectangles representing z4 and z5 . Note in particular that Cortot and Horowitz mainly dier with respect to the third principal component. Horowitz has a more extreme dierence in skewness betweem the rst and second halves of the piece. Also striking are the outliers Brendel, Ortiz, and Gianoli. The overall skewness, as represented by the rst component, is quite extreme for Brendel and Ortiz. For comparison, their tempo curves are plotted in Figure 8.5 together with Cortots and Horowitz rst performances. In view of the PCA one may now indeed see that in the tempo curves by Brendel and Ortiz there is a strong contrast between small tempo variations applied most of the time and occasional strong local ritardandi.
Comp. 7
0.4
loading
loading
2 4 6 8
0.2
0.3
0.6
0.2
0.2
0.6
loading
0.2
loading
2 4 6 8
0.2
0.6
0.2
0.2
0.1
RT O CO T1 RT BA CH OT CU C 3
CO
RT
T2
AP RZ O OV N A
ES
NO
IS A EIW IT RA SK U EN SC AZ H DE E M US
IT Z2
HO
z3
KU
BA
RO
ES
AR
LE I ZK AK T KZ LI1 EN
0.2
CO
EN
ES
ER
IC
DA
VI
CH
VA
0.3
RO
US M TO
HO
NI
DE
EL
BU
BR
HN
AR
SC
0.4
0.5
IA
NO
LI
0.5
0.4
0.3
0.2
SH
EL
LE Y
0.1
KA
TS
0.0
z2
Figure 8.4 Schumanns Tr aumerei: symbol plot of principal components z2 , ..., z5 for PCA of tempo skewness.
HO
RO
EN
AB
KR
AR
IS
IT
Z3
Cortot1
5 10 15 25 20
Horowitz1
Brendel
Gianoli
50
100
150
200
Figure 8.5 Schumanns Tr aumerei: tempo curves by Cortot, Horowitz, Brendel, and Gianoli.
8.3.2 PCA of entropies Consider the entropy measures E1 , E2 , E3 , E4 , E8 and E10 dened in Chapter 3. We ask the following question: is there a combination of entropy measures that enables us to distinguish computationally between various styles of composition? The following compositions are included in the study: Henry Purcell 2 Airs (Figure 8.6), Hornpipe; J.S. Bach First movements of Cello Suites No. 16, Prelude and Fugue No. 1 and 8 from Das Wohltemperierte Klavier; W.A. Mozart KV 1e, 331/1, 545/1; R. Schumann op. 15, No. 2,3,4,7; op. 68, No. 2, 16; A. Scriabin op. 51, No. 2, 4; F. Martin Pr eludes No. 6, 7 (cf. Figures 8.11, 8.12). For each composition, we dene the vector x = (x1 , ..., x6 )t = (E1 , E2 , E3 , E4 , E9 , E10 )t . The results of PCA are displayed in Figures 8.7 through 8.10. The rst principal component mainly consists of an average of the rst four components and a comparison with E10 (Figure 8.8). The second component essentially includes a comparison between E9 and E10 , whereas the third component is mainly a weighted average of E2 , E9 , and E10 . Finally, the fourth component compares E2 , E3 with E1 . According to the screeplot (Figure 8.7), the rst three components already explain more than 95% of the variability. Scatterplots of the rst three components (Figures 8.9 and 8.10) together with symbols representing the next two components show a
clear clustering. For clarity, only three dierent names (Purcell, Bach, and Schumann) are written explicitly in the plots. Schumann turns out to be completely separated from Bach. Moreover, Purcell appears to be somewhat outside the regions of Bach and Schumann, in particular in Figure 8.10. In conclusion, entropies, as dened above, do indeed seem to capture certain features of a composers style.
AIR
q = 96
Piano
11
14
Entropies  second vs. first principal component; rectangles with width=3rd comp., height=4th comp.
Purcell
Bach
2
Schumann
4
2
Schumann Schumann
Purcell Schumann
1
Schumann Bach
Bach
Bach
Figure 8.9 Entropies symbol plot of the rst four principal components.
Third vs. second principal component rectangles with width=4th comp., height=5th comp.
Bach Bach Bach Bach Bach Bach Bach
Bach
Schumann
Schumann Schumann
Schumann Schumann
Bach
Bach
Purcell
1
1
Purcell
2
Schumann
Bach
Purcell
Bach
Figure 8.11 F. Martin (18901971). (Courtesy of the Soci et e Frank Martin and Mrs. Maria Martin.)
Figure 8.12 F. Martin (18901971)  manuscript from 8 Pr eludes. (Courtesy of the Soci et e Frank Martin and Mrs. Maria Martin.)
CHAPTER 9
Discriminant analysis
9.1 Musical motivation Discriminant analysis, often also referred to under the more general notion of pattern recognition, answers the question of which category an observed item is most likely to belong to. A typical application in music is attribution of an anonymous composition to a time period or even to a composer. Other examples are discussed below. A prerequisite for the application of discriminant analysis is that a training data set is available where the correct answers are known. We give a brief introduction to basic principles of discriminant analysis. For a detailed account see e.g. Mardia et al. (1979), Klecka (1980), Breiman (1984), Seber (1984), Fukunaga (1990), McLachlan (1992) and Huberty (1994), Ripley (1995), Duda et al. (2000), Hastie et al. (2001). 9.2 Basic principles 9.2.1 Allocation rules Suppose that an observation x Rk is known to belong to one of p mutually exclusive categories G1 , G2 ,...,Gp . Associated with each category is a probability density fi (x) of X on Rk . This means that if an individual comes from group i, then the individuals random vector X has the probability distribution fi . The problem addressed by discriminant analysis is as follows: observe X = x, and try to guess which group the observation comes from. The aim is, of course, to make as few mistakes as possible. In probability terms this amounts to minimizing the probability of misclassication. The solution is dened by a classication rule. A classication rule is a division of Rk into p disjoint regions: Rk = R1 R2 ... Rp , Ri Rj = (i = j ). The rule allocates an observation to group Gi , if x Ri . More generally, we may dene a randomized rule by allocating an observation to group Gi with probability i (x), where i=1 i (x) = 1 for every x. The advantage of allowing random allocation is that discriminant rules can be averaged and the set of all random rules is convex, thus allowing to nd optimal rules. Note that deterministic rules are a special case, by setting i (x) = 1 if x Ri and 0 otherwise.
9.2.2 Case I: Known population distributions Discriminant analysis without prior group probabilities the MLrule Assume that it is not known a priori which of the groups is more likely to occur; however for each group the distribution fi is known exactly. This case is mainly of theoretical interest; it does however illustrate the essential ideas of discriminant analysis. A plausible discriminant rule is the Maximum Likelihood Rule (MLRule): allocate x to group Gi , if fi (x) = max fj (x)
j =1,...,p
(9.1)
If the maximum is reached for several groups, then x is considered to be in the union of these (for continuous distributions this occurs with probability zero). In the case of two groups the MLrule means that x is allocated to G1 , if f1 (x) > f2 (x), or, equivalently, log f1 (x) >0 f2 (x) (9.2)
In the case where all probability densities are normal with equal covariance matrices we have: Theorem 23 Suppose that each fi is a multivariate normal distribution with expected value i and covariance matrix i . Suppose further that 1 = 2 = ... = p = and det > 0. Then the MLrule is given as follows: allocate x to group Gi , if (x i )t 1 (x i ) = min (x j )t 1 (x j )
j =1,...,p
(9.3)
Note that the Mahalanobis distance di = (x i )t 1 (x i ) measures how far x is from the expected value i , while taking into account covariances between the components of the random vector X = (X1 , ..., Xp )t . In particular, for p = 2, x is allocated to G1 , if 1 (9.4) at (x (1 + 2 )) > 0 2 where a = 1 (1 2 ). Thus, we obtain a linear rule where x is compared with the midpoint between 1 and 2 . Discriminant analysis with prior group probabilities the Bayesian rule Sometimes one has a priori knowledge (or belief) how likely each of the groups is to occur. Thus, it is assumed that we know the probabilities i = P (observation drawn from group Gi ) (i = 1, ..., p) (9.5) i = 1. The conditional likelihood that the obserwhere 0 i 1 and vation comes from group Gi given the observed value X = x is proportional
to i fi (x). The natural rule is then the Bayes rule: Allocate x to Gi , if i fi (x) = max j fj (x)
j =1,...,p
(9.6)
For the noninformative prior 1 = 2 = ... = p = 1/p, representing complete lack of knowledge about which groups observations are more likely to come from, the Bayes rule coincides with the MLrule. In the case of two groups, the Bayes rule is a simple modication of the MLrule, since x is allocated to G1 , if f1 (x) 2 log > log (9.7) f2 (x) 1 Which rule is better? The quality of a rule is judged by the probability of correct classication (or misclassication). There are two standard ways of comparing classication rules: a) comparison of individual probabilities of correct classication; and b) comparison of the overall probability of correct classication. The rst criterion can be understood as follows: for a random allocation rule with probabilties i (.), the probability that a randomly chosen individual coming from group Gi is classied into group Gj is equal to pji = j (x)fi (x)dx (9.8)
Thus, correct classication for individuals from group Gi occurs with probability pii and misclassication with probability 1 pii . A rule r with correctclassicationprobabilities pii is said to be at least as good as a rule r with probabilities p ii , if pii p ii for all i. If there is at least one > sign, then r is better. If there is no better rule than r, then r is called admissible. Consider now a Bayes rule r with probabilities pij . Is there any better rule than r? Suppose that r is better. Then i pii < On the other hand, ii = i p i i fi (x)dx max{j fj (x)}dx.
j
i p ii .
i max{j fj (x)}dx =
j
i i fi (x)
which contradicts the rst inequality. The conclusion is therefore that every Bayes rule is optimal in the sense that it is admissible. If there are no a priori probabilities i , or more exactly the noninformative prior is used, then this means that the MLrule is optimal. The second criterion is applicable if a priori probabilities are available: the probability of correct allocation is
p p
pcorrect =
i=1
i pii =
i=1
i fi (x)dx
(9.9)
A rule is optimal if pcorrect is maximal. In contrast to admissibility, all rules can be ordered according to classication correctness. As before, it can be shown that the Bayes rule is optimal. Both criteria can be generalized to the case where misclassication is associated with costs that may dier for dierent groups. 9.2.3 Case II: Population distribution form known, parameters unknown Suppose that each fi is known, except for a nite dimensional parameter vector i . Then the rules above can be adopted accordingly, replacing parameters by their estimates. The MLrule is then: allocate x to Gi , if i ) = max fj (x; j ) fi (x;
j =1,...,p
(9.10)
(9.11)
The rule becomes particularly simple if fi are normal with unknown means i be the sample i and equal covariance matrices 1 = 2 = ... = . Let x i the sample covariance matrix for observations from group Gi . mean and Estimating the common covariance matrix by 1 + n2 2 + ... + np p )/(n p) = (n1 where ni is the number of observations from Gi and n = n1 + ... + np , the MLrule allocates x to Gi , if (x i )t 1 (x i ) = min (x j )t 1 (x j )
j =1,...,p
(9.12)
For two groups, we have the linear MLrule 1 x1 + x 2 )) > 0 a t (x ( 2 1 ( where a = x1 x 2 ), and the corresponding Bayes rule 1 2 x1 + x a t (x ( 2 )) > log 2 1 (9.14) (9.13)
It should be emphasized here that while a linear discriminant rule is meaningful for the normal distribution, this may not be so for other distributions. For instance, if for G1 a onedimensional random variable X is observed with a uniform distribution on [1, 1] and for G2 the variable X is uniformly distributed on [3, 2] [2, 3], then the two groups can be distinguished perfectly, however not by a linear rule. 9.2.4 Case III: Population distributions completely unknown If the population distributions fi are completely unknown, then the search for reasonable rules is more dicult. In recent literature, some rules based on nonparametric estimation or suitable projection techniques have been proposed (see e.g. Friedman 1977, Breiman 1984, Hastie et al. 1994, Polzehl 1995, Ripley 1995, Duda et al. 2000, Hand et al. 2001). The simplest, and historically most important, rule is based on Fishers linear discriminant function. Fisher postulated that a linear rule may often be reasonable (see however the remark in Section 9.2.3 why this need not always be so). He proposed to nd a vector a such that the linear function at x maximizes the ratio between the variability between groups compared to the variability within the groups. More specically, dene Xnp = X to be the n p matrix where each row i corresponds to an observed vector xi = (xi1 , ..., xip )t . We denote the columns of X by x(j ) (j = 1, ..., p). The rows are assumed to be ordered according to groups, i.e. rows 1 to n1 are observations from G1 , rows n1 + 1 through n1 + n2 are from G2 and so on. Moreover, dene the matrix Mnn = M = I n1 1 1t where I is the identity matrix and 1 = (1, ..., 1)t . We denote the subma(i) trices of X and M that belong to the dierent groups by Xnj p = X (j ) and Mnj nj = M (j ) respectively. The corresponding subvectors of y = (y1 , ..., yn )t are denoted by y (j ) . Then the variability of the vector y = Xa, dened by
n (j )
SST =
i=1
(yi y )2 = y t M y = at X t M Xa
(9.15)
(9.16)
(yi
y (j ) )2 = at W a
(9.17)
and SSTbetween =
p j =1
nj ( y (j ) y )2 = at Ba
p
(9.18)
Here,
p
W =
j =1
n j Sj =
j =1
[X (j ) ]t M (j ) X (j )
B=
j =1
nj ( x(j ) x )( x(j ) x )t
the between groups matrix, Sj is the sample covariance matrix of obsernj (j ) p = n1 j =1 i=1 yi is the overall mean, vations xi from group Gj , y
(j )
1 y (j ) = n (j ) and x are the correyi the mean in group Gj and x j sponding (vector) means for x. Fishers linear discriminant function (or rst canonical variate) is the linear function at x where a maximizes the ratio at Ba SSTbetween = t (9.19) Q(a) = SSTwithin a Wa The solution is given by Theorem 24 Let a be the eigenvector of W 1 B that corresponds to the largest eigenvalue. Then Q(a) is maximal. The classication rule is then: allocate x to Gi , if
(9.20)
If there are only p = 2 groups, then n1 n2 (1) ( x x B= (2) )( x(1) x (2) )t n has rank 1 and the only nonzero eigenvalue is n1 n2 (1) tr(W 1 B ) = ( x x (2) )t W 1 ( x(1) x (2) ) n with eigenvector a = W 1 ( x(1) x (2) ). The discriminant rule then becomes the same as the MLrule for normal distributions with equal covariance matrices: allocate x to Gi , if 1 (1) ( x(1) x x +x (2) )t W 1 (x ( (2) )) > 0 2 9.2.5 How good is an empirical discriminant rule? If the densities fi are not known, then the classication rule as well as the probabilities pii of correct classication must be estimated from the given (9.21)
Figure 9.1 Discriminant analysis combined with time series analysis can be used to judge purity of intonation (Elvira by J.B.).
data. In principle this is easy, since the corresponding estimates can simply be plugged into the formula for pii . The observed data that are used for estimation are also called training sample. A problem with these estimates is, however, that the search for the optimal discriminant rule was done with the same data. Therefore, p 11 will tend to be too optimistic (i.e. too large), unless n is very large. The same is true for any method that estimates classication probabilities from the training data. A possibility to avoid this is to partition the data set randomly into a training sample that is used for estimation of the discriminant rule, and a disjoint validation sample that is used for estimation of classication probabilities. Obviously, this can only be done for large enough data sets. For recently developed computational methods of validation, such as bootstrap, see e.g. Efron (1979), L auter (1985), Fukunaga (1990), Hirst (1996), LeBlanc and Tibshirani (1996), Davison and Hinkley (1997), Chernick (1999), Good (2001). 9.3 Sp ecic applications in music 9.3.1 Identication of pitch, tone separation, and purity of intonation Weihs et al. (2001) investigate objective criteria for judging purity of intonation of singing. The acoustic data are as described in Chapter 4. In order to address the question of how to computationally assess purity of intonation, a vocal expert classied 132 selected tones of 17 performances (Figure 9.1) of H andels Tochter Zion into the classes at, correct, and sharp. The opinion of the expert is assumed to be the truth. An objective measure of purity is dened by = log12 (observed ) log12 (o )
where o is the correct basic frequency, corresponding to the note in the score and adjusted to the tuning of the accompanying piano, and observed is the actually measured frequency. Maximum likelihood discriminant analysis leads to the following classication rule: the maximal permissible error in halftones which is accepted in order to classify a tone as correct is about 0.4 halftones below and above the target tone. Note that this is much higher than 0.03 halfnotes which is the minimal distance between frequencies a trained ear can distinguish in principle (see Pierce 1992). If a note is considered incorrect by an expert, then the estimated probability of being nevertheless classied as correct by the discriminant rule turns out to be 0.174. This rather high error rate may be due to several causes. Purity of intonation is a phenomenon that probably depends on more than just the basic frequency. Possible factors are, for instance, amount of vibrato, loudness, pitch, context (e.g. previous and subsequent notes), timbre, etc. Thus, more variables that characterize the sound may have to be incorporated, in addition to , in order to dene a musically meaningful notion of purity of intonation. 9.3.2 Identication of historic periods For a composition, consider notes modulo octave, with 0 being set equal to the most frequent note (which we will also call basic tone). The relative frequencies of each note 0, ..., 11 are denoted by po , ..., p11 . We the set x1 = p5 . Note that, if 0 is the root of the tonic triad then 5 is the root of the subdominant. Moreover we dene
n
x2 = E =
i=1
log(pi + 0.001)pi
which is slightly modied measure of entropy. We now describe each composition by a bivariate observation x = (p5 , E )t . The question is now whether this very simple 2dimensional descriptive statistic can tell us anything about the time when the music was composed. In view of the somewhat naive simplicity of x, the answer is not at all obvious. To simplify the problem, composers are divided into two groups: Group 1 = composers who died before 1800, and Group 2 = composers who died after 1800 (or are still alive). Essentially, the two groups correspond to the partition into early music to baroque and classical till today. The compositions considered here are those given in the star plot example (Section 2.7.2). In order to be able to check objectively how the procedure works, only a subset of n = 94 compositions is used for estimation. Applying a linear discriminant rule partitions the plane into two half planes by
P(Subdominant)
2.0
1.8
1.6
1.4
1.2
1.9
2.0
2.1
2.2
2.3
2.4
entropy
Figure 9.2 Linear discriminant analysis of compositions before and after 1800, with the training sample. The data used for the discriminant rule consists of x = (p 5 , E ).
a straight line. Figure 9.2 shows the estimated partitioning line together with the training sample (o = before 1800, x = after 1800). Apparently, the two groups can indeed be separated quite well by the estimated straight line. This is quite surprising, given the simplicity of the two variables. As expected, however, the partition is not perfect, and it does not seem to be possible to improve it by more complicated partitioning lines. To assess how well the rule may indeed classify, we consider 50 other compositions that were not used for estimating the discriminant rule. Figure 9.3 shows that the rule works well, since almost all observations in the validation sample are classied correctly. An unusual composition is Bart oks Bagatelle No. 3 which lies far on the left in the wrong group. The partitioning can be improved if the time periods of the two groups are chosen farther apart. This is done in gures 9.3a and b with Group 1 = Early Music to Baroque and 2 = Romantic to 20th century. (A beautiful example of early music is displayed in Figure 9.6; also see Figures 9.7 and 9.8 for portraits of Brahms and Wagner.) Figure 9.4 shows the corresponding plot of the partition together with the data (n = 72). Compositions not used in the estimation are shown in Figure 9.5. Again, the rule works well, except for Bart oks third Bagatelle.
Fitted discriminant rule and validation data not used for estimation
before 1800 after 1800
P(Subdominant)
1.4
B ar to k
1.9
1.2
2.0
2.1
2.2
2.3
2.4
entropy
Figure 9.3 Linear discriminant analysis of compositions before and after 1800, with the validation sample. The data used for the discriminant rule consists of x = (p 5 , E ).
1.0
P(Subdominant)
2.0
1.8
1.6
1.4
1.2
1.8
2.0
2.2
2.4
entropy
Figure 9.4 Linear discriminant analysis of Early Music to Baroque and Romantic to 20th Century. The points (o and ) belong to the training sample. The data used for the discriminant rule consists of x = (p5 , E ).
Fitted discriminant rule and validation data not used for estimation
P(Subdominant)
1.0
1.4
2.2
1.8
1.8
ar to
2.0
2.2
2.4
entropy
Figure 9.5 Linear discriminant analysis of Early Music to Baroque and Romantic to 20th century. The points (o and ) belong to the validation sample. The data used for the discriminant rule consists of x = (p5 , E ).
Figure 9.6 Graduale written for an Augustinian monastery of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek Z urich.) (Color gures follow page 152.)
Figure 9.7 Johannes Brahms (18331897). (Photograph by Maria Fellinger, courtesy of Zentralbibliothek Z urich.)
Figure 9.8 Richard Wagner (18131883). (Engraving by J. Bankel after a painting by C. J ager, courtesy of Zentralbibliothek Z urich.)
CHAPTER 10
Cluster analysis
10.1 Musical motivation In discriminant analysis, an optimal allocation rule between dierent groups is estimated from a training sample. The type and number of groups are known. In some situations, however, it is neither known whether the data can be divided into homogeneous subgroups nor how many subgroups there may be. How to nd such clusters in previously ungrouped data is the purpose of cluster analysis. In music, one may for instance be interested in how far compositions or performances can be grouped into clusters representing dierent styles. In this chapter, a brief introduction to basic principles of statistical cluster analysis is given. For an extended account of cluster analysis see e.g. Jardine and Sibson (1971), Anderberg (1973), Hartigan (1978), Mardia et al. (1979), Seber (1984), Blasheld et al. (1985), Hand (1986), Fukunaga (1990), Arabie et al. (1996), Gordon (1999), H oppner et al. (1999), Everitt et al. (2001), Jajuga et al. (2002), Webb (2002). 10.2 Basic principles 10.2.1 Maximum likelihood classication Suppose that observations x1 , ..., xn Rk are realizations of n independent random variables Xi (i = 1, ..., n). Assume further that each random variable comes from one of p possible groups such that if Xi comes from group j , then it is distributed according to a probability density f (x; j ). In contrast to discriminant analysis, it is not observed which groups xi (i = 1, ..., n) belong to. Each observation xi is thus associated with an unobserved parameter (or label) i specifying group membership. We may simply dene i = j if xi belongs to group j . Denote by = (1 , ..., n )t the vector of labels and, for each j = 1, ..., p, let Aj = {xi : 1 i n, i = j } be the unknown set of observations that belong group j . Then the likelihood function of the observed data is
p
{
j =1 xi Aj
f (xi ; j )}
(10.1)
Maximizing L with respect to the unknown parameters 1 , ..., p and 1 , 1 , ..., p , ..., n , we obtain MLestimates 1 , ..., n and estimated sets
1 , ..., A p . Denoting by m the dimension of j , the number of estimated A parameters is p m + n. This is larger than the number of observations. It can therefore not be expected that all parameters are estimated consistently. Nevertheless, the MLestimate provides a classication rule due to j s by removing the following property: suppose that we change one of the A j and putting it into another set A l (l=j ). Then an observation xio from A the likelihood can at most become smaller. The new likelihood is obtained j ) and multiplying by f (xi ; l ). from the old one by dividing by f (xio ; o We therefore have the following property l ) f (xio ; 1 , ..., p , 1 , ..., p , L(x1 , ..., xn ; 1 , ..., n ) 1 , ..., n ) L(x1 , ..., xn ; j ) f (xio ; (10.2) or, dividing by L (assuming that it is not zero), j ) f (x; l ) for x A j f (x; (10.3)
This is identical with the MLallocation rule in discriminant analysis. The only, but essential, dierence here is that is unknown, i.e. our sample (training data) gives us only information about the distribution of X but not about . This makes the task much more dicult. In particular, since the number of unknown parameters is too large in general, maximum likelihood clustering can not only be computationally dicult but its asymptotic performance may not stabilize suciently. In special cases, however, a simple method can be obtained. Suppose, for instance, that the distributions in the groups are multivariate normal with means j and covariance matrices j . Then the MLestimates of these parameters, given , are the group sample means 1 xi x j = nj ( )
iAj ( )
respectively. The loglikelihood function then reduces to a constant minus p 1 j =1 nj log j . Maximization with respect to leads to the estimate 2 = arg min h( )
(10.4)
where h( ) =
p j =1
j ( )nj () 
(10.5)
Computationally this means that the function h( ) is evaluated for all groupings of the observations x1 , ..., xn , and the estimate is the grouping
that minimizes h( ). Clearly, this is a computationally demanding task. A simpler rule is obtained if we assume that all covariance matrices are equal to a common covariance matrix . Then  = arg min n1 = arg min 
p p
j) (nj
(10.6)
Even in this simplied form, nding the best clustering is computationally demanding. For instance, if data have to be divided into two groups, then 2 j ) may dier is the number of possible assignments for which j =1 (nj n1 equal to 2 . In addition, if the number of groups is not known a priori, then a suitable, and usually computationally costly, method for estimating p must be applied. From a principle point of view it should also be noted that if normal distributions or any other distributions with overlapping domains are assumed, then there are no perfect clusters. Even if the distributions were known, an observation x can be from any group with fi (x) > 0, with positive probability, so that one can never be absolutely sure where it belongs. A variation of MLclustering is obtained if the groups themselves are associated with probabilities. Let j be the probability that a randomly sampled observation comes from group j . In analogy to the arguments above, maximization of the likelihood with respect to all parameters inj as prior cluding j (j = 1, ..., p) leads to a Bayesian allocation rule with distribution. 10.2.2 Hierarchical clustering MLclustering yields a partition of observations into p groups. Sometimes it is desirable to obtain a sequence of clusters, e.g. starting with two main groups and then subdividing these into increasingly homogeneous clusters. This is particularly suitable for data where a hierarchy is expected  such as, for instance, in music. Generally speaking, a hierarchical method has the following property: a partitioning into p + 1 clusters consists of two clusters whose union is equal to one of the clusters from the partitioning into p groups p 1 clusters that are identical with p 1 clusters of the partitioning into p groups. In a rst step, data are transformed into a matrix D = (dij )i,j =1,...,n of distances or a matrix S = (sij )i,j =1,...,n of similarities. The denition of distance and similarity used in cluster analysis is more general than the usual denition of a metric: Denition 54 Let X be an arbitrary set and d : X X R a real valued function such that for all x, y X
D1. d(x, y ) = d(y, x) D2. d(x, y ) 0 D3. d(x, x) = 0 Then d is called a distance. If in addition we also have D4. d(x, y ) = 0 x = y D5. d(x, z ) d(x, y ) + d(y, z ) (triangle inequality), then d is a metric. A measure of similarity is usually assumed to have the following properties: Denition 55 Let X be an arbitrary set and s : X X R a real valued function such that for all x, y X S1. s(x, y ) = s(y, x) S2. s(x, y ) > 0 S3. s(x, y ) increases with increasing similarity. Then s is called a measure of similarity. Axiom S3 is of course somewhat subjective, since it depends on what is meant exactly by similarity. Table 10.1 gives examples of distances and measures of similarity. Suppose now that, for an observed data set x1 , ..., xn , we can dene a distance matrix D = (dij )i,j =1,...,n where dij denotes the distance between vectors xi and xj . A hierarchical clustering algorithm tries to group the data into a hierarchy of clusters in such a way that the distances within these clusters are generally much smaller than those between the clusters. Numerous algorithms are available in the literature. The reason for the variety of solutions is that in general the result depends on various free choices, such as the sequence in which clusters are built or the denition of distance between clusters. For illustration, we give the denition of the complete linkage (or furthest neighbor) algorithm: 1. Set a threshold do . (o ) (o ) 2. Start with the initial clusters A1 = {x1 }, ..., An = {xn } and set i = 1. (o ) (o ) (o ) The distances between the clusters are dened by djl = d(Aj , Al ) = d(xj , xl ). This gives the n n distance matrix D(o) = (djl )j,l=1,...,n . 3. Join the two clusters for which the distance djl
(i) (i) A1 , ..., Ani . (i) (i) (i1) (o )
taining new clusters 4. Calculate the new distances between clusters by djl = d(Aj , Al ) =
(i) xAj ,y Al
(i)
max
(i)
d(x, y )
(10.7)
and the corresponding (n i) (n i) distance matrix D(i) with elements (i) djl (j, l = 1, ..., n i).
Table 10.1 Some measures of distance and similarity between x = (x1 , ..., xk )t , y = (y1 , ..., yk )t Rk . For some of the distances, it is assumed that a data set of observations in Rk is available to calculate sample variances s2 j (j = 1, ..., k ) and a k k sample covariance matrix S.
Name Euclidian distance Denition d(x, y ) =
k i=1 (xi
Comments yi )2 yi )2 /s2 j Usual distance in Rk Standardized Euclidian Standardized Euclidian Less sensitive to outliers
1/
Pearson distance
d(x, y ) =
k i=1 (xi
(x y )t S 1 (x y )
k i=1
w i  x i yi  w i  x i yi 
Minkowski metric
k i=1
For = 1 : Manhattan For xi , yi 0 (example: proportions) Suitable for xi = 0, 1 Suitable for for xi = 0, 1 Suitable if some xi qualitative, some xi quantitative
Bhattacharyya distance
k i=1 ( xi
yi )2
1/2
s(x, y ) = k 1
x i yi
ai , s(x, y ) = k 1 ai = xi yi + (1 xi )(1 yi ) w i  x i yj  , s(x, y ) = 1 k 1 wi = 1 if xi qualitative, wi = 1/Ri if quantitative (with Ri = range of ith coordinate)
5. If
do
(10.8)
then stop. Otherwise, set i = i + 1 and go to step 3. Note in particular that for the nal clusters, the maximal distance within each cluster is at most do . As a result, the nal clusters tend to be very compact. A related method is the socalled nearest neighbor single linkage algorithm. It is identical with the above except that distance between clusters is dened as the minimal distance between points in the two clusters. This can lead to socalled chaining in the form of elongated clusters.
For other algorithms and further properties see the references given at the beginning of this chapter, and references therein. 10.2.3 HISMOOTH and HIWAVE clustering HISMOOTH and HIWAVE models, as dened in Chapter 5, can be used to extract dominating features of a time series y (t) that are related to an explanatory series x(t). Suppose that we have several y series, yj (t) (j = 1, ..., N ) that share the same explanatory series x(t). An interesting question is then in how far features related to x(t) are similar, and which series have more in common than others. One way to answer the question consists of the following clustering algorithm: 1. For each series yj (t), t a HISMOOTH or HIWAVE model, thus obtaining a decomposition yj (t) = j (t, xt ) + ej (t) where j is the estimated expected value of yj given x(t). 2. Perform a cluster analysis of the tted curves j (t, xt ). 10.3 Sp ecic applications in music 10.3.1 Distribution of notes Consider the distribution pj (j = 0, 1, ..., 11) of notes modulo as dened for the star plots in Chapter 2. Can the visual impression of star plots in Figure 2.31 be conrmed by cluster analysis? We consider the transformed data vectors = (1 , ..., 11 )t , with j = log(pj /(1 pj )), for the following compositions: 1) Anonymus: Saltarello (13th century); Saltarello (14th century); Troto (13th century); Alle psalite (13th century); 2) A. de la Halle (1235?1287): Or est Bayard en la pature, hure!; 3) J. Ockeghem (14251495): Canon epidiatesseron; 4) J. Arcadelt (15051568): Ave Maria; La Ingratitud; Io dico fra noi; 5) W. Byrd (15431623): Ave Verum Corpus; Alman; The Queens Alman; 6) J. Dowland (15621626): The Frog Galliard; The King of Denmarks Galliard; Come again; 7) H.L. Hassler (15641612): Galliarda; Kyrie from Missa Secunda; Sanctus et Benedictus from Missa Secunda; 8) Palestrina (15251594): Jesu! Rex admirablis; O bone Jesu; Pueri hebraeorum; 9) J.H. Schein (15861630): Banchetto musicale; 10) J.S. Bach (16851750): Preludes and Fugues 124 from Das Wohltemperierte Klavier; 11) J. Haydn (17321809): Sonata op. 34/3 (Figure 10.3); 12) W.A. Mozart (17561791): Sonata KV 545 (2nd Mv.); Sonata KV 281 (2nd Mv.); Sonata KV 332 (2nd Mv.); Sonata KV 333 (2nd Mv); 13) C. Debussy (18621918): Claire de lune; Arabesque 1; Reections dans leau; 14) A. Sch onberg (18741951): op. 19/2 (Figure 10.4); 15) A. Webern (18831945): Orchesterst uck op. 6, No. 6; 16) Bart ok (18811945): Bagatelles No.
10
12
14
10
15
20
25
30
ANONYMUS OCKEGHEM
HASSLER HASSLER
HASSLER PALESTRINA PALESTRINA BYRD BYRD BARTOK BYRD SCHEIN BACH SCHOENBERG DEBUSSY BACH DEBUSSY MOZART MOZART MOZART MOZART BACH BACH BACH BACH MOZART BACH BACH BACH BACH DEBUSSY BACH BARTOK WEBERN MESSIAEN HAYDN BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BACH BARTOK BARTOK TAKEMITSU
Figure 10.3 Joseph Haydn (17321809). (Title page of a biography published by the Allgemeine MusikGesellschaft Z urich, 1830; courtesy of Zentralbibliothek Z urich.)
13; Piano Sonata (2nd Mv.); 17) O. Messiaen (19081992): Vingts regards de Jesu No. 3; 18) T. Takemitsu (19301996): Rain tree sketch No. 1. Figure 10.1 shows the result of complete linkage clustering of the vectors (1 , ..., 11 )t , based on the Euclidian and do = 5. The most striking feature is the clear separation of early music from the rest. Moreover, the 20th century composers considered here are in a separate cluster, except for Bart oks Bagatelle No. 3 (and Debussy, who may be considered as belonging to the 19th and 20th centuries). In contrast, clusters provided by a single linkage algorithm are less easy to interpret. Figure 10.2 illustrates a typical result of this method namely long narrow clusters where the maximal distance within a cluster can be quite large. In our example this does
Figure 10.4 Klavierst uck op. 19, No. 2 by Arnold Sch onberg. (Facsimile; used by permission of Belmont Music Publishers.)
Bach: Pr.1/WK I
Bach: F.1/WK I
not seem appropriate, since, due to the organic historic development of music, the eect of chaining is likely to be particularly pronounced. 10.3.2 Entropies Consider entropies as dened in Chapter 3. More specically, we dene for each composition a vector y = (E1 , ..., E10 )t . After standardization of each coordinate, cluster analysis is applied the following compositions by J.S. Bach: Cello Suites No. I to VI (1st movement from each); Preludes and Fugues No. 1 and 8 from Das Wohltemperierte Klavier (each separately). The complete linkage algorithm leads to a clear separation of the Cello Suites from Das Wohltemperierte Klavier displayed in Figure 10.5. 10.3.3 Tempo curves One of the obvious questions with respect to the tempo curves in Figure 2.3 is whether one can nd clusters of similar performances. Applying complete linkage cluster analysis (with the euclidian distance) to the raw data yields the clusters in Figure 10.6. Cortot and Horowitz appear to have very individual styles, since they build distinct clusters on their own. It should be noted, however, that this does not imply that other pianists do not have their own styles. Cortot and Horowitz simply happen to be the lucky ones
Bach: Pr.8/WK I
Bach: F.8/WK I
who are represented more than once in the sample, so that the consistency of their performances can be checked empirically. Figure 10.6 also shows that Cortot is somewhat of an outlier, since his cluster separates from all other pianists at the top level. 10.3.4 Tempo curves and melodic structure Cluster analysis alone does not provide any further explanation about the meaning of observed clusters. In particular, we do not know which musically meaningful characteristics determine the clustering of tempo curves. In contrast, cluster analysis based on HISMOOTH or HIWAVE models provides a way to gain more insight. The tted HISMOOTH curves in Figures 5.9a through d extract essential features that make comparisons easier. The estimated bandwidths can be interpreted as a measure of how much emphasis a pianist puts on global and local features respectively. Figure 10.7 shows clusters based on the tted HISMOOTH curves. In contrast to the original data, complete and single linkage turn out to yield almost the same clusters. Thus, applying the HISMOOTH t rst leads to a stabilization of results. From Figure 10.7, we may identify about six main clusters, namely: A: KRUST, KATSARIS, SCHNABEL;
CORTOT3 CORTOT1 CORTOT2 MOISEIWITSCH ORTIZ NEY NOVAES DAVIES SCHNABEL SHELLEY CURZON KRUST ASKENAZE ARRAU BRENDEL ESCHENBACH ARGERICH DEMUS KLIEN HOROWITZ1 HOROWITZ2 HOROWITZ3 BUNIN KUBALEK CAPOVA ZAK GIANOLI KATSARIS
B: MOISEIWITSCH, NOVAES, ORTIZ; C: DEMUS, CORTOT1, CORTOT2, CORTOT3, ARGERICH, SHELLEY, CAPOVA; D: ARRAU, BUNIN, KUBALEK, CURZON, GIANOLI; E: ASKENAZE, DAVIES; F: HOROWITZ1, HOROWITZ2, HOROWITZ3, ZAK, ESCHENBACH, NEY, KLIEN, BRENDEL. This is related to grouping of the vector of estimated bandwidths, (b1 , b2 , b3 )t R3 + . In gure 10.8, the x and y coordinates correspond to b1 and b2 respectively, and the radius of a circle is proportional to b3 . The letters A through F identify locations where one or more observation from that cluster occurs. The pictures show that only a few selected values of b1 and b2 are selected. Particularly striking are the large bandwidths for clusters A and B. Apparently, these pianists emphasize mostly larger structures of the composition. Also note that the clusters do not separate equally well in each projection. Apart from clusters A and B, one cannot order the performances in terms of large versus small bandwidth. Overall, one may conclude that HISMOOTHclustering together with analytic indicator functions provides a better understanding of essential characteristics of musical performance (Figure 10.9).
KRUST KATSARIS SCHNABEL MOISEIWITSCH NOVAES ORTIZ DEMUS CORTOT1 CORTOT3 ARGERICH SHELLEY CAPOVA CORTOT2 ARRAU BUNIN KUBALEK CURZON GIANOLI ASKENAZE DAVIES ZAK ESCHENBACH NEY HOROWITZ3 KLIEN BRENDEL HOROWITZ1 HOROWITZ2
B A F D C C C D F
2
B BB
A D
CD F C
1.5 2.0 2.5 3.0
E F F E
3.5
0
0.5
1.0
Figure 10.8 Symbol plot of HISMOOTH bandwidths for tempo curves. The radius of each circle is proportional to a constant plus log b3 ; the horizontal and vertical axes are equal to b1 and b2 respectively. The letters AF indicate where at least one observation from the corresponding cluster occurs.
CHAPTER 11
Multidimensional scaling
11.1 Musical motivation In some situations data consist of distances only. These distances are not necessarily euclidian so that they do not necessarily correspond to a conguration of points in a euclidian space. The question addressed by multidimensional scaling (MDS) is in how far one may nevertheless nd points in a hopefully lowdimensional euclidian space that have exactly or approximately the observed distances. The procedure is mainly an exploratory tool that helps to nd structure in distance data. We give a brief introduction to the basic principles of MDS. For a detailed discussion and an extended bibliography see, for instance, Kruskal and Wish (1978), Cox and Cox (1994), Everitt and RabeHesketh (1997), Borg and Groenen (1997), Schiman (1997); also see textbooks on multivariate statistics, such as the ones given in the previous chapters. For the origins of MDS and early references see Young and Householder (1941), Guttman (1954), Shepard (1962a,b), Kruskal (1964a,b), Ramsay (1977). 11.2 Basic principles 11.2.1 Basic denitions In MDS, any symmetric n n matrix D = (dij )i,j =1,...,n with dij 0 and dii = 0 is called a distance matrix. Note that this corresponds to the axioms D1, D2, and D3 in the previous chapter. If instead of distances, a similarity matrix S = (sij )i,j =1,....,n is given, then one can dene a corresponding distance matrix by a suitable transformation. One possible transformation is, for instance, dij = sii 2sij + sjj (11.1) The question addressed by metric MDS can be formulated as follows: given an n n distance matrix D, can one nd a dimension k and n points with D x1 , ..., xn in Rk such that these points have a distance matrix D approximately, or even exactly, equal to D? Clearly one prefers low dimensions (k = 2 or 3, if possible), since it is then easy to display the points graphically. On the other hand, the dimension cannot be too low in order to obtain a good approximation of D, and hence a realistic picture of structures in the data. As an alternative to metric MDS, one may also consider
nonmetric methods where one tries to nd points in a euclidian space such that the ranking of the distances remains the same, whereas their nominal values may dier. 11.2.2 Metric MDS In the ideal case, the metric solution constructs n points x1 , ..., xn Rk with elements for some k such that their euclidian distance matrix D, ij = (xi xj )t (xi xj ), is exactly equal to the original distance matrix d D. If this is possible, then D is called euclidian. The condition under which this is possible is as follows: Theorem 25 D = Dnn = (dij )i,j =1,...,n is euclidian if and only if the matrix B = Bnn = M AM is positive semidenite, where M = (I n1 11t ), I = Inn is the identity matrix, 1 = (1, ..., 1)t and A = Ann has elements 1 aij = d2 (i, j = 1, ..., n). 2 ij The reason for positive semideniteness of B is that if D is indeed a euclidian matrix corresponding to points x1 , ..., xn Rk , then bij = (xi x )t (xj x ) (11.2)
so that B denes a centered scalar product for these points. In matrix form we have B = (M X )(M X )t where the n rows of Xnk correspond to the vectors xi (i = 1, ..., n). Since for any matrix C , the matrices C t C and CC t are positive semidenite, so is B . The construction of the points x1 , ..., xn given D = Dnn (or Bnn 0) is done as follows: suppose that B is of rank k n. Since B is a symmetric matrix, we have the spectral decomposition B = C C t = ZZ t (11.3)
where is the n n diagonal matrix with eigenvalues 1 2 ... k > 0 and j = 0 (j > k ) in the diagonal, and Z = Znn = (zij )i,j =1,...,n the n n matrix with the rst k columns z (j ) (j = 1, ..., k ) equal to the rst k eigenvectors. Then xi = (zi1 , ..., zik )t (i = 1, ..., n) (11.4)
of Z are points in Rk with distance matrix D. In practice, the following diculties can occur: 1. D is euclidian, but k is too large to be of any use (after all the purpose is to obtain an interpretable picture of the data); 2. D is not euclidian with a) all i positive, or, b) some i negative. Because of these problems, one often uses a rough approxima
tion of D, based on a small number of eigenvectors that correspond to positive eigenvalues. Finally, note that if instead of distances, similarities are given and the similarity matrix S is positive semidenite, then S can be transformed into a euclidian distance matrix by dening dij = sii 2sij + sjj (11.5)
11.2.3 Nonmetric MDS For qualitative data, or generally observations in nonmetric spaces, distances can only be interpreted in terms of ranking. For instance, the subjective judgement of an audience may be that a composition by Webern is slightly more dicult than Wagner, but much more dicult than Mozart, thus dening a larger distance between Webern and Mozart than Webern and Wagner. It may, however, not be possible to express distances between the compositions by numbers that could be interpreted directly. In such cases, D is often called a dissimilarity matrix rather than a distance matrix. Since only the relative size of distances is meaningful, various computationally demanding algorithmic methods for dening points in a euclidian space such that the ranking of the distances remains the same have been developed in the literature (e.g. Shepard 1962a,b, Kruskal 1964a,b, Guttman 1968, Lingoes and Roskam 1973).
11.2.4 Chronological ordering Suppose a distance matrix D (or a similarity matrix S ) is given and one would like to nd out whether there is a natural ordering of the observational units. For instance, a listener may assign a distance matrix between various musical pieces without knowing anything about these pieces a priori. The question then may be whether the listeners distance matrix corresponds approximately to the sequence in time when the pieces were composed. This problem is also called seriation. MDS provides a possible solution in the following way: if the distances expressed the temporal (or any other) sequence exactly, then the conguration of points found by MDS would be onedimensional. In the more realistic case that distances are partially due to the temporal sequence, the points in Rk should be scattered around a onedimensional, not necessarily straight, line in Rk . In the simplest case, this may already be visible in a twodimensional plot.
11.3 Sp ecic applications in music 11.3.1 Seriation by simple descriptive statistics Suppose we would like to guess which time a composition is from, without listening to the music but instead using an algorithm. There is a large amount of music theory that can be used to determine the time when a composition was written. One may wonder, however, whether there may be a very simple computational way of guessing. Consider, for instance, the following frequencies: xi = pi1 (i = 1, ..., 12) are the relative frequencies of notes modulo 12 centered around the central tone, as dened in section 9.3.2. Moreover, set x13 equal to the relative frequency of a sequence of four notes following the sequence of interval steps 3, 3 and 3. This corresponds to an arpeggio of the diminished seventh chord. Thus, we consider a vector x = (x1 , ..., x13 )t with coordinates corresponding to proportions. An appropriate measure of distance between proportions is the Bhattacharyya distance (Bhattacharyya 1946b) given in Table 10.1 namely
k
d(x, y ) =
i=1
( xi yi )2
1/2
This is not a euclidian distance so that it is not a priori clear whether a suitable representation of the observations in a euclidian space is possible. MDS with k = 2 yields the points in Figure 11.1. Three time periods are distinguished by using dierent symbols for the points. The periods are dened in a very simple way, namely by date of birth of the composer: a) before 1720 (early to baroque; see e.g. Figure 11.3); b) 17201880 (classical to romantic); and c) 1880 or later (20th century). The conguration of the respective points does show an eect of time. The three time periods can be associated with regional clusters though the regions overlap. An outlier from the middle category is Schoenberg. This is due to the crude denition of the time periods: Schoenberg (in particular his op. 19/2) clearly belongs the 20th century he just happens to be born a little bit too early (1874), and is therefore classied as classical to romantic. The dependence between time period and second MDScoordinate can also be seen by comparing boxplots (Figure 11.2). 11.3.2 Perception and music psychology MDS is frequently used to analyze data that consist of subjective distances between musical sounds (e.g. with respect to pitch or timbre) or compositions obtained in controlled experiments. Typical examples are Grey and Gordon (1978), Gromko (1993), Ueda and Ohgushi (1987), Wedin (1972), Wedin and Goude (1972), Markuse and Schneider (1995). Since it is not known in how far the cognitive metric may correspond approximately to
0.6
x2
0.4
0.2
0.0
0.2
0.4
0.4
0.2
0.0
x1
0.2
0.4
Figure 11.1 Twodimensional multidimensional scaling of compositions ranging from the 13th to the 20th century, based on frequencies of intervals and interval sequences.
a euclidian distance, MDS is a useful method to investigate this question, to simplify highdimensional distance data and possibly nd interesting structures. Grey and Gordon consider perceptual eects of timbres characterized by spectra. For a related study see Wedin and Goude (1972). Gromko (1993) carries out an MDS analysis to study perceptual dierences between expert and novice music listeners. Ueda and Ohgushi (1987) study perceptual components of pitch and use MDS to obtain a spatial representation of pitch.
0.4
0.2
0.0
0.2
0.4
0.6
17201880
Figure 11.2 Boxplots of second MDScomponent where compositions are classied according to three time periods.
Figure 11.3 Fragment of a graduale from the 14th century. (Courtesy of Zentralbibliothek Z urich.)
Figure 11.4 Muzio Clementi (17521832). (Lithography by H. Bodmer, courtesy of Zentralbibliothek Z urich.)
Figure 11.5 Freddy (by J.B.) and Johannes Brahms (18331897) going for a drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek Z urich.)
List of gures
Figure 1.1: Quantitative analysis of music helps to understand creative processes. (Pierre Boulez, photograph courtesy of Philippe Gontier, Paris; and Jim by J.B.) Figure 1.2: J.S. Bach (16851750). (Engraving by L. Sichling after a painting by Elias Gottlob Haussmann, 1746; courtesy of Zentralbibliothek Z urich.) Figure 1.3: Ludwig van Beethoven (17701827). (Drawing by E. D urck after a painting by J.K. Stieler, 1819; courtesy of Zentralbibliothek Z urich.) Figure 1.4: Anton Webern (18831945). (Courtesy of Osterreichische Post AG.) Figure 1.5: Gottfried Wilhelm Leibniz (16461716). (Courtesy of Deutsche Post AG and Elisabeth von JanotaBzowski.) Figure 1.6: W.A. Mozart (17591791) (authorship uncertain) SpiegelDuett. Figure 1.7: Wolfgang Amadeus Mozart (17561791). (Engraving by F. M uller after a painting by J.W. Schmidt; courtesy of Zentralbibliothek Z urich.) Figure 1.8: The torus of thirds Z3 + Z4 . Figure 1.9: Arnold Sch onberg Sketch for the piano concert op. 42 notes with tone row and its inversions and transpositions. (Used by permission of Belmont Music Publishers.) Figure 1.10: Notes of Air by Henry Purcell. (For better visibility, only a small selection of related motifs is marked.) Figure 1.11: Notes of Fugue No. 1 (rst half) from Das Wohltemperierte Klavier by J.S. Bach. (For better visibility, only a small selection of related motifs is marked.) Figure 1.12: Notes of op. 68, No. 2 from Album f ur die Jugend by Robert Schumann. (For better visibility, only a small selection of related motifs is marked.) Figure 1.13: A miraculous transformation caused by high exposure to Wagner operas. (Caricature from a 19th century newspaper; courtesy of Zentralbibliothek Z urich.)
Figure 1.14: Graphical representation of pitch and onset time in Z2 71 to anti gether with instrumentation of polygonal areas. (Excerpt from S Piano concert No. 2 by Jan Beran, col legno CD 20062; courtesy of col legno, Germany.) Figure 1.15: Iannis Xenakis (19221998). (Courtesy of Philippe Gontier, Paris.) Figure 1.16: Ludwig van Beethoven (17701827). (Courtesy of Zentralbibliothek Z urich.) Figure 2.1: Robert Schumann (18101856) Tr aumerei op. 15, No. 7. Figure 2.2: Tempo curves of Schumanns Tr aumerei performed by Vladimir Horowitz. Figure 2.3: Twentyeight tempo curves of Schumanns Tr aumerei performed by 24 pianists. (For Cortot and Horowitz, three tempo curves were available.) Figure 2.4: Boxplots of descriptive statistics for the 28 tempo curves in Figure 2.3. Figure 2.5: qqplots of several tempo curves (from Figure 2.3). Figure 2.6: Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16. Figure 2.7: Frequencies of notes 0,1,...,11 for moving windows of onsetlength 16. Figure 2.8: Johannes Chrysostomus Wolfgangus Theophilus Mozart (17561791) in the house of Salomon Gessner in Zurich. (Courtesy of Zentralbibliothek Z urich.) Figure 2.9: R. Schumann (18101856) lithography by H. Bodmer. (Courtesy of Zentralbibliothek Z urich.) Figure 2.10: Acceleration of tempo curves for Cortot and Horowitz. Figure 2.11: Tempo acceleration correlation with other performances. Figure 2.12: Martha Argerich interpolation of tempo curve by cubic splines. Figure 2.13: Smoothed tempo curves g 1 (t) = (nb1 )1 8). Figure 2.14: Smoothed tempo curves g 2 (t) = (nb2 )1 g 1 (t)] (b2 = 1). Figure 2.15: Smoothed tempo curves g 3 (t) = (nb3 )1 g 1 (t) g 2 (t)] (b3 = 1/8).
ti K ( t b1 )yi (b1 = ti K ( t b2 )[yi ti K ( t b3 )[yi
1 (t) g 2 (t) Figure 2.16: Smoothed tempo curves residuals e (t) = yi g g 3 (t).
Figure 2.17: Melodic indicator local polynomial ts together with rst and second derivatives. Figure 2.18: Tempo curves (Figure 2.3) rst derivatives obtained from local polynomial ts (span 24/32). Figure 2.19: Tempo curves (Figure 2.3) second derivatives obtained from local polynomial ts (span 8/32). Figure 2.20: Kinderszene No. 4 sound wave of performance by Horowitz at the Royal Festival Hall in London on May 22, 1982. Figure 2.21: log(Amplitude) and tempo for Kinderszene No. 4 auto and cross correlations (Figure 2.24a), scatter plot with tted least squares and robust lines (Figure 2.24b), time series plots (Figure 2.24c), and sharpened scatter plot (Figure 2.24d). Figure 2.22: Horowitz performance of Kinderszene No. 4 log(tempo) versus log(Amplitude) and boxplots of log(tempo) for three ranges of amplitude. Figure 2.23: Horowitz performance of Kinderszene No. 4 twodimensional histogram of (x, y ) = (log (tempo), log (Amplitude)) displayed in a perspective and image plot respectively. Figure 2.24: Horowitz performance of Kinderszene No. 4 kernel estimate of twodimensional distribution of (x, y ) = (log (tempo), log (Amplitude)) displayed in a perspective and image plot respectively. Figure 2.25: R. Schumann, Tr aumerei op. 15, No. 7 density of melodic indicator with sharpening region (a) and melodic curve plotted against onset time, with sharpening points highlighted (b). Figure 2.26: R. Schumann, Tr aumerei op. 15, No. 7 tempo by Cortot and Horowitz at sharpening onset times. Figure 2.27: R. Schumann, Tr aumerei op. 15, No. 7 tempo derivatives for Cortot and Horowitz at sharpening onset times. Figure 2.28: Arnold Sch onberg (18741951), selfportrait. (Courtesy of Verwertungsgesellschaft BildKunst, Bonn.) Figure 2.29: a) Cherno faces for 1. Saltarello (Anonymus, 13th century); 2. Prelude and Fugue No. 1 from Das Wohltemperierte Klavier (J. S. Bach, 16851750); 3. Kinderszene op. 15, No. 1 (R. Schumann, 18101856); 4. Piano piece op. 19, No. 2 (A. Sch onberg, 18741951); 5. Rain Tree Sketch 1 (T. Takemitsu, 19301996); b) Cherno faces for the same compositions as in gure 2.29a, after permuting coordinates. Figure 2.30: The minnesinger Burchard von Wengen (12291280), contemporary of Adam de la Halle (1235?1288). (From Codex Manesse, courtesy of the University Library Heidelberg.) (Color gures follow page 168.)
t Figure 2.31: Star plots of p j = (p6 , p11 , p4 , p9 , p2 , p7 , p12 , p5 , p10 , p3 , p8 ) for compositions from the 13th to the 20th century. Figure 2.32: Symbol plot of the distribution of successive interval pairs (y (ti ), y (ti+1 )) (2.36a, c) and their absolute values (b, d) respectively, for the upper envelopes of Bachs Pr aludium No. 1 (Das Wohltemperierte Klavier I) and Mozart s Sonata KV 545 (beginning of 2nd movement). Figure 2.33: Symbol plot of the distribution of successive interval pairs (y (ti ), y (ti+1 )) (a, c) and their absolute values (b, d) respectively, for the upper envelopes of Scriabins Pr elude op. 51, No. 4 and F. Martins Pr elude No. 6. Figure 2.34: Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportional to pj 1 . Figure 2.35: Symbol plot with x = pj 5 , y = pj 7 and radius of circles proportional to pj 6 . (Color gures follow page 168.) Figure 2.36: Symbol plot with x = pj 5 , y = pj 7 . The rectangles have width pj 1 (diminished second) and height pj 6 (augmented fourth). (Color gures follow page 168.) Figure 2.37: Symbol plot with x = pj 5 , y = pj 7 , and triangles dened by pj 1 (diminished second), pj 6 (augmented fourth) and pj 10 (diminished seventh). (Color gures follow page 168.) Figure 2.38: Names plotted at locations (x, y ) = (pj 5 , pj 7 ). (Color gures follow page 168.) t Figure 2.39: Prole plots of p j = (p5 , p10 , p3 , p8 , p1 , p6 , p11 , p4 , p9 , p2 , p7 ) . Figure 3.1: Ludwig Boltzmann (18441906). (Courtesy of Osterreichische Post AG.) Figure 3.2: Fractal pictures (by C eline Beran, computer generated.) (Color gures follow page 168.) Figure 3.3: Gy orgy Ligeti (*1923). (Courtesy of Philippe Gontier, Paris.) Figure 3.4: Comparison of entropies 1, 2, 3, and 4 for J.S. Bachs Cello Suite No. I and R. Schumanns op. 15, No. 2, 3, 4, and 7, and op. 68, No. 2 and 16. Figure 3.5: Alexander Scriabin (18711915) (at the piano) and the conductor Serge Koussevitzky. (Painting by Robert Sterl, 1910; courtesy of Gem aldegalerie Neuer Meister, Dresden, and RobertSterlHouse.) Figure 3.6: Comparison of entropies 9 and 10 for Bach, Schumann, and Scriabin/Martin. Figure 3.7: Metric, melodic, and harmonic global indicators for Bachs Canon cancricans. Figure 3.8: Robert Schumann (18101856). (Courtesy of Zentralbibliothek Z urich.)
Figure 3.9: Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 2 (upper gure), together with smoothed versions (lower gure). Figure 3.10: Metric, melodic, and harmonic global indicators for Schumanns op. 15, No. 7 upper gure), together with smoothed versions (lower gure). Figure 3.11: Metric, melodic, and harmonic global indicators for Weberns Variations op. 27, No. 2 (upper gure), together with smoothed versions (lower gure). Figure 3.12: R. Schumann Tr aumerei: motifs used for specic melodic indicators. Figure 3.13: R. Schumann Tr aumerei: indicators of individual motifs. Figure 3.14: R. Schumann Tr aumerei: contributions of individual motifs to overall melodic indicator. Figure 3.15: R. Schumann Tr aumerei: overall melodic indicator. Figure 4.1: Sound wave of c and f played on a piano. Figure 4.2: Zoomed piano sound wave shaded area in Figure 4.1. Figure 4.3: Periodogram of piano sound wave in Figure 4.2. Figure 4.4: Sound wave of e played on a harpsichord. Figure 4.5: Periodogram of harpsichord sound wave in Figure 4.4. Figure 4.6: Harpsichord sound periodogram plots for dierent time frames (moving windows of time points). Figure 4.7: A harpsichord sound and its spectrogram. Intense pink corresponds to high values of I (t, ). (Color gures follow page 168.) Figure 4.8: A harpsichord sound wave (a), logarithm of squared amplitudes (b), histogram of the series (c) and its periodogram on logscale (d) together with tted SEMIFARspectrum. Figure 4.9: Logfrequencies with tted SEMIFARtrend and loglogperiodogram together with SEMIFARt for Bachs rst Cello Suite (1st movement; a,b) and Paganinis Capriccio No. 24 (c,d) respectively. Figure 4.10: Local variability with tted SEMIFARtrend and loglogperiodogram together with SEMIFARt for Bachs rst Cello Suite (1st movement; a,b) and Paganinis Capriccio No. 24 (c,d) respectively. Figure 4.11: Niccol` o Paganini (17821840). (Courtesy of Zentralbibliothek Z urich.) Figure 5.1: Simulated signal (a) and wavelet coecients (b); (c) and (d): wavelet components of simulated signal in a; (e) and (f): wavelet components of simulated signal in a and frequency plot of coecients. Figure 5.2: Decomposition of xseries in simulated HIWAVE model.
Figure 5.3: Simulated HIWAVE model  explanatory series g1 (a), y series (b), y versus x (c), y versus g1 (d), y versus g2 = x g1 (e) and time frequency plot of y (f). Figure 5.4: HIWAVE time series and tted function g 1 . Figure 5.5: Hierarchical decomposition of metric, melodic, and harmonic indicators for Bachs Canon cancricans (Das Musikalische Opfer BWV 1079) and Weberns Variation op. 27, No. 2. Figure 5.6: Quantitative analysis of performance data is an attempt to understand objectively how musicians interpret a score without attaching any subjective judgement. (Left: Freddy by J.B.; right: J.S. Bach, woodcutting by Ernst W urtemberger, Z urich. Courtesy of Zentralbibliothek Z urich). Figure 5.7: Most important melodic curves obtained from HIREG t to tempo curves for Schumanns Tr aumerei. Figure 5.8: Successive aggregation of HIREGcomponents for tempo curves by Ashkenazy and Horowitz (third performance). Figure 5.9 a and b: HISMOOTHts to tempo curves (performances 114); Figure 5.9 c and d: HISMOOTHts to tempo curves (performances 1528). Figure 5.10: Time frequency plots for Cortots and Horowitzs three performances. Figure 5.11: Wavelet coecients for Cortots and Horowitzs three performances. Figure 5.12: Tempo curves approximation by most important 2 best basis functions. Figure 5.13: Tempo curves approximation by most important 5 best basis functions. Figure 5.14: Tempo curves approximation by most important 10 best basis functions. Figure 5.15: Tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVEt plotted against trial cuto parameter (b) and tted HIWAVEcurves (c). Figure 5.16: First derivative of tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVEt plotted against trial cuto parameter (b) and tted HIWAVEcurves (c). Figure 5.17: Second derivative of tempo curves (a) by Cortot (three curves on top) and Horowitz, R2 obtained in HIWAVEt plotted against trial cuto parameter (b) and tted HIWAVEcurves (c). Figure 6.1: JeanPhilippe Rameau (16831764). (Engraving by A. St. Aubin after J. J. Caeri, Paris after 1764; courtesy of Zentralbibliothek Z urich.)
Figure 6.2: Fr ed eric Chopin (18101849). (Courtesy of Zentralbibliothek Z urich.) Figure 6.3: Stationary distributions j (j = 1, ..., 11) of Markov chains with state space Z12 \{0}, estimated for the transition between successive intervals. Figure 6.4: Cluster analysis based on stationary Markov chain distributions for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmanino. Figure 6.5: Cluster analysis based on stationary Markov chain distributions of torus distances for compositions by Bach, Mozart, Haydn, Chopin, Schumann, Brahms, and Rachmanino. Figure 6.6: Comparison of log odds ratios log( 1 / 2 ) of stationary Markov chain distributions of torus distances. Figure 6.7: Comparison of log odds ratios log( 1 / 3 ) of stationary Markov chain distributions of torus distances. Figure 6.8: Comparison of log odds ratios log( 2 / 3 ) of stationary Markov chain distributions of torus distances. 3 ) and log( 2 / 3 ) of Figure 6.9: Comparison of log odds ratios log( 1 / stationary Markov chain distributions of torus distances. Figure 6.10: Comparison of stationary Markov chain distributions of torus distances. 3 ) and log( 2 / 3 ) plotted against Figure 6.11: Log odds ratios log( 1 / date of birth of composer. Figure 6.12: Johannes Brahms (18331897). (Courtesy of Zentralbibliothek Z urich.) Figure 7.1: B ela Bart ok statue by Varga Imre in front of the B ela Bart ok Memorial House in Budapest. (Courtesy of the B ela Bart ok Memorial House.) Figure 7.2: Sergei Prokoe as a child. (Courtesy of Karadar Bertoldi Ensemble; www.karadar.net/Ensemble/.) Figure 7.3: Circular representation of compositions by J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8). 1, R , d and log m for notes modulo 12, comparing Figure 7.4: Boxplots of Bach, Scarlatti, Bart ok, and Prokoef. Figure 7.5: Circular representation of intervals of successive notes in the following compositions: J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8).
1 , R , d and log m for note intervals modulo 12, Figure 7.6: Boxplots of comparing Bach, Scarlatti, Bart ok, and Prokoef. Figure 7.7: Circular representation of notes ordered according to circle of fourhts in the following compositions: J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8). 1 , R , d and log m for notes 12 ordered according Figure 7.8: Boxplots of to circle of fourhts, comparing Bach, Scarlatti, Bart ok and Prokoef. Figure 7.9: Circular representation of intervals of successive notes ordered according to circle of fourhts in the following compositions: J. S. Bach (Pr aludium und Fuge No. 5 from Das Wohltemperierte Klavier), D. Scarlatti (Sonata Kirkpatrick No. 125), B. Bart ok (Bagatelles No. 3), and S. Prokoe (Visions fugitives No. 8). 1 , R , d and log m for note intervals modulo 12 Figure 7.10: Boxplots of ordered according to circle of fourhts, comparing Bach, Scarlatti, Bart ok, and Prokoef. Figure 8.1: Tempo curves for Schumanns Tr aumerei: skewness for the eight parts A1 , A2 , A1 , A2 , B1 , B2 , A1 , A2 for 28 performances, plotted against the number of the part. Figure 8.2: Schumanns Tr aumerei: screeplot for skewness. Figure 8.3: Schumanns Tr aumerei: loadings for PCA of skewness. Figure 8.4: Schumanns Tr aumerei: symbol plot of principal components z2 , ..., z5 for PCA of tempo skewness. Figure 8.5: Schumanns Tr aumerei: tempo curves by Cortot, Horowitz, Brendel, and Gianoli. Figure 8.6: Air by Henry Purcell (16591695). Figure 8.7: Screeplot for PCA of entropies. Figure 8.8: Loadings for PCA of entropies. Figure 8.9: Entropies symbol plot of the rst four principal components. Figure 8.10: Entropies symbol plot of principal components no. 25. Figure 8.11: F. Martin (18901971). (Courtesy of the Soci et e Frank Martin and Mrs. Maria Martin.) Figure 8.12: F. Martin (18901971)  manuscript from 8 Pr eludes. (Courtesy of the Soci et e Frank Martin and Mrs. Maria Martin.) Figure 9.1: Discriminant analysis combined with time series analysis can be used to judge purity of intonation (Elvira by J.B.). Figure 9.2: Linear discriminant analysis of compositions before and after 1800, with the training sample. The data used for the discriminant rule consists of x = (p5 , E ).
Figure 9.3: Linear discriminant analysis of compositions before and after 1800, with the validation sample. The data used for the discriminant rule consists of x = (p5 , E ). Figure 9.4: Linear discriminant analysis of Early Music to Baroque and Romantic to 20th Century. The points (o and ) belong to the training sample. The data used for the discriminant rule consists of x = (p5 , E ). Figure 9.5: Linear discriminant analysis of Early Music to Baroque and Romantic to 20th century. The points (o and ) belong to the validation sample. The data used for the discriminant rule consists of x = (p5 , E ). Figure 9.6: Graduale written for an Augustinian monastery of the diocese Konstanz, 13th century. (Courtesy of Zentralbibliothek Z urich.) (Color gures follow page 168.) Figure 9.7: Johannes Brahms (18331897). (Photograph by Maria Fellinger, courtesy of Zentralbibliothek Z urich.) Figure 9.8: Richard Wagner (18131883). (Engraving by J. Bankel after a painting by C. J ager, courtesy of Zentralbibliothek Z urich.) Figure 10.1: Complete linkage clustering of logoddsratios of notefrequencies. Figure 10.2: Single linkage clustering of logoddsratios of notefrequencies. Figure 10.3: Joseph Haydn (17321809). (Title page of a biography published by the Allgemeine MusikGesellschaft Z urich, 1830; courtesy of Zentralbibliothek Z urich.) Figure 10.4: Klavierst uck op. 19, No. 2 by Arnold Sch onberg. (Facsimile; used by permission of Belmont Music Publishers.) Figure 10.5: Complete linkage clustering of entropies. Figure 10.6: Complete linkage clustering of tempo. Figure 10.7: Complete linkage clustering of HISMOOTHts to tempo curves. Figure 10.8: Symbol plot of HISMOOTH bandwidths for tempo curves. The radius of each circle is proportional to a constant plus log b3 the horizontal and vertical axes are equal to b1 and b2 respectively. The letters AF indicate where at least one observation from the corresponding cluster occurs. Figure 10.9: Maurizio Pollini (*1942). (Courtesy of Philippe Gontier, Paris.) Figure 11.1: Twodimensional multidimensional scaling of compositions ranging from the 13th to the 20th century, based on frequencies of intervals and interval sequences. Figure 11.2: Boxplots of second MDScomponent where compositions are classied according to three time periods.
Figure 11.3: Fragment of a graduale from the 14th century. (Courtesy of Zentralbibliothek Z urich.) Figure 11.4: Muzio Clementi (17521832). (Lithography by H. Bodmer, courtesy of Zentralbibliothek Z urich.) Figure 11.5: Freddy (by J.B.) and Johannes Brahms (18331897) going for a drink. (Caricature from a contemporary newspaper; courtesy of Zentralbibliothek Z urich.)
References
Akaike, H. (1973a). Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, B.N. Petrow and F. Csaki (eds.), Akademiai Kiado, Budapest, 267281. Akaike, H. (1973b). Maximum likelihood identication of Gaussian autoregressive moving average models. Biometrika, Vol. 60, 255265. Akaike, H. (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model tting. Biometrika, Vol. 26, 237242. Albert. A.A. (1956). Fundamental Concepts of Higher Algebra. University of Chicago Press, Chicago. Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic Press, New York and London. Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis (2nd ed.). Wiley, New York. Andreatta, M. (1997) Grouptheoretical methods applied to music. PhD thesis, University of Sussex. M. Andreatta, M., Noll, T., Agon, C. and Assayag, G. (2001). The geometrical groove: rhythmic canons between theory, implementation and musical experiment. In: Les Actes des 8` emes Journes dInformatique Musicale, Bourges 79 juin 2001, p. 9397. Antoniadis, A. and Oppenheim, G. (1995). Wavelets and Statistics. Lecture Notes in Statistics, No. 103, Springer, New York. Arabie, P., Hubert, L.J. and De Soete, G. (1996). Clustering and Classication. World Scientic Pub., London. Archibald, B. (1972). Some thoughts on symmetry in early Webern. Persp. New Music, 10, 159163. Ash, R.B. (1965). Information Theory. Wiley, New York. Ashby, W.R. (1956). An Introduction to Cybernetics. Wiley, New York. Babbitt, M. (1960) Twelvetone invariants as compositional determinant. Musical Quarterly, 46, 245259. Babbitt, M. (1961) Set structure as a compositional determinant. JMT, 5, No. 2, 7294. Babbitt, M. (1987) Words about Music. Dembski A. and Straus J.N. (eds.), University of Wisconsin Press, Madison. Backus, J. (1969). The acoustical Foundations of Music, W.W. Norton & Co., New York (reprinted 1977). Bailhache, P. (2001). Une Histoire de lAcoustique Musicale, CNRS Editions. Balzano, G.J. (1980). The grouptheoretic description of 12fold and microtonal pitch systems. Computer Music Journal, Vol. 4, No. 4, 6684.
Barnard, G.A. (1951). The theory of information. J. Royal Statist. Soc., Series B, Vol. 13, 4669. Bartlett, M.S. (1955). An Introduction to Stochastic Processes. Cambridge University Press, Cambridge. Batschelet, E. (1981). Circular Statistics. Academic Press, London. Beament, J. (1997). The Violin Explained: Components, Mechanism, and Sound. Oxford University Press, Oxford. Benade, A.H. (1976). Fundamentals of Musical Acoustics. Oxford University Press, Oxford. (Reprinted by Dover in 1990). Benson, D. (19952002). Mathematics and Music. Internet Lecture Notes, Department of Mathematics, University of Georgia, USA (available at http://www.math.uga.edu/~djb/html/mathmusic.html). Beran, J. (1987). Aniseikonia. H.O.E. (Bison Records). Beran, J. (1991). Cirri. Centaur Records, CRC 2100. Beran, J. (1994). Statistics for LongMemory Processes. Chapman & Hall, New York. Beran, J. (1995). Maximum likelihood estimation of the dierencing parameter for invertible short and longmemory ARIMA models. J. R. Statist. Soc., Series B, Vol. 57, No.4, 659672. Beran, J. (1998) Modeling and objective distinction of trends, stationarity and longrange dependence. Proceedings of the VIIth International Congress of Ecology  INTECOL 98, Farina, A., Kennedy, J. and Boss u, V. (Eds.), p. 41. a Beran, J. (2000). S nti. col legno, WWE 1CD 20062 (http://www.collegno.de). Beran, J. and Feng. Y. (2002a). SEMIFAR models a semiparametric framework for modelling trends, longrange dependence and nonstationarity. Computational Statistics & Data Analysis, Vol. 40, No. 2, 393419. Beran, J. and Feng, Y. (2002b). Iterative plugin algorithms for SEMIFAR models denition, convergence, and asymptotic properties. J. Computational Graphical Statist., Vol. 11, No. 3, 690713. Beran, J. and Ghosh, S. (2000). Estimation of the dominating frequency for stationary and nonstationary fractional autoregressive processes. J. Time Series Analysis, Vol. 21, No. 5, 513533. Beran, J. and Mazzola, G. (1992). Immaculate Concept. SToA music, 1 CD 1002.92, Z urich. Beran, J. and Mazzola, G. (1999). Analyzing musical structure and performance  a statistical approach. Statistical Science, Vol. 14, No. 1, pp.4779. Beran, J. and Mazzola, G. (1999). Visualizing the relationship between two time series by hierarchical smoothing. J. Computational Graphical Statist., Vol. 8, No. 2, pp.213238. Beran, J. and Mazzola, G. (2000). Timing Microstructure in Schumanns Tr aumerei as an Expression of Harmony, Rhythm, and Motivic Structure in Music Performance. Computers Mathematics Appl., Vol. 39, No. 56, pp.99130. Beran, J. and Mazzola, G. (2001). Musical composition and performance statistical decomposition and interpretation. Student, Vol. 4, No.1, 1342. Beran, J. and Ocker, D. (1999). SEMIFAR forecasts, with applications to foreign
exchange rates. J. Statistical Planning Inference, 80, 137153. Beran, J. and Ocker, D. (2001). Volatility of stock market indices  an analysis based on SEMIFAR models. J. Bus. Economic Statist., Vol. 19, No. 1, 103116. Berg, R.E. and Stork, D.G. (1995). The Physics of Sound (2nd ed.). Prentice Hall, New Jersey. Berry, W. (1987). Structural Function in Music. Dover, Mineola. Besag, J. (1989). Towards Bayesian image analysis. J. Appl. Statistics, Vol. 16, 395407. Besicovitch, A.S. (1935). On the sum of digits of real numbers represented in the dyadic system (On sets of fractional dimensions II). Mathematische Annalen, Vol. 110, 321330. Besicovitch, A.S. and Ursell, H.D. (1937). Sets of fractional dimensions (V): On dimensional numbers of some continuous curves. J. London Mathematical Society, Vol. 29, 449459. Bhattacharyya, A. (1946a). On some analogues of the amount of information and their use in statistical estimation. Sankhya, Vol. 8, 114. Bhattacharyya, A. (1946b). On a measure of divergence between two multinomial populations. Sankhya, 7, 401406. Billingsley, P. (1986). Probability and Measure (2nd ed.). Wiley, New York. Blasheld, R.K. and Aldenderfer, M.S. (1985). Cluster Analysis. Sage, London. Boltzmann, L. (1896). Vorlesungen u ber Gastheorie. Johann Ambrosius Barth, Leipzig. Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: Theory and Applications. Springer, New York. Bowman, A.W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with SPlus Illustrations. Oxford University Press, Oxford. Box, G.E.P. and Jenkins, G.M. (1970). Time Series Analysis: Forecasting and Control. HoldenDay, San Francisco. Breiman, L. (1984). Classication and Regression Trees. CRC Press, Boca Raton. Bremaud, P. (1999). Markov Chains. Springer, New York. Brillouin, L. (1956). Science and Information Theory. Academic Press, New York. Brillinger, D. (1981). Time Series Data Analysis and Theory (expanded ed.). Holden Day, San Francisco. Brillinger, D. and Irizarry, R.A. (1998). An investigation of the second and higherorder spectra of music. Signal Processing, Vol. 65, 161179. Bringham, E.O. (1988). The Fast Fourier Transform and Applications. Prentice Hall, New Jersey. Brockwell, P.J. and Davis, R.A. (1991). Time series: Theory and methods (2nd ed.). Springer, New York. Brown, E.N. (1990). A note on the asymptotic distribution of the parameter estimates for the harmonic regression model. Biometrika, Vol. 77, No. 3, 653656. Chai, W. and Vercoe, B. (2001). Folk Music Classication Using Hidden Markov Models. Proceedings of International Conference on Articial Intelligence, June 2001 (//web.media.mit.edu/ chaiwei/papers/chai ICAI183.pdf). Chambers, J., Cleveland, W., Kleiner, B., and Tukey, P. (1983). Graphical Meth
ods for Data Analysis. Wadsworth Publishing Company: Belmont, California. Chernick, M.R. (1999). Bootstrap Methods: A Practitioners Guide. JpsseyBass, New York. Chung, K.L. (1967). Markov Chains with Stationary Transition Probabilities. Springer, Berlin. Cleveland, W. (1985). Elements of Graphing Data. Wadsworth Publishing Company: Belmont, California. Coifman, R., Meyer, Y., and Wickerhauser, V. (1992). Wavelet analysis and sinal processing. In: Wavelets and Their Applications, pp. 153178. Jones and Bartlett Publishers, Boston. Coifman, R. and Wickerhauser, V. (1992). Entropybased algorithms for best basis selection. IEEE Transactions on Information Theory, Vol. 38, No. 2, 713718. Conway, J.H. and Sloane, N.J.A. (1988). Sphere packings, lattices and groups. Grundlehren der mathematischen Wissenschaften 290, Springe, Berlin. Cooley, J.W. and Tukey, J.W. (1965). An algorithm for the machine calculation of complex Fourier series. Math. Comput., Vol. 19, 297301. Cox, T.F. and Cox, M.A.A. (1994). Multidimensional Scaling. Chapman & Hall, London. Cremer, L. (1984). The Physics of The Violin, MIT Press, 1984. Crocker, M.J. (ed.) (1998). Handbook of Acoustics, Wiley Interscience: New York. Dahlhaus, R. (1987). Ecient parameter estimation for selfsimilar processes. Ann. Statist., Vol. 17, 17491766. Dahlhaus, R. (1996a). Maximum likelihood estimation and model selection for locally stationary processes. J. Nonpar. Statist., Vol. 6, 171191. Dahlhaus, R. (1996b) Asymptotic statistical inference for nonstationary processes with evolutionary spectra. In: Athens Conference on Applied Probability and Time Series, Vol. II, P.M. Robinson and M. Rosenblatt (Eds.), 145159, Lecture Notes in Statistics, 115, Springer, New York. Dahlhaus, R. (1997). Fitting time series models to nonstationary processes. Ann. Statistics, Vol. 25, 137. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia, PA. Davison, A.C. and Hinkley, D.V. (1997). Bootstrap Methods and Their Application. Cambridge University Press, Cambridge. de la MotteHaber, H. (1996). Handbuch der Musikpsychologie (2nd ed.). Laaber Verlag, Laaber. Devaney, R.L. (1990). Chaos, Fractals and Dynamics. AddisonWesley, California. Diaconis, P., Graham, R.L., and Kantor, W.M. (1983). The mathematics of perfect shues. Adv. Appl. Math., Vol. 4, 175196. Diggle, P. (1990) Time Series A Biostatistical Introduction. Oxford University Press, Ocford. Dillon, W. R. and Goldstein, M. (1984). Multivariate Analysis, Methods and Applications. Wiley, New York. Donoho, D.L. and Johnstone, I.M. (1995). Adapting to unknown smoothness via wavelet shrinkage. JASA, 90, 12001224. Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet shrink
age. Ann. Statistics 26, 879921. Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995). Wavelet shrinkage: Asymptopia? J. R. Statist. Soc., Series B, 57, 301337. Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1996). Density estimation by wavelet thresholding. Ann. Statistics, 24, 508539. Draper, N.R. and Smith, H. (1998). Applied Regression Analysis (3rd ed.). Wiley, New York. Duda, R.O., Hart, P.E. and Stork, D.G. (2000). Pattern classication (2nd ed.). Wiley, New York. Edgar, G.A. (1990). Measure, Topology and Fractal Geometry. Springer, New York. Eelsberg, W. and Steinmetz, R. (1998). Video Compression Techniques. Dpunkt Verlag, Heidelberg. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statistics, Vol. 7, 1 26. Eimert, H. (1964). Grundlagen der musikalischen Reihentechnik. Universal Edition, Vienna. Elliott, R.J., Agoun, L., and Moore, J.B. (1995). Hidden Markov Models: Estimation and Control. Springer, New York. Erd os, P. (1946). On the distribution function of additive functions. Ann. Mathematics, Vol. 43, 120. Eubank, R.L. (1999). Nonparametric Regression and Spline Smoothing (2nd ed.). Marcel Dekker: New York. Everitt, B.S., Landau, S. and Leese, M. (2001). Cluster Analysis (4th ed.). Oxford University Press, Oxford. Everitt, B.S. and RabeHesketh, S. (1997). The Analysis of Proximity Data. Arnold, London. Falconer, K.J. (1985). The Geometry of Fractal Sets. Cambridge University Press, Cambridge. Falconer, K.J. (1986). Random Fractals. Math. Proc. Cambridge Philos. Soc., Vol. 100, 559582. Falconer, K.J. (1990). Fractal Geometry. Wiley, New York. Fan, J. and Gijbels, I. (1995). Datadriven bandwidth selection in local polynomial tting: Variable bandwidth and spatial adaptation. J. R. Statist. Soc., Ser. B, 57, 371394. Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and Its Applications. Chapman & Hall, London. Feng, Y. (1999). Kernel and Locally Weighted Regression with Applications to Time Series Decomposition. Verlag f ur Wissenschaft und Forschung, Berlin. Fisher, N.I. (1993). Statistical Analysis of Circular Data. Cambridge University Press, Cambridge. Fisher, R.A. (1925). Theory of Statistical Information. Proc. Camb. Phil. Soc., Vol. 22, pp. 700725. Fisher, R.A. (1956). Statistical Methods and Scientic Inference. Oliver & Boyd, London. Fleischer, A. (2003). Die analytische Interpretation. Schritte zur Erschlieung eines Forschungsfeldes am Beispiel der Metrik. PhD dissertation, Humboldt
University Berlin. dissertation.de, Verlag im Internet GmbH, Berlin. Fleischer, A., Mazzola, G., Noll, Th. Zur Konzeption der Software RUBATO f ur musikalische Analyse und Performance. Musiktheorie, Heft 4, pp.314325, 2000. Fletcher, T.J. (1956). Campanological groups. American Math. Monthly, 63/9, 619626. Fletcher, N.H. and Rossing, T.D. (1991). The Physics of Musical Instruments. Springer, Berlin/New York. Flury, B. and Riedwyl, H. (1988). Multivariate Statistics: A Practical Approach. Cambridge University Press, Cambridge, UK. Forte, A. (1964). A theory of setcomplexes for music. JMT, 8, No. 2, 136183. Forte, A. (1973). Structure of atonal music. Yale University Press, New Haven. Forte, A. (1989). La setcomplex theory: elevons les enjeux! Analyse musicale, 4eme trimestre, 8086. Fox, R. and Taqqu, M.S. (1986). Large sample properties of parameter estimates for strongly dependent stationary Gaussian time series. Ann. Statisics., Vol. 14, 517532. Friedman, J.H. (1977). A recursive partitioning decision rule for nonparametric classication. IEEE Transactions on Computers, Vol. 26, No. 4, 404408. Fripertinger, H. (1991). Enumeration in music theory. S eminaire Lotharingien de Combinatoire, 26, 2942. Fripertinger, H. (1999). Enumeration and construction in music theory. Diderot Forum on Mathematics and Music Computational and Mathematical Methods in Music, Vienna, Austria. December 24, 1999. H. G. Feichtinger and M. Drer, editors. sterreichische Computergesellschaft, 179204. Fripertinger, H. (2001). Enumeration of nonisomorphic canons. Tatra Mountains Math. Publ., 23. Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (2nd ed.). Academic Press, New York. Gasser, T. and M uller, H.G. (1979). Kernel estimation of regression functions. In: Smoothing Techniques for Curve Estimation. Gasser, T., Rosenblatt, M. (Eds.), Springer, New York, pp. 2368. Gasser, T. and M uller, H.G. (1984). Estimating regression functions and their derivatives by the kernel method. Scand. J. Statist., Vol. 11, 171185. Gasser, T., M uller, H.G., and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. J. R. Statist. Soc., Ser. B, Vol. 47, 238252. Genevois, H. and Orlarey, Y. (1997). Musique et Math ematiques. Al easGrame, Lyon. Gervini, D. and Yohai, V.J. (2002). A class of robust and fully ecient regression estimators. Ann. Statistics, Vol. 30, 583616. Ghosh, S. (1996). A new graphical tool to detect nonnormality. J. R. Statist. Society, Series B, Vol. 58, 691702. Ghosh, S. (1999). T3plot. In: Encyclopedia for Statistical Sciences, Update volume 3, (S. Kotz ed.), pp. 739744, Wiley, New York. Ghosh, S. and Beran, J. (2000). Comparing two distributions: The two sample T3 plot. J. Computational Graphical Statist., Vol. 9, No. 1, 167179. W.J. Gilbert (2002) Modern Algebra with Applications. Wiley, New York. Ghosh, S. and Draghicescu, D.(2002a). Predicting the distribution function for
longmemory processes. Int. J. Forecasting, 18, 283290. Ghosh, S., Draghicescu, D. (2002b). An algorithm for optimal bandwidth selection for smooth nonparametric quantiles and distribution functions. In: Statistics in Industry and Technology: Statistical Data Analysis Based on the L1Norm and Related Methods. Dodge Y. (Ed.), Birkh auser Verlag, Basel, Switzerland, pp. 161168. Ghosh, S., Beran, J. and Innes, J. (1997). Nonparametric conditional quantile estimation in the presence of long memory. Student  Special issue on the conference on L1Norm and related methods, Vol. 2, 109117. Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (Eds.) (1996). Markov Chain Monte Carlo in Practice. Chapman & Hall, London. Goldman, S. (1953). Information Theory. Prentice Hall, New Jersey. Good, P.I. (2001). Resampling Methods. Birkh auser, Basel. Gordon, A.D. (1999). Classication (2nd ed.). Chapman and Hall, London. G otze, H. and Wille, R. (Eds.) (1985). Musik und Mathematik (Salzburger Musikgesprch 1984 unter Vorsitz von Herbert von Karajan). Springer, Berlin. Graeser, W. (1924). Bachs Kunst der Fuge. In: BachJahrbuch, 1924. Gra, K.F. (1975). Wave Motion in Elastic Solids. Oxford University Press. (reprinted by Dover, 1991). Granger, C.W.J. and Joyeux, R. (1980). An introduction to longrange time series models and fractional dierencing. J. Time Series Anal., Vol. 1, 1530. Grenander, U. and Szeg o, G. (1958). Toeplitz Forms and Their Application. Univ. California Press, Berkeley. Grey, J. (1977). Multidimensional perceptual scaling of musical timbre. J. Acoustical Soc. America, Vol. 62, 12701277. Grey, J. and Gordon, J. (1978). Perceptual Eects of spectralmodications on musical timbres. J. Acoust. Soc. America, 63, 14931500. Gromko, J.E. (1993). Perceptual Dierences between expert and novice music listeners at multidimensional scaling analysis. Psychology of Music, 21, 3447. Guttman, L. (1954). A new approach to factor analysis: the radex. In: Mathematical thinking in the behavioral sciences, P. Lazarsfeld (Ed.). Free Press, New York, pp. 258348. Guttman, L. (1968). A general nonmetric technique for nding the smallest coordinate space for a conguration of points. Psychometrika, 33, 469506. Hall, D.E. (1980). Musical Acoustics. Wadsworth Publishing Company: Belmont, California. Halsey, D. and Hewitt, E. (1978). Eine gruppentheoretische Methode in der Musiktheorie. Jaresbericht der Duetschen Math. Vereinigung, Vol. 80. Hampel, F.R., Ronchetti, E., Rousseeuw, P., and Stahel, W.A. (1986). Robust Statistics: The Approach based on Inuence Functions. Wiley, New York. Hand, D.J. (1986). Discrimination and Classication. Wiley, New York. Hand, D.J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. MIT Press, Cambdridge (USA). Hannan, E.J. (1973). The estimation of frequency. J. Appl. Probab., Vol. 10, 510519. Hannan, E.J. and Quinn, B.G. (1979). The determination of the order of an autoregression. J. R. Statist. Soc., Series B, Vol. 41, 190195.
H ardle, W. (1991) Smoothing Techniques. Springer. New York. H ardle, W., Kerkyacharian, G., Picard, D., and Tsybokov, A. (1998). Wavelets, Approximation, and Statistical Applications. Lecture Notes in Statistics, No. 129. Springer, New York. Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York. Hartley, R.V. (1928). Transmission of information. Bell Syst. Techn. J., 535563. Hassan, T. (1982). Nonlinear time series regression for a class of amplitude modulated cosinusoids. J. Time Series Analysis, Vol. 3, 109122. Hastie, T., Tibshirani, R., and Buja, A. (1994). Flexible discriminant analysis by optimal scoring. JASA, Vol. 89, 12551270. Hastie, T., Tibshirani, R., and Friedman, J.H. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York. Hausdor, F. (1919). Dimension und ausseres mass. Mathematische Annalen, Vol. 79, 157179. von Helmholtz, H. (1863). Die Lehre von den Tonempndungen als physiologische Grundlage der Musik, Reprinted in Darmstadt, 1968. Herstein, I.N. (1975). Topics in Algebra. Wiley, New York. Hirst, D. (1996). Errorrate estimation in multiplegroup linear discriminant analysis. Technometrics, Vol. 38, 389399. Hjort, N.L. and Glad, I.K. (2002). Nonparametric density estimation with a parametric start. Ann. Statistics, Vol. 23, No. 3, 882904. Hofstadter, D.R. (1999). Gdel, Escher, Bach, Basic Books, New York. H oppner, F. Klawonn, F., Kruse, R. and Runkler, T. (1999). Fuzzy Cluster Analysis. Wiley, New York. Hosking, J.R.M. (1981). Fractional dierencing. Biometrika, Vol. 68, 165176. Howard, D.M. and Angus, J. (1996). Acoustics and Psychoacoustics, Focal Press. Huber, P. (1981). Robust Statistics. Wiley, New York. Huberty, C.J. (1994). Applied Discriminant Analysis. Wiley, New York. Hurvich, C.M. and Ray, B.K. (1995). Estimation of the memory parameter for nonstationary or noninvertible fractionally integrated processes. J. Time Series Anal., Vol. 16 1741. Irizarry, R.A. (1998). Statistics and music: tting a local harmonic model to musical sound signals. PhD thesis, University of California, Berkeley. Irizarry, R.A. (2000). Asymptotic distribution of estimates for a timevarying parameter in a harmonic model with multiple fundamentals. Statistica Sinica, Vol. 10, 10411067. Irizarry, R.A. (2001). Local harmonic estimation in musical sound signals. JASA, Vol. 96, No. 454, 357367. Irizarry, R.A. (2002). Weighted estimation of harmonic components in a musical sound signal. J. Time Series Anal., Vol. 23, 2948. Isaacson, D.L. and Madsen, R.W. (1976). Markov Chains Theory and Applications. Wiley, New York. Jaard, S., Meyer, Y., and Ryan, R. (2001). Wavelets: Tools for Science and Technology. SIAM, Philadelphia. Jajuga, K., Sokoowski, A. and Bock, H.H. (Eds.) (2002). Statistical Pattern Recognition. Springer, New York. Jammalamadaka, S.R. and SenGupta, A. (2001). Topics in circular statistics.
Series on Multivariate Analysis, Vol. 5. World Scientic, River Edge, NJ. Jansen, M. (2001). Noise Reduction by Wavelet Thresholding. Lecture Notes in Statistics, No. 161. Springer, New York. Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. Wiley, New York. Johnson, J. (1997). Graph Theoretical Methods of Abstract Musical Transformation. Greenwood Publishing Group, London. Johnson, R.A. and Wichern, D.W. (2002). Applied Multivariate Statistical Analysis. Pretice Hall, New Jersey. Johnston, I. (1989). Measured Tones: The Interplay of Physics and Music. Institute of Physics Publishing, Bristol and Philadelphia. Joshi, D.D. (1957). Linformation en statistique math ematique et dans la th eorie des communications. PhD thesis, Facult e des Sciences de lUniversit e de Paris. Juang, B.H. and Rabiner, L.R. (1991). Hidden Markov models for speech recognition. Technometrics, Vol. 33, 251272. Kaiser, G. (1994). A Friendly Guide to Wavelets. Birkh auser, Boston. Keil, W. (1991). Gibt es den Goldenen Schnitt in der Musik des 16. bis 19. Jahrhunderts? Eine kritische Untersuchung rezenter Forschungen. Augsburger Jahrbuch f ur Musikwissenschaft, Vol. 8 1991. p. 770. Schneider, Tutzing, Germany. Kelly, J.P. (1991). Hearing. In: Principles of Neural Science, E.R. Kandel, J.H. Schwarz, T.M. Jessel (Eds.), Elsevier, New York, pp. 481499. Kemey, J.G., Snell, J.L., and Knapp, A.W. (1976). Denumerable Markov Chains. Springer, New York. Khinchin, A.I. (1953). The entropy concept in probability theory. Uspekhi Matematicheskikh Nauk, Vol. 8, No. 3 (55), 320 (Russian). Khinchin, A.I. (1956). On the fundamental theorems of information theory. Uspekhi Matematicheskikh Nauk, Vol. 11, No. 1 (67), 1775 (Russian). Kinsler, L.E., Frey, A.R., Coppens, A.B., and Sanders, J.V. (2000) Fundamentals of Acoustics, (4th ed.). Wiley, New York. Klecka, W.R. (1980). Discriminant Analysis. Sage, London. Kolmogorov, A.N. (1956). On the Shannon theory of information transmission in the case of continuous signals. IRE Trans. on Inform. Theory, Vol. IT2, 102108. Kono, N. (1986). Hausdor dimension of sample paths for selfsimilar processes. In: Dependence in Probability and Statistics, E. Eberlein and M.S. Taqqu (eds.), Birkh auser, Boston. Kruskal, J.B. (1964a). Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis. Psychometrika, 29, 127. Kruskal, J.B. (1964b). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115129. Kruskal, J.B. and Wish, M. (1978). Multidimensional Scaling. Sage, London. Krzanowski, W.J. (1988). Principles of Multivariate Analysis. Oxford University Press, Oxford. Kullback, S. (1959). Information Theory and Statistics. Wiley, Newy York. Lanciani, A. (2001). Math ematiques et musique: les labyrinthes de la ph enom enologie. Editions J er ome Millon, Grenoble. L auter, H. (1985). An ecient estimator for the error rate in discriminant anal
ysis. Statistics, Vol. 16, 107119. Lamperti, J.W. (1962). Semistable stochastic processes. Trans. American Math. Soc., Vol. 104, 6278. Lamperti, J.W. (1972). Semistable Markov processes. Z. Wahrsch. verw. Geb., Vol. 22, 205225. LeBlanc, M. and Tibshirani, R. (1996). Combining estimates in regression and classication. JASA, Vol. 91, 16411650. Lendvai, E. (1993). Symmetries of Music. Kod aly Institute, Kecskemet. Levinson, S.E., Rabiner, L.R., and Sondhi, M.M. (1983). An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech reconition. Bell Systems Tech. J., Vol. 62, 10351074. Lewin, D. (1987). Generalized Musical Intervals and Transformations. Yale University Press, New Haven/London. Leyton, M. (2001). A Generative Theory of Shape. Springer, New York. Licklider, J.R.C. (1951). A duplex theory of pitch reception. Experientia, Vol. 7, 128134. Ligges, U., Weihs, C., HasseBecker, P. (2002). Detection of locally stationary segments in time series. In: Proceedings in Computational Statistics, W. Hrdle, B. Rnz (Eds.), pp. 285290. Lindley, M. and TurnerSmith, R. (1993). Mathematical Models of Musical Scales. Verlag f ur systematische Musikwissenschaft GmbH, Bonn. Lingoes, J.C. and Roskam, E.E. (1973). A mathematical and empirical analysis of two multidimensional scaling algorithms. Psychometrika, 38, Monograph Suppl. No. 19. MacDonald, I.L. and Zucchini, W. (1997). Hidden Markov and Other Models for Discretevalued Time Series. Chapman & Hall, London. Mallat, S. (1998). A Wavelet Tour of Signal Processing. Academic Press, London. Mandelbrot, B.B. (1953). Contribution ` a la th eorie math ematique des jeux de communication. Publs. Inst. Statist. Univ. Paris, Vol. 2, Fasc. 1 et 2, 3124. Mandelbrot, B.B. (1956). An outline of a purely phenomenological theory of statistical thermodynamics: I. canonical ensembles. IRE Trans. on Inform. Theory, Vol. IT2, 190203. Mandelbrot, B.B. (1977). Fractals: Form, Chance and Dimension. Freeman & Co., San Francisco. Mandelbrot, B.B. (1983). The Fractal Geometry of Nature. Freeman & Co., San Francisco. Mandelbrot, B.B. and van Ness, J.W. (1968). Fractional Brownian motions, fractional noises and applications. SIAM Review, Vol. 10, No.4, 422437. Mandelbrot, B.B. and Wallis, J.R. (1969). Computer experiments with fractional Gaussian noises. Water Resour. Res., Vol. 5, No.1, 228267. Mardia, K.V. (1972). Statistics of Directional Data. Academic Press, London. Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, London. Markuse, B. and Schneider, A. (1995). Ahnlichkeit, N ahe, Distanz: Zur Anwendung multidimensionaler Skalierung in musikwissenschaftlichen Untersuchungen. In: Festschrift f ur Jobst Peter Fricke zum 65. Geburtstag, W. Auhagen, B. G atjen and K. Niem oller (Eds.), Musikwis
senschaftliches Institut der Universitt zu K oln (http://www.unikoeln.de/philfak/muwi/publ/fs fricke/festschrift.html). Matheron, G. (1973). The intrinsic random functions and their applications. Adv. Appl. Prob., Vol. 5, 439468. Mathieu, E. (1861). M emoire sur l etude des fonctions de plusieurs quantit ees. J. Math. Pures Appl., Vol. 6, 241243. Mathieu, E. (1873). Sur la fonction cinq fois transitive de 24 quantit ees. J. Math. Pures Appl., Vol. 18, 2546. Mazzola, G. (1985) Gruppen und Kategorien in der Musik, HeldermannVerlag, Berlin. Mazzola, G. (1990a). Geometrie der T one. Birkh auser, Basel. Mazzola, G. (1990b). Synthesis. SToA music 1001.90, Z urich. Mazzola, G. (1989/1994). Presto. SToA music, Z urich. Mazzola, G. (2002). The Topos of Music. Birkh auser, Basel. Mazzola, G. and Beran, J. (1998). Rational composition of performance. In: controlling creative processes in music, W. Auhagen, R. Kopiez (Eds.), Staatliches Institut f ur Musikforschung (Berlin), Lang Verlag, Frankfurt/New York. Mazzola, G., Zahorka, O. and StangeElbe, J. (1995). Analysis and Performance of a Dream. In: Proceedings of the 1995 Symposium on Musical Performance, J. Sundberg (ed.), KTH, Stockholm. McLachlan, G.J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York. McMillan, B. (1953). The basic theorems of information theory. Ann. Math. Statistics, 24, 196219. Meyer, Y. (1992). Wavelets and Operators. Cambridge University Press,Cambridge. Meyer, Y. (1993). Wavelets: Algorithms and Applications. SIAM, Philadelphia, PA. Morris, R.D. (1987). Composition with PitchClasses. Yale University Press, New Haven. Morris, R.D. (1995). Compositional spaces and other territories. PNM 33, 328358. Morse, P.M. and Ingard, K.U. (1968). Theoretical Acoustics. McGraw Hill. (Reprinted by Princeton University Press 1986.) Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. AddisonWesley, Reading, MA. Nadaraya, E.A. (1964). On estimating regression. Theory of Probability and its Applications, Vol. 9, 141142. Nederveen, C.J. (1998). Acoustical Aspects of Woodwind Instruments. Northern Illinois University Press, de Kalb. Nettheim, N. (1997). A Bibliography of Statistical Applications in Musicology. Musicology Australia, Vol. 20, 94106. Newton, H.J. and Pagano, M. (1983). A method for determining periods in time series. JASA, Vol. 78, 152157. Noll, T. (1997). Harmonische Morpheme. Musikometrika, Vol. 8, 732. Norden, H. (1964). Proportions in Music. Fibonacci Quarterly, Vol. 2, 219. Norris, J.R. (1998). Markov Chains. Cambridge University Press, Cambridge.
Ogden, R.T. (1996). Essential Wavelets for Statistical Applications and Data Analysis. Birkh auser, Boston. Orbach, J. (1999). Sound and Music. University Press of America, Lanham, MD. Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statistics, Vol. 33, 10651076. Peitgen, H.O. and Saupe, D. (1988). The Science of Fractal Images. Springer, New York. Percival, D.B. and Walden, A.T. (2000). Wavelet Methods for Time Series Analysis. Cambridge University Press, Cambridge, UK. Perle, G. (1955). Symmetric formations in the string quartets of B ela Bart ok. Music Review 16, 300312. Pierce, J.R. (1983). The Science of Musical Sound. Scientic American Books, New York (2nd ed. printed by W.H. Freeman & Co, 1992). Plackett, R.L. (1960). Principles of Regression Analysis. Clarendon Press, Oxford. Polzehl, J. (1995). Projection pursuit discriminant analysis. Computational Statist. Data Anal., Vol. 20, 141157. Price, B.D. (1969). Mathematical groups in campanology. Math. Gaz., 53, 129133. Priestley, M.B. (1965). Evolutionary spectra and nonstationary processes. J. R. Statist. Soc., Series B, Vol. 27, 204237. Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 1): Univariate Time Series. Academic Press, New York. Priestley, M.B. (1981b). Spectral Analysis and Time Series, (Vol. 2): Multivariate Series, Prediction and Control. Academic Press, New York. Quinn, B.G. and Thomson, P.J. (1991) Estimating the frequency of a periodic function. Biometrika, Vol. 78, No. 1, 6574. Rahn, J. (1980). Basic Atonal Theory. Longman, New York. Raichel, D.R. (2000). The Science and Applications of Acoustics. American Inst. of Physics, College Park, PA. Ramsay, J.O. (1977). Maximum likelihood estimation in multidimensional scaling. Psychometrika, 42, 241266. Raphael, C.S. (1999). Automatic segmentation of acoustic music signals using hidden Markov models. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 21, No. 4, 360370. Raphael, C.S. (2001a). A probabilistic expert system for automatic musical accompaniment. J. Computational Graphical Statist., Vol. 10, No. 3, 487512. Raphael, C.S. (2001b). Synthesizing musical accompaniment with Bayesian belief networks. J. New Music Res., Vol. 30, No. 1, 5967. Rao, C.R. (1973). Linear Statistical Inference and its Applications (2nd ed.). Wiley & Sons, New York. Rayleigh, J.W.S. (1896). The Theory of Sound (2 vols), 2nd ed., Macmillan, London (Reprinted by Dover, 1945). Read, R.C. (1997). Combinatorial problems in the theory of music. Discrete Mathematics, 167/168, 543551. Reiner, D. (1985). Enumeration in music theory, American Math. Monthly, 92/1, 5154.
R enyi, A. (1959a). On the dimension and entropy of probability distributions. Acta Mathe. Acad. Sci. Hung., Vol. 10, 193215. R enyi, A. (1959b). On a theorem of P. Erd os and its applications in information theory. Mathematica Cluj, Vol. 1, No. 24, 341344. R enyi, A. (1961). On measures of entropy and information. Proc. Fourth Berkeley Symposium on Math. Stat. Prob., Vol. I, Univ. California Press, Berkeley, 547561. R enyi, A. (1965). On foundations of information theory. Review of the International Statistical Institute, Vol. 33, 114. R enyi, A. (1970). Probability Theory. North Holland, Amsterdam. Repp, B. (1992). Diversity and Communality in Music Performance: An Analysis of Timing Microstructure in Schumanns Tr aumerei. J. Acoustic Soc. Am., 92, 25462568. Rigden, J.S. (1977). Physics and the Sound of Music. Wiley, New York. Ripley, B. (1995). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Rodet, X.(1997). Musical sound signals analysis/synthesis: sinusoidal+residual and elementary waveform models. Appl. Signal Processing, 4, 131141. Roederer, J.G. (1995). The Physics and Psychophysics of Music. Springer, Berlin/New York. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statistics, Vol. 27, 832837. Rossing, T.D. (ed.) (1984). Acoustics of Bells. Van Nostrand Reinhold, New York. Rossing, T.D. (1990). The Science of Sound (2nd ed.). AddisonWesley, Reading, MA. Rossing, T.D. (2000). Science of Percussion Instruments. World Scientic, London. Rossing, T.D. and Fletcher, N.H. (1995). Principles of Vibration and Sound. Springer, Berlin/New York. Rotman, J.J. (2002). Advanced Modern Algebra. Prentice Hall, New Jersey. Rousseeuw, P. and Yohai, V.J. (1984). Robust regression by means of Sestimators. In: Robust Nonlinear Time Series Analysis, J. Franke, W. Hardle, and D. Martin (Eds.), Lecture Notes in Statistics, Vol. 26, 256277, Springer, New York. Ruppert, D. and Wand, M.P. (1994). Multivariate locally weighted least squares regression. Ann. Statistics, Vol. 22, 13461370. Ryan, T.P. (1997). Modern Regression Methods. Wiley, New York. Sche e, H. (1959). The Analysis of Variance. Wiley, New York. Schnitzler, G. (1976). Musik und Zahl. Verlag fr systematische Musikwissenschaft, Bonn. Sch onberg, A. (1950). Die Komposition in 12 T onen. In: Style and Idea, New York. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., Vol. 6, 461464. Seber, G.A.F. (1984). Multivariate Observations. Wiley, New York. Serra, X. and Smith, J.O. (1991). Spectral modeling synthesis: A sound analysis/synthesis system based on deterministic plus stochastic decomposition. Computer Music J., Vol. 14, No. 4, 1224.
Shannon, C.E. (1948). A mathematical theory of communication. Bell Syst. Techn. J., Vol. 27, 379423. Shannon, C.E. and Weaver, W. (1949). The Mathematical Theory of Communication. Univ. Illinois Press, Urbana. Shepard, R.N. (1962a). The analysis of proximities: multidimensional scaling with unknown distance function Part I. Psychometrika, 27, 125140. Shepard, R.N. (1962b). The analysis of proximities: multidimensional scaling with unknown distance function Part II. Psychometrika, 27, 219246. Schiman, S. (1997). Introduction to Multidimensional Scaling: Theory, Methods, and Applications by Susan. Academic Press, New York. Shumway, R. and Stoer, D.S. (2000). Time Series Analysis and Its Applications. Springer, New York. Silverman, B. (1986). Density estimation for statistics and data analysis. Chapman & Hall, London. Simono, J.S. (1996). Smoothing methods in statistics. Springer, New York. Sinai, Y.G. (1976). Selfsimilar probability distributions. Theory Probab. Appl., Vol. 21, 6480. Slaney, M. and Lyon, R.F. (1991). Apple hearing demo real. Apple Technical Report No. 25, Apple Computer Inc., Cupertino, CA. Solo, V. (1992). Intrinsic random uctuations. SIAM Appl. Math., Vol. 52, 270291. Solomon, L.J. (1973). Symmetry as a determinant of musical composition. PhD thesis, University of West Virginia. Srivastava, M. and Sen, A.K. (1997). Regression Analysis: Theory, Methods and Applications. Springer, New York. Stamatatos, E. and Widmer, G. (2002). Music perfomer recognition using an ensemble of simple classiers. Austrian Research Institute for Articial Intelligence, Vienna, TR200202. StangeElbe, J. (2000). Analyse und Interpretationsperspektiven zu J.S. Bachs Kunst der Fuge mit Werkzeugen der objektorientierten Informationstechnologie. Habilitation thesis, University of Osnabr uck. Steinberg, R. (ed.) (1995). Music and the Mind Machine. Springer, Heidelberg. Stewart, I. (1992). Another ne math youve got me into. . . , W. H. Freeman. Stoyan, D. and Stoyan, H. (1994). Fractals, Random Shapes and Point Fields: Methods of Geometrical Statistics. Wiley, New York. Straub, H. (1989). Beitr age zur modultheoretischen Klassikation musikalischer Motive. Diploma thesis, ETH Z urich. Taylor, R. (1999a). Fractal analysis of Pollocks drip paintings. Nature, Vol. 399, p. 422. Taylor, R. (1999b). Fractal Expressionism. Physics World, Vol. 12, No. 10, p. 25. Taylor, R. (1999c). Fractal expressionism where art meets science. In: Art and Complexity, J. Casti (ed.), Perseus Press, Vol. Taylor, R. (2000). The use of science to investigate Jackson Pollocks drip paintings. Art and the Brain, Journal of Consciousness Studies, Vol. 7, No. 89, p137. Telcs, A. (1990). Spectra of graphs and fractal dimensions. Probab. Th. Rel. Fields, Vol. 82, 435449.
Thumfart, A. (1995). Discrete Evolutionary Spectra and their Application to a Theory of Pitch Perception. StatLab Heidelberg, Beitr age zur Statistik, No. 30. Tricot, C. (1995). Curves and Fractal Dimension. Springer, New York. Tufte, E. (1983). The visual display of quantitative information. AddisonWesley, Reading, MA. Tukey, J.W. (1977). Exploratory data analysis. AddisonWesley, Reading, MA. Tukey, P.A. and Tukey, J.W. (1981). Graphical display of data sets in 3 or more dimensions. In: Interpreting Multivariate Data, V. Barnett (ed.), Wiley, Chichester, UK. Ueda, K. and Ohgushi, K. (1987). Perceptual components of pitch: spatial representation using a multidimensional scaling technique. J. Acoust. Soc. Am., 82, 11931200. Velleman, P. and Hoaglin, D. (1981). The ABCs of EDA: Applications, Basics, and Computing of Exploratory Data Analysis. Duxbury, Belmont, CA. Vidakovic, B. (1999). Statistical Modeling by Wavelets. John Wiley, New York. Voss, R.F. and Clarke, J. (1975). 1/f noise in music and speech. Nature, Vol. 258, 317318. Voss, R.F. and Clarke, J. (1978). 1/f noise in music: music from 1/f noise. J. Acoust. Soc. America, Vol. 63, 258263. Voss, R.F. (1988). Fractals in nature: From characterization to simulation. In: Science of fractal images, H.O. Peitgen and D. Saupe (Eds.), Springer, Berlin, pp. 2669. Vuza, D.T. (1991). Supplementary sets and regular complementary unending canons (part one). Persp. New Music, Vol. 29, No. 2, 2249. Vuza, D.T. (1992a). Supplementary sets and regular complementary unending canons (part two). Persp. of New Music, Vol. 30, No. 1, 184207. Vuza, D.T. (1992b). Supplementary sets and regular complementary unending canons (part three). Persp. New Music, Vol. 30, No. 2, 102125. Vuza, D.T. (1993). Supplementary sets and regular complementary unending canons (part four). Persp. New Music, Vol. 31, No. 1, 270305. van der Waerden, B.L. (1979). Die Pythagoreer. Artemis, Z urich. Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia. Walker, A.M. (1971). On the estimation of a harmonic component in a time series with stationary independent residuals. Biometrika, Vol. 58, 2136. Walmsley, P.J., Godsill, S.J. and Rayner, P.J.W. (1999). Bayesian graphical models for polyphonic pitch tracking. In: Diderot Forum on Mathematics and Music Computational and Mathematical Methods in Music, Vienna, Austria, December 24, 1999, H. G. Feichtinger and M. Drer (eds.), sterreichische Computergesellschaft. Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London. Watson, G. (1964). Smooth regression analysis. Sankhya, Series A, Vol. 26, 359372. Watson, G. (1983). Statistics on Spheres. Wiley, New York. Waugh, W.A. (1996). Music, probability, and statistics. In: Encyclopedia of Statistical Sciences, by S. Kotz, C. B. Read, and D.L. Banks (Eds.), 6, 134137.
Webb, A.R. (2002). Statistical Pattern Recognition (2nd ed.). Wiley, New York. Wedin, L. (1972). Multidimensional scaling of emotional expression in music. Svensk Tidskrift f or Musikforskning, 54, 115131. Wedin, L. and Goude, G. (1972). Dimension analysis of the perception of musical timbre. Scand. J. Psychol., 13, 228240. Weihs, C., Bergho, S., HasseBecker, P. and Ligges, U. (2001). Assessment of Purity of Intonation in Singing Presentations by Discriminant Analysis. In: Mathematical Statistics and Biometrical Applications, J. Kunert, and G. Trenkler. (Eds.), pp. 395410. White, A.T. (1983). Ringing the changes. Math. Proc. Camb. Phil. Soc. 94, 203215. White, A.T. (1985). Ringing the changes II. Ars Combinatorica, 20A, 6575. White, A.T. (1987). Ringing the cosets. American Math. Monthly 94/8, 721746. Whittle, P. (1953). Estimation and information in stationary time series. Ark. Mat., Vol. 2, 423434. Widmer, G. (2001). Discovering Simple Rules in Complex Data: A Metalearning Algorithm and Some Surprising Musical Discoveries. Austrian Research Institute for Artical Intelligence, Vienna, TR200131. Wiener, N. (1948). Cybernetics or control and communication in the animal and the machine. Act. Sci. Indust., No. 1053, Hermann et Cie, Paris. Wilson, W.G. (1965). Change Ringing. October House Inc., New York. Wolfowitz, J. (1957). The coding of messages subject to chance errors. Illinois J. Math., Vol. 1, 591606. Wolfowitz, J. (1958). Information theory for mathematicians. Ann. Math. Statistics, Vol. 29, 351356. Wolfowitz, J. (1961). Coding Theorems of Information Theory. Springer, Berlin. Woodward, P.M. (1953). Probability and Information Theory with Applications to Radar. Pergamon Press, London. Xenakis, I. (1971). Formalized Music: Thought and Mathematics in Composition. Indiana University Press, Bloomington/London. Yaglom, A.M. and Yaglom, I.M. (1967). Wahrscheinlichkeit und Information. Deutscher Verlag der Wissenschaften, Berlin. Yost, W.A. (1977). Fundamentals of Hearing. An Introduction. Academic Press, San Diego. Yohai, V.J. (1987). High breakdownpoint and high eciency robust estimates for regression. Ann. Statistics, Vol. 15, 642656. Yohai, V.J., Stahel, W.A., and Zamar, R. (1991). A procedure for robust estimation and inference in linear regression. In: Directions in robust statistics and diagnostics, Part II, W.A. Stahel, and S.W. Weisberg (Eds.), Springer, New York. Young, G. and Householder, A. S. (1941). A note on multidimensional psychophysical analysis. Psychometrika, 6, 331333. Zassenhaus, H.J. (1999). The Theory of Groups. Dover, Mineola. Zivot, E. and Wang, J. (2002). Modeling Financial Time Series with SPlus. Springer, New York.