Anda di halaman 1dari 14

Leal 1

AD SUM LUCEM: AN ETUDE IN INSTRUMENTAL SPEECH SYNTHESIS

Partiels by Gerard Grisey and Gondwana by Tristn Murail are two landmark pieces

of what can be identified as the spectral movement.1 Both pieces share that they are the

instrumental realization or translation of a technique that is mainly developed through electronic

and/or digital means: additive and Frequency Modulation (FM) syntheses respectively.2 Both

pieces also share that they use such synthesis processes as a model more than they are an

accurate and literal translation of these techniques into the acoustic instrumental realm. In

Partiels, Grisey used the analysis of a trombone pitch to understand the harmonic material that

constituted the trombone timbre at that given pitch, and used the harmonic series and the relative

intensity, duration, and appearance time of each partial as models for the piece. In Gondwana,

Murail used the data obtained from the calculation of the pitches produced by the side bands that

are the resultant of a FM equation. In this case, there is a formula that determines which and how

many side bands will be produced when a carrier frequency of a determined value is modulated

by a modulator frequency of a determined value. Such realizations are not literal mainly because

the synthesis techniques mentioned, when using electronic means, use sine waves, which are the

simplest type periodic wave. In contrast, Grisey and Murail had to deal with complex sounds of

acoustic instruments, meaning that the resultant timbres include series of additional harmonic

partials their relation to its electronic counterparts.3 In this project I sought to apply a similar

1 Joshua Fineberg, Spectral Music, Contemporary Music Review 19, no.2 (2000): 2.

2 Franois Rose, "Introduction to the Pitch Organization of French Spectral Music,"


Perspectives of New Music 34, no. 2, (Summer, 1996):811; Tristan Murail, "Villeneuve-ls-
Avignon Conferences, Centre Acanthes, 911 and 13 July 1992," Contemporary Music Review
24, nos. 2-3 (2005): 205211.

3 Rose, Pitch Organization, 811; Murail, Villeneuve-ls-Avignon, 205211.


Leal 2

approach for the realization of speech synthesis, which is almost exclusively done with

computers, by means of traditional-ensemble acoustic instruments.

Speech synthesis by analog acoustic means has been explored in the past having a strong

impact the development of the study of phonetics. In his thesis on speech synthesis, Sami

Lemmetty described a couple of early examples of mechanic speech synthesis. These examples

include Christian Kratzensteins development of a series of resonators to produce the long

vowels /a/ /e/ /i/ /o/ /u/; Wolfgang von Kempelens models of his Acoustic Mechanical Speech

Machine, which consisted of a couple of reeds, a box with a variety of resonators, and a flexible

end that could be manipulated to filter the sound in the way the mouth filters the sounds

produced by the vocal strings (the machine could produce some vowels and some nasal and

fricative consonants); Charles Wheatstone and Alexander Graham Bells constructions of von

Kempelens machine; and the exploration by Robert Willis on the production of vowels by

means of organ pipe-like tubes.4 These examples, however, are mainly based on the production

of phoneme-like sounds by a single emitter of which several parameters can be manipulated in

ways traditional-ensemble instruments cannot. In any case, the production of movable filters,

which could be a valid approach, escapes the scope of this paper.

Digital speech synthesis is a complex process and requires complicated computer

algorithms. Two common approaches to speech synthesis are the unit selection synthesis

approach and the statistical parametric synthesis approach. Both approaches imply access to a

large data base of recorded speech sounds. The main difference between both approaches is that

unit selection mixes and reproduce small bits of sound straight from the data base, while

statistical parametric synthesis uses algorithms to analyze the samples, calculate possible
4Sami Lemmetty, Review of Speech Synthesis Technology (Master's Thesis). (Finland: Helsinki
University of Technology, 1999), 46.
Leal 3

outcomes, and synthesize the result of such operations.5 Understanding the details of any of these

techniques would require a high degree of specialization, which escapes the scope of this project.

In any case, the use of analytical techniques to interpret the components of determined speech

sounds (phonemes), and the later synthesis of an approximated sound using acoustic instruments

better resembles the latter approach.

A less data intensive approach for synthesizing certain phonemes by electronic means is

subtractive synthesis. Knowledge of the formants of specific vowels and consonants can be used

to control specific band-pass filters applied to a complex sound (such as white noise). This

approach is similar to that described in the previous examples of mechanic synthesis above. The

implementation of such method, however, seems impractical for tradition-ensemble acoustic

instruments. The visual analysis of such formants and the harmonic components of speech can be

used as models for an additive synthesis approach, which seems more reasonable in terms of data

amounts and more suitable for the selected medium.

The analysis of a large pool of phonemes, even using the simplest of the techniques

described above, implies an extended amount of data, even if such data is reduced to visual

representations. Because the scope of this project was limited by the time available during the

second half of a semesters worth of course work (for one class), an achievable goal seemed to

be the realization of a single phrase of text. Because of my Latin American background, I am

more familiarized with the sound of Spanish phonemes, and my understanding of English

phonemes may imply the failure to correctly assess the resultant sounds and their similarity to

English phonemes. Latin, however, seems a language that could fulfill the needs of this project.

Latin is a language historically related to western art music, which is often what we study in the

5 Heiga Zena; Keiichi Tokudaa, and Alan W. Blackc. 2009. "Statistical parametric speech
synthesis." Speech Communication 51, no.11 (2009): 10391044.
Leal 4

academy. At the same time, Latin vowels and consonants are closer to Spanish phonemes than

English phonemes are. I created the following sentence: Ad sum lucem arte et laborum, which

roughly translates into I am the light of arts and labor. I am not a religious person, but I interpret

the sentence as a manifestation of Marx and Engels Spectre of Communist, with which they

open the Manifesto of the Communist Party.6

Because vowels and most consonants can be understood as spectrally different (mainly

harmonic vs mainly inharmonic components), I decided it was better to approach them in two

different ways. Vowels can be synthesized more easily by means of additive synthesis because

they are mainly constituted of harmonic components. I then proceeded to record and analyze my

voice saying the Spanish vowels (which correspond also to most Latin vowels) /a/ /e/ /i/ /o/

and /u/ (Figures 1, 2, and 3).

/a/ /e/ /i/


/o/ /u/

Figure 1 Harmonic analysis of vowels using Spear software.

6 Friedrich Engels and Karl Marx, The Communist Manifesto, (Kindle Edition).
Leal 5

/a/ /e/ /i/


/o/ /u/

Figure 2 Harmonic analysis of vowels using Spear7 (all selected)

/a/ /e/ /i/ /o/ /u/

Figure 3 Sonogram of vowels using Max/MSP8

7 See http://www.klingbeil.com/spear/

8 See https://cycling74.com/products/max/#.WEiJqvkrI4k
Leal 6

To try an additive synthesis approach to reproduce the vowels, I set fifteen sinewave

oscillators. The first oscillator to the left produced a fundamental pitch, which was estimated in

123Hz, and each oscillator to the right produced a frequency resulting from the multiplication of

the fundamental by 2, 3, 4, and so on consecutively up to 14. Each oscillator was connected to a

number box which displayed the resultant frequency in Hz, and to a gain slide which allowed for

independent control of the volume (Figure 4).

Figure 4 Additive synthesizer using Max/MSP

I observed the harmonic structure of each vowel and the relative intensity of each

harmonic, and used that information as an estimate to produce vowels using the additive
Leal 7

synthesizer. The model was based only on approximates because the production mechanisms are

different. Perception played the most important role in tweaking the gain for each oscillator. I

was satisfied when I identified the sound as a vowel more than as separate oscillators.

Comparing the result for each vowel also was important because perception of the

formants was easier to understand for one vowel in relation to other vowels. The results for each

vowel were translated as a list of messages for all the oscillators, allowing me to quickly go from

one vowel to another. I recorded the result for a sequence of vowels and analyzed it using Spear

(Figures 5 and 6). While the harmonic analysis of the synthesized vowels is evidently less

complex than the real speech one, the approximation is enough to recognize the timbral

difference between vowels.

/a/ /e/ /i/ /o/ /u/

Figure 5 Analysis of vowels produced by additive synthesis (all selected)


Leal 8

/a/ /e/ /i/ /o/ /u/

Figure 6 Analysis of vowels produced by additive synthesis

To produce a similar effect with traditional-ensemble acoustic instruments, it was

necessary to consider the timbral qualities of those instruments. I selected the clarinet as the main

instrument since, when played at very low dynamic, its timbre resembles that of sinewaves.9 The

translation, however, is not free of technical issues: a bass clarinet is needed to produce the pitch

closest to 123Hz, which is a B2 (123.47Hz), and for the sound to resemble the timbre of a

sinewave it is crucial that the note is played in a pianissimo dynamic. Otherwise, the result

includes a series of partials that have undesired effects on the addition during the synthesis, and

therefore on the resultant timbre. Similarly, because of the relative intensity of the partials that

constitute each vowel, really high pitches are demanded, which are technically almost impossible

to produce in the required dynamic, especially considering that the lower partials are also played

pianissimo.

These above mentioned and other issues will be accounted for by the end of the report.

For now, an approximation using samples will be dealt with as much freedom as possible. To run

9 Wolfe, Joe. 2016, Clarinet acoustics: an introduction, University of New South Wales
School of Physics. http://newt.phys.unsw.edu.au/jw/clarinetacoustics.html#pff
Leal 9

a first trial, I downloaded samples from the University of Iowa Electronic Music Studios

website.10 Pitches were determined using the frequencies produced by the oscillators in the

additive synthesizer, and selecting the closest semitone. Using a Audacity, a free Digital Audio

Workstation (DAW), each sample was loaded in a separate track, which allowed me to control

the gain of each partial independently. While all samples corresponded to a pianissimo

performance of the determined pitch, a great deal of gain control had to be used to produce a

timbre that could be identified as a vowel. This time, the model was taken from the additive

synthesizer, and again the results were slightly altered to produce the desired effect. A harmonic

analysis shows the results for the pseudo analog synthesis of vowels in Figures 7 and 8.

/a/ /e/ /i/ /o/ /u/

Figure 7 Pseudo-analog synthesized vowels (all selected)

10 Lawrence Fritts, Musical Instruments Samples, University of Iowa Electronic Music Studios.
http://theremin.music.uiowa.edu/MIS.html.
Leal 10

/a/ /e/ /i/ /o/ /u/

Figure 8 Pseudo analog Synthesized Vowels


As suggested earlier, I relied more in intuitive approximation than in exact data. It can be

noted, however, how there are similitudes in the analyses of the vowels through different media.

For instance, in figures 1 and 2 it can be observed the difference between vowels /e/ and /i/

mainly in the less nuanced harmonics for the vowel /i/ in the areas between 500 and 750 Hz, and

the total lack of harmonics in the 750 to 2000hz. Also, vowel /e/ presents more nuanced

harmonics than vowel /a/ in the 200 to 500Hz area, but the progression of harmonics upwards for

/e/ is less gradual, and there is visible weakening of harmonics in the area between 750 and

1500Hz, while for /a/ the intensity decreases in a gradual fashion up to the 1600Hz area, where

there is a visible lack of harmonics. Vowel /o/ has a similar structure than /a/, but with more

nuanced harmonics in the 200 to 500Hz area and a visible lack of harmonics in the 1200Hz

area. /u/ appears as similar to /o/, but with les nuanced harmonics in the 500 to 1200Hz, where

/u/ keeps losing harmonics gradually in contrast to the more sudden lack of them in /o/. The

difference between /o/ and /u/, however, seems to be much clearer in the higher register that is

not displayed in the figures. Similar tendencies can be observed in figures 5 and 6, and 7 and 8.

Differences might be explained by the contrasting degrees of complexity contributed by


Leal 11

inharmonic sounds in the cases of the original speech and the pseudo-acoustic representation in

relation to the additive synthesized case.

Two alternative approaches to the synthesis of vowels were explored, one including flutes

as well as clarinets and one using only clarinets but taking in account the harmonic structures of

the individual pitches produced in different dynamics (using mainly the lower pitches to produce

the aggregate). These experiments were not explored in enough depth to produce satisfying

results. In the first case, the lack of understanding of the flute spectrum produced undesired

inharmonic elements. In the second case, there was not enough independence of each partial

regarding gain, which made the result far less controllable.

To study the consonants spectra, I recorded my voice saying diverse syllables, such as

apa, aba, opo, ola, and so on. Then, once I compared some of my analysis with the descriptions

and analysis included in articles and electronic resources about phonemes, I proceeded to record

my voice saying the proposed sentence in Latin.11 Figure 9 shows the spectral analysis of the

whole sentence.

Ad sum lucem arte et


laborum

Figure 9 Speech analysis using Max/MSP

11 Ad sum Lucem Arte et Laborum


Leal 12

Figure 10 is another analysis of the sentence (same recording) using the Emu online software.12 I

only included part of the sentence here, since it was the section I was able to realize using the

pseudo-analog method.

Ad sum lucem

Figure 10 Speech analysis using Emu


Spectrograms were considerable better than harmonic analysis to interpret consonants. /l/

and /m/ have a relatively harmonic spectrum, and the synthesis was approached in the same than

in the case of vowels. /d/ and /s/ and /ch/, however, include much more noise and lack of a clear

harmonic structure. /d/ is a stop phoneme, so the effect was achieved by suddenly cutting the

sound and allowing a softer considerably less rich (in terms of partials) aggregate to appear right

after the stop, as the resonant segment of the /d/ sound. This is not a very accurate representation

of the phoneme, but it is close enough for the ear to interpret when put in a larger context. The /s/

sound, as can be observed, has an incredibly rich spectrum, consisting mainly of noise and

without a clear harmonic structure, and can be classified as a colored noise.13 As the average

listener cannot really differentiate between very similar colored noises, the task was to find a

percussive sound which spectrum was also a colored noise with a similar appearance. The sound

12 See http://ips-lmu.github.io/EMU-webApp/

13 Joshua Fineberg, "APENDIX 1 Guide to the Basic Concepts and Techniques of Spectral
Music." Computer Music Review 19, no. 2 (2000): 91.
Leal 13

of a hi-hat when opened by the foot mechanism was the most similar case among the available

sounds.14 An artificial envelop was applied to avoid the initial percussive hit of the hi-hat. A

similar approach was applied to the phoneme /ch/, but some high frequencies were added to

resemble its whistle-like characteristic. The result of the proposed approach can be listened at

https://soundcloud.com/camilo-ignacio-leal-molina/clar-ad-sum-lucem-1. Figure 11 shows a

spectrogram of the pseudo-acoustic synthesis realized with the Emu web app.

Ad sum lucem

Figure 11 Spectrogram realized with Emu


Several elements were ignored in this study, mainly because of time constrains. Future

exploration should include the study of specific transitions between consonants and vowels and

between vowels and vowels, as there are specific harmonic structures and spectral elements that

would help to better interpret the phonemes. In addition, real acoustic synthesis should be

applied to understand the technical limitations of the proposed approach.

14 Fritts, Musical Instruments Samples


Leal 14

Bibliography

Engels, Friedrich; Marx, Karl. The Communist Manifesto. Kindle Edition.


Fineberg, Joshua. 2000. "APENDIX 1 Guide to the Basic Concepts and Techniques of Spectral
Music." Computer Music Review 19 (2): 81113.
Fineberg, Joshua. 2000. "Spectral music." Contemporary Music Review 19 (2): 15.
doi:10.1080/07494460000640221.
Fritts, Lawrence. 2016. University of Iowa Electronic Music Studios.
http://theremin.music.uiowa.edu/MIS.html.
Lemmetty, Sami. 1999. Review of Speech Synthesis Technology (Master's Thesis). Finland:
Helsinki University of Technology.
Murail, Tristan. 2005. "Villeneuve-ls-Avignon Conferences, Centre Acanthes, 911 and 13 July
1992." Contemporary Music Review 24 (2-3): 187267.
doi:10.1080/07494460500154889.
Rose, Franois. 1996. "Introduction to the Pitch Organization of French Spectral Music."
Perspectives of New Music 34 (2): 639.
Wolfe, Joe. 2016. University of New South Wales School of Physics.
http://newt.phys.unsw.edu.au/jw/clarinetacoustics.html#pff.
Zena, Heiga, Keiichi Tokudaa, and Alan W. Blackc. 2009. "Statistical parametric speech
synthesis." Speech Communication 51 (11): 10391064.
doi:10.1016/j.specom.2009.04.004.