Anda di halaman 1dari 4

VOICE QUALITY CONVERSION IN TD-PSOLA SPEECH SYNTHESIS

Xuejing Sun
.

Speech Acoustics Laboratory, Department of Communication Sciences and Disorders Northwestern University Evanston, I L 60208, USA

ABSTRACT
The capability of producing different voice qualities is highly desirable in modem speech synthesis systems. Diphone based synthesizers using TD-PSOLA can generate high quality synthetic speech. However, one of the drawbacks of such systems in comparison to the formant synthesizer or the LPC synthesizer is its inflexibility in Voice Quality Conversion (VQC). In this paper, 1 present a VQC method for the TDPSOLA synthesis system. For vocal fry, the ST-signals are multiplied by Kaiser window with alternate magnitude; for breathy voice, the ST-signals are first convolved with a one-pole filter, and then combined with shaped noise signals, and finally multiplied by a Hanning window. All the windowed ST-signals are then overlap-added as in standard TD-PSOLA. The perceptual evaluation test shows that this method can generate the desired voice quality successfully. (The samples can be WWW URL address: obtained through this htt~://mel.soeech.nwu.edu/sunxi/voc-abstract.htm)

1. INTRODUCTION
In a speech synthesis system, the ability to synthesize different voice qualities is nontrivial. Not only can it make the synthetic speech more natural, but it can also make the synthetic speech more expressive. Primarily through controlling the glottal volume velocity waveform, Klatt and Klatt (1990) were able to synthesize different voice qualities in a formant synthesizer. Successful results have also been obtained with a LPC synthesizer [ 1][2]. These speech synthesis techniques directly employ source-filter theory, and are inherently more flexible in terms of voice quality synthesis. However, their quality of synthetic speech has generally been less satisfactory when compared with that of PSOLA-based systems. Recent developments in waveform coding technique-PSOLA [ 5 ] , especially TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add), have greatly improved the quality of diphonebased speech synthesis systems. In the family of PSOLA techniques, TD-PSOLA is the most computationally efficient method, and is one of the most popular synthesis techniques nowadays. LP-PSOLA (Linear Predictive PSOLA) and FDPSOLA (Frequency Domain PSOLA), though able to produce equivalent result, require much more computational power. In terms of synthesizing different voice qualities, however, TDPSOLA has no inherent ability except that it uses pre-recorded diphones with desired voice quality. This could be a timeconsuming process and cost more storage, therefore, is not commonly adopted. An ideal solution is to plug in an extra module to convert normal voice to desired voice quality when

Figure 1. (a). Block diagram of vocal fry conversion.


(b). Block diagram of breathy voice conversion needed, which can be controlled through simple parameters. In this paper, I present a Voice Quality Conversion (VQC) method, which gives the TD-PSOLA system the ability to generate different voice qualities (see Figure 1). It should be noted that in LP-PSOLA and FD-PSOLA systems, we are also able to realize VQC. However, since those approaches are relatively computationally inefficient, they are not of interest here. A successful VQC also has potential usage in voice gender conversion. Lawlor and Fagan (1999) have shown that the female-to-male voice conversion has very good results, while the male-to-female results are unsuccessful. Since we know that female voice is in general more breathy than male voice, we may extend the voice quality conversion algorithm into voice gender conversion. This paper is organized as follows. Firstly, 1 describe the main acoustic features of different voice qualities. Secondly, I brief the

0-7803-6293-4/00/$10.00 02000 IEEE.

953

TD-PSOLA technique. Then I describe the procedures of converting normal voice to different voice qualities. After this 1 present the perceptual evaluation test result. In the end, I briefly summarize the current study and discuss the potential future research.

4. VOICE QUALITY CONVERSION


4.1 Converting modal voice to vocal fry
Since vocal fry is characterized by a highly damped speech waveform, a straightforward way to simulate vocal fry is to modify the shape of the ST-signal waveform. Thus, we may choose a window that has a sharper slope. In light of this, Kaiser window, which defined by formula (1)(2)(3), could be a suitable choice. In order to get a sharper window, I use CY = 10 which results in nearly -100dB attenuation of the side lobe.

2. VOICE QUALITY FEATURES


Voice quality, which has been studied for a long time, unfortunately has no general accepted definition [2][3][8]. In the present study, voice quality refers to the perceptual impression of changes along the dimension that whether it sounds pressedrough, normal, or breathy, namely vocal fry, modal voice, and breathy voice. Perceptually, vocal fry is characterized by pressed and rough voice, whereas breathy voice is characterized by the voice with more aspiration. The acoustical characteristics have been studied extensively [2][3][8][9]. Vocal fry has the following characteristics: In terms of glottal volume velocity waveform, vocal fry has short open phase, and longer close phase, hence small Open Quotient (OQ). The slope of the glottal spectrum is -6 dB/octave, whereas in the modal voice the slope is -1 2 dB/octave. The level of turbulent noise component is low. The waveform of individual glottal cycles is highly damped. Vocal pulse cycles altemate both in amplitude and period. Vocal fry often occurs in low pitch voice. Breathy voice is characterized by: Large open quotient. A spectral slope of -1 8 dBloctave A high level of turbulent noise, especially above 2k Hz. Lightly damped waveform of glottal cycles. In a formant or LPC synthesizer, most of the above features can be controlled directly by adjustable parameters. In TD-PSOLA, it is impossible to 'manipulate all the features, such as open quotient. Fortunately, those acoustic features are not totally independent of each other. We could modify only some of the features and still achieve the desired perceptual effects.

(3) The result of increasing attenuation of the side lobe is the increases of the width of the main lobe, which makes the original spectrum smoother, and the spectral slope is reduced to a lesser extent. This is not bad for vocal fry because vocal fry is characterized by a reduced slope of glottal spectrum. Another side effect of this frequency interpolation process is formant bandwidth widening. Fortunately, there is no obviously perceptual consequence of this [6]: As mentioned earlier, an important perceptual characteristic of vocal fry is roughness. This can be achieved by using altemate pulse cycles. In reality, such altemate pulse cycles often consist of both amplitude altemation and period altemation. It has been found that the rough voice quality can be achieved by solely manipulating one of these two aspects [8]. In the present study, therefore, only altemate amplitude has been employed to avoid pitch modification since that is not of concem.

3. TD-PSOLA SPEECH SYNTHESIS


TD-PSOLA employs a simple, computationally efficient algorithm, which can generate high quality synthetic speech [ 5 ] . The first step in PSOLA synthesis is to perform pitch analysis on the speech waveform and generate so called pitch marks. Then the speech waveform is broken into many short-term signals (STsignals) by multiplying a sequence of pitch-synchronous analysis windows. The windows are of raise cosine type (such as the Hanning window). The proper length of the window is 2 to 4 times of local pitch period. To synthesize speech at different pitch, the ST-signals are simply overlapped and added with desired spacing of the ST-signals.

-2 0

'

200

400

600

800

I lo00

954

1
4
0.8 0.6

0.4 0.2 0 0
0
50

100

150

200

250

200

400

600

800

1000

Figure 3. Noise energy envelope shaping function A pitch-modulated square wave and a triangular-like envelope shaping function have been tried [2][7]. In the current study, I use a pitch-modulated cosine function (Figure 3). Spectral tilt. This is achieved by convolving the ST-signal with a one-pole spectral shaping filter. (4)

Figure 2. (a) Original speech segments /a/ in male voice. (b). Synthetic speech segments (vocal fry). It should be noted that if we combine some pitch modifications with the current scheme, we would have better perceptual effects because vocal fry usually occurs at low pitch and altemate period cycles can also induce a roughness sensation. Nonetheless, for the purpose of comparison, pitch modification is excluded from the present study. Figure 2. shows a speech waveform segment of the original speech and the synthesized speech, respectively.

where U,is a real pole inside the unit circle, and usually near the unit circle. It can produce a spectral slope of approximately -6 dB/octave. Add the noise to the ST-signal one by one. Multiply the "noisy" ST-signal by a Hanning window of the same length. Center the windowed ST-signal at new pitch marks, then perform overlap-add procedure. Note that in the present study, pitch modification is not of concem, so the new pitch marks will be the same as that of the original. Note that there are several parameters that are adjustable, such as the noise level and the degree of spectral tilt, etc. Their values vary with gender, age, and some other individual characters. However, for a particular voice database in speech synthesis system, the values can be constant.

4.2 Converting modal voice to breathy voice


Breathy voice is characterized by greater spectral tilt, more aspiration noise at mid and high frequency region. Although we can often observe that the waveform of breathy voice is less damped, this is not a reliable measurement [2]. Simple modification of the waveform through time-domain has shown to be unsuccessful by the author's preliminary studies. In the present study, a frequency domain modification approach is adopted. The steps involved in the conversion procedure are as follows: 1. Split the speech into frames (ST-signals) with a rectangular window according the pitch marks. Each frame is centered at the pitch peak, and the length of the frame is equal to twice the local pitch period. Generate noise. First, the spectral envelope of the ST-signal is estimated by using autocorrelation LPC at order 12. A pre-emphasis has also been performed before the estimation. Then a unit-variance white Gaussian noise sequence is filtered by a LPC filter with the coefficients derived above (In Harmonic Plus Noise (HNM) speech modification scheme [7], a lattice filter is used for a similar purpose. However, because of its computational cost, it is not adopted in the present algorithm). The length of the noise is equal to the ST-signal. The gain of the LPC filter is adjusted through a parameter called

2.

A,

which is actually

related to the original gain derived from LPC at certain ratio. The ratio is usually in the range of [0.1 0.41. The noise is then band passed at 2k-8k. It has been shown that the temporal envelope of the noise signal is an important factor for the naturalness of the synthetic speech [2].

2000

4000

6Ooo

I 8ooo

Frequency (Hz)

955

40

E Z
20
d .

.e

3
0
0 ,

Kaiser window; for breathy voice, the ST-signals is convolved with a spectral-shaping filter first, and then shaped noise signals are added. The results of perceptual evaluation test indicate that the algorithm can effectively convert modal voice into the desired voice quality. Future research includes utilizing the present VQC method in a TD-PSOLA speech synthesis system, and conducting more comprehensive perceptual evaluation experiment to test the quality and the naturalness. It is also hoped that the above method could be applied to the voice gender conversion as mentioned in the Introduction.

-20

-40

2000

4000 6000 Frequency (Hz)

8000

(b)

Figure 4. (a). Spectrum of the original vowel /a/ in female voice (b) Spectrum of converted breathy voice of the same vowel.

7. REFERENCE
[ I ] Childers, D.G. Glottal source modeling for voice conversion. Speech Communication, 16 (2): 127-138, 1995. [2] Childers, D.G., and Lee, C.K. Vocal quality factors: f the Analysis, synthesis, and perception. Journal o Acoustical Society of America, 90(5): 2394-2410, 1991. [3] Klatt, D.H., and Klatt, L.C. Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87(2): 820-856, 1990. [4] Lawlor, B., and Fagan, A.D. A Novel Efficient Algorithm for Voice Gender Conversion. Proceedings of the 14Ih International Congress o f Phonetic Sciences, San Francisco, August 1999, Vol. I , pages 77-80. [5] Moulines, E., and Charpentier, F. Pitch-Synchronous Waveform Processing Techniques for Text-To-Speech Synthesis Using Diphones. Speech Communication, 9: 453-467, 1990. [6] Moulines, E., and Laroche, J. Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communicarion, 16 ( 2 ) : 175-205, 1995. [7] Stylianou, I. Harmonic plus Noise Models for Speech, combined with Statistical Methods, for Speech and Speaker Modification. P h B . Thesis, ENST-Telecom Paris, 1996. [8] Titze, I.R. Principles of Voice Production. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1994. [9] Wendahl, R.W. Laryngeal analog synthesis of harsh voice quality. Folia Phoniutrica 15: 241 -250, 1963.

5. PERCEPTION TEST REULTS


To evaluate the performance of the VQC algorithm, a formal perception test was conducted. First, vowels /i/, /a/ were recorded in modal voice by a male and a female native English speaker. The samples were judged as in modal voices by two professors of Speech and Language Pathology who are familiar with different voice qualities. Then, the modal voices were converted to vocal fry and breathy voice using the algorithm described above. In the experiment, the subject was asked to judge the quality of a synthesized vowel (target vowel) as compared with a reference vowel, i.e., whether the target vowel sounds relatively breathy, the same, or pressed and rough (vocal fry). The reference is the originally recorded vowel. Subjects were provided four categories from which were Vocal Fry (Pressed and rough), Same quality, Breathy, and Other quality. The synthesized vowels and the original speech were mixed, and each had three repetitions. The stimuli were randomly presented to the subject. Seven subjects participated in the experiment. All of them had no previous knowledge of vocal fry and breathy voice except a professor of Speech who is very familiar with different voice qualities. Vocal fry and breathy voice quality were explained to the naive subjects before the experiment began. The results show that over 80% of the synthetic vocal fry and breathy voice are correctly identified. The professor classified all the synthetic speech to the expected category at 100 percent. For the naive subjects, they sometimes made errors probably due to the fact that they are not familiar with the different types of voice quality. Overall the results are satisfactory, and the subjects all reported that the synthesized vowels sounded like the reference vowels except for voice quality. The experiment only tests two vowels. More extensive tests that use words and sentences are needed for further perceptual evaluation in the future.

6. SUMMARY
In this study, a voice quality conversion algorithm within TDPSOLA was formulated and tested. The goal is to enable a TDPSOLA speech synthesis system to produce desired voice quality as needed. For vocal fry, the ST-signals are modified with the

956

Anda mungkin juga menyukai