Anda di halaman 1dari 5
Pitch Contour Extraction Using Time-Frequency Representations Renny E. Badra, Trina Adridn de Pérez, Calogero Bruscianelli and Aquiles Viloria Grupo de Procesamiento de Sefiales - Departamento de Electrénica y Circuitos Universidad Simén Bofivar - Apartado 89000, Caracas 1080A - Venezuela US. Mailing Address: BAMCO CCS 144-00, PO BOX 025322, Miami FL 33102 - 5322 Phone: (+582) 9063630-9063651 Fax:(+582) 9063631 E-mail: renny@usb.ve Abstract.- This work introduces a new method of continuous pitch frequency deter- mination that is built on the use of Time- Frequency Representations (TFR). It offers a better time resolution than those that only provide average pitch results, and it compares favorably with epoch extraction methods in terms of computational efficiency. The proposed TFR uses a kernel function that successfully suppresses cross-terms, producing a time-frequency function that clearly displays harmonic pitch information, as well as voiced-unvoiced transitions. The paper also includes a preliminary evaluation of the method performance and compares it with a standard algorithm, using synthetic and real speech input signals. L INTRODUCTION Although most of existing pitch detection methods provide only average results within a fixed-length frame [1], individual pitch period determination (epoch extraction, (2,8]) is of capital importance in obtaining the micro- and macro-melodies, which are key elements in speaker and phrase recognition systems. The method proposed in this paper is based on the use of a version of the Time-Frequency Representation that employs a signal-derived kernel function. Results offered by the new algorithm are smooth pitch contours that reflect pitch variations in a sample-by-sample basis, with a high time resolution. Therefore, they are perfectly suited for micro- and macro-melody determination. Moreover, voiced/unvoiced detection can also be accomplished. Results presented in this article are compared with a standard average pitch ‘measurement method for validation. ‘This paper is organized as follows: section 1911 Il presents the Time-Frequency Repre- sentations and the signal-dependent kernel functions; section III introduces and describes the new method; section IV shows the results of tests made with synthetic and natural speech signals and their validation, and conclusions are found in section V. II. SIGNAL DEPENDENT. ‘TIME-FREQUENCY REPRESENTATIONS The generalized Time-Frequency Distribution (TFD) is defined as the bi- dimensional Fourier transform of the product 4(u,2)A(u,), where A(u,t) is the so-called ambiguity function of the signal s(t), defined as Alu,t)= J serps" (e ~prelae, qa and ¢(u) is the transformation kernel. Several commonly employed TFDs (Wigner, Page, Sinc, Choi-Williams, etc.) result from using different kernel functions. Because of the relationship between A(u,7), and the resulting TFD, any “filtering” action tending to minimize undesired components in the ambiguity function plane (u-t) will produce a similar effect on the t-f plane. However, some of the criteria employed in selecting the kernel yield to results that may not necessarily satisfy some of the TED mathematical properties. For this reason the word “representation” is used instead of “distribution”. ‘Among the undesired components that contaminate TFDs and TFRs, the so-called cross-terms are particularly confusing when trying to obtain information from the t-f plane. These are the result of the combination of two or more auto-terms (the actual Frequency @ © 1000 Hz 500 Hz OHz : Tie Fig. 1. (a) Wigner TED (kernel (1,2) = 1) of a female voiced section having a pitch frequency of about 250 Hz. It is easy to verify the presence of cross-terms above and below the fundamental pitch trajectory. (b) TFR obtained using a kemnel that equals 1 in a rectangular zone around the + ‘axis, and zero elsewhere. Cross-terms are successfully cancelled. components of the signal) by means of the non-linear operations involved in_ the calculation of the representation. Their suppression or minimization is a goal to be achieved by selecting an appropriate kernel. Several variations of Time-Frequency Representations (TFRs) have been used in time-frequency analysis [3]. However, it has been shown that a fixed kernel function produces useful results only for a limited family of signals, which means that the universal optimum kernel does not exist. Research efforts have been recently directed towards the design of signal-dependent optimum TFR kernels [4]. The goal is to find a kernel that enhances auto-terms (the actual components of the signal) and suppresses cross-terms for a given set of signals. A recent work [5] proposes a general procedure to obtain the optimum kernel for a given set of signals, which is supposed to allow good TER performance. However, it involves a heavy optimization procedure and is highly parameter-dependent. It can be shown [6] that the auto-terms of any periodic function map into the taxis of 1912 the ambiguity function plane, while its cross- terms map out of this line. Still the structure of the ambiguity function of quasi-periodic signals (such as voiced speech) cannot be easily predicted. However, it is reasonable to expect that, for quasi-periodic signals with slowly varying frequencies, most of the auto- terms energy will be close to the taxis of the ambiguity function plane. Voiced speech signals can be considered as quasiperiodic in a short time basis. Therefore, it can be postulated that most of the energy of their auto-terms is concentrated in the neighborhood of the r axis. It can then be predicted with a reasonable hope of success that a kernel that rejects components located away form the « axis will reduce the cross-terms, producing a “cleaner” TFR. Fig 1 depicts the nature of cross-terms in a voiced speech segment TFD and how the selection of an appropriate kernel suppresses them. Moreover, since the energy of random white signals is generally spread all over the ambiguity function plane, it can be expected that such a selective kernel will also perform an interesting noise rejection function. 8kHz-sampled speech >| Filter 0-900 Hz ‘Window A(ua) NxN matrix so) wy ‘Decimate Construct }_ St) | N-point 1 Analytic Signal ‘complex vector Calculate Ambiguity Function G=3 te TFR \ R(t } e BO (= } Fig. 2. Block Diagram of the TFR-based Pitch Contour Extraction Method. IH. THE TFR-BASED PITCH CONTOUR EXTRACTION ALGORITHM Fig. 2 shows the block diagram of the proposed method. After being lowpass- filtered, the 8kHz-sampled speech segment (4N samples) is decimated to N samples to reduce the computational load. Then, a raised cosine window with a rise time of N/8 samples is applied to compensate for segmentation effects. An analytic version of this signal is constructed: S(t) = s(t) + jsp (t), Q) where 5j(t) is the Hilbert transform of s() This procedure, commonly used in time- frequency analysis, suppresses negative frequencies from the representation, thus reducing the number of cross-terms and doubling frequency resolution. The ambiguity function of the analytic signal is calculated using (1). Then, the kernel function is obtained through o(u,2)=|A(w, DUT £), @) This choice, although somehow empirical, is aimed by two reasons: (i) the rectangul shape mask selects the components that lie in the vicinity of the « axis, which is in agreement with the presumed location of the auto-terms, as stated in Section II; (ii) the module of the ambiguity function enhances the zones where the signal components actually lie. The width of the rectangular pulse (K) is set to 1/64 of the length of the decimated sequence. The bi-dimensional Discrete-Time Fourier Transform of the product $(u,7)A(u,2) is the desired TFR (R(1,f)) of the speech segment. Then, the pitch contour Fp(t) is obtained by 1913 following the trajectory corresponding to the first harmonic of the pitch. This is achieved through a peak-picking algorithm that performs the following steps: a) it selects a time starting point TSTART based on a block estimate of the signal power; b) it picks the lowest significative local maximum FSTART in the 40-400 Hz range (time being fixed to TSTART); ©) it picks successive frequency maxima in both directions of the time axis from a range of + 8 Hz relative to the last (adjacent) maximum picked; d) it stops at the end of the segment, or when the amplitude level of the next maximum is less than 25% of the greatest maximum picked; it then labels the remaining of the segment in that direction as unvoiced speech. The uni-dimensional function obtained Fp(t) is the desired pitch contour of the block processed. The analysis frame is then shifted 3NA¢ decimated samples to the right to ensure that all sections of the signal are processed without the attenuating effects of the raised- cosine window, and the process is repeated. The frequency resolution of the method (4f) is related to the length of the input segment in the following manner: ats of = oN @ where fs is the sampling frequency. As can be seen, N must be kept invariable to provide constant frequency resolution for each input block. ‘Computationally, the method may look quite heavy, mostly because of the bi- dimensional Fourier transform that has to be carried out (it involves 2 FFTs for each N- Iength decimated speech section). However, the rectangular shape of the selected kernel dramatically reduces the number of. computations because of the forced zeros in the input matrix 4(u,7)A(u,1): In fact, only K full FFTs are required, plus N “reduced” FFTs in which there are only K non-zero input samples (recall that K is the width -in samples- of the rectangular kernel). In addition, output frequencies in the 40-400 Hz range are the only ones needed to perform the pick-peaking process, which reduces the processing load even further. This makes the method computationally viable when compared to most of existing epoch extraction schemes [2,8], although a more precise evaluation of this feature is yet to be made. The robustness of the algorithm depends upon the procedure selected for the extraction of the contour Fp(t) from the representation R(t). The described peak-picking algorithm performs well for a wide variety of examples, including indistinctively male and female speech. IV. PERFORMANCE OF THE TFR-BASED PITCH CONTOUR EXTRACTION ALGORITHM ‘The performance of the algorithm was first tested using artificially generated voiced speech. A 3-pole, I-zero system was fed with pulses at a variable rate, i.e., a frequency-modulated pulse train. Input pulses were previously shaped according to estimates of the glottal excitation during voiced phonemes [7]. Pole and zero frequencies and bandwidths were set to their average values for English vowels /a/, /i/ and /u/. Results obtained for the corresponding pitch contours are displayed in Fig. 3, along with the theoretical (preset) values. These theoretical values are simply the reciprocal of each of the preset individual periods of the synthetic speech segments. The average of the absolute errors in the contour obtained Fp(t) is about 4% in each of the examples shown. Note that the error is greater at the extremes. To illustrate the operation of the new method on real speech, an experiment performed taking a Spanish utterance (1.1 seconds) as a test signal is now presented. Input block length N was to 2048 samples. 1914 16 1 theoretical a 160 12 fal Fig. 3. Theoretical (preset) and estimated (using the proposed method) pitch contours (in Hz) for the synthethic speech experiment. Results are shown in Fig. 4, along with those obtained by using the Center-Clipping Autocorrelation Method for average pitch determination [1], which have been included for comparison. The mean value of the difference between the contours obtained by the two methods is virtually zero, with a standard deviation of about 4 Hz. Again, the accuracy of the results is somehow decreased at the extremes of each input block, although this can be neutralized by performing some type of averaging procedure involving the overlapping sections of adjacent blocks. A more detailed analysis reveals that this behavior is improved when K is increased, although this action also enforces slightly the cross-terms. 'V. CONCLUSIONS A new algorithm for pitch contour extraction of speech signals has been introduced and explained. It is based on 2 Time-Frequency Representation that features a signal-dependent kernel specially designed to minimize cross-terms and facilitate the extraction of the contour from the t-f plane. Tests made show that the method works well for a wide variety of signals, producing pitch contours that are smooth, precise and immune to noise. It also allows for voiced/unvoi discrimination. ~ 125 Hz 100 Hz 75 Ha 125 Hz 100 Hz 15 Hz ee ~~” ee pe Fig. 4. (a) Input utterance. (b) Pitch Contours obtained using the proposed method (each trace corresponds to a different input segment). Note that some of the traces are shorter, which indicates that unvoiced speech has been detected at one (or both) of the the extremes of the segment. (c) Pitch Contour obtained using the Center-Cliping Autocorrelation method (shown for comparison). ACKNOWLEDGEMENT This work was co-sponsored by Project BID-Conicit E-18 (New Technologies Program) and Universidad Simén Bolivar. REFERENCES [1] L. Rabiner, M. Cheng, A. Rosenberg and C. McGonegal, "A Comparative Performance Study of Several Pitch Detection Algorithms,” IEEE Trans. on Acoust., Speech, and Signal Proc., vol. ASSP-24, No.5, Oct. 1976, pp.399-418. [2] C. Ma, Y. Kamp and LF. Willems, “A Frobenius Norm Approach to Glottal Closure Detection from the Speech Signal,” IEEE Trans. Speech Aud. Proc., Vol. 2, No. 2, Apr. 1994, pp. 258-265. [3] L. Cohen, “Time-Frequency Distribu- tions - A Review,” Proc. IEEE , Vol. 71, No. 6, Jul. 1989, pp. 941-981. [4] D.’L. Jones and T. W. Parks, “A High- Resolution Data-Adaptive _Time- Frequency Representation,” IEEE Trans. Acoust., Speech and Sig. Proc., Vol. ASSP-38, No. 12, Dec. 1990, pp. 2127-2135. {5] R. Baraniuk and D. Jones “A signal- dependent Time-Frequency 1915, Representation: Optimal kernel design,” IEEE Trans. on Signal Processing, Vol. 41, N° 4, April 1993, pp. 1589-1601. [6] T. Adrian de Pérez, J. Restrepo and L.M. Diaz “Optimum time-frequency representations of monocomponent signal combinations,” Signal Processing, vol. 38, No. 1, pp. 187-195, 1994. (7] M. Matausek and V. Batalov, "A New Approach to the Determination of the Glottal Waveform,” IEEE Trans. on Acoust., Speech, and Sig. Proc., vol. ASSP-28, No.6, Jun. 1980. [8] G. Gonzélez, R. E. Badra, R. Medina and J. Regidor: “Period Estimation Using Minimum Entropy Deconvolution,” Signal Processing, vol. 41, No. 1, Jan. 1995

Anda mungkin juga menyukai