Anda di halaman 1dari 10

24

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-25, NO. 1, FEBRUARY 1977

On the Use of Autocorrelation Analysis for Pitch Detection


LAWRENCE R. RABINER,
FELLOW, IEEE

Abstract-One of the most time honored methods of detecting pitch is to use some type of autocorrelation analysis on speech which has been appropriately preprocessed. The goal of the speech preprocessing in most systems is to whiten, or spectrally flatten, the signal so as to eliminate the effects of the vocal tract spectrum on the detailed shape of the resulting autocorrelation function. The purpose of this paper is to present some results on several types of (nonlinear) preprocessing whichcan be used to effectivelyspectrally flatten the speech signal The types ofnonlinearities which are considered are classified by a nonlinear input-output quantizer characteristic. By appropriate adjustment of the quantizer threshold levels, both the ordinary (linear) autocorrelation analysis, and the center clipping-peak clipping autocorrelation presented to of Dubnowski et al. [l] can be obtained.Resultsare demonstratethe degree of spectrum flatteningobtained using these methods. Each of the proposed methods was tested on several of the utterances used in a recent pitch detector comparison study by Rabiner et al. [2] Results of this comparison are included in this paper. One final topic which is discussed in this paper an algorithm for adaptively is choosing a frame size for an autocorrelation pitch analysis.

correlation function. The use of a window for analysisleads to at least two difficulties. First there is the problem of choosing an appropriate window. Second there is the problem that (for a stationary analysis): no matter which window is selected,theeffect of the window is to tapertheautocorrelation function smoothly to 0 as the autocorrelation index increases. This tends effect to compound the difficulties mentioned above in which formant peaks in theautocorrelation function (which occur at lower indices than the pitch period peak) tend to be of greater magnitude than the peak due to the pitchperiod. A final difficultywiththeautocorrelationcomputation is the problem of choosing an appropriate analysis frame (window) size. The ideal analysis frame should contain from 2 to 3 complete pitch periods. Thus, for high pitch speakers the analysis frame should be short (5-20 ms), whereas for low pitched speakers it should be long (20-50 ms). A wide variety of solutions have been proposed to the above I. INTRODUCTION problems. To partially eliminate the effects the of higher LTHOUGH a large numberofdifferent methods have formant structurethe on autocorrelation function, most been proposed for detecting pitch, the autocorrelation methods use a sharp cutoff low-pass filter with cutoff around pitch detector is still one of the most robust and reliable of 900 Hz. This will, in general, preserve a sufficient number of pitch detectors [2]. There areseveral reasons whyautocor- pitch harmonics for accurate pitch detection, but willelimrelation methods for pitch detection have generally met with inatethe second and higher formants.Inadditionto linear good success. The autocorrelation computation is made filtering to remove the formant structure, a wide variety of directly on the waveform and is a fairly straightforward(albeit methods have been proposed for directly or indirectly time consuming) computation. Although a high processing spectrally flattening the speech signal to remove the effects of rate is required,theautocorrelationcomputation is simply the first formant [3]- [5] , [ l ] . Included among these techamenable to digital hardware implementation generally re- niques are center clipping and spectral equalization by filter quiring only a single multiplier and an accumulator as the bankmethods [3], inverse filtering using linear prediction computational elements. Finally, the autocorrelation compumethods [4], spectral flatteningby linear prediction and a tation is largely phase insensitive.' Thus, it is a good method Newton transformation [5] , and spectral flattening by a comt o use to detect pitch of speech which has been transmitted bination of center and peak clipping methods [ 11 . over a telephoneline,or has suffered some degree of phase Each of these methods has met with some degree of success; distortion via transmission. however, problems still remain. It is the purpose of this paper Although an autocorrelation pitch detector has some advan- to investigate the properties of a class of nonlinearities applied tages for pitch detection, there several problems associated to the speech signal prior to autocorrelation analysis with the are withthe use ofthismethod.Althoughtheautocorrelation purpose of spectrally flattening the signal. Also a solution to function of a section ofvoiced speech generally displays a the problem ofchoosing an analysis frame size which adapts to fairly prominent peak atthepitchperiod,autocorrelation the estimated average pitch of the speaker will be presented. peaks due to the detailed formant structure of the signal are The organization of this paper is as follows. In Section I1 we also often present. Thus, one problem is to decide which of review the theory of short-time autocorrelation analysisand several autocorrelation peaks corresponds to the pitch period. present the various types of nonlinearities to be investigated Another problem with the autocorreiation computation is the for spectrally flattening the speech. Examples of signal spectra required use of a window for computing the short time autoManuscript received April4, 1976;revised August 16, 1976. The author is with the Bell Laboratories, Murray Hill, NJ 07974. 'In the limit of exactly periodic signals, or for an infinite correlation insensitive. function it is exactly phase
'A stationary analysis is one for which the same set of input samples is used in computing all the points of the autocorrelation function. A nonstationary analysis is impractical for pitch detection because of the large number of autocorrelation points involved in computation. the

RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION

25

obtained with the nonlinearities being used will begiven in this, section. InSection I11 theresultsofalimited but formal evaluation of each of the nonlinear autocorrelation analyses are given. Several of the test utterances used in [ 2 ] are used in this test for comparison purposes. In Section IV we discuss a simple algorithm for adaptingtheframe size of the analysis based on the estimated average pitch period for the speaker, andpresentresults on how well itworked on several test examples. 11. SHORT-TIME AUTOCORRELATION ANALYSIS Given a discrete time signal x(n), defined for all n , the autocorrelation function is generally defined as

- - - - - - - - - - - - -- - - - - - - - I I

I I
I

s(n) I

4 t
I

LPF

p'
NL2

I I

CORRELATOR
I
I

0-900 H Z

I
I

I I

x*

(ll) I

I L

I _-------__--_--_------J

SPEECH PREPROCESSOR

Fig. 1. Block diagram of the nonlinear correlation processing.

computation of (3), as discussed in Section I. Fig. 1 shows a block diagram of the processing which was used. The speech 1 N signal s(n) is firstlow-passfilteredby an FIR, linearphase, digital filter with a passband of 0 to 900 Hz, and a stopband beginning at 1700 H z . ~ The output ofthelow-passfilter is The autocorrelation function of a signalis basically a (non- then used as input to two nonlinear processors, labeled NL1 invertible)transformationofthe signal which is useful for and NL2 in Fig. 1. The nonlinearities used in each path may displayingstructureinthewaveform.Thus, for pitch detecormay not be identical.Thetypesofnonlinearitieswhich tion, if we assume x(n) is exactly periodic with period P,i.e., were investigated were various center clippers, and peak clipx(n) = x(n t P) for all n, then it is easily shown that pers. Based on earlier works [3] , [ 1 ] it has been shown that such nonlinearities can provide a fairly high degree of spectral @x(m> = @x(m P), + (2) flattening,and are computationallyquiteefficient to implement [ 11 . Additionally, the capability of correlating two noni.e., the autocorrelation is also periodic with the same period. linearly processed versions of the same signal provides a useful Conversely,periodicityintheautocorrelationfunctionindidegree of flexibility into the system. It has also been argued cates periodicity in the signal. For a nonstationary signal, such as speech, the concept of a that such a correlation willbe most appropriate in a variety in long-time autocorrelation measurement as given by (1) is not of actual situations pitch detection? Three types of nonlinearity have been considered. They are reallymeaningful.Thus,it is reasonable t o defineashorttimeautocorrelationfunction, whichoperatesonshort seg- classified according to their input-output quantization characteristic in the following way. The first type of nonlinearity is ments of thesignal, as a compressed center clipper whose output y(n) obeysthe 1 N I - 1 relation (with x(n) as input)' @Q(VZ) = [ X ( U + Q ) w ( H ) ] [ ~ (fn + m)w(n f m ) ] , Q

n=o

O<m<Mo - 1

(3)

y ( n ) = clc [x(n)] = ( x @ ) - CL),


= 0,

x(n) 2 c, Ix(n)l< CL x(n) < -c,


(5)

where w(n) is an appropriate window for analysis, N is the sectionlength being analyzed, N' is thenumberof signal samples used in the computation of &(m), Mo is the number of autocorrelation points to be computed, and Jz is the index of the starting sample of the frame. For pitch detection applications N' is generally set to the value

= (x(n) t CL),

where CL is the clipping threshold. The second nonlinearity is a simple center clipper with theinput-output relation6 y(n) = clp [x(n)] = x(n),x(n)
2

N'=N- m

(4)

c,

so that only the N samples in the analysis frame (i.e., x @ ) , x(Qf l), . . . , x(Q+ N - 1)) are used in the autocorrelation computation. Values of 200 and 300 have generally been used for Mo and N , respectively, [ l ] corresponding to a maximum Finally, the third nonlinearity is the combination center and pitch period of 20 ms (200 samples at a 10 kHz sampling rate) peak clipper with the input-output relation7 and a 30 ms analysis frame size. As will be discussed later a rectangular window ,(i.e., w(n) = 1, 0 < n < N - 1, w(n) = 0 3The filter had an impulse response duration 25 samples. The filter of elsewhere) is used for all the computations t o be described in passband was flat to within +0.03, and the stopband responsewas down this paper. at least 50 dB. 4M.Sondhi-personal communication. To reduce the effects of the formant structure on the de'The function clc [x] stands for clip and compress x. tailedshape of the short-time autocorrelation function, two 6The function clp [x] stands for cljp x. preprocessing functions were used prior to the autocorrelation 'The function sgn [x] stands for the sign of x.

26

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977 TABLE I COMBINATIONS OF NONLINEARITIES CORRELATON ANALYSIS FOR
CC'I1reiatic:,
NO.

x,(n)

X*(")

COmDUtatio"

x(n)
clc[x(n:l
clp[x(n)l
x(n)

X(E)

*>+

c?c[x(n) 1
clp[n(n)l

*, k,/#
*,t/Q

3
4

sm[r(n)l
s.s,ln!r)!
sgnlx(n)l

5
6

clcrx(")l
clplx(r.)l

?/a +/a ?/a


*,+/0
*,ti#

7
8 9
LO

x(n)
X(?)

clcln(n)l

clulx(n)l
clc[x(n)I

clplx(r.)l

*,+/0
coun:er/B

sgn[x(n)l

sen[x(n!l

The justification for considering the nonlinearities of Fig. 2 for use in autocorrelation analysisis obtained by examining the effects of the nonlinearities on the waveforms. It can be argued that a center clipper effectively attenuates the effects of first formant structure on the waveform, without seriously affecting thepitch pulse indications. However, it has been argued that the peak clipping of the sgn quantizer [Fig. 2(c)] gives too much weight to signal amplitudes that just exceed y = sgn t x l the clipping threshold, and too little weight to signal amplitudesthat exceed the clipping threshold by a widemargin. Fig. 2. Input+utput characteristics of each of the three nonlinearities used in the investigation. Thus, the justification for the clc (clip and compress) and the clp (clip) quantizers is that they provide a compromise between the extremes of no clipping and infinite peak clipping. Before proceeding to some examples showing the effects of each of these nonlinearities, it is worth noting that the method used to set the clipping threshold (C,) for each of these nonlinearities was exactlythemethod used in [ l ] , i.e., set the Fig. 2 illustrates the input-output characteristics for the three clipping as a fixed percentage (68 percent) of the smaller of nonlinearities of (5)-(7). Allowing a direct path connection the maximum absolute signal level over the first and last onebetween input and output for each of the nonlinearities of Fig. thirdsofthe analysis frame. This method has proven quite 1 (Le., y = x ) it can be seen that there are ten distinct ways' in successful in all tests to date [2] . Fig. 3 illustrates the effects of each of the quantizer characwhich the signals x1(n) and x 2 ( n ) can be correlated, depending on which of the nonlinearities is used for NL1 and NL2. Table teristics of Fig. 1 on a typical frame ofvoiced speech. The left-hand sideofFig. 3 shows the sequence ofsignals x(n), I summarizes these ten possibilities. ) clp , It should benotedthatcorrelation number 1 in Table I clc [ ~ ( n , ] [x(n)] and s g n [ ~ ( n .) ]Superimposed on the corresponds to an ordinary autocorrelation; whereas correla- plot of x(n) is the clipping level for this frame of speech. The tion number 10 corresponds to the combination peak clipping, right-hand side of Fig. 3 shows the sequence of autocorrelations center clipping correlation discussed in [l]. Also shown in corresponding to each of the sequences at the left (i.e., correTable I are the required computations needed to implement lationsnumbers 1, 2, 3, and 10 in Table I).9 A rectangular window was used in all cases for computing the correlations as the combinedcorrelationforeach possibility. Inthemost general case (correlation number 1) a multiply and an no othertype ofwindow is reasonable. Theeffectsofthe add are required for each sample inthecomputation. For nonlinear clipping are readily evident in this figure. Although cases 2 , 3, 7, 8, and 9, whenever either x l ( n ) or x 2 ( n + m ) there is a sharp peak in the autocorrelation due to the pitch falls below the clipping level, C L , no computation is required at m = 80 for all four correlations, the shape of the correlation as indicated by the q5 in the computationcolumn for these function for the unprocessed speech [Fig. 3(a)] is significantly cases. For cases 4, 5 , and 6 onlyan adder/subtractor is re- different from the shape of the correlation function for all of quired because sgn [x(n+ m)] can only assume the values +1 the nonlinearly processed signals [Fig. 3(b)-(d)] -especially in [addition of xl(n)], 0 (no computation), or -1 [subtraction the low time part of the correlation (i.e., m going from 20 to 3(d) illustrates the'problems associated with of x 1 ( n ) ] Finally, case 10 only requires an up-down counter SO). Fig. also . using the sgn quantizer in thatall speech samples which exceed to implement as discussed in [l]. the clipping threshold are weighted equally. Thus in the third
'Theoretically there are 16 ways in which x l ( n ) and x 2 ( n ) can be correlated. For all practical purposes, however, six pairs of these results are equivalent. Thus, only ten ways of correlating x1 (n) and x 2 ( n ) are considered here. 9Each of the signal amplitudes in Figs. 3-7, and 9 is scaled so that the maximum value is set to 1.0 for display purposes. Thus, it is difficult to compare these amplitude sequences against each other.

RABINER:AUTOCORRELATION ANALYSIS FORPITCHDETECTION


FRAME 2 3 LRR-I SAW THE CAT

27

x(n)
fCL -cL

00
0

n ,
I
I

2000
SIGNAL X+(n) CORRELATION+ (m)

5000
POWER SPECTRUM

s ( f 1 (db)

SIGNAL

- X(

In)

CORRELATION -

+ (m)

Fig. 4. The signal x l ( n ) ; the resulting correlation and power spectrum for each of theten correlators of Table I for a section of voiced speech.

Fig. 3. Each of the processed signals and the resulting correlation function for a section of voiced speech.

Figs. 4-7 show plots of the results of processing four different sections of voiced speech. The left-hand column shows the signal x l ( n ) , the middle column shows the signal $(m) = x1( n ) correlatedwith x z ( n ) (where x z ( n ) isas specifiedin period there are five pulses of varying width whereas in the shows the power specfirst periods there are only three pulses. Fig. 3(b) shows that Table I), and theright-handcolumn trum S(f) obtained as described above. The ten rows in each such problems are inherently eliminated by the clc quantizer signals to be whose output samples are proportionalinamplitude to the figurecorrespond to thetencombinationsof correlated as shown in Table I. An examinationofFig. 4 amount bywhich they exceed the clipping threshold. shows that for the unprocessed signal (i.e., the top row) the first several harmonics are seen in the power spectrum. Spectral Flatteningfiom the Quantizers Beyond 1 kHz,thespectrumdecaysrapidlydue to thelowIt has already been argued that the effect of the nonlinear pass filter (the lack of a sharp falloff in the spectrum is due to processing preceding the correlation computation is to approx- acombinationofthe signal and autocorrelationwindows). imately spectrally flattenthe signal spectrum, thereby enThe amplitudes of the harmonics vary with the first formant hancing the periodicity of the signal. To investigate this, the envelope. It can be seen that the spectrumfortheautocorpower spectrum of each of the correlation functions of Table I relations of each of the nonlinear quantizers (i.e., rows 2, 3, was computed directly from the correlation function by the and 10) are muchflatterthantheoriginal signal spectrum. Fourier transform relation Additionally, the spectra of the nonlinearly processed signals are much broader than the original spectrum. It is interesting to note that the spectra from correlations involving x(n), i.e., correlations numbers 1, 4, 7 , and 8, are the least flattened and are generally quite irregular (i.e., the harmonics are not very easy A512-pointFFTwasusedtocomputeSCfk),k=0,1,~~~,511 to find). Fig. 5 shows similar results from a different section of voiced where speech. As seen from the spectrum of the unprocessed signal (on the top line) the bandwidth of the first formant is fairly small, causing the correlation function to show a great deal of formant periodicity for small values of rn. The effects of the Le., at 512 points around the unit circle. Since @(m) theo- nonlinearities on the signal spectra are quite impressive even is reticallyinfinite,a (W,,+ 1) point Hamming window was for such a difficult case as this one. used to taper $(rn) smoothly to 0. (Note we are assuming Fig. 6 shows results from a section of speech from a female @(m)is symmetric, i.e., $(m)= @(-rn).) speaker (high pitch). Again the spectrum from unprothe

28

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977


FRAME 88 L R R - I SAW THE CAT FRAME 146 LMO5T

3
tc

14

0 ~SIGNAL X1 (n)

.. 300 0
CORRELATION

200 0
(rn)

5000 POWER SPECTRUM 5 ( f 1 (db)

0
SIGNAL

5000

x,(n)

CORRELATION

+(m~

POWER SPECTRUM S ( f ) (db)

Fig. 5. The signal x l ( n ) ; the resulting correlation and power spectrum for each of the ten correlators of Table I for another section of voiced speech.
FRAME 50 F 105 W

Fig. 7. The signalxl(n); the resulting correlation and power spectrum for each of the ten correlatorsofTable I for asectionof voiced speech from a low pitched male speaker.

in the spectrum at the top of Fig. 7. Pitch detection directly on the autocorrelation of the signal yields incorrect results in this case due to the first formant peak(s) in the autocorrelation function. However, as shown in Fig. 7, almost any of the nonlinearities flatten the spectrum and eliminate the troublesome effects of the sharp first formant in the resulting correlation function. In summary, we have presented examples which tendto show that, as anticipated, the effect of nonlinearly quantizing the signal amplitudes using the quantizers of Fig. 1 is to effectively flatten and broaden the signal power spectrum, thereby reducing theeffectsofthe first formant on the correlation function, and simplifying the pitch detection problem. In the next section we present results of a comparative test of the performance of the ten correlation pitch detectors discussed in this section on a series of speech utterances. 111. EVALUATION THE TEN NONLINEAR OF CORRELATIONS

1 0

0
SIGNAL X i

300 0

(n)

CORRELATION

200 0
(rn)

5000

POWER SPECTRUM S(f) (db)

Fig. 6 . The signal x1(n);the resulting correlation and power spectrum for each of the ten correlators of Table I for asection of voiced speech from a female speaker.

cessedsignal shows only a few harmonics whose amplitudes vary with the formantamplitude.Thenonlinearly processed samples show various degrees of spectral flattening, as anticipated by theprevious discussion. Finally, Fig. 7 shows the results obtained with a voiced frame from a low pitched (long period) male speaker. In this example the first formant has a very narrow bandwidth as seen

In order to evaluate a d compare the performance of the ten nonlinear correlations discussed inthe preceding section, a small set of the utterances from the data base in [2] was used as a testset.Foreachoftheutterances a reference pitch contour was available from which an error analysis was made [ 6 ] . Since the problem of making a reliable voiced-unvoiced decision was not a concern here, the reference voiced-unvoiced contour was used directly, Le., each correlator was required to estimate the pitch period, assuming a priori that the interval was properly classified as voiced. (No pitchdetection was done during unvoiced intervals.) However, if the peak correlation value (normalized) fell below a threshold (0.25), the interval wasclassifiedas unvoiced sincereliable selection of

RABINER:AUTOCORRELATION ANALYSIS FORPITCHDETECTION TABLE I1 STANDARD DEVIATIONS CORRELATORS FOR TEN


utterance

29

TABLE III ERROR STATISTICS TEN CORRELATORS-UNSMOOTHED FOR

1
2

.53 .63 .h4


.52

.60

.80

.91

.S4
.58

.71
.614

.84
.72

.68

.To
.83 .86

$
.El

5
6

.63
.A7

: :

.7 5 .6 9 ,4 10 .45

7 8

.40

.61 .64 .65 .63

.79 .76 .72


.80

.56 .79 .99


.82

.78 .78

.65 1.15 .68 -97 .79 .. 5 61 9

.b 5 .73 .54 .671.56 1.13 .56 1.76 1.17

1.34 .85 1.58 .8 8 1.59 1.70 1.18 1.19 1.66 1.10 1.14 1.50 1.54 1.82 .97 1.11 1.47 1.75 1.07 1.07 1.55 1.66 1.08 1.19 1.46 1.70 .99 1.04 1.61 i.88 1.09 1.18 1.501.13 1.69
1.25

1.23 1.24 1.21 1.32 1.31 1.32 1.24 1.29 1.34 1.29 1.3 1.39 1.17 1.24 1.02 1.05
1.29

1.52

1.39 1.46
1.50

1 .23 1.05 1.04

1 6
2

2
2

8
6

1 1
1
2

2
2 2 2

1.51
1.41 1.28

1.31 1.43 1.45 1.45


1.F7

1.34 1.21 1.25 121 .1


1.24

4
3 1

6
6

2
2

4
8
10

40 4 3324 6 8 1 5 1 1 22 5 10 13 21 29 2 4 3 14 18 3 25 4 13 24 12

18

18

25

6 10 15 18 11 13 13 16
10

8
1 : 20

7
14 17 14
9

19 13

1.46 1.11 1.17

3 3

4 5 4
1

1
1
2

10

.75

24 31 28 34 37 25 7 14 16 15 27 6

27

16
22
11

8
9

15

11

13

Standard Deviation of Pitch Periarl

- llnslcoothrd
1 2 1.u $ ~ ~ 3 1 $ 3 k 5 6 1
8 9 1

Number of Gross

Voiced Errors (Unsmoothed)

0
2

1 2

5
$

3 4

5
6
7 8

2
u

.61 .40 .51 .50 .50 .62 .48 .61 .61 ,119 .55 .60 .56 .97 .43 .49 .55 1.25 .lo . 8 .>I 4 .61 .56 .56 .52 .60 1.06 .59 .46 .54 .63 .59 .59
.48

.50
.50

.54
.71 .8 6 .51 .56

1.59 1.52 2.08 1.22 1.92 1.75 .59 1.09 1.40 1.81 1.25 1.50 1.63 1.54 2.15 .57 .a5 1.211.76 1.28 1.491.3111.98 1.0 1.59 1.33 1.83 1.81 2.17 .62 1.02 1.60 1.55 1.53 1.18 1.65 1.36 2.23 1.13 1 6 1.68 1.57 1.51 .3 1.69 1.93 .1 1.14 1 5 1.75 1.77 1 2 2.15 .1 10 5 .3 .5
1.12

.8b

0 1 2 o 0 0 0
0

1 5 0 o 0 0 0
2

2 4 2 1 1 2 2 2 3 1 4

1 3 o 1

0 1 3 0 2 3 0 1 0 1 14 1 0 1 3 1 1 2 2 0 3 6 2 5 4 o o o 1 1 o 4 0 0 0 2 4 0 1
2

2 5 o 2 4 4

2 4

2 0 1
2

0 2 2
2
0

0 0
0

21 .1 1.92 1.66 1.63

3 1 5

5 3 3

3 0 1

5 2 3 1

2 4 0

9 1 0

.50 .59

.1 6 .64
.54

.50 .51 .8 5

1.19 1.04 10 .5

14 .3 14 .1 12 .6

1.78 1.83 1.67 1.92 16 17 .3 .6

1.78
1.48 1.68

1.84 111 .1

2.48
2.39 22 .9

2 0

0 0

3 0 1

4
5

4
1 6

7
5

L
2

4 1 1 1 0 6 6 7

15 .6

Standard Deviation O f Pitch Period

- Smoothed
Total Nvmbee
Of

Rmher of Voiced

- Unvoiced Enors

(unsmoothed)

the pitch was not possible with a correlation peak below this threshold. Thirteen utterances from [2] were used in this comparison. Tables 11-V presenttheresults of an error analysis which measured the average and standard deviation the of pitch period, number the of gross pitch period errors, and the number of voiced-to-unvoiced errors [2] . l o For all utterances the average pitchperioderror was well below 0.5 samples (10 kHz sampling rate) and so the results of this measurement are not presented. Table I1 presents the standard deviations of the pitch period for the ten correlations. The results are also presentedfortheerrorswhenthepitchcontours were nonlinearly smoothed using a medium smoothing algorithm [7]. From Table I1 it can be seen that the standard deviations for all correlators were approximately the same for the same utterance. It is also seen that as the average pitch period gets longer (reading left from to right) standard the deviation increases proportionally. Tables I11 and IV show the error statistics for gross errors, and voiced-to-unvoiced errors both for the unsmoothed pitch contours(Table 111) andforthe smoothedpitchcontours (Table IV). These tables show that the for high pitched speakers (utterances prefaced by C1, F1, F2), although some differences" were present in the error scores the for unsmoothed data,the nonlinearsmoother was able to correct most of the errors. Thus the overall performance on the first
"A voiced-to-unvoiced error occurred when voiced a region was improperly classified as an unvoiced region because no peak above the n threshold waspresent i the correlation function. "These differences for the high pitched (short period) speakers were due to pitch period doubling, i.e., thecorrelation peakat twicethe period was somewhat higher than correlation the peak at the true period. This is a common effect when the pitch period is on the order of 30 ms (300 Hz pitch) as was the case for these speakers.

Voiced Intervals

213 169 157 1114 141 133 118 170 105 17'4 152 110 134

TABLE IV ERROR STATISTICS TEN CORRELATORS-SMOOTHED FOR

~ ~ 6 = 5 , ~ ~ Q 1

0 1 1 1 7 4 o 6 5 3 0 0 0 0 0 0 0 7 3 0 0 0 0 0 0 8 8 2 4 0 5 0 0 0 0 0 0 0 6 2 4 ~ 6 0 0 0 0 0 0 0 3 3 7 0 0 0 0 0 0 0 9 8
0
0
0

6 4
0

3 3 8 8 2 2

2 2 2 o 1 1 2 1 1 3 5 2 2 0 2

3 1

3 3 7

8 9 0

0 0 0

0 0 0 0

0 0 0

0 0 0 0

0 0 0

0 1 0

1 4 3

9 9 4

8 1 3

3 0 2

0 3 3

Number of Cross Voiced Errors

(Smoothed)

1 2 b 3
$ 4

3 3
2

1 0

0 1

3 4

5 9 2

2 2

1 1

o
1 1

o
0 G
0

4
4

a
4

z 5
9 4
g "

6 7

6 2 2

o
i 1
0

o 1 1
0

3 3 4

'

6
5

1 1
2 2 2

8 2 9 3 1 0 2

5
6 4

8
9 9

1 1 3 1 3 2 0 1 9 1 0 7 1 3 1 4 5 2 ' 3 8 8 1 2 L 7 15 11 2 9 1 0 8 1 1 1 6 7 1 1 1 0 1 9 8 2 0

8 1 1 1 7 9 9 2 9 3 2 9 0

8
3

1
0
0

1 7 1 3 1 1 1

7 3 9 1 2 6 7 1 1 8 1 0 1 0

6 4
4

Numher of Voiced

Unrciced Errors (Smoothed)

four utterances was approximately the same for all correlators. Fofthe low pitched speakers (utterances prefaced by LM, M2) there were more significant differences between corthe relators. For the category of gross errors, correlators 1 and 8 generally had the largest numbers of errors across the last 6 utterances in the test. However, for the category of voiced-tounvoiced errors, correlatorsand 2 9 had consistently the largest number errors. of Although smoothing the signifi-

30

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1917


TABLE V TOTAL ERROR STATISTICS CORRELATORS FOR TEN

cantly reduced the number of gross errors for many of the correlators, turn in it increased the number voiced-toof unvoiced errors. Since both errorsconstitute a pitcherror, in this case the most significant error statistic is probably the sum of the gross errors and voiced-to-unvoiced errors, Table V shows these results. Based onthis combinederrorstatistic the following conclusions can be drawn about the performance of the ten correlators. 1) For high pitched speakers the differences in performance scores between the different correlators small and probably are insignificant. It is for this class of speakers that any type of correlation measurement of pitch period tends to work very well. 2) For low pitched speakers fairly significant differences in the performance scores existed. Correlator number 1 (the normal linear autocorrelation) tended to give the worst performance for all utterances in this class. Correlators numbers 4, 7, 8 (the ones involving an unprocessed x(n) in the computation) were also somewhat poorer in their overall performance based on the sum of gross errors and voiced-to-unvoiced errors. 3) Differences in the performance among the remaining six correlators were notconsistent.Thus,anyoneofthese correlators would be appropriate for an autocorrelation pitch detector. It is interesting to note that (asseen in Tables 111-V) the results for utterance M208T were significantly worse than for utterance M208M. These utterances were simultaneously recorded-the difference being that M208T was recorded off a telephoneline, whereas M208M was recorded from a close talking microphone. This result. is due to the band-limiting effects of the telephone line (300 Hz cutoff frequency) which eliminate the first few harmonics of the pitch, therebymaking accurate pitch detection more difficult. To illustrate the errors made during one of the more difficult utterances, Fig. 8 shows the pitch period contours from three of the correlators for the utterance LMOST (we were away a year ago, spoken by a low pitched maleover the telephone line). Also shown in this figureis the nonlinearly smoothed pitchcontour fromcorrelatornumber 10. The pitch period contour from correlator number 1 [Fig. 8(a)] shows the large number of gross pitch period errors made during the analysis. It is readily seen that most of the errors involved choosing a low valued correlation peak rather than the one at the pitch period. These errors, although due somewhat to the frame size used for analysis (30 ms or 300 samples), are primarily due to the narrow bandwidth first formant which has a stronger correlation peak than the one due to the pitch period. The results for correlators number 2 [Fig. 8(b)] and 10 [Fig. 8(c)] confirm thefactthatthe use of thenonlinearities prior to correlation greatly flattens the spectrum, thereby reducing the number of errors of the type discussed above. As shown in Fig. 8(d), the nonlinear smootheris quite capable of correcting most of the gross pitch errors in analysis from the correlators using the nonlinearities; however, thenumber of errorsfor correlator number 1 is too large to be adequately corrected by thissmoother.Thenonlinearlysmoothedpitch contour also shows that the only gross pitch period errors which were not

1
2

14
11

9 5
I

11

4 6

8
11

2
3

2 4 6 i 5 14 . E : 6 11
8 9 10

B 3
3
L

7 4
4
4

6 5
2
2

8 6 6 6

3
2

2
2

a
11

3
3

8 9
14

5 6 5

a
6

1:

3 2

L 2 5 3 6 L O 6 18 26 15 16 28 12 5 3 2 2 3 0 2 4 3 16 25 18 4 16 25 15 4 2 5 3 4 2 7 19 33 40 35 7 2 0 32 18 7 21 32 17

20

21

27

16

la
11

19
16
22

15 19 22 16
15
19 14

14 19
16

13

23 17 20
26
2 1

17

23 15 15

20

T o t a l Number o i P i t c h Errors

U~smoothed

corrected by the smoother were those that occurred near an unvoiced boundary. As already mentioned, these gross pitch period errors were often changed into voiced-to-unvoiced errors in the smoothed pitch contour. I v . ADAPTIVEFRAMESIZE FOR PITCH ANALYSIS One of the remaining problems in designing an effective correlation detector pitch is t o implement an algorithm for making the analysis frame size variable. It is important to note that the variability offrame size for a given speaker is not nearly as important as the variability of frame size from speaker to speaker. The most important feature of the analysis frame size is that it be large enough to encompass at least two complete pitch periods, but not so large that it encompasses a large number of pitch periods. If we consider the range ofpitch period variation across speakers [2], then a frame size on the order of 40 samples (4 ms) is required for a high pitched speaker, and a frame size on the order of 400 samples (40 ms) is required for a low pitched speaker. Thus, a single fixed frame size will not be suitable for all speakers. The question now remains as to a suitable method of adapting the frame size to the pitch of the speaker. We have already argued that adaptation to the detailed pitch variation within an utterance is generally unnecessary-mainly because the range ofpitch variation within an utterance is generally 1 octave or less (a factor of 2 to 1) from the average pitch for theutterance. Thus, an instantaneouslyadapting algorithm for choosing the analysis frame size is notrequired. Thisis fortunate in that instantaneously adaptive methods generally do not work well when the pitch estimates include gross pitch errors. In lieu of an instantaneously adaptive method, a simple but effective method of adapting the frame size is to estimate the average pitch F(m) of the speaker using the relation

RABINER:AUTOCORRELATION ANALYSIS FOR PITCHDETECTION


LMO5T-WE WERE AWAY A YEAR AGO FRAME 147 LMO5T N= 300

31

50t.

- . . . .-. . . .

w
I .

CORRELATOR NO. 2

50
e
0

.
100
FRAME NUMBER

g
W

o o " " ' " 5"0 ' " " ' "

150

CORRELATOR NO. 10 SIGNAL X1 (fl)


CORRELATION +(m)
POWER SPECTRUM

S ( f ) (db)

(a)
50 * -

...
FRAME NUMBER
(C)

FRAME 147 LMOBT N =600

0 6 " 1 " 5 0 1 ' 1 1 ' 1 1 ' 150 ' ' 100 '
E 4 0

r
5/OO 0
50

CORRELATOR NO. 10 (SMOOTHED CONTOUR)

100
FRAME NUMBER

150

(dl

Fig. 8. The pitch period contours for the utterance LMOST from three of the correlators of Table I and a nonlinearly smoothed pitch contour from correlator number 10.

'oP+-+--tm.f l ) SIGNAL X , (

= 100,

(10) < 10 Nm

Ek
CORRELATION C p ( r n 1 200
POWER SPECTRUM5000

S(f)

Idb)

where p ( i ) is the pitch period of the ith voiced frame (i.e., unvoiced frames are not used in the computation), and N , is thenumber of voiced framesup to themthframe.I2The initial condition p(m)= 100 for Nm < 10 is used to ensure a reasonable "average" pitch period estimate until a sufficient number of voiced frames have beenestimated.The frame length L(m) is generated from thesimple rule

(b) Fig. 9. The signal xl(n), the resulting correlation and power spectrum for each of the 10 correlators of Table I for both a 300 and a 600 sample analysis frame.

L(m) = 3 -F(m>.

(1 1)

The factor of 3 allows up to a 50 percent variation in pitch period from estimated the average pitch period, and still ensures that at least two complete pitch periods are contained within each analysis frame. To prevent the analysis frame from getting too small, or too large, L(m) is restricted to the range
,100<L(m) < 600, allfor

12For continuous text, equation 10 should be modified so that & ) z is computed over a fixed number ofpast voiced frames.

m.

(12)

32

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977


LMO5T CORRELATOR 10 SMOOTHED DATA
M107M CORRELATOR 1 0 SMOOTHED DATA

2oo

50

100

150

FRAME NO.
F107M CORRELATOR 1 0 SMOOTHED DATA

200

ClO8T CORRELATOR 1 0 SMOOTHED DATA

Fig. 10. Plots of the pitch contour and the resulting adaptive frame size for four typical utterances.

To demonstrate the necessity and effectiveness of matching the analysis frame size to the speaker's average pitch, Fig. 9 shows plots of the waveforms, correlation functions, and powerspectra for a section of voiced speech fromlow a pitchedmale.Thepitchduringthissection was about150 samples. Fig. 9(a) shows the results for the 10 correlators for an analysis size of 300 samples, Fig. 9(b) shows the results for a 600 sample analysis frame size. By comparing the flatness of the power spectrum for the best correlators (i.e., numbers , 3 , 2 and 10) it can be readily seen that the longer analysis frame size leads to significantly flatter spectra. The analysis frame adaptation algorithm discussed above was tested on several utterances used in the study of [ 2 ] . Fig. 10

showsplotsof both the nonlinearlysmoothedpitchperiod contour, and the analysis frame size as obtained from (11) and (12). Fig. lO(a) shows the results on a pitched low male whose average pitchperiod was about140samples. As discussedabove the first 10 voiced frames used a300 sample frame;,after that the frame size adapted slowly to the pitch period, reaching a fairly constant value of about 420 samples. Fig. 1O(b) shows the results for a normal pitched male speaker with very little pitch variation throughout the utterance. The algorithm rapidly very converges to an analysis frame size of about 210 samples for this speaker. Fig. 1O(c) shows the results for a female speaker. In this case the analysis frame size quickly converged to a length of about 135 samples.

AUTOCORRELATION RABINER:

ANALYSIS FOR PITCH DETECTION

33

Finally, Fig. 10(d) shows the results for a high pitched child. In this case the frame size reached the lower limit of a 100 sample frame size at the first iteration, and remained at that value throughout the utterance. Adapting the frame size to the estimated average pitch of the speaker can have advantages otherthantheones discussed above. In cases where the resulting frame size is smaller than 300 samples,thecomputationofthecorrelationfunction is speeded up.In cases where theframe size falls below 200 samples,thecomputation is speeded up even morebecause fewer than 200 correlations need to be computed. Thus, for example, for a frame size of 300 samples, on the order ofN 1 = 300 X 200 = 60 000 operations (multiply, addition) need to be performed to compute 200 autocorrelation points, whereas for a frame size of samples, the of 100 on order N2 = 100 X 100 = 10 000 operations are required providing a 6 to 1 savings in computation. However, in cases wheretheframe size exceeds 300 samples,thecorrelationcomputationtime increases, but this increase in computation time is unavoidable if one is to use the proper frame size.

of ten types of nonlinear correlation showed that correlations involving theunprocessed signal were somewhatinferior to correlations involving thenonlinearlyprocessedsignal; however, almost all the nonlinearities provided essentially the same performance. In additiona simple procedure for adaptingthe analysis frame size ofthecorrelation to theestimated average pitch period of the speaker was proposed and evaluated for several utterances. By basing the adaptation on a running estimate of the pitch period, it was shown that a fairly reliable and robust method of adapting analysis frame size resulted. This method should be appropriate for any frame-by-frame speech analysis system in which pitch extracted. is

REFERENCES
[ l ] J. J. Dubnowski,R. W. Schafer, and L. R. Rabiner, Real-time digital hardware pitchdetector, IEEE Trans. Acoust.,Speech, and SignalProcessing, vol. ASSP-24, pp. 2-8, Feb. 1976. [2] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg, and C. A. McGonegal, A comparative performance study of several pitch detection algorithms, IEEE Trans. Acoust., Speech, and Signal Processing, VO~. ASSP-24, pp. 399-418, Oct. 1976. [3] M. M. Sondhi, New methods of pitch extraction, IEEE Trans. Audio Electroacoust., vol. AU-16, pp. 262-266, June 1968. [4] J . D. Markel, TheSIFTalgorithmforfundamentalfrequency estimation, IEEE Trans. Audio Electroacoust., vol.AU-20,pp. 367-377, Dec. 1972. [5] B. S. Atal, unpublished work. [6] C. A. McGonegal, L. R. Rabiner,and A. E. Rosenberg, A semiautomatic detector pitch (SAPD), IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-23, pp. 570-574, Dec. 1975. [7] L. R. Rabiner, M. R. Sambur, and C. E. Schmidt, Applications of a nonlinear smoothing algorithm to speech processing, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-23, PP. 552-557, Dec. 1975.

V. SUMMARY
In this paper wehave examined several methods for combiningnonlinear processing ofthespeech waveform witha standard correlation analysis to give correlation functions which have sharp peaks at the pitch period. We have shown that the nonlinearities provide some degree of spectral flattening, thereby enhancing periodicity the peaks the in correlation function, and reducing the correlation peaks dueto the formant structure of the waveform. A formal evaluation

Anda mungkin juga menyukai