Anda di halaman 1dari 10

ABSTRACT Voice morphing means the transition of one speech signal into another.

Likeimage morphing, speech morphing aims to preserve the shared

characteristics of the starting and final signals, while generating a smooth transition between them. Speechmorphing is analogous to image morphing. In image morphing the in-between images all show one face smoothly changing its shape and texture until it turns into the target face. It is this feature that a speech morph should possess. One speech signal should smoothlychange into another, keeping the shared characteristics of the starting and ending signalsbut smoothly changing the other properties.The major properties of concern as far as a speech signal is concerned are its pitch and envelope information. These two reside in a convolved form in a speech signal. Hence some efficient method for extracting each of these is necessary. We have adopted an uncomplicated approach namely cepstral analysis to do the same. Pitch and formant information in each signa l is extracted using the cepstral approach. N e c e s s a r y processing to obtain the morphed speech signal include methods like Cross fading of envelope information, Dynamic Time Warping to match the major signal features (pitch)and Signal Re-estimation to convert the morphed speech signal back into the acousticwaveform. INTRODUCTION Voice morphing, which is also referred to as voice transformationand voice conversion, is a technique for modifying a source speakerss p e e c h t o s o u n d a s i f i t w a s s p o k e n b y s o m e d e s i g n a t e d t a r g e t speaker. There are many applications of voice morphing includingcustomizing voices for text to speech (TTS) systems, transformingvoice -overs in adverts and films to sound like that of a well -knowncelebrity, and enhancing t h e s p e e c h o f i m p a i r e d s p e a k e r s s u c h a s laryngectomees. Two key requirements of many of these applicationsa r e t h a t f i r s t l y t h e y s h o u l d n o t r e l y o n l a r g e a m o u n t s o f p a r a l l e l training data where both speakers recite identical texts, and secondly,t h e h i g h a u d i o q u a l i t y o f t h e s o u r c e s h o u l d b e p r e s e r v e d i n t h e transformed speech. The core process in a voice morphing system isthe transformation of the spectral envelope of the source speaker tomatch that of the target speaker and various approaches have beenproposed for doing this such as codebook mapping, formant mapping,a n d linear transformations. Codebook mapping, however, typicallyleads to discontinuities in the transformed speech. Although s o m e discontinuities can be resolved by some form of interpolation technique, the

conversion approach can still suffer from a lack of robustness as well as degraded quality. On the other hand, formant mapping is proneto formant tracking errors. Hence, transformation-based approachesa r e n o w t h e m o s t popular. In particular, the continuous probabilistictransformation approach introduced by Stylianou provides the baselinef o r m o d e r n s y s t e m s . I n t h i s a p p r o a c h , a G a u s s i a n m i x t u r e m o d e l (GMM) is used to classify each incoming speech frame, and a set of linear transformations weighted by the continuous GMM probabilitiesa r e a p p l i e d t o g i v e a s m o o t h l y v a r y i n g t a r g e t o u t p u t . T h e l i n e a r transformations are typically estimated from time-aligned parallelt r a i n i n g d a t a u s i n g l e a s t m e a n s q u a r e s . M o r e r e c e n t l y , K a i n h a s proposed a variant of this method in which the GMM classification is D e p t . o f E l e c t r o n i c s & C o m m u n i c a t i o n G C E Kannur 1 S e m i n a r R e p o r t 2 0 0 7 V o i c e M o r p h i n g based on a joint density model. However, like the original Stylianoua p p r o a c h , i t s t i l l r e l i e s o n p a r a l l e l t r a i n i n g d a t a . A l t h o u g h t h e requirement for parallel training data is often acceptable, there areapplications which require voice transformation for nonparallel trainingd a t a . Examples can be found in the entertainment and mediain d u s t ri e s wh e re re c o rd in gs o f u n k n o wn s p e a k e r s n e e d t o b e transformed to sound like well-known personalities. Further uses aree n v i s a g e d i n a p p l i c a t i o n s w h e r e t h e p r o v i s i o n o f p a r a l l e l d a t a i s impossible such as when the source and target speaker speak differentl a n g u a g e s . A l t h o u g h i n t e r p o l a t e d l i n e a r t r a n s f o r m s a r e e f f e c t i v e i n transforming speaker identity, the direct transformation of successivesource speech frames to yield the required target speech will result ina number artifacts. The reasons for this are as follows. First, t h e reduced dimensionality of the spectral vector used to represent thespectral envelope and the averaging effect of the linear transformationr e s u l t i n f o r m a n t b r o a d e n i n g a n d a l o s s o f s p e c t r a l d e t a i l . S e c o n d , unnatural phase dispersion in the target speech can lead to audible a r t i f a c t s a n d t h i s effect is aggravated when pitch and duration aremodified. Third, u n v o i c e d s o u n d s h a v e v e r y h i g h v a r i a n c e a n d a r e typically not transformed. However, in that case, residual voicing fromt h e s o u r c e i s carried over to the target speech resulting in a disconcerting background whispering effect .To achieve high quality of v o i c e conversion, include a spectral refinement approach

t o compensate the spectral distortion, a phase prediction method forn a t u r a l p h a s e c o u p l i n g a n d a n u n v o i c e d s o u n d s t r a n s f o r m a t i o n scheme. Each of these techniques is assessed i n d i v i d u a l l y a n d t h e overall performance of the complete solution evaluated using listeningtests. Overall it is found that the enhancements significantly improve. TRANSFORM-BASED VOICE MORPHING SYSTEM 2.1 Overall Framework Transform-based voice morphing technology converts thes p e a k e r i d e n t i t y b y m o d i f y i n g t h e p a r a m e t e r s o f a n a c o u s t i c representation of the speech signal. It normally includes two parts, thet r a i n i n g p r o c e d u r e a n d t h e t r a n s f o r m a t i o n p r o c e d u r e . T h e t r a i n i n g procedure operates on examples of speech from the source and thet a r g e t s p e a k e r s . T h e i n p u t s p e e c h e x a m p l e s a r e f i r s t a n a l y z e d t o extract the spectral parameters that represent the speaker identity.U s u a l l y t h e s e p a r a m e t e r s e n c o d e t h e s h o r t - t e r m a c o u s t i c features,s u c h a s t h e s p e c t r u m s h a p e a n d t h e f o r m a n t s t r u c t u r e . A f t e r t h e feature extraction, a conversion function is t r a i n e d t o c a p t u r e t h e relationship between the source parameters and the correspondingtarget parameters. In the transformation procedure, the new spectralparameters are obtained by applying the trained conversion functionsto the source parameters. Finally, the morphed speech is synthesizedfrom the converted parameters. There are three interdependent issuesthat must be decided before building a voice morphing system. First, amathematical model must be chosen which allows the speech signal tobe manipulated and regenerated with minimum distortion. Previousresearch suggests that the sinusoidal model is a good candidate since,in principle at least, this model can support modifications to both theprosody and the spectral characteristics of the source signal without inducing significant artifacts However, in practice, conversion quality isalways compromised by phase incoherency in the regenerated signal,and to minimize this problem, a pitch synchronous sinusoidal model isu s e d i n o u r s y s t e m . S e c o n d , t h e a c o u s t i c f e a t u r e s w h i c h e n a b l e humans to identify speakers must b e e x t r a c t e d a n d c o d e d . T h e s e features should be independent of the message and the environments o t h a t w h a t e v e r a n d w h e r e v e r t h e s o u r c e speaker speaks, his/her voice characteristics can be successfully transformed to sound like thetarget speaker. Clearly the changes applied to these features must becapable of straightforward realization by the speech model. Third, thetype of conversion function and the method of training and applyingthe conversion function must be decidedSpectral Parameters As indicated above, the overall shape of the spectral envelope provides an effective representation of the vocal tract characteristics of t h e s p e a k e r a n d t h e f o r m a n t s t r u c t u r e o f v o i c e d s o u n d s . G e n e r a l l y , there are several ways to estimate the spectral envelope,such as usinglinear predictive coding (LPC) , cepstral coefficients, and line spectralfrequencies (LSF). The main steps in estimating the LSF envelope foreach speech frame are as follows.1 . U s e t h e

amplitudes of the harmonicsdetermined by the pitchsynchronous sinusoidal model to represent the m a g n i t u d e spectrum.K is determined by the fundamental frequency , i t s value can typically range from 50 to 200.2 . R e s a m p l e t h e m a g n i t u d e s p e c t r u m n o n u n i f o r m l y a c c o r d i n g t o the bark scale frequency warping using cubic spline interpolation.3 . C o m p u t e t h e L P C c o e f f i c i e n t s b y a p p l y i n g t h e L e v i n s o n - D u r b i n algorithm to the autocorrelation sequence of the warped powerspectrum.4.Convert the LPC coefficients to LSF. 5 . I n o r d e r t o maintain adequate encoding of the f o r m a n t structure,LSF spectral vectors with an order of p=15 were usedthroughout our voice conversion experiments. 2.3 Linear Transforms We now turn to the key problem of finding an a p p r o p r i a t e conversion function to transform the spectral parameters. Assume thatthe training data contains two sets of spectral vectors X and Y which,respectively, encode the speech of the source speaker and the targetspeaker A straightforward method to convert the source vectors is touse a linear transform. In the general case, the linear transformation of a p dimensional vector x i s r e p r e s e n t e d b y a p * ( p + 1 ) d i m e n s i o n a l matrix W applied to the extended vector x=[x,1]. Since there are a wide variety speech sounds, a single global transform is not sufficientt o c a p t u r e t h e v a r i a b i l i t y i n human speech. Therefore, a commonly used technique is to classify the speech sounds into classes using a statistical classifier such as a Gaussian mixture model (GMM) and thenapply a class-specific transform. However, in practice, the selection of a single N transform from a finite set of transformations can lead to discontinuities in the output signal. In addition, the selected transformmay not be appropriate for source vectors that fall in the overlap areab e t w e e n c l a s s e s . H e n c e , i n o r d e r t o g e n e r a t e m o r e r o b u s t transformations, a soft classification is preferred in which all N transformations contribute to the conversion of the source vector. Thec o n t r i b u t i o n degree of each transformation matrix depends on the d e g r e e t o w h i c h t h a t s o u r c e v e c t o r b e l o n g s t o t h e c o r r e s p o n d i n g speech class. 1) Least Square Error Estimation : W h e n p a r a l l e l t r a i n i n g d a t a i s available, the transformation matrices can be estimated directly usingt h e l e a s t s q u a r e e r r o r ( L S E ) c r i t e r i o n . I n t h i s c a s e , t h e s o u r c e a n d target vectors are time aligned such that each source training vector xicorresponds to a target training vector yi For ease of manipulation, thegeneral form of the interpolated transformation in (2) can be rewrittencompactly as

where D e E l C o G C 7

p e m E

t c m

. t

r u n Kannur

o o i

f n c

i a

c t

s i o

& n

and The accurate alignment of source and target vectors in t h e training set is crucial for a robust estimation of the t r a n s f o r m a t i o n matrices. Normally, a dynamic time warping (DTW) algorithm is used toobtain the required time alignment where the local cost function is thes p e c t r a l distance between source and target vectors. However, thealignment obtained using this method will sometimes be distorted w h e n t h e source and target speakers are very different, this i s especially a problem in cross gender transformation.2) Maximum Likelihood Estimation

: As noted in the introduction, theprovision of parallel training data is not always feasible and hence itw o u l d b e u s e f u l i f t h e r e q u i r e d t r a n s f o r m a t i o n m a t r i c e s c o u l d b e estimated from nonparallel data. The form of suggests that, analogoust o t h e u s e o f t r a n s f o r m s f o r a d a p t a t i o n i n s p e e c h r e c o g n i t i o n , maximum likelihood (ML) should provide a framework for doing this. Toestimate multiple transforms using this scheme, a source GMM is usedt o a s s i g n t h e s o u r c e v e c t o r s t o c l a s s e s v i a a s i n t h e L S E e s t i m a t i o n scheme. A transform matrix is then estimated separately for each classu s i n g t h e a b o v e M L s c h e m e a p p l i e d t o j u s t t h e d a t a f o r t h a t c l a s s . Though it is theoretically possible to estimate multiple transforms usingsoft classification, in practice, matrices will become too large to invert.Hence, the simpler hard classification approach is used here. As with t h e l e a s t m e a n s q u a r e s m e t h o d u s i n g p a r a l l e l d a t a , p e r f o r m a n c e i s greatly improved if subphone segment boundaries can be accuratelyd e t e r m i n e d i n t h e s o u r c e d a t a using the target HMM and forcedalignment recognition mode. This enables the set of Gaussians evaluated for each source frame to be limited to just those associatedwith the HMM state corresponding to the associated subphone. This does, of course, require that the orthography of the source utterancesbe known. Similarly, knowing the orthography of the target trainingdata makes training the target HMM simpler and more effective. SYSTEM ENHANCEMENT The converted speech produced by the baseline s y s t e m described above will often contain artifacts. This section d i s c u s s e s these artifacts in more detail and describes the solutions developed tomitigate them. 3.1 Phase Prediction As is well known, the spectral magnitude and phase of human s p e e c h a r e h i g h l y c o r r e l a t e d . I n t h e b a s e l i n e s y s t e m , w h e n o n l y spectral magnitudes are modified and the original phase is preserved,a harsh quality is introduced into the converted speech. However, tosimultaneously model the magnitude and phase and then convert themboth via a single unified transform is extremely difficult. A GMM modelis first trained to cluster the target spectral envelopes coded via LSFcoefficients into M classes ( C 1 ,.,CM ).

For each target envelope v wehave a set of posterior probabilities. This can be regarded as another form of representation of the spectral shape. A set of template signal T =[ T 1 ,., TM ] can be estimated by minimising the waveform shape prediction error. 3.2 Spectral Refinement Although the formant structure of the source speech has been transformed to match the target, the spectral detail has been lost as aresult of reducing the dimensionality of the envelope representation during the transform. Another clearly visible effect is the broadening of the spectral peaks caused, at least in part, by the averaging effect of the estimation method. All these degradations lead to muffled effectsin the converted speech. To solve this problem, a straightforward ideais to reintroduce the lost spectral details to the converted envelopes. Aspectral residual prediction approach has been developed to do this based on the residual codebook method, where the codebook is trainedu s i n g a GMM model. After the residual codeb ook is obtained, thes p e c t r a l r e s i d u a l n e e d e d t o c o m p e n s a t e e a c h c o n v e r t e d s p e c t r a l envelope can be predicted straightforwardly based on the posteriorprobabilities. 3.3 Transforming Unvoiced Sounds Many unvoiced sounds, have some vocal tract coloring a n d simply copying the source to the target affects the converted speechcharacteristics, especially in cross gender conversion. A typical effect ist h e perception of another speaker whispering behind the t a r g e t speaker. Since most unvoiced sounds have no obvious voc al tractstructure and cannot be regarded as short -term stationary signals,their spectral envelopes show large variations. Therefore, it i s n o t effective to convert them using the same solution as for voiced sounds.However randomly deleting, replicating and concatenating segments of the same unvoiced sound does not induce significant artifacts. Thisobservation suggests a possible solution based on unit selection and concatenation to transform unvoiced sounds. REALTIME VOICE MORPHING In real time voice morphing what we want is to be able to morph,in real-time user singing a melody with the voice of another singer. Itresults in an impersonating system with which the user can morph his/her voice attributes, such as pitch, timbre, vibrato and articulation,w i t h t h e o n e s f r o m a p r e r e c o r d e d t a r g e t s i n g e r . T h e u s e r i s a b l e t o control the degree of morphing, thus being able to choose the level of i m p e r s o n a t i o n t h a t h e / s h e w a n t s t o accomplish. In our particularimplementation we are us ing as the

target voice a recording of thecomplete song to be morphed. A more u s e f u l s y s t e m w o u l d u s e a database of excerpts of the target voice, thus choosing the appropriatet a r g e t s e g m e n t a t e a c h p a r t i c u l a r t i m e i n t h e morphing process. Ino r d e r t o i n c o r p o r a t e t o t h e u s e r s v o i c e t h e c o r r e s p o n d i n g characteristics of the target voice, the system has to first recognizew h a t t h e u s e r i s s i n g i n g ( p h o n e m e s a n d n o t e s ) , f i n d i n g the samesounds in the target voice (i.e. synchronizing the s o u n d s ) , t h e n interpolate the selected voice attributes, and finally g e n e r a t e t h e output morphed voice. All this has to be accomplished in real-time. Fig 4.1 System block diagram 4.1 The Voice Morphing System Figure shows the general block diagram of the v o i c e impersonator system. The underlying analysis/synthesis technique isSMS to which many changes have been done to better adapt it to thesinging voice and to the real-time constrains of the application. Also arecognition and alignment module was added for synchronizing th eusers voice with the target voice before the morphing is done. Beforewe can morph a particular song we have to supply information aboutt h e s o n g t o b e m o r p h e d a n d t h e s o n g r e c o r d i n g i t s e l f ( T a r g e t Information and Song Information). The system requires the phonetict r a n s c r i p t i o n o f t h e l y r i c s , t h e m e l o d y a s MIDI data, and the actual r e c o r d i n g t o b e u s e d a s t h e t a r g e t a u d i o d a t a . T h u s , a g o o d impersonator of the s i n g e r t h a t o r i g i n a l l y s a n g t h e s o n g h a s t o b e recorded. This recording has to be analyzed with SMS, segmented intomorphing units, and each unit labeled with the appropriate note andphonetic information of the song. This preparation stage is done semi-automatically, using a non-real time application developed for this task. The first module of the running system includes the realtime analysisand the recognition/ alignment steps. Each analysis frame, with the a p p r o p r i a t e parameterization, is associated with the phoneme of a s p e c i f i c moment of the song and thus with a target frame. Therecognition/alignment algorithm is based on t r a d i t i o n a l s p e e c h recognition technology, that is, Hidden Markov Models (HMM) that werea d a p t e d t o t h e s i n g i n g v o i c e . O n c e a u s e r f r a m e i s m a t c h e d w i t h a target frame, we morph them interpolating data from both frames andwe synthesize the output sound. Only voiced phonemes are morphed and the user has control over which and by how much each parameteris interpolated. The frames belonging to unvoiced phonemes are left untouched thus always having the users consonants. 4.2 Voice analysis/synthesis using SMS The traditional SMS analysis output is a collection of frequency a n d amplitude values that represent the partials of the s o u n d (sinusoidal component), and either filter coefficients with a gain value or spectral magnitudes and phases representi ng the residual sound(non sinusoidal component. Several modifications have been done to t h e m a i n S M S p r o c e d u r e s t o a d a p t t h e m t o t h e r e q u i r e m e n t s o f t h e impersonator

system. A major improvement to SMS has been the real-time implementation of the whole analysis/synthesis process, with ap r o c e s s i n g l a t e n c y o f l e s s than 30 milliseconds and tuned to the p a r t i c u l a r c a s e o f the singing voice. This has required m a n y optimizations in the analysis part, e s p e c i a l l y i n t h e f u n d a m e n t a l frequency detection algorithm. These improvements were mainly donein the pitch candidate's search process, in the peak selection process,i n t h e f u n d a m e n t a l f r e q u e n c y t r a c k i n g p r o c e s s , a n d i n t h e implementation of a voiced-unvoiced gate. Another important set of i m p r o v e m e n t s t o S M S r e l a t e t o t h e i n c o r p o r a t i o n o f a h i g h e r - l e v e l analysis step that extracts the parameters that are most meaningful tob e m o r p h e d . A t t r i b u t e s t h a t a r e i m p o r t a n t t o b e able to interpolate b e t w e e n t h e u s e r s v o i c e a n d t h e t a r g e t s v o i c e i n a k a r a o k e application include spectral shape, fundamental frequency, amplitudea n d r e s i d u a l s i g n a l . O t h e r s , s u c h a s p i t c h m i c r o v a r i a t i o n s , v i b r a t o , spectral tilt, or harmonicity, are also relevant for various steps in themorphing process or to perform other sound transformation that aredone in parallel to the morphing. For example, transforming some of t h e s e a t t r i b u t e s w e c a n a c h i e v e v o i c e e f f e c t s s u c h a s T o m W a i t s hoarseness. 4.3 Phonetic recognition/alignment This part of the system is responsible for recognizing t h e phoneme that is being uttered by the user and also its musical contextso that a similar segment can be chosen from the target information. There is a huge amount of research in the field of speech recognition. The recognition systems work reasonably well when tested in the well-c o n t r o l l e d e n v i r o n m e n t o f t h e l a b o r a t o r y . H o w e v e r , p h o n e m e recognition rates decay miserably when the conditions are adverse. Inour case, we need a speaker independent system capable of working in a bar with a lot of noise, loud music being played and not very-highquality microphones. Moreover the system deals with singing voice,which has never been worked on and for which there are no available databases. It has to work also with very low delay, we cannot wait for aphoneme to be finished before we recognize it and we have to assign aphoneme to each frame. Fig 4.2 Recognition and matching of morphable units. This would be a rather impossible/impractical problem if it wasnot for the fact that we know the words beforehand, the lyrics of thesong. This reduces a big portion of the search problem: all the possiblep a t h s a r e r e s t r i c t e d t o j u s t o n e s t r i n g o f p h o n e m e s , w i t h s e v e r a l possible pronunciations. T h e n t h e p r o b l e m r e d u c e s t o l o c a t i n g t h e phoneme in the lyrics and placing the start and end points. We haveincorporated a speech recognizer based on phoneme-base discreteHMM's that handles musical information and that is able to work withvery low delay. The details of the recognition system can be found ina n o t h e r p a p e r o f o u r g r o u p . T h e r e c o g n i z e r i s a l s o u s e d i n t h e preparation of the target audio data, to fragment the recording intom o r p h a b l e u n i t s ( p h o n e m e s ) a n d t o l a b e l t h e m w i t h t h e

p h o n e t i c transcription and the musical context. This is done out of real-time fora better performance. Morphing Depending on the phoneme the user is singing, a unit from the t a r g e t i s selected. Each frame from the user is morphed with a different frame from the target, advancing sequentially in time. Then t h e user has the choice to interpolate the different p a r a m e t e r s extracted at the analysis stage, such as amplitude, fundamentalf r e q u e n c y , s p e c t r a l s h a p e , r e s i d u a l s i g n a l , e t c . I n g e n e r a l t h e amplitude will not be interpolated, thus always using the amplitudefrom the user and the unvoiced phonemes will also not be morphed,thus always using the consonants from the user. This will give the userthe feeling of being in control. In most cases the durations of the userand target phonemes to be morphed will be different. If a given usersphoneme is shorter than the one from the target the system will simplyskip the remaining part of the target phoneme and go directly to thearticulation portion. In the case when the user sings a longer phonemethan the one present in the target data the system enters in the loopm o d e . E a c h v o i c e d p h o n e m e o f t h e t a r g e t h a s a l o o p p o i n t . f r a m e , marked in the preprocessing, non-real time stage. The system uses thisframe to loop-synthesis in case the user sings beyond that point in thep h o n e m e . O n c e w e r e a c h t h i s f r a m e i n t h e t a r g e t , t h e r e s t o f t h e frames of the user will be interpolated with that same frame until the user ends the phoneme. This process is shown in Figure Fig 4.3 Loop synthesis diagram. The frame used as a loop frame requires a good spectral shape and, if possible, a pitch very close to the note that corresponds to thatp h o n e m e . S i n c e w e k e e p a c o n s t a n t s p e c t r a l s h a p e , w e h a v e t o d o something to make the synthesis sound natural. The way we do it is byusing some natural templates obtained from the analysis of a longerphoneme that are then used to generate more target frames to morphwith out of the loop frame. One feature that adds naturalness is pitchvariations of a steady state note sung by the same target. These deltapitches are kept in a look up table whose first access is ra ndom andthen we just read consecutive values. We keep two tables, one with v a r i a t i o n s o f s t e a d y p i t c h a n d a n o t h e r o n e w i t h v i b r a t o t o g e n e r a t e target frames. Once all the chosen parameters have been interpolatedin a given frame they are added back to the basic SMS frame of theuser. The synthesis is done with the standard synthesis procedures of SMS

Anda mungkin juga menyukai