Abstract: The research presents the capability of a Hidden Markov Model-based TTS system to produce
Marathi speech. In this synthesis method, routes of speech parameters are generated from the trained Hidden
Markov Models. A final speech waveform is synthesized from those speech parameters. In our experiments,
spectral properties were represented by Mel Cepstrum Coefficients. Both the training and synthesis issues are
investigated in this research using marked Marathi speech database. Experimental evaluation depicts that the
developed text-to-speech system is accomplished of producing effectively regular speech in terms of
intelligibility and pitch for Marathi.
Keywords: Marathi Speech Synthesis, Text-To-Speech (TTS), Hidden-Markov-Model (HMM), Marathi HTS.
I.
Introduction
A speech synthesis system is a computer-based system that produce speech automatically, through a
grapheme-to-phoneme transcription of the sentences and prosodic features to utter. The synthetic speech is
generated with the available phones and prosodic features from training speech database [1][2][21]. The speech
units is classified into phonemes, diaphones and syllables. The output of speech synthesis system differs in the
size of the stored speech units and output is generated with execution of different methods. A text-to-speech
system is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it
converts raw text containing symbols like numbers and abbreviations into the equivalent words. This process is
often called text normalization, preprocessing, or tokenization. Second task is to assigns phonetic transcriptions
to each word, and divides and marks the text into prosodic units like phrases, clauses, and sentences. Although
text-to-speech systems have improved over the past few years, some challenges still exist. The back end phase
produces the synthesis of the particular speech with the use of output provided from the front end. The symbolic
representations from first step are converted into sound speech and the pitch contour, phoneme durations and
prosody are incorporated into the synthesized speech.
The paper is structured in five sections. The techniques of speech synthesis are described in section 2.
Database for synthesis system is explained in section 3. Section 4 explains speech quality measurement. Section
5 is dedicated with experimental analysis followed by conclusion.
The first obligation of a text-to-speech (TTS) system is intelligibility and the second one is the
naturalness. Actually the concept of naturalness is not to restitute the reality but to suggest it. Thus, listening to a
synthetic voice must allow the listener to attribute this voice to some pseudo-speaker and to perceive some kind
of expressivities as well as some indices characterizing the speaking style and the particular situation of
elocution [3]. Modern speech synthesizers are able to achieve high intelligibility. However, they still suffer from
a rather unnatural speech. Recently, to increase the naturalness, there has been a noticeable shift from di-phone
based towards corpus-based unit selection speech synthesis observed [4]. Between these corpus-based unit
selection technique is by far the best for producing the natural speech. But it requires a large database often in
size of gigabyte. There are many corpus based and di-phone based TTS available for different Indian languages
but those are still not reaches the acceptable quality of naturalness. More over the same has not yet been
implemented for resource-limited or embedded devices such as mobile phones.
In this view, Hidden Markov Models (HMMs) have proven to be an efficient parametric model of the
speech acoustics in the framework of speech synthesis because of its small database size and ability to produce
intelligent and natural speech. Although having been originally implemented for Japanese language, the HMM
based speech synthesis (HSS) [5] approach has also been applied to other languages, e.g., English [6], German
[7], Portuguese [8], Chinese [9], etc. Input contextual labels and questions for context clustering are the only
language dependent topics in the HSS scheme. This paper describes first experiments on statistical parametric
HMM-based speech synthesis for the Marathi language. For building of our experimental TTS system, HTS
toolkit is employed.
DOI: 10.9790/4200-05613439
www.iosrjournals.org
34 | Page
Basic System
The HMM-based speech synthesis technique comprises training and synthesis parts, as depicted in Figure 1.
III.
The speech database research lab here for training purpose is originally developed by IIT-Hyderabad
[14]. Overall 1000 sentences are used for training which consist of 12 type of sentences i.e. Complex affirmative,
Complex negative, Simple affirmative with verb, Simple affirmative without verb, Simple negative, Compound
affirmative, Compound Negative, Exclamatory, Imperative, Passive, WH questions, Yes-No questions.
All the sentences were tagged with marking of phoneme, syllable, and word boundaries along with the
appropriate Parts of Speech (POS) and phrase/clause markers. During the training prosodic word boundary are
used as word boundary instead of syntactic word boundary. Those prosodic word labeling was carried out
manually.
DOI: 10.9790/4200-05613439
www.iosrjournals.org
35 | Page
TABLE-3:M ARATHI PHONEME NUMBER INVENTORY CONSISTS OF 11 NUMBERS INCLUDING TWO GLIDES AND 3
SAMPLE TEXT IN M ARATHI
1) Numbers
3.2. Tone
In Marathi language tone is not phonemically significant. In a simple declarative sentence with neutral
focus, most words and/or phrases in Marathi is said to carry a rising tone with the exception of the last word in
the sentence, which carries only a low tone. In that sense Marathi is called bound stress language. In a
declarative sentence with neutral focus, the intonation pattern is falling. In sentences involving focused words or
phrases, the rising tones last until the right edge of the focused word; all following words carry a low tone. WH
sentences follow the same intonation pattern as the sentences involving focused words, but in yes-No sentences
DOI: 10.9790/4200-05613439
www.iosrjournals.org
36 | Page
IV.
In order to generate the context-dependent label format from the given text first POS is marked. As
existing automatic POS tagging for Marathi language is not up to the mark so it is done manually. Since the
training is done based on the prosodic word, the input text prosodic word labeling is performed based on the rule
as describe below [17].
TABLE 3: LIST OF THE CONTEXT FEATURES
Units
Phoneme
Syllable
Prosodic
Word
Phrase
Utterance
Features
-{preceding, current, succeeding} phonemes
- position of current phoneme in current syllable
-whether or not {preceding, current, succeeding} syllables are stressed
-number of phonemes in {preceding, current, succeeding} syllables
- position of current syllable in current word
-number of stressed syllables in current phrase { before, after} current syllable
-number of syllables, counting from previous stressed to current syllable in the utterance
-number of syllables, counting from current to next stressed syllable in the utterance
-part-of-speech of {preceding, current, succeeding} words
-number of syllables in {preceding, current, succeeding} words
- position of current word in current phrase
-number of content words in current phrase {before,
after}
current word
-number of words counting from previous content
word to current word in the utterance
-number of words counting from current to next content word in the utterance
-number of {syllables, words} in {preceding, current, succeeding} phrases
- position of current phrase in current utterance
- TOBI endtone of current phrase
-number of {syllables, words, phrases} in the utterance
www.iosrjournals.org
37 | Page
Evaluation
Figure- 3 and Figure- 4 show a comparison of spectrogram and F0 patterns between synthesized and
original speech signals for a given sentence which is not included in the training database. It can be noticed that
the generated spectrogram and F0 contour are quite close to the natural.
(a)
(b)
Figure- 3: Examples of spectrogram extracted from utterance
( a) Natural speech, (b) Synthesized speech
For subjective evaluation of the output speech quality 5 subjects, 3 male (L1, L2, L3) and 2 female (L4,
L5), are selected and their age ranging from 24 to 50. All subjects are native speakers of Standard Colloquial
Marathi and non-speech expert. 10 original and synthesized sentences are randomly presented for listening and
their judgment in 5 point score (1=less natural 5= most natural).
Table 4 represents the tabulated mean opinion scores for all sentences of each subject for original as
well as modified sentences.
HTS
DOI: 10.9790/4200-05613439
P1
P2
P3
P4
P5
Avg
5.9
5.9
5.9
5.9
5.9
Stdev
1.31
1.05
1.48
1.31
1.70
Avg
2.5
2.1
2.3
2.6
2.2
Stdev
1.2
1.7
1.6
1.5
1.7
Avg
3.7
3.2
3.9
3.8
3.4
Stdev
1.4
1.7
1.5
1.4
1.6
www.iosrjournals.org
38 | Page
VI.
The evaluation results show the efficacy of HMM based Marathi TTS system for generation of highly
intelligible speech with naturalness although the training corpus is not made for development of TTS system. In
future the above TTS system will be trained by the appropriate training corpus for better quality of output. It
was observed during the testing that the intonation of WH and Yes/no sentences was not good in spite of the
presence of WH and Yes/no sentences in the training corpus. In future derived F0 contour from the training
model can be corrected as per the language input with the help of Fujisaki generation process F0 model.
References
[1].
[2].
[3].
[4].
[5].
[6].
[7].
[8].
[9].
[10].
[11].
[12].
[13].
[14].
[15].
[16].
[17].
[18].
[19].
[20].
[21].
[22].
Mohammed Waseem, C.N Sujatha, Speech Synthesis System for Indian Accent using Festvox, International journal of Scientific
Engineering and Technology Research, ISSN 2319-8885 Vol.03,Issue.34 November-2014, Pages:6903-6911
Sangramsing Kayte, Kavita waghmare ,Dr. Bharti Gawali Marathi Speech Synthesis: A review International Journal on Recent
and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 3 Issue: 6 3708 3711.
Campbell, N., Black, A.: Prosody and the Selection of Source Units for Concatenative Synthesis, J. van Santen, R. Sproat, J. Olive
and J. Hirschberg (Eds.), in Progress in Speech Synthesis, pp. 279282, Springer Verlag, 1996.
Deketelaere S., Deroo O., Dutoit T., Speech Processing for Communications: Whats New?(*) MULTITEL ASBL, 1 Copernic
Ave, Initialis Scientific Park, B-7000 MONS(**) Facult Polytechnique de Mons, TCTS Lab, 1 Copernic Ave, Initialis Scientific
Park, B-7000 MONS.
T. Yoshimura, K. Tokuda, T. Masuko, T.Kobayashi and T.Kitamura, "Simultaneous Modeling of Spectrum, Pitch and Duration in
HMM Based Speech Synthesis," Proc. of EUROSPEECH, vol.5, pp.23472350, 1999.
K. Tokuda, H. Zen, and A. W. Black, An HMM-based speech synthesis applied to English, in IEEE Workshop in Speech
Synthesis, 2002.
S. Krstulovic, A. Hunecke, M. Schroeder, An HMM-Based Speech Synthesis System applied to German and its Adaptation to a
Limited Set of Expressive Football Announcements, Proc. of Interspeech, 2007.
Maia, R., Zen, H., Tokuda, K., Kitamura, T., Resende Jr., F.G.,Towards the development of a Brazilian Portuguese text-tospeech
system based on HMM, Eurospeech, 2003.
Qian, Y., Soong, F., Chen, Y., Chu, M.: An HMM-based Mandarin Chinese text-to-speech system. In: Q. Huo et al. (eds.)
ISCSLP 2006, LNAI, vol. 4274, pp. 223-232. Springer, Heidelberg (2006).
T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, "An adaptive algorithm for mel-cepstral analysis of speech," in ICASSP, 1992
Tokuda, K., Masuko, T., Miyazaki, N., and Kobayashi, T., Multi-space Probability Distribution HMM, IEICE Trans. Inf. & Syst.,
E85-D(3):pp. 455-464, 2002.
Shinoda, K. and Watanabe, T., Acoustic Modeling Based on the MDL Principle for Speech Recognition, Proc. EuroSpeech 1997 ,
pp. 99-102.
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, "Speech synthesis generation algorithms for HMMbased
speech synthesis," in ICASSP, 2000.
Shyamal Das Mandal, Arup Saha, A.K.Datta Annotated Speech Corpora Development in Indian languages Vishwa Bharat vol. 16,
pp49-64, January 2005.
Basu, J., Basu, T., Mitra, M., Mandal, S. 2009. Grapheme to Phoneme (G2P) conversion for Bangla. Speech Database and
Assessments, Oriental COCOSDA International Conference, pp. 66-71.
Hayes, B. and Lahiri, A., Bengali Intonational Phonology, Natural Language and Linguistic Theory, Springer Science, pp 5658 ,
1991.
Shyamal Kumar Das MandaI and Asoke Kumar Datta,. 2007. Epoch synchronous non-overlap-add (ESNOLA) method-based
concatenative speech synthesis system for Bangla. 6th ISCA Workshop on Speech Synthesis, Germany, pp. 351-355.
S. K. Das Mandal, A. H.Warsi, T. Basu, K. Hirose, and H.Fujisaki, Analysis and Synthesis of F0 Contours for Bangla Readout
Speech, Proc. of Oriental COCOSDA 2010, Kathmandu, Nepal, 2010.
Monica Mundada, Sangramsing Kayte, Dr. Bharti Gawali "Classification of Fluent and Dysfluent Speech Using KNN Classifier"
International Journal of Advanced Research in Computer Science and Software Engineering Volume 4, Issue 9, September 2014
(IMPACT FACTOR: 2.080)
Monica Mundada, Bharti Gawali, Sangramsing Kayte "Recognition and classification of speech and its related fluency disorders"
International Journal of Computer Science and Information Technologies (IJCSIT) (IMPACT FACTOR: 3.32)
Sangramsing Kayte, Monica Mundada "Study of Marathi Phones for Synthesis of Marathi Speech from Text" International Jo urnal
of Emerging Research in Management &Technology ISSN: 2278-9359 (Volume-4, Issue-10) October 2015 Impact Factor: 1.492
DOI: 10.9790/4200-05613439
www.iosrjournals.org
39 | Page