Abstractcurrently there is a tremendous increase in synthesis part is also concerned with selection and
electronic text in various languages, including Ethiopian Semitic concatenation of appropriate speech units [3].
languages. Yet, it is inaccessible to visually impaired people and
the illiterate. To come up with high quality text to speech There are a wide range of problems in TTS [4]. One of the
synthesizers for local languages, it is imperative that research in challenges is pre-processing of non-standard words such as
natural language processing and synthetic speech generation be numerals, abbreviations and acronyms [2]. While standard
improved. Accordingly, we critically reviewed core issues in text words have specific pronunciation that can be described
to speech synthesis for Ethiopian Semitic languages and revealed phonetically by letter to phone (L2P) rule, pronunciation of
that further research to improve the quality of text to speech non-standard words is not straight forward as the written form
synthesis for Ethiopian Semitic languages is mandatory. To is far different from the spoken form [4][1]. There are also
optimize linguistic resources required for high quality synthetic problems related to correct prosody (rhythm and melody) and
speech output we propose designing a generic bilingual Text to pronunciation analysis of written text [2]. Generation of full
Speech framework for Ethiopian Semitic languages. range of prosodic effects from text is challenging as text
encodes verbal message that the author generated whereas
Keywordsspeech synthesis; Amharic Tigrigna; Bilingual prosody reflects the intention of the author. Thus, extraction of
TTS; Ethiopian Semitic languages prosodic information embedded in the context requires learning
about the prosodic features in context [1].
I. INTRODUCTION
Text-To-Speech (TTS) is the process of converting text to II. ETHIOPIAN SEMITIC LANGUAGE
audible speech [1]. Towards this end, there has been research Ethiopian Semitic language is a sub family of south-Semitic
endeavor to achieve the ultimate goal of speech synthesis that which in turn is a sub family of West Semitic language under
mimic the speech production system of the human being [2]. the Semitic language of the Afro-Asiatic super family.
The state-of-the-art shows that, nowadays researchers are Ethiopian Semitic language includes Geez, Tigrigna, Tigre,
aiming at a more natural sounding and intelligible speech Amharic, Argoba, and others [5]. The language family
output to be achieved [1]. basically uses the same writing system with slight modification
TTS problem is an integration of two sub-problems, which mainly needed to fit to the specific language characteristics. In
involve text analysis phase and speech generation phase [1]. this section, we compare and contrast Amharic and Tigrigna,
Text analysis is the process of determining the underlying the most widely spoken Semitic languages in Ethiopia with
structure of the sentence and the phonemic composition of each specific emphasis on phonetic composition and syllable
word. Text analysis phase involves text preprocessing (text structure.
normalization), word pronunciation (syllabification), phrasing A. Phone-sets of Amharic and Tigrigna
intonation (identifying phrase pattern), segmental duration and
Amongst Ethiopian Semitic languages, Amharic and
computation of intonation contour from phonological
Tigrigna share common characteristics derived from the same
representations of the text [3]. When real world text is
Geez root [5].
considered, it contains non-standard words (NSWs) such as
numbers, acronyms and abbreviations in addition to standard According to Girmay, Tigrigna has 29 consonantal
words, thus normalization process that expands each NSW into phonemes and seven vowels [6]. This is so without considering
respective pronounceable form is required [4]. the labiovelars [gw], [kw], [kw] and the fricatives [X] and
Speech generation transforms the abstract linguistic [X]. Girmay argues that the labiovelars are derivable and the
representation of text into speech waveform. Speech generation fricatives are allophones of [k] and [k], respectively [6].
According to other researchers, however, these labiovelar and
phase is responsible for phonetic realization of each phoneme,
the fricatives as well as the phoneme [V] are included in the
that is conversion of the phonemic representation in the text
constant chart of Tigrigna as a result of which the number of
into its spoken counterpart or speech sound [1]. The speech
consonants would be 37 for Tigrigna [7][8].
Voiceless p t k kw I
Ejective p t k kw
Voiced v z z
voiceless f s sh x xw h
Ejective s x xw
Voiced dz
Voiceless c
Affri
cate
Ejective c
Nasal m n n
Approximant w j
Liquids l
r
Ethiopia being multilingual and multinational country, its this study is to design a novel bilingual text to speech synthesis
constitution decrees that each nation, nationality and people has framework for Ethiopian Semitic languages.
the right to speak, write and develop its language [13].
Accordingly, there are a lot of written documents and text IV. REVIEW OF RELATED WORK
books being produced in respective working languages of the There are few research works based on Concatenative,
regional states. In addition to text books, a lot of journals, Formant as well as Parametric statistical methods for Ethiopian
magazines, newspapers and novels are available in Amharic languages. Laine [14] did the first Text to Speech System for
and Tigrigna. Furthermore, Ethiopian Semitic languages are Amharic language using diphone based concatenative
under resourced that to optimize utilization of linguistic synthesis. Tools used were Pascal and MATLAB, and the
resource there is a need for a generic bilingual text to speech evaluation reported as good. Furthermore, Laine noted that
synthesis framework for these languages. Thus, the purpose of prosodic information was not considered in his work.
Henock [15] did the next attempt that applied concatenative These sources are among the widely used public media that
speech synthesis on Amharic. Tools and techniques include publish issues related to economics, politics, sport, social and
TD-PSOLA technique for smoothing, PRAAT for technology issues among others.
spectrographic analysis; and Delphi and MATLAB for
prototype development. Evaluation used included ORT (Open Text data collected is further preprocessed so that the text is
Rhyme Test), and MOS (Mean Opinion Score), and result was composed of standard words from which punctuation marks are
reported promising. removed. For data cleaning and sentence level tokenization,
Python programming language is used.
Tesfay [16] attempted the first text to speech for Tigrigna
language using diphone based concatenative approach with B. Implementation
MATLAB. The performance of the synthesizer measured using Towards our effort to explore the extent to which an
MOS was 3.05. Inclusions of acronym converter to the text already existing speech synthesis system such as festival can be
processing module and prosody control are some of the things used for Ethiopian Semitic languages such as Amharic and
that need researched further [16]. Tigrigna, a Linux environment was set up. With selected words
and phrases we tested the performance of the synthesizer with
Nadew [17] implemented formant based speech synthesis default parameters. As shown in Table II, ten words and
for Amharic vowels using MATLAB. The focus was on phrases were selected that are composed of different syllables,
vowels since vowels play a big role in change of pronunciation the smallest being one syllable word, the largest used is ten
of a word in different contexts. Result indicated intelligibility syllables phrase.
of 88.85% for isolated vowels. Nadew recommended
refinement of the work including consonant consideration, and To make the text input understandable to festival,
preparation of appropriate speech corpus [17]. transliteration of Ethiopic orthography was necessary.
Accordingly, we developed a transliteration algorithm and
Bereket [18] modeled an HMM based speech synthesizer implemented it with python2.78. Transliterated version of the
with the objective of developing unlimited domain speech text was then used as an input to the synthesizer. Speech output
synthesizer for Amharic language that can generate a natural corresponding to the text input was then saved as a .wav file for
sounding and intelligible synthetic speech with less resource later evaluation by listeners. The preliminary experimentation
requirement. Out of 11,670 sentences of corpus 500 sentences shows that there is a promising result enabling the researchers
were used to train the HTS, and 20 sentences were used for to identify the points where the existing speech synthesis
testing. The performance reported was MOS of 4.12 and 3.6 system fails to catch parts of Ethiopian Semitic languages.
for intelligibility and naturalness respectively. As a future
work, Bereket recommended inclusion of prosodic information C. Experimental Result and Discusion
to identify dialects and word meanings [18]. The performance of the prototype developed during
Experiment by Alula [19] shows the possibility of including experimentation is evaluated by means of intelligibility
non-standard words (NSWs) in Amharic TTS. Alula applied measure, which tells the extent to which the synthetic speech is
Festival diphone based unit concatenative synthesis and RELP comprehendible. Since the speech output produced is based on
Coding (Residual Excited Linear Predictive Coding). default parameters which make use of foreign pronunciation,
Performance of the synthesizer was MOS of 3.0 and 2.83 for naturalness test was not conducted.
intelligibility and naturalness respectively. Alula recommended As shown in Table II, intelligibility of the speech output
consideration of all types of NSWs and incorporation of part of was evaluated by two bilingual speakers of the two languages
speech (POS) tagged corpus for prosody control as a future where they were asked to write what they perceived while
work [19]. listening to the speech output. Then average number of
The few attempts that mostly test the possibility of syllables correctly perceived was computed against the number
developing TTS synthesizers for Ethiopian languages are done of syllables in the text to get the percentage of accuracy. Their
in a fragmented manner. Hence, there is a need for a generic average response, 86.03% accuracy indicates that the point of
TTS synthesizer that would optimize linguistic resources. difficulties of the already existing system to pronounce t, k,
Accordingly, the level of TTS quality for Ethiopian Semitic s, H among the unique phonemes in Amharic and Tigrigna is
languages needs further improvement. To this end the core visible. Moreover, the presence of allophones in Amharic and
issues that need be addressed include homograph Tigrigna are not recognized by the already existing synthesis
disambiguation, extracting prosodic information from text and system, thus, there is a need to create voices for the languages
incorporation of non-standard words in the development of a to improve the quality of the synthesized speech in terms of
generic bilingual TTS both intelligibility and naturalness.