Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e in fo abstract
Available online 3 August 2009 A speech synthesizer based on an artificial neural network (ANN) is being developed for application to
Keywords: deeply embedded systems for language-independent speech commands on hands-free interfaces. A feed-
Text-to-speech forward, backpropagation, artificial neural network has been trained for this purpose using a custom-
Rule-based ANN training developed, regular expression-based, text-to-phone transcription engine to generate training patterns.
Window alignment Initial experimental results show the expected properties of language independence and in-system
Speech synthesis learning capability of this approach. The ANN demonstrates the capacity to generalize and map the
words missing at training time, as well as to reduce contradictions related to different pronunciations for
the same word.
& 2009 Elsevier B.V. All rights reserved.
0925-2312/$ - see front matter & 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.neucom.2008.08.023
ARTICLE IN PRESS
88 M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96
articulatory information embedded in human uttered speech. Such was able to produce appropriate pronunciation even when a
information is difficult to model mathematically. specific rule had not been furnished during its training stage.
Language modeling and rule-based representation of speech- Since that demonstration, many new ANN approaches to speech
production mechanisms attained successful results in the synthesis have been proposed [9,16,22,23].
seventies, when industrial synthesizers were introduced to the The main advantage of applying an ANN to speech synthesis is
mass market. However, few implementations were carried out its ability to learn to speak just as a human does. This means that
because electronic technology was not adequate to run language- it can learn to speak any language, just as a human does. The ANN
and rule-based speech-production models. engine is the same for any language it is trained in, so there is no
In the days when desktop computing had meager computing coding dependency because the ANN trained for a specific
power and memory storage, many research and development language is only data dependent.
efforts sought optimal solutions to speech-synthesis problems, The traditional approach to TTS synthesis requires a large
such as text-to-phoneme conversion based on neural networks amount of data to represent all the knowledge about how a word
and phoneme-based speech-synthesizer circuits. When digital has to be converted into a correct speech-synthesizer control
signal-processor (DSP) chips were introduced in 1980, hardware stream. The ability of an ANN to generalize reduces the amount of
implementation of speech synthesis was neglected and firmware- such data and also performs a smoothing action on the output,
based speech synthesizing became a more interesting topic for resulting in more natural speech synthesis [2,4].
researchers and developers. Because memory was a scarce In the following sections, this paper presents the system
resource, great effort was devoted to compressing speech data. framework, ANN architecture, training strategy, and performance
Over the last decade, the widespread availability of memory in evaluation. A brief concluding section summarizes the work and
desktop and laptop computers has driven speech-synthesis future plans.
research to aim for high-quality speech production, with
engineers and system designers paying little attention to system 2. System framework
optimization. Naturalness has been attained by using redundancy
as the primary solution to the many hurdles in producing quality The main idea in developing our system framework is to set up
speech, thanks to the assumption that vast resources (memory, an almost fully automatic process to generate a ready-to-run,
computing power, etc.) are available at low cost. ANN-based speech synthesizer trained for a specific language,
The next wave of computing and communication technology starting from ASCII text.
exhibits a clear trend toward embedded computing solutions based Each language is represented by a set of rules that encode all
on high-density integration technologies such as system-on-chip the information needed to correctly pronounce each word in that
(SoC), system-on-package (SoP), etc. These emerging system language. This set of rules is used to generate—from any text—the
technologies are very powerful, but not redundant in terms of training patterns to set up the ANN-based speech synthesizer.
memory and computing power. Speech synthesis needs a systemic A special-purpose, development environment (Fig. 1) was
approach to achieve new results befitting this new scenario. designed for this purpose. It consists of four functional blocks,
Today, embedded computing (cellular telephones, PDAs, MP3 each executing a whole task that helps develop, train, evaluate,
players, etc.), is more readily available than desktop computing. and implement a complete text-to-speech application.
Speech-synthesis applications mostly target embedded and The system framework consists of four main functional modules:
deeply embedded systems, especially to implement handsfree
interfaces for applications that run on such systems. regular-expression-based, text-to-phones translator;
Speech synthesis becomes a very complex task if the main goal training-set builder;
is to implement it on a deeply embedded system with real-time, ANN engine; and
unlimited-vocabulary, speaker-independent and language-inde- speech synthesizer.
pendent specifications. Current high-quality, speech-synthesis
solutions, due to their reliance on substantial processing
resources, cannot be scaled down to satisfy the emerging
demands of embedded systems in every application field where
human-to-machine interaction is required.
The artificial neural-network (ANN) approach [3,8] to speech
synthesis [11,20] can optimally solve several implementation and
application problems, primarily because it is closer to the process
to be emulated, i.e. the human ability to communicate by means
of voice and language. Moreover, because the linguistic approach
to natural speech synthesis has proven effective, but the task of
generating rules is time-consuming and tedious, an ANN may be
the most suitable solution.
Sejnowski and Rosenberg [24] were the first researchers to
demonstrate that an ANN could be successfully applied to the
speech-synthesis challenge. They demonstrated that a three-layer,
backpropagation network (BPN) can be successfully trained to
convert text into phonemic parameters to drive an articulatory
speech synthesizer.
The most interesting result was achieved by observing the
nature of the speech produced at incremental stages during the
training phase. The utterance evolves, as in children during their
learning stages. The ANN proved capable of learning about
pronunciation rules, as well as about other embedded informa-
tion, such as articulatory effects and inflection. Moreover, the ANN Fig. 1. System framework.
ARTICLE IN PRESS
M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96 89
Input text and its phonetic transcription are the data input for
the training-set builder. This automatically processes the phonetic
transcription of the text used to train the ANN, generating the
appropriate training patterns to tailor the ANN for this specific
application.
One critical task executed by this ANN’s training-set builder is
to align text with its phonetic transcription. Manual alignment is
superfluous thanks to rule-based alignment of text with phonetic Fig. 2. JOONE.
transcription.
The rule is thus valid for several text strings, such as BIOy, where digits ‘‘1’’ and ‘‘2’’ preceding the phone symbol indicate its
BIONy, BIOGy, BIANNUAL, etc. degree of intensity.
Rules are grouped in sublists containing all the rules with the The union of all the left contexts of this group of rules is the
same initial character in the context to be matched. Each sublist is universal set. The last rule always matches even if none of the
internally ordered by specificity of the rules, with the most preceding rules match.
specific first, the most general last, and the last rule formulated as
context-independent, so that it always matches:
4. ANN architecture
!ðBÞ! ¼ =b=i : =
... The ANN has a three-layer, feed-forward, backpropagation
architecture (FFBP-ANN), similar to that used in NETtalk [24]. Its
!ðBIÞ# ¼ =b=a=I= inputs are fully connected to all the nodes in the hidden layer, and
... the hidden layer is fully connected to the output nodes (Fig. 4).
Text is fed to units in the input layer. Phonetic transcription
ðBÞ ¼ =b= ð4Þ
comes out of the nodes at the output layer. All inputs and outputs
have a linear activation function that controls the connection. A
The regular-expression-based, text-to-phones translator is a non-linear activation function (sigmoid) connects hidden-layer
general purpose, text-processor engine applicable to any written nodes to output-layer nodes as follows:
text in any language. It depends only on the rule set and on the
classes defined for that language. The initial character of the 1
substring to be processed is located on the appropriate sublist, si ¼
1Pþ eIi
ð6Þ
from top to bottom, until the appropriate rule matches. The rule is Ii ¼ wij sj
then applied. Finally, the related phone sequence is appended to j
a - X, 0
g - g, o
g - –, 4
l - 1, 4
u - U, 1
t - t, o
i - –, 0
n - N, o
a - e, 2
t - t, o
e - –, o
a - @, 2
b - b, o
e - x, 0
r - r, o
r - –, 4
Fig. 8. Training-set generation process. a - e, 1
t - s, 4
i - –, 0
o - x, o
that the ANN can be trained for any language, regardless of its n - n, o
peculiarities.
The backpropagation algorithm was used as the learning where o is the right syllable boundary, 4 is the left syllable
algorithm to train our FFBP-ANN, minimizing the average squared boundary, 1 is the primary stress, 2 is the secondary stress, and 0
error between the actual output at the i-th neuron of the output is the tertiary stress
layer and the target output value: This training set is very large but not exhaustive. It cannot be
automatically generated for any language because it requires a
X
N
phonetic dictionary for the target language. It also needs
E¼ ðsi ti Þ2 ð8Þ
i¼1
manual alignment when word length fails to match phone-string
length. In an attempt to achieve fully automated training-pattern
where N is the total number of units in the output layer. generation, the pronunciation-rule set and the text-to-phones algo-
To avoid oscillations during the training phase, to amplify the rithm are used to generate the training pattern starting from the word
learning rate, and to allow escaping from local minima, a list alone.
momentum-term, modified learning algorithm was applied: The pronunciation-rule set embeds both alignment information
and stress information in each rule, so alignment can be carried out
wkþ1 ¼ wk agk þ bðwk wk1 Þ ð9Þ automatically during training-pattern generation. Stress and duration
ARTICLE IN PRESS
M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96 93
information is encoded in the phone symbol as follows: Once alignment is complete, the sliding window algorithm is
applied to the transcribed word, generating the following training
=a=not stressed array (based on a nine-character window):
=0 a=stressed
=0 a : =stressed and long – – – – – – – – G /–/
– – – – – – – G O /–/
This solution allows any prosodic information to be encoded – – – – – – G O G /–/
into the phone name, so the ANN can learn more about different – – – – – G O G N /–/
pronunciations of each phone. – – – – G O G N A /g/
– – – G O G N A – /0 o/
– – G O G N A – – /J:J/
5.2. Aligning patterns – G O G N A – – – /–/
G O G N A – – – – /a/
Pattern alignment [13] is automatically solved by looking up O G N A – – – – – /–/
the pronunciation rules employed to generate the phonic G N A – – – – – – /–/
transcription for each word in the training set. This is defined as N A – – – – – – – /–/
transforming T of the word string (r) into the phonetic string (p) A – – – – – – – – /–/
by applying the rule R (see Section 3):
Table 1 The same word (text) can be sent as input to both transcription
Training set consisting of 720 Italian words starting with the two alphabetical processes (soft- and hard-computing) and compared at the
character ‘‘C’’ and ‘‘E.’’
phonetic level. Utterance-level comparison is also available
Word Transcription because the speech synthesizer is integrated into the TXT2SP
development environment. This environment also provides the
CE /_tS/0 E_/ functions needed to set up and train the ANN, as well as to
y y compare ANN performance with rule-based implementation,
CELLA /_tS/0 E/l:l/a_/
y y
using a text file as test input.
CETO /_tS/0 E:/t/o_/
y y
CELIBATO /_tS/0 e:/l/i/b/a/t/o_/ 6.1. Test planning
y y
CETRIOLO /_tS/e/t/4/i/0 o:/l/o_/
The main goal of our research is to explore the ability of an FFBP-
ANN to produce text-to-speech at an acceptable level of intellig-
ibility and to match our main embedded-system requirements.
Two main tests were planned. The first was designed to
evaluate whether the FFBP-ANN could correctly transcribe all
the words used at training time. The second was designed to
evaluate whether the FFBP-ANN could correctly transcribe words
not included in the training set. To run first test, the 720 word
set used at training time was fed to the ANN. For the second, the
ANN was fed a set of exception words not included in the training
set.
7. Conclusions References
An FFBP-ANN-based engine for text-to-phones transcription is [1] P.C. Bagshaw, Phonemic transcription by analogy in text-to-speech synthesis:
being developed to act as the back end for phonetic synthesizers. novel word pronunciation and lexicon compression, in: Computer Speech and
The network can be trained on any language using the same set of Language, vol. 12 (2), Elsevier, Amsterdam, 1998, pp. 119–142.
[2] G. Bakiri, T.G. Dietterich, Achieving high-accuracy text-to-speech with
phonemes, if an appropriate set of rules is available. The trained machine learning, in: R.I. Damper (Ed.), Data Mining Techniques in Speech
FFBP-ANN’s ability to generalize shows that this approach to text- Synthesis, Chapman & Hall, New York, NY, 2002.
to-speech synthesis becomes advantageous compared with [3] C. Bishop, Neural Networks for Pattern Recognition, Oxford University Press,
Oxford, 1995.
current implementations when embedded systems are its target. [4] J.A. Bullinaria, Representation, earning, generalization and damage in neural
Its main advantage is due to the FFBP-ANN’s ability to correctly network models of reading aloud, Technical Report, Edinburgh University, 1994.
transcribe nearly any word in the trained language, including [5] L. Canepari, Il MaPI—Manuale di pronuncia Italiana, Zanichelli, 2004.
[6] G.C. Cawley, M.D. Edgington, Generalization in neural speech synthesis, in:
exception words. This ability leads to limited memory require- Proceedings of the Institute of Acoustics Autumn Conference, Windemere, UK,
ments when setting up an unlimited vocabulary text-to-speech 1998.
system. On the other hand, rule-based, text-to-speech synthesis, [7] P. Cosi, P. Frasconi, M. Gori, L. Lastrucci, G. Soda, Competitive radial basis
functions training for phone classification, Neurocomputing 34 (2000)
to correctly cover any exception in specific vocabulary, needs large
117–129.
amounts of memory, not available in embedded systems. [8] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice-Hall,
Another important advantage of the FFBP-ANN is that its Englewood Cliffs, NJ, 1999.
implementation is language-independent. If its size parameters [9] F. Hendessi, A. Ghayoori, T.A. Gulliver, A speech synthesizer for persian text
using a neural network with a smooth ergodic HMM, ACM Transactions on
(number of nodes at input and output layers) are maximized to fit Asian Language Information Processing 4 (1) (2005).
the language to be supported, then a single ANN engine can [10] JOONE, Java Object Oriented Neural Engine, /http://www.joone.orgS.
switch from one language to another by merely exchanging data [11] O. Karaali, G. Corrigan, N. Massey, C. Miller, O. Schnurr, A. Macie, A high
quality text-to-speech system composed of multiple neural networks, in: IEEE
(specific language trained ANN node weights). Rule-based text-to- International Conference on Acoustics, Speech and Signal Processing, Seattle,
speech synthesis, to switch from one language to another, needs WA, 1998.
updated data (rule sets), as well as program code. This is a [12] P. Ladefoged, Vovels and Consonants: An Introduction to the Sounds of
Languages, Blackwell Publisher, Malden, MA, 2004.
disadvantage for applications based on embedded systems, [13] C.X. Ling, H. Wang, Alignment algorithms for learning to read aloud, in:
because programming code is read-only system information. Proceedings of IJCAI, 1997.
The peculiarity of strict code dependency in an FFBP-ANN [14] M. Malcangi, NeuroFuzzy approach to the development of a text-to-speech
(TTS) synthesizer for deeply embedded applications, in: Proceedings of the
trained for text-to-speech transcription can also be advantageous 14th Turkish Symposium on Artificial Intelligence and Neural Networks,
for software-embedded applications, such as network-distributed Cesme, Turkey, 2005.
services, where each client needs of a specific language for voice [15] M. Malcangi, Combining a fuzzy logic engine and a neural network to develop
an embedded audio synthesizer, in: T. Simos, G. Psihoyios (Eds.), Lecture
interaction. Applications of this kind prove more difficult if
Series on Computer and Computational Sciences, vol. 8, Brill Publishing,
program code needs to be updated, as required by rule-based Leiden, The Netherlands, 2007, pp. 159–162.
text-to-speech synthesis. Finally, the FFBP-ANN’s in-system [16] T. Kristensen, Two neural network paradigms of phoneme transcription—a
learning ability can be part of the embedded system’s native comparison, in: Proceedings of the 2004 IEEE International Joint Conference
on Neural Networks, 2004.
hardware capacity, so that speech-synthesis functionality can be [17] J.M. Lucassen, R.L. Mercer, An information theoretic approach to the
trained at the user level. automatic determination of phonemic baseform, in: Proceedings of the IEEE
ARTICLE IN PRESS
96 M. Malcangi, D. Frontini / Neurocomputing 73 (2009) 87–96
International Conference on Acoustics, Speech and Signal Processing, San David Frontini received his B.Sc. in information
Diego, CA, 1984, pp. 42.5.1–42.5.4. science from the Universita degli Studi of Milan. From
[18] Y. March, R.I. Damper, A multi-strategy approach to improving pronunciation 1999 to 2000 he worked as an IT consultant. From 2000
by analogy, Computational Linguistics 6 (2) (2000) 195–219. to 2003 he was an EMEA Professional Service member
[19] M. Moreira, E. Fiesler, Neural networks with adaptive learning rate and in Open Text corporation. From 2003 to 2006 he was a
momentum terms, IDIAP Technical Report, 1995. senior IT specialist at IBM working the telecommuni-
[20] D.P. Morgan, C.L. Scofield, Neural Networks and Speech Processing, Kluwer cation business. From 2006 to 2007 he was an IT
Academic Publishers, London, 1991. architect for SIA Group. Since 2007 he has been an IT
[21] D. O’Shaughnessy, Speech Communication—Human and Machine, Addison- architect in ASTIR. His research interests include
Wesley, Reading, MA, 1987. intelligent information systems, such as neural net-
[22] M. Rahim, C. Goodyear, Articulatory synthesis with the aid of a neural net, in: works and fuzzy logic.
Proceedings of ICASSP, Glasgow, Scotland, 1989, pp. 227–230.
[23] M.S. Scordilis, J.N. Gowdy, Neural network based generation of fundamental
frequency contours, in: Proceedings of ICASSP, Glasgow, Scotland, 1989, pp.
219–222.
[24] T.J. Sejnowski, C.R. Rosenberg, Parallel networks that learn to pronounce
english text, in: Complex Systems 1, Complex Systems Publication, Cham-
paign, IL, 1987, pp. 145–168.
[25] A. van den Bosh, A. Content, W. Daelemans, Measuring the complexity of
writing systems, Journal of Quantitative Linguistics 1 (3) (1994) 178–188.