November 2000
Contents
1 Introduction..............................................................................................1
2 Problem Description................................................................................2
2.1 Objectives..............................................................................................................2
2.2 Subproblems..........................................................................................................2
2.3 Significance...........................................................................................................3
3 Literature Review....................................................................................5
3.1 Emotion and Speech..............................................................................................5
3.2 The Speech Correlates of Emotion........................................................................6
3.3 Emotion in Speech Synthesis................................................................................8
3.4 Speech Markup Languages...................................................................................9
3.5 Extensible Markup Language (XML).................................................................10
3.5.1 XML Features..........................................................................................10
3.5.2 The XML Document................................................................................11
3.5.3 DTDs and Validation..............................................................................12
3.5.4 Document Object Model (DOM).............................................................14
3.5.5 SAX Parsing.............................................................................................15
3.5.6 Benefits of XML......................................................................................16
3.5.7 Future Directions in XML........................................................................17
3.6 FAITH.................................................................................................................18
3.7 Resource Review.................................................................................................20
3.7.1 Text-to-Speech Synthesizer.....................................................................20
3.7.2 XML Parser..............................................................................................22
3.8 Summary.............................................................................................................23
4 Research Methodology..........................................................................25
4.1 Hypotheses..........................................................................................................25
Page i
November 2000
5 Implementation......................................................................................28
5.1 TTS Interface.......................................................................................................28
5.1.2 Module Inputs..........................................................................................29
5.1.3 Module Outputs........................................................................................30
5.1.4 C/C++ API...............................................................................................31
5.2 SML Speech Markup Language.......................................................................32
5.2.1 SML Markup Structure............................................................................32
5.3 TTS Module Subsystems Overview....................................................................34
5.4 SML Parser..........................................................................................................35
5.5 SML Document...................................................................................................36
5.5.1 Tree Structure...........................................................................................36
5.5.2 Utterance Structures.................................................................................37
5.6 Natural Language Parser.....................................................................................39
5.6.2 Obtaining a Phoneme Transcription........................................................40
5.6.2 Synthesizing in Sections..........................................................................42
5.6.3 Portability Issues......................................................................................43
5.7 Implementation of Emotion Tags........................................................................44
5.7.1 Sadness.....................................................................................................45
5.7.2 Happiness.................................................................................................46
5.7.3 Anger........................................................................................................47
5.7.4 Stressed Vowels.......................................................................................48
5.7.5 Conclusion...............................................................................................48
5.8 Implementation of Low-level SML Tags............................................................49
5.8.1 Speech Tags.............................................................................................49
5.8.2 Speaker Tag..............................................................................................53
5.9 Digital Signal Processor......................................................................................54
5.10 Cooperating with the FAML module................................................................55
5.11 Summary...........................................................................................................57
Page ii
November 2000
7 Future Work...........................................................................................82
7.1 Post Waveform Processing..................................................................................82
7.2 Speaking Styles...................................................................................................83
7.3 Speech Emotion Development............................................................................84
7.4 XML Issues.........................................................................................................85
7.5 Talking Head.......................................................................................................86
7.6 Increasing Communication Bandwidth...............................................................87
8 Conclusion...............................................................................................88
9 Bibliography...........................................................................................91
10 Appendix A SML Tag Specification..................................................96
................................................................................................................................101
Page iii
November 2000
List of Figures
15 Figure 1 - An XML document holding simple weather information.11
16 Figure 2 - Sample section of a DTD file...............................................12
17 Figure 3 - XML syntax error - list and item tags incorrectly matched.
13
18 Figure 4 - Well-formed XML document, but does not follow
grammar specification in DTD file (an item tag occurs outside of list
tag)...........................................................................................................13
19 Figure 5 Well-formed XML document that also follows DTD
grammar specification. Will not produce any parse errors..............13
20 Figure 6 - DOM representation of XML example..............................15
21 Figure 7 - FAITH project architecture................................................19
22 Figure 8 - Talking Head being developed as part of the FAITH
project at the School of Computing, Curtin University of Technology.
20
23 Figure 9 - Top level outline showing how Festival and MBROLA
systems were used together...................................................................21
24 Figure 10 - Black box design of the system, shown as the TTS module
of a Talking Head...................................................................................28
25 Figure 11 - Top-level structure of an SML document........................32
26 Figure 12 - Valid SML markup............................................................33
27 Figure 13 - Invalid SML markup.........................................................33
28 Figure 14 - TTS module subsystems.....................................................34
29 Figure 15 - Filtering process of unknown tags....................................36
30 Figure 16 - SML Document structure for SML markup given above.
37
Page iv
November 2000
Page v
November 2000
List of Tables
46 Table 1 - Summary of human vocal emotion effects.............................8
47 Table 2 - Summary of human vocal emotion effects for anger,
happiness, and sadness..........................................................................44
48 Table 3 Speech correlate values implemented for sadness..............45
49 Table 4 - Speech correlate values implemented for happiness..........46
50 Table 5 - Speech correlate values implemented for anger.................47
51 Table 6 - Vowel-sounding phonemes are discriminated based on their
duration and pitch..................................................................................48
52 Table 7- MBROLA command line option values for en1 and us1
diphone databases to output male and female voices.........................54
53 Table 8 - Statistics of participants........................................................63
54 Table 9 - Confusion matrix template....................................................64
55 Table 10 - Confusion matrix with sample data...................................65
56 Table 11 - Confusion matrix showing ideal experiment data: 100%
recognition rate for all simulated emotions.........................................65
57 Table 12 Listener response data for neutral phrases spoken with
happy emotion........................................................................................66
58 Table 13 Section 2A listener response data for neutral phrases.....67
59 Table 14 - Listener response data for Section 2A, Question 1...........68
60 Table 15 - Listener response data for Section 2A, Question 2...........68
61 Table 16 - Listener responses for utterances containing emotionless
text with no vocal emotion.....................................................................70
62 Table 17 - Listener responses for utterances containing emotive text
with no vocal emotion............................................................................71
63 Table 18 - Listener responses for utterances containing emotionless
text with vocal emotion..........................................................................72
Page vi
November 2000
Page vii
Chapter 1
Introduction
When we talk we produce a complex acoustic signal that carries information in addition
to the verbal content of the message. Vocal expression tells others about the emotional
state of the speaker, as well as qualifying (or even disqualifying) the literal meaning of
the words. Because of this, listeners expect to hear vocal effects, paying attention not
only to what is being said, but how it is said. The problem with current speech
synthesizers is that the effect of emotion on speech is not taken into account, producing
output that sounds monotonic, or at worst distinctly machine-like. As a result of this, the
ability of a Talking Head to express its emotional state will be adversely affected if it
uses a plain speech synthesizer to "talk". The objective of this research was to develop a
system that is able to incorporate emotional effects in synthetic speech, and thus improve
the perceived naturalness of a Talking Head.
This thesis reviews the literature in the fields of speech emotion, synthetic speech
synthesis, and XML. A discussion on XML is featured prominently in this thesis because
it was the vehicle chosen for directing how the synthetic voice should sound. It also had
considerable impact on how speech information was processed. The design and
implementation details of the project are discussed to describe the developed system. An
in-depth analysis of the projects evaluation data is then given, concluding with a
discussion of future work that has been identified.
Chapter 2
Problem Description
2.1 Objectives
Development of the project was aimed at meeting two main objectives to support the
hypotheses of Section 4.1:
1. To develop a system that can add simulated emotion effects to synthetic
speech. This involved researching the speech correlates of emotion that have
been identified in the literature. The findings were to be applied to the control
parameters available in a speech synthesizer, allowing a specified emotion to be
simulated using rules controlling the parameters.
2. To integrate the system within the TTS (text-to-speech) module of a
Talking Head. The speech system was to be added to the Talking Head that is
part of the FAITH1 project. It is being developed jointly at Curtin University of
Technology, Western Australia, and the University of Genoa in Italy (Beard et al,
1999). The text-to-speech module must be treated as a 'black box', which is
consistent with the modular design of FAQbot.
2.2 Subproblems
A number of subproblems were identified to successfully develop a system with the
stated objectives.
1. Design and implementation of a speech markup language. It was desirable
that the markup language be XML-based; the reasons for this will become
apparent later in the thesis. The role of the speech markup language (SML) is to
1
2.3 Significance
The project is significant because despite the important role of the display of emotion in
human communication, current text-to-speech synthesizers do not cater for its effect on
speech. Research to add emotion effects to synthetic speech is ongoing, notably by
Murray and Arnott (1996), but has been mainly restricted to a standalone system, and not
part of a Talking Head as this project set out to do.
Increased naturalness in synthetic speech is seen as being important for its
acceptance (Scherer, 1996), and this is likely to be the case for applications of Talking
Head technology as well. This thesis is attempting to address this need. Advances in this
area will also benefit work in the fields of speech analysis, speech recognition and speech
synthesis when dealing with natural variability. This is because work with the speech
correlates of emotion will help support or disprove speech correlates identified in speech
analysis, help in proper feature extraction for the automatic recognition of emotion in the
voice, and generally improve synthetic speech production.
Chapter 3
Literature Review
This section presents a brief review of the literature relevant to the areas the project is
concerned with: the effects of emotion on speech, speech emotion synthesis, XML and
speech markup languages.
frighten a would-be assailant, with the body tensing for a possible confrontation. The
expression of emotion through speech also serves to communicate to others our
judgement of a particular situation. Importantly, vocal changes due to emotion may in
fact be cross-cultural in nature, though this may only be true for some emotions, and
further work is required to ascertain this for certain (Murray, Arnott and Rohwer, 1996).
We also deliberately use vocal expression in speech to communicate various
meanings. Sudden pitch changes will make a syllable stand out, highlighting the
associated word as an important component of that utterance (Dutoit, 1997). A speaker
will also pause at the end of key sentences in a discussion to allow listeners the chance to
process what was said, and a phrases pitch will increase towards the end to denote a
question (Malandro, Barker and Barker, 1989). When something is said in a way that
seems to contradict the actual spoken words, we will usually accept the vocal meaning
over the verbal meaning. For example, the expression thanks a lot spoken in an angry
tone will generally be taken in a negative way, and not as a compliment, as the literal
meaning of the words alone would suggest. This underscores the importance we place on
the vocal information that accompanies the verbal content.
This
3.
The content is ignored altogether, either by using equipment
designed to extract various speech attributes, or by filtering out the content.
The latter technique involves applying a low-pass filter to the speech signal,
thus eliminating the high frequencies that word recognition is dependent upon.
(This meets with limited success, however, since some of the vocal
information also resides in the high frequency range.)
The problem of speech parameter identification is further compounded by the
subjective nature of these tests. This is evident in the literature, as results taken from
numerous studies rarely agree with each other. Nevertheless, a general picture of the
speech parameters responsible for the expression of emotion can be constructed. There
are three main categories of speech correlates of emotion (Cahn, 1990; Murray, Arnott
and Rohwer, 1996):
Pitch contour. The intonation of an utterance, which describes the nature
of accents and the overall pitch range of the utterance. Pitch is expressed as
fundamental frequency (F0). Parameters include average pitch, pitch range,
contour slope, and final lowering.
Timing. Describes the speed that an utterance is spoken, as well as rhythm
and the duration of emphasized syllables. Parameters include speech rate,
hesitation pauses, and exaggeration.
Voice quality. The overall character of the voice, which includes effects
such as whispering, hoarseness, breathiness, and intensity.
It is believed that value combinations of these speech parameters are used to express
vocal emotion. Table 1 is a summary of human vocal emotion effects of four of the socalled basic emotions: anger, happiness, sadness and fear (Murray and Arnott, 1993;
Galanis, Darsinos and Kokkinakis, 1996; Cahn, 1990; Davtiz, 1976; Scherer, 1996). The
parameter descriptions are relative to neutral speech.
Anger
Happiness
Sadness
Fear
Speech rate
Faster
Slightly faster
Slightly slower
Much faster
Pitch average
Very
higher
Much higher
Slightly lower
Very much
higher
Pitch range
Much wider
Much wider
Slightly
narrower
Much wider
Intensity
Higher
Higher
Lower
Higher
Pitch changes
Abrupt,
downward,
directed
contours
Smooth, upward
inflections
Downward
inflections
Downward
terminal
inflections
Voice quality
Breathy, chesty
tone1
Breathy,
blaring1
Resonant1
Irregular
voicing1
Articulation
Clipped
Slightly slurred
Slurred
Precise
much
The summary should not be taken as a complete and final description, but rather is
meant as a guideline only. For instance, the table above emphasizes the role of
fundamental frequency as a carrier of vocal emotion. However, Knower (1941, as
referred in Murray and Arnott, 1993) notes that whispered speech is able to convey
emotion, even though whispering makes no use of the voices fundamental frequency.
Nevertheless, being able to succinctly describe vocal expression like this has significant
benefits for simulating emotion in synthetic speech.
carefully constructed rules. Two of the better known systems capable of adding emotionby-rule effects to speech are the Affect Editor, developed by Cahn (1990b), and
HAMLET, developed by Murray and Arnott (1995) (Murray, Arnott and Newell, 1988).
The systems both make use of the DECtalk text-to-speech synthesizer, mainly because of
its extensive control parameter features.
Future work is concerned with building a solid model of emotional speech, as this
area is seen as being limited by our understanding of vocal expression, and the quality of
the speech correlates used to describe emotional speech (Cahn, 1988; Murray and Arnott,
1995; Scherer, 1996). Although not within the scope of the project, it is worth
mentioning that research is being undertaken in concept-to-speech synthesis. This work
is aimed at improving the intonation of synthetic speech by using extra linguistic
information (i.e. tagged text) provided by another system, such as a natural language
generation (NLG) system (Hitzeman et al, 1999).
Variability in speech is also being investigated in the area of speech recognition with
the aim of possibly developing computer interfaces that respond differently according to
the emotional state of the user (Dellaert, Polzin and Waibel, 1996). Another avenue for
future research could be to incorporate the effects of facial gestures on speech. For
instance, Hess, Scherer and Kappas (1988) noted that voice quality is judged to be
friendly over the phone when a person is smiling. A model that could cater for this
would have extremely beneficial applications for recent work concerned with the
synchronization of facial gestures and emotive speech in Talking Heads.
Finally, simulating emotion in synthetic speech not only has the potential to build
more realistic speech synthesizers (and hence provide the benefits that such a system
would offer), but will also add to our understanding of speech emotion itself.
Most research and commercial systems allow for such an annotation scheme, but
almost all are synthesizer dependent, thus making it extremely difficult for software
developers to build programs that can interface with any speech synthesizer. Recent
moves by industry leaders to standardize a speech markup language has led to the draft
specification of SABLE, a system independent, SGML-based markup language (Sproat
et al, 1998). The SABLE specification has evolved from three existing speech synthesis
markup languages: SSML (Taylor and Isard, 1997), STML (Sproat et al, 1997), and
Javas JSML.
2.
Structure the structure of an XML document can be nested to
any level of complexity since it is the author that defines the tag set and
grammar of the document.
3.
Validation if a tag set and grammar definition is provided
(usually via a Document Type Definition (DTD)), then applications
processing the XML document can perform structural validation to make sure
it conforms to the grammar specification. So though the nested structure of an
XML document can be quite complex, the fact that it follows a very rigid
guideline makes document processing relatively easy.
<?xml version=1.0?>
<weather-report>
<date>March 25, 1998</date>
<time>08:00</time>
<area>
<city>Perth</city>
<state>WA</state>
<country>Australia</country>
</area>
<measurements>
<skies>partly cloudy</skies>
<temperature>20</temperature>
<h-index>51</h-index>
<humidity>87</humidity>
<uv-index>1</uv-index>
</measurements>
</weather-report>
Markup tag
Character data
(marked up text)
One of the main observations that should be made for the example given in Figure 1
is that an XML document describes only the data, and not how it should be viewed. This
is unlike HTML, which forces a specific view and does not provide a good mechanism
for data description (Graham and Quinn, 1999). For example, HTML tags such as P,
DIV, and TABLE describe how a browser is to display the encapsulated text, but are
inadequate for specifying whether the data is describing an automotive part, is a section
of a patients health record, or the price of a grocery item.
The fact that an XML document is encoded in plain text was a conscious decision
made by the XML designers the designing of a system-independent and vendorindependent solution (Bosak, 1997). Although text files are usually larger than
comparable binary formats, this can be easily compensated for using freely available
utilities that can efficiently compress files, both in terms of size and time. At worst, the
disadvantages associated with an uncompressed plain text file is deemed to be
outweighed by the advantages of a universally understood and portable file format that
does not require special software for encoding and decoding.
list (item)+>
item>
item
CDATA #IMPLIED>
One or more
item tags
Attribute is
optional
Figure 4 shows a well-formed XML document (i.e. it follows the XML syntax), but
does not follow the grammar specified in the linked DTD file. (The DTD file is the one
given in Figure 2).
<?xml version=1.0?>
<!DOCTYPE list SYSTEM list-dtd-file.dtd
<list>
<item>Item 1</item>
<item>Item 2</item>
</list>
list-dtd-file.dtd
(DTD file)
<item>Item 3</item>
Figure 4 - Well-formed XML document, but does not follow grammar specification in
DTD file (an item tag occurs outside of list tag).
Figure 5 shows a well-formed XML document that also meets the grammar
specification given in the DTD file.
<?xml version=1.0?>
<!DOCTYPE list SYSTEM list-dtd-file.dtd
<list>
<item>Item 1</item>
<item type=x>Item 2</item>
<item>Item 3</item>
</list>
Figure 5 Well-formed XML document that also follows DTD
grammar specification. Will not produce any parse errors.
The XML Recommendation states that any parse error detected while processing an
XML document will immediately cause a fatal error (Extensible Markup Language,
1998) the XML document will not be processed any further, and the application will
not attempt to second guess the authors intent. Note that the DTD does NOT define how
the data should be viewed either. Also, the DTD is able to define which sub-elements can
occur within an element, but not the order in which they occur; the same applies for
attributes specified for an element. For this reason, an application processing an XML
document should avoid being dependent on the order of given tags or attributes.
a.
<weather-report>
<date>October 30, 2000</date>
<time>14:40</time>
<measurements>
<skies>Partly cloudy</skies>
<temperature>18</temperature>
</measurements>
</weather-report>
b.
<weather-report>
<weather-report>
<date>
<date>
<time>
<time>
October 30,
2000
14:40
<measurements>
<measurements>
<skies>
<skies>
<temperature>
<temperature>
Partly cloudy
18
A SAX handler, on the other hand, can process very large documents since it does
not keep the entire document in memory during processing. SAX, the Simple API for
XML, is a standard interface for event-based XML parsing (SAX 2.0, 2000). Instead of
building a structure representing the entire XML document, SAX reports parsing events
(such as the start and end of tags) to the application through callbacks.
client computer, reducing the servers workload and thus enhancing server
scalability.
Embedding of multiple data types XML documents can contain
virtually any kind of data type such as image, sound, video, URLs, and also
active components such as Java applets and ActiveX.
Data delivery since XML documents are encoded in plain text, data
delivery can be performed on existing networks, sent using HTTP just like
HTML.
Combined with the XML features discussed in section 3.5.1, the above list
underscores the enormous potential of XML. Indeed, the extent of these benefits makes
XML a core component in a wide range of applications: from dissemination of
information in government agencies to the management of corporate logistics; providing
telecommunication services; XML-based prescription drug databases to help pharmacists
advise their customers; simplifying the exchange of complex patient records obtained
from different data sources, and much more (SoftwareAG, 2000a).
XML Schemas 1 & 2- aimed at helping developers to define their
own XML-based formats.
Future applications of some of these XML components to this project is discussed
later in this thesis. For more information on these emerging technologies, see the XML
Cover Pages (Cover, 2000).
3.6 FAITH
A very brief description of the FAITH project is required in order to gain an
understanding of where the TTS module fits within the Talking Head architecture.
Figure 7 shows a simplified view of the various subsystems that make up the Talking
Head.
TTS
Text to
synthesise
FAPs
(visemes)
Waveforms
Brain
Personality
Text
questions
SERVER
FAML
FAPs
MPEG-4
CLIENT
Questions
Waveforms
User Interface
FAPs
Festival is a widely recognized research project developed at the Centre for Speech
Technology Research (CSTR), University of Edinburgh, with the aim of offering a free,
high quality text-to-speech system for the advancement of research (Black, Taylor and
Caley, 1999). The MBROLA project, initiated by the TCTS Lab of the Facult
Polytechnique de Mons (Belgium), is a free multi-lingual speech synthesizer developed
with aims similar to Festivals (MBROLA Project Homepage, 2000).
Text
NLP
(Festival)
Phonemes, pitch
and duration
DSP
(MBROLA)
Waveform
Figure 9 - Top level outline showing how Festival and MBROLA systems were used together.
It was decided for this project to use the Festival system as the natural language
parser (NLP) component of the module, which accepts text as input and transcribes this
to its phoneme equivalent, plus duration and pitch information. This information can be
then given to the MBROLA synthesizer, acting as the digital signal processing unit
(DSP), which produces a waveform from this information. Although Festival has its own
DSP unit, it was found that the Festival + MBROLA combination produces the best
quality. It is important to note that the Festival system supports MBROLA in its API.
Because of the phoneme-duration-pitch input format required for MBROLA, it
provides very fine pitch and timing control for each phoneme in the utterance. As stated
before, this level of control is simply unattainable with commercial systems, except
DECtalk. The advantage of using MBROLA over DECtalk, however, is in the fact that
once a phonemes pitch is altered in the latter system, the generated pitch contour is
overwritten. Cahn (1990) first mentioned this problem, and as a result did not manipulate
the utterance at the phoneme level, limiting the amount of control, which ultimately
hindered the quality of the simulated emotion. To overcome this, Murray and Arnott
(1995) had to write their own intonation model to replace the DECtalk generated pitch
contour when they changed pitch values at the phoneme level. Fortunately, this is not an
issue with MBROLA, as changes to the pitch and duration levels can be done prior to
passing it to MBROLA (as Figure 9 shows). Therefore, it can be seen that the Festival +
MBROLA option offers high control comparable to the DECtalk synthesizer, with the
benefit of less complexity.
The use of Festival and MBROLA also addressed the platform independent
subproblem described in Section 2.2. Although developed mainly for the UNIX
platform, its source code can be ported to the Win32 platform via relatively minor
modifications. The MBROLA Homepage offers binaries for many platforms, including:
Win32, Linux, most Unix OS versions, BeOS, Macintosh and more.
Before the final decision was made to make use of the Festival system however, an
important issue required investigation. The previous TTS module of the Talking Head
did not use the Festival system because although it was acknowledged that Festivals
output is of a very high quality, the computation time was deemed to be far too expensive
to use in an interactive application (Crossman, 1999). For example, the phrase Hello
everybody. This is the voice of a Talking Head. The Talking Head project consists of
researchers from Curtin University and will create a 3D model of a human head that will
answer questions inside a web browser. took about 45 seconds to synthesize on a Silicon
Graphics Indy workstation (Crossman, 2000). It is contested however, that the negative
impression that could be made of the Festival system from such data may be a little
misled. Though execution time may take longer on an SG Indy workstation, informal
testing on several standard PCs (Win32 and Linux platforms) showed that the same
phrase took less than 5 seconds to synthesize (including the generation of a waveform).
Since TTS processing is done on the server side, the system can be easily configured to
ensure Festival will carry its processing on a faster machine. Therefore, Festivals
synthesis time was not considered a problem.
Since it is expected that the programs input will contain marked up text, an XML parser
was required to parse and validate the input, and create a DOM tree structure for easy
processing. There are a number of freely available XML parsers, though many are still in
development stage and implement the XML specification to varying degrees. One of the
more complete parsers is libxml, a freely available XML C library for Gnome (libxml,
2000).
Using libxml as the XML parser fulfilled the needs of the project in a number of
ways:
a) Portability written in C, the library is highly portable. Along with the main
program, it has been successfully ported to the Win32, Linux and IRIX
platforms.
b) Small and simple only a limited range of the XML features are being used,
therefore a complex parser was not required. This is not to say that libxml is a
trivial library as it offers some powerful features.
c) Efficiency Informal testing showed libxml parses large documents in
surprisingly little time. Although not used for this project, libxml offers a
SAX interface to allow for more memory-efficient parsing (see section 3.5.5).
d) Free libxml can be obtained cost-free and license-free.
It is important to note that the libxml librarys DOM tree building feature was used to
help create the required objects that hold the programs utterance information. However,
care was taken to make sure the programs objects were not dependent on the XML
parser being used. Instead, a wrapper class, CTTS_SMLParser, used libxml as the XML
parser and output a custom tree-like structure very similar to that of the DOM. This
ensured that all other objects within the program used the custom structure, and not the
DOM tree that libxml outputs. (See Chapter 5 for more details.)
3.8 Summary
This chapter has explored research that was applicable to this project, focusing on how
the literature can help with achieving the stated objectives and subproblems of Chapter 2,
and supporting the hypotheses of Chapter 4. More specifically, the literature was
investigated to find the speech correlates of emotion, seeking clear definitions so that
there was a solid base to work from during the implementation phase. The work of
prominent researchers in the field of synthetic speech emotion, such as Murray and
Arnott (1995) and Cahn (1990) who have already attempted to simulate emotional
speech, was sought in order to gain an understanding of the problems involved, and the
approach taken in solving them.
The in-depth review on XML served two purposes: a) to describe what XML is and
what the technology is trying the address, and b) to expound the benefits of XML so as to
justify why SML was designed to be XML-based. A resource review was given to
discuss the issues involved when deciding which tools to use for the TTS module, and to
address one of the subproblems stated in Section 2.3; that is, that the TTS module should
be able to run across the Win32, Linux, and UNIX platforms.
Chapter 4
Research Methodology
The literature review of Chapter 3 enabled the formation of the hypotheses stated in this
chapter. It also identified areas where limitations would apply, and defined the scope of
the project.
4.1 Hypotheses
The project was developed to test the following hypotheses:
1. The effect of emotion on speech can be successfully synthesized using
control parameters.
2. Through the addition of emotive speech:
a)
Listeners will be able to correctly recognize the intended emotion
being synthesized.
b)
Head.
It should be noted that hypothesis 2a allows for a significant error rate in recognizing
the simulated emotion as we ourselves find it difficult to understand each others nonverbal cues, and are often misunderstood as a result. Malandra, Barker and Barker
(1989), and Knapp (1980) discuss difficulties in emotive speech recognition. For the
hypothesis to be accepted, however, the recognition rate will be significantly higher than
mere chance, showing proof that correct recognition of the simulated emotion is indeed
occurring.
4.2.2 Delimitations
The purpose of this research is to determine how well the vocal effects of emotion can be
added to synthetic speech it is not concerned with generating an emotional state for the
Talking Head based on the words it is to speak. Therefore, the system will not know the
required emotion to simulate from the input text alone. This top-level information will be
provided through the use of explicit tags, hence the need for the implementation of a
speech markup language.
Due to the strict time constraints placed on this project, the emotions that are to be
simulated by the system were bounded to happiness, sadness, and anger. These three
emotions were chosen because of the wealth of study carried out on these emotions (and
hence an increased understanding) compared to other emotions. This is because
happiness, sadness, and anger (along with fear and grief) are often referred to as the
basic emotions, on which it is believed other emotions are built on.
Chapter 5
Implementation
This chapter discusses the implementation of the TTS module to simulate emotional
speech for a Talking Head, plus the stated subproblems of Section 2. The discussion
covers how the modules input is processed, and how the various emotional effects were
implemented. This will involve a description of the various structures and objects that
are used by the TTS module. Since the module relies heavily on SML, the speech
markup language that was designed and implemented to enable direct control over the
modules output, the chapter discusses SML issues such as parsing and tag processing.
TTS Module
Viseme
information
Figure 10 - Black box design of the system, shown as the TTS module of a Talking Head.
It was mentioned earlier in Section 3.7.1 that the TTS module uses the Festival
Speech Synthesis System and the MBROLA Synthesizer. Figure 10 does not show any
of this detail, nor should it. What is important to describe at this level is the modules
interface; how the module produces its output is irrelevant to the user of the module.
Plain Text
The simplest form of input, plain text means that the TTS module will endeavour to
render the speech-equivalent of all the input text. In other words, it will be assumed that
no characters within the input represent directives for how to generate the speech. As a
result of this, speech generated using plain text will have default speech parameters,
spoken with neutral intonation.
SML Markup
If direct control over the TTS modules output is desired, then the text to be spoken can
be marked up in SML, the custom markup language implemented for the module.
Although an in-depth description of SML will not be given here (see Section 2 and
Appendix A), it was designed to provide the user of the TTS module with the following
abilities:
Control over speaker properties. This gives the ability to not only
have control of how the marked up text is spoken, but also who is speaking.
Speaker properties such as gender, age, and voice can be dynamically changed
within SML markup.
Another important feature of the TTS module with regards to the input it can receive
is that the module is able to handle unknown tags present within the text. This is
important because other modules within the Talking Head may (and do) have their own
markup languages to control processing within that module. For example, Huynh (2000)
has developed a facial markup language to specify many head movements and
expressions. If any non-SML tags are present within the input given to the TTS module,
they will simply be filtered out of the SML input.
A very important note to make is that the filtering out of unknown tags is done
before any XML-related parsing is carried out; the XML Recommendation explicitly
states that the presence of any unknown tags that do not appear in the documents
referenced DTD should immediately cause a fatal error (Extensible Markup Language,
1998). Therefore, it is important that before any processing of the TTS modules input
takes place, proper filtering takes place since it is expected that non-SML tags will
present in the input. Admittedly, the very fact that the TTS module is given input that
may not contain pure SML markup does not reflect a solid design of the system.
However, the FAITH project has only just begun to make use of XML-based markup
languages for its various modules and the XML processing architecture within the
Talking Head is not very mature. One possibly better approach to maintaining several
XML-based markup languages (such as SML and FAML) is discussed in the Future
Work section (Chapter 7).
Waveform
The TTS module will always produce a sound file which, when played, is the speech
equivalent of the text received as input. The sound file is in the WAV 16-bit format.
Once a waveform is produced, it is sent via MPEG-4 to the client side of the application
where it is played (see Section 3.6).
Visemes
The visual equivalent of the phonemes (speech sounds) that are spoken are also output
from the TTS module. The viseme output is encoded as stated in the MPEG-4
CTTS_Central::SpeakFromFile
(const
char
CTTS_Central::SpeakText
(const
char
*Message)
e.g.
SpeakText(Hello World);
CTTS_Central::SpeakTextEx
(const
char
*Message,
A special note to make is that the type of input given to each of these functions is not
explicitly specified. For instance, does the file given to SpeakFromFile contain plain
text, or is it an SML document? Each of the API functions listed above automatically
detect the input type using a simple heuristic: if the start of the input file or character
string contains an XML header declaration, then it is detected to be SML markup,
otherwise the input is treated as plain text.
A C API has also been made available, which has the same functionality as the
CTTS_Central objects interface. Only, initialization and destruction routines have to
be explicitly called.
XML header
Root tag
<?xml version="1.0"?>
<!DOCTYPE sml SYSTEM "./sml-v01.dtd">
<sml>
<p>...</p>
<p>...</p>
Paragraphs
<p>...</p>
</sml>
Figure 11 - Top-level structure of an SML document.
In turn, each p node can contain one or more emotion tags (sad, angry, happy, and
neutral), and instances of the embed tag; text not contained within an emotion tag is
not allowed. For example, Figure 12 shows valid SML markup, while Figure 13 shows
SML markup that is invalid because it does not follow this rule. Note that unlike lazy
HTML, the paragraph (p) tags must be closed properly.
<p>
<p>
<sad>I have some sad news:</sad>
this part of the markup is not valid SML.
</p>
Figure 13 - Invalid SML markup.
All tags described in Appendix A can occur inside an emotion tag (except sml, p,
and embed). A limitation of SML is that emotion tags cannot occur within other emotion
tags. However, unless explicitly specified, most other tags can contain even instances of
tags with the same name. For example, a pitch tag can contain another pitch tag as
the following example shows.
<pitch range=+100%>
Not I, <pitch middle=-15%>said the dog.</pitch>
</pitch>
The described structure of an input file containing SML markup can be confirmed by
SMLs DTD (see Appendix B). Should the input file not conform to the DTD
specification, a parse error will occur and, in accordance with the XML
Recommendation, the input will not be processed.
Plain text
Text
SML Parser
Phoneme data
SML Document
Phoneme
info
libxml
Tags +
Text/Phonemes
Modified
Text/Phonemes
Visual
Module
Visemes
Phoneme data
DSP
SML Tags
Processor
Waveform
MBROLA
As Figure 14 shows, the design of the TTS module subsystems is centered on the SML
Document object. The main steps for synthesizing the modules input text involve the
creation, processing, and output of the SML Document. This is broken down into the
following tasks:
a)
Parsing. The input text is parsed by the SML Parser, and creates
an SML Document object. The SML Parser makes use of libxml.
b)
Text to Phoneme Transcription. The Natural Language Parser
(NLP) is responsible for transcribing the text into its phoneme equivalent, plus
providing intonation information in the form of each phonemes duration and
pitch values. This information is given to the SML Document object and
stored within its internal structures. The NLP unit makes use of the Festival
Speech Synthesis System.
c)
SML Tag Processing. Any SML tags present in the input text are
processed. This usually involves modifying the text or phonemes held within
the SML Document.
d)
Waveform Generation. The phoneme data held within the SML
Document is given to the Digital Signal Processing (DSP) unit to generate a
waveform. The DSP makes use of the MBROLA Synthesizer.
e)
Viseme Generation. The Visual Module is responsible for
transcribing the phonemes to their viseme equivalent. Again, the phoneme
data is obtained from the SML Document. In this thesis, the Visual Module
will not be discussed in any further detail since it is has reused much of the
old TTS modules subroutines. Crossman (1999) provides a description of the
phoneme-to-viseme translation process.
The TTS module keeps track of all SML tag names by keeping a special XML
document that holds SML tag information2. Filtering of the input is done by creating a
copy of the input file, and copying only those tags that are known. It is important that
this filtering process is carried out because the input is envisaged to contain other nonSML tags, such as those belonging to the FAML module. Figure 15 shows the filtering
process.
SML tag information
Tag lookup
Known tag
Input file
The XML document is called tag-names.xml, and is held in the special TTS resource directory
TTS_rc.
hold markup information, attribute values, and character data. Figure 16 shows the highlevel structure of an SML Document that would be constructed for the accompanying
SML markup. Note how each node has a type that specifies what type of node it is.
The hierarchical nature of the SML Document implies which text sections will be
rendered in what way a parent will affect all its children. So, for the example in Figure
16, the emph node will affect the phoneme data of its (one) child node, the text node
containing the text too. The happy node will affect the phoneme data of all its (three)
children nodes containing the text Thats not, too, and far away respectively. Tags
that were specified with attribute values are represented by element nodes that point to
attribute information (this is not shown on Figure 16 for clarity purposes).
sml
DOCUMENT_NODE
p
ELEMENT_NODE
neutral
ELEMENT_NODE
text
TEXT_NODE
I live at
<?xml version="1.0"?>
<!DOCTYPE sml SYSTEM "./sml-v01.dtd">
<sml>
<p>
<neutral>I live at
<rate speed=-10%>10 Main Street</rate>
</neutral>
<happy>
Thats not <emph>too</emph> far away.
</happy>
</p>
</sml>
happy
ELEMENT_NODE
rate
ELEMENT_NODE
text
SML_TEXT_NODE
10 Main Street
text
TEXT_NODE
Thats not
emph
ELEMENT_NODE
text
TEXT_NODE
far away.
text
SML_TEXT_NODE
too
1.
turn,
each
CTTS_PhonemeInfo
CTTS_WordInfo
objects
that
contain
object
contains
phoneme
list
information.
of
A
CTTS_PhonemeInfo object contains the actual phoneme and its duration (ms).
Each
CTTS_PhonemeInfo
object
then
contains
list
of
U
W the
P dh
pp
P@
(0,95)
pp (50,101)
pitch value
% inside phoneme length
W moon
Pm
pp (0,102)
P uu
pp (50,110)
P n
pp (100,103)
Figure 17 - Utterance structures to hold the phrase the moon. U = CTTS_UtteranceInfo object;
W = CTTS_UtteranceInfo object; P = CTTS_PhonemeInfo object; pp = CTTS_PitchPatternPoint
object.
neutral
<neutral>
10 oranges cost
<emph>$8.30</emph>
</neutral>
10 oranges cost
ten oranges cost
emph
$8.30
eight dollars thirty
Once word information is stored within each nodes utterance object, phoneme data
can be generated for each word. Obtaining the actual phoneme data (including
intonation) is a more complex process however. This is because an entire phrase should
be given to Festival in order for correct intonation to be generated. As an example,
consider the following SML markup (the corresponding nodes held in the SML
Document are shown in Figure 19).
<happy>
<rate speed=-15%>I wonder,</rate> you pronounced it
<emph>tomato</emph> did you not?
</happy>
happy
rate
you pronounced it
I wonder,
emph
tomato
If each text nodes contents is given to Festival one at a time (i.e. first I wonder,
then you pronounced it, and so forth), then though Festival will be able to produce the
correct phonemes it will not generate proper pitch and timing information for the
phonemes. This will result in an utterance whose words are pronounced properly, but
contains inappropriate intonation breaks that make the utterance sound unnatural.
An appropriate analogy to this would be if a person were shown a pack of cards with
words written on them one at a time, and asked to read it out loud. The person, not
knowing what words will follow, will not know how to give the phrase an appropriate
intonation.
Now if the same person is given a card that contains the entire sentence on it, then
now knowing what the phrase is saying, the person will read it out loud correctly. The
same approach was taken in the solution to this problem. Continuing the above example
will help understand how this is done.
The SML Document is traversed until an emotion node is encountered. In
the example, traversal would stop at the happy node.
The contents of its child text nodes are then concatenated to make one
phrase. So, the contents of the four text nodes in Figure 19 would be
concatenated to form the phrase I wonder, you pronounced it tomato did you
not? The phrase is stored in a temporary utterance object held in the happy
node.
<p>
<neutral>Utterance 1</neutral>
<happy>Utterance 2</happy>
<sad>Utterance 3</sad>
</p>
SERVER
CLIENT
Idle
Playing Utterance 1
Playing Utterance 2
Idle
Playing Utterance 3
Anger
Happiness
Sadness
Speech rate
Faster
Slightly faster
Slightly slower
Pitch average
Very
higher
Much higher
Slightly lower
Pitch range
Much wider
Much wider
Slightly
narrower
Intensity
Higher
Higher
Lower
Pitch changes
Abrupt,
downward,
directed
contours
Smooth, upward
inflections
Downward
inflections
Voice quality
Breathy, chesty
tone1
Breathy,
blaring1
Resonant1
Articulation
Clipped
Slightly slurred
Slurred
much
Table 2 - Summary of human vocal emotion effects for anger, happiness, and sadness.
5.7.1 Sadness
Basic Speech Correlates
Following the literature-derived guideline for the speech correlates of emotion shown in
Table 2, Table 3 shows the parameter values set for the SML sad tag. The values were
optimized for the TTS module, and are given as percentage values relative to neutral
speech.
Parameter
Speech rate
-15%
Pitch average
-5%
Pitch range
-25%
Volume
0.6
As a result of the above speech parameter changes, the speech is slower, lower in
tone, and is more monotonic (pitch range reduction gives a flatter intonation curve). The
volume is reduced for sadness so that the speaker talks more softly. (Implementation
details on how speech rate, volume and pitch values are modified can be found in Section
5.8).
Prosodic rules
The following rules, adopted from Murray and Arnott (1995), were deemed to be
necessary for the simulation of sadness. Some parameter values were slightly modified
to work best with the TTS module.
1.
Eliminate abrupt changes in pitch between phonemes. The
phoneme data is scanned, and if any phoneme pairs have a pitch difference of
greater than 10% then the lower of the two pitch values is increased by 5% of
the pitch range.
2.
Add pauses after long words. If any word in the utterance contains
six or more phonemes, then a slight pause (80 milliseconds) is inserted after
the word.
The following rules were developed specifically for the TTS module.
1.
Lower the pitch of every word that occurs before a pause. Such
words are lowered by scanning the phoneme data in the particular word, and
5.7.2 Happiness
Basic Speech Correlates
Following the literature-derived guideline for the speech correlates of emotion shown in
Table 2, Table 4 shows the parameter values set for the SML happy tag. The values
were optimized for the TTS module, and are given as percentage values relative to
neutral speech.
Parameter
Speech rate
+10%
Pitch average
+20%
Pitch range
+175%
Volume
As a result of the above speech parameter changes, the speech is slightly faster, is
higher in tone, and sounds more excited since intonation peaks are exaggerated due to the
pitch range increase.
Prosodic rules
The following rules were adopted from Murray and Arnott (1995). Some parameter
values were slightly modified to work best with the TTS module.
1.
Increase the duration of stressed vowels. The phoneme data is
scanned, and the duration of all primary stressed vowel phonemes is increased
by 20%. Stressed vowels are discussed in Section 5.7.4.
2.
Eliminate abrupt changes in pitch between phonemes. The
phoneme data is scanned, and if any phoneme pairs have a pitch difference of
greater than 10% then the lower of the two pitch values is increased by 5% of
the pitch range.
3.
Reduce the amount of pitch fall at the end of the utterance.
Utterances usually have a pitch drop in the final vowel and any following
consonants. This rule increases the pitch values of these phonemes by 15%,
hence reducing the size of the terminal pitch fall.
5.7.3 Anger
Basic Speech Correlates
Table 5 shows the parameter values set for the SML angry tag. Note that the values were
optimized for the TTS module, and are given as percentage values relative to neutral
speech.
Parameter
Speech rate
+18%
Pitch average
-15%
Pitch range
-15%
Volume factor
1.7
Prosodic rules
The following rule was adopted from Murray and Arnott (1995).
1.
Increase the pitch of stressed vowels. The phoneme data is
scanned, and the pitch of primary stressed vowels is increased by 20%, while
secondary stressed vowels are increased by 10%.
Inspection of the parameter values in Table 5 will reveal that they differ considerably
from the guidelines shown in Table 2. Initially, speech parameters were set for anger as
shown in Table 2 but preliminary tests showed that even with different prosodic rules, the
angry tag produced output that was too similar to that of the happy tag (both had
increases in speech rate, pitch average and pitch range).
It was decided to keep the increase in speech rate to denote an increase in
excitement, but lower the pitch average. The lower voice seemed to better convey a
menacing tone, for the same reason why animals utter a low growl to ward off possible
intruders. With the help of the increase in volume, the pitch average lowering also results
in a perceived hoarseness in the voice3, although vocal effects could not be
implemented due to a limitation of the Festival and MBROLA systems. The combination
of the decreased pitch range and the intonation rule results in a flatter intonation curve
Increased hoarseness in the voice for anger is supported in the literature (Murray and Arnott, 1993).
with sharper peaks. This upholds Table 2s description of pitch changes for anger:
abruptcontour.
Duration
Avg.
>
Duration
Primary
Yes
Yes
Secondary
Yes
No
Tertiary
No
Yes
Table 6 - Vowel-sounding phonemes are discriminated based on their duration and pitch.
Therefore, the prosodic rules made use of the fact that different stress types existed
for vowel-sounding phonemes, basing classification on the criteria shown in the table
above. Whether or not this follows the stressed phoneme definition of Arnott and Murray
(1995) is unclear; the fact remains that Table 6 allowed the implementation of prosodic
rules that involved different stressed vowel types, and its success will be demonstrated in
this thesis evaluation section (Chapter 6).
5.7.5 Conclusion
While the literature identifies a number of speech correlates of emotion, speech
synthesizer limitations did not allow for the implementation of some of them; namely,
intensity, articulation and voice quality parameters. This may have been the main reason
why happiness and anger would have been too similar if the recommendations found in
the literature had been strictly followed based on acoustic features alone there were too
few differences between the two emotions.
In discussing HAMLETs prosodic rules, Murray and Arnott (1995) state that the
rules were developed to be as synthesizer independent as possible. This was found to be
the case, though specific values given in their paper for the DECtalk system obviously
could not be used. Still, the values served as a very good indication of what settings
needed to be changed for different emotions. The very fact that some of the HAMLET
prosodic rules could be implemented in this projects TTS module serves to show that the
work of Murray and Arnott (1995) is not speech synthesizer dependent. This has the
added advantage that it assures that the emotion rules implemented in the TTS module
are not dependent on the Festival Speech Synthesis System and the MBROLA
Synthesizer.
also affected. Figure 21 shows the factors that duration and pitch values are multiplied
by.
sorry
phonemes
s o r ii
neighbours
duration = 1.8
pitch = 1.2
Figure 21 Multiply factors of pitch and duration values for emphasized phonemes.
The attributes that can be specified within the emph tag give various options on how
the word will be emphasized. For instance, the affect attribute can specify whether
only the pitch should change for the affected phonemes (default), or just the duration, or
both. The level attribute specifies the strength of the emphasis (e.g. weak (default),
moderate, strong). The current limitation of the tag is that only one target phoneme can
be specified. However, this can be easily modified so that multiple target phonemes can
exist within a word by extending the target attribute and how it is processed.
b) embed
The embed tag enables foreign file types to be embedded within SML markup. The
type of embedded file is specified through the type attribute. Currently, two file types
are supported in SML:
audio the embedded file is an audio file, and is played aloud. The
filename is specified through the src attribute. Embedding audio files within
SML markup is useful for sound effects.
<embed type=audio src=sound.wav/>
this file involves obtaining the two input filenames, and calling the MML
library that will perform synthetic singing. The implementation of MML for
synthetic singing was developed by Stallo (2000).
<embed type=mml music_file=rowrow.mml lyr_file=rowrow.lyr/>
c) pause
The pause tag inserts a silent phoneme at the end of the last word of the previous
text node with a duration specified by the length or msec attributes. Finding the text
node previous to the pause node can be is non-trivial, as the algorithm must be able to
handle any sub-tree structure. shows a possible structure that the algorithm must be able
to handle.
Take a deep <emph>breath</emph> <pause length=medium/> and continue
text
TEXT_NODE
Take a deep
emph
ELEMENT_NODE
text
TEXT_NODE
breath
pause
ELEMENT_NODE
length=medium
text
TEXT_NODE
Take a deep
Silent phoneme to be
inserted at end of breath
d) pitch
Pitch average
The pitch average is modified by changing the pitch values of every phonemes pitch
point(s). The middle attribute specifies by which factor the pitch values need to change.
Modifying the pitch value of every phoneme by the same amount has the effect of
changing the pitch average.
Pitch range
Modifying the pitch range of an utterance requires that the pitch average be known.
Therefore, the pitch average of the utterance is calculated before any pitch values can be
modified. Each pitch value is then recalculated using the following equation:
This has the effect of moving pitch points further away from the average line, for
pitch values that are both greater and less than the pitch average (as shown in Figure 23).
Care is taken that the new pitch values do not go below/higher than predetermined
thresholds, or otherwise the voice loses its human quality and sounds too machine-like.
pitch
average
pitch
average
Original pitch values
New pitch values
Figure 23 - The effect of widening the pitch range of an utterance.
e) pron
The pron tag is used to specify a particular pronunciation of a word. It pron tag is
the only tag that modifies the text content of an SML Document text node, and is
processed before the contents is given to the NLP module at the text to phoneme
transcription stage. This is because the value of the sub attribute overwrites the contents
of the text node. When the phoneme transcription stage is reached, the substituted text is
what is given to the NLP module, and so the phoneme transcription reflects the specified
pronunciation of the markup. Figure 24 illustrates how this is done for the markup
segment:
<pron sub=toe may toe>tomato</pron>
pron
ELEMENT_NODE
sub=toe may toe
NLP
pron
ELEMENT_NODE
sub=toe may toe
text
content
text
TEXT_NODE
tomato
text
TEXT_NODE
toe may toe
f) rate
phoneme
transcription
text
TEXT_NODE
toe may toe
The speech rate is modified very easily by affecting the phoneme duration data in the
text node structures. How much the speech should increase/decrease by is specified by
the speed attribute, and each phonemes duration is multiplied by a factor reflecting the
value of speed.
g) volume
The volume is modified through the usage of the MBROLA v command line
option. The level attribute specifies the volume change, and this is converted to a
suitable value to pass to MBROLA through the command line. The disadvantage of this
way of implementing volume control is that MBROLA applies the volume to the whole
utterance it synthesizes. Since sections are passed to MBROLA at the emotion tag level,
the volume can vary at most from emotion to emotion. Therefore, dynamic volume
change within an emotion tag is not possible, and has been identified as an area for future
work (see Section 7.1).
Name
The value of the name attribute is passed to MBROLA on the command line, and must be
the name of an MBROLA diphone database that already exists on the system. Using a
different diphone database changes the voice since the recorded speech units contained
within the database are from a different source. The TTS module can currently use two
diphone databases: en1, and us1. Making use of more diphone databases (when they
become available) requires very minimal additions. Unfortunately, specifying the actual
name of the MBROLA diphone database in the SML markup has been identified as a
design flaw, since an SML user will now be aware that the MBROLA synthesizer is
being used and is forced to write markup that directly accesses an MBROLA voice. It is
very important that this should be altered in the future.
Gender
The MBROLA synthesizer provides the ability to change frequency values and voice
characteristics, and thus provides a way of obtaining a male and female voice from the
same diphone database. Obtaining a male or female voice requires the specification of
the following MBROLA command line options:
1. Frequency ratio specified through the -f command line option. For
instance, if -f 0.8 is specified on the MBROLA command line, all
fundamental frequency values will be multiplied by 0.8 (voice will sound
lower).
2. Vocal tract length ratio specified through the -l command line option. For
instance, if the sampling rate of the database is 16000, indicating -l 18000
allows you to shorten the vocal tract by a ratio of 16/18 (which will make the
voice sound more feminine).
Unfortunately, values for the f and l MBROLA command line options are
dependent on the diphone database being used. Table 7 shows the parameter values
required to obtain a male and female voice for the en1 and us1 diphone databases.
Gender
1.0
16000
Female
en1)
1.6
20000
0.9
16000
Female
us1)
1.5
16000
(using
(using
Table 7- MBROLA command line option values for en1 and us1 diphone
databases to output male and female voices.
The set of recorded speech sounds is often referred to as the voice data corpus, or a diphone database.
phoneme
i 61 0 109 50 110
m 69
duration
pitch point
@ 48 0 105 50 94
n 91 100 90
# 210
choice of good descriptive tag names (if an appropriate name is already being
used in another tag set), and similar tag names such as anger and angry will
only confuse future users of TTS and FAML markup. A possible solution would
be the use of XML Namespaces, which allow different markup languages to
contain the same tag names (ambiguity is resolved through the use of resolution
identifiers e.g. sml.angry and faml.angry)
2. FAML API functions are called from within the TTS module to initialize the
FAML module, and to allow the FAML module to modify the output FAP
stream.
3. The creation of temporary utterance files in a format required by the FAML
module. This information is needed by the FAML module for the proper
synchronization of its generated facial gestures and the TTS modules speech.
The utterance file contains word and phoneme information in the format shown
in Figure 26.
word
phoneme and
duration (ms)
>And
# 210
a 90
n 49
d 46
>now
n 72
au 262
# 210
>the
dh 45
@ 40
>latest
l 69
ei 136
t 71
i 66
s 85
t 66
>news
n 72
y 45
uu 217
z 82
# 210
Figure 26 - Example utterance information supplied to the FAML module by the TTS module.
Example phrase: And now the latest news
5.11 Summary
This chapter has been able to demonstrate that the design and implementation of the
described TTS module has endeavoured to follow the black box design principle in
virtually all of its subsystems.
As Figure 7 shows, the dependence of other Talking Head modules on the
TTS module is strictly limited to its precisely defined inputs any modification
of the internal workings of the TTS module will not affect how the rest of the
Talking Head functions.
The TTS modules own dependence on tools it uses such as libxml, the
Festival Speech Synthesis System, and the MBROLA Synthesizer has been
carefully bounded through proper design of C++ classes and structures. There is
also no dependence of the DSP unit (which makes use of MBROLA) on the NLP
unit (Festival), so minimal changes would be required if another NLP unit was
used, but the TTS subsystem design would remain unchanged. The only
requirements for a new speech synthesizer to be used in the NLP and DSP units
would be the ability to manipulate the utterance at the phoneme data level
(including pitch and duration information).
The prosodic and speech parameter rules for the emotion tags are totally
speech synthesizer independent, except for the volume settings. Because of this,
the emotion tags could be easily ported for use in another TTS module using
different speech synthesizers.
Processing of most of the low-level speech tags makes use of the SML
Documents utterance structures only. However, the speaker and volume tags
make heavy use of the MBROLA Synthesizer due to the fact that these tags
affect the way MBROLA produces the waveform. Future work will look at
minimizing the dependence of these two tags on MBROLA.
Chapter 6
Results and Analysis
Evaluation of this project was primarily concerned with testing the hypotheses stated in
Section 4.1. Therefore, the evaluation process endeavoured to ascertain how well the
system is able to simulate emotion in synthetic speech, and the extent of the effect this
has on a Talking Heads ability to communicate. In this chapter, the procedure in which
data was acquired to test the hypotheses will be described. This will be followed by a
presentation of the data, of which a full analysis is carried out.
b)
c)
The same phrases as in (a) above, but spoken in one of the four
emotions (includes neutral).
d)
The same phrases as in (b) above, but spoken in one of the four
emotions.
Both Cahn (1990b) and Murray and Arnott (1995) expressed difficulty in finding
appropriate test phrases, and indeed, the same difficulties were encountered when
designing the questionnaire for this project, especially in finding neutral phrases that
sounded convincing under any of the four emotions. A number of phrases were
borrowed from the experiments of the aforementioned researchers, and the rest were
original. It is important to note that the participants were not made aware of the different
types of phrases that were prepared. See Appendix E for a list of the example test
phrases used.
An important design issue for this section of the questionnaire was deciding the way
in which the participants should indicate their choice. Murray and Arnott (1995) have
identified (and used) two basic methods of user input that are suitable for speech emotion
recognition: forced response tests, and free response tests. In a forced response test, the
subject is forced to choose from a list of words the one that best describes the emotion
that he or she perceives is being spoken. In a free response test, the subject may write
down any word that he or she thinks best describes the emotion.
In experiments performed by Cahn (1990), the evaluation was based on a forced
response test, with only the six emotions that the Affect Editor simulated as possible
responses. Participants were also asked to indicate (on a scale of 1 to 10) how strongly
they heard the emotion in the utterance, and how sure they were.
For this project, it was decided to adopt the forced response test because of its
simplicity and to avoid the possible ambiguity of a free response test (for instance, if a
participant wrote down exasperated, should it be categorized as angry or disgusted,
or neither?). However, a mechanism needed to be provided so as not to limit the possible
selection of responses. This was important because only three emotions other than
neutral anger, happiness, and sad were simulated by the system. It was feared that
should the selection list be confined to just four possible responses, then it could
potentially invalidate the data. For instance, any utterance that contained positive words,
or contained a positive tone to it, would immediately be perceived as happy only because
it would be the only positive emotion in the list. This situation was avoided by adding
two distractor emotions in the selection list surprise and disgust. This way, listeners
would have more choice. In addition to this, an Other option was added to the
selection list to enable the participant to write down their own descriptive term if they so
wished. Therefore, a total of seven possible responses were made available for the four
different types of emotion being simulated.
It is beneficial that the tests in this section had a very similar structure to the forced
response tests conducted by Murray and Arnott (1995). The experiment could not be
exactly the same since time limitations did not permit as many test utterances to be
played, and not all the emotions that were tested in Murray were simulated. Still, the
experiments are similar enough to provide at least a loose comparison of results.
It would have been interesting if a female voice would have been tested since the
voice gender can be changed through the speaker tag (see Section 5.8.2). However,
the questionnaire length was strictly limited, and a sufficient number of examples of one
gender were needed to obtain useful data; interchanging male and female voices would
have introduced complex variables, and making valid conclusions from the data would
have been difficult, if not impossible. Still, differences in subject responses to male and
female voices could have possibly provided useful data for analysis and this should
certainly be looked into in the future (see Chapter 7).
Section 3 Talking Head
For this section it was desired to obtain data that could describe the effect that adding
vocal emotion to a Talking Head had on a users perception of the Talking Head. To do
this it was decided to prepare two movies of the Talking Head: one speaking without
vocal emotion effects, and the other including emotion effects in its speech all other
variables must remain as much the same as possible (e.g. facial expressions, movements,
and the actual words spoken). It was not possible to have the visual information of the
Talking Head exactly the same for both examples, since the inclusion of certain speech
tags affected the length of the utterance, and so slight movements such as eye blinking
may have been different. Facial expressions showing emotion however, were the same
for both examples (the exact placement, duration and intensity of the expressions could
be controlled via facial markup tags (Huynh, 2000)).
The utterance that was synthesized for both examples was excerpted from the Lewis
Carroll classic Alices Adventures in Wonderland (Carroll, 1946). The excerpt contained
dialogue between three characters of the story: Alice, the March Hare, and the Hatter.
The passage was chosen for its wonderful expressiveness (it is a childrens novel), the
fact that it included dialogue, and because various emotions such as sad, curiosity, and
disappointment could be used when reading the passages.
The first examples speech was synthesized without any markup at all, and therefore
included Festivals intonation unchanged. For the second example, the text was marked
up by hand using a variety of high-level speech emotion tags, and lower-level speech tags
such as rate, pitch and emph (see Section 5.8).
screen at the front of the theatre. The room also had an adequate sound system enabling
each participant to comfortably listen to the demonstration. A total of 45 participants
took place in the demonstration, a number large enough to produce adequate data.
The demonstration began by very briefly introducing the participants to what the
project is about. They were told that by filling out the questionnaire they would be
helping in evaluating how well the project addresses the problem it had set out to solve.
Nevertheless, it was made clear to all participants that it was not compulsory that they
should sit the demonstration. The participants were given an overview of what to expect
in the questionnaire, including the fact that they would be played a number of sound files
and asked to comment on each one. It was emphasized however, that it was not they who
were being tested, but the program itself, and that there were no right or wrong answers.
The participants were asked to fill out Section 1 of the questionnaire. They were told
the relevance of the questions being asked in that section, but that they were under no
obligation to answer questions they did not feel comfortable with. The participants were
encouraged to ask questions to clarify any details they felt had not been made clear.
The sound and movie demonstrations of Sections 2 and 3 had been pre-rendered and
made part of a Microsoft PowerPoint presentation that was shown on the large screen at
the front of the lecture theatre. The reason for this was to avoid waiting for the utterances
to be generated, and to provide a way of showing the audience the current section and
example number that was being played. It also minimized the risk of anything going
wrong.
Parts A and B of Section 2 consisted of the test utterances described in Section 6.1.1
being played. For each test utterance, the sound file was played twice and the
participants were given time to choose which emotion best suited how the speaker
sounded. In order to give the participants a chance to listen to the test utterance again and
confirm their choice, the five most recent utterances were repeated every fifth example.
The repeating of the utterances also served as a mental break for the participants.
The next section of the questionnaire consisted of the two Talking Head examples
described in Section 6.1.1. The participants were asked to first watch both of the
examples, and then fill out that part of the questionnaire. Before the examples were
played, the participants were asked to scan the four questions asked in that section so as
to aid them in what they should look for the Talking Heads clarity, expressiveness,
naturalness, and appeal. However, the participants were not asked to focus on the voice
only, even though it was only the voice that changed in the two examples. The
participants were then asked to fill out the rest of the questionnaire, which consisted of
more general questions (see Appendix D).
Important to note is that there were some factors that did not allow the evaluation to
be carried out under ideal conditions. For instance, the very fact that the demonstration
was held with a group of participants may have been a cause of distraction for some.
Moving towards an ideal situation would have been for each participant to sit the
questionnaire one at a time, in front of a computer without anyone else in the room. This
would have minimized any distractions, and would have allowed the participant to go at
his or her own pace. Given the time and resources available, however, this was not
possible. Still, the demonstration was designed to be short enough to keep the
participants attention and give each participant the ability to review their answers
through playbacks, and conducted in such a way as to give the participants ample time to
answer the questions.
23.9
84.00%
16.00%
51.10%
8.90%
8.90%
6.70%
4.40%
8.90%
11.10%
57.80%
42.20%
In addition to the statistics shown in the table, it should be noted that all participants
were students enrolled in a second year Computer Science introductory graphics unit. All
participants were computer literate who used computers at home, school and for work.
71.1% had heard of the term speech synthesis before, while 26.7% had not (2.2% were
unknown). Also, 82.2% had seen a Talking Head before, while 15.6% had not (2.2%
were unknown).
S
T
I
M
U
L
U
S
Sad
Angry
Neutral
Surprised Disgusted
Other
Happy
Sad
Angry
Neutral
For example, in Table 10, the first row of example percentage values show that when
an utterance was combined with happy vocal emotion effects generated by the system,
42.2% of listeners perceived the emotion as happy (i.e. correct recognition took place),
2.2% perceived it as sounding sad, 6.7% angry, 17.9% neutral, 23.7% surprised, 4.4%
disgusted, and 2.9% specified something else. The data in the example therefore, would
indicate happiness was the most recognized emotion for that utterance, and that the
utterance was mostly confused with surprise and neutral. Note that the Other category
5
The intended meaning is not that the participants failed to recognize the simulated emotion through a fault
of their own, but simply that the participants choice didnt match the emotion being simulated.
also includes those participants who could not decide which emotion was being
portrayed.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S
Happy
Happy
Sad
Angry
Neutral
42.2%
2.2%
6.7%
17.9%
Surprised Disgusted
23.7%
4.4%
Other
2.9%
Sad
Angry
Neutral
From the above example, it can be seen that the data should be read in rows; each
row representing an utterance or group of utterances that were simulating a specific
emotion, and the cell values for that row show the distribution of the participants
responses.
Cells that have the same row and column names hold values that represent when the
listeners perceived emotion matched the emotion being simulated any other cell
represents wrong recognition. A table holding values such as those in Table 11 would
therefore be showing ideal data 100% recognition of the simulated speech emotion took
place for all emotions.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S
Happy
Sad
Angry
Neutral
Surprised Disgusted
Other
Happy
100%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
Sad
0.0%
100%
0.0%
0.0%
0.0%
0.0%
0.0%
Angry
0.0%
0.0%
100%
0.0%
0.0%
0.0%
0.0%
Neutral
0.0%
0.0%
0.0%
100%
0.0%
0.0%
0.0%
Table 11 - Confusion matrix showing ideal experiment data: 100% recognition rate for all
simulated emotions.
2.
People vary in their ability to recognize emotional expressions. In
a study described in Knapp (1980), listeners ranged from 20 percent correct to
over 50% correct.
3.
Emotions themselves vary in being able to be correctly recognized.
Another study showed anger was identified 63 percent of the time whereas
pride was only identified 20 percent of the time.
So if obtaining such ideal data for human speech is unrealistic, this is more the case
when emotion is being simulated in synthetic speech.
Happy
Happy
Sad
Angry
Neutral
18.9%
3.3%
3.3%
41.1%
Surprised Disgusted
22.2%
6.7%
Other
4.4%
Sad
Angry
Neutral
Table 12 Listener response data for neutral phrases spoken with happy emotion.
RCEIVED EMOTION
Similarly, Table 13 showsP E
listener
response data for all four emotions demonstrated
in the test utterances for Section 2A. Significant values are displayed in a larger font.
S
T
I
M
U
L
U
S
Happy
Sad
Angry
Happy
18.9%
3.3%
3.3%
Sad
0.0%
77.8%
1.1%
Angry
6.7%
0.0%
Neutral
6.7%
23.3%
Neutral
Surprised Disgusted
41.1% 22.2%
Other
6.7%
4.4%
0.0%
6.7%
6.7%
44.4% 21.1%
5.6%
21.1%
1.1%
54.4%
3.3%
3.3%
6.7%
2.2%
7.8%
The following observations can be made from the data for each emotion (including
happy, which has already been discussed).
a)
Happy. A poor recognition rate (18.9%), with the stimulus being
strongly confused with neutral (41.1%) and surprise (22.2%).
b)
Sad. A very high recognition rate occurred for sadness (77.8%),
with little confusion occurring with other possible emotions.
c)
Angry. A relatively high percentage of listeners recognized the
angry stimulus (44.4%), but a considerable amount confused anger with
neutral and disgust (21.1% each).
d)
Neutral. Most listeners correctly recognized when the utterance
was played without emotion (54.4%), but a significant portion (23.3%)
perceived the emotion as sad.
The Analysis
Any percentage value substantially greater than 14% was deemed as significant. This is
based on the logic that if all participants had randomly chosen one of the seven emotions,
then a particular emotion would have 1/7 (~14%) chance of being chosen. Therefore, if a
cell has a percentage substantially greater than 14% then there must have been a factor
(or factors) that influenced the listeners choice.
Except for happy, all emotions had a recognition rate greater than 14%. However,
sadness was the only emotion that enjoyed little confusion with other emotions. So with
the exception of sadness, average recognition for the simulated emotion in the utterances
was quite low.
To give a possible explanation for these values, it will be helpful to look at the data
for each question separately. Table 14 shows listener response data for Question 1 of
Section 2A, and Table 15 shows listener response data for Question 2.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S
Happy
Sad
Angry
Neutral
Surprised Disgusted
Other
Happy
6.7%
2.2%
6.7%
57.8%
13.3%
8.9%
4.4%
Sad
0.0%
66.7%
2.2%
8.9%
0.0%
11.1%
11.1%
Angry
0.0%
0.0%
66.7%
6.7%
4.4%
20.0%
2.2%
Neutral
0.0%
46.7%
2.2%
40.0%
2.2%
2.2%
6.7%
Happy
Sad
Angry
Happy
31.1%
4.4%
0.0%
Sad
0.0%
88.9%
0.0%
Angry
13.3%
0.0%
Neutral
13.3%
0.0%
Neutral
Surprised Disgusted
24.4% 31.1%
Other
4.4%
4.4%
0.0%
2.2%
2.2%
22.2% 35.6%
6.7%
22.2%
0.0%
68.9%
4.4%
4.4%
6.7%
2.2%
6.7%
It is very obvious that the two tables hold substantially different data values.
Whereas for Question 1 the happy stimulus received a recognition rate of 6.7%, with
57.8% mistaking it for neutral, this changed dramatically for Question 2, with 31.1% of
listeners correctly identifying the happy emotion, and only 24.4% classifying the
utterance as neutral. There is still considerable confusion occurring for the happy
stimulus even for Question 2 (surprise is 31.1%), but the difference between the two
questions recognition rates is too large to ignore.
Therefore it is evident that the results in this section were very much utterance
dependent. From the data, it seems that the phrase The telephone has not rung at all
today (Question 1) was said more effectively in an angry tone than I received my
assignment mark today (Question 2), and this was despite the care that was taken to
choose neutral test phrases. Both Cahn (1990) and Murray and Arnott (1995) report this
problem of utterance-dependant results, and it demonstrates the difficult task of obtaining
quantitative results on speech emotion recognition.
An interesting observation is noting when a particular stimulus received strong
listener recognition and the emotion in the same row with the next highest response. For
example, anger received strong recognition for Question 1 (see Table 14). For that
utterance, the emotion that anger was most confused with is disgust, which received
20.0%. Close scrutiny of the two tables will also reveal that when happiness was
strongly recognized, the emotion it was most confused with is surprised (see row 1 of
Table 15). Also, neutral in Table 14 was most confused with sad. The pattern that
emerges is the one described in the literature; the pairs that are most often confused with
each other are happiness-surprise, sadness-neutral, and anger-disgusted. This is because
the speech correlates identified in the literature are very similar for these emotion pairs.
As a consequence, emotion recognition will often be confused between these emotion
pairs, especially with neutral text.
From Table 14 and Table 15, it can also be seen that generally, recognition of the
simulated emotions improved for Question 2. This could be due to the listeners
becoming accustomed to the synthetic voice and learning to distinguish between the
emotions. It could also have been due to the listeners, who were all students, relating
more to the phrase of the second question, which was about receiving an assignment
mark. This could be clarified with further tests.
b)
c)
d)
In this part of the analysis, the data will be presented for each of these four test
utterance types.
Table 16 shows listener responses for utterances with emotionless text with no vocal
emotion; that is, utterances whose text was emotionally indistinguishable from just the
words alone, and spoken with no vocal emotion effects. The text phrases used for this
section are phrases 1-5 shown in Appendix E.
S
T
I
M
U
L
U
S
PERCEIVED EMOTION
Neutral
Happy
Sad
Angry
Neutral
2.2%
36.9%
2.7%
44.4%
Surprised Disgusted
3.1%
6.7%
Other
4.0%
Table 16 - Listener responses for utterances containing emotionless text with no vocal emotion.
The observation that can be made from the data is that there was a strong recognition
that the utterances were spoken with no emotion. However, the utterances were also
confused with sadness, as it too received a strong listener response (36.9%). Again, like
the previous section of the questionnaire (discussed in Section 6.2.2), it was found that
there was a great deal of variation in listeners perception that was dependent on the
utterance being spoken. For instance, one of the phrases in this section was The
telephone has not rung at all today. The majority (73.3%) of listeners perceived the
speaker as being sad, while only 20.0% thought it sounded neutral. Instead, the phrase I
have an appointment at 2 oclock tomorrow was perceived more neutral (62.2% for
neutral, 20.0% for sad). The point being emphasised is that although care was taken to
choose emotionally undetermined phrases, the data suggests that this is a very difficult
task. Our dependency on context to discriminate emotions is stated in Knapp (1980), and
has been confirmed in this evaluation.
Albeit in varying degrees, the general trend for this subsection was that the neutral
voice was often perceived as sounding sad. This suggests that Festivals intonation,
which is modeled to be neutral, may have an underlining sadness. Interestingly, Murray
and Arnott (1995) also made this observation with the HAMLET system, which makes
use of the MITalk phoneme duration rules described in Allen et al. (1987, Chapter 9).
Happy
Sad
Angry
Neutral
Happy
13.3%
8.9%
0.0%
68.9%
0.0%
0.0%
8.9%
Sad
2.2%
48.9%
2.2%
37.8%
0.0%
4.4%
4.4%
Angry
1.1%
0.0%
42.2% 33.3%
2.2%
20.0%
1.1%
Neutral
2.2%
4.4%
11.1%
0.0%
6.7%
0.0%
75.6%
Surprised Disgusted
Other
Table 17 - Listener responses for utterances containing emotive text with no vocal emotion.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S
Happy
Happy
Sad
Angry
24.4%
1.1%
0.0%
Neutral
Surprised Disgusted
22.2% 40.0%
2.2%
Other
10.0%
Sad
0.0%
72.2%
0.0%
17.8%
0.0%
2.2%
7.8%
Angry
0.0%
0.0%
91.1%
2.2%
0.0%
4.4%
2.2%
Table 18 - Listener responses for utterances containing emotionless text with vocal emotion.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S
Happy
Sad
Angry
Neutral
Surprised Disgusted
Other
Happy
66.7%
4.4%
0.0%
13.3%
4.4%
2.2%
8.9%
Sad
0.0%
62.2%
4.4%
24.4%
0.0%
0.0%
8.9%
Angry
0.0%
0.0%
77.8%
1.1%
0.0%
15.6%
5.6%
Neutral
6.7%
2.2%
0.0%
71.1%
13.3%
4.4%
2.2%
Table 19 - Listener responses for utterances containing emotive text with vocal emotion.
By studying the two confusion matrices in Table 16 and Table 18, it can be seen that
emotion recognition was enhanced for neutral phrases spoken with vocal emotion
compared to the same text spoken without vocal emotion. The two confusion matrices
however, show only the number of listeners who correctly or incorrectly recognized the
simulated emotion. What they do not show is the number of listeners who improved with
the addition of vocal emotion effects. To determine the effect of vocal emotion therefore,
the data should be filtered to not show listeners who correctly recognized the intended
emotion when the utterance was spoken without vocal emotion and then also correctly
recognized when the utterance was spoken with vocal emotion.
In order to address this, further analysis was carried out on the data to determine the
effect the addition of vocal emotion had on listener emotion recognition. To do this, the
analysis kept track of listeners who had improved in their recognition of the intended
emotion, and also kept track of listeners whose recognition was shown to deteriorate
when the utterance was spoken with vocal emotion. Table 20 shows for each simulated
emotion, the percentage of listeners who incorrectly recognized the intended emotion
when it was spoken without vocal emotion and who then improved in their recognition
when the utterance was spoken with vocal emotion. Similarly, Table 21 shows the
percentage of listeners whose emotion recognition deteriorated with the addition of vocal
emotion effects.
Happy
Sad
Angry
22.2%
36.7%
84.4%
Happy
Sad
Angry
2.2%
12.2%
2.2%
The above results show that an overall significant increase in emotion recognition
occurred with the introduction of vocal emotion effects in a neutral utterance; this was
true for all the simulated emotions. Anger gained the largest increase in emotion
recognition (84.4%). This suggests that if a neutral text utterance is to be perceived as
being spoken in anger, then its perception is very much dependent on the vocal emotion
effects without vocal emotion, it is simply very difficult to perceive the speaker as
being angry.
Deterioration for sadness was higher than the other two emotions. The reason for
this is not clear, except that confusion between sadness and neutrality occurred with all
utterance types.
Happy
Sad
Angry
57.8%
31.1%
41.1%
Happy
Sad
Angry
4.4%
17.8%
6.7%
From the above results, it can be seen that all emotions received a significant
increase in emotion recognition once vocal emotion was added to the utterance, with the
greatest improvement occurring for happiness (57.8%). Anger did not improve so much
with emotive text as with neutral text. This shows that emotive text also had a strong
influence on determining the correct emotion being simulated.
As with Table 21, Table 23 also shows that the deterioration of recognition with
sadness was significantly higher than other emotions; sadness-neutral confusion was a
problem throughout all the tests. Still, the substantial effect of vocal emotion even on
emotive text phrases can be clearly seen through this analysis.
participants. One discrepancy was found however, with differences too significant to
overlook: it was noted from the data that participants who did not speak English as their
first language confused sadness with neutrality significantly more than people who did
speak English as their first language. This was noted for neutral text, emotive voice
and emotive text, emotive voice type utterances. Table 24 shows listener responses for
the sadness stimulus for participants who speak English as their first language, while
Table 25 shows the same listener responses for people who do not speak English as their
first language. Both Tables are for neutral text, emotive voice utterance types.
S
T
I
M
U
L
U
S
S
T
I
M
U
L
U
S
PERCEIVED EMOTION
Sad
Happy
Sad
Angry
Neutral
0.0%
82.7%
0.0%
5.8%
Surprised Disgusted
0.0%
1.9%
Other
9.6%
Table 24 Listener responses for participants who speak English as their first language.
Utterance type is neutral text, emotive voice.
PERCEIVED EMOTION
Sad
Happy
Sad
Angry
Neutral
0.0%
55.3%
0.0%
34.2%
Surprised Disgusted
0.0%
2.6%
Other
7.9%
Table 25 Listener responses for participants who do NOT speak English as their first
language. Utterance type is neutral text, emotive voice.
The pattern is also reflected in the following data for emotive text, emotive voice
utterance types (Table 26 and Table 27).
S
T
I
M
U
L
U
S
PERCEIVED EMOTION
Happy
Sad
Angry
Neutral
Surprised Disgusted
Other
Sad
0.0%
76.9%
3.8%
11.5%
0.0%
0.0%
0.0%
Table 26 Listener responses for participants who speak English as their first language.
Utterance type is emotive text, emotive voice.
S
T
I
M
U
L
U
S
PERCEIVED EMOTION
Sad
Happy
Sad
Angry
Neutral
0.0%
42.1%
5.3%
42.1%
Surprised Disgusted
0.0%
0.0%
Other
10.5%
Table 27 Listener responses for participants who do NOT speak English as their first
language. Utterance type is emotive text, emotive voice.
The significant difference in emotion perception for sadness suggests that cultural
issues are coming into play; this is an issue that has received a good deal of interest in the
non-verbal communication literature (Knapp, 1980) (Malandra, Barker and Barker,
1989). The high confusion between sadness and neutrality for participants without
English as their first language may account for the high deterioration values for sadness
seen in Table 21 and Table 23.
Maybe comforting to know is that if there was confusion for sadness, neutral was
chosen, as supported by Murray and Arnott (1995). It would be alarming indeed if the
data would show that sadness could be easily confused with, say, disgust, as the
consequences in real-life communication would be disastrous.
Understandability. If one Talking Head version was determined to
be generally easier to understand than the other, then it can be assumed that
listener comprehension would benefit.
Expressiveness.
The Talking Head version that would be
determined to be better able to express itself would communicate more
effectively its feelings, the mood of a story, the seriousness/lightheartedness
of information etc.
Understandability
Talking Head 1
Talking Head 2
Neither
2.2%
82.2%
15.6%
The data shows that the overwhelming majority of participants thought the second
Talking Head demonstration (the one that included vocal expression) was easier to
understand. From participants justifications for their choice, it was remarked that not
only did most participants think the second demonstration was easier to understand, but
that many thought the first demonstration was very difficult to understand. Participants
wrote that the second demonstration was easier to understand because it spoke slower
and because it had better expression in its voice. In fact, the speech rate of the second
demonstration did not slow down; rather, more pauses between sentences and character
dialogues were present. This highlights the importance that silence has in our verbal
communication.
Although it could be thought that the second demonstration was perceived as being
easier to understand because the story was told again, and so therefore the participants
were hearing the story for the second time, the fact that the reasons for their choice were
widely echoed shows that the voice was a determining factor.
Expressiveness
Table 29 shows the response of participants when asked which Talking Head
demonstration seemed best able to express itself. Again, the second demonstration was
overwhelmingly favoured over the first demonstration. Motivators for favouring the
second demonstration were because it was perceived the storytelling was better
structured, and because there was more variation in its tone. Many participants
commented on how the variability and changes in pitch helped the turn-taking of the
characters and to distinguish ideas behind [the] statements in the story. Still others
commented on the dynamic use of volume, and that the tone of the second demonstration
was more appropriate for the story.
Expressiveness
Talking Head 1
Talking Head 2
Neither
4.4%
86.7%
8.9%
Another factor that was stated by most participants who favoured the second
demonstration was that the vocal and facial expressions were better synchronised, and
that the vocal expressions stood out more. The synchronisation of vocal expression
and facial gestures is deemed to be very important for communication by Cassell et al
(1994a), and Cassell et al (1994b); it is encouraging that most participants were able to
notice this and view it as desirable.
Interestingly, participants who did not favour the second demonstration of the
Talking Head commented that it was too expressive (in a caricature way).
Naturalness
Table 30 shows the response of participants when asked which Talking Head seemed
more natural. The main reason that the majority of participants that voted for the second
demonstration gave was that the speech now matched the facial expressions better, and so
seemed more realistic. For this question, there were a considerable amount of
participants who commented that the mouth and lip movements were unconvincing and
therefore needed more work for the Talking Head to seem realistic.
Naturalness
Talking Head 1
Talking Head 2
Neither
15.6%
71.1%
13.3%
Interest
When asked which Talking Head demonstration seemed more interesting, participants
gave a variety of reasons for their choice. Most participants opted for the second
demonstration, and Table 31 shows the data for this question.
Interesting
Talking Head 1
Talking Head 2
Neither
2.2%
84.4%
13.3%
Many participants wrote that because the first demonstration was difficult to
understand, they lost interest in what it was saying. Conversely, the second
demonstration was easier to understand, and as a result, they didnt have to concentrate as
much. Others wrote that the Talking Head in the second demonstration seemed more
alert and that it was happier. Several participants commented that the second
demonstration was more interesting because it seemed it had more human qualities,
and that the facial expressions were noticed more because of the improved speech.
For the results shown in this section, it is safe to say that the overwhelming majority
of participants favoured the second demonstration of the Talking Head over the first
demonstration. What is gratifying is the fact that the participants werent told to focus on
the Talking Heads voice but rather look at it on the whole, and yet most people
attributed the improvement between the two examples to the vocal expression of the
Talking Head.
A very important note to make however is that the results are from data that was
obtained by comparing two versions of the Talking Head: with and without simulated
vocal emotion. Further tests would need to be done to quantitatively determine how
effective a communicator the Talking Head is with its improved speech. What is
important is that the results have been able to show that with vocal emotion, the Talking
Head is a better communicator, basing the argument that is has enhanced
understandability, expressiveness, naturalness, and user interest.
6.4 Summary
This chapter briefly described how the evaluation research methodology was employed to
test the hypotheses stated in Section 4.1. The TTS module was tested for how well
listeners could recognize the simulated emotions, and the data was seen to largely support
the stated hypotheses. The variables that can affect speech emotion recognition were
investigated and discussed, and the test data was also seen to support both previous work
done in the field of synthetic speech emotion (namely Murray and Arnott, 1995), and the
general literature from the fields of paralinguistics and non-verbal behaviour.
The last hypothesis of Section 4.1 was also tested to investigate if a Talking Head is
able to communicate information more effectively when it is given the capability of
speech expression. Through a series of demonstrations, this testing section showed that
viewers overwhelmingly rated the Talking Head with vocal expression as much easier to
understand, more expressive, natural, and more interesting to look and listen to. It is
proposed that these factors contribute in making the Talking Head a more effective
communicator.
Chapter 7
Future Work
Through the development of this project, a number of possible avenues for future work
have been identified. The following sections came about either as a result of the strict
time constraints of the project, and hence certain features were unable to be implemented,
or because it was discovered that many areas offered great depth in which much future
work could continue. This is especially true for XML, whose usability and potential is
indeed enormous. It is envisaged that many of the following sections each contain
enough depth for entire projects.
Operation
Parameters
Start/End
Figure 27 - A node carrying waveform processing instructions for an operation.
SML Document
Tags +
Text/Phonemes
Modified
Phonemes
Phoneme
data
DSP
SML Tags
Processor
MBROLA
Processing
directives
Waveform
Post
Waveform
Processor
Waveform
Processing List
Modified
Waveform
Processing
directives
For instance, one of the characteristics of dynamic style is a faster rate of speech, and
studies have shown that higher ratings of intelligence, knowledge, and objectivity are
ascribed to the speaker (Miller, 1976 as referenced in Malandra, Barker and Barker,
1989). Contrariwise, a speaker adopting a conversational style of speaking (which is
characterized by a more consistent rate and pitch), is rated to be more trustworthy, better
educated, and more professional by listeners (Pearce and Conklin, 1971 as referenced in
Malandra, Barker and Barker, 1989). If these results can be reproduced, this will clearly
have major repercussions for a Talking Head.
An interesting way that speaking styles could be specified is through the use of
Cascading Style Sheets (CSS), making use of Extensible Stylesheet Language (XSL)
technology. The SML markup itself would not need to be changed to adopt a different
speaking style, but rather the XSL document that defines the style and voice of the
speaker. This confirms one of the benefits of XML that was stated earlier, where the
XML file describes the data and does not force the presentation of the data.
<?xml-stylesheet ...?>
<?xml-stylesheet ...?>
<sml>
<sml>
link to stylesheet
<xsl:stylesheet ...>
<xsl:stylesheet ...>
...
...
...
...
</xsl:stylesheet ...>
</xsl:stylesheet ...>
</sml>
</sml>
Stylesheet
SML File
c) It could well be that work in this area will find that speech correlates may not be
as important for discriminating similar complex emotions (e.g. pride and
satisfaction), and that it is rather what we say that provides the best cues for such
speaker emotions. This is upheld by Knapp (1980), who states as we
develop, we rely on context to discriminate emotions with similar
characteristics.
TTS
Brain
XML Data
(SML, FAML, etc)
XML
Handler
TTS Output
FAML
FAML
Other
Other
FAML Output
Other Output
Text to be
rendered
NLP (Festival)
Phoneme Info
(text)
DSP (MBROLA)
TTS Module
TTS Module
SML Processing
Viseme Generator
SERVER SIDE
CLIENT SIDE
Waveform
Visemes
The idea of sending text to the client instead of a waveform is not new, but other
solutions state that to do this the entire TTS module should be on the client side. The text
to be rendered is sent to the client, and the speech is fully synthesized there. However, a
complex NLP module such as Festivals is quite large, and it would be very undesirable
to force users to install such an application on their systems.
The architecture in Figure 31 shows a system where all phoneme transcription and
SML tag processing is still carried out on the server. Instead of producing a waveform
and sending this to the client however, the phoneme information is sent across in
MBROLAs input format (described in Section 5.9) to the TTS Module on the client side,
where the waveform is produced by the DSP. The MBROLA Synthesizer is not a very
large program, and has binaries available for most platforms. Note that although the
diagram shows the Festival and MBROLA systems, any synthesizer that can deal with
phoneme information at the input and output level can be used.
Chapter 8
Conclusion
This thesis has focused on the primary goal of simulating emotional speech for a Talking
Head. The objectives reflected this by aiming to develop a speech synthesis system that
is able to add the effects of emotion on speech, and to implement this system as the TTS
module of a Talking Head. The literature was explored and investigation of research in
the fields of non-verbal behaviour, paralinguistics, and speech synthesis allowed (with
some confidence) the following hypotheses of Section 4.1 to be formed:
1. The effect of emotion on speech can be successfully synthesized using
control parameters.
2. Through the addition of emotive speech:
a)
Listeners will be able to correctly recognize the intended emotion
being synthesized.
b)
Head.
The review of the literature also helped to both identify and provide possible
solutions to the subproblems stated in Section 2.2, namely the need to develop a speech
markup language, and synchronizing speech and facial gestures.
Throughout the entire design and implementation phase of the TTS module, the
literature was a solid basis upon which design decisions were made. For instance, SMLs
speech emotion tags implemented the speech correlates of emotion found in the literature
and previous work on synthetic speech emotion, chiefly by Murray and Arnott (1995) and
Cahn (1990).
The evaluation of the TTS module was an integral part of this thesis. Therefore, an
evaluative research methodology was employed to determine if the system supported or
disproved the stated hypotheses. In order for testing to be carried out in a clear and
concise way that could be directly linked with the hypotheses, the questionnaire-based
evaluation process was organized into two main sections:
a)
b)
The extent of the effect vocal expression has on a Talking Heads
ability to communicate.
The results from the speech emotion recognition sections support hypotheses 1 and
2a; that is, strong recognition of the simulated emotions took place and so simulation of
the emotive speech was successful. The evaluation was able to show that a good
recognition rate of emotion occurred, comparable to that described in the literature for
both synthetic speech emotion, and for human speech.
The investigation of different combinations of neutral/emotive text and
neutral/emotive vocal expression was important in that it helped identify the importance
of the words we decide to use and how we choose to say them. Importantly, the results in
Chapter 6 are seen to also confirm the literature; both in terms of the effects of the
different variables involved (words and vocal expression), and also in terms of the
emotions that are often confused together.
As found in Murray and Arnott (1995) and Cahn (1990), the results were seen to be
very utterance dependent and may have contributed to both correct recognition and also
confusion between emotions. The reason for confusion between certain emotions is
reported in the literature to be dependent on many factors, including the text in the phrase
and its context, the speakers voice, and who is hearing the utterance, which brings
gender and cultural issues in play (Malandra, Barker and Barker, 1980; Knapp, 1980).
Results of the Talking Head experiments also proved to be positive with the
overwhelming majority of viewers stating that the Talking Head was much easier to
understand, more expressive, natural, and more interesting when vocal expression was
added to its speech. It would be very interesting to investigate listener emotion
recognition rates for the same utterances spoken by the Talking Head itself. This way,
the importance of the combination of the visual channel (facial gestures) and the audio
channel (speech) could be studied.
There is much work yet to be done in the field of synthetic speech emotion, with
Chapter 7 identifying a number of key areas which should be investigated. The main
limitation of the project was the inability to implement dynamic volume control and
speech correlates affecting the voice quality (such as nasality, hoarseness, breathiness
etc). Notwithstanding these limitations however, this thesis has been able to demonstrate
the effect synthetic speech emotion has on user perception when it is applied to a Talking
Head. It is envisaged it would bring similar benefits to other applications using speech
synthesis. It is through this demonstration that the significance and value of this project
is seen to have been confirmed.
Bibliography
Allen, J., Hunnicutt, M. S. and Klatt, D. (1987). From Text to Speech: The MITalk
System, Cambridge University Press, Cambridge.
Bates, J. (1994). The Role of Emotion in Believable Agents, Communications of the
ACM, vol. 37, pp. 122-125.
Beard, S. (1999). FAQBot, Honours Thesis, Curtin University of Technology, Bentley,
Western Australia.
Beard, S., Crossman, B., Cechner, P. and Marriott A. (1999). FAQbot, Proceedings of
Pan Sydney Area Workshop on Visual Information Processing, Nov 1999, University
of Sydney, Australia.
Beard, S., Marriott, A. and Pockaj, R. (2000). A Humane Interface, OZCHI 2000
Conference on Human-Computer Interaction: Interfacing reality in the new
millennium. (4-8 Dec, 2000), Sydney, Australia.
Black, A. W., Taylor, P. and Caley, R. (1999). The Festival Speech Synthesis System
System Documentation, Edition 1.4, for Festival Version 1.4.0, 17th June 1999,
[Online], Available: www.cstr.ed.ac.uk/projects/festival/manual/festival_toc.html.
Bos,
B.
(2000).
XML
in
10
points.
http://www.w3.org/XML/1999/XML-in-10-points.
[Online].
Available:
Bosak, J. (1997) XML, Java, and the Future of the Web. [Online]. Available:
http://www.webreview.com/pub/97/12/19/xml/index.html
Cahn, J. E. (1988). From Sad to Glad: Emotional Computer Voices, Proceedings of
Speech Tech '88, Voice Input/Ouput Applications Conference and Exhibition, April
1988, New York City, pp. 35-37.
Cahn, J. E. (1990). The Generation of Affect in Synthesized Speech, Journal of the
American Voice I/O Society, vol. 8, pp. 1-19.
Cahn, J. E. (1990b). Generating expression in synthesized speech, Technical Report,
Massachusetts Institute of Technology Media Laboratory, MA, USA.
Carroll, L. (1946). Alices Adventures in Wonderland, Random House, New York.
Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S.,
Becket, T., and Achorn, B. (1994a). Animated conversation: Rule-based generation
of facial expression, gesture and spoken intonation for multiple conversational
agents, In Proceedings of ACM SIGGRAPH 94.
Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S.,
and Achorn, B. (1994b). Modeling the interaction between speech and gestures, In
Proceedings of the 16th annual Conference of the Cognitive Science Society, pp. 119124.
Cechner, P. (1999), NO-FAITH: Transport Service for Facial Animated Intelligent
Talking Head, Honours Thesis, Curtin University of Technology, Bentley, Western
Australia.
Cover, R. (2000). The XML Cover Pages. [Online]. Available: http://www.oasisopen.org/cover/xml.html.
Davitz, J. R. (1964). The Communication of Emotional Meaning, McGraw-Hill, New
York.
Dellaert, F., Polzin, T. and Waibel, A. (1996). Recognizing Emotion in Speech,
Proceedings of ICSLP 96 the 4th International Conference on Spoken Language
Processing, 3-6 October 1996, Philadelphia, PA, USA.
Document Object Model (DOM) Level 1 Specification (Second Edition) (2000).
[Online]. Available: http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929.
Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis, Kluwer Acadesmic
Publishers.
Extensible Markup Language (XML) 1.0, (1998).
http://www.w3.org/TR/1998/REC-xml-19980210.
[Online].
Available:
library
for
Gnome.
[Online].
Available:
Malandra, L., Barker, L., and Barker, D. (1989). Nonverbal Communication, 2nd edition,
Random House, pp. 32-50.
Mauch, J. E. and Birch, J. W. (1993). Guide to the Successful Thesis and Dissertation, 3rd
edition, Marcel Dekker, New York.
MBROLA
Project
Homepage
(2000).
http://tcts.fpms.ac.be/synthesis/mbrola.html.
Microsoft
(2000).
Benefiting
From
XML.
http://msdn.microsoft.com/xml/general/benefits.asp.
[Online],
[Online].
Available:
Available:
Available:
SoftwareAG
(2000b).
XML
Benefits.
http://www.softwareag.com/xml/about/xml_ben.htm.
Available:
[Online].
Sproat, R., Taylor, P., Tanenblatt, M. and Isard, A. (1997). A Markup Language for
Text-to-Speech Synthesis, in Proceedings of the Fifth European Conference on
Speech Communication and Technology (Eurospeech 97), vol. 4, pp. 1747-1750.
Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., and Lenzo, K. (1998).
SABLE: A Standard for TTS Markup, in Proceedings of International Conference
on Speech and Language Processing (ICSLP98), pp. 1719-1724.
Stallo, J. (2000). Canto: A Synthetic Singing Voice. Technical Report, Curtin
University
of
Technology,
Western
Australia.
Available:
http://www.computing.edu.au/~stalloj/projects/canto/.
Taylor, P. and Isard, A. (1997). SSML: A speech synthesis markup language, Speech
Communication, vol. 21, pp. 123-133.
The XML FAQ (2000). Frequently Asked Questions about the Extensible Markup
Language. [Online]. Available: http://www.uc.ie/xml/.
SML Tags
angry
Description: Simulates the effect of anger on the voice (i.e. generates a
voice that sounds angry).
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:
<angry>I would not give you the time of day</angry>
embed
Description: Gives the ability to embed foreign file types within an SML
document such as sound files, MML files etc., and for them to be
processed appropriately.
Attributes:
Name
Description
Values
type
src
A character string.
music_fil
e
A character string.
file.
lyr_file
A character string.
Properties: empty.
Example:
<embed type=mml music_file=songs/aaf.mml lyr_file=songs/aaf.lyr/>
emph
Description: Emphasizes a syllable within a word.
Attributes:
Name
Description
Values
target
A
character
string
representing a phoneme
symbol. Uses the MPRA
phoneme set.
level
weakest,
weak,
moderate, strong.
affect
happy
Description: Simulates the effect of happiness on the voice (i.e.
generates a voice that sounds happy).
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:
neutral
Description: Gives a neutral intonation to the spoken utterance.
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:
<neutral>I can sometimes sound bored like this.</neutral>
p
Description: Element used to divide text into paragraphs. Can only
occur directly within an sml element. The p element wraps emotion
tags.
Attributes: none.
Properties: Can contain all other elements, except itself and sml.
Example:
<p>
<sad>Today its been raining all day,</sad>
<happy>But theyre calling for sunny skies tomorrow.</happy>
</p>
pause
Description: Inserts a pause in the utterance.
Attributes:
Name
Description
Values
length
msec
A positive number.
smoot
h
Properties: empty.
Example:
Ill take a deep breath <pause length=long/> and try it again.
pitch
Description: Element that changes pitch properties of contained text.
Attributes:
Name
Description
Values
middle
range
Increases/decreases pitch
contained text by N%.
(+/-)N%
range
of
pron
Description: Enables manipulation of how something is pronounced.
Attributes:
Name
Description
Values
sub
A character string.
rate
Description: Sets the speech rate of the contained text.
Attributes:
Name
Description
Values
speed
sad
Description: Simulates the effect of sadness on the voice (i.e.
generates a voice that sounds sad).
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:
<sad>Honesty is hardly ever heard.</sad>
sml
Description: Root element that encapsulates all other SML tags.
Attributes: none.
Properties: root node, can only occur once.
Example:
<sml>
<p>
<happy>The sml tag encapsulates all other tags</happy>
</p>
</sml>
speaker
Description: Specify the speaker to use for the contained text.
Attributes:
Name
Description
Values
gender
male, female.
name
A character string.
volume
Description: Sets the speaking volume. Note, this element sets only
the volume, and does not change voice quality (e.g. quiet is not a
whisper).
Attributes:
Name
Description
Values
level
(+/-)N%,
loud.
soft,
normal,
<!ATTLIST embed
type CDATA #REQUIRED
src CDATA #IMPLIED
music_file CDATA #IMPLIED
lyr_file CDATA #IMPLIED>
<!-#######################################
# LOW-LEVEL TAGS
#######################################
-->
<!ELEMENT pitch (#PCDATA | %LowLevelElements;)*>
<!ATTLIST pitch
middle CDATA #IMPLIED
base (low | medium | high) "medium"
range CDATA #IMPLIED>
<!ELEMENT rate (#PCDATA | %LowLevelElements;)*>
<!ATTLIST rate
speed CDATA #REQUIRED>
<!-- SPEAKER DIRECTIVES: Tags that can only encapsulate plain text -->
<!ELEMENT emph (#PCDATA)>
<!ATTLIST emph
level (weakest | weak | moderate | strong) "weak"
affect CDATA #IMPLIED
target CDATA #IMPLIED>
<!ELEMENT pause EMPTY>
<!ATTLIST pause
length (short | medium | long) "medium"
msec CDATA #IMPLIED
smooth (yes | no) "no">
<!ELEMENT volume (#PCDATA)>
<!ATTLIST volume
level CDATA #REQUIRED>
<!ELEMENT pron (#PCDATA)>
<!ATTLIST pron
sub CDATA #REQUIRED>
Make sure you have a recent version of the Cygnus Cygwin make utility working on your system.
You can download the Cygwin package here. We used version b20.1.
2.
3.
1.
2.
config/system.mak - This file shouldn't really be hand edited as it's created automatically
from the config file, but if the system gets confused, you may need to change the value of
OSREV from 1.0 to 20.1.
Creating VCMakefiles
From memory, there were no mishaps here. Only, again, make sure the make utility you're using is recent,
preferably from the Cygnus Cygwin package version b20.1 (we had trouble with earlier versions).
Building
To build the system:
nmake /nologo /fVCMakefile
However, you'll probably incur a number of compile errors. The following is a list of changes I had to
make:
*1
SIZE_T redefinition in siod/editline.h.
SIZE_T is redefined in editline.h since it is already defined in a Windows include file. Fix by inserting
compiler directive to only define SIZE_T if it hasn't been defined before, or simply don't define it (i.e. use
#if 0 ... #endif).
*2
"unknown size" error for template in base_class/EST_TNamedEnum.cc.
nmake doesn't seem to like the syntax used in template void
EST_TValuedEnumI<ENUM,VAL,INFO::initialise(const void *vdefs, ENUM
(*conv)(const char *)). Go to the line it's grumbling about, and do the following:
*3
comment this line out:
(const struct EST_TValuedEnumDefinition *defs = (const
struct EST_TValuedEnumDefinition *)vdefs;
*4
insert these 2 lines (__my_defn can be anything, just make sure it doesn't
conflict with another typedef!):
typedef EST_TValuedEnumDefinition __my_defn;
const __my_defn *defs = (const __my_defn *)vdefs;
*5
Multiply defined symbols in base_class/inst_templ/vector_fvector_t.cc
EST_TVector.cc is already included in vector_f_.cc, so a conflict is occurring. I'm not sure nmake would
be the only one to complain about this though.
*6
Comment out this line: #include
"../base_class/EST_TVector.cc"
*7
T * const p_contents never initialised in EST_TBox constructor (include/EST_TBox.h)
nmake wants p_contents to be initialised in the constructor using an initializer list.
*8
Comment EST_TBox constructor out:
EST_TBox(void) {};
*9
Define EST_TBox constructor with an initializer list:
EST_TBox (void) : p_contents (NULL) {};
I don't think it matters what p_contents is initialised to since the inline comment says that
the constructor will never be called anyway.
Building Festival
If I recall correctly, once the Speech Tools Library was successfully compiled, there were no problems
compiling Festival. However, once you run any of the executables such as festival.exe, pay attention to any
error messages appearing at startup. If most functions are unavailable to you in the Scheme interpreter, it's
probably because it can't find the lib directory that contains the Scheme (.scm) files.
Compiling
If you've included the right header files, and the include paths are correct, you should have no trouble.
However, make sure your program source files have a .cpp extension, and not just .c - you'll get all sorts of
compilation errors. (Take it from someone who lost a lot of time over this silly error!) Also, make sure
SYSTEM_IS_WIN32 is defined by adding it to the pre-processor definitions in Project-->Settings-->C/C+
+.
Linking
The manual gives instructions on which Festival and Speech Tool libraries your program needs to link
with. (Note: the manual mentions .a UNIX library files. You should look for .lib files instead.) Add linking
information through Project-->Settings-->Link. In addition to the libraries specified in the manual, you'll
need to link two others libraries:
*10
*11
These two libraries should already be on your system, and Visual C++ should already be able to find them.
That's it! Good luck with your work :) If you've found this document to be helpful, please drop me an
email - I'd like to know if it's proving useful for others. If you have found errors in this document, or
have other suggestions for getting around problems, I'd especially like to hear from you! I also welcome
any questions you may have. You can contact me at stalloj@cs.curtin.edu.au. Finally, my warm thanks
to Alan W Black, Paul Taylor, and Richard Caley for making the Festival Speech Synthesis System
freely available to all of us :)
SECTION 1 - PERSONAL
AND
BACKGROUND DETAILS
Age: ____________
Gender: Female
Male
No
I Dont
School
Other
Part B.
For each of the speech sample played, choose ONE which you think is most appropriate for
how the speaker sounds.
1.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------2.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------3.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------4.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------5.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
-----------------------------------------------------------------------------------------------------------------------6.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------7.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------8.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------9.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------10.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
------------------------------------------------------------------------------------------------------------------------
11.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------12.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------13.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------14.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------15.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
-----------------------------------------------------------------------------------------------------------------------16.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------17.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------18.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------19.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------20.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
------------------------------------------------------------------------------------------------------------------------
for
your
_________________________________________________________
_________________________________________________________
_________________________________________________________
b) Did one speaker seem best able to express itself? (Please tick ONE box only)
The FIRST speaker seemed best able to express itself.
The SECOND speaker seemed best able to express itself.
Neither was able to express itself better than the other.
Reason for your choice:
_________________________________________________________
_________________________________________________________
_________________________________________________________
c) Did one example seem more natural than the other? (Please tick ONE box only)
The FIRST example seemed more natural.
The SECOND example seemed more natural.
Neither seemed more natural than the other.
Reason for your choice:
_________________________________________________________
_________________________________________________________
_________________________________________________________
d) Did one example seem more interesting to you? (Please tick ONE box only)
The FIRST example seemed more interesting.
The SECOND example seemed more interesting.
Neither seemed more interesting than the other.
Reason for your choice:
_________________________________________________________
_________________________________________________________
_________________________________________________________
choice:
SECTION 4 GENERAL
Prior to this demonstration:
No
Yes
No
Yes
Any further comments you could give about the Talking Head and/or its voice would be
greatly appreciated.
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
END OF QUESTIONNAIRE
THANK YOU VERY MUCH FOR YOUR HELP
Emotional phrases
6. I have some wonderful news for you. (Happiness)
7. I cannot come to your party tomorrow. (Sadness)
8. I would not give you the time of day. (Anger)
9. Smoke comes out of a chimney. (Neutral)
10. Dont tell me what to do. (Anger)