Anda di halaman 1dari 120

Simulating Emotional Speech for a Talking Head

November 2000

Contents
1 Introduction..............................................................................................1
2 Problem Description................................................................................2
2.1 Objectives..............................................................................................................2
2.2 Subproblems..........................................................................................................2
2.3 Significance...........................................................................................................3

3 Literature Review....................................................................................5
3.1 Emotion and Speech..............................................................................................5
3.2 The Speech Correlates of Emotion........................................................................6
3.3 Emotion in Speech Synthesis................................................................................8
3.4 Speech Markup Languages...................................................................................9
3.5 Extensible Markup Language (XML).................................................................10
3.5.1 XML Features..........................................................................................10
3.5.2 The XML Document................................................................................11
3.5.3 DTDs and Validation..............................................................................12
3.5.4 Document Object Model (DOM).............................................................14
3.5.5 SAX Parsing.............................................................................................15
3.5.6 Benefits of XML......................................................................................16
3.5.7 Future Directions in XML........................................................................17
3.6 FAITH.................................................................................................................18
3.7 Resource Review.................................................................................................20
3.7.1 Text-to-Speech Synthesizer.....................................................................20
3.7.2 XML Parser..............................................................................................22
3.8 Summary.............................................................................................................23

4 Research Methodology..........................................................................25
4.1 Hypotheses..........................................................................................................25

Page i

Simulating Emotional Speech for a Talking Head

November 2000

4.2 Limitations and Delimitations.............................................................................26


4.2.1 Limitations...............................................................................................26
4.2.2 Delimitations............................................................................................26
4.3 Research Methodologies.....................................................................................27

5 Implementation......................................................................................28
5.1 TTS Interface.......................................................................................................28
5.1.2 Module Inputs..........................................................................................29
5.1.3 Module Outputs........................................................................................30
5.1.4 C/C++ API...............................................................................................31
5.2 SML Speech Markup Language.......................................................................32
5.2.1 SML Markup Structure............................................................................32
5.3 TTS Module Subsystems Overview....................................................................34
5.4 SML Parser..........................................................................................................35
5.5 SML Document...................................................................................................36
5.5.1 Tree Structure...........................................................................................36
5.5.2 Utterance Structures.................................................................................37
5.6 Natural Language Parser.....................................................................................39
5.6.2 Obtaining a Phoneme Transcription........................................................40
5.6.2 Synthesizing in Sections..........................................................................42
5.6.3 Portability Issues......................................................................................43
5.7 Implementation of Emotion Tags........................................................................44
5.7.1 Sadness.....................................................................................................45
5.7.2 Happiness.................................................................................................46
5.7.3 Anger........................................................................................................47
5.7.4 Stressed Vowels.......................................................................................48
5.7.5 Conclusion...............................................................................................48
5.8 Implementation of Low-level SML Tags............................................................49
5.8.1 Speech Tags.............................................................................................49
5.8.2 Speaker Tag..............................................................................................53
5.9 Digital Signal Processor......................................................................................54
5.10 Cooperating with the FAML module................................................................55
5.11 Summary...........................................................................................................57

Page ii

Simulating Emotional Speech for a Talking Head

November 2000

6 Results and Analysis..............................................................................58


6.1 Data Acquisition..................................................................................................58
6.1.1 Questionnaire Structure and Design........................................................58
6.1.2 Experimental Procedure...........................................................................61
6.1.3 Profile of Participants...............................................................................63
6.2 Recognizing Emotion in Synthetic Speech.........................................................64
6.2.1 Confusion Matrix.....................................................................................64
6.2.2 Emotion Recognition for Section 2A.......................................................66
6.2.3 Emotion Recognition for Section 2B.......................................................69
6.2.4 Effect of Vocal Emotion on Emotionless Text........................................73
6.2.5 Effect of Vocal Emotion on Emotive Text..............................................75
6.2.6 Further Analysis.......................................................................................75
6.3 Talking Head and Vocal Expression...................................................................77
6.4 Summary.............................................................................................................81

7 Future Work...........................................................................................82
7.1 Post Waveform Processing..................................................................................82
7.2 Speaking Styles...................................................................................................83
7.3 Speech Emotion Development............................................................................84
7.4 XML Issues.........................................................................................................85
7.5 Talking Head.......................................................................................................86
7.6 Increasing Communication Bandwidth...............................................................87

8 Conclusion...............................................................................................88
9 Bibliography...........................................................................................91
10 Appendix A SML Tag Specification..................................................96
................................................................................................................................101

11 Appendix B SML DTD.....................................................................102


12 Appendix C Festival and Visual C++..............................................104
13 Appendix D Evaluation Questionnaire...........................................107
14 Appendix E Test Phrases for Questionnaire, Section 2B..............113

Page iii

Simulating Emotional Speech for a Talking Head

November 2000

List of Figures
15 Figure 1 - An XML document holding simple weather information.11
16 Figure 2 - Sample section of a DTD file...............................................12
17 Figure 3 - XML syntax error - list and item tags incorrectly matched.
13
18 Figure 4 - Well-formed XML document, but does not follow
grammar specification in DTD file (an item tag occurs outside of list
tag)...........................................................................................................13
19 Figure 5 Well-formed XML document that also follows DTD
grammar specification. Will not produce any parse errors..............13
20 Figure 6 - DOM representation of XML example..............................15
21 Figure 7 - FAITH project architecture................................................19
22 Figure 8 - Talking Head being developed as part of the FAITH
project at the School of Computing, Curtin University of Technology.
20
23 Figure 9 - Top level outline showing how Festival and MBROLA
systems were used together...................................................................21
24 Figure 10 - Black box design of the system, shown as the TTS module
of a Talking Head...................................................................................28
25 Figure 11 - Top-level structure of an SML document........................32
26 Figure 12 - Valid SML markup............................................................33
27 Figure 13 - Invalid SML markup.........................................................33
28 Figure 14 - TTS module subsystems.....................................................34
29 Figure 15 - Filtering process of unknown tags....................................36
30 Figure 16 - SML Document structure for SML markup given above.
37
Page iv

Simulating Emotional Speech for a Talking Head

November 2000

31 Figure 17 - Utterance structures to hold the phrase the moon. U =


CTTS_UtteranceInfo object; W = CTTS_UtteranceInfo object; P =
CTTS_PhonemeInfo object; pp = CTTS_PitchPatternPoint object. 39
32

Figure 18 - Tokenization of a part of an SML


Document................................................................................................40

33 Figure 19 - SML Document sub-tree representing example SML


markup....................................................................................................41
34 Figure 20 - Raw timeline showing server and client execution when
synthesizing example SML markup above..........................................43
35 Figure 21 Multiply factors of pitch and duration values for
emphasized phonemes............................................................................50
36 Figure 22 - Processing a pause tag........................................................51
37 Figure 23 - The effect of widening the pitch range of an utterance...52
38 Figure 24 - Processing the pron tag......................................................52
39 Figure 25 - Example MBROLA input..................................................55
40 Figure 26 - Example utterance information supplied to the FAML
module by the TTS module. Example phrase: And now the latest
news.......................................................................................................56
41 Figure 27 - A node carrying waveform processing instructions for an
operation.................................................................................................83
42 Figure 28 - Insertion of new submodule for post waveform
processing................................................................................................83
43 Figure 29 - SML Markup containing a link to a stylesheet................84
44 Figure 30 - Inclusion of an XML Handler module to centrally
manage XML input................................................................................85
45 Figure 31 - Proposed design of TTS Module architecture to minimize
bandwidth problems between server and client..................................87

Page v

Simulating Emotional Speech for a Talking Head

November 2000

List of Tables
46 Table 1 - Summary of human vocal emotion effects.............................8
47 Table 2 - Summary of human vocal emotion effects for anger,
happiness, and sadness..........................................................................44
48 Table 3 Speech correlate values implemented for sadness..............45
49 Table 4 - Speech correlate values implemented for happiness..........46
50 Table 5 - Speech correlate values implemented for anger.................47
51 Table 6 - Vowel-sounding phonemes are discriminated based on their
duration and pitch..................................................................................48
52 Table 7- MBROLA command line option values for en1 and us1
diphone databases to output male and female voices.........................54
53 Table 8 - Statistics of participants........................................................63
54 Table 9 - Confusion matrix template....................................................64
55 Table 10 - Confusion matrix with sample data...................................65
56 Table 11 - Confusion matrix showing ideal experiment data: 100%
recognition rate for all simulated emotions.........................................65
57 Table 12 Listener response data for neutral phrases spoken with
happy emotion........................................................................................66
58 Table 13 Section 2A listener response data for neutral phrases.....67
59 Table 14 - Listener response data for Section 2A, Question 1...........68
60 Table 15 - Listener response data for Section 2A, Question 2...........68
61 Table 16 - Listener responses for utterances containing emotionless
text with no vocal emotion.....................................................................70
62 Table 17 - Listener responses for utterances containing emotive text
with no vocal emotion............................................................................71
63 Table 18 - Listener responses for utterances containing emotionless
text with vocal emotion..........................................................................72

Page vi

Simulating Emotional Speech for a Talking Head

November 2000

64 Table 19 - Listener responses for utterances containing emotive text


with vocal emotion.................................................................................73
65 Table 20 - Percentage of listeners who improved in emotion
recognition with the addition of vocal emotion effects for neutral text.
74
66 Table 21 Percentage of listeners whose emotion recognition
deteriorated with the addition of vocal emotion effects for neutral
text...........................................................................................................74
67 Table 22 - Percentage of listeners whose emotion recognition
improved with the addition of vocal emotion effects for emotive text.
75
68 Table 23 Percentage of listeners whose emotion recognition
deteriorated with the addition of vocal emotion effects for emotive
text...........................................................................................................75
69 Table 24 Listener responses for participants who speak English as
their first language. Utterance type is neutral text, emotive voice.
76
70 Table 25 Listener responses for participants who do NOT speak
English as their first language. Utterance type is neutral text,
emotive voice.........................................................................................76
71 Table 26 Listener responses for participants who speak English as
their first language. Utterance type is emotive text, emotive voice.
77
72 Table 27 Listener responses for participants who do NOT speak
English as their first language. Utterance type is emotive text,
emotive voice.........................................................................................77
73 Table 28 - Participant responses when asked to choose the Talking
Head that was more understandable....................................................78
74 Table 29 Participant responses when asked which Talking Head
seemed best able to express itself..........................................................79
75 Table 30 Participant responses when asked which Talking Head
seemed more natural..............................................................................80
76 Table 31 Participant responses when asked which Talking Head
seemed more interesting........................................................................80

Page vii

Chapter 1
Introduction
When we talk we produce a complex acoustic signal that carries information in addition
to the verbal content of the message. Vocal expression tells others about the emotional
state of the speaker, as well as qualifying (or even disqualifying) the literal meaning of
the words. Because of this, listeners expect to hear vocal effects, paying attention not
only to what is being said, but how it is said. The problem with current speech
synthesizers is that the effect of emotion on speech is not taken into account, producing
output that sounds monotonic, or at worst distinctly machine-like. As a result of this, the
ability of a Talking Head to express its emotional state will be adversely affected if it
uses a plain speech synthesizer to "talk". The objective of this research was to develop a
system that is able to incorporate emotional effects in synthetic speech, and thus improve
the perceived naturalness of a Talking Head.
This thesis reviews the literature in the fields of speech emotion, synthetic speech
synthesis, and XML. A discussion on XML is featured prominently in this thesis because
it was the vehicle chosen for directing how the synthetic voice should sound. It also had
considerable impact on how speech information was processed. The design and
implementation details of the project are discussed to describe the developed system. An
in-depth analysis of the projects evaluation data is then given, concluding with a
discussion of future work that has been identified.

Chapter 2
Problem Description
2.1 Objectives
Development of the project was aimed at meeting two main objectives to support the
hypotheses of Section 4.1:
1. To develop a system that can add simulated emotion effects to synthetic
speech. This involved researching the speech correlates of emotion that have
been identified in the literature. The findings were to be applied to the control
parameters available in a speech synthesizer, allowing a specified emotion to be
simulated using rules controlling the parameters.
2. To integrate the system within the TTS (text-to-speech) module of a
Talking Head. The speech system was to be added to the Talking Head that is
part of the FAITH1 project. It is being developed jointly at Curtin University of
Technology, Western Australia, and the University of Genoa in Italy (Beard et al,
1999). The text-to-speech module must be treated as a 'black box', which is
consistent with the modular design of FAQbot.

2.2 Subproblems
A number of subproblems were identified to successfully develop a system with the
stated objectives.
1. Design and implementation of a speech markup language. It was desirable
that the markup language be XML-based; the reasons for this will become
apparent later in the thesis. The role of the speech markup language (SML) is to
1

Facial Animated Interactive Talking Head

provide a way to specify in which emotion a text segment is to be rendered. In


addition to this, it was decided to extend the application of the markup to provide
a mechanism for the manipulation of generally useful speech properties such as
rate, pitch, and volume. SML was designed to closely follow the SABLE
specification, described by Sproat et al (1998).
2. Evaluation of each of the existing text-to-speech (TTS) submodules of the
Talking Head was required. Its aim was to determine what could and could not
be reused. This included assessing the existing TTS modules API, and the
modules that interface with other subsystems of the Talking Head (namely the
MPEG-4 subsystem).
3. Cooperative integration with modules that were being concurrently written
for the Talking Head, namely the gesture markup language being developed by
Huynh (2000). The collaboration between the two subprojects was aimed at
providing the Talking Head with synchronization of vocal expressions and facial
gestures. An architecture specification for allowing facial and speech
synchronization is given by Ostermann et al. (1998).
4. Since the Talking Head is being developed to run over a number of
platforms (Win32, Linux, and IRIX 6.3), it was crucial that the new TTS module
would not hamper efforts to make the Talking Head a platform independent
application.

2.3 Significance
The project is significant because despite the important role of the display of emotion in
human communication, current text-to-speech synthesizers do not cater for its effect on
speech. Research to add emotion effects to synthetic speech is ongoing, notably by
Murray and Arnott (1996), but has been mainly restricted to a standalone system, and not
part of a Talking Head as this project set out to do.
Increased naturalness in synthetic speech is seen as being important for its
acceptance (Scherer, 1996), and this is likely to be the case for applications of Talking
Head technology as well. This thesis is attempting to address this need. Advances in this
area will also benefit work in the fields of speech analysis, speech recognition and speech
synthesis when dealing with natural variability. This is because work with the speech
correlates of emotion will help support or disprove speech correlates identified in speech

analysis, help in proper feature extraction for the automatic recognition of emotion in the
voice, and generally improve synthetic speech production.

Chapter 3
Literature Review
This section presents a brief review of the literature relevant to the areas the project is
concerned with: the effects of emotion on speech, speech emotion synthesis, XML and
speech markup languages.

3.1 Emotion and Speech


Emotion is an integral part of speech. Semantic meaning in a conversation is conveyed
not only in the actual words we say, but also in how they are expressed (Knapp, 1980;
Malandro, Barker and Barker, 1989). Even before they can understand words, children
display the ability to recognize vocal emotion, illustrating the importance that nature
places on being able to convey and recognize emotion in the speech channel bandwidth.
The intrinsic relationship that emotion shares with speech is seen in the direct effect
that our emotional state has on the speech production mechanism. Physiological changes
such as increased heart rate and blood pressure, muscle tremors and dryness of mouth
have been noted to be brought about by the arousal of the sympathetic nervous system,
such as when experiencing fear, anger, or joy (Cahn, 1990). These effects of emotion on
a persons speech apparatus ultimately affect how speech is produced, thus promoting the
view that an emotion carrier wave is produced for the words spoken (Murray and
Arnott, 1993).
With emotion being described as the organisms interface to the world outside,
(Scherer, 1981), considerable interest has been devoted to investigate the role of emotion
in speech, particularly regarding its social aspects (Knapp, 1980). One function is to
notify others of our behavioural intentions in response to certain events (Scherer, 1981).
For example, the contraction of ones throat when experiencing fear will produce a harsh
voice that is increased in loudness (Murray and Arnott, 1993), serving to warn and

frighten a would-be assailant, with the body tensing for a possible confrontation. The
expression of emotion through speech also serves to communicate to others our
judgement of a particular situation. Importantly, vocal changes due to emotion may in
fact be cross-cultural in nature, though this may only be true for some emotions, and
further work is required to ascertain this for certain (Murray, Arnott and Rohwer, 1996).
We also deliberately use vocal expression in speech to communicate various
meanings. Sudden pitch changes will make a syllable stand out, highlighting the
associated word as an important component of that utterance (Dutoit, 1997). A speaker
will also pause at the end of key sentences in a discussion to allow listeners the chance to
process what was said, and a phrases pitch will increase towards the end to denote a
question (Malandro, Barker and Barker, 1989). When something is said in a way that
seems to contradict the actual spoken words, we will usually accept the vocal meaning
over the verbal meaning. For example, the expression thanks a lot spoken in an angry
tone will generally be taken in a negative way, and not as a compliment, as the literal
meaning of the words alone would suggest. This underscores the importance we place on
the vocal information that accompanies the verbal content.

3.2 The Speech Correlates of Emotion


Acoustics researchers and psychologists have endeavoured to identify the speech
correlates of emotion. The motivation behind this work is based on the demonstrated
ability of listeners to recognize different vocal expressions. If vocal emotions are
distinguishable, then there are acoustic features responsible for how various emotions are
expressed (Scherer, 1996). However, this task has met with considerable difficulty. This
is because coordination of the speech apparatus to produce vocal expression is done
unconsciously, even when a speaking style is consciously adopted (Murray and Arnott,
1996).
Traditionally, there have been three major experimental techniques that researchers
have used to investigate the speech correlates of emotion (Knapp, 1980; Murray and
Arnott, 1993):
1.
Meaningless, neutral content (e.g. letters of the alphabet,
numbers etc) is read by actors who express various emotions.
2.
The same utterance is expressed in different emotions.
approach aids in comparing the emotions being studied.

This

3.
The content is ignored altogether, either by using equipment
designed to extract various speech attributes, or by filtering out the content.
The latter technique involves applying a low-pass filter to the speech signal,
thus eliminating the high frequencies that word recognition is dependent upon.
(This meets with limited success, however, since some of the vocal
information also resides in the high frequency range.)
The problem of speech parameter identification is further compounded by the
subjective nature of these tests. This is evident in the literature, as results taken from
numerous studies rarely agree with each other. Nevertheless, a general picture of the
speech parameters responsible for the expression of emotion can be constructed. There
are three main categories of speech correlates of emotion (Cahn, 1990; Murray, Arnott
and Rohwer, 1996):
Pitch contour. The intonation of an utterance, which describes the nature
of accents and the overall pitch range of the utterance. Pitch is expressed as
fundamental frequency (F0). Parameters include average pitch, pitch range,
contour slope, and final lowering.
Timing. Describes the speed that an utterance is spoken, as well as rhythm
and the duration of emphasized syllables. Parameters include speech rate,
hesitation pauses, and exaggeration.
Voice quality. The overall character of the voice, which includes effects
such as whispering, hoarseness, breathiness, and intensity.

It is believed that value combinations of these speech parameters are used to express
vocal emotion. Table 1 is a summary of human vocal emotion effects of four of the socalled basic emotions: anger, happiness, sadness and fear (Murray and Arnott, 1993;
Galanis, Darsinos and Kokkinakis, 1996; Cahn, 1990; Davtiz, 1976; Scherer, 1996). The
parameter descriptions are relative to neutral speech.

Anger

Happiness

Sadness

Fear

Speech rate

Faster

Slightly faster

Slightly slower

Much faster

Pitch average

Very
higher

Much higher

Slightly lower

Very much
higher

Pitch range

Much wider

Much wider

Slightly
narrower

Much wider

Intensity

Higher

Higher

Lower

Higher

Pitch changes

Abrupt,
downward,
directed
contours

Smooth, upward
inflections

Downward
inflections

Downward
terminal
inflections

Voice quality

Breathy, chesty
tone1

Breathy,
blaring1

Resonant1

Irregular
voicing1

Articulation

Clipped

Slightly slurred

Slurred

Precise

much

terms used by Murray and Arnott (1993).


Table 1 - Summary of human vocal emotion effects.

The summary should not be taken as a complete and final description, but rather is
meant as a guideline only. For instance, the table above emphasizes the role of
fundamental frequency as a carrier of vocal emotion. However, Knower (1941, as
referred in Murray and Arnott, 1993) notes that whispered speech is able to convey
emotion, even though whispering makes no use of the voices fundamental frequency.
Nevertheless, being able to succinctly describe vocal expression like this has significant
benefits for simulating emotion in synthetic speech.

3.3 Emotion in Speech Synthesis


In the past, focus has been placed on developing speech synthesizer techniques to
produce clearer intelligibility, with intonation being confined to model neutral speech.
However, the speech produced is distinctly machine sounding and unnatural. Speech
synthesis is seen as being flawed for not possessing appropriate prosodic variation like
that found in human speech. For this reason, some synthesis models are including the
effects of emotion on speech to produce greater variability (Murray, Arnott and Rohwer,
1996). Interestingly, Scherer (1996) sees this as being crucial for the acceptance of
synthetic speech.
The advantage of the vocal emotion descriptions in Table 1 is that the speech
parameters can be manipulated in current speech synthesizers to simulate emotional
speech without dramatically affecting intelligibility. This approach thus allows emotive
effects to be added on top of the output of text-to-speech synthesizers through the use of

carefully constructed rules. Two of the better known systems capable of adding emotionby-rule effects to speech are the Affect Editor, developed by Cahn (1990b), and
HAMLET, developed by Murray and Arnott (1995) (Murray, Arnott and Newell, 1988).
The systems both make use of the DECtalk text-to-speech synthesizer, mainly because of
its extensive control parameter features.
Future work is concerned with building a solid model of emotional speech, as this
area is seen as being limited by our understanding of vocal expression, and the quality of
the speech correlates used to describe emotional speech (Cahn, 1988; Murray and Arnott,
1995; Scherer, 1996). Although not within the scope of the project, it is worth
mentioning that research is being undertaken in concept-to-speech synthesis. This work
is aimed at improving the intonation of synthetic speech by using extra linguistic
information (i.e. tagged text) provided by another system, such as a natural language
generation (NLG) system (Hitzeman et al, 1999).
Variability in speech is also being investigated in the area of speech recognition with
the aim of possibly developing computer interfaces that respond differently according to
the emotional state of the user (Dellaert, Polzin and Waibel, 1996). Another avenue for
future research could be to incorporate the effects of facial gestures on speech. For
instance, Hess, Scherer and Kappas (1988) noted that voice quality is judged to be
friendly over the phone when a person is smiling. A model that could cater for this
would have extremely beneficial applications for recent work concerned with the
synchronization of facial gestures and emotive speech in Talking Heads.
Finally, simulating emotion in synthetic speech not only has the potential to build
more realistic speech synthesizers (and hence provide the benefits that such a system
would offer), but will also add to our understanding of speech emotion itself.

3.4 Speech Markup Languages


Ideally, a text-to-speech synthesizer would be able to accept plain text as input, and speak
it in a manner comparable to a human, emphasizing important words, pausing for effect,
and pronouncing foreign words correctly. Unfortunately, automatically processing and
analyzing plain text is extremely difficult for a machine. Without extra information to
accompany the words it is to speak, the speech synthesizer will not only sound unnatural,
but intelligibility will also decrease. Therefore, it is desirable to have an annotation
scheme that will allow direct control over the speech synthesizers output.

Most research and commercial systems allow for such an annotation scheme, but
almost all are synthesizer dependent, thus making it extremely difficult for software
developers to build programs that can interface with any speech synthesizer. Recent
moves by industry leaders to standardize a speech markup language has led to the draft
specification of SABLE, a system independent, SGML-based markup language (Sproat
et al, 1998). The SABLE specification has evolved from three existing speech synthesis
markup languages: SSML (Taylor and Isard, 1997), STML (Sproat et al, 1997), and
Javas JSML.

3.5 Extensible Markup Language (XML)


XML is the Extensible Markup Language created by W3C, the World Wide Web
Consortium (Extensible Markup Language, 1998). It was specially designed to enable
the use of large document management concepts for the World Wide Web that were
embodied in SGML, the Standard Generalized Markup Language. In adopting SGML
concepts however, the aim was also to remove features of SGML that were either not
needed for Web applications or were very difficult to implement (The XML FAQ, 2000).
The result was a simplified dialect of SGML that is relatively easy to learn, use and
implement, and at the same time retains much of the power of SGML (Bosak, 1997).
It is important to note that XML is not a markup language in itself, but rather it is a
meta-language a language for describing other languages. Therefore, XML allows a
user to specify the tag set and grammar of their own custom markup language that
follows the XML specification.

3.5.1 XML Features


There are three significant features of XML that make it a very powerful meta-language
(Bosak, 1997):
1.
Extensibility - new tags and their attribute names can be defined at
will. Because the author of an XML document can markup data using any
number of custom tags, the document is able to effectively describe the data
embodied within the tags. This is not the case with HTML, which uses a
fixed tag set.

2.
Structure the structure of an XML document can be nested to
any level of complexity since it is the author that defines the tag set and
grammar of the document.
3.
Validation if a tag set and grammar definition is provided
(usually via a Document Type Definition (DTD)), then applications
processing the XML document can perform structural validation to make sure
it conforms to the grammar specification. So though the nested structure of an
XML document can be quite complex, the fact that it follows a very rigid
guideline makes document processing relatively easy.

3.5.2 The XML Document


An XML document is a sequence of characters that contains markup (the tags that
describe the text they encapsulate), and the character data (the actual text being marked
up). Figure 1 shows an example of a simple XML document.
XML header
declaration

<?xml version=1.0?>
<weather-report>
<date>March 25, 1998</date>
<time>08:00</time>
<area>
<city>Perth</city>
<state>WA</state>
<country>Australia</country>
</area>
<measurements>
<skies>partly cloudy</skies>
<temperature>20</temperature>
<h-index>51</h-index>
<humidity>87</humidity>
<uv-index>1</uv-index>
</measurements>
</weather-report>

Markup tag

Character data
(marked up text)

Figure 1 - An XML document holding simple weather information.

One of the main observations that should be made for the example given in Figure 1
is that an XML document describes only the data, and not how it should be viewed. This
is unlike HTML, which forces a specific view and does not provide a good mechanism
for data description (Graham and Quinn, 1999). For example, HTML tags such as P,
DIV, and TABLE describe how a browser is to display the encapsulated text, but are

inadequate for specifying whether the data is describing an automotive part, is a section
of a patients health record, or the price of a grocery item.
The fact that an XML document is encoded in plain text was a conscious decision
made by the XML designers the designing of a system-independent and vendorindependent solution (Bosak, 1997). Although text files are usually larger than
comparable binary formats, this can be easily compensated for using freely available
utilities that can efficiently compress files, both in terms of size and time. At worst, the
disadvantages associated with an uncompressed plain text file is deemed to be
outweighed by the advantages of a universally understood and portable file format that
does not require special software for encoding and decoding.

3.5.3 DTDs and Validation


The XML specification has very strict rules which describe the syntax of an XML
document for instance, the characters allowable within the markup section, how tags
must encapsulate text, the handling of white space etc. These rigid rules make the tasks
of parsing and dividing the document into sub-components much easier. A well-formed
XML document is one that follows the syntax rules set in the XML specification.
However, since its author determines the structure of the document, a mechanism must be
provided that allows grammar checking to take place. XML does this through the
Document Type Definition, or DTD.
A DTD file is written in XMLs Declaration Syntax, and contains the formal
description of a documents grammar (The XML FAQ, 2000). It defines amongst other
things: which tags can be used and where they can occur, the attributes within each tag,
and how all the tags fit together.
Figure 2 gives a sample DTD section that describes two elements: list and item.
The example declares that one or more item tags can occur within a list tag.
Furthermore, an item tag may optionally have a type attribute.
<!ELEMENT
<!ELEMENT
<!ATTLIST
type

list (item)+>
item>
item
CDATA #IMPLIED>

Figure 2 - Sample section of a DTD file

One or more
item tags
Attribute is
optional

Extending this example, the different levels of validation performed by an XML


parser can be seen. Figure 3 shows an XML document that does not meet the syntax
specified in the XML specification.
<?xml version=1.0?>
<list><item>
Item 1
</list></item>
Figure 3 - XML syntax error - list and item tags incorrectly matched.

Figure 4 shows a well-formed XML document (i.e. it follows the XML syntax), but
does not follow the grammar specified in the linked DTD file. (The DTD file is the one
given in Figure 2).
<?xml version=1.0?>
<!DOCTYPE list SYSTEM list-dtd-file.dtd
<list>
<item>Item 1</item>
<item>Item 2</item>
</list>

list-dtd-file.dtd
(DTD file)

<item>Item 3</item>
Figure 4 - Well-formed XML document, but does not follow grammar specification in
DTD file (an item tag occurs outside of list tag).

Figure 5 shows a well-formed XML document that also meets the grammar
specification given in the DTD file.
<?xml version=1.0?>
<!DOCTYPE list SYSTEM list-dtd-file.dtd
<list>
<item>Item 1</item>
<item type=x>Item 2</item>
<item>Item 3</item>
</list>
Figure 5 Well-formed XML document that also follows DTD
grammar specification. Will not produce any parse errors.

The XML Recommendation states that any parse error detected while processing an
XML document will immediately cause a fatal error (Extensible Markup Language,
1998) the XML document will not be processed any further, and the application will

not attempt to second guess the authors intent. Note that the DTD does NOT define how
the data should be viewed either. Also, the DTD is able to define which sub-elements can
occur within an element, but not the order in which they occur; the same applies for
attributes specified for an element. For this reason, an application processing an XML
document should avoid being dependent on the order of given tags or attributes.

3.5.4 Document Object Model (DOM)


The Document Object Model (DOM) Level 1 Specification states the Document Object
Model as a platform- and language-neutral interface that allows programs and scripts to
dynamically access and update the content, structure and style of documents (Document
Object Model, 2000). It provides a tree-based representation of an XML document,
allowing the creation, manipulation, and navigation of any part within the document.
However, it is important to note that the DOM specification itself does not specify that
documents must be implemented as a tree only, it is convenient that the logical structure
of the document be described as a tree due to the hierarchical structure of marked up
documents. The DOM is therefore a programming API for documents that is truly
structurally neutral as well.
Working with parts of the DOM is quite intuitive since the object structure of the
DOM very closely resembles the hierarchical structure of the document. For instance,
the DOM shown in Figure 6b would represent the tagged text example in Figure 6a.
Again, the hierarchical relationships are logical ones defined in the programming API,
and are not representations of any particular internal structures (Document Object Model,
2000).
Once a DOM tree is constructed, it can be modified easily by adding/deleting nodes
and moving sub-trees. The new DOM tree can then be used to output a new XML
document since all the information required to do so is held within the DOM
representation. A DOM tree will not be constructed until the XML document has been
fully parsed and validated.

a.

<weather-report>
<date>October 30, 2000</date>
<time>14:40</time>
<measurements>
<skies>Partly cloudy</skies>
<temperature>18</temperature>
</measurements>
</weather-report>

b.

<weather-report>
<weather-report>

<date>
<date>

<time>
<time>

October 30,
2000

14:40

<measurements>
<measurements>

<skies>
<skies>

<temperature>
<temperature>

Partly cloudy

18

Figure 6 - DOM representation of XML example

3.5.5 SAX Parsing


A downside to the DOM is that most XML parsers implementing the DOM make the
entire tree reside in memory apart from putting a strain on system resources, it also
limits the size of the XML document that can be processed (Python/XML Howto, 2000)
(libxml, 2000). Also, say the application only needs to search the XML document for
occurrences of a particular word, it would be inefficient to construct a complete inmemory tree to do this.

A SAX handler, on the other hand, can process very large documents since it does
not keep the entire document in memory during processing. SAX, the Simple API for
XML, is a standard interface for event-based XML parsing (SAX 2.0, 2000). Instead of
building a structure representing the entire XML document, SAX reports parsing events
(such as the start and end of tags) to the application through callbacks.

3.5.6 Benefits of XML


The following benefits of using XML in applications have been identified (Microsoft,
2000) (SoftwareAG, 2000b):
Simplicity XML is easy to read, write and process by both humans and
computers.
Openness XML is an open and extensible format that leverages on other
(open) standards such as SGML. XML is now a W3C Recommendation, which
means it is a very stable technology. In addition, XML is highly supported by
industry market leaders such as Microsoft, IBM, Sun, and Netscape, both in
developer tools and user applications.
Extensibility data encoded in XML is not limited to a fixed tag set.
This enables precise data description, greatly aiding data manipulators such as
search engines to produce more meaningful searches.
Local computation and manipulation once data in XML format is sent
to the client, all processing can be done on the local machine. The XML DOM
allows data manipulation through scripting and other programming languages.
Separation of data from presentation this allows data to be written,
read and sent in the best logical mode possible. Multiple views of the data are
easily rendered, and the look and feel of XML documents can be changed
through XSL style sheets; this means that the actual content of the document
need not be changed.
Granular updates the structure of XML documents allows for granular
updates to take place since only modified elements need to be sent from the
server to the client. This is currently a problem with HTML since even with the
slightest modification a page needs to be rebuilt. Granular updates will help
reduce server workload.
Scalability separation of data from presentation also allows authors to
embed within the structured data procedural descriptions of how to produce
different views. This offloads much of the user interaction from the server to the

client computer, reducing the servers workload and thus enhancing server
scalability.
Embedding of multiple data types XML documents can contain
virtually any kind of data type such as image, sound, video, URLs, and also
active components such as Java applets and ActiveX.
Data delivery since XML documents are encoded in plain text, data
delivery can be performed on existing networks, sent using HTTP just like
HTML.
Combined with the XML features discussed in section 3.5.1, the above list
underscores the enormous potential of XML. Indeed, the extent of these benefits makes
XML a core component in a wide range of applications: from dissemination of
information in government agencies to the management of corporate logistics; providing
telecommunication services; XML-based prescription drug databases to help pharmacists
advise their customers; simplifying the exchange of complex patient records obtained
from different data sources, and much more (SoftwareAG, 2000a).

3.5.7 Future Directions in XML


Section 3 of this chapter has served to give a basic introduction to XML: its design
principles, features and benefits, the structure of an XML document, validation using a
DTD, and two parsing strategies, DOM and SAX. The discussion has dealt mainly with
the XML 1.0 Specification, however, and the reader should be aware that there is actually
a family of technologies (Bos, 2000) associated with XML. Some of the XML projects
currently underway are (Bos, 2000):

Xlink describes a standard way of including hyperlinks within an


XML file.

XPointer & XFragments syntaxes for pointing to parts of an


XML document.

Cascading Style Sheets (CSS) making the style sheet language


applicable to XML as it is with HTML.

XSL advanced language for expressing style sheets.

XML Namespaces specification that describes how an URL can


be associated with every tag and attribute within an XML document.


XML Schemas 1 & 2- aimed at helping developers to define their
own XML-based formats.
Future applications of some of these XML components to this project is discussed
later in this thesis. For more information on these emerging technologies, see the XML
Cover Pages (Cover, 2000).

3.6 FAITH
A very brief description of the FAITH project is required in order to gain an
understanding of where the TTS module fits within the Talking Head architecture.
Figure 7 shows a simplified view of the various subsystems that make up the Talking
Head.

TTS
Text to
synthesise

FAPs
(visemes)
Waveforms

Brain

Personality

Text
questions

SERVER

FAML

FAPs

MPEG-4

CLIENT

Questions

Waveforms

User Interface

FAPs

Figure 7 - FAITH project architecture.

As Figure 7 shows, the architecture has been designed to fit a client/server


framework; the client is responsible for interfacing with the user, where the Talking Head
is rendered, audio played, extra information displayed etc. The server accepts a users
text input (such as questions, dialog etc), and processing is carried out in the following
order:
The Brain module, developed by Beard (1999), processes the users text
input and forms an appropriate response. The response is then sent to the TTS
module.
The TTS module is responsible for producing the speech equivalent of the
Brains text response and outputs a waveform. It also outputs viseme
information in MPEG-4s FAP (Facial Animation Parameter) format, and passes
this to modules responsible for generating more FAP values. Visemes are the
visual equivalent of speech phonemes (e.g. the mouth forms a specific shape
when saying oo). Generated FAP values can be used to move specific points
of the Talking Heads face (to produce head movements, blinking, gestures etc),
which is the purpose of the next two subsystems.
The Personality modules role is to generate MPEG-4 FAP values with the
goal of simulating various personalities such as friendliness and dominance
(Shepherdson, 2000).
The FAML modules role is to generate MPEG-4 FAP values to display
various head movements, and facial gestures and expressions specified through a
special markup language (Huynh, 2000).
As the diagram shows, communication between the client and server is done via an
implementation of the MPEG-4 standard (Cechner, 1999). For a more detailed,
summarized description of the FAITH project, see Beard et al (2000). (The reader should
note however, that the paper describes the old TTS module, and not the (newer) TTS
module described in this thesis). Figure 8 shows one of the models rendered on the client
side with which the user interfaces with.

Figure 8 - Talking Head being developed as part of the FAITH project


at the School of Computing, Curtin University of Technology.

3.7 Resource Review


3.7.1 Text-to-Speech Synthesizer
Much of the preparation for this project required the investigation of an appropriate text
to speech synthesizer able to provide the flexibility required to achieve the proposed
aims, and therefore deserves mention in this thesis.
Initially, investigation focused on commercial text-to-speech systems. Most of the
prominent systems offer documentation and interactive sampling via the Internet,
including the ability to send and receive custom example output. Some of the TTS
systems considered were from the following companies: AT&T Labs, Bell Labs, Elan
Informatique (recently acquired by Lernout & Hauspie), Eloquent Technology, Lernout
& Hauspie, and the DECtalk synthesizer (now owned by Compaq).
However, though the speech quality of all the mentioned systems was of a very high
standard, with the exception of the DECtalk synthesizer none of these allowed the control
required for this project. At most, speech rate, average pitch and some user commands
are available, but are too basic and inadequate for the project. The Microsoft Speech API
4.0 (SAPI) specification does allow for voice quality control such as whispering (through
the \vce tag), but these were rarely implemented by the speech engines supporting SAPI.
It is therefore no coincidence that the two major research groups of synthetic speech
emotion, Cahn (1990), and Murray and Arnott (1995), chose to use DECtalk in their
implementations. It offers a very large number of control commands through its API,
including the ability to manipulate the utterance at the phoneme level (Hallahan, 1996).
Notwithstanding these features, it was found that the freely available systems, the
Festival Speech Synthesis System and MBROLA, offered the same capabilities (albeit in
an indirect way) plus additional advantages.

Festival is a widely recognized research project developed at the Centre for Speech
Technology Research (CSTR), University of Edinburgh, with the aim of offering a free,
high quality text-to-speech system for the advancement of research (Black, Taylor and
Caley, 1999). The MBROLA project, initiated by the TCTS Lab of the Facult
Polytechnique de Mons (Belgium), is a free multi-lingual speech synthesizer developed
with aims similar to Festivals (MBROLA Project Homepage, 2000).

Text

NLP
(Festival)
Phonemes, pitch
and duration

Pitch and Timing


Modifier
Modified phonemes,
pitch and duration

DSP
(MBROLA)

Waveform

Figure 9 - Top level outline showing how Festival and MBROLA systems were used together.

It was decided for this project to use the Festival system as the natural language
parser (NLP) component of the module, which accepts text as input and transcribes this
to its phoneme equivalent, plus duration and pitch information. This information can be
then given to the MBROLA synthesizer, acting as the digital signal processing unit
(DSP), which produces a waveform from this information. Although Festival has its own
DSP unit, it was found that the Festival + MBROLA combination produces the best
quality. It is important to note that the Festival system supports MBROLA in its API.
Because of the phoneme-duration-pitch input format required for MBROLA, it
provides very fine pitch and timing control for each phoneme in the utterance. As stated
before, this level of control is simply unattainable with commercial systems, except

DECtalk. The advantage of using MBROLA over DECtalk, however, is in the fact that
once a phonemes pitch is altered in the latter system, the generated pitch contour is
overwritten. Cahn (1990) first mentioned this problem, and as a result did not manipulate
the utterance at the phoneme level, limiting the amount of control, which ultimately
hindered the quality of the simulated emotion. To overcome this, Murray and Arnott
(1995) had to write their own intonation model to replace the DECtalk generated pitch
contour when they changed pitch values at the phoneme level. Fortunately, this is not an
issue with MBROLA, as changes to the pitch and duration levels can be done prior to
passing it to MBROLA (as Figure 9 shows). Therefore, it can be seen that the Festival +
MBROLA option offers high control comparable to the DECtalk synthesizer, with the
benefit of less complexity.
The use of Festival and MBROLA also addressed the platform independent
subproblem described in Section 2.2. Although developed mainly for the UNIX
platform, its source code can be ported to the Win32 platform via relatively minor
modifications. The MBROLA Homepage offers binaries for many platforms, including:
Win32, Linux, most Unix OS versions, BeOS, Macintosh and more.
Before the final decision was made to make use of the Festival system however, an
important issue required investigation. The previous TTS module of the Talking Head
did not use the Festival system because although it was acknowledged that Festivals
output is of a very high quality, the computation time was deemed to be far too expensive
to use in an interactive application (Crossman, 1999). For example, the phrase Hello
everybody. This is the voice of a Talking Head. The Talking Head project consists of
researchers from Curtin University and will create a 3D model of a human head that will
answer questions inside a web browser. took about 45 seconds to synthesize on a Silicon
Graphics Indy workstation (Crossman, 2000). It is contested however, that the negative
impression that could be made of the Festival system from such data may be a little
misled. Though execution time may take longer on an SG Indy workstation, informal
testing on several standard PCs (Win32 and Linux platforms) showed that the same
phrase took less than 5 seconds to synthesize (including the generation of a waveform).
Since TTS processing is done on the server side, the system can be easily configured to
ensure Festival will carry its processing on a faster machine. Therefore, Festivals
synthesis time was not considered a problem.

3.7.2 XML Parser

Since it is expected that the programs input will contain marked up text, an XML parser
was required to parse and validate the input, and create a DOM tree structure for easy
processing. There are a number of freely available XML parsers, though many are still in
development stage and implement the XML specification to varying degrees. One of the
more complete parsers is libxml, a freely available XML C library for Gnome (libxml,
2000).
Using libxml as the XML parser fulfilled the needs of the project in a number of
ways:
a) Portability written in C, the library is highly portable. Along with the main
program, it has been successfully ported to the Win32, Linux and IRIX
platforms.
b) Small and simple only a limited range of the XML features are being used,
therefore a complex parser was not required. This is not to say that libxml is a
trivial library as it offers some powerful features.
c) Efficiency Informal testing showed libxml parses large documents in
surprisingly little time. Although not used for this project, libxml offers a
SAX interface to allow for more memory-efficient parsing (see section 3.5.5).
d) Free libxml can be obtained cost-free and license-free.
It is important to note that the libxml librarys DOM tree building feature was used to
help create the required objects that hold the programs utterance information. However,
care was taken to make sure the programs objects were not dependent on the XML
parser being used. Instead, a wrapper class, CTTS_SMLParser, used libxml as the XML
parser and output a custom tree-like structure very similar to that of the DOM. This
ensured that all other objects within the program used the custom structure, and not the
DOM tree that libxml outputs. (See Chapter 5 for more details.)

3.8 Summary
This chapter has explored research that was applicable to this project, focusing on how
the literature can help with achieving the stated objectives and subproblems of Chapter 2,
and supporting the hypotheses of Chapter 4. More specifically, the literature was
investigated to find the speech correlates of emotion, seeking clear definitions so that
there was a solid base to work from during the implementation phase. The work of
prominent researchers in the field of synthetic speech emotion, such as Murray and

Arnott (1995) and Cahn (1990) who have already attempted to simulate emotional
speech, was sought in order to gain an understanding of the problems involved, and the
approach taken in solving them.
The in-depth review on XML served two purposes: a) to describe what XML is and
what the technology is trying the address, and b) to expound the benefits of XML so as to
justify why SML was designed to be XML-based. A resource review was given to
discuss the issues involved when deciding which tools to use for the TTS module, and to
address one of the subproblems stated in Section 2.3; that is, that the TTS module should
be able to run across the Win32, Linux, and UNIX platforms.

Chapter 4
Research Methodology
The literature review of Chapter 3 enabled the formation of the hypotheses stated in this
chapter. It also identified areas where limitations would apply, and defined the scope of
the project.

4.1 Hypotheses
The project was developed to test the following hypotheses:
1. The effect of emotion on speech can be successfully synthesized using
control parameters.
2. Through the addition of emotive speech:
a)
Listeners will be able to correctly recognize the intended emotion
being synthesized.
b)
Head.

Information will be communicated more effectively by the Talking

It should be noted that hypothesis 2a allows for a significant error rate in recognizing
the simulated emotion as we ourselves find it difficult to understand each others nonverbal cues, and are often misunderstood as a result. Malandra, Barker and Barker
(1989), and Knapp (1980) discuss difficulties in emotive speech recognition. For the
hypothesis to be accepted, however, the recognition rate will be significantly higher than
mere chance, showing proof that correct recognition of the simulated emotion is indeed
occurring.

4.2 Limitations and Delimitations


4.2.1 Limitations
Two main limitations have been identified:
1. Vocal Parameters - The quality of the synthesized emotional speech will be
limited by the ability of the vocal parameters to describe the various emotions.
This is a reflection of the current level of understanding of speech emotion itself.
2. Speech Synthesizer Quality - The quality of the speech synthesizer and the
parameters it is able to handle will also have a direct effect on the speech
produced. For instance, most speech synthesizers are unable to change voice
quality features (breathiness, intensity etc) without significantly affecting the
intelligibility of the utterance.

4.2.2 Delimitations
The purpose of this research is to determine how well the vocal effects of emotion can be
added to synthetic speech it is not concerned with generating an emotional state for the
Talking Head based on the words it is to speak. Therefore, the system will not know the
required emotion to simulate from the input text alone. This top-level information will be
provided through the use of explicit tags, hence the need for the implementation of a
speech markup language.
Due to the strict time constraints placed on this project, the emotions that are to be
simulated by the system were bounded to happiness, sadness, and anger. These three
emotions were chosen because of the wealth of study carried out on these emotions (and
hence an increased understanding) compared to other emotions. This is because
happiness, sadness, and anger (along with fear and grief) are often referred to as the
basic emotions, on which it is believed other emotions are built on.

4.3 Research Methodologies


The following Research Methodologies of Mauch and Birch (1993) are applicable to this
research:
Design and Demonstration. This is the standard methodology used for
the design and implementation of software systems. The speech synthesis
system is being demonstrated as the TTS module of a Talking Head.
Evaluation. The effectiveness of the system was needed to be determined
via listener questionnaires, testing how well the TTS module supports the stated
hypotheses. Therefore an evaluation research methodology was adopted.
Meta-Analysis. The project involves a number of diverse fields other
than speech synthesis; namely psychology, paralinguistics, and ethology. The
meta-analysis research methodology was used to determine how well the speech
emotion parameters described in these fields mapped to speech synthesis.

Chapter 5
Implementation
This chapter discusses the implementation of the TTS module to simulate emotional
speech for a Talking Head, plus the stated subproblems of Section 2. The discussion
covers how the modules input is processed, and how the various emotional effects were
implemented. This will involve a description of the various structures and objects that
are used by the TTS module. Since the module relies heavily on SML, the speech
markup language that was designed and implemented to enable direct control over the
modules output, the chapter discusses SML issues such as parsing and tag processing.

5.1 TTS Interface


Before an in-depth description of each of the TTS modules components is given, it will
be beneficial to describe the input and outputs of the system. It was important to be able
to describe the system as a very high-level black box; not only for clarity of design, but
also to ensure that the replacement of the existing TTS module of the FAITH project
would be a smooth one. It also minimizes module and tool interdependency. Figure 10
shows the black box design of the system as the TTS module of a Talking Head.
Waveform
Text

TTS Module

Viseme
information

Figure 10 - Black box design of the system, shown as the TTS module of a Talking Head.

It was mentioned earlier in Section 3.7.1 that the TTS module uses the Festival
Speech Synthesis System and the MBROLA Synthesizer. Figure 10 does not show any

of this detail, nor should it. What is important to describe at this level is the modules
interface; how the module produces its output is irrelevant to the user of the module.

5.1.2 Module Inputs


Figure 10 shows text as the single input to the TTS module. However, text can be a
fairly ambiguous description for input, and indeed, the module caters for two distinct
types of text: plain text, and text marked up in the TTS modules own custom Speech
Markup Language (SML).

Plain Text
The simplest form of input, plain text means that the TTS module will endeavour to
render the speech-equivalent of all the input text. In other words, it will be assumed that
no characters within the input represent directives for how to generate the speech. As a
result of this, speech generated using plain text will have default speech parameters,
spoken with neutral intonation.

SML Markup
If direct control over the TTS modules output is desired, then the text to be spoken can
be marked up in SML, the custom markup language implemented for the module.
Although an in-depth description of SML will not be given here (see Section 2 and
Appendix A), it was designed to provide the user of the TTS module with the following
abilities:

Direct control of speech production. For example, the system


could be specified to speak at a certain speech rate, pitch, or pronounce a
particular word in a certain way (this is especially useful for foreign names).

Control over speaker properties. This gives the ability to not only
have control of how the marked up text is spoken, but also who is speaking.
Speaker properties such as gender, age, and voice can be dynamically changed
within SML markup.

The effect of the speakers emotion on speech. For example, the


markup may specify that the speaker is sad for a portion of the text. As a
result, the speech will sound sad. One of the primary objectives of this thesis
is to determine how effective the simulated effect of emotion on the voice is.

Another important feature of the TTS module with regards to the input it can receive
is that the module is able to handle unknown tags present within the text. This is
important because other modules within the Talking Head may (and do) have their own
markup languages to control processing within that module. For example, Huynh (2000)
has developed a facial markup language to specify many head movements and
expressions. If any non-SML tags are present within the input given to the TTS module,
they will simply be filtered out of the SML input.
A very important note to make is that the filtering out of unknown tags is done
before any XML-related parsing is carried out; the XML Recommendation explicitly
states that the presence of any unknown tags that do not appear in the documents
referenced DTD should immediately cause a fatal error (Extensible Markup Language,
1998). Therefore, it is important that before any processing of the TTS modules input
takes place, proper filtering takes place since it is expected that non-SML tags will
present in the input. Admittedly, the very fact that the TTS module is given input that
may not contain pure SML markup does not reflect a solid design of the system.
However, the FAITH project has only just begun to make use of XML-based markup
languages for its various modules and the XML processing architecture within the
Talking Head is not very mature. One possibly better approach to maintaining several
XML-based markup languages (such as SML and FAML) is discussed in the Future
Work section (Chapter 7).

5.1.3 Module Outputs


The output of the TTS black box is two streams:

Waveform
The TTS module will always produce a sound file which, when played, is the speech
equivalent of the text received as input. The sound file is in the WAV 16-bit format.
Once a waveform is produced, it is sent via MPEG-4 to the client side of the application
where it is played (see Section 3.6).

Visemes
The visual equivalent of the phonemes (speech sounds) that are spoken are also output
from the TTS module. The viseme output is encoded as stated in the MPEG-4

specification (MPEG, 1999). The phoneme-to-viseme translation submodule is one of


the few that were retained from the existing TTS module.

5.1.4 C/C++ API


Modules that call the TTS module do so through its C/C++ API, which was designed to
be simple and high-level. In C++ programs, the CTTS_Central class is at the center of
the TTS API. The following list describes each of its public methods.

CTTS_Central::SpeakFromFile

(const

char

*Filename) Synthesizes the contents of the file specified by the input

parameter Filename. e.g. SpeakFromFile (infile.sml);

CTTS_Central::SpeakText

(const

char

*Message)

Synthesizes the actual character string pointed to by Message.

e.g.

SpeakText(Hello World);

CTTS_Central::SpeakTextEx

(const

char

*Message,

int Emotion) Synthesizes the actual character string pointed to by


Message, simulating the emotion specified by Emotion. Emotion can be

specified using any one of the predefined constants: TTS_NEUTRAL,


TTS_ANGRY, TTS_HAPPY, TTS_SAD.

A special note to make is that the type of input given to each of these functions is not
explicitly specified. For instance, does the file given to SpeakFromFile contain plain
text, or is it an SML document? Each of the API functions listed above automatically
detect the input type using a simple heuristic: if the start of the input file or character
string contains an XML header declaration, then it is detected to be SML markup,
otherwise the input is treated as plain text.
A C API has also been made available, which has the same functionality as the
CTTS_Central objects interface. Only, initialization and destruction routines have to
be explicitly called.

TTS_Initialise () used to initialise the TTS module. The


function is called once only.

TTS_SpeakFromFile (const char *Filename) same as

CTTS_Central::SpeakFromFile (const char *Filename).

TTS_SpeakText (const char *Message) same as

CTTS_Central::SpeakText (const char *Filename).

TTS_SpeakTextEx (const char *Message, int

Emotion) same as CTTS_Central::SpeakTextEx (const char


*Message, int Emotion).

TTS_Destroy () used to nicely cleanup the TTS module once


it is not need anymore. The function is called once only.

5.2 SML Speech Markup Language


In Section 2.2 it was identified that the design and implementation of a suitable markup
language was required so that the emotion of a text segment could be specified, as well as
providing a means of manipulating other useful speech parameters. SML is the TTS
modules XML-based Speech Markup Language designed to meet these requirements.
This section will provide an overview of how an utterance should be marked up in SML.
For a description of each SML tag with its associated attributes, see Appendix A. For
issues regarding SMLs implementation, see Section 5.7.

5.2.1 SML Markup Structure


An input file containing correct SML markup must contain an XML header declaration at
the beginning of the file. Following the XML header, the sml tag encapsulates the entire
marked up text, and can contain multiple p (paragraph) tags. Figure 11 shows the basic
layout of an input file marked up in SML. Note that all the XML constraints discussed in
Section 3.5 apply to SML.
Reference to
SML v01 DTD

XML header
Root tag

<?xml version="1.0"?>
<!DOCTYPE sml SYSTEM "./sml-v01.dtd">
<sml>
<p>...</p>
<p>...</p>

Paragraphs

<p>...</p>
</sml>
Figure 11 - Top-level structure of an SML document.

In turn, each p node can contain one or more emotion tags (sad, angry, happy, and
neutral), and instances of the embed tag; text not contained within an emotion tag is

not allowed. For example, Figure 12 shows valid SML markup, while Figure 13 shows
SML markup that is invalid because it does not follow this rule. Note that unlike lazy
HTML, the paragraph (p) tags must be closed properly.
<p>

<neutral>Please remain quiet.</neutral>


<embed src=sound.wav/>
<angry>Who made that noise?</angry>
</p>

Figure 12 - Valid SML markup.

<p>
<sad>I have some sad news:</sad>
this part of the markup is not valid SML.
</p>
Figure 13 - Invalid SML markup.

All tags described in Appendix A can occur inside an emotion tag (except sml, p,
and embed). A limitation of SML is that emotion tags cannot occur within other emotion
tags. However, unless explicitly specified, most other tags can contain even instances of
tags with the same name. For example, a pitch tag can contain another pitch tag as
the following example shows.
<pitch range=+100%>
Not I, <pitch middle=-15%>said the dog.</pitch>
</pitch>

The described structure of an input file containing SML markup can be confirmed by
SMLs DTD (see Appendix B). Should the input file not conform to the DTD
specification, a parse error will occur and, in accordance with the XML
Recommendation, the input will not be processed.

5.3 TTS Module Subsystems Overview


NLP
Festival

Plain text

Text

SML Parser

Phoneme data

SML Document

Phoneme
info

libxml

Tags +
Text/Phonemes

Modified
Text/Phonemes

Visual
Module

Visemes

Phoneme data

DSP
SML Tags
Processor

Waveform

MBROLA

Figure 14 - TTS module subsystems.

As Figure 14 shows, the design of the TTS module subsystems is centered on the SML
Document object. The main steps for synthesizing the modules input text involve the
creation, processing, and output of the SML Document. This is broken down into the
following tasks:
a)
Parsing. The input text is parsed by the SML Parser, and creates
an SML Document object. The SML Parser makes use of libxml.
b)
Text to Phoneme Transcription. The Natural Language Parser
(NLP) is responsible for transcribing the text into its phoneme equivalent, plus
providing intonation information in the form of each phonemes duration and
pitch values. This information is given to the SML Document object and

stored within its internal structures. The NLP unit makes use of the Festival
Speech Synthesis System.
c)
SML Tag Processing. Any SML tags present in the input text are
processed. This usually involves modifying the text or phonemes held within
the SML Document.
d)
Waveform Generation. The phoneme data held within the SML
Document is given to the Digital Signal Processing (DSP) unit to generate a
waveform. The DSP makes use of the MBROLA Synthesizer.
e)
Viseme Generation. The Visual Module is responsible for
transcribing the phonemes to their viseme equivalent. Again, the phoneme
data is obtained from the SML Document. In this thesis, the Visual Module
will not be discussed in any further detail since it is has reused much of the
old TTS modules subroutines. Crossman (1999) provides a description of the
phoneme-to-viseme translation process.

5.4 SML Parser


The SML Parser, encapsulated in the CTTS_SMLParser class, is responsible for parsing
the modules text input to ensure that it is both a well-formed XML document and that its
structure conforms to the grammar specification of the DTD. If the input is fully
validated then an SML Document object is created based on the input.
To perform full XML parsing on the input, the XML C Library libxml is used. Apart
from validating the input, libxml also constructs a DOM tree (described in Section 3.5.4)
that represents the inputs tag structure should no parse errors occur. The SML
Document object that is returned by the SML Parser follows the hierarchical structure of
the DOM very closely. Therefore, it traverses the DOM and creates an SML Document
containing nodes mirroring the DOMs structure. Once the SML Document has been
constructed, the DOM is destroyed and the SML Document is returned.
It was mentioned in Section 5.1.2 that the TTS module is able to handle unknown
tags present within the input markup. This is because the input is filtered to remove all
unknown tags before any validation parsing is done by libxml. In doing so, the DOM tree
that libxml creates does not hold any unknown tag nodes, and as a consequence neither
does the SML Document.

The TTS module keeps track of all SML tag names by keeping a special XML
document that holds SML tag information2. Filtering of the input is done by creating a
copy of the input file, and copying only those tags that are known. It is important that
this filtering process is carried out because the input is envisaged to contain other nonSML tags, such as those belonging to the FAML module. Figure 15 shows the filtering
process.
SML tag information
Tag lookup

Known tag

Input file

Copy of input file


(Filtered)

Figure 15 - Filtering process of unknown tags.

5.5 SML Document


As the TTS modules subsystems diagram shows (Figure 14), the SML Document is at
the core of the TTS module. Its role is to store all information required for speech
synthesis to take place, such as word, phoneme, and intonation data. It also contains the
full tag information that appears in the input; in fact, such is the depth of information held
that the SML markup could be easily recreated by the information held in the SML
Document. The tag data is used to control the manipulation of the text and phoneme
data. In this section we will describe the structure of the SML Document, as well the
various structures required to perform the above mentioned role. Finally, the data held
within the SML Document is then used to produce a waveform. The SML Document
object is encapsulated by the TTS_SMLDocument and TTS_SMLNode classes.

5.5.1 Tree Structure


In the last section, it was mentioned that the structure of the SML Document matches
very closely that of the XML DOM. The SML Document consists of a hierarchy of
nodes that represent the information held in the input SML markup. Therefore, the nodes
2

The XML document is called tag-names.xml, and is held in the special TTS resource directory
TTS_rc.

hold markup information, attribute values, and character data. Figure 16 shows the highlevel structure of an SML Document that would be constructed for the accompanying
SML markup. Note how each node has a type that specifies what type of node it is.
The hierarchical nature of the SML Document implies which text sections will be
rendered in what way a parent will affect all its children. So, for the example in Figure
16, the emph node will affect the phoneme data of its (one) child node, the text node
containing the text too. The happy node will affect the phoneme data of all its (three)
children nodes containing the text Thats not, too, and far away respectively. Tags
that were specified with attribute values are represented by element nodes that point to
attribute information (this is not shown on Figure 16 for clarity purposes).

sml
DOCUMENT_NODE

p
ELEMENT_NODE

neutral
ELEMENT_NODE

text
TEXT_NODE
I live at

<?xml version="1.0"?>
<!DOCTYPE sml SYSTEM "./sml-v01.dtd">
<sml>
<p>
<neutral>I live at
<rate speed=-10%>10 Main Street</rate>
</neutral>
<happy>
Thats not <emph>too</emph> far away.
</happy>
</p>
</sml>

happy
ELEMENT_NODE

rate
ELEMENT_NODE

text
SML_TEXT_NODE
10 Main Street

text
TEXT_NODE
Thats not

emph
ELEMENT_NODE

text
TEXT_NODE
far away.

text
SML_TEXT_NODE
too

Figure 16 - SML Document structure for SML markup given above.

5.5.2 Utterance Structures


Each text node contains its own utterance information, which comprises of word and
phoneme related data. The information is held in different layers.

1.

Utterance level the whole phrase held in that node. The


CTTS_UtteranceInfo class is responsible for holding information at this
level.
2.

Word level the individual words of the utterance. The


CTTS_WordInfo class is responsible for holding information at this level.
3.

Phoneme level the phonemes that make up the words. The


CTTS_PhonemeInfo class is responsible for holding information at this
level.
4.
Phoneme pitch level the pitch values of the phonemes (phonemes
can have multiple pitch values). The CTTS_PitchPatternPoint class is
responsible for holding information at this level.
The way that the above mentioned objects are organized within a text node is as
follows:
A text node contains one CTTS_UtteranceInfo object.
The CTTS_UtteranceInfo object contains a list of CTTS_WordInfo
objects that contain word information.
In

turn,

each

CTTS_PhonemeInfo

CTTS_WordInfo

objects

that

contain

object

contains

phoneme

list

information.

of
A

CTTS_PhonemeInfo object contains the actual phoneme and its duration (ms).

Each

CTTS_PhonemeInfo

object

then

contains

list

of

CTTS_PitchPatternPoint objects that contain pitch information for each

phoneme. A pitch point is characterized by a pitch value, and a percentage value


of where the point occurs within the phonemes duration.

U
W the
P dh
pp

P@
(0,95)

pp (50,101)
pitch value
% inside phoneme length

W moon
Pm
pp (0,102)

P uu
pp (50,110)

P n
pp (100,103)

Figure 17 - Utterance structures to hold the phrase the moon. U = CTTS_UtteranceInfo object;
W = CTTS_UtteranceInfo object; P = CTTS_PhonemeInfo object; pp = CTTS_PitchPatternPoint
object.

5.6 Natural Language Parser


As introduced in Section 5.3, the NLP (Natural Language Parser) module is responsible
for transcribing the text to be rendered as speech to its phoneme equivalent. It is also
responsible for generating intonation information by providing pitch and duration values
for each phoneme in the utterance. The goals this module sets out to achieve are nontrivial, and it is not surprising that this stage takes by far the longest time of any of the
stages in the speech synthesis process. Dutoit (1997) gives an excellent discussion of the
problems the NLP unit of a speech synthesizer must overcome.
Since the phoneme transcription and the intonation information greatly affect the
quality of the synthesized speech, it was very important to have an NLP that would
produce high quality output. As mentioned in Section 3.7.1, the Festival Speech
Synthesis System was chosen to provide these services, which is able to generate output
comparable to commercial speech synthesizers.

5.6.2 Obtaining a Phoneme Transcription


As described in Section 5.5, each text node within the SML Document contains utterance
objects that will ultimately hold the nodes word and phoneme information. One of the
intermediate steps for obtaining a phoneme transcription is to tokenize the input character
string into words. For example, the character string On May 5 1985, 1,985 people
moved to Livingston would be tokenized in the following words by Festival: On May
fifth nineteen eighty five one thousand nine hundred and eighty five people moved to
Livingston. This illustrates the complexity of the input that Festival is able to handle,
which has a direct effect on user perception of the intelligence of the Talking Head.
To tokenize the contents of the SML Document, the tree is traversed and each text
nodes content is individually given to Festival. Festival returns the tokens in the
character string, and these are stored as words in the corresponding nodes utterance
object. Figure 18 shows how each node holds its own token information.
SML Markup

neutral

<neutral>
10 oranges cost
<emph>$8.30</emph>
</neutral>
10 oranges cost
ten oranges cost

emph

$8.30
eight dollars thirty

Figure 18 - Tokenization of a part of an SML Document.

Once word information is stored within each nodes utterance object, phoneme data
can be generated for each word. Obtaining the actual phoneme data (including
intonation) is a more complex process however. This is because an entire phrase should
be given to Festival in order for correct intonation to be generated. As an example,
consider the following SML markup (the corresponding nodes held in the SML
Document are shown in Figure 19).

<happy>
<rate speed=-15%>I wonder,</rate> you pronounced it
<emph>tomato</emph> did you not?
</happy>

happy

rate

you pronounced it

I wonder,

emph

did you not?

tomato

Figure 19 - SML Document sub-tree representing example SML markup.

If each text nodes contents is given to Festival one at a time (i.e. first I wonder,
then you pronounced it, and so forth), then though Festival will be able to produce the
correct phonemes it will not generate proper pitch and timing information for the
phonemes. This will result in an utterance whose words are pronounced properly, but
contains inappropriate intonation breaks that make the utterance sound unnatural.
An appropriate analogy to this would be if a person were shown a pack of cards with
words written on them one at a time, and asked to read it out loud. The person, not
knowing what words will follow, will not know how to give the phrase an appropriate
intonation.
Now if the same person is given a card that contains the entire sentence on it, then
now knowing what the phrase is saying, the person will read it out loud correctly. The
same approach was taken in the solution to this problem. Continuing the above example
will help understand how this is done.
The SML Document is traversed until an emotion node is encountered. In
the example, traversal would stop at the happy node.
The contents of its child text nodes are then concatenated to make one
phrase. So, the contents of the four text nodes in Figure 19 would be
concatenated to form the phrase I wonder, you pronounced it tomato did you
not? The phrase is stored in a temporary utterance object held in the happy
node.

The phrase is given to Festival, and Festival generates the phoneme


transcription as well as intonation information.
The entire phoneme data is stored in the happy nodes temporary
utterance object.
Because each text node already contains word information in their
utterance objects, it is a simple process to disperse the phoneme data held in the
happy node amongst its children. The temporary utterance object in happy is
then destroyed.
If this procedure is followed then correct intonation is given to the utterance. Of
course, a limitation is that this will not solve the problem of having an emotion change in
mid sentence. However, the algorithm makes an assumption that this will not occur
frequently, and that if it does, the intonation will not be needed to continue over emotion
boundaries and a break is acceptable.

5.6.2 Synthesizing in Sections


Speech synthesis can be a processor intensive task, and can take a significant amount of
time and memory when synthesizing larger utterances. Finding any way to minimize the
waiting time is highly desirable, especially when the speech production is being waited
upon by an interactive Talking Head.
There was a concern that if a very large amount of SML markup was given to the
TTS module, then the execution time would be unacceptable for someone communicating
with the Talking Head. To avoid this from occurring, a solution was implemented that
took advantage of the client/server architecture of the Talking Head.
Instead of the entire SML Document being synthesized in one go, smaller portions
(at the emotion node level) are synthesized one at a time on the server and sent to the
client. As the Talking Head on the client begins to speak, the server synthesizes the
next emotion tag of the SML Document. By the time the Talking Head has finished
talking, the next utterance is ready to be spoken. This way, the actual waiting time is
really only for the first utterance and is now dependent on the communication speed
between the server and client, and not the synthesis time of the whole document. Figure
20 represents a timeline of the example markup.
It should be noted that this section-oriented method of producing speech involves not
only the NLP submodule but also all the steps in the synthesis process after the creation
of the SML Document.

<p>
<neutral>Utterance 1</neutral>
<happy>Utterance 2</happy>
<sad>Utterance 3</sad>
</p>

SERVER

CLIENT

Synthesizing neutral tag


(Utterance 1)

Idle

Synthesizing happy tag


(Utterance 2)

Playing Utterance 1

Synthesizing sad tag


(Utterance 3)

Playing Utterance 2

Idle

Playing Utterance 3

Figure 20 - Raw timeline showing server and client


execution when synthesizing example SML markup above.

5.6.3 Portability Issues


To address the portability issue stated in Section 2.2, it was important Festival be useable
over multiple platforms. Because the Festival system has been developed primarily for
the UNIX platform, compiling it for IRIX 6.3 was relatively straightforward. Similarly,
obtaining a Linux version of Festival was also effortless since Linux RPMs (RedHat
Package Manager) containing precompiled Festival libraries are available. Although it
has not been tested extensively on the Win32 platform, the Festival developers are
confident that the source code is platform independent enough for Festival to compile on
Win32 machines without too many changes.
Despite this optimism, a considerable amount of the projects effort went into
realizing this objective. In fact, changes made to the code were kept track of, and as it
grew, a help document for compiling Festival with Microsoft Visual C++ resulted. The
document was made available on the World Wide Web to help other developers, and has
already received attention. The help document has been included in this thesis (see
Appendix C). An online version of the help document can be found at the following
URL: http://www.computing.edu.au/~stalloj/projects/honours/festival-help.html.

5.7 Implementation of Emotion Tags


Previous sections have dealt with describing the framework constructed to support the
main hypothesis; that is, to simulate the effect of emotion on speech. This section will
now discuss the implementation of SMLs emotion tags, which when used to markup
text, cause the text to be rendered with the specified emotion.
As has already been stated in this thesis, the speech correlates of emotion needed to
be investigated in the literature for the main objectives to be met. Section 3.2 described
the findings of this research, and a table was constructed that describes the speech
correlates for four of the five so-called basic emotions: anger, happiness, sadness, and
fear (see Table 1). The table formed the basis for implementing the angry, happy, and
sad SML tags. For ease of reference, the contents of Table 1 for the anger, happiness

and sadness emotions is shown again in the following table:

Anger

Happiness

Sadness

Speech rate

Faster

Slightly faster

Slightly slower

Pitch average

Very
higher

Much higher

Slightly lower

Pitch range

Much wider

Much wider

Slightly
narrower

Intensity

Higher

Higher

Lower

Pitch changes

Abrupt,
downward,
directed
contours

Smooth, upward
inflections

Downward
inflections

Voice quality

Breathy, chesty
tone1

Breathy,
blaring1

Resonant1

Articulation

Clipped

Slightly slurred

Slurred

much

terms used by Murray and Arnott (1993).

Table 2 - Summary of human vocal emotion effects for anger, happiness, and sadness.

To implement the guidelines found in the literature on human speech emotion,


Murray and Arnott (1995) developed a number of prosodic rules for their HAMLET
system. The TTS module has adopted some of these rules, though slight modifications
were required. Also, other similar prosodic rules have been developed through personal
experimentation.

5.7.1 Sadness
Basic Speech Correlates
Following the literature-derived guideline for the speech correlates of emotion shown in
Table 2, Table 3 shows the parameter values set for the SML sad tag. The values were
optimized for the TTS module, and are given as percentage values relative to neutral
speech.
Parameter

Value (relative to neutral speech)

Speech rate

-15%

Pitch average

-5%

Pitch range

-25%

Volume

0.6

Table 3 Speech correlate values implemented for sadness.

As a result of the above speech parameter changes, the speech is slower, lower in
tone, and is more monotonic (pitch range reduction gives a flatter intonation curve). The
volume is reduced for sadness so that the speaker talks more softly. (Implementation
details on how speech rate, volume and pitch values are modified can be found in Section
5.8).
Prosodic rules
The following rules, adopted from Murray and Arnott (1995), were deemed to be
necessary for the simulation of sadness. Some parameter values were slightly modified
to work best with the TTS module.
1.
Eliminate abrupt changes in pitch between phonemes. The
phoneme data is scanned, and if any phoneme pairs have a pitch difference of
greater than 10% then the lower of the two pitch values is increased by 5% of
the pitch range.
2.
Add pauses after long words. If any word in the utterance contains
six or more phonemes, then a slight pause (80 milliseconds) is inserted after
the word.
The following rules were developed specifically for the TTS module.
1.
Lower the pitch of every word that occurs before a pause. Such
words are lowered by scanning the phoneme data in the particular word, and

lowering the last vowel-sounding phoneme (and any consonant-sounding


phonemes that follow) by 15%. This has the effect of lowering the last
syllable.
2.
Final lowering of utterance. The last syllable of the last word in
the utterance is lowered in pitch by 15%.

5.7.2 Happiness
Basic Speech Correlates
Following the literature-derived guideline for the speech correlates of emotion shown in
Table 2, Table 4 shows the parameter values set for the SML happy tag. The values
were optimized for the TTS module, and are given as percentage values relative to
neutral speech.
Parameter

Value (relative to neutral speech)

Speech rate

+10%

Pitch average

+20%

Pitch range

+175%

Volume

1.0 (same as neutral)

Table 4 - Speech correlate values implemented for happiness.

As a result of the above speech parameter changes, the speech is slightly faster, is
higher in tone, and sounds more excited since intonation peaks are exaggerated due to the
pitch range increase.
Prosodic rules
The following rules were adopted from Murray and Arnott (1995). Some parameter
values were slightly modified to work best with the TTS module.
1.
Increase the duration of stressed vowels. The phoneme data is
scanned, and the duration of all primary stressed vowel phonemes is increased
by 20%. Stressed vowels are discussed in Section 5.7.4.
2.
Eliminate abrupt changes in pitch between phonemes. The
phoneme data is scanned, and if any phoneme pairs have a pitch difference of
greater than 10% then the lower of the two pitch values is increased by 5% of
the pitch range.
3.
Reduce the amount of pitch fall at the end of the utterance.
Utterances usually have a pitch drop in the final vowel and any following

consonants. This rule increases the pitch values of these phonemes by 15%,
hence reducing the size of the terminal pitch fall.

5.7.3 Anger
Basic Speech Correlates
Table 5 shows the parameter values set for the SML angry tag. Note that the values were
optimized for the TTS module, and are given as percentage values relative to neutral
speech.
Parameter

Value (relative to neutral speech)

Speech rate

+18%

Pitch average

-15%

Pitch range

-15%

Volume factor

1.7

Table 5 - Speech correlate values implemented for anger.

Prosodic rules
The following rule was adopted from Murray and Arnott (1995).
1.
Increase the pitch of stressed vowels. The phoneme data is
scanned, and the pitch of primary stressed vowels is increased by 20%, while
secondary stressed vowels are increased by 10%.
Inspection of the parameter values in Table 5 will reveal that they differ considerably
from the guidelines shown in Table 2. Initially, speech parameters were set for anger as
shown in Table 2 but preliminary tests showed that even with different prosodic rules, the
angry tag produced output that was too similar to that of the happy tag (both had
increases in speech rate, pitch average and pitch range).
It was decided to keep the increase in speech rate to denote an increase in
excitement, but lower the pitch average. The lower voice seemed to better convey a
menacing tone, for the same reason why animals utter a low growl to ward off possible
intruders. With the help of the increase in volume, the pitch average lowering also results
in a perceived hoarseness in the voice3, although vocal effects could not be
implemented due to a limitation of the Festival and MBROLA systems. The combination
of the decreased pitch range and the intonation rule results in a flatter intonation curve

Increased hoarseness in the voice for anger is supported in the literature (Murray and Arnott, 1993).

with sharper peaks. This upholds Table 2s description of pitch changes for anger:
abruptcontour.

5.7.4 Stressed Vowels


Some of the prosodic rules in the previous sections made use of the term stressed vowel
phoneme, and denoted two different types: primary and secondary. The term is used in
Murray and Arnott (1995) when describing their prosodic rules, but unfortunately no
explanation is given of the heuristic used to determine the type of stress. The frequent
occurrence of the term within the prosodic rules however, signified that it was too
important to ignore. Therefore, analysis was carried out on the phoneme data output by
Festival to ascertain if there are any discriminating factors between vowel sounding
phonemes.
The results of the analysis showed that indeed, phoneme pitch and duration data can
be used to categorize the importance of vowel-sounding phonemes within a word. Table
6 summarizes the findings of the analysis, and shows how three types of vowel-sounding
phonemes can be discriminated based on the average phoneme pitch and duration values
of the utterance.
Stress type

Duration
Avg.

>

Duration

Pitch > Pitch Avg.

Primary

Yes

Yes

Secondary

Yes

No

Tertiary

No

Yes

Table 6 - Vowel-sounding phonemes are discriminated based on their duration and pitch.

Therefore, the prosodic rules made use of the fact that different stress types existed
for vowel-sounding phonemes, basing classification on the criteria shown in the table
above. Whether or not this follows the stressed phoneme definition of Arnott and Murray
(1995) is unclear; the fact remains that Table 6 allowed the implementation of prosodic
rules that involved different stressed vowel types, and its success will be demonstrated in
this thesis evaluation section (Chapter 6).

5.7.5 Conclusion
While the literature identifies a number of speech correlates of emotion, speech
synthesizer limitations did not allow for the implementation of some of them; namely,

intensity, articulation and voice quality parameters. This may have been the main reason
why happiness and anger would have been too similar if the recommendations found in
the literature had been strictly followed based on acoustic features alone there were too
few differences between the two emotions.
In discussing HAMLETs prosodic rules, Murray and Arnott (1995) state that the
rules were developed to be as synthesizer independent as possible. This was found to be
the case, though specific values given in their paper for the DECtalk system obviously
could not be used. Still, the values served as a very good indication of what settings
needed to be changed for different emotions. The very fact that some of the HAMLET
prosodic rules could be implemented in this projects TTS module serves to show that the
work of Murray and Arnott (1995) is not speech synthesizer dependent. This has the
added advantage that it assures that the emotion rules implemented in the TTS module
are not dependent on the Festival Speech Synthesis System and the MBROLA
Synthesizer.

5.8 Implementation of Low-level SML Tags


5.8.1 Speech Tags
This section will briefly describe the implementation of SMLs lower level tags that can
be used to directly manipulate the output of the TTS module. For instance, using the
rate tag can make the TTS module produce slower or faster speech, and the volume tag
can increase/decrease the volume of the utterance.
The tree-like structure of the SML Document lends itself to recursive tag processing.
This is because an SML Document element node representing a tag usually affects a subtree of nodes of an unknown structure. Therefore it is very convenient to define recursive
functions that visit all child nodes of the tag node, and process the data held in the text
nodes.
a) emph
Using the emph tag, a word can be emphasized by increasing the pitch and duration
of certain phonemes within the word. More specifically, emphasis of a word involves a
target phoneme whose pitch and duration information is changed. The target phoneme
can be specified via the target attribute (the actual phoneme is specified, not the letter);
if not specified, the target becomes the first vowel-sounding phoneme within the word.
The pitch and duration values of neighbouring phonemes (if they exist) of the target are

also affected. Figure 21 shows the factors that duration and pitch values are multiplied
by.

<emph target=o affect=b level=moderate>sorry</emph>


target phoneme
duration = 2.0
pitch =1.3

sorry
phonemes

s o r ii
neighbours
duration = 1.8
pitch = 1.2

Figure 21 Multiply factors of pitch and duration values for emphasized phonemes.

The attributes that can be specified within the emph tag give various options on how
the word will be emphasized. For instance, the affect attribute can specify whether
only the pitch should change for the affected phonemes (default), or just the duration, or
both. The level attribute specifies the strength of the emphasis (e.g. weak (default),
moderate, strong). The current limitation of the tag is that only one target phoneme can
be specified. However, this can be easily modified so that multiple target phonemes can
exist within a word by extending the target attribute and how it is processed.
b) embed
The embed tag enables foreign file types to be embedded within SML markup. The
type of embedded file is specified through the type attribute. Currently, two file types
are supported in SML:

audio the embedded file is an audio file, and is played aloud. The
filename is specified through the src attribute. Embedding audio files within
SML markup is useful for sound effects.
<embed type=audio src=sound.wav/>

mml Music Markup Language (MML) markup is being embedded, and


signifies that the voice will sing the specified song. Two input files are
specified through the following attributes: music_file (contains music
description of song), and lyr_file (contains songs lyrics). Processing of

this file involves obtaining the two input filenames, and calling the MML
library that will perform synthetic singing. The implementation of MML for
synthetic singing was developed by Stallo (2000).
<embed type=mml music_file=rowrow.mml lyr_file=rowrow.lyr/>

c) pause
The pause tag inserts a silent phoneme at the end of the last word of the previous
text node with a duration specified by the length or msec attributes. Finding the text
node previous to the pause node can be is non-trivial, as the algorithm must be able to
handle any sub-tree structure. shows a possible structure that the algorithm must be able
to handle.
Take a deep <emph>breath</emph> <pause length=medium/> and continue

text

TEXT_NODE
Take a deep

emph

ELEMENT_NODE

text

TEXT_NODE
breath

pause
ELEMENT_NODE
length=medium

text

TEXT_NODE
Take a deep

Silent phoneme to be
inserted at end of breath

Figure 22 - Processing a pause tag.

d) pitch
Pitch average
The pitch average is modified by changing the pitch values of every phonemes pitch
point(s). The middle attribute specifies by which factor the pitch values need to change.
Modifying the pitch value of every phoneme by the same amount has the effect of
changing the pitch average.

Pitch range
Modifying the pitch range of an utterance requires that the pitch average be known.
Therefore, the pitch average of the utterance is calculated before any pitch values can be
modified. Each pitch value is then recalculated using the following equation:

NewPitch = (OldPitch Average) * PitchRangeChange

This has the effect of moving pitch points further away from the average line, for
pitch values that are both greater and less than the pitch average (as shown in Figure 23).
Care is taken that the new pitch values do not go below/higher than predetermined
thresholds, or otherwise the voice loses its human quality and sounds too machine-like.
pitch
average

pitch
average
Original pitch values
New pitch values
Figure 23 - The effect of widening the pitch range of an utterance.

e) pron
The pron tag is used to specify a particular pronunciation of a word. It pron tag is
the only tag that modifies the text content of an SML Document text node, and is
processed before the contents is given to the NLP module at the text to phoneme
transcription stage. This is because the value of the sub attribute overwrites the contents
of the text node. When the phoneme transcription stage is reached, the substituted text is
what is given to the NLP module, and so the phoneme transcription reflects the specified
pronunciation of the markup. Figure 24 illustrates how this is done for the markup
segment:
<pron sub=toe may toe>tomato</pron>

pron

ELEMENT_NODE
sub=toe may toe

NLP

pron

ELEMENT_NODE
sub=toe may toe
text
content

text

TEXT_NODE
tomato

Sub-tree reflects the structure


of the example markup. pron
tag has attribute sub value of
toe may toe

text

TEXT_NODE
toe may toe

pron nodes sub attribute


value overwrites contents of
child text node.

Figure 24 - Processing the pron tag.

f) rate

phoneme
transcription

text

TEXT_NODE
toe may toe

At phoneme transcription stage, text


node gives its contents to NLP, and
receives phoneme transcription.

The speech rate is modified very easily by affecting the phoneme duration data in the
text node structures. How much the speech should increase/decrease by is specified by
the speed attribute, and each phonemes duration is multiplied by a factor reflecting the
value of speed.

g) volume
The volume is modified through the usage of the MBROLA v command line
option. The level attribute specifies the volume change, and this is converted to a
suitable value to pass to MBROLA through the command line. The disadvantage of this
way of implementing volume control is that MBROLA applies the volume to the whole
utterance it synthesizes. Since sections are passed to MBROLA at the emotion tag level,
the volume can vary at most from emotion to emotion. Therefore, dynamic volume
change within an emotion tag is not possible, and has been identified as an area for future
work (see Section 7.1).

5.8.2 Speaker Tag


The speaker tag allows control over some speaker properties, principally the diphone
database MBROLA is to use (see Section 5.9) and the speakers gender. Implementation
mainly involved providing a way to change the MBROLA Synthesizers settings.
Example SML markup:
<speaker gender=female name=en1>I am a woman</speaker>

Name
The value of the name attribute is passed to MBROLA on the command line, and must be
the name of an MBROLA diphone database that already exists on the system. Using a
different diphone database changes the voice since the recorded speech units contained
within the database are from a different source. The TTS module can currently use two
diphone databases: en1, and us1. Making use of more diphone databases (when they
become available) requires very minimal additions. Unfortunately, specifying the actual
name of the MBROLA diphone database in the SML markup has been identified as a
design flaw, since an SML user will now be aware that the MBROLA synthesizer is
being used and is forced to write markup that directly accesses an MBROLA voice. It is
very important that this should be altered in the future.

Gender
The MBROLA synthesizer provides the ability to change frequency values and voice
characteristics, and thus provides a way of obtaining a male and female voice from the
same diphone database. Obtaining a male or female voice requires the specification of
the following MBROLA command line options:
1. Frequency ratio specified through the -f command line option. For
instance, if -f 0.8 is specified on the MBROLA command line, all
fundamental frequency values will be multiplied by 0.8 (voice will sound
lower).
2. Vocal tract length ratio specified through the -l command line option. For
instance, if the sampling rate of the database is 16000, indicating -l 18000
allows you to shorten the vocal tract by a ratio of 16/18 (which will make the
voice sound more feminine).
Unfortunately, values for the f and l MBROLA command line options are
dependent on the diphone database being used. Table 7 shows the parameter values
required to obtain a male and female voice for the en1 and us1 diphone databases.
Gender

Frequency ratio (f)

Vocal tract ratio (l)

Male (using en1)

1.0

16000

Female
en1)

1.6

20000

Male (using us1)

0.9

16000

Female
us1)

1.5

16000

(using

(using

Table 7- MBROLA command line option values for en1 and us1 diphone
databases to output male and female voices.

5.9 Digital Signal Processor


Once all phoneme data is finalized, the SML Document contains the required information
to produce a waveform, which is one of the outputs of the black box figure shown in
Section 5.1. The Digital Signal Processing (DSP) unit of the TTS module makes heavy
use of the MBROLA Synthesizer, which is a speech synthesizer based on the
concatenation of diphones. This basically means that an utterance is synthesized by
concatenating very small units of recorded speech sounds4.
4

The set of recorded speech sounds is often referred to as the voice data corpus, or a diphone database.

MBROLA accepts as input a list of phonemes, together with prosodic information


(in the form of phoneme duration and a piecewise linear description of pitch), and from
this is able to produce speech samples. The MBROLA input format is very intuitive and
simple for describing phoneme data. Hence the design of the TTS modules utterance
structures (discussed in Section 5.5.2) was based on the MBROLA input format. Figure
25 shows example MBROLA input required to produce a speech sample of the word
emotion. Each line holds the following information, delimited by white space:
phoneme, duration, n pitch-point pairs. A pitch-point pair is represented by two numbers:
a percentage value representing how far into the phonemes duration the point occurs,
and the actual pitch value.
# 210

phoneme

i 61 0 109 50 110
m 69

duration

ou 140 0 107 50 110


sh 117

pitch point

@ 48 0 105 50 94
n 91 100 90
# 210

Figure 25 - Example MBROLA input.

5.10 Cooperating with the FAML module


A subproblem that was stated in Section 2.2 was the cooperative integration with the
FAML (Facial Animation Markup Language) module, which implements a facial gesture
markup language for the Talking Head. Collaboration with Huynh (2000) was aimed at
achieving synchronization of vocal expressions and facial gestures. The TTS module
supports the FAML module in the following ways.
1. The allowance of FAML markup tags to be present within SML markup. This
required a filtering process to be implemented so that non-SML tags are not
present when parsing of the TTS input takes place. This also required that
provisions be put in place so that tag name conflicts would not occur. The way
this was resolved was by making sure that there were no tag names that occurred
in both the SML tag set and the FAML tag set. For instance, to output an angry
expression the SML tag uses the tag name angry, while the FAML uses the tag
name anger. This is a simplistic solution however, because it hinders the

choice of good descriptive tag names (if an appropriate name is already being
used in another tag set), and similar tag names such as anger and angry will
only confuse future users of TTS and FAML markup. A possible solution would
be the use of XML Namespaces, which allow different markup languages to
contain the same tag names (ambiguity is resolved through the use of resolution
identifiers e.g. sml.angry and faml.angry)
2. FAML API functions are called from within the TTS module to initialize the
FAML module, and to allow the FAML module to modify the output FAP
stream.
3. The creation of temporary utterance files in a format required by the FAML
module. This information is needed by the FAML module for the proper
synchronization of its generated facial gestures and the TTS modules speech.
The utterance file contains word and phoneme information in the format shown
in Figure 26.
word

phoneme and
duration (ms)

>And
# 210
a 90
n 49
d 46
>now
n 72
au 262
# 210
>the
dh 45
@ 40
>latest
l 69
ei 136
t 71
i 66
s 85
t 66
>news
n 72
y 45
uu 217
z 82
# 210

Figure 26 - Example utterance information supplied to the FAML module by the TTS module.
Example phrase: And now the latest news

5.11 Summary
This chapter has been able to demonstrate that the design and implementation of the
described TTS module has endeavoured to follow the black box design principle in
virtually all of its subsystems.
As Figure 7 shows, the dependence of other Talking Head modules on the
TTS module is strictly limited to its precisely defined inputs any modification
of the internal workings of the TTS module will not affect how the rest of the
Talking Head functions.
The TTS modules own dependence on tools it uses such as libxml, the
Festival Speech Synthesis System, and the MBROLA Synthesizer has been
carefully bounded through proper design of C++ classes and structures. There is
also no dependence of the DSP unit (which makes use of MBROLA) on the NLP
unit (Festival), so minimal changes would be required if another NLP unit was
used, but the TTS subsystem design would remain unchanged. The only
requirements for a new speech synthesizer to be used in the NLP and DSP units
would be the ability to manipulate the utterance at the phoneme data level
(including pitch and duration information).
The prosodic and speech parameter rules for the emotion tags are totally
speech synthesizer independent, except for the volume settings. Because of this,
the emotion tags could be easily ported for use in another TTS module using
different speech synthesizers.
Processing of most of the low-level speech tags makes use of the SML
Documents utterance structures only. However, the speaker and volume tags
make heavy use of the MBROLA Synthesizer due to the fact that these tags
affect the way MBROLA produces the waveform. Future work will look at
minimizing the dependence of these two tags on MBROLA.

Chapter 6
Results and Analysis
Evaluation of this project was primarily concerned with testing the hypotheses stated in
Section 4.1. Therefore, the evaluation process endeavoured to ascertain how well the
system is able to simulate emotion in synthetic speech, and the extent of the effect this
has on a Talking Heads ability to communicate. In this chapter, the procedure in which
data was acquired to test the hypotheses will be described. This will be followed by a
presentation of the data, of which a full analysis is carried out.

6.1 Data Acquisition


6.1.1 Questionnaire Structure and Design
It was decided that evaluation data would be acquired via a questionnaire filled out by
participants of a single demonstration. The decision was based on experience gained in
the evaluation of a previous version of the Talking Head by Shepherdson (2000). The
questionnaire had to be carefully designed to ensure that the questions asked would
provide sufficient data to adequately prove or disprove the projects stated hypotheses.
The following subsections describe the structure of the questionnaire, and design issues
that were considered. For a copy of the actual questionnaire, see Appendix D.
Section 1 Personal and Background Details
Because it was known beforehand that most, if not all, participants were to be university
students, and that a good proportion would be international students, questions were
prepared to acquire the participants nationality, and if English was their first spoken
language. Combined with their responses in the other sections, it was hoped that this
would provide useful data for analysis. Other information such as the participants age
and gender was also asked.

Section 2 Emotion Recognition


This section was made up of two parts, and borrowed heavily from the testing method
described in Murray and Arnott (1995). Both parts dealt with the recognition of emotion
in synthetic speech, but followed a different format. Part A had only two test phrases, but
each phrase was synthesized under four different emotions: anger, happiness, sadness,
and neutral (no emotion). The two phrases were specially chosen because they were
deemed to be emotionally undetermined (Murray and Arnott, 1995). Emotionally
undetermined phrases are phrases whose emotion cannot be determined simply by the
words. For instance, the phrase I received my assignment mark today can be
convincingly said under a variety of emotions the speaker may be sad about the mark
he or she received, can be feeling happy about it, or can even be angry having felt to be
treated unfairly.
For Part B, ten test phrases were prepared: five of these were emotionally neutral
phrases, and the other five were emotionally biased phrases. An example of a neutral
phrase is The book is lying on the table. An example of an emotionally biased, or
emotive, phrase is I would not give you the time of day (the words already carry
negative connotations, which could probably influence the listener to identify anger or
disgust). Part B consisted of the following utterances (in this order):
a)

Five neutral phrases spoken without emotion.

b)

Five emotive phrases spoken without emotion.

c)
The same phrases as in (a) above, but spoken in one of the four
emotions (includes neutral).
d)
The same phrases as in (b) above, but spoken in one of the four
emotions.
Both Cahn (1990b) and Murray and Arnott (1995) expressed difficulty in finding
appropriate test phrases, and indeed, the same difficulties were encountered when
designing the questionnaire for this project, especially in finding neutral phrases that
sounded convincing under any of the four emotions. A number of phrases were
borrowed from the experiments of the aforementioned researchers, and the rest were
original. It is important to note that the participants were not made aware of the different
types of phrases that were prepared. See Appendix E for a list of the example test
phrases used.
An important design issue for this section of the questionnaire was deciding the way
in which the participants should indicate their choice. Murray and Arnott (1995) have

identified (and used) two basic methods of user input that are suitable for speech emotion
recognition: forced response tests, and free response tests. In a forced response test, the
subject is forced to choose from a list of words the one that best describes the emotion
that he or she perceives is being spoken. In a free response test, the subject may write
down any word that he or she thinks best describes the emotion.
In experiments performed by Cahn (1990), the evaluation was based on a forced
response test, with only the six emotions that the Affect Editor simulated as possible
responses. Participants were also asked to indicate (on a scale of 1 to 10) how strongly
they heard the emotion in the utterance, and how sure they were.
For this project, it was decided to adopt the forced response test because of its
simplicity and to avoid the possible ambiguity of a free response test (for instance, if a
participant wrote down exasperated, should it be categorized as angry or disgusted,
or neither?). However, a mechanism needed to be provided so as not to limit the possible
selection of responses. This was important because only three emotions other than
neutral anger, happiness, and sad were simulated by the system. It was feared that
should the selection list be confined to just four possible responses, then it could
potentially invalidate the data. For instance, any utterance that contained positive words,
or contained a positive tone to it, would immediately be perceived as happy only because
it would be the only positive emotion in the list. This situation was avoided by adding
two distractor emotions in the selection list surprise and disgust. This way, listeners
would have more choice. In addition to this, an Other option was added to the
selection list to enable the participant to write down their own descriptive term if they so
wished. Therefore, a total of seven possible responses were made available for the four
different types of emotion being simulated.
It is beneficial that the tests in this section had a very similar structure to the forced
response tests conducted by Murray and Arnott (1995). The experiment could not be
exactly the same since time limitations did not permit as many test utterances to be
played, and not all the emotions that were tested in Murray were simulated. Still, the
experiments are similar enough to provide at least a loose comparison of results.
It would have been interesting if a female voice would have been tested since the
voice gender can be changed through the speaker tag (see Section 5.8.2). However,
the questionnaire length was strictly limited, and a sufficient number of examples of one
gender were needed to obtain useful data; interchanging male and female voices would
have introduced complex variables, and making valid conclusions from the data would
have been difficult, if not impossible. Still, differences in subject responses to male and

female voices could have possibly provided useful data for analysis and this should
certainly be looked into in the future (see Chapter 7).
Section 3 Talking Head
For this section it was desired to obtain data that could describe the effect that adding
vocal emotion to a Talking Head had on a users perception of the Talking Head. To do
this it was decided to prepare two movies of the Talking Head: one speaking without
vocal emotion effects, and the other including emotion effects in its speech all other
variables must remain as much the same as possible (e.g. facial expressions, movements,
and the actual words spoken). It was not possible to have the visual information of the
Talking Head exactly the same for both examples, since the inclusion of certain speech
tags affected the length of the utterance, and so slight movements such as eye blinking
may have been different. Facial expressions showing emotion however, were the same
for both examples (the exact placement, duration and intensity of the expressions could
be controlled via facial markup tags (Huynh, 2000)).
The utterance that was synthesized for both examples was excerpted from the Lewis
Carroll classic Alices Adventures in Wonderland (Carroll, 1946). The excerpt contained
dialogue between three characters of the story: Alice, the March Hare, and the Hatter.
The passage was chosen for its wonderful expressiveness (it is a childrens novel), the
fact that it included dialogue, and because various emotions such as sad, curiosity, and
disappointment could be used when reading the passages.
The first examples speech was synthesized without any markup at all, and therefore
included Festivals intonation unchanged. For the second example, the text was marked
up by hand using a variety of high-level speech emotion tags, and lower-level speech tags
such as rate, pitch and emph (see Section 5.8).

Section 4 General Information


The last section ascertained if the participant had heard of the term speech synthesis
before, and if they had seen a Talking Head before. The section also aimed at receiving
comments about the speech examples the participant had heard, and about the Talking
Head they had seen.

6.1.2 Experimental Procedure


The demonstration was held in a moderately sized lecture theatre that was categorized by
benches arranged in a tiered fashion (maximizing the view of the front), and a large

screen at the front of the theatre. The room also had an adequate sound system enabling
each participant to comfortably listen to the demonstration. A total of 45 participants
took place in the demonstration, a number large enough to produce adequate data.
The demonstration began by very briefly introducing the participants to what the
project is about. They were told that by filling out the questionnaire they would be
helping in evaluating how well the project addresses the problem it had set out to solve.
Nevertheless, it was made clear to all participants that it was not compulsory that they
should sit the demonstration. The participants were given an overview of what to expect
in the questionnaire, including the fact that they would be played a number of sound files
and asked to comment on each one. It was emphasized however, that it was not they who
were being tested, but the program itself, and that there were no right or wrong answers.
The participants were asked to fill out Section 1 of the questionnaire. They were told
the relevance of the questions being asked in that section, but that they were under no
obligation to answer questions they did not feel comfortable with. The participants were
encouraged to ask questions to clarify any details they felt had not been made clear.
The sound and movie demonstrations of Sections 2 and 3 had been pre-rendered and
made part of a Microsoft PowerPoint presentation that was shown on the large screen at
the front of the lecture theatre. The reason for this was to avoid waiting for the utterances
to be generated, and to provide a way of showing the audience the current section and
example number that was being played. It also minimized the risk of anything going
wrong.
Parts A and B of Section 2 consisted of the test utterances described in Section 6.1.1
being played. For each test utterance, the sound file was played twice and the
participants were given time to choose which emotion best suited how the speaker
sounded. In order to give the participants a chance to listen to the test utterance again and
confirm their choice, the five most recent utterances were repeated every fifth example.
The repeating of the utterances also served as a mental break for the participants.
The next section of the questionnaire consisted of the two Talking Head examples
described in Section 6.1.1. The participants were asked to first watch both of the
examples, and then fill out that part of the questionnaire. Before the examples were
played, the participants were asked to scan the four questions asked in that section so as
to aid them in what they should look for the Talking Heads clarity, expressiveness,
naturalness, and appeal. However, the participants were not asked to focus on the voice
only, even though it was only the voice that changed in the two examples. The
participants were then asked to fill out the rest of the questionnaire, which consisted of
more general questions (see Appendix D).

Important to note is that there were some factors that did not allow the evaluation to
be carried out under ideal conditions. For instance, the very fact that the demonstration
was held with a group of participants may have been a cause of distraction for some.
Moving towards an ideal situation would have been for each participant to sit the
questionnaire one at a time, in front of a computer without anyone else in the room. This
would have minimized any distractions, and would have allowed the participant to go at
his or her own pace. Given the time and resources available, however, this was not
possible. Still, the demonstration was designed to be short enough to keep the
participants attention and give each participant the ability to review their answers
through playbacks, and conducted in such a way as to give the participants ample time to
answer the questions.

6.1.3 Profile of Participants


The first and last section of the questionnaire provided some statistics about the
participants who were involved in the evaluation process. Table 8 shows a summary of
the information obtained.
Average age:
Gender:
Male
Female
Country lived in most:
Australia
Singapore
Indonesia
Malaysia
Vietnam
Other
Unanswered
English first language:
Yes
No

23.9
84.00%
16.00%
51.10%
8.90%
8.90%
6.70%
4.40%
8.90%
11.10%
57.80%
42.20%

Table 8 - Statistics of participants

In addition to the statistics shown in the table, it should be noted that all participants
were students enrolled in a second year Computer Science introductory graphics unit. All
participants were computer literate who used computers at home, school and for work.
71.1% had heard of the term speech synthesis before, while 26.7% had not (2.2% were
unknown). Also, 82.2% had seen a Talking Head before, while 15.6% had not (2.2%
were unknown).

6.2 Recognizing Emotion in Synthetic Speech


The discussion in this section is concerned with providing a full analysis of the data
obtained from the emotion recognition tests Sections 2A and 2B of the questionnaire.
Before the discussion can continue however, it will be beneficial to describe the structure
and meaning of the confusion matrix, which is how most of the data will be presented.

6.2.1 Confusion Matrix


The advantage of using a confusion matrix to present the data is that it not only shows the
correct recognition percentage for a particular emotion, but also the distribution of the
emotions that were mistakenly5 chosen. It thus provides a way to clearly show any
possible confusion that may have occurred between two or more emotions.
Table 9 shows the confusion matrix template that will be used. The column headings
represent the possible emotions the participants could choose from. The row headings
represent the actual emotions that were simulated. Each cell will contain a percentage
value indicating the proportion of listeners who perceived a particular emotion for a
particular stimulus.
PERCEIVED EMOTION
Happy

S
T
I
M
U
L
U
S

Sad

Angry

Neutral

Surprised Disgusted

Other

Happy
Sad
Angry
Neutral

Table 9 - Confusion matrix template

For example, in Table 10, the first row of example percentage values show that when
an utterance was combined with happy vocal emotion effects generated by the system,
42.2% of listeners perceived the emotion as happy (i.e. correct recognition took place),
2.2% perceived it as sounding sad, 6.7% angry, 17.9% neutral, 23.7% surprised, 4.4%
disgusted, and 2.9% specified something else. The data in the example therefore, would
indicate happiness was the most recognized emotion for that utterance, and that the
utterance was mostly confused with surprise and neutral. Note that the Other category
5

The intended meaning is not that the participants failed to recognize the simulated emotion through a fault
of their own, but simply that the participants choice didnt match the emotion being simulated.

also includes those participants who could not decide which emotion was being
portrayed.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S

Happy

Happy

Sad

Angry

Neutral

42.2%

2.2%

6.7%

17.9%

Surprised Disgusted

23.7%

4.4%

Other

2.9%

Sad
Angry
Neutral

Table 10 - Confusion matrix with sample data.

From the above example, it can be seen that the data should be read in rows; each
row representing an utterance or group of utterances that were simulating a specific
emotion, and the cell values for that row show the distribution of the participants
responses.
Cells that have the same row and column names hold values that represent when the
listeners perceived emotion matched the emotion being simulated any other cell
represents wrong recognition. A table holding values such as those in Table 11 would
therefore be showing ideal data 100% recognition of the simulated speech emotion took
place for all emotions.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S

Happy

Sad

Angry

Neutral

Surprised Disgusted

Other

Happy

100%

0.0%

0.0%

0.0%

0.0%

0.0%

0.0%

Sad

0.0%

100%

0.0%

0.0%

0.0%

0.0%

0.0%

Angry

0.0%

0.0%

100%

0.0%

0.0%

0.0%

0.0%

Neutral

0.0%

0.0%

0.0%

100%

0.0%

0.0%

0.0%

Table 11 - Confusion matrix showing ideal experiment data: 100% recognition rate for all
simulated emotions.

Of course, this is virtually impossible, since it is well documented in the literature


that our recognition rate is far less than perfect for when humans speak (Malandra,
Barker and Barker, 1989). Knapp (1980) identifies three main factors that describe why
this is so:
1.
People vary in their ability to express their emotions (not only in
speech, but also in other forms of communication such as facial expressions
and body language).

2.
People vary in their ability to recognize emotional expressions. In
a study described in Knapp (1980), listeners ranged from 20 percent correct to
over 50% correct.
3.
Emotions themselves vary in being able to be correctly recognized.
Another study showed anger was identified 63 percent of the time whereas
pride was only identified 20 percent of the time.
So if obtaining such ideal data for human speech is unrealistic, this is more the case
when emotion is being simulated in synthetic speech.

6.2.2 Emotion Recognition for Section 2A


The Data
The listener response data for the test utterances spoken with happy vocal emotion effects
is given in Table 12. The data shows a relatively poor recognition rate (18.9%) for the
happy emotion; 41.1% of listeners perceived the emotion to be neutral, while 22.2%
perceived the speaker as sounding surprised.
PERCEIVED EMOTION
S
T
I
M
U
L
U
S

Happy

Happy

Sad

Angry

Neutral

18.9%

3.3%

3.3%

41.1%

Surprised Disgusted

22.2%

6.7%

Other

4.4%

Sad
Angry
Neutral

Table 12 Listener response data for neutral phrases spoken with happy emotion.

RCEIVED EMOTION
Similarly, Table 13 showsP E
listener
response data for all four emotions demonstrated
in the test utterances for Section 2A. Significant values are displayed in a larger font.
S
T
I
M
U
L
U
S

Happy

Sad

Angry

Happy

18.9%

3.3%

3.3%

Sad

0.0%

77.8%

1.1%

Angry

6.7%

0.0%

Neutral

6.7%

23.3%

Neutral

Surprised Disgusted

41.1% 22.2%

Other

6.7%

4.4%

0.0%

6.7%

6.7%

44.4% 21.1%

5.6%

21.1%

1.1%

54.4%

3.3%

3.3%

6.7%

2.2%

7.8%

Table 13 Section 2A listener response data for neutral phrases.

The following observations can be made from the data for each emotion (including
happy, which has already been discussed).
a)
Happy. A poor recognition rate (18.9%), with the stimulus being
strongly confused with neutral (41.1%) and surprise (22.2%).
b)
Sad. A very high recognition rate occurred for sadness (77.8%),
with little confusion occurring with other possible emotions.
c)
Angry. A relatively high percentage of listeners recognized the
angry stimulus (44.4%), but a considerable amount confused anger with
neutral and disgust (21.1% each).
d)
Neutral. Most listeners correctly recognized when the utterance
was played without emotion (54.4%), but a significant portion (23.3%)
perceived the emotion as sad.
The Analysis
Any percentage value substantially greater than 14% was deemed as significant. This is
based on the logic that if all participants had randomly chosen one of the seven emotions,
then a particular emotion would have 1/7 (~14%) chance of being chosen. Therefore, if a
cell has a percentage substantially greater than 14% then there must have been a factor
(or factors) that influenced the listeners choice.
Except for happy, all emotions had a recognition rate greater than 14%. However,
sadness was the only emotion that enjoyed little confusion with other emotions. So with
the exception of sadness, average recognition for the simulated emotion in the utterances
was quite low.
To give a possible explanation for these values, it will be helpful to look at the data
for each question separately. Table 14 shows listener response data for Question 1 of
Section 2A, and Table 15 shows listener response data for Question 2.

PERCEIVED EMOTION
S
T
I
M
U
L
U
S

Happy

Sad

Angry

Neutral

Surprised Disgusted

Other

Happy

6.7%

2.2%

6.7%

57.8%

13.3%

8.9%

4.4%

Sad

0.0%

66.7%

2.2%

8.9%

0.0%

11.1%

11.1%

Angry

0.0%

0.0%

66.7%

6.7%

4.4%

20.0%

2.2%

Neutral

0.0%

46.7%

2.2%

40.0%

2.2%

2.2%

6.7%

Table 14 - Listener response data for Section 2A, Question 1.


PERCEIVED EMOTION
S
T
I
M
U
L
U
S

Happy

Sad

Angry

Happy

31.1%

4.4%

0.0%

Sad

0.0%

88.9%

0.0%

Angry

13.3%

0.0%

Neutral

13.3%

0.0%

Neutral

Surprised Disgusted

24.4% 31.1%

Other

4.4%

4.4%

0.0%

2.2%

2.2%

22.2% 35.6%

6.7%

22.2%

0.0%

68.9%

4.4%

4.4%

6.7%

2.2%

6.7%

Table 15 - Listener response data for Section 2A, Question 2.

It is very obvious that the two tables hold substantially different data values.
Whereas for Question 1 the happy stimulus received a recognition rate of 6.7%, with
57.8% mistaking it for neutral, this changed dramatically for Question 2, with 31.1% of
listeners correctly identifying the happy emotion, and only 24.4% classifying the
utterance as neutral. There is still considerable confusion occurring for the happy
stimulus even for Question 2 (surprise is 31.1%), but the difference between the two
questions recognition rates is too large to ignore.
Therefore it is evident that the results in this section were very much utterance
dependent. From the data, it seems that the phrase The telephone has not rung at all
today (Question 1) was said more effectively in an angry tone than I received my
assignment mark today (Question 2), and this was despite the care that was taken to
choose neutral test phrases. Both Cahn (1990) and Murray and Arnott (1995) report this
problem of utterance-dependant results, and it demonstrates the difficult task of obtaining
quantitative results on speech emotion recognition.
An interesting observation is noting when a particular stimulus received strong
listener recognition and the emotion in the same row with the next highest response. For
example, anger received strong recognition for Question 1 (see Table 14). For that

utterance, the emotion that anger was most confused with is disgust, which received
20.0%. Close scrutiny of the two tables will also reveal that when happiness was
strongly recognized, the emotion it was most confused with is surprised (see row 1 of
Table 15). Also, neutral in Table 14 was most confused with sad. The pattern that
emerges is the one described in the literature; the pairs that are most often confused with
each other are happiness-surprise, sadness-neutral, and anger-disgusted. This is because
the speech correlates identified in the literature are very similar for these emotion pairs.
As a consequence, emotion recognition will often be confused between these emotion
pairs, especially with neutral text.
From Table 14 and Table 15, it can also be seen that generally, recognition of the
simulated emotions improved for Question 2. This could be due to the listeners
becoming accustomed to the synthetic voice and learning to distinguish between the
emotions. It could also have been due to the listeners, who were all students, relating
more to the phrase of the second question, which was about receiving an assignment
mark. This could be clarified with further tests.

6.2.3 Emotion Recognition for Section 2B


As outlined in Section 6.1.1, there were ten test phrases in this section, with each phrase
being used to generate two utterances one spoken in a neutral tone and the other in one
of the three emotions (happy, sad, or angry). Section 2B was thus made up of four
different types of utterances in the following order:
a)

Emotionless or neutral text phrases spoken with no vocal emotion.

b)

Emotive text phrases spoken with no vocal emotion.

c)

Emotionless text phrases of (a) spoken with vocal emotion.

d)

Emotive text phrases of (b) spoken with vocal emotion.

In this part of the analysis, the data will be presented for each of these four test
utterance types.

Emotionless text with no vocal emotion

Table 16 shows listener responses for utterances with emotionless text with no vocal
emotion; that is, utterances whose text was emotionally indistinguishable from just the
words alone, and spoken with no vocal emotion effects. The text phrases used for this
section are phrases 1-5 shown in Appendix E.
S
T
I
M
U
L
U
S

PERCEIVED EMOTION

Neutral

Happy

Sad

Angry

Neutral

2.2%

36.9%

2.7%

44.4%

Surprised Disgusted

3.1%

6.7%

Other

4.0%

Table 16 - Listener responses for utterances containing emotionless text with no vocal emotion.

The observation that can be made from the data is that there was a strong recognition
that the utterances were spoken with no emotion. However, the utterances were also
confused with sadness, as it too received a strong listener response (36.9%). Again, like
the previous section of the questionnaire (discussed in Section 6.2.2), it was found that
there was a great deal of variation in listeners perception that was dependent on the
utterance being spoken. For instance, one of the phrases in this section was The
telephone has not rung at all today. The majority (73.3%) of listeners perceived the
speaker as being sad, while only 20.0% thought it sounded neutral. Instead, the phrase I
have an appointment at 2 oclock tomorrow was perceived more neutral (62.2% for
neutral, 20.0% for sad). The point being emphasised is that although care was taken to
choose emotionally undetermined phrases, the data suggests that this is a very difficult
task. Our dependency on context to discriminate emotions is stated in Knapp (1980), and
has been confirmed in this evaluation.
Albeit in varying degrees, the general trend for this subsection was that the neutral
voice was often perceived as sounding sad. This suggests that Festivals intonation,
which is modeled to be neutral, may have an underlining sadness. Interestingly, Murray
and Arnott (1995) also made this observation with the HAMLET system, which makes
use of the MITalk phoneme duration rules described in Allen et al. (1987, Chapter 9).

Emotive text with no vocal emotion


The use of utterances of this type endeavoured to determine the influence emotive text
has on listener perception of emotion. Table 17 shows the data accrued for utterances
spoken with no vocal emotion,P but
(i.e.
E R contained
C E I V E Demotive
E M Otext
TIO
N the speakers feelings can
be approximately determined from the text alone). Phrases 6-10 of Appendix E were
used
for this subsection.
S
T
I
M
U
L
U
S

Happy

Sad

Angry

Neutral

Happy

13.3%

8.9%

0.0%

68.9%

0.0%

0.0%

8.9%

Sad

2.2%

48.9%

2.2%

37.8%

0.0%

4.4%

4.4%

Angry

1.1%

0.0%

42.2% 33.3%

2.2%

20.0%

1.1%

Neutral

2.2%

4.4%

11.1%

0.0%

6.7%

0.0%

75.6%

Surprised Disgusted

Other

Table 17 - Listener responses for utterances containing emotive text with no vocal emotion.

The following observations can be made from the data:


a)
Row maxima occurred for three emotions corresponding to the
emotion in the text (sadness, anger, and neutral). This shows that for these
emotions, a good proportion of the listeners relied on the words in the
utterance to distinguish between emotions. However, a significant amount of
confusion occurred for sadness and anger; the sadness stimulus was perceived
as neutral by 37.8% of listeners, and the anger stimulus was confused with
neutral (33.3%) and disgusted (20.0%).
b)
The phrase containing happy text was very much perceived as
having no emotion. It is unclear why this emotions recognition suffered
more than the others did; maybe happiness heavily requires confirmation in
the speakers voice for it to be identified as such.
The significant confusion occurring with happiness, sadness, and anger suggests that
although emotion perception can be strongly influenced by the words in the utterance,
lack of vocal emotion in the voice meant that the utterances sounded unconvincing, and
so emotion recognition was not very strong.

Emotionless text with vocal emotion


The confusion matrix for phrases containing emotionless text (phrases 1-5 of Appendix
E) spoken with vocal emotion is shown in Table 18; that is, utterances whose text was
emotionally indistinguishable from just the words alone, but were spoken with various
vocal emotion effects.

PERCEIVED EMOTION
S
T
I
M
U
L
U
S

Happy

Happy

Sad

Angry

24.4%

1.1%

0.0%

Neutral

Surprised Disgusted

22.2% 40.0%

2.2%

Other

10.0%

Sad

0.0%

72.2%

0.0%

17.8%

0.0%

2.2%

7.8%

Angry

0.0%

0.0%

91.1%

2.2%

0.0%

4.4%

2.2%

Table 18 - Listener responses for utterances containing emotionless text with vocal emotion.

The following observations can be made from the data:


a)
The row maximum for the happiness stimulus was for the surprised
emotion (40.0%), while 24.4% thought it sounded happy and 22.2% thought it
sounded neutral. Though this is a low listener response for happiness, the
high combined response of the happiness and surprised emotions seems to
suggest that it had a pleasant tone. That listeners wrote descriptive terms (for
the Other option) such as content, informative, proud, and curious
seems to confirm this.
b)
A high correct recognition rate occurred for the sad stimulus
(72.2%). Other listeners gave their own descriptive terms such as lethargic,
disappointed, and sleepy.
c)
The anger stimulus received a very high recognition rate (91.1%),
with very little confusion occurring with other emotions.
Bearing in mind that the utterances for this subsection contained exactly the same
text as the neutral text, neutral voice utterances (shown in Table 16), the strong effect
vocal emotion has on a listeners perception of emotion in an utterance can be clearly
seen. Whereas most listeners chose either neutral or sad for the neutral text, neutral
voice utterances, Table 18 shows a much better overall emotion recognition for neutral
text, emotive voice utterances.

Emotive text with vocal emotion


The confusion matrix for phrases containing emotive text (phrases 6-10 of Appendix E)
spoken with vocal emotion is shown in Table 19; that is, utterances whose text alone gave
an indication of the speakers emotion, and were further spoken with the appropriate
vocal emotion.

PERCEIVED EMOTION
S
T
I
M
U
L
U
S

Happy

Sad

Angry

Neutral

Surprised Disgusted

Other

Happy

66.7%

4.4%

0.0%

13.3%

4.4%

2.2%

8.9%

Sad

0.0%

62.2%

4.4%

24.4%

0.0%

0.0%

8.9%

Angry

0.0%

0.0%

77.8%

1.1%

0.0%

15.6%

5.6%

Neutral

6.7%

2.2%

0.0%

71.1%

13.3%

4.4%

2.2%

Table 19 - Listener responses for utterances containing emotive text with vocal emotion.

The following observations can be made from the data:


a)
As predicted, the confusion matrix shows a strong average
recognition rate for all simulated emotions (seen in the high values down the
diagonal line, matching row and column names).
b)
Confusion continued to occur for emotions, but was not as
significant as other utterance types.
Both of the above observations indicate that correct emotion perception is taking
place, upholding the hypotheses related to this section (see Section 4.1).
Important to note is the reoccurrence of the confusion pairs mentioned in Section
6.2.2 sad with neutral, and angry with disgusted. Happiness, in Table 19 at least, was
not confused much with surprise; however, this may be due to the utterance not lending
itself to this confusion as surprise is certainly being confused with happiness in Table 18.
Interestingly, some listeners complained in the general comments section that it was
very difficult to hear disgust or surprise in the speech utterances (the reader will recall
that disgust and surprise were not being simulated, and were included in the questionnaire
selection list as distractors). This may suggest that some listeners may have sometimes
chosen disgust or surprise because they felt the emotion had to come up sooner or later.
Happiness for this section enjoyed a high recognition rate after suffering low
recognition rates for other utterance types. For the neutral text, emotive voice test
utterances, happy utterances were perceived as having a pleasant tone, but were not
explicitly identified as happiness. The high recognition rate for happiness for emotive
text, emotive voice type utterances suggests that the emotive text helped clarify the
utterance as not just being pleasant but happy.

6.2.4 Effect of Vocal Emotion on Emotionless Text

By studying the two confusion matrices in Table 16 and Table 18, it can be seen that
emotion recognition was enhanced for neutral phrases spoken with vocal emotion
compared to the same text spoken without vocal emotion. The two confusion matrices
however, show only the number of listeners who correctly or incorrectly recognized the
simulated emotion. What they do not show is the number of listeners who improved with
the addition of vocal emotion effects. To determine the effect of vocal emotion therefore,
the data should be filtered to not show listeners who correctly recognized the intended
emotion when the utterance was spoken without vocal emotion and then also correctly
recognized when the utterance was spoken with vocal emotion.
In order to address this, further analysis was carried out on the data to determine the
effect the addition of vocal emotion had on listener emotion recognition. To do this, the
analysis kept track of listeners who had improved in their recognition of the intended
emotion, and also kept track of listeners whose recognition was shown to deteriorate
when the utterance was spoken with vocal emotion. Table 20 shows for each simulated
emotion, the percentage of listeners who incorrectly recognized the intended emotion
when it was spoken without vocal emotion and who then improved in their recognition
when the utterance was spoken with vocal emotion. Similarly, Table 21 shows the
percentage of listeners whose emotion recognition deteriorated with the addition of vocal
emotion effects.

Happy
Sad
Angry

22.2%
36.7%
84.4%

Table 20 - Percentage of listeners who


improved in emotion recognition with
the addition of vocal emotion effects for
neutral text.

Happy
Sad
Angry

2.2%
12.2%
2.2%

Table 21 Percentage of listeners whose


emotion recognition deteriorated with
the addition of vocal emotion effects for
neutral text.

The above results show that an overall significant increase in emotion recognition
occurred with the introduction of vocal emotion effects in a neutral utterance; this was
true for all the simulated emotions. Anger gained the largest increase in emotion
recognition (84.4%). This suggests that if a neutral text utterance is to be perceived as
being spoken in anger, then its perception is very much dependent on the vocal emotion
effects without vocal emotion, it is simply very difficult to perceive the speaker as
being angry.
Deterioration for sadness was higher than the other two emotions. The reason for
this is not clear, except that confusion between sadness and neutrality occurred with all
utterance types.

6.2.5 Effect of Vocal Emotion on Emotive Text


A similar analysis to the one in Section 6.2.4 was carried out to determine the effect of
vocal emotion on emotive text. Table 22 and Table 23 show the results of this analysis.

Happy
Sad
Angry

57.8%
31.1%
41.1%

Table 22 - Percentage of listeners whose


emotion recognition improved with the
addition of vocal emotion effects for
emotive text.

Happy
Sad
Angry

4.4%
17.8%
6.7%

Table 23 Percentage of listeners whose


emotion recognition deteriorated with
the addition of vocal emotion effects for
emotive text.

From the above results, it can be seen that all emotions received a significant
increase in emotion recognition once vocal emotion was added to the utterance, with the
greatest improvement occurring for happiness (57.8%). Anger did not improve so much
with emotive text as with neutral text. This shows that emotive text also had a strong
influence on determining the correct emotion being simulated.
As with Table 21, Table 23 also shows that the deterioration of recognition with
sadness was significantly higher than other emotions; sadness-neutral confusion was a
problem throughout all the tests. Still, the substantial effect of vocal emotion even on
emotive text phrases can be clearly seen through this analysis.

6.2.6 Further Analysis


In Section 6.1.1, it was mentioned that questionnaire participants were asked to provide
personal and background information. This associated information with participants
emotion recognition scores would enable further analysis for instance, were there any
differences in emotion recognition between male and female participants? Or, were there
any differences between participants who spoke English as their first language versus
participants who didnt?
Unfortunately, the gender distribution of the participants was too unbalanced to
enable any concrete analysis to take place 84.4% were male, 15.6% were female. On
the other hand, participant population was quite evenly divided for those who spoke
English as their first language and those who didnt 57.8% and 42.2% respectively.
For the language analysis, apart from minor differences in emotion recognition of the
intended stimulus, correct emotion perception was the same for both classes of

participants. One discrepancy was found however, with differences too significant to
overlook: it was noted from the data that participants who did not speak English as their
first language confused sadness with neutrality significantly more than people who did
speak English as their first language. This was noted for neutral text, emotive voice
and emotive text, emotive voice type utterances. Table 24 shows listener responses for
the sadness stimulus for participants who speak English as their first language, while
Table 25 shows the same listener responses for people who do not speak English as their
first language. Both Tables are for neutral text, emotive voice utterance types.

S
T
I
M
U
L
U
S

S
T
I
M
U
L
U
S

PERCEIVED EMOTION

Sad

Happy

Sad

Angry

Neutral

0.0%

82.7%

0.0%

5.8%

Surprised Disgusted

0.0%

1.9%

Other

9.6%

Table 24 Listener responses for participants who speak English as their first language.
Utterance type is neutral text, emotive voice.
PERCEIVED EMOTION

Sad

Happy

Sad

Angry

Neutral

0.0%

55.3%

0.0%

34.2%

Surprised Disgusted

0.0%

2.6%

Other

7.9%

Table 25 Listener responses for participants who do NOT speak English as their first
language. Utterance type is neutral text, emotive voice.

The pattern is also reflected in the following data for emotive text, emotive voice
utterance types (Table 26 and Table 27).
S
T
I
M
U
L
U
S

PERCEIVED EMOTION
Happy

Sad

Angry

Neutral

Surprised Disgusted

Other

Sad

0.0%

76.9%

3.8%

11.5%

0.0%

0.0%

0.0%

Table 26 Listener responses for participants who speak English as their first language.
Utterance type is emotive text, emotive voice.
S
T
I
M
U
L
U
S

PERCEIVED EMOTION

Sad

Happy

Sad

Angry

Neutral

0.0%

42.1%

5.3%

42.1%

Surprised Disgusted

0.0%

0.0%

Other

10.5%

Table 27 Listener responses for participants who do NOT speak English as their first
language. Utterance type is emotive text, emotive voice.

The significant difference in emotion perception for sadness suggests that cultural
issues are coming into play; this is an issue that has received a good deal of interest in the
non-verbal communication literature (Knapp, 1980) (Malandra, Barker and Barker,
1989). The high confusion between sadness and neutrality for participants without
English as their first language may account for the high deterioration values for sadness
seen in Table 21 and Table 23.
Maybe comforting to know is that if there was confusion for sadness, neutral was
chosen, as supported by Murray and Arnott (1995). It would be alarming indeed if the
data would show that sadness could be easily confused with, say, disgust, as the
consequences in real-life communication would be disastrous.

6.3 Talking Head and Vocal Expression


This section will endeavour to determine the effect vocal emotion has on peoples
perception of the Talking Head that was described previously. Specifically, it will
address hypothesis 2b that states: through the addition of emotive speech, information
will be communicated more effectively by the Talking Head (see Section 4.1).
In the demonstration of the two example movies of the Talking Head (described in
Section 6.1.1), the following comparisons were asked to be made by the participants. As
a memory refresher, the first Talking Head example was without vocal expression, and
the second Talking Head example included vocal expression. The reader is reminded that
Huynhs (2000) facial markup tags were used to perform facial expressions during the
reading. The markup tags and their placement were identical for both examples.


Understandability. If one Talking Head version was determined to
be generally easier to understand than the other, then it can be assumed that
listener comprehension would benefit.

Expressiveness.
The Talking Head version that would be
determined to be better able to express itself would communicate more
effectively its feelings, the mood of a story, the seriousness/lightheartedness
of information etc.

Naturalness. Lester and Stone (1997) show that believability of


pedagogical agents (an application of Talking Heads) plays an important role
in user motivation. Along with Bates (1994), it is shown that life-like quality
and the display of complex behaviours (including emotion) is a crucial factor
for believability.

Interest. Users who find a speaker interesting will give their


attention to what is being said. If this occurs, the speaker has the opportunity
to communicate his or her ideas better.
Understandability
Table 28 shows the response of participants who compared how well they could
understand the Talking Head in each example, and chose which one they felt was easier
to understand.

Understandability
Talking Head 1
Talking Head 2
Neither

2.2%
82.2%
15.6%

Table 28 - Participant responses when asked to choose


the Talking Head that was more understandable.

The data shows that the overwhelming majority of participants thought the second
Talking Head demonstration (the one that included vocal expression) was easier to
understand. From participants justifications for their choice, it was remarked that not
only did most participants think the second demonstration was easier to understand, but
that many thought the first demonstration was very difficult to understand. Participants
wrote that the second demonstration was easier to understand because it spoke slower
and because it had better expression in its voice. In fact, the speech rate of the second
demonstration did not slow down; rather, more pauses between sentences and character

dialogues were present. This highlights the importance that silence has in our verbal
communication.
Although it could be thought that the second demonstration was perceived as being
easier to understand because the story was told again, and so therefore the participants
were hearing the story for the second time, the fact that the reasons for their choice were
widely echoed shows that the voice was a determining factor.
Expressiveness
Table 29 shows the response of participants when asked which Talking Head
demonstration seemed best able to express itself. Again, the second demonstration was
overwhelmingly favoured over the first demonstration. Motivators for favouring the
second demonstration were because it was perceived the storytelling was better
structured, and because there was more variation in its tone. Many participants
commented on how the variability and changes in pitch helped the turn-taking of the
characters and to distinguish ideas behind [the] statements in the story. Still others
commented on the dynamic use of volume, and that the tone of the second demonstration
was more appropriate for the story.

Expressiveness
Talking Head 1
Talking Head 2
Neither

4.4%
86.7%
8.9%

Table 29 Participant responses when asked which


Talking Head seemed best able to express itself.

Another factor that was stated by most participants who favoured the second
demonstration was that the vocal and facial expressions were better synchronised, and
that the vocal expressions stood out more. The synchronisation of vocal expression
and facial gestures is deemed to be very important for communication by Cassell et al
(1994a), and Cassell et al (1994b); it is encouraging that most participants were able to
notice this and view it as desirable.
Interestingly, participants who did not favour the second demonstration of the
Talking Head commented that it was too expressive (in a caricature way).

Naturalness

Table 30 shows the response of participants when asked which Talking Head seemed
more natural. The main reason that the majority of participants that voted for the second
demonstration gave was that the speech now matched the facial expressions better, and so
seemed more realistic. For this question, there were a considerable amount of
participants who commented that the mouth and lip movements were unconvincing and
therefore needed more work for the Talking Head to seem realistic.

Naturalness
Talking Head 1
Talking Head 2
Neither

15.6%
71.1%
13.3%

Table 30 Participant responses when asked which


Talking Head seemed more natural.

Interest
When asked which Talking Head demonstration seemed more interesting, participants
gave a variety of reasons for their choice. Most participants opted for the second
demonstration, and Table 31 shows the data for this question.

Interesting
Talking Head 1
Talking Head 2
Neither

2.2%
84.4%
13.3%

Table 31 Participant responses when asked which


Talking Head seemed more interesting.

Many participants wrote that because the first demonstration was difficult to
understand, they lost interest in what it was saying. Conversely, the second
demonstration was easier to understand, and as a result, they didnt have to concentrate as
much. Others wrote that the Talking Head in the second demonstration seemed more
alert and that it was happier. Several participants commented that the second
demonstration was more interesting because it seemed it had more human qualities,
and that the facial expressions were noticed more because of the improved speech.
For the results shown in this section, it is safe to say that the overwhelming majority
of participants favoured the second demonstration of the Talking Head over the first
demonstration. What is gratifying is the fact that the participants werent told to focus on
the Talking Heads voice but rather look at it on the whole, and yet most people

attributed the improvement between the two examples to the vocal expression of the
Talking Head.
A very important note to make however is that the results are from data that was
obtained by comparing two versions of the Talking Head: with and without simulated
vocal emotion. Further tests would need to be done to quantitatively determine how
effective a communicator the Talking Head is with its improved speech. What is
important is that the results have been able to show that with vocal emotion, the Talking
Head is a better communicator, basing the argument that is has enhanced
understandability, expressiveness, naturalness, and user interest.

6.4 Summary
This chapter briefly described how the evaluation research methodology was employed to
test the hypotheses stated in Section 4.1. The TTS module was tested for how well
listeners could recognize the simulated emotions, and the data was seen to largely support
the stated hypotheses. The variables that can affect speech emotion recognition were
investigated and discussed, and the test data was also seen to support both previous work
done in the field of synthetic speech emotion (namely Murray and Arnott, 1995), and the
general literature from the fields of paralinguistics and non-verbal behaviour.
The last hypothesis of Section 4.1 was also tested to investigate if a Talking Head is
able to communicate information more effectively when it is given the capability of
speech expression. Through a series of demonstrations, this testing section showed that
viewers overwhelmingly rated the Talking Head with vocal expression as much easier to
understand, more expressive, natural, and more interesting to look and listen to. It is
proposed that these factors contribute in making the Talking Head a more effective
communicator.

Chapter 7
Future Work
Through the development of this project, a number of possible avenues for future work
have been identified. The following sections came about either as a result of the strict
time constraints of the project, and hence certain features were unable to be implemented,
or because it was discovered that many areas offered great depth in which much future
work could continue. This is especially true for XML, whose usability and potential is
indeed enormous. It is envisaged that many of the following sections each contain
enough depth for entire projects.

7.1 Post Waveform Processing


Another submodule could be added to the TTS module to perform post waveform
processing after the Digital Signal Processor has generated the waveform. This could
bring a range of benefits, from adding novelty filter effects to the voice to simulating
various environments (such as a large hall, outdoors, bathroom effect etc). But perhaps
the greatest benefit that post waveform processing could accomplish would be the
implementation of dynamic volume within an utterance. The volume tag would be
extended to include gradual increasing/decreasing volume abilities, and would help vary
intensity levels within an utterance.
The problem that the implementation of any post waveform effects would need to
solve would be determining which part(s) of the waveform need to be altered. This
would be determined within the SML Tag Processor, which could output a list of
operation nodes describing the operation(s) that the Post Waveform Processor needs to
perform, and the waveform parts that are to be affected by the operation. Figure 27
shows the structure of such a node, while Figure 28 shows an (initial) design of part of
the TTS module with the addition of the Post Waveform Processor submodule.

Operation
Parameters
Start/End
Figure 27 - A node carrying waveform processing instructions for an operation.

SML Document
Tags +
Text/Phonemes

Modified
Phonemes

Phoneme
data

DSP
SML Tags
Processor

MBROLA

Processing
directives

Waveform

Post
Waveform
Processor

Waveform
Processing List

Modified
Waveform

Processing
directives

Figure 28 - Insertion of new submodule for post waveform processing.

7.2 Speaking Styles


An interesting area for future work would be the investigation of developing different
speaking styles. Work in this area is quite important, as the literature is very rich in the
discussion of how and why we adopt different speaking styles. For instance, Knapp
(1980) shows how research in the field of paralinguistics suggests that the way we speak
changes depending on who we are talking to (e.g. a group of people or one on one,
someone of the opposite sex, someone from a different age group etc). Research has also
shown that male and female speakers have different speech intonation patterns, and
catering for this in the SML speaker tags output would be beneficial (Knapp, 1980).
Malandra, Barker and Barker (1989) describe how speaking styles, namely
conversational style and dynamic style, affect the listeners perception of the speaker.

For instance, one of the characteristics of dynamic style is a faster rate of speech, and
studies have shown that higher ratings of intelligence, knowledge, and objectivity are
ascribed to the speaker (Miller, 1976 as referenced in Malandra, Barker and Barker,
1989). Contrariwise, a speaker adopting a conversational style of speaking (which is
characterized by a more consistent rate and pitch), is rated to be more trustworthy, better
educated, and more professional by listeners (Pearce and Conklin, 1971 as referenced in
Malandra, Barker and Barker, 1989). If these results can be reproduced, this will clearly
have major repercussions for a Talking Head.
An interesting way that speaking styles could be specified is through the use of
Cascading Style Sheets (CSS), making use of Extensible Stylesheet Language (XSL)
technology. The SML markup itself would not need to be changed to adopt a different
speaking style, but rather the XSL document that defines the style and voice of the
speaker. This confirms one of the benefits of XML that was stated earlier, where the
XML file describes the data and does not force the presentation of the data.
<?xml-stylesheet ...?>
<?xml-stylesheet ...?>
<sml>
<sml>

link to stylesheet

<xsl:stylesheet ...>
<xsl:stylesheet ...>
...
...

...
...

</xsl:stylesheet ...>
</xsl:stylesheet ...>

</sml>
</sml>

Stylesheet

SML File

Figure 29 - SML Markup containing a link to a stylesheet.

7.3 Speech Emotion Development


The TTS module has implemented three emotions thus far (excluding neutral): happiness,
anger, and sadness. Fear and grief are the two remaining of the five so-called basic
emotions which could be implemented, and this certainly would not be a comprehensive
list of vocal emotions that humans display. Therefore, there is much work that could be
carried out in this area alone.
However, future work in other (more complex) emotions may not produce the same
results found in this thesis. This would be due to a number of reasons:
a) More complex emotions are less understood.
b) As a consequence of (a) speech correlates for complex emotions are much harder
to identify, and as a result are featured less prominently in the literature.

c) It could well be that work in this area will find that speech correlates may not be
as important for discriminating similar complex emotions (e.g. pride and
satisfaction), and that it is rather what we say that provides the best cues for such
speaker emotions. This is upheld by Knapp (1980), who states as we
develop, we rely on context to discriminate emotions with similar
characteristics.

7.4 XML Issues


Because FAITH has only just begun to incorporate XML input into its various modules, a
solid framework for handling markup of various XML-based languages has not been
properly defined. Section 5.1.2 raised an important issue concerning the handling of
XML data; the TTS input needs to be filtered since the input may contain non-SML tags.
Though this process works, this design is a little loose since all modules within the
Talking Head receive XML data that is often not relevant to the module. Also, inefficient
handling of the XML input occurs since individual modules are required to re-parse the
same data.
A more structured approach would be to have a separate module whose role would
be to manage the XML data. Its task would be to then pass relevant XML data to the
other modules, possibly in the form of a pre-built DOM. With this in place, each module
in the system will have clearer definitions of its inputs, knowing that all information it
receives is applicable to the module. Figure 30 shows where the XML Handler fits in
with the other modules of the Talking Head. Note that the design also enables the system
to not be as TTS driven as it currently is (see Section 7.5).
SML

TTS

Brain

XML Data
(SML, FAML, etc)

XML
Handler

TTS Output

FAML

FAML

Other

Other

FAML Output

Other Output

Figure 30 - Inclusion of an XML Handler module to centrally manage XML input.

A complete re-think of how XML data is handled is needed if XML is to play a


major role in the way the Talking Head processes information (which is what we believe
will eventuate). There is a need for the framework to be put in place as soon as possible
in order to aid the implementation of future modules so that handling of the input is done
in a controllable manner, and that the full potential of XML is realized.

7.5 Talking Head


The evaluation of the TTS module showed that there is a need to improve the viseme
production of the Talking Head. This is logical since the speech and facial gestures are
now able to express various emotions, but the mouth movements during speech currently
stay the same. The effect of emotion on mouth movements is therefore an area that
should be explored. This will include the need to investigate how certain facial
expressions can be preserved during speech, especially those including mouth
movements. For instance, currently it is quite difficult for the Talking Heads mouth to
be in the shape of a smile and talk at the same time without causing undesirable effects.
Another limitation that has been identified concerning the Talking Head is that its
head movements are purely TTS driven. Currently, when speech is rendered, the FAP
stream containing just the visemes is passed to the Personality and FAML modules so
that they can add head movement and facial gesture directives down the FAP stream.
The consequence of this process is that if the Talking Head does not say anything, no
head movements or facial expressions can be generated. This results in a Talking Head
that is completely immobile unless its talking, which gives a very mechanical feel to it.
What is required is a complete redesign of how speech and the FAP stream output is
generated. That is, the generation of the FAP stream for head movements and facial
gestures should be partially divorced from speech production to give the Talking Head
the ability to move, blink, and perform facial gestures and expressions even when it is not
talking.
This has not been a major issue during evaluation stages of the Talking Head in
Shepherdson (2000), Huynh (2000), and this thesis because testing has been mainly done
through the use of demonstration examples no thorough testing has yet been done in an
interactive environment. However, the Talking Head has reached a level of maturity
where it is now being demonstrated in public events, and hopefully this will result in
constructive feedback to help prioritize (and shed light on) problems that need to be
solved.

7.6 Increasing Communication Bandwidth


Work is underway to make the FAITH Projects Talking Head accessible from within a
web browser, and therefore have a presence on the World Wide Web (Levy, 2000). With
the current architecture, applications making use of the Talking Head will suffer
bandwidth problems due to the large transfer of volume from the server to the client.
Even with high-speed connections, users will still experience delays since many
waveforms need to be sent to the client as the Talking Head speaks.
A solution to this problem, of course, would be to implement streaming audio, but
this would require considerable effort. This thesis proposes an alternate, simpler solution
that would greatly reduce the amount of traffic between the server and the client. Figure
31 shows the proposed architecture of a modified TTS module that bridges across the
server and client.

Text to be
rendered

NLP (Festival)

Phoneme Info
(text)

DSP (MBROLA)

TTS Module

TTS Module

SML Processing

Viseme Generator

SERVER SIDE

CLIENT SIDE

Waveform

Visemes

Figure 31 - Proposed design of TTS Module architecture to minimize


bandwidth problems between server and client.

The idea of sending text to the client instead of a waveform is not new, but other
solutions state that to do this the entire TTS module should be on the client side. The text
to be rendered is sent to the client, and the speech is fully synthesized there. However, a
complex NLP module such as Festivals is quite large, and it would be very undesirable
to force users to install such an application on their systems.
The architecture in Figure 31 shows a system where all phoneme transcription and
SML tag processing is still carried out on the server. Instead of producing a waveform
and sending this to the client however, the phoneme information is sent across in
MBROLAs input format (described in Section 5.9) to the TTS Module on the client side,
where the waveform is produced by the DSP. The MBROLA Synthesizer is not a very
large program, and has binaries available for most platforms. Note that although the
diagram shows the Festival and MBROLA systems, any synthesizer that can deal with
phoneme information at the input and output level can be used.

Chapter 8
Conclusion
This thesis has focused on the primary goal of simulating emotional speech for a Talking
Head. The objectives reflected this by aiming to develop a speech synthesis system that
is able to add the effects of emotion on speech, and to implement this system as the TTS
module of a Talking Head. The literature was explored and investigation of research in
the fields of non-verbal behaviour, paralinguistics, and speech synthesis allowed (with
some confidence) the following hypotheses of Section 4.1 to be formed:
1. The effect of emotion on speech can be successfully synthesized using
control parameters.
2. Through the addition of emotive speech:
a)
Listeners will be able to correctly recognize the intended emotion
being synthesized.
b)
Head.

Information will be communicated more effectively by the Talking

The review of the literature also helped to both identify and provide possible
solutions to the subproblems stated in Section 2.2, namely the need to develop a speech
markup language, and synchronizing speech and facial gestures.
Throughout the entire design and implementation phase of the TTS module, the
literature was a solid basis upon which design decisions were made. For instance, SMLs
speech emotion tags implemented the speech correlates of emotion found in the literature
and previous work on synthetic speech emotion, chiefly by Murray and Arnott (1995) and
Cahn (1990).
The evaluation of the TTS module was an integral part of this thesis. Therefore, an
evaluative research methodology was employed to determine if the system supported or
disproved the stated hypotheses. In order for testing to be carried out in a clear and

concise way that could be directly linked with the hypotheses, the questionnaire-based
evaluation process was organized into two main sections:
a)

Speech emotion recognition, and

b)
The extent of the effect vocal expression has on a Talking Heads
ability to communicate.
The results from the speech emotion recognition sections support hypotheses 1 and
2a; that is, strong recognition of the simulated emotions took place and so simulation of
the emotive speech was successful. The evaluation was able to show that a good
recognition rate of emotion occurred, comparable to that described in the literature for
both synthetic speech emotion, and for human speech.
The investigation of different combinations of neutral/emotive text and
neutral/emotive vocal expression was important in that it helped identify the importance
of the words we decide to use and how we choose to say them. Importantly, the results in
Chapter 6 are seen to also confirm the literature; both in terms of the effects of the
different variables involved (words and vocal expression), and also in terms of the
emotions that are often confused together.
As found in Murray and Arnott (1995) and Cahn (1990), the results were seen to be
very utterance dependent and may have contributed to both correct recognition and also
confusion between emotions. The reason for confusion between certain emotions is
reported in the literature to be dependent on many factors, including the text in the phrase
and its context, the speakers voice, and who is hearing the utterance, which brings
gender and cultural issues in play (Malandra, Barker and Barker, 1980; Knapp, 1980).
Results of the Talking Head experiments also proved to be positive with the
overwhelming majority of viewers stating that the Talking Head was much easier to
understand, more expressive, natural, and more interesting when vocal expression was
added to its speech. It would be very interesting to investigate listener emotion
recognition rates for the same utterances spoken by the Talking Head itself. This way,
the importance of the combination of the visual channel (facial gestures) and the audio
channel (speech) could be studied.
There is much work yet to be done in the field of synthetic speech emotion, with
Chapter 7 identifying a number of key areas which should be investigated. The main
limitation of the project was the inability to implement dynamic volume control and
speech correlates affecting the voice quality (such as nasality, hoarseness, breathiness
etc). Notwithstanding these limitations however, this thesis has been able to demonstrate
the effect synthetic speech emotion has on user perception when it is applied to a Talking

Head. It is envisaged it would bring similar benefits to other applications using speech
synthesis. It is through this demonstration that the significance and value of this project
is seen to have been confirmed.

Bibliography
Allen, J., Hunnicutt, M. S. and Klatt, D. (1987). From Text to Speech: The MITalk
System, Cambridge University Press, Cambridge.
Bates, J. (1994). The Role of Emotion in Believable Agents, Communications of the
ACM, vol. 37, pp. 122-125.
Beard, S. (1999). FAQBot, Honours Thesis, Curtin University of Technology, Bentley,
Western Australia.
Beard, S., Crossman, B., Cechner, P. and Marriott A. (1999). FAQbot, Proceedings of
Pan Sydney Area Workshop on Visual Information Processing, Nov 1999, University
of Sydney, Australia.
Beard, S., Marriott, A. and Pockaj, R. (2000). A Humane Interface, OZCHI 2000
Conference on Human-Computer Interaction: Interfacing reality in the new
millennium. (4-8 Dec, 2000), Sydney, Australia.
Black, A. W., Taylor, P. and Caley, R. (1999). The Festival Speech Synthesis System
System Documentation, Edition 1.4, for Festival Version 1.4.0, 17th June 1999,
[Online], Available: www.cstr.ed.ac.uk/projects/festival/manual/festival_toc.html.
Bos,
B.
(2000).
XML
in
10
points.
http://www.w3.org/XML/1999/XML-in-10-points.

[Online].

Available:

Bosak, J. (1997) XML, Java, and the Future of the Web. [Online]. Available:
http://www.webreview.com/pub/97/12/19/xml/index.html
Cahn, J. E. (1988). From Sad to Glad: Emotional Computer Voices, Proceedings of
Speech Tech '88, Voice Input/Ouput Applications Conference and Exhibition, April
1988, New York City, pp. 35-37.
Cahn, J. E. (1990). The Generation of Affect in Synthesized Speech, Journal of the
American Voice I/O Society, vol. 8, pp. 1-19.
Cahn, J. E. (1990b). Generating expression in synthesized speech, Technical Report,
Massachusetts Institute of Technology Media Laboratory, MA, USA.
Carroll, L. (1946). Alices Adventures in Wonderland, Random House, New York.

Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S.,
Becket, T., and Achorn, B. (1994a). Animated conversation: Rule-based generation
of facial expression, gesture and spoken intonation for multiple conversational
agents, In Proceedings of ACM SIGGRAPH 94.
Cassell, J., Steedman, M., Badler, N., Pelachaud, C., Stone, M., Douville, B., Prevost, S.,
and Achorn, B. (1994b). Modeling the interaction between speech and gestures, In
Proceedings of the 16th annual Conference of the Cognitive Science Society, pp. 119124.
Cechner, P. (1999), NO-FAITH: Transport Service for Facial Animated Intelligent
Talking Head, Honours Thesis, Curtin University of Technology, Bentley, Western
Australia.
Cover, R. (2000). The XML Cover Pages. [Online]. Available: http://www.oasisopen.org/cover/xml.html.
Davitz, J. R. (1964). The Communication of Emotional Meaning, McGraw-Hill, New
York.
Dellaert, F., Polzin, T. and Waibel, A. (1996). Recognizing Emotion in Speech,
Proceedings of ICSLP 96 the 4th International Conference on Spoken Language
Processing, 3-6 October 1996, Philadelphia, PA, USA.
Document Object Model (DOM) Level 1 Specification (Second Edition) (2000).
[Online]. Available: http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929.
Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis, Kluwer Acadesmic
Publishers.
Extensible Markup Language (XML) 1.0, (1998).
http://www.w3.org/TR/1998/REC-xml-19980210.

[Online].

Available:

Galanis, D., Darsinos, V. and Kokkinakis, G. (1996). Investigating Emotional Speech


Parameters for Speech Synthesis, International Conference on Electronics,
Circuits, and Systems (ICECS 96), pp. 1227-1230.
Graham, I. S. and Quinn, L. (1999). XML Specification Guide. Wiley Computing
Publishing, New York.
Hallahan, W. I. (1996). DECtalk Software: Text-to-Speech Technology and
Implementation, Digital Technical Journal.
Heitzeman, J., Black, A. W., Mellish, C., Oberlander, J., Poesio, M. and Taylor, P.
(1999). An Annotation Scheme for Concept-to-Speech Synthesis, Proceedings of
the European Workshop on Natural Language Generation, Toulouse, France, pp. 5966.

Hess, U., Scherer, K. L. and Kappas, A. (1988). Multichannel Communication of


Emotion, in Facets of Emotion Recent Research, Edited by Klaus R. Scherer,
Lawrence Erlbaun Associates, Publishers, Hillsdale, New Jersey.
Huynh, Q. (2000). Evaluation into the validity and implementation of a virtual news
presenter for broadcast multimedia, Honours Thesis (yet to be published), Curtin
University of Technology, Bentley, Western Australia.
Knapp, M. L. (1980). Essentials of Nonverbal Communication, Holt, Rinehart and
Winston, pp. 203-229.
Lester, J. C. and Stone, B. A. (1997). Increasing Believability in Animated Pedagogical
Agents, in First International Conference on Autonomous Agents, Marina del Rey,
CA, USA, pp. 16-21.
Levy, Y. (2000). WebFAITH (Web-based Facial Animated Interactive Talking Head),
Honours thesis, Curtin University of Technology, Bentley, Western Australia.
libxml (2000). The XML
http://www.xmlsoft.org.

library

for

Gnome.

[Online].

Available:

Malandra, L., Barker, L., and Barker, D. (1989). Nonverbal Communication, 2nd edition,
Random House, pp. 32-50.
Mauch, J. E. and Birch, J. W. (1993). Guide to the Successful Thesis and Dissertation, 3rd
edition, Marcel Dekker, New York.
MBROLA
Project
Homepage
(2000).
http://tcts.fpms.ac.be/synthesis/mbrola.html.
Microsoft
(2000).
Benefiting
From
XML.
http://msdn.microsoft.com/xml/general/benefits.asp.

[Online],
[Online].

Available:
Available:

MPEG (1999). Overview of the MPEG-4 Standard. ISO/IEC JTC1/SC29/WG11 N2725.


Seoul, South Korea.
Murray, I. R. and Arnott, J. L. and Newell A. F. (1988). HAMLET Simulating
emotion in synthetic speech, Proceedings of Speech '88, The 7th FASE Symposium
(Institute of Acoustics), Edinburgh, August 1988, pp. 1217-1223
Murray, I. R. and Arnott, J. L. (1993). "Toward the Simulation of Emotion in Synthetic
Speech: A Review of the Literature on Human Vocal Emotion", Journal of the
Acoustical Society of America, February 1993, vol. 2, pp. 1097-1108.
Murray, I. R. and Arnott, J. L. (1995). Implementation and testing of a system for
producing emotion-by-rule in synthetic speech, Speech Communications, vol. 16,
pp. 369-390.

Murray, I. R. and Arnott, J. L. (1996). Synthesizing emotions in speech: is it time to get


excited?, in Proceedings of ICSLP 96 the 4th International Conference on Spoken
Language Processing, Philadelphia, PA, USA, 3-6 October 1996, pp. 1816-1819.
Murray, I. R. and Arnott, J. L. and Rohwer E. A. (1996). Emotional stress in synthetic
speech: Progress and future directions, Speech Communication, vol. 20, pp. 85-91.
Ostermann, J., Beutnagel, M., Fischer, A., Wang, Y. (1998). Integration of Talking
Heads and Text-to-Speech Synthesizers for Visual TTS, in Proceedings of
International Conference on Speech and Language Processing (ICSLP 98), Sydney,
Australia, December 1998.
Python/XML Howto (2000). [Online]. Available: http://www.python.org/doc/howto/xml.
SAX 2.0 (2000). Sax: The Simple API for XML, version 2. [Online]. Available:
http://www.megginson.com/SAX.
Scherer, K. L. (1981). Speech and Emotional States, in Speech Evaluation in
Psychiatry, edited by J. K. Darby (Grune and Stratton, New York).
Scherer, K. L. (1996). Adding the Affective Dimension: a New Look in Speech
Analysis and Synthesis, Proceedings of the International Conference on Speech
and Language Processing (ICSLP 96), Philadelphia, USA, October 1996.
Shepherdson, R. H. (2000). The Personality of a Talking Head, Honours Thesis, Curtin
University of Technology, Bentley, Western Australia.
SoftwareAG (2000a). XML Application Examples. [Online].
http://www.softwareag.com/xml/applications/xml_app_at_work.htm.

Available:

SoftwareAG
(2000b).
XML
Benefits.
http://www.softwareag.com/xml/about/xml_ben.htm.

Available:

[Online].

Sproat, R., Taylor, P., Tanenblatt, M. and Isard, A. (1997). A Markup Language for
Text-to-Speech Synthesis, in Proceedings of the Fifth European Conference on
Speech Communication and Technology (Eurospeech 97), vol. 4, pp. 1747-1750.
Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., and Lenzo, K. (1998).
SABLE: A Standard for TTS Markup, in Proceedings of International Conference
on Speech and Language Processing (ICSLP98), pp. 1719-1724.
Stallo, J. (2000). Canto: A Synthetic Singing Voice. Technical Report, Curtin
University
of
Technology,
Western
Australia.
Available:
http://www.computing.edu.au/~stalloj/projects/canto/.
Taylor, P. and Isard, A. (1997). SSML: A speech synthesis markup language, Speech
Communication, vol. 21, pp. 123-133.

The XML FAQ (2000). Frequently Asked Questions about the Extensible Markup
Language. [Online]. Available: http://www.uc.ie/xml/.

Appendix A SML Tag Specification


The following list is a description of each of SMLs elements in alphabetical order. As
with any XML element, all SML elements are case sensitive; therefore, all SML elements
must appear in lower case, otherwise they will be ignored. The SML DTD is given in
Appendix B. For associated implementation issues, see Sections 5.7 and 5.8.

SML Tags
angry
Description: Simulates the effect of anger on the voice (i.e. generates a
voice that sounds angry).
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:
<angry>I would not give you the time of day</angry>

embed
Description: Gives the ability to embed foreign file types within an SML
document such as sound files, MML files etc., and for them to be
processed appropriately.
Attributes:
Name

Description

Values

type

Specifies the type of file that is being


embedded. (Required)

audio embedded file is


an audio file.
mml an mml file is
embedded.

src

If type = audio, then src attribute


gives path to audio file.

A character string.

music_fil
e

If type = mml, then music_file


specifies the path of the mml music

A character string.

file.
lyr_file

If type = mml, then lyr_file specifies


the path of the mml lyrics file.

A character string.

Properties: empty.
Example:
<embed type=mml music_file=songs/aaf.mml lyr_file=songs/aaf.lyr/>

emph
Description: Emphasizes a syllable within a word.
Attributes:
Name

Description

Values

target

Specifies which phoneme in contained


text will be the target phoneme.
If
target is not specified, default target will
be the first phoneme found within the
contained text.

A
character
string
representing a phoneme
symbol. Uses the MPRA
phoneme set.

level

The strength of the emphasis. (Default


level is weak).

weakest,
weak,
moderate, strong.

affect

Specifies if the element is to affect the


contained texts phoneme pitch values,
or duration values, or both. (Default is
pitch only).

p affect pitch only.


d affect duration only.
b affect both pitch and
duration.

Properties: Cannot contain other elements.


Example:
I have told you <emph affect=b level=moderate>so</emph> many times.

happy
Description: Simulates the effect of happiness on the voice (i.e.
generates a voice that sounds happy).
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:

<happy>I have some wonderful news for you.</happy>

neutral
Description: Gives a neutral intonation to the spoken utterance.
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:
<neutral>I can sometimes sound bored like this.</neutral>

p
Description: Element used to divide text into paragraphs. Can only
occur directly within an sml element. The p element wraps emotion
tags.
Attributes: none.
Properties: Can contain all other elements, except itself and sml.
Example:
<p>
<sad>Today its been raining all day,</sad>
<happy>But theyre calling for sunny skies tomorrow.</happy>
</p>

pause
Description: Inserts a pause in the utterance.
Attributes:
Name

Description

Values

length

Specified the length of the utterance


using descriptive value.

short, medium, long.

msec

Specifies the length of the utterance in


milliseconds.

A positive number.

smoot
h

Specifies if the last phonemes before


this pause need to be lengthened
slightly.

yes, no (default = yes)

Properties: empty.

Example:
Ill take a deep breath <pause length=long/> and try it again.

pitch
Description: Element that changes pitch properties of contained text.
Attributes:
Name

Description

Values

middle

Increases/decreases pitch average of


contained text by N%

(+/-)N%, highest, high,


medium, low, lowest.

range

Increases/decreases pitch
contained text by N%.

(+/-)N%

range

of

Properties: Can contain other non-emotion elements.


Example: Not I, <pitch middle=-20%>said the dog</pitch>

pron
Description: Enables manipulation of how something is pronounced.
Attributes:
Name

Description

Values

sub

The string the contained text should be


substituted with. Required.

A character string.

Properties: Cannot contain other elements.


Example:
You say tomato, I say <pron sub=toe may toe>tomato</pron>

rate
Description: Sets the speech rate of the contained text.
Attributes:
Name

Description

Values

speed

Sets the speed, can be a percentage


value higher or lower than the current,
or a descriptive term. Required.

(+/-)N%, fastest, fast,


medium, slow, slowest.

Properties: Can contain other non-emotion elements.


Example:
I live at <rate speed=-10%>10 Main Street.</rate>

sad
Description: Simulates the effect of sadness on the voice (i.e.
generates a voice that sounds sad).
Attributes: none.
Properties: Can contain other non-emotion elements.
Example:
<sad>Honesty is hardly ever heard.</sad>

sml
Description: Root element that encapsulates all other SML tags.
Attributes: none.
Properties: root node, can only occur once.
Example:
<sml>
<p>
<happy>The sml tag encapsulates all other tags</happy>
</p>
</sml>

speaker
Description: Specify the speaker to use for the contained text.
Attributes:
Name

Description

Values

gender

Sets the gender of the speaker. Default


= male.

male, female.

name

Specifies the diphone database that

A character string.

should be used (e.g. en1, us1 etc).

Properties: Can contain other non-emotion tags.


Example:
<speaker gender=female>Im a young woman.</speaker>

volume
Description: Sets the speaking volume. Note, this element sets only
the volume, and does not change voice quality (e.g. quiet is not a
whisper).
Attributes:
Name

Description

Values

level

Sets the volume level of the contained


text. Can be a percentage value, or a
descriptive term.

(+/-)N%,
loud.

Properties: Can contain other non-emotion tags.


Example:
<volume level=soft>You can barely hear me.</volume>

soft,

normal,

Appendix B SML DTD


<!-#######################################################################
##
# Speech Markup Language (SML) DTD, version 1.0.
#
# Usage:
# <!DOCTYPE sml SYSTEM "./sml-v01.dtd">
#
# Author: John Stallo
# Date: 11 September, 2000.
#
########################################################################
-->
<!ENTITY % LowLevelElements
"pitch |
rate |
emph |
pause |
volume |
pron |
speaker">
<!ENTITY % HighLevelElements "angry |
happy |
sad |
neutral |
embed">
<!ELEMENT sml (p)+>
<!ELEMENT p (%HighLevelElements;)+>
<!ELEMENT speaker (#PCDATA | %LowLevelElements;)*>
<!ATTLIST speaker
gender (male | female) "male"
name (en1 | us1) "en1">
<!-#######################################
# HIGH-LEVEL TAGS
#######################################
-->
<!-- EMOTION TAGS -->
<!ELEMENT angry (#PCDATA | %LowLevelElements;)*>
<!ELEMENT happy (#PCDATA | %LowLevelElements;)*>
<!ELEMENT sad (#PCDATA | %LowLevelElements;)*>
<!ELEMENT neutral (#PCDATA | %LowLevelElements;)*>
<!-- OTHER HIGH-LEVEL TAGS -->
<!ELEMENT embed EMPTY>

<!ATTLIST embed
type CDATA #REQUIRED
src CDATA #IMPLIED
music_file CDATA #IMPLIED
lyr_file CDATA #IMPLIED>
<!-#######################################
# LOW-LEVEL TAGS
#######################################
-->
<!ELEMENT pitch (#PCDATA | %LowLevelElements;)*>
<!ATTLIST pitch
middle CDATA #IMPLIED
base (low | medium | high) "medium"
range CDATA #IMPLIED>
<!ELEMENT rate (#PCDATA | %LowLevelElements;)*>
<!ATTLIST rate
speed CDATA #REQUIRED>
<!-- SPEAKER DIRECTIVES: Tags that can only encapsulate plain text -->
<!ELEMENT emph (#PCDATA)>
<!ATTLIST emph
level (weakest | weak | moderate | strong) "weak"
affect CDATA #IMPLIED
target CDATA #IMPLIED>
<!ELEMENT pause EMPTY>
<!ATTLIST pause
length (short | medium | long) "medium"
msec CDATA #IMPLIED
smooth (yes | no) "no">
<!ELEMENT volume (#PCDATA)>
<!ATTLIST volume
level CDATA #REQUIRED>
<!ELEMENT pron (#PCDATA)>
<!ATTLIST pron
sub CDATA #REQUIRED>

Appendix C Festival and Visual C++


Compiling Festival with Visual C++
I've recently been able to get Festival 1.4.1 and the Edinburgh Speech Tools Library successfully compiled
using Visual C++ 6.0 under Microsoft Windows NT 4.0. There were many but simple problems, so I
started to keep track of what I did to get it to work so the process could be repeated if, God forbid, I needed
to recompile it in the future. Having said this, you'll realise I can't make any guarantees that the changes
described in this document will work, or are all that is required for compiling Festival on your system, but
I've made this available with the hope it may prove helpful to someone.
First things first, make sure you read the README and INSTALL files that come with Festival. These two
files provide important information about the Festival package, on setting it up and compiling. Also, read
the manual! You can find an online version here. Also, in every file that you modify, leave some sort of
comment stating what was changed, why it was changed, how to undo the modification, and who changed
it! I encapsulated my modifications in the following way:
/////////////////////////////////////
// STALLO MODIFICATION START
modified code...
// STALLO MODIFICATION END
/////////////////////////////////////
An important note to be made is that these modifications were not suggested by anyone from Festival, and
so some parts of this document may very well be erroneous. All I know is that I've been able to get it
working with these changes in place! My advice would be to first follow the instructions given in Festival's
INSTALL and README files. If you come across any problems, start looking through this document to
see if there's anything that may help you. Something I remarked is that most of the problems were in
compiling the Speech Tools library. Once this was done, compiling Festival met with very little difficulty.

Building the Edinburgh Speech Tools Library


The Edinburgh Speech Tools Library should be compiled before Festival (all files are found under the
speech_tools directory).
Configuration
Before carrying out the configuration instructions given in INSTALL:
1.

Make sure you have a recent version of the Cygnus Cygwin make utility working on your system.
You can download the Cygwin package here. We used version b20.1.

2.

Make changes to the following files:

3.

1.

config/system.sh - Statement obtaining OSREV didn't work on our system. Hardcode


value (i.e. OSREV=20.1)

2.

config/system.mak - This file shouldn't really be hand edited as it's created automatically
from the config file, but if the system gets confused, you may need to change the value of
OSREV from 1.0 to 20.1.

Now carry out the configuration instructions.

Creating VCMakefiles
From memory, there were no mishaps here. Only, again, make sure the make utility you're using is recent,
preferably from the Cygnus Cygwin package version b20.1 (we had trouble with earlier versions).
Building
To build the system:
nmake /nologo /fVCMakefile
However, you'll probably incur a number of compile errors. The following is a list of changes I had to
make:
*1
SIZE_T redefinition in siod/editline.h.
SIZE_T is redefined in editline.h since it is already defined in a Windows include file. Fix by inserting
compiler directive to only define SIZE_T if it hasn't been defined before, or simply don't define it (i.e. use
#if 0 ... #endif).
*2
"unknown size" error for template in base_class/EST_TNamedEnum.cc.
nmake doesn't seem to like the syntax used in template void
EST_TValuedEnumI<ENUM,VAL,INFO::initialise(const void *vdefs, ENUM
(*conv)(const char *)). Go to the line it's grumbling about, and do the following:
*3
comment this line out:
(const struct EST_TValuedEnumDefinition *defs = (const
struct EST_TValuedEnumDefinition *)vdefs;
*4
insert these 2 lines (__my_defn can be anything, just make sure it doesn't
conflict with another typedef!):
typedef EST_TValuedEnumDefinition __my_defn;
const __my_defn *defs = (const __my_defn *)vdefs;

*5
Multiply defined symbols in base_class/inst_templ/vector_fvector_t.cc
EST_TVector.cc is already included in vector_f_.cc, so a conflict is occurring. I'm not sure nmake would
be the only one to complain about this though.
*6
Comment out this line: #include
"../base_class/EST_TVector.cc"

*7
T * const p_contents never initialised in EST_TBox constructor (include/EST_TBox.h)
nmake wants p_contents to be initialised in the constructor using an initializer list.
*8
Comment EST_TBox constructor out:
EST_TBox(void) {};
*9
Define EST_TBox constructor with an initializer list:
EST_TBox (void) : p_contents (NULL) {};
I don't think it matters what p_contents is initialised to since the inline comment says that
the constructor will never be called anyway.

Building Festival
If I recall correctly, once the Speech Tools Library was successfully compiled, there were no problems
compiling Festival. However, once you run any of the executables such as festival.exe, pay attention to any

error messages appearing at startup. If most functions are unavailable to you in the Scheme interpreter, it's
probably because it can't find the lib directory that contains the Scheme (.scm) files.

Using the Festival C/C++ API


(Linking the Festival libraries with your own source code)
Follow the directions given in the Festival manual for using Festival's C/C++ API. You'll need to include
the appropriate header files, and make sure you have the include paths set correctly within the Visual C++
environment (these are done through Project-->Settings-->C/C++). Try compiling the sample program
given in the manual that uses the Festival functions.

Compiling
If you've included the right header files, and the include paths are correct, you should have no trouble.
However, make sure your program source files have a .cpp extension, and not just .c - you'll get all sorts of
compilation errors. (Take it from someone who lost a lot of time over this silly error!) Also, make sure
SYSTEM_IS_WIN32 is defined by adding it to the pre-processor definitions in Project-->Settings-->C/C+
+.

Linking
The manual gives instructions on which Festival and Speech Tool libraries your program needs to link
with. (Note: the manual mentions .a UNIX library files. You should look for .lib files instead.) Add linking
information through Project-->Settings-->Link. In addition to the libraries specified in the manual, you'll
need to link two others libraries:
*10

wsock32.lib - for the socket functions.

*11

winmm.lib - for multimedia functions such as PlaySound().

These two libraries should already be on your system, and Visual C++ should already be able to find them.
That's it! Good luck with your work :) If you've found this document to be helpful, please drop me an
email - I'd like to know if it's proving useful for others. If you have found errors in this document, or
have other suggestions for getting around problems, I'd especially like to hear from you! I also welcome
any questions you may have. You can contact me at stalloj@cs.curtin.edu.au. Finally, my warm thanks
to Alan W Black, Paul Taylor, and Richard Caley for making the Festival Speech Synthesis System
freely available to all of us :)

Appendix D Evaluation Questionnaire


SIMULATING EMOTIONAL SPEECH
FOR A TALKING HEAD
QUESTIONNAIRE
PLEASE NOTE:
You do NOT have to take part in this questionnaire.
If you find any of these questions intrusive feel free to leave them
unanswered.
Any data collected will remain strictly confidential, and anonymity will be
preserved.

SECTION 1 - PERSONAL

AND

BACKGROUND DETAILS

Age: ____________

Gender: Female

Male

In which country have you lived for most of your life?


______________________________________________________________________________________
Is English your first spoken language? Yes

No

Where do you use computers?


Home
Work
_____________

I Dont

School

Other

The next section will commence shortly.


PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO
THANKYOU

SECTION 2 EMOTION RECOGNITION


Part A.
For each of the speech samples played, choose ONE which you think is most appropriate for
how the speaker sounds.
1. The telephone has not rung at all today.
a)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------b)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------c)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------d)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 4 EXAMPLES WILL NOW BE REPLAYED
----------------------------------------------------------------------------------------------------------------------2. I received my assignment mark today.
a)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------b)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------c)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------d)
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 4 EXAMPLES WILL NOW BE REPLAYED
-----------------------------------------------------------------------------------------------------------------------

Part B.

For each of the speech sample played, choose ONE which you think is most appropriate for
how the speaker sounds.
1.

No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------2.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------3.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------4.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------5.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
-----------------------------------------------------------------------------------------------------------------------6.

No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------7.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------8.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------9.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------10.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
------------------------------------------------------------------------------------------------------------------------

11.

No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------12.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------13.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------14.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------15.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
-----------------------------------------------------------------------------------------------------------------------16.

No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------17.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------18.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------19.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------20.
No emotion
Happy
Sad
Angry
Disgusted
Surprised
Other __________________
-----------------------------------------------------------------------------------------------------------------------THE PREVIOUS 5 EXAMPLES WILL NOW BE REPLAYED
------------------------------------------------------------------------------------------------------------------------

SECTION 3 TALKING HEAD


Please look and listen at the Talking Head examples before filling out this section.
[TALKING HEAD DEMONSTRATIONS]
Of the two examples just played:
a) Was one example easier to understand than the other? (Please tick ONE box only)
The FIRST example was easier to understand.
The SECOND example was easier to understand.
Neither was easier to understand than the other.
Reason

for

your

_________________________________________________________
_________________________________________________________
_________________________________________________________
b) Did one speaker seem best able to express itself? (Please tick ONE box only)
The FIRST speaker seemed best able to express itself.
The SECOND speaker seemed best able to express itself.
Neither was able to express itself better than the other.
Reason for your choice:
_________________________________________________________
_________________________________________________________
_________________________________________________________
c) Did one example seem more natural than the other? (Please tick ONE box only)
The FIRST example seemed more natural.
The SECOND example seemed more natural.
Neither seemed more natural than the other.
Reason for your choice:
_________________________________________________________
_________________________________________________________
_________________________________________________________
d) Did one example seem more interesting to you? (Please tick ONE box only)
The FIRST example seemed more interesting.
The SECOND example seemed more interesting.
Neither seemed more interesting than the other.
Reason for your choice:
_________________________________________________________
_________________________________________________________
_________________________________________________________

choice:

SECTION 4 GENERAL
Prior to this demonstration:

Had you ever heard of speech synthesis?

No

Yes

Had you seen an animated Talking Head before?

No
Yes

If so, can you remember its name? ______________________________

Any further comments you could give about the Talking Head and/or its voice would be
greatly appreciated.
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________

END OF QUESTIONNAIRE
THANK YOU VERY MUCH FOR YOUR HELP

Appendix E Test Phrases for


Questionnaire, Section 2B
Neutral phrases
1. I saw her walking along the beach this morning.
2. The book is lying on the table.
3. My creator is a human being.
4. I have an appointment at 2 oclock tomorrow.
5. Jim came to basketball training last night.

Emotional phrases
6. I have some wonderful news for you. (Happiness)
7. I cannot come to your party tomorrow. (Sadness)
8. I would not give you the time of day. (Anger)
9. Smoke comes out of a chimney. (Neutral)
10. Dont tell me what to do. (Anger)

Anda mungkin juga menyukai