Bryan Duggan School of Informatics National College of Ireland, Mayor St., IFSC, Dublin 1, Ireland. Tel.: + 353 1 449 8604 E-mail: bduggan@ncirl.ie Mark Deegan School of Computing Dublin Institute of Technology, Kevin St., Dublin 8, Ireland. Tel.: + 353 1 402 2867 E-mail: mark.deegan@comp.dit.ie
Abstract. The voice enabled web is a combination of XML based markup languages, speech recognition, text to speech (TTS) and web technologies. Key to the success of voice enabled web applications is the naturalness of the interface. Users are much more likely to interact with a system they feel comfortable with and that responds in a human like way. This paper describes the deployment of TTS in commercial voice enabled web systems and considers whether the excessive usage of TTS can be detrimental to users perceptions of the system.
1 Introduction
The Yankee Group (2001) defines the voice enabled web as any speech-enabled data interaction that utilises some browsing mechanism to navigate between separate sites including voice-enabled web sites and interactive voice response (IVR) systems Put another way, the voice enabled web is the web, with voice access typically over a telephone. The voice enabled web combines XML based mark-up languages, speech recognition, text to speech (TTS) and web technologies. The development of the VoiceXML standard by AT & T, IBM, Lucent Technologies and Motorola, (since ratified by the Worldwide Web Consortium (W3C)), has led to a proliferation in recent years of voice enabled web systems. Notwithstanding the limits of technology, successful speech user interfaces should be natural sounding and have an identifiable personality so that users feel comfortable in interacting with the system. The task of TTS in voice enabled web systems, is to generate human like speech from text input to mimic human speakers. TTS is a straightforward mechanism for delivering content in a voice enabled web system, with minimal developer effort. TTS also facilitates a great deal of flexibility as opposed to using pre-recorded prompts. Several approaches to generating speech from text are available, with variable degrees of naturalness. This paper examines the issue of naturalness in voice enabled web systems and evaluates whether the usage of TTS is conducive to creating natural sounding interfaces.
A voice portal, like an Internet portal, is a single place where content from a number of sources is aggregated (The Yankee Group, 2001). For example, a voice portal user can typically access email, news, stock quotes, weather reports, traffic information, restaurant recommendations, cinema reviews and other services over the telephone. Users navigate voice portals with voice commands. Voice portals have been deployed by Internet portal companies such as AOL and Yahoo and by voice portal only companies such as Tell Me Networks and Hey Anita. Latterly, the technology has been used in vCommerce applications. vCommerce is an emerging term that describes the usage of speech technology over the telephone in commercial applications such as banking, buying cinema tickets or stock trading (Biddlecombe, 2000; The Yankee Group, 2001). VoiceXML is a mark-up language for developing speech user interfaces. The developme nt of the VoiceXML standard by AT & T, IBM, Lucent Technologies and Motorola has meant that freed developers from the necessity of having to learn about speech recognition algorithms or proprietary Application Programming Interfaces (API) for speech recognition engines (Mc Glashan et al, 2001). With the development of VoiceXML 2.0, a range of supporting standards has emerged for describing TTS, recognition grammars and call control. These standards have been grouped by the W3C into a suite called the W3C Speech Interface Framework and will likely form the basis for future voice enabled web applications (Larson, 2003).
Most research on speech user interfaces suggests creating a consistent personality for the automated voice of the system (Halpern, 2001; Eisenzopf, 2002; Sharma & Kunins, 2002). This helps users to relate to the system. Kotelly (2002) proposes that the personality of a voice enabled web system is conveyed by the text of the prompts, the voice speaking the prompts and the direction of the prompts. In a brokerage system, for example, the text spoken by the system might be formal in nature and the speaking voice of the system could be friendly, but measured and conscientious. This would inspire confidence in the caller about the personality answering the call. The voice of an order processing system for a computer games company on the other hand could be young and energetic. The system might use informal language and colloquial speech in the prompts. The aim should be to reinforce the branding of the company through the personality of the speaking voice in a voice enabled web system. It can also be argued that people draw conclusions about the underlying competence of computer applications in a way that is similar to how people draw conclusions about humans. If the system apologises too much when it makes an unimportant error, or if it talks too slowly, it may create the impression that the system is incompetent. Having a dialogue that creates the impression of an enthusiastic competent helper is also important in inspiring confidence in users (Sharma & Kunins, 2002). Lawrie (2002) presents the example of "Julie", the voice of Amtrac, an American train company. Amtrac opted for a casual, conversational approach to the interface of its voice enabled web timetabling and ticket booking system. Julie greets all callers in a warm, friendly manner and provides regular reassurance as she navigates callers through the speech service. Since speech enabling the service, automation rates have increased by 61% (Lawrie, 2002).
producing a waveform output of the text to be spoken, following the parameters defined by the document. TTS systems typically perform a phonetic analysis on the text input, to convert the text into a sequence of phonemes. A phoneme is smallest phonetic unit in a language that is capable of being distinguished, as the m of mat and the b of bat in English (Edgington et al., 1996). This is followed by a prosodic analysis to attach appropriate pitch and duration information to the phonetic sequence. Finally the speech synthesis component takes the parameters from the tagged phonetic sequence to generate the corresponding speech waveform (Huang et al., 2001). There are several methods currently in use to synthesise speech. These are outlined in the next sections. 3.1.1 Articulatory Synthesis
Articulatory synthesis uses a computer-simulated model of the speech production mechanism in humans. This includes a model for the glottis, vocal tract, tongue and lips. It uses time dependant, three-dimensional differential equations to compute the synthetic speech output. This approach however has notoriously high computational requirements, and at present does not result in natural sounding, fluent speech. Commercial systems do not yet exist that use this approach as speech scientists still lack sufficient knowledge about the apparatus of speech in humans (Schroeter, 2001). 3.1.2 Formant synthesis
Formant synthesis uses a rule-based approach to describe speech as a set of (up to 60) parameters, related to formant and anti-formant frequencies and bandwidths. A formant is several frequency regions of relatively great intensity in a sound spectrum, which together determines the characteristic quality of a vowel sound (Dutoit, 1996). Formant synthesis generates highly intelligible, but not completely natural sounding speech. It has the advantage however of low memory footprint and moderate computational requirements (Schroeter, 2001). 3.1.3 Concatenative synthesis
Concatenative synthesis generates speech from actual recorded speech samples stored in a voice database. Speech can be stored either as a waveform or encoded by a suitable speech coding method. Concatenative speech synthesis systems then string together units from the database and output the resulting speech signal. Variable length units are now the norm with concatenative speech synthesis, a unit being a recorded speech sample in a speech database. A unit can be a phrase, word, a single phoneme or a diaphone. A diaphone is the transitional sound from one phoneme to the next that contains the second half of one phoneme plus the first half of the next phoneme (E.g. the t in writing) (Yi & Glass, 1998). These systems are the most human sounding, because they are in fact human. The quality of non-uniform-unit (NUU) concatenative synthesis synthetic speech may sometimes be indistinguishable from human speech, but this requires many hours of
text to be recorded. Concatenative synthesis TTS is the most frequently used method of generating TTS in voice enabled web systems, though future TTS systems may use a hybrid of concatenative and formant approaches, so that prosodic effects, such as varying the speed pitch and emotional content of concatenativly generated speech would be possible (Henton, 2002).
3.3 Conclusions
A high quality TTS system must be both intelligible and natural. While modern TTS systems are intelligible, they still sound computer generated. Completely natural sounding TTS is not yet achievable using any of the approaches outlined in this paper. As users weight TTS speech output quality very highly in judging the overall quality of a voice enabled web system and make this judgement very quickly, companies have been reluctant to deploy TTS technology and only use it where it is absolutely not possible to pre-record prompts using an actor. Companies have devoted thousand of person hours recording prompts so as to brand the spoken personality of their voice enabled web systems and to avoid using the cheaper, faster and more flexible alternative: the voice of a machine. It is therefore recommended that pre-recorded speech should be used where possible and TTS only used where this is not feasible, for example in reading emails.
References
AT & T (2003) Natural Voices, http://www.naturalvoices.att.com/demos/, Accessed April 2003. BIDDLECOMBE, E. (2000) Talkshow, Communications International, July 2000. BURNETT, D., WALKER, M., HUNT, A. (2002) Speech Synthesis Markup Language Specification, W3C Working Draft, 5 April 2002. DUTOIT, T. (1997) Text, Speech And Language Technology, Kluwer Academic Publishers, Dordrecht, April 1997. EISENZOPF, J. (2003) Top 10 Best Practices for Voice User Interface Design http://www.developer.com/voice/article.php/1567051, Accessed April 2003. HALPERN , E. (2001) Human Factors and Voice Applications, VoiceXML Review, June 2001. HENTON, DR C. (2002) Fiction and Reality of TTS, Speech Technology Magazine, February 2002.
HUANG, X.D., ACERO , A., HON, HW., REDDY, R. (2001) Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall PTR; 1st edition, April 25, 2001. KOTELLY, B. (2002) The Science Behind Successful Caller-Experience, Global Speech Day presentation, May 21st 2002. LARSON, DR., J. A. (2003) The W3C Speech Interface Framework, Speech Technology Magazine, March/April 2003. LAWRIE, C. (2002) Best Practices: Achieving Success with Speech, Speech Technology Magazine, November/December 2002. M. EDGINGTON ET AL. (1996) Overview of current text-to-speech technologies: Part I text and linguistic analysis, BT Technology Journal, 14(1). MARKOWITZ, J. (1996) Using Speech Recognition, Upper Saddle River, NJ, Prentice Hall, 1996. MCGLASHAN , S., BURNETT, D., DANIELSEN , P., F ERRANS P. (2001) Voice Extensible Mark-up Language (VoiceXML) Version 2.0, W3C Working Draft, October 2001. NUANCE (2003) Vocalizer Demonstration, http://www.nuance.com/prodserv/demo_vocalizer.html, Accessed April 2003. REEVES , B., NASS, C. (1999) The Media Equation: How People Treat Computers, Television, and New Media like Real People and Places, CSLI Publications; Reprint edition. SCAN SOFT (2003) Realspeak Demonstration, http://www.scansoft.com/realspeak/demo/, Accessed April 2003 SCHROETER, J. (2001) The Fundamentals of text to speech Synthesis, VoiceXML Review, March 2001. SHARMA, C., KUNINS, J. (2002) VoiceXML: Strategies and Techniques for Effective Voice Application Development with VoiceXML 2.0, John Wiley & Sons; 1st edition. THE YANKEE GROUP (2001) Voice Commerce: Speech Technology as an Enabler of Mobile Finical Transactions, The Yankee Report, May 2001. TURING, A.M. (1950) Computing Machinery and Intelligence, Mind 1950. YI, J.R.W., GLASS, J.R. (1998) Natural Sounding Speech Synthesis Using VariableLength Units, Spoken Language Systems Group, MIT. ZUE, V., GLASS, J. (2000) Conversational Interfaces: Advances and Challenges, Proceedings of the IEEE, Vol 88 No 8, August 2000.