Sie sind auf Seite 1von 6

Considerations In The Usage Of Text To Speech (TTS) In The Creation Of Natural Sounding Voice Enabled Web Systems

Bryan Duggan School of Informatics National College of Ireland, Mayor St., IFSC, Dublin 1, Ireland. Tel.: + 353 1 449 8604 E-mail: bduggan@ncirl.ie Mark Deegan School of Computing Dublin Institute of Technology, Kevin St., Dublin 8, Ireland. Tel.: + 353 1 402 2867 E-mail: mark.deegan@comp.dit.ie

Abstract. The voice enabled web is a combination of XML based markup languages, speech recognition, text to speech (TTS) and web technologies. Key to the success of voice enabled web applications is the naturalness of the interface. Users are much more likely to interact with a system they feel comfortable with and that responds in a human like way. This paper describes the deployment of TTS in commercial voice enabled web systems and considers whether the excessive usage of TTS can be detrimental to users perceptions of the system.

1 Introduction
The Yankee Group (2001) defines the voice enabled web as any speech-enabled data interaction that utilises some browsing mechanism to navigate between separate sites including voice-enabled web sites and interactive voice response (IVR) systems Put another way, the voice enabled web is the web, with voice access typically over a telephone. The voice enabled web combines XML based mark-up languages, speech recognition, text to speech (TTS) and web technologies. The development of the VoiceXML standard by AT & T, IBM, Lucent Technologies and Motorola, (since ratified by the Worldwide Web Consortium (W3C)), has led to a proliferation in recent years of voice enabled web systems. Notwithstanding the limits of technology, successful speech user interfaces should be natural sounding and have an identifiable personality so that users feel comfortable in interacting with the system. The task of TTS in voice enabled web systems, is to generate human like speech from text input to mimic human speakers. TTS is a straightforward mechanism for delivering content in a voice enabled web system, with minimal developer effort. TTS also facilitates a great deal of flexibility as opposed to using pre-recorded prompts. Several approaches to generating speech from text are available, with variable degrees of naturalness. This paper examines the issue of naturalness in voice enabled web systems and evaluates whether the usage of TTS is conducive to creating natural sounding interfaces.

2 The Voice Enabled Web


Voice enabled web technology is being deployed in a broad range of industries. The technology came to widespread attention with the launch of the first voice portals.

A voice portal, like an Internet portal, is a single place where content from a number of sources is aggregated (The Yankee Group, 2001). For example, a voice portal user can typically access email, news, stock quotes, weather reports, traffic information, restaurant recommendations, cinema reviews and other services over the telephone. Users navigate voice portals with voice commands. Voice portals have been deployed by Internet portal companies such as AOL and Yahoo and by voice portal only companies such as Tell Me Networks and Hey Anita. Latterly, the technology has been used in vCommerce applications. vCommerce is an emerging term that describes the usage of speech technology over the telephone in commercial applications such as banking, buying cinema tickets or stock trading (Biddlecombe, 2000; The Yankee Group, 2001). VoiceXML is a mark-up language for developing speech user interfaces. The developme nt of the VoiceXML standard by AT & T, IBM, Lucent Technologies and Motorola has meant that freed developers from the necessity of having to learn about speech recognition algorithms or proprietary Application Programming Interfaces (API) for speech recognition engines (Mc Glashan et al, 2001). With the development of VoiceXML 2.0, a range of supporting standards has emerged for describing TTS, recognition grammars and call control. These standards have been grouped by the W3C into a suite called the W3C Speech Interface Framework and will likely form the basis for future voice enabled web applications (Larson, 2003).

2.1 Naturalness In Voice Enabled Web Systems


The issue of naturalness in voice enabled web systems is one of the most important in advancing user acceptance. Users are much more likely to interact with a system they feel comfortable with and that responds in a human like way (Markowitz, 1996). Ultimately, voice enabled web systems should pass the Turing Test. The Turing Test was proposed by mathematician Alan Turing in 1950. As Turing (1950) put it, it is proposed that a machine may be deemed intelligent, if it can act in such a manner that a human cannot distinguish the machine from another human merely by asking questions via a mechanical link. Limitations of voice recognition technology and machine intelligence mean that this goal is still far off. Research suggests that people display similar behaviour in interacting with computer systems as they do when interacting with real people (Reeves & Nass, 1999). It is suggested that because the human brain evolved in an environment where only humans exhibited social behaviour, humans tend to respond to objects that exhibit social characteristics as human. This is an evolved response, which can be overcome only when people are consciously aware of their behaviour and choose to reject it. This response explains why, for example, people feel fear when watching a frightening film, even though it is not real. Reeves & Nass (1999) detail a number of experiments carried out to validate this theory. These experiments suggest that people will exhibit such characteristics as politeness, interpersonal distance, flattery, judgement and prejudice when interacting with even the simplest user interfaces. They propose that the design of user interfaces to computer systems should be based on social principals. Human beings are natural experts in social interaction and will respond to computer systems that leverage this skill in humans. They suggest that people will automatically become experts in such systems.

Most research on speech user interfaces suggests creating a consistent personality for the automated voice of the system (Halpern, 2001; Eisenzopf, 2002; Sharma & Kunins, 2002). This helps users to relate to the system. Kotelly (2002) proposes that the personality of a voice enabled web system is conveyed by the text of the prompts, the voice speaking the prompts and the direction of the prompts. In a brokerage system, for example, the text spoken by the system might be formal in nature and the speaking voice of the system could be friendly, but measured and conscientious. This would inspire confidence in the caller about the personality answering the call. The voice of an order processing system for a computer games company on the other hand could be young and energetic. The system might use informal language and colloquial speech in the prompts. The aim should be to reinforce the branding of the company through the personality of the speaking voice in a voice enabled web system. It can also be argued that people draw conclusions about the underlying competence of computer applications in a way that is similar to how people draw conclusions about humans. If the system apologises too much when it makes an unimportant error, or if it talks too slowly, it may create the impression that the system is incompetent. Having a dialogue that creates the impression of an enthusiastic competent helper is also important in inspiring confidence in users (Sharma & Kunins, 2002). Lawrie (2002) presents the example of "Julie", the voice of Amtrac, an American train company. Amtrac opted for a casual, conversational approach to the interface of its voice enabled web timetabling and ticket booking system. Julie greets all callers in a warm, friendly manner and provides regular reassurance as she navigates callers through the speech service. Since speech enabling the service, automation rates have increased by 61% (Lawrie, 2002).

3 Text to Speech (TTS)


The task of TTS in voice enabled web systems, is to generate human like speech from text input to mimic human speakers. TTS is also referred to as speech synthesis, particularly in the engineering community (Huang et al., 2001). It is a non-trivial problem domain. Not only must a system be able to pronounce all common words from a particular language, it must also be able to deal with millions of names and acronyms. Moreover, in order to sound natural, the intonation of the sentences must be appropriately generated.

3.1 TTS Approaches


Input to a TTS system can be either raw text or tagged text. Tags can be added to raw text to assist text, phonetic and prosodic analysis. Prosody is a term that refers to the pitch, inflection and duration of a sound. Prosody is what differentiates human sounding TTS from machine sounding TTS (Dutoit, 1997). It has also been referred to as the musical qualities of speech (Edgington et al., 1996). Speech Synthesis Markup Language (SSML) is designed to assist the generation of synthetic speech in voice enabled web systems. Its role is to provide a standard way to control aspects of speech output such as pronunciation, volume, pitch and rate across different synthesiscapable platforms (Burnett et al., 2002). SSML defines the format of a document that can form the input to a TTS engine. The TTS engine then interprets the document

producing a waveform output of the text to be spoken, following the parameters defined by the document. TTS systems typically perform a phonetic analysis on the text input, to convert the text into a sequence of phonemes. A phoneme is smallest phonetic unit in a language that is capable of being distinguished, as the m of mat and the b of bat in English (Edgington et al., 1996). This is followed by a prosodic analysis to attach appropriate pitch and duration information to the phonetic sequence. Finally the speech synthesis component takes the parameters from the tagged phonetic sequence to generate the corresponding speech waveform (Huang et al., 2001). There are several methods currently in use to synthesise speech. These are outlined in the next sections. 3.1.1 Articulatory Synthesis

Articulatory synthesis uses a computer-simulated model of the speech production mechanism in humans. This includes a model for the glottis, vocal tract, tongue and lips. It uses time dependant, three-dimensional differential equations to compute the synthetic speech output. This approach however has notoriously high computational requirements, and at present does not result in natural sounding, fluent speech. Commercial systems do not yet exist that use this approach as speech scientists still lack sufficient knowledge about the apparatus of speech in humans (Schroeter, 2001). 3.1.2 Formant synthesis

Formant synthesis uses a rule-based approach to describe speech as a set of (up to 60) parameters, related to formant and anti-formant frequencies and bandwidths. A formant is several frequency regions of relatively great intensity in a sound spectrum, which together determines the characteristic quality of a vowel sound (Dutoit, 1996). Formant synthesis generates highly intelligible, but not completely natural sounding speech. It has the advantage however of low memory footprint and moderate computational requirements (Schroeter, 2001). 3.1.3 Concatenative synthesis

Concatenative synthesis generates speech from actual recorded speech samples stored in a voice database. Speech can be stored either as a waveform or encoded by a suitable speech coding method. Concatenative speech synthesis systems then string together units from the database and output the resulting speech signal. Variable length units are now the norm with concatenative speech synthesis, a unit being a recorded speech sample in a speech database. A unit can be a phrase, word, a single phoneme or a diaphone. A diaphone is the transitional sound from one phoneme to the next that contains the second half of one phoneme plus the first half of the next phoneme (E.g. the t in writing) (Yi & Glass, 1998). These systems are the most human sounding, because they are in fact human. The quality of non-uniform-unit (NUU) concatenative synthesis synthetic speech may sometimes be indistinguishable from human speech, but this requires many hours of

text to be recorded. Concatenative synthesis TTS is the most frequently used method of generating TTS in voice enabled web systems, though future TTS systems may use a hybrid of concatenative and formant approaches, so that prosodic effects, such as varying the speed pitch and emotional content of concatenativly generated speech would be possible (Henton, 2002).

3.2 Evaluating TTS


There is a tendency for users to weight TTS speech output quality very highly in judging the overall quality of a voice enabled web system. There is also a tendency for users to make this judgement very quickly after hearing just a few prompts (Schroeter, 2001). Available literature on the subject would suggest that completely natural sounding TTS is not achievable using technologies available today, (Dutoit, 1997; Zue & Glass 2000; Schroeter, 2001; Henton, 2002). This is confirmed by the author's evaluation of the TTS engines from leading vendors (AT & T, 2003; Nuance, 2003; ScanSoft, 2003). Therefore, companies have been reluctant to deploy TTS technology and only use it where it is absolutely not possible to pre-record prompts using an actor.

3.3 Conclusions
A high quality TTS system must be both intelligible and natural. While modern TTS systems are intelligible, they still sound computer generated. Completely natural sounding TTS is not yet achievable using any of the approaches outlined in this paper. As users weight TTS speech output quality very highly in judging the overall quality of a voice enabled web system and make this judgement very quickly, companies have been reluctant to deploy TTS technology and only use it where it is absolutely not possible to pre-record prompts using an actor. Companies have devoted thousand of person hours recording prompts so as to brand the spoken personality of their voice enabled web systems and to avoid using the cheaper, faster and more flexible alternative: the voice of a machine. It is therefore recommended that pre-recorded speech should be used where possible and TTS only used where this is not feasible, for example in reading emails.

References
AT & T (2003) Natural Voices, http://www.naturalvoices.att.com/demos/, Accessed April 2003. BIDDLECOMBE, E. (2000) Talkshow, Communications International, July 2000. BURNETT, D., WALKER, M., HUNT, A. (2002) Speech Synthesis Markup Language Specification, W3C Working Draft, 5 April 2002. DUTOIT, T. (1997) Text, Speech And Language Technology, Kluwer Academic Publishers, Dordrecht, April 1997. EISENZOPF, J. (2003) Top 10 Best Practices for Voice User Interface Design http://www.developer.com/voice/article.php/1567051, Accessed April 2003. HALPERN , E. (2001) Human Factors and Voice Applications, VoiceXML Review, June 2001. HENTON, DR C. (2002) Fiction and Reality of TTS, Speech Technology Magazine, February 2002.

HUANG, X.D., ACERO , A., HON, HW., REDDY, R. (2001) Spoken Language Processing: A Guide to Theory, Algorithm and System Development, Prentice Hall PTR; 1st edition, April 25, 2001. KOTELLY, B. (2002) The Science Behind Successful Caller-Experience, Global Speech Day presentation, May 21st 2002. LARSON, DR., J. A. (2003) The W3C Speech Interface Framework, Speech Technology Magazine, March/April 2003. LAWRIE, C. (2002) Best Practices: Achieving Success with Speech, Speech Technology Magazine, November/December 2002. M. EDGINGTON ET AL. (1996) Overview of current text-to-speech technologies: Part I text and linguistic analysis, BT Technology Journal, 14(1). MARKOWITZ, J. (1996) Using Speech Recognition, Upper Saddle River, NJ, Prentice Hall, 1996. MCGLASHAN , S., BURNETT, D., DANIELSEN , P., F ERRANS P. (2001) Voice Extensible Mark-up Language (VoiceXML) Version 2.0, W3C Working Draft, October 2001. NUANCE (2003) Vocalizer Demonstration, http://www.nuance.com/prodserv/demo_vocalizer.html, Accessed April 2003. REEVES , B., NASS, C. (1999) The Media Equation: How People Treat Computers, Television, and New Media like Real People and Places, CSLI Publications; Reprint edition. SCAN SOFT (2003) Realspeak Demonstration, http://www.scansoft.com/realspeak/demo/, Accessed April 2003 SCHROETER, J. (2001) The Fundamentals of text to speech Synthesis, VoiceXML Review, March 2001. SHARMA, C., KUNINS, J. (2002) VoiceXML: Strategies and Techniques for Effective Voice Application Development with VoiceXML 2.0, John Wiley & Sons; 1st edition. THE YANKEE GROUP (2001) Voice Commerce: Speech Technology as an Enabler of Mobile Finical Transactions, The Yankee Report, May 2001. TURING, A.M. (1950) Computing Machinery and Intelligence, Mind 1950. YI, J.R.W., GLASS, J.R. (1998) Natural Sounding Speech Synthesis Using VariableLength Units, Spoken Language Systems Group, MIT. ZUE, V., GLASS, J. (2000) Conversational Interfaces: Advances and Challenges, Proceedings of the IEEE, Vol 88 No 8, August 2000.

Das könnte Ihnen auch gefallen