WebSphere
Voice
Server
for
Multiplatforms
VoiceXML
Programmer’s
Guide
WebSphere
Voice
Server
for
Multiplatforms
VoiceXML
Programmer’s
Guide
Note
Beforeusingthisinformationandtheproductitsupports,besuretoreadtheinformationin“Notices”onpage 185
SixthEdition(November2005)
Thiseditionappliestoversion5release1ofIBM®WebSphere®VoiceServerforMultiplatformsandtoallsubsequent
releasesandmodificationsuntilotherwiseindicatedinneweditions.IBMmaypublishoneormoreneweditionsof thispublicationinadownloadableformataftertheprogramisgenerallyavailable.Toobtainthemostrecentedition ofthispublication,gototheWebsiteat
http://www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi ©CopyrightInternationalBusinessMachinesCorporation2000,2005.Allrightsreserved.
Contents
Listoffigures . . . . . . . . . . . vii
Listoftables . . . . . . . . . . . ix
Aboutthisbook . . . . . . . . . . xi
Whoshouldreadthisbook . . . xi
Relatedpublications . . . xi
Specificationsandstandards. . . xi
Speechuser-interfacedesign . . . xii
Server-sideprogramming . . . xii
Deploymentinformation . . . xiii
Howthisbookisorganized . . . xiv
Documentconventionsandterminology . . xv
Makingcommentsonthisbook . . . xv
Chapter1.IntroductiontoVoiceXML . . . 1
Whatarevoiceapplications? . . . 1
Whycreatevoiceapplications? . . . 2
Whataretypicaltypesofvoiceapplications? . 2
Queries . . . 2
Transactions . . . 3
WhatisVoiceXML? . . . 3
WhataretheadvantagesofVoiceXML? . . . 4
HowdoyoucreateanddeployaVoiceXML application? . . . 5
Howdousersaccessthedeployedapplication? 6 Chapter2.DesigningaSUI(SUI). . . . . 9
Introduction . . . 9
TheimportanceofSUIdesign . . . 10
ThebasesofSUIdesign . . . 10
TheconsumersofSUIdesign . . . 10
e-Serviceandspeechtechnology . . . . 10
Customersatisfactionwithe-Service . . . 11
Servicerecovery. . . 12
SUImisconceptions . . . 12
FundamentalSUIdesign. . . 12
MajorSUIobjectives . . . 12
ThepoweroftheSUI . . . 13
Designmethodology . . . 13
DesignPhase. . . 13
Prototypephase(“WizardofOz”testing) 16 Testphase. . . 18
Refinementphase . . . 19
Gettingstarted—high-leveldesigndecisions 19 Selectinganappropriateuserinterface . . 20
Decidingonthetypeandlevelof information . . . 20
Choosingthebarge-instyle . . . 21
Selectingrecordedpromptsorsynthesized speech . . . 24
Decidingwhethertouseaudioformatting 27 Usingsimpleornaturalcommand grammars. . . 29
Adoptingaterseorverbosepromptstyle 31 Allowingonlyspeechinputorspeechplus DTMF . . . 32
Adoptingaconsistentsetofglobal navigationcommands. . . 34
Decidingwhethertousehumanagentsin thedeployedsystem . . . 39
Choosinghelpmodeorself-revealinghelp 40 Gettingspecific—low-leveldesigndecisions 43 Adoptingaconsistent“soundandfeel”. . 43
Usingconsistenttiming . . . 44
Designingconsistentdialogs . . . 45
Creatingintroductions . . . 46
Constructingappropriatemenusand prompts . . . 48
Designingandusinggrammars . . . . 59
Errorrecoveryandconfirminguserinput 64 Advanceduserinterfacetopics. . . 71
Issuesinartificialpersonae . . . 71
Controllingthe“lostinspace”problem. . 72
Managingaudiolists . . . 72
Chapter3.VoiceXMLlanguage . . . . . 75
ChangesfromVoiceXML2.0 . . . 75
Newelementsandattributes . . . 75
SpeechSynthesisMarkupLanguage (SSML). . . 76
ThestructureofaVoiceXMLapplication . . 76
Formsandformitems . . . 76
Menus . . . 77
Flowcontrol . . . 78
Subdialogs . . . 79
Comments . . . 79
AsimpleVoiceXMLexample . . . 80
Dynamiccontent . . . 81
VoiceXMLelementsandattributes . . . . 82
SpeechSynthesisMarkupLanguage(SSML) 90 SSMLelementsandattributes . . . 90
Built-infieldtypesandgrammars. . . 92
Recordedaudio . . . 95
Usingprerecordedaudiofiles . . . 95
Recordingspokenuserinput . . . 95
Playingandstoringrecordeduserinput . 95
Recordinguserinputduringspeech recognition . . . 95
Documentfetchingandcaching . . . 96
Controllingfetchandcachebehavior. . . 96
Preventingcaching. . . 96
Events . . . 96
Predefinedevents . . . 97
Application-specificevents . . . 99
Recurringevents . . . 99
Variablesandexpressions . . . 99
UsingECMAScript. . . 99
Declaringvariables . . . 100
Assigningandreferencingvariables. . . 102
Usingshadowvariables. . . 102
Grammars . . . 104
Grammarsyntax . . . 105
Staticgrammars . . . 107
Dynamicgrammars . . . 109
Grammarscope . . . 109
Hierarchyofactivegrammars. . . 110
Mixed-initiativeapplicationand form-levelgrammars. . . 111
Specifyingasounds-likespellingina Japanese,aCantonese,oraSimplified Chinesegrammar . . . 114 Timeoutproperties . . . 115 Incompletetimeout . . . 115 Completetimeout . . . 116 Example . . . 116 Telephonyfunctionality . . . 117
AutomaticNumberIdentification . . . 117
DialedNumberIdentificationService . . 118
Calltransfer. . . 119
Chapter4.Hints,tips,andbestpractices 123 VoiceXMLapplicationstructure . . . 123
Decidinghowtogroupdialogs . . . . 123
Decidingwheretodefinegrammars . . 124
Fetchingandcachingresourcesfor improvedperformance . . . 124
InvokingaStateTableusingVoiceXML . . 125
Confidence-levelprocessing . . . 127
Usingaproxyserver. . . 127
Testingbuilt-infieldtypes . . . 127
Samplecode . . . 129
CallingaJavaapplication . . . 129
Callinglegacytelephonyapplications . . 133
Usingn-best . . . 134
AppendixA.CTTScaching . . . . . . 137
AppendixB.CanadianFrench . . . . . 139
Built-infieldtypesandgrammars . . . . 139
Predefinedevents. . . 144
Built-incommands . . . 145
Specifyingcharacterencoding. . . 145
Testingbuilt-infieldtypes . . . 145
SSMLelementsandattributes. . . 146
AppendixC.German . . . . . . . . 149
Built-infieldtypesandgrammars . . . . 149
Predefinedevents. . . 153
Built-incommands . . . 154
Specifyingcharacterencoding. . . 154
Testingbuilt-infieldtypes . . . 155
AppendixD.Japanese . . . . . . . 157
Built-infieldtypesandgrammars . . . . 157
Predefinedevents. . . 160
Built-incommands . . . 160
Specifyingcharacterencoding. . . 160
Testingbuilt-infieldtypes . . . 160
SSMLelementsandattributes. . . 162
AppendixE.Korean . . . . . . . . 163
Built-infieldtypesandgrammars . . . . 163
Predefinedevents. . . 167
Built-incommands . . . 168
Specifyingcharacterencoding. . . 168
Testingbuilt-infieldtypes . . . 168
SSMLelementsandattributes. . . 170
AppendixF.SimplifiedChinese . . . . 171
Built-infieldtypesandgrammars . . . . 171
Predefinedevents. . . 174
Built-incommands . . . 174
Specifyingcharacterencoding. . . 175
Testingbuilt-infieldtypes . . . 175
AppendixG.UKEnglish . . . . . . . 177
Built-infieldtypesandgrammars . . . . 177
Testingbuilt-infieldtypes . . . 183
Notices . . . . . . . . . . . . . 185 Trademarks . . . 187 Accessibility . . . 188 Otherattributions. . . 188 Glossary . . . . . . . . . . . . 189 Index . . . . . . . . . . . . . 195
List
of
figures
1. Spoke-too-soon(STS)incident . . . . 66 2. Spoke-way-too-soon(SWTS)incident 66
List
of
tables
1. Samplescriptfor“WizardofOz”testing 16
2. Whentouseaspeechinterface. . . . 20
3. Barge-inEnabledversusDisabled 21 4. Barge-indetectionmethods . . . 22
5. Audioformatting . . . 28
6. Simpleversusnaturalcommand grammars . . . 29
7. Promptstyles. . . 31
8. Mixedinputmodes. . . 33
9. Recommendedlistofglobalcommand types . . . 35
10. Useofhumanagents . . . 39
11. Comparisonofhelpstyles . . . 40
12. Successfuluseofexample . . . 46
13. Unsuccessfuluseofexample—switchto directeddialog . . . 46
14. Recommendedmaximumnumberof menuitems . . . 49
15. Recognitionerrorswhenspelling 56 16. Grammarword/phraselengthtrade-offs 60 17. Vocabularyrobustnessandgrammar complexitytrade-offs . . . 60
18. Numberofactivegrammartrade-offs 61 19. Error-recoverytechniques. . . 64
20. SummaryofVoiceXMLelementsand attributes . . . 82
21. SummaryofSSMLelementsand attributes . . . 90
22. Built-intypesforUSEnglish . . . . 92
23. Propertiesforcapturingaudioduring speechrecognition . . . 95
24. Predefinedeventsandeventhandlers forUSEnglish . . . 97
25. Variablescope . . . 100
26. Examplesofrelationaloperators 102 27. Shadowvariables . . . 103
28. SpeechRecognitionGrammar limitationsinWebSphereVoiceServer . 104
29. Formgrammarscope. . . 110
30. Incompletetimeout. . . 116
31. Completetimeout . . . 116
32. SampleinputforUSEnglishbuilt-in fieldtypes . . . 128
33. CanadianFrenchbuilt-intypes 139 34. CanadianFrenchpredefinedevents andevent-handlermessages . . . . 145
35. CanadianFrenchbuilt-inVoiceXML browsercommands . . . 145
36. SampleinputforCanadianFrench built-infieldtypes. . . 146
37. LimitationsforCanadianFrenchSSML elements . . . 147
38. Germanbuilt-intypes . . . 149
39. Germanpredefinedeventsand event-handlermessages . . . 153
40. Germanbuilt-inVoiceXMLbrowser commands . . . 154
41. SampleinputforGermanbuilt-infield types . . . 155
42. Japanesebuilt-intypes . . . 157
43. Japanesepredefinedeventsand event-handlermessages . . . 160
44. Japanesebuilt-inVoiceXMLbrowser commands . . . 160
45. SampleinputforJapanesebuilt-infield types . . . 161
46. LimitationsforJapaneseSSML elements . . . 162
47. Koreanbuilt-infieldtypes . . . 164
48. Koreanpredefinedeventsand event-handlermessages . . . 168
49. Koreanbuilt-inVoiceXMLbrowser commands . . . 168
50. SampleinputforKoreanbuilt-infield types . . . 169
51. LimitationsforKoreanSSMLelements 170 52. SimplifiedChinesebuilt-intypes 171 53. SimplifiedChinesepredefinedevents andevent-handlermessages . . . . 174
54. SimplifiedChinesebuilt-inVoiceXML browsercommands . . . 175
55. SampleinputforSimplifiedChinese built-infieldtypes. . . 175
56. LimitationsforSimplifiedChinese SSMLelements. . . 176
57. UKEnglishbuilt-intypes . . . 177
58. SampleinputforUKEnglishbuilt-in fieldtypes . . . 183
About
this
book
ThisbookprovidesinformationaboutusingVoiceXMLVersion2.0and2.1to designanddevelopvoiceapplications.Theresultingapplicationscanthen be deployedina telephonyenvironmentusingIBM® WebSphere® VoiceServer to providevoiceaccesstoWeb-baseddatausingstandardtelephones.
Who
should
read
this
book
Readthis bookifyouare:
v
Someonewho wantstofindoutabouttheadvantages ofusingVoiceXML todeliverWeb-based voiceservices
v Anapplication developerinterested increatingVoiceXMLapplications
v AcontentcreatorresponsibleforthecreativeaspectsofVoiceXML applications
Related
publications
Reference,design,andprogramminginformationforcreating voice applicationsisavailablefroma varietyofsources,asrepresentedbythe documentslistedinthissection.
Note: Guidelinesandpublicationscitedinthis bookare foryourinformation onlyanddonotinanymannerserveasanendorsementof those materials.Youaloneare responsiblefordeterminingthesuitabilityand applicabilityofthisinformationtoyour needs.
Specifications
and
standards
Youmaywanttorefertothefollowingsourcesforinformationaboutrelevant specificationsandstandards:
v VoiceExtensibleMarkupLanguage(VoiceXML)Version2.0specification, publishedbyW3Candavailable athttp://www.w3.org/TR/voicexml20/
v SpeechSynthesisMarkupLanguageVersion1.0specification,publishedby W3Candavailable athttp://www.w3.org/TR/speech-synthesis/
v
SpeechRecognitionGrammarSpecification(SRGS) Version1.0,published by W3Candavailable athttp://www.w3.org/TR/speech-grammar/
v SemanticInterpretationforSpeechRecognitionW3CWorkingDraft1 April2003, publishedbyW3Candavailable at http://www.w3.org/TR/semantic-interpretation/
v HTTPStateManagementMechanism (CookieSpecification)available at http://www.w3.org/Protocols/rfc2109/rfc2109
v ECMAStandard262:ECMAScriptLanguageSpecification,3rdEdition, publishedbyECMAathttp://www.ecma.ch/ecma1/stand/ECMA-262.htm
v TheInternationalPhoneticAlphabet,publishedbytheInternationalPhonetic Associationathttp://www2.arts.gla.ac.uk/IPA/ipachart.html
v TheUnicodeStandardVersion3.0,publishedbyTheUnicodeConsortium at http://www.unicode.org
Speech
user-interface
design
Thespeechuserinterfaceguidelinespresented inthisbookareanevolving setofrecommendationsbased onindustryresearchandlessons learnedinthe processofdevelopingourownVoiceXMLandtelephonyapplications.For moreinformation,refertospeechindustryliteratureandpublicationssuchas thefollowingsources:
v AudioSystemforTechnicalReadings(ASTeR)byT.V.Raman,aPh.D.Thesis publishedbyCornellUniversity,May1994.
v AuditoryUserInterfaces—TowardsTheSpeakingComputerbyT.V.Raman, publishedbyKluwerAcademicPublishers,August1997.
v “DirectingtheDialog:TheArt ofIVR”byMyraHambleton,publishedin
SpeechTechnology,Feb/Mar 2000.
v HandbookofHuman-ComputerInteractionbyThomasKLandauer,Martin Helander,andPrasad V.Prabhu,published byElsevierScience,Amsterdam, North-Holland,June1997.
v
HowtoBuildaSpeechRecognitionApplication:AStyleGuideforTelephony
Dialogues(SecondEdition)byBruceBalentine,DavidP.Morgan,and WilliamS.Meisel,publishedbyEnterpriseIntegrationGroup,SanRamon, CA,2001.
v HumanFactors andVoice InteractiveSystemsbyDaryleGardner-Bonneau, publishedbyKluwerAcademicPublishers, Boston,MA,March1999.
Server-side
programming
Informationaboutserver-sideprogrammingisavailablefroma numberof sources,includingthefollowing:
v BuildingBlocksforCGIScriptsinPerlat
http://www.cc.ukans.edu/~acs/docs/other/cgi-with-perl.shtml
v Designand ImplementServlets,JSPs,and EJBsforIBM WebSphereApplication
Server(IBMRedbook)SG24-5754-00
v Developingane-businessApplicationfortheIBM WebSphereApplicationServer (IBMRedpiece)SG24-5423-00
v JavaServerPages(JSP)athttp://java.sun.com/products/jsp
v JavaServletathttp://java.sun.com/products/servlet/
v TheASP (ActiveServerPages)ResourceIndexat http://www.aspin.com/index/default.asp
v TheFrontofIBMWebSphereBuildinge-businessUserInterfaces (IBMRedbook) SG24-5488-00
v WebSphereV4.0AdvancedEditionHandbook(IBM Redbook)SG24-6176-00
v WebSphereVersion4ApplicationDevelopmentHandbook(IBMRedbook) SG24-6134-00
Deployment
information
Forinformationaboutdeployingyour voiceapplications,refertothe
documentationprovidedwith WebSphereVoice ServerforMultiplatformsand WebSphereVoice ResponseforAIX®.
Thefollowingdocumentationisprovidedinsoftcopyonlywiththeproduct, orcanbedownloadedfromtheIBMPublicationsCenter:
v IBMText-to-SpeechSSMLProgrammingGuideVersion6.7.3.0 WebSphereVoiceServerforMultiplatforms
v WebSphereVoiceServerforMultiplatforms:Administrator’sGuide,G210-1561
v WebSphereVoiceServerforMultiplatforms:ApplicationDevelopmentusingState
Tables,G210-1562
WebSphereVoiceResponseforAIX
v WebSphereVoiceResponseforAIX:GeneralInformation andPlanning, GC33-1840
v
WebSphereVoiceResponseforAIX:Installation,GC33-1842
v WebSphereVoiceResponseforAIX:UserInterfaceGuide,SC33-1841
v WebSphereVoiceResponseforAIX:ConfiguringtheSystem,SC33-1843
v WebSphereVoiceResponseforAIX:Managingand MonitoringtheSystem, SC33-1844
v WebSphereVoiceResponseforAIX:Designingand ManagingStateTable
Applications,SC33-1845
v WebSphereVoiceResponseforAIX:ApplicationDevelopmentusingStateTables, SC33-1846
v WebSphereVoiceResponse: DevelopingJavaApplications,GC34-6318
v
WebSphereVoiceResponse: DeployingandManaging VoiceXMLandJava
Applications,GC34-6319
v WebSphereVoiceResponse: ApplicationDevelopmentusing Javaand VoiceXML, GC34-6049
v WebSphereVoiceResponseforAIX:CustomServers,SC33-1847
v WebSphereVoiceResponseforAIX:3270Servers,SC33-1848
v WebSphereVoiceResponseforAIX:FaxusingBrooktrout,SC34-5982
v
WebSphereVoiceResponseforAIX:CiscoICMInterfaceUser’sGuide,SC34-5317
v WebSphereVoiceResponseforAIX:ProgrammingfortheADSIFeature, SC34-5380
v WebSphereVoiceResponseforAIX:ProgrammingfortheSignalingInterface, SC33-1851
How
this
book
is
organized
Chapter1,“IntroductiontoVoiceXML,”onpage1 providesanintroductionto voiceapplicationsand VoiceXML.
Chapter2,“Designing aSUI(SUI),”onpage9 presentsuser-interface guidelinesfordevelopingvoiceapplications.
Chapter3,“VoiceXMLlanguage,”onpage75introducesbasicconceptsand constructsofVoiceXML.Forcompletesyntax,pleaserefertotheVoiceXML2.0 specificationathttp://www.w3.org/TR/voicexml20/
Chapter4,“Hints,tips, andbestpractices,”onpage123contains hintsand tipsforstructuringandcoding yourVoiceXMLapplications.Itincludes examplesofthefollowing:useoftheVoiceXML<object>elementtoreference Javacode;useof the<property>element’s maxnbestnametoobtainn-best results;and,useofaprecompiledgrammar.
AppendixA, “CTTScaching,”onpage137describesaudio cachingfor improvingtheperformanceofapplicationsthatuseconcatenative text-to-speech(CTTS)synthesis.
Youmight wanttorefertothefollowingappendixesforlanguage-specific information:
v AppendixC, “German,”onpage149 containsinformationthatisspecific to German.
v AppendixD, “Japanese,”onpage157 containsinformationthatisspecific toJapanese.
v AppendixF,“SimplifiedChinese,” onpage171 containsinformationthatis specifictoSimplifiedChinese.
v AppendixG,“UKEnglish,” onpage177 containsinformationthatis specifictoUKEnglish.
“Notices”onpage185 containsnoticesand trademarkinformation. “Glossary”onpage189defineskeyterminologyusedinthisdocument.
Document
conventions
and
terminology
Thefollowingconventionsareusedtopresentinformationinthisdocument: Italic Usedforemphasis,toindicatevariabletext,andforreferencestoother
documents.
Bold Usedforconfigurationparameters,filenames,URLs,anduser-interface controlssuchascommandbuttonsandmenus.
Monospaced Usedforsamplecode.
<text> Usedforeditorialcommentsinscripts.
Making
comments
on
this
book
Ifyouespeciallylike ordislikeanythingaboutthisbook,feelfreetosend us yourcomments.
Youcancommentonwhatyouregard asspecific errorsoromissions,andon theaccuracy, organization,subjectmatter,orcompletenessofthis book.Please limityourcomments totheinformationthatisinthisbookandtothewayin whichtheinformationispresented.SpeaktoyourIBMrepresentativeif you havesuggestionsabouttheproductitself.
Whenyousenduscomments,yougranttoIBM anonexclusiverighttouseor distributetheinformationinanywayit believesappropriatewithout
incurringanyobligationtoyou.
Youcanget yourcommentstous quicklybysendingan e-mailto [email protected].Alternatively,youcanmailyour commentsto: User Technologies
IBM United Kingdom Laboratories, Mail Point 095, Hursley Park,
Winchester, Hampshire, SO21 2JN, United Kingdom
Chapter
1.
Introduction
to
VoiceXML
Thisintroductionaddressesthefollowingquestions:
v “Whatare voiceapplications?”
v “Whycreatevoiceapplications?”onpage2
v “Whatare typicaltypesofvoiceapplications?”onpage2
v “WhatisVoiceXML?”onpage3
v
“Whatare theadvantages ofVoiceXML?”onpage4
v “Howdo youcreateanddeploy aVoiceXMLapplication?”onpage5
v
“Howdo usersaccessthedeployedapplication?”onpage6
What
are
voice
applications?
Voiceapplicationsareapplicationsinwhichtheinputand/oroutputare througha spoken,ratherthanagraphical,user interface.Theapplicationfiles canresideonthelocalsystem,anintranet,ortheInternet.Userscanaccess thedeployed applicationsanytime,anywhere, fromanytelephone.
“Voice-enablingtheWorldWideWeb”doesnotsimplymean:
v
Usingspokencommandstotellavisualbrowsertolookupa specificWeb addressorgo toaparticularbookmark
v Havingavisualbrowserthrowawaythegraphicsona traditionalvisual Webpageandread therestoftheinformationaloud
v Convertingtheboldoritalicsona visualWebpagetosomekindof emphasizedspeech
Rather,voiceapplicationsprovideaneasyand novelwayforuserstosurfor shopontheInternet—“browsing byvoice.”Users caninteractwith Web-based data(thatis,dataavailablevia Web-stylearchitecturesuchasservlets,ASPs, JSPs,JavaBeans,CGIscripts,etc.)usingspeechratherthanakeyboardand mouse.
Theformthatthisspokendatatakesisoftennotidenticaltotheformit takes inavisualinterface,due totheinherent differencesbetweentheinterfaces. Forthisreason, transcoding—thatis, usinga tooltoautomaticallyconvert HTMLfilestoVoiceXML—maynotbethemosteffectivewaytocreatevoice applications.
Why
create
voice
applications?
Untilrecently,theWorldWideWebhasreliedexclusivelyonvisualinterfaces todeliverinformationandservices tousersviacomputersequipped witha monitor,keyboard,andpointingdevice.Indoingso,a hugepotential customerbasehasbeenignored:peoplewho (duetotime,location,and/or costconstraints)donothaveaccesstoacomputer.
Manyofthese peopledo,however,haveaccesstoatelephone.Providing “conversationalaccess”(thatis,spokeninputandaudio outputovera telephone)toWeb-baseddatawillpermitcompanies toreachthisuntapped market.Users benefitfromtheconvenienceofusingthemobileInternet for self-servicetransactions,whilecompaniesenjoytheWeb’srelatively low transactioncosts.And,unlikeapplicationsthatrelyondualtone
multi-frequency(DTMF)(telephonekeypress)input,voiceapplicationscanbe usedina hands-freeoreyes-freeenvironment,aswellasbycustomerswith rotarypulsetelephoneserviceortelephonesinwhichthekeypadisonthe handset.
What
are
typical
types
of
voice
applications?
Voiceapplicationswilltypicallyfallintooneof thefollowingcategories:
v “Queries”
v “Transactions”onpage3
Queries
Inthis scenario,a customercallsintoa systemtoretrieveinformationfroma Web-basedinfrastructure.
Thesystemguidesthecustomerthrough aseries ofmenusand formsby playinginstructions,prompts,andmenu choicesusingprerecordedaudiofiles orsynthesizedspeech.
ThecustomerusesspokencommandsorDTMFinputtomakemenu selectionsandfillinformfields.
Basedonthecustomer’sinput,thesystemlocatestheappropriaterecordsina back-endenterprisedatabase.Thesystempresents thedesiredinformationto thecustomer,either byplayingbackprerecordedaudiofilesorby
synthesizingspeechbased onthedataretrievedfromthedatabase.
Examplesofthistype ofself-serviceinteractionincludeapplicationsorvoice portalsprovidingweatherreports,movielistings,stock quotes,
health-care-providerlistings,andcustomer serviceinformation(Webcall centers).
Transactions
Inthis scenario,a customercallsintoa systemtoexecutespecifictransactions witha Web-basedback-enddatabase.
Thesystemguidesthecustomertoprovidethedatarequiredforthe transactionbyplayinginstructions,prompts,and menuchoicesusing prerecordedaudiofiles orsynthesizedspeech.Thecustomerrespondsusing spokencommandsorDTMF input.
Basedonthecustomer’sinput,thesystemconductsthetransactionand updatestheappropriaterecordsina back-endenterprisedatabase.Typically thesystem alsoreportsbacktothecustomer,eitherbyplayingprerecorded audiofilesorbysynthesizingspeechbased ontheinformationinthedatabase records.
Examplesofthistype ofself-serviceinteractionincludeapplicationsorvoice portalsforemployeebenefits,employeetimecardsubmission,financial
transactions,travel reservations,calendarappointments,electronicrelationship management(eRM),salesautomation,andordermanagement.
What
is
VoiceXML?
TheVoiceeXtensible MarkupLanguage(VoiceXML)isan XML-basedmarkup languagefor creatingdistributedvoiceapplications,muchasHTMLisa markuplanguageforcreating distributedvisualapplications.Itisanindustry standarddefinedbytheWorld WideWebConsortium (W3C)at
http://www.w3.org/TR/voicexml21/.
TheVoiceXMLlanguageenablesWebdeveloperstouseafamiliarmarkup styleandWebserver-sidelogictodeliverapplicationsforuseovertelephone lines.TheresultingVoiceXMLapplicationscaninteractwithexistingback-end businessdataandlogic.
UsingVoiceXML,application developerscancreateWeb-basedvoice applicationsthatuserscanaccessbytelephoneor otherpervasive devices. VoiceXMLsupportsdialogs thatfeature:
v
Recognitionofspokeninput
v DTMFinput
v Recordingofspokeninput
v Synthesizedspeechoutput(“text-to-speech”)
v Pre-recordeddigitized audiooutput
v Dialogflowcontrol
v AutomaticNumberIdentification(ANI)
v
DialedNumberIdentificationService(DNIS)
v Calltransfer
What
are
the
advantages
of
VoiceXML?
Whileyoucouldcertainlybuildvoiceapplicationswithoutusingavoice markuplanguageand aspeechbrowser(forexample,bywritingyour applicationsdirectlytoaspeechAPI),usingVoiceXMLand aVoiceXML browserprovideseveralimportantcapabilities:
v VoiceXMLisamarkuplanguagethatmakesbuildingvoiceapplications easier,inthesamewaythatHTMLsimplifies buildingvisualapplications. VoiceXMLalsoreduces theamount ofspeechexpertisethatdevelopers need.
v VoiceXMLapplicationscanusethesameexistingback-endbusinesslogicas theirvisualcounterparts,enablingvoicesolutionstobe introducedtonew marketsquickly. Currentandlong-termdevelopmentandmaintenance costsareminimizedbyleveragingtheWebdesignskillsandinfrastructures alreadypresentintheenterprise.Customerscanbenefitfroma consistency ofexperiencebetweenvoiceandvisualapplications.
v VoiceXMLimplementsa client/serverparadigm,whereaWebserver providesVoiceXMLdocuments thatcontaindialogs tobeinterpretedand presentedtoa user.Theuser’sresponsesare submittedtotheWebserver, whichrespondsbyprovidingadditionalVoiceXMLdocuments,as
appropriate.VoiceXMLallowsyoutorequestdocumentsand submitdata toserverscriptsusingUniversalResource Identifiers(URIs).VoiceXML documentscanbestatic, ortheycanbedynamicallygeneratedbyCGI scripts,JavaBeans,ASPs,JSPs,Javaservlets, orotherserver-sidelogic.
v Unlikea proprietaryInteractiveVoice Response(IVR)system,VoiceXML providesanopenapplication developmentenvironmentthatgenerates portableapplications.ThismakesVoiceXMLacost-effectivealternativefor providingvoiceaccessservices.
v MostinstalledIVRsystemstodayacceptinputfromthetelephonekeypad only.Incontrast,VoiceXMLisdesignedpredominantlytoacceptspoken input,butitcanalsoacceptDTMFinput,ifdesired.Asa result,VoiceXML helpsspeedupcustomer interactionsbyprovidingamore naturalinterface thatreplacesthetraditional, hierarchicalIVRmenutreewitha streamlined dialogusinga flattenedcommandstructure.
v VoiceXMLdirectlysupportsnetworkedandWeb-based applications,
meaningthatauser atonelocationcanaccessinformationoranapplication providedbya serveratanothergeographicallyororganizationallydistant location.Thiscapitalizesontheconnectivityand commercepotentialofthe WorldWideWeb.
v Usinga singleVoiceXMLbrowsertointerpretstreamsofmarkuplanguage originatingfrommultiple locationsprovidestheuserwith aseamless conversationalexperienceacrossindependentapplications.Forexample,a voiceportalapplication mightallowa usertotemporarilysuspendan airlinepurchasetransactiontointeractwitha bankingapplicationona differentservertocheckanaccountbalance.
v VoiceXMLsupportslocalprocessingand validationofuser input.
v VoiceXMLsupportsplaybackofprerecordedaudiofiles.
v VoiceXMLsupportsrecordingof userinput.Theresultingaudiocanbe playedbacklocallyoruploadedtotheserverforstorage,processing,or playbackatalatertime.
v
VoiceXMLdefinesaset ofeventscorrespondingtosuchactivitiesasa user requestforhelp,thefailureofa usertorespondwithin atimeoutperiod, andanunrecognizeduserresponse.AVoiceXMLapplicationcanprovide catchelementsthatrespondappropriatelytoa giveneventforaparticular context.
v VoiceXMLsupportscontext-specificand taperedhelpusinga systemof eventsandcatchelements.Helpcanbetapered byspecifyinga countfor eacheventhandler,sothatdifferentevent handlersareexecuteddepending onthenumber oftimesthattheevent hasoccurred inthespecifiedcontext. Thiscanbe usedtoprovideincreasingly moredetailedmessageseachtime theuserasksfor help.For moreinformation,see“Choosinghelpmodeor self-revealinghelp”onpage40.
v VoiceXMLsupportssubdialogs,whichareroughlytheequivalentof functionormethod calls.Subdialogscanbeusedtoprovidea disambiguationorconfirmationdialog,andtocreatereusable dialog components.Formoreinformation,see“Subdialogs” onpage79.
How
do
you
create
and
deploy
a
VoiceXML
application?
1. An applicationdevelopercreatesavoiceapplication writteninVoiceXML. YoucanwriteVoiceXMLapplicationsusinga texteditorbutyoumight find itmoreconvenienttousea graphicaldevelopment environmentthat helpsyoucreate,manageand testVoiceXMLfiles.TheVoice Toolkitfor WebSphereStudio(Voice Toolkit)supportsthedevelopment of
VoiceXML-basedapplications.
VoiceXMLpagescanbestaticormaybegenerateddynamicallyfromCGI scripts, JavaBeans,ASPs,JSPs,Javaservlets,orotherserver-side
techniques.
2. (Optional)ThedeveloperpublishestheVoiceXMLapplication (VoiceXML documents,grammarfiles,anyprerecordedaudiofiles,andany
3. (Optional)Thedeveloperusesa desktopworkstationandtheVoiceToolkit totest theVoiceXMLapplicationrunningontheWebserverorlocaldisk, pointingtheVoiceXMLbrowsertotheappropriatestartingVoiceXML page.
4. Atelephony expertconfigures thetelephonyinfrastructureforWebSphere Voice ResponseforAIX. SeeWebSphereVoiceResponseforAIX:Installation
and WebSphereVoiceResponseforAIX: ConfiguringtheSystemfor instructions.
5. Thesystem administratorusesWebSphereVoiceResponseforAIXto configure,deploy,monitor,and managea dedicatedWebSphereVoice Server system.WebSphereVoice Response’stelephone networkconnection provides theaudio channelsfortheVoiceXMLbrowser.
6. Thedeveloperusesarealtelephonetotest theVoiceXMLapplication usingWebSphereVoiceServer.
How
do
users
access
the
deployed
application?
Onceyourvoiceapplicationsaredeployed,userssimplydialthetelephone numberthatyouprovideand areconnectedtothecorrespondingvoice application.Thefigurebelow showsaflowchartofatypical call.
Answer the telephone call
Play a prompt
Wait for the caller’s response
Take action as directed by the caller
Complete the interaction
1. Auserdialsthetelephonenumberyouprovide.WebSphereVoice Responseanswersthecallandexecutestheapplicationreferenced bythedialedphonenumber.
2. WebSphereVoiceServerplaysagreetingtothecallerandprompts thecallertoindicatewhatinformationheorshewants.
v Theapplicationcanuseprerecordedgreetingsandprompts,or theapplicationcanhavethegreetingorpromptsynthesized fromtextusingthetext-to-speechengine.
v Iftheapplicationsupportsbarge-in,thecallercaninterruptthe promptifheorshealreadyknowswhattodo.
3. Theapplicationwaitsforthecaller’sresponseforasetperiodof time.
v Thecallercanrespondeitherbyspeakingorbypressingoneor morekeysonaDTMFtelephonekeypad,dependingonthe typesofresponsesexpectedbytheapplication.
v Iftheresponsedoesnotmatchthecriteriadefinedbythe application(suchasthespecificword,phrase,ordigits),the voiceapplicationcanpromptthecallertoentertheresponse again,usingthesameordifferentwording.
v Ifthewaitingperiodhaselapsedandthecallerhasnot responded,theapplicationcanpromptthecalleragain,using thesameordifferentwording.
4. Theapplicationtakeswhateveractionisappropriatetothecaller’s response.Forexample,theapplicationmight:
v Updateinformationinadatabase
v Retrieveinformationfromadatabaseandspeakittothecaller v Storeorretrieveavoicemessage
v Launchanotherapplication v Playahelpmessage
Aftertakingaction,theapplicationpromptsthecallerwithwhat todonext.
5. Thecallerortheapplicationcanterminatethecall.Forexample: v Thecallercanterminatetheinteractionatanymomentby
hangingup.WebSphereVoiceResponsecandetectifthecaller hangsupandcanthendisconnectitself.
v Iftheapplicationpermitsit,thecallercanuseacommandto indicateexplicitlythattheinteractionisover(forexample,by saying“Exit”).
v Iftheapplicationhasfinishedrunning,itcanplayaclosing messageandthendisconnect.
Chapter
2.
Designing
a
SUI
(SUI)
Thischaptercoversthefollowingtopics:v “Introduction”
v “TheimportanceofSUIdesign”onpage10
v “Designmethodology”onpage13
v “Gettingstarted—high-leveldesigndecisions”onpage19
v
“Gettingspecific—low-leveldesigndecisions”onpage43
v “Advanceduserinterfacetopics”onpage71
Introduction
TheSUIguidelinespresentedherearejustthat:guidelines.Insomecases,the requirementsand objectivesofparticularVoiceXMLapplicationsmaypresent validreasonsforoverridingcertainguidelines.Furthermore,these guidelines addressthedesignpointswehavefoundtobethemostimportantin
producingspeech/audiouserinterfaces, butarenotascomprehensiveas thosefoundinabookdedicatedto thetopic.Finally,keepinmindthatwhile thefollowingguidelinescanhelpyouproduceausableapplication,theydo notguarantee usabilityorusersatisfaction; youshouldplantoconduct usabilitytestswithyourapplication. (See“Designmethodology”onpage13.) Note: Theguidelinesandreferenced publicationspresented inthisbookare
foryour informationonly,anddo notinanymannerserveasan endorsementofthosematerials.Youaloneare responsiblefor
determiningthesuitabilityand applicabilityofthisinformationtoyour needs.
Thegoalsoftheguidelinespresentedinthis chapterinclude:
v Helpingyoucreatestandardized,well-behavedVoiceXMLapplications
v Reducingdevelopment timebyteachingcurrentbestpracticesinSUI(SUI) design
v
Increasingtheusabilityof theSUIandreducingtheenduser’slearning curvebypromotingconsistentcomputeroutputand predictableuserinput
The
importance
of
SUI
design
The
bases
of
SUI
design
EffectiveSUIdesigndrawsuponmanydisciplines.Thekeyscientific disciplinesare Psychology,Human-ComputerInteraction,HumanFactors, LinguisticsandCommunicationTheory.TheartisticdisciplinesofAuditory DesignandWriting (especiallythetechniquesofwritingdialog)arealso very important.Finally,fortruecraftsmanshipinSUIdesign,thereisnosubstitute forexperienceandcodificationofbestpractices(suchastheinformation providedinthischapter).
The
consumers
of
SUI
design
ASUIhasmanyconsumersandmust simultaneouslysatisfymanyobjectives. AmongtheconsumersofSUIsaremarketers,serviceproviders,endusers, anddevelopers.
Marketershavetheresponsibilityofsellingspeechapplicationstoservice providers.Theprimaryobjectiveof aSUIfromamarketer’spointofviewis toappealtothetargetedserviceprovider,thushelpingtomakethesale. Serviceprovidersrely onthespeechapplication tohelpthemprovidea servicetoendusers.For serviceproviders,theprimary objectivesof anSUI aretosavemoneyandmaintaincustomer contactand satisfaction.Tothis end,theyare alsoconcernedwith theircorporateimageandhowtheirspeech applicationswillfitintotheiroverallbrandingstrategies.
Enduserscallthespeechapplicationforthepurposeofobtaininga service fromtheserviceprovider. EnduserswantSUIsthatareeasy touse,allow efficienttaskcompletion,andprovidea pleasantuser experience.
Developersmustwritethecodethatcreatestheentirespeechapplication, includingtheSUI.Theprimaryobjectivesfor developerscreating SUIsare thattheinterface betechnologicallyfeasible,capableofcompletiongiven resourceconstraints, andrequireminimaldevelopmenteffort.
e-Service
and
speech
technology
Speechapplicationsareoneway toprovidecustomer self-serviceover electronicnetworks(thetechnologiescollectivelyknownase-Service). e-Servicesmustfocusonmeetingcustomerneedstoincreasemarketshare andrevenue.Technologiesareenablers ine-Servicethatcanenhancecustomer convenienceandsupport,butapplicationsrequirecarefullydesigned user interfacestomanage customerexpectations.
However,a keyelementofcustomerrelationship building,humaninteraction, isabsentfrommoste-Servicefacilities.Throughcarefuldesign,speech
limitationsofcurrenttechnologies,conversationswithautomated systemare verydifferentfromconversations betweenhumans. Nonetheless,excellent userinterfacedesigns workingwithcurrenttechnologiescancreatethe illusionofhuman-like interaction.Toachievethislevelof excellenceinuser interfacedesign,itisimportanttoconsider users’interpersonal
communication,cognitive,andsocialskillsin theinitial design,thentoapply usabilitytestingmethodsthroughoutthedesigncycletotunetheinterface. Goode-Servicedesignrequiresabalance ofbusinessneedsandenduser needs.Somekeybusinessneedsare:
v Provideinformationandservices.
v
Marketproductsandservices.
v Buildcustomerloyaltyandtrust.
v
Conveya positivebusinessimagetothecustomer.
v Reducecosts.
Incontrast,keycustomer(enduser) goalsare:
v Obtaininformationorperform ataskquicklyand easily.
v Havea smoothand pleasantinteraction.
Thesegoalsconvergewhendesigners applyknowledgeofhuman
conversationalbehaviortotheuserinterface.Key elementstoconsiderare:
v Organizationandcallflow
v
Languageand promptstyle
v Thesystemvoice(especiallyloudness,pitchandpitchvariation)
v
Useofnon-speechaudio(forexample,musicand audiologos)
v Socialexpectationoftheserviceproviderrole
Customer
satisfaction
with
e-Service
Accordingtorecentmarketresearch,customersare satisfiedwithself-service technologieswhenthey:
v Arebetter thanotherservicealternativesbyvirtueofsavingtime,saving money,providing morecustomercontroloravoidingserviceemployees. (68%)
v Dotheirjobsandperform asintended.(21%)
v
Solveanintensifiedneed.(11%) Customersbecomedissatisfiedwhen:
v Technologyfails.(43%)
v Applicationshavepoordesign.(36%)
v Serviceprocess fails.(17%)
Pooruserinterface designcanbetherootcauseofcustomer perceptionof theseapparentlydifferentproblems.
Service
recovery
Acommon factorthatrunsthrough customerdissatisfaction isthelackof servicerecoveryinthefaceofsystem failure.Toavoidcustomer
dissatisfaction,itisimportantto:
v Usehighquality,reliabletelephonyand speechtechnology.
v Designforcommunicationbreakdownbyexpectingmisrecognitionsto occur.
v Reinforcecorrectuserresponsesbymovingforward.
v Useahierarchyof non-intrusiveerrormessagesthatfoster animpressionof movingforwardwhilerepairingthecommunicativebreakdowninthemost appropriateway.
v Useauser centereddesignprocess ofiterativeevaluationandredesignto quicklyidentifyandavoiduserproblems.
SUI
misconceptions
Thesestatementsare notnecessarilytrue:
v Itisalways easyand naturaltouseSUIs.Afterall,almost everyonecan talk!
v Talkingtoa computerisjustliketalking toaperson.
v Usablespeechapplicationsrequirethelatestandgreatesttechnologies.
v
Barge-inisessential.Withoutbarge-in,aSUIisunusable.
v Usershatebeepsand othertonesinSUIsbecausetheyaren’tnatural.
Fundamental
SUI
design
SUIdesignisnotjustreadinga visualwebpage.Youmustdecide:
v Whattopresent.
v Howmuchtopresent.
v Howtopresentit.
v Whentopresentit.
EffectiveSUIdesignisbased on:
v Understandingcustomer profiles.
v Meetingrealisticexpectations.
v
Followinga designmethodologythatusesproventechniques.
Major
SUI
objectives
Intheirinfluential bookonSUIs,BalentineandMorgan statedthatthemain enemyofthespokenuserinterface istime. Thebasisforthis assertionisthat speechhasa temporaryexistenceandlistenersmust rememberwhatthey haveheard.Ifprompts inaspeechapplication aretooshort,however,they
canbe subjecttomultipleinterpretations.Acleardesignobjective, therefore,is toavoidmaking usershear more(orless)thantheyneedtohearorto say more(orless)thantheyneedtosay.
Itisalso importanttostrivetomakeevery interactionmovetheuserforward (oratleastcreatetheillusionofmovingforward).Thisiseasiersaidthan donebecausedialogsneedcareful craftingandusability evaluation.Towork towardthese objectives,usepromptsthataresuccinct andsincere(modeled afterthepromptsprovidedbyexpertcallcenter agents),provide
self-revealingcontextualhelpmessages,and usea professionalvoicetalent for recordedprompts.
The
power
of
the
SUI
AgoodSUIhasa natural,human-like quality.Thereisnoneedtoassociatea functionwith anumber(incontrasttotouchtone userinterfaces)whenspeech labelsprovidegoodfunctional descriptions.Systempromptsbecomemuch shorterandmorenatural,anditispossibletoaddoptionsinthefuture withoutanyneedtochangeexistingspeechlabels,eveniftheorderof functionschanges. Finally,thereisnoneedforauser tomovethephone awayfromhis orhereartofindthebuttontopress.
AHarris surveyconductedin2003supportstheclaimoftheeffectivenessof speechapplicationswith thefollowingresults:
v Speechiswidelyusedandaccepted(only7% ofrespondentsinthesurvey wouldavoidfutureuseofspeechsystems).
v
Consumersreportedhighsatisfactionwith speechexperiences(61%highly satisfiedwithmostrecentspeechinteraction).
v Consumersfeelthatspeechprovidesmanyadvantagesoverothere-Service methods(90%ofrespondentspreferredspeechtotouchtone systems).
Design
methodology
DevelopingSUIs,likemostdevelopment activities,involvesan iterative 4-phaseprocess:
v “DesignPhase”
v “Prototypephase(“WizardofOz”testing)”onpage16
v “Testphase” onpage18
v
“Refinementphase”onpage19
Becausethisprocessisiterative,youshouldattempttokeeptheinterface as fluidaspossibleforaslongaspossible.
Design
Phase
Inthis phase,thegoalistodefineproposedfunctionalityandcreateaninitial design.Thisinvolvesthefollowingtasks:
v “Analyzingyourusers”
v
“Analyzingusertasks”
v “Developingtheconceptualdesign(vision clips)”onpage15
v “Makinghigh-leveldecisions”onpage15
v “Makinglow-leveldecisions”onpage15
v “Definingthecompletecallflow”onpage15
v “Creatingtheinitial dialogscript”onpage16
v “Planningforexpertusers”onpage16 Analyzingyour users
Thefirststep indesigningyourVoiceXMLapplicationsshouldbe toconduct useranalysisto identifyanyusercharacteristicsandrequirementsthatmight influenceapplicationdesign.Forexample:
v Howfrequentlywillyour usersusethesystem?
v Whatistheirmotivationforusingthesystem?
v Inwhattypeof environmentwillyour usersusethesystem(quietoffice, outdoors,noisyshoppingmall)?
v Whattypeoftelephone connectionwillmostofyour usershave(land-line, cordless,cellular)?
v Aremanyofyour intendedusersnon-nativespeakersofthelanguagein whichtheapplication willbe written?
v
Howcomfortableareyour userswithautomated (“self-service”) applications?
v Willyourapplication bepersonalized (basedonANIinformationoruser login)?
Analyzingusertasks
Afteryouhaveidentifiedwhoyour usersare,thenext stepistodetermine whattasks yourapplicationshouldsupport.
Consider:
v Whatarethemostcommon tasksyouruserswillperform?Whattasksare lesscommon?
v
Areyour usersfamiliar withthetaskstheywillneedtocomplete?
v Willyourusersbe abletoperformthese tasksbyothermeans(inperson, usingavisualWebinterface,bycallingacustomer servicerepresentative, etc.)?
v Willyourusershavetheoptionoftransferringtoahuman operator?
v Whatwordsandphrasesdoyour userstypicallyusetodescribethetasks anditemsinyour proposedapplication?
Forexample,tasksina bankingapplicationcould includetransferringmoney, obtainingcurrentaccountbalance,listing somenumber ofmostrecent
transactions,etc.Youmight allocatethefunctionsasfollows:
v Havetheapplication locate,sort,andstore accountinformation,andmake anyroutinedecisions.For example,if theuserattemptstotransfermore moneythanisavailable,theapplicationplaysawarningmessage.
v Havetheuser confirmtransactionsand makeanynon-routineorelective decisions.Forexample,asktheusertoconfirmtheamount ofmoneyand accountnumber beforetheapplicationsubmits aformthatinitiatesa monetarytransfer.
Developingthe conceptualdesign(visionclips)
Afteridentifyingthetasks,thefirst designactivityshouldbetheconceptual developmentoftheapplication.Thisisusuallydonebywritinghigh-level scriptsofproposeduser-systeminteractionscalledvisionclips.Thereisno attemptmadetocompletelydefinethecallflow.Rather,thefocus isonthe user-systeminteractioninkeypartsofimportanttasks. Thisdesignactivity providesaninexpensivefirst steppriortoanylargeinvestmentofresource. Ina visionclip,designers createpreliminarysamplesofconversations to promotediscussionsbetweenthedesignerand thecustomer regardingthe customer’staskand userinterface expectations.Ideally,thesescriptsare recordedsocustomerscanreview thesound andfeeloftheproposed interactions.Designersofvisionclipsshouldbeveryfamiliarwith the capabilitiesofthespeechtechnologiestoavoidpreparingvision clipsthat wouldbedifficult orimpossibletodeployasapplications.
Makinghigh-leveldecisions
Thenextstep istomakehigh-levelapplicationdecisions,suchasselectingthe appropriateuserinterface,barge-instyle,promptstyle,andhelpstyle.For details,see“Gettingstarted—high-leveldesigndecisions”onpage19. Makinglow-leveldecisions
Afteryouhavemadethehigh-leveldecisions,youshouldproceedtothe lower-levelsystemdecisionsthataddress suchissuesassoundandfeel, word choice,etc.“Getting specific—low-leveldesigndecisions”onpage43provides informationtohelpyoumakethese decisions.
Definingthe completecallflow
Next,youwillwanttooutlinethecallflowthatmapstheinteractionbetween yourapplicationand theuser.Forexample:
v Whatquestionsdo youneedtoasktheuser?
Yourapplication interactionshouldhavealogicalprogressionthattakesinto accounttypicalresponses,unusualresponses,and anyerrorconditions that mightoccur.
Creatingthe initialdialogscript
Afteryouhavedefinedthecallflow,youshouldbe readytocreateaninitial draftofthescriptforthedialogbetweentheapplication andtheuser.The scriptshouldincludeallofthetextthatwillbespokenbytheapplication,as wellasexpecteduserresponses.
Planningforexpertusers
Asthefinalstep intheDesign phase,youmaywanttoidentifythepotential forexpertusersand beginconsideringwhereyoumaybeable tohelpthem cutthrough someoftheinterface toquicklyperformcommon tasks.
Prototype
phase
(“Wizard
of
Oz”
testing)
Thegoalofthisphaseistocreateaprototypeof theapplication, leavingthe designflexibleenoughtoaccommodatechanges inpromptsanddialogflow insubsequentphasesofthedesign.
Forthefirstiteration,youmaywanttousea techniqueknownas“Wizardof Oz”testing. Thistechniquecanbe usedbefore youbegincoding,asit
requiresonlyaprototype paperscriptand twopeople:onetoplaytheroleof theuser,and ahuman“wizard”toplaytheroleofthecomputersystem. Here’showitworks:
v Thescriptshouldincludetheproposedintroduction, prompts,listofglobal commands,andallplannedself-revealinghelp.ConsiderTable1 asan exampleofa scriptappropriatefor“WizardofOz”testing:
Table1.Samplescriptfor“WizardofOz”testing
Title Messagetypes Promptsandresponses
System actions Greeting Intromessage WelcometoPhonePay!You
cansayRepeatorHelpat anytime.
GotoGet Account Number.
GetAccountNumber Prompt Accountnumber?
Help1 Youraccountnumberison
theupper-rightportionof yourbill.Speakonlythe numbersontheleftsideof thedash.Youcanignore leadingzeros.
Table1.Samplescriptfor“WizardofOz”testing (continued)
Title Messagetypes Promptsandresponses
System actions
Help2 Atanytime,youcansay
Help,Repeat,GoBack,Main Menu,Exit,orTransferto Agent.Tocontinue,sayor enteryouraccountnumber.
Callerresponse <Saysorentersnumbers> Ifinput spoken,go toConfirm Account Number.If enteredvia DTMF,go toGetPIN Number.
ConfirmAccountNumber Prompt Wasthat<number>?
Help1 PleasesayYes,No,or
Repeat.
Help2 Atanytime,youcansay
Help,Repeat,GoBack,Main Menu,Exit,orTransferto Agent.Tocontinue,sayYes orNo.
Callerresponse Yes GotoGet
PIN Number.
Callerresponse No GotoGet
Account Number.
v Thetwoparticipantsshouldbe physicallyseparatedsothatthey cannotuse visualcuestocommunicate;a partitionwillsuffice,oryoucoulduse separateroomsandallowthepeopletocommunicatebytelephone.
v Thewizardmust beveryfamiliarwith thescript.Theuser shouldnever seethescript.
v Theusertelephones(orpretendstotelephone)thewizard,who begins readingthescriptaloud.Theuser respondstotheprompts,andthewizard continuesthedialogbased onthescripted responsetotheuser’sutterance. “WizardofOz”testinghelpsyoufixproblemsinthescriptandtaskflow beforecommittinganyofthedesigntocode.However, itcannotdetectother
typesofusability problems,suchasrecognitionoraudio qualityproblems; addressingproblemslinked tothesetypesof errorsrequiresa working prototype.
Test
phase
Afteryouhaveincorporatedtheresultsofthe“WizardofOz”testing,you willwanttocodeandtesta workingprototypeof theapplication.Duringthis phase,be suretoanalyzethebehaviorofbothnewand, ifapplicable,expert users.
Identifyingrecognition problems
AsyouproceedwiththeTestphase, noteanyconsistentrecognitionproblems. Themostcommoncauseofrecognitionproblemsisacousticconfusability amongthecurrentlyactivephrases.Forexample,bothMadisonandAddison areUSairports.Thus,thesepotentialuser inputstoa travelapplicationare highlyconfusable:
User: Flying from Madison
User: Flying from Addison
Sometimesthereisnothingyoucando whenthishappens.Other timesyou cantry tocorrecttheproblemby:
v Usinga synonymfor oneoftheterms. Forexample,ifthesystemis confusing“no”and“new,”youmight beable toreplace“new”with “recent,”dependingontheapplication’scontext.
v Addinga wordtooneormoreof thechoices.FortheMadison/Addison airportconfusion,youcouldmakestatesoptional inthegrammarformost cities,butrequire thestateforlow-trafficairportsthathaveacoustic confusabilitywithhigher-trafficairports.
v Planfordisambiguation bywritingcodethatincludesoraccessesdata abouttypicalacousticconfusions.For example:
System: Flying from?
User: Los Angeles <not flagged as confusable>
System: Flying to?
User: Newark <flagged as confusable with New York> System: Newark, New Jersey or New York, New York?
User: Newark, New Jersey
Identifyinganyuserinterfacebreakdowns
TheTestphaseisalso whereyouwillidentifypotentialuserinterface breakdowns.Some factorsyoumaywanttoanalyzeinclude:
v Percentageofuserswhodidnotsuccessfullycompleteyourtestscenarios
v
Percentageofuserswhotransferred toahumanoperator, whenthiswas notthedesiredoutcome
v Pointsintheapplication whereusersexperienced themostdifficulty
v Unexpecteduserbehaviors
v Effectivenessoferrorrecoverymechanisms
v Timetocompletetypicaltransactions
v Self-reportedlevelofusersatisfaction
Thefirstroundofusertestingtypicallyrevealsplaceswherethesystem’s responseneedstoberephrasedtoimproveusability.For thisreason,system promptsandothermessagesshouldbe leftflexibleforaslongaspossible,at leastuntilafter thefirst roundof usertesting.
Refinement
phase
Duringthisphase,youwillupdatetheuserinterfacebased ontheresultsof testingtheprototype.Forexample,youmayreviseprototypescripts,add taperedpromptsand customizableexpertiselevels,createdialogsfor inter-andintra-applicationinteractions.
Finally,youwillwanttoiteratetheDesign—Prototype—Test—Refineprocess. Thisincludes, intheTest phase,usersfrompreviousroundsoftestingand usersnew tothesystem.Ideally,thefinalusabilitytest shouldbe ona deployedsystemtoallowevaluationofitsaccuracy,latency,barge-in characteristicsand qualityof speechoutput.
Getting
started—high-level
design
decisions
Designinga SUIinvolvesat leasttwolevelsof designdecisions.First,you needtomakecertainhigh-leveldesigndecisionsregardingsystem-level interfaceproperties.Onlythen canyouget downtothedetailsofdesigning specificsystemprompts anddialogs.Thehigh-leveldecisions youneedto makeinclude:
v “Selectinganappropriateuserinterface”onpage20
v “Decidingonthetype andlevelofinformation” onpage20
v
“Choosingthebarge-instyle”onpage21
v “Selectingrecordedprompts orsynthesizedspeech”onpage24
v
“Decidingwhethertouseaudioformatting” onpage27
v “Usingsimpleornaturalcommandgrammars”onpage29
v “Adoptingaterseor verbosepromptstyle”onpage31
v “AllowingonlyspeechinputorspeechplusDTMF”onpage32
v “Decidingwhethertousehumanagentsinthedeployedsystem”onpage 39
v “Choosinghelpmodeorself-revealinghelp”onpage40
Thereisnosinglecorrectanswer;theappropriatedecisionsdepend onthe application,theusers,andtheusers’environment(s).Theremainderofthis sectionpresentsthetrade-offsassociatedwitheachofthesedecisions.
Selecting
an
appropriate
user
interface
Thefirstdecisionthatyoumust makeistoselecttheappropriateuser interfaceforyourapplication. Notallapplicationsarewell-suited toaSUI; someworkbestwitha visualinterface,andothersbenefitfromamulti-modal interface(thatis,botha speechanda visualinterface).
ThecharacteristicsinTable2 canhelpyoudecidewhetheryourapplication is suitedtoaSUI.
Table2.Whentouseaspeechinterface
Considerusingspeechif: Applicationsmaynotbesuitedtospeechif:
Usersaremotivatedtousethespeechinterface becauseit:
v Savesthemtimeormoney v Isavailable24hoursaday
v Providesaccesstofeaturesnotavailablethrough othermeans
v Allowsthemtoremainanonymousandavoid discussingsensitivesubjectswithahuman
Usersarenotmotivatedtousethespeech interface.
Userswillnothaveaccesstoacomputerkeyboard whentheywanttousetheapplication.
Thenatureoftheapplicationrequiresalotof graphicsorothervisuals(forexample,mapsor commerceapplicationsforapparel).
Userswanttousetheapplicationina “hands-free”or“eyes-free”environment.
Userswillbeoperatingtheapplicationinan extremelynoisyenvironment(duetosimultaneous conversations,backgroundnoise,etc.)
Usersarevisuallyimpairedorhavelimiteduseof theirhands.
Usersarehearingimpaired,havedifficulty speaking,orareinanenvironmentthatprohibits speech(forexample,acourtroom).
Deciding
on
the
type
and
level
of
information
Tokeepfromoverloadingtheuser’sshort-termmemory,information
presentedinaSUImustgenerallybe moreconcisethaninformationpresented visually.Itiscommontopresentonlythemostessentialinformationinitially, thengiveuserstheopportunitytoaccessdetailedinformation.
Forexample,considera bankingapplicationinwhicha usercanrequestalist ofrecentlyclearedchecks.Inavisualinterface,theapplicationmight returna tableshowingthechecknumber,datecleared,payeename,and amount.A similarapplicationwitha speechinterfacemight returnonlythecheck numberanddatecleared,and thenpermittheuser toselecta specificcheck numbertohearthepayeenameandamount,if desired.
Choosing
the
barge-in
style
Enablingbarge-inallowsthecomputerandtheusertospeak atthesame time,permittingtheuser’sspeechtointerruptsystemprompts asthemachine playsthem.
Onthesurface,itmight seemthatenablingbarge-inisalwayspreferableto disablingbarge-in.Itiseasy toimagineexperienceduserswantingto interruptprompts (especiallylengthyones) whentheyknowwhattosay. Thereare situations,however,inwhichasystemforwhichbarge-inhasbeen disabledwillbe aseasyoreasiertouse.
Table3 comparesimplementationswithbarge-inenabledordisabled:
Table3.Barge-inEnabledversusDisabled
Style Advantages Disadvantages
Barge-inenabled Experienceduserscaninterruptsystem promptstospeeduptheinteraction. Userscansay“Quiet”tostopthe prompt.
Note: Forcommandsinlanguagesother thanUSEnglish,seetheappropriate appendixes.
Inexperiencedusersmayinadvertently interruptthepromptbeforehearing enoughtoformanacceptableresponse. Youcanminimizethisproblemby keepingsystempromptsshort,tolessen theuser’sneedtobargein;ifyour promptsarelong,youshouldtryto presentkeyinformationearlyinthe prompt.
Whenusinghotwordbarge-in(see Table4onpage22),Lombardspeechand thestutteringeffectcanbeproblematic. Tominimizethisproblem,youshould keeprequireduserinputsveryshort.See “ControllingLombardspeechandthe stutteringeffect”onpage23formore information.
Table3.Barge-inEnabledversusDisabled (continued)
Style Advantages Disadvantages
Barge-indisabled Guaranteesthattheentireprompttext plays.Thismaybeespeciallyusefulfor applicationswithlotsoflegalnotices, advertisements,orotherinformationthat youwanttomakesurealwaysgets presentedtotheuser.
Createsa“myturn-yourturn”rhythm forthedialog.
Experienceduserscannotinterrupt prompts;however,ifthepromptsare shortenough,usersshouldnotneedto interrupt.
Usersmayexperienceturn-takingerrors. Keepingpromptsshorthelpsminimize this.
Ifenablingbarge-in,youshouldplayaninitial promptof3secondsorlonger withbarge-indisabledtogive thesystem timetocalibrateechocancellation. Comparingbarge-indetectionmethods
Tousebarge-ineffectively,it isimportanttounderstandhow thesystem determineswhentostopaninterruptedprompt.ForWebSphereVoiceServer thedefaultbarge-indetectionmethod isspeech.
Table4 comparestheavailable barge-indetectionmethods.
Table4.Barge-indetectionmethods Barge-indetection
method Description Advantages Disadvantages
hotword Audiooutputstopsonlyafter thesystemdeterminesthat theuserhasspokena completewordorphrasethat isvalidinacurrentlyactive grammar.
Resistanttoaccidental interruptions,suchas,those causedbycoughing,
muttering,orusingthesystem inanenvironmentwithloud ambientconversation.
Increasedincidenceof Lombardspeechandthe “stutteringeffect”(seenext section);however,youcan controlthissomewhatby makingrequireduser responsesasshortaspossible. Thetimerequiredto
recognizespokeninputcan causeslowersystemresponse times.
speech Audiooutputstopsassoon asthespeechrecognition enginedetectssound.
Thisbehaviorismoretypical ofconversationbetweentwo humans.
MinimizesLombardspeech, thestutteringeffect,andthe distortiontothefirstsyllable ofuserspeechthatoften occurswhenusersbargein.
Susceptibletoaccidental interruptiondueto
backgroundnoise,non-speech vocalizations,andspeechnot intendedforthesystem.
ControllingLombardspeechandthestutteringeffect
Whenspeakinginnoisyenvironments,peopletendtoexaggeratetheirspeech orraisetheirvoicessootherscanhearthemoverthenoise. Thisdistorted speechpatternisknown asLombardspeech(namedfortheresearcherE. Lombard,who in1911wasthefirsttoreportsuchaneffect),and itcanoccur evenwhentheonlynoiseisthevoiceof anotherparticipantinthe
conversation(forexample,when onepersontriestointerruptanother,or,in thecaseofa voiceapplication, whentheuser triestobarge-inwhilethe computerisspeaking).
The“stutteringeffect”mayoccurwhena promptkeepsplayingformorethan about300 msafter theuserbeginsspeaking. Unlessusershaveundergone trainingwith thesystem,they mayinterpretthecontinuedplayingofthe promptasevidencethatthesystemdidnothear them.Inresponse,some usersmaystop whattheywere sayingand beginspeakingagain –causinga stutteringeffect.Thisstutteringmakesitvirtuallyimpossibleforthesystemto matchtheutterancetoanything inanactivegrammar,sothesystem generally treatstheinputasan“out-of-grammar”utterance,evenif whattheuser intendedtosaywasactuallyinoneoftheactivegrammars.
TocontrolLombardspeechand thestuttering effectwhenusinghotword barge-indetection,thepromptshouldstopwithin about300 msafter theuser beginstalking.Theaveragetimerequiredtoproduceasyllableofspeechis about150-200ms, thismeansthatthesystem designshouldpromoteshort userresponses(ideallynomorethantwoor threesyllables)whenusing hotwordbarge-indetection.Youshouldalsotrytokeeppromptsasshortas possibletominimizethelikelihoodthatuserswillwanttointerruptthe prompt.Ifthis isnotpossible,youshouldconsiderswitchingtospeech barge-indetection,orin extremecasesconsider disablingbarge-in. Weighinguserandenvironmentalcharacteristics
Whendecidingwhethertousebarge-inandwhichtypeofbarge-indetection ismostappropriate,youshouldconsiderhow frequentlyuserswilluseyour application(expertusersaremorelikelytobargein),andinwhat
environment(qualityofthetelephone connection,generalnoiselevel,etc.). Ingeneral,youshouldenable barge-infordeployedapplications.However,if echocancellationonyour telephonyequipmentisnotgoodenough,itmight benecessarytodisablebarge-in.
Minimizingtheneedtobargein
Evenwhenthesystempermitsbarge-in,manyusersdonotliketointerrupt thesystem.Tominimizetheuser’sneedtobargein,youmight consider placingshortpauses (around0.75second)at logicalpoints duringand betweenprompts,suchasattheendofa sentenceor aftereachmenuitem. Thesebriefpauseswillgiveuserstheopportunitytobegintalkingwithout
activelyinterruptingthesystem.Insystemsfor whichbarge-inhasbeen disabled,youcansimulatebarge-inbyenabling recognitionduringthese pauses.Besurenottoproduceaturn-takingtoneattheendofthese “recognitionwindows”because speechatthesetimesisoptional,not required.
Usingaudioformatting
Ifyouneedtotemporarilydisablebarge-in(using<prompt bargein=“false”>),suchaswhilethesystemreadslegalnotices or
advertisements,youmaywanttousea uniquebackgroundsound,tone,or promptasan indicator.Forguidance,see“Applying audioformatting”on page28.
Ifyoudisablebarge-in,considerplayingatonetosignaltheuserwhenitis timetospeak.Theintroductorymessageshouldexplicitlytelluserstospeak onlyafter this“turn-taking”tone.
Note: Theuseoftonestosignaluserinputissomewhatcontroversial,with somedesignersavoidingtones basedonabelief thattonesare unnaturalinspeechandannoyingtousers.Otherscontend that effectivecomputerspeechinterfacesneednotperfectly mimichuman conversation,andthata well-designedtonecanpromoteclearand efficientturn-takingwithoutannoyance.For guidanceincreatingan effectiveturn-takingtone,see“Designingaudiotones”onpage27. Wordingprompts
Forsystemswithoutbarge-in,makepromptsasconciseaspossible.Ifa promptmustberelatively long,placethekeyinformationtowardtheendof theprompttodiscourageusersfromspeakingbefore theirturn.
Youcando thesame forsystemswith barge-in,assumingyourprompts are relativelyshort;ifthepromptsarelong,youmaydecidetomovethekey informationtothebeginningofthepromptsousersknowwhatinputto provideiftheyinterrupttheprompt.
Selecting
recorded
prompts
or
synthesized
speech
Synthesizedspeech(text-to-speechorTTS)isusefulasa placeholderduring applicationdevelopment,orwhenthedatatobe spokenis“unbounded”(not knowninadvance),whichmakesit impossibletoprerecord.
Whendeployingyourapplications,however,youshouldplantouse
professionallyrecordedpromptswheneverpossible.Usersexpectcommercial systemstousehigh-qualityrecordedspeech,andonlyrecordedspeechcan guaranteehighlynaturalpronunciationand prosody.
Creatingrecordedprompts
Thefollowingguidelineswillhelpyougeneratehigh-qualityrecorded prompts:
v Useprofessionalvoicetalent, qualityrecordingequipment,anda suitable recordingenvironment.
v Maintainconsistencyinmicrophoneplacementandrecordingarea.
v Ifpromptscontainlongnumbers,orifmanyoftheapplication’susersare notnativespeakersofthelanguageinwhichtheapplicationspeaks, considerslowingdownthespeechorexaggeratingnaturalpauses.
v Ifyouare planningondisablingbarge-in,aggressivelytrim recorded promptstoremovethebeginningandendingsilences.Ifyouareplanning onenablingbarge-inorifanotherspeechsegmentwillfollowimmediately, trimthebeginningaggressively,butleavesilenceattheendthatis
appropriatefortheendingpunctuation(500msforfinalpunctuation,250 msfornon-final punctuation).Otherwise,leaveaslittlesilenceaspossible.
v Asageneralrule,useonlyonevoice.Whenusingmultiple voices,havea cleardesigngoal (forexample,afemalevoiceforintroductionandprompts, anda malevoiceformenuchoices).Foraconsistent sound,youshould recordyour ownmessagestohandlethe<error>,<cancel> and<help> events.
v Ifavoicesegmentwillappearinphraseswith differentintonations,be sure torecordthatsegmentforeachintonation.For example,supposethe systemwillseekconfirmationofa telephonenumberusingthephrase“Was thatfourthreethree<pause>fivefivesixthree?”The“three”thatappears beforethepauseshouldhaveaslightlyfallingpitch,but the“three”that appearsbefore thequestionmarkshouldhavearisingpitch. The“three” thatappearsbetweentwoothernumbersshouldhavea steadypitch. This suggeststhatit willbe necessarytoobtainonerecordingfor eachofthe threeintonations,toobtainthehighest-qualityspeechoutput.Notethatthe developmenteffortrequiredforthismightnotbe appropriateforevery application.
v Beawareof theappropriatestresstouseineachsegmentthatyouplanto record.Iftheappropriatestresspointisnotthelastopen-classitem(which iseither anounoraverb)inthesentence,make anoteofwherethe speakershouldplacethestress.
v
Ifyouare recordingsegments thattheapplication willplaysequentially(in otherwords,willsplice),be suretochoosethesplicepointscarefully.If possible,choosesplicepointsatnaturalpausepoints.Avoidsplicepoints thatseparatearticlessuchas″a,″″an,″ and″the″fromthefollowingword (oranyothercombinationthatspeakersnormallyruntogether).
v Ifyouintendtotranslate theapplication intootherlanguages,planahead whendefiningtheaudio segmentstorecord.Youmight needtoseek assistancefromanativespeakerof thetarget language.Ingeneral,tryto avoiddefiningaudiosegmentstorecord thatareisolatednounsbecausein
manylanguagesthecorrectformfordeterminers(inEnglish,“a”,“an”and “the”)dependsonthefollowingnoun.Youshouldbe awarethatthere mightbe othercontextualdependenciesthatare importantinthetarget language.Some oftheknownissuesaregendersensitivity, orderingof recordedsegmentsandplurality.Goodplanningearly inthedefinitionof audiosegmentscanpreventunnecessaryreworkduring translation. Whenusingrecordedprompts,youcanimprovesystemperformance by prefetchingand cachingtheaudiofiles.See“Fetchingandcachingresources forimprovedperformance”onpage124.
ForDTMFprompts(forexample,“Forchecking,press1.For savings,press2. Totransfertoanagent,press3.”)usethefollowingtimingguidelines:
v Usea500 mspausebetweenitems.
v
Usea250 mspausebefore “press”.
v Nodetectable pauseafter“for”,“to”or“press”.
Forspeechprompts(forexample,“Selectchecking,savingsortransferto agent.”or“To workwithyour checkingaccountsaychecking.”)usethe followingtimingguidelines:
v Usea750 mspausebetweenitemswhentherearemorethan3 items. Whenthereareonlytwoorthreeitems,donotintroduceanyexaggerated pauses.Speakthephraseasanormalsentence.
v
Usea250 mspausebefore “say”or“select”.
v Nodetectable pauseafter“say”or“select”.
Fora mixtureofbothDTMFandspeechpromptsusethefollowingtiming guidelines:
v Usea300 to500mspauseafteraninformationalmessagethatprecedes the presentationofa menu.
v Forlongermessages,use250msfor acommatypepauseand500 msfora periodtypepause.
UsingTTSprompts
Althoughrecordedpromptsare bestformanyapplications,itisimportantto keepinmind thatitiseasiertomaintainand modifyanapplicationthatuses TTSprompts.Forthisreason, youshouldtypicallyuseTTSprompts during development.
Whenyouarereadytodeployyourapplication,userecordedpromptswhen possible.Ifpart ofasentencerequiresproductionvia TTS,itisgenerally bettertogeneratethatentiresentencewith TTStoavoidthejarring juxtapositionofrecordedandartificialspeech.Itisalsopossibletodesign sentencestopositionthedynamiccontent attheend,and toplaythedynamic
contentfollowingashortpausetoseparate thedynamicTTScontent fromthe staticrecordedcontent.Fornow,designersshouldbe cautiousin usingthis approachbecauseitisn’tclearwhetherpeoplewouldgenerallypreferhearing allTTSorthistype ofcombinationofrecordedandTTSoutput.
Handlingunboundeddata: Iftheinformationthattheapplicationneedsto speakisunbounded,youwillneedtouseTTS.Examples ofunbounded informationinclude:
v Telephonedirectories
v E-mailmessages
v Frequentlyupdatedlistsofemployeeorcustomernames,movietitles,or otherpropernouns
v Up-to-the-minutenewsstories
ImprovingTTSoutput: Youcanimprovethequalityofsynthesizedspeech outputbyusingSSMLtoprovideadditionalinformationintheinputtext.For example:
v YoucanimprovetheTTSengine’sprocessingofnumericalconstructsby usingthe<say-as>elementtospecifythedesiredpronunciation.
v YoucanimprovetheTTSengine’sprocessingofuncommonnamesby usingthe<phoneme>tag.
v Forsyn