VoiceXML Programmer s Guide

(1)

(2)

Mountain View, CA 94043 Part No. 520-0001-02

(3)

Tag Summary . . . .110 Tag Index . . . .112 Tag Descriptions. . . .115 <assign> . . . .116 <audio> . . . .118 <bevocal:connect> . . . 122 <bevocal:dial> . . . 124 <bevocal:disconnect> . . . 131 <bevocal:enroll> . . . 133 <bevocal:foreach>. . . 140 <bevocal:hold> . . . 142 <bevocal:listen> . . . 144 <bevocal:register>. . . 147 <bevocal:verify> . . . 150 <bevocal:whisper> . . . 153 <block> . . . 155 <break> . . . 157 <catch> . . . 159 <choice> . . . 162 <clear> . . . 165 <data> . . . 167 <disconnect> . . . 171 <div> . . . 173 <dtmf> . . . 174 <else> . . . 177 <elseif> . . . 179 <emp> . . . 181

(8)

<error> . . . 186 <example> . . . 189 <exit> . . . 190 <field> . . . 192 <filled>. . . 200 <foreach> . . . 202 <form> . . . 204 <goto> . . . 207 <grammar> . . . 209 <help> . . . 225 <if> . . . 228 <initial> . . . 230 <item> . . . 232 <lexicon> . . . 234 <link>. . . 235 <log> . . . 238 <mark> . . . 240 <menu> . . . 241 <meta> . . . 244 <metadata> . . . 246 <noinput> . . . 247 <nomatch> . . . 250 <object> . . . 253 <one-of> . . . 254 <option> . . . 256 <p> . . . 258 <paragraph> . . . 259 <param> . . . 260 <phoneme> . . . 262 <prompt> . . . 264 <property> . . . 268 <pros> . . . 272 <prosody> . . . 273 <record> . . . 275 <reprompt> . . . 279 <rethrow> . . . 281

(9)

<return>. . . 283 <rule> . . . 285 <ruleref> . . . 287 <s> . . . 290 <say-as> . . . 291 <sayas>. . . 294 <script> . . . 297 <send> . . . 301 <sentence> . . . 304 <speak> . . . 305 <sub> . . . 306 <subdialog>. . . 307 <submit> . . . .311 <tag> . . . 314 <throw> . . . 316 <token> . . . 319 <transfer> . . . 320 <value> . . . 328 <var> . . . 332 <voice> . . . 334 <vxml>. . . 336

12. Properties . . . 339

Property Summary . . . 340 Property Index . . . 341 Property Descriptions. . . 343 audiofetchhint . . . 343 audiomaxage. . . 344 audiomaxstale . . . 344 bargein . . . 344 bargeintype . . . 344 bevocal.audio.capture . . . 345 bevocal.audio.outputvolume . . . 345 bevocal.dtmf.flushbuffer . . . 345 bevocal.fetchaudio.allfetches . . . 345 bevocal.fetchaudio.extend . . . 346 bevocal.fetchaudio.flushqueue . . . 346 bevocal.fetchaudio.sounds . . . 346

(10)

bevocal.goback . . . 347 bevocal.grammar.interpretationtype . . . 347 bevocal.grammar.phoneticpruning . . . 347 bevocal.grammar.weightfactor . . . 347 bevocal.grammar.wordtransitionpenalty . . . 348 bevocal.hotwordmax . . . 348 bevocal.hotwordmin . . . 348 bevocal.incrementErrorOnNSP. . . 348 bevocal.locale . . . 348 bevocal.logging . . . 349 bevocal.maxdialogerrors . . . 349 bevocal.maxerrors. . . 349 bevocal.maxinterpretations . . . 349 bevocal.mingoback . . . 350 bevocal.securelogging.enabled. . . 350 bevocal.securelogging.key . . . 350 bevocal.security.key . . . 351 bevocal.sounds.listening . . . 351 bevocal.sounds.maskrecognitionlatency. . . 351 bevocal.sounds.recognition. . . 351 bevocal.transfer.terminatetones . . . 352 bevocal.utterance.prefix . . . 352 bevocal.voice.name . . . 352 bevocal.vxml.maxrecognitionlatency. . . 352 caching . . . 352 completetimeout . . . 353 confidencelevel . . . 353 datafetchhint . . . 353 datamaxage . . . 353 datamaxstale . . . 354 documentfetchhint . . . 354 documentmaxage . . . 354 documentmaxstale . . . 354 fetchaudio . . . 354 fetchaudiodelay . . . 355 fetchaudiominimum . . . 355

(11)

fetchtimeout. . . 355 grammarfetchhint . . . 355 grammarmaxage . . . 355 grammarmaxstale . . . 356 incompletetimeout . . . 356 inputmodes . . . 356 interdigittimeout. . . 356 maxnbest. . . 357 maxspeechtimeout . . . 358 recordutterance . . . 358 recordutterancetype . . . 358 scriptfetchhint . . . 359 scriptmaxage . . . 359 scriptmaxstale . . . 359 ssmlfetchhint . . . 359 ssmlmaxage . . . 360 ssmlmaxstale . . . 360 sensitivity. . . 360 speedvsaccuracy . . . 360 termchar . . . 360 termtimeout . . . 360 timeout . . . 361 universals . . . 361

13. Variables . . . 363

Variable Summary . . . 363 Variable Index . . . 364 Variable Descriptions . . . 364 _event . . . 364 _message . . . 364 application.lastaudio$ . . . 364 application.lastresult$ . . . 365 session.bevocal.timeincall. . . 369 session.bevocal.version . . . 369 session.iidigits . . . 369 session.telephone.ani . . . 369 session.telephone.dnis . . . 369

(12)

JavaScript Constants . . . 371 bevocal.outboundrequestid . . . 371 bevocal.sessionid . . . 371 _addHeader . . . 371 bevocal.cookies.addClientCookie . . . 372 bevocal.cookies.deleteClientCookie . . . 375 bevocal.cookies.getClientCookie . . . 375 bevocal.cookies.getClientCookies . . . 376 bevocal.enroll.removeEnrolledPhrase . . . 376 bevocal.getProperty . . . 377 bevocal.getVersion . . . 377 bevocal.log . . . 378 bevocal.soap.serviceFromWSDL . . . 378 bevocal.soap.serviceFromEndpoint . . . 379 bevocal.soap.locateService . . . 381 bevocal.soap.SoapException. . . 382 bevocal.soap.SoapFault. . . 383 bevocal.soap.FaultDetails . . . 383

(13)

Preface

VoiceXML is a markup language for writing telephone-based speech applications. This document describes BeVocal VoiceXML, which is compliant with the W3C VoiceXML Version 2.0 Specification.

Audience

This document is for software developers using the BeVocal Café development environment. It assumes you are familiar with the basic concepts of HTML.

Conventions

Italic font is used for:

• Introducing terms that will be used throughout the document • Emphasis

Bold font is used for headings.

Fixed width font is used for: • Code examples

• Tags and attributes

• Values or text that must be typed as shown • Filenames and pathnames

Italic fixed width font is used for: • Variables

• Prototypes or templates; what you actually type will be similar in format, but not the exact same characters as shown

How to Use This Guide

Part I of this guide explains how to use VoiceXML features. A new application developer typically reads these chapters completely and in order.

• Chapter 1, “Getting Started” introduces VoiceXML and its major features. • Chapter 2, “Forms” describes VoiceXML forms.

• Chapter 3, “Event Handling” describes events that can be thrown during the execution of a VoiceXML application and how events are handled.

• Chapter 4, “Fetching and Caching Resources” explains how an application can control the way VoiceXML documents and other resources are fetched and cached.

(14)

• Chapter 5, “Using Multiple-Recognition” describes how the BeVocal VoiceXML interpreter can provide multiple recognition results.

Part II of this guide explains how to use Extended VoiceXML features. A new application developer typically reads those chapters which are relevant for his application.

• Chapter 6, “Controlling Outbound Calls” describes the BeVocal VoiceXML call-control features, an extension to VoiceXML.

• Chapter 7, “Go-Back Facility” describes the BeVocal VoiceXML go-back facility, an experimental extension to VoiceXML.

• Chapter 8, “TTS and Recorded Voice Selection” describes the BeVocal VoiceXML TTS and Recorded

Voice Selection facility, an experimental extension to VoiceXML

• Chapter 9, “Dynamic SSML” describes the BeVocal VoiceXML Dynamic SSML facility, an experimental extension to VoiceXML

• Chapter 10, “SOAP Client Facility” describes the BeVocal VoiceXML SOAP Client facility, an experimental extension to VoiceXML

Part III of this guide provides reference descriptions of the various components of the VoiceXML language. Application developers typically do not read these chapters from start to finish, but instead use them to look up information about the various tags, properties, and so on.

• Chapter 11, “Tags” describes the tags that make up VoiceXML.

• Chapter 12, “Properties” describes the properties that can be set to control the behavior of a VoiceXML application.

• Chapter 13, “Variables” describes predefined variables that are available in VoiceXML applications. • Chapter 14, “JavaScript Functions and Objects” describes predefined JavaScript functions that are

available in VoiceXML applications.

References

For additional or related information, you can refer to:

• VoiceXML Version 2.0 Specification. VoiceXML Forum. (http://www.w3c.org/TR/voicexml20) • VoiceXML Tag Summary. BeVocal.

(http://cafe.bevocal.com/docs/vxml_summary/index.html)

• Grammar Reference. BeVocal. (http://cafe.bevocal.com/docs/grammar/index.html) • JavaScript Quick Reference. BeVocal.

(15)

PART 1 Using VoiceXML

This part explains how to use VoiceXML features: • Chapter 1, “Getting Started”

• Chapter 2, “Forms”

• Chapter 3, “Event Handling”

• Chapter 4, “Fetching and Caching Resources” • Chapter 5, “Using Multiple-Recognition”

(16)

(17)

1 Getting Started

VoiceXML is a markup language derived from XML for writing telephone-based speech applications. Users call applications by telephone. They listen to spoken instructions and questions instead of viewing a screen display; they provide input using the spoken word and the touchtone keypad instead of entering information with a keyboard or mouse.

This chapter describes: • VoiceXML

• User Interaction • Flow of Execution

• Collecting Input and Playing Prompts

VoiceXML

Just as a web browser renders HTML documents visually, a VoiceXML interpreter renders VoiceXML documents audibly. You can think of the VoiceXML interpreter as a telephone-based voice browser. As with HTML documents, VoiceXML documents have web URIs and can be located on any web server. Yet a standard web browser runs locally on your machine, whereas the VoiceXML interpreter is run remotely—at the VoiceXML hosting site, for example. And you use your telephone to access the VoiceXML interpreter.

Environment

In order to support a telephone interface, the VoiceXML interpreter runs within an execution environment that includes a telephony component, a text-to-speech (TTS) speech-synthesis component, and a speech-recognition component.

The VoiceXML interpreter transparently interacts with these infrastructure components as needed. For example:

• Text strings in output elements are rendered using TTS.

• Connection issues (picking up the incoming call, detecting a hang-up, transferring a call) are handled by the telephony component.

• Listening to spoken input from the user and identifying its meaning is handled by the speech-recognition component.

Tags and Elements

VoiceXML uses markup tags and plain text. A tag is a keyword enclosed by the angle bracket characters (< and >). A tag may have attributes inside the angle brackets. Each attribute consists of a name and a value, separated by an equal sign (=) and the value must be enclosed in quotes.

(18)

Tags occur in pairs; corresponding to the start tag <keyword> is the end tag </keyword>. Between the start and end tag, other tags and text may appear. Everything from the start tag to the end tag, is called an

element. For example, the following three lines constitute a prompt element:

What is your telephone number? </prompt>

If there are no other tags or text between the start and end tag, a syntactic shorthand is permitted. You can precede the closing angle bracket (>) of the start tag with a slash (/) and omit the end tag. For example, instead of writing a value element as:

<value expr="result"></value> you can use the shorthand notation: <value expr="result"/>

Because the syntax specifies the end of each element, the VoiceXML interpreter can check that the entire document has been received.

If one element contains another, the containing element is called the parent element of the contained element. The contained element is called a child element of its containing element. The parent element may also be called a container.

Although both HTML and VoiceXML use markup tags, the two languages use tags differently. Whereas the markup tags in HTML describe how to render the data, the markup tags in XML (and consequently in VoiceXML) describe the data itself. This allows an XML interpreter or browser to display the data in whatever way is appropriate.

BeVocal VoiceXML generally complies with the VoiceXML 2.0 Specification. It also includes several handy extensions that you can use if you choose. VoiceXML Tag Summary lists any differences between BeVocal VoiceXML and the standard.

Simple Example

In VoiceXML, the <form> element is analogous to an HTML form that contains items for the user to enter. In VoiceXML forms, each logical piece of information to be collected from the user is identified with a <field> tag.

The form in the following example collects one piece of information from the user. Once this information is obtained, execution proceeds to the field’s <filled> element. Other tags used in the example include the following:

• The <script> tag specifies a block of client-side JavaScript code. • The <var> tag declares a variable to be used within the form. • The <prompt> tag produces audio output for the user. • The <assign> tag assigns a value to a variable.

• The <value> tag evaluates an expression and produces spoken output of the result.

This example requests a number from the caller, computes the factorial of that number, and repeats the answer to the caller.

<?xml version="1.0" ?>

<!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"

Tip:

• VoiceXML conforms to XML standards; the formats for VoiceXML tags are more strictly defined than are the formats in HTML. If you are used to HTML and not XML, remember that all container

(19)

VoiceXML

"http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd"> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml">

<script>

<![CDATA[

function factorial(n) {return (n <= 1) ? 1 : n * factorial(n-1);} ]]>

</script>

<form id="computefactorial">

<var name="result"/>

<field name="num" type="number">

<prompt>please say a number </prompt>

<assign

name="result"

expr="factorial(num)" />

<prompt>The factorial of

</filled>

</field> </form> </vxml>

VoiceXML contains no explicit instructions about how to present the prompt, “please say a number” or how to present the results. In theory, these could be presented textually on a different kind of browser.

In practice, the example document is run as a telephone application and results in conversations such as the following.

Application: Please say a number.

User: 4

(20)

Documents

An executable VoiceXML file is called a document. The VoiceXML interpreter loads a document file to execute it.

Every VoiceXML document must start with header information that conforms to the XML standard: <?xml version="1.0" ?>

<!DOCTYPE vxml

PUBLIC "-//W3C/DTD VoiceXML 2.0//EN"

"http://www.w3.org/TR/voicexml20/vxml.dtd">

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> These headers describe the language in which the document is written:

• The first tag indicates that the document is an XML document. This tag is required.

Always use this tag exactly as specified; it must be the very first characters in the document. To be a legal XML document, the first 4 characters of any XML file (including a VoiceXML document) must be:

<?xm

No characters, not even whitespace characters such as space or newline, can come before these 4 characters in a VoiceXML document.

• The second tag identifies the Document Type Definition (DTD), which is used to validate that the contents represent well-formed VoiceXML. This tag is optional.

A DTD describes the format of the data that might appear in an XML document. That is, the DTD defines the valid tags by specifying what attributes each tag can have and what child tags or other content each tag can contain.

If your document contains only standard VoiceXML elements, you can use the DTD shown above. If you use any of the BeVocal VoiceXML extensions to VoiceXML, you’ll need to use the correct DTD. In this case, you replace the DOCTYPE element with the following:

<!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN" "http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">

You should include a DOCTYPE declaration during development, as it allows better error checking by the interpreter. You may remove it during deployment for performance.

• The third tag identifies the version of VoiceXML used in this document and the designated namespace for VoiceXML. This tag is required.

For VoiceXML 2.0, this tag should always include these 2 attributes. It can also include optional attributes described in the section on the <vxml> tag.

Apart from headers and possibly comments, all the content in a VoiceXML document is contained within a <vxml> element, that is, between the <vxml> start tag and the </vxml> end tag.

Applications

A VoiceXML application consists of one or more documents. Any multidocument application has a single

application root document. Each document in an application identifies the application root document with

the application attribute of the <vxml> tag: <vxml

version="2.0"

xmlns="http://www.w3.org/2001/vxml" application="myAppRoot.vxml" >

Whenever the interpreter executes a document, it loads that document. If the document specifies an application root document, that document is also loaded.

(21)

VoiceXML

You can use an application root document for global items or interactions that you want to be active throughout the application. For example, suppose the application root document myAppRoot.vxml declares a variable named company that has an initial value of BeVocal:

...

This variable has application scope. That is, any document in the application can use the variable.

Dialogs

Within a document, a user interacts with dialogs, in which the application produces auditory output, typically asking for information. The user provides input by speaking or pressing keys on the telephone. User speech must be recognized and its meaning interpreted. The telephone key input is interpreted as a sequence of tones in the Dual Tone Multifrequency (DTMF) signalling system.

VoiceXML has two kinds of dialogs: forms and menus.

• A form interacts with the user to fill in a number of fields. Every field has an associated variable, called its input-item variable, or just input variable. Initially, the variable has a value of undefined. It is filled in when the speech-recognition engine recognizes a valid response in a user utterance. Note: In VoiceXML 1.0, an input-item variable was known as a field-item variable.

• A menu presents the user with a number of choices; it transitions to a different dialog based on the user’s selection.

Forms

The VoiceXML <form> tag defines a form and the <field> tag defines a field in a form. You specify the name of the input variable with the name attribute of the <field> tag. You can use the input variable’s name in expressions to refer to the stored value.

In the example in “Simple Example” on page 6, the input variable is named num: <field name="num" type="number">

When the user says the number, the number is stored in the num variable. Then the interpreter proceeds to execute the field’s <filled> element. Here, the num variable in the <assign> element is evaluated before being passed as the parameter to the factorial function.

<assign

name="result"

expr="factorial(num)" />

Menus

The <menu> tag defines a menu; each choice consists of a <choice> element. The next attribute of a <choice> element specifies the destination dialog to which the interpreter should transition when the user selects that choice. If a <form> or <menu> element is to be the destination of a transition, the id attribute for the destination dialog should specify a unique identifier.

For example, the following menu consists of three choices. <menu>

Please choose one of <enumerate/> </prompt>

<choice next="#MovieForm"> local movies

</choice>

(22)

local radio stations </choice> <choice next="http://www.nationTV.org/tv.vxml"> national TV listings </choice> </menu>

The prompt in this menu includes an <enumerate> tag. This tag lets you set up a template for an automatically generated description of the choices. By default, the <enumerate> template simply lists all the choices. In the above example, the prompt is “Please choose one of local movies, local radio stations, national TV listings.”

The destination dialog specified by the next attribute can be in the current document or in a different document:

• If the user says “local movies”, the interpreter transitions to the dialog named MovieForm in the same document.

• If the user says “local radio stations”, the interpreter transitions to the dialog named RadioForm in the document localBroadcast.vxml.

• If the user says “national TV listings”, the interpreter transitions to the first dialog in the document tv.vxml in the national TV web site.

Properties

You can set properties to customize the behavior of the interpreter. The <property> tag specifies the property to set and the value for that property.

Various properties control how the interpreter behaves when prompting the user for input, recognizing speech or DTMF input, and fetching documents and other resources. For additional information, see Chapter 12, “Properties”.

Grammars

The speech-recognition engine uses grammars to interpret user input. See the Grammar Reference for details on creating and using grammars. Here, we only cover a portion of the relevant information. Each field in a form can have a grammar that specifies the valid user responses for that field. An entire form can have a grammar that specifies how to fill multiple input variables from a single user utterance. Each choice in a menu has a grammar that specifies the user input that can select the choice.

(23)

VoiceXML

Built-in Grammars

The following basic grammars are built into all standard VoiceXML interpreters:

BeVocal VoiceXML contains additional built-in grammars as an extension to standard VoiceXML:

You can reference a built-in grammar in either of two ways:

• You can use a standard built-in grammar as the type attribute of a <field> element. The example in “Simple Example” on page 6 uses the built-in number grammar:

This means that the speech-recognition engine tries to interpret what the user says as a number. • You can use any built-in grammar (standard or BeVocal VoiceXML extension) in a <grammar>

element by specifying the src attribute with a URI of the form: builtin:grammar/typeName

For example:

Application-Defined Grammars

Although the built-in grammars can be useful, you typically need to define your own grammars. An application-defined grammar can be specified in the following forms:

• Augmented BNF (ABNF) form of the W3C Speech Recognition Grammar Format • XML form of the W3C Speech Recognition Grammar Format

• Nuance Grammar Specification Language (GSL) • Java Speech Grammar Format (JSGF)

Grammar Type Description

boolean Recognizes a positive or negative response. currency Recognizes an amounts of money, in dollars.

date Recognizes a calendar date.

digits Recognizes a sequence of digits. number Recognizes a number.

phone Recognizes a telephone number adhering to the North American Dialing Plan (with no extension).

time Recognizes a clock time.

Grammer Type Description

airport Recognizes an airport name or code, such as DFW or Dallas-Fort Worth. airline Recognizeds an airline name or code, such as AA or American Airlines. equity Recognizes a company symbol or full name, such as IBM or Cisco Systems. citystate Recognizes US city and state names, for example, “Sunnyvale, California”. stockindex Recognizes the names of the major US stock indexes, such as “Nasdaq”. street Recognizes a street name (with or without street number).

(24)

A simple grammar can be defined in the document. An inline grammar is defined within the <grammar> element itself. For example, the following inline ABNF grammar matches the words “add” and “subtract”. <field name="operator">

<grammar> #ABNF 1.0; root $op;

$op = add | subtract; </grammar>

...

With this grammar, if the user says “add,” the input variable operator is set to add.

More complex grammars can be written externally. An external grammar is defined in a file separate from the VoiceXML document file and is referenced by the src attribute of the <grammar> element. For example, the following field uses a grammar rule named Colors in an external XML grammar defined in the file partGrammar.grxml.

<field name="part"> <grammar

src="http://www.mySite/partGrammar.grxml#Colors"/> ...

The named rule (Colors in the preceding example) is the one the interpreter will use to start recognition. The specified file may include other grammar rules, which may be used as subrules of the this rule. The grammar for a menu choice can be specified explicitly with a <grammar> child of the <choice> element. Alternatively, a grammar can be generated automatically from the choice text.

If the accept attribute of the <menu> tag is set to approximate, the user can say a subset of the words in the choice text to select that choice. Adding this attribute to the preceding example allows the user to say “TV listings” or just “TV” to select the third choice:

<menu accept="approximate"> ... <choice ...> national TV listings </choice> </menu>

Note that the words must be spoken in the correct order; “listings, TV” would not be recognized. If you want some choices to be matched exactly and others to allow a subset of the words, you can specify the accept attribute on individual <choice> elements.

Active Grammars

The speech-recognition engine uses active grammars to interpret user input. A field grammar is active whenever the interpreter is executing that field. A menu-choice grammar is active whenever the interpreter is executing the containing menu. A form grammar is active whenever the interpreter is executing the containing form.

A form grammar or the collection of choice grammars in a menu can optionally be made active at higher scopes:

• A grammar with document scope is active whenever the interpreter is executing any dialog in the document.

• A grammar with application scope is active whenever the interpreter is executing any document in the application.

(25)

VoiceXML

If the interpreter is executing one dialog and the user’s input matches an active grammar for a different dialog, control transfers to the latter dialog. If the grammar is in application scope, control might transfer to a dialog in a different document.

Note that within a field, you can temporarily turn off grammars from higher scopes by setting the field’s modal attribute to true.

Events

The VoiceXML interpreter can throw a number of predefined events based on errors, telephone disconnects, or user input. For example:

• A no-input event is thrown if the user does not respond to a question.

• A no-match event is thrown when the user does not respond intelligibly—that is, when the user’s utterance does not match any active grammar.

• A help event is thrown when the user requests help. • An error event is thrown when any kind of error occurs.

An application can define additional events and can use a <throw> element to throw an event of a specified kind.

An application can catch an event and take the appropriate response in an event handler. A <catch> element is a general-purpose event handler; its event attribute specifies the kinds of event that it handles. Additional event-handling tags are syntactic shorthand: <noinput>, <nomatch>, <help>, and <error>. Each of these shorthand tags catches one type of event, indicated by its name. For example, a

<nomatch> element catches no-match events.

When an event is thrown, the associated event handler, if it exists, is invoked. If the handler did not cause the application to terminate, execution resumes in the element that was being executed when the event was thrown.

For more information, see Chapter 3, “Event Handling”.

Links

A link specifies a grammar that is independent of any particular dialog.

A <link> element defines a link. Each <link> element contains a <grammar> element. A link’s

grammar is active in the scope of the element that contains the link. For example, if the link is in a form, its grammar is active when the interpreter is executing that form. If a link is under a <vxml> element, its grammar has document scope; if the link is in the application root document, its grammar has application scope. Links in a <vxml> element can implement global behaviors.

A link can specify one of two possible actions to take if the speech-recognition engine detects a match its grammar:

• The link can cause a transition to a different location; in that case, its next attribute specifies the destination of the transition. Links, like menu choices, can cause transitions to other dialogs or documents.

• The link can throw an event; in that case, its expr attribute specifies the event to throw. After the event is handled execution resumes with the element that was being executed when the link grammar was matched.

(26)

For example, the following link is defined at document level; its grammar is active whenever the interpreter is executing any dialog in the document. If the user says “operator,” the link transfers control to a different document. <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <link next="operator_xfer.vxml"> <grammar type="application/x-nuance-gsl"> operator </grammar> </link> ...

Universal Commands and Grammars

A universal command is always available—the user can give the command at any point in an interaction. A

universal grammar specifies user utterances that can be recognized as a universal command. Predefined Universal Grammars

The following predefined universal grammars are available to all applications:

If one of these predefined universal grammars is activated and a user utterance matches the grammar, an event of the same name is thrown. For example, a help event is thrown when the user says “help.”

Application-Defined Universal Grammars

An application creates its own universal command by defining and enabling a new universal grammar and implementing its response to the command.

To define a universal grammar, set the universal attribute in the <grammar> tag that defines the grammar for the command. The attribute value is a name that uniquely identifies the grammar among all universal grammars in the application. In the following example, the new universal grammar is named joke; the user utterance “Tell me a joke” will be a universal command when this universal grammar is activated.

<grammar universal="joke" type="application/x-nuance-gsl"> (tell me a joke)

</grammar>

Activating Universal Grammars

An application can activate any of the universal grammars to enable the corresponding universal

commands. When a universal grammar is activated, a user utterance that matches the grammar is treated as a universal command.

All universal grammars are deactivated by default. The application can activate some or all universal grammars by setting the universals property. This property specifies which of the universal grammars should be active; all other universal grammars are deactivated.

Grammar Description

help The user asked for help. exit The user asked to exit.

cancel The user asked to cancel the prompt that is playing.

goback The user wants to retract the last response and go back to an earlier part of the interaction.

(27)

VoiceXML

• Set the universals property to all to activate all universal grammars (both predefined and application-defined):

<property name="universals" value="all" />

• Set the universals property to a space-separated list of grammars to activate those universal and deactivate others:

• Set the universals property to none to deactivate all previously activated universal grammars in the current scope.

<property name="universals" value="none" />

Note: (VoiceXML 1.0 only) If the <vxml> tag’s version attribute is 1.0, all universal grammars are activated by default.

Responding to Application-Specific Universal Grammars

A <link> element containing a universal grammar implements the application’s response to the

corresponding universal command. Your application can respond to the command in whatever manner is appropriate. Typically, the response is to throw an event or to transition to a different form.

If you throw an application-specific event, you must provide an event handler to take the appropriate action. For example:

<link event="joke">

<grammar universal="joke" type="application/x-nuance-gsl"> (tell me a joke)

</grammar> </link>

<catch event="joke">

Procedural Logic

You can use procedural logic, called executable content, within a few basic elements: <block>, <filled>, and event handlers. Within executable content, you can declare and assign values to variables, use simple conditional logic, perform iteration (a BeVocal VoiceXML extension), output speech or audio to the user, or run a JavaScript script.

Variables

Variables are declared by the <var> tag. Declarations can appear in a document, a form, or executable content. The <var> tag can optionally specify the variable’s initial value; if it doesn’t, the variable is initialized to undefined.

(28)

A variable has the scope of the element that contains its declaration:

• A variable has document scope if it is declared in a <vxml> element, or in a <block> or event handler that is a child of the <vxml> element. If the document is the application root document, then the variable has application scope.

You can refer to a variable x with document scope either as x or document.x (for clarity or to resolve ambiguity). If the variable is in the application root document, then you can refer to it in other

documents as application.x.

• A variable has dialog scope if it is declared in a <form> element, or in a <block> or <filled> element that is a child of a <form> element, or an event handler that is a child of a <form> or <menu> element.

You can refer to a variable x with dialog scope either as x or dialog.x.

• A variable has an anonymous scope, local to a field, if it is declared in an event handler or <filled> element that is a child of a <field> element.

If a <var> element specifies a variable that is already in scope, it does not declare a new variable with the same name, but simply assigns a value to the existing variable. If the <var> element has an expr attribute, the variable is assigned the specified value; otherwise, the variable is assigned the value undefined.

You can set a variable’s value with the <assign> tag.

VoiceXML variables are in all respects equivalent to JavaScript variables—they are part of the same namespace. For additional information, see “Scripts” on page 17.

Conditional Logic

You can use an <if> element to execute a block of code if a condition is satisfied. Within that element, you can use a sequence of <elseif> elements to execute alternative blocks of code if all previous conditions failed and the condition of the <elseif> element is satisfied. You can use an <else> element to execute and alternative block of code if all previous conditions failed.

The conditions in <if> and <elseif> elements are expressed as Boolean-valued JavaScript expressions.

Iteration

You can use the BeVocal VoiceXML extension <bevocal:foreach> to execute the contained elements once for each element of a specified array.

Audio Output

A <prompt> or <reprompt> element generates speech output; an <audio> element plays a prerecorded audio clip. The <value> tag evaluates an expression and produces spoken output of the result.

Prompts can appear in executable contents as well as in elements for collecting user input. Anywhere a <prompt> is valid, text is interpreted as a prompt even if the enclosing <prompt> and </prompt> tags are omitted.

An input item and the <initial> item of a mixed-initiative form has a prompt counter that lets you play different prompts if the user revisits the item several times. For example, you may want to play shorter descriptions after the first or second time the user is prompted for the same information. The prompt counters are reset on each form invocation.

Tip:

• If your JavaScript expression contains any of the characters “<“, “>”, or “&”, that character must be replaced with the corresponding escape sequence “<”, “>”, or “&”.

(29)

User Interaction

Scripts

A <script> element executes a JavaScript script, which is run in the scope of the parent element. A <script> element can also define functions that can be called by JavaScript expressions in the same scope.

VoiceXML variables are equivalent to JavaScript variables and are part of the same namespace.

VoiceXML variables can be used in a script just as variables defined in a <script> element can be used in VoiceXML. Declaring a variable using a <var> element is equivalent to using a var statement in a <script> element.

If your JavaScript expression contains any of the characters “<“, “>”, or “&”, that character must be escaped. Inside a <script> element, you can do so in one of 2 ways. You can replace the individual characters with the corresponding escape sequence “<”, “>”, or “&”. This may result in code that is difficult to read. Alternatively, you can place the entire script inside a CDATA section. For example, either of the following is correct:

function factorial(n) {

return (n <= 1) ? 1 : n * factorial(n-1); } </script> or <script> <![CDATA[ function factorial(n) { return (n <= 1) ? 1 : n * factorial(n-1); } ]]> </script>

You might argue that the second is a little easier to read.

User Interaction

VoiceXML supports both application-directed and mixed-initiative interactions with a user.

In an application-directed (or simply directed) interaction, the application prompts for the information it needs and the user supplies the requested information by answering the prompts. The application controls the interaction; the user cannot volunteer information. To be more accurate, the application does not understand volunteered information:

• If the application is executing a form, the only active grammar is the one for the current field of the form. The only valid user input is one that provides a value for the current field’s variable.

• If the application is executing a menu, the only active grammars are the grammars of the menu’s choices. The only valid user input is one that selects a choice for the current menu.

In a mixed-initiative interaction, the user and the application both participate in determining what the application does next. A single utterance from the user may provide input for multiple input variables in a form. In response to a prompt in one dialog, the user may provide input that matches a grammar defined in a different form. When this happens, the interpreter transitions to that dialog and fills its input variables from the user input. Similarly, the user may provide input that selects a choice from a different menu or that matches a link grammar, causing a transition to the destination specified by that choice or link.

If an application does not use links or grammars with document or application scope, it may still include mixed-initiative forms. A mixed-initiative form includes a form grammar. It can include an <initial> element to control the initial interaction in the form. This element can request user input or perform other non-interactive initialization tasks. In response to a prompt from the <initial> element, the user could

(30)

provide input that fills in multiple input variables. If the form prompts for individual fields, any user input that matches the form grammar is valid—even if that input does not fill in the field for which the user was just prompted.

Note: Fewer speech-recognition errors occur in directed interactions than in mixed-initiative interactions.

Flow of Execution

Execution within a VoiceXML document flows in document order until a dialog (form or menu) is entered. Execution flows from the current dialog to a different dialog or document, based on either:

• An explicit transition statement in the current dialog.

• Speech recognition in the current dialog that causes a transition to a different dialog.

In addition, execution can temporarily leave the current dialog to execute a subdialog, returning to the current dialog when execution of the subdialog is complete.

If the current dialog completes execution without transitioning to a different location, the application exits. In addition, you can use an <exit> element to end the application explicitly.

Explicit Transition

You can set up explicit transitions to other dialogs or documents in your application using <goto> or <submit> tags. These transition elements can be placed inside <block> or <filled> elements or event handlers.

The <goto> element lets you transition to another input item in the current form, to another dialog in the current document, or to another document. When you make the transition to the new location, the local variables from the old form or document are lost. This happens even if you transition to the same form you were in before. However, the values of local variables are not affected when you use <goto> to transition between items within a form.

The <submit> tag lets you pass values to another document using an HTTP GET or POST request. Since you use a URI to specify the next document, it need not be a VoiceXML document; for example, it could be a CGI script document.

Recognition-Triggered Transition

User input to a dialog may cause a transition to a different location:

• If the speech-recognition engine matches the grammar of a menu’s <choice> element that has a next or expr attribute, the interpreter transitions to the destination specified by that attribute.

• If the speech-recognition engine matches the grammar of a <link> element that has a next or expr attribute, the interpreter transitions to the destination specified by that attribute.

• If the speech-recognition engine matches a grammar with document or application scope that is defined in a different dialog, the interpreter transitions to that dialog.

Subdialogs

A subdialog is a reusable VoiceXML dialog that you can pass data to and get return values from:

• The current dialog passes control to a subdialog with a <subdialog> element. It can pass data to the subdialog with <param> elements inside the <subdialog> element.

• A subdialog returns control to the calling dialog with the <return> element. It can pass values back using the namelist attribute of the <return> element.

(31)

Collecting Input and Playing Prompts

At any moment, the VoiceXML interpreter is either waiting for input in an input item, such as a field, or transitioning between input items in response to some input. In this sense, input can be a spoken user utterance, a series of DTMF key presses, or an input-related event such as invalid input. What happens in the waiting and transitioning states is rather intertwined.

While waiting for input (also referred to as being in a recognition state), the interpreter is listening for and attempting to match spoken utterances or DTMF key presses against the currently active grammars. When the interpreter listens for speech input, it constantly compares the incoming audio stream to all active grammars, looking for a match. At some point after the user stops talking, the interpreter decides whether the input is valid. The timing for this is controlled by several properties; the properties are different for spoken grammars and for DTMF grammars. For details on how these properties interact, see

Chapter 12, “Properties”.

While transitioning between input items, the interpreter completely ignores spoken utterances. If the property bevocal.dtmf.flushbuffer is set to false, then it does listen for DTMF key presses. It queues (or buffers) any key presses for the next recognition state and it keeps track of timing information for the key presses. The interpreter also queues asynchronously generated events that are not related directly to execution of the transition (such as the user hanging up).

During this transitioning state, prompts and audio are queued to be played and a program’s executable content is run. Prompts get played either at the start of the next waiting state or sometimes when the interpreter goes off to fetch a resource, such as another document. For details on fetching resources, see Chapter 4, “Fetching and Caching Resources”.

At the beginning of a waiting state, there may be DTMF key presses queued during the previous transitioning state. By default, those key presses are not available for the waiting state to use for

(32)

(33)

2 Forms

The main elements of a document (within the <vxml> element) are forms. VoiceXML forms are analogous to web forms; you use them to collect (voice) input from the user.

This chapter describes: • Form Items

• Form-Item Variables • Execution of a Form • User Interaction

Form Items

So far, the only form item we’ve discussed is the <field> element. However, forms can contain either input items or control items:

Input items are elements for collecting user input or results. An input item is any one of the following:

• A field, defined with the <field> tag, asks the user for a piece of information.

• A record item, defined with the <record> tag, records what the user says (perhaps for a voicemail message);

• A subdialog, defined with the <subdialog> tag, invokes a reusable dialog.

• A transfer item, defined with the <transfer> tag, transfers the user to another telephone number.

Control items are tags that can contain procedural items for audio output or computation. A control item is

either of the following:

• A block, defined with the <block> tag, is a container for procedural elements.

• An initial item, defined with the <initial> tag, controls the initial interaction of a mixed-initiative form.

Form-Item Variables

Each form item has an associated form-item variable. When a form is entered, all form-item variables are initially undefined. When a form item is visited, its variable is set to the result of interpreting that form item. For example, visiting a <block> element sets its form-item variable true. The form-item variable for an input item is also called an input-item variable (or simply input variable); after an input item is visited, its input-item variable is set to the value collected from the user.

(34)

Execution of a Form

Within a form, the flow of execution is governed by the Form Interpretation Algorithm (FIA), a looping algorithm. On each iteration, the FIA selects the form item to visit next.

A form item’s guard conditions determine whether it can be selected on a given iteration: • The value of the form-item variable must be undefined.

• The value of any cond expressions contained in the form item must evaluate to true.

Both guard conditions must be met in order for a form item to be selected. The FIA examines the form items in document order, selecting the first one whose guard conditions are met. If the guard conditions for all form items fail, the form (and the application) exits.

By default, every form-item variable has an initial value of undefined so every form item that does not specify a cond expression is eligible for selection. After the form item is visited, its variable is set to a value, which prevents the same form item from being selected again on the next iteration.

You can explicitly control the execution of any form item if you give its variable a name and an initial value other than undefined. Doing so prevents the form item from being eligible for selection until you explicitly use the <clear> tag to reset its variable. Typically, input-item variables are given names but control-item variables are not.

User Interaction

User interaction with a form can be directed or mixed initiative.

A directed form has no form grammar, only grammars for its individual fields. A directed form gives the user explicit directions about what to say and when. For example, a directed form might result in the following dialog:

A form that includes its own grammar is a mixed-initiative form. The form grammar allows several input variables to be filled in as a result of a single user utterance. A mixed-initiative form allows the user to speak more naturally. For example, a mixed-initiative form might result in the following dialog.

One disadvantage of mixed-initiative forms is that form grammars are more complicated and can result in more recognition errors.

The grammar for a field sets a value for the field’s variable. For example, the grammar in the following field, specified in ABNF, assigns the value june to the variable month if the user says “June.”

#ABNF 1.0;

Application: Would you like to buy, sell, or receive a stock quote?

User: Get a quote.

Application: What stock or stocks would you like a quote for?

User: Intel.

Application: Stock assistant here. How can I help you?

(35)

User Interaction

root $mo;

$month = june | july |august; </grammar>

<field>

The grammar for a form must specify both the input variable to be set by a grammar rule and the value for that variable. For example, the ABNF grammar in the following file, foo.gram, sets values for two variables, quantity and fruit:

#ABNF 1.0; root $main; $main = [$amount] $fruit | $amount [$fruit] | $amount $fruit ; $amount = one { quantity=1 } | two { quantity=2 } | three { quantity=3 } ; $fruit =

(apple | apples) { fruit=apples } | (orange | oranges) { fruit=oranges } ;

This grammar is used by the following mixed-initiative form: <form id="foo">

<prompt>How many apples or oranges do you want?</prompt> </initial>

<prompt>Do you want apples or oranges?</prompt> </field>

Ok, you want <value expr="quantity"/> <value expr="fruit"/> </prompt>

</filled> </form>

(36)

(37)

3 Event Handling

The VoiceXML interpreter can throw a number of predefined events based on errors, telephone

disconnects or user requests. You can also throw events you define that are specific to your application. When an event is thrown, the associated event handler, if it exists, is invoked. Then execution resumes in the element that was being executed when the event was thrown.

This chapter describes: • Predefined Events • Default Event Handlers

• Application-Defined Event Handlers • Events in Subdialogs

• Throwing Events

• Application-Defined Events

Predefined Events

The following standard events are predefined:

Event Description

exit The user asked to exit.

help The user asked for help.

noinput The user did not provide timely input. nomatch The user did not provide meaningful input.

cancel The user asked to cancel the prompt that is being played. connection.disconnect.hangup The user hung up. New in VoiceXML 2.0.

(38)

The following additional events are defined as BeVocal VoiceXML extensions:

The following standard errors are predefined:

The following additional errors are defined as BeVocal VoiceXML extensions:

Note: In a VoiceXML 2.0 document (when the value of the version attribute of the vxml tag is 2.0), the telephone.disconnect.* and error.telephone.* events have been changed to

connection.disconnect.* and error.connection.*. See above.

goback User wants to retract the last response and go back to an earlier part of the interaction. See Chapter 7, “Go-Back Facility”.

connection.far_end.busy The number for an outbound telephone call was busy. connection.far_end.disconnect An outbound telephone was disconnected because the

called third party hung up. Outbound telephone calls are described in Chapter 6, “Controlling Outbound Calls”.

connection.far_end.disconnect.timeout An outbound telephone exceeded its maximum allowed duration.

connection.far_end.noanswer An outbound telephone call was not answered within the time allowed for making the connection.

error.badfetch An error occurred while the interpreter was fetching a document or resource.

error.noauthorization The user is not authorized to perform the requested action. error.semantic A runtime error occurred in the VoiceXML code.

error.connection.baddestination The destination URI for an outbound telephone call was invalid.

error.connection.noauthorization An attempt was made to place an unauthorized outbound telephone call, for example, one that exceeds the maximum allowed duration.

error.connection.noresource An audio input or output resource is unavailable. error.noresource An audio input or output resource is unavailable. error.unsupported.format The requested resource format is not supported. error.unsupported.element The requested element is not supported (for example,

error.unsupported.subdialog).

error.internal A serious internal error occurred in the interpreter. error.bevocal.maxdialogerrors_exceeded The maximum number of speech errors was

exceeded in a particular execution of a particular form.

error.bevocal.maxerrors_exceeded The maximum number of speech errors was exceeded during the call.

(39)

Default Event Handlers

Backward Compatibility with VoiceXML 1.0.

The following predefined events are still supported in VoiceXML 1.0:

The following predefined errors are still supported in VoiceXML 1.0:

Default Event Handlers

The BeVocal interpreter provides the following default event handlers for the predefined events and errors:

Backward Compatibility with VoiceXML 1.0:

The following predefined events handler is still supported for VoiceXML 1.0.

telephone.disconnect.hangup The user hung up.

telephone.disconnect.transfer The user’s call was transferred.

error.telephone.baddestination The destination URI for an outbound telephone call was invalid.

error.telephone.noauthorization An attempt was made to place an unauthorized outbound telephone call, for example, one that exceeds the maximum allowed duration.

error.telephone.noresource A telephone resource is unavailable, for example because the application tried to make an outbound telephone call while another outbound call was active.

Event Handler Description

exit Exit the interpreter.

help Play a default audio help message and reprompt. The help

message says: “No help available right now.”

noinput Play a default audio message and reprompt. The message says: “I’m sorry, I didn’t hear you.”

nomatch Play a default audio message and reprompt. The says: “I’m sorry, I didn’t understand you.”

cancel Stop playing audio.

error Exit the interpreter.

connection.disconnect.hangup Exit the interpreter. New in VoiceXML 2.0.

goback Undo whatever actions resulted from the last response, then prompt the user for a new response.

(40)

Application-Defined Event Handlers

Although the system provides default handlers for the predefined events, you can override these handlers by providing your own event handlers in any element that can throw an event. The <catch>, <error>, <help>, <noinput>, and <nomatch> elements are event handlers.

An element in which an event may be thrown also inherits event handlers defined in its ancestor elements. For example, an event thrown within a field element may be caught by a handler in that element, or in its form, or in its document, or in its application. This inheritance of event handlers allows you to provide consistency in event handling by defining handlers at a higher level.

The method by which event handlers are inherited from ancestor elements is called as if by copy semantics in the VoiceXML 2.0 specification. It helps to think of the appropriate event-handler literally being copied into the scope of where the event was thrown. Variable references are resolved relative to the scope of the element where the event was thrown. And URL references are resolved relative to the document from which the event was thrown. For example, if you have a <catch> handler in an application root document, which is in a different directory from the main document which threw the event, URLs in the handler will be resolved to the directory of the main document. The change to URL resolution to the originating document is considered 2.0 behavior and applies only when the <vxml> tag’s version attribute is set to "2.0" or greater.

Form items contain event counters that let you respond differently if the same event is thrown multiple times. For example, you may want to provide more details each time the user asks for help. The event counters are reset on each form invocation.

When an event occurs, its counter is used to select applicable event handlers:

1. All handlers in the scope in which the event occurred and its containing scopes are considered. 2. A handler for the event is eligible if its count attribute is less than or equal to the event’s counter. 3. Those eligible handlers with the highest count are selected as applicable (more than one handler

may have the same highest count).

4. The applicable handlers are ordered by scope, with the innermost event handlers first; within a given scope, the applicable handlers are examined in the order in which the occur in the VoiceXML

document. The first applicable handler in this ordering is selected to handle the event. You can set up event handlers that catch all events with a given prefix (for example,

error.unsupported). Note, however, that the interpreter selects a handler based on count, scope, and document order only. A more specific handler does not take precedence. For example, if an

error.unsupported.format event is thrown and the first applicable handler is for all events beginning with the prefix error.unsupported, that handler will be invoked even if the next applicable handler is for the specific event error.unsupported.format.

Within an event handler, the _event variable contains the name of the event currently being handled; the _message variable contains the message string that provides additional information about the event. If no message was supplied when the event was thrown, the _message variable is undefined.

Event Handler Description

(41)

Events in Subdialogs

A subdialog must catch any event that is thrown while the subdialog is being executed. If no handler for the event is found in the subdialog’s execution context, a fatal error occurs, causing the interpreter to exit. (VoiceXML 1.0 only) In VoiceXML 1.0, when a subdialog throws an event, the result depends on whether the subdialog is modal. Subdialogs are modal by default; a subdialog can be made non-modal by setting the modal attribute to false.

• If an event is thrown within a modal subdialog and no handler for the event is found in the subdialog’s execution context, a fatal error occurs, causing the interpreter to exit.

• If an event is thrown within a non-modal subdialog and no handler for the event is found in the subdialog’s execution context, the interpreter causes the subdialog’s context to return and rethrows the event in the calling context, restarting the search for the event handler in that context.

Note: In VoiceXML 2.0, all subdialogs are modal.

Throwing Events

An application can throw events as follows:

• A <throw> element throws an event; it can occur within executable content, that is, in a block or <filled> element, or an event handler.

• A <link> element can specify an event to be thrown when the link’s grammar is matched. • A <choice> element in a menu can specify an event to be thrown when the choice’s grammar is

matched.

• A <return> element in a subdialog can specify an event to be thrown after control returns to the calling dialog.

Tips:

• Always set up default <help>, <nomatch>, and <noinput> messages of your own, at top level scope. For example:

I'm sorry. There's no help available here. <reprompt/> </help>

I'm sorry. I didn't hear anything. <reprompt/> </noinput>

I didn't get that. <reprompt/> </nomatch>

...

• If you want to execute both an event handler in an inner scope and a handler for the same event in an outer scope, the inner handler can use a <rethrow> element to rethrow the event.

VoiceXML Programmer s Guide

Table of Contents

Preface. . . .1

1. Getting Started . . . .5

2. Forms . . . 21

4. Fetching and Caching Resources . . . 31

5. Using Multiple-Recognition . . . 49

6. Controlling Outbound Calls . . . 69

7. Go-Back Facility . . . 79

8. TTS and Recorded Voice Selection . . . 89

9. Dynamic SSML . . . .95

10. SOAP Client Facility . . . .101

11. Tags . . . .109

12. Properties . . . 339

13. Variables . . . 363

Preface

Audience

Conventions

How to Use This Guide

References

PART 1

Using VoiceXML

1

Getting Started

VoiceXML

User Interaction

Flow of Execution

Collecting Input and Playing Prompts

2

Forms

Form Items

Form-Item Variables

Execution of a Form

User Interaction

3

Event Handling

Predefined Events

Default Event Handlers

Application-Defined Event Handlers

Events in Subdialogs

Throwing Events