Voice User Interface Design

(1)

Voice User Interface

Design

Part III: Technology Support for Voice Application

Dr. Dirk Schnelle-Walka

(2)

W3C Speech Interface Framework Speech Synthesis Grammar Pronunication Lexicon VoiceXML Call Control Semantic Interpretation

(3)

Status of W3C Speech Interface Languages Voice XML 2.0 & 2.1 Grammar Synthesis 1.0 Synthsis 1.1 Call Control Semantic Interpret-ration Recommendation Proposed Recommendation Candidate Recommendation Last Call Working Draft Working Draft PLS Voice

(4)

VoiceXML VoiceXML is an XML markup language to develop speech applications

(voice based equivalent for HTML)

<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.1">

<form>

<block>Hello World!</block> </form>

(5)

Goals

Use the internet via the telephone

– voice and touch tones

Transfer the advantages of web based

development and data transfer to interactive voice based applications

Ease the development of voice based dialogs Authors should not take care about low-level

programming tasks

Development of a new language since voice

(6)

History of VoiceXML

VoxML PML SpeechML TalkML

(defunct) VoiceXML 2.0 PhoneWeb VoiceXML 3.0 VoiceXML 2.1 1998 1995 1999 2000 2004 2007 2008 VoiceXML 1.0 Native code <1994

(7)

Scope of VoiceXML

VoiceXML describes the human-machine interaction provided by voice response systems, which includes:

Output of synthesized speech (text-to-speech). Output of audio files.

Recognition of spoken input. Recognition of DTMF input. Recording of spoken input. Control of dialog flow.

Telephony features such as call transfer and disconnect.

(8)

VoiceXML-document

VoiceXML-document = finite state machine – Each state specified the next state

The user is always in defined conversation state (=dialog)

<vxml> is the top-level element of a document, serving as a container for dialogs

Dialog

– Formular – Menu

(9)

A simple document <?xml version="1.0" encoding="ISO-8859-1"?> <vxml version="2.0" lang="en"> <form> <block> <prompt bargein="false">

Welcome to Travel Planner!

<audio src="http://www.adline.com/travelad.wav"/> </prompt>

</block>

</form> </vxml>

(10)

Analogy HTML - VoiceXML HTML VoiceXML Images Audio/Video Audio/ Grammars Scripts VoiceXML documents HTML documents Voice Browser Web-Server Simple telephone Internet Gateway Internet Telephony-network ASR DTMF TTS … Voice-Server PC with a webbrowser

"point & click"

(11)

Architecture VoiceXML Interpreter Context VoiceXML Interpreter Context VoiceXML Interpreter VoiceXML Interpreter Document Server Document Server Implementation Platform Implementation Platform Request Document

(12)

Core Concepts

Session

A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and

processed, and ends when requested by the user, a document, or the interpreter context.

Application

An application is a set of documents sharing the same application root document

Grammar

Each dialog has one or more speech and/or DTMF grammars

associated with it.

Events

Notifications mechanism

Links

(13)

Execution with one document

Document execution begins at the first dialog by default.

As each dialog executes, it determines the next dialog.

When a dialog doesn't specify a successor dialog, document execution stops.

<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0">

<var name="hi" expr="'Hello World!'"/> <form>

<block>

<value expr="hi"/><goto next="#say_goodbye"/> </block>

</form>

<form id="say_goodbye">

<block> Goodbye! </block> </form> </vxml> <vxml> <var a=".."> … </vxml>

(14)

Executing a Multi-Document Application

Multiple documents work together as one application

– Root document

– One or more leaf documents

Each leaf document names the root document in its

<vxml> element

If the interpreter loads and executes a leaf document in this application, it first loads the application root document if it is not already loaded

<vxml> <var a=".."> … </vxml> <vxml application= "root.vxml"> … </vxml> <vxml application= "root.vxml"> … </vxml> <vxml application= "root.vxml"> … </vxml>

The application root document

remains loaded until the interpreter is told to load a document that

(15)

Forms

Forms are the key component of VoiceXML documents.

A form contains

– A set of form items

– Event handlers

– <filled>-elements

Form Interpretation Algorithm (FIA) – Select – Collect – Process Forms Eventhandler * Form Items *

(16)

Form Items

Form items are the elements that can be visited in the main

loop of the FIA

Input items direct the FIA to gather a result for a specific element Control Items contain

a block of procedural code to execute Form Items Control Items * Input Items *

(17)

Form Item Variables

Each form item has an associated

form item

variable

Set to undefined or value of expr attribute when the form is entered

Contains the result of interpreting the form item

Guard condition (cond attribute)

, which governs

whether or not that form item can be selected by the FIA

(18)

Variables

VoiceXML variables are equivalent to ECMAScript variables

Variables are declared by

– <var> elements

– Within a script

The <assign> element assigns a value to a variable

The <clear> element resets one or more variables

<var name="phone" expr="'6305551212'"/>

<script>

var phone ="6305551212";

<script/>

<assign name="flavor" expr="'chocolate'"/>

(19)

Variable Scopes

VoiceXML uses an ECMAScript scope chain

to allow variables to be declared at different levels of hierarchy in an application

Variable references match the closest enclosing scope according to the scope chain given above

Prefix a reference with a scope name for clarity or to resolve ambiguity

session

application

document

dialog

(20)

Conditions

If, elseif and else analog to known programming languages

In cond operators "<", "<=", and "&&"

comparisons must be escaped in XML (to "<" and so on).

<if cond="flavor == 'vanilla'">

<assign name="flavor_code" expr="'v'"/> <elseif cond="flavor == 'chocolate'"/>

<assign name="flavor_code" expr="'h'"/> <else/>

<assign name="flavor_code" expr="'?'"/> </if>

(21)

Directed Forms

– Simplest and most common type of form – Form items are executed exactly once in

sequential order

Mixed Initiative Forms

– Both the computer and the human direct the conversation

– Must have one or more form-level grammars

– Input items can be filled in any order

– More than one input item can be filled as a result of a single user utterance.

(22)

Block

A block contains executable content

Executed if the block's form item variable is undefined and the

block's cond attribute, if any, evaluates to true.

Control item

Block

<block>

Welcome to Flamingo, your source for lawn ornaments.

(23)

Prompt

A prompt controls the output of synthesized speech and prerecorded audio

Content of the <prompt> element is

modelled on the W3C Speech Synthesis Markup Language

<prompt>

Welcome to the <emphasis>Bird Seed Emporium</emphasis>. <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/>

We have 250 kilogram drums of thistle seed for

<say-as interpret-as="currency">$299.95</say-as>

plus shipping and handling this month.

(24)

SSML

Speech Synthesis Markup Language

(SSML) is a W3C Recommendation for assisting in the generation of speech in Web applications

Give the underlying speech synthesiser hints in how to render text-to-speech

Allows flexible prosodic control such as changing volume, rate, and contour of the synthesised speech

(25)

Simple SSML Example

<?xml version="1.0"?> <speak version="1.0">

Would you like

<emphasis> debit </emphasis> or

<emphasis> credit </emphasis> </speak>

(26)

Controlling the order of field collection

Several possibilities

1. Assign a value to a form item variable

2. <clear> to set a form item variable to undefined 3. explicitly specify the next form item to visit using

(27)

Field

A field specifies an input item to be gathered from the user

Each field can have one or more prompts

Specified the input by grammars

Input item

Field

Eventhandler *

<field name="flavor">

<prompt>What is your favorite ice cream?</prompt> <grammar src="../grammars/ice_cream.grxml"

type="application/srgs+xml"/> </field>

(28)

Internal and External Grammars Internal Grammar

<grammar type="media-type" mode="voice"> inline speech grammar

</grammar>

External Grammar

<grammar type="media-type" src="URI"/>

Reference to Grammar in XML- bzw. EBNF-Form

<grammar type="application/srgs+xml" src="http://www…com/date.grxml"/>

(29)

EBNF-Grammar EBNF-Form of an internal grammar

<grammar mode="voice" type="application/srgs">

#ABNF 1.0;

language en-US; mode voice; root $command;

public $command = $action $object;

$action = open | close | delete | move;

$object = [the | a] (window | file | menu);

</grammar>

(EBNF = Extended Backus-Naur-Form)

Peter Naur

(30)

XML-Grammar XML-Form of an internal grammar

<grammar mode="voice" xml:lang="en-US" version="1.0" root="command"> <rule id="command" scope="public">

<ruleref uri="#action"/> <ruleref uri="#object"/> </rule>

<rule id="object">

<item repeat="0-1"> <one-of>

<item> the </item><item> a </item> </one-of>

</item> <one-of>

<item> window </item><item> file </item><item> menu </item> </one-of>

</rule> </grammar>

(31)

Sales

I'd like to place an order I need to talk to a salesman Repair repair department service service department customer service Order status where's my order? track my order track my shipment

where the hell is my stuff?

Grammars can get very complicated:

(32)

Semantic Interpretation

Semantic Interpretation for Speech Recognition (SISR) is a

specification for extracting the semantics or meanings of a raw utterance

SISR is used inside SRGS grammars

to annotate the meaning of the matched words

Used to implement “Natural Language Understanding”

(33)

Simple SISR Example

<?xml version="1.0"?> <grammar version="1.0"

tag-format="semantics/1.0-literals">

<one-of>

<item> yes </item>

<item> sure <tag>yes</tag> <item> <item> aye <tag>yes</tag> </item> </one-of>

(34)

Event Handlers

Deal with exceptional or error conditions Control mechanism for dialog turn retries

– <catch event=“noinput”> … </catch> – <catch event=“nomatch” … </catch> – <catch event=“help”> … </catch>

Shorthand notation available

– <noinput> … </noinput>, etc.

Scoped according to where they occur

(35)

Filled

Specifies an action to perform when some combination of input items are filled

<form id="get_city"> <field name="city">

<grammar type="application/srgs+xml" src="served_cities.grxml"/> <prompt>What is the city?</prompt>

<filled>

<if cond="city == 'Novosibirsk'"> <prompt>

Note, Novosibirsk service ends next year.

</prompt> </if>

</filled> </field>

(36)

<form>

<audio src="welcome.wav"> Welcome to Acme.

You can choose sales, repair, or order status.</audio> </prompt>

<noinput> You must say something. </noinput>

<block>

</form>

VoiceXML example with error handling

(37)

<form>

<nomatch> I didn't understand you. Please try again. </nomatch>

<block>

</form>

(38)

<form>

<help> You can say sales, repair, or order status. </help>

<nomatch> I didn't understand you. Please try again. </nomatch> <block>

</form>

VoiceXML example with error handling

(39)

Tapered Prompts Tapered prompts are those that may change with each attempt.

Information-requesting prompts may become more terse under the assumption that the user is becoming more familiar with the task

Help messages become more detailed perhaps, under the assumption that the user needs more help.

Prompts can change just to make the interaction more interesting.

<form> <block>

<prompt bargein="false">Welcome to the ice cream survey.</prompt> </block>

<field name="flavor">

<prompt count="1">What is your favorite flavor?</prompt>

<prompt count="3">Say chocolate, vanilla, or strawberry.</prompt> <help>Sorry, no help is available.</help>

(40)

Menu Simple type of a dialog

Selects a predefined option

Selected option determines the next dialog

<menu>

<prompt>

Welcome home. Say one of: <enumerate/> </prompt>

<choice next="sports.vxml"> Sports </choice> <choice next="weather.vxml"> Weather </choice> <choice next="astronews.vxml"> News </choice> <noinput>

Please say one of <enumerate/> </noinput>

</menu>

C(omputer): Say one of: Sports; Weather; News. H(uman): Astrology

C: I did not understand what you said. (a platform-specific default message.) C: Say one of: Sports; Weather; News. H: Sports

(41)

DTMF menu

Menu with explicit DTMF sequences given to each choice

<menu>

<property name="inputmodes" value="dtmf"/>

<prompt>For sports press 1, For weather press 2, For news 3.</prompt> <choice dtmf="1" next="sports.vxml"/>

<choice dtmf="2" next="weather.vxml"/> <choice dtmf="3" next="astronews.vxml"/> </menu>

(42)

Subdialog (1)

Subdialogs

– mechanism for reusing common dialogs – building libraries of reusable applications

Calling dialog waits until execution has finished

Results are returned using the <return> tag

<form>

<subdialog name="result" src="#getdriverslicense"> <param name="birthday" expr="'2000-02-10'"/> <filled>

<submit next="http://acme.com/cgi-bin/process"/> </filled>

</subdialog> </form>

(43)

 <form id="getdriverslicense"> <var name="birthday"/> <field name="drivelicense"> <grammar src="http://grammarlib/drivegrammar.grxml" type="application/srgs+xml"/>

<prompt> Please say your drivers license number. </prompt> <filled>

<if cond="validdrivelicense(drivelicense,birthday)"> <var name="status" expr="true"/>

<else/> <var name="status" expr="false"/> </if>

<return namelist="drivelicense status"/> </filled>

</field> </form>

(44)

Record Collects a recording from the user

Recording can be played back (using the expr attribute on

<audio>) or submitted to a server

<record name="msg" beep="true" maxtime="10s"

finalsilence="4000ms" dtmfterm="true" type="audio/x-wav">

<prompt timeout="5s">

Record a message after the beep.

</prompt> <noinput>

I didn't hear anything, please try again.

</noinput> </record>

<prompt>

Your message is <audio expr="msg"/>.

Input item

Record

Eventhandler *

(45)

Link

Specifies a grammar that is active whenever the user is in the scope of the link

If user input matches the link's grammar, control transfers to the link's destination URI

<link next="http://www.voicexml.org/books/main.vxml"> <grammar mode="voice" version="1.0" root="root">

<rule id="root" scope="public"> <one-of>

<item>books</item>

<item>VoiceXML books</item> </one-of>

</rule> </grammar>

<grammar mode="dtmf" version="1.0" root="r2"> <rule id="r2" scope="public"> 2 </rule> </grammar>

(46)

PLS

Pronunciation Lexicon Specification (PLS) – W3C Voice Browser Activity

– Pronunciation lexicon markup language

Two main applications:

– Speech Synthesis (SSML documents)

PLS improves SSML on text normalization, GTP

(47)

PLS in SSML

SSML document references an external pron lexicon:

– TTS engine loads the PLS documents and applies them to the SSML document

– applications may specify contextual PLS documents, which are to be used in different points of the interaction (like airports.pls, carriers.pls, …)

<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0"

xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="SI"> <lexicon uri="http://www.alpineon.com/airports.pls"/>

Letalo letalske družbe British Airlines, ki prihaja iz Manchestra, bo imelo 5 minut zamude.

(48)

Lexeme

The <lexeme> element - container of a lexicon

entry:

– usually only one <grapheme> element

– several <phoneme> or <alias> elements

<?xml version="1.0" encoding="UTF-8"?>

<grapheme>dober</grapheme> <phoneme>"d/o:-b@r</phoneme>

<!– This is an example of the x-sampa-SI-reduced string for the pronounciation of the Slovenian word: "dober", meaning "good" in English -->

</lexeme> </lexicon>

(49)

Events are thrown by the platform under a variety of circumstances

– when the user does not respond – doesn't respond intelligibly

– requests help, etc.

– semantic error in a VoiceXML document – Explicit use of <throw>

– Events are caught by catch elements

Events

<throw event="nomatch"/>

<throw event="connection.disconnect.hangup"/>

<catch event="nomatch noinput" count="3"> <prompt>Security violation!</prompt>

(50)

Goto

The <goto> element is used to:

– transition to another form item in the current form, – transition to another dialog in the current document,

or

– transition to another document.

<goto nextitem="ssn_confirm"/> <goto next="#another_dialog"/>

(51)

Submit

The <submit> element is used to

submit information to the origin Web server and then transition to the

document sent back in the response.

<submit next="log_request" method="post"

(52)

Tools

Eclipse Voice Tools Project

(53)

Call Control XML

Call Control XML (CCXML) provides call control support for VoiceXML and other dialog

languages

Supports:

– Multi-party conferencing

– Multi-call handling and control – Asynchronous event handling

(54)

Simple Call Control Example <?xml version="1.0"?>

<ccxml version="1.0"> <eventprocessor>

<transition event="connection.connected"> <dialogstart uri="helloworld.vxml"/>

</transition> </eventprocessor> <ccxml>

(55)

IETF Protocols

Three important IETF protocols powering the Speech-Enabled Web:

– Hyper Text Transfer Protocol (HTTP)

– Session Initiation Protocol (SIP)

– Media Resource Control Protocol (MRCP)

HTTP, SIP, and MRCP are common interaction

(56)

HTTP

HTTP is an open protocol designed for distributed, collaborative, hypermedia information systems

A lightweight, request/response protocol that enables a robust and scalable distribution of resources within the Web

Speech applications use HTTP for fetching and transporting resources such as VoiceXML

(57)

Transfer to Speech-Enabled Web Technology

HTTP affords the application developer the ability to deploy his/her application remotely from the platform provider

HTTP employs the http: and https: URI scheme

for identification of resources

Speech-Enabled Web Technology inherits

resource discovery, load-balancing, and failover solutions from HTTP

(58)

Simple HTTP Example

VoiceXML Browser Webserver

GET /application.vxml HTTP/1.1 Host: webserver1.voxpilot.com

HTTP/1.1 200 OK

Date: Tue, 25 May 2005 12:00:00 GMT Content-Type: application/voicexml+xml Content-Length: 128

<?xml version="1.0"?> <vxml version="2.0"> . . .

(59)

SIP

SIP is an open IP signalling protocol for audio/video telephony, conferencing, and presence & instant messaging

SIP is often called a “rendezvous protocol”

Gaining rapid adoption as the signalling protocol of choice: The 3GPP has selected it for powering the IP Multimedia Subsystem (IMS) architecture

(60)

SIP and VoiceXML

SIP is a popular protocol for providing the telephony interface to VoiceXML and CCXML servers

The sip: and sips: URI schemes are used for

identification of VoiceXML and CCXML

(61)

SIP Phone VoiceXML Browser INVITE 200 OK ACK BYE 200 OK media media

(62)

MRCP

MRCP is an open protocol for controlling

network-based media resources such as speech recognisers and speech synthesisers

Problem statement:

– Different markets have different preferred speech engine vendors

– Speech engine APIs are complex, diverse and moving targets, often changing per version!

– Platform integrators need to maintain integrations to multiple vendors

(63)

MRCP Benefits

MRCP delivers a standard protocol that alleviates the integration burden for everyone

Win-win situation: speech vendors concentrate on the speech engine, platform vendors

concentrate on the platform

MRCP is being widely adopted by leading speech vendors

(64)

MRCP and SIP

MRCP employs SIP to establish media and

control sessions to speech recognisers and from speech synthesisers

MRCP is a text-based control protocol (inspired by HTTP) and provides hooks to control media resources and to receive progress notifications

By leveraging SIP, MRCP inherits resource

(65)

VoiceXML Browser Speech Recognizer 200 IN-PROGRESS RECOGNIZE START-OF-SPEECH RECOGNITION-COMPLETE Simple MRCP Example

(66)

Interaction Identification Representatio n VoiceXML SRGS SSML SISR CCXML HTTP SIP MRCP http: https: sip: sips:

Putting it all together

Orthogonality allows new speech standards to be created and evolved in parallel to each other

(67)

Putting it all together (2)

Web and Internet standards greatly alleviate the hurdles of closed, proprietary interfaces and APIs

Creating applications no longer requires specialised professional services

Existing Web infrastructure and skills can be leveraged

Scalability, robustness, security, resource discovery solutions are inherited “for free”

(68)