Voice User Interface
Design
Part III: Technology Support for Voice Application
Dr. Dirk Schnelle-Walka
W3C Speech Interface Framework Speech Synthesis Grammar Pronunication Lexicon VoiceXML Call Control Semantic Interpretation
Status of W3C Speech Interface Languages Voice XML 2.0 & 2.1 Grammar Synthesis 1.0 Synthsis 1.1 Call Control Semantic Interpret-ration Recommendation Proposed Recommendation Candidate Recommendation Last Call Working Draft Working Draft PLS Voice
VoiceXML VoiceXML is an XML markup language to develop speech applications
(voice based equivalent for HTML)
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.1">
<form>
<block>Hello World!</block> </form>
Goals
Use the internet via the telephone
– voice and touch tones
Transfer the advantages of web based
development and data transfer to interactive voice based applications
Ease the development of voice based dialogs Authors should not take care about low-level
programming tasks
Development of a new language since voice
History of VoiceXML
VoxML PML SpeechML TalkML
(defunct) VoiceXML 2.0 PhoneWeb VoiceXML 3.0 VoiceXML 2.1 1998 1995 1999 2000 2004 2007 2008 VoiceXML 1.0 Native code <1994
Scope of VoiceXML
VoiceXML describes the human-machine interaction provided by voice response systems, which includes:
Output of synthesized speech (text-to-speech). Output of audio files.
Recognition of spoken input. Recognition of DTMF input. Recording of spoken input. Control of dialog flow.
Telephony features such as call transfer and disconnect.
VoiceXML-document
VoiceXML-document = finite state machine – Each state specified the next state
The user is always in defined conversation state (=dialog)
<vxml> is the top-level element of a document, serving as a container for dialogs
Dialog
– Formular – Menu
A simple document <?xml version="1.0" encoding="ISO-8859-1"?> <vxml version="2.0" lang="en"> <form> <block> <prompt bargein="false">
Welcome to Travel Planner!
<audio src="http://www.adline.com/travelad.wav"/> </prompt>
</block>
</form> </vxml>
Analogy HTML - VoiceXML HTML VoiceXML Images Audio/Video Audio/ Grammars Scripts VoiceXML documents HTML documents Voice Browser Web-Server Simple telephone Internet Gateway Internet Telephony-network ASR DTMF TTS … Voice-Server PC with a webbrowser
"point & click"
Architecture VoiceXML Interpreter Context VoiceXML Interpreter Context VoiceXML Interpreter VoiceXML Interpreter Document Server Document Server Implementation Platform Implementation Platform Request Document
Core Concepts
Session
A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and
processed, and ends when requested by the user, a document, or the interpreter context.
Application
An application is a set of documents sharing the same application root document
Grammar
Each dialog has one or more speech and/or DTMF grammars
associated with it.
Events
Notifications mechanism
Links
Execution with one document
Document execution begins at the first dialog by default.
As each dialog executes, it determines the next dialog.
When a dialog doesn't specify a successor dialog, document execution stops.
<?xml version="1.0" encoding="UTF-8"?> <vxml version="2.0">
<var name="hi" expr="'Hello World!'"/> <form>
<block>
<value expr="hi"/><goto next="#say_goodbye"/> </block>
</form>
<form id="say_goodbye">
<block> Goodbye! </block> </form> </vxml> <vxml> <var a=".."> … </vxml>
Executing a Multi-Document Application
Multiple documents work together as one application
– Root document
– One or more leaf documents
Each leaf document names the root document in its
<vxml> element
If the interpreter loads and executes a leaf document in this application, it first loads the application root document if it is not already loaded
<vxml> <var a=".."> … </vxml> <vxml application= "root.vxml"> … </vxml> <vxml application= "root.vxml"> … </vxml> <vxml application= "root.vxml"> … </vxml>
The application root document
remains loaded until the interpreter is told to load a document that
Forms
Forms are the key component of VoiceXML documents.
A form contains
– A set of form items
– Event handlers
– <filled>-elements
Form Interpretation Algorithm (FIA) – Select – Collect – Process Forms Eventhandler * Form Items *
Form Items
Form items are the elements that can be visited in the main
loop of the FIA
Input items direct the FIA to gather a result for a specific element Control Items contain
a block of procedural code to execute Form Items Control Items * Input Items *
Form Item Variables
Each form item has an associated
form item
variable
Set to undefined or value of expr attribute when the form is entered
Contains the result of interpreting the form item
Guard condition (cond attribute)
, which governswhether or not that form item can be selected by the FIA
Variables
VoiceXML variables are equivalent to ECMAScript variables
Variables are declared by
– <var> elements
– Within a script
The <assign> element assigns a value to a variable
The <clear> element resets one or more variables
<var name="phone" expr="'6305551212'"/>
<script>
var phone ="6305551212";
<script/>
<assign name="flavor" expr="'chocolate'"/>
Variable Scopes
VoiceXML uses an ECMAScript scope chain
to allow variables to be declared at different levels of hierarchy in an application
Variable references match the closest enclosing scope according to the scope chain given above
Prefix a reference with a scope name for clarity or to resolve ambiguity
session
application
document
dialog
Conditions
If, elseif and else analog to known programming languages
In cond operators "<", "<=", and "&&"
comparisons must be escaped in XML (to "<" and so on).
<if cond="flavor == 'vanilla'">
<assign name="flavor_code" expr="'v'"/> <elseif cond="flavor == 'chocolate'"/>
<assign name="flavor_code" expr="'h'"/> <else/>
<assign name="flavor_code" expr="'?'"/> </if>
Directed Forms
Directed Forms
– Simplest and most common type of form – Form items are executed exactly once in
sequential order
Mixed Initiative Forms
– Both the computer and the human direct the conversation
– Must have one or more form-level grammars
– Input items can be filled in any order
– More than one input item can be filled as a result of a single user utterance.
Block
A block contains executable content
Executed if the block's form item variable is undefined and the
block's cond attribute, if any, evaluates to true.
Control item
Block
<block>
Welcome to Flamingo, your source for lawn ornaments.
Prompt
A prompt controls the output of synthesized speech and prerecorded audio
Content of the <prompt> element is
modelled on the W3C Speech Synthesis Markup Language
<prompt>
Welcome to the <emphasis>Bird Seed Emporium</emphasis>. <audio src="rtsp://www.birdsounds.example.com/thrush.wav"/>
We have 250 kilogram drums of thistle seed for
<say-as interpret-as="currency">$299.95</say-as>
plus shipping and handling this month.
SSML
Speech Synthesis Markup Language
(SSML) is a W3C Recommendation for assisting in the generation of speech in Web applications
Give the underlying speech synthesiser hints in how to render text-to-speech
Allows flexible prosodic control such as changing volume, rate, and contour of the synthesised speech
Simple SSML Example
<?xml version="1.0"?> <speak version="1.0">
Would you like
<emphasis> debit </emphasis> or
<emphasis> credit </emphasis> </speak>
Controlling the order of field collection
Several possibilities
1. Assign a value to a form item variable
2. <clear> to set a form item variable to undefined 3. explicitly specify the next form item to visit using
Field
A field specifies an input item to be gathered from the user
Each field can have one or more prompts
Specified the input by grammars
Input item
Field
Eventhandler *
<field name="flavor">
<prompt>What is your favorite ice cream?</prompt> <grammar src="../grammars/ice_cream.grxml"
type="application/srgs+xml"/> </field>
Internal and External Grammars Internal Grammar
<grammar type="media-type" mode="voice"> inline speech grammar
</grammar>
External Grammar
<grammar type="media-type" src="URI"/>
Reference to Grammar in XML- bzw. EBNF-Form
<grammar type="application/srgs+xml" src="http://www…com/date.grxml"/>
EBNF-Grammar EBNF-Form of an internal grammar
<grammar mode="voice" type="application/srgs">
#ABNF 1.0;
language en-US; mode voice; root $command;
public $command = $action $object;
$action = open | close | delete | move;
$object = [the | a] (window | file | menu);
</grammar>
(EBNF = Extended Backus-Naur-Form)
Peter Naur
XML-Grammar XML-Form of an internal grammar
<grammar mode="voice" xml:lang="en-US" version="1.0" root="command"> <rule id="command" scope="public">
<ruleref uri="#action"/> <ruleref uri="#object"/> </rule>
<rule id="object">
<item repeat="0-1"> <one-of>
<item> the </item><item> a </item> </one-of>
</item> <one-of>
<item> window </item><item> file </item><item> menu </item> </one-of>
</rule> </grammar>
Sales
I'd like to place an order I need to talk to a salesman Repair repair department service service department customer service Order status where's my order? track my order track my shipment
where the hell is my stuff?
Grammars can get very complicated:
Semantic Interpretation
Semantic Interpretation for Speech Recognition (SISR) is a
specification for extracting the semantics or meanings of a raw utterance
SISR is used inside SRGS grammars
to annotate the meaning of the matched words
Used to implement “Natural Language Understanding”
Simple SISR Example
<?xml version="1.0"?> <grammar version="1.0"
tag-format="semantics/1.0-literals">
<one-of>
<item> yes </item>
<item> sure <tag>yes</tag> <item> <item> aye <tag>yes</tag> </item> </one-of>
Event Handlers
Deal with exceptional or error conditions Control mechanism for dialog turn retries
– <catch event=“noinput”> … </catch> – <catch event=“nomatch” … </catch> – <catch event=“help”> … </catch>
Shorthand notation available
– <noinput> … </noinput>, etc.
Scoped according to where they occur
Filled
Specifies an action to perform when some combination of input items are filled
<form id="get_city"> <field name="city">
<grammar type="application/srgs+xml" src="served_cities.grxml"/> <prompt>What is the city?</prompt>
<filled>
<if cond="city == 'Novosibirsk'"> <prompt>
Note, Novosibirsk service ends next year.
</prompt> </if>
</filled> </field>
<form>
<field name="main_menu"> <prompt>
<audio src="welcome.wav"> Welcome to Acme.
You can choose sales, repair, or order status.</audio> </prompt>
<grammar src="main_menu.grxml"/> </field>
<noinput> You must say something. </noinput>
<block>
<submit next="http://acme.com/route... " method="get"/> </block>
</form>
VoiceXML example with error handling
<form>
<field name="main_menu"> <prompt>
<audio src="welcome.wav"> Welcome to Acme.
You can choose sales, repair, or order status.</audio> </prompt>
<grammar src="main_menu.grxml"/> </field>
<noinput> You must say something. </noinput>
<nomatch> I didn't understand you. Please try again. </nomatch>
<block>
<submit next="http://acme.com/route... " method="get"/> </block>
</form>
<form>
<field name="main_menu"> <prompt>
<audio src="welcome.wav"> Welcome to Acme.
You can choose sales, repair, or order status.</audio> </prompt>
<grammar src="main_menu.grxml"/> </field>
<help> You can say sales, repair, or order status. </help>
<noinput> You must say something. </noinput>
<nomatch> I didn't understand you. Please try again. </nomatch> <block>
<submit next="http://acme.com/route... " method="get"/> </block>
</form>
VoiceXML example with error handling
Tapered Prompts Tapered prompts are those that may change with each attempt.
Information-requesting prompts may become more terse under the assumption that the user is becoming more familiar with the task
Help messages become more detailed perhaps, under the assumption that the user needs more help.
Prompts can change just to make the interaction more interesting.
<form> <block>
<prompt bargein="false">Welcome to the ice cream survey.</prompt> </block>
<field name="flavor">
<prompt count="1">What is your favorite flavor?</prompt>
<prompt count="3">Say chocolate, vanilla, or strawberry.</prompt> <help>Sorry, no help is available.</help>
Menu Simple type of a dialog
Selects a predefined option
Selected option determines the next dialog
<menu>
<prompt>
Welcome home. Say one of: <enumerate/> </prompt>
<choice next="sports.vxml"> Sports </choice> <choice next="weather.vxml"> Weather </choice> <choice next="astronews.vxml"> News </choice> <noinput>
Please say one of <enumerate/> </noinput>
</menu>
C(omputer): Say one of: Sports; Weather; News. H(uman): Astrology
C: I did not understand what you said. (a platform-specific default message.) C: Say one of: Sports; Weather; News. H: Sports
DTMF menu
Menu with explicit DTMF sequences given to each choice
<menu>
<property name="inputmodes" value="dtmf"/>
<prompt>For sports press 1, For weather press 2, For news 3.</prompt> <choice dtmf="1" next="sports.vxml"/>
<choice dtmf="2" next="weather.vxml"/> <choice dtmf="3" next="astronews.vxml"/> </menu>
Subdialog (1)
Subdialogs
– mechanism for reusing common dialogs – building libraries of reusable applications
Calling dialog waits until execution has finished
Results are returned using the <return> tag
<!-- form dialog that calls a subdialog -->
<form>
<subdialog name="result" src="#getdriverslicense"> <param name="birthday" expr="'2000-02-10'"/> <filled>
<submit next="http://acme.com/cgi-bin/process"/> </filled>
</subdialog> </form>
<!-- subdialog to get drivers license --> <form id="getdriverslicense"> <var name="birthday"/> <field name="drivelicense"> <grammar src="http://grammarlib/drivegrammar.grxml" type="application/srgs+xml"/>
<prompt> Please say your drivers license number. </prompt> <filled>
<if cond="validdrivelicense(drivelicense,birthday)"> <var name="status" expr="true"/>
<else/> <var name="status" expr="false"/> </if>
<return namelist="drivelicense status"/> </filled>
</field> </form>
Record Collects a recording from the user
Recording can be played back (using the expr attribute on
<audio>) or submitted to a server
<record name="msg" beep="true" maxtime="10s"
finalsilence="4000ms" dtmfterm="true" type="audio/x-wav">
<prompt timeout="5s">
Record a message after the beep.
</prompt> <noinput>
I didn't hear anything, please try again.
</noinput> </record>
<prompt>
Your message is <audio expr="msg"/>.
Input item
Record
Eventhandler *
Link
Specifies a grammar that is active whenever the user is in the scope of the link
If user input matches the link's grammar, control transfers to the link's destination URI
<link next="http://www.voicexml.org/books/main.vxml"> <grammar mode="voice" version="1.0" root="root">
<rule id="root" scope="public"> <one-of>
<item>books</item>
<item>VoiceXML books</item> </one-of>
</rule> </grammar>
<grammar mode="dtmf" version="1.0" root="r2"> <rule id="r2" scope="public"> 2 </rule> </grammar>
PLS
Pronunciation Lexicon Specification (PLS) – W3C Voice Browser Activity
– Pronunciation lexicon markup language
Two main applications:
– Speech Synthesis (SSML documents)
PLS improves SSML on text normalization, GTP
PLS in SSML
SSML document references an external pron lexicon:
– TTS engine loads the PLS documents and applies them to the SSML document
– applications may specify contextual PLS documents, which are to be used in different points of the interaction (like airports.pls, carriers.pls, …)
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="SI"> <lexicon uri="http://www.alpineon.com/airports.pls"/>
Letalo letalske družbe British Airlines, ki prihaja iz Manchestra, bo imelo 5 minut zamude.
Lexeme
The <lexeme> element - container of a lexicon
entry:
– usually only one <grapheme> element
– several <phoneme> or <alias> elements
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" xml:lang="si-SI" alphabet="x-sampa-SI-reduced"> <lexeme>
<grapheme>dober</grapheme> <phoneme>"d/o:-b@r</phoneme>
<!– This is an example of the x-sampa-SI-reduced string for the pronounciation of the Slovenian word: "dober", meaning "good" in English -->
</lexeme> </lexicon>
Events are thrown by the platform under a variety of circumstances
– when the user does not respond – doesn't respond intelligibly
– requests help, etc.
– semantic error in a VoiceXML document – Explicit use of <throw>
– Events are caught by catch elements
Events
<throw event="nomatch"/>
<throw event="connection.disconnect.hangup"/>
<catch event="nomatch noinput" count="3"> <prompt>Security violation!</prompt>
Goto
The <goto> element is used to:
– transition to another form item in the current form, – transition to another dialog in the current document,
or
– transition to another document.
<goto nextitem="ssn_confirm"/> <goto next="#another_dialog"/>
Submit
The <submit> element is used to
submit information to the origin Web server and then transition to the
document sent back in the response.
<submit next="log_request" method="post"
Tools
Eclipse Voice Tools Project
Call Control XML
Call Control XML (CCXML) provides call control support for VoiceXML and other dialog
languages
Supports:
– Multi-party conferencing
– Multi-call handling and control – Asynchronous event handling
Simple Call Control Example <?xml version="1.0"?>
<ccxml version="1.0"> <eventprocessor>
<transition event="connection.connected"> <dialogstart uri="helloworld.vxml"/>
</transition> </eventprocessor> <ccxml>
IETF Protocols
Three important IETF protocols powering the Speech-Enabled Web:
– Hyper Text Transfer Protocol (HTTP)
– Session Initiation Protocol (SIP)
– Media Resource Control Protocol (MRCP)
HTTP, SIP, and MRCP are common interaction
HTTP
HTTP is an open protocol designed for distributed, collaborative, hypermedia information systems
A lightweight, request/response protocol that enables a robust and scalable distribution of resources within the Web
Speech applications use HTTP for fetching and transporting resources such as VoiceXML
Transfer to Speech-Enabled Web Technology
HTTP affords the application developer the ability to deploy his/her application remotely from the platform provider
HTTP employs the http: and https: URI scheme
for identification of resources
Speech-Enabled Web Technology inherits
resource discovery, load-balancing, and failover solutions from HTTP
Simple HTTP Example
VoiceXML Browser Webserver
GET /application.vxml HTTP/1.1 Host: webserver1.voxpilot.com
HTTP/1.1 200 OK
Date: Tue, 25 May 2005 12:00:00 GMT Content-Type: application/voicexml+xml Content-Length: 128
<?xml version="1.0"?> <vxml version="2.0"> . . .
SIP
SIP is an open IP signalling protocol for audio/video telephony, conferencing, and presence & instant messaging
SIP is often called a “rendezvous protocol”
Gaining rapid adoption as the signalling protocol of choice: The 3GPP has selected it for powering the IP Multimedia Subsystem (IMS) architecture
SIP and VoiceXML
SIP is a popular protocol for providing the telephony interface to VoiceXML and CCXML servers
The sip: and sips: URI schemes are used for
identification of VoiceXML and CCXML
SIP Phone VoiceXML Browser INVITE 200 OK ACK BYE 200 OK media media
MRCP
MRCP is an open protocol for controlling
network-based media resources such as speech recognisers and speech synthesisers
Problem statement:
– Different markets have different preferred speech engine vendors
– Speech engine APIs are complex, diverse and moving targets, often changing per version!
– Platform integrators need to maintain integrations to multiple vendors
MRCP Benefits
MRCP delivers a standard protocol that alleviates the integration burden for everyone
Win-win situation: speech vendors concentrate on the speech engine, platform vendors
concentrate on the platform
MRCP is being widely adopted by leading speech vendors
MRCP and SIP
MRCP employs SIP to establish media and
control sessions to speech recognisers and from speech synthesisers
MRCP is a text-based control protocol (inspired by HTTP) and provides hooks to control media resources and to receive progress notifications
By leveraging SIP, MRCP inherits resource
VoiceXML Browser Speech Recognizer 200 IN-PROGRESS RECOGNIZE START-OF-SPEECH RECOGNITION-COMPLETE Simple MRCP Example
Interaction Identification Representatio n VoiceXML SRGS SSML SISR CCXML HTTP SIP MRCP http: https: sip: sips:
Putting it all together
Orthogonality allows new speech standards to be created and evolved in parallel to each other
Putting it all together (2)
Web and Internet standards greatly alleviate the hurdles of closed, proprietary interfaces and APIs
Creating applications no longer requires specialised professional services
Existing Web infrastructure and skills can be leveraged
Scalability, robustness, security, resource discovery solutions are inherited “for free”
Further Reading Voice Browser Activity
W3C http://www.w3.org/Voice/ VoiceXML Zentrale Udo Gläser http://www.glaeser-bonn.de/voicexml/index.html SIP
SIP Protocol Overview
http://www.en.voipforo.com/SIP/SIP_architecture.php MRCP RFC 4463 http://tools.ietf.org/html/rfc4463 For testing – http://www.jvoicexml.org – http://www.speechforge.org/projects/mrcp4j/index.html – http://www.mjsip.org/ – http://www.eclipse.org/vtp/