ECE6255
Digital Processing of Speech Signals
Chin-Hui Lee
School of Electrical and Computer Engineering Georgia Institute of Technology
Atlanta, GA 30332, USA
Lecture 27:
Teaching Machines to Speak and Listen:
A Wonderful Journey from Science
Fictions to Technology Realities
Lecture Outline
• Science Fictions: fascination and inspiration
– ―2001: A Space Odyssey‖, ―Star Wars‖, ―A. I.‖
• History of talking machines
– Text-to-speech (TTS) synthesis
• Development of speech synthesis technology
– Capabilities and limitations
• History of listening machines
– Automatic speech recognition (ASR)
• Development of speech recognition technology
– Capabilities and limitations
• Summary: applications today and tomorrow
– Voice user interface, voice search & speech translation
Speaking and Listening Machines
• Equipping machines with human‘s sensory modalities: a modern dream and fascination
• Human‘s unique talking capability
• Children have better successes than machines
• Speaking is easier than listening
• These human capabilities for machines are
better demonstrated in science fictions than in
technology realities of today
2001: A Space Odyssey (HAL)
HAL listens, talks, sings, reads lips, plays chess, and solves problems !!
A Pair of Ideal Communicators
C-3PO translates, interprets and
understands six million artificial & natural
languages.
R2-D2 listens, understands and interprets natural
languages and work like a
“Star War”
Theme
Droids on Sesame St.
Birthplace of Modern Speech Research
BL: an excellent culture
• 25,000+ patents
• 11 Nobel Laureates
• 9 National Medals of Science (US)
• 5 National Medals of Technology (US)
• 1 Emmy Award
• Host Chairman Jiang Zemin in 2000 (one of three labs)
• Speech research starts from Alexander Gram Bell
-- First speech synthesizer -- First speech recognizer
Human Speech Production Mechanism
• Air enters the lungs via normal breathing and no speech is
produced (generally) on in-take
• As air is expelled from the
lungs, via the trachea, the tensed vocal cords within the larynx are caused to vibrate by the air flow
• Air is chopped up into quasi- periodic pulses, then modulated in frequency (spectrally shaped) in passing through the pharynx (throat cavity), the mouth cavity, and possibly the nasal cavity
Abstractions of Physical Model
Time-Varying Filter
excitation speech
voiced
unvoiced
First Set of Vowel Tubes
• Kratzenstein‘s acoustic resonators
– Apparatus created in St. Petersburg (1779)
– Figure shown from Schroeter (1993)
First Speaking Machine
• Wheatstone‘s reconstruction of von Kempelen‘s
– Figure shown from Flanagan (1972)
First Mechanical Synthesizer: VODER
Homer Dudley, 1939 World Fair in New York City
Source-Filter Synthesis Models
• Cascade/serial (formant) synthesis model
“To Be …”- Bell Labs
Daisy-Daisy with music
We Wish You …
Continuing Evolution (1959-1987)
Haskins, 1959
KTH – Stockholm, 1962 Bell Labs, 1973
MIT, 1976
MIT-talk, 1979
Speak ‗N Spell, 1980 BELL Labs, 1985
DECtalk (voice morphing), 1987
Concatenation Synthesis Systems
• ―MI3‖ Story: Tom Cruise mimics Philip Seymour Hoffman with a few sentence examples to cover all sounds
• Choice of units: synthesis through cut-‘n‘- paste, more units give better performance
– Words: there are an infinite number of them
– Syllables: there are about 10K in English
– Phonemes: there are about 45 in English
– Demi-syllables: there are about 2500 in English
– Diphones: there are about 1500-2500 in English
– Phrases and Waveform: as many units as possible (1990-2000‘s)
Word and Syllable Concatenation
IVR System HK Airport PA System
Text-to-Speech (TTS) Synthesis
GOAL: convert arbitrary textual messages to intelligible and natural sounding synthetic speech so as to transmit information from a machine to a person
input text
Text Analysis
Speech Synthesizer abstract
underlying linguistic description
synthetic speech
output
• Pronunciation of text => phonemes, stress, intonation, duration
• Syntactic structure of sentence (pauses, rate of speaking, emphasis)
• Semantic focus, ambiguity resolution (duration, intonation)
— rules for word etymology (especially names, foreign terms)
• Expressive, emotional and personalization: Ultimate goals
Multilingual TTS (Diphone as Units)
American English
Chinese French
Italian Spanish
Russian
Modern TTS (Waveform Concatenation) Soliloquy from Hamlet—
Gettysburg Address—
Bob Story—
German female—
UK British female—
Spanish female—
Korean female — French male—
Hidden Markov Model Based Synthesis:
•
Mathematically formed• Naturally Sounding
• Continuously improving
“Talking Heads”: Audiovisual TTS
3D Talking Heads Sample-based Talking Heads
flexible; easy to show in any pose;
look like a ‗real person‘; require recording of real people; limited
Business Drivers of TTS
• Handicap assistance: Stephen Hawking, reading machine for the blind, telephone for the deaf
• Cost reduction in automation
– TTS as a dialog component for customer care – TTS to replace expensive recorded IVR prompts
• New products and services
– Location-based services
– Providing information in cars (e.g., driving directions, traffic reports)
– Unified messaging (reading e-mail, fax)
– Voice portals (enterprise, home, phone access to web- based services), talking internet: the Taipei story
– e-commerce (automatic information agents) – Customized news, stock reports, sports scores – Portable devices
From: Marilyn Walker <[email protected]>
To: David Ross <[email protected]>
Subject: Re: Today's Meeting
Date: Tuesday, December 01, 1998 4:25 PM
--- 4:30 is fine for me. See you at the meeting.
Marilyn
---Original Message---
From: David Ross <[email protected]>
To: Marilyn Walker <[email protected]>
Date: Tuesday, December 01, 1998 2:25 PM Subject: Today's Meeting
Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected].
Example: Old TTS, No Filter
Reading Email
From: Marilyn Walker <[email protected]>
To: David Ross <[email protected]>
Subject: Re: Today's Meeting
Date: Tuesday, December 01, 1998 4:25 PM
--- 4:30 is fine for me. See you at the meeting.
Marilyn Walker
---Original Message---
From: David Ross <[email protected]>
To: Marilyn Walker <[email protected]>
Date: Tuesday, December 01, 1998 2:25 PM Subject: Today's Meeting
Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected].
Thanks, david ross
Example: Enhanced TTS
Reading Email (after Rendering)
Human Peripheral Auditory Mechanism
- Inner, middle and outer ears
- Basilar membrane, hearing
- Brain and auditory neurons
History: Early ASR, 1970’s and 2000’s
1950‘s: Early Speech recognizers
1952: Bell Labs single-speaker digit recognizer: analog, 2% error
1960‘s: FFT, linear prediction, dynamic programming
NEC: speaker-dependent digit recognizer, huge size at $80,000
1970‘s: ARPA SUR 5-year project
Hidden Markov model: a major breakthrough and paradigm shift
1980-1990‘s: DARPA annual ―bakeoff‖, large databases
Verbex: speaker-dependent recognizer, small size at $250 Dragon Systems, IBM Via Voice : dictation systems
AT&T: VRCP automated operator service (1B/year) Commercial ASR: Nuance, SpeechWork, L&H, ….
Multilingual ASR systems, services and applications ASR as interpreted by Seinfeld (see the TV clip)
2000: DARPA GALE language translation project
NTT‘s speech translation system over the mobile phone IBM‘s hand-held speech translator deployed in Iraq
Phoneme Classification Chart: Early ASR
Vocal Cords
Vibrating
Noise-Like
ASR by Scoring Individual Sounds
• Computing word score using a sequence of phone scores
e d t v a n t s
p b a k EDtv
Ants
Payback
Score = 12.2
Score = 32.5
Score = 29.4 EDtv
Please say the name of the movie now.
Training Speech Recognizer Models
s t
z th e
i
a b
f g k l m n
r sh y zh # ä a
ch d
h j
ng i p th v w
r
Th-i -s i-s a t - e - s - t
This is a test.
Millions of training samples are combined to build sub-word models, one for each phoneme
Training sample
Human Speech Knowledge Hierarchy
Phonotactic Syntactic Semantic Pragmatic Acoustic/
Phonetic
Acoustic Model
Word Lexicon
Language Model
Relationship of speech sounds and English phonemes
Rules for phoneme sequences and
pronunciation
Structure of words, phrases
in a sentence
Relationship and meanings
among words
Discourse, interaction history,
world knowledge
Understanding Model
Dialog Manager
ASR SLU DM
Speech Recognition Capabilities
Spontaneous Speech
Fluent Speech
Read Speech
Connected Speech Isolated
word spotting
digit strings
speaker verification
voice commands
Name Dialing
transcription natural
conversation
office dictation user driven
dialogue
personal assistant system driven
dialogue
1997
2001
By B. S.
Bell Labs Voice Call Transactions
• VRCP
- 1 B calls per year (1992)
• Voice Prompter
- 900 M calls/year (1992)
• SDN/NRA
- 250 M calls/year (1996)
• Universal Card
- 50 M calls/year (1995)
• MovieFone
- 40 M calls/year (1999)
• Talking Call Waiting
- ~110 M calls/year (2000) Over
Billion Served
VRCP: Fully Deployment
• System deployment
―Fully deployed in the 48 continental states and still being used
―Known as 0+ service (dialing 0 followed by 10 numbers)
• System Impact
―Handle over 1B call transactions a year (30M+ per day)
―Offer a savings of over $300M a year for service providers
―Stand as the most widely used voice-enabled services as of today
―Lead to many successful automated speech applications
• A key patent (Lee, Rabiner, and Wilpon) made it possible
―98% accuracy was obtained within 3 months after the initial trial
• Societal perception
―General public: no noticeable difference
Spoken Language Translation (C3P-O) and Voice User Interface (R2-D2)
Speech Recognizer
Language Generator &
TTS Synthesizer Language
Analyzer
Machine Translation
or Dialog Management
Semantic Rules
Bilingual Databases Translation
Models
Text Analysis &
Pronunciation Rules Acoustic
& Language Models Voice
Input
Voice Output
(Speech in Language A)
(Text Understanding in Language A/B)
(Text Reply in Language B)
(Speech in Language B) (Text in
Language A)
Bilingual Databases Translation
Models
Business Drivers of ASR
• Handicap assistance: wheelchair control, voice-
enabled keyboard (MIT story), telephone for the deaf
• Cost reduction in automation
– ASR as a dialog component for call centers: $1M per year savings for per 1% automation for large customer services – Airline and train information service
• New products and services
– Telematics: applications in cars – Medical and law dictation services
– Voice writer: IBM‘s Via Voice in many languages
– Voice portals (enterprise, home, phone access to Web- based services): listening and talking internet
– e-commerce (automatic information agents) – Speech translator and travel assistance
–
Microsoft MiPad
Multimodal Interactive Pad
• Usability studies show double throughput for English inputs
• Speech is mostly useful in cases with lots of
alternatives
• Demo (MiPad)
AT&T MATCH
“ Are there any cheap Italian places in this
Multimodal Access To City Help Access to information through
voice interface, gesture and GPS Multimodal Integration
Combination of speech, gesture
and meaning using finite state
technology
Human Speech Recognition vs. ASR
0.1 1 10 100
0.001 0.01 0.1 1 10
HUMAN ERROR (%)
MACHINE ERROR (%)
Digits RM-LM NAB-mic WSJ
RM-null NAB-omni SWBD WSJ-22dB Machines Outperform
Humans
x1 x10
x100
S-Learning Curve and Paradigm Shift
Speech Recognition Accuracy
Time
HMMs
2. Mathematical formalization, Global optimization, search, Automatic learning from data 1. Heuristics,
Handcrafted rules, Local optimization
3. ????
Human Performance
Apple iPhone Applications
• 90,000 applications, more to come
• 2 billion downloads, more to come
ASR advances triggered by more
interests in smartphones &
voice clicks?
Google Text, News, Video & 1-800-goog-411
Moving Forward
• ―HAL‘s Legacy‖: a book published in 1997 to celebrate HAL‘s
―birthday‖ to review 30 years of technology development
– Chess playing, lip reading, singing machines, speaking machines, listening machines: plenty of limitations
– Artificial intelligence: a long way to go – Most AI problems are AI-Complete
• Ten years later:
– Chess playing: IBM‘s Big Blue – Lip reading: improved capabilities
– Singing machines: improved capabilities – Speaking machines: improved capabilities – Listening machines: improved capabilities – Artificial intelligence: still a long way to go
• The Future: AI lives again (the ONE Web)
Summary
• Science Fictions: an inspiration
• History of talking machines
– Text-to-speech (TTS) synthesis
• Development of speech synthesis technology
– Capabilities and limitations
• History of listening machines
– Automatic speech recognition (ASR)
• Development of speech recognition technology
– Capabilities and limitations: ―Seinfeld‖
• Summary: applications today and tomorrow
– Voice user interface (VUI), voice search & speech translation – The ultimate ―machine‖ in science fictions: ―A.I.‖
– ONE Machine: from Kevin Kelly ―5000 days of Internet‖, 2007