ECE6255 Digital Processing of Speech Signals

(1)

ECE6255

Digital Processing of Speech Signals

Chin-Hui Lee

School of Electrical and Computer Engineering Georgia Institute of Technology

Atlanta, GA 30332, USA

Lecture 27:

Teaching Machines to Speak and Listen:

A Wonderful Journey from Science

Fictions to Technology Realities

(2)

Lecture Outline

• Science Fictions: fascination and inspiration

– ―2001: A Space Odyssey‖, ―Star Wars‖, ―A. I.‖

• History of talking machines

– Text-to-speech (TTS) synthesis

• Development of speech synthesis technology

– Capabilities and limitations

• History of listening machines

– Automatic speech recognition (ASR)

• Development of speech recognition technology

– Capabilities and limitations

• Summary: applications today and tomorrow

– Voice user interface, voice search & speech translation

(3)

Speaking and Listening Machines

• Equipping machines with human‘s sensory modalities: a modern dream and fascination

• Human‘s unique talking capability

• Children have better successes than machines

• Speaking is easier than listening

• These human capabilities for machines are

better demonstrated in science fictions than in

technology realities of today

(4)

2001: A Space Odyssey (HAL)

HAL listens, talks, sings, reads lips, plays chess, and solves problems !!

(5)

A Pair of Ideal Communicators

C-3PO translates, interprets and

understands six million artificial & natural

languages.

R2-D2 listens, understands and interprets natural

languages and work like a

“Star War”

Theme

Droids on Sesame St.

(6)

Birthplace of Modern Speech Research

BL: an excellent culture

• 25,000+ patents

• 11 Nobel Laureates

• 9 National Medals of Science (US)

• 5 National Medals of Technology (US)

• 1 Emmy Award

• Host Chairman Jiang Zemin in 2000 (one of three labs)

• Speech research starts from Alexander Gram Bell

-- First speech synthesizer -- First speech recognizer

(7)

Human Speech Production Mechanism

• Air enters the lungs via normal breathing and no speech is

produced (generally) on in-take

• As air is expelled from the

lungs, via the trachea, the tensed vocal cords within the larynx are caused to vibrate by the air flow

• Air is chopped up into quasi- periodic pulses, then modulated in frequency (spectrally shaped) in passing through the pharynx (throat cavity), the mouth cavity, and possibly the nasal cavity

(8)

Abstractions of Physical Model

Time-Varying Filter

excitation speech

voiced

unvoiced

(9)

First Set of Vowel Tubes

• Kratzenstein‘s acoustic resonators

– Apparatus created in St. Petersburg (1779)

– Figure shown from Schroeter (1993)

(10)

First Speaking Machine

• Wheatstone‘s reconstruction of von Kempelen‘s

– Figure shown from Flanagan (1972)

(11)

First Mechanical Synthesizer: VODER

Homer Dudley, 1939 World Fair in New York City

(12)

Source-Filter Synthesis Models

• Cascade/serial (formant) synthesis model

“To Be …”- Bell Labs

Daisy-Daisy with music

We Wish You …

(13)

Continuing Evolution (1959-1987)

Haskins, 1959

KTH – Stockholm, 1962 Bell Labs, 1973

MIT, 1976

MIT-talk, 1979

Speak ‗N Spell, 1980 BELL Labs, 1985

DECtalk (voice morphing), 1987

(14)

Concatenation Synthesis Systems

• ―MI3‖ Story: Tom Cruise mimics Philip Seymour Hoffman with a few sentence examples to cover all sounds

• Choice of units: synthesis through cut-‘n‘- paste, more units give better performance

– Words: there are an infinite number of them

– Syllables: there are about 10K in English

– Phonemes: there are about 45 in English

– Demi-syllables: there are about 2500 in English

– Diphones: there are about 1500-2500 in English

– Phrases and Waveform: as many units as possible (1990-2000‘s)

(15)

Word and Syllable Concatenation

IVR System HK Airport PA System

(16)

Text-to-Speech (TTS) Synthesis

GOAL: convert arbitrary textual messages to intelligible and natural sounding synthetic speech so as to transmit information from a machine to a person

input text

Text Analysis

Speech Synthesizer abstract

underlying linguistic description

synthetic speech

output

• Pronunciation of text => phonemes, stress, intonation, duration

• Syntactic structure of sentence (pauses, rate of speaking, emphasis)

• Semantic focus, ambiguity resolution (duration, intonation)

— rules for word etymology (especially names, foreign terms)

• Expressive, emotional and personalization: Ultimate goals

(17)

Multilingual TTS (Diphone as Units)

American English

Chinese French

Italian Spanish

Russian

(18)

Modern TTS (Waveform Concatenation) Soliloquy from Hamlet—

Gettysburg Address—

Bob Story—

German female—

UK British female—

Spanish female—

Korean female — French male—

Hidden Markov Model Based Synthesis:

•

Mathematically formed

• Naturally Sounding

• Continuously improving

(19)

“Talking Heads”: Audiovisual TTS

3D Talking Heads Sample-based Talking Heads

flexible; easy to show in any pose;

look like a ‗real person‘; require recording of real people; limited

(20)

Business Drivers of TTS

• Handicap assistance: Stephen Hawking, reading machine for the blind, telephone for the deaf

• Cost reduction in automation

– TTS as a dialog component for customer care – TTS to replace expensive recorded IVR prompts

• New products and services

– Location-based services

– Providing information in cars (e.g., driving directions, traffic reports)

– Unified messaging (reading e-mail, fax)

– Voice portals (enterprise, home, phone access to web- based services), talking internet: the Taipei story

– e-commerce (automatic information agents) – Customized news, stock reports, sports scores – Portable devices

(21)

From: Marilyn Walker <[email protected]>

To: David Ross <[email protected]>

Subject: Re: Today's Meeting

Date: Tuesday, December 01, 1998 4:25 PM

--- 4:30 is fine for me. See you at the meeting.

Marilyn

---Original Message---

From: David Ross <[email protected]>

To: Marilyn Walker <[email protected]>

Date: Tuesday, December 01, 1998 2:25 PM Subject: Today's Meeting

Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected].

Example: Old TTS, No Filter

Reading Email

(22)

From: Marilyn Walker <[email protected]>

To: David Ross <[email protected]>

Subject: Re: Today's Meeting

Date: Tuesday, December 01, 1998 4:25 PM

--- 4:30 is fine for me. See you at the meeting.

Marilyn Walker

---Original Message---

From: David Ross <[email protected]>

To: Marilyn Walker <[email protected]>

Date: Tuesday, December 01, 1998 2:25 PM Subject: Today's Meeting

Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected].

Thanks, david ross

Example: Enhanced TTS

Reading Email (after Rendering)

(23)

Human Peripheral Auditory Mechanism

- Inner, middle and outer ears

- Basilar membrane, hearing

- Brain and auditory neurons

(24)

History: Early ASR, 1970’s and 2000’s

1950‘s: Early Speech recognizers

1952: Bell Labs single-speaker digit recognizer: analog, 2% error

1960‘s: FFT, linear prediction, dynamic programming

NEC: speaker-dependent digit recognizer, huge size at $80,000

1970‘s: ARPA SUR 5-year project

Hidden Markov model: a major breakthrough and paradigm shift

1980-1990‘s: DARPA annual ―bakeoff‖, large databases

Verbex: speaker-dependent recognizer, small size at $250 Dragon Systems, IBM Via Voice : dictation systems

AT&T: VRCP automated operator service (1B/year) Commercial ASR: Nuance, SpeechWork, L&H, ….

Multilingual ASR systems, services and applications ASR as interpreted by Seinfeld (see the TV clip)

2000: DARPA GALE language translation project

NTT‘s speech translation system over the mobile phone IBM‘s hand-held speech translator deployed in Iraq

(25)

Phoneme Classification Chart: Early ASR

Vocal Cords

Vibrating

Noise-Like

(26)

ASR by Scoring Individual Sounds

• Computing word score using a sequence of phone scores

e d t  v  a n t s

p  b a k EDtv

Ants

Payback

Score = 12.2

Score = 32.5

Score = 29.4 EDtv

Please say the name of the movie now.

(27)

Training Speech Recognizer Models

s t

z th e

i

a  b

 f g k l m n

r sh y zh # ä a  ^

ch d

h  j

ng  i p th  v w

r



Th-i -s i-s a t - e - s - t

This is a test.

Millions of training samples are combined to build sub-word models, one for each phoneme

Training sample

(28)

Human Speech Knowledge Hierarchy

Phonotactic Syntactic Semantic Pragmatic Acoustic/

Phonetic

Acoustic Model

Word Lexicon

Language Model

Relationship of speech sounds and English phonemes

Rules for phoneme sequences and

pronunciation

Structure of words, phrases

in a sentence

Relationship and meanings

among words

Discourse, interaction history,

world knowledge

Understanding Model

Dialog Manager

ASR SLU DM

(29)

Speech Recognition Capabilities

Spontaneous Speech

Fluent Speech

Read Speech

Connected Speech Isolated

word spotting

digit strings

speaker verification

voice commands

Name Dialing

transcription natural

conversation

office dictation user driven

dialogue

personal assistant system driven

dialogue

1997

2001

By B. S.

(30)

Bell Labs Voice Call Transactions

• VRCP

- 1 B calls per year (1992)

• Voice Prompter

- 900 M calls/year (1992)

• SDN/NRA

- 250 M calls/year (1996)

• Universal Card

- 50 M calls/year (1995)

• MovieFone

- 40 M calls/year (1999)

• Talking Call Waiting

- ~110 M calls/year (2000) Over

Billion Served

(31)

VRCP: Fully Deployment

• System deployment

―Fully deployed in the 48 continental states and still being used

―Known as 0+ service (dialing 0 followed by 10 numbers)

• System Impact

―Handle over 1B call transactions a year (30M+ per day)

―Offer a savings of over $300M a year for service providers

―Stand as the most widely used voice-enabled services as of today

―Lead to many successful automated speech applications

• A key patent (Lee, Rabiner, and Wilpon) made it possible

―98% accuracy was obtained within 3 months after the initial trial

• Societal perception

―General public: no noticeable difference

(32)

Spoken Language Translation (C3P-O) and Voice User Interface (R2-D2)

Speech Recognizer

Language Generator &

TTS Synthesizer Language

Analyzer

Machine Translation

or Dialog Management

Semantic Rules

Bilingual Databases Translation

Models

Text Analysis &

Pronunciation Rules Acoustic

& Language Models Voice

Input

Voice Output

(Speech in Language A)

(Text Understanding in Language A/B)

(Text Reply in Language B)

(Speech in Language B) (Text in

Language A)

Bilingual Databases Translation

Models

(33)

Business Drivers of ASR

• Handicap assistance: wheelchair control, voice-

enabled keyboard (MIT story), telephone for the deaf

• Cost reduction in automation

– ASR as a dialog component for call centers: $1M per year savings for per 1% automation for large customer services – Airline and train information service

• New products and services

– Telematics: applications in cars – Medical and law dictation services

– Voice writer: IBM‘s Via Voice in many languages

– Voice portals (enterprise, home, phone access to Web- based services): listening and talking internet

– e-commerce (automatic information agents) – Speech translator and travel assistance

–

(34)

Microsoft MiPad

Multimodal Interactive Pad

• Usability studies show double throughput for English inputs

• Speech is mostly useful in cases with lots of

alternatives

• Demo (MiPad)

(35)

AT&T MATCH

“ Are there any cheap Italian places in this

Multimodal Access To City Help Access to information through

voice interface, gesture and GPS Multimodal Integration

Combination of speech, gesture

and meaning using finite state

technology

(36)

Human Speech Recognition vs. ASR

0.1 1 10 100

0.001 0.01 0.1 1 10

HUMAN ERROR (%)

MACHINE ERROR (%)

Digits RM-LM NAB-mic WSJ

RM-null NAB-omni SWBD WSJ-22dB Machines Outperform

Humans

x1 x10

x100

(37)

S-Learning Curve and Paradigm Shift

Speech Recognition Accuracy

Time

HMMs

2. Mathematical formalization, Global optimization, search, Automatic learning from data 1. Heuristics,

Handcrafted rules, Local optimization

3. ????

Human Performance

(38)

Apple iPhone Applications

• 90,000 applications, more to come

• 2 billion downloads, more to come

ASR advances triggered by more

interests in smartphones &

voice clicks?

(39)

Google Text, News, Video & 1-800-goog-411

(40)

Moving Forward

• ―HAL‘s Legacy‖: a book published in 1997 to celebrate HAL‘s

―birthday‖ to review 30 years of technology development

– Chess playing, lip reading, singing machines, speaking machines, listening machines: plenty of limitations

– Artificial intelligence: a long way to go – Most AI problems are AI-Complete

• Ten years later:

– Chess playing: IBM‘s Big Blue – Lip reading: improved capabilities

– Singing machines: improved capabilities – Speaking machines: improved capabilities – Listening machines: improved capabilities – Artificial intelligence: still a long way to go

• The Future: AI lives again (the ONE Web)

(41)

Summary

• Science Fictions: an inspiration

• History of talking machines

– Text-to-speech (TTS) synthesis

• Development of speech synthesis technology

– Capabilities and limitations

• History of listening machines

– Automatic speech recognition (ASR)

• Development of speech recognition technology

– Capabilities and limitations: ―Seinfeld‖

• Summary: applications today and tomorrow

– Voice user interface (VUI), voice search & speech translation – The ultimate ―machine‖ in science fictions: ―A.I.‖