Speech Processing for Audio Indexing

51  Download (0)

Full text


Speech Processing for Audio Indexing

Lori Lamel

with J.L. Gauvain and colleagues from Spoken Language Processing Group



• Huge amounts of audio data produced (TV, radio, Internet, telephone) • Automatic structuring of multimedia, multilingual data

• Much accessible content in the audio and text streams

• Speech and language processing technologies are key components

for indexing

• Applications: digital multimedia libraries, media monitoring, News on


Processing Audiovisual Data

XML Emotions signal signal Audio Segmentation Language recognition Speaker recognition Named entities, ... Q&A Audio Analysis Dialog Generation Synthesis Acoustic−phonetic modeling Speech transcription Translation studies Corpus based Perceptual tests


Some Recent Projects

• Conversational interfaces & intelligent assistants:

• Computers in the Human Interaction Loop (chil.server.de)

Human-Human communication mediated by the machine – multimodality (speech, image, video)

– “who & where” speaker identification and tracking – “what” linguistic content (far-field ASR)

– “why & how” style, attitude, emotion

• Speech translation:

Technology and Corpora for Speech to Speech Translation (http://www.tc-star.org)

Global Autonomous Language Exploitation (DARPA GALE)

• Developing multimedia and multilingual indexing and management tools

QUAERO (www.quaero.org)


Why Is Speech Recognition Difficult?

Automatic conversion of spoken words in text

Text: I do not know why speech recognition is so difficult

Continuous: Idonotknowwhyspeechrecognitionissodifficult

Spontaneous: Idunnowhyspeechrecnitionsodifficult

Pronunciation: YdonatnowYspiCrEkxgnISxnIzsodIfIk∧lt YdonowYspiCrEknISNsodIfxk∧l

YdontnowYspiCrEkxnISNsodIfIk∧lt YdxnowYspiCrEknISNsodIfxk∧lt

Important variability factors:

Speaker Acoustic environment

physical characteristics (gender, background noise (cocktail party, ...) age, ...), accent, emotional state, room acoustic, signal capture

situation (lecture, conversation, (microphone, channel, ...) meeting, ...)


Audio Samples (1)

• WSJ Read stop


• WSJ Spontaneous


• Broadcast News



Audio Samples (2)

• Conversational Telephone Speech stop

Hello. Yes, This is Bill. Hi Bill. This is Hillary. So this is my first call so I haven’t a clue as to - - what we’re supposed to do yeah. It’s pretty funny that we’re Bill and Hillary talking on the phone though isn’t it?

Ah I didn’t even think about that. {laughter} uhm we’re supposed to

talk about did you get the topic. Yes. So we’re supposed to talk about what changes if any you have made since ...

• CHIL, spontaneous, non native

I just give you a brief overview of [noise] what’s going on in uh audio and why we bother with all these microphones in the the materials of the walls , the carpet and all that ...


Audio Samples (3)

• EPPS stop

Mister President, in these questions we see the European social model in flashing lights. The European social model is a mishmash which pleases no one: a bit of the free market here and a bit of welfare state there, mixed with a little green posturing.

• EPPS, spontaneous

Thank you very much, Mister Chairman. Hum When I came to this room I I have seen and heard homophobia but sort of vaguely ; through T. V. and the rest of it. But I mean listening to some of the Polish colleagues speak here today , especially Roszkowski , Pek , Giertych and Krupa,...

• EPPS, non native

My main my main problem with the environmental proposals of the Commission is that they do not correspond to the objectives of the Sixth Environmental Action Plan. As far as the traffic is concerned, the sixth E. A. P. stressed the decoupling of transport growth and G. D. P. growth.


Indicative ASR Performance

Task Condition Word Error

Dictation read speech, close-talking mic. 3-4% (humans 1%)

read speech, noisy (SNR 15dB) 10%

read speech, telephone 20%

spontaneous dictation 14%

read speech, non-native 20%

Found audio TV & radio news broadcasts 10-15% (humans 4%)

TV documentaries 20-30%

Telephone conversations 20-30% (humans 4%)

Lectures (close mic) 20%

Lectures (distant mic) 50%


from NIST

1 0 0 % 1 0 % 1 % 4 % R e a d S p e e c h 2 0 k 5 k 1 k N o i s y V a r i e d Mi c r o p h o n e s A i r T r a ve l P l a n n i n g K i o s k S p e e c h 2 % C o n ve r s a t i o n a l S p e e c h ( N o n= E ng li s h) S w it c h b o ar d C e ll u l ar S w it c h b o ar dII C T S A ra bi c ( UL) CT S M a n d ar i n ( UL) 0 C T S Fi s h e r ( UL) S w it c hb o ar d ( N on \ En glish) N e ws E n g li s h 1 0 X B r o a d c a s t S p e e c h N e ws E n g li s h 1 X N e ws E n g li s h u n li m it e d N e w s M an d ar i n 10 X N e w s A ra bi c 10 X M ee ti n g Ȃ M D M M e e ti n g= I H M M ee ti n g ȂS D M M e e t i n g S p e e c h N I S T S T T B e n c h m a r k T e s t H i s t o r y

Ȃ ƒ›Ǥǯ͘͟

W E R ( % )


Some Research Topics

• Development of generic technology

Speaker independent, task independent (large vocabulary > 200k words)

Robust (noise, microphone, ...)

• Improving model accuracy

• Processing found audio: varied speaking styles, uncontrolled conditions Multilinguality: low e-resourced languages

• Developing systems with less supervision

Model adaptation: keeping models up-to-date


Internet Languages

Internet users in Millions per language Increase (%) in Internet users per internetworldstats.com, Nov 30, 2007 language from 2000 to 2007


ASR Systems

Speech Sample HMM Extraction Speech Transcription Decoding Training f(X|H) P(H|W) P(W) W* Y X Text Corpus Manual Transcription Recognizer Lexicon Language Model Acoustic Models N−gram Estimation Training Lexicon Training Feature Normalization Corpus Speech Acoustic Front−end Decoder


System Components

• Acouctic models: HMMs (10-20M parameters), allophones (triphones),

discriminative features, discriminative training

• Pronunciation dictionary with probabilities

• Language models: statistical N-gram models (10-20M N-grams), model

interpolation, connectionist LMs, text normalization

• Decoder: multipass decoding, unsupervised acoustic model adaptation,


Multi Layer Perceptron Front-End

• Features incorporate longer temporal information than cepstral features • MLP NN projects the high-dimensional space onto phone state targets • LP-TRAP features (500ms), 4 layer bottle-neck architecture (HMM state

targets, 3rd layer size is 39), PCA [Hermansky & Sharma, ICSLP’98; Fousek, 2007]

• Studied various configurations & ways to combine with cepstral features • Main results:

bnat06 WER (%)


No adaptation 21.8 20.7 19.2

SAT+CMLLR+MLLR 19.0 18.9 17.8

– Signifiant WER reduction without acoustic model adaptation – Best combination via feature vector concatenation (78 features) – Acoustic model adaptation is less effective for MLP


Pronunciation Modeling

• Pronunciation dictionary is an important system component

• Usually represented with phones + non-speech units (silence, breath, filler)

• Well-suited to recognition of read and prepared speech

• Modeling variants particularly important for spontaneous speech

multiwords “I don’t know”

• Explore (semi-)automatic methods to generate pronunciation variants • Explicit use of phonetic knowledge to hypothesize rules

anti, multi: /i/ - /ay/ variation

inter: /t/ vs nasal flap, retroflex schwa vs schwa

stop deletion/insertion

y-insertion (coupon, news, super)

• Data driven: confront pronunciations with very large amounts of data


Example: coupon

kyupan kupan



12 a23 a f( |s ) 11 a22 a f( |s ) 1 . f( |s ) 2 . 3 33 . 3 2 1 a 5 x 4 x 3 x6 x7 x8 x 2 x 1 x


Example: interest

InXIst IntrIst


Pronunciation Variants - BN vs CTS

100 hours of data

Word Pronunciation BN CTS Word Pronunciation BN CTS

interest IntrIst 238 488 interests IntrIss 52 53

IntXIst 3 33 IntrIsts 19 30

InXIst 0 11 IntXIsts 3 2

IntXIss 3 1

interested IntrIstxd 126 386 interesting IntrIst|G 193 1399

IntXIstxd 3 80 IntXIst|G 8 314

InXIstxd 18 146 InXIst|G 21 463

• Similar but less frequent words: interestingly, disinterest, disinterested,

intercom, interconnected, interfere...


Morphological Decomposition

• Morphological decomposition proposed to address problem of lexical

variety (Arabic, German, Turkish, Finnish, Amharic, ...)

• Rule-based vs data-driven approaches • Reduces out-of-vocabulary rate

• Improve n-gram estimation of infrequent words

• Tendency to create errors due to acoustic confusions of small units • Some proposed solutions:

- inhibit decomposition of frequent words

- incorporate linguistic knowledge (recognizer confusions, assimilation)

• Recognition performace of optimized systems is typically comparable to


Continuous Space Language Models

• Project words onto a continuous space (about 300)

• Estimate projection and n-gram probabilities with neural network • Better generalization for unknown n-grams (compared to general

back-off methods)

• Efficient training with a resampling algorithm

• Only model the N most common words (10-12k) • Interpolated with a standard 4-gram backoff LM

• 20% perplexity reduction, 0.5-0.8% reduction in WER across a variety


Reducing Training Costs - Motivation

• State-of-the-art speech recognizers trained on

– hundreds to thousands hours of transcribed audio data – hundreds of million to several billions of words of texts – large recognition lexicons

• Less e-represented languages

– Over 6000 languages, about 800 written – Poor representation in accessible form

– Lack of economic and/or political incentives

– PhD theses: Vietnamese, Khmer [Le, 2006], Somali [Nimaan, 2007], Amharic, Turkish [Pellegrini, 2008]

– SPICE: Afrikaans, Bulgarian, Vietnamese, Hindi, Konkani, Telugu, Turkish, but also English, German, French [Schultz, 2007]



• How much data is needed to train the models?

• What performance can be expected with a given amount of resources? • Are audio or textual training data more important for ASR in less

e-represented languages (languages with texts available)?

• Assess the impact on WER of the quantity of

Speech material (acoustic modeling) Texts (transcriptions)

Texts (newspapers, newswires available on the Web)


OOV and WER Rates

Amharic, 2 hour development set

50 25 0 4.6M 1M 500k 100k 10k 0 OOV(%)

Text corpus size (# words) 2h 10h 35h 75 50 25 4.6M 1M 500k 100k 10k log WER(%)

Text corpus size (# words) 2h 5h 10h 35h

• Audio transcriptions closer to development data • Even with 4.6M texts, more audio transcripts help


Influence of Transcriptions in LM

Amharic, 2 hour development set

100 75 50 25 4.6M 1M 500k 100k 10k log WER(%)

Text corpus size (# words) 2h 10h 35h MA2h−ML10h MA2h−ML35h 50 25 4.6M 1M 500k 100k 10k log WER(%)

Test corpus size (# words) 10h MA10h−ML35h 35h

• 2h of audio training data is not sufficient

• Good operating point: 10h audio, 1M word texts

• Collecting texts is more efficient than transcribing audio at this operating


Reducing Training Supervision

• Use of quick transcriptions (less detailed, less costly to produce)

− low-cost approximate annotations (commercial transcripts, closed-captions)

− raw data with closely related contemporary texts

− raw data with complementary texts (only newspaper, newswire,...)

Lightly supervised and unsupervised training using speech recognizer [Zavaliagkos & Colthurst 1998; Kemp & Waibel 1999; Jang & Hauptmann, 1999, Lamel, Gauvain & Adda 2000]

• Standard MLE training assumes that transcriptions are perfect, and are

aligned with audio

Flexible alignment using full N-gram decoder and garbage model

• Can skip over unknown words, transcription errors Implicit modeling of short vowels in Arabic


Reducing Training Supervision - Non-speech

• Use existing models (trained on manual annotations) to automatically

detect fillers and breath noises

• Add these to reference transcriptions

breath filler

Manual (50h) 8.1% 0.6%

BN (967h) 4.9% 3.4%

BC (162h) 5.4% 6.8%


Reducing Supervision: Pronunciation Generation

• Pronunciation generation in Arabic is relatively straightforward from

a vocalized word form

• Most sites working on Arabic use the Buckwalter morphological analyzer

to generate all possible vocalizations

• But many words are not handled

• Introduced generic vowel (@) model as a placeholder

• Developed rules to automatically generate pronunciations with a

generic vowel from non-vocalized word form

• Enables training with partial information

ktb k@t@b@ k@t@b@n kt@b@ kt@b@n k@tb@ k@tb@n k@t@b


Alignment Example: li AntilyAs

l @ n t @ l y A s


Towards Applications

• Transcription for Translation (TC-STAR) www.tc-star.org • Using the WEB as a Source of Training Data

• Indexing Internet audio-video data: Audiosurf www.audiosurf.org • Open Domain Question/Answering www.lsi.upc.edu/˜qast/2008 • Open domain Interactive Information Retrieval by Telephone (RITel)



ASR Transcripts Versus Texts

– Spoken language, disfluencies (hesitations, fillers, ...)

– No document/sentence boundaries – Lack of capitalization, punctuation

– Transcription errors (substitutions, deletions, insertions)

Strong correlation between WER and IE metrics (NIST) Slightly more errors on NE tagged words

Vocabulary limitation essentially affects person names

+ Known vocabulary, i.e. no orthographic errors, no need for

normalization + Time codes


Transcription for Translation (TC-STAR)

• Three languages (EN, ES, MAN), 3 tasks (European Parliament,

Spanish Parliament, BN)

• Improving modeling and decoding techniques

– audio segmentation, discriminative features, duration models, – automatic learning, discriminative training,

– pronunciation models, NN language model, decoding strategies, ...

• Model adaptation methods

– speaker adaptive trainining, acoustic and language model adaptation, – cross-system adaptation, adaptation to noisy environments, ...

• Integration of ASR and Translation

– transcription errors, confidence scores, word graph interface – punctuation for MT


TC-Star ASR Progress

5 6 7 8 9 10 15 20 30 2004 2005 2006 2007 Best TCStar LIMSI OCT04-BN

• More task-specific data • More complex models

• Improved modeling and decoding techniques

Automatic learning, discriminative training, pronunciation modeling


Using the WEB as a Source of Training Data

(Work done in collaboration with Vecsys Research)

• How to build a system without speaking the language (e.g Russian)

and without transcribed data?

• Main steps

– Get texts from the WEB, and basic linguistic knowledge for letter-to-sound rules

– Get BN audio data with incomplete and approximate transcripts – Normalize texts/transcripts and build n-gram LM

– Start with seed AMs from other languages

– Align incomplete transcripts and audio, discard badly aligned data – Train AMs


Results: Approximate WER

10 20 30 40 50 60 0 2 4 6 8 10 Russian WER Russian WER select

• Estimated WER using approximate transcripts • Approximate WER reduced from 30% to 10%



Indexing Internet audio/video

• Processed data

– 259 sources

– 80h/day, 33 languages

• Transcribing and indexing

– 203 sources – 6 languages

English French German Arabic Spanish Mandarin All

2005 3889h 10735h 964h 1756h 60h - 17404h

2006 5794h 14950h 1639h 3383h 1369h - 27135h

2007 7441h 18675h 1970h 4634h 2337h 2473 37530h


Open Domain Question/Answering in Audio data

• Automatic structurization of audio data

• Detect semantically meaningful units

• Specific entity taggers (named, task, linguistic entities)

- rewrite rules (work like local grammars with specific dictionaries) - named and linguistic entity tagging are language dependent but task independent

- task entity tagging is language independent but task dependent

• Projects: RITel (ritel.limsi.fr), Chil (chil.server.de), Quaero


• Participation in CLEF track QAst07, QAst08 (www.lsi.upc.edu/˜qast/

2008) [Rosset et al, ASRU 2007]

Compare QA performance on manual and ASR transcripts of varied audio Best accuracy about 0.4-0.5, generally degradation with increasing WER


RITel System


RITel System

Information Retrieval by Telephone [Rosset et al, 2005,2006]

ritel.limsi.fr Example stop

• Fully integrated open domain Interactive Information Retrieval System • Speech detection and recognition

• Non-Contextual analysis (for texts and speech transcripts) • General Information Retrieval

• IR on databases and dialogic interactions • Management of dialog history

• Natural Language Generation


Question: What is the Vlaams Blok?

Manual transcript: the Belgian Supreme Court has upheld a previous

ruling that declares the Vlaams Blok a criminal organization and effectively bans it .

Answer: criminal organisation

Extracted portion of an automatic transcript (CTM file format): (...) 20041115 1705 1735 EN SAT 1 1018.408 0.440 Vlaams 0.9779 20041115 1705 1735 EN SAT 1 1018.848 0.300 Blok 0.8305 20041115 1705 1735 EN SAT 1 1019.168 0.060 a 0.4176 20041115 1705 1735 EN SAT 1 1019.228 0.470 criminal 0.9131 20041115 1705 1735 EN SAT 1 1019.858 0.840 organisation 0.5847 20041115 1705 1735 EN SAT 1 1020.938 0.100 and 0.9747 (...) Answer: 1019.228 1019.858


Speech Analytics

• Statistical analyses of huge speech corpora [Infom@gic/Callsurf] • Client satisfaction, service efficiency, ...

• Extraction of metadata (content, topics, quality of interaction ...) • Confidentiality and privacy issues


Speech Analytics Sample Extract

• CallSurf

A: oui bonjour Monsieur, EDF pro `a l’appareil. C: oui. ah oui oui oui.

A: ouais je je vous appelle Monsieur, je vous avais eu il y a quelques

jours au t ´el ´ephone,

C: tout `a fait.

A: et je vous avais transf ´er ´e vers euh des conseillers euh nouvelle offre

qui m’ont renvoy ´e un mail ce matin, en me disant de vous recontacter puisque la demande de mise en service par le premier conseiller que vous aviez eu `a l’origine n’avait pas ´et ´e faite.

C: ah bon

A: ouais. donc euh bon c’est pas coup ´e sur place. c’est pas le probl `eme



Detailed Transcription

edfp h5 fre 00000051 spk1 16.510 17.144 <o,f0,male> all `o oui

edfp h5 fre 00000051 spk2 17.144 18.140 <o,f0,female> {fw} Monsieur Dupont

edfp h5 fre 00000051 spk1 18.140 18.563 <o,f0,male> oui

edfp h5 fre 00000051 spk2 18.563 20.827 <o,f0,female> oui Ma(demoiselle) Mademoiselle Brun EDF pro

edfp h5 fre 00000051 spk1 21.159 21.612 <o,f0,male> oui

edfp h5 fre 00000051 spk2 21.612 27.769 <o,f0,female> {breath} {fw} je vous rappelle

je viens d’ avoir le service technique {fw} concernant {fw} notre discussion {fw} de ce matin

edfp h5 fre 00000051 1 spk2 28.403 37.035 <o,f0,female> donc {fw} apr `es v ´erification en fait le pr ´ed ´ecesseur qui ´etait donc un client {fw} particulier en r ´ef ´erence


Quick Anonymized Transcription

A/ Sylviane XX EDF Pro bonjour .

C/ all `o bonjour

A/ oui

C/ je vous donne ma r ´ef ´erence client ?

A/ s’ il vous plait C/ rrrrr A/ rr rrr C/ rrr A/ oui C/ rrr tiret r r


Training with Reduced Information

• Transcriptions assumed to be imperfect

• Alignment allowing all reference words to be optionally skipped • Filler words and breath can be inserted between words

• Anonymized forms absorbed by a generic model

• Training uses a word lattice generated by the decoder • Word error rate currently about 30%


Some Perspectives

• State-of-the-art transcription systems have WER ranging from 10-30%

depending upon the tasks

• Comparable results obtained on several languages (English, French,

German, Mandarin, Spanish, Dutch ...)

• Performance is sufficient for some tasks (indexing huge amounts of

au-dio/video, spoken document retrieval, speech-to-speech translation)

• Requires substantial training resources but recent research has

demon-strated the potential of semi- and unsupervised methods

• Processing of more varied data (podcasts, voice mail, meetings)

• Dealing with different speaker accents (regional and non-native accents) • Keeping models up-to-date


Over 25 Years of Progress

1980 1990 2000 2008 Controled dialog Voice commands single speaker Isolated words 2 − 30 words Connected words Conversational Telephone Speech

single speaker speaker indep. 10 − 100 words 2 − 30 words

Transcription for indexation speaker indep.

Isolated word dictation single speaker 20k words single speaker 60k words speaker indep 10 words Continuous dictation Speech Analytics Audio mining Unlimited Domain Unlimited Domain Speech−to−speech Translation Transcription Internet audio TV & radio




Related subjects :