AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking

(1)

E

S

E

A

R

C

H

R

E

P

R

O

R

T

I

D

I

A

P

Dalle Molle Institute

for Perceptual Artificial

Intelligence P.O.Box 592 M a r t i g n yV a l a i sS w i t z e r l a n d phone +41 27 721 77 11 fax +41 27 721 77 12 e-mail [email protected] internethttp://www.idiap.ch AV16.3: an Audio-Visual

Corpus for Speaker

Localization and Tracking

Guillaume Lathoud a,b Jean-Marc Odobez a Daniel Gatica-Perez a IDIAP{RR 04-28 August 2004 to appear in

Proceedingsofthe2004WorkshoponMachine LearningforMultimodal

Interaction(MLMI'04),BengioandBourlardEds,Springer-Verlag,2004

a

IDIAPResearchInstitute,CH-1920Martigny,Switzerland

b

(2)

(3)

AV16.3: an Audio-Visual Corpus for Speaker

Localization and Tracking

GuillaumeLathoud Jean-Marc Odobez Daniel Gatica-Perez

August2004

to appear in

Proceedingsofthe2004WorkshoponMachineLearningforMultimodalInteraction(MLMI'04),

BengioandBourlardEds,Springer-Verlag,2004

Abstract. Assessingthe quality of aspeakerlocalization or trackingalgorithm onafew short

examplesisdiÆcult,especiallywhentheground-truthisabsentornotwelldened. Onestep

to-wardssystematicperformanceevaluationofsuchalgorithmsistoprovidetime-continuousspeaker

locationannotationoveraseriesofrealrecordings,coveringvarioustestcases. Areasofinterest

include audio, video and audio-visual speakerlocalization and tracking. The desired location

annotationcanbeeither2-dimensional(imageplane)or3-dimensional(physicalspace). This

pa-permotivatesanddescribesacorpusofaudio-visualdatacalled\AV16.3",alongwithamethod

for3-Dlocationannotationbasedoncalibratedcameras. \16.3"standsfor 16microphones and

3cameras, recordedinafullysynchronizedmanner,inameetingroom. Partof thiscorpushas

(4)

1 Introduction

This paper describes a corpus of audio-visual data called \AV16.3", recorded in a meeting room

context. \16.3"standsfor16microphonesand 3cameras,recordedin afullysynchronizedmanner.

The central idea is to use calibrated cameras to provide continuous 3-dimensional (3-D) speaker

location annotation for testing audio localization and tracking algorithms. Particular attention is

giventooverlappedspeech,i.e. whenseveralspeakersaresimultaneouslyspeaking. Overlapisindeed

animportantissueinmulti-partyspontaneousspeech,asfoundinmeetings[1]. Sincevisualrecordings

areavailable,videoandaudio-visualtrackingalgorithmscanalsobetested. Wethereforedenedand

recorded a series of scenarios so as to cover a variety of research areas, namely audio, video and

audio-visuallocalizationandtrackingofpeoplein ameeting room. Possibleapplicationsrangefrom

automaticanalysisofmeetingsto robustspeechacquisition andvideosurveillance,tonameafew.

Inordertoallowforsuchabroadrangeofresearchtopics,\meetingroomcontext"isdenedhere

in awide way. This includes ahigh varietyof situations, from \meetingsituations" wherespeakers

are seated most of the time, to \motion situations" where speakersare moving most of the time.

This departs from existing, related databases: for example the ICSI database [2] contains

audio-only recordings of natural meetings, the CUAVE database [3] does contain audio-visual recordings

(close-ups)but focuses onmultimodalspeechrecognition. TheCIPIC[4]databasefocuses on

Head-RelatedTransferFunctions. Insteadoffocusingtheentiredatabaseononeresearchtopic,wechoseto

haveasingle,generic setup,allowingverydierentscenariosfordierentrecordings. Thegoalis to

provide annotationbothin termsof \true"3-Dspeakerlocationin the microphonearrays'referent,

and \true" 2-D head/face location in the image plane of each camera. Such annotation permits

systematicevaluation oflocalizationandtrackingalgorithms,asopposedtosubjectiveevaluationon

a few short examples without annotation. To the best of our knowledge, there is no such

audio-visualdatabasepubliclyavailable.Thedatasetwepresentherehasbeguntobeused: tworecordings

withstaticspeakershavealreadybeensuccessfullyusedtoreportresultsonreal multi-sourcespeech

recordings[5].

Whileinvestigating forexisting solutionsforspeakerlocation annotation,wefound various

solu-tions with devices to beworn by each person and abase device that locateseach personal device.

However,these solutionswereeither verycostlyand extremelyperformant(highprecisionand

sam-pling rate, notether betweenthe base and the personal devices), or cheapbut with poorprecision

and/or high constraints (e.g. personal devices tethered to the base). We therefore optedfor using

calibratedcamerasforreconstructing 3-Dlocationof thespeakers. Itis importanttonote that this

solutionispotentiallynon-intrusive,whichisindeedthecaseonpartofthecorpuspresentedhere: on

somerecordingsnoparticularmarkeriswornbytheactors.

Inthedesignofthecorpus,twocontradictingconstraintsneededtobefullled: 1)theareaoccupied

byspeakersshouldbelargeenoughtocoverboth\meetingsituations"and\motionsituations",2)this

areashouldbeentirelyvisiblebyallcameras. Thelatterallowssystematicoptimizationofthecamera

placement. Italsoleadsto robustreconstruction of3-Dlocationinformation,sinceinformationfrom

allcamerascanbeused.

Therestofthispaperisorganizedasfollows: Section2describesthephysicalsetupandthecamera

calibrationprocessusedtoprovide3-Dmouthlocationannotation. Section3describesandmotivates

a set of sequences publicly available via Internet. Section 4 discusses the annotationprotocol, and

reportsthecurrentstatusoftheannotationeort.

2 Physical Setup and Camera Calibration

Forpossiblespeakers'locations,weselectedaL-shapedareaaroundthetablesinameetingroom,as

depictedinFig.1. Ageneraldescriptionofthemeeting roomcanbefoundin [6]. TheL-shapedarea

(5)

Tables

C3

C2

C1

Speakers

Area

MA1

MA2

Figure 1: Physical setup: threecamerasC1,C2 andC3 and two8-microphone circulararraysMA1

andMA2. Thegrayareaisintheeldofviewofallthreecameras. TheL-shapedareaisa3m-long

by2m-widerectangle,minus a0.6m-wideportiontakenbythetables.

dierentcamerascanbeseeninFig.2. Thedataitselfisdescribedin Sect.3.

Thechoiceofhardwareisdescribedandmotivatedin Sect.2.1. Weadopteda2-stepstrategyfor

placingthecamerasandcalibratingthem. First,cameraplacement(location,orientation,zoom)is

op-timized,usingaloopingprocessincludingsub-optimalcalibrationofthecameraswith2-Dinformation

only(Sect.2.2). Second,eachcameraiscalibratedinaprecisemanner,usingboth2-Dmeasurements

and3-Dmeasurementsinthereferentofthemicrophonearrays(Sect. 2.3).

Theideabehindthisprocessisthatifwecantrackthemouthofapersonin eachcamera'simage

plane,thenwecanreconstructthe3-Dtrajectoryofthemouthusingthecameras'calibration

param-eters. Thiscanbeusefulasaudioannotation,providedthe3-Dtrajectoryisdenedinthereferentof

themicrophonearrays. Weshowthat the3-Dreconstructionerroriswithin averyacceptablerange.

2.1 Hardware

Weused3camerasandtwo10cm-radius,8-microphonearraysfromaninstrumentedmeetingroom[6].

Thetwomicrophonearraysareplaced0.8m apart. Themotivationbehindthischoiceisthreefold:

Recordingsmadewithtwomicrophonearraysprovidetestcasesfor3-Daudiosourcelocalization

andtracking,aseachmicrophonearraycanbeusedtoprovidean(azimuth,elevation)location

estimateof eachaudiosource.

Recordingsmadewithseveralcamerasgeneratemanyinteresting,realisticcasesofvisual

occlu-sion,viewingeach person fromseveralviewpoints.

At least two cameras are necessaryfor computing the 3-D coordinates of an object from the

2-Dcoordinatesin cameras'image planes. Theuse ofthree camerasallowsto reconstruct the

3-Dcoordinatesofanobjectinarobustmanner. Indeed,inmostcases,visualocclusionoccurs

inonecameraonly;theheadofthepersonremainsvisiblefromthetwoothercameras.

2.2 Step One: Camera Placement

ThisSectiondescribestheloopingprocessusedtooptimizecamerasplacement(location,orientation,

(6)

(Mul-camera 1

camera 2

camera 3

camera 1

Figure 2: Snapshots from the cameras at their nal positions. Red \+" designate points in the

calibrationtrainingset

train

,green\x"designatepointsinthecalibrationtestset

test .

(7)

unknown. TheMultiCamSelfCalusesonlythe2-Dcoordinatesin theimageplaneofeachcamera. It

jointly produces aset of calibration parameters

1

foreach cameraand 3-Dlocationestimates of the

calibrationpoints,byoptimizingthe\2-Dreprojectionerror". Foreachcamera,\2-D

reprojec-tionerror"isdenedasthedistanceinpixelsbetweentherecorded2-Dpointsandtheprojectionof

their3-Dlocationestimatesbackontothecameraimageplane,usingtheestimatedcameracalibration

parameters. Althoughweused thesoftwarewiththestrictminimumnumberofcameras(three),the

obtained2-Dreprojectionerrorwasdecent: itsupperbound wasestimatedaslessthan0.17pixels.

Thecameraplacementprocedureconsists inaniterativeprocesswiththree steps: Place,Record

andCalibrate:

1. Place the three cameras (location, orientation, zoom)based on experience in prior iterations.

Inpracticethevariouscamerasshould giveviews thatareasdierentaspossible.

2. Record synchronouslywiththe3camerasasetofcalibrationpoints,i.e. 2-Dcoordinatesin the

imageplaneofeachcamera. Asexplainedin[7],wavingalaserbeamerindarknessissuÆcient.

3. Calibrate the3camerasbyrunningMultiCamSelfCalonthecalibrationpoints.

MultiCamSelf-Caloptimizesthe2-Dreprojectionerror.

4. Totrydecreasingthe2-Dreprojectionerror, loopto1. Else goto5. Inpractice,a2-D

repro-jectionerrorbelow0.2pixelsisreasonable.

5. Selectthecameraplacementthatgavethesmallest2-Dreprojectionerror.

Multi-cameraself-calibrationisgenerallyknowntoprovidelessprecisionthanmanualcalibration

usinganobjectwithknown3-Dcoordinates. Themotivationforusingitwaseaseofuse: the

calibra-tionpointscanbequicklyrecordedwithalaserbeamer. OneiterationofthePlace/Record/Calibrate

loop thus takes about 1h30. This process converged to the positioning of the camera depicted in

Fig.1.

Fordetailedinformation,includingthemulti-cameraself-calibrationproblemstatement,thereader

isinvitedtorefertothedocumentationin[7].

2.3 Step Two: Camera Calibration

This Section describes precise calibration of each camera, assuming the cameras' placement xed

(location, orientation, zoom). This is done by selecting and optimizing the calibration parameters

for each camera, on a calibration object. For each point of the calibration object, both true

3-D coordinates in the microphone arrays' referent and true 2-D coordinates in each camera's

image planeareknown. 3-Dcoordinateswereobtainedon-sitewith ameasuringtape(measurement

errorestimatedbelow0.005m). CrossesinFig.2showthe3-Dcalibrationpoints. Thesepointswere

split intwosets:

train

(36points)and

test

(39points).

Particularmentionmust bemade ofthe model selectionissue, i.e. howwechose to model

non-linear distortions produced by each camera's optics. An iterative process that evaluates adequacy

of the calibration parameters of all three cameras in terms of \3-D reconstruction error" was

adopted: the Euclidean distance between 3-D location estimates of points visible from at least 2

cameras,andtheirtrue3-Dlocation. Thecameracalibrationprocedurecanbedetailedasfollows:

1. Model selection: foreach camera,selectthe set ofcalibration parametersbasedonexperience

inprioriterations.

2. Model training: for each camera, estimatethe selected calibrationparameters on

train using

thesoftwareavailablein [8].

(8)

3. 3-D error: for each point in train

, compute the Euclidean distance between true 3-D

coor-dinates and 3-D coordinates reconstructed using the trained calibration parameters and the

2-Dcoordinatesineachcamera'simage plane.

4. Evaluation: estimatethe\training"maximum3-Dreconstructionerroras+3,whereand

respectivelystandformeanandstandarddeviationofthe3-Derror,acrossallpointsin

train .

5. Totrydecreasingthemaximum3-Dreconstructionerror,loopto 1. Elsegoto6.

6. Selectthesetofcalibrationparametersandtheirestimatedvalues,thatgavethesmallest

max-imum3-Dreconstructionerror.

Theresultofthisprocessis aset ofcalibrationparametersandtheirvaluesforeach camera. Forall

camerasthebestsetofparameterswerefocalcenter,focallengths,r

2

radialandtangentialdistortion

coeÆcients.

Oncethetrainingwasover,weevaluatedthe3-Derrorontheunseentestset

test

. Themaximum

3-Dreconstructionerroronthissetwas0.012m. Thismaximumerrorwasdeemeddecent,ascompared

tothediameterofanopenmouth (about0.05m).

3 Online Corpus

This Section rst motivates and describes the variety of sequences recorded, and then describes in

moredetailstheannotatedsequences. \Sequence"means:

3videoDIVXAVIles(resolution288x360),oneforeachcamera,sampledat25Hz. Itincludes

alsooneaudiosignal.

16audioWAV lesrecordedfrom thetwocircular 8-microphonearrays,sampledat16kHz.

Whenpossible, moreaudioWAV les recordedfrom lapels wornby the speakers, sampled at

16kHz.

All les were recorded in a synchronous manner: video les carry a time-stamp embedded in the

upperrowsofeachimage, andaudiolesalwaysstartat videotimestamp00:00:10.00. Complete

details aboutthehardwareimplementation of auniqueclock acrossallsensors canbefound in [6].

Althoughonly8sequenceshavebeenannotated,manyothersequencesarealsoavailable. Thewhole

corpus, alongwith annotationles, camera calibrationparametersand additionaldocumentationis

accessible 2

at: http://mmm.idiap.ch/Lathoud/av16.3v6. It wasrecordedoveraperiodof5days,

andincludes42sequencesoverall,withsequencedurationrangingfrom14secondsto9minutes(total

1h25). 12 dierentactors wererecorded. Although theauthors of thepresent paperwererecorded,

many of theactors don't haveanyparticular expertise in theelds of audioand video localization

andtracking.

3.1 Motivations

Themain objectiveistostudyseverallocalization/trackingphenomena. Anon-limitinglistincludes:

Overlappedspeech.

Closeandfarlocations,smallandlargeangularseparations.

Objectinitialization.

Variable numberofobjects.

(9)

Table 1: List of the annotated sequences. Tags mean: [A]udio, [V]ideo, predominant [ov]erlapped

speech, atleastonevisual[occ]lusion,[S]taticspeakers,[D]ynamicspeakers,[U]nconstrainedmotion,

[M]outh,[F]ace,[H]ead,speech/silence [seg]mentation.

Sequence Duration Modalities Nb. of Speaker(s) Desired

name (seconds) ofinterest speakers behavior annotation

seq01-1p-0000 217 A 1 S M,seg

seq11-1p-0100 30 A, V,AV 1 D M,F,seg

seq15-1p-0100 35 AV 1 S,D(U) M,F,seg

seq18-2p-0101 56 A(ov) 2 S,D M,seg

seq24-2p-0111 48 A(ov),V(occ) 2 D M,F

seq37-3p-0001 511 A(ov) 3 S M,seg

seq40-3p-0111 50 A(ov),AV 3 S,D M,F

seq45-3p-1111 43 A(ov),V(occ),AV 3 D(U) H

Partialandtotalocclusion.

\Natural"changesofillumination.

Accordingly,wedened andrecordedasetofsequencesthat containsahighvarietyoftestcases:

fromshort,veryconstrained,speciccases(e.g. visualocclusion),foreachmodality(audioorvideo),

to naturalspontaneousspeechand/ormotioninmuchlessconstrainedcontext.

Each sequence is useful for at least one of three elds of research: analysis of audio, video or

audio-visualdata. Uptothreepeopleareallowedineachsequence. Humanmotioncanbestatic(e.g.

seatedpersons),dynamic (e.g. walkingpersons)oramix ofbothacrosspersons(some seated,some

walking)andtime(e.g. meetingpreceded andfollowedbypeoplestandingandmoving).

3.2 Contents

As mentionedabove, the online corpuscomprises of 8 annotatedsequences plusmany more

unan-notated sequences. These 8 sequences were selected for the initial annotation eort. This choice

is a compromise betweenhaving asmall numberof sequences for annotation, and coveringa large

varietyofsituations tofulll interestsfromvarious areasof research. It constitutesaminimal setof

sequencescoveringasmuchvarietyaspossibleacrossmodalitiesandspeakerbehaviors. Theprocess

ofannotationisdescribedinSect.4.

The name of each sequence is unique. Table 1 gives a synthetic overview. A more detailed

descriptionofeachsequencefollows.

seq01-1p-0000 Asinglespeaker,staticwhile speaking, at each of16locationscoveringtheshaded

areainFig.1. Thespeakerisfacingthemicrophonearrays. Thepurposeofthissequenceisto

evaluateaudiosourcelocalizationonasinglespeakercase.

seq11-1p-0100 Onespeaker, mostlymovingwhile speaking. The onlyconstrainton the speaker's

motionisto facethemicrophonearrays. Themotivationisto testaudio,videooraudio-visual

(AV)speakertrackingondiÆcultmotioncases. Thespeakeristalkingmostofthetime.

seq15-1p-0100 One moving speaker, walking around while alternating speech and long silences.

The purpose of this sequence is to 1) show that audio tracking alone cannot recover from

unpredictabletrajectoriesduringsilence,2)provideaninitialtestcaseforAVtracking.

seq18-2p-0101 Twospeakers,speakingandfacingthemicrophonearraysallthetime,slowlygetting

(10)

seq24-2p-0111 Twomovingspeakers,crossingtheeldofviewtwiceandoccludingeachothertwice.

Thetwospeakersaretalkingmostofthetime. Themotivationistotest bothaudioandvideo

occlusions.

seq37-3p-0001 Three speakers,staticwhilespeaking. Twospeakersremainseatedallthetimeand

thethirdoneisstanding. Overallvelocationsarecovered. Mostofthetime2or3speakersare

speakingconcurrently. (Forthis particularsequenceonlysnapshotimageles areavailable,no

AVIles.) Thepurposeofthissequenceistoevaluatemulti-sourcelocalizationandbeamforming

algorithms.

seq40-3p-0111 Three speakers,twoseatedand onestanding, allspeaking continuously, facingthe

arrays, thestanding speaker walksback and forth once behind the seatedspeakers. The

mo-tivation is both to test multi-source localization, tracking and separation algorithms, and to

highlightcomplementaritybetweenaudioandvideomodalities.

seq45-3p-1111 Three moving speakers, entering and leaving the scene, all speaking continuously,

occluding each other many times. Speakers' motionis unconstrained. This is a verydiÆcult

caseof overlapped speech and visual occlusions. Its goal is to highlight the complementarity

betweenaudioandvideomodalities.

3.3 Sequence Names

Asystematiccodingwasdened,suchthatthenameofeachsequence(1)isunique,and(2)contains

acompactdescriptionofitscontent. Forexample\seq40-3p-0111"hasthreeparts:

\seq40"istheuniqueidentierofthissequence.

\3p"meansthatoverall3dierentpersonswererecorded{butnotnecessarilyallvisible

simul-taneously.

\0111"arefourbinary agsgivingaquickoverviewofthecontentofthis recording. Fromleft

toright:

bit1: 0 means \very constrained", 1 means \mostly unconstrained" (general behavior:

al-thoughmostrecordingsfollowsomesortofscenario,someinclude verystrongconstraints

suchasthespeakerfacingthemicrophonearraysatalltimes).

bit2: 0means\static motion"(e.g. mostlyseated),1means\dynamicmotion". (e.g.

contin-uousmotion).

bit3: 0means\minorocclusion(s)",1means\atleastonemajorocclusion",involvingatleast

onearrayorcamera: wheneversomebodypassesin frontoforbehindsomebodyelse.

bit4: 0 means \little overlap", 1 means \signicant overlap". This involves audio only: it

indicateswhetherthereisasignicantproportionofoverlapbetweenspeakersand/ornoise

sources.

4 Annotation

Twotypesofannotationscanbecreated: inspace(e.g. speakertrajectory)ortime(e.g. speech/silence

segmentation). Thedenitionofannotationintrinsicallydenestheperformancemetricsthatwillbe

usedto evaluatelocalizationandtrackingalgorithms. Howannotationshould bedenedistherefore

debatable. Moreover, we note that dierent modalities (audio, video) might require very dierent

annotations (e.g. 3-D mouth location vs 2-D head bounding box). Sections 4.1 and 4.2 report the

(11)

Figure3: SnapshotsofthetwowindowsoftheHeadAnnotationInterface.

4.1 Initial Eort

Thetwosequenceswithstaticspeakersonlyhavealreadybeenfullyannotated: \seq01-1p-0000"and

\seq37-3p-0001". Theannotationincludes, foreach speaker,3-Dmouth location andspeech/silence

segmentation. 3-Dmouthlocationisdenedrelativetothemicrophonearrays'referent. Theoriginof

thisreferentisin themiddleofthetwomicrophonearrays. Thisannotationisalsoaccessibleonline.

It hasalreadybeensuccessfullyused toevaluaterecentwork[5]. Moreover,asimpleexampleof use

ofthis annotationisavailablewhithintheonlinecorpus,asdescribedin Sect.4.3.

Asforsequenceswithmovingspeakersandocclusioncases,threeMatlabgraphicalinterfaceswere

written andused toannotate locationofthehead, ofthemouth andofan optionalmarker(colored

ball)onthepersons' heads:

BAI: theBall AnnotationInterface,tomarkthelocationof acoloredballontheheadof aperson,

asan ellipse. Occlusions canbemarked,i.e. whentheball is notvisible. TheBAI includes a

simpletrackertointerpolatebetweenmanualmeasurements.

HAI: theHeadAnnotationInterface,tomarkthelocationoftheheadofaperson,asarectangular

bounding box. Partialorcompleteocclusionscanbemarked.

MAI: the Mouth Annotation Interface, to mark the locationof the mouth of a person as apoint.

Occlusionscanbemarked,i.e. whenthemouthis notvisible.

All three interfaces share verysimilar features, including twowindows: one for the interface itself,

and asecond onefor theimagecurrentlybeingannotated. An exampleofsnapshotofthe HAIcan

beseeninFig.3. All annotationlesaresimplematricesstoredinASCII format.

Allthreeinterfacesareavailableanddocumentedonline,withinthecorpusitself. Wehavealready

usedthemto producecontinuous3-Dmouthlocationannotationfrom sparsemanualmeasurements,

(12)

Table2: Annotationavailable onlineasofAugust31st,2004. \C"meanscontinuousannotation,i.e.

allframesofthe25Hzvideoareannotated. \S"meanssparseannotation,i.e. theannotationisdone

ataratelessthan25Hz(giveninparenthesis).

Sequence ball mouth head speech/silence

2-D 3-D 2-D 3-D 2-D segmentation seq01-1p-0000 C C precise seq11-1p-0100 C C C C seq15-1p-0100 S(2Hz) S(2Hz) seq18-2p-0101 C C C C seq24-2p-0111 C C C C S(2Hz) seq37-3p-0001 C C undersegmented seq40-3p-0111 S(2Hz) S(2Hz) seq45-3p-1111 S(2Hz) S(2Hz) S(2Hz) 4.2 Current State

Theannotationeortisconstantlyprogressingovertime,andTable2detailswhatisalreadyavailable

onlineasofAugust31st,2004.

4.3 Example 1: Audio Source Localization Evaluation

The online corpus includes a complete example (Matlab les) of single source localization followed

bycomparison withtheannotation, for\seq01-1p-0000". Itis basedonaparametricmethod called

SRP-PHAT[9]. All necessaryMatlabcodeto runthe exampleis availableonline

3

. Thecomparison

shows that theSRP-PHAT localization method providesa precisionbetween -5 and +5 degrees in

azimuth.

4.4 Example 2: Multi-Object Video Tracking

As an example, the results of applying three independent, appearance-based particle lters on 200

framesof the\seq45-3p-1111"sequence,using onlyoneof thecameras,areshownin Fig. 4,and in

avideo

4

. Thesequence depictsthree people movingaround theroom whilespeaking, and includes

multiple instances of object occlusion. Each tracker has been initialized by hand, and uses 500

particles. Objectappearanceismodeledbyacolordistribution[10]inRGBspace.

Inthis particularexamplewehavenotdone any performance evaluation yet. Weplan to dene

precisionandrecall basedontheintersecting surfacebetweenthe annotationbounding boxandthe

resultbounding box.

4.5 Example 3: 3-D Mouth Annotation

Fromsparse2-Dmouthannotationoneachcameraweproposeto(1)reconstruct3-Dmouthlocation

using cameracalibration parametersestimated asexplainedin Sect. 2.3, (2) interpolate3-D mouth

locationusing theballlocationasoriginof the3-Dreferent. The3-Dball locationitselfis provided

bythe2-DtrackerintheBAIinterface(seeSect.4.1)and3-Dreconstruction. Themotivationofthis

choice wastwofold: rstofall,using simple(e.g. polynomial) interpolationonmouthmeasurements

wasnotenoughin practice,sincehumanmotioncontainsmanycomplexnon-linearities(sharpturns

and accelerations). Second, visual tracking of the mouth is a hard task in itself. We found that

interpolatingmeasurementsinthemovingreferentofanautomaticallytrackedballmarkeriseective

3

http://mmm.idiap.ch/La thou d/a v16. 3v6/EXAMPLES/AUDIO/READ ME

(13)

(14)

even at low annotationrates (e.g 2Hz= 1videoframe out of12), which is particularly important

since the goal is to saveon time spent doing manual measurements. A complete examplewith all

necessaryMatlabimplementationcanbefoundonline

5

. Thisimplementationwasusedto createall

3-Dlesavailablewithin thecorpus.

4.6 Future Directions

DiÆculties arise mostlyintwocases: 1) predominanceofoverlappedspeech, and 2)highly dynamic

situations,in terms ofmotionsand occlusions. 1) canbeaddressedby undersegmentingthe speech

and dening propermetrics forevaluation. By\undersegmenting" wemean that less segmentsare

dened, each segment comprising some silence and speech which is too weak to be localized. An

exampleisgivenin[5].

2) ismore diÆcultto address. It is intrinsicallylinked to theminimum intervalat which

anno-tation measurementsare taken, and therefore the interval at which performance will be evaluated.

Consideringthefact thatlocationbetweentwomeasurementscanbeinterpolated,twoattitudescan

beenvisaged:

1. Onshortsequences,withveryspecictestcases,theintervalcanbechosenverysmall,inorder

toobtain ne-grained,precise spatial annotation. Even with interpolation, this would require

independentobserver(s)togivemanytruelocationmeasurements.

2. On long sequences, the interval can be chosen larger. If the interpolated annotation is used

for performanceevaluation, slightimprecision canbe tolerated, ascompensatedby thesize of

thedata (\continuous"annotation). Ifthemanualannotationmeasurementsonlyareusedfor

performanceevaluation(\sparse"annotation),theevaluation willbemoreprecise,andthe

rel-ativelylargenumberofsuchmeasurementsmaystillleadtosignicantresults. By\signicant"

we meanthat thestandard deviation of the erroris small enough for theaverage errorto be

meaningful.

5 Conclusion

ThispaperpresentedtheAV16.3corpusforspeakerlocalizationandtracking. AV16.3focusesmostly

on the context of meeting room data, acquired synchronously by 3 cameras, 16 far-distance

micro-phones,andlapels. Ittargetsvariousareasofresearch:audio,visualandaudio-visualspeakertracking.

Inordertoprovideaudioannotation,cameracalibrationisusedtogenerate\true"3-Dspeakermouth

location, using freely available software. Tothe best of ourknowledge, this is the rstattempt to

provide synchronized audio-visual data for extensive testing on a variety of test cases, along with

spatial annotation. AV16.3 is intended as astep towards systematic evaluation of localization and

trackingalgorithms onreal recordings. Futurework includes completion of the annotationprocess,

andpossiblydataacquisitionwithdierentsetups.

6 Acknowledgments

Theauthors acknowledge thesupport ofthe EuropeanUnion throughtheAMI, M4, HOARSEand

IM2.SA.MUCATAR projects. Theauthors wish to thankall actors recordedin this corpus, Olivier

Massonforhelpwiththephysicalsetup,andMathewMagimai.-Dossforvaluablecomments.

(15)

References

[1] E. Shriberg, A. Stolcke,and D. Baron. Observations on overlap: ndings and implicationsfor

automaticprocessingofmulti-partyconversation. InProceedings of Eurospeech 2001, volume2,

pages1359{1362,2001.

[2] A.Janin,D.Baron,J.Edwards,D.Ellis,D.Gelbart,N.Morgan,B.Peskin,T.Pfau,E.Shriberg,

A. Stolcke, andC.Wooters. TheICSImeeting corpus. InProceedings of the 2003 IEEE

Inter-nationalConference onAcoustics,Speech andSignalProcessing(ICASSP-03), 2003.

[3] E.Patterson,S.Gurbuz, Z.Tufekci,and J.Gowdy. Movingtalker,speaker-independentfeature

study and baseline results using the CUAVE multimodal speech corpus. Eurasip Journal on

AppliedSignalProcessing, 11:1189{1201,2002.

[4] V.R.Algazi,R.O.Duda,andD.M.Thompson. TheCIPICHRTFdatabase.InProceedingsofthe

2001 IEEE Workshop on Applications of SignalProcessingto Audioand Acoustics

(WASPAA-01),2001.

[5] GuillaumeLathoud andIainA. McCowan. A Sector-BasedApproachforLocalizationof

Multi-ple Speakers withMicrophone Arrays. In Proceedings of the 2004 ISCA Tutorial andResearch

Workshop onStatistical andPerceptual Audio Processing(SAPA-04),October2004.

[6] D. Moore. TheIDIAPSmartMeetingRoom. IDIAP-COM07,IDIAP,2002.

[7] T.Svoboda.Multi-CameraSelf-Calibration.http://cmp.felk.cvut.cz/ svoboda/SelfCal/index.html,

August2003.

[8] J. Y. Bouguet. Camera Calibration Toolbox for Matlab.

http://www.vision.caltech.edu/bouguetj/calibdoc/,January2004.

[9] J. DiBiase, H. Silverman, and M. Brandstein. Robust localization in reverberant rooms. In

M. Brandstein and D. Ward, editors, Microphone Arrays, chapter 8, pages 157{180.Springer,

2001.

[10] P.Perez,C.Hue,J.Vermaak,andM.Gangnet. Color-basedProbabilisticTracking. In