E
S
E
A
R
C
H
R
E
P
R
O
R
T
I
D
I
A
P
Dalle Molle Institutefor Perceptual Artificial
Intelligence P.O.Box 592 M a r t i g n yV a l a i sS w i t z e r l a n d phone +41 27 721 77 11 fax +41 27 721 77 12 e-mail [email protected] internethttp://www.idiap.ch AV16.3: an Audio-Visual
Corpus for Speaker
Localization and Tracking
Guillaume Lathoud a,b Jean-Marc Odobez a Daniel Gatica-Perez a IDIAP{RR 04-28 August 2004 to appear in
Proceedingsofthe2004WorkshoponMachine LearningforMultimodal
Interaction(MLMI'04),BengioandBourlardEds,Springer-Verlag,2004
a
IDIAPResearchInstitute,CH-1920Martigny,Switzerland
b
AV16.3: an Audio-Visual Corpus for Speaker
Localization and Tracking
GuillaumeLathoud Jean-Marc Odobez Daniel Gatica-Perez
August2004
to appear in
Proceedingsofthe2004WorkshoponMachineLearningforMultimodalInteraction(MLMI'04),
BengioandBourlardEds,Springer-Verlag,2004
Abstract. Assessingthe quality of aspeakerlocalization or trackingalgorithm onafew short
examplesisdiÆcult,especiallywhentheground-truthisabsentornotwelldened. Onestep
to-wardssystematicperformanceevaluationofsuchalgorithmsistoprovidetime-continuousspeaker
locationannotationoveraseriesofrealrecordings,coveringvarioustestcases. Areasofinterest
include audio, video and audio-visual speakerlocalization and tracking. The desired location
annotationcanbeeither2-dimensional(imageplane)or3-dimensional(physicalspace). This
pa-permotivatesanddescribesacorpusofaudio-visualdatacalled\AV16.3",alongwithamethod
for3-Dlocationannotationbasedoncalibratedcameras. \16.3"standsfor 16microphones and
3cameras, recordedinafullysynchronizedmanner,inameetingroom. Partof thiscorpushas
1 Introduction
This paper describes a corpus of audio-visual data called \AV16.3", recorded in a meeting room
context. \16.3"standsfor16microphonesand 3cameras,recordedin afullysynchronizedmanner.
The central idea is to use calibrated cameras to provide continuous 3-dimensional (3-D) speaker
location annotation for testing audio localization and tracking algorithms. Particular attention is
giventooverlappedspeech,i.e. whenseveralspeakersaresimultaneouslyspeaking. Overlapisindeed
animportantissueinmulti-partyspontaneousspeech,asfoundinmeetings[1]. Sincevisualrecordings
areavailable,videoandaudio-visualtrackingalgorithmscanalsobetested. Wethereforedenedand
recorded a series of scenarios so as to cover a variety of research areas, namely audio, video and
audio-visuallocalizationandtrackingofpeoplein ameeting room. Possibleapplicationsrangefrom
automaticanalysisofmeetingsto robustspeechacquisition andvideosurveillance,tonameafew.
Inordertoallowforsuchabroadrangeofresearchtopics,\meetingroomcontext"isdenedhere
in awide way. This includes ahigh varietyof situations, from \meetingsituations" wherespeakers
are seated most of the time, to \motion situations" where speakersare moving most of the time.
This departs from existing, related databases: for example the ICSI database [2] contains
audio-only recordings of natural meetings, the CUAVE database [3] does contain audio-visual recordings
(close-ups)but focuses onmultimodalspeechrecognition. TheCIPIC[4]databasefocuses on
Head-RelatedTransferFunctions. Insteadoffocusingtheentiredatabaseononeresearchtopic,wechoseto
haveasingle,generic setup,allowingverydierentscenariosfordierentrecordings. Thegoalis to
provide annotationbothin termsof \true"3-Dspeakerlocationin the microphonearrays'referent,
and \true" 2-D head/face location in the image plane of each camera. Such annotation permits
systematicevaluation oflocalizationandtrackingalgorithms,asopposedtosubjectiveevaluationon
a few short examples without annotation. To the best of our knowledge, there is no such
audio-visualdatabasepubliclyavailable.Thedatasetwepresentherehasbeguntobeused: tworecordings
withstaticspeakershavealreadybeensuccessfullyusedtoreportresultsonreal multi-sourcespeech
recordings[5].
Whileinvestigating forexisting solutionsforspeakerlocation annotation,wefound various
solu-tions with devices to beworn by each person and abase device that locateseach personal device.
However,these solutionswereeither verycostlyand extremelyperformant(highprecisionand
sam-pling rate, notether betweenthe base and the personal devices), or cheapbut with poorprecision
and/or high constraints (e.g. personal devices tethered to the base). We therefore optedfor using
calibratedcamerasforreconstructing 3-Dlocationof thespeakers. Itis importanttonote that this
solutionispotentiallynon-intrusive,whichisindeedthecaseonpartofthecorpuspresentedhere: on
somerecordingsnoparticularmarkeriswornbytheactors.
Inthedesignofthecorpus,twocontradictingconstraintsneededtobefullled: 1)theareaoccupied
byspeakersshouldbelargeenoughtocoverboth\meetingsituations"and\motionsituations",2)this
areashouldbeentirelyvisiblebyallcameras. Thelatterallowssystematicoptimizationofthecamera
placement. Italsoleadsto robustreconstruction of3-Dlocationinformation,sinceinformationfrom
allcamerascanbeused.
Therestofthispaperisorganizedasfollows: Section2describesthephysicalsetupandthecamera
calibrationprocessusedtoprovide3-Dmouthlocationannotation. Section3describesandmotivates
a set of sequences publicly available via Internet. Section 4 discusses the annotationprotocol, and
reportsthecurrentstatusoftheannotationeort.
2 Physical Setup and Camera Calibration
Forpossiblespeakers'locations,weselectedaL-shapedareaaroundthetablesinameetingroom,as
depictedinFig.1. Ageneraldescriptionofthemeeting roomcanbefoundin [6]. TheL-shapedarea
Tables
C3
C2
C1
Speakers
Area
MA1
MA2
Figure 1: Physical setup: threecamerasC1,C2 andC3 and two8-microphone circulararraysMA1
andMA2. Thegrayareaisintheeldofviewofallthreecameras. TheL-shapedareaisa3m-long
by2m-widerectangle,minus a0.6m-wideportiontakenbythetables.
dierentcamerascanbeseeninFig.2. Thedataitselfisdescribedin Sect.3.
Thechoiceofhardwareisdescribedandmotivatedin Sect.2.1. Weadopteda2-stepstrategyfor
placingthecamerasandcalibratingthem. First,cameraplacement(location,orientation,zoom)is
op-timized,usingaloopingprocessincludingsub-optimalcalibrationofthecameraswith2-Dinformation
only(Sect.2.2). Second,eachcameraiscalibratedinaprecisemanner,usingboth2-Dmeasurements
and3-Dmeasurementsinthereferentofthemicrophonearrays(Sect. 2.3).
Theideabehindthisprocessisthatifwecantrackthemouthofapersonin eachcamera'simage
plane,thenwecanreconstructthe3-Dtrajectoryofthemouthusingthecameras'calibration
param-eters. Thiscanbeusefulasaudioannotation,providedthe3-Dtrajectoryisdenedinthereferentof
themicrophonearrays. Weshowthat the3-Dreconstructionerroriswithin averyacceptablerange.
2.1 Hardware
Weused3camerasandtwo10cm-radius,8-microphonearraysfromaninstrumentedmeetingroom[6].
Thetwomicrophonearraysareplaced0.8m apart. Themotivationbehindthischoiceisthreefold:
Recordingsmadewithtwomicrophonearraysprovidetestcasesfor3-Daudiosourcelocalization
andtracking,aseachmicrophonearraycanbeusedtoprovidean(azimuth,elevation)location
estimateof eachaudiosource.
Recordingsmadewithseveralcamerasgeneratemanyinteresting,realisticcasesofvisual
occlu-sion,viewingeach person fromseveralviewpoints.
At least two cameras are necessaryfor computing the 3-D coordinates of an object from the
2-Dcoordinatesin cameras'image planes. Theuse ofthree camerasallowsto reconstruct the
3-Dcoordinatesofanobjectinarobustmanner. Indeed,inmostcases,visualocclusionoccurs
inonecameraonly;theheadofthepersonremainsvisiblefromthetwoothercameras.
2.2 Step One: Camera Placement
ThisSectiondescribestheloopingprocessusedtooptimizecamerasplacement(location,orientation,
(Mul-camera 1
camera 2
camera 3
camera 1
camera 1
camera 1
Figure 2: Snapshots from the cameras at their nal positions. Red \+" designate points in the
calibrationtrainingset
train
,green\x"designatepointsinthecalibrationtestset
test .
unknown. TheMultiCamSelfCalusesonlythe2-Dcoordinatesin theimageplaneofeachcamera. It
jointly produces aset of calibration parameters
1
foreach cameraand 3-Dlocationestimates of the
calibrationpoints,byoptimizingthe\2-Dreprojectionerror". Foreachcamera,\2-D
reprojec-tionerror"isdenedasthedistanceinpixelsbetweentherecorded2-Dpointsandtheprojectionof
their3-Dlocationestimatesbackontothecameraimageplane,usingtheestimatedcameracalibration
parameters. Althoughweused thesoftwarewiththestrictminimumnumberofcameras(three),the
obtained2-Dreprojectionerrorwasdecent: itsupperbound wasestimatedaslessthan0.17pixels.
Thecameraplacementprocedureconsists inaniterativeprocesswiththree steps: Place,Record
andCalibrate:
1. Place the three cameras (location, orientation, zoom)based on experience in prior iterations.
Inpracticethevariouscamerasshould giveviews thatareasdierentaspossible.
2. Record synchronouslywiththe3camerasasetofcalibrationpoints,i.e. 2-Dcoordinatesin the
imageplaneofeachcamera. Asexplainedin[7],wavingalaserbeamerindarknessissuÆcient.
3. Calibrate the3camerasbyrunningMultiCamSelfCalonthecalibrationpoints.
MultiCamSelf-Caloptimizesthe2-Dreprojectionerror.
4. Totrydecreasingthe2-Dreprojectionerror, loopto1. Else goto5. Inpractice,a2-D
repro-jectionerrorbelow0.2pixelsisreasonable.
5. Selectthecameraplacementthatgavethesmallest2-Dreprojectionerror.
Multi-cameraself-calibrationisgenerallyknowntoprovidelessprecisionthanmanualcalibration
usinganobjectwithknown3-Dcoordinates. Themotivationforusingitwaseaseofuse: the
calibra-tionpointscanbequicklyrecordedwithalaserbeamer. OneiterationofthePlace/Record/Calibrate
loop thus takes about 1h30. This process converged to the positioning of the camera depicted in
Fig.1.
Fordetailedinformation,includingthemulti-cameraself-calibrationproblemstatement,thereader
isinvitedtorefertothedocumentationin[7].
2.3 Step Two: Camera Calibration
This Section describes precise calibration of each camera, assuming the cameras' placement xed
(location, orientation, zoom). This is done by selecting and optimizing the calibration parameters
for each camera, on a calibration object. For each point of the calibration object, both true
3-D coordinates in the microphone arrays' referent and true 2-D coordinates in each camera's
image planeareknown. 3-Dcoordinateswereobtainedon-sitewith ameasuringtape(measurement
errorestimatedbelow0.005m). CrossesinFig.2showthe3-Dcalibrationpoints. Thesepointswere
split intwosets:
train
(36points)and
test
(39points).
Particularmentionmust bemade ofthe model selectionissue, i.e. howwechose to model
non-linear distortions produced by each camera's optics. An iterative process that evaluates adequacy
of the calibration parameters of all three cameras in terms of \3-D reconstruction error" was
adopted: the Euclidean distance between 3-D location estimates of points visible from at least 2
cameras,andtheirtrue3-Dlocation. Thecameracalibrationprocedurecanbedetailedasfollows:
1. Model selection: foreach camera,selectthe set ofcalibration parametersbasedonexperience
inprioriterations.
2. Model training: for each camera, estimatethe selected calibrationparameters on
train using
thesoftwareavailablein [8].
3. 3-D error: for each point in train
, compute the Euclidean distance between true 3-D
coor-dinates and 3-D coordinates reconstructed using the trained calibration parameters and the
2-Dcoordinatesineachcamera'simage plane.
4. Evaluation: estimatethe\training"maximum3-Dreconstructionerroras+3,whereand
respectivelystandformeanandstandarddeviationofthe3-Derror,acrossallpointsin
train .
5. Totrydecreasingthemaximum3-Dreconstructionerror,loopto 1. Elsegoto6.
6. Selectthesetofcalibrationparametersandtheirestimatedvalues,thatgavethesmallest
max-imum3-Dreconstructionerror.
Theresultofthisprocessis aset ofcalibrationparametersandtheirvaluesforeach camera. Forall
camerasthebestsetofparameterswerefocalcenter,focallengths,r
2
radialandtangentialdistortion
coeÆcients.
Oncethetrainingwasover,weevaluatedthe3-Derrorontheunseentestset
test
. Themaximum
3-Dreconstructionerroronthissetwas0.012m. Thismaximumerrorwasdeemeddecent,ascompared
tothediameterofanopenmouth (about0.05m).
3 Online Corpus
This Section rst motivates and describes the variety of sequences recorded, and then describes in
moredetailstheannotatedsequences. \Sequence"means:
3videoDIVXAVIles(resolution288x360),oneforeachcamera,sampledat25Hz. Itincludes
alsooneaudiosignal.
16audioWAV lesrecordedfrom thetwocircular 8-microphonearrays,sampledat16kHz.
Whenpossible, moreaudioWAV les recordedfrom lapels wornby the speakers, sampled at
16kHz.
All les were recorded in a synchronous manner: video les carry a time-stamp embedded in the
upperrowsofeachimage, andaudiolesalwaysstartat videotimestamp00:00:10.00. Complete
details aboutthehardwareimplementation of auniqueclock acrossallsensors canbefound in [6].
Althoughonly8sequenceshavebeenannotated,manyothersequencesarealsoavailable. Thewhole
corpus, alongwith annotationles, camera calibrationparametersand additionaldocumentationis
accessible 2
at: http://mmm.idiap.ch/Lathoud/av16.3v6. It wasrecordedoveraperiodof5days,
andincludes42sequencesoverall,withsequencedurationrangingfrom14secondsto9minutes(total
1h25). 12 dierentactors wererecorded. Although theauthors of thepresent paperwererecorded,
many of theactors don't haveanyparticular expertise in theelds of audioand video localization
andtracking.
3.1 Motivations
Themain objectiveistostudyseverallocalization/trackingphenomena. Anon-limitinglistincludes:
Overlappedspeech.
Closeandfarlocations,smallandlargeangularseparations.
Objectinitialization.
Variable numberofobjects.
Table 1: List of the annotated sequences. Tags mean: [A]udio, [V]ideo, predominant [ov]erlapped
speech, atleastonevisual[occ]lusion,[S]taticspeakers,[D]ynamicspeakers,[U]nconstrainedmotion,
[M]outh,[F]ace,[H]ead,speech/silence [seg]mentation.
Sequence Duration Modalities Nb. of Speaker(s) Desired
name (seconds) ofinterest speakers behavior annotation
seq01-1p-0000 217 A 1 S M,seg
seq11-1p-0100 30 A, V,AV 1 D M,F,seg
seq15-1p-0100 35 AV 1 S,D(U) M,F,seg
seq18-2p-0101 56 A(ov) 2 S,D M,seg
seq24-2p-0111 48 A(ov),V(occ) 2 D M,F
seq37-3p-0001 511 A(ov) 3 S M,seg
seq40-3p-0111 50 A(ov),AV 3 S,D M,F
seq45-3p-1111 43 A(ov),V(occ),AV 3 D(U) H
Partialandtotalocclusion.
\Natural"changesofillumination.
Accordingly,wedened andrecordedasetofsequencesthat containsahighvarietyoftestcases:
fromshort,veryconstrained,speciccases(e.g. visualocclusion),foreachmodality(audioorvideo),
to naturalspontaneousspeechand/ormotioninmuchlessconstrainedcontext.
Each sequence is useful for at least one of three elds of research: analysis of audio, video or
audio-visualdata. Uptothreepeopleareallowedineachsequence. Humanmotioncanbestatic(e.g.
seatedpersons),dynamic (e.g. walkingpersons)oramix ofbothacrosspersons(some seated,some
walking)andtime(e.g. meetingpreceded andfollowedbypeoplestandingandmoving).
3.2 Contents
As mentionedabove, the online corpuscomprises of 8 annotatedsequences plusmany more
unan-notated sequences. These 8 sequences were selected for the initial annotation eort. This choice
is a compromise betweenhaving asmall numberof sequences for annotation, and coveringa large
varietyofsituations tofulll interestsfromvarious areasof research. It constitutesaminimal setof
sequencescoveringasmuchvarietyaspossibleacrossmodalitiesandspeakerbehaviors. Theprocess
ofannotationisdescribedinSect.4.
The name of each sequence is unique. Table 1 gives a synthetic overview. A more detailed
descriptionofeachsequencefollows.
seq01-1p-0000 Asinglespeaker,staticwhile speaking, at each of16locationscoveringtheshaded
areainFig.1. Thespeakerisfacingthemicrophonearrays. Thepurposeofthissequenceisto
evaluateaudiosourcelocalizationonasinglespeakercase.
seq11-1p-0100 Onespeaker, mostlymovingwhile speaking. The onlyconstrainton the speaker's
motionisto facethemicrophonearrays. Themotivationisto testaudio,videooraudio-visual
(AV)speakertrackingondiÆcultmotioncases. Thespeakeristalkingmostofthetime.
seq15-1p-0100 One moving speaker, walking around while alternating speech and long silences.
The purpose of this sequence is to 1) show that audio tracking alone cannot recover from
unpredictabletrajectoriesduringsilence,2)provideaninitialtestcaseforAVtracking.
seq18-2p-0101 Twospeakers,speakingandfacingthemicrophonearraysallthetime,slowlygetting
seq24-2p-0111 Twomovingspeakers,crossingtheeldofviewtwiceandoccludingeachothertwice.
Thetwospeakersaretalkingmostofthetime. Themotivationistotest bothaudioandvideo
occlusions.
seq37-3p-0001 Three speakers,staticwhilespeaking. Twospeakersremainseatedallthetimeand
thethirdoneisstanding. Overallvelocationsarecovered. Mostofthetime2or3speakersare
speakingconcurrently. (Forthis particularsequenceonlysnapshotimageles areavailable,no
AVIles.) Thepurposeofthissequenceistoevaluatemulti-sourcelocalizationandbeamforming
algorithms.
seq40-3p-0111 Three speakers,twoseatedand onestanding, allspeaking continuously, facingthe
arrays, thestanding speaker walksback and forth once behind the seatedspeakers. The
mo-tivation is both to test multi-source localization, tracking and separation algorithms, and to
highlightcomplementaritybetweenaudioandvideomodalities.
seq45-3p-1111 Three moving speakers, entering and leaving the scene, all speaking continuously,
occluding each other many times. Speakers' motionis unconstrained. This is a verydiÆcult
caseof overlapped speech and visual occlusions. Its goal is to highlight the complementarity
betweenaudioandvideomodalities.
3.3 Sequence Names
Asystematiccodingwasdened,suchthatthenameofeachsequence(1)isunique,and(2)contains
acompactdescriptionofitscontent. Forexample\seq40-3p-0111"hasthreeparts:
\seq40"istheuniqueidentierofthissequence.
\3p"meansthatoverall3dierentpersonswererecorded{butnotnecessarilyallvisible
simul-taneously.
\0111"arefourbinary agsgivingaquickoverviewofthecontentofthis recording. Fromleft
toright:
bit1: 0 means \very constrained", 1 means \mostly unconstrained" (general behavior:
al-thoughmostrecordingsfollowsomesortofscenario,someinclude verystrongconstraints
suchasthespeakerfacingthemicrophonearraysatalltimes).
bit2: 0means\static motion"(e.g. mostlyseated),1means\dynamicmotion". (e.g.
contin-uousmotion).
bit3: 0means\minorocclusion(s)",1means\atleastonemajorocclusion",involvingatleast
onearrayorcamera: wheneversomebodypassesin frontoforbehindsomebodyelse.
bit4: 0 means \little overlap", 1 means \signicant overlap". This involves audio only: it
indicateswhetherthereisasignicantproportionofoverlapbetweenspeakersand/ornoise
sources.
4 Annotation
Twotypesofannotationscanbecreated: inspace(e.g. speakertrajectory)ortime(e.g. speech/silence
segmentation). Thedenitionofannotationintrinsicallydenestheperformancemetricsthatwillbe
usedto evaluatelocalizationandtrackingalgorithms. Howannotationshould bedenedistherefore
debatable. Moreover, we note that dierent modalities (audio, video) might require very dierent
annotations (e.g. 3-D mouth location vs 2-D head bounding box). Sections 4.1 and 4.2 report the
Figure3: SnapshotsofthetwowindowsoftheHeadAnnotationInterface.
4.1 Initial Eort
Thetwosequenceswithstaticspeakersonlyhavealreadybeenfullyannotated: \seq01-1p-0000"and
\seq37-3p-0001". Theannotationincludes, foreach speaker,3-Dmouth location andspeech/silence
segmentation. 3-Dmouthlocationisdenedrelativetothemicrophonearrays'referent. Theoriginof
thisreferentisin themiddleofthetwomicrophonearrays. Thisannotationisalsoaccessibleonline.
It hasalreadybeensuccessfullyused toevaluaterecentwork[5]. Moreover,asimpleexampleof use
ofthis annotationisavailablewhithintheonlinecorpus,asdescribedin Sect.4.3.
Asforsequenceswithmovingspeakersandocclusioncases,threeMatlabgraphicalinterfaceswere
written andused toannotate locationofthehead, ofthemouth andofan optionalmarker(colored
ball)onthepersons' heads:
BAI: theBall AnnotationInterface,tomarkthelocationof acoloredballontheheadof aperson,
asan ellipse. Occlusions canbemarked,i.e. whentheball is notvisible. TheBAI includes a
simpletrackertointerpolatebetweenmanualmeasurements.
HAI: theHeadAnnotationInterface,tomarkthelocationoftheheadofaperson,asarectangular
bounding box. Partialorcompleteocclusionscanbemarked.
MAI: the Mouth Annotation Interface, to mark the locationof the mouth of a person as apoint.
Occlusionscanbemarked,i.e. whenthemouthis notvisible.
All three interfaces share verysimilar features, including twowindows: one for the interface itself,
and asecond onefor theimagecurrentlybeingannotated. An exampleofsnapshotofthe HAIcan
beseeninFig.3. All annotationlesaresimplematricesstoredinASCII format.
Allthreeinterfacesareavailableanddocumentedonline,withinthecorpusitself. Wehavealready
usedthemto producecontinuous3-Dmouthlocationannotationfrom sparsemanualmeasurements,
Table2: Annotationavailable onlineasofAugust31st,2004. \C"meanscontinuousannotation,i.e.
allframesofthe25Hzvideoareannotated. \S"meanssparseannotation,i.e. theannotationisdone
ataratelessthan25Hz(giveninparenthesis).
Sequence ball mouth head speech/silence
2-D 3-D 2-D 3-D 2-D segmentation seq01-1p-0000 C C precise seq11-1p-0100 C C C C seq15-1p-0100 S(2Hz) S(2Hz) seq18-2p-0101 C C C C seq24-2p-0111 C C C C S(2Hz) seq37-3p-0001 C C undersegmented seq40-3p-0111 S(2Hz) S(2Hz) seq45-3p-1111 S(2Hz) S(2Hz) S(2Hz) 4.2 Current State
Theannotationeortisconstantlyprogressingovertime,andTable2detailswhatisalreadyavailable
onlineasofAugust31st,2004.
4.3 Example 1: Audio Source Localization Evaluation
The online corpus includes a complete example (Matlab les) of single source localization followed
bycomparison withtheannotation, for\seq01-1p-0000". Itis basedonaparametricmethod called
SRP-PHAT[9]. All necessaryMatlabcodeto runthe exampleis availableonline
3
. Thecomparison
shows that theSRP-PHAT localization method providesa precisionbetween -5 and +5 degrees in
azimuth.
4.4 Example 2: Multi-Object Video Tracking
As an example, the results of applying three independent, appearance-based particle lters on 200
framesof the\seq45-3p-1111"sequence,using onlyoneof thecameras,areshownin Fig. 4,and in
avideo
4
. Thesequence depictsthree people movingaround theroom whilespeaking, and includes
multiple instances of object occlusion. Each tracker has been initialized by hand, and uses 500
particles. Objectappearanceismodeledbyacolordistribution[10]inRGBspace.
Inthis particularexamplewehavenotdone any performance evaluation yet. Weplan to dene
precisionandrecall basedontheintersecting surfacebetweenthe annotationbounding boxandthe
resultbounding box.
4.5 Example 3: 3-D Mouth Annotation
Fromsparse2-Dmouthannotationoneachcameraweproposeto(1)reconstruct3-Dmouthlocation
using cameracalibration parametersestimated asexplainedin Sect. 2.3, (2) interpolate3-D mouth
locationusing theballlocationasoriginof the3-Dreferent. The3-Dball locationitselfis provided
bythe2-DtrackerintheBAIinterface(seeSect.4.1)and3-Dreconstruction. Themotivationofthis
choice wastwofold: rstofall,using simple(e.g. polynomial) interpolationonmouthmeasurements
wasnotenoughin practice,sincehumanmotioncontainsmanycomplexnon-linearities(sharpturns
and accelerations). Second, visual tracking of the mouth is a hard task in itself. We found that
interpolatingmeasurementsinthemovingreferentofanautomaticallytrackedballmarkeriseective
3
http://mmm.idiap.ch/La thou d/a v16. 3v6/EXAMPLES/AUDIO/READ ME
even at low annotationrates (e.g 2Hz= 1videoframe out of12), which is particularly important
since the goal is to saveon time spent doing manual measurements. A complete examplewith all
necessaryMatlabimplementationcanbefoundonline
5
. Thisimplementationwasusedto createall
3-Dlesavailablewithin thecorpus.
4.6 Future Directions
DiÆculties arise mostlyintwocases: 1) predominanceofoverlappedspeech, and 2)highly dynamic
situations,in terms ofmotionsand occlusions. 1) canbeaddressedby undersegmentingthe speech
and dening propermetrics forevaluation. By\undersegmenting" wemean that less segmentsare
dened, each segment comprising some silence and speech which is too weak to be localized. An
exampleisgivenin[5].
2) ismore diÆcultto address. It is intrinsicallylinked to theminimum intervalat which
anno-tation measurementsare taken, and therefore the interval at which performance will be evaluated.
Consideringthefact thatlocationbetweentwomeasurementscanbeinterpolated,twoattitudescan
beenvisaged:
1. Onshortsequences,withveryspecictestcases,theintervalcanbechosenverysmall,inorder
toobtain ne-grained,precise spatial annotation. Even with interpolation, this would require
independentobserver(s)togivemanytruelocationmeasurements.
2. On long sequences, the interval can be chosen larger. If the interpolated annotation is used
for performanceevaluation, slightimprecision canbe tolerated, ascompensatedby thesize of
thedata (\continuous"annotation). Ifthemanualannotationmeasurementsonlyareusedfor
performanceevaluation(\sparse"annotation),theevaluation willbemoreprecise,andthe
rel-ativelylargenumberofsuchmeasurementsmaystillleadtosignicantresults. By\signicant"
we meanthat thestandard deviation of the erroris small enough for theaverage errorto be
meaningful.
5 Conclusion
ThispaperpresentedtheAV16.3corpusforspeakerlocalizationandtracking. AV16.3focusesmostly
on the context of meeting room data, acquired synchronously by 3 cameras, 16 far-distance
micro-phones,andlapels. Ittargetsvariousareasofresearch:audio,visualandaudio-visualspeakertracking.
Inordertoprovideaudioannotation,cameracalibrationisusedtogenerate\true"3-Dspeakermouth
location, using freely available software. Tothe best of ourknowledge, this is the rstattempt to
provide synchronized audio-visual data for extensive testing on a variety of test cases, along with
spatial annotation. AV16.3 is intended as astep towards systematic evaluation of localization and
trackingalgorithms onreal recordings. Futurework includes completion of the annotationprocess,
andpossiblydataacquisitionwithdierentsetups.
6 Acknowledgments
Theauthors acknowledge thesupport ofthe EuropeanUnion throughtheAMI, M4, HOARSEand
IM2.SA.MUCATAR projects. Theauthors wish to thankall actors recordedin this corpus, Olivier
Massonforhelpwiththephysicalsetup,andMathewMagimai.-Dossforvaluablecomments.
References
[1] E. Shriberg, A. Stolcke,and D. Baron. Observations on overlap: ndings and implicationsfor
automaticprocessingofmulti-partyconversation. InProceedings of Eurospeech 2001, volume2,
pages1359{1362,2001.
[2] A.Janin,D.Baron,J.Edwards,D.Ellis,D.Gelbart,N.Morgan,B.Peskin,T.Pfau,E.Shriberg,
A. Stolcke, andC.Wooters. TheICSImeeting corpus. InProceedings of the 2003 IEEE
Inter-nationalConference onAcoustics,Speech andSignalProcessing(ICASSP-03), 2003.
[3] E.Patterson,S.Gurbuz, Z.Tufekci,and J.Gowdy. Movingtalker,speaker-independentfeature
study and baseline results using the CUAVE multimodal speech corpus. Eurasip Journal on
AppliedSignalProcessing, 11:1189{1201,2002.
[4] V.R.Algazi,R.O.Duda,andD.M.Thompson. TheCIPICHRTFdatabase.InProceedingsofthe
2001 IEEE Workshop on Applications of SignalProcessingto Audioand Acoustics
(WASPAA-01),2001.
[5] GuillaumeLathoud andIainA. McCowan. A Sector-BasedApproachforLocalizationof
Multi-ple Speakers withMicrophone Arrays. In Proceedings of the 2004 ISCA Tutorial andResearch
Workshop onStatistical andPerceptual Audio Processing(SAPA-04),October2004.
[6] D. Moore. TheIDIAPSmartMeetingRoom. IDIAP-COM07,IDIAP,2002.
[7] T.Svoboda.Multi-CameraSelf-Calibration.http://cmp.felk.cvut.cz/ svoboda/SelfCal/index.html,
August2003.
[8] J. Y. Bouguet. Camera Calibration Toolbox for Matlab.
http://www.vision.caltech.edu/bouguetj/calibdoc/,January2004.
[9] J. DiBiase, H. Silverman, and M. Brandstein. Robust localization in reverberant rooms. In
M. Brandstein and D. Ward, editors, Microphone Arrays, chapter 8, pages 157{180.Springer,
2001.
[10] P.Perez,C.Hue,J.Vermaak,andM.Gangnet. Color-basedProbabilisticTracking. In