Real-time neuro-inspired sound source localization and tracking architecture applied to a robotic platform

(1)

Real-time

neuro-inspired sound source localization and tracking

architecture

applied to a robotic platform

Elena Cerezuela Escudero

a,

_{, Fernando Pérez Peña}

b

_,

_{Rafael Paz Vicente}a

_,

Angel Jimenez-Fernandez

a

_,

_{Gabriel Jimenez Moreno}a

_,

_{Arturo Morgado-Estevez}b a Robotics and Computer Technology Lab (RTC), Universidad de Sevilla, ETSI Informática, Avd. Reina Mercedes s/n, Seville 41012, Spain b Applied Robotics Research Lab, Universidad de Cádiz, Faculty of Engineering, Avda. Universidad, 10, Puerto Real, Cadiz 11519, Spain

Keywords: Sound localization

Interaural intensity difference Spike signal processing Neuromorphic auditory sensor Neurorobotics

FPGA

a

b

s

t

r

a

c

t

Thispaperproposesareal-timesoundsourcelocalizationandtrackingarchitecturebasedontheability ofthemammalianauditorysystemusingtheinterauralintensitydifference.Weusedaninnovative bin-auralNeuromorphicAuditorySensortoobtainspikeratessimilar tothosegeneratedbytheinnerhair cellsofthe humanauditorysystem. Thedesignofthecomponentthatobtains theinterauralintensity differenceisinspiredbythe lateralsuperiorolive.Thespikestream thatrepresentstheIID isusedto turnaroboticplatformtowardsthesoundsourcedirection.ThearchitecturewasimplementedonFPGA devicesusinggeneralpurposeFPGAresourcesandwastestedwithpuretones(1-kHz,2.5-kHzand5-kHz sounds)withanaverageerrorof2.32°.Ourarchitecturedemonstratesapotentialpracticalapplicationof soundlocalizationforrobots,andcanbeusedtotestparadigmsforsoundlocalizationinthemammalian brain.

1. Introduction

Sound localization is a function that the ears, auditory

path-ways andauditorycortex ofthebrain performtogetherto

deter-minethesourceofasound.Itisapowerfulfeatureofmammalian

perceptionthat allowstheanimaltobeawareoftheenvironment

and to locate prey and predators. This has inspired researchers

to develop new computational models of the auditory pathways

andbiologicalmechanismsthatunderliesoundlocalizationinthe

brain.

The ability to model the ways in which mammals locate a

sound source can improve the perception and navigation of

mo-bile robots, allow the development of better virtual realities,

im-proveteleconferencing, providesurveillance systems with

omnidi-rectionalsensitivity,andenhancehearingaids.

During the last decades, the structure and function of

path-ways in the auditorybrainstem for soundlocalization have been

extensivelystudied[1–4]. Thedirectionofasoundinthe

horizon-talplane isdeterminedby acombinationofbinauralcuesderived

fromtheincidentacousticwavesarrivingattheearfromdifferent

angles:interauraltimedifference (ITD)andinterauralintensity,or

level,difference(IIDorILD,respectively).Soundsthatdonot

gen-erate directlyin front of or behindthe receptor arrive earlier at

oneearthanattheother,creatinganITD.Forwavelengthsroughly

equalto, orshorterthan, the diameterofthe head,a shadowing

effectisproducedattheearthatisfurtherawayfromthesource,

creatinganIID[1,2].

Forexample,in generalterms, ifa pure tone soundsource is

positionedontheleftside,thesoundsignalattheleftearis

rep-resentedbytheequation:

Le ft_signal= a× sin

(

2

π

ft

)

(1)

where a is thesoundamplitude, f the soundfrequencyand t the

time.Thesoundattherightearisrepresentedbytheequation:

Right_signal=

(

a/

a

)

× sin

(

2

π

f

(

t−

t

)

(2)

where

a and

t arerelatedto,respectively,theintensity

differ-ence(IID)causedbytheshadowingeffectofthehead,andthe

ad-ditionaltime(ITD)requiredforthe soundwaveto travelthe

fur-therdistancetotherightear.

Due to the head size, the ITD cue in humans is effective for

locating low frequency sounds (20Hz - 1kHz). However, the

in-formation it provides becomes ambiguous for frequencies above

1kHz.Incontrast,theIIDcueisnotusefulforlocatingsounds

be-low 1kHz, butit is more eﬃcientthan the ITD cue for mid-and

(2)

NAS

90º

-90º

0º

HEAD

(Motor DC)

Right micro

Left micro

Processing System

Motor

Driver

LSO

Left spiking

stream

Right spiking

stream

Actuation

Calibration

Fig. 1. Tracking Sound system architecture.

TheITDandIIDcuesareextractedinthemedialandlateral

su-periorolive,respectively(MSOandLSO)whicharelocatedwithin

anareaoftheauditorysystemcalledthesuperiorolivarycomplex

[2]. The LSOhas a tonotopical organization: highfrequencies are

representedin the middleof theLSO and continuallydecreasing

frequenciestobothsides[5].LSOneuronsareinhibitedbysounds

tothecontralateralearandexcitedbysoundstotheipsilateralear,

resultinginaneuralformofsubtraction[2].

Thereisanincreasingdemandforthedevelopmentofreal-time

and low-power sound localization techniques in the industry of

hearingaid[6–8] and roboticapplications[9–11]. Currently,

vari-ousdigitalprocessingtechniquesbasedon FastFourierTransform

(FFT)havebeenproposedtodeterminethesourceofasound

sig-nal[12,13].However,thesetechniquesrequirehighpower

consum-ingdevices,suchasDigitalSignalProcessors(DSP),andmemoryto

performsuchcomplexsignal processing.TraditionalDigital Signal

Processing(DSP)techniquescommonlyapplyMultiply-Accumulate

(MAC)operationsoveracollectionofdiscretesamplescodiﬁedas

ﬁxed or ﬂoating point representations. MAC operations often

re-quirededicatedandcomplexresources,i.e.ﬂoat-pointmultipliers,

whichare availablein FPGAsasdedicated expensiveresources in

relativelysmallquantities.Therefore,applyingasequence ofMAC

operationsover a dataset with these units requires multiplexing

them in time. So they are reused with different input data and

outputresults, whichare storedin a globalmemory.It often

re-quireshighfrequencyclock signalsto achieve a competitive data

throughput.Furthermore,large memory depths to store

interme-diatedataandresultsareneeded.Thesefactsarereﬂected inthe

powerconsumption andcircuitrycomplexity. Onthe other hand,

SpikeSignalProcessing(SSP)implementsthebasicoperationsthat

commonlyareperformedinDSP,butoverspikeratecodedsignals

[14]. Thus, operations are performeddirectly over spike streams,

beingequivalenttosimplyaddingorremovingspikesattheright

moment(althoughitisnotevidentwhich).Thecircuitsthat

imple-mentSSPoperationsusegeneralpurposeFPGAresources,as

coun-ters,comparatorsandlogicgates.Thisallowsthebuildingoflarge

scale dedicated systems in hardware, which process spike coded

signalsinrealtime using low frequencyclocksina fullyparallel

wayfor(lowcost)FPGAs,forexample,theauditorysensorusedin

thiswork demands 29.7mWfor 64 channelsin stereooperation

[15].

The ability to replicate the ways in which mammalslocate a

soundsourcecould allow the developmentofbetter virtual

real-ities. In addition, the performance of robotics with lower power

consumptionwillbeincreased.Furthermore,hearingaidscouldbe

enhancedbyimprovingthelocalizationofindividualsounds.These

improvements,whichareenabledbytheabilitytounderstandand

mimicmammaliansoundlocalization,arethemainreasonsforthe

researchcarriedoutinthispaper.

The aimofthisresearch involvesthe developmentof a

spike-based systemthat processes andextracts thebinaural cueof IID

with a topology inspired by the mammalian auditory pathways,

speciﬁcally the LSO. Using the IID cue, the systemperforms the

taskoftrackingasoundinrealtime,inabiologicallyinspiredway.

In thispaper, toobtain spike ratessimilar to those generated

by the inner hair cells of the human auditory system, we used

an innovative binauralNeuromorphicAuditory Sensor(NAS). This

decomposesanaudiosignal intodifferentfrequencybandswhere

theaudioinformationisencodedinthespikerates[15].Usingthe

out coming spikerates fromtheNAS asthestimulus tothe LSO

modelwe propose,the wholearchitecture dealswithbiologically

inspired data. The NAS, the LSO modelandthe actuationsystem

have been implemented on FPGA devices using general purpose

FPGAresources.ThesemodelsaredevelopedusingSSPtechniques

[14].

There are previous works that propose audiolocalization

sys-tems inspiredby themammalian auditorysystem: thepapers by

[8]and[11] reported that a neuromorphic silicon cochlea can be

usedforspatialauditionandauditorysceneanalysis;bothpapers

were based onITD. In[8], thesound localizationcircuit was

de-visedbymimickingthe neuronalorganizationofbarn owl’s

audi-torypathwaytoobtainITD.

Theworksof[16]and[17]proposedaSpikingNeuralNetwork

(SNN)topartiallysimulatethesuperior olivarycomplex,butthey

didnotuseaneuromorphicdevicetoobtainthespikestreamsthat

representsound. Inthesystemproposed in[16], theinputsound

passesthroughaGammatoneﬁlterbankandisthenencodedinto

phase-locked spikes using a model of the half-wave rectiﬁed

re-ceptor potential of inner hair cells. ITD processing uses a series

ofdelaysandaleaky integrate-and-ﬁreneuron model; theITDis

calculated for all frequency channels to form a full map of ITD

processing. IID processingdoesnot use aneuron model; instead,

a logarithmic ratiocomputes the intensity difference. The model

classiﬁes the sound source between 7 discrete azimuthal angles

(from −90° to 90° in steps of 30°). The model was testedusing

a roboticheadonbroadband sounds,both noiseandspeech,and

it achieved overall localization accuracies of 80%. The paper by

[17] presentedaSNNarchitecture tosimulatethesound

localiza-tionabilityofthemammalianauditorypathwaysusingtheIID.To

train and validate the localizationability of the architecture,

ex-perimentallyderivedhead-relatedtransferfunctionacousticaldata

fromadultdomestic catswere employed;the supervisedlearning

algorithmknownas“remotesupervisionmethod” wasusedforthe

trainingtodeterminetheazimuthalangles.TheSNNclassiﬁedthe

sound source between 13 discrete azimuthal angles (from −60°

to 60° insteps of10°). The experimental resultsusing the same

soundfrequencyusedforthetrainingwere52%for5-kHzsounds,

(3)

Analog Audio AC’97 Audio Codec

AC’97

FSM

L-Cascade Filter Bank Synthetic Spikes Generator

...

Band Pass FIlter 0

Band Pass Filter 63 Left Data AC’97 Serial Link

FPGA

AER Monitor

AER

REQ

ACK

...

Spikes_L_0 Spikes_L_63 Digital Audio Spikes

20

Right Left

Right

Data R-Cascade Filter

Bank Synthetic

Spikes Generator

...

Band Pass FIlter 0

Band Pass Filter 63

...

Spikes_R_0

Spikes_R_63 20

Fig. 2. NAS Architecture.

proposed an FPGA implementationof a soundlocalization, based

onpulsedneuralnetwork.Thepulsedneuralnetworkextractsthe

ITDtoclassifybetween7angleclasseswitharesolutionof30°.In

reference[19],the authors obtainthe ILD fromthe energylevels

ofmultiplemicrophonepairsandproposeanalgorithmtoextract

thebearingangle.

The present work is structured asfollows. Section 2 presents

theglobalarchitectureofthesoundlocalizationandtracking

sys-tem.Inthissection,theLSOmodelthatobtainstheIIDcueandthe

processingsystemthatimplementsthatmodeltotrackasoundare

describedindetail.InSection3,theexperimentalresultsare

pre-sented to show the feasibility andperformance ofthe sound

lo-calizationandtrackingsystem.Finally,aconclusionsectionis

pre-sentedinsection4.

2. Architectureofreal-timesoundlocalizationandtracking system

Inthisworkwepresentaneuromorphicreal-timesound

track-ingsystemwhichconsistsofaneuromorphicauditorysystem.The

aim is to model thefunctionality of theLSO to locate and track

high-frequency sounds ina biologically inspiredway. Thesystem

proposed obtains the IID auditory cue in the same waythe LSO

does.Then,theIIDauditorycueisusedastheinputforthe

spike-based processing systemthat tracks the sound. Fig. 1 showsthe

experimentalsetuptolocateandtrackthesoundsourceusingthe

NAS,andaProcessingSystembasedonLSOandActuation

compo-nents.Theexperimentalsetupconsistsofaroboticplatformbased

onacorkporexpanheadwithonemicrophoneoneachsideofthe

head.ThesoundsignalsfromthemicrophonesaresenttotheNAS

wherethey aredecomposedintodifferentfrequencybands.Then,

the NAS produces spikes that representthe spectral information

of the original audio signal; the ﬁnal output ofthe NAS are the

binauralspikingstreamﬂows.Thesebinauralcuesaresenttothe

LSOsystem,whichextractstheIID.Finally,theIIDspikingstreams

are used bythe calibrationsystemand actuationsystemto track

thesoundsource.

Thissection brieﬂy describestheNAS, the LSOmodelandthe

processingsystem.

2.1. Neuromorphic auditory system

To generate the spike streams inspired by the human

audi-torysystem, a neuromorphicdeviceproposed in aprevious work

[15] was used, which decomposes an audio signal into different

frequency bands of spiking information, in the same way a

bio-logical cochlea sends the audioinformation to the brain. The

bi-ological cochlea performs the transduction betweenthe pressure

SH&F

Input A

(Excitation)

Input B

(Inhibition)

OutPut

Spikes A

OutPut

Spikes B

Fig. 3. Spike Hold & Fire block.

signal representingtheacousticinput andthe neuralsignalsthat

carryinformationto thebrain. Duetothephysical characteristics

ofapartofthecochlea,thebasilarmembrane,thecochleadivides

an inputsignal into itsfrequency components.Thousandsof hair

cellsonthe membranegenerate actionpotentials,or spikes,that

travelalongnerveﬁbrestohigher-orderauditorybrainareas.

The architecture of the NAS is shown in Fig. 2. The system’s

inputsare thedigitalized audiostreams (the left and rightaudio

signals), which represent the audio signalsof a binauralsystem.

A Synthetic Spike Generator [14,20] converts these digital audio

sourcesintotwospikestreams.Then,thecascadebandpassﬁlter

banksplitsthespikestreamsinto64frequencybandsusing64

dif-ferentspikingoutputs. Theseoutputs are combinedbya monitor

block whichencodes each spikeaccordingto Address-Event

Rep-resentation(AER)andtransmits thisinformationtothe

classiﬁca-tion layers [21].All the elementsrequired fordesigning the NAS

components (Synthetic Spike Generators, cascade ﬁlter bank and

theAER monitor)havebeen implementedinVHDL anddesigned

asspike-basedbuildingblocks[14].TheL-CascadeFilterbankand

R-Cascade Filter bank have identical architecture. Table 1 shows

theNASfeatures. The NAShas beenused beforein [22] to

mea-surethe speedofDCmotorandin[23–25]foraudiosound

clas-siﬁcation.In[26]asoftwaretool(NAVIS)todeveloptheﬁrst

post-processinglayerusingtheNASinformationisproposed.

2.2. Lateral superior olive model

WepresentaLSO modelwhichdecodestheIIDfromthe

bin-auralspikingstreamsgeneratedbytheNAS.

Forsoundwavesthathavesimilarorsmallerwavelengthsthan

thediameterofthe humanhead,a shadowingeffectisproduced

attheear furtherfromthesource,creatingan IID[1]. Processing

inthe LSOinvolves takingasinput the two soundsignalsin the

formofa neural stimulus fromeach ear.The ipsilateralstimulus

(4)

Table 1

NAS characteristics.

Number of bands Frequency range Dynamic range Max. Event rate Clock frequency Hardware resources

64 × 2 9.6 Hz – 14.06kHz 75dB 2.19Mevents/s 27 MHz 11,141 slices

Fig. 4. ILD (encoded as an event-rate) measured over time obtained from the input data that generates the left and right NAS when the sound source is placed at 45 ° from the head.

passing through the medial nucleustrapezoid body (MNTB).The

convergenceof excitatory inputs from the ipsilateralear and

in-hibitoryinputsfromtheoppositeearresemblesarelativelysimple

subtractionprocess that createsthe well-describedIID sensitivity

ofLSO neurons.These neuronsproduce an output related tothe

IID[2].

The architecture of thesystem is shownin Fig. 3. It is based

ontheSpikeHold & Fire(SH&F) blockthat transforms twoinput

spikerates into two output spike rates. The outputs are

propor-tionaltothesubtractionbetweenthetwoinput rates.Subtracting

aspike-basedinputsignalfromanotherwillyieldanewspike

sig-nalwhosespikeratewillbeproportionaltothedifferencebetween

bothinput spikerates. Ifthat subtractionispositive, thespike is

ﬁredbyoutputportA,andifthesubtractionisnegative,thespike

isﬁredby outputportB[27,28].Thisblockwassuccessfullyused

inpreviousworksforspikeﬁlteringdesign[14,15,29]andfor

im-plementingaspike-basedclosed-looproboticcontroller[28].

ThefunctionoftheSH&Fistoholdincomingspikesforaﬁxed

periodoftimewhilemonitoringtheinputevolutiontodetermine

outputspikes.Fig.3showstheSH&Fblock,whichhastwoinputs:

A(Excitation input)andB(Inhibitioninput).Ifaspikeisreceived

atinputportA,an“A” stateisheldinternally.Inthecasethat no

spikesarereceived,nothingisdone. Whenanewspikearrives,it

behavesinonewayoranother,accordingtothespikeinputport.If

SH&Freceivesan excitatoryspike(portA),theheldspikeisﬁred,

and a new “A” state is held internally. If a spike is received in

portB,theheldspikeiscancelledandnooutputspikeisﬁredat

anyoutput port.Similar SH&Fbehaviorcan beextended toinput

spikesinportBusingthesamelogic:hold,cancel,andﬁrespikes

accordingtoinput spikesports.WiththisimplementationofLSO,

theMNTBfunction, inorderto obtaininhibitory input,is carried

outbyinput BoftheSH&Fblock.The outputSpikesAﬁringrate

isproportional totheILD corresponding tothe soundoriginating

totheleftofthe headandtheoutput SpikesBﬁringrateis

pro-portionaltotheILDcorresponding tothesoundoriginatingtothe

right.Fig.4showsanexampleofhowtheSH&Fworks:thetwo

in-putsandoutputalongtimeareshowntobetterexplainthemodel

functioning,inordertoexhibithowtheinhibition/excitation

prin-cipleinﬂuencestheoutput.

ThismodelconsiderstheSH&Fblocktobeakintothesubsetof

biologicalneuronsoftheLSOthatdealwiththenarrowfrequency

bandsofsoundtotrack.

2.3. Processing system

The architecture of the processing systemis shown in Fig. 5.

ThissysteminterfacestheNASsystem,processesthedatafromthe

NASgeneratingthe IIDcue andproducesthe commandstodrive

themotor.

The NASwillsendthe eventsfromthe 64bandsforboth the

rightandleftmicrophone-ear.Tocharacterizeoursystem,the

pro-cessingsystemonlysamplesthebands(leftandright)responding

tothefundamentalfrequencyofthesoundtotrack.

Thearchitectureconsistsofacalibrationstagewherethe

base-line is established (i.e. ILD=0dB),three SH&F blocks where the

ILDandtheerrorarecomputed,twospikegenerators[20]to

gen-eratethereference,aserialcommunicationcomponentandaspike

lengtheningmechanism.

Thebehaviorofthislayerisdividedintotwosteps:initial

cali-brationandnormalbehavior.

Initial calibration

1. Thecalibrationprocessistheﬁrststepthattakesplace.Thereis

aninitialﬁvesecondcountdowntoletuspreparethescenario,

i.e.to place the speaker in front of the head (0°) and play a

soundwith thesame fundamental frequencyas thesound to

track.Therestofthecomponentsoftheactuationlayerwillbe

disabledduringthisentirecalibrationphase.

2. After the ﬁve second countdown, two components are

launched:afoursecondcountdownandthebaselinerate

com-ponent (Fig. 5). This last component receives the addresses

fromtheAER bus interfacing theNASsystem. It includes two

registerswheretheeventsreceived fromeach eararecounted

(5)

AER Bus (From NAS) REQ ACK PFM Motor + PFM Motor -16

Processing system (Virtex5 – FPGA)

Encoder channel A Encoder channel B Calibration 5 seconds counter Enable 4 seconds counter

Enable Act. Output Baseline rate Right_ref Left_ref AER REQ ACK NAS Interface Calibration_done Enable AER REQ ACK Ref. – Ears H&F Enable Ears H&F Enable Left spike Right spike Left spike Right spike Pos. ears spikes Neg. ears spikes Ref. H&F Enable Right Ref. spikes Left Ref. spikes Pos. Ref. spikes Neg. Ref.

spikes Pos. Ref._spikes Neg. Ref. spikes Right Ref. Gen. Enable Right Ref. CLK/3 Right Ref. spikes Pos. ears spikes Neg. ears spikes Pos. Motor spikes Neg. Motor spikes PC Serial Tx / Rx Enable Encoder Interface Enable +0.7° Enc. Channel A Enc. Channel B -0.7° Tx/Rx Spikes Width Enable PFM Motor + PFM Motor -Pos. Motor spikes Neg. Motor spikes LSO Model Actuation Left Ref. Gen. Enable Left Ref. CLK/3 Left Ref. spikes

Fig. 5. Block diagram of the processing architecture. For the sake of simplicity, we have avoided drawing the asynchronous reset of all components of the architecture and the way signals are combined. For instance, the ACK out signal can be driven by the NAS interface component and the calibration component, so a multiplexer driven by the “calibration_done” signal, is used to select which component is sending the ACK out of the FPGA. Also, this “calibration_done” signal will enable the rest of the components that are not involved in the calibration stage.

second countdown, we will have the total number of events

received from the right and left ear. Then, both registers are

shiftedtwo positions totheright(divisionby four) sothe

re-sultintheregistersaretheeventspersecondreceivedbyeach

earconsideringthereferencecondition:theheadfacingthe

di-rectionofthesoundsource.Therefore,theserateswillplaythe

roleofreferenceforourcontrolsystem.

3. The way to generate the rate computed only in the previous

stageistousethespikesgeneratorpresentedin[20].Thereare

twoofthesemodules usedinourarchitecture:rightreference

generatorandleftreferencegenerator;eachwillreceivethe

in-putdatafromthebaselineratecomponent.

Therategeneratedbythisblockisdeﬁnedaccordingtothe

fol-lowingequation:

Firingrate=f_clk/2(N−1)_{× input}₌_gain_{× input}

Sincewehavecalculatedthedesiredoutputrate(ﬁringratein

theaboveequation)weneedtoknowtheinputwehaveto

supplyto thismoduleto generatesuch a rate.Then, ifthe

gain of the generator is a power of two values,it is easy

toright-shift thecomputedrateagain,inordertohavethe

correctinputvalueforthegenerators.

Ifweconsideraclockfrequencyof100MHzandthenthisdata

isintroducedintotheequation,itisnotpossibletoreacha

poweroftwo gainby thegenerator.Toachieve it,we have

dividedtheclockfrequencybythreeandused24bitsto

im-plementthegenerator,resultinginagainof1.98∼ 2.

Implementingthegeneratorinthiswayallowsustoright-shift

thereferenceratecalculatedinthebaselineratecomponent

andsupplytheresultastheinputforthespikegenerators.

4. Theoutput spikingratesofbothgenerators willbe subtracted

usingthereferenceHold&FirecomponentshowninFig.5.The

resultingrateofspikesisthereferenceforourdesignwhenthe

normaltrackingbehavioristakingplace.

Normal behavior

During thenormal phase,the NASinterface block is receiving

theaddressesfromtheAERbusanddetectingthechannelsof

in-terest,whichdependonthefrequencyofthesoundtotrack/locate.

Once an event on anyof thesechannels is received, it is

propa-gated to theEarsSH&Fcomponent inthearchitecture wherethe

rates are subtracted (the channels tuned for the samefrequency

ofeach ear arecancelled) tocompute the ILD(Fig.6). ThisSH&F

blockimplementstheLSOmodeltoobtainaspikeratethat

repre-sentstheIIDcueofanarrowfrequencybandofsound.Toremove

theerrorcausedbygaindifferencebetweenthemicrophones,the

resultingrateiscomparedwiththereferencecomputedpreviously

duringthe calibration process (in Ref.-Ears SH&F). The ﬁnal rate

(errorifwe compare witha conventionalcontrol system) issent

to the module responsible for lengthening the time duration of

the events.The reasonfor doing this, is to avoidﬁltering events

bythemotor(witha100MHzclockfrequency,thetimelengthof

an eventwillbe 10nsandthislength willbe ﬁlteredout by the

motor).

Theeventtimelength canbeconﬁgured fromthePCandthis

lengthwillbethesameforalltheeventssenttothemotor.Using

Pulse Frequency Modulation (PFM) removes the delay associated

withPWMandmakesthedirectuseoftheeventsavailable[30].

Finally,the encoder channelsare read andthisinformation is

serially transmitted to the PC whenever the motor turns 0.7° in

anydirection.

Oneof thedifferences betweenourarchitecture andthe ones

proposed in [16] and [17] is: they did not use a neuromorphic

device to obtain the spike streams that representsound. Instead

ofthis,in[17],theyusedanexperimentallyderived, head-related,

transferfunctionacousticaldataforeachearand,in[16],a

second-order Gammatoneﬁlterbank wasused. This means that the

sys-tem in [17] has a high computational cost to be implemented

[31,32]andthesystemproposedin[16]needs highpower-hungry

devicestobeimplemented[33].TheSSNarchitecturepresentedin

[17]isdifferent,depending onif thesound arrivesfromtheleftor

therightsideoftheheadanditneedsaMNTBnodetoobtainthe

inhibitoryinput. The architecture presented in thispaper is able

toreceive thesoundfromleft or rightanddoesnot needan

in-hibitorynode becausetheSH&Fisable tosubtractsigned spikes.

TheSNNof[17] needsa receptiveﬁeld layerandanoutput layer

to determine the angle(each angle corresponds to a SNN class).

However,inthisworkwedonotneedtheselayersbecausetheIID

cueobtainedbytheprocessinglayerisusedtocontrolthemotor.

Allneuronsinthearchitecturepresentedin[17]usedaLIFmodel

andthey havebeentrainedby thesupervised learningalgorithm

knownas“remotesupervisionmethod” .Ourarchitecturedoesnot

needtrainingstagebutitdoesneedacalibrationstage thattakes

4s.

3. Experimentalresults

The experimental setup is shown in Fig. 1. The distance

be-tween the speaker and the head is 40cm atdifferent azimuthal

angles(0° to90° instepsof15°).Themicrophonesareoneachside

(6)

Fig. 6. ILD (encoded as an event-rate) measured over time. The sound source is placed at 0, 15, 30, 45, 60, 75 and 90 ° from the motionless head during a second for each angle.

Table 2

Data obtained for the ﬁrst experiment: 1 kHz stimulus.

Target angle (degrees) Mean angle reached (degrees) Standard deviation (degrees) Max. angle (degrees) Min. angle (degrees)

0 1.97 0.61 2.81 0.70 15 14.16 2.60 16.87 6.33 30 30.67 0.56 31.64 29.53 45 42.83 0.60 44.30 42.19 60 56.25 0.65 58.36 55.55 75 75.84 0.50 76.64 75.23 90 93.21 1.06 94.92 91.41

ontopofa platformdrivenby aDC motorwithan encoder

(Mi-cromotor Ref. 2224R006SR plus gearhead Ref. 20/1112:1 and

en-coderRef.IE2-512fromFaulhaber).TheNASisimplementedusing

a Virtex5 FPGA (XC5VFX70T) and it uses up to 99% of the total

slicesavailable(11,141).TheFPGAisinaXilinxdevelopmentboard

(ML507),which,amongothercomponents,includestheAC’97

au-diocodec. The NASoutput isconnectedto theprocessingsystem

usingtheAERprotocol.Theprocessingsystemisalsoimplemented

usingaVirtex5FPGA(XC5VFX30T),whichusesupto4%ofthe

to-talslices availableon thedevice (5120). Usingthe XilinxXPower

tool,theestimatedpowerconsumption fortheprocessingsystem

is28.63mW and29.7mWfor the NAS [15]. The test scenario is

placed at the center of a classroom with a RT60 of 3.06 s. The

speciﬁcationsofthemicrophoneare:transducerprinciplebasedon

backelectretcondenser element,the frequencyresponserangeis

between20and16,000Hz,thesensitive is ₋₆₄dB_±3dB andthe

impedanceis1000Ohm.

Fig.6showstheIIDmeasuredattheoutputoftheSH&Fwhich

modelstheLSO (EarsH&F showninFig.5). Theexperiment

con-sistsof measuring the SH&F output ratewhen the soundsource

isplacedat7azimuthalangles,[0–90]degreeswithinprogressive

stagesof15°.TheroboticplatformisnotmovingtoobtaintheIID.

Thesoundplayedisapuretoneof1KHzanditlastedonesecond

ateachangle.Then,toobtaintheeventrate,thespikeoutputwas

sampled ata frequencyof 10Hz (100ms period). It can be seen

how the event rateis higher asthe angle increases. This rateis

equivalenttotheperceivedIIDwhenlisteningtoasoundof1kHz

[34].

The systemwastestedusing threedifferentpure tones:1kHz

(channel19oftheNAS),2.5kHz(channel13oftheNAS)and5kHz

(channel7oftheNAS).Forthethreetests, thefollowingidentical

sequence wasfollowed: ﬁrst, we placed thehead in front ofthe

speaker where the calibration phase takes place (4 s); then, we

movedthespeakertoeach targetangle(0° to90° instepsof15°)

andrecordedtheanglereachedbytheroboticplatform(10seach

target). The real angle reached by the head is measured by the

magnetic encoder included within the motor. The encoder

chan-nels are sent to the FPGAwhere they are processed and, ﬁnally,

the FPGA seriallytransmits to the PC wheneverthe motor turns

0.7° in anydirection. This let usmeasure the real anglereached

bythehead.Wethencomparedthismeasurementwiththetarget

angle(where thespeakerwaspositioned).Werepeatedeach test

10times.Thetestscanbeextendedtoawiderrangewithoutany

recalibration.1

Fig.7showsthebehaviorofthesystemwhenthestimulusisa

puretoneof1kHzandthemeasurementsaretakenevery15°.The

(7)

Table 3

Data obtained for the ﬁrst experiment: 2.5 kHz stimulus.

0 2.77 0.56 4.21 1.40 15 14.54 1.23 16.87 11.95 30 30.22 0.48 30.93 28.82 45 45.72 1.31 49.92 44.29 60 59.79 0.80 61.87 58.35 75 84.46 0.75 86.48 82.96 90 94.96 0.42 95.62 94.21

Fig. 7. The system is tested with a pure tone of 1 KHz as the stimulus. The target angles are 0, 15, 30, 45, 60, 75 and 90 °, represented in the graph with a blue cross. In red, the mean, maximum and minimum angles reached by the platform. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Fig. 8. The system is tested with a pure tone of 2.5 KHz as the stimulus. The target angles are 0, 15, 30, 45, 60, 75 and 90 °, represented in the graph with a blue cross. In red, the mean, maximum and minimum angles reached by the platform. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

ﬁgureshowsthemeananglereached,themaximumandminimum

valuesreached(absoluteerrorfromthemeanachieved)inredand,

withabluecross,theidealbehaviorisrepresented.Table2shows

thedataforthisﬁrstexperiment.

Fig. 8 shows the behavior of the system when the stimulus

is a pure tone of 2.5kHz and the measurements are taken

ev-ery 15°. The ﬁgure shows the mean angle reached, the

maxi-Fig. 9. The system is tested with a pure tone of 5 kHz as the stimulus. The target angles are 0, 15, 30, 45, 60, 75 and 90 °, represented in the graph with a blue cross. In red, the mean, maximum and minimum angles reached by the platform. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

mumandminimumvaluesreached(absoluteerrorfromthemean

achieved)inredand,withabluecross,theidealbehavioris

rep-resented.Table3showsthedataforthissecondexperiment.

Fig.9showsthebehaviorofthesystemwhenthestimulusisa

puretoneof5kHzandthemeasurementsaretakenevery15°.The

ﬁgureshowsthemeananglereached,themaximumandminimum

valuesreached(absoluteerrorfromthemeanachieved)inredand,

withabluecross,theidealbehaviorisrepresented.Table4shows

thedataforthisthirdexperiment.

Theaverageerrorofthesystemis1.92° forthe1kHzstimulus

test,2.49° forthe2.5kHztestand2.57° forthe5kHztest.

There-fore,itresultsonanaverageerrorof2.32°.

Theaverageerrorisverylowwhenthestimulusisapuretone

of1kHz:1.92°.Thiserrorishigherwhenthefrequencyofthetone

isincreased.It goesupto 2.49° forthe2.5kHztestand2.57° for

the5kHztest.Thisis mostlikelyduetothefact that theNASis

designedin a waythat it generates larger spike rates when the

frequencyofthestimulustargetslowbandsofthesystem,dueto

theinterferenceoftheﬁlters.Therefore,higherratesarriveatthe

actuationlayer.Then,sinceweare usingPFM,thereisatrade-off

betweenthespikerateandthetime lengthofasinglespike.The

higherspikerateproducesamismatchinthatrelation,creatinga

largererror.

Consideringthat the environmental SNR ofthese experiments

is73dB,wecheckedthebehaviorofthesystemwhenhigherlevel

of noise is present. We have tested it when a set of SNR is

ap-plied:(−5dB,0dB,4dB,14dB,28dB).TheSNRismeasuredwhen

thesoundsourcedis positionedatazimuthal angleof0° andthe

white noise source is equidistantly located between two

(8)

Table 4

Data obtained for the ﬁrst experiment: 5 kHz stimulus.

0 0.32 0.36 0.70 0.00 15 9.00 0.38 9.84 8.44 30 26.97 0.34 27.42 26.72 45 42.16 0.47 42.89 41.48 60 57.33 0.36 57.66 56.95 75 73.07 0.19 73.12 72.42 90 88.13 0.34 88.59 87.89

Fig. 10. Angle reached by the robotic platform when the stimulus is a pure tone of 1 KHz and a set of SNR was applied: ( −5 dB, 0 dB, 4 dB, 14 dB and 28 dB).

(9)

Fig. 12. Angle reached by the robotic platform when the stimulus is a pure tone of 5 KHz and a set of SNR was applied: ( −5 dB, 0 dB, 4 dB, 14 dB and 28 dB).

Fig. 13. Trajectory followed by the head when the sound source is moving. The tracking of the sound is accurate. A new target is supplied according to the red arrows on the plot. The real trajectory of the head is shown by the blue trace. The mean error is 4.37 ° and the time to reach a new supplied target is 125 ms. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

Table 5

Mean error (in degrees) obtained for the experiments where the noise is present. 1 KHz 2.5 KHz 5 KHz −5 dB 9.61 2.86 4.42 0 dB 8.80 2.66 4.28 4 dB 8.22 2.68 3.09 14 dB 5.22 2.60 2.58 28 dB 5.20 2.63 2.39

ofthesystemwhenthreedifferentpuretonesareapplied:1KHz,

2.5KHzand5KHzasthestimulus.

Table5showsthemeanerrorwhenthenoiseispresentinthe

experiments.It can be seenhow whenthe SNR ishigherthe

er-ror is lower.However, the meanerror isvery low (less than ten

degreesinthe worstcase),thereforethesystemhasa highnoise

tolerance.

Fig. 13 shows the tracking of the sound source by the head

whenapuretoneof1kHzisused.Theexperimentlasts61sand

thetargettotrackchanges accordingtotheredarrowsshownon

theﬁgure.Theset oftargetssuppliedis(0,30,60,90,−30,−60,

−90,0)andthesetofpositionsreachedbythehead,onaverage,

is(0,28,68,94,−28,−55,−82,0)resultinginanerrorof3.625°

onaverage.Thetargetsarereachedbytheheadwithin125ms

af-terdeliveringanewone.Fromthattime,wehavetoconsiderthat

the NAS takesan average of 20 μs to process the digital sound,

togenerateaspikingoutputactivityandthat theprocessinglayer

takes50nstoprocesseachspikereceived.

4. Conclusion

Thispaperdescribes thedesignandhardwareimplementation

ofasoundlocalizationandtrackingsysteminspiredby the

mam-malianauditorysystem.TheNASsensor[14]isusedtoproducea

biologicalcochlea-like output. Thisoutput isthestimulus forthe

processingsystemwherethe LSOmodel isimplemented. The

ar-chitecture proposed for the LSO which performs the subtraction

betweentwoinput spikeratesproduces theIID.The IIDauditory

cue isused asthe input forthespike-based actuationstage that

tracksthesound.

The model was tested using 1kHz, 2.5kHz and 5kHz pure

tones,obtainingamaximumerroroflessthanﬁvedegrees,alower

errorthanthoseobtainedinpreviousworksthat proposeda

bio-logicallyinspired architectureof theLSO,[16] and[17],since the

classiﬁcationaccuracyoftheSNNarchitectureproposedby [17]is

52% for5kHzsounds,83% for15kHzsoundsand 40%for25kHz

soundsinstepsof10°,andtheclassiﬁcationaccuracyobtainedby

themodelproposedin[16],usingboth noiseandspeechinsteps

of30°,is80%. Furthermore,our systemshowsa highnoise

toler-ancelevelwhenwhitenoiseisapplied:intheworstcondition,the

averageerrorislowerthantendegrees.

Incomparisontotherelatedwork,thispaperprovides

(10)

are faithful tothe architecture ofthe mammalian auditory

path-waysimplemented inhardwareto tracksound inreal time with

veryhighaccuracy.Furthermore,thearchitecturepresentedinthis

workcanbeimplementedbyusinglow-costcommercialhardware

devices and have a low power consumption because the

hard-ware implementation uses generalpurpose FPGA resources, such

as counters, comparators, logic gates and low clock frequencies.

Speciﬁcally,thepowerconsumptiongoesupto58.33mWin

oper-ation(29.7mWfromtheNASand28.63mWfromtheprocessing

layer).

Forthesereasons,theproposedsystemissuitableforthesound

localizationtaskinrobotics.Inaddition,theworkpresentedinthis

paperis a signiﬁcant step forward in biologically inspired sound

localizationmodeling, becauseitobtains theIID cuesimulatinga

partofthemammalianauditorysystem,LSO.

Acknowledgements

This work was supported by the Spanish grant (withsupport

fromtheEuropeanRegionalDevelopmentFund)COFNET( TEC2016-77785-P).

References

[1] B. Grothe , M. Pecka , D. McAlpine , Mechanisms of sound localization in mammals, Physiol. Rev. 90 (3) (2010) 983–1012 .

[2] D.J. Tollin , The lateral superior olive: a functional role in sound source localization, Neuroscientist 9 (2) (2003) 127–143 .

[3] L.A. Jeffress , A place theory of sound localization, J. Comp Physiol. Psychol. 41 (1) (1948) 35 .

[4] D. McAlpine , B. Grothe , Sound localization and delay lines–do mammals ﬁt the model? Trends Neurosci. 26 (7) (2003) 347–350 .

[5] T.C. Yin , Neural mechanisms of encoding binaural localization cues in the auditory brainstem, Integrative Functions in the Mammalian Auditory Pathway, Springer, New York, 2002, pp. 99–159 .

[6] W.C. Wu , C.H. Hsieh , H.C. Huang , O.C. Chen , Hearing aid system with 3D sound localization, in: Proceedings. of the IEEE International Conference on TENCON, 2007, pp. 1–4 .

[7] H.J. Simon , Bilateral ampliﬁcation and sound localization: then and now, J. Re- habil. Res. Dev. 42 (4) (2005) 117 .

[8] P.K. Park , H. Ryu , J.H. Lee , C.W. Shin , K.B. Lee , J. Woo , J.S. Kim , B.C. Kang , S.C. Liu , T. Delbruck , Fast neuromorphic sound localization for binaural hearing aids, in: Proceedings of the 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2013, pp. 5275–5278 . [9] H. Okuno , K. Nakadai , Real-time sound source localization and separation

based on active audio-visual integration, Comput. Methods Neural Model. 2686 (2003) 118–125 .

[10] K. Nakadai , D. Matsuura , H. Okuno , H. Kitano , Applying scattering theory to robot audition system: robust sound source localization and extraction, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2003, pp. 1147–1152 .

[11]A. van Schaik , V. Chan , C. Jin , Sound localisation with a silicon cochlea pair, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, pp. 2197–2200 .

[12] Y. Tamai , S. Kagami , H. Mizoguchi , K. Sakaya , K. Nagashima , T. Takano , Circu- lar microphone array for meeting system, in: Proceedings of the IEEE Sensors, 2003, pp. 1100–1105 .

[13] S.S. Yeom , J.S. Choi , Y.S. Lim , M. Park , M. Kim , DSP implementation of sound source localization with gain control, in: Proceedings of the International Con- ference on Control, Automation and Systems, 2007, pp. 224–229 .

[14] A . Jimenez-Fernandez , A . Linares-Barranco , R. Paz-Vicente , G. Jiménez , A. Civit , Building blocks for spikes signals processing, in: Proceedings of the Interna- tional Joint Conference Neural Networks, 2010, pp. 1–8 .

[15] A. Jiménez-Fernández , E. Cerezuela-Escudero , L. Miró-Amarante , M.J. Domínguez-Morales , F.d.A. Gómez-Rodríguez , A. Linares-Barranco , G. Jiménez-Moreno , in: A binaural neuromorphic auditory sensor for FPGA: a spike signal processing approach, 28, 2017, pp. 804–818 .

[16] J. Liu , D. Perez-Gonzalez , A. Rees , H. Erwin , S. Wermter , A biologically inspired spiking neural network model of the auditory midbrain for sound source localisation, Neurocomputing 74 (1) (2010) 129–139 .

[17] J.A. Wall , L.J. McDaid , L.P. Maguire , T.M. McGinnity , Spiking neural network model of sound localization using the interaural intensity difference, IEEE Trans. Neural Netw. Learn. Syst. 23 (4) (2012) 574–586 .

[18] Iwasa, K., Kugler, M., Kuroyanagi, S., & Iwata, A.. A sound localization and recognition system using pulsed neural networks on FPGA. in: Proceedings of the International Joint Conference on Neural Networks (pp. 902–907). IEEE. [19] Birchﬁeld, S.T., & Gangishetty, R.. Acoustic localization by interaural level differ-

ence. in: Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’05 (Vol. 4, pp. iv–1109). IEEE.

[20]F. Gomez-Rodriguez , R. Paz , L. Miro , A. Linares-Barranco , G. Jiménez , A. Civit , Two hardware implementations of the exhaustive synthetic AER generation method, Lecture Notes in Computer Science, Computational Intelligence and Bioinspired Systems, 3512, Springer, Berlin Heidelberg, 2005, pp. 534–540 . [21]E. Cerezuela-Escudero , M.J. Dominguez-Morales , A. Jimenez-Fernandez ,

R. Paz-Vicente , A. Linares-Barranco , G. Jimenez-Moreno , Spikes monitors for FPGAs, an experimental comparative study, in: Proceedings of the 12th International Work-Conference Artiﬁcial Neural Networks, 2013, pp. 179–188 . [22]A. Rios-Navarro , E. Cerezuela-Escudero , M. Dominguez-Morales ,

A. Jimenez-Fernandez , G. Jimenez-Moreno , A. Linares-Barranco , Live demon- stration: Real-time motor rotation frequency detection by spike-based visual and auditory AER sensory integration for FPGA, in: Proceedings of the IEEE International Symposium on Circuits and Systems, 2015 1907-1907 . [23]E. Cerezuela-Escudero , A. Jimenez-Fernandez , R. Paz-Vicente , M. Dominguez–

Morales , A. Linares-Barranco , G. Jimenez-Moreno , Musical notes classiﬁcation with neuromorphic auditory system using FPGA and a convolutional spiking network, in: Proceedings of the International Joint Conference on Neural Net- works, 2015, pp. 1–7 .

[24]E. Cerezuela-Escudero , A. Jimenez-Fernandez , R. Paz-Vicente , J.P. Dominguez– Morales , M.J. Dominguez-Morales , A. Linares-Barranco , Sound recognition system using spiking and MLP neural networks, in: Proceedings of the In- ternational Conference on Artiﬁcial Neural Networks, Springer International Publishing, 2016, pp. 363–371 .

[25]J.P. Dominguez-Morales , A . Jimenez-Fernandez , A . Rios-Navarro , E. Cerezue- la-Escudero , D. Gutierrez-Galan , M.J. Dominguez-Morales , G. Jimenez-Moreno , Multilayer , spiking neural network for audio samples classiﬁcation using SpiN- Naker, in: Proceedings of the International Conference on Artiﬁcial Neural Net- works, Springer International Publishing, 2016, pp. 45–53 .

[26]J.P. Dominguez-Morales , A. Jimenez-Fernandez , M. Dominguez-Morales , G. Jimenez-Moreno , NAVIS: neuromorphic auditory VISualizer tool, Neurocom- puting 237 (2017) 418–422 .

[27] Á.F. Jiménez Fernández , Diseño y Evaluación de Sistemas de Control y Proce- samiento de Señales Basados en Modelos Neuronales Pulsantes Ph.D.Thesis, University of Seville, Spain, 2010 .

[28]A. Jimenez-Fernandez , G. Jimenez-Moreno , A. Linares-Barranco , M.J. Dominguez-Morales , R. Paz-Vicente , A. Civit-Balcells , A neuro-inspired spike-based PID motor controller for multi-motor robots with low cost FPGAs, Sensors 12 (4) (2012) 3831–3856 .

[29]M. Domínguez-Morales , A. Jimenez-Fernandez , E. Cerezuela-Escudero , R. Paz-Vicente , A. Linares-Barranco , G. Jimenez , On the designing of spikes band-pass ﬁlters for FPGA, in: Proceedings of the International Conference on Artiﬁcial Neural Networks and Machine Learning, ICANN, 2011, 2011, pp. 389–396 .

[30]F. Pérez-Peña , A. Morgado-Estévez , A. Linares-Barranco , Inter-spikes-intervals exponential and gamma distributions study of neuron ﬁring rate for SVITE motor control model on FPGA, Neurocomputing 14 9 (2015) 4 96–504 .

[31] J. Mackenzie , J. Huopaniemi , V. Valimaki , I. Kale , Low-order modeling of head--related transfer functions using balanced model truncation, IEEE Signal Pro- cess. Lett. 4 (2) (1997) 39–41 .

[32]M. Otani , S. Ise , Fast calculation system specialized for head-related transfer function based on boundary element method, J. Acoust. Soc. Am. 119 (5) (2006) 2589–2598 .

[33]M. Slaney , An Eﬃcient Implementation of the Patterson-Holdsworth Auditory Filter Bank, An Eﬃcient Implementation of the Patterson-Holdsworth Auditory Filter Bank, 35, Apple Computer Perception Group, 1993 Technical Report . [34]W.E. Feddersen , T.T. Sandel , D.C. Teas , L.A. Jeffress , Localization of high-fre-