Adaptive Least Error Rate Algorithm for Neural Network Classifiers

(1)

FOR NEURAL NETWORK CLASSIFIERS

S.Chen y

, B.Mulgrew z

andL.Hanzo y

y

DepartmentofElectronicsandComputerScience

UniversityofSouthampton,Higheld

SouthamptonSO171BJ,U.K.

z

DepartmentofElectronicsandElectricalEngineering

Universityof Edinburgh,King'sBuildings

EdinburghEH93JL,U.K.

Abstract. Weconsidersample-by-sampleadaptivetrainingof

two-class neuralnetwork classiers. Specicapplicationsthatwehave

inmindarecommunicationchannelequalizationandcode-division

multiple-access (CDMA) multiuser detection. Typically, training

of such neural network classiers is done using some stochastic

gradient algorithm that tries to minimize the mean square error

(MSE).Since thegoalshouldreallybeminimizingthe error

prob-ability, the MSE is a \wrong" criterion to use and may lead to

a poor performance. We propose a stochastic gradient adaptive

minimum error rate (MER) algorithm called the least error rate

(LER) for trainingneural network classiers.

INTRODUCTION

We study the class of neural network classiers where pattern vectors are

drawnfromanitesetandcorruptedbyanadditivenoise. Specicexamples

includeneuralnetworkequalizersandmultiuserdetectors[1]-[6]. Weassume

that a sample-by-sampleadaptation isnecessary to meetreal-time

compu-tational constraints. In such applications, the training of neural network

classiersisusuallydoneusingsomestochasticgradientalgorithmbasedon

theMSEcriterion,andaclassicalexampleistheleastmeansquare(LMS)

al-gorithm. However,therealgoalistominimizetheerrorprobability.Thebias

infavouroftheMSEcriterionisperhapsrootedinadaptivelinearltering.

For linear classiers, such as linear equalizers and multiuser detectors,

there exists some relationship between the MSE and error probability. A

small MSE isusuallyassociated witha small error rate. However, even in

the linearcase, theminimum MSE(MMSE) solution ingeneral is notthe

MERone. Itisnowwell-knownthattheerrorrategapbetweentheMMSE

(2)

classiers, there is no relationship between the MSE and error rate, and

the MMSE solution may not correspond to a small error rate. In eect,

standard adaptive algorithmsfor training nonlinear classiers, such asthe

LMSalgorithm,isbasedonacriterionthatisirrelevanttotheproblem.

WeadopttheMERcriterionanddevelopastochasticgradientadaptive

MERalgorithmfortrainingneuralnetworkclassiers. Weemployan

adap-tivestrategythatisverysimilartotheone usedforderivingtheLMS.The

MMSEsolutionrequiresensembleaverages, whicharenotavailable in

gen-eral. Sampletimeaveragescaninpracticebeusedtoprovideestimatesofthe

requiredstatistics.Takingtoextreme,usingone-sampleestimateleadstothe

LMSadaptation. Theerrorprobabilityofaclassierisanintegrationofthe

probabilitydensityfunction(p.d.f.) oftheclassierdecisionvariable,which

isgenerallyunavailable. Howeverasampletimeaverageofthep.d.f.,called

thekerneldensityestimate[14],[15],canbeconstructed. Whenaone-sample

kerneldensityestimateisused,aninstantaneousorstochasticgradient

adap-tivealgorithmisderived. Thisapproachhassuccessfullybeenappliedinour

previousworksonlinearadaptiveMERalgorithms[9],[10].

The proposed LER algorithm is applied to equalization and multiuser

detectionusingaradialbasisfunction(RBF)network. Thesimulationresults

obtainedshowthattheLER algorithmhasareasonableconvergencespeed

anda small RBFnetworktrained bytheLERalgorithmcan closelymatch

theoptimalperformance. Thesimulationstudy alsoconrmsthattheMSE

criterionisirrelevanttothiskindofapplicationsandtheRBFnetworktrained

bytheLMSalgorithm,althoughconvergingwellintheMSE,canproducea

poorerrorrateperformance.

PROBLEM FORMULATION

Weconsidertheclassofnonlinearclassiersthataredenedby

^

c(k)=sgn(y(k)) with y(k)=f(r (k);w ) (1)

wherer(k)isanM-dimensionalpatternvectorwithitsassociatedclasslabel

c(k) 2 f1g,the vector wconsists of all theadjustable parametersof the

classier f, and c(k)^ is the estimated class label for r(k). Any soft sgn

function employed by a classier is included in the function form f. The

patternvectorr(k) isassumedtotake theform: r(k)=r(k)+n(k), where

thecleanpartr(k)takesvaluesfromanite setwithequalprobability

r(k)2fr

j

; 1jN

b

g (2)

andthenoisevectorn(k)isGaussianwithcovariancematrixE[n(k)n T

(k)]=

2

n

I. Eachr

j

hasanassociatedclasslabelc (j)

2f1g.

Classically,adaptivetrainingofsuchanonlinearclassierisdone by

ad-justingwsothattheMSE,E[(c(k) y(k)) 2

],isminimized,andistypically

(3)

y(k)=f(r(k);w (k 1))

w (k)=w (k 1)+(c(k) y(k))

@f(r(k );w (k 1))

@w

(3)

whereisanadaptivegain. However,thetrueperformancecriterionshould

betheerror rate. We considerhow toconstructanadaptive training

algo-rithm basedon theMER criterion. Notethat theMER iswith respectto

thechosenclassierstructure.

ADAPTIVEMINIMUM ERRORRATETRAINING

Theerrorprobabilityoftheclassier(1)is

P

E

(w )=Probfsgn(c(k))y(k)<0g (4)

Denethesignedvariabley

s

(k)=sgn(c(k))y(k)anditsp.d.f. p

y (y s ). Then P E (w )=

Z 0 1 p y (y s )dy s (5)

Bylinearizingtheclassieraround

r (k),itcanbeapproximatedas

y(k)=f(r (k)+n(k);w )f(r(k);w )+

@f(r(k);w )

@r

T

n(k)=f(r (k);w )+e(k)

(6)

wheree(k) isGaussianwithzeromeanandvariance

2 =E " 2 n

@f(r(k);w )

@r

T

@f(r(k);w )

@r # = 2 n N b N b X j=1 @f(r j ;w )

@r

T

@f(r

j ;w )

@r

(7)

Thus, the classier is approximated as an additive Gaussian noise model

y(k)y(k) +e(k),withy(k) takingvaluesfromtheniteset

y(k)2fy

j =f(r

j

;w ); 1jN

b

g (8)

Thep.d.f. ofy

s

(k) canthenbeapproximatedby

p y (y s ) 1 N b p 2 Nb X j=1 exp (y s sgn(c (j) )y j ) 2 2 2 (9)

andtheerrorprobabilityoftheclassierisapproximately

P E (w ) 1 N b p 2 N b X j=1 Z 1 g j (w ) exp x 2 j 2 ! dx j = 1 N b N b X j=1 Q(g j

(w )) (10)

(4)

g

j (w )=

sgn(c (j) )y j = sgn(c (j) )f(r j ;w )

(12)

Approximateminimumerrorrate solution

If theset (2) isknown, anapproximate MERsolution can be obtained by

minimizingtheerrorrateexpression(10). NotingthegradientofP

E (w ) rP E (w ) 1 N b p 2 Nb X j=1 exp y 2 j 2 2 ! @g j (w ) @w = 1 N b p 2 Nb X j=1 exp y 2 j 2 2 ! sgn(c (j) ) @f(r j ;w )

@w

(13)

thefollowingiterativegradientalgorithmcanbeusedtoarriveatan

approx-imateMERsolution. Givenaninitialw (0),atlth iteration:

y

j

(l)=f(r

j

;w (l 1)); 1jN

b

rP

E

(w (l))= 1 Nb p 2 P Nb j=1 exp y 2 j (l) 2 2 sgn(c (j) )

@f(rj;w (l 1))

@w

w (l)=w (l 1) rP

E (w (l)) 9 > = > ; (14)

Notice thatthevariance 2

coulditerativelybecalculated using (7).

How-ever,for numerical andconvergenceconsiderations, it ispreferredtox 2

toanappropriatelychosenconstant, thatis,toconsider 2

asanalgorithm

parameterthatrequirestuning. Iftheclassier(1)islinear,allthe

approxi-mationsbecomesexact,andwearriveattheexactMERsolution[9],[10].

Block-data gradient adaptivealgorithm

We will adopting a sample time averagefor estimating the p.d.f.. Thisis

known as kernel densityestimation [14],[15]. Given a blockof K training

samplesfr(k);c(k)g,akerneldensityestimateofp

y (y s ) is ^ p y (y s )= 1 K p 2 K X k =1 exp (y s sgn(c(k))y(k)) 2 2 2 (15)

Fromtheestimatederrorprobability

^

P

E (w )=

Z 0 1 ^ p y (y s )dy s (16) r ^ P E

(w )canbecalculated

r ^

P

E (w )=

1 K p 2 K X k =1 exp y 2 (k) 2 2 sgn(c(k))

@f(r(k);w )

@w

(17)

(5)

ToderivetheLERalgorithm,considerasingle-sampleestimateofp

y (y

s ):

^ p

y (y

s ;k)=

1

p

2 exp

(y

s

sgn(c(k))y(k)) 2

2 2

(18)

Usingtheinstantaneousorstochasticgradient

r ^

P

E

(k;w )= 1

p

2 exp

y 2

(k)

2 2

sgn(c(k))

@f(r(k);w )

@w

(19)

astochasticgradientalgorithmisgivenby

y(k)=f(r(k);w (k 1))

w (k)=w (k 1)+

p

2 exp

y 2

(k )

2 2

sgn(c(k))

@f(r(k );w (k 1))

@w

)

(20)

whereandaretheadaptivegainandwidth parameter.

APPLICATIONS

The rstapplicationconsidersequalizationinthepresenceof channel

inter-symbol interference(ISI), co-channelinterferenceand additive white

Gaus-siannoise. Forsimplication,itisassumedthatthereexistsonlyone

inter-feringco-channel,andthereceivedsignalatsamplek isgivenby

r(k)=r(k) +n(k)= n0 1

X

i=0 a

0;i b

0

(k i)+ n1 1

X

i=0 a

1;i b

1

(k i)+n(k) (21)

where n(k)isaGaussianwhitenoisewithvariance 2

n ,a

0;i

arethechannel

tapsanda

1;i

theco-channeltaps,thedesiredandinterferingdatab

0 (k)and

b

1

(k)takevaluefromthesetf1g,andtheyareuncorrelated. Theequalizer

uses thereceivedsignalvector, r(k)=[r(k) r(k 1)r(k M+1)] T

,to

produceanestimateofb

0

(k d). Thus,b

0

(k d)servesastheclasslabel,and

the numberof statesis N

b = 2

2M+n

0 +n

1 2

. The optimal equaliser, known

astheBayesianequaliser,isgivenin[16]. ARBFnetworkwithn

c

Gaussian

kernelsisusedastheadaptiveequalizer. Theweight vectorof theequalizer

thusconsistsofalltheRBFweights,centersandwidths.

Inthe simulation, the channeland co-channel are respectivelyA

0 (z) =

0:5+1:0z 1

and A

1

(z) = (1:0+0:5z 1

), with set to give a signal to

interferenceratio SIR = 12 dB. The equalizer order isset to M = 2 and

the decision delay d = 1. The number of states is N

b

= 64. The rst

n

c

=2datapointsthatbelong totheclass+1andtherstn

c

=2datapoints

that belong to theclass 1 are usedas initial centers. The initial weights

areset to 1

n

c (2

2

n )

M=2

accordingly. All thewidthsare initiallyset to8 2

n .

(6)

bytheLMSandLERalgorithms,respectively. Fig.1(a)showsthelearning

rates in terms of the estimated bit error rate(BER) for thefour adaptive

RBFequalizers,where theresultsareaveragedover100runs. FortheLMS

training,thelearningratesintermsoftheMSEaredepictedinFig.1(b).

0.001

0.01

0.1

1

0

200

400

600

800

1000

Estimated Bit Error Rate

Sample k

4-LMS

6-LMS

4-LER

6-LER

0.1

1

0

200

400

600

800

1000

Mean Square Error

Sample k

4-LMS

6-LMS

(a) (b)

Figure1: Convergenceratesintermsof(a)theestimatedBERforvariousadaptive

RBFequalizers,and(b)theMSEforLMSadaptiveRBFequalizers. Theresults

areaveragedover100runs.

-3 -2 -1 0 1 2 3

r(k 1)

r(k) b

0

(k 1)= 1 b

0

(k 1)=1

u

u u

u e

e e

e

r

r r

r

r r

r

r r

r

r r

r

r r

r

r r

r

r r

r

r r

r

r r

r

r r

r

r r

r

?

? ?

?

Figure2: Comparisonoftheoptimaldecisionboundarywiththatofthe6-center

LER RBFequalizer(thinsolid: RBF, thicksolid: optimal). SNR =20 dBand

SIR=12dB.Thedotsindicatethenoise-freestatesandstarsthenalcenters.

Itcan be seenthat, althoughthe LMSRBFequalizersconverge wellin

the MSE,theyhave poor BERperformance. Typicaldecision boundary of

the6-centerLERRBFequalizeriscomparedwiththeoptimalboundaryin

Fig.2. ThetrueBERsofthe4-centerLERRBFequalizertogetherwiththose

(7)

shown,as theyare almostindistinguishable fromthe optimalperformance.

The BERs of the 6-center LMSRBF equalizer, notshown, are notbetter

thanthose ofthelinearMMSEequaliser.

-5

-4

-3

-2

-1

0

5

10

15

20

25

30

35

40

log10(Bit Error Rate)

Signal to Noise Ratio (dB)

linear MMSE

4-LER

optimal

Figure3: PerformancecomparisonofthreeequalizersintermsofBERversusSNR.

SIR=12dB.TheadaptiveLERRBFequalizerhas4centers.

The second application considers the multiuser detection for the

syn-chronous CDMAdownlink withN usersandM chips persymbol. The bit

vectorof N users isb(k) =[b

1 (k)b

N (k)]

T

, with b

i

(k) 2 f1gdenoting

thekthsymbolofuseri,theunit-lengthsignaturecodesequenceforuseriis

s

i =[s

i;1 s

i;M ]

T

,andthechannel(at chiprate)isH(z)= P n h 1 i=0 h i z i .

Itcanbeshownthatthereceivedsignalvectorsampledatchiprate,r(k)=

[r

1 (k)r

M (k)]

T

,is:

r(k)=P 2 6 6 6 4 b(k) b(k 1) . . .

b(k L+1) 3

7

5

+n(k)=r(k)+n(k) (22)

where r(k) denotesthenoise-free received signal,theGaussian noisevector

n(k) = [n

1 (k)n

M (k)] T with E[n(k)n T (k)] = 2 n

I, the MLN system

matrixPisgivenby

P=H 2 6 6 6 6 4

SA 0 0

0 SA . . . . . . . . . . . . . . . 0

0 0

SA 3 7 7 7 7 5 (23)

theMLM channelmatrixHhastheform

(8)

the user signature sequence matrix S = [s

1 s

N

], and the diagonal user

signal amplitude matrix A = diagfA

1 A

N

g. The channel ISI span L

dependsonthechannellength,n

h

,relatedtothechiplength,M. L=1for

n

h

=1,L=2 for1<n

h

M,L=3for M <n

h

2M, andsoon. The

detectoratthereceiverforuseridetects thetransmittedbitb

i

(k)basedon

thereceivedsignalr(k). Thusb

i

(k) servesasclasslabel,andthenumberof

statesis N

b =2

LN

. The optimal detector,calledtheBayesiandetector,is

givenin[6]. AdaptivedetectoremployedisaRBFnetworkwithn

c

Gaussian

kernelfunctions.

0.0001

0.001

0.01

0.1

1

10

0

400

800

1200

1600

2000

Estimated Bit Error Rate

Sample k

16-LMS

64-LMS

16-LER

64-LER

1

10

100

1000

0

400

800

1200

1600

2000

Mean Square Error

Sample k

16-LMS

64-LMS

(a) (b)

Figure4: Convergenceratesintermsof (a)theestimatedBERforvarioususer-3

adaptive RBF detectorsand (b)the MSE for user-3 LMS RBFdetectors. The

resultsareaveragedover100runs.

A3-user systemwith8chips persymbolisusedinthesimulation. The

chipcodesforthethreeusersare(+1;+1;+1;+1; 1; 1; 1; 1),(+1; 1;

+1; 1; 1;+1; 1;+1)and (+1; 1; 1;+1; 1;+1;+1; 1), respectively,

andthechannelisH(z)=0:8+0:6z 1

+0:5z 2

. Thethreeusershaveequal

signalpower. Forthisexample,thenumberofstatesisN

b

=64. Thedetector

for user 3isconsidered. Weights, centers andwidthsof RBFdetectors are

initialized ina similarmanner asfor thepreviousequalizationapplication.

GivenSNRforuser3SNR

3

=15dB(SINRforuser3isSINR

3

= 3:08dB),

RBF detectors with 16 and 64 centers are trained by the LMS and LER

algorithms,respectively. Thelearningrates intermsoftheestimated BER

aredepictedinFig.4(a)forthe4adaptiveRBFdetectors. Fig.4(b)shows

thelearningratesintermsoftheMSEforthe2LMSRBFdetectors. Itcan

beseenthattheLERalgorithmproducesconsistentresultsand,inparticular,

the64-centerLERRBFdetectorisabletoachievetheoptimalperformance.

For the LMS training, the algorithm converges very well in the MSE and

thereisanalmost30dBreductionintheMSE,ascanbeseeninFig.4(b).

However,theBERsofthetwoLMSRBFdetectorsbothapproachto0.5. In

fact,onaverage,theinitial16-centerRBFdetectorhasaBER=0:2andthe

initial 64-centerRBFdetectorhasaBER=0:008. Yet,aftertraining using

the LMS,bothyieldalmost 1in 2errors. This isnothingto dowithlocal

minima. As far asthe LMSalgorithm is concerned,it does a good job in

whatitsupposestodo: gettingtheMSEdown. The BERsof the16-center

LER RBFdetectorare comparedwiththe optimal detectorinFig. 5for a

(9)

-6

-5

-4

-3

-2

-1

0

5

10

15

20

25

log10(Bit Error Rate)

SNR3 (dB)

linear MMSE

adaptive RBF

optimal

Figure5: Performancecomparisonofthreedetectorsforuser 3. SNRi,1i3,

areidentical. TheadaptiveRBFdetectorhas16centersandistrainedbytheLER

algorithm.

CONCLUSIONS

Adaptive orsample-by-sample training of nonlinearclassiers is classically

done using somestochasticgradient algorithms,suchastheLMS,that are

linkedtotheMMSEcriterion. ThisisdespitethefactthattheMMSE

crite-rionbearsno relationshiptotheMERcriterion. Itiswidelyacknowledged

thatadaptivetrainingofneuralnetworkclassiersoften encounters

diÆcul-ties. These diÆculties are typically attributed to \local minima". Rarely

a more fundamental question is asked: whether the underlying criterion

adopted,theMMSE,isarelevantonefortheproblem.

Themaincontributionof thispaperistoderiveanadaptiveMER

algo-rithm calledtheLER for training aclass of neuralnetwork classiers that

includeadaptivenonlinearequalisersandmultiuserdetectors. Ourapproach

has beenmotivated froma kernel densityestimationof the errorrateas a

smoothfunctionofthetrainingdataandanadoptionofstochasticgradient

of theestimated errorprobability. ThisLERalgorithmhasbeenappliedto

channelequalizationandCDMAmultiuserdetectionusinga RBFnetwork.

SimulationresultshavedemonstratedthattheLERalgorithmperforms

con-sistently and the algorithm has a good convergence speed. A small-size

RBF network trained by the LER algorithm can closely match the

opti-malBayesianperformance. TheresultsalsoconrmsthatthestandardLMS

(10)

[1] G.J.Gibson,S.Siu,S.Chen,C.F.N.CowanandP.M.Grant,\Theapplication

of nonlineararchitecturestoadaptivechannelequalisation,"inProc.ICC'90

(Atlanta,USA),1990,pp.312.8.1.{312.8.5.

[2] S.Chen,G.J.Gibson,C.F.N.CowanandP.M.Grant,\Adaptiveequalization

ofnitenon-linearchannelsusingmultilayerperceptrons,"SignalProcessing,

Vol.20,No.2,pp.107{119,1990.

[3] S. Chen, G.J. Gibson, C.F.N. Cowan and P.M. Grant, \Reconstruction of

binarysignalsusinganadaptiveradial-basis-functionequalizer,"Signal

Pro-cessing,Vol.22,No.1,pp.77{93, 1991.

[4] B.Aazhang,B.P.Parisand G.C.Orsak,\Neuralnetworksformultiuser

de-tectionincode-divisionmultiple-accesscommunications,"IEEETrans.

Com-munications,Vol.40,No.7,pp.1212{1222,1992.

[5] U.Mitraand H.V.Poor,\Neuralnetworktechniquesforadaptivemultiuser

demodulation," IEEE J. Selected Areas in Communications, Vol.12, No.9,

pp.1460-1470, 1994.

[6] S. Chen,A.K. Saminganand L.Hanzo,\Supportvector machinemultiuser

receiver forDS-CDMA signalsin multipath channels," IEEE Trans. Neural

Networks,toappear,May2001.

[7] S. Chen, E.S. Chng, B. Mulgrew and G. Gibson, \Minimum-BER

linear-combinerDFE,"inProc.ICC'96(Dallas,Texas),1996,Vol.2,pp.1173{1177.

[8] S.Chen,B.Mulgrew,E.S.ChngandG.Gibson,\Spacetranslationproperties

and the minimum-BERlinear-combiner DFE,"IEE Proc.Communications,

Vol.145,No.5,1998,pp.316{322.

[9] B.MulgrewandS.Chen,\Stochasticgradientminimum-BERdecision

feed-back equalisers," in Proc. IEEE Symposium on Adaptive Systems for

Sig-nal Processing, CommunicationandControl(LakeLouise,Alberta,Canada),

Oct.1-4,2000,pp.93{98.

[10] S. Chen, A.K. Samingan, B. Mulgrewand L. Hanzo, \Adaptive

minimum-BERlinearmultiuserdetectionforDS-CDMAsignalsinmultipathchannels,"

IEEETrans.SignalProcessing,Vol.49,No.6,pp.1240{1247,2001.

[11] C.C.YehandJ.R.Barry,\Adaptiveminimumbit-errorrateequalizationfor

binarysignaling,"IEEETrans.Communications,Vol.48,No.7,pp.1226{1235,

2000.

[12] C.C.Yeh,R.R.LopesandJ.R.Barry,\Approximateminimumbit-errorrate

multiuser detection," in Proc. Globecom'98 (Sydney, Australia), Nov.1998,

pp.3590{3595.

[13] I.N.Psaromiligkos,S.N. BatalamaandD.A. Pados, \Onadaptiveminimum

probability of error linear lter receivers for DS-CDMA channels," IEEE

Trans.Communications,Vol.47,No.7,pp.1092{1102, 1999.

[14] B.W.Silverman,DensityEstimation.London:ChapmanHall,1996.

[15] A.W.BowmanandA.Azzalini,AppliedSmoothingTechniquesforData

Anal-ysis.OxfordUniversityPress,1997.

[16] S.ChenandB.Mulgrew,\Overcomingco-channelinterferenceusingan

adap-tiveradialbasisfunction equaliser,"Signal Processing,Vol.28, No.1, pp.91{