FOR NEURAL NETWORK CLASSIFIERS
S.Chen y
, B.Mulgrew z
andL.Hanzo y
y
DepartmentofElectronicsandComputerScience
UniversityofSouthampton,Higheld
SouthamptonSO171BJ,U.K.
z
DepartmentofElectronicsandElectricalEngineering
Universityof Edinburgh,King'sBuildings
EdinburghEH93JL,U.K.
Abstract. Weconsidersample-by-sampleadaptivetrainingof
two-class neuralnetwork classiers. Specicapplicationsthatwehave
inmindarecommunicationchannelequalizationandcode-division
multiple-access (CDMA) multiuser detection. Typically, training
of such neural network classiers is done using some stochastic
gradient algorithm that tries to minimize the mean square error
(MSE).Since thegoalshouldreallybeminimizingthe error
prob-ability, the MSE is a \wrong" criterion to use and may lead to
a poor performance. We propose a stochastic gradient adaptive
minimum error rate (MER) algorithm called the least error rate
(LER) for trainingneural network classiers.
INTRODUCTION
We study the class of neural network classiers where pattern vectors are
drawnfromanitesetandcorruptedbyanadditivenoise. Specicexamples
includeneuralnetworkequalizersandmultiuserdetectors[1]-[6]. Weassume
that a sample-by-sampleadaptation isnecessary to meetreal-time
compu-tational constraints. In such applications, the training of neural network
classiersisusuallydoneusingsomestochasticgradientalgorithmbasedon
theMSEcriterion,andaclassicalexampleistheleastmeansquare(LMS)
al-gorithm. However,therealgoalistominimizetheerrorprobability.Thebias
infavouroftheMSEcriterionisperhapsrootedinadaptivelinearltering.
For linear classiers, such as linear equalizers and multiuser detectors,
there exists some relationship between the MSE and error probability. A
small MSE isusuallyassociated witha small error rate. However, even in
the linearcase, theminimum MSE(MMSE) solution ingeneral is notthe
MERone. Itisnowwell-knownthattheerrorrategapbetweentheMMSE
classiers, there is no relationship between the MSE and error rate, and
the MMSE solution may not correspond to a small error rate. In eect,
standard adaptive algorithmsfor training nonlinear classiers, such asthe
LMSalgorithm,isbasedonacriterionthatisirrelevanttotheproblem.
WeadopttheMERcriterionanddevelopastochasticgradientadaptive
MERalgorithmfortrainingneuralnetworkclassiers. Weemployan
adap-tivestrategythatisverysimilartotheone usedforderivingtheLMS.The
MMSEsolutionrequiresensembleaverages, whicharenotavailable in
gen-eral. Sampletimeaveragescaninpracticebeusedtoprovideestimatesofthe
requiredstatistics.Takingtoextreme,usingone-sampleestimateleadstothe
LMSadaptation. Theerrorprobabilityofaclassierisanintegrationofthe
probabilitydensityfunction(p.d.f.) oftheclassierdecisionvariable,which
isgenerallyunavailable. Howeverasampletimeaverageofthep.d.f.,called
thekerneldensityestimate[14],[15],canbeconstructed. Whenaone-sample
kerneldensityestimateisused,aninstantaneousorstochasticgradient
adap-tivealgorithmisderived. Thisapproachhassuccessfullybeenappliedinour
previousworksonlinearadaptiveMERalgorithms[9],[10].
The proposed LER algorithm is applied to equalization and multiuser
detectionusingaradialbasisfunction(RBF)network. Thesimulationresults
obtainedshowthattheLER algorithmhasareasonableconvergencespeed
anda small RBFnetworktrained bytheLERalgorithmcan closelymatch
theoptimalperformance. Thesimulationstudy alsoconrmsthattheMSE
criterionisirrelevanttothiskindofapplicationsandtheRBFnetworktrained
bytheLMSalgorithm,althoughconvergingwellintheMSE,canproducea
poorerrorrateperformance.
PROBLEM FORMULATION
Weconsidertheclassofnonlinearclassiersthataredenedby
^
c(k)=sgn(y(k)) with y(k)=f(r (k);w ) (1)
wherer(k)isanM-dimensionalpatternvectorwithitsassociatedclasslabel
c(k) 2 f1g,the vector wconsists of all theadjustable parametersof the
classier f, and c(k)^ is the estimated class label for r(k). Any soft sgn
function employed by a classier is included in the function form f. The
patternvectorr(k) isassumedtotake theform: r(k)=r(k)+n(k), where
thecleanpartr(k)takesvaluesfromanite setwithequalprobability
r(k)2fr
j
; 1jN
b
g (2)
andthenoisevectorn(k)isGaussianwithcovariancematrixE[n(k)n T
(k)]=
2
n
I. Eachr
j
hasanassociatedclasslabelc (j)
2f1g.
Classically,adaptivetrainingofsuchanonlinearclassierisdone by
ad-justingwsothattheMSE,E[(c(k) y(k)) 2
],isminimized,andistypically
y(k)=f(r(k);w (k 1))
w (k)=w (k 1)+(c(k) y(k))
@f(r(k );w (k 1))
@w
(3)
whereisanadaptivegain. However,thetrueperformancecriterionshould
betheerror rate. We considerhow toconstructanadaptive training
algo-rithm basedon theMER criterion. Notethat theMER iswith respectto
thechosenclassierstructure.
ADAPTIVEMINIMUM ERRORRATETRAINING
Theerrorprobabilityoftheclassier(1)is
P
E
(w )=Probfsgn(c(k))y(k)<0g (4)
Denethesignedvariabley
s
(k)=sgn(c(k))y(k)anditsp.d.f. p
y (y s ). Then P E (w )=
Z 0 1 p y (y s )dy s (5)
Bylinearizingtheclassieraround
r (k),itcanbeapproximatedas
y(k)=f(r (k)+n(k);w )f(r(k);w )+
@f(r(k);w )
@r
T
n(k)=f(r (k);w )+e(k)
(6)
wheree(k) isGaussianwithzeromeanandvariance
2 =E " 2 n
@f(r(k);w )
@r
T
@f(r(k);w )
@r # = 2 n N b N b X j=1 @f(r j ;w )
@r
T
@f(r
j ;w )
@r
(7)
Thus, the classier is approximated as an additive Gaussian noise model
y(k)y(k) +e(k),withy(k) takingvaluesfromtheniteset
y(k)2fy
j =f(r
j
;w ); 1jN
b
g (8)
Thep.d.f. ofy
s
(k) canthenbeapproximatedby
p y (y s ) 1 N b p 2 Nb X j=1 exp (y s sgn(c (j) )y j ) 2 2 2 (9)
andtheerrorprobabilityoftheclassierisapproximately
P E (w ) 1 N b p 2 N b X j=1 Z 1 g j (w ) exp x 2 j 2 ! dx j = 1 N b N b X j=1 Q(g j
(w )) (10)
g
j (w )=
sgn(c (j) )y j = sgn(c (j) )f(r j ;w )
(12)
Approximateminimumerrorrate solution
If theset (2) isknown, anapproximate MERsolution can be obtained by
minimizingtheerrorrateexpression(10). NotingthegradientofP
E (w ) rP E (w ) 1 N b p 2 Nb X j=1 exp y 2 j 2 2 ! @g j (w ) @w = 1 N b p 2 Nb X j=1 exp y 2 j 2 2 ! sgn(c (j) ) @f(r j ;w )
@w
(13)
thefollowingiterativegradientalgorithmcanbeusedtoarriveatan
approx-imateMERsolution. Givenaninitialw (0),atlth iteration:
y
j
(l)=f(r
j
;w (l 1)); 1jN
b
rP
E
(w (l))= 1 Nb p 2 P Nb j=1 exp y 2 j (l) 2 2 sgn(c (j) )
@f(rj;w (l 1))
@w
w (l)=w (l 1) rP
E (w (l)) 9 > = > ; (14)
Notice thatthevariance 2
coulditerativelybecalculated using (7).
How-ever,for numerical andconvergenceconsiderations, it ispreferredtox 2
toanappropriatelychosenconstant, thatis,toconsider 2
asanalgorithm
parameterthatrequirestuning. Iftheclassier(1)islinear,allthe
approxi-mationsbecomesexact,andwearriveattheexactMERsolution[9],[10].
Block-data gradient adaptivealgorithm
We will adopting a sample time averagefor estimating the p.d.f.. Thisis
known as kernel densityestimation [14],[15]. Given a blockof K training
samplesfr(k);c(k)g,akerneldensityestimateofp
y (y s ) is ^ p y (y s )= 1 K p 2 K X k =1 exp (y s sgn(c(k))y(k)) 2 2 2 (15)
Fromtheestimatederrorprobability
^
P
E (w )=
Z 0 1 ^ p y (y s )dy s (16) r ^ P E
(w )canbecalculated
r ^
P
E (w )=
1 K p 2 K X k =1 exp y 2 (k) 2 2 sgn(c(k))
@f(r(k);w )
@w
(17)
ToderivetheLERalgorithm,considerasingle-sampleestimateofp
y (y
s ):
^ p
y (y
s ;k)=
1
p
2 exp
(y
s
sgn(c(k))y(k)) 2
2 2
(18)
Usingtheinstantaneousorstochasticgradient
r ^
P
E
(k;w )= 1
p
2 exp
y 2
(k)
2 2
sgn(c(k))
@f(r(k);w )
@w
(19)
astochasticgradientalgorithmisgivenby
y(k)=f(r(k);w (k 1))
w (k)=w (k 1)+
p
2 exp
y 2
(k )
2 2
sgn(c(k))
@f(r(k );w (k 1))
@w
)
(20)
whereandaretheadaptivegainandwidth parameter.
APPLICATIONS
The rstapplicationconsidersequalizationinthepresenceof channel
inter-symbol interference(ISI), co-channelinterferenceand additive white
Gaus-siannoise. Forsimplication,itisassumedthatthereexistsonlyone
inter-feringco-channel,andthereceivedsignalatsamplek isgivenby
r(k)=r(k) +n(k)= n0 1
X
i=0 a
0;i b
0
(k i)+ n1 1
X
i=0 a
1;i b
1
(k i)+n(k) (21)
where n(k)isaGaussianwhitenoisewithvariance 2
n ,a
0;i
arethechannel
tapsanda
1;i
theco-channeltaps,thedesiredandinterferingdatab
0 (k)and
b
1
(k)takevaluefromthesetf1g,andtheyareuncorrelated. Theequalizer
uses thereceivedsignalvector, r(k)=[r(k) r(k 1)r(k M+1)] T
,to
produceanestimateofb
0
(k d). Thus,b
0
(k d)servesastheclasslabel,and
the numberof statesis N
b = 2
2M+n
0 +n
1 2
. The optimal equaliser, known
astheBayesianequaliser,isgivenin[16]. ARBFnetworkwithn
c
Gaussian
kernelsisusedastheadaptiveequalizer. Theweight vectorof theequalizer
thusconsistsofalltheRBFweights,centersandwidths.
Inthe simulation, the channeland co-channel are respectivelyA
0 (z) =
0:5+1:0z 1
and A
1
(z) = (1:0+0:5z 1
), with set to give a signal to
interferenceratio SIR = 12 dB. The equalizer order isset to M = 2 and
the decision delay d = 1. The number of states is N
b
= 64. The rst
n
c
=2datapointsthatbelong totheclass+1andtherstn
c
=2datapoints
that belong to theclass 1 are usedas initial centers. The initial weights
areset to 1
n
c (2
2
n )
M=2
accordingly. All thewidthsare initiallyset to8 2
n .
bytheLMSandLERalgorithms,respectively. Fig.1(a)showsthelearning
rates in terms of the estimated bit error rate(BER) for thefour adaptive
RBFequalizers,where theresultsareaveragedover100runs. FortheLMS
training,thelearningratesintermsoftheMSEaredepictedinFig.1(b).
0.001
0.01
0.1
1
0
200
400
600
800
1000
Estimated Bit Error Rate
Sample k
4-LMS
6-LMS
4-LER
6-LER
0.1
1
0
200
400
600
800
1000
Mean Square Error
Sample k
4-LMS
6-LMS
(a) (b)
Figure1: Convergenceratesintermsof(a)theestimatedBERforvariousadaptive
RBFequalizers,and(b)theMSEforLMSadaptiveRBFequalizers. Theresults
areaveragedover100runs.
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
r(k 1)
r(k) b
0
(k 1)= 1 b
0
(k 1)=1
u
u u
u e
e e
e
r
r
r
r r
r
r
r
r
r
r
r r
r
r
r r
r
r
r r
r
r
r
r
r
r
r r
r
r
r r
r
r
r r
r
r
r
r
r
r
r r
r
r
r r
r
r
r r
r
r
r
r
r
r
r r
r
r
r
?
? ?
?
?
?
Figure2: Comparisonoftheoptimaldecisionboundarywiththatofthe6-center
LER RBFequalizer(thinsolid: RBF, thicksolid: optimal). SNR =20 dBand
SIR=12dB.Thedotsindicatethenoise-freestatesandstarsthenalcenters.
Itcan be seenthat, althoughthe LMSRBFequalizersconverge wellin
the MSE,theyhave poor BERperformance. Typicaldecision boundary of
the6-centerLERRBFequalizeriscomparedwiththeoptimalboundaryin
Fig.2. ThetrueBERsofthe4-centerLERRBFequalizertogetherwiththose
shown,as theyare almostindistinguishable fromthe optimalperformance.
The BERs of the 6-center LMSRBF equalizer, notshown, are notbetter
thanthose ofthelinearMMSEequaliser.
-5
-4
-3
-2
-1
0
0
5
10
15
20
25
30
35
40
log10(Bit Error Rate)
Signal to Noise Ratio (dB)
linear MMSE
4-LER
optimal
Figure3: PerformancecomparisonofthreeequalizersintermsofBERversusSNR.
SIR=12dB.TheadaptiveLERRBFequalizerhas4centers.
The second application considers the multiuser detection for the
syn-chronous CDMAdownlink withN usersandM chips persymbol. The bit
vectorof N users isb(k) =[b
1 (k)b
N (k)]
T
, with b
i
(k) 2 f1gdenoting
thekthsymbolofuseri,theunit-lengthsignaturecodesequenceforuseriis
s
i =[s
i;1 s
i;M ]
T
,andthechannel(at chiprate)isH(z)= P n h 1 i=0 h i z i .
Itcanbeshownthatthereceivedsignalvectorsampledatchiprate,r(k)=
[r
1 (k)r
M (k)]
T
,is:
r(k)=P 2 6 6 6 4 b(k) b(k 1) . . .
b(k L+1) 3
7
7
7
5
+n(k)=r(k)+n(k) (22)
where r(k) denotesthenoise-free received signal,theGaussian noisevector
n(k) = [n
1 (k)n
M (k)] T with E[n(k)n T (k)] = 2 n
I, the MLN system
matrixPisgivenby
P=H 2 6 6 6 6 4
SA 0 0
0 SA . . . . . . . . . . . . . . . 0
0 0
SA 3 7 7 7 7 5 (23)
theMLM channelmatrixHhastheform
the user signature sequence matrix S = [s
1 s
N
], and the diagonal user
signal amplitude matrix A = diagfA
1 A
N
g. The channel ISI span L
dependsonthechannellength,n
h
,relatedtothechiplength,M. L=1for
n
h
=1,L=2 for1<n
h
M,L=3for M <n
h
2M, andsoon. The
detectoratthereceiverforuseridetects thetransmittedbitb
i
(k)basedon
thereceivedsignalr(k). Thusb
i
(k) servesasclasslabel,andthenumberof
statesis N
b =2
LN
. The optimal detector,calledtheBayesiandetector,is
givenin[6]. AdaptivedetectoremployedisaRBFnetworkwithn
c
Gaussian
kernelfunctions.
0.0001
0.001
0.01
0.1
1
10
0
400
800
1200
1600
2000
Estimated Bit Error Rate
Sample k
16-LMS
64-LMS
16-LER
64-LER
1
10
100
1000
0
400
800
1200
1600
2000
Mean Square Error
Sample k
16-LMS
64-LMS
(a) (b)
Figure4: Convergenceratesintermsof (a)theestimatedBERforvarioususer-3
adaptive RBF detectorsand (b)the MSE for user-3 LMS RBFdetectors. The
resultsareaveragedover100runs.
A3-user systemwith8chips persymbolisusedinthesimulation. The
chipcodesforthethreeusersare(+1;+1;+1;+1; 1; 1; 1; 1),(+1; 1;
+1; 1; 1;+1; 1;+1)and (+1; 1; 1;+1; 1;+1;+1; 1), respectively,
andthechannelisH(z)=0:8+0:6z 1
+0:5z 2
. Thethreeusershaveequal
signalpower. Forthisexample,thenumberofstatesisN
b
=64. Thedetector
for user 3isconsidered. Weights, centers andwidthsof RBFdetectors are
initialized ina similarmanner asfor thepreviousequalizationapplication.
GivenSNRforuser3SNR
3
=15dB(SINRforuser3isSINR
3
= 3:08dB),
RBF detectors with 16 and 64 centers are trained by the LMS and LER
algorithms,respectively. Thelearningrates intermsoftheestimated BER
aredepictedinFig.4(a)forthe4adaptiveRBFdetectors. Fig.4(b)shows
thelearningratesintermsoftheMSEforthe2LMSRBFdetectors. Itcan
beseenthattheLERalgorithmproducesconsistentresultsand,inparticular,
the64-centerLERRBFdetectorisabletoachievetheoptimalperformance.
For the LMS training, the algorithm converges very well in the MSE and
thereisanalmost30dBreductionintheMSE,ascanbeseeninFig.4(b).
However,theBERsofthetwoLMSRBFdetectorsbothapproachto0.5. In
fact,onaverage,theinitial16-centerRBFdetectorhasaBER=0:2andthe
initial 64-centerRBFdetectorhasaBER=0:008. Yet,aftertraining using
the LMS,bothyieldalmost 1in 2errors. This isnothingto dowithlocal
minima. As far asthe LMSalgorithm is concerned,it does a good job in
whatitsupposestodo: gettingtheMSEdown. The BERsof the16-center
LER RBFdetectorare comparedwiththe optimal detectorinFig. 5for a
-6
-5
-4
-3
-2
-1
0
0
5
10
15
20
25
log10(Bit Error Rate)
SNR3 (dB)
linear MMSE
adaptive RBF
optimal
Figure5: Performancecomparisonofthreedetectorsforuser 3. SNRi,1i3,
areidentical. TheadaptiveRBFdetectorhas16centersandistrainedbytheLER
algorithm.
CONCLUSIONS
Adaptive orsample-by-sample training of nonlinearclassiers is classically
done using somestochasticgradient algorithms,suchastheLMS,that are
linkedtotheMMSEcriterion. ThisisdespitethefactthattheMMSE
crite-rionbearsno relationshiptotheMERcriterion. Itiswidelyacknowledged
thatadaptivetrainingofneuralnetworkclassiersoften encounters
diÆcul-ties. These diÆculties are typically attributed to \local minima". Rarely
a more fundamental question is asked: whether the underlying criterion
adopted,theMMSE,isarelevantonefortheproblem.
Themaincontributionof thispaperistoderiveanadaptiveMER
algo-rithm calledtheLER for training aclass of neuralnetwork classiers that
includeadaptivenonlinearequalisersandmultiuserdetectors. Ourapproach
has beenmotivated froma kernel densityestimationof the errorrateas a
smoothfunctionofthetrainingdataandanadoptionofstochasticgradient
of theestimated errorprobability. ThisLERalgorithmhasbeenappliedto
channelequalizationandCDMAmultiuserdetectionusinga RBFnetwork.
SimulationresultshavedemonstratedthattheLERalgorithmperforms
con-sistently and the algorithm has a good convergence speed. A small-size
RBF network trained by the LER algorithm can closely match the
opti-malBayesianperformance. TheresultsalsoconrmsthatthestandardLMS
[1] G.J.Gibson,S.Siu,S.Chen,C.F.N.CowanandP.M.Grant,\Theapplication
of nonlineararchitecturestoadaptivechannelequalisation,"inProc.ICC'90
(Atlanta,USA),1990,pp.312.8.1.{312.8.5.
[2] S.Chen,G.J.Gibson,C.F.N.CowanandP.M.Grant,\Adaptiveequalization
ofnitenon-linearchannelsusingmultilayerperceptrons,"SignalProcessing,
Vol.20,No.2,pp.107{119,1990.
[3] S. Chen, G.J. Gibson, C.F.N. Cowan and P.M. Grant, \Reconstruction of
binarysignalsusinganadaptiveradial-basis-functionequalizer,"Signal
Pro-cessing,Vol.22,No.1,pp.77{93, 1991.
[4] B.Aazhang,B.P.Parisand G.C.Orsak,\Neuralnetworksformultiuser
de-tectionincode-divisionmultiple-accesscommunications,"IEEETrans.
Com-munications,Vol.40,No.7,pp.1212{1222,1992.
[5] U.Mitraand H.V.Poor,\Neuralnetworktechniquesforadaptivemultiuser
demodulation," IEEE J. Selected Areas in Communications, Vol.12, No.9,
pp.1460-1470, 1994.
[6] S. Chen,A.K. Saminganand L.Hanzo,\Supportvector machinemultiuser
receiver forDS-CDMA signalsin multipath channels," IEEE Trans. Neural
Networks,toappear,May2001.
[7] S. Chen, E.S. Chng, B. Mulgrew and G. Gibson, \Minimum-BER
linear-combinerDFE,"inProc.ICC'96(Dallas,Texas),1996,Vol.2,pp.1173{1177.
[8] S.Chen,B.Mulgrew,E.S.ChngandG.Gibson,\Spacetranslationproperties
and the minimum-BERlinear-combiner DFE,"IEE Proc.Communications,
Vol.145,No.5,1998,pp.316{322.
[9] B.MulgrewandS.Chen,\Stochasticgradientminimum-BERdecision
feed-back equalisers," in Proc. IEEE Symposium on Adaptive Systems for
Sig-nal Processing, CommunicationandControl(LakeLouise,Alberta,Canada),
Oct.1-4,2000,pp.93{98.
[10] S. Chen, A.K. Samingan, B. Mulgrewand L. Hanzo, \Adaptive
minimum-BERlinearmultiuserdetectionforDS-CDMAsignalsinmultipathchannels,"
IEEETrans.SignalProcessing,Vol.49,No.6,pp.1240{1247,2001.
[11] C.C.YehandJ.R.Barry,\Adaptiveminimumbit-errorrateequalizationfor
binarysignaling,"IEEETrans.Communications,Vol.48,No.7,pp.1226{1235,
2000.
[12] C.C.Yeh,R.R.LopesandJ.R.Barry,\Approximateminimumbit-errorrate
multiuser detection," in Proc. Globecom'98 (Sydney, Australia), Nov.1998,
pp.3590{3595.
[13] I.N.Psaromiligkos,S.N. BatalamaandD.A. Pados, \Onadaptiveminimum
probability of error linear lter receivers for DS-CDMA channels," IEEE
Trans.Communications,Vol.47,No.7,pp.1092{1102, 1999.
[14] B.W.Silverman,DensityEstimation.London:ChapmanHall,1996.
[15] A.W.BowmanandA.Azzalini,AppliedSmoothingTechniquesforData
Anal-ysis.OxfordUniversityPress,1997.
[16] S.ChenandB.Mulgrew,\Overcomingco-channelinterferenceusingan
adap-tiveradialbasisfunction equaliser,"Signal Processing,Vol.28, No.1, pp.91{