Contents lists available atScienceDirect
C. R.
Acad.
Sci.
Paris,
Ser. I
www.sciencedirect.com
Statistics
On
bandwidth
parameter
choices
for
discrete
nonparametric
kernel
estimator
Choix
de
fenêtres
de
lissage
pour
un
estimateur
non
paramétrique
à
noyau
discret
Tristan Senga
Kiessé
INRA,UMR1069,SolAgroetHydrosystèmeSpatialisation,35000Rennes,France
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory: Received20May2012
Acceptedafterrevision10February2016 Availableonline13May2016
PresentedbytheEditorialBoard
This noteconcentrates on the nonparametric estimation ofa probability mass function (p.m.f.) using discrete associated kernels. An expression of the optimal bandwidth minimizing the asymptoticpart of the global squared error is given. Some asymptotic expressions ofbias and varianceof thecross-validation criterionare alsopresented.At last,thetwobandwidthselectionproceduresareillustratedthroughsomesimulationsand anapplicationonarealcountdataset.
©2016Académiedessciences.PublishedbyElsevierMassonSAS.Thisisanopenaccess articleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
r
é
s
u
m
é
Cette note se focalise sur l’estimation non paramétrique à noyau associé discret d’une fonction de masse de probabilité. Une expression de la fenêtre optimale minimisant la partie asymptotique de l’erreur quadratique globale est donnée. Des expressions asymptotiquespour lebiaisetlavarianced’uncritèredesélectionparvalidation croisée sont égalementprésentées. Enfin,lesdeux méthodesde choixdefenêtre sontillustrées pardessimulationsetuneapplicationsurdesdonnéesréelles.
©2016Académiedessciences.PublishedbyElsevierMassonSAS.Thisisanopenaccess articleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Let
(
Xi)
i=1,2,··· ,n beasampleofi.i.d.discreterandomvariables(r.v.)havingap.m.f. f(
x)
=
Pr(
Xi=
x)
>
0 onsupportS
(e.g.,non-negativeintegerssetN
).Thediscretekernelestimatorfn of f canbeexpressedas fn(
x)
=
1 n n i=1 Kx,h(
Xi),
x∈
S,
(1)E-mailaddress:[email protected]. http://dx.doi.org/10.1016/j.crma.2016.02.012
1631-073X/©2016Académiedessciences.PublishedbyElsevierMassonSAS.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
whereh
=
h(
n)
>
0 isan arbitrarysequenceofsmoothingparametersthat fulfillslimn→∞h(
n)
=
0 and Kx,h(
·)
isap.m.f., calleddiscreteassociatedkernel,withr.v.K
x,honsupportS
x (containingtarget x andnotdependingonh)satisfyingH1
:
limh→0E
(K
x,h)
=
x,
and H2:
h→lim0Var(K
x,h)
=
0;
withoutlossofgenerality,weassume
S
x⊆
S
inallthiswork.Thenonparametrickernelestimationhasbeenalreadylargely discussed in the literature for the probability density function (p.d.f.), in particular for choosing smoothing parameters; seeBowman[2]orMarron [6].Incontrast,thediscretekernelestimationforp.m.f.hasreceivedmorelessattentionthan that for p.d.f.In fact, until now, it commonly results from the discretization ofthe continuous case by considering the ordinalvariablesasbeingcontinuous;seeAitchisonandAitken[1],Titterington[8]orSimonoffandTutz[7].Foraspecific investigation ofcount data,Kokonendjietal.[4]andKokonendjiandSengaKiessé[3]havedevelopedthe nonparametric kernelestimationusingsomediscreteassociatedkernels.Thisopenswaytoaspecifictreatmentofthebandwidthselections ofdiscretekernelestimatorsbyadaptingclassicalmethods,whicharewellknownforthecontinuouscase.Thisnotepursues the first works realizedon thissubject, witha first attemptto provide an expression of theoptimal bandwidth andby investigatingtheasymptoticpropertiesofthecross-validationcriterionforfn in(1).2. Discreteassociatedkernels
LetusfirstintroducethefollowinghypothesesonmodalprobabilityofKx,h:
H1
:
Kx,h(
x)
=
1−
hA(K
x,h)
+
O(
h2),
with
y∈Sx\{x}Kx,h(
y)
=
hA(
K
x,h)
+
O(
h2)
→
0 as h→
0,where A(
K
x,h)
isboundedaway from0 forh→
0.Itresultsin thefollowingproposition.Proposition2.1.Considerafixedpointx
∈
S
andthebandwidthh>
0.UnderassumptionH1,theexpectationandvarianceofthe discreteassociatedkernelKx,hfulfillH1andH2.Proof. FortheassumptionH1,theexpectationof
K
x,hcomesdirectlyfromE(
K
x,h)
=
xKx,h(
x)
+
y∈Sx\{x} y Kx,h(
y)
=
x+
B(
x;
h),
withB(
x;
h)
= −
xhA(K
x,h)
+
y∈Sx\{x}y Kx,h(
y)
+
O(
h2
)
→
0 forh→
0.FortheassumptionH2,thevarianceofK
x,hcanbe successivelyexpressedas Var
(K
x,h)
=
y∈Sx y2Kx,h(
y)
−
y∈Sx y Kx,h
(
y)
2=
x2{
Kx,h(
x)
−
1} +
D(
x;
h),
withD(
x;
h)
=
y∈Sx\{x}y 2K x,h(
y)
+
x2−
x+
y∈S x(
y−
x)
Kx,h(
y)
2→
0 whenh→
0 underthehypothesesH1.Indeed, whenh→
0,wehaveboth Kx,h(
y)
→
0 for y=
x and Kx,h(
x)
→
1 for y=
x.2
Then,thefollowingassumptioncanbeformulatedonthevarianceof
K
x,h:H2
:
Var(K
x,h)
=
hV(K
x,h)
+
O(
h2),
leading tosome hypothesesH1 andH2lessgeneralthanH1andH2;A
(K
x,h)
andV(K
x,h)
donotobligatorydependonx andh,asinthenextexample.Example. (SeeKokonendjiandZocchi[5].) Leta1
,
a2 befixed integersandh1,
h2 besmoothingparameters. Foranyfixed x∈
S = Z
,considerthe r.v.K
a1,a2;x,h1,h2 ofdiscrete generalizedtriangularassociatedkernels definedonsupportsS
a1,x=
{
x−
1,
x−
2,
. . . ,
x−
a1}
andS
x,a2= {
x,
x+
1,
. . . ,
x+
a2}
andwhosep.m.f.isKa1,a2;x,h1,h2
(
y)
=
1 P1
−
x−
y a1+
1 h1 1Sa1,x(
y)
+
1
−
y−
x a2+
1 h2 1Sx,a2(
y)
,
where P
≡ (a
1+
a2+
1)
− (a
1+
1)
−h1ak=11kh1− (a
2+
1)
−h2ak=21kh2=
P(
a1,
a2,
h1,
h2)
isthenormalizingconstant.Forh sufficientlysmall,onehasKai;x,hi
(
x)
=
1−
2 i=1 hiA(
ai)
+
O(
h2i)
and Var(K
ai;x,hi)
=
2 i=1 hiV(
ai)
+
O(
h2i)
,
with A(
ai)
=
ailog(
ai+
1)
−
aik=1log
(
k)
andV(
ai)
=
ai(
2a2i+
3ai+
1)
log(
ai+
1)/
6−
aik=1k2log
(
k)
.Hence,theassumptions H1andH2 arefulfilled.3. Nonparametrickernelestimator
Thissectionpresentsanoptimalh-valueminimizinganapproximateofthemeanintegratedsquarederror(MISE)of
fn in(1),incomparisonwiththeh-valuemimimizingcross-validationfunction(KokonendjiandSengaKiessé[3]).3.1. MinimizationofMISE
The
fn’svarianceandbiashavebeenestablishedby KokonendjiandSengaKiessé[3]underH1andH2;bytakinginto accountH1andH2,theglobalerrorisMISE
(
h)
=
x∈S Var{
fn(
x)
} +
x∈S Bias2{
fn(
x)
} =
AMISE(
h)
+
o 1 n+
h 2+
O(
h2),
where AMISE(
h)
=
1 n x∈S f(
x)
1
−
hA(K
x,h)
2−
f(
x)
+
1 4h 2 x∈S V(K
x,h)
f(2)(
x)
2istheapproximatedMISEforh sufficientlysmall,with f(2)afinitedifferenceofsecondorder.Hence,anapproximatevalue
hopt=
arg minh>0AMISE(
h)
ofthetrueoptimalbandwidthhopt=
arg minh>0MISE(
h)
canbegivenby hopt(
n,
f)
=
x∈S f(
x)
A(K
x,h)
x∈S f(
x)
{
A(K
x,h)
}
2+ (
n/
4)
x∈S V(K
x,h)
f(2)(
x)
2,
whichtendsto0 asn→ ∞
with0<
x∈SV(K
x,h)
f(2)(
x)
2<
∞
.Itresultsinanasymptoticrelationshipsuchthathopt∼
k0n−1 withk0=
4 x∈S f(
x)
A(K
x,h)/
[
x∈SV(K
x,h)
f(2)(
x)
2]
.Thus,the discretegeneralizedtriangularassociated kernel presentedasanexamplehasoptimalbandwidth hopt(
n,
ai,
f)
=
A
(
ai)
2
{
A(
ai)
}
2+ (
n/
2)
{
V(
ai)
}
2x∈Zf(2)
(
x)
2,
i=
1,
2.
(2)3.2. Minimizationofcross-validationfunction
We are interested in a bandwidth hcv minimizing a cross-validation score function CV with respect to h such that hcv
=
arg minh>0CV(
h)
withCV
(
h)
=
x∈S 1 n n i=1 Kx,h(
Xi)
2−
2 n(
n−
1)
n i=1 j=i KXi,h Xj.
(3)TheCV’smeanandvarianceforfixedh areprovidedinthefollowingtheorem.
Theorem3.1.Considerafixedpointx
∈
S
andthebandwidthh=
h(
n)
>
0 suchthat limn→∞h
=
0.AssumethatCVisthe cross-validationfunctionforthenonparametricestimatorin(1)ofp.m.f.f .Then,wehaveE{
CV(
h)
} =
AMISE(
h)
−
x∈S f2(
x)
+
O 1 n+
h 2 and Var{
C V(
h)
} =
1 n x∈S x∈S Kx2,h(
x)
−
2Kx,h(
x)
2 f3(
x)
−
1 n x∈S f2(
x)
2+
O h2 n+
1 n2,
whereKx,hisadiscretekernelsatisfyingtheassumptionH1. Proof. Letuspresentthecross-validationscorefunctionin(3)as
C V
(
h)
=
1 n2 n i=1 x∈S K2x,h(
Xi)
+
2 n2j<i Hi j
,
with Hi j=
x∈SKx,h(
Xi)
Kx,h(
Xj)
−
2KXi,h Xj . One hasE{
C V(
h)
}
= (
1/
n)
x∈SE{
K2 x,h(
X1)
}
+ (
1−
1/
n)
E(
Hi j)
. Forthe firsttermofE{
C V(
h)
}
,weget:E{
K2x,h(
X1)
} =
x∈S Kx2,h(
x)
f(
x)
+
x∈S y∈S\{x} Kx2,h(
y)
f(
y)
;
wherethesecondterminthepreviousequationisfinite.Nowletusexpress
E{
KX1,h(
X2)
} =
x∈SE{
Kx,h(
X2)
}
f(
x),
E{
Kx,h(
X1)
} =
y∈S Kx,h(
y)
f(
y)
= E{
f(K
x,h)
},
and
E{
f(
K
x,h)
}
=
f(
x)
+(
1/
2)
f(2)(
x)
Var(
K
x,h)
+
o(
h)
obtainedbyusingadiscreteTaylorexpansionaroundx.Forthesecond termofE{
C V(
h)
}
,itensuesE(
Hi j)
=
h2 4 x∈S V(K
x,h)
f(2)(
x)
2−
x∈S f2(
x)
+
o(
h2)
+
O(
h2),
whereo
(
h2)
isasymptoticallydominatedby O(
h2)
.Hence,itresultsE{
C V(
h)
}
. ForCV’svariance,webeginbycalculatingthevarianceofthefirsttermas1 n3Var
x∈S Kx2,h(
X1)
=
1 n3 x∈S Kx2,h(
x)
2 f(
x)
−
x∈S K2x,h(
x)
f(
x)
2+
Qn(
x;
h),
with n3Qn(
x;
h)
=
y∈S\{x} x∈S Kx2,h(
y)
2 f(
y)
+
x∈S K2x,h(
x)
f(
x)
2−
y∈S x∈S Kx2,h(
y)
f(
y)
2,
where Qn iso
(
1/
n3)
;infact,we canassumeVarx∈Sn−2ni=1Kx2,h(
Xi)
=
o(
n−3)
+
O(
n−3)
.Consideringthevariance ofthesecondtermofCV,wehave1 n4Var
j<i Hi j=
1 n2Var(
Hi j)
+
1 n 1−
3 n Cov(
Hi j,
Hik)
+
O 1 n3.
Without giving all details, we get
E(
Hi j2)
=
x∈Sx∈SKx2,h(
x)
−
2Kx,h(
x)
2f2
(
x)
+
o(
h2)
. Then, by usingexpression ofE(
Hi j)
calculatedpreviously,wehaveVar
(
Hi j)
=
x∈S x∈S K2x,h(
x)
−
2Kx,h(
x)
2 f2(
x)
−
x∈S f2(
x)
2+
h2 2 x∈S V(K
x,h)
f(2)(
x)
2x∈S f2
(
x)
+
o(
h2).
Inaddition,itcanbeshownthat
E(
Hi jHik)
=
x∈S x∈S K2x,h(
x)
−
2Kx,h(
x)
2 f3(
x)
+
o(
h3)
;
then,byconsideringCov
(
Hi j,
Hik)
= E(
Hi jHik)
− E(
Hi j)
E(
Hik)
,wehaveCov
(
Hi j,
Hik)
=
x∈S x∈S Kx2,h(
x)
−
2Kx,h(
x)
2 f3(
x)
−
x∈S f2(
x)
2+
h2 2 x∈S V(K
x,h)
f(2)(
x)
2 x∈S f2(
x)
+
o(
h3)
+
O(
h2),
withCov
(
Hi j,
Hkl)
=
0.HencethedesiredresultonVar{
C V(
h)
}
.2
4. IllustrationsThis section illustrates the bandwidth choices; discrete symmetric and generalized triangular kernels are applied for simulatedandrealdata,respectively.
Table 1
AveragesI S E,hoptandhcvfornonparametricestimatorusingtriangularkernels.
Sample size n hopt I S E(hopt) hopt I S E(hopt) hcv I S E(hcv)
a=2 25 0.49 0.0028 0.23 0.0116 0.73 0.0130 100 0.16 0.0011 0.12 0.0043 0.23 0.0053 a=3 25 0.22 0.0034 0.10 0.0140 0.43 0.0149 100 0.06 0.0009 0.04 0.0057 0.07 0.0062 a=4 25 0.12 0.0033 0.05 0.0196 0.23 0.0195 100 0.03 0.0006 0.02 0.0065 0.04 0.0068
Fig. 1. Estimations of the number of goals per match using a discrete generalized triangular kernel. 4.1. Simulations
Consider simulated data of sample size n from a Poisson distribution f with mean
μ
=
2. The estimatorfn using symmetrictriangularkernelsisappliedwitha1=
a2=
a∈ {
2,
3,
4}
,asanexample.Theperformanceoffn isevaluatedvia theintegratedsquarederrorI S E(
h)
=
x∈N{
fn(
x)
−
f(
x)
}
2 determinedusinghopt,hcv andtruehoptcomputednumerically since f is known. In Table 1, the averages I S E,hopt andhcv are calculated based on 100 replications of the simulated data.The valuehopt iscompetitive since globally we have I S E(
hopt)
≤
I S E(
hcv)
for chosen valuesofa; note that hopt is underestimatedbyhoptandoverestimatedby hcv.Atlast,byestimating f usingempiricalfrequencyofdata, I S E isequal to0.
0308 forn=
25 and0.
0088 forn=
100.4.2. Application
Theestimator
fnisappliedusingageneralizedtriangularkernelwith(
a1,
a2)
= (
1,
2)
oncountdata,describinganumber ofgoalspermatchfromtheFrenchfootballchampionship(Kokonendjietal.[4]).Byacross-validationprocedure,wehave h1cv=
0.
170 andh2cv=
0.
085,whiletheexpression(2)resultsinhopt,01=
0.
156 andhopt,02=
0.
031,obtainedbyreplacing theunknownp.m.f. f in(2)withitsempiricalfrequencyestimate.Thequalityoftheestimateprovidedusing(
h1cv,
h2cv)
is closetothatoftheonecalculatedusing(
hopt,01,
hopt,02)
,sinceISEis,respectively,equalto3.
0081×
10−4and3.
0082×
10−4 (Fig. 1).References
[1]J.Aitchison,C.G.G.Aitken,Multivariatebinarydiscriminationbythekernelmethod,Biometrika63(1976)413–420. [2]A.Bowman,Analternativemethodofcross-validationforthesmoothingofdensityestimates,Biometrika71(1984)352–360. [3]C.C.Kokonendji,T.SengaKiessé,Discreteassociatedkernelmethodandextensions,Stat.Methodol.8(2011)497–516.
[4]C.C.Kokonendji,T.SengaKiessé,S.S.Zocchi,Discretetriangulardistributionsandnon-parametricestimationforprobability massfunction,J. Non-parametr.Stat.19(2007)241–254.
[5]C.C.Kokonendji,S.S.Zocchi,Extensionsofdiscretetriangulardistributionsandboundarybiasinkernelestimationfordiscretefunctions,Stat.Probab. Lett.80(2010)1655–1662.
[6]J.S.Marron,Acomparisonofcross-validationtechniquesindensityestimation,Ann.Stat.15(1987)152–162.
[7]J.S.Simonoff,G.Tutz,Smoothingmethodsfordiscretedata,in:M.G.Schimek(Ed.),SmoothingandRegression:Approaches,Computation,and Applica-tion,Wiley,NewYork,2000,pp. 193–228.