Kernel density estimation with adaptive varying window size
Vladimir Katkovnik
a, Ilya Shmulevich
b,*aKwangju Institute of Science and Technology, Kwangju, South Korea bUniversity of Texas M.D. Anderson Cancer Center, Houston, TX 77030, USA
Received 21 November 2000; received in revised form 21 August 2001
Abstract
A new method of kernel density estimation with a varying adaptive window size is proposed. It is based on the so-called intersection of confidence intervals (ICI) rule. Several examples of the proposed method are given for different types of densities and the quality of the adaptive density estimate is assessed by means of numerical simulations. Ó 2002 Elsevier Science B.V. All rights reserved.
Keywords: Non-parametric; Kernel; Density estimation; Parzen; ICI rule
1. Introduction
In pattern recognition, optimal algorithms of-ten require the knowledge of underlying densities of signals and/or noise. As these densities are usually unknown, unrealistic assumptions are fre-quently made, thus compromising the perfor-mance of the algorithms in question. A common approach to this problem is to estimate the density from the data. If a particular form of the density is assumed or known, then parametric estimation is used. If nothing is assumed about the density shape, non-parametric estimation is employed. Besides being widely used in the field of pattern recognition and classification (Fukunaga, 1990), non-parametric probability density estimation has
been applied in image processing (Sindoukas et al., 1997; Wright et al., 1997), communications (Zabin and Wright, 2000), and many other fields.
One of the most well-known and popular techniques of non-parametric density estimation is the kernel or Parzen density estimate (Fukunaga, 1990; Parzen, 1962; Cacoullos, 1966). Given N
samples X1; . . . ; XN drawn from a population with
density function fðxÞ, x 2 R1, the Parzen density
estimate at x is given by ^ ffhðxÞ ¼ 1 N XN i¼1 1 hj x Xi h ; ð1Þ
where jðÞ is a window or kernel function and h is the window width, smoothing parameter, or simply the kernel size. Traditionally, it is assumed that R
jðuÞ du ¼ 1 and jðÞ is symmetric, that is, jðuÞ ¼ jðuÞ. One popular choice is the Gaussian kernel jðuÞ ¼ 1=pffiffiffiffiffiffi2pexpðu2=2Þ.
The kernel size h is the most important char-acteristic of the Parzen density estimate (Raudys,
www.elsevier.com/locate/patrec
*Corresponding author. Tel.: +1-713-745-1502; fax:
+1-801-382-2064.
E-mail addresses:vlkatkov@hotmail.com (V. Katkovnik), is@ieee.org (I. Shmulevich).
0167-8655/02/$ - see front matterÓ 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 2 ) 0 0 1 2 7 - 7
1991; Silverman, 1978). One can compute the ideal or optimal value of h by minimizing the mean-square error
MSEf ^ffhðxÞg ¼ Ef½ ^ffhðxÞ f ðxÞ2g ð2Þ
between the true and estimated densities, with re-spect to h. The MSE is a function of x and so the optimal kernel size h is also a function of x. In order to minimize the MSE, a best compromise between variance and bias must be selected. Using Taylor series approximations of the moments of
^
ffhðxÞ and noting that
MSEf ^ffhðxÞg ¼ ½Ef ^ffhðxÞg f ðxÞ2þ Varf ^ffhðxÞg; ð3Þ where
Ef ^ffhðxÞg ¼ Z
jðuÞf ðx þ huÞ du; ð4Þ
and Varf ^ffhðxÞg ¼ 1 N 1 h Z j2ðuÞf ðx þ huÞ E2f ^ff hðxÞg : ð5Þ
We have that for small h and large N (h! 0,
N ! 1, and Nh ! 1),
Biasf ^ffhðxÞg ¼ Ef ^ffhðxÞg f ðxÞ ’
h2
2f
00ðxÞRu2jðuÞ du; if jðuÞ ¼ jðuÞ
hf0ðxÞRujðuÞdu; otherwise 8 < : ð6Þ and Varf ^ffhðxÞg ’fðxÞ Nh Z j2ðuÞ du: ð7Þ
Then, the optimal value of the kernel size can be shown to be equal to (Parzen, 1962)
h0ðxÞ ¼ fðxÞRj2ðuÞ du N fh 00ðxÞRu2jðuÞ dui2 0 B @ 1 C A 1=5 ð8Þ
for a symmetric kernel, when h! 0 as N ! 1
and Nh! 1, assuring asymptotic unbiasedness
and consistency of the estimate. Similarly,
h0ðxÞ ¼ fðxÞRj2ðuÞ du 2N fh 0ðxÞRujðuÞ dui2 0 B @ 1 C A 1=3 ð9Þ
for a non-symmetric window. As can be seen from Eqs. (8) and (9), the optimal kernel size depends on the value of the density and on its second or first derivative. These equations, in particular the order with respect to N, depend essentially on whether the kernel is symmetric or non-symmetric. It is also possible to obtain an optimal constant kernel size independent of x by minimizing either the
inte-gral mean-square error RMSEf ^ffhðxÞg dx or the
expected mean-square error RMSEf ^ffhðxÞgf ðxÞ dx
(Fukunaga, 1990; Parzen, 1962).
Clearly, in practice, one does not have access to
the true density function fðxÞ which is proposed to
be estimated. Thus, a number of heuristic ap-proaches can be taken for finding the window width. For instance, the optimal constant h can be computed for, say, the Normal distribution (as a function of N) and then used for making the
esti-mate ^ffhðxÞ (Fukunaga, 1990). Since the density
estimates are often used for classification purposes, another approach is to determine h on the basis of the expected probability of misclassification (Raudys, 1991). Although h is usually taken to be a constant, several important approaches have been proposed to vary it. One is the well-known kth nearest neighbor estimate (Loftsgaarden and Quesenberry, 1965). Another method is the adap-tive kernel estimate proposed in (Breiman et al., 1977). A number of other papers considering the problem of kernel size selection exist (e.g. Abram-son, 1982; Terrell and Scott, 1992; Chiu, 1992; Sheather and Jones, 1991; Hall et al., 1991; Taylor, 1989).
In this paper, we propose and develop a special rule (statistic) for choosing a data-driven kernel size that is selected in a point-wise manner for every argument value of the density. Our main motivation for this complex estimator is to make it adaptive to unknown and varying smoothness of the density to be estimated.
This method is described in Section 2. In Sec-tion 3, we assess the quality of the density estimate by comparing it to a known density which we are
estimating, under the mean-square error criterion. It will be shown that the estimates based on the variable-sized kernels are superior to the estimates based on optimal constant-sized kernels.
2. Adaptive kernel size selection
The method of adaptive kernel size selection is based on the ICI rule, proposed in (Goldenshluger and Nemirovsky, 1997) for adaptive regression smoothing and later developed for signal filtering in (Katkovnik, 1999). One of several attractive properties of the ICI rule is that it is spatially adaptive over a wide range of signal classes in the sense that its quality is close to that which one could achieve if smoothness of the original sig-nal was known in advance (Goldenshluger and Nemirovsky, 1997). We briefly reintroduce this method here in the context of density estimation.
Consider the ratio of the standard deviation
Stdf ^ffhðxÞg of the estimate to the absolute value of
the biasjBiasf ^ffhðxÞgj of the estimate, evaluated at
the ideal value of h0ðxÞ, given in Eqs. (8) and (9).
We get Stdf ^ffh 0ðxÞg jBiasf ^ffh 0ðxÞgj ¼ k; k ¼ 2;ffiffiffi if jðuÞ ¼ jðuÞ 2 p ; otherwise
It is useful to note that as the standard devia-tion and bias are monotonically increasing and
decreasing, respectively, as h! 0, we have that
1
kStdf ^ffhðxÞg P jBiasf ^ffhðxÞgj; if h 6 h0;
1
kStdf ^ffhðxÞg 6 jBiasf ^ffhðxÞgj; if h P h0:
ð10Þ
Although it is known that in regions where the density function is convex, it is theoretically pos-sible to find bandwidths for which the point-wise bias is equal to zero (Sain and Scott, 1996), we assume the monotonicity of the bias with respect to the bandwidth locally, for the theory and initial motivation. Moreover, these zero-bias bandwidths are typically much larger than the asymptotic data-dependent bandwidth given in (8), while Eq.
(10) holds for h! 0.
For a given kernel size h, the estimation error can be represented as
j ^ffhðxÞ f ðxÞj ¼ jBiasf ^ffhðxÞg þ nhðxÞj
6jBiasf ^ffhðxÞgj þ jnhðxÞj;
where nhðxÞ is a random variable with zero mean
and standard deviation equal to Stdf ^ffhðxÞg. Thus,
j ^ffhðxÞ f ðxÞj 6 jBiasf ^ffhðxÞgj þ vp Stdf ^ffhðxÞg
ð11Þ holds with an arbitrary probability p, for a
suit-ably chosen vp. Using the relationships in (10)
together with (11), we get that for h 6 h0,
j ^ffhðxÞ f ðxÞj 6 1 k þ vp Stdf ^ffhðxÞg ¼ C Stdf ^ffhðxÞg; ð12Þ where C¼ 1 kþ vp
. Larger values of C correspond to larger values of p. The ICI rule essentially tests
the hypothesis h 6 h0 for various values of h and
in this way selects an h close to h0as follows.
Suppose H ¼ fh1< h2< < hJg is a finite
collection of kernel sizes, starting with a small h1.
Using inequality (12), we determine a sequence of confidence intervals
Dj¼ ½Lj; Uj; j ¼ 1; . . . ; J
Lj¼ ^ffhjðxÞ C Stdf ^ffhjðxÞg Uj¼ ^ffhjðxÞ þ C Stdf ^ffhjðxÞg;
ð13Þ
each one corresponding to a kernel size in H. The ICI rule, then, can be stated as follows (Katkov-nik, 1999).
ICI rule: Consider the intersection of the
inter-vals Dj, 1 6 j 6 i with increasing i and let iþ be the
largest of those i for which the intervals Dj,
1 6 j i have a point in common. This iþ defines
the adaptive kernel size hþðxÞ ¼ h
iþ and
conse-quently, the density estimate ^ffhþðxÞðxÞ.
It is important to note that the kernel size se-lection procedure based on the ICI rule requires only the knowledge of the density estimate and its variance, for which Eq. (7) can be used. However, this variance, in turn, depends on the unknown
density to be estimated. A pilot estimate of the
density can be used in (7) instead of fðxÞ.
How-ever, it is emphasized that this pilot estimate should be obtained more or less independently
with respect to the final estimate ^ffhþðxÞðxÞ. In fact,
this is a general rule of using pilot estimates in statistics (see e.g., Fan and Gijbels, 1996). The kernel density estimate with a constant window size
h¼ hh is a good choice for the considered
prob-lem. In our simulation experiments, we employ the Sheather–Jones plug in method (Sheather and
Jones, 1991) for the estimation of hh in the pilot
density estimate. This method is known to have excellent performance as compared to other known methods. C is a design parameter of the algorithm and the selection of its value is discussed in (Kat-kovnik and Shmulevich, 2000).
This ICI procedure for the varying window density estimate can be implemented by Algo-rithm 1.
Algorithm 1. Adaptive Window Width Selection
L( 1; U ( 1
whileðL 6 U Þ and ði 6 J Þ do
L( ^ffh iðxÞ C Stdf ^ffhhðxÞg U ( ^ffhiðxÞ þ C Stdf ^ffhhðxÞg L( max½L; L; U ( min½U ; U i( i þ 1 end while hþðxÞ ( h i1 3. Simulation examples
In this section, we will illustrate the use of the kernel size selection procedure based on the ICI rule.
3.1. Qualitative simulation
This group of simulations is given in order to demonstrate the ability of the ICI to obtain the optimal (reasonable) window sizes. As a first ex-ample, consider estimating the piece-wise constant density function shown in Fig. 1a. This example is intended to qualitatively demonstrate the behavior of the adaptive kernel size selection procedure,
using the symmetric Gaussian kernel as well as non-symmetric right and left kernels
jrðuÞ ¼ 2ffiffiffiffiffiffi 2p p exp u22; uP0 0; u <0 8 < : jlðuÞ ¼ jrðuÞ: ð14Þ
The allowable kernel sizes (H) start with
h1¼ 0:01 and increase until h300¼ 3:0 with a step
of 0.01. Fig. 1b–d shows the kernel sizes chosen by the ICI rule, corresponding to the three kinds of
kernels (jðÞ, jrðÞ, and jlðÞ) used. The number of
observations N is equal to 10 000. Especially worthy of notice is the behavior of non-symmetric kernels in the presence of discontinuities in the density function. For instance, the kernel size of
the right kernel jrðÞ is rather high at the point of
the first discontinuity (x¼ 1) and becomes
smaller as it approaches the second discontinuity ðx ¼ 0Þ, after which the situation is similar (Fig. 1c). This behavior corresponds to the common sense idea that a large window size for the right
kernel should be chosen at x¼ 1þ while for
x¼ 0, the data available for the right kernel
es-timator is very small and hence, the kernel size is
accordingly small. Here, the notation x means
lime!0;e>0ðx eÞ. For the left kernel jlðÞ, the
be-havior is the opposite (Fig. 1d). Immediately after the first discontinuity, the left kernel still contains very few observations and consequently has small
size which increases up until x¼ 0. Similarly, just
after this point, the kernel size again becomes quite small, since even small sizes encompass two dif-ferent density regions. Finally, as shown in Fig. 1b, the size of the symmetric kernel increases towards the middle between discontinuities, that is, at
x¼ 0:5. Also, very large kernel sizes, in the form
of spikes, can be seen exactly at the points of discontinuities. The reason for this phenomenon is the following. It can be shown, by using Eq. (4), that at the point of discontinuity of the density function, the expectation of the estimate satisfies lim h!0 Z jðuÞf ðx þ huÞ du ¼f x þ ð Þ þ f xð Þ 2 :
Therefore, the ICI rule behaves correctly and in accordance with this fact. The reason that large kernel sizes are chosen is because the density
function on either side of the discontinuity is constant and larger kernel sizes decrease the vari-ance and consequently, the MSE. The combined density estimates obtained by fusion of the left and right estimates are considered in (Katkovnik and Shmulevich, 2000).
Another way to evaluate the method of adap-tive kernel size selection is to compare it to an ideal varying kernel size. Rather than using Eq. (8), which is an asymptotic result, we shall compare our method to the more stringent criterion, namely,
the empirically obtained varying kernel size hðxÞ,
which minimizes the MSE between the known
density fðxÞ and the estimated density ^ffhðxÞðxÞ. In
other words,
hðxÞ ¼ arg min
hðxÞ E fðxÞ
^ffhðxÞðxÞ2: ð15Þ
As a second example, we shall consider estimating the density function shown in Fig. 2a. The density
is zero outside the interval ½0; 1. The allowable
kernel sizes (H) start with h1¼ 0:01 and increase
until h100¼ 1:0 with a step of 0.01. The number of
observations N is equal to 10 000. An ideal kernel size, from the set of allowable kernel sizes, was found for every x, using Eq. (15). This ideal win-dow size is shown in Fig. 2b as a solid line. The
dashed line shows the variable kernel size hþðxÞ
obtained by using the ICI rule. As can be seen, their behavior is very similar. As expected, the kernel size is larger in the flat region of the density as compared with regions of the peaks where the kernel size becomes smaller.
3.2. Quantitative simulation
For quantitative accuracy analysis, we use a
double-peaked mixed density, fðxÞ ¼ ð1 aÞN ð0;
r1Þ þ aN ðm2;r2Þ, with m2¼ 7, r1¼ 1, r2¼ 0:05,
and a¼ 1=2. This type of combined models is
most reasonable for demonstrating the advantage of the varying window kernel estimates (Terrell and Scott, 1992). The true density to be estimated is depicted in Fig. 3. The allowable kernel sizes (H)
Fig. 1. Density fðxÞ to be estimated (a) and adaptive window widths corresponding to a symmetric kernel jðÞ (b), right kernel jrðÞ (c),
are logarithmically spaced between h1¼ 0:002 and
h40¼ 4:0. The number of observations N is equal
to 1000. The following computations were per-formed. For each constant kernel size in H, the
point-wise average, over 200 simulation runs, of the Parzen estimate was obtained. That is, ffhiðxÞ ¼ 1 200 X200 j¼1 ^ ffhðjÞ i ðxÞ; ð16Þ where ^ffhðjÞ
i ðxÞ is the Parzen estimate with constant
kernel size hi using data from the jth realization.
For each run, the square root of the average, over 200 runs, of the point-wise squared error was computed. In other words,
ehiðxÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 200 X200 j¼1 fðxÞ ^ffhðjÞi ðxÞ h i2 v u u t : ð17Þ
The best (constant) kernel size, h, was selected
by considering the mean value, over x, of the error in (17). In our simulations, the best kernel size was
equal to h¼ 0:024. We note in passing that the
data-driven Sheather–Jones plug in method (She-ather and Jones, 1991) produces a kernel size roughly five times larger.
Fig. 2. (a) Density fðxÞ to be estimated; (b) optimal varying window width hmðxÞ (solid line) minimizing the MSE between the known
density and the estimate and the varying window width based on the ICI rule hþðxÞ (dashed line) (C ¼ 5).
Further, the ICI rule was used with values of C between 0.5 and 3.0 with a step size of 0.5. In a similar manner to the above, the point-wise aver-age of the estimate equipped with the ICI rule was obtained. Also, the error, as in (17), was com-puted. This was performed for every value of C.
Finally, the best Cwas chosen in the same way as
h. In our simulations, C¼ 0:5. It should be
mentioned that the accuracy of estimation is not overly sensitive to the selection of C and there exists a range of values all of which result in sim-ilar accuracy (Katkovnik and Shmulevich, 2000).
Fig. 4a and b depict the two peaks of the den-sity estimated using the ideal constant kernel size
hand the ICI rule. The true density is shown as a
dashed line. The average of the 200 estimates given in (16) is indicated as ‘‘mean const.’’ and the cor-responding average of the estimates using the ICI rule as ‘‘mean ICI’’. Note that in Fig. 4b, the average estimated densities nearly coincide. In ad-dition, these figures also show the upper and lower ‘‘confidence intervals’’ given by ffhiðxÞ ehiðxÞ for
the ideal constant kernel size as well as for the ICI rule. These are labeled as ‘‘upper/lower const.’’ and ‘‘upper/lower ICI’’ respectively. It can be readily seen that on the wide peak of the density, the ICI rule produces smaller confidence intervals and hence, less variability of the error in the esti-mate. We stress that this is done in comparison with the ideal constant kernel size, which is, of course, unknown. Moreover, adaptive constant bandwidth methods (e.g., Sheather and Jones, 1991) produce very different sizes from the ideal. As far as the average of the estimates, the ideal constant kernel size estimate and the estimate based on the ICI rule are comparable, with the ICI rule producing a slightly smoother estimate for the wide peak (Fig. 4a).
4. Conclusions
We have proposed a new method for varying the bandwidth in kernel density estimation. This
Fig. 4. The results of Monte Carlo simulations with 200 runs: (a) and (b) show two parts of the density using the ideal constant kernel size as well as the ICI rule. Average of the estimates as well as their confidence intervals are indicated. The true density is shown in dashed lines.
method is based on the ICI rule and requires only the knowledge of the variance of the estimate. In our case, as the true density is unknown, the variance of the estimator is approximated by re-placing the true density by a pilot estimate with a data-dependent constant kernel size. It is also possible to implement an iterative technique in which successive estimates are used to compute the variance by formula (5), using which in the ICI rule, new estimates are formed. Although we have considered this method for one-dimensional den-sities, there is no conceptual difficulty in extending it to multi-dimensional densities. In that case, as with other techniques, not only the size, but also the shape of the kernel is an important parameter. We have shown, by means of numerical simula-tions, that the proposed method can perform sig-nificantly better than any constant-bandwidth method.
Acknowledgements
The authors are grateful for the support and hospitality of Tampere International Center for Signal Processing in Tampere, Finland, where this work was done.
References
Abramson, I., 1982. On bandwidth variation in kernel estimates – a square root law. Ann. Stat. 10, 1217–1223. Breiman, L., Meisel, W., Purcell, E., 1977. Variable kernel
estimates of multivariate densities. Technometrics 19, 135– 144.
Cacoullos, T., 1966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 179–189.
Chiu, S.-T., 1992. An automatic bandwidth selector for kernel density estimation. Biometrika 79 (4), 771–782.
Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and its Application. Chapman and Hall, London.
Fukunaga, K., 1990. Statistical Pattern Recognition, second ed. Academic Press, New York.
Goldenshluger, A., Nemirovsky, A., 1997. On spatial adaptive estimation of nonparametric regression. Math. Meth. Stat. 6 (2), 135–170.
Hall, P., Sheather, S.J., Jones, M.C., Marron, J.S., 1991. On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78 (2), 263–269.
Katkovnik, V., 1999. A new method for varying adaptive bandwidth selection. IEEE Trans. Signal Process. 47 (9), 2567–2571.
Katkovnik, V., Shmulevich, I., 2000. Kernel density estimation with varying data-driven bandwidth. in: EOS/SPIE Sympo-sium, Image and Signal Processing for Remote Sensing, September 25–29, Barcelona, Spain.
Loftsgaarden, D., Quesenberry, C., 1965. A nonparametric estimate of a multivariate density function. Ann. Math. Stat. 36, 1049–1051.
Parzen, E., 1962. On the estimation of a probability density function and the mode. Ann. Math. Stat. 33, 1065–1076. Raudys, S., 1991. On the effectiveness of Parzen window
classifier. Informatica 2 (3), 434–454.
Sain, S.R., Scott, D.W., 1996. On locally adaptive density estimation. J. Am. Stat. Assoc. 91 (436), 1525–1534. Sheather, S.J., Jones, M.C., 1991. A reliable data-based
bandwidth selection method for kernel density estimation. J. R. Stat. Soc. B 53 (3), 683–690.
Silverman, B.W., 1978. Choosing the window width when estimating a density. Biometrika 65, 1–11.
Sindoukas, D., Laskaris, N., Fotopoulos, S., 1997. Algorithms for color image edge enhancement using potential functions. IEEE Signal Process. Lett. 4 (9), 269–272.
Taylor, C.C., 1989. Bootstrap choice of the smoothing para-meter in kernel density estimation. Biometrika 76 (4), 705– 712.
Terrell, G., Scott, D., 1992. Variable kernel density estimation. Ann. Stat. 20 (3), 1236–1265.
Wright, D., Stander, J., Nicolaides, K., 1997. Nonparametric density estimation and discrimination from images of shapes. J. Roy. Statistical Soc., Ser. C: Appl. Statistics 46 (3), 365–380.
Zabin, S., Wright, G., 2000. Nonparametric density estimation and detection in impulsive interference channels. IEEE Trans. Commun. 42 (2–4), 1684–1711.