Kernel density estimation with adaptive varying window size

(1)

Kernel density estimation with adaptive varying window size

Vladimir Katkovnik

a

, Ilya Shmulevich

b,*

a_{Kwangju Institute of Science and Technology, Kwangju, South Korea} b_{University of Texas M.D. Anderson Cancer Center, Houston, TX 77030, USA}

Received 21 November 2000; received in revised form 21 August 2001

Abstract

A new method of kernel density estimation with a varying adaptive window size is proposed. It is based on the so-called intersection of conﬁdence intervals (ICI) rule. Several examples of the proposed method are given for diﬀerent types of densities and the quality of the adaptive density estimate is assessed by means of numerical simulations. Ó 2002 Elsevier Science B.V. All rights reserved.

Keywords: Non-parametric; Kernel; Density estimation; Parzen; ICI rule

1. Introduction

In pattern recognition, optimal algorithms of-ten require the knowledge of underlying densities of signals and/or noise. As these densities are usually unknown, unrealistic assumptions are fre-quently made, thus compromising the perfor-mance of the algorithms in question. A common approach to this problem is to estimate the density from the data. If a particular form of the density is assumed or known, then parametric estimation is used. If nothing is assumed about the density shape, non-parametric estimation is employed. Besides being widely used in the ﬁeld of pattern recognition and classiﬁcation (Fukunaga, 1990), non-parametric probability density estimation has

been applied in image processing (Sindoukas et al., 1997; Wright et al., 1997), communications (Zabin and Wright, 2000), and many other ﬁelds.

One of the most well-known and popular techniques of non-parametric density estimation is the kernel or Parzen density estimate (Fukunaga, 1990; Parzen, 1962; Cacoullos, 1966). Given N

samples X1; . . . ; XN drawn from a population with

density function fðxÞ, x 2 R1_{, the Parzen density}

estimate at x is given by ^ ff_hðxÞ ¼ 1 N XN i¼1 1 hj x Xi h ; ð1Þ

where jðÞ is a window or kernel function and h is the window width, smoothing parameter, or simply the kernel size. Traditionally, it is assumed that R

jðuÞ du ¼ 1 and jðÞ is symmetric, that is, jðuÞ ¼ jðuÞ. One popular choice is the Gaussian kernel jðuÞ ¼ 1=pffiffiffiffiffiffi2pexpðu2_=2Þ.

The kernel size h is the most important char-acteristic of the Parzen density estimate (Raudys,

www.elsevier.com/locate/patrec

*_{Corresponding author. Tel.: +1-713-745-1502; fax:}

+1-801-382-2064.

E-mail addresses:[email protected] (V. Katkovnik), [email protected] (I. Shmulevich).

(2)

1991; Silverman, 1978). One can compute the ideal or optimal value of h by minimizing the mean-square error

MSEf ^ff_hðxÞg ¼ Ef½ ^ff_hðxÞ f ðxÞ2g ð2Þ

between the true and estimated densities, with re-spect to h. The MSE is a function of x and so the optimal kernel size h is also a function of x. In order to minimize the MSE, a best compromise between variance and bias must be selected. Using Taylor series approximations of the moments of

^

ff_hðxÞ and noting that

MSEf ^ff_hðxÞg ¼ ½Ef ^ff_hðxÞg f ðxÞ2þ Varf ^ff_hðxÞg; ð3Þ where

Ef ^ff_hðxÞg ¼ Z

jðuÞf ðx þ huÞ du; ð4Þ

and Varf ^ff_hðxÞg ¼ 1 N 1 h Z j2ðuÞf ðx þ huÞ E2_{f ^}_ff hðxÞg : ð5Þ

We have that for small h and large N (h! 0,

N ! 1, and Nh ! 1),

Biasf ^ff_hðxÞg ¼ Ef ^ff_hðxÞg f ðxÞ ’

h2

2f

00_ðxÞR_u2_{jðuÞ du;} _{if jðuÞ ¼ jðuÞ}

hf0_ðxÞR_ujðuÞdu; _otherwise 8 < : ð6Þ and Varf ^ff_hðxÞg ’fðxÞ Nh Z j2_{ðuÞ du:} _ð7Þ

Then, the optimal value of the kernel size can be shown to be equal to (Parzen, 1962)

h0ðxÞ ¼ fðxÞRj2_{ðuÞ du} N fh 00_ðxÞR_u2_{jðuÞ du}i2 0 B @ 1 C A 1=5 ð8Þ

for a symmetric kernel, when h! 0 as N ! 1

and Nh! 1, assuring asymptotic unbiasedness

and consistency of the estimate. Similarly,

h0ðxÞ ¼ fðxÞRj2_{ðuÞ du} 2N fh 0_ðxÞR_{ujðuÞ du}i2 0 B @ 1 C A 1=3 ð9Þ

for a non-symmetric window. As can be seen from Eqs. (8) and (9), the optimal kernel size depends on the value of the density and on its second or ﬁrst derivative. These equations, in particular the order with respect to N, depend essentially on whether the kernel is symmetric or non-symmetric. It is also possible to obtain an optimal constant kernel size independent of x by minimizing either the

inte-gral mean-square error RMSEf ^ff_hðxÞg dx or the

expected mean-square error RMSEf ^ff_hðxÞgf ðxÞ dx

(Fukunaga, 1990; Parzen, 1962).

Clearly, in practice, one does not have access to

the true density function fðxÞ which is proposed to

be estimated. Thus, a number of heuristic ap-proaches can be taken for ﬁnding the window width. For instance, the optimal constant h can be computed for, say, the Normal distribution (as a function of N) and then used for making the

esti-mate ^ff_hðxÞ (Fukunaga, 1990). Since the density

estimates are often used for classiﬁcation purposes, another approach is to determine h on the basis of the expected probability of misclassiﬁcation (Raudys, 1991). Although h is usually taken to be a constant, several important approaches have been proposed to vary it. One is the well-known kth nearest neighbor estimate (Loftsgaarden and Quesenberry, 1965). Another method is the adap-tive kernel estimate proposed in (Breiman et al., 1977). A number of other papers considering the problem of kernel size selection exist (e.g. Abram-son, 1982; Terrell and Scott, 1992; Chiu, 1992; Sheather and Jones, 1991; Hall et al., 1991; Taylor, 1989).

In this paper, we propose and develop a special rule (statistic) for choosing a data-driven kernel size that is selected in a point-wise manner for every argument value of the density. Our main motivation for this complex estimator is to make it adaptive to unknown and varying smoothness of the density to be estimated.

This method is described in Section 2. In Sec-tion 3, we assess the quality of the density estimate by comparing it to a known density which we are

(3)

estimating, under the mean-square error criterion. It will be shown that the estimates based on the variable-sized kernels are superior to the estimates based on optimal constant-sized kernels.

2. Adaptive kernel size selection

The method of adaptive kernel size selection is based on the ICI rule, proposed in (Goldenshluger and Nemirovsky, 1997) for adaptive regression smoothing and later developed for signal ﬁltering in (Katkovnik, 1999). One of several attractive properties of the ICI rule is that it is spatially adaptive over a wide range of signal classes in the sense that its quality is close to that which one could achieve if smoothness of the original sig-nal was known in advance (Goldenshluger and Nemirovsky, 1997). We brieﬂy reintroduce this method here in the context of density estimation.

Consider the ratio of the standard deviation

Stdf ^ff_hðxÞg of the estimate to the absolute value of

the biasjBiasf ^ff_hðxÞgj of the estimate, evaluated at

the ideal value of h0ðxÞ, given in Eqs. (8) and (9).

We get Stdf ^ff_h 0ðxÞg jBiasf ^ff_h 0ðxÞgj ¼ k; k ¼ 2;ffiffiffi if jðuÞ ¼ jðuÞ 2 p ; otherwise

It is useful to note that as the standard devia-tion and bias are monotonically increasing and

decreasing, respectively, as h! 0, we have that

1

kStdf ^ffhðxÞg P jBiasf ^ffhðxÞgj; if h 6 h0;

1

kStdf ^ffhðxÞg 6 jBiasf ^ffhðxÞgj; if h P h0:

ð10Þ

Although it is known that in regions where the density function is convex, it is theoretically pos-sible to ﬁnd bandwidths for which the point-wise bias is equal to zero (Sain and Scott, 1996), we assume the monotonicity of the bias with respect to the bandwidth locally, for the theory and initial motivation. Moreover, these zero-bias bandwidths are typically much larger than the asymptotic data-dependent bandwidth given in (8), while Eq.

(10) holds for h! 0.

For a given kernel size h, the estimation error can be represented as

j ^ff_hðxÞ f ðxÞj ¼ jBiasf ^ff_hðxÞg þ nhðxÞj

6jBiasf ^ff_hðxÞgj þ jnhðxÞj;

where nhðxÞ is a random variable with zero mean

and standard deviation equal to Stdf ^ff_hðxÞg. Thus,

j ^ff_hðxÞ f ðxÞj 6 jBiasf ^ff_hðxÞgj þ vp Stdf ^ffhðxÞg

ð11Þ holds with an arbitrary probability p, for a

suit-ably chosen vp. Using the relationships in (10)

together with (11), we get that for h 6 h0,

j ^ff_hðxÞ f ðxÞj 6 1 k þ vp Stdf ^ff_hðxÞg ¼ C Stdf ^ff_hðxÞg; ð12Þ where C¼ 1 kþ vp

. Larger values of C correspond to larger values of p. The ICI rule essentially tests

the hypothesis h 6 h0 for various values of h and

in this way selects an h close to h0as follows.

Suppose H ¼ fh1< h2< < hJg is a ﬁnite

collection of kernel sizes, starting with a small h1.

Using inequality (12), we determine a sequence of conﬁdence intervals

Dj¼ ½Lj; Uj; j ¼ 1; . . . ; J

Lj¼ ^ffhjðxÞ C Stdf ^ffhjðxÞg Uj¼ ^ffhjðxÞ þ C Stdf ^ffhjðxÞg;

ð13Þ

each one corresponding to a kernel size in H. The ICI rule, then, can be stated as follows (Katkov-nik, 1999).

ICI rule: Consider the intersection of the

inter-vals Dj, 1 6 j 6 i with increasing i and let iþ be the

largest of those i for which the intervals Dj,

1 6 j i have a point in common. This iþ _deﬁnes

the adaptive kernel size hþ_{ðxÞ ¼ h}

iþ and

conse-quently, the density estimate ^ff_hþ_ðxÞðxÞ.

It is important to note that the kernel size se-lection procedure based on the ICI rule requires only the knowledge of the density estimate and its variance, for which Eq. (7) can be used. However, this variance, in turn, depends on the unknown

(4)

density to be estimated. A pilot estimate of the

density can be used in (7) instead of fðxÞ.

How-ever, it is emphasized that this pilot estimate should be obtained more or less independently

with respect to the ﬁnal estimate ^ff_hþ_ðxÞðxÞ. In fact,

this is a general rule of using pilot estimates in statistics (see e.g., Fan and Gijbels, 1996). The kernel density estimate with a constant window size

h¼ hh is a good choice for the considered

prob-lem. In our simulation experiments, we employ the Sheather–Jones plug in method (Sheather and

Jones, 1991) for the estimation of hh in the pilot

density estimate. This method is known to have excellent performance as compared to other known methods. C is a design parameter of the algorithm and the selection of its value is discussed in (Kat-kovnik and Shmulevich, 2000).

This ICI procedure for the varying window density estimate can be implemented by Algo-rithm 1.

Algorithm 1. Adaptive Window Width Selection

L( 1; U ( 1

whileðL 6 U Þ and ði 6 J Þ do

L( ^ff_h iðxÞ C Stdf ^ffhhðxÞg U ( ^ff_h_iðxÞ þ C Stdf ^ff_h_hðxÞg L( max½L; L; U ( min½U ; U i( i þ 1 end while hþ_{ðxÞ ( h} i1 3. Simulation examples

In this section, we will illustrate the use of the kernel size selection procedure based on the ICI rule.

3.1. Qualitative simulation

This group of simulations is given in order to demonstrate the ability of the ICI to obtain the optimal (reasonable) window sizes. As a ﬁrst ex-ample, consider estimating the piece-wise constant density function shown in Fig. 1a. This example is intended to qualitatively demonstrate the behavior of the adaptive kernel size selection procedure,

using the symmetric Gaussian kernel as well as non-symmetric right and left kernels

jrðuÞ ¼ 2ffiffiffiffiffiffi 2p p exp u₂2; uP0 0; u <0 8 < : jlðuÞ ¼ jrðuÞ: ð14Þ

The allowable kernel sizes (H) start with

h1¼ 0:01 and increase until h300¼ 3:0 with a step

of 0.01. Fig. 1b–d shows the kernel sizes chosen by the ICI rule, corresponding to the three kinds of

kernels (jðÞ, jrðÞ, and jlðÞ) used. The number of

observations N is equal to 10 000. Especially worthy of notice is the behavior of non-symmetric kernels in the presence of discontinuities in the density function. For instance, the kernel size of

the right kernel jrðÞ is rather high at the point of

the ﬁrst discontinuity (x¼ 1) and becomes

smaller as it approaches the second discontinuity ðx ¼ 0Þ, after which the situation is similar (Fig. 1c). This behavior corresponds to the common sense idea that a large window size for the right

kernel should be chosen at x¼ 1þ _{while for}

x¼ 0_{, the data available for the right kernel}

es-timator is very small and hence, the kernel size is

accordingly small. Here, the notation x _means

lime!0;e>0ðx eÞ. For the left kernel jlðÞ, the

be-havior is the opposite (Fig. 1d). Immediately after the ﬁrst discontinuity, the left kernel still contains very few observations and consequently has small

size which increases up until x¼ 0_{. Similarly, just}

after this point, the kernel size again becomes quite small, since even small sizes encompass two dif-ferent density regions. Finally, as shown in Fig. 1b, the size of the symmetric kernel increases towards the middle between discontinuities, that is, at

x¼ 0:5_{. Also, very large kernel sizes, in the form}

of spikes, can be seen exactly at the points of discontinuities. The reason for this phenomenon is the following. It can be shown, by using Eq. (4), that at the point of discontinuity of the density function, the expectation of the estimate satisﬁes lim h!0 Z jðuÞf ðx þ huÞ du ¼f x þ ð Þ þ f x_ð _Þ 2 :

Therefore, the ICI rule behaves correctly and in accordance with this fact. The reason that large kernel sizes are chosen is because the density

(5)

function on either side of the discontinuity is constant and larger kernel sizes decrease the vari-ance and consequently, the MSE. The combined density estimates obtained by fusion of the left and right estimates are considered in (Katkovnik and Shmulevich, 2000).

Another way to evaluate the method of adap-tive kernel size selection is to compare it to an ideal varying kernel size. Rather than using Eq. (8), which is an asymptotic result, we shall compare our method to the more stringent criterion, namely,

the empirically obtained varying kernel size h_ðxÞ,

which minimizes the MSE between the known

density fðxÞ and the estimated density ^ff_h_ðxÞðxÞ. In

other words,

hðxÞ ¼ arg min

hðxÞ E fðxÞ

^ff_hðxÞðxÞ2: ð15Þ

As a second example, we shall consider estimating the density function shown in Fig. 2a. The density

is zero outside the interval ½0; 1. The allowable

kernel sizes (H) start with h1¼ 0:01 and increase

until h100¼ 1:0 with a step of 0.01. The number of

observations N is equal to 10 000. An ideal kernel size, from the set of allowable kernel sizes, was found for every x, using Eq. (15). This ideal win-dow size is shown in Fig. 2b as a solid line. The

dashed line shows the variable kernel size hþ_ðxÞ

obtained by using the ICI rule. As can be seen, their behavior is very similar. As expected, the kernel size is larger in the ﬂat region of the density as compared with regions of the peaks where the kernel size becomes smaller.

3.2. Quantitative simulation

For quantitative accuracy analysis, we use a

double-peaked mixed density, fðxÞ ¼ ð1 aÞN ð0;

r1Þ þ aN ðm2;r2Þ, with m2¼ 7, r1¼ 1, r2¼ 0:05,

and a¼ 1=2. This type of combined models is

most reasonable for demonstrating the advantage of the varying window kernel estimates (Terrell and Scott, 1992). The true density to be estimated is depicted in Fig. 3. The allowable kernel sizes (H)

Fig. 1. Density fðxÞ to be estimated (a) and adaptive window widths corresponding to a symmetric kernel jðÞ (b), right kernel jrðÞ (c),

(6)

are logarithmically spaced between h1¼ 0:002 and

h40¼ 4:0. The number of observations N is equal

to 1000. The following computations were per-formed. For each constant kernel size in H, the

point-wise average, over 200 simulation runs, of the Parzen estimate was obtained. That is, ffhiðxÞ ¼ 1 200 X200 j¼1 ^ ff_hðjÞ i ðxÞ; ð16Þ where ^ff_hðjÞ

i ðxÞ is the Parzen estimate with constant

kernel size hi using data from the jth realization.

For each run, the square root of the average, over 200 runs, of the point-wise squared error was computed. In other words,

ehiðxÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 200 X200 j¼1 fðxÞ ^ff_hðjÞ_i ðxÞ h i2 v u u t : ð17Þ

The best (constant) kernel size, h_{, was selected}

by considering the mean value, over x, of the error in (17). In our simulations, the best kernel size was

equal to h_{¼ 0:024. We note in passing that the}

data-driven Sheather–Jones plug in method (She-ather and Jones, 1991) produces a kernel size roughly ﬁve times larger.

Fig. 2. (a) Density fðxÞ to be estimated; (b) optimal varying window width hmðxÞ (solid line) minimizing the MSE between the known

density and the estimate and the varying window width based on the ICI rule hþ_{ðxÞ (dashed line) (C ¼ 5).}

(7)

Further, the ICI rule was used with values of C between 0.5 and 3.0 with a step size of 0.5. In a similar manner to the above, the point-wise aver-age of the estimate equipped with the ICI rule was obtained. Also, the error, as in (17), was com-puted. This was performed for every value of C.

Finally, the best C_{was chosen in the same way as}

h. In our simulations, C_{¼ 0:5. It should be}

mentioned that the accuracy of estimation is not overly sensitive to the selection of C and there exists a range of values all of which result in sim-ilar accuracy (Katkovnik and Shmulevich, 2000).

Fig. 4a and b depict the two peaks of the den-sity estimated using the ideal constant kernel size

h_{and the ICI rule. The true density is shown as a}

dashed line. The average of the 200 estimates given in (16) is indicated as ‘‘mean const.’’ and the cor-responding average of the estimates using the ICI rule as ‘‘mean ICI’’. Note that in Fig. 4b, the average estimated densities nearly coincide. In ad-dition, these ﬁgures also show the upper and lower ‘‘conﬁdence intervals’’ given by ffhiðxÞ ehiðxÞ for

the ideal constant kernel size as well as for the ICI rule. These are labeled as ‘‘upper/lower const.’’ and ‘‘upper/lower ICI’’ respectively. It can be readily seen that on the wide peak of the density, the ICI rule produces smaller conﬁdence intervals and hence, less variability of the error in the esti-mate. We stress that this is done in comparison with the ideal constant kernel size, which is, of course, unknown. Moreover, adaptive constant bandwidth methods (e.g., Sheather and Jones, 1991) produce very diﬀerent sizes from the ideal. As far as the average of the estimates, the ideal constant kernel size estimate and the estimate based on the ICI rule are comparable, with the ICI rule producing a slightly smoother estimate for the wide peak (Fig. 4a).

4. Conclusions

We have proposed a new method for varying the bandwidth in kernel density estimation. This

Fig. 4. The results of Monte Carlo simulations with 200 runs: (a) and (b) show two parts of the density using the ideal constant kernel size as well as the ICI rule. Average of the estimates as well as their conﬁdence intervals are indicated. The true density is shown in dashed lines.

(8)

method is based on the ICI rule and requires only the knowledge of the variance of the estimate. In our case, as the true density is unknown, the variance of the estimator is approximated by re-placing the true density by a pilot estimate with a data-dependent constant kernel size. It is also possible to implement an iterative technique in which successive estimates are used to compute the variance by formula (5), using which in the ICI rule, new estimates are formed. Although we have considered this method for one-dimensional den-sities, there is no conceptual diﬃculty in extending it to multi-dimensional densities. In that case, as with other techniques, not only the size, but also the shape of the kernel is an important parameter. We have shown, by means of numerical simula-tions, that the proposed method can perform sig-niﬁcantly better than any constant-bandwidth method.

Acknowledgements

The authors are grateful for the support and hospitality of Tampere International Center for Signal Processing in Tampere, Finland, where this work was done.

References

Abramson, I., 1982. On bandwidth variation in kernel estimates – a square root law. Ann. Stat. 10, 1217–1223. Breiman, L., Meisel, W., Purcell, E., 1977. Variable kernel

estimates of multivariate densities. Technometrics 19, 135– 144.

Cacoullos, T., 1966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 179–189.

Chiu, S.-T., 1992. An automatic bandwidth selector for kernel density estimation. Biometrika 79 (4), 771–782.

Fan, J., Gijbels, I., 1996. Local Polynomial Modelling and its Application. Chapman and Hall, London.

Fukunaga, K., 1990. Statistical Pattern Recognition, second ed. Academic Press, New York.

Goldenshluger, A., Nemirovsky, A., 1997. On spatial adaptive estimation of nonparametric regression. Math. Meth. Stat. 6 (2), 135–170.

Hall, P., Sheather, S.J., Jones, M.C., Marron, J.S., 1991. On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78 (2), 263–269.

Katkovnik, V., 1999. A new method for varying adaptive bandwidth selection. IEEE Trans. Signal Process. 47 (9), 2567–2571.

Katkovnik, V., Shmulevich, I., 2000. Kernel density estimation with varying data-driven bandwidth. in: EOS/SPIE Sympo-sium, Image and Signal Processing for Remote Sensing, September 25–29, Barcelona, Spain.

Loftsgaarden, D., Quesenberry, C., 1965. A nonparametric estimate of a multivariate density function. Ann. Math. Stat. 36, 1049–1051.

Parzen, E., 1962. On the estimation of a probability density function and the mode. Ann. Math. Stat. 33, 1065–1076. Raudys, S., 1991. On the eﬀectiveness of Parzen window

classiﬁer. Informatica 2 (3), 434–454.

Sain, S.R., Scott, D.W., 1996. On locally adaptive density estimation. J. Am. Stat. Assoc. 91 (436), 1525–1534. Sheather, S.J., Jones, M.C., 1991. A reliable data-based

bandwidth selection method for kernel density estimation. J. R. Stat. Soc. B 53 (3), 683–690.

Silverman, B.W., 1978. Choosing the window width when estimating a density. Biometrika 65, 1–11.

Sindoukas, D., Laskaris, N., Fotopoulos, S., 1997. Algorithms for color image edge enhancement using potential functions. IEEE Signal Process. Lett. 4 (9), 269–272.

Taylor, C.C., 1989. Bootstrap choice of the smoothing para-meter in kernel density estimation. Biometrika 76 (4), 705– 712.

Terrell, G., Scott, D., 1992. Variable kernel density estimation. Ann. Stat. 20 (3), 1236–1265.

Wright, D., Stander, J., Nicolaides, K., 1997. Nonparametric density estimation and discrimination from images of shapes. J. Roy. Statistical Soc., Ser. C: Appl. Statistics 46 (3), 365–380.

Zabin, S., Wright, G., 2000. Nonparametric density estimation and detection in impulsive interference channels. IEEE Trans. Commun. 42 (2–4), 1684–1711.