Choice of Objective Function - Methods for Shape-Constrained Kernel Density Estimation

The choice of objective function will influence both the nature of the optimization problem and the qualitative behaviour of the resulting density estimates. Several possibilities forδ(q,t) are discussed below.

6_{The plotting function returns a set of points that define each contour. Convexity, for example,} can be checked by using built-in functions to compare the area enclosed by these points to the area of the points’ convex hull.

2.3.1 Objectives Based on the Adjustable Values

The adjustable values q can be considered perturbations of the target vector t. It is

natural, then, to take the objective to be a measure of distance between vectors. One option is the Lα distance defined in equation (1.7), which takes δ(q,t) to be a norm

of the difference q−t.

The choice of α can have important consequences on the performance of an es- timator when the Lα distance is used. In data sharpening, for instance, α can be

interpreted as controlling the tendency to sharpen by moving single points or groups of points. Setting α = 2 discourages movement of single points through large dis- tances, while setting α = 1 makes the optimizer indifferent to the number of points moved. The value ofαcan particularly affect behaviour in the tails of the distribution, where there are few data points.

The Lα distance was used by Braun and Hall (2001) and Hall and Kang (2005) to

perform data sharpening with SQP as the optimizer. Those studies found thatα = 1 gave better mean integrated squared error (MISE) performance in test problems, but caused problems with numerical stability, occasionally leading to non-convergence. Failure to converge was attributed to differentiability: L1(y,x) is not differentiable in its ith dimension at yi =xi.

To improve the numerical stability of optimization, Hall and Kang (2005) proposed using a metric defined as

Ψtan(y,x) = n X i=1 Z di 0 arctan(t)dt,

where di = |yi − xi|. The reason for using this function was to mimic the linear

behaviour of L1 away from di = 0 while maintaining differentiability at zero. In

Chapter 3, a new alternative, the rounded-corners objective, will be used instead:

RCγ(y,x) = n X i=1 2 3γd 2 i − 1 9γ2d 3 i I(di ≤γ) + d−4 9γ I(di > γ) , (2.4)

−4 −3 −2 −1 0 1 2 3 4 0 1 2 3 4 y i−xi f(y i −x i ) L 1 RC1 RC2 Ψtan

Figure 2.7: Summands of four objective functions.

differentiable continuous piecewise function of di. The summand of (2.4) is a convex

cubic polynomial in the interval |yi−xi| ≤γ and a line with unit slope (just likeL1)

outside this interval. The central interval is effectively a curved, differentiable patch that replaces the corner in the usual L1 objective. The constant γ determines the

width of this interval; smaller values of γ more closely approximate L1.

Figure 2.7 compares the summands ofL1,RC1,RC2, and Ψtan. TheRC objective

achieves the same aims as Ψtan, but without the need for integration. It also allows

the amount of curvature at the vertex to be controlled by changing the value of γ. The goal of shape-constrained estimation is to find a good density estimate that satisfies the constraint. If the problem is posed in terms of the adjustment vector q,

rather than the density estimate itself, then in some situations the set of solutions

{q} can have a many-to-one mapping onto the set of density estimates {fˆy}. Data

sharpening has this property, for example, because the KDE is invariant to permutations of y while the Lα and RC objectives are not. If two solution vectors y1 and

y2 are permutations of each other then they are practically equivalent; nonetheless,

numerical optimization routines using objective functions (1.7) or (2.4) will consider them to be different because δ(y1,x)6=δ(y2,x) in general for those objectives.

step to enforce permutation invariance on the solutions. The simplest way to match points in 1-D problems is to start with tsorted in ascending order and then sort any proposedq before calculating the objective function. A sorted Lα objective can then

be defined as: Ls_α(q,t) =Lα(sort(q),t) = n X i=1 |q₍_i₎−t(i)|α, 1≤α≤2, (2.5)

where q(i) represents the ith largest point in q. The sorted version of RC can be

similarly defined and denoted RCs₍

q,t).

An optimization heuristic using only the un-sorted objective function will have no way of knowing whether a solution’s objective value could be improved by re- matching its points tot. The algorithm may fail to use promising solution paths, or may converge to sub-optimal solutions when points “cross over” each other into an un-matched state. Performing matching before evaluating the solution might improve performance and reliability of solution methods.

2.3.2 Objectives Based on Density Estimates

Another approach to choosing an objective function is to use a metric based on the constrained and pilot density estimates, ˆfq and ˆft. There are a number of suitable

distance or discrepancy measures available, including integrated squared error (ISE), Kullback-Leibler divergence (KL), and total variation (T V), respectively defined as

ISE(q,t) = Z ∞ −∞ ( ˆft(t)−fˆq(t)) 2_dt, _(2.6) KL(q,t) = Z ∞ −∞ ˆ fq(t) ln ˆ fq(t) ˆ ft(t) ! dt, (2.7) T V(q,t) = 1 2 Z ∞ −∞ |fˆt(t)−fˆq(t)|dt. (2.8)

ISE is an integrated L2 distance between estimates, while T V is an integrated L1

because KL(q,t)6=KL(t,q).

Devroye and Lugosi (2001) examined different distance measures in a density estimation context and concluded that the T V distance has several theoretical advan- tages. In particular, it admits a probability interpretation: for any Borel set B,

|P_fˆt(B)−Pfˆq(B)| ≤T V(q,t). That is,T V(q,t) is the maximum possible difference

attainable when the same probability is calculated with the two density estimates ˆfq

and ˆft.

Objectives based directly on the estimates have the advantage of being insensitive to the ordering of q and t. On the other hand, the density-based objectives are

specific to the density-estimation context, and would not apply if, for example, the constraint handling methods were used in a monotone regression problem.

2.3.3 A Likelihood Objective

A final objective function is based on the negative log-likelihood of the data under the constrained density ˆfq. Using this objective, the goal is to find the shape-restricted

KDE that assigns greatest likelihood to the observed data. The objective function can be written as LIK(q,x) =− n X i=1 ln ˆfq(xi). (2.9)

The LIK objective hasq and xas its arguments, rather thanq and tin the general

framework. As such it does not strictly fit into the general framework previously proposed (except in the case of data sharpening, where x = t). The algorithms presented in later chapters all taketor ˆftas their reference point–the goal is to make

the adjusted estimate as close to the pilot estimate as possible. If instead we use likelihood as the measure of success, the pilot estimate might not be the best point of reference and the algorithms might not work as well.

Despite these complications, the LIK objective is given here because of its intu- itive appeal. Likelihood is also used in Section 2.4 to motivate a bandwidth selector suitable for shape-constrained estimates.

2.3.4 Visualizing the Objective Functions

Section 2.2.1 included a small example where two of eight points were moved to achieve unimodality (Figure 2.3). The same example can be used to visualize the different objective functions in two dimensions. Figure 2.8 shows this problem’s objective function contours for the eight objectives defined above. Each graph in the figure shows the solution space for the problem, with each point (y1, y8) in the graphs representing a potential solution.

Seven of the objective functions have their minimum value at the observed data.

LIK is the exception, reaching its minimum when bothy1 and y8 are shifted slightly inward from their corresponding x values.

The L1, L2, and RC objectives are all convex functions of (y1, y8). The sorted objective Ls

2 and the four density-based objectives (ISE, T V, KL, and LIK) are

not—they exhibit many local optima, ridges, and plateaus. They are all symmetric around the y1 =y8 line, consistent with the symmetry of the original problem.

From an optimization standpoint, the convexity of the first three functions is attractive. All common deterministic optimization routines require a convex objective function in order to reliably find a global optimum. Nevertheless, it would be better if the choice of objective was not motivated by optimization convenience, and in any given situation it may be desirable to use one of the non-convex objectives for theoretical or practical reasons.

In document Methods for Shape-Constrained Kernel Density Estimation (Page 53-58)