Minimum Density Hyperplanes - Low Density Cluster Separators for Large, High Dimensional, Mixed

4.2 Methodology

4.2.1 Minimum Density Hyperplanes

A hyperplane can be defined by a unit-length vectorv ∈ Sd−1 = {x ∈ Rd| ∥x∥ = 1}

and a displacement from the originb ∈ R, asH(v,b) = {x ∈ Rd|v⊤x = b}. To quantify the density of the region intersected by a hyperplane with respect topˆxwe adapt

thedensity on a hyperplanecriterion proposed byBen-David et al.(2009),

I(v,b) =

∫

x∈H(v,b)pˆx(x)dx. (4.1)

The hyperplane that minimisesIˆ(v,b)is called theminimum density hyperplane(MDH).

I(v,b)cannot be evaluated analytically for all types of density estimators, but whenpˆxis

constructed from an isotropic Gaussian kernel density estimate Eq. (4.1) simplifies greatly,

ˆ I(v,b) = ∫ x∈H(v,b) 1 n(2πh2₎d/2 n

∑

i=1 exp { −∥x−xi∥2 2h2 } dx, = 1 n√2πh2 n

∑

i=1 exp { −(b−v⊤xi)2 2h2 } , (4.2) =pˆ_vT_x(b),

wherepˆ_vT_xdenotes a one-dimensional kernel density estimator constructed from the pro-

jection ofX ontov, and using the same bandwidth,h, aspˆx. Eq. (4.2) states thatIˆ(v,b)

can be computed exactly by projecting the data ontov; constructing a one-dimensional density estimator from these projections that uses Gaussian kernels with bandwidthh; and eval- uating it atb. Since projections can only contract pairwise distances, it can be shown that

I(v,b)imposes an upper bound on the estimated density at any point on the hyperplane H(v,b)(Pavlidis et al.,2016),

max

x∈H(v,b)pˆx(x) ⩽(2πh

2₎(1−₂d)_I_ˆ₍_v_,_b_).

This bound is tight if only one-dimensional projections ofX are used. Therefore, the MDH imposes the lowest upper bound (that can be achieved using one-dimensional projections only) on the maximum value ofpˆxalong a hyperplane separator.

Assuming without loss of generality thatX is centred at zero, the MDH is the solution to the optimisation problem,

min v,b

I(v,b), s.t.b ∈[−ασv,ασv], (4.3)

whereσvdenotes the standard deviation of the projected data ontov, andα > 0is a user

defined parameter controlling the width of the search interval forb, discussed in detail be- low. It is necessary to constrain the displacement of the separating hyperplane from the origin,|b|, as for anyv∈ Sd−1, a hyperplane of arbitrarily low density can be found for suf- ficiently large|b|, that islim_|_b_|→_∞ Iˆ(v,b) = 0. Such hyperplanes are clearly not meaning- ful for clustering as they assign all observations to one cluster. The constrained optimisation problem in Eq. (4.3) exhibits multiple local minima, as demonstrated in Figure 4.1, which

(a) Iˆ(v,b) (b) Hyperlane separators from SQP

Figure 4.1: Illustration of local minimaIˆ(v,b)and the resulting hyperplane separators from constrained optimisation with 50 random initialisations for the S4 dataset.

shows the value ofIˆ(v,b)with changes in the projection angle and displacement from the origin, as well as the resulting hyperplane separators obtained through sequential quadratic programming (SQP) (with 50 random initialisations) over the S4 dataset (Fränti and Virma- joki,2006).

To alleviate the problem of convergence to poor local minima, the following projection pursuit formulation has been proposed (Pavlidis et al.,2016),

ϕ(v) =min

b∈R f(v,b), (4.4)

f(v,b) = Iˆ(v,b) + L

ηεmax{0,−ασv−b,b−ασv}1+ε, (4.5)

whereL = (e1/2h22π)−1 ⩾ sup_b_∈_R|pˆ′_vT_x(b)|andε,η ∈ (0, 1). We callf the pe-

nalised density integral, andϕthe projection index, as it quantifies the suitability of of projection vectors for low-density cluster separation. The choice ofLensures that for fixedv

the global minimiser of f(v,b)will be withinηof the minimiser ofIˆ(v,b)in the interval

Figure 4.2: Separating hyperplaneH(v,b), estimated density of the projections ofX

ontov(black line),Iˆ(v,·), and penalised objective function,f(v,·), forη = 0.01and

ε={0.1, 0.3, 0.9}(burgundy,orange and green lines respectively).

function is continuously differentiable everywhere, while resembling the hinge loss function. Forηandε, values close to zero and one are recommended respectively. Fig. 4.2 illustrates the two dimensional A1 dataset (Kärkkäinen and Fränti,2002), along with a candidate separating hyperplane (black line). The observations projected onto the vector perpendic- ular to the separating hyperplane are illustrated with red dots. The one-dimensional kernel density estimator constructed from these projections,pˆ_v⊤_x, is also illustrated along with the

penalised density integral, f(v,·), for three choices of(η,ε). The figure illustrates the effect of the penalty function, which is to ensure that all minimisers off(v,·)are identical to the minimisers ofIˆ(v,·)in[−ασv,ασv]and differ by at mostηat the boundaries. The figure

also shows that the precise choices ofηandεare not critical, but sensible values are required to avoid numerical instability.

The parameterαdetermines the range over which minimisers ofIˆ(v,·)are sought. If

αis constant, then its value critically affects the quality of the estimated MDH. Settingα close to zero favours hyperplanes that induce a balanced bi-partition ofX, but there is no guarantee that clusters can be separated by a hyperplane that goes through the mean of the data. If instead a large value ofαis used there is a risk that the MDH will separate the tail

Figure 4.3: Illustration of the resulting hyperplane separators from the projection pursuit formulation with 50 random initialisations for the S4 dataset.

ofpˆxrather than separating high-density regions. Instead of selecting a fixed value, it has

been recommended inPavlidis et al.(2016) to estimate the MDH for a sequence of increasing values ofα, starting from zero, and using the previously identified MDH as the initial projection direction each timeαis increased. Settingαto zero initially forces the algorithm to seek low-density hyperplanes that induce a balanced bi-partition of high-density clusters, while increasingαin subsequent steps fine tunes the location of the MDH. The maximum value ofαis not critical in this approach as it is straightforward to detect when the MDH is no longer a local minimiser ofIˆ(v,·)but instead intersects the tail ofpˆx. Such solutions are

discarded.

The formulation in Eqs. (4.4) - (4.5) can accommodate discontinuous changes of the minimiser,b⋆ = arg min_b_∈_[₋_ασ

v,ασv]Iˆ(v,b), as a result of changes inv. It is thus less suscep-

tible to convergence to local minima than a simple constrained optimisation formulation, as seen in Figure 4.3, which shows the hyperplane separators on the S4 dataset arising from this projection pursuit formulation with 50 random initialisations. By contrast to the constrained optimisation approach, projection pursuit converges to only a few solutions, all of which correspond to very high-quality cluster separators.

function.Lewis and Overton(2013) have strongly advocated that a Broyden–Fletcher– Goldfarb–Shanno (BFGS) method using inexact line searches is very efficient for the min- imisation of such functions, while being much less computationally demanding than non- smooth optimisation methods like gradient sampling (Burke et al.,2005). We call the projection pursuit algorithm that minimises the projection indexϕ(v), minimum density projection pursuit (MDP2).

In document Low Density Cluster Separators for Large, High Dimensional, Mixed and Non Linearly Separable Data (Page 77-82)