Kernel Fisher Discriminant Analysis - A practical view of large-scale classification: feature s

3.4 Discussion

4.1.1 Kernel Fisher Discriminant Analysis

The Fisher Discriminant Analysis expounded in Section 2.3.1 is probably the most pop- ular supervised feature extraction techniques in machine learning. Although relying on heavy assumptions which are not true in many applications, Fisher Discriminant Analy- sis has proven to be very powerful and it is a point of reference in many machine learning systems due to its appealing properties:

1. The existence of a global solution and the absence of a local minima. 2. The solution has a closed form.

3. Its clear interpretation.

However, the analysis carried out in the previous section still holds here: for most real- world data, a linear discriminant is not complex enough. To tackle this problem, two different approaches can be considered:

1. Use more sophisticated distributions –instead of gaussian distributions– to model the data. Assuming general distributions often sacrifices the simple and closed form and it requires more complex solvers, which is not desirable for large-scale problems.

2. Look for nonlinear directions. These nonlinearities can be in turn introduced explicitly or implicitly.

• Introducing explicit nonlinearities. A set D of predefined nonlinear functions Φi(x) acting directly on the samples is considered: D = {Φi : X → R}. Then, the optimal discriminant is defined as,

f (x) = X

Φi(x)∈D

These models produce nonlinear functions and hence, they are known as gen- eralized linear discriminants. Taking advantage of the linearity of f with re- spect to the parameters w and b, Equation 4.1 can be optimized via least squares fitting. Well-known examples following this approach are radial- basis function (RBF) networks [95], where nonlinear functions have the form Φi(x) = exp(−kxi − xk2/σ) with xi a fixed pattern and σ ≥ 0. If σ pa- rameter is also optimized, the function f is not linear anymore and methods like gradient descent should be applied. Many other works extending the classical linear discriminant framework can be found in the literature. These methods are based on the introduction of special nonlinearities, the impo- sition of certain types of regularization and the use of special optimization algorithms [42]. While an appropriate choice of functions D could approximate any function with an arbitrary precision, the presence of parameters to be tuned in the basis functions (multilayer neural networks) complicates the optimization problem, which is often prone to local minimal and lack for an intuitive interpretation.

• Introducing implicit nonlinearities. Instead of hand-crafting a nonlinear preprocessing, kernel functions are introduced. The idea introduced by Mika et al. [96] suggests an efficient way to increase the expressiveness of the discriminant via nonlinear directions applying the kernel-trick (Section 2.3.2.1): first map the data nonlinearly into some feature space F and compute the Fisher Discriminant Analysis there, which yields a nonlinear discriminant in the original input space. Being still the solutions hardly interpretable, this new formulation is able to find closed form solutions while maintaining the theoretical beauty of Fisher discriminant providing the model with flexibility by means of the use of different kernels. The rest of this section is directed to explain in detail the kernelization of the Fisher discriminant, closely related with the QPFS kernelization.

Formalizing the idea given by Mika et al. [96], let S+ = {x+1, . . . , x +

N1} and S− =

{x−₁, . . . , x−_N

2} be samples from two different classes, xi ∈ R

M _{and S = S}

+∪ S− the complete set of N (N = N++ N−) training samples. And let y ∈ {−1, 1}N be the vector with the corresponding labels. Let Φ : RM _{−→ F be the mapping function to the kernel} space and K(xi, xj) = Φ(xi) · Φ(xj) the Mercer kernel which defines the dot-product

Linear Fisher Direction

(a)

Kernel Fisher Preimage

(b)

Figure 4.3: Illustrating the performance of Fisher Discriminant in a nonlinear synthetic example. Figure 4.3(a) shows the projection direction of the Fisher discriminant in the original space. Figure 4.3(b) shows the preimage in the original space of the

Kernel Fisher Discriminant.

in F . To find the Fisher Discriminant Analysis in F , the following quantity must be maximized1, J (w) = w T_SΦ Bw wTSΦ Ww (4.2)

where w ∈ F and S_BΦand S_WΦ are the corresponding between and within scatter matrices in F , i.e. S_BΦ = (µΦ₊− µ₋Φ)(µΦ₊− µΦ₋)T S_WΦ = X x∈S+ (Φ(x) − µΦ₊)(Φ(x) − µΦ₊)T + X x∈S− (Φ(x) − µΦ₋)(Φ(x) − µΦ₋)T with µΦ + = 1 N+ P x∈S+Φ(x) and µ Φ − = N1− P

x∈S−Φ(x). Finding a solution to Equa-

tion 4.2 in the kernel space F requires to reformulate it in terms of only dot products of the input patterns. Every solution w ∈ F can be written as an expansion in terms of the mapped training data w =PN

i=1αiΦ (xi) making the things tractable. Consider the symmetric operators S on the finite-dimensional subspace spanned by the Φ(xi) in a feature space F –possibly infinite–. For example, any matrix S constructed as a linear combination of {Φ(xi)} as matrices SB and SW. Writing w = v1+ v2 as the decomposition of the part v1 lying in the subspace spanned by the training data and

Observe the similarity between the following expressions and the objective function of Fisher Dis- criminant Analysis given by Equations 2.13–2.15.

v2 lying in its orthogonal subspace, it can be shown that it is sufficient to consider the part of the quadratic form wTSw which lies in the span of the examples,

hw, Swi = h(v1+ v2), S(v1+ v2)i = h(v1+ v2), Sv1i = hv1, Sv1i.

These equalities use the properties Sv2 = 0 and hv, Svi. Moreover, as v1 lies in the span of {Φ(xi)} and S, by construction, only operates in this subspace, Sv1 lies in the subspace spanned by {Φ(xi)} as well.

The formulation of w as the linear combination of the mapped patterns, w =PN

i=1αiΦ (xi) makes it possible to write Equation 4.2 in terms of dot products in the kernel space as follows, J (α) = α T_Bα αTW α (4.3) being B = (B+− B−)(B+− B−)T W = K+(IN+− 1N+)K T ++ K−(IN−− 1N−)K T − (B+)_j = _N1₊P_k=1N+ K(xj, x+_k) (B−)_j = _N1₋P_k=1N− K(xj, x−_k)

where K+ is a N × N+matrix with (K+)nm = K(xn, x+m), IN+ is the identity matrix in

RN+×N+ and 1N+ is a square matrix in R

N+×N+ _{with all entries equal to} 1

N+. Definitions

for the negative class are analogous. Similarly to how the linear Fisher direction is obtained in the Fisher Discriminant Analysis, αKFDA _{can be worked out by finding the} leading eigenvector of W−1B or by computing αKFDA = W−1(B−− B+). However, the proposed model is ill-posed [15]: the N -dimensional covariance matrix has been estimated with N patterns which cause matrix W not to be positive2. Besides the

Let W = KDKT with D = I − v1vT1 − v2v2T, vj ∈ RN and (v+)i = √1

N+

if the example i belongs to the positive class and zero otherwise. The same applies for negative samples. Rank(W ) ≤ min{Rank(K), Rank(D)}. Matrix K is usually full rank in practice but anyway, the rank of matrix D is N − 2 spanning v1and v2the null space of D.

numerical problems, successful learning requires to control the size and complexity of the classification model as stated in Section 2.1. While the Fisher discriminant is rather simple, its reformulation in a kernel space admits a wide variety of nonlinear solutions and the incorporation of a regularization term improves the generalization capability of the model. Regularization functions as kαk2, kwk2 and others have been proposed in [96–98]. The KFDA authors propose to approximate matrix W by Wµ= W + µWIN with µW ≥ 0 and, as it will be shown later, this regularization is equivalent to regularize kαk2_{. This kind of regularization can be viewed from different perspectives:}

• If µW is large enough (ideally, the minimum value which makes W positive defi- nite), the problem in Equation 4.3 is feasible and numerically more stable. • As µ_W is increased, the variance of the W estimation decreases. In fact, when

µW → ∞ the solution lie in the direction of (B−− B+). • The regularization kαk2 _{is a l}

2-regularization which favors solutions with small expansion coefficients.

Regularization on kwk2 is achieved by adding a multiple of the kernel matrix to W : Wµ = W + µKK (µK ≥ 0). This regularization resembles Support Vector Machines studied in Chapter 2 (Section 2.3.2) and indeed it can be formulated as a Least Squares SVM problem [99].

The question now is how to project a point into the direction given by the coefficients αKFDA if wKFDAis not explicitly known (recall that it has been assumed that wKFDA=

i=1αKFDAi Φ(xi)). The dependence on Φ must be removed and the projection of a point need to be expressed as a function of the kernel matrix. More precisely, let ˜x a point in the original input space, its projection according to the KFDA optimal direction is a one-dimensional feature given by,

˜ z = wKFDA· Φ(˜x) = N X i=1 αKFDA_i Φ(xi) ! · Φ(˜x) = N X i=1 αKFDA_i (Φ(xi) · Φ(˜x)) = N X i=1 αKFDA_i K(xi, ˜x).

Regarding the synthetic example given in Figure 4.1, Figure 4.3(a) shows the projection direction computed via FDA which does not improve the discriminatory information in the data. By contrast, the KFDA boundary shown in Figure 4.3(b) separates linearly the input data with a single feature. Then, why not use always KFDA to ensure the most general model? The answer is simple, while the projection of a point to the linear Fisher direction only requires to evaluate one dot product in the input space, the projection of a point in the kernel space requires as many kernel evaluations as training samples and weighting each evaluation by αKFDA. In short, the evaluation of the kernel Fisher discriminant is so computationally intensive that its use is only justified in those cases in which linear feature extraction is insufficient. In addition to its high computational cost and unlike its linear counterpart, when KFDA is used it may not be possible to find an exact preimage of a reconstructed pattern. And, as in FDA, KFDA for binary problems project the data into one dimensional space which may be not enough for complex classification problems.

In document A practical view of large-scale classification: feature selection and real-time classification (Page 113-118)