• No results found

Dimensionality Reduction and Feature Selection using a Mixed-norm Penalty Function

N/A
N/A
Protected

Academic year: 2020

Share "Dimensionality Reduction and Feature Selection using a Mixed-norm Penalty Function"

Copied!
145
0
0

Loading.... (view fulltext now)

Full text

(1)

ZENG, HUIWEN. DIMENSIONALITY REDUCTION AND FEATURE SELECTION US-ING A MIXED-NORM PENALTY FUNCTION. (Under the direction of Professor H. Joel Trussell).

Dimensionality reduction, which is the process of mapping high-dimension patterns to lower dimension subspaces, is a key issues in enhancing the processing efficiency of high dimensional data such as hyperspectral images. Dimensionality reduction has been widely discussed in the areas of data mining, image processing, pattern recognition, etc. Because in most situations, many of the dimensions are redundant or unnecessary for the tasks of interest, removing those dimensionality will produce more efficient computation while main-taining the original performance. Dimensionality reduction also reduces the measurement and storage requirements, reduces training and utilization times and it defies the curse of dimensionality to improve classification performance.

Feature selection, the process of constructing and selecting the subsets of features that are useful to build a good predictor is of interest for many years. Before Kohavi and John published a special issue on feature selection in 1997, usually no more than 40 features are studied. Ever since then, people started looking at problems with hundreds to tens of thousands of features. Like dimensionality reduction, feature selection reduces the mea-surement and storage requirements, reduces training and utilization times, and it facilitates data visualization and data understanding.

(2)

by

HUIWEN ZENG

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial satisfaction of the requirements for the Degree of

Doctor of Philosophy

Electrical Engineering

Raleigh 2005

Approved By:

Dr. Wesley Snyder Dr. Arne A. Nilsson

(3)
(4)

Biography

(5)

Acknowledgements

It is indeed my pleasure to express my sincere gratitude to Dr. H. Joel Trussell, his guidance and comments have always aided me in looking into the right direction as well as the finer aspects of the research and in obtaining meaningful results. I also want to thank him not only for the clear and valuable guidance and expert suggestions through out my PhD years, which has led me into the gate of research, but also for the encouragement and positive comments, which have always inspired me to work harder and have given me confidence in these years.

I also want to thank my parents and my sister, for their kind love and support, which make all of these possible.

The other committee members, Dr. Wesley Snyder, Dr. Tim Kelley and Dr. Arne A. Nilsson, to whom I am extremely grateful, give the important advices on both the research and the dissertation, which has helped me a lot in completing the project successfully.

(6)

Contents

List of Figures vii

List of Tables x

1 Introduction 1

1.1 Hyperspectral Images . . . 2

1.2 The Definitions of Dimensionality Reduction and Feature Selection . . . 5

1.2.1 Dimensionality Reduction Problem . . . 5

1.2.2 Feature Selection . . . 8

1.3 The Motivations for Dimensionality Reduction and Feature Selection . . . . 10

2 Existing Methods for Dimensionality Reduction 12 2.1 Principal Component Analysis (PCA) . . . 13

2.2 Linear Classifier with the Vora Methods . . . 18

2.2.1 Select Residual Space . . . 18

2.2.2 Algorithm for Finding Linear Classifier . . . 25

2.3 Pruning with Neural Networks . . . 26

2.3.1 Introduction to ANN . . . 26

2.3.2 Pruning Methods . . . 29

2.3.3 Optimal Brain Damage . . . 29

2.3.4 Optimal Brain Surgery . . . 30

2.4 Penalty Functions . . . 31

2.4.1 Weight Decay . . . 31

2.4.2 Weight Elimination . . . 32

2.4.3 Laplace Prior . . . 33

2.4.4 Hoyer’s Method . . . 34

3 Existing Methods for Feature Selection 35 3.1 Wrapper Method . . . 35

3.1.1 Wrapper for Neural Networks . . . 36

3.1.2 Wrapper for Support Vector Machines(SVM) . . . 37

(7)

3.2 Filter Method . . . 49

3.2.1 The FOCUS Algorithm . . . 50

3.2.2 The Relief Algorithm . . . 50

3.2.3 Filter Method Decision Tree . . . 51

4 The Mixed-Norm Penalty Function 53 4.1 The Mixed-Norm Penalty Function and Dimensionality Reduction . . . 54

4.1.1 The Mixed-Norm Penalty Functions . . . 55

4.1.2 Neural Networks and Bayesian Classifier . . . 61

4.1.3 Bi-Level Optimization . . . 64

4.2 The Mixed-Norm Penalty Function and Feature Selection . . . 66

5 Implementation and Practical Issues 69 5.1 MATLAB Constrained Optimization Routinefmincon . . . 69

5.1.1 An Overview of SQP . . . 70

5.1.2 SQP Implementation . . . 71

5.1.3 Choosing Stepsize . . . 73

5.2 Non-negativity . . . 74

5.3 Trade-offs in Dimensionality vs. Smoothness . . . 75

5.4 Varying Illumination . . . 76

6 Experiments and Results for Dimensionality Reduction and Feature Se-lection 79 6.1 Dimensionality Reduction with the Vora Value . . . 79

6.1.1 Residual SpaceR= [F1, C4] . . . 81

6.1.2 Residual SpaceR= [F3, C4] . . . 81

6.2 Dimensionality Reduction Using Penalty Functions . . . 86

6.2.1 XOR Data . . . 88

6.2.2 Noisy XOR Data . . . 89

6.2.3 Hyperspectral Bear Image . . . 94

6.2.4 Hyperspectral Pond Image . . . 97

6.2.5 The Hoyer’s function Vs. The Mixed-Norm Penalty Function . . . . 100

6.2.6 Result for Smoothness Control . . . 100

6.2.7 Result and Examples for Varying Illumination . . . 102

6.3 Feature Selection . . . 110

6.3.1 Synthetic Linearly Separable Case . . . 110

6.3.2 Synthetic Nonlinearly Separable Case . . . 113

6.3.3 Hyperspectral Images . . . 117

6.3.4 Wisconsin Breast Cancer Data . . . 121

7 Conclusion 125

(8)

List of Figures

1.1 The Electromagnetic Spectrum (400nm-700nm is the visible spectrum) . . 2

1.2 An example of Hyperspectral Image . . . 4

1.3 An example of mappingsF and f . . . 7

1.4 The Dimensionality Reduction Problem . . . 7

1.5 The Hierarchy of Feature Types . . . 8

2.1 A Principal Component Decomposition . . . 17

2.2 Samples from the Target . . . 19

2.3 Samples from the Neutral Clutter . . . 19

2.4 Comparison between the 1st Eigenvectors of Target and Neutral Clutter . . 20

2.5 Comparison between the 2nd Eigenvectors of Target and Neutral Clutter . . 21

2.6 Vora value of fur space and clutter space . . . 23

2.7 Ordered Vora value of Residual Space Combination . . . 24

2.8 Architecture of Single Layer Neural Networks . . . 27

2.9 Relationship between Weight Distribution and Penalty . . . 33

2.10 Illustration of 4 Different Degrees of Sparseness. This figure is obtained from [34], we add labels to the figure to make it clearer. Thexaxis is the index of weight and the y axis is the magnitude of the weight. This figure just gives an example of sparse and non-sparse vectors, and there is no constraints on the norms. . . 34

3.1 A Flow Chart for Wrapper Method . . . 36

3.2 Illustration of a Separable Case and the Optimal Hyperplane . . . 39

3.3 Data Point Inside the Region of Separation, but On the Right Side of the Decision Surface . . . 42

3.4 Data Point Inside the Region of Separation, but On the Wrong Side of the Decision Surface . . . 43

3.5 A Flow Chart for Filter Method . . . 50

3.6 Depth-First Tree Search for Feature Selection . . . 52

3.7 Breadth-First Tree Search for Feature Selection . . . 52

(9)

4.2 |cosθ|p+|sinθ|p vs. θ . . . . 58

4.3 A Comparison of Four Different Two-Dimensional Priors . . . 63

5.1 Non-Smooth Vs. Smooth Filters . . . 75

6.1 Original Bear Image . . . 80

6.2 Training and Testing Set for Bear Image . . . 80

6.3 Error trajectory (R= [F1, C4]) . . . 82

6.4 Training Result (R= [F1, C4]) . . . 83

6.5 Testing Result(R= [F1, C4]) . . . 83

6.6 Error trajectory (R= [F3, C4]) . . . 84

6.7 Training Result (R= [F3, C4]) . . . 85

6.8 Testing Result(R= [F3, C4]) . . . 85

6.9 Successive Neuron Elimination Results for Noisy XOR Data . . . 87

6.10 Successive Neuron Elimination Results for Hyperspectral Image (Left: Fur Vs. Neutral Color Right: Fur Vs. All Colors) . . . 87

6.11 Classification Results of Test 1 in Table 6.4 . . . 91

6.12 Classification Results of Test 1 in Table 6.6 (White Clutter Only) . . . 94

6.13 Classification Results of Test 21 in Table 6.6 (All Color Clutter) . . . 95

6.14 Pond Image (First Horizontal Reflectance Band) . . . 98

6.15 Training Set of Pond Image . . . 98

6.16 An Comparison of the Hoyer’s Function and the Mixed-Norm Penalty Function101 6.17 The Intensity of Changing Illumination (α= 0.0014) . . . 103

6.18 The Intensity of Changing Illumination (α= 0.0028) . . . 103

6.19 Bear Image with Varying Illumination (α= 0.0014) . . . 104

6.20 Bear Image with Varying Illumination (α= 0.0028) . . . 104

6.21 Trained with Uniform Illumination, Tested with Uniform Illumination . . . 106

6.22 Trained with Uniform Illumination, Tested with Uniform Illumination . . . 106

6.23 Trained with Uniform Illumination, Tested with Varying Illumination (α = 0.0014) . . . 107

6.24 Trained with Uniform Illumination, Tested with Varying Illumination (α = 0.0014) . . . 107

6.25 Trained with Uniform Illumination, Tested with Varying Illumination (α = 0.0028) . . . 108

6.26 Trained with Uniform Illumination, Tested with Varying Illumination (α = 0.0028) . . . 108

6.27 Trained with Random Illumination, Tested with Varying Illumination (α = 0.0028) . . . 109

6.28 Trained with Random Illumination, Tested with Varying Illumination (α = 0.0028) . . . 109

6.29 Synthetic linearly separable Case . . . 110

6.30 Synthetic linearly non-separable Case . . . 111

6.31 The Feature Weights for Four Tests Using Bear Data Without Color . . . . 118

6.32 The Feature Weights for Four Tests Using Bear Data With Color . . . 118

(10)
(11)

List of Tables

1.1 An Example of Feature-based Data . . . 9

6.1 Result of 50 Tests for the Classic XOR Using Four Penalty Functions . . . . 89

6.2 Selection of Results of the Mixed-Norm Penalty Function for Classic XOR . 90 6.3 Result of 12 Tests for the Noisy XOR Using Four Penalty Functions . . . . 92

6.4 Results of the Mixed-Norm Penalty Function with Noisy XOR Data . . . . 93

6.5 Result of 40 Tests for the Bear Data Using Four Penalty Functions . . . 95

6.6 Results of Four Penalty Functions with Bear Data . . . 96

6.7 Result of 20 Tests for the Pond Data Using Four Penalty Functions . . . 99

6.8 Results of Four Penalty Functions with Pond Data . . . 99

6.9 Trade-offs in Dimensionality and Smoothness . . . 101

6.10 Simulation Results with Bear Data . . . 105

6.11 Results of 1-D Feature Selection on Linearly Separable Data . . . 112

6.12 Results of 2-D Feature Selection on Linearly Separable Data . . . 114

6.13 Results of 1-D Feature Selection on Nonlinearly Separable Data . . . 115

6.14 Results of 2-D Feature Selection on Nonlinearly Separable Data . . . 116

6.15 Feature Selection Results for Wisconsin Breast Cancer Data (average detec-tion rate = 95.15%, average number of features = 3.3) . . . 123

(12)

Chapter 1

Introduction

This work focuses on a new penalty function for dimensionality reduction and feature selection. The method has proved to be successful for both the synthetic data and real data. Hyperspectral images are used intensively through out this research to show the per-formance of the dimensionality reduction and feature selection. Therefore, before we discuss dimensionality reduction and feature selection methods, we want to introduce hyperspectral images.

(13)

Figure 1.1: The Electromagnetic Spectrum (400nm-700nmis the visible spectrum)

detect vegetation stress[52]. Military personnel have used hyperspectral imagery to detect military vehicles under partial vegetation canopy, and many other military target detection objectives [45][46].

Before talking about the processing of hyperspectral images, first we want to introduce some basic concepts for hyperspectral images.

1.1

Hyperspectral Images

The ”hyper” in hyperspectral means ”over” or ”too many” and refers to the large number of measured wavelength bands. Hyperspectral images provide a lot of spectral information to identify and distinguish spectrally unique materials. They usually provide more accurate and detailed information extraction than many other type of remotely sensed data.

(14)

Reflectance is defined as the amount of light a target reflects, expressed as a percentage of the incident light. Reflectance depends on target color and composition and on the frequency of the light being reflected. This percentage is called reflectance, which has the value between 0 and 1.

Hyperspectral images are usually recorded by a spectrometer, which measures reflectance at many narrow, closely spaced wavelength bands. The resulting images record a reflectance spectrum for each pixel in the image. This type of detailed pixel spectrum can provide much more information about the surface than a multispectral pixel spectrum. Multispectral images, on the other hand, are measured as the radiation reflected from a surface at a few wide, separated wavelength bands.

Although most hyperspectral sensors measure hundreds of wavelengths, it is not the num-ber of measured wavelengths that defines a sensor as hyperspectral[47]. Instead, we judge whether an image is hyperspectral by looking at the narrowness and contiguous nature of the measurements. For example, a sensor that measured only 20 bands could be considered hyperspectral if those bands were contiguous and, say, 10 nm wide. If a sensor measured 20 wavelength bands that were, say, 100 nm wide, would no longer be considered hyper-spectral. Hyperspectral imagery provides an opportunity for more detailed image analysis. For example, using hyperspectral data, visually similar materials can be distinguished. To fulfill this potential, new image processing techniques have been developed.

In this work, we work with hyperspectral images. A 31-band image that was used for color reproduction studies was used for the tests of hyperspectral classification [1]. An RBG rendering of this image is shown in Fig. 1.2, which was acquired with a progressive-scanning monochrome digital camera.

(15)

400 450 500 550 600 650 700 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wavelength (nm) R e fl e c ta n c e

400 450 500 550 600 650 700 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wavelength (nm) R e fl e c ta n c e

400 450 500 550 600 650 700 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wavelength (nm) R e fl e c ta n c e

400 450 500 550 600 650 700 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wavelength (nm) R e fl e c ta n c e

400 450 500 550 600 650 700

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wavelength (nm) R ef lec tanc e

400 450 500 550 600 650 700

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 wavelength (nm) R ef lec tanc e

(16)

values in the red region, which has wavelengths from 600-700 mn.

1.2

The Definitions of Dimensionality Reduction and

Fea-ture Selection

We have said earlier that hyperspectral images provide ample information about the pixel reflectance. However, in many situations, hyperspectral images also provide redundant information, which reduces computational efficiency.

1.2.1 Dimensionality Reduction Problem

In statistics, dimensionality reduction is mapping a multidimensional space into a space of fewer dimensions. It is sometimes the case that analysis such as regression or classification can be carried out in the reduced space as accurately than in the original space.

Here we give a mathematical definition of dimensionality reduction [15]. Suppose we have a sample set{tn}Nn=1 of D-dimensional vectors lying in a data spaceT (usuallyRD or a subset of it). The fundamental assumption of dimensionality reduction is that the sample actually lies, at least approximately, on a manifold of a space X of dimension L, where L < D. The goal of dimensionality reduction is to find a representation of that manifold (a coordinate system) that will allow the data vectors to be projected onto it and obtain a low-dimensional, compact representation of the data. Formally, dimensionality reduction is defined by the following problem: given a sample{tn}Nn=1⊂ T , find:

A spaceX of dimensionL(typically RL or a subset of it).

A dimensionality reduction mappingF:

F: T X

t x=F(t), (1.1)

(17)

A smooth, nonsingular reconstruction mappingf:

f : X M ⊂ T

x t=f(x), (1.2)

such that

L < Dis as small as possible.

The reconstruction error of the sample is small. Let t∗

n def= f(F(tn)) be the

recon-structed vector for point tn. The reconstruction error of the sample is defined as

E{d({tn}Nn=1)} def= PN

n=1d(tn, t∗n) and d is a some distance measurement between tn

and t∗

n in the space T (e.g., the Euclidean distance inRD). This condition can also

be restated in a simpler way: the manifoldMdef= f(X) approximately contains all the sample points: {tn}Nn=1⊆ M

An example of mappings F and f is shown in Fig.1.3. The dimensionality reduction mapping F (dashed lines) and the reconstruction mapping f (dotted lines) satisfy F·f = identity.

In Fig. 1.4, we show the dimensionality problem as a processing flowchart. Most of the processing systems are effective only with small dimensional vector data, so data of a higher dimension must be reduced before being fed into the system.

The dimensionality reduction problems can usually be categorized as [79]:

Hard dimensionality reduction problems: Hard dimensionality reduction deals with data with huge dimensionality ranging from several hundred to hundreds of thousands of dimensions. The reduction is usually drastic for the hard dimensionality reduction problems. Famous problems like pattern recognition and classification problems in-volving images (e.g., face recognition, character recognition, etc.) or speech (e.g., auditory models) all belong to this group.

(18)

Figure 1.3: An example of mappings Fand f

High-Dimensional Data DIMENSIONALITY

REDUCTION Low-Dimensional Data

PROCESSING SYSTEM

Intractable

(19)

Feature Type

Discrete (Finite)

Continuous (Infinite)

Complex

Ordinal Nominal

Figure 1.5: The Hierarchy of Feature Types

Visualization problems: The data itself does not normally have a very high dimension in absolute terms, but we need to reduce it to 2 or 3 in order to plot and visualize it. Several representation techniques allow visualization up to about 5-dimensional data sets, using colors, rotation, stereography, glyph or other devices, but they lack the appeal of a simple plot [65]; a well-known representation technique is the grand tour [8]. Chernoff faces allow even a few more dimensions, but are difficult to interpret and do not produce a spatial view of the data [16].

If we allow the time variable, then there are two more dimensionality reduction cate-gories: static dimensionality reduction and time-dependent dimensionality reduction. Time-dependent reduction is useful for vector time series, such as video sequences or continuous speech. We deal here only with static dimensionality reduction.

1.2.2 Feature Selection

Features are also called attributions, properties, or characteristics. They can usually be categorized into three different types, namely: discrete, continuous and, complex. The hierarchy of feature types is shown in Fig. 1.5.

(20)

Table 1.1: An Example of Feature-based Data

Weight(lb) Height(feet) Age Gender Result

130 5.5 24 Female Graduate

160 5.8 18 Male Under

150 6.2 19 Male Under

170 6.1 32 Male Graduate

140 5.9 28 Male Graduate

120 5.3 20 Female Under

150 6.0 26 Male Under

110 5.4 18 Female Under

f(x) outputs one of the predefined classes. For the example shown in Table 1.1, there are four features in total and the last column of results show the different classes the subjects belong to. The weight, height and age are examples of continuous features and the gender has discrete values.

(21)

from undergraduate than the weight and height. In more complicated situations, histograms of the various classes could be used to show the distribution of the data.

In a real system, the problem is usually a lot more complicated than the example we have here, and a detailed description of the feature selection techniques will be discussed in Chapter 3.

1.3

The Motivations for Dimensionality Reduction and

Fea-ture Selection

The notorious curse of dimensionality is a well-known phenomenon for many multi-dimensional or high-multi-dimensional problems. The termcurse of dimensionality, refers to the fact that, in the absence of simplifying assumptions, the sample size needed to estimate a function of several variables to a given degree of accuracy (i.e., to get a reasonably low-variance estimate) grows exponentially with the number of variables [11]. One approach to minimizing the effect of the dimensionality curse is to reduce the number of dimensions of the high-dimensional data and select the critical features before processing the reduced dimension. Therefore, transforming data from a high-dimensional space to a lower dimen-sional space or finding the optimal feature set without losing critical information is not a trivial task.

(22)
(23)

Chapter 2

Existing Methods for

Dimensionality Reduction

(24)

function [33][7][55][34]. The gain competition technique prunes the neural net by having hidden neurons’ gains compete according to similarities between nodes [44]. Other pruning algorithms include the two-stage procedure [69] and iterative pruning [24]. In [69], useless (zero output) and repetitive neurons (neurons with the same pattern) are removed in the first stage; two neurons with identical or reverse output will be considered to have the same pattern. In the second stage as the nonessential neurons whose output are combination of other neurons are removed, a neuron is called nonessential if it has nonzero output but the classification can still be done without that neuron. If the output of an entire layer has the same pattern as some other layer, then the entire layer is removed in the third stage of pruning. The iterative method proposed in [24] describes a method for reducing the size of trained feedforward neural networks, in which the key idea consists of iteratively removing hidden units and then adjusting the remaining weights in such a way as to preserve the overall network behavior. Both the two-stage method and the iterative pruning technique require user interaction to prune neural networks because we need to determine the redun-dant and unnecessary neurons by looking at the output weights and compare them, which is not optimal for coding. We introduce a method that is closely related to the penalty function approaches and is used in combination with a constrained optimization algorithm. In this chapter, we want to introduce PCA, the Vora approach and a few penalty function methods in more detail.

2.1

Principal Component Analysis (PCA)

Principal component analysis is possibly the dimensionality reduction technique most widely used in practice, perhaps due to its conceptual simplicity, its analytical proper-ties and the fact that relatively efficient algorithms (of polynomial complexity) exist for its computation [38][37]. PCA is well known in signal processing as the Karhunen-Lo`eve transform.

(25)

mdef= 1 n

n X

k=1

xk, (2.1)

and the scatter matrix

Sdef=

n X

k=1

(xk−m)(xk−m)T. (2.2)

The scatter matrix is basically n−1 times an estimate of the covariance matrix S, and it is symmetric positive semidefinite .

We want to represent all of the vectors by a single vector x0, such that the sum of the squared Euclidian distances betweenx0 to allxkis as small as possible. We define the sum

of squared error functionJ0(x0) by

J0(x0) =

n X

k=1

kx0xkk2. (2.3)

Therefore, we have

x0= arg

x0

minJ0 (2.4)

Insert Eqn. (2.1) into Eqn. (2.3), we have

J0(x0) =

n X

k=1

k(x0m)(xk−m)k2

=

n X

k=1

kx0mk22

n X

k=1

(x0m)T(xk−m) + n X

k=1

kxk−mk2

=

n X

k=1

kx0mk22(x0m)T

n X

k=1

(xk−m) + n X

k=1

kxk−mk2

Notice that the 2nd term in Eqn. (2.5)

−2(x0m)T

n X

k=1

(xk−m) = −2(x0m)T

à n X

k=1

xk− n X

k=1

m !

= −2(x0m)T

à n X

k=1

xk−nm !

= −2(x0m)T

à n X

k=1

xk−n1 n n X k=1 xk !

(26)

Therefore, Eqn. (2.5) can be simplified as

J0(x0) =

n X

k=1

kx0mk2+

n X

k=1

kxkmk2

| {z }

independent ofx0

(2.6)

The 2nd term is independent of x

0, and it is clear that Eqn. (2.6) will be minimized at

x0 =m.

The sample mean is a zero-dimensional representation of the data set since it projects all the data to a single point. By projecting the data onto a line through the sample mean, we get a one-dimenional representation of the data set. Letebe a unit vector in the direction of the line, then the equation of the line can be written as

x=m+αe, (2.7)

where the scalarα∈Rcorresponds to the Euclidian distance of any pointxfrom the mean

m. If we represent xk by m+αke, we can find an ”optimal” set of coefficients αk by

minimizing the sum of squared-error

J1(α1,· · · , αn,e) = n X

k=1

k(m+αke)xkk2

=

n X

k=1

ke(xkm)k2

=

n X

k=1

αk2kek22

n X

k=1

αkeT(xkm) +

n X

k=1

kxkmk2

To minimize J1, we set the partial differentiation of Eqn. (2.8) with respect to αk to

zero and solve for the equation.

∂J1 ∂αk

= 2αkkek22eT(xkm) = 0⇒αk=eT(xkm), (2.8)

askek2 = 0.

(27)

Now we want to find the best direction e for the line. Inserting Eqn. (2.8) into Eqn. (2.8), we have

J1(e) =

n X

k=1 α2k2

n X

k=1 α2k+

n X

k=1

kxk−mk2

=

n X

k=1

[eT(xk−m)]2+ n X

k=1

kxk−mk2

=

n X

k=1

eT(xk−m)(xk−m)Te+ n X

k=1

kxk−mk2

= eTSe+

n X

k=1

kxk−mk2 (2.9)

To minimizeJ1, we want to maximize eTSe. Using the method of lagrange multipliers,

and letting λbe the multiplier, we differentiate

u=eTSe−λeTe (2.10)

We want to maximize Eqn. (2.10) by taking its derivative with respect toeand setting it to zero.

∂u

e = 2Se2λe= 0 (2.11)

It is easy to see that

Se=λe (2.12)

In particular, aseTSe=λeTe, recall the definition of eigenvectors. It is easy to see that

to maximize eTSe, eshould be the eigenvector corresponding to the largest eigenvalue of

the scatter matrix.

Therefore, the best (best in the least sum-of squared-error sense) one-dimensional pro-jection of the data is to project the data onto a line through the sample mean and in the direction of the eigenvector corresponding to the largest eigenvalue of the scatter matrix.

(28)

Figure 2.1: A Principal Component Decomposition

x=m+

d0

X

i=1

αiei, (2.13)

whered0 d. The error function we want to minimize is then

Jd0 =

n X

k=1

° ° ° ° °(m+

d0

X

i=1

αkiei)xk ° ° ° ° °

2

(2.14)

Following the previous approaches, we can prove that Eqn. (2.14) will be minimized when the vectorse1,· · · ,ed0 are thed0 eigenvectors corresponding to thed0 largest eigenvalues of

the scatter matrix. Because the scatter matrix is real and symmetric, the eigenvectors are orthogonal [53]. In Fig. 2.1, we can see an example of a principle component in 2-D.

(29)

2.2

Linear Classifier with the Vora Methods

The goal of a general pattern recognition problem is to classify data into different classes. The features are extracted from each data and are used as the input to the classifier, the classifier will determine to which class each input data belongs based on its input features. There are both linear and non-linear classifiers to choose from, and one maybe better than the other based on different applications.

The data we use here is the hyperspectral bear image mentioned in Chapter 1. We want to distinguish the fur of the bear (the target) from the white background (the clutter), which appears very similar to the fur. For this task, the linear classifier is employed to do the classification.

Since the fur and white clutter appear very similar to the human eyes, using only the RGB values is not enough to distinguish the target from the clutter. However, now we know the reflectance of each pixel in 31 spectral bands, which contains the information we need to distinguish the target and clutter. Since target and clutter are made of different materials, which have different reflectance spectra, there must be some differences between the target and clutter spaces, which are 31 dimensions. We can examine the properties of the spaces spanned by eigenvectors of target and clutter since the eigenvectors represent the properties of a matrix. Because fur and white clutter are similar, the fur space and white clutter space must have a large overlap, which is useless for distinguishing the spaces. We are interested in the space where fur and white clutter differ; so we want to reduce the dimension of the fur space and clutter space, and only keep the information where the differences exist.

In the following sections, we will discuss how to choose the optimal residual space and how to distinguish target and clutter using different classifiers.

2.2.1 Select Residual Space

(30)

fur

Figure 2.2: Samples from the Target

white clutter

(31)

5 10 15 20 25 30 0.12

0.14 0.16 0.18 0.2 0.22 0.24

comparison of first eigen vector

target clutter

Figure 2.4: Comparison between the 1st Eigenvectors of Target and Neutral Clutter

To check the properties of fur and clutter, we need to compute the eigenvalues and eigenvectors of the covariance matrices of the target and clutter. The covariance matrices are defined as:

FXY =E(FmF)(FmF)T

CXY =E(CmC)(CmC)T, (2.15)

where, mF and mC are the expected vectors ofFand C. To simplify computation, we can

assume Fand Cto have zero mean, which does not have much affect on the eigenvectors. Fig. 2.4 shows the first eigenvectors (the one associated with the largest eigenvalue) of both the target and clutter while Fig. 2.5 gives us the comparison of the second eigenvectors, which correspond to the second largest eigenvalue.

(32)

5 10 15 20 25 30 -1

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4

comparison of second eigen vector

target clutter

(33)

clutter are not identical, and the differences can be seen from other eigenvectors. Fig. 2.5 shows that the second eigenvectors are indeed quite different. When we want to distinguish fur from clutter, the second eigenvector is more important to us than the first one.

The target and clutter are both in the 31 dimensional space. By using the most sig-nificant eigenvectors, we can approximate the target pixels in a lower dimensional space. For example, 99% of the spectral power is found in a subspace defined by the first Qt eigenvectors of the target ensemble. Likewise 99% of the spectral power of the clutter is in the first Qc eigenvectors associated with that ensemble. The two subspaces with 99%

of both classes are different. Consider the vector spaces obtained by using combinations of eigenvectors from the target ensemble and the clutter ensemble. If {eti}N ti=1 are the eigen-vectors associated with the target and {ecj}N cj=1 are the eigenvectors associated with the

clutter, the dimensionality of the space spanned by{eti, ecj}is at most min{N t+N c,31}. If we use all 31 eigenvectors of the target and the clutter, there are P31i=1¡31i¢ P31j=1¡31j¢ possible combinations. This is because there may be 1, 2, . . . , 31 target eigenvectors in the residual space. If there is only one target eigenvector in the space, this eigenvector can be anyone of the 31 eigenvectors, i.e., the possibility is ¡311¢. Generally, the possibility of having ieigenvectors in the residual space is ¡31i¢, since ican be 1 to 31, the total number of combinations of target eigenvectors is¡311¢+¡312¢+· · ·+¡3131¢=P31i=1¡31i¢. Similarly, the total number of combinations of neutral clutter eigenvectors isP31j=1¡31j¢, which means the number of possible residual spaces is Nb =P31i=1¡31i¢ P31j=1¡31j¢. Now, after we reduce the dimensionality, the total number of combinations is reduced toNa=

PNt i=1

¡Nt i

¢ PNc j=1

¡Nc j

¢

, sinceNt¿31 and Nc¿31, we haveNa¿Nb.

Vora and Trussell have introduced a measurement to determine the similarity of two range spaces ΩS, ΩM of matrices S and M [74]. Let an orthonormal basis for ΩS be defined by N= [n1,n2,· · ·nα], such that ΩN =ΩS , and NTN=I. Notice that α is the

number of orthogonal vectors and is equal to the rank of S.

Similarly, we define an orthonormal basis for ΩM by O = [o1,o2,· · ·oβ], such that ΩO =ΩM, and OTO= I, where β is the number of orthogonal vectors, and is equals to

the rank of M.

A normalized measure of goodness (referred to as Vora value in our project) is defined as

υ(S,M) =

Pα

i=1λ2i(OTN)

(34)

5 10 15 20 25 30 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Vora value comparison in different dimension

space dimension

Vo

ra

Va

lu

e

Figure 2.6: Vora value of fur space and clutter space

and λ2

i(OTN) is theith eigenvalue of OTN.

It can be verified thatυ(S,M) has the value between 0 and 1, whenυ(S,M) = 0, space

ΩS ΩM and are orthogonal, and whenυ(S,M) = 1,ΩS and ΩM are identical.

As discussed in early sections, the space spanned by the first target eigenvector and the space spanned by the first clutter eigenvector are very close; they must have very large Vora value, which is close to 1. Since the second eigenvectors are quite different, after we include the second eigenvector in the spaces, the spaces differ to some extent, which leads to a decrease in Vora value. If we continue to add more eigenvectors to the residual space, we see a fluctuation of Vora values. Since the 31 eigenvectors span the whole space, then the target space and clutter space is exactly the same, i.e., we have a Vora value of 1. This is verified in Fig. 2.6, where we plotted the Vora values of Ωj

vFkªjk=1©vkCªjk=1 o

(35)

0 10 20 30 40 50 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

number of different spaces

v o ra v a lu e

fur and white clutter

Figure 2.7: Ordered Vora value of Residual Space Combination

by an exhaustive search. In the end, we sorted the Vora values in an increasing manner; we can see a trend of Vora value going from almost zero to almost 1 as shown in Fig 2.7. Now, we want to distinguish fur and clutter in the residual spaces with low Vora Values.

Before we do classification, we also need to check dimensionality of the residual space, since they may cause computational errors.

Define optimal residual spaceΩXY as

XY=

 

Z|Z= nF

X

k=mF

αkvkF +

nC

X

k=mC βkvkC

 

, (2.17)

which has a dimension less than or equal to the summation of the dimension of the two spaces (nF −mF + 1) + (nC−mC+ 1).

The covariance matrix of ΩXY is defined as KXY = E{(ZmZ)(ZmZ)T}, where

mZis the expected vector ofZ. Assuming zero-mean, this matrix can be further simplified asKXY =E{ZZT}. We computed the eigenvalues and eigenvectors ofK

XY. As discussed

(36)

31 eigenvalues we have, most of them will have values very close to 0. To avoid singularity, we use onlyQeigenvectors that associate with the Qlargest eigenvalues. In our project,Q is chosen in such a way that the Q largest eigenvectors contain 99.9% of the total power. We define ΩR as the space spanned by these Q eigenvectors, and it will be the residual

space for later use.

2.2.2 Algorithm for Finding Linear Classifier

To distinguish target from clutter, we introduce the concepts of pattern recognition. Given a set of random variables, some of them may have a certain property that indicates they belong to classω1, and the others do not have this property, and they will be classified asω2. In pattern recognition, we have many different classifiers to do the task. One of the simplest and most commonly used classifier is linear classifier.

Define x as the set of random variables we want to study. We want to find a vector v

and a scalar v0, such that h(x) =vTx+v0 ≶0 x

  

 

ω1 ω2

, whereω1 and ω2 are two

different classes. In our project, these two classes are target and clutter. The termh(x) is a linear function of xand is called a linear discriminant function.

Define mi and Σi as the expected vectors and covariance matrices of the conditional

density function p(X|ωi). In our project, we use the projection of target and clutter onto ΩR as the feature, andmi andΣi can be computed accordingly.

The algorithm for finding the best linear classifier for normally distributed random vari-ables is summarized here [22]:

1. s←0

2. Calculate vfor a given sby

v= (sΣ1+ (1−s)Σ2)1(m2m1). (2.18)

3. Using thev obtained, calculateσ2

1, sigma22 by sigma2i =vTΣiv and v0 by v0 = −[sσ

2

1vTm2+ (1−s)σ22vTm1] 2

1+ (1−s)σ22

(37)

4. Calculate error ²by

²=P(ω1)

Z +

+η1

σ1 1

2π exp

µ

−ξ2 2

+P(ω2)

Z η2 σ2 −∞ 1 2π exp µ

−ξ2 2

(2.20)

5. While s+4<1,s←s+4, goto 2.

If we plot the error ² vs. s, we will have a concave curve, and what we are interested in is the lowest point, i.e., the minimum error we can have. Thev and v0 associated with that lowest point is the optimal classifier we can have.

The Vora value method for dimensionality reduction outperforms the PCA in that it seeks the subspaces with maximum difference between the two classes instead of the space to represent the target and clutter spaces. However, simulation results have shown that it did not provide us with the minimum dimensionality and the classification error is larger than that of other methods that use neural networks. The result of the Vora value method will be shown in Chapter 6.

2.3

Pruning with Neural Networks

A large category of dimensionality reduction methods are associated with ANN. The dimensionality is reduced by pruning the neural networks, therefore, before we review these methods, we want to introduce ANN first.

2.3.1 Introduction to ANN

(38)

Figure 2.8: Architecture of Single Layer Neural Networks

The basis of most classification problems is the simple two-class problem. We will use this as the basis for our work but the feature selection method can be used equally well for other cases.

For most of the classification problems we are dealing with, we try to detect target from clutter, which is a two-class classification problem. Therefore, we will first look at the architecture of ANN for the two-class classification problem, where only one output neuron will be needed, and the network of greatest interest has a single hidden layer, shown in Fig. 2.8 and can be represented by

y=h(x) =g1(g2(xTWI+bI)wO+bO), (2.21)

wherexand yare the input vector and output value of the ANN, respectively. The weight matrix connecting the input with the neurons in the hidden layer and biases are denoted asWI and bI, respectively, where the output weights, weights connecting hidden neurons with output neuron, are represented by vectorwO; the variablebOis the bias for the output

neuron. The elements of WI are wij, which is the weight for the ith feature that is sent

to thejth neuron. Let M and N be the number of hidden neurons and the dimensionality

of the input data, respectively, then x is a N ×1 vector and the size of both bI and wO

are M ×1. The input weight matrix WI will thus be a N ×M matrix. In this work we

explicitly differentiate the input weights and output weights by different subscriptWI and

(39)

pruning techniques use all weights in the network and they may have different notations for the weights.

Common transfer functions g1 and g2 include linear, tangent sigmoid and logarithm sigmoid functions. For multi-class classification problems, we need multiple output neurons, and the output y instead of a scalar will be a vector. Our training criterion is to minimize the sum of squared errors of all network output,y(k) =h(xk) with the desired outputyd(k), i.e.,

L X

k=1

(y(k)−yd(k))2, (2.22)

where y(k) is the network output of the kth input and y

d(k) is the desired output

corre-sponding to the kth input data. The number of vectors in the training set is denoted as

L.

Once the training data are input into Neural networks, the weights connected to the neurons are adjusted adaptively to model the relationship between the input and desired output by some learning algorithm. Neural Networks learn by examples, so they can be used only as a supervised learning method. Conventional techniques such as Bayesian classifier studies the probability distribution of the data; because of the inaccuracy of probability estimation, it usually has a reasonably high testing error. However, neural networks are programmed to classify a specific data set. The training data must be selected carefully so that it is a good representation of the class; otherwise, the network can be overtrained to learn unnecessary or even unrepresentative details of the data set so that the network may fail to generalize to real data from a similar source. Therefore, one of the disadvantages is that because the network finds out how to solve the problem by itself, its operation can be unpredictable. This can be minimized by dividing the data into test sets and training sets. The network is trained on one set of data; then the generalization of the network is determined by testing its performance on the other set.

(40)

2.3.2 Pruning Methods

Pruning is the procedure of reducing dimensionality by removing redundant neurons from the hidden layer when building the classifier with ANN. The most common network pruning methods are sensitivity calculation and penalty functions. We will introduce some famous algorithms for network pruning here.

Le Cun [48] and Hassibi [31] prune neural networks by calculating the sensitivity of the parameters in the network. Each parameter in the network is associated with a saliency, which is proportional the the second order derivative of the objective function with respect to that particular parameter. The authors believe that small saliency means a perturbation in the parameter will not change the value of the objective function, and therefore, the parameter can be safely removed.

2.3.3 Optimal Brain Damage

In optimal brain damage (OBD) [48], the author approximates the objective function E by a Taylor series, and a perturbation δU of the parameter vector U will change the objective function by:

δE=X

i

giδui+12 X

i

hiiδu2i +

1 2

X

i6=j

hijδuiδuj+O(kδUk3), (2.23)

where the δui’s are the components of δU, the gi’s are the components of the gradient G

of E with respect toU, and the hij’s are the elements of the Hessian matrix H of E with

respect toU:

gi= ∂u∂E

i (2.24)

and

hij =

2E ∂ui∂uj

. (2.25)

(41)

saliencies are removed after converges. Using this assumption, the first term in Eqn. (2.23) is removed. The author also assumed that the objective function is near quadratic so that the last term can be neglected. Therefore, Eqn. (2.23) is simplified as:

δE = 1 2

X

i

hiiδu2i. (2.26)

The saliency for each parameter is thus defined as:

si= 12hiiu2i. (2.27)

The algorithm for OBD is shown below:

1. Choose a reasonable network architecture

2. Train the network until a reasonable solution is obtained

3. Compute the second derivativeshii for each parameter

4. Compute the saliencies for each parameter

5. Sort the parameters by saliency and delete some low-saliency parameters

6. Iterate to Step 2

The authors did not mention how many and by what criteria the parameters are actually removed during each iteration.

2.3.4 Optimal Brain Surgery

In 1993, Hassibi and Stork proposed an algorithm called optimal brain surgery (OBS) [31], which is similar to OBD, but with different assumptions. They believed that assuming the Hessian to be diagonal does not hold for many problems. However, they also made the same assumptions as in OBD to remove the first and last term in Eqn. (2.23) and the saliency defined in OBS is:

si = 1 2

u2

i

(42)

2.4

Penalty Functions

Among the various dimensionality reduction methods, the penalty function is one of the most commonly used approaches as penalty functions are easily combined with other cost functions that need to be minimized for a constructed optimization problem. For classification problems with neural networks, the sum of the penalty function combined with the cost of misclassification is minimized during the learning procedure. The objective function to be minimized can be written:

C(w) =Cs(w) +λCp(w), (2.29)

where the first term, Cs(w), is a function of the error between the actual output and

the desired output of the network, and the second term Cp(w) is the penalty function,

which depends on the network parameters alone. The parameter λ adjusts the weight of the penalty term in the total cost function and reflects the relative importance of the classification performance and the dimensionality of the network. We will review two of the most common penalty functions here since we we will use them for performance comparisons with the new function.

2.4.1 Weight Decay

A classic penalty function is weight decay, which reduces the ANN by minimizing the square of the 2-norm of the weights [33][17].

Cp(w) = X

i∈net

wi2, (2.30)

where w(i) is all the weights in the synaptic network. The cost function Cp(w) will be

(43)

vector by choosing the smallest vector that solves the learning problem. Second, if the size is chosen right, a weight decay can suppress some of the effects of static noise on the targets. However, when adding the weight decay cost function to the cost function, all the weights tend to decrease to zero. A problem with weight decay is that it treated all the weights in the system in the same way, and it is very possible that all the weights will decrease proportionally while keeping the relative importance between the weights unchanged, which will not provide us with the information as to which neurons are insignificant and need to be removed. The ideal case would produce a few large weights while driving all other weights to zero. Even if we restrict the weights in the penalty function to only the output weights the tendency is still to reduce all weights to zero.

2.4.2 Weight Elimination

A more selective penalty function is weight elimination [7]

Cp(w) = X

i∈net

(wi/w0)2

1 + (wi/w0)2 (2.31)

where w0 is a user-defined parameter. When |wi| ¿ w0, the ith term will approach zero, which means that particular weight is unimportant in the learning process. Weight decay is actually contained within the weight elimination formula. The author believes that weight elimination is well suited for network pruning by eliminating variables that offer little or no assistance in estimating the correct outcome because the small weights only add unwanted white noise.

The parameterw0 can be used to adjust whether we want many weights of equal mag-nitude or some small weights with some large magmag-nitudes. To understand the effect of this parameter, consider a network with only two weights. Fig. 2.9 showsCp(w) as a function of

(44)

0 0.2 0.4 0.6 0.8 1 0.35

0.4 0.45 0.5 0.55 0.6 0.65 0.7

w

1 / (w1+w2) C

p(w)

(w1+w2)/w0 = 1 (w1+w2)/w0 = 1.1 (w1+w2)/w0 = 1.2 (w1+w2)/w0 = 1.3 (w1+w2)/w0 = 1.4

Figure 2.9: Relationship between Weight Distribution and Penalty

2.4.3 Laplace Prior

Williams[76] proposed another type of penalty function assuming the weights in the network have Laplace distribution and the penalty term is proportional to the logarithm of the`1 norm of the weight vector .

Cp(w) = X

i∈net

|wi|. (2.32)

(45)

i i i i

wi wi wi wi

Cp=0.1 Cp=0.4 Cp=0.7 Cp=0.9

Figure 2.10: Illustration of 4 Different Degrees of Sparseness. This figure is obtained from [34], we add labels to the figure to make it clearer. The x axis is the index of weight and they axis is the magnitude of the weight. This figure just gives an example of sparse and non-sparse vectors, and there is no constraints on the norms.

2.4.4 Hoyer’s Method

Hoyer proposed a sparseness measure Cp(w), based on the relationship between the `1 norm and the `2 norm [34] [35].

Cp(w) =

N−P|wi|/ qP

w2

i

N−1 , (2.33)

whereN is the number of weights in the network. When all thewi but one are zero, Eqn.

(2.33) reaches its maximum of 1. When all the wi have the same magnitude, the equation

will reach its minimum at zero. An example of various degrees of sparseness is shown in Fig. 2.10[34]. Four vectors are shown in the figure, exhibiting sparseness levels of 0.1, 0.4, 0.7, and 0.9. Each bar denotes the value of one element of the vector. At low levels of sparseness (leftmost), all elements are roughly equally active. At high levels (rightmost), most coefficients are zero whereas only a few take significant values.

(46)

Chapter 3

Existing Methods for Feature

Selection

The existing feature selection methods are usually divided into two groups, the filter methods and the wrapper methods. The biggest difference between the filter method and the wrapper method lies in the fact that the first one preprocess the data and reduces the number of features based on distance and information measures [5][6], and the wrapper method selects the optimal subset of features aiming at achieving accurate classification [41].

3.1

Wrapper Method

(47)

Feature Generation

Learning Algorithm

Full Set Subset

Good? Accuracy

No

Training

Data Phase 1

Phase 2 Learning Algorithm Testing Classifier Accuracy Testing Data Training Data Yes

Figure 3.1: A Flow Chart for Wrapper Method

for each subset of features, a classifier (can be linear classifier, Bayesian classifier, neural network etc) is generated from the data with chosen features. Its accuracy is recorded, and the feature subset with the highest accuracy is kept. When the selection process terminates, the subset with the best accuracy is chosen. The second phase in the wrapper model is a normal learning and testing process. Since we just maintain the best subset of features in the first phase, we need to re-learn the classifier associated with the best subset. The following sections show how this method can be applied using various architecture methods.

3.1.1 Wrapper for Neural Networks

Setiono and Liu proposed a feature selector using neural networks [67][36]. The features are pruned by adding a penalty function to the error function as in Eqn. (4.1), where the penalty function at iterationk is:

Cpk(w) =²1k

à h X m=1 n X `=1 β(wm ` )2

1 +β(wm ` )2

!

+²2k

à h X m=1 n X `=1 (wm` )2

!

, (3.1)

(48)

iterations. The data available are divided into the training setS1 and the test setS2. The setS2 is used for cross-validation, and the algorithm decides whether to stop removing more features by testing the accuracy of the performance on S2. The accuracy rate (percentage of correctly classified data) for set S1 and S2 is denoted as R1 and R2, respectively. The algorithm keeps the best accuracy rate R2 of the networks on this set. If there is still a feature that can be removed such thatR2 does not drop by more than ∆R(predetermined constant), then this feature will be removed; otherwise, the algorithm terminates.

The penalty function in Eqn. (3.1) is trained iteratively. At iterationk, all features but Fk are input to the neural networks. The penalty parameters ²1k and ²2k are set equal for

all features initially. After the network is trained, the relative importance of each feature can be inferred from the accuracy rate of all the networks Nk having one less feature Fk.

A high accuracy rate ofNk suggests that featureFk can be removed from the feature set. After computingN1 through Nn, where n is the number of features, the average accuracy

rate is calculated. If the accuracy rate of Nk is higher than the average, the the penalty

parameters for the network connections from input feature Fk are multiplied by a factor

1.1, because with larger penalty parameters, the connections from the input feature will have smaller magnitude after the network is retained, and the feature can be removed in the next iteration. On the other hand, if the network accuracy ofNkis below average, that means Fk is an important feature, and the penalty parameters are divided by a factor of 1.1.

3.1.2 Wrapper for Support Vector Machines(SVM)

The support vector machine, which is known as a special type of feed-forward network, has been used for feature selection [75][29][61]. Before we talk about how the SVM is applied to feature selection, we want to introduce the concept of SVM and its principles.

(49)

Linearly Separable Case

Consider a set of data{xi, ti}iN=1, wherexi is the input vector andtiis the target output.

For a two class problem, we assume the input data with outputti= 1 andti= -1 are linearly separable. The separating equation can be written as:

wTx+b= 0, (3.2)

where, wis the weight vector and b is the bias. For a linear separable case, we can always find aw andb such that

wTi x+b≥0 forti = +1

wTi x+b <0 forti =−1 (3.3)

For a given set ofwand b, the separation between the hyperplane and the closest data point is known as themargin of separation, denoted byρ. The goal of SVM is to find a set ofwandbso that the data can be classified and the margin is maximized. The hyperplane that maximizes the margin is known as the optimal hyperplane. An example of a simple classification problem with the optimal hyperplane is shown in Fig. 3.2.

Letw and b∗ be the optimal values that form the optimal hyperplane. Therefore, the optimal hyperplane, representing a multi-dimensional linear decision surface is defined by

w∗Tx+b∗= 0. (3.4)

The discriminant function thus can be written as

g(x) =w∗Tx+b∗. (3.5)

We can expressxas

x=xp+r w

kwk, (3.6)

wherexp is the normal projection ofxonto the optimal hyperplane, and r is the algebraic

distance and it is positive when xis on the positive side of the hyperplane and negative if

(50)

Margin

Support Vectors

Support Vectors

wTx+b = +1

wT* x+b* = 0

wTx+b = -1

Optimal Hyperplane

Figure 3.2: Illustration of a Separable Case and the Optimal Hyperplane

g(x) = w∗Tx+b∗ = w∗T

µ

xp+r w

kwk

+b∗

= w∗Txp+rw∗T w

kwk+b∗

= w∗Txp+b∗+rkw∗k

= g(xp) +rkw∗k.

Because xp is the normal projection ofx onto the optimal hyperplane, it satisfies Eqn.

(3.4), i.e., g(xp) = 0. Therefore, we have

g(x) =rkw∗k, (3.7)

or

r= g(x)

kwk. (3.8)

The distance from the origin (x = 0) to the optimal hyperplane is therefore kwb∗∗k; in

(51)

algebraic distance from the support vectorsxs to hyperplane is

r = g(xs) kwk

=      1

kwk ifts= 1

1

kwk ifts=−1,

where the positive sign means the support vectorxsis on the positive side of the hyperplane

and the negative sign indicates that xs lies on the other side. The margin of separation, which is twice the distance from the support vectors to the hyperplane is therefore

ρ = 2r

= 2

kwk. (3.9)

Finding the Optimal Hyperplane

We discussed earlier that the optimal hyperplane is the plane that satisfies Eqn. (3.3) and maximizes ρ. From Eqn. (3.9), maximizing ρ is equivalent to minimizing 12kwk. Since kwk2 = wTw, finding the optimal hyperplane can thus be written as the following

optimization problem

min 1

2w

Tw

subject to ti(wTxi+b)≥1 fori= 1,2,· · ·N, (3.10)

whereti is the class label as defined in Eqn. (3.3).

The constraint in Eqn. (3.10) is just another form of Eqn. (3.3). Note that the objective function is a convex function of w and the constraint is a linear function of w. This optimization problem can be solved using the Lagrange multipliers method, where we have

J(w, b,λ) = 1 2w

Tw N X

i=1

λi[ti(wTxi+b)−1], (3.11)

where λi are the Lagrange coefficients. Taking the partial derivative of J(w, b,λ) wrt. w, and setting it to zero, we have

∂J(w, b,λ)

w = w

N X

i=1 λitixi

(52)

Therefore,

w=

N X

i=1

λitixi. (3.13)

Now, taking the partial derivative ofJ(w, b,λ) wrt. b, and setting it to zero, we have

∂J(w, b,λ)

∂b =

N X

i=1 λiti

= 0, (3.14)

i.e.,

N X

i=1

λiti= 0. (3.15)

The corresponding dual problem is given in Eqn. (3.9) and a more detailed discussion can be found in [32][20][57].

max

λ Q(λ) = N X

i=1 λi− 12

N X i=1 N X j=1

λiλjtitjxTi xj

subject to

N X

i=1

λiti= 0

λi 0 (3.16)

Note thatQ(λ) depends only on the inner product of ©xT i xj

ªN i,j=1. Solving for the optimal Lagrange multipliersλ∗

i, the optimal weight vector is

w=

N X

i=1

λ∗itixi, (3.17)

and the optimal bias can be solved using the property of support vectors (w∗Txs±b∗ =±1);

therefore,

b∗ = 1w∗Txs forts= 1, (3.18)

(53)

Margin

Support Vectors

Support Vectors

wTx+b = +1

wT* x+b* = 0

wTx+b = -1

Optimal Hyperplane Correctly

Classified

Figure 3.3: Data Point Inside the Region of Separation, but On the Right Side of the Decision Surface

Linearly Non-Separable Case

Given a set of linearly non-separable data, it is not possible for us to find a separating hyperplane with no classification error. Therefore, we want to find a hyperplane that the classification error is minimized.

If the constraint in Eqn. (3.10) is violated, the region of separation is called soft and there are two possible ways of violation as shown in Fig. 3.3 and Fig. 3.4.

In Fig. 3.3, the point inside the dashed circle falls inside the region of separation and thus the constraint in Eqn. (3.10) is violated; however, with the same optimal hyperplane, it can still be correctly classified. In Fig. 3.4, the point inside the dashed circle also falls inside the region of separation, however, since this data point belongs to another class, it will cause a classification error with the optimal hyperplane.

Introducing a set of nonnegative slack variables{ξi}Ni=1, the new separating hyperplane

can be expressed as:

min 1

2w

Tw

subject to ti(wTxi+b)≥1−ξi fori= 1,2,· · ·N. (3.19)

It is easy for us to see that in Fig. 3.3, we have 0≤ξi 1, and the data are still correctly

(54)

Margin

Support Vectors

Support Vectors

wTx+b = +1

wT* x+b* = 0

wTx+b = -1

Optimal Hyperplane Misclassified

Figure 3.4: Data Point Inside the Region of Separation, but On the Wrong Side of the Decision Surface

To minimized the total classification error, we want to minimize

min

w Φ(w) =

1 2w

Tw+C N X

i=1 ξi

subject to ti(wTxi+b)≥1−ξi fori= 1,2,· · ·N

ξi 0 (3.20)

Minimizing the first term in Eqn. (3.20) corresponds to minimizing the dimension of the support vector machine and the second term in the equation minimizes the classification error, where PNi=1ξi is an upper bound of the number of test errors. Like theC defined in

Eqn. (2.29), it adjusts the relative importance of the dimensionality and the classification performance, and the value ofC is given by the users.

References

Related documents