Feature Generation - Pattern recognition using genetic programming for classification of diabet

designed to overcome this drawback. In this method a removed variable may be added back to dataset if later it appears to be significant.

2.5 Feature Generation

Feature generation is a powerful term in the domain of PR and sometimes it can also include the concept of feature selection. The process of transforming the original set of features to a new reduced set of features (by making combination of the original features) such that the underlying classes are more separable is called feature generation.

In the domain of image processing and signal processing the term feature generation is sometimes used slightly differently, where it is used to describe the phenomenon of extracting features from the original image or signal. Instead of using the full size input, features which are reduced representation of the ac- tual input are extracted from the input. These features are expected to contain enough relevant information to perform the desired task. This process is similar to the other definition in the sense that both of these methods are trying to find a reduced representation of input. The ultimate objective of the feature generation is same in all domains i.e. to reduce the amount of resources required to perform the desired task. The increase in the amount of resources required, with an increase in input dimensions has been described previously asthe curse of di-

mensionality. Although, there has been an exponential growth in the computing

power in recent years, the benefits offered by feature generation in terms of re- ducing computational complexity cannot be ignored. Moreover, the improvement in performance offered by feature generation is invaluable.

There are quite a few feature generation methods available in the literature. They can be categorised into different groups using different criteria. For example, they can be categorised as linear or non-linear based on the transformation function used, depending on whether the class information is available or not, they can be categorised as supervised or unsupervised, etc. Linear feature generation methods are relatively simple and provide analytical solutions for most of the problems. A linear solution, if possible, is always preferred over non-linear solution because of its simplicity. In this thesis both linear and non-linear feature generation methods are used for transformation of original features.

2.5. FEATURE GENERATION ₂₀

2.5.1 Principal Component Analysis

Principal component analysis (PCA) is one of the simplest and most popular feature generation method invented by Pearson in 1901 [12]. The purpose of PCA is to linearly transform the original set of possibly correlated features to a new set of linearly uncorrelated features, such that the new features are orthogonal to each other. The new set of features are called the principal components. The assumption made in PCA is that most of the information is along the dimensions where variance is the largest. The first principal component covers most of the variance, followed by the second principal component and so on. PCA does not take into account the class label of the features and is essentially an unsupervised technique. Since this study is not the theoretical study of statistics, only the methodology of PCA will be presented and interested readers can consult [1] for more details.

Given am by n data matrixX, where each column represents an observation and each row represents a feature, the steps for calculating principal components are given below

1. Subtract the mean: For PCA to work properly, each feature should be zero mean which can be achieved by by subtracting mean of each feature from each feature element.

2. Calculate the covariance matrix: The symmetric covariance matrix of sizem bym, can be calculated from the data matrixX, using the following equation Cm×m = n−1 X i=0 (xi−µi)(xi−µi)T (2.10)

wherexi is the ith feature vector andµi is the mean of ith vector.

3. Calculation of Eigenvalues and Eigenvectors: The eigenvalues and eigenvectors are calculated from the covariance matrix, calculated at the second step.

4. Sorting: The eigenvectors are sorted using the order of eigenvalues (in a descending order).

5. Transformation of original matrix: The original matrix is transformed into new matrix by multiplying eigenvector or a set of eigenvectors with

2.5. FEATURE GENERATION ₂₁ original matrix. The first eigenvector (corresponding to highest eigenvalue) will cover the maximum variance in the new feature space.

6. Selection of principal components: Since most of the data variance will be covered by first few principal components, last few principal components (transformed feature vectors) can be discarded. A famous criterion to decide, which principal components to discard, is the eigenvalue-one criterion [13]. Each feature contributes one unit of variance to the total variance of the dataset. Any feature having an eigenvalue greater than one covers more variance than the variance of one feature. Such a feature is considered more valuable than others and is worthy of being retained. On the other hand a feature with eigenvalue less than one, covers less variance than the variance of one feature. Such a feature is considered trivial and is removed from the new transformed features.

PCA is a non-parametric dimensionality reduction method since it does not require any priori knowledge about the data probability distribution. This is considered a strength as well as weakness of PCA. It requires less information about the problem in question (a strength) but on the other hand it can result in over fitting the problem when limited number of data samples are available (a weakness). It is important to point out that PCA is based on the following assumptions:

1. The dimensionality of the data can be efficiently reduced by linear transformation.

2. Most of the information is contained in the directions where data variation is maximum.

These two assumption are by no means always met. For example, if the data points are placed on the surface of hypersphere, linear transformation (for dimen- sion reduction) cannot cope with it. Similarly, the direction having maximum variation, do not necessarily contain maximum information. This is particularly true when signal-to-noise ratio (SNR) is low, where noise is also viewed as useful variance. Moreover, for classification problems, PCA does not take into account class labels and the direction of maximum variation may not help in class dis- crimination.

Nevertheless, PCA is a basic and popular method for feature generation. It is fast and effective when certain conditions are met.

In document Pattern recognition using genetic programming for classification of diabetes and modulation data (Page 39-42)