• No results found

Feature selection techniques

2.3 Techniques to treat Small Sample Size problem

2.3.2 Feature selection techniques

The SSS problem happens when the number of available samples is comparable with its dimensionality. The basic idea behind the feature selection techniques is to find a subspace of data that is more informative and projecting all the data in that subspace [29]. By this, the feature selection techniques remove the redundant dimensions to reduce the dimensionality of the samples.

The most popular methods for finding this subspace are Principal Component Analysis (PCA) [182] and its extensions [27, 18]. The PCA-based methods provide the Eigen-vectors (directions) in data that we need to focus on more.

The PCA technique generally works as following. Suppose that m d-dimensional random variables, represented by x1, x2, · · · , xm, are available. The objective of PCA is to find n < m

random variables, denoted by zj, which are a linear combination of xis and preserve as much

information existing in xis as possible. So, we have,

zj= m

i=1

wjixi, (2.12)

The most prevalent method to solve the above-mentioned information preservation problem is the eigen-value decomposition of covariance matrix of random variable xis.

Covariance matrixis denoted by C whose components are Ci j= cov xi, xj. Cov xi, xj is

defined as,

cov xi, xj = E xi, xj = E (xi) E xj , (2.13)

where E (xi) is the expected value of xi.

From the definition presented in Eq. 2.13, it is evident that the covariance matrix is symmetric, i.e. Ci j = Cji. A symmetric matrix can be decomposed as,

C= U DUT, (2.14)

in which, U is orthogonal, i.e. UUT = I, and D = diag (λ1, λ2, · · · , λm) is diagonal. The

columns of U are the eigenvectors of C and λis are their corresponding eigenvalues.

The first principal component, which corresponds to the largest eigenvalue, contains the highest amount of information. The second principal component, which is related to the second largest eigenvalue, includes the second highest amount of information. Similarly, the ithprincipal component corresponds to the ith eigenvalue and contains the ith highest amount of information.

It can be shown that the ratio of information existing in each principal component is proportional to the magnitude of its corresponding eigenvalue. Therefore, to select appropriate number of components that preserve a desirable amount of information, we first need to sort the eigenvalues. Then, the k first principal components ought to be selected such that,

∑ki=1λi

∑Ni=1λi

< A prede f ined threshold, (2.15)

The other powerful tool to find the most discriminative subspace is Linear Discriminant Analysis (LDA) based methods [183–185] and their extensions [186]. As the criterion that LDA uses to find the discriminative subspace is different from the criterion used in PCA, the subspace found by LDA is different from the subspace found by PCA.

Initially, LDA divides all samples, denoted by xis, into K classes that contains M samples.

Then, LDA finds the following transformation.

e

where, xkm is the d-dimensional random variable in kth class andexkm is its projection. By optimally selecting the matrix A, the transformation in Eq. 2.16 projects the samples into a subspace with acceptable amount of information.

The matrix A can be determined by Fischer criterion that is defined as,

F(q) = q TS bq qTS wq . (2.17)

in which, q ∈ Rnand, Sband Sware the between and within scatter matrices respectively. Sb

and Sw are defined as,

Sb= K

k=1  xk− x xk− xT, Sw= M

m=1  xkm− xk xk− xkT. (2.18)

where xkand x are the mean of all samples and mean of samples in kthclass, respectively. The columns of matrix A are chosen from the set of qs that are obtained from thee optimization below.

e

q= argmax

q∈Rn

F(q) . (2.19)

To construct the set ofqs, one can choose the set of eigenvectors corresponding to the largeste eigenvalues of S−1w Sb[187].

LDA may still suffer from the SSS problem, when the sample size is relatively so small. When this happens, combining LDA with PCA has been proposed as a more effective technique to better treat the SSS problem [187].

Since the above-mentioned methods have been widely used for Face Recognition, they have been designed to be more suitable for this application. The major drawback of the dimensionality reduction techniques is to ignore a subspace of data that in some situations may contain a noticeable amount of information. Therefore, it may cause some inaccuracy in the performance of the model.