• No results found

More Effective Unsupervised Feature Selection Algorithms

lection Algorithms

As explained in the previous section it is difficult to perform unsupervised variable selection using the PCA and SPCA decompositions. In the litera- ture several algorithms have been developed as better alternatives to PCA and SPCA. Many of these algorithms may be thought of as extensions to the unsupervised domain of algorithms like lasso [63], forward selection regres- sion (FSR) and backward elimination regression (BER) that were originally designed for supervised regression [129]. The unsupervised version of lasso is

proposed in the paper convex principal features selection [117], CPFS and is described in the following algorithm.

Algorithm 4.5.1 (Convex Principal Features Selection). CPFS is defined by the following minimization problem:

ˆ Aλ = argmin A∈Rp×p k X − XA k2 F +λ p X i=1 k ~ai k∞ (4.39) where A =     ~a1 ~a2 . . . ~ap     (4.40)

The result of the minimization problem is a sparse matrix ˆAλ that contains the regression coefficient of each variable. The rows of A whose values are all 0 or smaller than a certain threshold k ~aj k2≤  correspond to the variables that are redundant and that can be discarded.

A more aggressive feature selection approach can be obtained with the un- supervised version of FSR and BER. Forward selection and backward elimi- nation of variables are well known techniques extensively used in supervised learning. Given an input dataset X = (x1, . . . , xp) ∈ Rn×p, an output y ∈ Rn and a function Φ that maps X into y, it is of interest to estimate the set of k < p variables (xi1, . . . , xik) that can optimally reconstruct y. Determining

the optimal solution to the problem requires the evaluation of all pk combi- nations which become computationally intractable even for moderate values of p and k. An approximation of the solution is obtained using the forward selection [130], [131], [132], [133] or backward elimination procedure [129], [17]. Different algorithms have been proposed based on these principles for unsupervised features selection. These iteratively select or remove variables according to a maximization criteria.

4.5.1

Unsupervised Forward Selection and Backward Elim-

ination of Variables

4.5.1.1 FOS-MOD

Among the unsupervised forward selection algorithms one of the most fa- mous is FOS-MOD [119], Forward Orthogonal Search (FOS) algorithm by Maximizing the Overall Dependency (MOD), described in the following al- gorithm.

Algorithm 4.5.2 (FOS-MOD). The similarity between two variables is de- fined as:

sc(x, y) = (x

Ty)2

(xTx)(yTy) (4.41)

and a matrix of similarity between variables is computed as:

C1 = {c1i,j}i,j=1,...,p= {sc(xi, xj)}i,j=1,...,p (4.42) The first variable is then selected as z1 = xl1 where l1 is obtained as:

¯ C1j = 1 p p X i=1 c1i,j (4.43) l1 = argmax 1≤j≤n ¯ C1j (4.44)

and the associated orthogonal variable is chosen as

q1 = z1 (4.45)

At step m − 1, m − 1 variables have been selected

Sm−1 = {z1, . . . , zm−1} (4.46)

and their respective m − 1 associated orthogonal variables are

Qm−1 = {q1, . . . , qm−1} (4.47)

All the variables in the set complementary to Sm−1(C(Sm−1)) are transformed as: qmj = αj − αTjq1 qT 1q1 − · · · − α T jqm−1 qT m−1qm−1 ∀ αj ∈C(Sm−1) (4.48)

The matrix C at step m is then defined as:

Cm = {cmi,j}i,j=1,...,p = {sc(xi, qmj )}i,j=1,...,p (4.49) and the m-th chosen variable is then zm = xlm where

¯ Cmj = 1 n n X i=1 cmi,j (4.50) lm = argmax 1≤j≤n ¯ Cmj (4.51)

FOS-MOD bases all its computation on the covariance matrix C. Other methods base the selection of variables on the reconstruction error or ex- plained variance. This makes them more similar to PCA. Some of these are Orthogonal Feature Selection (OFS, [134] ), Selection of Variables to Preserve Multivariate Data Structure (SV, [114]) and Forward Selection Component Analysis (FSCA, [89]). In the next section a high level overview of these al- gorithms is presented. It will be clear that they all share a similar structure. FSCA will then be used as an exemplar methodology and described in detail. It will be also shown that FOS-MOD is a particular case of FSCA.

4.5.1.2 Orthogonal Feature Selection

In Orthogonal Feature Selection (OFS) [134] the best rank one approximation of the data is computed with PCA (t1pT1). The variable z1 is then selected as the one most correlated with t1. The data is then projected in the space orthogonal to z1 and the procedure is repeated until k variables are selected. The pseudocode for the algorithm is reported in 4.5.1. The main advantage of this algorithm is the low computational complexity when the number of variables p is large. Indeed t1 can be efficiently computed with the NIPALS Pseudocode 4.3.1 and all the other operations are relatively low complexity even if large datasets are used. On the other hand the algorithm relies on the fact that the variables are selected in order to approximate t1 which is itself an approximation of the original matrix X. This means that the obtained reconstruction is, in general, less accurate than the one obtained with FOS- MOD or with the FSCA algorithm which will be presented in Section 4.6. Input: Input matrix X = (x1, . . . , xp) and k the number of components.

1: for r = 1, . . . , k do

2: Compute t1 the projection of X on its first P C. 3: zr = argmaxxicorr(xi, t1)

4: Save zr as the rth variable

5: X = X − Φ(zr)X (Deflate X)

6: end for

7: return z1, . . . , zk

Pseudocode 4.5.1: Orthogonal Features Selection

4.5.1.3 Selection of Variables to Preserve Multivariate Data Struc-

ture

In [114] another feature selection method SV is proposed. At first all the vari- ables are considered and then they are recursively discarded. The variables

are selected according to their ability to approximate the PCA scores. Indeed the algorithm is based on ideas similar to the ones used in OFS described in the previous section. A full description of the algorithm is reported in Pseudocode 4.5.2. The main drawback of this algorithm is its computational complexity. It is based on backward elimination, an algorithm that is more computationally demanding than forward selection. In addition PCA must be computed several times. In addition the backward elimination procedure does not work if the number of variables is larger than the number of samples as discussed in [135].

Input: Input matrix X = (x1, . . . , xp) and k the number of components.

1: while X has more than k variables do

2: Compute Tv the projection of X on its first v P Cs

3: for i = 1, . . . , p do

4: Obtain Xi removing xi from X.

5: Compute Tv(i) the projection of Xi on its first v P Cs

6: UΣVT is the SVD decomposition of TTvTv(i)

7: Di = trace(TTvTv − Tv(i)TTv(i) − 2Σ)

8: end for

9: v = argminiDi

10: Discard the variable xv from X

11: end while

Pseudocode 4.5.2: Selection of Variables to Preserve Multivariate Data Structure.