• No results found

Unsupervised Feature Selection

As discussed in the previous sections PCA is an efficient method for reducing the dimensionality of the data. The main drawback of PCA is the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. It is desirable not only to achieve a good lower dimensionality approximation but also to reduce the number of explicitly used variables. An ad hoc way is to artificially set the loadings with absolute values smaller than a threshold to zero. This informal thresholding approach is frequently used in practice but can be potentially misleading [105].

[Pk, Tk]=NIPALS(X, k)

Input: Data matrix X, number of PCs k

1: Set E = X

2: Initialise P0 = T0 = ∅

3: Set  = 10−6 (convergence threshold)

4: Initialise t to a non-zero column of X

5: for j = 1 to k do

6: Set tnew= t and told= t + 2

7: while k told− tnew k2 ≥  do

8: told= tnew 9: p = ETt/tTt 10: p = p/ppTp 11: t = Ep 12: tnew= t 13: end while 14: Pj = (Pj−1, p) 15: Tj = (Tj−1, t) 16: E = E − tpT 17: end for 18: return Pk, Tk

4.4.1

Sparse PCA

Sparse Principal Components Analysis (Sparse PCA [108]) was developed as a more interpretable PCA alternative. In Sparse PCA the loading matrix P is sparse (i.e. only few elements are nonzero). The score matrix T will then be obtained as a linear combination of only a few of the original variables. The relevant variables are the ones with nonzero coefficients in the loadings matrix. This allows dimensionality reduction and features selection to be performed at the same time. The next theorem shows that the PCA loadings can be obtained as the solution of a least square problem.

Theorem 4.4.1. Let X = UDVT be the singular values decomposition of X

and Yi = UiDi the projection of X on P Ci. Given the solution of the Ridge regression problem ˆ βridge = argminβ k Yi− Xβ k22 +λ k β k 2 2 (4.34) and define ˆ v = ˆ βridge k ˆβridge k22 (4.35) it follows that ˆv = Vi

The same theorem also holds without the ridge penalty but λ 6= 0 allows us to handle high dimensional data p > n where a unique solution to the least square problem does not exist [108].

The ridge regression problem in equation 4.34 can be generalized to yield the elastic net problem.

ˆ

β = argminβ k Yi− Xβ k22 +λ k β k22 +λ1 k β k1 (4.36)

The L1 penalty makes ˆβ a sparse vector. Indeed this is essentially the lasso problem discussed in chapter 3. The vector ˆVi =

ˆ β k β k1

is the sparse ap- proximation of Vi. For consistency with the PCA notation we define

Pλ = ˆV (4.37)

where λ is used to signify the dependency of P on the penalty weighting λ. The main PCA equation is then updated as:

where Tλ is the projection of X in the space defined by Pλ. The equation 4.19 does not hold in this case due to the non orthogonality of Pλ. Instead the lower dimensional reconstruction of X, corresponding to the projection of X onto the space spanned by Pλ can be obtained by linear regression as explained in Section 4.3.

4.4.1.1 Challenges with Variables Selection with SPCA

SPCA was initially motivated to obtain a more interpretable approximation of the PCA model and is commonly used as a benchmark for unsupervised features selection methodologies. The main issue with using SPCA for feature selection is the difficulty with determining what features to select. This problem was already reported in [117] and is illustrated in the following example.

Example 4.4.1 (SPCA on Glass data). Glass is a popular datasets often used in machine learning. Its reference paper is [128]. The data has a total of p = 9 variables. 3 components are computed with SPCA and the result is reported in Table 4.1. From the table it is difficult to understand what variables should be selected. Variable x5 for example has a nonzero loading in the first component but variable x6 has a much larger weight on the third one. It is difficult to decide if variable x5 should be preferred to variable x6.

P C1 P C2 P C3 x1 -9.164 0.0 0.0 x2 0.0 0.0 0.0 x3 0.0 6.872 0.0 x4 0.0 -6.724 0.0 x5 2.585 0.0 0.0 x6 0.0 0.0 -9.595 x7 -8.343 0.0 0.0 x8 0.0 -6.86 0.0 x9 0.0 0.0 0.0

Table 4.1: The first three P Cs obtained with SPCA on the glass data.

In addition SPCA is not efficient for unsupervised variable selection. The reason for this is that in certain situations SPCA assigns similar loadings to similar variables. This follows from the fact that, even if it is not guaranteed, lasso can have the grouping effect [65]. This will be illustrated in Example 4.7.3, which will be presented later in the chapter.

4.4.1.2 SPCA Algorithms

Different algorithms have been developed in order to efficiently compute the sparse PCA decomposition. In the original paper [108] SPCA is computed with an iterative algorithm. This algorithm needs to compute the SVD de- composition of the data several times and it is consequently inefficient for high dimensional datasets. An alternative algorithm denoted as sPCA-rSVD is proposed in [110]. The sPCA-rSVD is described in Pseudocode 4.4.1. In step two of the original version of the algorithm the best rank one approxi- mation of X is obtained with SVD. It is better to obtain it with the NIPALS algorithm as it is more efficient as only one PCA component is required.

Input: Input matrix X, k the number of components, λ1 the penalty value

and the tolerance τ .

1: for i = 1 : k do

2: Compute unew, vnew : unewvTnew is the best rank one approximation 3: uold= 0, vold= 0 ;

4: while k vnew− vold k≥ voldτ or k unew− uold k≥ uoldτ do 5: vold = vnew, uold= unew;

6: vnew = sign(XTuold)(|XTuold| − λ1)+ ;

7: d =k Xvnew k2;

8: unew = Xvnew/ k Xvnewk2 ;

9: end while 10: X = X − Φ(unew)X ; 11: save vnew; 12: end for 13: return T, P Pseudocode 4.4.1: sPCA-rSVD