• No results found

Sparsity in Machine Learning: An Information Selecting Perspective

N/A
N/A
Protected

Academic year: 2021

Share "Sparsity in Machine Learning: An Information Selecting Perspective"

Copied!
106
0
0

Loading.... (view fulltext now)

Full text

(1)

University of Massachusetts Amherst University of Massachusetts Amherst

ScholarWorks@UMass Amherst

ScholarWorks@UMass Amherst

Doctoral Dissertations Dissertations and Theses

July 2019

Sparsity in Machine Learning: An Information Selecting

Sparsity in Machine Learning: An Information Selecting

Perspective

Perspective

Siwei Feng

Electrical and Computer Engineering

Follow this and additional works at: https://scholarworks.umass.edu/dissertations_2

Part of the Other Computer Engineering Commons, and the Signal Processing Commons

Recommended Citation Recommended Citation

Feng, Siwei, "Sparsity in Machine Learning: An Information Selecting Perspective" (2019). Doctoral Dissertations. 1550.

https://scholarworks.umass.edu/dissertations_2/1550

This Open Access Dissertation is brought to you for free and open access by the Dissertations and Theses at ScholarWorks@UMass Amherst. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of ScholarWorks@UMass Amherst. For more information, please contact

(2)

SPARSITY IN MACHINE LEARNING:

AN INFORMATION SELECTING PERSPECTIVE

A Dissertation Outline Presented by

SIWEI FENG

Submitted to the Graduate School of the

University of Massachusetts Amherst in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY May 2019

(3)

c

Copyright by Siwei Feng 2019

(4)

SPARSITY IN MACHINE LEARNING:

AN INFORMATION SELECTING PERSPECTIVE

A Dissertation Outline Presented by

SIWEI FENG

Approved as to style and content by:

Marco F. Duarte, Chair

Patrick A. Kelly, Member

Dennis L. Goeckel, Member

Jie Xiong, Member

Robert Jackson, Department Chair Electrical and Computer Engineering

(5)

ABSTRACT

SPARSITY IN MACHINE LEARNING:

AN INFORMATION SELECTING PERSPECTIVE

MAY 2019

SIWEI FENG

B.Sc., SOOCHOW UNIVERSITY

M.Sc., UNIVERSITY OF MASSACHUSETTS AMHERST Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST

Directed by: Professor Marco F. Duarte

Today we are living in a world awash with data. Large volumes of data are ac-quired, analyzed and applied to tasks through machine learning algorithms in nearly every area of science, business, and industry. For example, medical scientists analyze the gene expression data from a single specimen to learn the underlying causes of disease (e.g. cancer) and choose the best treatment; retailers can know more about customers’ shopping habits from retail data to adjust their business strategies to bet-ter appeal to customers; suppliers can enhance supply chain success through supply chain systems built on knowledge sharing. However, it is also reasonable to doubt whether all the genes make contributions to a disease; whether all the data obtained from existing customers can be applied to a new customer; whether all shared knowl-edge in the supply network is useful to a specific supply scenario. Therefore, it is crucial to sort through the massive information provided by data and keep what we

(6)

really need. This process is referred to as information selection, which keeps the information that helps improve the performance of corresponding machine learning tasks and discards information that is useless or even harmful to task performance. Sparse learning is a powerful tool to achieve information selection. In this thesis, we apply sparse learning to two major areas in machine learning – feature selection and transfer learning.

Feature selection is a dimensionality reduction technique that selects a subset of representative features. Recently, feature selection combined with sparse learning has attracted significant attention due to its outstanding performance compared with traditional feature selection methods that ignore correlation between features. How-ever, they are restricted by design to linear data transformations, a potential draw-back given that the underlying correlation structures of data are often non-linear. To leverage more sophisticated embedding than the linear model assumed by sparse learning, we propose an autoencoder-based unsupervised feature selection approach that leverages a single-layer autoencoder for a joint framework of feature selection and manifold learning. Additionally, we include spectral graph analysis on the projected data into the learning process to achieve local data geometry preservation from the original data space to the low-dimensional feature space.

Transfer learning describes a set of methods that aim at transferring knowledge from related domains to alleviate the problems caused by limited/no labeled training data in machine learnig tasks. Many transfer learning techniques have been pro-posed to deal with different application scenarios. However, due to the differences in data distribution, feature space, label space, etc., between source domain and target domain, it is necessary to select and only transfer relevant information from source domain to improve the performance of target learner. Otherwise, the target learner can be negatively impacted by the weak-related knowledge from source domain, which is referred to as negative transfer. In this thesis, we focus on two transfer learning

(7)

scenarios for which limited labeled training data are available in target domain. In the first scenario, no label information is avaible in source data. In the second scenario, large amounts of labeled source data are available, but there is no overlap between the source and target label spaces. The corresponding transfer learning technique to

the former case is called self-taught learning, while that for the latter case is called

few-shot learning. We apply self-taught learning to visual, textal, and audio data. We also apply few-shot learning to wearable sensor based human activity data. For both cases, we propose a metric for the relevance between a target sample/class and a source sample/class, and then extract information from the related samples/classes for knowledge transfer to perform information selection so that negative transfer caused by weakly related source information can be alleviated. Experimental results show that transfer learning can provide better performance with information selection.

(8)

TABLE OF CONTENTS

Page ABSTRACT. . . .iv LIST OF TABLES. . . x LIST OF FIGURES. . . .xi CHAPTER 1. INTRODUCTION . . . 1

1.1 Unsupervised Feature Selection . . . 1

1.2 Transfer Learning . . . 4 1.2.1 Self-Taught Learning . . . 5 1.2.2 Few-Shot Learning . . . 7 1.3 Outline . . . 8 2. BACKGROUND . . . 9 2.1 Notations . . . 9 2.2 Single-Layer Autoencoder . . . 10

2.3 Long-Short Term Memory Network . . . 11

2.4 Sparse Learning-Based Unsupervised Feature Selection . . . 12

2.5 Self-Taught Learning . . . 16

2.6 Human Activity Recognition . . . 18

2.7 Few-Shot Learning . . . 19

3. UNSUPERVISED FEATURE SELECTION. . . 21

3.1 Introduction . . . 21

3.2 Proposed Method . . . 22

3.2.1 Objective Function . . . 22

(9)

3.3 Experiments . . . 27

3.3.1 Evaluation Metric . . . 27

3.3.2 Experimental Setup . . . 29

3.3.3 Parameter Sensitivity . . . 30

3.3.4 Feature Selection Illustration . . . 34

3.3.5 Clustering Illustration . . . 34

3.3.6 Performance Comparison . . . 35

4. SELF-TAUGHT LEARNING . . . 40

4.1 Introduction . . . 40

4.2 Proposed Method . . . 41

4.2.1 Knowledge Transfer and Relevance Measure . . . 41

4.2.1.1 Objective Function . . . 41

4.2.1.2 Optimization . . . 45

4.2.2 Classifier Training . . . 48

4.2.2.1 Source Domain Sample Reweighting . . . 49

4.2.2.2 Pseudo-Labeling . . . 49 4.2.2.3 Classifier Training . . . 50 4.3 Experiments . . . 51 4.3.1 Dataset Preparation . . . 51 4.3.2 Experimental Setup . . . 54 4.3.3 Parameter Sensitivity . . . 56 4.3.4 Performance Comparison . . . 58 4.3.5 Discussion . . . 62

4.3.5.1 The Effect of Local Data Structure Preservation . . . 62

4.3.5.2 The Effect of Source Sample Selection . . . 63

5. FEW-SHOT LEARNING . . . 66 5.1 Introduction . . . 66 5.2 Proposed Method . . . 68 5.2.1 Basic Framework . . . 68 5.2.2 Implementation . . . 71 5.3 Experiments . . . 72 5.3.1 Dataset Information . . . 72

(10)

5.3.2 Source/Target Split . . . 73

5.3.3 Baselines . . . 74

5.3.4 Performance Comparison . . . 75

6. CONCLUSION AND FUTURE WORK . . . 82

6.1 Conclusion . . . 82

6.2 Future Work . . . 83

(11)

LIST OF TABLES

Table Page

3.1 Details of datasets used. . . 28 4.1 Details of datasets used in our experiment. . . 53

4.2 Performance stability of GASTL in classification with respect to

balance parameters λ and γ. Classification accuracy mean (%)

and standard deviation (%) are presented. . . 58

4.3 Performance of GASTL and competing feature selection algorithms

in classification on Caltech101 with SIFTBOW as feature.

Classification accuracy (%) is used as the evaluation metric. . . 59

4.4 Performance of GASTL and competing feature selection algorithms

in classification on Caltech101 with VGG19 as feature.

Classification accuracy (%) is used as the evaluation metric. . . 60

4.5 Performance of GASTL and competing feature selection algorithms

in classification on IMDB. Classification accuracy (%) is used as the evaluation metric. TS = Training sample number in each

target class. . . 61

4.6 Performance of GASTL and competing feature selection algorithms

in classification on Twitter. Classification accuracy (%) is used as the evaluation metric. TS = Training sample number in each

target class. . . 62

4.7 Performance of GASTL and competing feature selection algorithms

in classification on ESC-50. Classification accuracy (%) is used as the evaluation metric. . . 63 5.1 Source/target split for activities . . . 74 5.2 Network Structure for Both Datasets . . . 75

(12)

LIST OF FIGURES

Figure Page

3.1 Performance variation of the GAFS w.r.t. dimensionality of subspace

m and the percentage of features selected p(%). . . 31

3.2 Performance variation of the GAFS w.r.t. balance parametersλ and

γ. . . 33

3.3 Feature selection illustration on Yale. Each row corresponds to a

sample human face image and each column refers to percentages of features selected

p∈ {10%,20%,30%,40%,50%,60%,70%,80%,100%} from left to

right. . . 34

3.4 Clustering illustration of data embeddings on hidden layers when

hidden layer size is 2. . . 35

3.5 Classification accuracy w.r.t different unsupervised feature selection

algorithms and the percentage of features selectionp (%) . . . 37

3.6 Clustering accuracy w.r.t different unsupervised feature selection

algorithms and the percentage of features selectionp (%) . . . 38

3.7 Normalized mutual information w.r.t different unsupervised feature

selection algorithms and the percentage of features selection p

(%) . . . 39

4.1 Performance of GASTL in classification as a function of the hidden

layer size m for varying sizes of the autoencoder hidden layer m.

Classification accuracy (%) is used as the evaluation metric. . . 57

4.2 Comparison of GASTL performance whenγ = 0 and optimal GASTL

performance. Classification accuracy (%) is used as the evaluation metric. Optimal values for γ are shown. . . 64

(13)

4.3 Comparison of GASTL performance when all source samples are used for classifier training and optimal GASTL performance.

Classification accuracy (%) is used as the evaluation metric.

Optimal values forp are shown. . . 65 5.1 Basic Framework . . . 68 5.2 Network Structure . . . 69

5.3 Comparison of CDHAR performance with OPP dataset. Source data

and target data are generated from the same participant.

Subfigures in the left column show results when each target class provides one training sample for target network training.

Subfigures in the right column shows results when each target class provides five training samples for target network training.

Classification accuracy (%) is used as the evaluation metric. . . 78

5.4 Comparison of CDHAR performance with OPP dataset. Source data

and target data are generated from the different participants. Subfigures in the left column show results when each target class provides one training sample for target network training.

Subfigures in the right column shows results when each target class provides five training samples for target network training.

Classification accuracy (%) is used as the evaluation metric. . . 79

5.5 Comparison of CDHAR performance with PAMAP2 dataset. Source

data and target data are generated from people from the same group. Subfigures in the left column show results when each target class provides one training sample for target network training. Subfigures in the right column shows results when each target class provides five training samples for target network training.

Classification accuracy (%) is used as the evaluation metric. . . 80

5.6 Comparison of CDHAR performance with PAMAP2 dataset. Source

data and target data are generated from people from the different groups. Subfigures in the left column show results when each target class provides one training sample for target network training. Subfigures in the right column shows results when each target class provides five training samples for target network training. Classification accuracy (%) is used as the evaluation

(14)

CHAPTER 1

INTRODUCTION

Machine learning practitioners are always yearning for training data with large volume and variety since it is believed that those can help train models with more useful information characterized and thus leading to better task performance. How-ever, in practice a large dataset is not always a guarantee for good machine learning task performance since there may exist samples and/or features that are redundant or irrelevant to specific machine learning tasks. In other words, models learned with these data may contain useless or even harmful information that negatively affect task performance. To alleviate the negative influence brought by such “bad” information, feature/sample selection is a necessary preprocessing step. Within the significant literatures on feature/sample selection, sparse learning based approaches play an im-portant role due to the selecting nature of sparse learning. In this thesis, sparse learning in unsupervised feature selection and transfer learning is investigated.

1.1

Unsupervised Feature Selection

In recent years, high-dimensional data can be found in many areas such as com-puter vision, pattern recognition, data mining, etc. Hign dimensionality enables data to include more information. However, learning high-dimensional data often suffer from several issues. For example, with a fixed number of training data, a large data dimensionality can cause the so-called Hughes phenomenon, i.e., a reduction in the generalization of the learned models due to overfitting during the training procedure compared with lower dimensional data [1]. Moreover, high-dimensional data tend to

(15)

include significant redundancy in adjacent features, or even noise, which leads to large amounts of useless or even harmful information being processed, stored, and trans-mitted [2, 3]. All these issues present challenges to many conventional data analysis problems. Moreover, several papers in the literature have shown that the intrinsic dimensionality of high-dimensional data is actually small [4–6]. Thus, dimensionality reduction is a popular preprocessing step for high-dimensional data analysis, which decreases time for data processing and also improves generalization of learned models. Feature selection [7–12] is a set of frequently used dimensionality reduction

ap-proaches that aim at selecting a subset of input dimensions1. Feature selection has

the advantage of preserving the same feature space as that of raw data. Feature selec-tion methods can be categorized into groups based on different criteria summarized below; refer to [13] for a detailed survey on feature selection.

• Label Availability. Based on the availability of label information, feature selection algorithms can be classified into supervised [7–9], semi-supervised [10– 12], and unsupervised [14–30] methods. Since labeled data are usually expensive and time-consuming to acquire [31, 32], unsupervised feature selection has been gaining more and more attention recently and is the subject of our focus in this work.

• Search Strategy. In terms of selection strategies, feature selection meth-ods can be categorized into filter, wrapper, and embedded methmeth-ods. Wrapper methods [33, 34] are seldom used in practice since they rely on a repetition of feature subset searching and selected feature subset evaluation until some stopping criteria or some desired performance are reached, which requires an exponential search space and thus is computationally prohibitive when feature dimensionality is high. Filter feature selection methods, e.g. Laplacian score

(16)

(LapScore) [14] and SPEC [15], assign a score (measuring task relevance, re-dundancy, etc.) to each feature and select those with the best scores. Though convenient to computation, these methods are often tailored specifically for a given task and may not provide an appropriate match to the specific application of interest. Embedded methods combine feature selection and model learning and provide a compromise between the two earlier extremes, as they are more efficient than wrapper methods and more task-specific than filter methods. We focus on embedded feature selection methods.

In recent years, feature selection algorithms aiming at selecting features that pre-serve intrinsic data structure (such as subspace or manifold structure) [16–30] have attracted significant attention due to their good performance and interpretability [13]. In these methods, data are linearly projected onto new spaces through a transforma-tion matrix, with fitting errors being minimized along with some sparse regulariza-tion terms. Feature importance is usually scored using the norms of corresponding rows/columns in the transformation matrix. In some methods [20–25, 28–30], the local data geometric structure, which is usually characterized by nearest neighbor graphs, is also preserved in the low-dimensional projection space. However, one basic assumption of these methods is that the data to be processed lie in or near a com-pletely linear low-dimensional manifold. However, this is not always true in practice, in particular with more sophisticated data.

We propose a novel algorithm for graph and autoencoder-based feature selection (GAFS). In this method, we integrate three objective functions into a single optimiza-tion framework: (i) we use a single-layer autoencoder to reconstruct the input data;

(ii) we use an `2,1-norm penalty on the columns of the weight matrix connecting the

autoencoder’s input layer and hidden layer to provide feature selection; (iii) we pre-serve the local geometric structure of the data through to the corresponding hidden layer activations. Extensive experiments are conducted on image data, audio data,

(17)

text data, and biological data. Many experimental results are provided to demon-strate the outstanding performance achieved by the proposed method compared with other state-of-the-art unsupervised feature selection algorithms.

The key contributions are highlighted as follows.

• We propose a novel unsupervised feature selection framework which is based on

an autoencoder and graph data regularization. By using this framework, the information of the underlying data subspace can be leveraged, which loosens the assumption of linear manifold in many relevant techniques.

• We present an efficient solver for the optimization problem underlying the

pro-posed unsupervised feature selection scheme. Our approach relies on an iterative scheme based on the gradient descent of the proposed objective function.

• We provide multiple numerical experiments that showcase the advantages of

the flexible models used in our feature selection approach with respect to the state-of-the-art approaches from the literature.

1.2

Transfer Learning

Supervised learning has excelled in many machine learning tasks such as classifi-cation [3, 35] and regression [36, 37]. However, the success of a supervised learning algorithm requires large-scale labeled training datasets and that both training and

testing data sharing the same label and feature space.2 These conditions limit the

applications of supervised learning methods in practical scenarios since it is expensive to collect eligible training data [38, 39].

Several techniques have been proposed to tackle the limitations of supervised learning methods. Semi-supervised learning [40–42] algorithms use both labeled and

(18)

unlabeled data to improve performance when labeled training data are limited. How-ever, many semi-supervised learning algorithms assume that unlabeled data and la-beled data have the same distribution [40, 41] or class labels [42]. The success of semi-supervised learning highly depends on the validity of these assumptions. How-ever, it is still difficult to gather unlabeled data which satisfy these preconditions.

In order to further loosen the restrictions on training data, many transfer learning approaches [43, 44] have been proposed. Transfer learning methods use the knowledge obtained from a source domain to improve the performance on target domain tasks. We mainly focus on two transfer learning techniques: self-taught learning and few-shot learning.

1.2.1 Self-Taught Learning

Self-taught learning [45–53] is the type of transfer learning techniques most simi-lar to semi-supervised learning, which also employs unlabeled data with the attempt to improve supervised learning performance when labeled training data are limited. However, compared with semi-supervised learning, self-taught learning methods have fewer restrictions on unlabeled data, as they allow the label spaces and marginal probability distributions of unlabeled and labeled data to be different. In self-taught learning, unlabeled data are used as source from which the knowledge learned is ap-plied to tasks performed on labeled target data. Such a loose restriction on unlabeled data significantly simplifies learning due to the huge volume of unlabeled data we can access. However, the easily obtained unlabeled data inevitably contain samples with weak relation to the labeled training data, which may even harm the supervised learning performance if we treat them equally as other unlabeled samples during knowledge transfer. This is known as negative transfer. Therefore, it is necessary to select samples that are related to the labeled data to reduce the impact caused by negative transfer.

(19)

We propose a novel algorithm for self-taught learning with unlabeled source data which are related to labeled target data to be selected. The algorithm leverages a

linear mapping, a k-nearest neighbor (kNN) graph and a single-layer autoencoder to

obtain a metric for cross domain sample relevance. We refer to this method as graph and autoencoder-based self-taught learning (GASTL). The framework of GASTL in-cludes two modules: a source sample re-weighting module and a classifier training module. In the first module, we assign each unlabeled source sample a weight that

indicates its relevance to labeled target samples.3 In the second module, source

sam-ples with large weights are selected to combine a training set with target data to train a classifier. Each selected source sample is assigned a pseudo-label from the target domain label space to be used during classifier training. The weights of source samples are also used during classifier training. The trained classifier is then used to predict labels of unseen target samples.

The key contributions are as follows:

• We propose a novel metric for the relevance of each source sample to the target

domain in the scenario of self-taught learning based on an autoencoder and graph data regularization. To the best of our knowledge, we are the first to measure source and target sample relevance for self-taught learning problems.

• We propose a novel classifier training scheme with both selected source samples

and target samples as training dataset with the relevance of each source sample to target domain being considered. We are not aware of existing self-taught learning approaches that integrate cross domain sample relevance into classifier training.

3The setting of self-taught learning requires source samples to be unlabeled and target samples

to be labeled. Therefore in the sequel we do not specify the availability of label information for both source and target samples.

(20)

• We present an efficient solver for the knowledge transfer optimization problem described above that relies on an iterative scheme based on the gradient descent of the proposed objective function. This solver shows advantages when the model complexity is large.

• Multiple experimental results are provided to demonstrate the performance

im-provements in terms of classification accuracy and insensitivity to parameters achieved by the proposed method compared with state-of-the-art self-taught learning methods and other relevant techniques.

1.2.2 Few-Shot Learning

Few-shot learning [54–56] assumes availability of labeled training data in the tar-get domain as in self-taught learning. Unlike self-taught learning, few-shot learning problems require large amounts of labeled source data whose label spaces have no overlap with that of target domain. Negative transfer also needs to be alleviated in this scenario.

We do not perform few-shot learning on multiple types of data as what we do with self-taught learning. Instead, we focus on human activity recognition with data obtained from wearable sensors such as a smartphone or wristband, which predicts activity type (e.g. walking, swimming, etc.) from the sensor outputs. Human activity recognition has been applied in many tasks such as sleep state detection [57] and smart home sensing [58].

In order to have a better discrimination of different types of activities, models need to be trained with large amounts of data from a diversity of sources. Unfortunately, in practice we do not always have enough data for each activity type. For example, a wearable health care system needs to be retrained if the previously collected activity data are not representative of the new activity type, and it is not likely for users to provide large amount of data for a single activity type. However, there may exist

(21)

relevance between new and existing activity data. Therefore, relevant knowledge from exisiting activity data can be used in model training with new data.

The algorithm we propose to perform few-shot learning on human activity recog-nition leverages a deep neural network and a linear mapping to obtain a metric for cross domain class relevance. The framework includes a feature extraction module and source class weighting module. In the first module, we use the a neural network to extract features for both source and target samples. In the second module, cross domain class similarities are measured with the features obtained in the first module and parameters of relevant source classes are used for target domain classifier train-ing. The trained classifier is then used to predict the activity type of unseen target samples.

1.3

Outline

The thesis is organized as follows.

Chapter 2 provides notations for this thesis, the concept of single-layer autoen-coder which plays an important role in algorithms described in Chapter 3 and Chapter 4, and a brief literature review of sparse learning-based unsupervised feature selection, self-taught learning, and few-shot learning. Chapter 3 presents our proposed unsu-pervised feature selection approach. Chapter 4 presents our proposed self-taught learning approach. Chap 5 presents our proposed few-shot learning approach for hu-man activity recognition. Finally, we conclude with a summary of our findings and a discussion of ongoing work in Chapter 6.

(22)

CHAPTER 2

BACKGROUND

2.1

Notations

Vectors are denoted by bold lowercase letters while matrices are denoted by bold

uppercase letters. The superscriptT of a matrix denotes the transposition operation.

For a matrix A, A(q) denotes the qth column and A

(p) denotes the pth row, while

A(p,q) denotes the entry at the pth row and qth column. The `

r,p-norm for a matrix

W∈Ra×b is denoted as kWkr,p =   a X i=1 b X j=1 |W(i,j)|r !p/r  1/p . (2.1)

The trace of a matrix L∈Ra×a is defined as Tr(L) =Pa

i=1L(i,i). We use1 and 0 to

denote an all-ones and all-zeros matrix or vector of the appropriate size, respectively.

We useX= [X(1),X(2),· · · ,X(n)]∈Rd×nto denote sample sets, whereX(i) ∈Rdis

theith sample in Xfori= 1,2,· · · , n, and where dandn denote data dimensionality

and number of samples in X, respectively.

For notations in transfer learning, we use D to denote a domain andT for a task.

A domain D consists of a feature space X and a marginal probability distribution

P(X) over a sample set X. A task T consists of a label space Y and an objective

predictive function f(X,Y) to predict the corresponding labelsY of a sample set X.

We use Dsrc = {Xsrc, P(Xsrc)} and Tsrc = {Ysrc, f(Xsrc,Ysrc)} to denote the source

domain and task, and use Dtrg ={Xtrg, P(Xtrg)} and Ttrg ={Ytrg, f(Xtrg,Ytrg)} for

(23)

2.2

Single-Layer Autoencoder

A single-layer autoencoder is an artificial neural network that aims to learn a

function h(x;Θ) ≈ x with a single hidden layer, where x ∈ Rd is the input data,

h(·) is a nonlinear function, and Θ is a set of parameters. To be more specific, the

workflow of an autoencoder contains two steps:

• Encoding: mapping the input data xto a compressed data representation y∈

Rm:

y=σ(W1x+b1), (2.2)

where W1 ∈ Rm×d is a weight matrix, b1 ∈ Rm is a bias vector, and σ(·) is

an elementary nonlinear activation function. Commonly used activation func-tions include the sigmoid function, the hyperbolic tangent function, the rectified linear unit, etc.

• Decoding: mapping the compressed data representation y to a vector in the

original data space X¯ ∈Rd:

¯

X=σ(W2y+b2), (2.3)

where W2 ∈ Rd×m and b2 ∈ Rd are the corresponding weight matrix and bias

vector, respectively.

The optimization problem brought by the autoencoder is to minimize the difference between the input data and the reconstructed/output data. To be more specific,

given a set of data X = [X(1),X(2),· · · ,X(n)], the parameters W

1, W2, b1, and b2

are adapted to minimize the reconstruction error Pn

i=1kX(i) − X¯(i)k22, where X¯(i)

(24)

the reconstruction error is by selecting the parameter values via the backpropagation

algorithm.1

The data reconstruction capability of the autoencoder makes it suitable to capture the essential information of the data while discarding information that is not useful or redundant.

2.3

Long-Short Term Memory Network

A long-short term memory (LSTM) network [60] is a type of recurrent neural network (RNN) which processes time series signals by taking as their input not just the current inputs but also what they have processed earlier in time. Each RNN contains a loop (repeating modules) inside the network structure that allows information to be passed from one step of the network to the next.

RNNs show their limitations when long-term dependencies are needed to be cap-tured. Consider the task of predicting the last word in the text “I come from Jiangsu, a province in China. The closest noun “province” suggests that the last word is proba-bly the name of a country, but if we want to know which country, we need information from further back. When the gap between the position of relevant information and the point where it is needed becomes large, RNNs show their incapabilities to connect in practice.

LSTMs are a special kind of RNN, which are famous for their capabilities to cap-ture long-term dependencies. Compared with the simple repeating modules of most RNNs, which sometimes only contans a single tanh layer, the repeating modules of LSTMs include more complicated interacting layers in their structures. The work-flow of LSTM can be briefly described as follows. The first step is to determine the importance of previous information, which is to decide a status between “completely

(25)

forget about this” and “completely keep this”. The next step is to decide what new information to store in the cell state and then replace the old state with a new state. Finally, we decide the information to output.

A stacked LSTM model [61] is a LSTM model with multiple amultiple hidden LSTM layers where each layer contains multiple memory cells. By stacking LSTM hidden layers, a LSTM model can be deeper that makes it capable of tackling more complex problems.

2.4

Sparse Learning-Based Unsupervised Feature Selection

Many unsupervised feature selection methods based on subspace structure preser-vation have been proposed in the past decades. For cases missing class labels, un-supervised feature selection methods select features that are representative of the underlying subspace structure of the data [16]. The basic idea is to use a transforma-tion matrix to project data to a new space and guide feature selectransforma-tion based on the sparsity of the transformation matrix [17]. To be more specific, the generic framework of these methods is based on the optimization

min

W L(Y,WX) +λR(W), (2.4)

where Y = [Y(1),Y(2),· · · ,Y(n)] ∈ Rm×n (m < d) is an embedding matrix in which

Y(i) Rm for i = 1,2,· · · , n denotes the representation of data point X(i) in the

obtained low-dimensional subspace. L(·) denotes a loss function, andR(·) denotes a

regularization function on the transformation matrix W∈Rm×d. The methods differ

in their choice of embedding Y and loss and regularization functions; some examples

are presented below.

Multi-cluster feature selection (MCFS) [18] and minimum redundancy spectral feature selection (MRSF) [19] are two long-standing and well-known subspace

(26)

of each data X is first learned based on spectral clustering. After that, all data points are regressed to the learned embedding through a transformation matrix W ∈ Rm×d = [W(1);W(2);· · · ;W(m)]. The loss function is set to the Frobenius

norm of the linear transformation error and the regularization function is set to the

`1,1 norm of the transformation matrix, which promotes sparsity. Thus, MCFS can

be formulated mathematically as the set of separate optimization problems

min W(q)kY (q)W(q)Xk2 2+λkW (q)k 1, (2.5)

where W(q) Rd and Y(q) Rn are the qth rows of W and Y, respectively, for

q= 1,2,· · · , m. A score for each feature is measured by the maximum absolute value of the corresponding row of the transformation matrix:

M CF S(q) = max

p=1,2,···,d|W

(p,q)|=kW(q)k∞, (2.6)

This score is then used in a filter-based feature selection scheme. MRSF is an

exten-sion of MCFS that changes the regularization function to an `2,1-norm that enforce

column sparsity on the transformation matrix. Ideally, the selected features should be representative enough to keep the loss value close to that obtained when using all

features. In order to achieve feature selection, we expect that W holds a sparsity

property with its columns, which means only a subset of the columns are nonzeros.

We use the`2-norm of a W column to measure the importance of the corresponding

feature, leading to an `2,1-norm regularization function. Furthermore, MRSF ranks

the importance of each feature according to the`2-norm of the corresponding column

of the transformation matrix. Both MCFS and MRSF are able to select features that best preserve the subspace structure of the data due to the application of spectral clustering. However, the performance of these two methods is often degraded by the separate nature of subspace learning and feature selection [29]. In order to address

(27)

this problem, many approaches on joint subspace learning and feature selection have been proposed. For example, in Gu et. al. [20] data are linearly projected to a low-dimensional subspace with a transformation matrix, and the local data geometric structure captured by a nearest neighbor graph is preserved in data embeddings on

low-dimensional subspace. In the meanwhile, an`2,1−norm is performed on

transfor-mation matrix to guide feature selection simultaneously. That is, subspace learning and feature selection are not two separate steps but combined into a single framework. Studies like [21–25] made further modifications to [20]: besides combining subspace learning and feature selection into a single framework, these methods also exploit the discriminative information of the data for unsupervised feature selection. For exam-ple, in unsupervised discriminative feature selection (UDFS) [21], data instances are

assumed to come from cclasses. UDFS uses local data geometric structure, which is

based on thek-nearest neighbor set of each data point, to incorporate local data

dis-criminative information into a feature selection framework. Like MCFS and MRFS,

UDFS also assumes the existence of a transformation matrix W ∈ Rm×c that maps

data to a low-dimensional space. The objective function of UDFS is

min

WTWTr(W

TMW) +αkWk

2,1, (2.7)

where M is an elaborate matrix that contains local data discriminative information;

see [21] for details. One drawback of these discriminative exploitation feature selection methods is that the feature selection performance relies on an accurate estimation of number of classes.

Instead of projecting data onto a low-dimensional subspace, some approaches consider combining unsupervised feature selection methods with self-representation. In these methods, each feature is assumed to be representable as a linear combination

of all (other) features, i.e.,X=WX+E, whereW∈Rd×d is a representation matrix

(28)

into the same data space so that the relationships between features can be gleaned from the transformation matrix. This type of method can be regarded as a special case of subspace learning-based feature selection methods where the embedding subspace is equal to the original space. Zhu et. al. [26] proposed a regularized self-representation (RSR) model for unsupervised feature selection that sets both the loss function and

the regularization function to`2,1-norms on the representation errorE(for robustness

to outlier samples) and transformation matrixW(for feature selection), respectively.

RSR can therefore be written as

min

W kX−WXk2,1+λkWk2,1. (2.8)

RSR has been extended to non-convex RSR [27], where the regularization function

is instead set to an `2,p-norm for 0 < p < 1. Unsupervised graph self-representation

sparse feature selection (GSR SFS) [28] further extends [27] by changing the loss function to a Frobenius norm, as well as by considering local data geometric structure

preservation on embedding WX through spectral graph analysis. GSR SFS can be

written in the following formulation

min W 1 2||X−WX|| 2 F +λ1Tr(XTWTLWX) +λ2||W||2,1, (2.9)

whereLis the graph Laplacian matrix. Lcan be calculated throughL=D−A, where

Ais the adjacency matrix which stores the similarities between vertices, andD is the

degree matrix which is a diagonal matrix containing information about the degree of vertices. Self-representation based dual-graph regularized feature selection clustering (DFSC) [29] considers the error of self-representation for both the columns and the

rows ofX(i.e., both for features and data samples). Moreover, spectral graph analysis

on both domains is considered. Subspace clustering guided unsupervised feature

(29)

with unsupervised feature selection. In addition, SCUFS also exploits discriminative information for feature selection.

2.5

Self-Taught Learning

Transfer learning methods can be classified as homogeneous and heterogeneous.

Homogeneous transfer learning methods assume Xsrc = Xtrg while heterogeneous

transfer learning methods assume Xsrc 6= Xtrg. We focus on homogeneous transfer

learning.

Self-taught learning can be categorized into the group of inductive transfer

learn-ing methods [43], in which Ttrg 6= Tsrc while the domains can be either same or

different. The idea of self-taught learning was first proposed by Raina et. al. [45] and

implemented through dictionary learning and sparse coding.2 To be more specific, a

dictionary is learned using source samples:

min D,Asrc kXsrc−DAsrck2F +β nsrc X i=1 kA(srci)k1, s.t. kD(j)k ≤1,1≤j ≤s, (2.10)

where D ∈ Rd×s is a dictionary with each column as a dictionary element, and

where each column in Asrc ∈ Rs×nsrc represents the sparse coefficient vector of the

corresponding unlabeled source sample from Xsrc ∈ Rd×nsrc. After the dictionary D

is obtained, a new labeled training set{Atrg,Ytrg}in the target domain is computed

through min Atrg kXtrg−DAtrgk2F +β ntrg X i=1 kA(trgi)k1, (2.11)

where each column in Atrg ∈ Rs×ntrg represents the sparse coefficient vector of the

corresponding labeled target sample fromXtrg ∈Rd×ntrg. Finally, a classifier is learned

2We use abbreviation STL to denote the method of [45] in the sequel, while we use the full name

(30)

on the new labeled training set by applying a supervised learning algorithm. The idea of self-taught learning has been applied in scenarios such as clustering [46], visual tracking [47], object localization [48, 49], hyperspectral image classification [50], wound infection detection [51], etc.

Wang et. al. [52] propose robust and discriminative self-taught learning (RDSTL) as an extension to STL. Compared with STL, two changes are made in order to increase the robustness of the learning model and make use of supervision information

contained in target samples. The first is to replace the `1-norm loss function used in

STL with an `2,1-norm loss function because the latter is claimed to be more robust

to noise and outliers. The second is to take advantage of label information of target

samples during learning. Assume Xk ∈ Rd×nk and Ak ∈ Rs×nk denote the samples

and corresponding sparse codes belonging to thekth class. We refer to source samples

as belonging to the 0th class and assume that the dataset X is arranged by classes

so that X = [X0,X1,· · · ,XK], where K is the total number of classes in the target

samples, with A following the same setup. Then RDSTL can be written as the

following optimization problem:

min D,AkX−DAk 2 F +β K X k=0 kATkk2,1, s.t. kD(j)k ≤1,1≤j ≤s. (2.12)

Another advantage of imposing `2,1-norm regularization on the representation

coef-ficients is that it makes the learning process insensitive to the dictionary size. This

is because sparsity on rows of A helps select basis vectors in D: a basis vector

contributes little to data representation if the `2-norm value of its corresponding

co-efficient vector is close to 0. Therefore, the final task performance should not be sensitive to the dictionary size once it is large enough.

(31)

Li et. al. [53] proposes a self-taught low-rank (S-Low) coding framework which is suitable for both clustering and classification tasks in visual learning. By imposing a low-rank constraint onto the sparse coefficient matrix, S-Low coding is claimed to be able to characterize the global structure information in the target domain. The objective function of S-Low coding is

min D,Asrc,Atrg, Esrc,Etrg kAtrgkγ1 +λ1Mγ2(Esrc) +λ2Mγ2(Etrg) +λ3kAsrck2,1, s.t.Xsrc=DAsrc+Esrc, Xtrg =DAtrg+Etrg, (2.13)

where k · kγ1 denotes the matrix γ-norm with parameter γ1 and Mγ2(·) denotes the

minimax concave penalty norm with parameterγ2. We refer readers to [53] for more

details on the roles and definitions of these two norms.

Though the self-taught learning approaches mentioned above use different schemes for knowledge transfer, they all use the whole source sample set without considering their relevance to target domain, which makes these methods potentially vulnerable to negative transfer.

2.6

Human Activity Recognition

Human activity recognition (HAR) is a technique that aims at learning high-level knowledge about human activities from the low-level sensor inputs [62]. In recent years, the growing ubiquity of sensor-equipped wearables such as smart wristbands and smartphones have significantly promoted researches regarding HAR in the field of pervasive computing [63].

Machine learning algorithms have been widely used in HAR. In the last decade, traditional machine learning tools such as Markov models [64, 65] and decision trees [66, 67] have yielded tremendous progress in HAR. However, traditional machine learning methods suffer from several limitations in HAR using wearables. Most

(32)

tra-ditional machine learning algorithms applied in HAR use manually designed features including mean, variance, and frequency, which are shallow and heavily rely on hu-man domain knowledge and experience; furthermore, they are specific to particular tasks. The drawback of these hand-crafted features is two-fold: 1. shallow features can only be used to recognize low-level activities like sitting and standing but are hardly capable for high-level activities like printing papers and attending a seminar [68]; 2. task-specific features can hardly be transferred to different environments or tasks. Therefore, traditional machine learning algorithms cannot handle complex HAR scenarios, and they require one specifically designed model for each task, which increases the time and labor cost to build HAR systems in terms of both labeled data collection and model construction.

In recent years, the application of deep learning methods to HAR has significantly alleviated the drawbacks of traditional machine learning based HAR methods. First, deep neural network can extract high-level features with little or no human design. Second, deep learning models can be reused for similar tasks, which makes HAR model construction more efficient. Different deep learning models such as deep neu-ral networks [69, 70], convolutional neuneu-ral networks [71, 72], autoencoders [73, 74], restricted Boltzmann machines [75, 76], and recurrent neural networks [77, 78] have been applied in HAR. We refer readers to [62] for more details on deep learning based HAR.

2.7

Few-Shot Learning

Few-shot learning (FSL) is a transfer learning technique that applies knowledge from existing data to data from unseen classes which do not have sufficient labeled training data for model training. For example, if we are performing a task of recog-nizing bird species from images, the number of data for some rare species of birds may be insufficient to be used for training. Then the knowledge of existing bird species

(33)

can be borrowed for the learning of new species. In this case, the machine learning problem with a classifier for bird images with insufficient amount of training data is treated as a FSL problem. If there is only one image of a bird species, this would be a one-shot learning problem.

The first work for FSL is [79], in which a variational Bayesian framework is pro-posed to represent visual object categories as probabilistic models. Existing object categories, denoted as prior knowledge, is represented as a probability density func-tion on the parameters of these models. While unseen categories, denoted as the posterior model, is obtained by updating the prior with one or more observations. Lim et al. [80] propose a sample-borrowing method for multiclass object detection that adds selected samples from similar categories to the training set in order to increase the number of training data.

In recent years, deep learning based FSL has become the mainstream of FSL due to their unparalleled performance. Initialization based methods [54, 81, 82] focus on the fine-tuning process for new tasks. Finn et al. [81] aims to learn a good initialization for new tasks. Ravi et al. [54] and Munkhdalai [82] replace the weight-update process with an external memory. Hallucination based methods [83–85] learn a data generator from the base classes and use the learned generator to hallucinate data for new classes. Hariharan et al. [83] transfers variance from base class data to new classes. Antoniou et al. [84] uses generative adversarial networks to transfer style from base classes to new classes. Wang et al. [85] integrate the generator into a meta-learning framework for transfer learning purposes. Distance metric learning based methods [55, 56, 86, 87] measure the distance between two images and take advantage of the distance to classify unseen images. Examples of distance metrics include cosine similarity [56], Euclidean distance [55], CNN-based relation module [86], and graph neural network [87].

(34)

CHAPTER 3

UNSUPERVISED FEATURE SELECTION

3.1

Introduction

Feature selection is a dimensionality reduction technique that selects a subset of representative features from highdimensional data by eliminating irrelevant and redundant features. Recently, feature selection combined with sparse learning has attracted significant attention due to its outstanding performance compared with tra-ditional feature selection methods that ignores correlation between features. These works first map data onto a low-dimensional subspace and then select features by posing a sparsity constraint on the transformation matrix. However, they are re-stricted by design to linear data transformation, a potential drawback given that the underlying correlation structures of data are often non-linear. To leverage a more sophisticated embedding, we propose an autoencoder-based unsupervised feature se-lection approach that leverages a single-layer autoencoder for a joint framework of feature selection and manifold learning. More specifically, we enforce column sparsity on the weight matrix connecting the input layer and the hidden layer, as in previous work. Additionally, we include spectral graph analysis on the projected data into the learning process to achieve local data geometry preservation from the original data space to the low-dimensional feature space. Extensive experiments are conducted on image, audio, text, and biological data. The promising experimental results validate

the superiority of the proposed method1.

(35)

3.2

Proposed Method

In this section, we introduce our proposed graph autoencoder-based unsupervised feature selection (GAFS). Our proposed framework performs broad data structure preservation through a single-layer autoencoder and also preserves local data geo-metric structure through spectral graph analysis. In contrast to existing methods that exploit discriminative information for unsupervised feature selection by impos-ing orthogonal constraints on the transformation matrix [21] or low-dimensional data representation [22, 23], GAFS does not include such constraints. More specifically, we do not add orthogonal constraints on the transformation matrix because feature weight vectors are not necessarily orthogonal with each other in real-world appli-cations [89], allowing GAFS to be applicable to a larger set of appliappli-cations [17]. Furthermore, methods posing orthogonal constraints on low-dimensional data repre-sentations makes a good estimation of number of classes necessary to obtain reliable label indicators for those algorithms; such estimation is difficult to achieve in an unsupervised framework.

3.2.1 Objective Function

The objective function of GAFS includes three parts: a term based on a single-layer autoencoder promoting broad data structure preservation; a term based on spectral graph analysis promoting local data geometric structure preservation; and a regularization term promoting feature selection. As mentioned in Chapter 2.2, a single-layer autoencoder aims at minimizing the reconstruction error between output and input data by optimizing a reconstruction error-driven loss function:

L(Θ) = 1 2n n X i=1 kX(i)−h(X(i);Θ)k22 = 1 2nkX−h(X;Θ)k 2 F, (3.1)

where Θ = [W1,W2,b1,b2], h(X;Θ) = σ(W2·σ(W1X+b1) +b2); we use the

(36)

SinceW1 is a weight matrix applied directly on the input data, each column ofW1

can be used to measure the importance of the corresponding data feature. Therefore,

R(Θ) =kW1k2,1 can be used as a regularization function to promote feature selection

as detailed in Chapter 2.4. The objective function for the single-layer autoencoder based unsupervised feature selection can be obtained by combining this regularization function with the loss function of (4.2), providing us with the optimization

min Θ 1 2nkX−h(X;Θ)k 2 F +λkW1k2,1, (3.2)

where λ is a balance parameter.

Local geometric structures of the data often contain discriminative information of neighboring data point pairs [18]. They assume that nearby data points should have similar representations. It is often more efficient to combine both broad and local data information during low-dimensional subspace learning [90]. In order to characterize

the local data geometric structure, we construct ak-nearest neighbor (kNN) graphG

on the data space. The edge weight between two connected data points is determined by the similarity between those two points. We choose cosine distance as similarity

measurement for its simplicity. Therefore the adjacency matrix A for the graph Gis

defined as A(i,j)=        X(i)TX(j) kX(i)k 2kX(j)k2 if X(i) ∈ Nk(X(j)) or X(j) ∈ Nk(X(i)), 0 otherwise, (3.3)

where Nk(X(i)) denotes the k-nearest neighborhood set for X(i), and X(i)

T

refers to

the transpose of X(i). The Laplacian matrix L of the graph G is defined as L =

D−A, where D is a diagonal matrix whoseith element on the diagonal is defined as

D(i,i)=Pn

j=1A

(37)

In order to preserve the local data geometric structure in the learned subspace (i.e.,

if two data pointsX(i)andX(j)are close in original data space then the corresponding

low-dimensional representations Y(i) and Y(j) are also close in the low-dimensional

embedding space), we set up the following minimization objective:

G(Θ) = 1 2 n X i=1 n X j=1 kY(i)−Y(j)k2 2A (i,j) = 1 2 n X i=1 n X j=1 (Y(i)TY(i)−Y(i)TY(j)−Y(j)TY(i)+Y(j)TY(j))A(i,j) = n X i=1 Y(i)TY(i)D(i,i)− n X i=1 n X j=1 Y(i)TY(j)A(i,j) = Tr(Y(Θ)DY(Θ)T)−Tr(Y(Θ)AY(Θ)T) = Tr(Y(Θ)LY(Θ)T), (3.4)

where Tr(·) denotes the trace operator, Y(i)(Θ) =σ(W1x(i)+b1) fori= 1,2,· · · , n

(and we often drop the dependence onΘfor readability), andY(Θ) = [Y(1)(Θ),Y(2)(Θ),· · · ,Y(n)(Θ)].

Therefore, by combining the single-layer autoencoder based feature selection ob-jective (3.2) and the local data geometric structure preservation into consideration, the resulting objective function of GAFS can be written in terms of the following

minimization with respect to the parameters Θ= [W1,W2,b1,b2]:

ˆ Θ= arg min Θ F(Θ) = arg minΘ L(Θ) +R(Θ) +G(Θ) = arg min Θ 1 2nkX−h(X;Θ)k 2 F +λkW1k2,1+γTr(YLYT) , (3.5)

where λ and γ are two balance parameters. Filter-based feature selection is then

performed using the score function GAF S(q) =kW(1q)k2 based on the weight matrix

W1 from ˆΘ.

3.2.2 Optimization

The objective function of GAFS shown in (4.6) does not have a closed-form

(38)

toolbox [92] to solve the GAFS optimization problem. The solver requires the

gradi-ents of the objective function in (4.6) with respect to its parameters Θ.

The gradients for the loss term L(Θ) can be obtained through a back-propagation

algorithm. We defer the details for the derivation of the gradients of the error term, which are standard in the formulation of backpropagation for an autoencoder. The resulting gradients are as follows:

∂L(Θ) ∂W1 = 1 n∆2X T, ∂L(Θ) ∂W2 = 1 n∆3Y T, ∂L(Θ) ∂b1 = 1 n n X i=1 ∆2(i)= 1 n∆21, ∂L(Θ) ∂b2 = 1 n n X i=1 ∆3(i)= 1 n∆31. (3.6)

Each column ∆(2i) and ∆(3i) of ∆2 ∈Rm×n and ∆3 ∈Rd×n, respectively, contains the

error term of the corresponding data point for the hidden layer and the output layer, respectively, having entries as follows:

∆(3p,i)= ( ¯X(p,i)−X(p,i))·X¯(p,i)·(1−X¯(p,i)), ∆(2q,i)= d X p=1 W(2q,p)∆(3p,i) ! ·Y(q,i)·(1−Y(q,i)), (3.7)

for p = 1,2,· · · , d, q = 1,2,· · · , m, and i = 1,2,· · · , n, and where ¯X denotes the reconstructed data output of the autoencoder. Equation (3.7) can also be rewritten in matrix form as

∆3 = (X¯ −X)•X¯ •(1−X¯), ∆2 = (WT2∆3)•Y•(1−Y),

(3.8)

where • denotes the element-wise product operator. In the sequel, we use 1 and

0 to denote an all-ones and all-zeros matrix or vector with of the appropriate size,

(39)

The regularization term R(Θ) =kW1k2,1, whose derivative does not exist for its

ith column W(1i) when W1(i) =0 for i= 1,2,· · · , d. In this case,

∂R(Θ)

∂W1

=W1U, (3.9)

where U∈Rd×d is a diagonal matrix whoseith element on the diagonal is

U(i,i)=      kW(1i)k2+ −1 , kW(1i)k2 6= 0, 0, otherwise. (3.10)

where is a small constant added to avoid overflow [29]. Since kW1k2,1 is not

dif-ferentiable at 0, we calculate the subgradient for each element in W1 in that case.

That is, for each element in W1, the subgradient at 0 can be an arbitrary value in

the interval [−1,1], and so we set the gradient to 0 for computational convenience.

In summary, the gradients for the regularization term is:

∂R(Θ) ∂W1 =λW1U, ∂R(Θ) ∂W2 =0, ∂R(Θ) ∂b1 =0, ∂R(Θ) ∂b2 =0, (3.11)

The gradients of the graph term G(Θ) = γTr(YLYT) can be obtained in a

straightforward fashion as follows: ∂L(Θ) ∂W1 = ∂Tr(γYLY T) ∂Y · ∂Y ∂Z · ∂Z ∂W1 = 2γ(YL•Y•(1−Y))XT, ∂L(Θ) ∂b1 = ∂Tr(γYLY T) ∂Y · ∂Y ∂Z · ∂Z ∂b1 = 2γ(YL•Y•(1−Y))1, ∂L(Θ) ∂W2 =0, ∂L(Θ) ∂b2 =0. (3.12)

(40)

To conclude, the gradients of the GAFS objective function with respect to Θ = [W1,W2,b1,b2] can be written as ∂F(Θ) ∂W1 = 1 n∆2X T +λW 1U+ 2γ(YL•Y•(1−Y))XT, ∂F(Θ) ∂W2 = 1 n∆3Y T, ∂F(Θ) ∂b1 = 1 n∆21+ 2γ(YL•Y•(1−Y))1, ∂F(Θ) ∂b2 = 1 n∆31 (3.13)

3.3

Experiments

In this section, we evaluate the feature selection performance of GAFS in terms of both supervised and unsupervised tasks, e.g. clustering and classification, on several benchmark datasets. We also compare GAFS with other state-of-the-art unsupervised

feature selection algorithms. To be more specific, we first select p representative

features and then perform both clustering and classification on those selected features. The performance of clustering and classification is used as the metric to evaluate

feature selection algorithms. We perform experiments on eight benchmark datasets,2

including three image datasets (MNIST, COIL20, Yale), three text datasets (PCMAC, BASEHOCK, RELATHE), one audio dataset (Isolet), and one biological dataset (Prostate GE). Detailed properties of those datasets are summarized in Table 4.1.

3.3.1 Evaluation Metric

We perform both supervised (i.e., classification) and unsupervised (i.e., cluster-ing) tasks on datasets formulated by the selected features in order to evaluate the effectiveness of feature selection algorithms. For classification, we employ softmax classifier for its simplicity and compute the classification accuracy as the evaluation

(41)

Dataset Features Instances Classes Type MNIST 784 60000 10 Image COIL20 1024 1440 20 Image Yale 1024 165 15 Image PCMAC 3289 1943 2 Text BASEHOCK 4862 1993 2 Text RELATHE 4322 1427 2 Text Prostate GE 5966 102 2 Biology Isolet 617 1560 26 Audio

Table 3.1. Details of datasets used.

metric for feature selection effectiveness. For clustering, we use k-means clustering

on the selected features and use two different evaluation metrics to evaluate the clus-tering performance of all methods. The first is clusclus-tering accuracy (ACC), defined as ACC = 1 n n X i=1 δ(gi,map(ci)),

wheren is the total number of data samples,δ(·) is defined byδ(a, b) = 1 when a=b

and 0 when a 6= b, map(·) is the optimal mapping function between cluster labels

and class labels obtained using the Hungarian algorithm [93], and ci and gi are the

clustering and ground truth labels of a given data samplexi, respectively. The second

is normalized mutual information (NMI), which is defined as

NMI = MI(C, G)

max(H(C), H(G)),

where C and G are clustering labels and ground truth labels, respectively, MI(C, G)

is the mutual information betweenC andG, andH(C) andH(G) denote the entropy

of C and G, respectively. More details about NMI are available in [94]. For both

ACC and NMI, 20 clustering processes are repeated with random initialization for each case following the setup of [18] and [21], and we report the corresponding mean values of ACC and NMI.

(42)

3.3.2 Experimental Setup

In our last experiment, we compare GAFS with LapScore3 [14], SPEC4 [15],

MRSF5 [19], UDFS6 [21], and RSR7 [26]. Among these methods, LapScore and SPEC

are filter feature selection methods which are based on data similarity. LapScore uses spectral graph analysis to set a score for each feature. SPEC is an extension to Lap-Score and can be applied to both supervised and unsupervised scenarios in which schemes of constructing graphs used for data similarity measurement are different. Details on MRSF, UDFS, and RSR can be found in Chapter 2.4. Besides the five methods, we also compare GAFS with the performance of using all features as the baseline.

Both GAFS and compared algorithms include parameters to adjust. In this exper-iment, we fix some parameters and tune others according to a “grid-search” strategy.

For all algorithms, we select p ∈ {2%, 4%, 6%, 8%, 10%, 20%, 30%, 40%, 50%,

60%, 70%, 80%} of all features for each dataset. For all graph-based algorithms,

the number of nearest neighbor in a kNN graph is set to 5. For all algorithms

pro-jecting data onto a low-dimensional space, the space dimensionality is set in the

range of m ∈ {10,20,30,40}. In GAFS, the range for the hidden layer size is set to

match that of the subspace dimensionalitym,8 while the balance parameters are given

ranges λ ∈ {10−4,10−3,10−2,10−1,1} and γ ∈ {0,10−4,5×10−4,10−3,5×10−3},

re-spectively. For UDFS, we use the range γ ∈ {10−9,10−6,10−3,1,103,106,109}, and λ

3Available at http://www.cad.zju.edu.cn/home/dengcai/Data/code/LaplacianScore.m 4Available at https://github.com/matrixlover/LSLS/blob/master/fsSpectrum.m 5Available at https://sites.google.com/site/alanzhao/Home

6Available at http://www.cs.cmu.edu/ yiyang/UDFS.rar 7Available at https://github.com/guangmingboy/githubs doc

8We will alternatively use the terminologies subspace dimensionality and hidden layer size in

(43)

is fixed to 103. For RSR, we use the rangeλ ∈ {10−3,5×10−3,10−2,5×10−2,10−1,5×

10−1,1,5,10,102}.

For each specific value of p on a certain dataset, we tune the parameters for each

algorithm in order to achieve the best results among all possible combinations. For classification, we report the highest classification accuracy. For clustering, we report the highest average values for both ACC and NMI from 20 repetitions.

3.3.3 Parameter Sensitivity

We study the performance variation of GAFS with respect to the hidden layer

size m and the two balance parameters λ and γ. We show the results on all the 8

datasets in terms of ACC.

We first study the parameter sensitivity of GAFS with respect to subspace

di-mensionality m. Besides the aforementioned manifold dimensionality range m ∈

{10,20,30,40}, which are common for both proposed and comparing algorithms, we

also conducted experiments with hidden layer size values of m∈ {100,200,300,400}

to investigate the performance change for a larger range of reduced dimensionality values. The results in Fig. 4.1 show that the performance of GAFS is not too sensi-tive to hidden layer size on the given datasets, with the exception of Yale, where the

performance with hidden layer size of m ∈ {10,20,30,40} is apparently better than

that with reduced dimensionality m ∈ {100,200,300,400}, while the performance

variations are small in the latter set. One possible reason behind this behavior is that for a human face image dataset like Yale, the differences between data instances can be subtle since they may only lie in a small area of relevance such as eyes, mouth, nose, etc. Therefore, in this case a small subspace dimensionality can be enough for information preservation, while a large subspace dimensionality may introduce redundant information that may harm feature selection performance.

(44)

10 20 30 40 50 60 70 80 30 35 40 45 50 55 60 65 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (a) MNIST 10 20 30 40 50 60 70 80 40 45 50 55 60 65 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (b) COIL20 10 20 30 40 50 60 70 80 20 25 30 35 40 45 50 55 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (c) Yale 10 20 30 40 50 60 70 80 45 50 55 60 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (d) PCMAC 10 20 30 40 50 60 70 80 45 50 55 60 65 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (e) BASEHOCK 10 20 30 40 50 60 70 80 42 44 46 48 50 52 54 56 58 60 62 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (f) RELATHE 10 20 30 40 50 60 70 80 50 55 60 65 70 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (g) Prostate GE 10 20 30 40 50 60 70 80 25 30 35 40 45 50 55 60 65 70 p (%) ACC (%) m = 10 m = 20 m = 30 m = 40 m = 100 m = 200 m = 300 m = 400 (h) Isolet

Figure 3.1. Performance variation of the GAFS w.r.t. dimensionality of subspace

(45)

We also study the performance of GAFS on balance parameters λ and γ, with

fixed percentage of selected features and hidden layer size. We set p = 20%, as

Fig. 4.1 shows that the performance stabilizes starting at that value ofp. For subspace

dimensionality, we choosem= 10 since Fig. 4.1 shows that the performance of GAFS

is not sensitive to the value of m. The performance results are shown in Fig. 3.2,

where we find that different datasets present different trends on the ACC values

with respect to λ and γ. However, we also find that the performance differences

on PCMCA, BASEHOCK, and RELATHE are not greater than 0.8%, 0.8%, and 0.4%, respectively. Therefore we cannot make any conclusion on the influence from

two balance parameters on ACC based on these 3 datasets. For the parameter λ,

which controls the column sparsity of W1, we can find that for Yale the performance

monotonically improves as the value of λ increases for each fixed value of γ, even

though the number of selected featuresmis fixed. We believe this is further evidence

that a small number of selected features receiving large score (corresponding to large λ) is sufficient to obtain good learning performance, while having a large number of

highly scoring features (corresponding to small λ) may introduce irrelevant features

to the selection. We also find a similar behavior for Prostate GE and Isolet. For both

MNIST and COIL20, we can find that the overall performance is best when λ= 10−2

and both smaller and larger values of λ degrade the performance. This is because

the diversity among instances of these two datasets is large enough: a large value of

λ may remove informative features, while a small value of λ prevents the exclusion

of small, irrelevant, or redundant features. For the parameter γ, which controls local

data geometric structure preservation, we can find that both large values and small

values of γ degrade performance. On one hand, we can conclude that local data

geometric structure preservation does help improve feature selection performance to a certain degree. On the other hand, large weights on local data geometric structure preservation may also harm feature selection performance.

(46)

(a) MNIST (b) COIL20

(c) Yale (d) PCMAC

(e) BASEHOCK (f) RELATHE

(g) Prostate GE (h) Isolet

(47)

3.3.4 Feature Selection Illustration

We randomly select five samples from the Yale dataset to illustrate the choices

made by different feature selection algorithms. For each sample,p∈ {10%,20%,30%,40%,50%,60%,70%,80%,100%}

features are selected. Figure 3.3 shows images corresponding to the selected features

(i.e., pixels) for each sample and value of p, with unselected pixels shown in white.

The figure shows that GAFS is able to capture the most discriminative parts on human face such as eyes, nose, and mouse.

Figure 3.3. Feature selection illustration on Yale. Each row corresponds to a sample

human face image and each column refers to percentages of features selected p ∈

{10%,20%,30%,40%,50%,60%,70%,80%,100%} from left to right.

3.3.5 Clustering Illustration

We show a toy example of clustering from the low-dimensional data representation

(i.e., the hidden

Figure

Table 3.1. Details of datasets used.
Figure 3.1. Performance variation of the GAFS w.r.t. dimensionality of subspace m and the percentage of features selected p (%).
Figure 3.2. Performance variation of the GAFS w.r.t. balance parameters λ and γ.
Figure 3.3. Feature selection illustration on Yale. Each row corresponds to a sample human face image and each column refers to percentages of features selected p ∈ {10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 100%} from left to right.
+7

References

Related documents

Most companies recruit for full-time and internship positions, but some indicate Co-Op as a recruiting priority, while not attending Professional Practice

During the thesis work, I measured six different parameters: the number of emergency processes, hash table entry number, caching replacement policy, cache entry

It also promises to provide several features like on demand self service, broad network access along with an infinite data storage capacity to its users, Figure 1

Political Parties approved by CNE to stand in at least some constituencies PLD – Partido de Liberdade e Desenvolvimento – Party of Freedom and Development ECOLOGISTA – MT –

ter mean to the prototypes computed from the true labels of all the samples. Similar to the semi-supervised scenario, we use a PN trained in the episodic mode as the feature

The UNO Assessment Committee is responsible for guiding the process of campus-wide academic assessment of student learning, and to that end it conducts regular reviews of

• Taxpayers subject to the provisions of Title II of the Income Tax Law (ITL) which have declared taxable income of $644,599,005 or more in the immediately preceding tax

Parr and Shanks [18] classify ERP implementations to three broad categories (comprehensive, middle road, and vanilla) and according to them ERP implementations differ with respect