Feature Selection and Classification - Novel Preprocessing Approaches for Omics Data Types and

The particular feature selection methods and classifier utilised are distinct for each type of change. DE: The process differs for microarrays and RNA-seq datasets. (a) For microarray data, genes are ranked on their moderated t-statistics using the implementation in limma (Smyth, 2004). Training and prediction for the microarray datasets is performed using diagonal linear discriminant analysis (DLDA), originally developed for classification of cancer samples based on their gene expression measurements (Dudoit, Fridlyand, & Speed, 2002). (b) For RNA-seq data, genes are ranked based on a likelihood ratio test statistic of negative binomial generalised linear models using the implementation in edgeR

(Robinson, McCarthy, & Smyth, 2010). Poisson linear discriminant analysis (PLDA) is used to determine a decision boundary and make predictions, as it has been demonstrated that DLDA finds suboptimal decision boundaries for count data, whereas PLDA finds the correct boundary (Witten, 2011). A power transformation was applied to eliminate overdispersion, making PLDA applicable to RNA-seq count data.

DV: For microarray data, the normalised values were directly used. For RNA-seq data, the mean-

3.2 Feature Selection and Classification & Anders, 2014), to avoid detecting DV features simply caused by DE. Features were then ranked based on either their Bartlett or Levene statistic and selection was applied. The Bartlett test (Bartlett, 1937) tends to choose features with a small number of outliers, whereas the Levene statistic (Levene, 1960) is robust to outliers. Before training and prediction, feature values were calculated as the absolute value of the difference of each measurement with the median of all samples in the training set and taking the absolute value. Thus, if the values originally came from a normal distribution the transformed values will follow a half-normal distribution. Fisher’s linear discriminant analysis (FLDA) was used for classification.

DD: Four approaches for assessing the differences between two different distributions (class 1 and class 2) are considered. These are the differences of medians and deviations, the Kolmogorov-Smirnov distance, the log-likelihood ratio, and simply combining the results of individual DE and DV selections. Motivated by the success of finding DE genes by considering the absolute differences in medians for the melanoma dataset (Jayawardana et al., 2015), the Differences of Medians and Deviations (DMD) is defined as:

DMD = |𝑚𝑒𝑑𝑖𝑎𝑛1 − 𝑚𝑒𝑑𝑖𝑎𝑛2| + |𝑄𝑛1 − 𝑄𝑛2|

where 𝑚𝑒𝑑𝑖𝑎𝑛₁ and 𝑚𝑒𝑑𝑖𝑎𝑛₂ represent the median expression values of class 1 and 2, respectively. The values 𝑄_𝑛₁ and 𝑄_𝑛₂ represent the robust scale estimator (Rousseeuw & Croux, 1993) for class 1 and 2, respectively. The Kolmogorov-Smirnov (KS) distance is simply defined as the greatest distance between the empirical cumulative distribution functions of the two classes. Thirdly, a log-likelihood ratio statistic with robust estimates of the location and scale was used:

where subscript 𝑖 denotes membership of class 𝑖, 𝑠_𝑖 denotes the number of samples in class 𝑖 and 𝑓 is the probability density function of the normal distribution. The LLR is the log of the likelihood ratio statistic for testing whether the two classes come from the same normal distribution or two different normal distributions where robust estimators of the mean and standard deviation are used in place of the maximum likelihood estimators. Terms without subscripts use values from samples in both classes. Finally, ensemble feature selection was performed by combining the selections which use the moderated t-test from limma and Bartlett test by taking the union of selected features. This is a naïve way to jointly capture features that are changing means and also those that are changing variances.

For each of the chosen features, a kernel density estimate is built for each of the two classes using a Gaussian smoothing kernel and bandwidth calculated by Silverman’s rule, which are the default settings of the density function in R. For the RNA-seq dataset, counts were transformed by the regularised log method, to prevent feature selection being biased towards differentially expressed genes, because of overdispersion of count data. To predict a sample from the test set, a naïve Bayes classifier is used for each feature. Two types of the classifier are considered. Firstly, each feature votes once for the class that has the maximum a posteriori estimate and the class with the largest number of votes is the predicted class. This is referred to as unweighted voting. Also, the differences between class densities, the

distances from the observation to the nearest non-zero crossover point of the two densities, and the sums of those two weights are calculated and summed over all selected features, with the sign of the sum determining the class prediction. This is termed weighted voting. Intuitively, the crossover distance

3.2 Feature Selection and Classification weighting captures how far away a measurement is from the nearest substantial observation of the class with lower density at the measurement point.

To summarise the voting schemes, let + and – denote the two classes that a sample can belong to. Let 𝑛+ and 𝑛₋ represent the number of samples in the training set belonging to each class. Let 𝑑₊ and 𝑑₋ represent the fitted densities of each class. Let g be the vector of measurements of a selected gene chosen in the feature selection stage. sgn is the sign function. Unweighted voting can be

mathematically expressed as:

mode(foreach 𝑔_𝑖 in g (sgn(𝑛₊× 𝑑₊(𝑔𝑖) − 𝑛−× 𝑑−(𝑔𝑖)))) Weighted voting has the form:

sgn(sum(foreach 𝑔_𝑖 in g (𝑛₊× 𝑑₊(𝑔𝑖) − 𝑛−× 𝑑−(𝑔𝑖))))

Figure 3.1Summary of feature types and classifiers. For each of differential expression, differential variability, and differential distribution, a representative gene expression distribution is shown, along with an illustration of the classification process. In the left column, the dashed vertical lines represent the means of the class distributions. In the right column, the variables 𝑥 and 𝑦 denote two different genes in a dataset. Each point indicates a sample. The bottom right panel illustrates that each gene from the selected gene set votes independently in differential distribution classification. The black circle shows the position along the x-axis of the expression measurement value. The intervals show two kinds of distance weighting evaluated in the

simulation study. The green interval corresponds to crossover distance voting and the black interval to height difference voting.

3.2 Feature Selection and Classification All classifications in this chapter have the same basic structure. 100 variations of each dataset were created by resampling with replacement followed by 5-fold cross-validation, to obtain the distribution of metrics. The reasons for doing so are described in Section 1.3. For each iteration of cross-validation:

1. The training data is first processed by a feature selection function. In each case, the feature selection function chooses the set of features by testing each gene individually, ranking them by a score, and calculating resubstitution error rates for the top 𝑥 ranked features. Values of 𝑥 considered ranged from 10 to 150, in increments of 10. The value of x which obtains the lowest balanced error rate determines the size of the set of top features selected.

Let 𝑇 be the set of samples in the training set. Let 𝐶̂(𝑡) be a classifier function which predicts the class of sample 𝑡 and let 𝐶(𝑡) be a function that returns the known class of sample 𝑡. Then, the resubstitution error rate is defined by the expression:

𝑛∑ I𝑇𝑅𝑈𝐸(𝐶̂(𝑡) ≠ 𝐶(𝑡) ) 𝑡∈𝑇

I is an indicator function.

2. For only the proposed DV classification, a transformation of expression values is made. For each feature, all samples’ expression levels are subtracted from the median expression level of the training set, and absolute values taken. This transforms the data into a form which allows a classifier that uses linear decision boundaries to be applied.

3. The samples assigned to the training set in the current iteration are used for model building. 4. The samples assigned to the test set in the current iteration have their classes predicted. The choice of using the resubstitution error for feature selection is a pragmatic one. Ideally, a nested cross-validation would be used. However, this approach is computationally infeasible in practice.

In document Novel Preprocessing Approaches for Omics Data Types and Their Performance Evaluation (Page 59-65)