• No results found

STATISTICAL CLASSIFICATION

2.5 NEAREST SHRUNKEN CENTROIDS

The nearest shrunken centroids (NSC) procedure was proposed by Tibshirani et al. (2001) and aims to identify a subset of predictor variables which best characterises the different classes (or the smallest subset of predictors which can accurately classify samples). NSC is a modification of the nearest centroid method and can be applied to classification problems as well as in unsupervised problems (Tibshirani et al., 2001:6567). Nearest centroid classification applied to a test case 𝒙 computes the (Euclidean) distance between 𝒙 and each of the class centroids and then classifies 𝒙 to the class whose centroid is closest to 𝒙. The main drawback of nearest centroid classification is that is uses all the predictors, which is not desirable in for example microarray problems where 𝑝 ≫ 𝑁. NSC modifies this method by β€œshrinking” each of the class centroids towards the overall centroid for all the classes. The idea behind NSC is that predictors whose class specific centroids are close to the overall mean centroid should not play a role in classification. Therefore, NSC continuously shrinks the predictor variables until only a few have an influence on the classification. After shrinking the class centroids, a test sample is then classified by the usual nearest centroid rule, but using the shrunken, hopefully denoised class centroids. In NSC, one must choose a shrinkage parameter, βˆ†, which influences the number of predictors used in computing the shrunken centroids. If βˆ† = 0, then all the predictors are used to compute the centroids. Tibshirani et al. (2001) propose that cross-validation should be used to determine the optimal value of βˆ†, denoted by βˆ†π‘œπ‘π‘‘.

This section proceeds with a more detailed description of NSC, based on Tibshirani et al. (2001) and Hastie et al. (2009, Section 18.2). Let π‘₯𝑖𝑗 denote the values for predictor 𝑗,

𝑗 = 1,2, … , 𝑝 and observations 𝑖 = 1,2, … , 𝑁. Also, let π‘₯̅𝑗𝑔 be the centroid (mean value) of

variable 𝑗 in class 𝑔. The overall centroid (mean value) for the π‘—π‘‘β„Ž predictor is given by

π‘₯̅𝑗 = βˆ‘π‘π‘–=1π‘₯𝑁𝑖𝑗.

In the binary case, it is clear that if the group-specific mean values π‘₯̅𝑗1 and π‘₯̅𝑗2 do not differ significantly from the overall centroid π‘₯̅𝑗 (and therefore π‘₯̅𝑗1 and π‘₯̅𝑗2 do not differ significantly from each other), then input variable 𝑗 will most probably not play a significant role in accurate classification. The NSC method shrinks the class-specific, π‘₯̅𝑗𝑔, towards the

overall centroids, π‘₯̅𝑗. If the shrinkage takes π‘₯̅𝑗1 and π‘₯̅𝑗2 all the way to the overall centroid, variable 𝑋𝑗 is effectively removed from the NSC classifier.

Consider the more detailed explanation for the general case of 𝐺 groups. Let 𝑑𝑗𝑔

represent the shrunken difference for the π‘—π‘‘β„Ž predictor. This is similar to a 𝑑-statistic for

predictor 𝑗 which compares the mean of class 𝑔 to the overall centroid, namely 𝑑𝑗𝑔 = π‘₯̅𝑗𝑔 βˆ’ π‘₯̅𝑗 π‘šπ‘”(𝑠𝑗+ 𝑠0) , (2.4) where π‘šπ‘” = βˆšπ‘1 𝑔+ 1

𝑁 and 𝑠𝑗 is the pooled within-class standard deviation for predictor 𝑗.

Assuming 𝐺 = 2, 𝑠𝑗2 = 1 𝑁 βˆ’ 2(βˆ‘(π‘₯𝑖𝑗 βˆ’ 𝒙̅j1) 2 π‘–βˆˆπΆ1 + βˆ‘(π‘₯𝑖𝑗 βˆ’ 𝒙̅j2)2 π‘–βˆˆπΆ2 ).

Therefore π‘šπ‘”π‘ π‘— is equal to the estimated standard error of the numerator in 𝑑𝑗𝑔 in (2.4).

In (2.4), 𝑠0 denotes a small positive constant (regularisation parameter) which guards

against large 𝑑𝑗𝑔 values that arise from predictors with near zero values. Tibshirani et al. (2001:6568) state that 𝑠0 is typically set equal to the median of the 𝑠𝑗-values over the set

of predictors, and has the same value for all the predictors. Tibshirani et al. (2003:107) explain that Equation (2.4) takes the size of each class into consideration and effectively applies a larger threshold to smaller (high variance) classes.

Equation (2.4) can also be written in the form

π‘₯̅𝑗𝑔 = π‘₯̅𝑗+ π‘šπ‘”(𝑠𝑗+ 𝑠0)𝑑𝑗𝑔.

Note that the normalisation by 𝑠𝑗 in (2.4) has the effect of giving higher weight to predictors that have stable expression within observations of the same class (Tibshirani et al., 2003:107).

The procedure now shrinks 𝑑𝑗𝑔 towards zero using the soft thresholding operator, giving 𝑑𝑗𝑔′ = 𝑠𝑖𝑔𝑛(𝑑

𝑗𝑔)(|𝑑𝑗𝑔| βˆ’ βˆ†)+, (2.5) where βˆ† is determined by choosing the threshold that yields the minimum cross-validation misclassification error rate. In (2.5) each |𝑑𝑗𝑔| is reduced by an amount βˆ† until it reaches zero. The subscript in (2.5) ( )+, indicates positive part, therefore 𝑑+ = 𝑑 if 𝑑 > 0, and zero otherwise. This is shown in Figure 2.2 below.

Figure 2.2: Soft thresholding function used in the NSC methodology Source: Tibshirani et al., 2003:106

In Figure 2.2, the soft thresholded function given in (2.5) is shown by the red dotted line, and the 45Β° line is shown in grey. The black bold dotted line between the red and grey lines represents βˆ†. If a predictor has 𝑑𝑗𝑔′ = 0 for all classes, then the centroid for predictor 𝑗, 𝒙̅𝑗 is the same for all classes and therefore the predictor does not contribute to

classification. In Figure 2.2 it is clear that as βˆ† becomes larger, there will be more predictors with 𝑑𝑗𝑔′ = 0 and therefore more predictors are eliminated, leaving only the most significant predictors as discriminating features. It is clear that the tuning parameter in this process is the quantity βˆ† and that care has to be taken to specify its value. It is recommended that the value of βˆ† should be determined using cross-validation (CV). In such a process one typically chooses the largest value of βˆ† (resulting in the smallest number of predictors) that achieves the minimal cross-validation error (Tibshirani et al., 2003:107). Note that in scenarios where there are several solutions with minimal cross- validation error rates, the optimal threshold value (number of genes) can either by chosen as the value with the largest likelihood or the one with the smallest number of genes (larger penalty). For the purpose of this thesis the latter method will be used if there is more than one threshold value yielding the minimal CV error rate.

Note that soft thresholding, as implemented in the NSC procedure, typically produces more reliable estimates of the true means since many of the π‘₯̅𝑗𝑔 values will be noisy and close to the overall mean π‘₯̅𝑗 (Donoho and Johnstone, 1994).

Hard thresholding can be used as an alternative to the soft thresholding in (2.5) (Tibshirani et al., 2003:111). Hard thresholding retains all the predictors with differences

𝑑𝑗𝑔

βˆ† βˆ’βˆ†

greater than βˆ† in absolute value and discards the other predictors. This implies that (2.5) changes to

𝑑𝑗𝑔′ = 𝑑𝑗𝑔 βˆ™ 𝐼(|𝑑𝑗𝑔| > βˆ†).

From the equation above it is clear that hard thresholding differs from soft thresholding shown in (2.5) in the sense that instead of shrinking all the 𝑑𝑗𝑔’s by an amount βˆ† towards

zero (as in (2.5)), the 𝑑𝑗𝑔’s larger than βˆ† remain unchanged. The main disadvantage of using hard thresholding is that it has a β€œjumpy” nature: as βˆ† is increased a predictor that has a full contribution 𝑑𝑗𝑔 is suddenly set to zero (Tibshirani et al., 2003:111). This typically leads to a procedure with larger variance than one based on soft thresholding.

Returning to the soft threshold NSC procedure, the new shrunken versions of π‘₯̅𝑗𝑔 are

calculated by reversing the transformation in (2.4):

π‘₯̅𝑗𝑔′ = π‘₯Μ…

𝑗+ π‘šπ‘”(𝑠𝑗+ 𝑠0)𝑑𝑗𝑔′ .

These shrunken centroids π‘₯̅𝑗𝑔′ are substituted in place of π‘₯̅𝑗𝑔 in the discriminant score.

This procedure therefore results in variable selection as only variables with a non-zero 𝑑𝑗𝑔′ for at least one of the classes play a role in the classification rule and typically a large

number of variables can be discarded. Therefore, NSC actually implements variable selection and thereby dimension reduction.

As mentioned previously, the discriminant score defined for each data case in NSC is similar to that of diagonal LDA (Tibshirani et al., 2001:6569). Suppose there exists a test data case vector with expression levels π’™βˆ— = (π‘₯

1βˆ—, π‘₯2βˆ—, … , π‘₯π‘βˆ—), then the discriminant score

for class 𝑔 is defined as

𝛿𝑔(π’™βˆ—) = βˆ’ βˆ‘(π‘₯𝑗 βˆ—βˆ’ π‘₯Μ… 𝑗𝑔′)2 (𝑠𝑗+ 𝑠0)2 + 2π‘™π‘œπ‘”πœ‹π‘”. 𝑝 𝑗=1 (2.6)

The first term in (2.6) standardises the squared distance of π’™βˆ— to the π‘”π‘‘β„Ž shrunken centroid. The second term, 2π‘™π‘œπ‘”πœ‹π‘”, makes a correction based on the class prior

probability πœ‹π‘”. Note that βˆ‘πΊπ‘”=1πœ‹π‘” = 1 and that πœ‹π‘” provides the overall relative frequency

of class 𝑔 in the population. The prior probability πœ‹π‘” is typically estimated by the sample prior πœ‹Μ‚π‘” =𝑁𝑁𝑔; however, if the sample prior is not representative of the population then

more realistic or equal priors can be used, i.e. πœ‹π‘” =𝐺1 (Tibshirani et al., 2001:6569).

Tibshirani et al. (2001:6569) explain that β€œif the smallest distances are close and hence ambiguous, the prior correction in (2.6) gives a preference for larger classes, because they potentially account for more errors.”

The NSC classification rule is then

𝐢(π’™βˆ—) = β„“ if 𝛿

𝑙(π’™βˆ—) = π‘šπ‘Žπ‘₯𝑔𝛿𝑔(π’™βˆ—).

Estimates of the class probabilities, analogous to Gaussian linear discriminant analysis, can be constructed using the discriminant scores in (2.6):

𝑝̂

𝑔

(𝒙

βˆ—

) =

𝑒