NEAREST SHRUNKEN CENTROIDS - STATISTICAL CLASSIFICATION

STATISTICAL CLASSIFICATION

2.5 NEAREST SHRUNKEN CENTROIDS

The nearest shrunken centroids (NSC) procedure was proposed by Tibshirani et al. (2001) and aims to identify a subset of predictor variables which best characterises the different classes (or the smallest subset of predictors which can accurately classify samples). NSC is a modification of the nearest centroid method and can be applied to classification problems as well as in unsupervised problems (Tibshirani et al., 2001:6567). Nearest centroid classification applied to a test case 𝒙 computes the (Euclidean) distance between 𝒙 and each of the class centroids and then classifies 𝒙 to the class whose centroid is closest to 𝒙. The main drawback of nearest centroid classification is that is uses all the predictors, which is not desirable in for example microarray problems where 𝑝 ≫ 𝑁. NSC modifies this method by “shrinking” each of the class centroids towards the overall centroid for all the classes. The idea behind NSC is that predictors whose class specific centroids are close to the overall mean centroid should not play a role in classification. Therefore, NSC continuously shrinks the predictor variables until only a few have an influence on the classification. After shrinking the class centroids, a test sample is then classified by the usual nearest centroid rule, but using the shrunken, hopefully denoised class centroids. In NSC, one must choose a shrinkage parameter, ∆, which influences the number of predictors used in computing the shrunken centroids. If ∆ = 0, then all the predictors are used to compute the centroids. Tibshirani et al. (2001) propose that cross-validation should be used to determine the optimal value of ∆, denoted by ∆_𝑜𝑝𝑡.

This section proceeds with a more detailed description of NSC, based on Tibshirani et al. (2001) and Hastie et al. (2009, Section 18.2). Let 𝑥𝑖𝑗 denote the values for predictor 𝑗,

𝑗 = 1,2, … , 𝑝 and observations 𝑖 = 1,2, … , 𝑁. Also, let 𝑥̅𝑗𝑔 be the centroid (mean value) of

variable 𝑗 in class 𝑔. The overall centroid (mean value) for the 𝑗𝑡ℎ_{predictor is given by}

𝑥̅𝑗 = ∑𝑁𝑖=1𝑥_𝑁𝑖𝑗.

In the binary case, it is clear that if the group-specific mean values 𝑥̅_𝑗1 and 𝑥̅_𝑗2 do not differ significantly from the overall centroid 𝑥̅_𝑗 (and therefore 𝑥̅_𝑗1 and 𝑥̅_𝑗2 do not differ significantly from each other), then input variable 𝑗 will most probably not play a significant role in accurate classification. The NSC method shrinks the class-specific, 𝑥̅𝑗𝑔, towards the

overall centroids, 𝑥̅_𝑗. If the shrinkage takes 𝑥̅_𝑗1 and 𝑥̅_𝑗2 all the way to the overall centroid, variable 𝑋_𝑗 is effectively removed from the NSC classifier.

Consider the more detailed explanation for the general case of 𝐺 groups. Let 𝑑𝑗𝑔

represent the shrunken difference for the 𝑗𝑡ℎ_{predictor. This is similar to a 𝑡-statistic for}

predictor 𝑗 which compares the mean of class 𝑔 to the overall centroid, namely 𝑑𝑗𝑔 = 𝑥̅𝑗𝑔 − 𝑥̅𝑗 𝑚𝑔(𝑠𝑗+ 𝑠0) , (2.4) where 𝑚𝑔 = √_𝑁1 𝑔+ 1

𝑁 and 𝑠𝑗 is the pooled within-class standard deviation for predictor 𝑗.

Assuming 𝐺 = 2, 𝑠_𝑗2 = 1 𝑁 − 2(∑(𝑥𝑖𝑗 − 𝒙̅j1) 2 𝑖∈𝐶1 + ∑(𝑥𝑖𝑗 − 𝒙̅j2)2 𝑖∈𝐶2 ).

Therefore 𝑚𝑔𝑠𝑗 is equal to the estimated standard error of the numerator in 𝑑𝑗𝑔 in (2.4).

In (2.4), 𝑠0 denotes a small positive constant (regularisation parameter) which guards

against large 𝑑_𝑗𝑔 values that arise from predictors with near zero values. Tibshirani et al. (2001:6568) state that 𝑠0 is typically set equal to the median of the 𝑠𝑗-values over the set

of predictors, and has the same value for all the predictors. Tibshirani et al. (2003:107) explain that Equation (2.4) takes the size of each class into consideration and effectively applies a larger threshold to smaller (high variance) classes.

Equation (2.4) can also be written in the form

𝑥̅_𝑗𝑔 = 𝑥̅_𝑗+ 𝑚_𝑔(𝑠_𝑗+ 𝑠₀)𝑑_𝑗𝑔.

Note that the normalisation by 𝑠_𝑗 in (2.4) has the effect of giving higher weight to predictors that have stable expression within observations of the same class (Tibshirani et al., 2003:107).

The procedure now shrinks 𝑑_𝑗𝑔 towards zero using the soft thresholding operator, giving 𝑑_𝑗𝑔′ _{= 𝑠𝑖𝑔𝑛(𝑑}

𝑗𝑔)(|𝑑𝑗𝑔| − ∆)₊, (2.5) where ∆ is determined by choosing the threshold that yields the minimum cross-validation misclassification error rate. In (2.5) each |𝑑_𝑗𝑔| is reduced by an amount ∆ until it reaches zero. The subscript in (2.5) ( )₊, indicates positive part, therefore 𝑡₊ = 𝑡 if 𝑡 > 0, and zero otherwise. This is shown in Figure 2.2 below.

Figure 2.2: Soft thresholding function used in the NSC methodology Source: Tibshirani et al., 2003:106

In Figure 2.2, the soft thresholded function given in (2.5) is shown by the red dotted line, and the 45° line is shown in grey. The black bold dotted line between the red and grey lines represents ∆. If a predictor has 𝑑_𝑗𝑔′ = 0 for all classes, then the centroid for predictor 𝑗, 𝒙̅𝑗 is the same for all classes and therefore the predictor does not contribute to

classification. In Figure 2.2 it is clear that as ∆ becomes larger, there will be more predictors with 𝑑_𝑗𝑔′ = 0 and therefore more predictors are eliminated, leaving only the most significant predictors as discriminating features. It is clear that the tuning parameter in this process is the quantity ∆ and that care has to be taken to specify its value. It is recommended that the value of ∆ should be determined using cross-validation (CV). In such a process one typically chooses the largest value of ∆ (resulting in the smallest number of predictors) that achieves the minimal cross-validation error (Tibshirani et al., 2003:107). Note that in scenarios where there are several solutions with minimal cross- validation error rates, the optimal threshold value (number of genes) can either by chosen as the value with the largest likelihood or the one with the smallest number of genes (larger penalty). For the purpose of this thesis the latter method will be used if there is more than one threshold value yielding the minimal CV error rate.

Note that soft thresholding, as implemented in the NSC procedure, typically produces more reliable estimates of the true means since many of the 𝑥̅_𝑗𝑔 values will be noisy and close to the overall mean 𝑥̅_𝑗 (Donoho and Johnstone, 1994).

Hard thresholding can be used as an alternative to the soft thresholding in (2.5) (Tibshirani et al., 2003:111). Hard thresholding retains all the predictors with differences

𝑑_𝑗𝑔

∆ −∆

greater than ∆ in absolute value and discards the other predictors. This implies that (2.5) changes to

𝑑_𝑗𝑔′ = 𝑑𝑗𝑔 ∙ 𝐼(|𝑑𝑗𝑔| > ∆).

From the equation above it is clear that hard thresholding differs from soft thresholding shown in (2.5) in the sense that instead of shrinking all the 𝑑𝑗𝑔’s by an amount ∆ towards

zero (as in (2.5)), the 𝑑_𝑗𝑔’s larger than ∆ remain unchanged. The main disadvantage of using hard thresholding is that it has a “jumpy” nature: as ∆ is increased a predictor that has a full contribution 𝑑_𝑗𝑔 is suddenly set to zero (Tibshirani et al., 2003:111). This typically leads to a procedure with larger variance than one based on soft thresholding.

Returning to the soft threshold NSC procedure, the new shrunken versions of 𝑥̅𝑗𝑔 are

calculated by reversing the transformation in (2.4):

𝑥̅_𝑗𝑔′ _{= 𝑥̅}

𝑗+ 𝑚𝑔(𝑠𝑗+ 𝑠0)𝑑𝑗𝑔′ .

These shrunken centroids 𝑥̅_𝑗𝑔′ are substituted in place of 𝑥̅𝑗𝑔 in the discriminant score.

This procedure therefore results in variable selection as only variables with a non-zero 𝑑_𝑗𝑔′ _{for at least one of the classes play a role in the classification rule and typically a large}

number of variables can be discarded. Therefore, NSC actually implements variable selection and thereby dimension reduction.

As mentioned previously, the discriminant score defined for each data case in NSC is similar to that of diagonal LDA (Tibshirani et al., 2001:6569). Suppose there exists a test data case vector with expression levels 𝒙∗ _{= (𝑥}

1∗, 𝑥2∗, … , 𝑥𝑝∗), then the discriminant score

for class 𝑔 is defined as

𝛿_𝑔(𝒙∗_{) = − ∑}(𝑥𝑗 ∗_{− 𝑥̅} 𝑗𝑔′)2 (𝑠_𝑗+ 𝑠₀)2 + 2𝑙𝑜𝑔𝜋𝑔. 𝑝 𝑗=1 (2.6)

The first term in (2.6) standardises the squared distance of 𝒙∗ to the 𝑔𝑡ℎ shrunken centroid. The second term, 2𝑙𝑜𝑔𝜋𝑔, makes a correction based on the class prior

probability 𝜋_𝑔. Note that ∑𝐺𝑔=1𝜋𝑔 = 1 and that 𝜋𝑔 provides the overall relative frequency

of class 𝑔 in the population. The prior probability 𝜋_𝑔 is typically estimated by the sample prior 𝜋̂𝑔 =𝑁_𝑁𝑔; however, if the sample prior is not representative of the population then

more realistic or equal priors can be used, i.e. 𝜋𝑔 =_𝐺1 (Tibshirani et al., 2001:6569).

Tibshirani et al. (2001:6569) explain that “if the smallest distances are close and hence ambiguous, the prior correction in (2.6) gives a preference for larger classes, because they potentially account for more errors.”

The NSC classification rule is then

𝐶(𝒙∗_{) = ℓ if 𝛿}

𝑙(𝒙∗) = 𝑚𝑎𝑥𝑔𝛿𝑔(𝒙∗).

Estimates of the class probabilities, analogous to Gaussian linear discriminant analysis, can be constructed using the discriminant scores in (2.6):

𝑝̂

_𝑔

(𝒙

∗

_{) =}

𝑒

In document Statistical classification in high-dimensional scenarios with a focus on microarray data sets (Page 33-37)