Consider a variable-ranking scheme for X defined by a rule ν, and let X ν be the ranked data

Discriminant Analysis

Step 1. Consider a variable-ranking scheme for X defined by a rule ν, and let X ν be the ranked data

Step 2. For 2≤ p ≤ r, consider the p-ranked data

X_ν,p=

⎡

⎢⎣ X_•,ν₁

... X_•,ν_p

⎤

⎥⎦,

which consist of the first p rows ofX_ν.

Step 3. Derive the rule rp from r for the data X_ν,p, classify these p-dimensional data and calculate the errorEpbased onX_ν,p.

Step 4. Put p= p + 1, and repeat steps 2 and 3.

Step 5. Find the value p^∗which minimises the classification error:

p^∗= argmin

2≤p≤dEp.

Then p^∗is the number of variables that results in the best classification with respect to the errorE^.

An important difference between Algorithms 4.2 and 4.3 is that the former uses the p-dimensional PC data, whereas the latter is based on the first p ranked variables.

Example 4.12 We continue with a classification of thebreast cancerdata. I apply Algo-rithm4.3with four variable-ranking schemes, including the original order of the variables, which I call the identity ranking, and compare their performance to that of Algorithm4.2.

The following list explains the notation of the five approaches and the colour choices in Figure4.10. Items 2 to 5 of the list refer to Algorithm4.3. In each case, p is the number of variables, and p≤ d.

1. The PCsW^{( p)}of Algorithm4.2. The results are shown as a blue line with small dots in the top row of Figure4.10.

2. The identity ranking, which leaves the variables in the original order, leads to subsets XI , p. The results are shown as a maroon line in the top row of Figure4.10.

3. The ranking induced by the scaled difference of the class means, d of (4.50). This leads toXd, p. The results are shown as a black line in all four panels of Figure4.10.

4. The ranking induced by the sphered difference of the class means, b of (4.50). This leads toXb, p. The results are shown as a red line in the bottom row of Figure4.10.

5. The ranking induced by s with coefficients (4.52) and Y the vector of labels. This leads toXs, p. The results are shown as a blue line in the bottom row of Figure4.10.

4.8 Principal Component Analysis, Discrimination and Regression 163 Table 4.9 Number of Misclassified Observations, # Misclass, Best

Dimension, p^∗, and Colour in Figure4.10for the Breast Cancer Data of Example4.12.

Approach Raw Scaled

Colour

# Data # Misclass p^∗ # Misclass p^∗ in Fig4.10

1 W^{( p)} 12 16 12 25 Blue and dots

2 XI , p 12 25 12 25 Maroon

3 Xd, p 11 21 11 21 Black

4 Xb, p 12 11 12 16 Red

5 Xs, p 15 19 11 21 Solid blue

5 10 15 20 25 30

10 30 50 70

5 10 15 20 25 30

10 30 50 70

5 10 15 20 25 30

10 30 50 70

5 10 15 20 25 30

10 30 50 70

Figure 4.10 Number of misclassified observations versus dimension p (on the x -axis) for the breast cancer data of Example4.12: raw data (left) and scaled data (right);Xd, pin black in all panels,W^{( p)}(blue with dots) andXI , p(maroon) in the top two panels;Xb, p(red) andXs, p(solid blue) in the bottom two panels. In the bottom-right panel, the blue and black graphs agree.

In Example4.6we saw that Fisher’s rule misclassified fifteen observations, and the nor-mal rule misclassified eighteen observations. We therefore use Fisher’s rule in this analysis and work with differently selected subsets of the variables.

Table4.9reports the results of classification with the five approaches. Separately for the raw and scaled data, it lists p^∗, the optimal number of observations for each method, and the number of observations that are misclassified with the best p^∗. The last column ‘Colour’

refers to the colours used in Figure4.10for the different approaches.

Figure4.10complements the information provided in Table4.9and displays the number of misclassified observations for the five approaches as a function of the dimension p shown on the x -axis. The left subplots show the number of misclassified observations for the raw data, and the right subplots give the same information for the scaled data. The colours in the plots are those given in the preceding list and in the table. There is considerable discrepancy

for the raw data plots but much closer agreement for the scaled data, with identical ranking vectors for the scaled data of approaches 3 and 5.

The black line is the same for the raw and scaled data because the ranking uses stan-dardised variables. For this reason, it provides a benchmark for the other schemes. The performance of the identity ranking (in maroon) is initially worse than the others and needs a larger number p^∗ than the other approaches in order to reach the small error of twelve misclassified observations.

An inspection of the bottom panels shows that the blue line on the left does not perform so well. Here we have to take into account that the ranking ofBair et al.(2006) is designed for regression rather than discrimination. For the scaled data, the blue line coincides with the black line and is therefore not visible. The red line is comparable with the black line.

The table shows that the raw and scaled data lead to similar performances, but more variables are required for the scaled data before the minimum error is achieved. The ranking ofXd, p (in black) does marginally better than the others and coincides withXs, p for the scaled data. If a small number of variables is required, thenXb, p (in red) does best. The compromise between smallest error and smallest dimension is interesting and leaves the user options for deciding which method is preferable. Approaches 3 and 4 perform better than 1 and 2: The error is smallest for approach 3, and approach 4 results in the most parsimonious model.

In summary, the analysis shows that 1. variable ranking prior to classification

(a) reduced misclassification, and

(b) leads to a more parsimonious model and that

2. the PC data with sixteen PCs reduces the classification error from fifteen to twelve.

The analysis of the breast cancer data shows that no single approach yields best results, but variable ranking and a suitably chosen number of variables can improve classification.

There is a trade-off between the smallest number of misclassified observations and the most parsimonious model which gives the user extra flexibility. If time and resources allow, I rec-ommend applying more than one method and, if appropriate, applying different approaches to the raw and scaled data.

In Section13.3we continue with variable ranking and variable selection and extend the approaches of this chapter to HDLSS data.

In document Koch I. Analysis of Multivariate and High-Dimensional Data 2013 (Page 190-193)