4. SIMULATION STUDIES INVOLVING CONTINUOUS DATA
4.6.3 Results
The first analysis looked at the error rate estimators for only LOA and QDA. The R main effect (F = 8. 7 1 ) and six R * factor interactions were significant at the a = 1 % level (and indeed at the a = 0. 1 % level). The largest among these were the R * f(. ) (F = 1 2.07), R * e (F = 7 .55) and R * f(.) * e (F = 6.00) interactions. Of the first order interactions, only the
R * 8 interaction was not significant showing that for both LDA and QDA, all error rate estimators produced fairly similar results, no matter what the distance between populations. Note, though, that when 8 = 2, R(B ) = 0. 1 59 and when 8 = 3, R(B) = 0.067, hence the probability models studied here were for fairl y well separated populations. Ganeshanandam
and Krzanowski ( 1 990) found the R * 8 interaction to be highly significant although they used 8 = LO 1 and 8 = 2.53 (R(B) = 0.29 1 and R(B) = 0. 1 03), thus based their results over a
much wider range of B ayes error rates. The results given here, though, were comparable with those of Fitzmaurice et al ( 1 99 1 ) for R(B) = 0.05 and 0. 15.
When comparing the seven error rate estimators for CART, the R main effect (F
=
50.76) andseven R * factor interactions were significant at the a = 1 % level. Of the first order
interactions, only the R * f(. ) interaction was not significant, which was as expected, given
that CART is robust to non-normality of the variables. The results showed though that the CART error rate estimators were very sensitive to the choice of the number of variables, sample size, Mahalanobis distance between populations and the priors-covariance structure of the data.
A final analysis was done to compare all the error rate estimators over the three methods. As
expected, with so many error rate estimators over different methods, almost all the R * factor
interactions were significant at the a = 1 % level. A personal correspondence from David
Hand suggested that this approach may be infeasible if there were different residual variances among the method-error rate estimator combinations. A plot of the residuals against each of the method - error rate estimator combinations showed that the assumption of equal residual variances was not really valid. As suggested by Hand, a weighting of the MSE' s for each method error rate estimator combination was carried out, using lis
�
as the weights, where s;
is the variance of the MSE's for the ith method-error rate estimator combination. The results, however, showed a number of differences from the unweighted analysis, in that a few less of the R * factor interactions were signiticant, but in the main, the important effects identified in the unweighted analysis showed up in the weighted analysis. However, the magnitude of the F-ratios can still be used to indicate which were the most important effects influencing the performance of the various error rate estimators across the three methods.
Tables 4. 1 7 to 4. 1 9 show the mean values for the most important second order interactions, those being R * n * 8 (F = 5.79) and R * f(.) * e (F = 4.37). Table 4. 1 7 shows that most estimators were more reliable (that is, had lower mean square error) when 8 = 3 than when 8
= 2 for smaller sample size. A notable exception to the rule was R(0.632) for CART which
confirms what was shown in Crawford ( 1 989). R(0.632) for LOA h ad lower mean square error for 8 = 3 than for 8 = 2, which confirms a trend shown in Fitzmaurice et al ( 199 1 ).
When n = 300, some of the CART error rate estimators were more reliable when 8 = 2 than
when 8 = 3. Of the R(CV) estimators, QOA and CART were the m ost reliable while those
for LOA were rather unreliable, due most probably to the large variability of the estimator.
Overall, R(0.632) for CART did best when 8 = 2, no matter what the sample size. When
8 = 3 , R(CV) for CART did best when n = 60 and R(TEN) for CART did best when n = 300.
Table
4.17:Average mean square errors (MSE's) for different error rate estimators
using LDA, QDA and CART with resp ect to the sample size-distance i nteraction
(n
* 8) (* Io-5)n = 60 n = 300
Method Error Rate Estimator 8 = 2 8 = 3 8 = 2 8 = 3
LOA R(CV) 708 344 504 424 R(A) 564 27 1 492 4 1 7 R(ROT) 427 525 484 366 R(0.632) 400 375 483 3 8 1 QOA R(CV) 1 89 I l l 84 39 R(A) 278 86 70 33 R(ROT) 773 533 57 34 R(0.632) 366 1 74 49 3 1 CART R(CV) 409 51 57 1 0 1 R(ACV) 2869 763 36 293 R(ROT) 1 85 234 58 56 R(AR) 1 238 687 22 1 224 R(0.632) 106 1 1 8 23 74 R(TEN) 483 477 66 24 R(AT) 1 532 850 262 1 84
Table 4. 1 8 shows that the error rate estimators using LOA and QOA were generally more reliable than those for CART, for normally dist1ibuted data. Of the R (CV) estimates, LOA
and QOA were the most reliable except when n 1 = 0.75 and I.2 = 32. 1 (e = 5) where CART
did best and LDA especially fell down. The R(0.632) estimator for CART performed
variable. The R(ROT) estimate for CART did best overall in the equal covariance, unequal priors case (e = 3).
Table
4.18:Average mean square errors (MS E's) for d i fferent e rror rate estim ators
using LDA, QDA and CART with respect to priors-covariance structure (e) and
normal d ata
(* 10-5).Method Error Rate Estimator e = 1 e = 2 e = 3 e = 4 e = 5
LOA R(CV) 60 49 1 09 1 52 440 R(A) 1 08 20 64 1 1 2 268 R(ROT) 18 97 33 1 140 374 R(0.632) 29 52 1 79 91 315 QOA R(CV) 87 55 173 39 194 R(A) 4 1 1 1 33 83 87 39 R(ROT) 232 294 704 197 940 R(0.632) 237 104 237 56 372 CART R(CV) 1 2 1 1 82 192 204 74 R(ACV) 1422 1203 970 1064 699 R(ROT) 1 14 175 56 1 77 145 R(AR) 1 027 42 1 482 605 427 R(0.632) 102 47 79 1 0 1 72 R(TEN) 434 297 1 87 268 1 26 R(AT) 1 247 65 1 409 7 10 5 1 9 Legend: e = 1 :
1t1
= 0.5 : :L1
= l:2 = I e = 2:1t1
= 0.5 : l: 1 = I, l:2 = 3:L t e = 3 :1t1
= 0.25: l: 1 = :L2 = I e = 4:1t1
= 0.25: L I = I, :L2 = 3:L t e = 5 :1t1
= 0.75 : l: 1 = I, :L2 = 3:L tTable 4. 1 9 shows the error rate estimators for the lognormally distributed data. The general
trend observed was that the estimators for LOA deteriorated m arkedly in the case of unequal
error rate. The estimators for QDA remained relatively constant for all levels of e and not too dissimilar from the normal data situation. Naturally, the CART estimators were exactly the same as in the normal data situation. Of the R(CV) estimators, LDA did best in the case of
both equal priors and covariances, CART for unequal priors but equal covariances and QDA
elsewhere. The R(0.632) estimate for CART had consistently low mean square error for all levels of e.
Table
4.19:Average mean square errors (MSE's) for d i fferent error rate estimators
using L DA, QDA and CART with respect to priors-covariance structure (e) and
lognormal data.
Method
Error Rate Estimator e = 1LDA R(CV) ltl R(A) 50 R(ROT) _) _ "' 7 R(0.632) l tl QDA R(CV) 43 R(A) 62 R(ROT) 1 9 1 R(0.632) 7 1 1 2 1 CART R(CV) Legend: R(ACV) 1422 R(ROT) 1 1 4 R(AR) 1 027 R(0.632) 102 R(TEN) 434 R(AT) 1 247 e = 1 : 1 q = 0.5:
:I:
1 = :I:2 = Ie = 2: 1 q = 0.5: :I: 1 = I,
:I:2
= 3:I: Ie = 3:
1t1
= 0.25:I: 1
= :I:2 = Ie = 4:
1t 1
= 0.25:I:1
= I,I:2
= 3:I: Ie = 5:
1tJ
= 0.75:I: 1
= I,I:2
= 3:I: I e = 2 e = 3 e = 4 e = 5 89 652 323 1 1 50 100 499 2959 1 82 1 25 1 178 1 97 1 240 70 873 2264 208 84 267 62 52 75 2 1 0 20 56 269 1 78 1 84 305 89 1 60 67 1 57 1 82 192 204 74 1 203 970 1 064 699 175 Stl 1 77 1 45 42 1 482 605 427 47 79 1 0 1 72 297 1 87 268 1 26 65 1 409 7 10 5 194.6.4 Summary
The results of this study have found that the differences between the error rate estimators for CART were most affected by dimension, sample size, distance between populations and the priors-covariance structure of the data. The differences in error rate estimators for LOA and QDA were affected most by all the above factors, except distance between populations, as well as being affected by the distribution of the data. Considering the error rate estimators over all methods, it was found that estimation of the actual error rate depended very much on the sample size - distance interaction as well as the distribution - priors - covariance structure interaction.
Overall, the QOA and CART error rate estimates most closely approximated the actual error rate, excluding the apparent error rates for CART, which were unreliable in most situations. Of the n-fold cross-validation estimates, those for QDA and CART were the best, usually having the lowest mean square error. The n-fold cross-validation estimates for LOA, away from the ideal situations, were found to be rather poor.
The 0.632 error rate estimate for CART was found to be very reliable throughout and not influenced to a great extent by any of the factors. This was confirmed by a separate ANOV A explaining the effects on the 0.632 estimator alone. It was noted that this estimator was particularly good for less well separated populations. On the other hand, the 0.632 estimate for LOA (and QOA) was sensitive to changes in the priors-covariance structure of the data and lognormal data.
4. 7 CONCLUSIONS
In this chapter, a comparative study was undertaken for four classification methods, namely LOA, QOA, CART and FACT on the basis of predictive accuracy. The four methods were compared over different dimensions, sample sizes, distances between populations, distributions and priors-covariance structures. The results showed that LOA and QOA performed best for normal data, h igher dimension and well separated populations, w hile CART performed well for lognormal data, lower dimension, less well separated populations and equal covariances with unequal priors. QOA was found to be the preferred method in the case of unequal covariances. In most situations, FACT's prediction rules were a poor fourth.
In Section 4.5, a study was undertaken, based on the findings from real data sets, to determine the effect on the individual group error rates for each of the four classification methods. These studies found that CART was affected quite drastically by the ratio of class sample sizes used, though not to the same extent as FACT. The individual error rates for LDA and especially QDA were least affected by unequal class sample sizes. It was recomm,Cnded from that study that caution should be shown when using CART on data sets with grossly unequal
class sample sizes.
In the final section of this chapter, an investigation was carried out into the reliability of each n-fold cross-validation error rate estimate for LDA, QDA and CART over the different probability models studied. The results showed that the n-fold cross-validation estimates were fairly reliable for QDA and CART in most situations, but that for LDA, were unreliable in situations where there were unequal covariance matrices. A deal of promise was shown by the 0.632 estimator for CART, in that it performed uniformly well over all situations studied. The reliability of this and other error rate estimates will be investigated in more detail in Chapter 6.
It could therefore be concluded that LDA and QDA would be preferred over CART in many situations, and CART as the preferred method in others, if accuracy were the sole measure of the performance of a method.
5. SIMULATION STUDIES INVOLVING CATEGORICAL DATA 5. 1 INTRODUCTION
Often in multivariate data settings, problems do not involve continuous variables. Rather, the problem may involve ordered categorical variables such as the ratings of a certain product (bad, average, good) or number of years education. These variables can be treated as continuous although the requirement of multivariate ellipsoidality may not always be met. In other situations, the problem may involve unordered categorical or nominal variables, whereby there is no natural ordering of the categories. Race and marital status are two examples of nominal variables.
The general approach taken by traditional discrimination methods for such variables is to code the c categories into (c- 1 ) binary variables where
Xj =
{6
else if ci presentAs seen earlier, CART and m any other tree-based procedures do not require the use of binary variables to handle nominal variables. Instead, most tree-based procedures attempt to find the grouping of categories that leads to the least overall misclassification error. FACT, in contrast, is one tree-based method which takes the LDA approach to handling categorical variables.
This chapter focuses on a comparison of LDA, QDA, CART and FACT for the above type of data. Firstly, the four methods are com pared from an accuracy point of view in classifying observations into two distinct populations, 0 1 and 02. For the sake of a direct comparison to be made between CART and the other three methods, only p-variate binary data is used. As in Chapter 4, the reliability of these classification methods is investigated as well as different error rate estimates for LDA and QDA.
5 .2 PREVIOUS STUDIES
Moore ( 1 973) carried out some simulation studies using six-dimensional, bimodal, binary data sets comparing LOA and QDA with various multinomial procedures. Two factors were varied, those being
Pij = probability of getting a response Xj = 1 for ni
and rijk = correlation between xj and xk for ni·
Moore's results showed that LOA performs very well except when there is a "reversal" in the log-likelihoods for each population. Moore illustrates by using the following example. Let x 1 and x2 be given as
{
1 if birth weight is high x1 = 0 if birthweight is low{
1 if gestation length is long x2 = 0 if gestation length is sh011A baby would be classified as normal when x 1 = 0 and x2 = 0 or x 1 = 1 and x2 = 1 , otherwise it is abnormal.
The optimal linear rule would be to assign x to n 1 if
2
a(x) = f3o + 2: f3ixj > c .
... ·· j=l .
Now a(l , 1 ) = a(O, 1 ) + a( l , 0) - a(O, 0).
As a( l , 1 ) > max { a( l , 0), a(O, 1 ) }
=> a(O, 0) < min{ a(l , 0), a(O, 1 ) }
then, if ( 1 , 1 ) is assigned to TI1 , (0, 0) has to be assigned to TI2. This leads to gross errors in misclassification. The problem is in using a monotonically increasing function of x 1 and x2 to approximate the log-likelihood, L(x), which is not monotonic. In the two-dimensional
case, the problem can be solved quite simply by including an interaction term in the model. When there are a large number of variables, however, this approach is infeasible.
Others have shown that not only different correlation structures in the two populations lead to these reversals, but so will moderate and large positive correlations. Krzanowski ( 1 977) considered a mixture of both binary and continuous random variables with various values of Pij used and rijk set to either 0 or 0.375 (no, or moderate, positive correlation). The results showed that LDA performed well when there was no correlation between the binary variables, but performed poorly if there was a moderate positive correlation among all the binary variables or the correlations between the binary and continuous variables differ markedly between the two groups. For both types of data, QDA has been found rarely to perform as well as LD A.
Ganeshanandam and Krzanowski ( 1 990) compared a number of different error rate estimators for LDA as well as the n-fold cross-validation estimate for QDA, for multivariate binary data. Their simulation results showed that the Pij's and sample size all had significant effects on the estimation of the actual erTor rate though the rijk factor did not.
.3 SIMULATION STUDY I 5 . 3 . 1 Study Plan
The factors used in this study were the same as those employed in Ganeshanandam and Krzanowski ( 1 990) in order to be able to check the results for LDA and QDA against theirs .. The factor p had settings of 5 and 10. A separate analysis was done for each of the two dimension levels. Factor n had three settings, those being "small, medium and large" relative to the number of variables. In the case of p = 5, n = 20, 60 and 100, while for p = 1 0, n = 40,
1 20 and 200. In all cases, 1t 1 = 1t2 = 0.5 implying that class sample sizes were equal. When
p = 5 , factor rijk had two levels, those being, all rijk = 0 and all rijk = 0.25. When p = 1 0, all
rijk were set to zero for the first five variables and to 0.25 for the second block of five. The last factor was the values of the Pij 's. The levels of the Pij ' s are shown in Table 5. 1 with level
1 corresponding to wide separation between groups with increasing levels leading to narrower separation. Level 5 corresponds to identically distributed populations. For p = 1 0, the Pi/ s for the first block of five variables were repeated for the second block of five. This gives 45
multinomial learning samples for p = 5 and 10 combined. Three replicates for each
sets in total (90 for each dimension size). Generation of the data was a straightforward process using MINIT AB macros.
Table
5.1:Values of
Pijfor each set of five b i nary variables.
Level n i n2 XJ x2 x3 x4 X) XJ x2 x3 x4 xs 1 0.20 0.20 0.20 0.20 0.20 0.80 0.80 0.80 0.80 0.80 2 0.25 0. 30 0.35 0.40 0.45 0.75 0.70 0.65 0.60 0.55 3 0.40 0.45 0.50 0.55 0.60 0.60 0.55 0.50 0.45 0.40 4 0.25 0.30 0.35 0.40 0.45 0.45 0.40 0.35 0.30 0.25 5 0.30 0.40 0.50 0.60 ()_ 70 0.30 0.40 0.50 0.60 0.70
The methods were compared by means of the n-fold cross-validation error rates, R(CV), except for FACT where ten fold cross-validation was used as outlined in Section 4.3. For both CART and FACT, the minimum size below which a node will not be split was set to
five, while for CART alone, the one standard error rule was used. A split-plot ANOV A was used to identify which factors lead to differences between the methods, as in Chapter 4. Tables of means and the standard deviations of the differences between means are presented for each significant effect.
5.3.2 Results
For the case p = 5, the results showed that the R (method) main effect (F = 1 0. 1 9) was highly
significant and dominated the variation in error. All interactions involving R had F-values less than L L This meant that there were differences between the methods when using p = 5 binary v ariables, and these differences were not intluenced by other factors.
In both the analyses for p = 5 and p = 1 0, the plot of residuals against fitted values showed no real trends, in contrast with the results for the continuous data. The plots of residuals against each method showed there to be roughly equal residual variances for each method. Individual ANOVA's were constructed for each method separately. These results confirmed the above finding of equal residual variances. Therefore, the results of the split-plot ANOV A are strictly valid.
The average values for each method, when p = 5, are given in Table 5.2, as well as the
standard error of difference between the methods. The results showed that CART was the best method by some distance from LDA, FACT and QDA.
Table
5.2:Means and standard error of the di fferences in means of the cross
validation error rate esti mates for each method.
= 5
Level LDA QDA CA RT FACT
0. 343 0.368 .304 .358
Standard error of the difference between means = 0.0 1 3 .
In the case of p = 1 0, the split-plot ANOY A showed that the R(method) main effect (F =
20. 78) was highly significant, though the R * Pij (method by probability pattern) interaction
(F = 4.2) and R * n (method by sample size) interaction (F = 4.9 1 ) were also highly significant (a < 0.00 1 ). This showed that for p = 1 0, there were not only differences between
the methods but that these differences were affected by both sample sizes, and to a slightly lesser extent, the pattern of probabilities in the parent populations.
Table 5.3 shows that increasing sample size had the effect of reducing the error rates for all
methods, except CART, where sample size had no real effect. The greatest reduction in error rate occurred when going from n = 40 to n = 1 20, for LOA, QDA and FACT. CART did best notably when n = 40 and slightly better than LOA when n = 1 20. When n = 200, however,
LDA did better than CART. FACT was a poor fourth except when n = 40, where, because of