Methods for Comparison
7.3.2 Statistical measures
The following measures are designed principally to explain the performance of statistical algorithms, but are likely to be more generally applicable. Often they are much influenced by the simple measures above. For example, the skewness measure often reflects the
Sec. 7.3] Characterisation of datasets 113
number of binary attributes, and if this is so, the skewness and kurtosis are directly related to each other. However, the statistical measures in this section are generally defined only for continuous attributes. Although it is possible to extend their definitions to include discrete and even categorical attributes, the most natural measures for such data are the information theoretic measures discussed in section 7.3.3.
Test statistic for homogeneity of covariances
The covariance matrices are fundamental in the theory of linear and quadratic discrimination detailed in Sections 3.2 and 3.3, and the key in understanding when to apply one and not the other lies in the homogeneity or otherwise of the covariances. One measure of the lack of homogeneity of covariances is the geometric mean ratio of standard deviations of the populations of individual classes to the standard deviations of the sample, and is given by Äbû öÌiÊ (see below). This quantity is related to a test of the hypothesis that all populations have a common covariance structure, i.e. to the hypothesis ;å :Ó
|
Ó
DE
Ó
Ë
which can be tested via Box’sv test statistic:
v Ô Ë À Á| cBk fEg log ¼ Ä Þ | Ä ¼~ where Ô f ³¨¥ e8Y¥H f Ô¹có¥He fDg c ù fDgÕ À f k f f k ùÖ ~
andÄ andÄ are the unbiased estimators of thei th sample covariance matrix and the pooled covariance matrix respectively. This statistic has an asymptotic
ê>ê.ç | D@> Ë Þ | D@×
distribution: and the approximation is good if eachk' exceeds 20, and if
ù
and¥ are both much smaller than everyk'.
In datasets reported in this volume these criteria are not always met, but thev statistic can still be computed, and used as a characteristic of the data. Thev statistic can be re- expressed as the geometric mean ratio of standard deviations of the individual populations to the pooled standard deviations, via the expression
Äbû öÌiÊ expÕ v ¥ Ë Á| ck' fEg Ö
TheÄbû öÌiÊ is strictly greater than unity if the covariances differ, and is equal to unity if and only if the M-statistic is zero, i.e. all individual covariance matrices are equal to the pooled covariance matrix.
In every dataset that we looked at thev statistic is significantly different from zero, in which case theÄbû ö.ÌiÊ is significantly greater than unity.
Mean absolute correlation coefficient, corr.abs The set of correlations Ø.
between all pairs of attributes give some indication of the interdependence of the attributes, and a measure of that interdependence may be calculated as follows. The correlationsØ
between all pairs of attributes are calculated for each class separately. The absolute values of these correlations are averaged over all pairs of attributes and over all classes giving the measure corr.abs
Ø which is a measure of interdependence between attributes.
114 Methods for comparison [Ch. 7
If corr.abs is near unity, there is much redundant information in the attributes and some procedures, such as logistic discriminants, may have technical problems associated with this. Also, CASTLE, for example, may be misled substantially by fitting relationships to the attributes, instead of concentrating on getting right the relationship between the classes and the attributes.
Canonical discriminant correlations
Assume that, in¥ dimensional space, the sample points from one class form clusters of roughly elliptical shape around its population mean. In general, if there areù
classes, theù means lie in aù
f
dimensional subspace. On the other hand, it happens frequently that the classes form some kind of sequence, so that the population means are strung out along some curve that lies ind' dimensional space, whered
ÿ
ù
f
. The simplest case of all occurs whend
f
and the population means lie along a straight line. Canonical discriminants are a way of systematically projecting the mean vectors in an optimal way to maximise the ratio of between-mean distances to within-cluster distances, successive discriminants being orthogonal to earlier discriminants. Thus the first canonical discriminant gives the best single linear combination of attributes that discriminates between the populations. The second canonical discriminant is the best single linear combination orthogonal to the first, and so on. The success of these discriminants is measured by the canonical correlations. If the first canonical correlation is close to unity, theù
means lie along a straight line nearly. If thede
f
th canonical correlation is near zero, the means lie ind' dimensional space. Proportion of total variation explained by first k (=1,2,3,4) canonical discriminants This is based on the idea of describing how the means for the various populations differ in attribute space. Each class (population) mean defines a point in attribute space, and, at its simplest, we wish to know if there is some simple relationship between these class means, for example, if they lie along a straight line. The sum of the firstd eigenvalues of the canonical discriminant matrix divided by the sum of all the eigenvalues represents the “proportion of total variation” explained by the firstd canonical discriminants. The total variation here is trc3Ó
g
. We calculate, as fractk, the values of c6Ù | e ED e}Ù · gh c3Ù | eÙ e DD e}Ù ê g for d f ~ ³ ~ 8 ~
This gives a measure of collinearity of the class means. When the classes form an ordered sequence, for example soil types might be ordered by wetness, the class means typically lie along a curve in low dimensional space. The Ù ’s are the squares of the canonical correlations. The significance of theÙ ’s can be judged from the
statistics produced by “manova”. This representation of linear discrimination, which is due to Fisher (1936), is discussed also in Section 3.2.
Departure from normality
The assumption of multivariate normality underlies much of classical discrimination pro- cedures. But the effects of departures from normality on the methods are not easily or clearly understood. Moreover, in analysing multiresponse data, it is not known how ro- bust classical procedures are to departures from multivariate normality. Most studies on robustness depend on simulation studies. Thus, it is useful to have measures for verifying the reasonableness of assuming normality for a given dataset. If available, such a measure would be helpful in guiding the subsequent analysis of the data to make it more normally distributed, or suggesting the most appropriate discrimination method. Andrews et al.
Sec. 7.3] Characterisation of datasets 115
(1973), whose excellent presentation we follow in this section, discuss a variety of methods for assessing normality.
With multiresponse data, the possibilities for departure from joint normality are many and varied. One implication of this is the need for a variety of techniques with differing sensitivities to the different types of departure and to the effects that such departures have on the subsequent analysis.
Of great importance here is the degree of commitment one wishes to make to the coordinate system for the multiresponse observations. At one extreme is the situation where the interest is completely confined to the observed coordinates. In this case, the marginal distributions of each of the observed variables and conditional distributions of certain of these given certain others would be the objects of interest.
At the other extreme, the class of all nonsingular linear transformations of the variables would be of interest. One possibility is to look at all possible linear combinations of the variables and find the maximum departure from univariate normality in these combinations (Machado, 1983). Mardia et al. (1979) give multivariate measures of skewness and kurtosis that are invariant to affine transformations of the data: critical values of these statistics for small samples are given in Mardia (1974). These measures are difficult to compare across datasets with differing dimensionality. They also have the disadvantage that they do not reduce to the usual univariate statistics when the attributes are independent.
Our approach is to concentrate on the original coordinates by looking at their marginal distributions. Moreover, the emphasis here is on a measure of non-normality, rather than on a test that tells us how statistically significant is the departure from normality. See Ozturk & Romeu (1992) for a review of methods for testing multivariate normality.
Univariate skewness and kurtosis
The usual measure of univariate skewness (Kendall et al., 1983) is Ô |
, which is the ratio of the mean cubed deviation from the mean to the cube of the standard deviation
Ô | µ0cBY K g h Å
although, for test purposes, it is usual to quote the square of this quantity:õ |ñÔ
|
Another measure is defined via the ratio of the fourth moment about the mean to the fourth power of the standard deviation:
õ K µ0cBY LK g h Å The quantity õ
Ú8 is generally known as the kurtosis of the distribution. However, we will refer toõ
itself as the measure of kurtosis: since we only use this measure relative to other measurements of the same quantity within this book, this slight abuse of the term kurtosis may be tolerated. For the normal distribution, the measures areõ |
° and õ H
8 , and we will say that the skewness is zero and the kurtosis is 3, although the usual definition of kurtosis gives a value of zero for a normal distribution.
Mean skewness and kurtosis
Denote the skewness statistic for attribute in populationÛ
by Ôß| c ~ g . As a single measure of skewness for the whole dataset, we quote the mean of the absolute value of Ô'|
cB
~
g
, averaged over all attributes and over all populations. This gives the measure ÉDd
I
¶¹Ö¨É . For a normal population, ÉEd
I
¶¹Ö¨É is zero: for uniform and exponential variables, the theoretical values ofÉDd
I
116 Methods for comparison [Ch. 7
the mean of the univariate standardised fourth momentõ
cB
~
g
, averaged over all attributes and populations. This gives the measureõ
. For a normal population,õ
8 exactly, and the corresponding figures for uniform and exponential variables are 1.8 and 9, respectively. Univariate skewness and kurtosis of correlated attributes
The univariate measures above have very large variances if the attributes are highly corre- lated. It may therefore be desirable to transform to uncorrelated variables before finding the univariate skewness and kurtosis measures. This may be achieved via the symmetric inverse square-root of the covariance matrix. The corresponding kurtosis and skewness measure (d
ú
ÃDÌ Bk» andÉDd
I
Bk» say) may be more reliable for correlated attributes. By construction, these measures reduce to the univariate values if the attributes are uncorre- lated. Although they were calculated for all the datasets, these particular measures are not quoted in the tables, as they are usually similar to the univariate statistics.