Chapter 3 Permutation Tests for Covariance Structures
3.1 Box M Test and Permutation Test
This section discusses the formal testing procedure using Box’s M (Mardia et al., 1979; Box, 1949) test for differences in covariance structures and the related permutation testing procedure proposed for use in the later chapters of this work.
Box’s M test is a likelihood ratio test of homogeneity of covariance matrices. In the traditional Box’s M test, the null and alternative hypotheses are given by
H0 : Σ1 = Σ2 vs H1 : Σ1 6= Σ2 where Σ1 is the covariance matrix for the first
sample,Σ2 is the covariance matrix for the second sample.
The test statistic and asymptotic distribution under H0 for Box’s M test are
given as follows (Mardia et al., 1979). The test statisticM is given byM =γP
(ni− 1) log(|S−ui1Su|), where γ = 1− 2p 2+ 3p−1 6(p+ 1)(w−1)( X ( 1 ni−1 − 1 n−w))
ni is the ith sample size,n =P(ni),Sui and Su are the unbiased estimators
Sui= ni ni−1 Si, Su = n n−wS
and the maximum likelihood estimate of Σi is S = n−1P(niSi) under H0 and Si
under H1 whereSi is the unbiased estimator of Σi. In our DNA example, w =2 and
p=60. In general,w is the number of covariance structures being compared and p is
estimates used in this procedure are the covariance matrices computed in Section 2.6. This allows the calculations to be carried out in the tangent space.
Box’s M statistic has an asymptotic χ2 distribution with 1
2p(p+ 1)(w−1) degrees
of freedom, for large samples. The traditional Box’s M test is used in Chapters 4 and 5.
The results from these tests on the DNA data are given in the table below. The first and second columns of the table give the damaged and undamaged molecules, respectively. The third column contains the test statistic value for each test. The fourth column contains the p-value for each test.
Table 3.1: Box’s M Test Results. This table contains the results of Box’s M test for the damaged-undamaged DNA molecule pairs. The test statistic value is given in the column labeled “Statistic”. The p-value is given in the column labeled “p-value”. It should be noted that the p-values from this test may not be accurate due to the violations of the Box’s M test assumptions.
Damaged Undamaged Statistic p-value
AFA AGA 37887.034 0 AFC AGC 35592.063 0 AFG AGG 12428.683 0 TFA TGA 7837.304 0 TFC TGC 37549.346 0 TFT TGT 9997.205 0
Table 3.1 shows that each test returned highly significant results. The p-value for each test was machine zero. These results seem very promising, but the assumptions for Box’s M test are not met. The underlying assumptions for Box’s M test are that
each ni exceeds 20, and that w and pare at most 5 (Mardia et al., 1979). Although
ni = 2500 and w = 2, each test has p = 60. Given this violation of the assumption
about p, the significance results of these tests be treated with caution.
The modified version of Box’s M test used in this chapter to conduct the permu- tation tests assumes that the mean shapes of the damaged and undamaged molecule in each test are equal. Hence, the following alteration to the null and alternative
hypotheses as compared to the traditional Box’s M test. The null and alternative hy-
potheses used in the permutation tests in this chapter are: H0 :ΣD =ΣU, µD =µU
vs H1 : ΣD 6=ΣU or µD 6=µU where ΣD is the covariance matrix for the damaged
DNA molecule, ΣU is the covariance matrix for the undamaged DNA molecule, µD
is the mean shape of the damaged DNA molecule, and µU is the mean shape of the
undamaged DNA molecule. This test is conducted for each of the six pairs of dam- aged and undamaged DNA molecules (e.g. AFA vs AGA, TFA vs TGA, etc). This alteration results in the distributions of the data being exchangeable under the null hypothesis, which is required for a permutation test.
To combat the problem with the assumptions, a permutation test based on Box’s M test is used. Using a permutation test instead of a distribution based test allows for results which do not depend on the distribution of the test statistic. The procedure for the test applied to the DNA dataset is given below. The idea behind the test is to find the test statistic using the original data and then permute the data many times to find a large number of test statistics from the permuted data. Once these test statistics have been calculated, the original test statistic is compared to the test statistics from the tests based on the permuted data. The test procedure below is actually a block permutation test so particular care must be paid to the method for choosing the size of the block. In this testing procedure, blocks of data are permuted instead of separately permuting each observation in the dataset. The method for permuting the blocks of data is outlined in the following procedure for carrying out the permutation test. In the setting of the Box’s M test, the distribution used to find p-values for the test is a chi square distribution. One important item of interest regarding the permutations is that the testing procedure used in this study only permutes the blocks by swapping between the two groups. This test does not swap blocks of data within the groups. For example, swapping block 1 in group 1 with block 1 in group 2 is possible, but swapping block 1 in group 1 with block 2 in
group 1 is not a possibility. By using a permutation test, the p-values associated with the test results do not have the limitation of needing to be from the chi squared distribution. The p-values for the permutation test are calculated by counting the number of permutations with test statistics greater than or equal to the test statistic associated with the unpermuted data. The disadvantage to this method (particularly in large datasets) is the computational speed of the testing procedure.