Chapter 3 Permutation Tests for Covariance Structures
3.4 Permutation Test Simulations
To determine the appropriateness of the tests described above, a simulation study was conducted using the following method of permutation testing:
Simulation 1. 1. Register the original dataXiU andXiD where XiU is the orig-
inal undamaged molecule dataset and XiD is the original damaged molecule
dataset, and estimate µD, µU, ΣD, ΣU.
2. The observations used for the simulations were generated by inducing a cor- relation, ρ, and a magnification of the error terms δ. The initial generated observation was a randomly selected value from a multivariate normal distri- bution with mean µandΣ, where µandΣare the mean shape and covariance matrix for the datasets. Generate the datasets for the simulations according to Xi =ρ(Xi−1−µ) +µ+ei where ei ∼M V N(0, δ.5Σ)for i= 2,3, . . . , n. Note
that in this setting the mean shape and covariance matrix,µandΣ, used in the generation of the observations for both the undamaged and damaged generated observations are the mean and covariance structure of XiU.
We now present the a permutation test for testing equal covariances and equal means, which we denote as Permutation Test A.
Permutation Test A. 1. Carry out pooled GPA on the whole dataset to get the registered dataset Xpij, i= 1, . . . , n,j = 1,2.
2. Obtain the group Procrustes meansµb1 andµb2 whereµbj = n1 Pn
i=1X
p
3. Within each group, obtain the tangent coordinatesvij =T b µj(X
p
ij),i= 1, . . . , n,
j = 1,2 where Tµ(X) denotes the tangent coordinates at µ.
4. Work out the factored MLE for each group based on vij, Σb1, Σb2 and evaluate
the test statistic T0 =d(Σb1, Σb2). 5. for I = 1, . . . , h
a) Swap/permute the blocks at random of size b to get Xsij, i = 1, . . . , n, j = 1,2.
b) Obtain the group Procrustes means of the permuted data µb s
1 and µb s
2.
c) Within each group, obtain the tangent coordinates vsij = T
b µsj(X
s
ij), j = 1,2.
d) Work out the factored MLE for each group based on vij, Σb s
1, Σb
s
2 and
evaluate the test statistic TI =d(Σb s
1, Σb
s
2).
6. Compute the p-value 1hPhI=11(TI ≥T0) where 1(TI ≥ T0) = 1 if TI ≥T0 and
1(TI ≥T0) = 0 otherwise.
The registered shapes in the above algorithm were computed using Generalized Procrustes analysis, using the “procGPA” command in the shapes library (Dryden,
2013). For the simulation studies presented in the later chapters, the choice of b was
b=100. Other block sizes were investigated in a smaller study to pick the most useful
block size. The choice of b=100 allowed for tests which preserved the correlation
structure of the data without upsetting the performance of the permutation testing
procedure. The other choices ofb which were investigated wereb=50 andb=200. An
example related to choosing the block size for the DNA dataset is given with a much smaller dataset in Section 6.2 related to the Rat Calvarial Growth dataset explored in that chapter. The goal for choosing the block size is to choose a block of large enough length to create a computationally time efficient testing procedure while also preserving the structure of the data. In the case of the DNA dataset, the observations are time dependent and it is of interest to preserve this structure within the blocks.
In the case of the example dataset in Section 6.2, the landmark recordings for the rat skulls are not time dependent so preserving the correlation between observations is not of interest.
Each of the 1000 simulations consists of a permutation test with 300 permutations of the blocks. The results of the size simulations for each of the tests are summa- rized in Section 3.5. The nominal significance level for the tests is 0.05. The tests are listed in the heading of the table with abbreviated names. BoxM, Riem, Proc, PShape, Chol, Power, Eucl, LogEucl, and RiemLE are the permutation tests based on Box’s M test, Riemannian distance, Procrustes distance, Procrustes Shape distance, Cholesky distance, Power distance, Euclidean distance, Log Euclidean distance, and Riemannian Le distance, respectively. In the situations where the size of the test was reasonable, a power simulation was conducted to assess the power of the tests. A further explanation of the power simulation method and the results of the power simulations can be found in Section 3.6.
The observations used for the simulations were generated by inducing a correla-
tion, ρ, and a magnification of the error terms δ. The initial generated observation
was a randomly selected value from a multivariate normal distribution with mean
µ and Σ, where µ and Σ are the mean shape and covariance matrix based on the
observation of the TGC DNA molecule. The estimated mean shape and covariance
structure from this molecule were used as the mean shape, µ, and the estimate
covariance structure, Σ, in the generation of the first observation as well as the re-
maining n−1 generated observations. The next n−1 observations were found by
Xi =ρ(Xi−1−µ) +µ+ei where ei ∼ M V N(0, δ.5Σ) for i= 2,3, . . . , n. The tests in the following size simulation study were carried out on the factored covariance matrices. A detailed explanation of the process for factoring the covariance structure can be found in Section 2.6. It should be noted that the method proposed in this chapter assumes that the mean shapes between the two groups are equal. An expla-
nation of the issues with this assumption is given in Section 2.3. In that chapter, it was shown that there are significant differences between the mean shapes of the DNA molecule between the damaged and undamaged versions of some of the molecules. This assumption will be relaxed in Section 4.1 but this section provides a more com- putationally efficient testing procedure when the assumption of equal mean shapes between the two groups is a valid assumption. In Section 6.1, an application of the testing procedure is given for the real DNA dataset for each of the DNA molecule pairs where one molecule is the undamaged molecule and the other molecule is the damaged version of that molecule. The factored covariance structures are being used to reduce the dimension of the data. The dimension reduction procedure is of partic- ular interest with high dimensional data, but could be applied in lower dimensional data problems as well. In Section 6.2, the same procedure is applied to a lower di- mensional dataset concerning the Rat Calvarial Growth dataset. The tests in the power simulation study, Section 3.6, were also carried out on the factored covariance matrices.