5.4 ANCOVA*
5.4.1 Computation of Rank-Based ANCOVA
For the general one-way or k-way ANCOVA models, we have written R func-tions which compute the rank-based ANCOVA analyses which are included in the R package npsm. We first discuss the computation of rank-based ANCOVA when the design is a one-way layout with k groups.
5See Huitema (2011) and Watcharotone (2010).
Computation of Rank-Based ANCOVA for a One-Way Layout For one-way ANCOVA models, we have written two R functions which make use of Rfit to compute a rank-based analysis of covariance. The function onecovahetercomputes the test of homogeneous slopes. It also computes a test of no treatment (group) effect, but this is based on the full model of heterogeneous slopes. It should definitely not be used if the hypothesis of ho-mogeneous slopes is rejected. The second function onecovahomog assumes that the slopes are homogeneous and tests the hypothesis of no treatment effect;
this is the adjusted analysis of covariance. It also tests the hypotheses that the covariates have a nonsignificant effect. The arguments to these functions are: the number of groups (the number of cells of the one-way design); a n × 2 matrix whose first column contains the response variable and whose second column contains the responses’ group identification; and the n × m matrix of covariates. The functions compute the analysis of covariances, summarizing the tests in ANCOVA tables. These tables are also returned in the value tab along with the full model fit in fit. For the function onecovahomog the full model is (5.15), while for the function onecovaheter the full model is (5.16).
We illustrate these functions with the following example.
Example 5.4.1 (Chateau Latour Wine Data). Sheather (2009) presents a dataset drawn from the Chateau Latour wine estate. The response variable is the quality of a vintage based on a scale of 1 to 5 over the years 1961 to 2004.
The predictor is end of harvest, days between August 31st and the end of harvest for that year, and the factor of interest is whether or not it rained at harvest time. The data are in latour in the package npsm. We first compute the test for homogeneous slopes.
> data = latour[,c(’quality’,’rain’)]
> xcov = cbind(latour[,’end.of.harvest’])
> analysis = onecovaheter(2,data,xcov,print.table=T) Robust ANCOVA (Assuming Heterogeneous Slopes) Table
df RD MRD F p-value
Groups 1 0.8830084 0.8830084 2.395327 0.129574660 Homog Slopes 1 2.8012494 2.8012494 7.598918 0.008755332
Based on the robust ANCOVA table, since the p-value is less than 0.01, there is definitely an interaction between the groups and the predictor. Hence, the test in the first row for treatment effect should be ignored.
To investigate this interaction, as shown in the left panel of Figure 5.4, we overlaid the fits of the two linear models over the scatterplot of the data. The dashed line is the fit for the group “rain at harvest time,” while the solid line is the fit for the group “no rain at harvest time.” For both groups, the quality of the wine decreases as the harvest time increases, but the decrease is much worse if it rains. Because of this interaction the tests in the first two rows of the ANCOVA table are not of much interest. Based on the plot, interpretations
from confidence intervals on the difference between the groups at days 25 and 50 would seem to differ.
20 30 40 50
The left panel is the scatterplot of the data of Example 5.4.1. The dashed line is the fit for the group “rain at harvest time” and the solid line is the fit for the group “no rain at harvest time.” The right panel is the normal probability plot of the Wilcoxon Studentized residuals.
If we believe the assumption of equal slopes is a reasonable one, then we may use the function onecovahomog which takes the same arguments as onecovaheter.
Computation of Rank-Based ANCOVA for a k-Way Layout
For the k-way layout, we have written the Rfit function kancova which com-putes the ANCOVA. Recall that the full model for the design is the cell mean (median) model. Under heterogeneous linear models, each cell in the model has a distinct linear model. This is the full model, Model 5.16, for testing homo-geneous slopes. This function also computes the adjusted analysis, assuming that the slopes are homogeneous; that is, for these hypotheses the full model
is Model 5.15. So these adjusted tests should be disregarded if there is reason to believe the slopes are different, certainly if the hypothesis of homogeneous slopes is rejected. For the adjusted tests, the standard hypotheses are those of the main effects and interactions of all orders as described in Section 5.3. A test for all covariate effects that are null is also computed. We illustrate this function in the following two examples.
Example 5.4.2. For this illustration we have simulated data from a 2 × 3 layout with one covariate. Denote the factors by A and B, respectively. It is a null model; i.e., for the data simulated, all effects were set at 0. The data are displayed by cell in the following table. The first column in each cell contains the response realizations while the second column contains the corresponding covariate values.
Factor A Factor B
B(1) B(2) B(3)
4.35 4.04 0.69 2.88 4.97 3.4 5.19 5.19 4.41 5.3 6.63 2.91 4.31 5.16 7.03 2.96 5.71 3.79 A(1) 5.9 1.43 4.14 5.33 4.43 3.53 4.49 3.51 3.73 4.93 5.29 5.22 5.75 3.1 5.65 3.89 4.93 3.22 6.15 3.15 6.02 5.69 5.1 4.73 4.94 2.01 4.27 4.2 4.52 2.79 6.1 3.01 4.3 2.57 A(2) 5.53 5.63 4.93 3.87 4.47 3.75 4.21 3.88 5.3 4.47 6.07 2.62 5.65 3.85
We use this example to describe the input to the function kancova. The design is a two-way with the first factor at 2 levels and the second factor at 3 levels.
The first argument to the function is the vector of levels c(2,3). Since there are two factors, the second argument is a matrix of the three columns: vector of responses, Yikj; level of first factor i; and level of second factor j. The third argument is the matrix of corresponding covariates. The data are in dataset acov231. The following code segment computes the rank-based ANCOVA.
> levs = c(2,3);
> data = acov231[,1:3];
> xcov = matrix(acov231[,4],ncol=1)
> temp = kancova(levs,data,xcov) Robust ANCOVA Table
All tests except last row is with homogeneous slopes as the full model. For the last row the full model is with heteroscedastic slopes.
df RD MRD F p-value 1 , 0 1 0.21806168 0.21806168 0.46290087 0.5022855 0 , 1 2 0.09420198 0.04710099 0.09998588 0.9051964 1 , 1 2 1.21200859 0.60600430 1.28642461 0.2932631 Covariate 1 0.13639223 0.13639223 0.28953313 0.5950966 Hetrog regr 5 2.33848642 0.46769728 0.98946073 0.4476655
The rank-based tests are all nonsignificant, which agrees with the null model used to generate the data.
Example 5.4.3 (2 × 2 with Covariate). Huitema (2011), page 496, presents an example of a 2 × 2 layout with a covariate. The dependent variable is the number of novel responses under controlled conditions. The factors are type of reinforcement (Factor A at 2 levels) and type of program (Factor B at 2 levels); hence there are four cells. The covariate is a measure of verbal fluency.
There are only 4 observations per cell for a total of n = 16 observations.
Since there are 8 parameters in the heterogeneous slope model, there are only 2 observations per parameter. Hence, the results are tentative. The data are in the dataset huitema496. Using the function kancova with the default Wilcoxon scores, the following robust ANCOVA table is computed.
> levels = c(2,2);
> y.group = huitema496[,c(’y’,’i’,’j’)]
> xcov = huitema496[,’x’]
> temp = kancova(levels,y.group,xcov) Robust ANCOVA Table
All tests except last row is with homogeneous slopes as the full model. For the last row the full model is with heteroscedastic slopes.
df RD MRD F p-value
1 , 0 1 5.6740175 5.6740175 6.0883935 0.031261699 0 , 1 1 0.4937964 0.4937964 0.5298585 0.481873895 1 , 1 1 0.1062181 0.1062181 0.1139752 0.742017556 Covariate 1 12.2708792 12.2708792 13.1670267 0.003966071 Hetrog regr 3 8.4868988 2.8289663 4.2484881 0.045209629
The robust ANCOVA table indicates heterogeneous slopes, so we plotted the four regression models next as shown in Figure 5.5. The rank-based test of homogeneity of the sample slopes agrees with the plot. In particular, the slope for the cell with A = 2, B = 2 differs from the others. Again, these results are based on small numbers and thus should be interpreted with caution. As a pilot study, these results may serve in the conduction of a power analysis for a larger study.
The rank-based tests computed by these functions are based on reductions of dispersion as we move from reduced to full models. Hence, as an alternative
2 4 6 8
2468
Verbal fluency
Responses
1
1
1
1
2
2
2
2
3 3
3
3 4
4
4 4
FIGURE 5.5
Scatterplot of the data of Example 5.4.3. The number 1 represents observations from the cell with A = 1, B = 1, 2 represents observations from the cell with A = 1, B = 2, 3 represents observations from the cell with A = 2, B = 1, and 4 represents observations from the cell with A = 2, B = 2.
to these functions, the computations can also be obtained using the Rfit functions rfit and drop.test with only a minor amount of code. In this way, specific hypotheses of interest can easily be computed, as we show in the following example.
Example 5.4.4 (Triglyceride and Blood Plasma Levels). The data for this ex-ample are drawn from a clinical study discussed in Hollander and Wolfe (1999).
The data consist of triglyceride levels on 13 patients. Two factors, each at two levels, were recorded: Sex and Obesity. The concomitant variables are chylomi-crons, age, and three lipid variables (very low-density lipoproteins (VLDL), low-density lipoproteins (LDL), and high-density lipoproteins (HDL)). The data are in the npsm dataset blood.plasma. The next code segment displays a subset of it.
> head(blood.plasma)
Total Sex Obese Chylo VLDL LDL HDL Age
[1,] 20.19 1 1 3.11 4.51 2.05 0.67 53 [2,] 27.00 0 1 4.90 6.03 0.67 0.65 51 [3,] 51.75 0 0 5.72 7.98 0.96 0.60 54 [4,] 51.36 0 1 7.82 9.58 1.06 0.42 56 [5,] 28.98 1 1 2.62 7.54 1.42 0.36 66 [6,] 21.70 0 1 1.48 3.96 1.09 0.23 37
The design matrix for the full model to test the hypotheses of no interaction between the factors and the covariates would have 24 columns, which, with only 13 observations, is impossible. Instead, we discuss the code to compute several tests of hypotheses of interest. The full model design matrix consists of the four dummy columns for the cell means and the 5 covariates. In the following code segment this design matrix is in the R matrix xfull while the response, total triglyceride, is in the column Total of blood.plasma. The resulting full model fit and its summary are given by:
> fitfull = rfit(blood.plasma[,’Total’]~xfull-1)
> summary(fitfull) Call:
rfit.default(formula = blood.plasma[, "Total"] ~ xfull - 1) Coefficients:
Estimate Std. Error t.value p.value xfull00 8.59033 9.30031 0.9237 0.407938 xfull01 -3.00427 8.32005 -0.3611 0.736297 xfull10 -12.61631 10.11257 -1.2476 0.280234 xfull11 -11.58851 10.32710 -1.1221 0.324605 xfullChylo 1.74111 0.53220 3.2715 0.030745 * xfullVLDL 2.87822 0.41674 6.9064 0.002305 **
xfullLDL 3.79748 2.77105 1.3704 0.242433 xfullHDL -11.46968 4.61116 -2.4874 0.067674 . xfullAge 0.24942 0.13359 1.8671 0.135284
---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Multiple R-squared (Robust): 0.9655414
Reduction in Dispersion Test: 14.01015 p-value: 0.01108
> dfull = disp(fitfull$betahat, xfull, fitfull$y, fitfull$scores)
> dfull [,1]
[1,] 26.44446
The last line is the minimum value of the dispersion function based on the fit of the full model. The one hypothesis of interest discussed in Hollander and
Wolfe is whether or not the three lipid covariates (VLDL, LDL, and HDL) are significant predictors. The answer appears to be “yes” based on the above summary of the Rfit of the full model. The following code performs a formal test using the hypothesis matrix hmat:
> hmat = rbind(c(rep(0,5),1,rep(0,3)), + c(rep(0,6),1,rep(0,2)), + c(rep(0,7),1,rep(0,1)))
> hmat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 0 0 0 0 1 0 0 0
[2,] 0 0 0 0 0 0 1 0 0
[3,] 0 0 0 0 0 0 0 1 0
> xred1 = redmod(xfull,hmat)
> fitr1 = rfit(blood.plasma[,’Total’]~xred1-1)
> drop.test(fitfull,fitr1) Drop in Dispersion Test F-Statistic p-value 10.407321 0.023246
Hence, based on this p-value (0.0232), it seems that the lipid variables are related to triglyceride levels. The next few lines of code test to see if the factor sex has an effect on triglycerides.
> hmat=rbind(c(1,1,-1,-1,rep(0,5)))
> xred3 = redmod(xfull,hmat)
> fitr3 = rfit(blood.plasma[,’Total’]~xred3-1)
> drop.test(fitfull,fitr3) Drop in Dispersion Test F-Statistic p-value 21.79352 0.00953
Based on the p-value of 0.0095, it appears that the factor sex also has an effect on triglyceride levels. Finally, consider the effect of obesity on triglyceride level.
> hmat=rbind(c(1,-1,1,-1,rep(0,5)))
> xred2 = redmod(xfull,hmat)
> fitr2 = rfit(Total~xred2-1)
> drop.test(fitfull,fitr2) Drop in Dispersion Test F-Statistic p-value 11.580682 0.027201
Thus, the rank-based test for obesity results in the test statistic 11.58 with p-value 0.0272.