Treatment Comparisons in Clinical Trials
3.2 Statistical Models for Treatment Comparisons .1 Models for Continuous Endpoints.1Models for Continuous Endpoints
We begin with comparison of two treatments based on the well-known t-test and then extend the concepts to multiple treatment comparisons for the analysis of variance approach.
3.2.1.1 Student’s t-tests
The student t-test is used to compare two treatment group means as-suming that the clinical trial endpoints are continuous and follow a normal distribution. To reflect this situation, a clinical trial of two treatment groups is conducted with the numbers of patients randomized to the two groups de-noted as n1 and n2. At the end of trial we observe clinical endpoints y1i and y2i on patients from each treatment group, where i = 1, · · · , ni. We test the null hypothesis that the means of the two treatment groups are the same:
H0: µ1= µ2 (3.1)
and the alternative could be two-sided as Ha : µ1 6= µ2 or one-sided as Ha : µ1> ( or <)µ2 depending on the trial objective.
The test statistic is constructed as:
t = y¯1− ¯y2 data, and s is the pooled standard error calculated as:
s = s
(n1− 1)s21+ (n2− 1)s22 n1+ n2− 2 ,
under assumption of constant variance and where s1 and s2 are the sample standard deviations from two treatment groups. It is noted that the t-statistic in Equation (3.2) is essentially the standardized difference of the two treatment group means.
Under the null hypothesis, this t-statistic has a Student’s t -distribution with n1+ n2− 2 degrees of freedom. The null hypothesis is not rejected if
|t| < tα/2,n1+n2−2 with 100(1 − α)% confidence coefficient (for a two-sided test).
Alternatively, a 100(1 − α)% confidence interval (CI) may be constructed on the true difference in treatment group means and used as the basis of statistical inference. The CI is constructed as:
¯
y1− ¯y2± tα/2,n1+n2−2sp
1/n1+ 1/n2
36 Clinical Trial Data Analysis Using R where tα/2,n1+n2−2 is the α/2-percentile.
The CI including zero is consistent with insufficient evidence to contradict or reject the null hypothesis.
The underlying assumptions for a valid t-test are that the observed clinical endpoints of y1and y2are independent and normally distributed with common variance σ2. If any of these assumptions is violated, there are remedies:
1. Unequal variances: If the two treatment groups have different variances, the t-statistic in Equation (3.2) may be modified as
t = y¯1− ¯y2 ps21/n1+ s22/n2
(3.3) with ν degrees of freedom calculated as
ν =
This test statistic t has a Student’s t -distribution, is known as the Welch test as in Welch (1947), and is implemented in R as t.test.
2. Non-normal data: The t-test is usually quite robust against departures from normality. However, when the departure is extreme, the recom-mended remedy is to use the Mann–Whitney–Wilcoxon (MWW) U-test (also called Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test).
This is a non-parametric test for assessing whether two independent samples of observations come from the same distribution. It is one of the most widely used non-parametric significance tests. It was proposed initially by Wilcoxon (1945), for equal sample sizes, and extended to arbitrary sample sizes and in other ways by Mann and Whitney (1947).
MWW is virtually identical to performing an ordinary parametric two-sample t-test on the ranks of the data after ranking over the combined samples. This U-test is implemented in R system as wilcox.test.
3. Bootstrap resampling: When any of the assumptions underlying the va-lidity of the t-test don’t hold for the data being analyzed, bootstrapping provides a viable alternative. The bootstrap method involves iteratively resampling the data with replacement, calculating the value of the statis-tic for each sample obtained, and generating the resampling distribution.
Percentile points corresponding to the Type-I error level and the sided-ness of the alternative hypothesis of the resampling distribution are then used in the assessment of statistical significance. We illustrate the boot-strapping approach using the R function bootstrap.
Treatment Comparisons in Clinical Trials 37 3.2.1.2 One-Way Analysis of Variance (ANOVA)
For comparisons involving more than two treatment groups, F -tests de-riving from a one-way analysis of variance (ANOVA) model are used. The fundamental idea for ANOVA is to partition the overall variance in clinical response into a component reflecting variation among treatment groups (fac-tor levels) and variation within treatment group [due to measurement error (residual)]. For a factor α occurring at i = 1, · · · , I levels, with j = 1, · · · , ni observations per level, the typical one-way ANOVA model may be expressed as
yij = µ + αi+ ij (3.4)
The above model is over-parameterized and not all the parameters are identifiable or estimable. The common constraints are:
1. Set µ = 0 and use I different dummy variables to estimate αi for i = 1, · · · , I.
2. Set α1 = 0, µ represents the expected mean response for level one and αifor i 6= 1 represents the difference between level i and level one. Level one is then called the reference level or baseline level. This corresponds to “treatment contrasts” as commonly outputted in R output.
Treatment effects (differences among specified treatments) are commonly estimated using least squares. Inference on the statistical significance of a treatment difference may be constructed as
H0: αi= 0, i = 1, · · · , I
Ha : at least one of the αi is not zero
The model under H0is then yij = µ+ijand under Hais yij= µ+αi+ij. If the null hypothesis fails to be rejected, the analysis ends and it is concluded that there is insufficient evidence to conclude that the treatment group means differ. However, if the null hypothesis is rejected, the next logical step is to investigate which levels differ by using so-called multiple comparisons.
We warn readers that the t-test from the Section 3.2.1.1 applied individ-ually to all pairwise comparisons is not the solution since it will inflate the type-I error rate. Therefore procedures that adjust for multiple comparisons are used. Tukey’s honest significant difference (HSD) procedure is commonly used for adjustment in the literature and it is easy to understand. Tukey’s HSD procedure is based on the distribution of the studentized range with quantile of qα,df1,df2 where df1= I and df2=PI
i=1ni− I as ˆ
αi− ˆαj±qα,df1,df2
√2 se( ˆαi− ˆαj) (3.5) The ANOVA procedure is implemented in the R system as aov and Tukey’s HSD procedure as TukeyHSD .
38 Clinical Trial Data Analysis Using R 3.2.1.3 Multi-Way ANOVA: Factorial Design
The R system permits extending the one-way ANOVA in Section 3.2.1.2 to ANOVA accounting for several factors (multi-way ANOVA). We describe the 2-way ANOVA corresponding to a two-factor design and illustrate this procedure in analyzing the DBP data in Section 3.1.
Suppose we have two factors, α (e.g. treatment with drugs A and B) at I levels and β (e.g. time at which DBP is measured; time = 1,..., 5) at J levels. Let nij be the number of observations at level i of α and level j of β, and denote those observations by yijk, k = 1, · · · , nij. The full ANOVA model with fixed effects is
yijk= µ + αi+ βj+ (αβ)ij+ ijk (3.6) where αiand βj are the main effects. The term (αβ)ij is the interaction effect between the two factors α and β, which may be interpreted as that part of the main effects not explained by the additive effects of α and β.
A significant interaction means the main effect of α cannot be assessed independent of β. A comparison of the levels of α is dependent on the level of β.
The interaction effect may be tested using the F -test from the ANOVA, which is implemented in the R system with the aov. If the interaction is found to be significant, further investigation is needed for inference about main effects of interest.
If the interaction is found to be insignificant, then main effects may be tested from the ANOVA table corresponding to the reduced model without interaction:
yijk= µ + αi+ βj+ ijk (3.7)
3.2.2 Models for Categorical Endpoints: Pearson’s χ2-test There are many methods for categorical data analyses. Readers are referred to Agresti (2002) for a comprehensive treatise. We introduce Pearson’s chi-square test in this chapter to draw comparisons with other methods of analyses of the clinical trial on duodenal ulcer healing. In addition, Pearson’s χ2 test is the probably the most commonly used statistical method for categorical analyses of contingency table data.
The first step in the square test is to calculate the value of the chi-square statistic. It is obtained by (1) forming the difference between the ob-served number of frequencies and the expected (under the null hypothesis of no difference among the groups being compared) number of frequencies in each cell of the contingency table, (2) squaring each difference, (3) dividing each squared difference by the expected number of frequencies, and (4) summing the results. The second step is to determine the degrees of freedom of the test, which is essentially the total number of observed frequencies adjusted
Treatment Comparisons in Clinical Trials 39 for the impact of using some of the observations to compute the “expected frequencies.”
The value of the test-statistic is:
χ2=X
i
(Oi− Ei)2 Ei
(3.8)
where Oi is the observed frequency, and Ei is the expected (theoretical) fre-quency under the null hypothesis.
Asymptotically, the distribution of the test statistic χ2is a chi-square dis-tribution. The “asymptotical” approximation to the chi-square distribution breaks down if expected frequencies are too low. In this case, a better approx-imation is obtained by using Yates’ correction for lack of continuity. This is accomplished by reducing the absolute value of each difference between ob-served and expected frequencies by 0.5 before squaring. This Pearson χ2-test is implemented in R as prop.test.