• No results found

3.6.1 General Concepts

Assessing more than one statistical hypothesis at one time is called simultaneous infer- ence. This is like asking multiple specific questions and expecting one specific answer per question, and more often than not also an overall answer to the set of questions as a whole. Directing several enquiries to the same set of data, however, usually results in a multiple comparison problem. Multiplicity can arise from several group comparisons being made, several endpoints being investigated, several subgroups being analyzed, etc. The common goal with simultaneous inference is to control a joint rate of type I errors occurring over the whole set (or “family”) of hypotheses. Hochberg and Tamhane (1987, p. 5) defined a family as “any collection of inferences for which it is meaningful to take into account some combined measure of errors”. In this context we appreciate the notion of a “claimwise” error rate as proposed by Phillips et al. (2013); this emphasizes that a claim can consist of diverse elementary hypotheses that are meaningful together.

Performing a series of level α tests for the elementary hypotheses without any further adjustment usually inflates the type I error rate of the entire claim. We want to focus here on methods that control the FWER in the strong sense i.e., the probability of incorrectly rejecting one or more true elementary nulls is to be bounded by α, no matter which and how many elementary nulls are true or false. The inferential procedures discussed in this work control the FWER (at least approximately) in the strong sense for a claim that may comprise comparisons among treatment groups as well as comparisons among time points.

When comparing multiple treatments separately and simultaneously at multiple occa- sions, it is sufficient for claiming an effect if at least one treatment difference is significant at least at one occasion. Likewise, when comparing multiple occasions separately and simultaneously for multiple treatment groups, an effect may be claimed if at least one

occasion difference is significant for at least one treatment. Hence the claims we want to make are formulated as union-intersection tests (UITs), and adjustment is needed as the goal is to bound the FWER (approximately) by α. A UIT involves testing the intersection of elementary null hypotheses against the union of alternatives:

H0 = \ H0(i) HA= [ HA(i) The global H0 is rejected if at least one elementary H

(i)

0 is rejected (Roy 1953).

The results of many simultaneous inference procedures can be expressed either as ad- justed p-values or SCIs, but intervals are superior to p-values in multiple ways (Gardner and Altman 1986): both convey information about statistical significance, but SCIs in addition allow to assess the magnitude of an effect on the original scale, its direction (decrease or increase), and its subject-matter relevance. Therefore SCIs are much more useful for direct interpretation with respect to the research questions of interest.

3.6.2 Multiple Comparison Procedures

The simplest and best-known adjustment for multiplicity is based on Bonferroni’s in- equality (Bonferroni 1935, 1936)2, leading to the corrected type I error bound

˜ α = α

z

and to adjusted p-values (which are then to be compared with α) for hypotheses h = 1, . . . , z given by

˜

ph = min(zph, 1).

This method is universally applicable to ensure strong FWER control but also notorious for being conservative unless test statistics are independent.

Just minimally more powerful is the correction of ˇSid´ak (1967) with its adjusted α bound of

˜

α = 1 − (1 − α)z1.

Unlike the Bonferroni method, it controls α exactly (in a probabilistic sense), however only under independence of test statistics. In the presence of correlation among tests, its achieved type I error level can lie considerably below the nominal α.

This conservativity can be cushioned by incorporating dependence of test statistics by means of their joint parametric distribution. Tukey’s pairwise comparisons using the studentized range distribution, Dunnett’s comparisons with a control, or the analysis of means (ANOM) are examples how α can be better exploited under dependence. These are all single-step tests procedures (meaning that the same critical value applies to all test statistics), and all of them can be viewed as special cases of multiple contrast tests (MCTs), which we will consider in depth in 3.7.

Stepwise test procedures offer another strategy to lessen conservativity as they uniformly improve the power of corresponding single-step tests. The Bonferroni-Holm step-down

test (Holm 1979) is uniformly more powerful than Bonferroni; similarly, the single-step many-to-one test of Dunnett (1955) can be made uniformly more powerful in step-down (Naik 1975; Marcus et al. 1976; Dunnett and Tamhane 1991) or step-up (Dunnett and Tamhane 1992, 1995) variants3.

The trouble with stepwise techniques is that compatible SCIs are, if available at all, cumbersome to derive and in most cases noninformative. Guilbaud (2008) and Strass- burger and Bretz (2008) established SCI bounds corresponding to Holm-type step-down (“sequentially rejective”) tests; these bounds stick to the margin δh (usually δh = 0 ∀ h) for all rejected hypotheses, and hence provide no additional information compared to the p-values unless all hypotheses are rejected. Compatible SCIs for step-down Dunnett tests, following suggestions by Bofinger (1987) and Stefansson et al. (1988), have the same unpleasant property. SCIs compatible with step-up Dunnett tests do not exist to date. We consider the unavailability of assuredly informative SCIs a major deficiency that outweighs the achievable gain in power, hence stepwise procedures will not play a role in the remainder of this work.

Elaborate treatise of simultaneous inference is provided in the textbooks by Hochberg and Tamhane (1987) and Hsu (1996) as well as a series of recent review articles (Dmitrienko and D’Agostino, Sr. 2013; Alosh et al. 2013; Dmitrienko et al. 2013; Huque et al. 2013). The books by Dmitrienko et al. (2010) and Dickhaus (2014) connect mathematical theory and biomedical applications. Software-specific overviews are delivered e.g., in Bretz et al. (2010, using R) or Westfall et al. (2011, using SAS).