Interim analyses and sequential designs in phase III studies

(1)

Susan Todd, Anne Whitehead, Nigel Stallard & John Whitehead

Medical and Pharmaceutical Statistics Research Unit, The University of Reading, PO Box 240, Earley Gate, Reading, Berkshire, RG6 6FN

Recruitment of patients to a clinical trial usually occurs over a period of time, resulting in the steady accumulation of data throughout the trial's duration. Yet, according to traditional statistical methods, the sample size of the trial should be determined in advance, and data collected on all subjects before analysis proceeds. For ethical and economic reasons, the technique of sequential testing has been developed to enable the examination of data at a series of interim analyses. The aim is to stop recruitment to the study as soon as there is suf®cient evidence to reach a ®rm conclusion. In this paper we present the advantages and disadvantages of conducting interim analyses in phase III clinical trials, together with the key steps to enable the successful implementation of sequential methods in this setting. Examples are given of completed trials, which have been carried out sequentially, and references to relevant literature and software are provided.

Keywords: clinical trials, error rates, monitoring, sequential trials

Introduction

In this, the ®rst in a series of three papers dealing with the opportunities and dangers presented by interim analyses in clinical trials, we focus on phase III clinical studies. A phase III clinical trial is a large-scale study, typically comparing a promising experimental treatment with a control (placebo or active). Its purpose is to seek ®rm evidence to support a claim that the experimental treatment has clinical bene®ts. In this paper we show how sequential methodology can play an important role in such trials.

The traditional approach to conducting phase III clinical trials has been to calculate a single ®xed sample size in advance of the study, which depends upon a speci®ed signi®cance level and power and the treatment advantage to be detected. Data on all patients are then collected before any formal analyses are performed. While such a framework is logical when observations are available simultaneously, as in an agricultural ®eld trial, it may be less suitable for medical studies, in which patients are recruited over months if not years, and data are available sequentially. Here, results from patients who enter the trial early on are available for analysis while later patients are

still being enrolled. It is natural to be interested in such results, but the uncontrolled examination of data can lead to misleading and sometimes wholly inappropriate con-clusions, an issue which is considered further in this article. Some routine monitoring of trial progress, usually blinded to treatment allocation, is often undertaken as part of a phase III trial. This can range from simple checking of protocol compliance and the accurate completion of record forms, to monitoring adverse events in trials of serious conditions so that prompt action can be taken. Such monitoring may be undertaken in conjunction with a data and safety monitoring board (DSMB), established to review the information collected. It would therefore appear that assessment of interim treatment differences is a logical and worthwhile extension. However, the hand-ling of treatment comparisons while a trial is still in progress poses problems in medical ethics, statistical analysis and practical organization [1]. In methodological terms, the approach presented in this paper is known as the frequentist approach and is the most widely used frame-work in clinical trials. An alternative school of thought, not discussed here, but mentioned for completeness, is the Bayesian approach as described by Spiegelhalter et al. [2]. Opportunities and dangers

The most appealing reason for monitoring trial data for treatment differences is that, ethically, it is desirable to terminate or change a trial when evidence has emerged

Correspondence: Dr S. Todd, Medical and Pharmaceutical Statistics Research Unit, The University of Reading, PO Box 240, Earley Gate, Reading, Berkshire, RG6 6FN. Tel.: 0118 9318917; Fax: 0118 9753169; E-mail: s.c.todd@ reading.ac.uk

(2)

that one treatment is clearly superior to the other. This is particularly important when life-threatening diseases are involved. Alternatively, the data may support the conclu-sion that the experimental treatment and the control do not differ by some predetermined clinically relevant magnitude, in which case it would be desirable, both ethically and economically, to stop the study and divert resources elsewhere. Finally, if information in a trial is accruing more slowly than expected, perhaps because of a low event rate, then extension of recruitment until a large enough sample has been recruited may be appropriate.

Unfortunately multiple analyses of accumulating data lead to problems in the interpretation of results. The main problem occurs when signi®cance testing is undertaken at the various interim looks. Even if the treatments are really equally effective, the more often one analyses the accumulating data, the greater the chance of eventually and wrongly detecting a difference, thereby drawing incorrect conclusions from the trial. Armitage et al. [3] were the ®rst to compute numerically the extent to which the type I error probability (the probability of incorrectly declaring the experimental treatment as different from control) is increased over its nominal level if a standard hypothesis test is conducted at each of a series of interim looks. They studied the problem of testing a normal mean with known variance and set the signi®cance level or type I error probability for the trial to be 5%. If one interim analysis and one ®nal analysis are performed this error rises to 8%. If four interim analyses and a ®nal analysis are undertaken this ®gure is 14%. Similar ®gures can be anticipated for other response types. In order to make use of the advantages of monitoring the treatment difference, methodology is required to maintain the overall type I error rate at an acceptable level.

A second problem concerns the ®nal analysis. When data are inspected at interim looks, the analysis appropriate for ®xed sample size studies is no longer valid. Quantities such as P values, point estimates and con®dence intervals are still well de®ned, but new methods of calculation are required. If a traditional analysis is performed at the end of a trial that stops because the experimental treatment is found better than control, the P value will be too small (too signi®cant), the point estimate too large and the con®dence interval too narrow.

To deal with the above problems, special techniques are required. These can be broadly termed sequential methods. In the following section a brief overview of this methodology and related issues is given.

Sequential methodology

In his 1999 paper [4], Whitehead lists the key ingredients required to conduct a trial sequentially (see Figure 1). The

®rst two ingredients are common to both ®xed sample size and sequential studies, but are worth emphasizing for completeness. The second two are solutions to the particular problems of error rates and analysis in the sequential setting. Any combination of choices for the four ingredients is permissible, but, largely for historical reasons, particular combinations preferred by authors in the ®eld have been extensively developed, incorporated into software (see below) and used in practice. Each of the four ingredients will now be considered brie¯y in turn. Parameterization of the treatment difference

As with a ®xed sample size study the ®rst stage in designing a phase III sequential clinical trial is to establish a primary measure of ef®cacy. The authority of any clinical trial will be greatly enhanced if a single primary response is speci®ed in the protocol and is subsequently found to show signi®cant bene®t of the experimental treatment. The choice should depend upon such criteria as clinical relevance, ease of obtaining accurate measurements and familiarity to clinicians. Appropriate choice for the associated parameter measuring treatment difference can then be made. This should depend upon such criteria as interpretability, for example whether a measurement based on a difference or a ratio is more familiar, and precision of the resulting analysis. A wide variety of continuous and discrete data types can be dealt with. Suppose that in a clinical trial the appropriate response is identi®ed as survival time following treatment for cancer, then a suitable parameter of interest might be the log-hazard ratio. If the primary response is a continuous measure such as the reduction in blood pressure after 1 month of antihypertensive medication then the differ-ence in true (unknown) means is of interest. Finally, if we are considering a dichotomous variable, such as the occurrence (or not) of deep vein thrombosis following hip replacement, the log-odds ratio may be the parameter of interest.

Test statistics for use in interim analyses

A sequential test monitors a statistic summarizing the current difference between the experimental treatment and control at a series of times during the trial. If the absolute value of this statistic exceeds some speci®ed critical value, the trial is stopped and the null hypothesis of no difference between treatments is rejected. The timing of the interim looks can be measured directly in terms of number of patients, or more ¯exibly in terms of information. It should be noted that the test statistic measuring treatment difference may increase or decrease between looks, while the statistic measuring information will always increase. Early work in this area prescribed

(3)

designs whereby traditional test statistics such as the t-statistic or the chi-squared statistic, were monitored after each patient's response was obtained. Examples can be found in the book by Armitage [5]. Later work by Pocock [6] and O'Brien & Fleming [7] allowed inspections after the responses from each group of k patients were obtained, where k was prede®ned. Since then, statisticians have developed more ¯exible ways of conducting sequential trials when considering the number and the timing of interim inspections. Whitehead [8] monitors a statistic measuring treatment difference known in technical terms as the ef®cient score and times the interim looks in terms of a second statistic approxi-mately proportional to study sample size known as observed Fisher's information. Jennison & Turnbull [9] use a direct estimate of the treatment difference itself as the test statistic of interest and record inspections in terms of a function of its standard error.

Stopping rules for sequential trials

As highlighted above, a sequential test compares the test statistic measuring treatment difference with appropriate critical values. These critical values form a stopping rule or boundary for the trial. At any stage in the trial, if the boundary is crossed, the study is stopped and an appropriate conclusion drawn. If the statistic stays within the test boundary then there is not enough evidence to come to a conclusion at present and a further interim look should be taken. It is possible to look after every patient or to have just one or two interim analyses. When interims are performed after groups of patients this may be referred to as a `group sequential trial'. The advantage of looking after every patient is that a trial can be stopped as soon as an additional patient response results in the boundary being crossed. In contrast, performing just one

or two looks reduces the potential for stopping, and hence delays it. However, the logistics of performing interim analyses after groups of subjects are far easier to manage. In practice, planning for between 4 and 8 interim analyses appears sensible.

Once it had been established that there was a problem with in¯ating the type I error when using traditional tests and the usual ®xed sample size critical values, designs had to be suggested which adjusted for this. It is the details of the derivation of the stopping rule that introduces much of the variety of sequential methodology. Key early work in the area includes the tests of Pocock [6] and O'Brien & Fleming [7]. A more ¯exible approach, referred to as the alpha-spending method was proposed by Lan & DeMets [10] and extended by Kim & DeMets [11]. A collection of designs based on straight line boundaries, which builds on work that has steadily accumulated since the 1940s is discussed by Whitehead [8], the best known and most widely implemented of these being the triangular test.

The important issues to focus upon are the desirable reasons for stopping or continuing a study. Reasons for stopping may include:

' The experimental treatment is obviously worse than

the control

' The experimental treatment is already obviously

better

' There is little chance of showing that the

experi-mental treatment is better.

Reasons for continuing may include:

' A moderate advantage of the experimental treatment

is likely and it is desired to estimate the magnitude carefully

' The event rate is low and more patients are needed to

achieve power.

These will determine the type of stopping rule that is appropriate for the study under consideration. Stopping rules are now available for testing superiority, noninfer-iority, equivalence and even safety aspects of clinical trials. As an example, consider a clinical trial conducted by the Medical Research Council Renal Cancer Collaborators between 1992 and 1997 [12]. Patients with metastatic renal carcinoma were randomly assigned to treatment with either the biological therapy, interferon-

a

, or the hormone therapy, oral medroxyprogesterone acetate (MPA). The use of interferon-

a

was experimental and this treatment is known to be both toxic and costly. Consequently its bene®ts over MPA needed to be substantial to justify its wider use. A stopping rule was required to satisfy the following requirements:

' Early stopping if data showed a clear advantage of

interferon-

a

over oral MPA

' Early stopping if data showed no worthwhile

advantage of interferon-

a

(either interferon-

a

obviously worse or little difference between treatments).

(4)

This suggested use of an asymmetric stopping rule. The design chosen was the triangular test [8], similar in appear-ance to the stopping rule in Figure 2. Interim analyses were planned every 6 months from the start of the trial.

The precise form of the stopping rule is de®ned, as is the sample size in a ®xed sample size trial, by consideration of signi®cance level, power and desired treatment advan-tage, with reference to the primary endpoint. The primary endpoint in the MRC study was survival time and the treatment difference was measured by the log-hazard ratio. It was decided that if a difference in 2 year survival from 20% on MPA to 32% on interferon-

a

(log-hazard ratio x0.342) was present, then a signi®cant treatment difference at the two-sided 5% signi®cance level should be detected with 90% power.

Analysis following a sequential trial

Once a sequential trial has stopped, an analysis will be performed. The interim analyses determine only whether stopping should take place, they do not provide a complete interpretation of the data. An appropriate ®nal analysis must take account of the fact that a sequential design was used. Unfortunately, many trials which have been terminated at an interim analysis are ®nally reported with analyses which take no statistical account of the inspections made [13]. In a sequential trial, although the meaning and interpretation of data summaries such as signi®cance levels, point estimates and con®dence intervals remain as for ®xed sample size trials, various methods of calculation have been proposed. These lead to slightly different results when applied to the same set of data. The user of a computer package such as those referenced below may accept the convention of the package and use the resulting analysis without being concerned about the details of calculation. Readers who wish to develop a deeper understanding of statistical analysis following a sequential trial are referred to Chapter 5 of Whitehead [8] and Chapter 8 of Jennison & Turnbull [14].

Sequential clinical trials in practice

Increasingly, sequential procedures are being implemented in modern clinical trials. Peace [15] presents case studies of several applications, some of which have formed part of New Drug Applications (NDAs) that have been approved by the Food and Drug Administration (FDA). Additional examples can be found in the proceedings of two work-shops, one on practical issues in data monitoring sponsored by the US National Institutes of Health held in 1992 (published in issues 5 and 6 of volume 12, 1993, of Statistics in Medicine) and the other on early stopping

rules in cancer clinical trials held at Cambridge University in 1993 (published in issues 13 and 14 of volume 13, 1994, of Statistics in Medicine). The medical literature also demonstrates the widening use of sequential methods. Examples of such studies include trials of corticosteroids for AIDS-induced pneumonia [16], of enoxaparin for prevention of deep vein thrombosis resulting from hip replacement surgery [17] and of implanted de®brilators in coronary heart disease [18]. Two books dealing exclusively with the implementation of sequential methods in clinical trials are those by Whitehead [8] and Jennison & Turnbull [14]. In addition, there are three commercial software packages currently available. The package PEST [19] is based on straight line boundaries. The package EaSt [20] implements the alpha-spending boundaries of Wang & Tsiatis [21] and Pampallona & Tsiatis [22]. A recent addition to the package S-Plus is the S+ SeqTrial module [23]. PEST and EaSt have both been developed over a number of years and are the leading packages in this ®eld. Both packages allow construction of stopping rules for a variety of practical circumstances, and provide a valid ®nal analysis. PEST also includes computation of appro-priate test statistics at each interim analysis, together with some additional ®nal analysis options. A good review of the capabilities of earlier versions is given by Emerson [24]. The S-plus module is relatively new this year and consequently has not yet been as extensively used. An example of the design and implementation of an actual sequential trial is given in Figure 2.

When planning any clinical trial sequentially, the implications of introducing a stopping rule need to be thought out carefully in advance of the study. In addition, all involved in the trial should be consulted with regard to the choice of a clinically relevant difference, speci®cation of an appropriate power requirement, and the selection of a suitable stopping rule. As part of the protocol for the study the operation of any sequential procedure should be described clearly in the statistical section.

If a DSMB is appointed one of their roles should be to scrutinize any proposed sequential stopping rule prior to the start of the study and to review the protocol in collaboration with the trial Steering Committee. The procedure for undertaking the interim analyses should also be ®nalized in advance of the trial start-up. The DSMB would then review results of the interim analyses as they are reported. Membership of the DSMB and its relationship with other parties in a clinical trial has been considered in the 1993 Statistics in Medicine volume referenced above and by Whitehead [25]. It is important that the interim results of an ongoing trial are not cir-culated widely as this may have an undesirable effect on the future progress of the trial. Investigators' attitudes will clearly be affected by whether a treatment looks good or bad as the trial progresses. It is usual for the DSMB to be

(5)

supplied with full information and, ideally, the only other individual to have knowledge of the treatment comparison would be the statistician who performs the actual analyses. Decision making as part of a sequential trial (whether by a DSMB or another party involved in the trial) is both important and time sensitive. A decision taken to stop a study not only affects the current trial, but often affects future trials planned in the same therapeutic area. How-ever, continuing a trial too long puts participants at unnecessary risk and delays the dissemination of important information. It is essential to make important scienti®c and ethical decisions with con®dence. Wondering whether the data supporting interim analyses are accurate and up-to-date is unsettling and makes the decision process harder. It is therefore necessary for the statistician performing the interim analyses to have both timely and accurate data. Unfortunately, a trade-off exists Ð it takes time to ensure accuracy. Potential problems can be alleviated if data for interim analyses are reported separately from the other trial data, as part of a `fast-track' system. Less data means that they can be validated quicker. If timeliness and accuracy are not in balance, not only may real-time decisions be made on old data, but more seriously, differential reporting may lead to inappropriate study conclusions.

Discussion

Sequential methodology in phase III clinical trials is not new, but it is true to say that it is the more recent theoretical developments, together with the availability of software, which have precipitated its wider use. The methodology is ¯exible as it enables choice of a stopping rule from a number of alternatives, allowing the trial design to meet the study objectives. One important point is that a stopping rule should not govern the trial completely. If external circumstances change the appro-priateness of the trial or assumptions made when choosing the design are suspected to be false, it can and should be overridden, although the reasons for doing so must be carefully documented.

Methodology for conducting a phase III clinical trial sequentially has been extensively developed, evaluated and documented. Error rates can be accurately preserved and valid inferences drawn. It is important that this fact is recognized and that individuals contemplating the use of interim analyses conduct them correctly. Both the FDA and the Medicines Control Agency (MCA) do not look favourably on evidence from trials incorporating unplanned looks at data. In the US, the Federal Register (1985) published regulations for NDAs which included V 3 –3 –2 –1 0 1 2 3 4 5 6 7 1 2 4 5 6 Z

Amongst the studies conducted in the development of Viagra was a small trial in men suffering erectile dysfunction as a result of spinal cord injury [26]. An ef®cient trial methodology for reaching a reliable conclusion with as few subjects as possible was required. It was felt that spontaneous improvement of their erections would be reported by 25% of men on placebo. An increase in the percentage of improvements from 25% on control to 60% on Viagra was felt to be clinically relevant. It was desired to detect this with power 0.8. A signi®cance level of 0.05 was speci®ed. When the objectives of the trial were considered in detail, an appropriate stopping rule known as the triangular test was chosen.

Eligible men attending clinics in Southport, Belfast and Stoke Mandeville, who had a regular female partner, were randomised between Viagra and a matching placebo pill. After 4 weeks they were asked whether the treatment received had improved their erections. By January 1996, 12 men had completed 4 weeks of treatment with 5/6 on Viagra and 1/6 on placebo reporting impro-vement. The ®rst point plotted on the ®gure (x) represents those data. The statistic Z signi®es the advantage seen so far on Viagra and is calculated from the observed number of successes on Viagra minus the number of successes that would have been expected if Viagra had no effect. The expected number of successes can be found by multiplying the total number of successes (6) by the proportion of men receiving Viagra (1/2), giving 3, so that Z is equal to 5±3=2 as plotted in the ®gure. The statistic V measures the information on which that comparison is based. This is the variance of Z. The inner dotted boundaries, known as the Christmas tree correction for discrete looks, form the stopping boundary: reach this and the trial is complete. Crossing the upper boundary results in a positive trial conclusion. The data were studied again in February, where 6/8 improved on Viagra and 1/8 improved on placebo, and in March, by which time improvement rates were 8/10 on Viagra and 1/10 on placebo. The upper boundary was reached and recruitment closed. When the results on the 6 men under treatment at that time were added, the rates became 9/12 and 1/14, respectively. By using a series of interim looks, the design allowed a strong positive conclusion to be drawn after only 26 men had been treated. A total of 57 subjects would have been entered into a ®xed sample size trial.

(6)

the requirement that the analysis of a phase III trial `assess...the effects of any interim analyses performed'. The FDA guidelines were updated by publication of `E9 Statistical Principles for Clinical Trials' in a later Federal Register (1998). Section 3 of this document discusses group sequential designs and Section 4 covers trial conduct including trial monitoring, interim analysis, early stopping, sample size adjustment and the role of an independent DSMB. With such acknowledgement from regulatory authorities the future for sequential methodology within clinical trials is encouraging.

The authors are grateful to the two referees for their comments and suggestions.

References

1 Pocock SJ. Clinical trials: a practical approach. New York. Wiley. 1983.

2 Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomized trials. J Roy Statist Soc Series A 1994; 157: 357±416.

3 Armitage P, McPherson CK, Rowe BC. Repeated signi®cance tests on accumulating data. J Roy Statist Soc Series A 1969; 132: 235±244.

4 Whitehead J. A uni®ed theory for sequential clinical trials. Statistics Med 1999; 18: 2271±2286.

5 Armitage P. Sequential medical trials (2nd edn). Oxford. UK. Blackwell, 1975.

6 Pocock SJ. Group sequential methods in the design and analysis of clinical trials. Biometrika 1977; 64: 191±199.

7 O'Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35: 549±556.

8 Whitehead J. The design and analysis of sequential clinical trials (revised 2nd edn). Chichester, UK, John Wiley & Sons Ltd, 1997.

9 Jennison C, Turnbull BW. Group sequential analysis incorporating covariate information. J Am Statist Assoc 1997; 92: 1330±1341.

10 Lan KKG, DeMets DL. Discrete sequential boundaries for clinical trials. Biometrika 1983; 70: 659±663.

11 Kim K, DeMets DL. Design and analysis of group sequential tests based on the type I error spending rate function. Biometrika 1987; 74: 149±154.

12 Medical Research Council Renal Cancer Collaborators. Interferon-aand survival in metastatic renal carcinoma: early results of a randomised controlled trial. Lancet 1999; 353: 14±17.

13 Facey KM, Lewis JA. The management of interim analyses in drug development. Statistics Med 1998; 17: 1801±1809.

14 Jennison C, Turnbull BW. Group sequential methods with applications to clinical trials Boca Raton. USA. Chapman & Hall/CRC, 2000.

15 Peace KE. Biopharmaceutical sequential statistical applications New York. Marcel Dekker., 1992.

16 Montaner JSG, Lawson LM, Levitt N, et al. Corticosteroids prevent early deterioration in patients with moderately severe Pneumocystis carinii pneumonia and the acquired

immunode®ciency syndrome (AIDS). Ann Inter Med 1990; 113: 14±20.

17 Whitehead J. Sequential designs for pharmaceutical clinical trials. Pharmaceut Med 1992; 6: 179±191.

18 Moss AJ, Hall WJ, Cannom DS et al. Improved survival with implanted de®brillator in patients with coronary disease at high risk of ventricular arrhythmia. N Engl J Med 1996; 335: 1933±1940.

19 MPS Research Unit. PEST 4: operating manual. The University of Reading, UK, 2000.

20 Cytel Software Corporation. EaSt. A software package for the design and interim monitoring of group-sequential clinical trials. Cytel Software Corporation, Cambridge, Mass, 2000. 21 Wang SK, Tsiatis AA. Approximately optimal one-parameter

boundaries for group sequential trials. Biometrics 1987; 43: 193±199.

22 Pampallona S, Tsiatis AA, Kim K. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favour of the null hypothesis. J Statistical Planning Inference 1994; 42: 19±35.

23 MathSoft Inc. S-Plus. MathSoft Inc, Seattle, Washington 2000, 2000.

24 Emerson SS. Statistical packages for group sequential methods. Amer Statist 1996; 50: 183±192.

25 Whitehead J. On being the statistician on a data and safety monitoring board. Statistics Med 1999; 18: 3425±3434. 26 Derry FA, Dinsmore WW, Fraser M, et al. Ef®cacy and

safety of oral sildena®l (viagra) in men with erectile dysfunction caused by spinal cord injury. Neurology 1998; 51: 1629±1633.