Statistical issues in interpreting clinical trials

(1)

CLINICAL TRIALS

Statistical issues in interpreting clinical trials

D . L . DEM E T S

Department of Statistics and Biostatistics, University of Wisconsin-Madison, Madison, WI, USA

Abstract. D. L. DeMets (University of Wisconsin- Madison, Madison, WI, USA). Statistical issues in interpreting clinical trials (Clinical Trials). J Intern Med 2004; 255: 529–537.

Randomized clinical trial is an important research tool in evaluating new therapeutic agents, devices and procedures. In order to obtain reliable and unbiased results, careful consideration must be given in the design and conduct of the trial. However,

bias can be introduced in the analysis of the final data if certain principles are not followed. Several issues are described that make interpretation of analyses challenging. These include the intent-to- treat principle, the use of surrogate outcome measures, subgroup analyses, missing data and noninferiority trials.

Keywords: intent to treat, subgroups, surrogate outcomes.

Introduction

Randomized clinical trials have become an important research tool in the development of new drugs, biologics, devices and procedures for therapeutic and prevention strategies across all disease areas.

Methods for the design, conduct and analysis of these trials have been widely developed [1–4].

Properly designed clinical trials depend interact- ively on results from previous observational studies, laboratory experiments and other clinical trials.

Conduct of a clinical trial requires careful attention to detail and is also a large challenge to comply with the intent of the protocol. One important principle is that no clever analysis can rescue a flawed design or poorly conducted trial. However, often the greatest challenge is in the interpretation of results from a properly designed and well- conducted trial. In the article, five issues in the interpretation of a completed trial will be discussed, some of which are old but still relevant issues and some are more recent.

Intention-to-treat principle

The intention to treat (ITT) principle is important in the interpretation of clinical trial results and there still remains some confusion as to the definition of this principle despite its existence for over three decades [5]. The ITT principle is defined to mean that all patients randomized into a trial are to be accounted for in the primary analysis and all primary events observed during the follow-up period are to be accounted for as well. If either of these aspects is not adhered to, the analysis of results may easily be biased in unpredictable directions and thus the interpretation of the results compromised [6].

One of the common myths is that large trials are particularly robust to these issues. Reducing bias in the assessment of patient response is important in the design, conduct and analysis. Variability in the outcome measure makes it more difficult to detect a true treatment effect. However, large trials reduce the variability of the response variable which is their goal in order to more easily detect whatever 529

(2)

treatment effect exists, either favourable or not.

Thus, any bias that exists in the outcome assessment will have even greater impact on the interpretation.

For whatever reason, trials still do not strictly adhere to the ITT analysis principle, although it may be claimed to be an ITT analysis.

There are two common reasons that are com- monly given to exclude randomized patients from the analyses, post hoc ineligibility assessment and lack of patient compliance with the intervention as specified in the protocol. These issues became the central issue in the interpretation of the Anturane Reinfarction Trial which was a trial evaluating sulfinpyrazone in post heart attack patients [7–9].

This trial randomized 1629 patients who survived a heart attack into two groups, one treated with sulfinpyrazone and the other with a matching placebo. In preparing for the analyses, the randomized patients were further classified into those 1558 who were deemed eligible and those 71 who did not meet the protocol-specified eligibility criteria. The analysis presented in the published papers [7, 8]

focused on the post hoc 1558 ‘eligible’ patients. A review by Temple and Pledger [9] revealed the impact of eliminating just a few ineligible patients.

As can be seen in Table 1, the analysis is quite sensitive as to the inclusion or exclusion of those 71

‘ineligible’ patients. In fact, a striking statistical comparison is observed between those patients in the treatment arm who were eligible with those declared post hoc ‘ineligible’ whilst the same comparison in the placebo arm is unremarkable. Clearly interpretation of the results is biased by the exclusion of those patients declared ineligible post hoc.

The trial results were challenged by regulatory review [9].

Not surprisingly, another reason for the exclusion of patients from the primary analysis in some trials is that the patients did not fully comply with the intervention as specified in the protocol. Some

researchers have argued that in order to get a better estimate of the true treatment effect, these nonad- herent patients should be excluded [10]. However, just as exclusions for eligibility can bias interpretation, so can exclusion of patients for noncompliance. An illustrative example was provided long ago by the Coronary Drug Project [11] which was a trial of cholesterol-lowering strategies in men who had survived a heart attack. One of the agents studied was clofibrate which was compared with a placebo.

As shown in Table 2, the overall results showed no difference in mortality. However, to illustrate the possible bias that can be introduced in the analysis, these patients were classified a ‘good’ complier defined as taking 80% or more of their assigned medication (i.e. pill counts) or as a ‘poor’ complier.

The analysis of the good compliers becomes difficult to interpret which was the point of their illustrative presentation. That is, the comparison of those who did not comply with clofibrate to those who did showed a mortality reduction from 24.6 to 15.0%.

However, an even greater reduction can be seen for a similar comparison in the placebo arm. Compliers live longer than noncompliers regardless of treatment. As shown in their paper, no multivariate regression analysis could explain this curious result.

The authors concluded that compliance with an intervention is itself an outcome and adjusting analysis of one outcome for another is confounded and thus uninterpretable. There are many other such examples that support this conclusion [1].

As perfect compliance is often not attainable and exclusion of noncompliant patients may introduce serious bias, those who design trials must attempt to adjust the design and increase the sample size to compensate for nonadherence. The first goal should be to minimize noncompliance by choosing the best- tolerated intervention strategy and then to monitor noncompliance during the conduct. An increase in

Table 2 Coronary drug project 5-year mortality

Clofibrate Placebo

n % Deaths n % Deaths

Total (as reported) 1103 20.0 2782 20.9

By compliance 1065 18.2 2695 19.4

<80% 357 24.6 882 28.2

‡80% 708 15.0 1813 15.1

Adjusting for 40 covariates had little impact. Compliance is an outcome. Compliers do better, regardless of treatment. Adapted from [11].

Table 1 1980 Anturane mortality results

Anturane (%) Placebo (%) P-value Randomized 74/813 (9.1) 89/816 (10.9) 0.20

‘Eligible’ 64/775 (8.3) 85/783 (10.9) 0.07

‘Ineligible’ 10/38 (26.3) 4/33 (12.1) 0.12 P-values for eligible

versus ineligible

0.0001 0.92

Adapted from [7–9].

(3)

sample size will often be necessary to compensate for the dilution effect of noncompliance [1]. For example, a 10% noncompliance in the intervention arm can require an increase of 23% in the sample size. A 20% intervention noncompliance may require a 56% sample size increase. Whilst such an increase is most costly and time consuming, the analysis will not be biased and thus the conclusions not compromised.

Surrogate outcome measures

The outcome measure used to assess the primary question should be clinically relevant, sensitive to the intervention effect, able to be ascertained in all patients and resistant to bias. As clinical trials using clinical outcomes can be large, time consuming and costly, researchers have sought alternatives to achieve smaller and shorter trials. One such approach has been to use a substitute for the clinical outcome, often referred to as the surrogate outcome.

The motivation for usage of the surrogate outcome is shown in Fig. 1. If the intervention modifies the surrogate, it will also modify the clinical outcome.

Whilst such an alternative approach can be appeal- ing, there are requirements that must be met [12] to be a valid substitute. Whilst these requirements were developed for a binary outcome, their essence appears to be generally applicable. One requirement is that the surrogate outcome must be predictive of the clinical outcome. The second requirement is that the surrogate outcome must fully capture the total effect of the intervention on the clinically relevant outcome. The latter requirement is difficult to obtain and validate. The reason is that in contrast to the simplicity of Fig. 1, biological systems are more likely to be more complex such as those illustrated in

Fig. 2. In the scenarios depicted, the intervention may modify the surrogate and have no effect or only partial effect on the clinical outcome. Alternatively, the intervention may modify the clinical outcome without affecting the surrogate. As discussed by some authors [13, 14], the track record for the use of surrogate outcomes is very disappointing. In fact, these authors present several examples where sole reliance on a surrogate outcome would have resul- ted in interventions that would have been useless or even harmful.

One of the classic and frequently referred to examples is the Cardiac Arrhythmia Suppression Trial (CAST) which evaluated three newly developed drugs that suppressed premature cardiac ventricular contractions [15]. Two of the drug arms in CAST

Time

Surrogate Intervention

Disease

end point

True clinical outcome

Fig. 1 The setting that provides the greatest potential for the surrogate end-point to be valid. Reproduced from [13], with kind permission from the publisher.

Time

True clinical outcome Disease

Surrogate end point

(a)

Intervention

(b)

Disease

Intervention

(c)

Disease

(d)

Intervention

Surrogate end point

Fig. 2 Reasons for failure of surrogate end-points. (a) The surrogate is not in the causal pathway of the disease process. (b) Of several causal pathways of disease, the intervention affects only the pathway mediated through the surrogate. (c) The surrogate is not in the pathway of the intervention’s effect or is insensitive to its effect. (d) The intervention has mechanisms for action independent of the disease process. Dotted lines¼ mechanisms of action that might exist. Reproduced from [13], with kind permission from the publisher.

(4)

were terminated early with the results shown in Table 3. With only approximately 10–15% of the trial completed, total mortality as well as cause- specific sudden death were dramatically increased for patients randomized on drug treatment arms for two of the drugs compared with the placebo arm, achieving highly statistically significant results. This result was quite contrary to expected results. Inter- estingly, all patients in CAST had cardiac dysrhythmias and had successfully completed a run-in period where these dysrhythmias were suppressed. Thus, although the surrogate was modified as intended, the effect on the clinical outcome was clearly not desirable. Without CAST, the use of this new class of drugs might have become widespread but at great risk to a number of patients.

In general, interpretation of clinical trials using surrogate outcomes must be extremely cautious.

Even if a surrogate meets the criteria in spirit, it is not clear that another drug of the same class or from another class could be fully evaluated using the same surrogate. If a trial must be interpreted on the basis of a surrogate, it must be done in the context of the number of individuals at risk and what the implications of being incorrect would be.

Subgroup analyses

Clinical trials typically have fairly liberal entry criteria in order to accommodate as many patients as possible and make recruitment feasible in a reasonable period of time. However, a natural clinical question is how the results of the completed trial might apply to a particular patient that a doctor might see in their office. Another reason is to assess the internal consistency of results across risk categories. In addition, previous hypotheses may be confirmed or new hypotheses may be generated.

Thus, it is common to divide the randomized patient population into categories or subgroups and make comparisons of the interventions within these subgroups. However, interpretation of subgroup results

can be challenging. First, the more subgroups that are created, the greater the probability of a positive result by chance alone; that is, the false positive rate increases with the number of analyses conducted.

For example, five subgroups would have over a 20%

chance of at least one statistically significant result if there were no treatment effect, using a nominal 5%

significance level for each subgroup analysis. In order to protect against this increase in false positive claims, clinical trials have required that the subgroups be stated in advance in the protocol and be limited in number with the primary goal of evaluating qualitatively the internal consistency. Even with these qualifications, interpretation of subgroup results must be conducted with extreme caution.

Wedel et al. describe an intriguing subgroup result in the MERIT trial [16]. MERIT was a trial of a betablocker (i.e. metoprolol) treatment for patients with congestive heart failure that demonstrated an overall dramatic 34% reduction in mortality as shown in Fig. 3 [17]. The second primary outcome was death plus all-cause hospitalization which also had a dramatic betablocker benefit. In their primary paper, remarkable consistency of mortality results were shown (see Fig. 4) across a large number of traditional subgroups of interest in heart failure patients. Here, the relative risk and 95% confidence intervals are plotted for each subgroup. Relative risks to the left of the unity line indicate that the betablocker is beneficial.

However, further analyses revealed one slight inconsistency in relative risk for mortality, one of the two primary outcomes [16], although the confidence intervals overlapped. As shown in

Table 3 Cardiac Arrhythmia Suppression Trial Early termination in two drug arms

Drugs Placebo

Sudden death 33 9

Total mortality 56 22

Fig. 3 Kaplan–Meier curves of cumulative percentage of total mortality. P-value adjusted for two interim analyses. This figure is adapted from [17] with kind permission of the publisher.

(5)

Fig. 5, the MERIT results for the two primary outcomes are shown by country or regions. Reg- ulatory review and labelling indications by the United States (US) Food and Drug Administration (FDA) suggest that the betablocker results may be less effective for mortality in the US than for other regions and did not approve this particular betablocker for a mortality benefit. However, the FDA did approve the drug for reductions in death plus hospitalization. Statistical analyses of this issue do not provide an adequate solution to a difficult question. As Wedel et al. [16] point out, the statistical methods for testing for region by treatment interaction are not unique and are generally regarded as not being extremely powerful to begin with. Results depend on the specific statistical model. Of all the subgroup analyses that were conducted, both the prespecified and the ad hoc, only this particular US subgroup seemed to deviate slightly from the overall positive beneficial effect.

However, despite these internal consistencies, this issue may not have been further addressed except for two other similar heart failure trials evaluating different beta-blockers [18, 19]. For each trial, the mortality results for the US were consistent with other geographical regions, and very similar to the overall MERIT results making the US deviation in the MERIT trial likely to be due to chance.

Another example in the challenge in interpretation of subgroup results is provided by the PRAISE-I and PRAISE-II trials [20, 21]. PRAISE-I was a trial evaluating a drug, amlodipine, for the treatment of congestive heart failure. This trial stratified the randomization of 1153 patients by ischaemia and nonischaemia aetiology of heart failure. The results demonstrated a statistically significant interaction (P ¼ 0.004) between aetiology and mortality, one of the major secondary endpoints of the trial. Standard statistical procedures would be to interpret the results in each subgroup

Fig. 4 Absolute numbers, point estimates of the hazard ratios and 95% confidence intervals (for combined end-point time to first event) for total mortality, total mortality plus all-cause hospitalization, and for total mortality plus hospitalization for worsening heart failure from prespecified analyses in predefined subgroups according to baseline characteristics. Filled squares indicate subgroups with a total of 180 events or more; open squares, subgroups with a total of less than 180 events (low power). CR/XL, controlled release/extended release;

NYHA, New York Heart Association; EF, ejection fraction; AMI, acute myocardial infarction; HR, heart rate; SBP, systolic blood pressure;

DBP, diastolic blood pressure. Adapted from [16].

(6)

separately. There was no observed treatment effect in the ischaemic subgroup. However, the nonischaemic subgroup had a significant reduction in mortality (P < 0.001), with a relative risk of approximately 0.6. Despite the fact that these two subgroups had been prespecified, researchers had expected the benefit to be in the ischaemic subgroup and were somewhat surprised by this result. Rather than accept the conclusion that the drug was effective in reducing mortality in the nonischaemic heart failure patient, using a very similar protocol, researchers conducted a second mortality trial, PRAISE-II, which evaluated the same drug in the nonischaemic heart failure patient (M. Packer, unpublished data). Whilst the results in the PRAISE-I trial were impressive, the results in PRAISE-II showed remarkably similar mortality results in the drug and placebo-treated arms. If researchers had followed standard statistical methods using only PRAISE I results, they might have concluded that this drug was beneficial in the nonischaemic heart failure patient. Whilst research-

ers cannot be sure why the PRAISE-I results were not confirmed in the nonischaemic patient, the fact that they were not repeated at least leaves the question open and not yet established.

Missing data

The ITT principle requires that all patients randomized be accounted for in the primary analysis.

However, if the outcome assessment for a patient is missing, then adhering to the ITT principle is challenging. The problem is that missing data may not be missing at random, and thus the ‘missing- ness’ is informative. For example, the sickest patient may not be able to return to the clinic for the tests and measurements necessary to ascertain the key outcomes. In some cases, patients with a serious life-threatening toxicity may not be able to get the key assessment. If that is the case, then not accounting for the patients with missing data may bias the analysis. Further- more, no statistical analyses can ever adequately Relative risk and 95% confidence interval

Belgium Czech Republic Denmark Finland Germany Hungary Iceland Norway Poland Sweden Switzerland The Netherlands UK

USA

All countries

Not estimable

Not estimable 7.3

4.5 No. of deaths

Meto CR/XL/Plac

3/13 9/17 11/11

0/2 19/31 16/29 2/2 6/11

8/8 2/9 0/1 14/25

4/9 51/49

145/217

No. of events Meto CR/XL/Plac

31/31 35/50 58/60 6/4 88/100

57/72 6/10 41/48 26/25 15/27 5/4 63/91 26/29 184/216

641/767

Favours Favours Favours Favours Favours Favours

Meto CR/XL Placebo Meto CR/XL Placebo Meto CR/XL Placebo

All patients randomized

Total mortality Total mortality/any hosp. Total mortality/CHF hosp.

No. of events Meto CR/XL/Plac

0.0 1.0 1.8

13/21 25/36 24/28 2/3 44/63 31/48 3/4 17/28 16/18 5/14 1/3 28/52 11/12 91/109

311/439 2.3

2.9 4.6

2.0 3.6

2.6

2.9

2.0

0.0 1.0 1.8 0.0 1.0 1.8

Fig. 5 Absolute numbers, point estimates of the hazard ratios and 95% confidence intervals (for combined end-point time to first event) for total mortality, for total mortality plus all-cause hospitalization, and for total mortality plus hospitalization for worsening heart failure in post hoc subgroups according to country (all patients randomized). Filled squares indicate subgroups with a total of 180 events or more; open squares, subgroups with a total of less than 180 events (low power). Reproduced, with kind permission of the publisher, from [16].

(7)

adjust for missing data, despite many techniques that attempt to do so.

In trials with a time-to-event analysis, the missing data often comes through a failure to follow the patient until the event of interest is observed. When patients without the primary event are not followed behind a point in time, the patient is said to be censored. There are two forms of censoring, one because the trial came to an end and the other because the trial ceased to follow the patient. In this situation, analyses methods such as Kaplan–Meier survival curve estimates [21] and the logrank test [22] assume that censoring is independent of the event process being observed. As the missing data due to loss to follow-up cannot be assumed to be independent of the disease process, and mostly likely is not independent, researchers must work very hard to minimize if not eliminate loss to follow-up. For mortality outcomes, such a goal can be reached due to the national death index in the United States or similar registries in other countries.

However, if the event needs tests to be documented such as an X-ray for lung cancer or an EKG for myocardial infarction, the missing data cannot be ascertained as easily as it can for mortality. One all too common practice in trials is that when a patient is taken off treatment due to toxicity or other side effects, the patient is taken off the study and follow-up is terminated. This prevents getting the data necessary to ascertain the primary event.

If researchers only analyse the events whilst the patient is actively on treatment, what is often referred to as ‘on study analysis’, serious bias may be introduced.

Noninferiority trials

Superiority trials attempt to demonstrate that the experimental intervention is better than or superior to the standard or control intervention, which may be best standard care plus a matching placebo. In contrast, noninferiority trials hope to show that the new or experimental intervention is not worse than the standard by some margin of indifference. The new therapy may be easier to administer, better tolerated, less toxic, or less expensive such that there is a beneficial trade-off if not much in treatment benefit is given up. In addition, regulatory guidelines for noninferiority trials may require that researchers demonstrate that the new intervention would have been better than placebo if a placebo arm had been used. This latter requirement can be partially addressed by a meta-analysis of the previous trials which compared the standard intervention with placebo [23]. However, this approach assumes that clinical conditions present during these historical trials are still present when the new trial is being conducted which may not be the case. Thus, some of the issues and biases in the use of historical control trials also apply to this latter analysis.

As shown in Fig. 6, any given trial may be a superiority trial as well as a noninferiority trial, depending on the results [24]. In Fig. 6, the relative risk and its 95% confidence interval is plotted. A confidence interval which lies to the left of unity implies that the new treatment is better than the control. A confidence interval which is totally to the right of unity indicates that the new treatment is worse than the control. In this figure, trial A is a superiority trial because the upper confidence inter-

Noninferiority

(i.e. Equivalence) Inferiority

Underpowered trial Superiority Test drug

better

Standard drug better Estimated benefit of standard drug over placebo

0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Zone of

noninferiority

B A

C

D

E Fig. 6 Zone of noninferiority for F

noninferiority trial design.

(8)

val excludes unity, thus demonstrating that the new treatment is better than the standard treatment.

Likewise, trials C and E are inferiority trials as the lower confidence level excludes unity indicating that the control treatment is better or equivalently the new treatment is worse. Trial B is indicating a slight benefit but the upper confidence interval does not exclude unity. However, the upper confidence interval does exclude the predefined margin of indifference and thus meets the criterion for noninferiority. Trial C is indicating that the new treatment may be slightly worse than control but the upper confidence level is still less than the margin of indifference so it qualifies as a noninferiority trial.

Trial F is too small and underpowered such that the confidence interval is too wide to declare the small positive benefit to be either superior (does not exclude unity) or noninferior (does not exclude the margin of indifference).

In focusing on the task of demonstrating that the new intervention is not worse than the standard by some margin of indifference, several challenges must be met. First, the noninferiority trial must be of the highest quality because a poorly executed trial might not be able to detect clinically meaningful differences. Secondly, the noninferiority trial must have a strong effective control intervention, some- thing equivalent to state-of-the-art care. Finally, the margin of indifference is somewhat arbitrary, depending not on statistical issues but on the medical importance of the treatment in the context of the underlying disease and risk to benefit trade- offs. This can be highly dependent on the specific disease and heath care context. Interpreting these aspects of noninferiority trials can be difficult.

One recent example of a superiority/noninferiority trial is the OPTIMAAL trial [25] which compared losartin with captopril in a heart failure patient population. Researchers would have been pleased if losartin were superior to captopril but also would have been satisfied with demonstrating that losartin was not inferior to captopril. The reason is that losartan is probably better tolerated by patients than captopril. By all current standards, OPTIMAAL was a well-designed and well-conducted trial. The superiority portion of the trial was designed to detect a 20%

reduction in relative risk with 95% power. The noninferiority portion specified a margin of indifference for the mortality relative risk that was 1.1; that is, the upper 95% confidence interval needed to

exclude a relative risk of 1.1 to declare that losartan was noninferior to captopril. The mortality results for OPTIMAAL had a relative risk of 1.126 with a 95%

confidence upper interval of 1.28 so that neither superiority nor noninferiority were achieved. The OPTIMAAL researchers did use the historical data to estimate that patients treated with captopril, the standard arm, had an estimated relative risk of 0.806 compared with placebo, or an almost 20% reduction in mortality. Multiplying this historical relative risk with the relative risk from OPTIMAAL gives an approximation to the effect of losartan with placebo if placebo had been used. This imputed relative risk is 0.906 but again must be interpreted with extreme caution, even when OPTIMAAL was a trial of the highest quality and well powered.

Conclusions

There are many issues in the design, conduct and analysis of randomized clinical trials that must be interpreted cautiously. Some of the more common issues in the interpretation of the analyses have been considered. It is useful to keep these issues in mind during the design and conduct so that the impact can be minimized at the time of analyses. Statistical analysis is limited in being able to compensate for ineligible patients being entered, for noncompliance to intervention, for unreliable outcome measures, for missing data and for underpowered trials.

Fortunately, researchers often do not have to rely on just a single trial which also helps with the interpretation of results.

Conflict of interest statement

No conflict of interest was declared.

References

1 Friedman L, Furberg C, DeMets D. Fundamentals of Clinical Trials. Littleton, MA: JohnWright - PSG Inc., 1981, 2nd edn, 1985; St Louis, MO: Mosby-Year Book, Inc., 3rd edn, 1996;

New York, NY: Springer-Verlag, 3rd edn, 1998.

2 Pocock SJ. Clinical Trials: A Practical Approach. New York: John Wiley & Sons, 1983.

3 Meinert CL. Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986.

4 Piantadosi S. Clinical Trials: A Methodologic Perspective.

Hoboken, New Jersey: Wiley-Interscience, 1997.

5 Peto R, Pike MC, Armitage P et al. Design and analysis of randomized clinical trials requiring prolonged observations of

(9)

each patient. I. Introduction and design. Br J Cancer 1976;

34: 585–612.

6 May GS, DeMets DL, Friedman LM et al. The randomized clinical trial: bias in analysis. Circulation 1981; 64: 669–73.

7 Anturane Reinfarction Trial Research Group. Sulfinpyrazone in the prevention of cardiac death after myocardial infarction:

the Anturane Reinfarction Trial. New Engl J Med 1978; 298:

289–95.

8 Anturane Reinfarction Trial Research Group. Sulfinpyrazone in the prevention of cardiac death after myocardial infarction.

New Engl J Med 1980; 302: 250–6.

9 Temple R, Pledger GW. The FDA’s critique of the Anturane Reinfarction Trial. New Engl J Med 1980; 303: 1488–92.

10 Sackett DL, Gent M. Controversy in counting and attributing events in clinical trials. New Engl J Med 1979; 301: 1410–2.

11 Coronary Drug Project Research Group. Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. New Engl J Med 1980; 303: 1038–41.

12 Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med 1989; 8: 431–40.

13 Fleming TR, DeMets DL. Surrogate endpoints in clinical trials:

are we being misled? Ann Int Med 1996; 125: 605–13.

14 Temple RJ. A regulatory authority’s opinion about surrogate endpoints. In: Nimmo WS, Tucker GT, eds. Clinical Meas- urement in Drug Evaluation. Hoboken, New Jersey: John Wiley & Sons Ltd., 1996.

15 Cardiac Arrhythmia Suppression Trial (CAST) Investigators.

Preliminary report: effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. New Engl J Med 1989; 321: 406–12.

16 Wedel H, DeMets D, Deedwania P et al. Challenges of subgroup analyses in multinational clinical trials. Experiences from the MERIT-HF trial. Am Heart J 2001; 142: 502–11.

17 MERIT-HF Study Group. Effect of metoprolol CR/XL in chronic heart failure: metoprolol CR/XL randomised intervention trial in congestive heart failure. Lancet 1999; 353: 2001–7.

18 CIBIS II Investigators and Committees. The cardiac insuffi- ciency Bisoprolol study II (CIBIS-II): a randomized trial. Lancet 1999; 353: 9–13.

19 Packer M, Coats AJS, Fowler MB et al. Effect of carvedilol on survival in severe chronic heart failure. New Engl J Med 2001;

334: 1651–8.

20 Packer M, O’Connor CM, Ghali JK et al. Effect of amlodipine on morbidity and mortality in severe chronic heart failure. New Engl J Med 1996; 335: 1107–14.

21 Kaplan E, Meier P. Nonparametric estimation from incom- plete observations. J Am Stat Assoc 1958; 53: 457–81.

22 Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 1959; 22: 719–48.

23 US Food and Drug Administration. Draft Guidance for Clinical Trial Sponsors on the Establishment and Operation of Clinical Trial Data Monitoring Committees. Rockville, MD: FDA, 2001.

http://www.fda.gov./cber/gdlns/clindatmon.htm.

24 Antman EM. Clinical trials in cardiovascular medicine. Cir- culation 2001; 103: 3101.

25 Dickstein K, Kjekshus J and the OPTIMAAL Steering Com- mittee. Effects of losartan and captopril on mortality and morbidity in high-risk patients after acute myocardial infarction: the OPTIMAAL randomised trial. Lancet 2002;

360: 752–60.

Correspondence: D. L. DeMets, Department of Biostatistics and Medical Informatics, University of Wisconsin, 600 Highland Ave, K6-446, Madison, WI 53792-4675, USA.

(fax: +1 608 263 1059; e-mail: [email protected])

(10)