P Less Than .05: What Does It Really Mean?

(1)

COMMENTARY

P

Less Than .05: What Does It Really Mean?

Zeev N. Kain, MD, MBA, Jill MacLaren, PhD

Center for the Advancement of Perioperative Health and Departments of Anesthesiology, Pediatrics, and Child Psychiatry, Yale University School of Medicine, New Haven, Connecticut

The authors have indicated they have no ﬁnancial relationships relevant to this article to disclose.

A

LTHOUGH THERE is a growing body of literature criticizing the use of mere statistical significance as a measure of clinical impact, we submit that this concept has not been widely incorporated in the pediatric liter-ature. This is especially problematic because an under-standing of the limitations of using only statistical signif-icance to evaluate treatments is necessary for readers of Pediatrics to draw accurate conclusions from data pre-sented in this journal. Here we highlight some of the issues related to the complex problem of evaluating treatment effects and the importance of using clinical significance in addition to the traditionalPvalue.

Currently, the magical boundary of P ⬍ .05 holds

great importance in whether a manuscript is accepted for publication, a research application is funded, or a new drug is approved by the Food and Drug Administration. We submit that if a treatment is to be useful to our children, it is not enough for treatment effects to be statistically significant; they also need to be large enough to be clinically meaningful. Evaluating treatment

out-comes on the basis of Pvalue alone is problematic for

several reasons. First, with a large sample, it is quite possible to have a statistically significant result between groups despite a minimal effect of treatment (ie, small effect size). Second, study outcomes with lowerPvalues are typically misinterpreted by pediatricians as having stronger effects than those with higherPvalues. That is, most clinicians believe that a result withP⫽.002 has a much greater treatment effect than a result ofP⫽.045. Although this is true if the sample size is the same in both studies, it is not true if the sample size is larger in

the study with the smaller Pvalue. This confusion

be-comes particularly concerning when one realizes that most pharmaceutically funded studies have very large sample sizes.

To combat overreliance on the P value, we

recom-mend that pediatricians be interested in answering 3 basic questions when examining the report of a clinical trial:

1. Could the findings of the clinical trial be solely a result of a chance occurrence? (ie, statistical signifi-cance)

2. How large is the difference between the primary end points of the study groups? (ie, impact of treatment, effect size)

3. Is the difference of primary end points between

groups meaningful to a patient? (ie, clinical

significance)

UNDERSTANDING STATISTICAL SIGNIFICANCE

As is familiar to most readers ofPediatrics, thePvalue is the most commonly used method of evaluating the sta-tistical significance of any finding. The origin of the P value lies in 1925 with Sir Ronald A. Fisher, who first suggested the use of a boundary between significance and nonsignificance that was based on probability.1–3_He

arbitrarily set this boundary atP⫽.05, where “P” stands for the probability that a finding of interest was reached by chance.1,2_{Although Fisher’s emphasis on significance}

testing and the arbitrary boundary ofP⬍.05 is familiar and widely used, it is important for pediatricians to recognize that this definition has been widely criticized

Abbreviation:CI, conﬁdence interval

Opinions expressed in these commentaries are those of the authors and not necessarily those of the American Academy of Pediatrics or its Committees.

www.pediatrics.org/cgi/doi/10.1542/peds.2006-3030 doi:10.1542/peds.2006-3030

Accepted for publication Oct 20, 2006

Address correspondence to Jill MacLaren, PhD, Department of Anesthesiology, Yale University School of Medicine, 333 Cedar St, New Haven, CT 06510. E-mail: [email protected] PEDIATRICS (ISSN Numbers: Print, 0031-4005; Online, 1098-4275). Copyright © 2007 by the American Academy of Pediatrics

608 KAIN, MacLAREN

at Viet Nam:AAP Sponsored on August 29, 2020

www.aappublications.org/news

(2)

over the past 80 years. Specifically, this approach is criticized because it does not take into account the size and clinical significance of the observed effect. That is, a small effect in a study with large sample size may have the samePvalue as a large effect in a study with a small sample size.

In an attempt to address some of the limitations of the

P value, use of confidence intervals (CIs) has been

ad-vocated by some clinicians.3_{It is important the readers}

realize, however, that these 2 definitions of statistical significance are essentially reciprocal.4_{That is, a}_P_value

of⬍.05 is essentially the same as having a 95% CI that does not overlap 0. CIs do have some advantage, how-ever, in that they can be used to estimate the size of

difference between groups.5 _{Unfortunately, this}

ap-proach is not widely used in the pediatric literature, and CIs are mostly used today as surrogates for the hypoth-esis test rather than considering the full range of likely effect size.

BEYOND THEPVALUE: EFFECT SIZES

Providing more information than eitherPvalues or CIs,

the group of statistics called “effect sizes” are measures of the magnitude of difference between groups, standard-ized by controlling for variation within groups. In other

words, whereas a P value denotes whether the

differ-ence between 2 groups in a particular study is likely to occur solely by chance, the effect size quantifies the amount of difference between these 2 groups. Because effect size is based on standardized differences between groups and not sample size, they better evaluate the strength of the intervention. Of particular relevance to pediatricians is effect sizes of the d type, because these

are primarily used to compare 2 treatment groups. d

-type effect size is defined as the magnitude of difference between 2 means, divided by the SD [(mean of control

group ⫺ mean of treatment group)/SD of the control

group]. Thus, the d effect size depends on variation

within the control group and the differences between the control and intervention groups. Conventionally, d-type effect sizes that are near .20 are interpreted as small, effect sizes near .50 are considered “medium,” and effect sizes around .80 are considered “large.”6 _Effect

sizes of another type, the risk potency type, include likelihood ratios such as odds ratio, risk ratio, risk differ-ence, and relative risk reduction. Clinicians are probably more familiar with these less abstract statistics, and it may be helpful to realize that likelihood statistics are a type of effect size. There are a number of different types of effect sizes, but description of these various types and formulae is beyond the scope of this commentary; how-ever, we refer the interested reader to a number of review articles that discuss these issues.7,8

FURTHER STILL: CLINICAL SIGNIFICANCE

At this point, we feel that it is important to caution Pediatricsreaders that magnitude of change (effect size) should not be interpreted as an indication of clinical significance. The clinical significance of a treatment should instead be based on external standards provided by patients and clinicians. That is, a small effect size may still be clinically significant and, likewise, a large effect size may not be clinically significant. Indeed, there is a growing recognition that traditional methods used, such as statistical significance tests and effect sizes, should be supplemented with methods for determining clinical sig-nificant changes.

Although there is little consensus about the criteria for these efficacy standards, the most common defini-tions of clinically significant change include: (1) treated patients make a statistically reliable improvement in the change scores; (2) treated patients are empirically indis-tinguishable from a normal population after treatment; or (3) there are changes of at least 1 SD. The most frequently used method for evaluating the reliability of change scores is the Jacobson-Truax method in combi-nation with clinical cutoff points.9 _{Using this method,}

change is considered unlikely to be the product of

mea-surement error if the reliable change index is ⬎1.96.

That is, when the score of a patient has a change score

⬎1.96, one can reasonably assume that indeed the score

has improved.

The validity of each of the above-described methods can be improved further by establishing their external validity (eg, patient perspective). For example, Flor et

al10 _{conducted a large meta-analysis and evaluated the}

effectiveness of multidisciplinary treatment for chronic pain. The investigators found that pain among the pa-tients who received the intervention was reduced by 25% with an effect size of .7. Although this finding seems promising statistically, the meaning of the results change in light of findings from Colvin et al, who re-ported that patients consider only a 50% pain

improve-ment a “treatimprove-ment success.”11 _{Thus, in this example, a}

reduction of 25% in pain scores may be statistically but not clinically significant. Clearly, this is a developing area that warrants additional discussion.

CONCLUSIONS

The issue of clinical significance is of utmost importance to both pediatric researchers and clinicians. On the re-search side, it is imperative that studies routinely eval-uate both statistical and clinical significance to advance our understanding of treatment effects. As such, we encourage researchers to report effect sizes, at the very least, and incorporate external validations of clinical sig-nificance when possible. On the clinical side, pediatri-cians must understand the potential disconnect between statistical and clinical significance when making deci-sions about the adoption of new treatments. The

inter-PEDIATRICS Volume 119, Number 3, March 2007 609 at Viet Nam:AAP Sponsored on August 29, 2020

(3)

pretation of any research findings should occur in the context of the magnitude of change that occurred and the clinical significance of the findings.

ACKNOWLEDGMENTS

This work was supported in part by the National Insti-tutes of Health through National Institute of Child Health and Human Development grant R01HD37007-02.

REFERENCES

1. Fisher RA.Statistical Methods for Research Workers.1st ed. Edin-burgh, Scotland: Oliver and Boyd; 1925

2. Fisher RA.Design of Experiments. 1st ed. Edinburgh, Scotland: Oliver and Boyd; 1935

3. Simon R. Confidence intervals for reporting results of clinical trials.Ann Intern Med.1986;105:429 – 435

4. Feinstein AR.P-values and confidence intervals: two sides of the same unsatisfactory coin.J Clin Epidemiol.1998;51:355–360 5. Gardner MG, Altman DG. Confidence intervals rather thanP values: estimation rather than hypothesis testing.BMJ.1986; 292:746 –750

6. Cohen J.Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Mahwah, NJ: Lawrence Erlbaum; 1988

7. Kirk R. Practical significance: a concept whose time has come. Educ Psychol Meas.1996;56:746 –759

8. Snyder, P, Lawson S. Evaluating results using corrected and uncorrected effect size estimates.J Exp Educ.1993;61:334 –349 9. Jacobson NS, Truax P. Clinical significance: a statistical ap-proach to defining meaningful change in psychotherapy re-search.J Consult Clin Psychol.1991;59:12–19

10. Flor H, Fydrich T, Turk DC. Efficacy of multidisciplinary pain treatment centers: a meta-analytic review. Pain. 1992;49: 221–230

11. Colvin DF, Bettinger R, Knapp R, Pawlicki R, Zimmerman J. Characteristics of patients with chronic pain. South Med J. 1980;73:1020 –1023

TV DRUG ADVERTISING

“Two independent government watchdog groups sharply criticized consumer drug advertising recently, and a separate survey Jan 9 commissioned by the PricewaterhouseCoopers accounting and consulting firm indicated that skep-ticism is widespread among the public, too. Only 1 in 10 consumers said the direct-to-consumer, or DTC, ads could provide useful information to a large audience, the survey said. (Consumer drug advertising is not permitted in most of the world, except New Zealand and the United States.) The pharma-ceutical industry itself acknowledges having an image problem. ‘It would be naı¨ve to not acknowledge the fact that DTC advertising is also a lightening-rod in the health care debate in this country,’ said Billy Tauzin, the former congressman who is now president and chief executive of the Pharmaceutical Research and Manufacturers of America, in a speech to venture capitalists last spring. There is ‘one great problem’ that the manufacturers face, he said: ‘in a word, it is trust.’ ‘While individual patients find the information useful in discussions with their physicians,’ he added in his speech, ‘patients, physi-cians and consumers generally express unhappiness with DTC advertising.’ . . . FDA officials said they had to deal with 54 000 drug promotions each year, aimed at both doctors and consumers.”

Freudenheim M.New York Times. January 22, 2007

Noted by JFL, MD

610 KAIN, MacLAREN

(4)

DOI: 10.1542/peds.2006-3030

2007;119;608

Pediatrics

Zeev N. Kain and Jill MacLaren

Less Than .05: What Does It Really Mean?

P

Services

Updated Information &

http://pediatrics.aappublications.org/content/119/3/608

including high resolution figures, can be found at:

References

http://pediatrics.aappublications.org/content/119/3/608#BIBL

This article cites 8 articles, 1 of which you can access for free at:

Subspecialty Collections

atistics_sub

http://www.aappublications.org/cgi/collection/research_methods_-_st

Research Methods & Statistics

b

http://www.aappublications.org/cgi/collection/medical_education_su

Medical Education

following collection(s):

This article, along with others on similar topics, appears in the

Permissions & Licensing

http://www.aappublications.org/site/misc/Permissions.xhtml

in its entirety can be found online at:

Information about reproducing this article in parts (figures, tables) or

Reprints

http://www.aappublications.org/site/misc/reprints.xhtml

Information about ordering reprints can be found online:

(5)

DOI: 10.1542/peds.2006-3030

2007;119;608

Pediatrics

Zeev N. Kain and Jill MacLaren

Less Than .05: What Does It Really Mean?

P

http://pediatrics.aappublications.org/content/119/3/608

located on the World Wide Web at:

The online version of this article, along with updated information and services, is

the American Academy of Pediatrics, 345 Park Avenue, Itasca, Illinois, 60143. Copyright © 2007 has been published continuously since 1948. Pediatrics is owned, published, and trademarked by Pediatrics is the official journal of the American Academy of Pediatrics. A monthly publication, it