The significance crisis in psychology and education

(1)

The “significance” crisis in psychology

and education

Bruce Thompson

∗

Department of Educational Psychology, Texas A&M University, College Station, TX 77843-4225, USA

Abstract

The present article explores various reasons why psychology and education have not proceeded

further with adopting reformed statistical practices advocated for several decades. Initially, a brief

statistical history is presented. Then both psychological and sociological barriers to reform are

consid-ered. Perhaps economics can learn from some of the mistakes made within psychology and education,

and invent its own new pitfalls, rather than becoming mired in the same crevasses discovered by other

disciplines.

© 2004 Elsevier Inc. All rights reserved.

Keywords:Significance; Psychology; Education

1. Introduction

As noted elsewhere within this special issue, criticisms of statistical significance tests

are almost as old as the methods themselves (e.g.,

Boring, 1919; Berkson, 1938

). These

criticisms have been voiced in disciplines as diverse as psychology, education, wildlife

science, and economics, and the frequency with which such criticisms are published is

increasing (

Anderson et al., 2000

).

Some of the most widely cited exemplars of these critiques were provided by

Carver

(1978)

,

Cohen (1994)

,

Schmidt (1996)

, and

Thompson (1996)

. Some of these critics have

argued that statistical significance tests should be banned from psychology/education

jour-nals. For example,

Schmidt and Hunter (1997)

argued that “Statistical significance testing

∗_{Tel.: +1 979 845 1335.}

E-mail address:bruce-thompson@tamu.edu.

(2)

retards the growth of scientific knowledge; it never makes a positive contribution” (p. 37).

Rozeboom (1997)

was equally direct:

Null-hypothesis significance testing is surely the most bone-headedly misguided

pro-cedure ever institutionalized in the rote training of science students

. . .

[I]t is a

sociology-of-science wonderment that this statistical practice has remained so

un-responsive to criticism

. . .

(p. 335)

Harlow et al. (1997)

provided a balanced and comprehensive treatment of these

argu-ments in their book, “What if there were no significance tests?”

Hubbard and Ryan (2000)

and

Huberty (1999, 2002)

presented the related historical perspectives on the uptake of

statistical testing within psychology and education.

The purpose of the present article is not to summarize these numerous criticisms, nor the

counterarguments to some of them. Instead, some reasons why psychology and education

have not proceeded further with adopting reformed practices are explored. Because the field

has moved inexorably, albeit glacially, the question could also be framed as, “What barriers

have slowed the recent progress of the field in adopting more informative methods?”

First, however, a brief history is presented. Perhaps economics can learn from some of

the mistakes made within psychology and education, and invent its own new pitfalls, rather

than becoming mired in the same crevasses discovered by other disciplines.

2. A brief history

In 1994, the fourth edition of the

Publication Manual

of the American Psychological

Association (APA) first mentioned statistics that characterize magnitude of effect (e.g.,

“ef-fect sizes” such as Cohen’s

d

,

R

2

,

η

2

,

ω

2

). Effect sizes quantify by

how much

sample results

diverge from the null hypothesis. There are dozens of effect size statistics (Kirk, 1996). The

estimation and interpretation of effect sizes are discussed by

Snyder and Lawson (1993),

Rosenthal (1994),

Hill and Thompson (2004),

Kirk (in press), and

Thompson (2002a,b,

in press).

The 1994 APA

Publication Manual

noted that, “Neither of the two types of probability

values reflects the magnitude of effect because both depend on sample size

. . .

You are

encouraged

to provide effect-size information” (APA, 1994, p. 18, emphasis added).

How-ever, 12 studies of 23 different journals showed that this “encouragement” had little, if any,

effect on reporting practices (Vacha-Haase et al., 2000).

In 1996, the APA Board of Scientific Affairs created a Task Force to recommend whether

or not statistical significance tests should be banned from journals. The Task Force published

its suggestions three years later (Wilkinson and APA Task Force on Statistical Inference,

1999), and did

not

recommend such a ban, but did present numerous related

recommenda-tions.

Included were admonitions about effect sizes, such as “

Always

[emphasis added] provide

some effect-size estimate when reporting a

p

value” (Wilkinson and APA Task Force, 1999,

p. 599). The APA Task Force further emphasized, “reporting and interpreting effect sizes in

the context of previously reported effects is

essential

[emphasis added] to good research” (p.

599). The Task Force also strongly emphasized the value of reporting confidence intervals.

(3)

In 2001, APA published the fifth edition of the

Publication Manual, which stated that

“For the reader to fully understand the importance of your findings, it is

almost always

necessary

[emphasis added] to include some index of effect size or strength of relationship

in your results section

. . .

” (pp. 25–26). The new Manual also labelled “failure to report

effect sizes” as a “defect in the design and reporting of research” (

APA, 2001, p. 5

). And as

regards confidence intervals, the 2001 manual suggested that CIs represent “in general,

the

best

reporting strategy. The use of confidence intervals is therefore

strongly recommended”

(p. 22, emphasis added).

Clearly, over the past decade, some progress has been realized. Nevertheless, the revised

manual is not without its critics, as revealed in the interviews conducted by Fiona

Fidler

(2002)

, and summarized in her penetrating essay.

Today, researchers in psychology and education seemingly now recognize that “surely,

God loves the 0.06 [level of statistical significance] nearly as much as the 0.05” (

Rosnow

and Rosenthal, 1989

, p. 1277). Changed views are reflected in the increasing uptake of three

premises.

First, researchers in these disciplines increasingly accept that magnitudes of effect size

are at least as important as (and perhaps more important than) indices of the probability of

sample results. As psychologist Roger

Kirk (1996)

explained,

. . .

a rejection means that the researcher is pretty sure of the direction of the difference.

Is that any way to develop psychological theory? I think not. How far would physics

have progressed if their researchers had focused on discovering ordinal relationships?

What we want to know is the size of the difference between

A

and

B

and the error

associated with our estimate; knowing that

A

is greater than

B

is not enough. (p. 754)

For this reason, today 24 journals in psychology and education have gone beyond the

admonitions of the APA

Publication Manual, and explicitly require effect size reports.

Included are journals with more than 50,000 subscriptions!

Second, researchers in these disciplines have increasingly recognized that evidence of

result replicability is important, but that statistical significance tests do not evaluate whether

or not results are serendipitous. Instead, these fields have turned toward “interpretation of

new results, once they are in hand, via explicit, direct [emphasis added] comparison with

the prior effect sizes in the related literature” (

Thompson, 2002b

, p. 28).

Third, researchers are increasingly recognizing the value of reporting and interpreting

effect sizes by graphically presenting confidence intervals for effects from both current and

prior related studies (

Thompson, in press

). New software (e.g.,

Algina and Keselman, 2003;

Altman et al., 2000; Cumming and Finch, 2001; Kline, 2004

) has greatly facilitated these

endeavors.

3. Barriers to reform

Journal editor

Loftus (1994)

lamented that repeated publications of concerns about

sta-tistical significance testing “are carefully crafted and put forth for consideration, only to

just kind of dissolve away in the vast acid bath of our existing methodological orthodoxy”

(p. 1). Another editor commented that “p

values are like mosquitos” and apparently “have

(4)

an evolutionary niche somewhere and [unfortunately] no amount of scratching, swatting

or spraying will dislodge them” (Campbell, 1982, p. 698). Nevertheless, as we have seem,

some progress on these matters has been achieved at least in psychology and education.

In formulating models of barriers to reform, there may be a natural tendency to frame

explanations that draw on the language and theories of one’s own discipline. For example,

in the present journal issue economic explanations are cited. It is argued that statistical

significance testing remains unscathed because the methods are cheap, or because market

inefficiencies have occurred.

To compliment these views through the lens of economics in the current issue, here two

alternative perspectives are offered. Of course, the explanations are

not

mutually exclusive,

and in fact these various barriers to reform may be comorbid.

4. Psychological barriers

In his classic 1994 article, Cohen concluded that the statistical significance test “does not

tell us what we want to know, and we so much want to know what we want to know that, out

of desperation, we nevertheless believe that it does!” (p. 997). Similarly,

Rozeboom (1960)

observed that “the perceptual defenses of psychologists are particularly efficient when

dealing with matters of methodology, and so the statistical folkways of a more primitive

past continue to dominate the local scene” (p. 417). Such views suggest that psychological

dynamics must be considered as part of the etiology of resistance to change, and indeed

various psychological theories have been offered.

For example,

Schmidt and Hunter (1997)

seemingly argued that statistical testing arose

without thoughtful consideration. Once the adoption began, it became self-sustaining. They

suggested that “logic-based arguments [against statistical testing] seem to have had only a

limited impact

. . .

[perhaps due to] the virtual

brainwashing

[emphasis added] in

signifi-cance testing that all of us have undergone” (pp. 38–39).

An alternative model speaks of a “psychology of

addiction

to significance testing”

(Schmidt and Hunter, 1997, p. 49). The nature of the addiction is not fully explored in their

treatment. But

Thompson (1993)

offered a possible model, pointing to the atavistic desire

to escape the burdens of responsibility. He noted that researchers presenting work when

questioned about import could merely say, “the result is statistically significant (p

< 0.05),”

and the answer was universally and thoughtlessly accepted. A seemingly objective and

accessible mechanism for routinely adjudicating result value had seemingly been isolated.

Unfortunately, as

Thompson (1993)

noted,

p

values cannot contain information about

result value. He explained that valid deductive argument may not contain in conclusions

any information completely absent from premises. He noted, “If the computer package did

not ask you your values prior to its analysis, it could not have considered your value system

in calculating

p’s, and so

p’s cannot be blithely used to infer the value of research results”

(p. 365).

Yet another psychological explanation invokes the theory of

cognitive dissonance. During

the 1950s, psychologist Leon

Festinger (1957)

thought it would be intriguing to observe

people who believed they had received messages that the world would end on a given date,

and who abandoned their worldly possessions and/or families to congregate together in a

(5)

given place at the appointed time. Festinger then observed these people once the appointed

times for the apocalypse had come and gone, to see how they rationalized their decisions

and the consequences.

In essence, Festinger’s theory says that when beliefs or expectations are perceived to

conflict with other beliefs, or reality, people seek congruence by adjusting perceptions to be

less dissonant with each other.

Festinger (1961)

did a study with rats, published in APA’s

flagship journal, suggesting that the need to resolve dissonance may even be biological.

The psychological theory of cognitive dissonance can be applied at two levels to the

phenomenon of continued emphasis on statistical significance testing. First,

students

typ-ically find the mastery of the logic of these tests convoluted, and consequently, painful.

One aspect of cognitive dissonance theory may be summarized as: we come to love that

for which we have suffered. Students may resolve their dissonance over painful exposure

to statistical testing by exaggerating the value of statistical significance tests.

Second, dissonance may impact

faculty

as well. Faculty who had published over the

course of careers, always with

p

values, may seek to avoid the dissonance of acknowledging

the errors of past ways by stubbornly denying the validity of arguments favoring reformed

analytic practices.

5. Sociological barriers

Professionals organize themselves into discipline-related professional associations.

Scholars tend to do whatever is viewed as normatively correct within their disciplines,

and professional associations tend to enforce these expectations at their meetings and in

their journals.

Professional associations create sociological barriers to reformed statistical practices

in two aspects. First, the bureaucracies of professional associations inevitably create

time

delays

in responding to reform initiatives. The committees (e.g., Publications Committee)

of professional associations only meet and conduct their business periodically (e.g., once or

twice a year). Furthermore, some professional associations are hierarchically organized, and

reform initiatives must be approved in a sequential order. Such structures create additional

delays, and provide opportunities to stop reform initiatives. For example, editorial policies

requiring effect size reporting in journals of the American Educational Research Association

(AERA) would first have to be approved by the AERA Publications Committee, and then

by the AERA Governing Council.

Consider the example of the APA Task Force on Statistical Inference. The APA Board of

Scientific Affairs took two years to discuss the creation of the Task Force. Then the primary

report of the Task Force was presented after an additional three years had passed.

Second, each professional organization has its own

culture

. The last few years have seen

paradigm shifts in psychology and education as regards quantitative as against qualitative

methods. Some professional associations have responded in an avoidant manner by

cre-ating cultures to sidestep these conflicts by adopting nonjudgmental perspectives. In such

environs anything goes, and the maintenance of academic rigor is left to the conscience of

each individual scholar. Statistical reforms are disadvantaged if the efforts are advanced in

cultures that are hostile to any exercise of collective judgment.

(6)

6. Some concluding thoughts

Notwithstanding psychological and sociological (and economic, and other) barriers,

statistical reform is occurring within psychology and education. Researchers working here

may still ask, “how probable are the sample results?”, and may still report

p

values.

But the focus of inquiry in psychology and education has now shifted to addressing two

questions: (1) how

big

are detected effects? and (2) are the effects

replicable? To address the

first question, researchers are evaluating effect sizes (cf.

Kirk, 1996; Thompson, in press

).

They are evaluating the magnitude of effects by considering the context of the research (e.g.,

medical life-or-death outcomes, educational instructional effectiveness), and by

explicit,

direct

comparison of study results with those in prior related studies (cf.

Thompson, 2002b

).

As regards the second question, scholars in psychology and education have moved away

from accepting statistical significance tests as evaluating the replicability of results (see

Cohen, 1994; Thompson, 1996

). Instead, the replicability of effects is increasingly being

judged by evaluating how stable effects are across a related literature, using what some have

come to call “meta-analytic thinking” (Cumming and Finch, 2002).

In other words, don’t tell me your results are improbable, or highly improbable. Tell

me they have

practical import, reflected in effect sizes. Tell me explicitly

why

you think

a given effect size, given what you are studying, is important. And give me the evidence

that effects across studies are reasonably comparable, so that I have some confidence that

your results are

replicable

and not serendipitous. Do

not

use statistical significance tests as

a warrant for replicability, because that is not what the tests do (

Cohen, 1994; Thompson,

1996

). Instead, use replicability as evidence of result replicability!

References

Algina, J., Keselman, H.J., 2003. Approximate confidence intervals for effect sizes. Educational and Psychological Measurement 63, 537–553.

Altman, D.G., Machin, D., Bryant, T.N., Gardner, M.J., 2000. Statistics with Confidence, second ed. BMJ Books, London, England.

American Psychological Association, 1994. Publication Manual of the American Psychological Association, fourth ed. Author, Washington, DC.

American Psychological Association, 2001. Publication Manual of the American Psychological Association, fifth ed. Author, Washington, DC.

Anderson, D.R., Burnham, K.P., Thompson, W., 2000. Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management 64, 912–923.

Berkson, J., 1938. Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association 33, 526–536.

Boring, E.G., 1919. Mathematical vs. scientific importance. Psychological Bulletin 16, 335–338.

Campbell, N., 1982. Editorial: Some remarks from the outgoing. Journal of Applied Psychology 67, 691–700. Carver, R., 1978. The case against statistical significance testing. Harvard Educational Review 48, 378–399. Cohen, J., 1994. The earth is round (p< 0.05). American Psychologist 49, 997–1003.

Cumming, G., Finch, S., 2001. A primer on the understanding, use and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement 61, 532–575. Festinger, L., 1957. A Theory of Cognitive Dissonance. Row, Peterson, Evanston, IL.

Festinger, L., 1961. The psychological effects of insufficient rewards. American Psychologist 16, 1–11. Fidler, F., 2002. The fifth edition of the APA Publication Manual: Why its statistics recommendations are so

(7)

Harlow, L.L., Mulaik, S.A., Steiger, J.H. (Eds.), 1997. What if There were no Significance Tests? Erlbaum, Mahwah, NJ.

Hill, C.R., Thompson, B., 2004. Computing and interpreting effect sizes. In: Smart, J.C. (Ed.), Higher Education: Handbook of Theory and Research, vol. 19. Kluwer, New York, pp. 175–196.

Hubbard, R., Ryan, P.A., 2000. The historical growth of statistical significance testing in psychology- and its future prospects. Educational and Psychological Measurement 60, 661–681.

Huberty, C.J., 1999. On some history regarding statistical testing. In: Thompson, B. (Ed.), Advances in Social Science Methodology, vol. 5. JAI Press, Stamford, CT, pp. 1–23.

Huberty, C.J, 2002. A history of effect size indices. Educational and Psychological Measurement 62, 227–240. Kirk, R.E., 1996. Practical significance: a concept whose time has come. Educational and Psychological

Measure-ment 56, 746–759.

Kirk, R.E., in press. The importance of effect magnitude. In: Davis, S.F. (Ed.), Handbook of Research Methods in Experimental Psychology. Blackwell, Oxford, United Kingdom.

Kline, R., 2004. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. Amer-ican Psychological Association, Washington, DC.

Loftus, G.R. 1994. Why psychology will never be a real science until we change the way we analyze data. Paper Presented at the Annual Meeting of the American Psychological Association, Los Angeles.

Rosenthal, R., 1994. Parametric measures of effect size. In: Cooper, H., Hedges, L.V. (Eds.), The Handbook of Research Synthesis. Russell Sage Foundation, New York, pp. 231–244.

Rosnow, R.L., Rosenthal, R., 1989. Statistical procedures and the justification of knowledge in psychological science. American Psychologist 44, 1276–1284.

Rozeboom, W.W., 1960. The fallacy of the null hypothesis significance test. Psychological Bulletin 57, 416–428. Rozeboom, W.W., 1997. Good science is abductive, not hypothetico-deductive. In: Harlow, L.L., Mulaik, S.A.,

Steiger, J.H. (Eds.), What if There were no Significance Tests? Erlbaum, Mahwah, NJ, pp. 335–392. Schmidt, F., 1996. Statistical significance testing and cumulative knowledge in psychology: Implications for the

training of researchers. Psychological Methods 1, 115–129.

Schmidt, F.L., Hunter, J.E., 1997. Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In: Harlow, L.L., Mulaik, S.A., Steiger, J.H. (Eds.), What if There were no Significance Tests? Erlbaum, Mahwah, NJ, pp. 37–64.

Snyder, P., Lawson, S., 1993. Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education 61, 334–349.

Thompson, B., 1993. The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education 61, 361–377.

Thompson, B., 1996. AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher 25 (2), 26–30.

Thompson, B., 2002a. Statistical,” “practical” and “clinical”: How many kinds of significance do counselors need to consider? Journal of Counseling and Development. 80, 64–71.

Thompson, B., 2002b. What future quantitative social science research could look like: confidence intervals for effect sizes. Educational Researcher 31 (3), 24–31.

Thompson, B., in press. Research synthesis: effect sizes. In: Green, J., Camilli, G., Elmore, P.B., (Eds.), Com-plementary Methods for Research in Education. American Educational Research Association, Washington, DC.

Vacha-Haase, T., Nilsson, J.E., Reetz, D.R., Lance, T.S., Thompson, B., 2000. Reporting practices and APA editorial policies regarding statistical significance and effect size. Theory and Psychology 10, 413–425. Wilkinson, L., APA Task Force on Statistical Inference, 1999. Statistical methods in psychology journals:

guide-lines and explanations. American Psychologist, 54, 594–604 (reprint available through the APA Home Page: http://www.apa.org/journals/amp/amp548594.html).