Nonparametric and Distribution- Free Statistical Tests

(1)

536

20

Nonparametric

and Distribution- Free Statistical

Tests

Concepts that you will need to remember from previous chapters

: Sums of squares of all scores, of group means, and within groups

: Mean squares for group means, and within groups

F statistic: Ratio of over

Degrees of freedom: The number of independent pieces of information remaining after estimating one or more parameters

Effect size ( ): A measure intended to express the size of a treatment

: Correlation based measures of effect size Multiple comparisons: Tests on differences between specific group

means in terms that are meaningful to the reader

Eta²1h²2, omega²1v²2 dˆ

MS_error MS_group

MS_group, MS_error SS_total, SS_group, SS_error

(2)

I

n this chapter we are going to change our general approach to hypothesis testing and look at procedures that rely on substituting ranks for raw scores. These are members of the class of nonparametric, or distribution-free, tests. We will first look at the underlying principle that stands behind such tests and then discuss the reasons why one might prefer to use this kind of test. We will see that these tests are a sup- plement to what we have learned, not a replacement.

Most of the statistical procedures we have discussed in the preceding chapters have involved the estimation of one or more parameters of the distribution of scores in the population(s) from which the data were sampled and assumptions concerning the shape of that distribution. For example, the t test makes use of the sample variance as an estimate of the population variance and also requires the assumption that the population from which we sampled is normal (or at least that the sampling distribution of the mean is normal). Tests, such as the t test, that involve assumptions either about specific parameters or about the distribution of the population are referred to as parametric tests.

Definition Parametric tests: Statistical tests that involve assumptions about, or estimation of, population parameters.

Nonparametric tests: Statistical tests that do not rely on parameter estimation or precise distributional assumptions.

Distribution-free tests: Another name for nonparametric tests.

One class of tests, however, places less reliance on parameter estimation and/or distribution assumptions. Such tests usually are referred to as nonparametric tests or distribution-free tests. By and large if a test is nonparametric, it is also distribution-free; in fact, it is the distribution-free nature of the test that is most valu- able to us. Although the two names often are used interchangeably, these tests will be referred to here as distribution-free tests.

The argument over the value of distribution-free tests has gone on for many years, and it certainly cannot be resolved in this chapter. Many experimenters feel that, for the vast majority of cases, parametric tests are sufficiently robust (unaffected by violations of assumptions) to make distribution-free tests unnecessary. Others, however, believe just as strongly in the unsuitability of parametric tests and the overwhelm- ing superiority of the distribution-free approach. (Bradley [1968] is a forceful and articulate spokesman for the latter group, even though his book on the subject is over 40 years old.) Regardless of the position you take on this issue, it is important that you are familiar with the most common distribution-free procedures and their underlying rationale. These tests are too prevalent in the experimental literature simply to be ignored.

The major advantage generally attributed to distribution-free tests is also the most obvious—they do not rely on any seriously restrictive assumptions concerning the shape of the sampled population(s). This is not to say that distribution-free tests do not make any distribution assumptions, only that the assumptions they do require are far

1s²2 1s²2

(3)

more general than those required for the parametric tests. The exact null hypothesis being tested may depend, for example, on whether two populations are symmetric or have a similar shape. None of these tests, however, makes an a priori assumption about the specific shape of the distribution; that is, the validity of the test is not affected by whether the distribution of the variable in the population is normal. A parametric test, on the other hand, usually includes some type of normality assumption; if that assumption is false, the conclusions drawn from the test may be inaccurate. Another characteristic of distribution-free tests that often acts as an advantage is that many of them, especially the ones discussed in this chapter, are more sensitive to medians than to means. Thus if the nature of your data is such that you are interested primarily in medians, the tests presented here may be particularly useful to you.

Those who favor using parametric tests in every case do not deny that the distribution-free tests are more liberal in the assumptions they require. They do argue, however, that the assumptions normally cited as being required of parametric tests are overly restrictive in practice and that the parametric tests are remarkably unaffected by violations of distribution assumptions. In other words, they argue that the parametric test is still a valid test even if all of its assumptions are not met.

The major disadvantage generally attributed to distribution-free tests is their lower power relative to the corresponding parametric test. In general, when the assumptions of the parametric test are met, the distribution-free test requires more observations than the comparable parametric test for the same level of power. Thus for a given set of data the parametric test is more likely to lead to rejection of a false null hypothesis than is the corresponding distribution-free test. Moreover, even when the distribution assumptions are violated to a moderate degree, the parametric tests are thought to maintain their advantage.

It often is claimed that the distribution-free procedures are particularly useful because of the simplicity of their calculations. However, for an experimenter who has just invested six months collecting data, a difference of five minutes in computation time hardly justifies the use of a less desirable test. Moreover, since most people run their analyses using computer software, the difference in ease of use disappears completely.

There is one other advantage of distribution-free tests. Because many of them rank the raw scores and operate on those ranks, they offer a test of differences in central tendency that are not affected by one or a few very extreme scores (outliers). An extreme score in a set of data actually can make the parametric test less powerful because it inflates the variance and hence the error term, as well as biasing the mean by shifting it toward the outlier (the latter may increase or decrease the difference between means).

In this chapter we will be concerned with four of the most important distribution-free methods. The first two are analogues of the t test, one for independent samples and one for matched samples. The next two tests are distribution-free analogues of the analysis of variance, the first for k independent groups and the second for k repeated measures. All these tests are members of a class known as rank-randomization tests because they deal with ranked data and take as the distribution of their test statistic, when the null hypothesis is true, the theoretical distribution of randomly distributed ranks. I’ll come back to this idea shortly. Because these tests convert raw data to ranks, the shape of the underlying distribution of scores in the population becomes less important. Thus both the sets

(4)

11 14 15 16 17 22

(data that might have come from a normal distribution) and

11 12 13 30 31 32

(data that might have come from a bimodal distribution) reduce to the ranks

1 2 3 4 5 6

Definition Rank-randomization tests: A class of nonparametric tests based on the theoretical distribution of randomly assigned ranks.

The use of methods based on ranks is not the only approach when we are concerned about nonnormality, though it is the most common. Wilcox (2003) has an extensive discussion of newer alternative methods (often relying on the trimming of samples), though there is not space to discuss those methods here.

Why do we use ranks?

You might reasonably ask why we would use ranks to run any of the tests in this chapter. There are three good reasons why these tests were designed around the substitution of ranks for raw data. In the first place, ranks can eliminate or reduce the effects of extreme values. The two highest ranks of 20 items will be the values 19 and 20. But the highest raw score values could be 77 and 78 or 77 and 130. It makes a difference with raw scores, but not with ranks.

A second advantage of ranks is that we know certain of their properties, such as that the sum of a set of ranks is This greatly simpli- fies calculations. This was especially important in the days before high speed computers.

The third advantage is that once you have worked out the critical value of the test statistic when you have 8 observations in one group and 13 in another, you never have to solve that problem again. The next time you have 8 scores in one group and 13 in another, converting to ranks will yield the same critical value. However, with raw scores you would have to set a cutoff for every conceivable collection of 8 scores in one group and 13 in another.

However, while ranks provided an easy solution when we had to do calculations by hand, that advantage is now largely gone. There is a whole set of statistical tests called randomization tests (or sometimes permutation tests) that work by randomizing raw scores. For the Mann-Whitney test we converted to ranks and then asked about all of the possible ways those ranks could have been assigned to groups if the null hypothesis were true. As I said, ranks made it easy to acquire all possible arrangements and identify the 5% most extreme ones. But now we can

N 3 1N 1 12>2.

(5)

20.1 The Mann–Whitney Test

One of the most common and best known of the distribution-free tests is the Mann–Whitney test for two independent samples. This test often is thought of as the distribution-free analogue of the test for two independent samples, although it tests a slightly different, and broader, null hypothesis. Its null hypothesis is the hypothesis that the two samples were drawn at random from identical populations (not just populations with the same mean), but it is especially sensitive to population differences in central tendency. Thus rejection of generally is interpreted to mean that the two distributions had different central tendencies, but it is possible that rejection actually resulted from some other difference between the populations. Notice that when we gain one thing (freedom from assumptions), we pay for it with something else (loss of specificity).

Definition Mann–Whitney test: A nonparametric test for comparing the central tendency of two independent samples.

The Mann –Whitney test is a variation on a test originally devised by Wilcoxon called the Rank-Sum test. Because Wilcoxon also devised another test, to be discussed in the next section, we will refer to this version as the Mann –Whitney test to avoid confusion. Although the test as devised by Mann and Whitney used a slightly different test statistic, the statistic used in this chapter (the sum of the ranks of the scores in one of the groups) is often advocated because it is much easier to calculate. (In fact, this is the statistic that Wilcoxon uses for his test. So, to be honest, I am calling this the Mann –Whitney test but doing it the way Wilcoxon proposed.) The result is the same, because either way of computing a test statistic would lead to exactly the same conclusion when applied to the same set of data.

The logical basis of the Mann-Whitney test is particularly easy to understand.

Assume that we have two independent treatment groups, with n₁observations in H₀

t

do exactly the same thing with raw scores. We can write a very simple computer program that randomly assigns scores to groups, calculates some statistic, and then repeats that process 5,000 times or more in a very few seconds. Then we identify the 5% most extreme outcomes and that gives us our critical value. And if there are two many possible permutations of the raw scores to make the enumeration practical, we can pick a random 5,000 or 10,000 rearrangements, and that will gives us a result that is acceptably close to the result of the full solution.

If R. A. Fisher were still around, he would argue that when the randomization of raw scores gives a result that is more than trivially different from the results of a parametric test such as or , then it is the or that is wrong.t F t F

(6)

Group 1 and observations in Group 2. To make it concrete, assume that there are 8 observations in each group. Further assume that we don’t know whether or not the null hypothesis is true, but we happen to obtain the following data:

Raw Scores

Group 1 18 16 17 21 15 13 24 20

Group 2 35 38 31 27 37 26 28 25

Well, it looks as if Group 2 outscored Group 1 by a substantial margin. Now suppose that we rank the data from lowest to highest, without regard to group membership.

Ranked Scores

Group 1 5 3 4 7 2 1 8 6 Ranks 36

Group 2 14 16 13 11 15 10 12 9 Ranks 100

Look at that! The lowest 8 ranks ended up in Group 1 and the highest 8 ranks ended up in Group 2. That doesn’t look like a very likely event if the two populations don’t differ.

We could calculate how often such a result would happen if we really need to, and if you are very patient. Although it could be done mathematically, we could do it empirically by taking 16 balls and writing the numbers 1 through 16 on them, corresponding to the 16 ranks. (We don’t have to worry about actual scores, because we are going to replace scores with ranks anyway.) Now we will toss all of the balls into a bucket, shake the bucket thoroughly, pull out 8 balls, which will correspond with the ranks for Group 1, record the sum of the numbers on those balls, toss them back into the bucket, shake and draw again, record the sum of the numbers, and continue that process all night. By the next morning we will have drawn an awful lot of samples, and we can look at the values we recorded and make a frequency distribution of them. This will tell us how often we had a sum of the ranks of only 36, how often the sum was 37, how often it was 50, or 60, or 90, or whatever. Now we really are finished. We know that if we just draw ranks out at random, only very rarely will we get a sum as small as 36. (A simple calculation shows that an outcome as extreme as ours would be expected to occur only one time out of 12,870, for a probability of .00008.) If the null hypothesis is really true, then there should be no systematic reason for the first group to have only the lowest ranks. It should have ranks that are about like those of the second group. If the ranks in Group 1 are improbably low, that is evidence against the null hypothesis.

I mentioned above that this is a “rank randomization” test, and what we have just done illustrates where the name comes from. We run the test by looking at what would happen if we randomly assigned scores (or actually ranks) to groups, even if we don’t actually go through the process of doing the random assignment ourselves.

n₂

(7)

Now consider the case in which the null hypothesis is true and the scores for the two groups were sampled from identical populations. In this situation if we were to rank all scores without regard to group membership, we would expect some low ranks and some high ranks in each group, and the sum of the ranks assigned to Group 1 would be roughly equal to the sum of the ranks assigned to Group 2.

Reasonable results for the situation with a true null hypothesis are illustrated.

Raw Scores

Group 1 31 16 17 28 15 38 24 20

Group 2 35 13 18 27 37 26 21 25

Now it looks as if Group 2 scores are not a lot different from Group 1 scores.

We can rank the data across both groups.

Ranked Scores

Group 1 13 3 4 12 2 16 8 6 Ranks 64

Group 2 14 1 5 11 15 10 7 9 Ranks 72

Here the sum of the ranks in Group 1 is not much different from the sum of the ranks in Group 2, and a sum like that would occur quite often if we just drew ranks at random.

Mann and Whitney (and Wilcoxon) based their tests on the logic just described, using the sum of the ranks in one of the groups as the test statistic.

If that sum is too small relative to the other sum, we will reject the null hypothesis. More specifically, we will take as our test statistic the sum of the ranks assigned to the smaller group, or if the smaller of the two sums. Given this value, we can use tables of the Mann –Whitney statistic to test the null hypothesis. (They needed to concern themselves with only one of the sums, because with a fixed set of numbers [ranks], the sum of the ranks in one group is directly related to the sum of the ranks in the other group. If one sum is high, the other must be low.)

To take a specific example, consider the data in Table 20.1 on the number of recent stressful life events reported by a group of cardiac patients in a local hospital and a control group of orthopedic patients in the same hospital.

It is well known that stressful life events (marriage, new job, death of a spouse, etc.) are associated with illness, and it is reasonable to expect that many cardiac patients would have experienced more recent stressful events than orthopedic patients (who just happened to break an ankle while tearing down a building or a collarbone while skiing). It would appear from the data that this expectation is borne out. Because we have some reason to suspect that life stress scores probably are not symmetrically distributed in the population (especially for cardiac patients if our research hypothesis is true), we will choose to use a distribution-free test. In this case we will use the Mann –Whitney test because we have two independent groups.

1WS2 n₁5 n₂

N

(8)

To apply the Mann–Whitney test, we first rank all 11 scores from lowest to high- est, assigning tied ranks to tied scores. The orthopedic group is the smaller of the two, and if those patients generally have had fewer recent stressful life events, then the sum of the ranks assigned to that group would be relatively low. Letting stand for the sum of the ranks in the smaller group (the orthopedic group), we find

in smaller group

We can evaluate the obtained value of by using Table E.8 in the Appendix E, which gives the smallest value of we would expect to obtain by chance if the null hypothesis were true. From Table E.8 we find that for subjects in the smaller group and subjects in the larger group ( is always used to represent the num- ber of subjects in the smaller group) the entry for (one-tailed) is 18. This means that for a difference between groups to be significant at the two-tailed .05 level (or the one-tailed .025 level), must be less than or equal to 18. Because we found to be 21, we cannot reject (By way of comparison, if we ran a test on these data, ignoring the fact that one sample variance is almost 50 times the other and that the data suggest that our prediction of the shape of the distribution of cardiac scores may be correct, would be 1.52 on 9 t df,which is also a nonsignificant result.)

t H₀.

W_s

W_S

a 5 .025 n₁

n₂5 6

n₁5 5 W_S

W_S W_S5 2 1 3.5 1 3.5 1 5 1 7 5 21

W_S5 ©1Ri2

W_S Table 20.1

Stressful Life Events Reported by Cardiac and Orthopedic Patients Cardiac Patients Orthopedic Patients

Data 32 8 7 29 5 0 1 2 2 3 6

Ranks 11 9 8 10 6 1 2 3.5 3.5 5 7

As an aside, I should point out that we would have rejected if our value of was smaller than the tabled value. Until now you have been rejecting when the obtained test statistic was larger than the corresponding tabled value. When we work with nonparametric tests the tables are usually set up to lead to rejection for small obtained values. If I were redesigning statistical procedures, I would set the tables up differently, but nobody asked me. Just get used to the fact that parametric tables are set up such that you reject for large obtained values, and nonparametric tables are often set up so that you reject for small values. That’s just the way it is.

H₀

H₀ W_S H₀

The entries in Table E.8 are for a one-tailed test and will lead to rejection of the null hypothesis only if the sum of the ranks for the smaller group is sufficiently small.

It is possible, however, that the larger ranks could be congregated in the smaller group, in which case if H₀is false, the sum of the ranks would be larger than chance

(9)

expectation rather than smaller. One rather awkward way around this problem would be to rank the data all over again, this time ranking from high to low, rather than from low to high. If we did that, the smaller ranks would appear in the smaller group, and we could proceed as before. We do not have to go through the process of reranking data, however. We can accomplish the same thing by making use of the symmetric properties of the distribution of the rank sum by calculating a statistic called . is the sum of the ranks for the smaller group that we would have found if we had reversed our ranking and ranked from highest to lowest:

where and is tabled in Table E.8 in Appendix E. For a two- tailed test of (which is what we normally want) we calculate both and enter the table with whichever is smaller, and double the listed value of .

For an illustration of and consider the following two sets of data:

Set 1

Group 1 Group 2

X 2 15 16 19 18 23 25 37 82

Ranks 1 2 3 5 4 6 7 8 9

Set 2

Group 1 Group 2

X 60 40 24 21 23 18 15 14 4

Ranks 9 8 7 5 6 4 3 2 1

Notice that the two data sets exhibit the same degree of extremeness, in the sense that for the first set, four of the five lowest ranks are in Group 1, and in the second set, four of the five highest ranks are in Group 1. Moreover, for Set 1 is equal to for Set 2 and vice versa. Thus if we establish the rule that we will calculate both and for the smaller group and refer the smaller of and to the tables, we will have a two-tailed test and will come to the same conclusion with respect to the two data sets.

The Normal Approximation

Table E.8 in Appendix E is suitable for all cases in which and are less than or equal to 25. For larger values of n₁and/or n₂we can make use of the fact that the

n₂ n₁

W¿_S W_S

W¿_S

W_S W_S5 29 W_S¿5 11

W_S5 11 W_S¿5 29

W¿_S, W_S

a W¿_S, W_S

H₀

2W 5 n₁1n11 n₂1 12 W¿_S5 2W 2 W_S

W¿_S W¿_S

(10)

distribution of approaches a normal distribution as sample sizes increase. This distribution has

and

Because the distribution is normal and we know its mean and its standard deviation (the standard error), we can calculate :

and obtain from the tables of the normal distribution an approximation of the true probability of a value of at least as low as the one obtained.

To illustrate the computations for the case in which the larger ranks fall into the smaller group and to illustrate the use of the normal approximation (although we don’t really need to use an approximation for such small sample sizes), consider the data in Table 20.2. These data are hypothetical (but reasonable) data on the birthweights (in grams) of children born to mothers who did not seek prenatal care until the third trimester and of children born to mothers who received prenatal care starting in the first trimester.

For the data in Table 20.2 the sum of the ranks in the smaller group equals 100.

From Table E.8 in Appendix E we find thus

Because 52 is smaller than 100, we go to Table E.8 with and (Remember, is defined as the smaller sample size.) Because we want a two-tailed test, we will double the column headings for The critical value of (or ) for a two-tailed test at is 53, meaning that only 5% of the time would we expect a value of or less than or equal to 53 when is true. Our obtained value of is 52, which falls into the rejection region, so we will reject We will conclude that mothers who do not receive prenatal care until the third trimester tend to give birth to smaller babies. This does not necessarily mean that not having care until the third trimester causes smaller babies, but only that variables associated with delayed care (e.g., young mothers, poor nutrition, and poverty) also are associated with lower birthweight.

The use of the normal approximation for evaluating is illustrated in the lower section of Table 20.2. Here we find that From Table E.10 in Appendix E we find that the probability of W_Sor W¿_Sat least as small as 52

z 5 2.13.

W_S H₀.

W_S

H₀ W¿_S

W_S

a 5 .05 W¿_S

W_S a.

n₁ n₂5 10.

W_S5 52, n₁5 8, W¿_S5 2W 2 W_S5 52.

2W 5 152;

W_S z 5 Statistic 2 Mean

Standard error 5

W_S2n₁1 n11 n₂1 12 2

B

n₁n₂1 n11 n₂1 12 12

z Standard error 5B

n₁ n₂1 n11 n₂1 12 12

Mean 5n₁1n11 n₂1 12 2 W_S

(11)

(a at least as extreme as ) is Because this value is smaller than our traditional cutoff of we will reject and again conclude that there is sufficient evidence to say that failing to seek early prenatal care is related to lower birthweight. Note that both the exact solution and the normal approximation lead to the same conclusion with respect to (With the normal approximation it is not necessary to calculate and use because use of will lead to the same value of except for the reversal of its sign. It would be instructive for you to calculate Student’s test for two independent groups from the same set of data.)

t z

W_S

W¿_S H₀. H₀

a 5 .05,

210.01662 5 0.033.

;2.13 z

Table 20.2

Data on Birthweight of Infants Born to Mothers with Different Levels of Prenatal Care

Beginning of Care

Third Trimester First Trimester

Birthweight Rank Birthweight Rank

1,680 2 2,940 10

3,830 17 3,380 16

3,110 14 4,900 18

2,760 5 2,810 9

1,700 3 2,800 8

2,790 7 3,210 15

3,050 12 3,080 13

2,660 4 2,950 11

1,400 1

2,775 6

5 100 2 76

2126.66675 2.13 5

100 2818 1 10 1 12 2

B

81102 18 1 10 1 12 12 z 5

W_S2n₁1n11 n₂1 12 2

B

n₁n₂1n11 n₂1 12 12

W_S¿5 2W 2 W_S5 152 2 100 5 52 W_S5 o1ranks in Group 22 5 100

(12)

The Treatment of Ties

When the data contain tied scores, any test that relies on ranks is likely to be somewhat distorted. There are several different ways of dealing with ties. You can assign tied ranks to tied scores (as we have been doing), you can flip a coin and assign con- secutive ranks to tied scores, or you can assign untied ranks in whatever way will make it hardest to reject In actual practice most people simply assign tied ranks.

Although that may not be the statistically best way to proceed, it is the most common and the method we will use here.

The Null Hypothesis

The Mann –Whitney test evaluates the null hypothesis that the two sets of scores were sampled from identical populations. This is broader than the null hypothesis tested by the corresponding test, which dealt specifically with means (primarily as a result of the underlying assumptions that ruled out other sources of difference). If the two populations are assumed to have the same shape and dispersion, then the null hypothesis tested by the Mann –Whitney test would actually deal with the central tendency (in this case the medians) of the two populations; if the populations are also symmetric, the test will be a test of means. In any event the Mann –Whitney test is particularly sensitive to differences in central tendency.

Using SPSS

I will illustrate the use of SPSS for this test, and it should be clear how it would be used for those tests that follow. In Chapter 17 we considered data collected by Willer (2005) on the Masculine Overcompensation Thesis. Those data can be found on the Web site as Tab17.5.dat. The first column represents Gender the second column represents Condition and the third column contains the dependent variable (Price). In Chapter 17 I mentioned that Willer’s data were probably positively skewed, although the data that I created to match his data were more or less normal. This might be a place where the Mann–Whitney test would be useful, especially if we had Willer’s actual data. I also noted there that Willer was most interested in males and the hypothesis that when males’ masculinity is questioned, they might engage in more masculine behavior, and so we will limit our analysis to males.

To restrict the analysis to data from males, you need to go to the drop- down menu labeled Data, choose Select Cases, and then specify that you want to use only the data from Gender 1. Next, choose Analyze/

Nonparametric tests/ 2-independent samples. Next you need to specify that Price is the test variable and that Threat is the Grouping variable. When you do that you also have to indicate that the levels of Threat are 1 and 2. The results of this analysis appear below.

5

11 5 Threat2, 11 5 Male2,

t H₀.

(13)

Mann–Whitney Test

Condition N Mean Rank Sum of Rank

Price willing to pay Threatened 25 29.62 740.50

Confirmed 25 21.38 534.50

Total 50

Ranks

Price willing to pay

Mann–Whitney U 209.500

Wilcoxon W 534.500

Z 1.999

Asymp. Sig. (2-tailed) .046 2 Test Statistics^a

a. Grouping Variable: Condition

Here you see that the probability of this result under the null hypothesis is given as .046, which is less that .05 and will lead us to conclude that threatened males do engage in more masculine behavior. (SPSS uses a normal approximation, but if you look at Appendix Table E.8 you will see that the critical sum of ranks is 536.

From the printout the smaller sum was 534.5, which also leads to rejection of the null hypothesis.)

20.2 Wilcoxon‘s Matched-Pairs Signed-Ranks Test

Frank Wilcoxon is credited with developing the most popular distribution-free test for independent groups, which I referred to as the Mann–Whitney test to avoid confusion and because of their work on it. He also developed the most popular test for matched groups (or paired scores). This test is the distribution-free analogue of the test for related samples. It tests the null hypothesis that two related (matched) samples were drawn either from identical populations or from symmetric populations with the same mean. More specifically it tests the null hypothesis that the distribution of difference scores (in the population) is symmetric about zero. This is the same hypothesis tested by the corresponding test when that test‘s normality assumption is met.

The logic behind Wilcoxon’s matched-pairs signed-ranks test is straight- forward and can be illustrated with an example of a study of schizophrenia and subcortical structures by Suddath, Christison, Torrey, Casanova, and Weinberger (1990). Bleuler (1911) originally described schizophrenia as being characterized by a lack of connections between associations in memory. The hippocampus has

t t

(14)

been suggested as playing an important role in memory storage and retrieval, and it is reasonable to ask if differences in hippocampal structures (particularly size) could play a role in schizophrenia. Suddath obtained MRI scans on the brains of 15 schizophrenic individuals and their monozygotic (identical) twins. They measured the volume of each brain’s left hippocampus. Because there are many things that control the volume of cortical and subcortical structures, Suddath used monozygotic twin pairs in an effort to control as many of these as possible and to reduce the amount of variance to be explained. The results appear in Table 20.3 as taken from Ramsey and Schafer (1996).

Definition Wilcoxon’s matched-pairs signed-ranks test: A nonparametric test for comparing the central tendency of two matched (related) samples.

If you plot the difference scores for these 15 twin pairs, as shown in Figure 20.1, you will note that the distribution is far from normal. With so few observations it is not feasible to make a definitive statement about normality, but I would not like to have to defend the idea that these are normally distributed observations. For that reason I would prefer to rely on a distribution-free test for paired observations, and that test is the Wilcoxon matched-pairs signed-ranks

Table 20.3

Data on Volume (in cm³) of Left Hippocampus in Schizophrenic and Nonschizophrenic Twin Pairs

Signed

Pair Normal Schizophrenic Difference Rank Rank

1 1.94 1.27 0.67 15 15

2 1.44 1.63 .18 9 9

3 1.56 1.47 0.09 5 5

4 1.58 1.39 0.19 10 10

5 2.06 1.93 0.13 8 8

6 1.66 1.26 0.40 12 12

7 1.75 1.71 0.04 3 3

8 1.77 1.67 0.10 6 6

9 1.78 1.28 0.50 13 13

10 1.92 1.85 0.07 4 4

11 1.25 1.02 0.23 11 11

12 1.93 1.34 0.59 14 14

13 2.04 2.02 0.02 1 1

14 1.62 1.59 0.03 2 2

15 2.08 1.97 0.11 7 7

T (Positive ranks) 111 T (Negative ranks) 9

(15)

test, which is based, as its name suggests, on the ranks of the differences rather than the numerical values.

If schizophrenia is associated with lower (or higher) volume for the left hippocampus, we would expect most of the twin pairs to show a lower (or higher) volume for the schizophrenic twin than for the control twin. Thus we would expect a predominantly positive (or negative) difference. We also would expect that twin pairs who broke this pattern to differ only slightly in the opposite direction from the trend. On the other hand, if schizophrenia has nothing to do with volume, we would expect about one-half of the difference scores to be positive and one-half to be negative, with the positive differences about as large as the negative ones. In other words, if is really true, we would no longer expect most changes to be in the predicted direction with only small changes in the unpredicted direction. Notice that I have deliberately phrased this paragraph for a two-tailed (nondirectional) test.

For a directional test you would simply remove the phrases in parentheses.

In carrying out the Wilcoxon matched-pairs signed-ranks test we first calculate the difference score for each pair of measurements. We then rank all difference scores without regard to the sign of the difference, give the algebraic sign of the differences to the ranks themselves, and finally sum the positive and negative ranks separately. The data in Table 20.3 present the numerical scores for the 15 schizophrenic participants and their twin in columns two and three. The fourth column shows the differences between the twins, with these differences ranked (without regard to sign) in the fifth column. Although the difference for pair 2 is the smallest number in column four and would normally be ranked 1, when we drop its sign and look only at the size of the difference and not its direction, it is the ninth-smallest difference. The last column shows the ranks found in column five with the sign of the difference1.192,

12.192, 1in cm³2

H₀ Figure 20.1

Distribution of differences between schizophrenic and normal twins

–.25 0 .13 .25 .38 .50 .63

Difference 0

1 2 3 4 5 6

Histogram of Difference Scores

Std. Dev. = .24 Mean = .20 N = 15.00 –.13

(16)

applied. The test statistic is taken as the smaller of the absolute values (i.e., dropping the sign) of the two sums and is evaluated against Table E.7 in Appendix E. (It is important to note that in calculating we attach algebraic signs to the ranks only for convenience. We could just as easily, for example, cir- cle those ranks that went with lower volume for the normal twin and underline those that went with higher volume for the normal twin. We are merely trying to differentiate between the two cases.)

For the data in Table 20.3 only one of the pairs had the normal twin with a smaller volume that the schizophrenic twin. Although that was the largest difference, it was still only one case. All other pairs showed a difference in the other direction. The sum of the positive ranks and the sum of the negative ranks Because is defined as the smaller absolute value of

and

To evaluate we refer to Table E.7, a portion of which is shown in Table 20.4.

The format of this table is somewhat different from that of the other tables we have seen. The easiest way to understand what the entries in the table represent is by way of an analogy. Suppose that to test the fairness of a coin, you are going to flip it eight times and reject the null hypothesis, at (one-tailed), if there were too few heads. Out of eight flips of a coin there is no set of outcomes that has a probability of exactly .05 under The probability of one or fewer heads is .0352, and the probability of two or fewer heads is .1445. Thus if we want to work at we can either reject for one or fewer heads, in which case the probability of a Type I error is actually .0352 (less than .05), or we can reject for two or fewer heads, in which case the probability of a Type I error is actually .1445 (much greater than .05). Do you see where we are going? The same kind of problem arises with because it is a discrete distribution. No value has a probability of exactly the desired

In Table E.7 we find that for a one-tailed test at (or a two-tailed test at ) with the entries are 25 [.0240] and 26 [.0277]. This tells us that if we want to work at a (one-tailed) (and thus a two-tailed test at ), we can reject either for (in which case actually equals .0240) or for (in which case the true value of is .0277). Because we want a two-tailed test, the probabilities should be doubled to 25 [.0480] and 26 [.0554]. We obtained a value of 9, so we would reject whichever cutoff we choose. We will conclude, therefore, that we reject the null hypothesis of equal volumes for the left hippocampus for both schizophrenic and normal participants. We can see from the data that the left hippocampus is generally smaller in those suffering from schizophrenia. This is a very important finding if only in that it demonstrates that there is a physical basis underlying schizophrenia, and not simply “mistaken ways of living.”

Ties

Ties can occur in the data in two different ways. One way would be for a twin pair to have the same scores for both the normal and schizophrenic twin, leading to a difference score of zero, which has no sign. In that case we normally eliminate that

H₀,

T T # 26 a

T # 25 a H₀

a 5 .05 a 5 .025

n 5 15 a 5 .05

a 5 .025 a.

T a 5 .05,

H₀.

a 5 .05 T,

T2, T 5 9.

T1

1T22 5 29. T 1T12 5 111

9th T

1T2

(17)

pair from consideration and reduce the sample size accordingly, although this leads to some bias in the test.

We could have tied difference scores that lead to tied rankings. If both tied scores have the same sign, we can break the tie in any way we want (or assign tied ranks) without affecting the final outcome. If the scores have opposite signs, we normally assign tied ranks and proceed as usual.

The Normal Approximation

Just as with the Mann–Whitney test, when the sample size is too large (in this case, larger than 50, which is the limit for Table E.7), a normal approximation is available to evaluate For larger sample sizes we know that the sampling distribution is approximately normally distributed with

and

Standard error 5B

n1n 1 12 12n 1 12 24 Mean 5n1n 1 12

4 T.

Table 20.4

Critical Lower-Tail Values of T and Their Associated Probabilities (Abbreviated Version of Table E.7)

Nominal ␣ (One-Tailed)

.05 .025 .01 .005

N T ␣ T ␣ T ␣ T ␣

5 0 .0313

1 .0625

6 2 .0469 0 .0156

3 .0781 1 .0313

7 3 .0391 2 .0234 0 .0078

4 .0547 3 .0391 1 .0156

8 5 .0391 3 .0195 1 .0078 0 .0039

6 .0547 4 .0273 2 .0117 1 .0078

9 8 .0488 5 .0195 3 .0098 1 .0039

9 .0645 6 .0273 4 .0137 2 .0059

10 10 .0420 8 .0244 5 .0098 3 .0049

11 .0527 9 .0322 6 .0137 4 .0068

. . . . . . . . . . . . . . . . . . . . . . . . . . .

15 30 .0473 25 .0240 19 .0090 15 .0042

31 .0535 26 .0277 20 .0108 16 .0051

(18)

Thus we can calculate z as

and evaluate using Table E.10. The procedure is directly analogous to that used with the Mann–Whitney test and will not be repeated here.

z z 5

T 2 n1n 1 12 4

B

n1n 1 12 12n 1 12 24

Frank Wilcoxon (1892 –1965)

Frank Wilcoxon is an interesting person in statistics for the simple reason that he was not really a statistician and didn’t publish any statistical work until he was in his 50s. He was originally trained in inorganic chemistry and spent most of his life doing chemical research dealing with insecticides and fungicides.

Wilcoxon had been in a statistical study group with W. J. Youden, an important early figure in statistics, and they had worked their way through Fisher’s very influential text. But when it came to analyzing data in later years, Wilcoxon was not satisfied with Fisher’s method of randomization of observations. Wilcoxon hit upon the idea of substituting ranks for raw scores, which allowed him to work out the distribution of various test statistics quite easily. His use of ranks stimulated work on inference based ranks on and led to a number of related statistical tests applied to ranks.

Wilcoxon officially retired in 1957, but then joined Florida State University and worked on sequential ranking methods until his death. His name is still largely synonymous with rank-based statistics.

20.3 Kruskal–Wallis One-Way Analysis of Variance

The Kruskal–Wallis one-way analysis of variance is a direct generalization of the Mann–Whitney test to the case in which we have three or more independent groups. As such it is the distribution-free analogue of the one-way analysis of variance discussed in Chapter 16. It tests the hypothesis that all samples were drawn from identical populations and is particularly sensitive to differences in central tendency.

Definition Kruskal–Wallis one-way analysis of variance: A nonparametric test analogous to a standard one-way analysis of variance.

(19)

To perform the Kruskal–Wallis test, we simply rank all scores without regard to group membership and then compute the sum of the ranks for each group. The sums are denoted by If the null hypothesis were true, we would expect the s to be more or less equal (aside from differences due to the size of the samples). A measure of the degree to which the s differ from one another is provided by

where

and the summation is taken over all groups. is then evaluated against the distribution on k 2 1 df.

x² H

R_j5 the sum of the ranks in the jth group n_j5 the number of observations in the jth group

H 5 12 N1N 1 12 a

R²_j

n_j 2 31N 1 12 R_j

R_j R_j.

Students frequently have problems with a statement such as “ is then evaluated against the distribution on ” All that it really means is that we treat as if it were a value of and look it up in the chi-square tables on k 2 1 df.

x² H

k 2 1 df.

x²

H

For an example, assume that the data in Table 20.5 represent the number of simple arithmetic problems (out of 85) solved (correctly or incorrectly) in one hour by participants given a depressant drug, a stimulant drug, or a placebo. Notice that in the Depressant group three of the participants were too depressed to do much of anything and in the Stimulant group three of the participants ran up against the limit of 85 available problems. These data are decidedly nonnormal, and we will convert the data to ranks and use the Kruskal–Wallis test. The calculations are shown in the lower part of the table. The obtained value of is 10.36, which can be treated as a on The critical value of is found in Table E.1 in the Appendices to be 5.99. Because we can reject and conclude that the three drugs lead to different rates of performance. (Like other chi-square tests, this test rejects for large values of It is nonetheless a nondirectional test.)

20.4 Friedman‘s Rank Test for k Correlated Samples

The last test to be discussed in this chapter is the distribution-free analogue of the one-way repeated-measures analysis of variance, Friedman’s rank test for k corre- lated samples. It was developed by the well-known economist Milton Friedman—in

H.

H₀

H₀ 10.36 7 5.99,

x².₀₅122 3 2 1 5 2 df.

x²

H

(20)

the days before he was a well-known economist. This test is closely related to a standard repeated-measures analysis of variance applied to ranks instead of raw scores.

It is a test on the null hypothesis that the scores for each treatment were drawn from identical populations, and it is especially sensitive to population differences in central tendency.

Definition Friedman’s rank test for k correlated samples: A nonparametric test analogous to a standard one-way repeated-measures analysis of variance.

We will base our example on a study by Foertsch and Gernsbacher (1997), who investigated the substitution of the genderless word “they” for “he” or “she.” With the decrease in the acceptance of the word “he” as a gender-neutral pronoun, many writers are using the grammatically incorrect “they” in its place. (You may have noticed that in this text I have very deliberately used the less-expected pronoun, such as “he” for nurse and “she” for professor, to make the point that profession and gender are not linked. You may also have noticed that you sometimes stumbled over some of those sentences, taking longer to read them. That is what Foertsch Table 20.5

Kruskal–Wallis Test Applied to Data on Problem Solving

Depressant Stimulant Placebo

Score Rank Score Rank Score Rank

55 9 73 15 61 11

0 1.5 85 18 54 8

1 3 51 7 80 16

0 1.5 63 12 47 5

50 6 85 18

60 10 85 18

44 4 66 13

69 14

R_i 35 115 40

x²_.05122 5 5.99 5 10.36 5 70.36 2 60 5 12

38012228.1252 2 60 5 12

191202 a35² 7 1115²

8 140²

4 b 2 3119 1 12 H 5 12

N1N 1 12 a

k

i51

R²_i

n_i 2 31N 1 12

(21)

If the null hypothesis were true, we would expect the rankings to be randomly distributed within each subject. Thus one participant might do best on sentences with an expected “he,” another might do best with an expected “she,” and a third and Gernsbacher’s study was all about.) Foertsch and Gernsbacher asked participants to read sentences like “A truck driver should never drive when sleepy, even if (he/she/they) may be struggling to make a delivery on time, because many acci- dents are caused by drivers who fall asleep at the wheel.” On some trials the words in parentheses were replaced by the gender-stereotypic expected pronoun, sometimes by the gender-stereotypic unexpected pronoun, and sometimes by “they.” For our purposes the dependent variable will be taken as the difference in reading time between sentences with unexpected pronouns and sentences with “they.” There were three kinds of sentences in this study, those in which the expected pronoun was male, those in which it was female, and those in which it could equally be male or female. There are several dependent variables I could use from this study, but I have chosen the effect of seeing “she” when expecting “he,” the effect of seeing

“he” when expecting “she,” and effect of seeing “they” when the expectation is neutral. (The original study is more complete than this.) The dependent variable is the reading time/character (in milliseconds). The data in Table 20.6 have been created to have roughly the same medians as the authors’ report.

Here we have repeated measures on each participant, because each participant was presented with each kind of sentence. Some people read anything more slowly than others, which is reflected in the raw data. The data are far from normally distributed, which is why I am applying a distribution-free test.

For Friedman’s test the data are ranked within each subject from low to high.

If it is easier to read neutral sentences with “they” than sentences with an unexpected pronoun, then the lowest ranks for each participant should pile up in the Neutral category. The ranked data follow.

Table 20.6

Data on Reading Times as a Function of Pronoun

Participant 1 2 3 4 5 6 7 8 9 10 11

Expect He/See She 50 54 56 55 48 50 72 68 55 57 68

Expect She/See He 53 53 55 58 52 53 75 70 67 58 67

Neutral/See They 52 50 52 51 46 49 68 60 60 59 60

Raw Data

Participant 1 2 3 4 5 6 7 8 9 10 11 Sum

Expect He/See She 1 3 3 2 2 2 2 2 1 1 3 22

Expect She/See He 3 2 2 3 3 3 3 3 3 2 2 29

Neutral/See They 2 1 1 1 1 1 1 1 2 3 1 15

(22)

might do best with an expected “they.” If this were the case, the sum of the rankings in each condition (row) would be approximately equal. On the other hand, if neutral sentences with “they” are easiest, then most participants would have their lowest ranking under that condition, and the sum of the rankings for the three conditions would be decidedly unequal.

To apply Friedman’s test, we rank the raw scores for each participant separately and then sum the rankings for each condition. We then evaluate the variability of the sums by computing

where

the sum of the ranks for the jth condition the number of subjects

the number of conditions

and the summation is taken over all conditions. This value of can be evaluated with respect to the standard distribution on

For the data in Table 20.5 we have

The critical value of on is 5.99, so we can reject and conclude that reading times are not independent of conditions. People can read a neutral sentence with “they” much faster than they can read sentences wherein the gender of the pronoun conflicts with the expected gender. From additional data that Foertsch and Gernsbacher present, it is clear that “they” is easier to read than the

“wrong” gender, but harder than the expected gender.

20.5 Measures of Effect Size

Measures of effect size are difficult to find with distribution-free statistical tests.¹ An important reason for this is because many of our effect size measures are based on the size of the standard deviation, and if the data are very badly (nonnormally)

H₀ 3 2 1 5 2 df

x²

5 12

11132 142115502 2 31112 142 5 140.9 2 132 5 8.9

5 12

11132 142 122²1 29²1 15²2 2 31112 142 x²_F5 12

Nk1k 1 12 aR²_j 2 3N1k 1 12

k 2 1 df.

x²

x²_F k

k 5 N 5 R_j5

x²_F5 12

Nk1k 1 12 aR²_j 2 3N1k 1 12

1Conover (1980) discusses the use of confidence intervals for nonparametric procedures.

(23)

distributed, a standard deviation loses much of its meaning for this purpose. If we know that data are normally distributed, and we know that the mean for one group is a standard deviation above the mean for another group, we can estimate that about two-thirds of the participants in the second group outscore the mean of the first group. But if our data are badly skewed we lose that kind of interpretation of the effect size. Similarly, even if we don’t standardized the difference between means on the basis of a standard deviation, with badly skewed data we still do not have a good understanding of what it means to say that the median of group 1 is 15 points above the median of group 2.

One effect size measure that you could use is to directly count, in your sample, the number, or better yet the percentage, of one group that outscored those in another group. For example, for the data in Table 20.2 we see that the median birthweight for those who received prenatal care in the first trimester was 3,245 grams.

For those who did not receive it until the third trimester, the median weight was 2,765.5 grams. This difference was statistically significant, and all mothers in the first trimester group had infants that weighed more than the median of the third trimester mothers. (Or, to put it in the reverse, only one mother in the third trimester group gave birth to an infant that was over the median weight of the first trimester group.) Reporting an effect size in this way may not be as satisfying as reporting effect sizes using or a related statistic, but it is certainly more informative that simply reporting that the difference was significant.

20.6 Writing Up the Results

I will give an example of writing up the results for the study by Suddath et al.

(1990). In writing those results we want to mention briefly what the study was about and how it was conducted. Then we want to explain why we would use a distribution-free test in this situation and then go on to report the results of that test.

Finally, we want to give the reader some sense of how large an effect we found.

✍

Suddath et al. (1990) examined the size of cortical structures in schizophrenic patients and their monozygotic twin. To control for possibly confounding variables, they chose 15 pairs of monozygotic twins.

Because schizophrenic patients often cannot seem to form appropriate connections between items stored in memory, the investigators were particularly interested in the hippocampus due to its role in memory storage and retrieval.

The measure used in this example was the volume of the left hippocampus taken from MRI scans of the brains of the participants.

Because the data were far from normally distributed, particularly the set of differences between the normal and schizophrenic siblings in each pair, the Wilcoxon Matched-Pairs Signed-Ranks test was used.

The results showed that the median difference in left hippocampal volume between normal and schizophrenic twins was 10.5 cm³, with the

dˆ

(24)

schizophrenic twins having the lower volume. This difference was statistically significant and in all but one case out of 15 (93%) the normal twin had the larger volume, indicating a robust finding. These results strongly support the belief that there is a clear structural difference between schizophrenic and normal participants in terms of the size of at least one subcortical structure.

20.7 Summary

This chapter summarized briefly a set of procedures that require far fewer restrictive assumptions concerning the populations from which our data have been sampled. Each of these tests involves converting raw scores to ranks and working with those ranks. What all of these tests have in common is that they ask how the ranks would be distributed if the null hypothesis were true. To do so they look at all possible randomizations of the ranks across groups. (That is why they are often called

“rank randomization” tests.) They then reject the null if the obtained pattern of ranks is too extreme. The tests differ in how they rank the observations. These tests are called nonparametric or distribution-free tests because they make many fewer assumptions about the shape of the distribution and do not rely on unknown parameters such as the population mean or variance We first examined the Mann–Whitney test, which is the distribution-free analogue of the independent sample test. (An almost identical test was also developed by Wilcoxon.) To perform the test we simply ranked the data without regard to group membership and asked if the distribution of ranks resembled the distribution we would expect if the null hypothesis were true. We take as our test statistic the sum of the ranks in the smaller group (or the smaller sum of ranks if the groups are of the same size.) If the test sta- tistic is smaller than the critical value found in the Mann–Whitney tables, we reject the null hypothesis. If the sample sizes are larger than those covered by the table, we can convert the sum of ranks to a score and refer that to the normal distribution.

The same general logic applies to the Wilcoxon matched-pairs signed-ranks test, which is the distribution-free test corresponding to the matched-sample test. In this test we first calculate the difference for each pair of scores, then rank those differences and assign these ranks the sign of the difference. Our test statistic is the smaller of the sum of the negative ranks and the sum of the positive ranks. We then either refer this sum to tables of the Mann–Whitney test or use a normal approximation with

We then discussed two distribution-free tests that are analogous to an analysis of variance on independent measures (the Kruskal–Wallis one-way analysis of variance) and repeated measures (Friedman’s rank test for correlated samples).

The Kruskal–Wallis ranks all observations regardless of group membership and then asks if the sums of the ranks assigned to each group are about as equal as the null hypothesis would predict. For the Friedman test we rank the scores for each participant, and then ask if the smaller ranks predominated in one treatment and the larger ranks predominated in another.

k

z.

t z

t

1s²2.

1m2 1T 5 9, p 6 .052,

(25)

I pointed out that we do not have a good measure of effect size to go with these tests, but that one approach would be to report the number of observations in one group that were smaller than the smallest observation in another group.

Some important terms in this chapter are

Parametric tests, 537 Nonparametric tests (distribution-free tests), 537 Rank-randomization tests, 539 Mann–Whitney test, 540

Wilcoxon’s matched-pairs signed-ranks test, 549

Kruskal–Wallis one-way analysis of variance, 553

Friedman’s rank test for k correlated samples, 555

20.8 Exercises

20.1 McConaughy (1980) has argued that younger children organize stories in terms of simple descriptive (“and then . . .”) models, whereas older children incorporate causal statements and social inferences. Suppose we asked two groups of children differing in age to summa- rize a story they just read. We then counted the number of statements in the summary that can be classed as inferences. The data are shown.

Younger Older Children Children

0 4

1 7

0 6

3 4

2 8

5 7

2 5

(a) Analyze these data using the two-tailed rank-sum test.

(b) What would you conclude?

20.2 Kapp, Frysinger, Gallagher, and Hazelton (1979) have demonstrated that lesions in the amygdala can reduce certain responses commonly associated with fear (e.g., decreases in heart rate). If fear is really reduced by the lesion, it should be more difficult to train an avoidance response in those animals because the aversiveness of the stimulus will conse- quently be reduced. Assume two groups of rabbits: One group has lesions in the amygdala, and the other is an untreated control group. The following data represent the number of trials to learn an avoidance response for each animal.