, NOTES NOTE ON WILCOXON'S TWO-SAMPLE TEST WHEN TIES ARE PRESENT. Mathematical Centre, Amsterdam

(1)

Reprinted from THE ANNALS OF MATHEMATICAL STATISTtCS Vol. 23, No. 1, March, 1952

Printed in U.S.A.

, NOTES

NOTE ON WILCOXON'S TWO-SAMPLE TEST WHEN TIES ARE PRESENT

BY J. HEMELRIJK Mathematical Centre, Amsterdam

Wilcoxon's parameterfree two-sample test (cf. Wilcoxon [1]; H.B. Mann and

D. R. Whitny [2]) depends on a statistic U with the following definition: If X1 , • • • , Xn and Yr , · · · , Ym are the two samples, U is the number of pairs

(i, j) with Xi

>

Yi. The probability distribution of U, under the hypothesis

that the samples have been drawn independently from the same continuous

pop-ulation, has been derived by Mann and Whitney. The influence of ties on this probability distribution has not been investigated as yet.

It is noteworthy that Wilcoxon's U is closely connected with the quantity S,

which Kendall (cf. e.g. Kendall [3]) introduced in the theory of rank correlation.

When r pairs of numbers (uk, vk) are given, Sis computed by scoring:

-1, if (uh - Uk) (vh - Vk)

<

0, 0, if (uh - Uk) (vh - Vk)

=

0, +1, if (uh - Uk) (vh - vk)

> 0,

and adding the scores for all pairs (h, k) with h

<

k. If, in this definition, we take r

=

n

+

m and substitute the values Xr, · · · , Xn, Y1, ·· · · , Ym in this

order for U1 , • • • , Un , Un+i , • • • , Ur , and O or 1 respectively for vk if uk

=

x;

for some i or '!f'k

=

y i for some j respectively, then the following relation holds:

(1) 2U

+

S

=

nm.

The simplest way to see this is by considering the total score of 2U

+

S for every pair (h, k). This score is equal to

+

1 if Vh

=

0 and vk

=

I, and 0

other-wise. The sum of the scores is therefore nm.

Relation (1) holds if no ties are present among the two samples x1 , • • •,

x,. and Y1, · · · , Ym. It is natural to define U in general by extending (1) to the case when there are ties. Since for a pair (x; , y i) with X;

=

y i the score of

S is equal to zero, the score for U must be taken as

½

for such a pair.

Now Kendall has derived the mean and the standard deviation of S under

the hypothesis that for a given order of the quantities v1 ,· • • • , Vr all the r !

possible permutations of u1 , · · · , Ur are equally probable. This condition is

fulfilled in our case if the samples x1, · · · , Xn and Yr, · · · , Ym have been drawn

at random from the same population (which need not be continuous anymore).

Therefore, the mean and standard deviation of U under the null hypothesis

may be derived from Kendall's formulas.

(2)

134 J. HEMELRIJK

According to Kendall ([4], pp. 56 and 60), we have

(2) and E(S)

=

0 var (S)

= /

13 {r(r - 1)(2r + 5) -

E

t(t - 1)(2t + 5) t (3) -

~

s(s - 1)(2s + 5)

l

+ gr(r _ 1\ r _ 2) {~ t(t - l)(t - 2)

l

• {~ s(s - l)(s - 2)1 + 2r(r

~

l)

q::

t(t - l)} {~ s(s - 1)},

where summation

Li

takes place over the various ties among ui , · · · , Ur , and

E.

over the ties among Vi , • • • , Vr ; t and s respectively indicating the number of

elements in every group of equal numbers among ui , · · · , Ur and v1 , • • · , v,

respectively. From (1) we have

(4)

and

(5)

E(U)

=

½

nm - E(S)

=

½

nm

var (U)

=

¼

var (S).

The group Vi , • • • , Vr consists of n numbers O and m numbers 1; thus s in (3)

takes the values n and m and we have

L

s(s - 1) (2s + 5)

=

n(n - 1) (2n + 5) + m(m - 1) (2m + 5),

• E

s(s - 1) (s - 2)

=

n(n - 1) (n - 2) + m(m - 1) (m - 2),

• E

s(s - 1)

=

n(n - 1)

+

m(m - 1) .

•

Substituting in (3) and (5), we obtain after some reduction

var (U)

=

,\nm(n + m + 1) -

-h

E

t(t - 1)(2t +5) (6) I + n(n - 1) (n - 2) + m(m - 1) (m - 2)

E

t(t _ l) (t _ ₂₎ 36(n + m)(n + m - l)(n + m - 2) 1

+

n(n - 1)

+

m(m - 1)

E

t(t _ l) 8(m

+

n)(m

+

n - 1) t '

where

E,

takes place over the ties among the values Xi , • • • , Xn, Yi, • • • , y,,.,

taken together. ·

When no ties are present this reduces to results of Mann and Whitney [2]:

(7) E(U)

= ½nm;var

(U)

=

bnm (n + m + 1).

From (6) and (7) it is easy to prove (e.g., by induction) that var (U) is decreased by the presence of ties among the observations. These results constitute a first _

(3)

NOTE ON WILCOXON'S TWO-SAMPLE TEST 135

step towards the possibility of using Wilcoxon's test for samples from any

population.

REFERENCES

[l] F. WiLCoxoN, "Individual comparisons by ranking methods," Biometrics Bull., Vol. 1 (1945), pp. 80-83.

[2] H.B. MANN AND D.R. WHITNEY, "On a test of whether one of two random variables is stochastically larger than the other," Annals of Math. Stat., Vol. 18 (1947), pp. 50-00.

[3] M. G. KENDALL, "A new measure of rank correlation," Biometrika, Vol. 30 (1938), pp. 81-93.

(4)

(5)

Reprinted from THE ANNALS OF MATHEMATICAL STATISTICS

Vol. 23, No. 1, March, 1952 Printed in U.S.A.

NOTES

Wilcoxon's parameterfree two-sample test (cf. Wilcoxon [1]; H.B. Mann and D. R. Whitny [2]) depends on a statistic U with the following definition: If x1, · · · , x,. and Y1, · · • , Ym are the two samples, U is the number of pairs

(i, j) with x,

>

Yi. The probability distribution of U, under the hypothesis

that the samples have been drawn independently from the same continuous

pop-ulation, has been derived by Mann and Whitney. The influence of ties on this probability distribution has not been investigated as yet.

which Kendall (cf. e.g. Kendall [3]) introduced in the theory of rank correlation. When r pairs of numbers (u1c, v1c) are given, Sis computed by scoring:

-1, if (ui. - u1c) (vi. - v1c)

< 0,

0, if (uh - U1c) (vi. - V1c)

=

0,

+1,

if (ui. - u1c) (vi. - vk)

>

0,

and adding the scores for all pairs (h, le) with h

<

le. If, in this definition, we take r

=

n

+

m and substitute the values X1 , · · · , Xn , Y1 , · · · , Ym in this order for U1 , · · • , Un , Un+1 , · · · , u, , and 0 or 1 respectively for V1c if Uk

=

x;

for some i or Uk

=

Yi for some j respectively, then the following relation holds:

(1) 2U

+

S

=

nm.

The simplest way to Se(\ this is by considering the total score of 2U

+

S for

every pair (h, le). This score is equal to

+1

if Vh = 0 and v1c = l, and 0 other-wise. The sum of the scores is therefore nm.

Relation (1) holds if no ties are present among the two samples x1 , • • •,

x,. and Y1, · · · , Ym. It is natural to define U in general by extending (1) to

the case when there are ties. Since for a pair (xi, Yi) with x;

=

Yi the score of Sis equal to zero, the score for U must be taken as

½

for such a pair.

the hypothesis that for a given order of the quantities v1 , • • • , v, all the r !

possible permutations of U1 , • • • , u, are equally probable. This condition is

fulfilled in our case if the samples X1, • • • , Xn and Yi, · · · , Ym have been drawn

at random from the same population (which need not be continuous anymore).

Therefore, the mean and standard deviation of U under the null hypothesis

may be derived from Kendall's formulas.

(6)

134 J. HEMELRIJK

According to Kendall ([4], pp. 56 and 60), we have (2) and E(S)

=

0 var (S)

=

-ls

{r(r - 1)(2r + 5) -

L

t(t - 1)(2t + 5) t " 1 . " (3) -

L:

s(s - 1)(2s + 5}} + gr(r _ l)(r _ 2)

{'7'

t(t - l)(t - 2)}

• {L

s(s - l)(s - 2)} + 2 ( 1 l) {L t(t - l)}{L s(s - 1)}, • r r - t •

where summation

Lt

takes place over the various ties among u1 , • • • , u, , and

Ls

over the ties among v1 , • • • , Vr ; t and s respectively indicating the number of

elements in every group of equal numbers among u1 , • • • , Ur and v1 , • · • , v,

respectively. From {l) we have (4)

and

(5)

E(U) =

½

nm - E(S) =

½

nm

var (U)

=

¼

var (S).

The group V1 , • • • , Vr consists of n numbers O and m numbers 1; thus s in (3)

takes the values n and m and we have ·

I:

s(s - 1) (2s + 5)

=

n(n - 1) (2n + 5) + m(m - 1) (2m + 5),

• L

s(s - 1) (s - 2)

=

n(n - 1) (n - 2) + m(m - 1) (m - 2), •

L

s(s - 1)

=

n(n - 1)

+

m(m - 1). 8

Substituting in (3) and (5), we obtain after some reduction var (U)

=

Anm(n + m + 1) -

-h

L

t(t - 1)(2t +5)

(6) t + n(n - l)(n - 2) + m(m - l)(m - 2)

L

t(t _ l)(t _ ₂) 36(n + m)(n + m - l)(n + m - 2) t

+

n(n - 1)

+

m(m - 1)

L

t(t _ l) 8(m

+

n)(m

+

n - 1) t '

where

Lt

takes place over the ties among the values x1 , • • • , Xn, y1 , • • • , y"',

taken together.

(7)

E(U)

=

½nm;var (U)

=

A

nm (n + m + 1).

From (6) and (7) it is easy to prove (e.g., by induction) that var (U) is decreased by the presence of ties among the observations. These results constitute a first

(7)

population.

REFERENCES

[1] F. WrwoxoN, "Individual comparisons by ranking methods," Biometrics Bull., Vol. 1 (1945), pp. 80-83.

[2] H.B. MANN AND D.R. WHITNEY, "On a test of whether one of two random variables is stochastically larger than the other," Annals of Math. Stat., Vol. 18 (1947),

pp. 50-60.

(3] M. G. KENDALL, "A new measure of rank correlation," Biometrika, Vol. 30 (1938),

pp. 81-93.

(8)

(9)

~~

•

• (i

...

9..-V

~

Reprinted from THE ANNALS OF MATHEMATICAL STATISTICS

Vol. 23, No. 1, March, 1952 Printed in U.S.A.

NOTES

Wilcoxon's parameterfree two-sample test (cf. Wilcoxon [1]; H. B. Mann and

D. R. Whitny [2]) depends on a statistic U with the following definition: If Xi , · • • • , Xn and Yi , • • • , Ym are the two samples, U is the number of pairs (i, j) with x;

>

y_{1 •}The probability distribution of U, under the hypothesis that the samples have been drawn independently from the same continuous pop-ulation, has been derived by Mann and Whitney. The influence of ties on this probability distribution has not been investigated as yet.

which Kendall (cf. e.g. Kendall [3]) introduced in the theory of rank correlation.

When r pairs of numbers (uk, vk) are given, Sis computed by scoring:

-1, if (uh - Uk) (vh - vk)

<

0, 0, if (uh - Uk) (vh - vk)

=

0, +1, if (uh - Uk) (vh - vk)

>

0,

and adding the scores for all pairs (h, k) with h

<

k. If, in this definition, we take r

=

n

+

m and substitute the values Xi , • • • , Xn , Yi , • · • , Ym in this order for ui , · · · , Un , Un+i , · · · , Ur , and 0 or 1 respectively for Vk if Uk

=

X;

for some i or Uk

=

Y; for some j respectively, then the following relation holds:

(1) 2U

+

S

=

nm.

The simplest way to see this is by considering the total score of 2U

+

S for every pair (h, k). This score is equal to

+

1 if vh

=

0 and vk

=

I, and 0 other-wise. The sum of the scores is therefore nm.

Relation (I) holds if no ties are present among the two samples Xi, • • •, Xn and Yi, • • • , Ym. It is natural to define U in general by extending (1) to

the case when there are ties. Since for a pair (x;, y;) with x;

=

y1 the score of

Sis equal to zero, the score for U must be taken as½ for such a pair.

the hypothesis that for a given order of the quantities Vi , • • • , Vr all the r !

possible permutations of ui , · · · , Ur are equally probable. This condition is fulfilled in our case if the samples Xi, • • • , Xn and Yi, • • • , Ym have been drawn at random from the same population (which need not be continuous anymore). Therefore, the mean and standard deviation of U under the null hypothesis may be derived from Kendall's formulas.

133

!iliL}OlHEEK MATHEMATISCH CENi'RUM

AMSTERDAM

(10)

134 J. HEMELRIJK

According to Kendall ([4], pp. 56 and 60), we have

(2) and E(S)

=

0 var (S)

=

-h

{r(r - 1)(2r

+

5) -

L

t(t - 1)(2t

+

5) t (3) -

~

s(s - 1)(2s

+

5))

+

gr(r _ 1\(r _ 2) {~ t(t - l)(t - 2)) • {~ s(s - l)(s - 2))

+

2r(r

~

l) {~ t(t - 1)) {~ s(s - l)l, where summation

Lt

takes place over the various ties among u1 , • • • , Ur, and

L,

over the ties among v1 , • • • , Vr ; t and s respectively indicating the number of elements in every group of equal numbers among u1 , • • • , Ur and v1 , · • • , v,

respectively. From (1) we have

(4) E(U)

=

½

nm - E(S)

=

½

nm

and

(5) var (U)

=

¼

var (S).

The group v1 , • • • , Vr consists of n numbers O and m numbers 1; thus s in (3)

takes the values n and m and we have

I:

s(s - 1) (2s + 5)

=

n(n - 1) (2n + 5) + m(m - 1) (2m + 5),

• L

s(s - 1) (s - 2)

=

n(n - 1) (n - 2)

+

m(m - 1) (m - 2),

• L

s(s - 1)

=

n(n - 1)

+

m(m - 1) .

•

Substituting in (3) and (5), we obtain after some reduction var (U)

=

-,hnm(n + m + 1) -

-h

L

t(t - 1)(2t +5) (6) t

+

n(n - l)(n - 2)

+

m(m - l)(m - 2)

L

t(t _ l)(t _ 2) 36(n + m)(n + m - l)(n + m - 2) t

+

n(n - 1)

+

m(m - 1)

L

t(t _ l) S(m

+

n)(m

+

n - 1) t '

where

Lt

takes place over the ties among the values X1, · · · , Xn, Yi,···, y,,,,

taken together.

(7) E(U)

=

½

nm; var (U)

=

-h

nm (n

+

m

+

1).

From (6) and (7) it is easy to prove (e.g., by induction) that var (U) is decreased by the presence of ties among the observations. These results constitute a first

(11)

population.

REFERENCES

[l] F. WILCOXON, "Individual comparisons by ranking methods," Biometrics Bull., Vol. 1

(1945), pp. 8o-83.

[2] H.B. MANN AND D.R. WHITNEY, "On a test of whether one of two random variables is stochastically larger than the other," Annals of Math. Stat., Vol. 18 (1947),

pp. 50-60.

[3] M. G. KENDALL, "A new measure of rank correlation," Biometrika, Vol. 30 (1938),

pp. 81-93.

(12)