Reprinted from THE ANNALS OF MATHEMATICAL STATISTtCS Vol. 23, No. 1, March, 1952
Printed in U.S.A.
, NOTES
NOTE ON WILCOXON'S TWO-SAMPLE TEST WHEN TIES ARE PRESENT
BY J. HEMELRIJK Mathematical Centre, Amsterdam
Wilcoxon's parameterfree two-sample test (cf. Wilcoxon [1]; H.B. Mann and
D. R. Whitny [2]) depends on a statistic U with the following definition: If X1 , • • • , Xn and Yr , · · · , Ym are the two samples, U is the number of pairs
(i, j) with Xi
>
Yi. The probability distribution of U, under the hypothesisthat the samples have been drawn independently from the same continuous
pop-ulation, has been derived by Mann and Whitney. The influence of ties on this probability distribution has not been investigated as yet.
It is noteworthy that Wilcoxon's U is closely connected with the quantity S,
which Kendall (cf. e.g. Kendall [3]) introduced in the theory of rank correlation.
When r pairs of numbers (uk, vk) are given, Sis computed by scoring:
-1, if (uh - Uk) (vh - Vk)
<
0, 0, if (uh - Uk) (vh - Vk)=
0, +1, if (uh - Uk) (vh - vk)> 0,
and adding the scores for all pairs (h, k) with h
<
k. If, in this definition, we take r=
n+
m and substitute the values Xr, · · · , Xn, Y1, ·· · · , Ym in thisorder for U1 , • • • , Un , Un+i , • • • , Ur , and O or 1 respectively for vk if uk
=
x;for some i or '!f'k
=
y i for some j respectively, then the following relation holds:(1) 2U
+
S=
nm.The simplest way to see this is by considering the total score of 2U
+
S for every pair (h, k). This score is equal to+
1 if Vh=
0 and vk=
I, and 0other-wise. The sum of the scores is therefore nm.
Relation (1) holds if no ties are present among the two samples x1 , • • •,
x,. and Y1, · · · , Ym. It is natural to define U in general by extending (1) to the case when there are ties. Since for a pair (x; , y i) with X;
=
y i the score ofS is equal to zero, the score for U must be taken as
½
for such a pair.Now Kendall has derived the mean and the standard deviation of S under
the hypothesis that for a given order of the quantities v1 ,· • • • , Vr all the r !
possible permutations of u1 , · · · , Ur are equally probable. This condition is
fulfilled in our case if the samples x1, · · · , Xn and Yr, · · · , Ym have been drawn
at random from the same population (which need not be continuous anymore).
Therefore, the mean and standard deviation of U under the null hypothesis
may be derived from Kendall's formulas.
134 J. HEMELRIJK
According to Kendall ([4], pp. 56 and 60), we have
(2) and E(S)
=
0 var (S)= /
13 {r(r - 1)(2r + 5) -E
t(t - 1)(2t + 5) t (3) -~
s(s - 1)(2s + 5)l
+ gr(r _ 1\ r _ 2) {~ t(t - l)(t - 2)l
• {~ s(s - l)(s - 2)1 + 2r(r~
l)q::
t(t - l)} {~ s(s - 1)},where summation
Li
takes place over the various ties among ui , · · · , Ur , andE.
over the ties among Vi , • • • , Vr ; t and s respectively indicating the number ofelements in every group of equal numbers among ui , · · · , Ur and v1 , • • · , v,
respectively. From (1) we have
(4)
and
(5)
E(U)
=
½
nm - E(S)=
½
nmvar (U)
=
¼
var (S).The group Vi , • • • , Vr consists of n numbers O and m numbers 1; thus s in (3)
takes the values n and m and we have
L
s(s - 1) (2s + 5)=
n(n - 1) (2n + 5) + m(m - 1) (2m + 5),•
E
s(s - 1) (s - 2)=
n(n - 1) (n - 2) + m(m - 1) (m - 2),•
E
s(s - 1)=
n(n - 1)+
m(m - 1) .•
Substituting in (3) and (5), we obtain after some reduction
var (U)
=
,\nm(n + m + 1) --h
E
t(t - 1)(2t +5) (6) I + n(n - 1) (n - 2) + m(m - 1) (m - 2)E
t(t _ l) (t _ 2) 36(n + m)(n + m - l)(n + m - 2) 1+
n(n - 1)+
m(m - 1)E
t(t _ l) 8(m+
n)(m+
n - 1) t 'where
E,
takes place over the ties among the values Xi , • • • , Xn, Yi, • • • , y,,.,taken together. ·
When no ties are present this reduces to results of Mann and Whitney [2]:
(7) E(U)
= ½nm;var
(U)=
bnm (n + m + 1).From (6) and (7) it is easy to prove (e.g., by induction) that var (U) is decreased by the presence of ties among the observations. These results constitute a first _
NOTE ON WILCOXON'S TWO-SAMPLE TEST 135
step towards the possibility of using Wilcoxon's test for samples from any
population.
REFERENCES
[l] F. WiLCoxoN, "Individual comparisons by ranking methods," Biometrics Bull., Vol. 1 (1945), pp. 80-83.
[2] H.B. MANN AND D.R. WHITNEY, "On a test of whether one of two random variables is stochastically larger than the other," Annals of Math. Stat., Vol. 18 (1947), pp. 50-00.
[3] M. G. KENDALL, "A new measure of rank correlation," Biometrika, Vol. 30 (1938), pp. 81-93.
Reprinted from THE ANNALS OF MATHEMATICAL STATISTICS
Vol. 23, No. 1, March, 1952 Printed in U.S.A.
NOTES
NOTE ON WILCOXON'S TWO-SAMPLE TEST WHEN TIES ARE PRESENT
BY J. HEMELRIJK Mathematical Centre, Amsterdam
Wilcoxon's parameterfree two-sample test (cf. Wilcoxon [1]; H.B. Mann and D. R. Whitny [2]) depends on a statistic U with the following definition: If x1, · · · , x,. and Y1, · · • , Ym are the two samples, U is the number of pairs
(i, j) with x,
>
Yi. The probability distribution of U, under the hypothesisthat the samples have been drawn independently from the same continuous
pop-ulation, has been derived by Mann and Whitney. The influence of ties on this probability distribution has not been investigated as yet.
It is noteworthy that Wilcoxon's U is closely connected with the quantity S,
which Kendall (cf. e.g. Kendall [3]) introduced in the theory of rank correlation. When r pairs of numbers (u1c, v1c) are given, Sis computed by scoring:
-1, if (ui. - u1c) (vi. - v1c)
< 0,
0, if (uh - U1c) (vi. - V1c)=
0,+1,
if (ui. - u1c) (vi. - vk)>
0,and adding the scores for all pairs (h, le) with h
<
le. If, in this definition, we take r=
n+
m and substitute the values X1 , · · · , Xn , Y1 , · · · , Ym in this order for U1 , · · • , Un , Un+1 , · · · , u, , and 0 or 1 respectively for V1c if Uk=
x;for some i or Uk
=
Yi for some j respectively, then the following relation holds:(1) 2U
+
S=
nm.The simplest way to Se(\ this is by considering the total score of 2U
+
S forevery pair (h, le). This score is equal to
+1
if Vh = 0 and v1c = l, and 0 other-wise. The sum of the scores is therefore nm.Relation (1) holds if no ties are present among the two samples x1 , • • •,
x,. and Y1, · · · , Ym. It is natural to define U in general by extending (1) to
the case when there are ties. Since for a pair (xi, Yi) with x;
=
Yi the score of Sis equal to zero, the score for U must be taken as½
for such a pair.Now Kendall has derived the mean and the standard deviation of S under
the hypothesis that for a given order of the quantities v1 , • • • , v, all the r !
possible permutations of U1 , • • • , u, are equally probable. This condition is
fulfilled in our case if the samples X1, • • • , Xn and Yi, · · · , Ym have been drawn
at random from the same population (which need not be continuous anymore).
Therefore, the mean and standard deviation of U under the null hypothesis
may be derived from Kendall's formulas.
134 J. HEMELRIJK
According to Kendall ([4], pp. 56 and 60), we have (2) and E(S)
=
0 var (S)=
-ls
{r(r - 1)(2r + 5) -L
t(t - 1)(2t + 5) t " 1 . " (3) -L:
s(s - 1)(2s + 5}} + gr(r _ l)(r _ 2){'7'
t(t - l)(t - 2)}• {L
s(s - l)(s - 2)} + 2 ( 1 l) {L t(t - l)}{L s(s - 1)}, • r r - t •where summation
Lt
takes place over the various ties among u1 , • • • , u, , andLs
over the ties among v1 , • • • , Vr ; t and s respectively indicating the number ofelements in every group of equal numbers among u1 , • • • , Ur and v1 , • · • , v,
respectively. From {l) we have (4)
and
(5)
E(U) =
½
nm - E(S) =½
nmvar (U)
=
¼
var (S).The group V1 , • • • , Vr consists of n numbers O and m numbers 1; thus s in (3)
takes the values n and m and we have ·
I:
s(s - 1) (2s + 5)=
n(n - 1) (2n + 5) + m(m - 1) (2m + 5),•
L
s(s - 1) (s - 2)=
n(n - 1) (n - 2) + m(m - 1) (m - 2), •L
s(s - 1)=
n(n - 1)+
m(m - 1). 8Substituting in (3) and (5), we obtain after some reduction var (U)
=
Anm(n + m + 1) --h
L
t(t - 1)(2t +5)(6) t + n(n - l)(n - 2) + m(m - l)(m - 2)
L
t(t _ l)(t _ 2) 36(n + m)(n + m - l)(n + m - 2) t+
n(n - 1)+
m(m - 1)L
t(t _ l) 8(m+
n)(m+
n - 1) t 'where
Lt
takes place over the ties among the values x1 , • • • , Xn, y1 , • • • , y"',taken together.
When no ties are present this reduces to results of Mann and Whitney [2]:
(7)
E(U)=
½nm;var (U)=
A
nm (n + m + 1).From (6) and (7) it is easy to prove (e.g., by induction) that var (U) is decreased by the presence of ties among the observations. These results constitute a first
NOTE ON WILCOXON'S TWO-SAMPLE TEST 135
step towards the possibility of using Wilcoxon's test for samples from any
population.
REFERENCES
[1] F. WrwoxoN, "Individual comparisons by ranking methods," Biometrics Bull., Vol. 1 (1945), pp. 80-83.
[2] H.B. MANN AND D.R. WHITNEY, "On a test of whether one of two random variables is stochastically larger than the other," Annals of Math. Stat., Vol. 18 (1947),
pp. 50-60.
(3] M. G. KENDALL, "A new measure of rank correlation," Biometrika, Vol. 30 (1938),
pp. 81-93.
~~
•
•
(i
...
9..-V
~
Reprinted from THE ANNALS OF MATHEMATICAL STATISTICSVol. 23, No. 1, March, 1952 Printed in U.S.A.
NOTES
NOTE ON WILCOXON'S TWO-SAMPLE TEST WHEN TIES ARE PRESENT
BY J. HEMELRIJK Mathematical Centre, Amsterdam
Wilcoxon's parameterfree two-sample test (cf. Wilcoxon [1]; H. B. Mann and
D. R. Whitny [2]) depends on a statistic U with the following definition: If Xi , · • • • , Xn and Yi , • • • , Ym are the two samples, U is the number of pairs (i, j) with x;
>
y1 • The probability distribution of U, under the hypothesis that the samples have been drawn independently from the same continuous pop-ulation, has been derived by Mann and Whitney. The influence of ties on this probability distribution has not been investigated as yet.It is noteworthy that Wilcoxon's U is closely connected with the quantity S,
which Kendall (cf. e.g. Kendall [3]) introduced in the theory of rank correlation.
When r pairs of numbers (uk, vk) are given, Sis computed by scoring:
-1, if (uh - Uk) (vh - vk)
<
0, 0, if (uh - Uk) (vh - vk)=
0, +1, if (uh - Uk) (vh - vk)>
0,and adding the scores for all pairs (h, k) with h
<
k. If, in this definition, we take r=
n+
m and substitute the values Xi , • • • , Xn , Yi , • · • , Ym in this order for ui , · · · , Un , Un+i , · · · , Ur , and 0 or 1 respectively for Vk if Uk=
X;for some i or Uk
=
Y; for some j respectively, then the following relation holds:(1) 2U
+
S=
nm.The simplest way to see this is by considering the total score of 2U
+
S for every pair (h, k). This score is equal to+
1 if vh=
0 and vk=
I, and 0 other-wise. The sum of the scores is therefore nm.Relation (I) holds if no ties are present among the two samples Xi, • • •, Xn and Yi, • • • , Ym. It is natural to define U in general by extending (1) to
the case when there are ties. Since for a pair (x;, y;) with x;
=
y1 the score ofSis equal to zero, the score for U must be taken as½ for such a pair.
Now Kendall has derived the mean and the standard deviation of S under
the hypothesis that for a given order of the quantities Vi , • • • , Vr all the r !
possible permutations of ui , · · · , Ur are equally probable. This condition is fulfilled in our case if the samples Xi, • • • , Xn and Yi, • • • , Ym have been drawn at random from the same population (which need not be continuous anymore). Therefore, the mean and standard deviation of U under the null hypothesis may be derived from Kendall's formulas.
133
!iliL}OlHEEK MATHEMATISCH CENi'RUM
AMSTERDAM
134 J. HEMELRIJK
According to Kendall ([4], pp. 56 and 60), we have
(2) and E(S)
=
0 var (S)=
-h
{r(r - 1)(2r+
5) -L
t(t - 1)(2t+
5) t (3) -~
s(s - 1)(2s+
5))+
gr(r _ 1\(r _ 2) {~ t(t - l)(t - 2)) • {~ s(s - l)(s - 2))+
2r(r~
l) {~ t(t - 1)) {~ s(s - l)l, where summationLt
takes place over the various ties among u1 , • • • , Ur, andL,
over the ties among v1 , • • • , Vr ; t and s respectively indicating the number of elements in every group of equal numbers among u1 , • • • , Ur and v1 , · • • , v,respectively. From (1) we have
(4) E(U)
=
½
nm - E(S)=
½
nmand
(5) var (U)
=
¼
var (S).The group v1 , • • • , Vr consists of n numbers O and m numbers 1; thus s in (3)
takes the values n and m and we have
I:
s(s - 1) (2s + 5)=
n(n - 1) (2n + 5) + m(m - 1) (2m + 5),•
L
s(s - 1) (s - 2)=
n(n - 1) (n - 2)+
m(m - 1) (m - 2),•
L
s(s - 1)=
n(n - 1)+
m(m - 1) .•
Substituting in (3) and (5), we obtain after some reduction var (U)
=
-,hnm(n + m + 1) --h
L
t(t - 1)(2t +5) (6) t+
n(n - l)(n - 2)+
m(m - l)(m - 2)L
t(t _ l)(t _ 2) 36(n + m)(n + m - l)(n + m - 2) t+
n(n - 1)+
m(m - 1)L
t(t _ l) S(m+
n)(m+
n - 1) t 'where
Lt
takes place over the ties among the values X1, · · · , Xn, Yi,···, y,,,,taken together.
When no ties are present this reduces to results of Mann and Whitney [2]:
(7) E(U)
=
½
nm; var (U)=
-h
nm (n+
m+
1).From (6) and (7) it is easy to prove (e.g., by induction) that var (U) is decreased by the presence of ties among the observations. These results constitute a first
NOTE ON WILCOXON'S TWO-SAMPLE TEST 135
step towards the possibility of using Wilcoxon's test for samples from any
population.
REFERENCES
[l] F. WILCOXON, "Individual comparisons by ranking methods," Biometrics Bull., Vol. 1
(1945), pp. 8o-83.
[2] H.B. MANN AND D.R. WHITNEY, "On a test of whether one of two random variables is stochastically larger than the other," Annals of Math. Stat., Vol. 18 (1947),
pp. 50-60.
[3] M. G. KENDALL, "A new measure of rank correlation," Biometrika, Vol. 30 (1938),
pp. 81-93.