A Recursive Algorithm
for Null Distributions
for
Outliers: I. Gamma Samples
T. Lewis N. R. J. Fieller
Department of Mathematical Statistics Department of Probability and Statistics
University of Hull, England University of Sheffield, England A recursive method is described, which may be used to obtain certain distributional results relating to order statistics. The procedure is applied to gamma samples to obtain null distribu- tions of various statistics appropriate for the testing of single and multiple outliers, and some useful inequalities for tail probabilities.
KEY WORDS Outliers Order statistics Dixon’s statistics Recursive algorithm Gamma samples
I. INTRODUCTION
Problems of analysis of outliers in samples from exponential parent populations have been studied by several authors, see e.g. Epstein [3], Laurent [6], Like5 [7], and Kabe [5]. As regards testing of outliers, some of this work has centred on statistics similar to Dixon’s [2] in the case of testing outliers from a normal parent population. Other work has centred
.
on statlstlcs of the form T,,, = xcn,/(x1 + * . . + Xn), where xc,, denotes thejth smallest observation in the random sample x1, . . . , xn. The latter may be viewed as likelihood-ratio criteria, because the statistic is a monotonic function of the increase in the maximized
log-likelihood on omission of the queried observa-
tion. We consider criteria of both kinds and show how their null distributions can be obtained by means of a recursive algorithm. We start by considering the statistic T,,, for an exponential sample. The result is known, Fisher [4], but this illustrates the procedure and leads naturally to the structure of the distribution noted by Fisher. The method readily extends to the more general case of a gamma population considered by Cochran [ 11. We apply the technique to give some new results, and illustrate our procedure by numeri- cal examples.
Received August 1975; revised January 1978
2. PRELIMINARIES
Let x1, . f . , xn be a sample from a gamma popu- lation
g(x, A) = {h’/I’(r)}x’-’ exp (-Ax) @>O) (1)
where r is known but X is unknown. Denote the ordered sample by xcl,, * * *, xcnj (x,,, < * . . < x,,,).
The criteria developed by Dixon [2] are of the form
Y = h - XwJIh - X(P)1
.(lIpIr<s<q~n,q-p>s-r); (2)
e.g. for testing the largest observation as an outlier we chooses = q = n, r = n - 1,~ = 1.
For the likelihood-based criteria, let S, = xIZIn xi and T,.I = x,/S,. Also write
T n.w = X(,)/sl, (3)
and in particular
T n,m1
= mix
T,,, = x,n,lS,; (4)I=1
these will be abbreviated
to
T,,,, T,,,when the sample
size is obviously n. For any j, Tn., follows a beta distribution with parameters r and r(n - I), i.e. its density function is &,,.c,-lj( * ) where
P&b (u) = {lya + b)/Iya)Iyb)]u”-‘( 1 - U)b-’
(0 < 24 < 1). (5) (Here, and throughout the paper, a density function whose value is specified only over certain regions is understood to be otherwise zero.)
The density functions and distribution functions of various statistics we now discuss will be denoted by a,(. >, A,( * ); 6,( + ); B,(. ); * . . as they arise. The sub- script n refers to sample size.
3. A RECURSIVE ALGORITHM FOR 7-,,,
Suppose T,,, has density function a,(.) and distri- bution function An(* ). Then
a,(u)& = P(T,,, c (u, u + du))
= 2 Wn,, 4u, u + du), Tn,vo = Tn,,)
J=l
= n P(T,,,, ~(u, u + du), n-1
. max xkI(Sn - x,) < xJ(S, - x,1). (6)
k=l
Now
T n,n = &L/s, = (xn/(Sn - x,))
41 + xn/(& - x,1), (7) and x,/(S, - x,) is independent of &/(Sn - x,) for each k = 1, v-e, n - 1,
(Since
the XkfOllOW
a gamma. . .
dtstrtbutton). Hence Tn,n an d
n-1
max xk/(& - &)
k=l
are independent, and so (6) gives u,(u)& = nP(T,,, c(u, u + du))
Wn-w-1, < u/(1 - u)). (8)
We thus have the recurrence relation
I = nPr.rcn-liu) &+lW(l - u)l(lln 5 u 2 1). (9) Consider first the particular case of an exponential parent population (r = 1). For l/2 I u I 1 we have u/( 1 - u) 2 1, A.-,(u/( 1 - u)} = 1, and hence from (9)
u,(u) = nP1,,-l(u) = n(n - l)( 1 - u)nm2, (10)
A.(u) = 1 - n( 1 - UT-‘. (11)
For l/3 < u I l/2 we have l/2 I u/(1 - u) I 1; hence from (11) A,-, {u/( 1 - u)} = 1 - (n - 1) {( 1 - 2u)/( 1 - u)Jn-*, and so from (9)
u,(u) = n(n - l)(l - ur-* - n(n - 1)2( 1 -2~)n-~, (12) A.(u) = 1 - n(l - u)“-’ +
0 ; (1 - 2u)n-1; (13)
and in general for l/(q + 1) 5 u 5 l/q, q = [l/u],
A.(u) = 1 - n(1 - u)n-l+
0 ; (1-2u)“-’
- . . . + C-1” ;
0
(1 - qu)n-1, (14) which is the well-known result of Fisher [4].For general r, the same procedure can be applied,
using the recursive formula (9) to evaluate the func- tions u,(u), A,(u) in successive intervals l/(s + 1) I u i l/s@ = 1, *a*, n - 1). In particular, sinceA.-,{u/ 1 - u)) = 1 when u 2 l/2 and < 1 when u < l/2, the upper tail probability P(T,,, > u), which can be used as a significance probability for testing an upper out- lier, satisfies
P(T(,, > u) = 1 - A.(u) = n s lPl,ru-lj (t)dt u
(u 2 l/2), (15a)
1
P(Ttnj > u) < n J Pr,rcn-lJ (t)dt u
(u < l/2). (156)
This may conveniently be written (upon making the usual transformation taking the beta density into the F-density) as
P(Tc,, > u) = n f’V’2r.2rcn-l~ > (n - l)u/(l - u)
(u 2 l/2), (16~)
P(T,,, > u) < n W’2r.2rcn-l~ > (n - l)ul(l - u))
(l/n<u< l/2), (166)
where F2r.2rcn -lj denotes the usual variance-ratio sta- tistic on 2r, 2r(n - 1) degrees of freedom.
4. FURTHER RESULTS
4. I Null Distribution of T,,,
We first apply the method of Section 3 to find the distribution of TC1,, a suitable statistic for testing the smallest observation in a gamma sample of size n. If T,,, has density function b,( * ) and distribution func- tion B,( * ), an argument similar to that of Section 3 gives the recursive relation
Mu) = nPr.rcn-lj (u)(l - Llb/(l - u)l)
(0 I u I l/n). (17)
For n = 2, B,{u/( 1 - u)) = 0 or 1 according as u/( 1 - u) < 1 or > 1, hence
b,(u) = 2&,,)(u) = 2{I’(2r)/(r(r))2)ur-1(i - u)‘-l
(0 I u I l/2). (18)
Proceeding recursively we find, in the case of an exponential population, (r = l),
b,(u) = 2 (0 2 u I l/2), (19)
b,(u) = n(n - l)(l - nu)“-*
(0 I u I l/n). (20) In the case r = 2,
whence recursively
b,(u) = 6Ou( I - 3u)( 1 - 3u2)
(0 I u 5 l/3), @lb)
b,(u) = i68u( 1 - 4u)*( 1 + 32.4 - 1224’ - 4u3) (0 I u < l/4), (21c) b,(u) = 36Ou( 1 - 5~)~( 1 + 8u - 18u* - 80~’ + 65~~)
(0 I u I l/5) (214
and so on. As a general inequality, analogous to (7), for the lower tail probability appropriate for testing a lower outlier, we have
P(T(1, < u) 5 n ~(F2r,*,w, < (n - l)u/(l - u))
(u < l/n). (22) 4.2 Null Distribution of T,,,, T,,,
Suppose Tcl,, T,,, in a gamma sample have joint density function c,(. ; ) and distribution function C,(.; ), c,(u, u) is clearly zero outside the region R,
definedbyu>O,(n- l)u+u< l,u+(n-l)u> 1.
Now the joint density of T,,, and T,,,, for any i andj, is
Urn)/KUr))* JXr(n - 2))l
. UP Lu’- 1( 1 -
u - U)rcn-z)-l (23)
(see, for example, Wilks [8], section 7.7). Thus, on applying the method of Section 3, we find, for (u, v)cR, and n 2 3,
~42.4, u) = n(n - l)lr(rn)/(IYr))2r(r(n - 2))l x u”v’-‘(l _ u _ v)“n-2,-l \k
(24) where
P = PW( 1 - u - u) < Tn-2,,,, < 7’,-,,,n-2,
< v/( 1 - u - v)). (25) For n L 5, (25) can be written
9 = A,-, {u/(1 - u - u)} - c,_, {u/(1 - u - v), v/( 1 - u - u)}. (26) When n = 3, the probability 9 in (25) is 1 if u/( 1 - u - v) < 1 and u/( 1 - u - u) > 1, and is 0 otherwise; hence
c&u, v) = 3!{p(3r)/(T(r))“}u’-‘v’-‘(l - u - v)'-'
(2u + u < 1, u + 2v > 1, u > 0). (27) When n = 4, \k is equal to
P(T2.m > max (1 - (u/( 1 - u - u)), u/( 1 - u - v)})
= 2(r(2r)/(r(r))*t~~‘*tr-l( 1 - ty-ldt (28)
where
w=(l-u-2v)/(l-u-v)
for 2u + 2v < 1, (u, U) ER,
u/(1-u-u) for2uf 2v> l,(u,v)tR,.
Evaluation of the integral in (28) and application of (24) with n = 4 gives
Q(U, U) = 4!(r(4r)/(r(r))42*r-l} u~-W1z
X g -$&( r y ')(I - u - v)*(~-l-j)z*j
where
z=u+3v-1 for 2u +
20 <
1,u+ 3u> 1, and
z=l-3u-u for 2u + 2u > 1,
3u+u< 1,
Particularizing to the exponential case (r = gives
(29)
u>o
u > 0. I), this
c3(u,u)= 1.2’.3 (u+2v> 1,2u+u< l,u>O), (3Oa) cq(u, v) = 2.3*.4(u + 3v - 1)
(24 + 3u > 1,2u + 2v < 1, u > O), 2.32.4{(u + 3u - 1) - 2(2u + 20 - I)} (2~ + 2v > 1, 3u + v < 1, u > 0), (30b) cg(u, u) = 3.42.5 (u + 40 - 1)2
(u + 4v > 1,2u + 3u < 1, u > O), 3.4*.5((u + 4v - 1)” - 3(2u + 3v - l)*} (2u + 3v > 1,3u +- 2v < 1, u > O),
3.42.5{(u + 4v - 1)2
- 3(2u-t 3v- 1)2+ 3(3u+ 2v- l)z/
(3u + 2u > 1,4u + v < 1, u > O), (3Oc) and so on. The general form of the distribution is clear.
Again, particularizing to the case r = 2 we get c3(u, v) = 72Ouv(l - u - u)
(u + 2u > 1, 2u + u < I, u > O), (31a) cd(u, u) = 1008Ouu(u + 3v - 1)((1 - u)~ - 3u2]
(24 +3v > 1,2u + 2v < 1, u > O), 1008Ouv( 1 - 3u - v) j( 1 - u)* - 3u2)
(2u + 2u > 1, 324 + v < 1, u > 0), (31b) whence c,(u, v) can be obtained recursively for fur- ther values of n.
4.3 Null Distribution of T(,,, - Tcl,
From the joint density function c,(. ; ) of Tel,, T,,,, expressions can be obtained for the density function, d,(* ) say, of Tcnl - Tcl,, a statistic which may be used for testing simultaneously the largest and smallest observations as outlying in a gamma sample of size n. Naturally it only makes sense to test for two outliers when n is 5 or more. In the exponential case, r = 1, we find successively, for purposes of the recurrence, d,(u) = 4u (0 C u I l/2),
4{u - (2u - 1)) (l/2 < u I l), Pa) d,(u) = 9{2u2) (0 I u I l/3),
9(2u* - (3~ - l)*) (l/3 I u I l/2), 9(2u2 - (3u - l)* + 2(2u - l)*}
(l/2< ul l), (326)
d,(u) = 1616~~) (0 < u I l/4),
16(6#3 - (4u - 1)3} (l/4 I u I l/3), 16(6u3 - (4u - 1)s + 3(3u - 1)3)
(l/3 5 u I l/2),
16(6u3-(4u- 1)3+ 3(3u- 1)3
- 3(2u- 1)3) (l/2< ul 1); (32~)
the form of d,(u) for general sample size n is clear. A corresponding calculation in the case r = 2 starts with the results
d3(u) = (40/9)u(2 - 5u*) (0 I u I l/2),
(40/9)( 1 - u)3(1 + 4u) (l/2 I u < l), (33a) d,(u) = (21/32) ~~(90 - 450~~ + 182~~)
(0 < u < l/3),
(21/32)(( 1 - u)~( 13 + 101~) - 2(1 - 2u)‘(13 + 104~ + lOOu*)) (l/3 I u I l/2),
(21/32)(1 - ~)~(13+ 101~)
(l/21 ul 1); (33b)
further details are not given here. 4.4 Two Upper Outliers
Suppose g,(. ; ) is the joint density function of T,,, and T,,-,, in a gamma sample; gn(u, v) will be zero outside the region R, defined by u + v < 1, u + (n -
1)~ > 1, u > v. For (u, u)cR,, the procedure of Section 3 gives the recurrence relation
g&4 u) = n(n - lNWn)l(r(r))*Wn - 2)))
u’-lv’-l(l - u - v)“n-2)-lAn-2
(v/( 1 - u - v))
= nVYm)/r(r)r(r(n
- 1
))I
u’-‘( 1 - u)‘(“-‘)m2a,-1(v/( 1 - u)}. (34) In the exponential case (r = 1) this becomes, from (14)
g,(u, u) = n(n - l)“(n - 2) - u - u)n-3
-(
“; 2 (1 -u->
2u)n-3+ ( >
n ; 2 (1 - u - 3u)n-3- . . . + (-p-1 ; I ;
( >
(1 - u - quyi (35) where q = [(l - u/v]. The generalization to three or more upper outliers in an exponential sample is obvi- ous.
A useful statistic in the testing of a pair of upper outliers in a gamma sample is T,,, + Tc,-l, = Z,, say, whose density function h,( * ) can be found in prin- ciple from the above joint density. Even without the exact distribution of this statistic, we have by the method of Section 3 the following inequality:
Mu) s b(n - 1 YW32r.r~n-2~(~), (36)
whence
P(Z, > u) 5 In(n - 1 l/2) P(F4r.2rcn-2j
> ((n - 2)/2)(u/(l - u))) (2/n S u I 1). (37) The generalization to three or more upper outliers is again clear.
4.5 Two Lower Outliers
Supposej,(.; ) is the joint density function of T,,, and Tc2,;jn(u, u) is clearly zero outside the region Rj defined by
u + (n - 1)~ < 1, 0 < u < u.
For (u, v)eR, the procedure of Section 3 gives the recurrence relation
j&4 v) = n(n - l)(r(rn)/(r(r))*r(r(n - 2))l
x u’-lu’-l(l _ u _ U)‘m-2,-l
(1 - &-*(u/(1 - u - u)))
= nUYrn)lr(r)r(r(n - 1 ))t
X u’-‘(1 - u)““-1)-2bn-1(v/(1 - u)). (38) In the exponential case (r = 1) this becomes, from (W,
jn(u, v) = n(n - 1)2(n - 2)
.(l - u - (n - l)~)“-~. (39) In the general case of nz lower outliers the same method gives the joint density jn(u, u, * * . , y, z) of Tel,,
. . . T,,,, for (u, u, * * .,
z)tR,, the region defined byu+v+ ..* +y+(n-m+ l)z< 1,
o<u<v...<z as
ji(t.4, u, 3 . . , Y, z) = (n),-,Wn)
/(l?(r))“-T(r(n - m + 1))
x U’-lU’+’ . . , yr-l
‘(1 _ u _ v - . . . -y)r(n-m+ll-*
X b,-,+,{z/(l - u - u - ... - y)), (40) where (n), = n!/(n - k)!, which in the exponential case (r = 1) becomes
Mu, 0, * * . , y,z) = (n),+,(n - l)m-l 11 - u - u . * .
- y - (n - m + l)~)~-~-l. (41)
An appropriate statistic for testing the significance of m lower outliers in a gamma sample is T,,, + T,,, +
. . . T,,, = W,, say. The density function k,(. ) of W,, can be found in principle by transforming the above density. In the case of two lower outliers in an ex- ponential sample (r = 1, m = 2) this yields
kn(t) = {n(n - l)‘/(n - 2))(( 1 - 1/2ntye2 - (1 - (n - l)t)“-“)(O < t I l/(n - l)), (n(n - l)‘/(n - 2))(( 1 - 1/2nt)n-‘)
(l/(n - 1) < t < 2/n). (42)
Even without the exact distribution of W, in the
general case of m outliers, the method of Section 3 yields the following inequality
kn (u>
5 ( >
; Prm,rcn-m,
(u),
whence
P(WaI u)< ( n 1 m
* Wzrcn-m,,2rm > m( 1 - u)/(u(n - m))) . (4)
4.6 A Dixon Statistic
We have shown that the recursive methods of Sec- tion 3 can be applied to likelihood-ratio statistics to useful effect, producing explicit results in some cases and bounds in others. The methods can also be ap- plied to statistics of the Dixon type, though for the larger sample sizes the calculations become involved. We illustrate the principle by considering the statistic
y = Ix,*, - ~uJ/lX,m - &l,J (45)
which has been discussed in the particular context of an exponential sample by Likes [7] and Kabe [5].
Consider first the joint distribution of Ttl,, TC2), T(,,
for a gamma sample; write ln( a; ; ) for its density function. Clearly In(u, u, w) will be zero outside the region RI defined by
u f (n - 2)v + w < 1, u +
u +
(n - 2)w > 1, w > v > u > 0. For (u, v, w)tR1and n 2 4, the method of Section 3 gives
I&, U, W) = n(r(rn)/r(r)r(r(n - 1 ))I
#r-1( 1 - U)“n-1)-3Cn-l
iv/( 1 - u), w/(1 - u)t. (46)
Hence, putting y = (u - u)/(w - u), the joint density function of TC1,, T,,, and y is
m,(u, W, y) = n{r(rn)/r(r)r(r(n - l))}
#r-1( 1 - U)“n-1)-3(W - #)
cn-li((1 - Y)U + YW)/(l - u), w/(1 - #)I (47)
for (u, w, y)tR, defined by (n-2){(1-y)u+yw)+w<l-u,
((l-y)u+yw}+(n-2)w> I-u,O<y< l,u>O.
From this we can find the distribution of y using known results for c,-~.
Consider for example the exponential case (r = 1) with n = 4. Substituting for c, from (30a) we get
m,(u, w, y) = 144(w - u) (u > 0, (3 - 2y)u + (2y + 1)w < 1, a
(2 - y)u + 0, f2)w > 1). (48)
Integrating out the variables u and w, the density function of y is found to be
61142~ + 1)’ - l/o/ + 2)*1 (o < y < 1) (49) and its distribution function
~Y/N~Y + 110) + 2)], (50)
in agreement with the percentage points of y given in Table 2 of Like; [7] and with the series given in equation (8) of Kabe [5]; (also with the equivalent series in equation (13) of Kabe’s paper, but note that the factor (q - p - i)! in the denominator of the first double sum in Kabe’s equation (8) has been acciden- tally omitted from his equation (13)).
As an example of a non-exponential gamma situa- tion, not considered by these authors, take the case r = 2, n = 4. From (47) and (31a), we get
m,(u, w, y) = I.20960 uw(w - u)(( 1 - y)u + ,JIW} (1 -(2-y)u-(l +y)w)
(u > 0, (3 - 2y)u + (2y + I)w < 1,
TABLE I-Frequencies of excess cycle rimes in steel manufacture.
Excess cycle 12 3 4 5 6 7 8 9 10
timex
Frequency 18 12 18 16 10 4 9 9 2 7
Fxcess cyc1e
tioE x 11 12 13 14 15 21 32 35 92 97
FriSCpWlCy 6 7 2 13 3 2 111
Integration with respect to u and w gives the density function of y as
3/8 (11(2?, + 1)-2 + 44(21, + 1)-3 - 69(2y + 1)-4 +24(2y + 1)-5
-100, + 2)-Z - 44(Y + 2))3 - 1201 + 2))4
+ 19201 +2)-5} (0 < y < 1). (52)
5. ILLUSTRATIVE EXAMPLES
5.1 Example I *
Consider example (6) given by Epstein [3], p. 171. Here the failure times of 10 items are observed, total- ling xxi = 600 units. The failure times of thefirst two items to fail are the shortest of the 10, and total 24 units, so we can write xclr + x(~) = 24. Under the assumption that failure times under given conditions are exponentially distributed, Epstein tests whether the first two items to fail can be regarded as having failed abnormally early; making use of a straight- forward F-test, he concludes in favour of this hypoth- esis. Suppose however that the 10 items were placed on test at different starting times and that the two shortest failure times occurred, not necessarily first,
but randomly in chronological sequence, so that
there was no a priori reason to consider these two items as different from the rest. How strong would the evidence then be for regarding them as inconsis- tent with the rest in view of their failure times? In the notation of Section 4.5 we have T,,, + Tt2) = W,, whose value, with m = 2, is
24
w==, lying in the range 0 < w < $ .
From (42)
P( WI0 < 0.04) = /““‘y (( 1 - 524)B - (1 - 924)8}du
0
This is the significance probability attaching to the observed ratio w = 0.04, i.e. there is no real evidence for regarding the figure of 24 units as abnormally
* V. Barnett and T. Lewis, Outliers in Statistical Data (London: John Wiley & Sons, Ltd., 1978), pp. 85-86.
low-a contrary conclusion to Epstein’s on our mod- ified premise.
5.2 Example 2
Table 1 shows a sample of 132 excess cycle times in steel manufacture (source of data: private communi- cation).
The sample of size 130 obtained by omitting the outliers xc131j = 92 and xl132j = 97 has mean X = 6.44, standard deviation s = 6.18, and third and fourth moments about the mean m3 = 493, m, = 1.34 X 104. Hence Z/s = 1.04, ms/s3 = 2.09, m4/s4 = 9.24, sug-
gesting that the X-distribution may reasonably be
assumed exponential (p/a = 1, &a3 = 2, &a4 = 9 for an exponential distribution). On this assumption we can test the outliers 92,97 for consistency with the other 130 values, using the statistic Z, of Section 4.4. Its value for the sample is (92 + 97)/1043 = 0.1812, so in the null case, from (37),
P(Z,,, > 0.1812) 5 (132 X 131/2)
Ww,o > (130/2)(0.1812/0.8188))
-8646 P (x1* > 57.54) = (8646)(29.77) exp(-28.77) = 8 X lo-‘,
i.e., the evidence for regarding the values 92,97 as not belonging to the same exponential distribution as the other 130 values is very strong.
6. CONCLUSION
The recursive procedure described in this paper allows ready calculation of the null distributions of various statistics for testing outliers in gamma sam- ples.
Exact significance levels can then be calculated on hand calculators, in many cases avoiding the need for extensive tables of percentage points.
7. ACKNOWLEDGMENTS
We are grateful to the referees for helpful com- ments.
REFERENCES
[t] COCHRAN, W. G. (1941). The distribution of the largest of a
I21
[31 [41
[51 [61
[71
PI
set of estimated variances as a fraction of their total. Ann. Eugenics, I I, 47-52.
DIXON, W. J. (1950). Analysis of extreme values. Ann. Math. Statist., 21. 488-506.
EPSTEIN, B. (1960). Tests for the validity of the assumption that the underlying distribution of life is exponential: Part II. Technometrics, 2, 167- 183.
FISHER, R. A. (1929). Tests of significance in harmonic anal- ysis, Proc. Roy. Sot., A, 125. 54-59.
KABE, D. G. (1970). Testing outliers from an exponential population. Metrika, 15. 15-18.
LAURENT, A. G. (1963). Conditional distribution of order statistics and distribution of the reduced i-th order statistic of the exponential model. Ann. Math. Statist., 34. 652-657. LIKES, J. (1966). Distribution of Dixon’s statistics in the case of an exponential population. Metrika. II, 46-54.
WILKS, S. S. (1962). Mathematical Statistics. New York: John Wiley and Sons, Inc.