• No results found

Identification of outliers in exponential samples with stepwise procedures

N/A
N/A
Protected

Academic year: 2021

Share "Identification of outliers in exponential samples with stepwise procedures"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

samples with stepwise procedures

V. Schultze and J. Pawlitschko

Department of Statistics, University of Dortmund,D-44221 Dortmund

Abstract

Inthispaperweconsidertheproblemofidentifyingoutliersinexponential

sam-pleswithstepwiseprocedures, namely inwardand outwardtesting procedures. We

treat outliers in the spirit of Davies and Gather (1993) as points which for given

> 0 lie in a certain -outlier region and focus especially on the worst-case

be-haviouroftheidenticationrules. Bestresultsyieldstepwise procedureswhich use

test statisticsbased on astandardized versionof thesample median.

KEYWORDS:outlieridentication,inwardtesting,outwardtesting,breakdownpoints,robust

statis-tics

1 Introduction

Insamplestaken fromsome targetpopulationoneoftenobservessomedata pointswhich

seem to dier strongly from the main body of the data. Such seemingly aberrant data

points are usually called \outliers". However, there exists no formal denition of what

constitutes anoutlier that has been widely accepted.

In this paperwe focus onoutlying observations in exponential samples. The exponential

distribution Exp() with unknown scale parameter >0 is commonly used asa simple

but nevertheless quite useful model distribution for lifetime data. The corresponding

(2)

1= exp ( x=); x>0;respectively.

Beginning with Cochran (1941), there is a rich literature on the problem of detecting

outliers in exponential samples, see e.g. Gather (1995) for a comprehensive (yet not

ex-haustive) account on contributions to this topic. In most of this work the problem of

outlier detection is seen as a testing problem with null hypothesis that all observed

life-times come from the same exponential distribution { the null model { and alternative

thatatleast one lifetimecomesfromanotherdistributionpermittedby asometimes only

implicitlyassumed outlier-generatingmodel.

An other approach recently pursued by Schultze and Pawlitschko (2000) is based on the

so called -outlier region of a distribution F which for 1 can be described as the

maximalpart of the support ofF which carriesonlya certainsmall probabilitymass. In

case of F =F

=Exp(), the corresponding -outlier region is given by

out( ;F

) = fx>0: x > ln g:

To take into accountthe size N of agiven sample one may choose

= N = 1 1 ~ 1=N for agiven :~

In Schultze and Pawlitschko (2000),the identicationof outliers ina given sample x

N = (x 1 ;:::;x N

) is achieved by constructingan empiricalversion OR (x

N ; N ) of out( N ;F )

{ a so-called one-step outlier identier { and classifying all observations as

N

-outliers

which liein OR . Typically, in the exponential case such a one-step outlier identier has

the form OR S (x N ; N ) = fx>0: x>S N (x N )g(N; N )g (1) whereS N (x N

)isanestimatorofscaleand g(N;

N

)anappropriatenormalizingconstant.

In the set-upof anormal sample, Carey et al.(1997) calla rule of this type a\resistant

detectionrule"ifitisbasedonrobustestimatorsoftheparametersofthenulldistribution.

In the exponentialcase there are a variety of robust estimators of the scale parameter ,

and itturnsout that astandardizedversionof thesample medianyieldsthebest results.

Other approaches for detecting

N

-outliers are not necessarily based on an explicit

em-pirical version of out(

N ;F

)but merely consist in arule that classies the observations

with respect to their \outlyingness". Especially stepwise procedures, namely

consecu-tive inward and outward testing procedures can be interpreted in this way. The main

(3)

lier identication rules, namely masking and swamping breakdown points and the size

of the largest nonidentiable outlier,and adapt them for stepwise procedures. Section 3

discusses some inward testingprocedures, aclassicalone usingCochran-statisticsateach

step and three alternatives which are based on robust estimators of scale. The fourth

section contains corresponding results for some outward testing procedures: again the

classical version with Cochran-statistics, two procedures based on spacings and a new

proposalthat makesuse of the standardized samplemedian. Clearly, worst-case analysis

onlyshedslightonthebehaviourofoutlieridenticationrulesforextremedatasituations.

Therefore, Section 5 contains the results of a small simulation study where we compare

their power of correctly identifying the outlying observations in a contaminated sample,

generatedfromascale-slippagemodelof Ferguson-type. A realdata exampleandashort

discussion in Section6 conclude this paper.

2 Worst-case behaviour of outlier identication rules

In the literature, comparisonsof dierent rules foroutlier identicationare mostly based

on simulations since exact results on e.g. power are seldom available. A eld, however,

where alsotheoretical resultscan beobtained is the analysis of the worst-case behaviour

of the identicationrules.

When identifying outliers two possible mistakes can occur: (1) a genuine outlier is not

recognized as such, and (2) a non-outlying observation is identied as outlier. If these

errors are caused by the outliers themselves, the phenomena are called (1) masking and

(2) swamping. The worst-case concerning these mistakes occurs if either badly placed

outlyingobservationsinthe samplecausean identicationrule tobe unabletorecognize

anarbitrarilylargeoutlier assuch ortoidentify anon-outlying observation asarbitrarily

largeoutlier. Thesmallestfractionofoutliersneeded inasampletoachievetheseextreme

results are called the masking and swamping breakdown point of the identicationrule,

respectively (cf.Davies and Gather, 1993).

More formally, given a sequence =(

N ) N2N with N 2 (0;1), 2(0;1),and asample

withnarbitraryobservationsx

n

,themaskingbreakdownpointofanoutlieridentication

rule OI may be dened as

M (OI;; x n ; ) = k M n+k M ;

(4)

with k = minfk : ( n+k ;x n ;k;)=0gand M ( n+k ;x n ; k;) = inf

> 0: there exist -outliers x o k = ( x o 1 ;:::;x o k ) such

thatOI failstoidentifysomex o i alsoinout(;F )asan n+k

-outlier inthe combined sample (x

n ;x

o

k ) :

Further, the swampingbreakdown point of OI may be dened as

S (OI; ; x n ; ) = k S n+k S ; with k S =minfk: S ( n+k ;x n ;k;)=0gand S ( n+k ; x n ; k; ) = inf

>0: there exist-outliersx o

k

suchthat OI identies

some

non-n+k

-outlierinx

n

as-outlierinthecombined

sample .

Here, the denitions in Davies and Gather, 1993, have been altered slightly to cover

the situation that with stepwise procedures only the observations of a given sample are

classiedwith respect totheir \outlyingness".

Ahighmaskingand swampingbreakdown pointaredesirablefeaturesof anidentication

rule and there are indeed many rules which achieve the optimal value 1/2. Note that

theoreticallylargervaluesare possiblebut thisisirrelevantinpracticalapplicationssince

a sample with a fraction of outliers of more than 50% makes no sense. Forcomparisons

between those optimal rules further criteria are needed. For instance, if the fraction of

outlying observations in the sample is smaller than the masking breakdown point then

one may still ask for the size of the largest nonidentiable outlier which is then nite.

Formally this quantity can be dened as follows. Let again OI denote an identication

rule and ~ >0 and and x

n

beas inthe denition of the breakdown points. Iffor some

k <n k n+k < M (OI; ; x n ; ); then set LO(k; OI; ;~ x n ; ) = sup x 2 out( n+k ;F ): there exists x o k 2 out(;F ) with x 2 x o k

such that OI fails to identify x as

n+k

-outlier

inthe combined sample (x

n ;x

o

k ) .

The value of LO will in general depend on the given sample x

n

, further it may even

not be well dened if the points in x

n

are located such that for all possible choices of

x o

k

each

n+k

-outlier in the combined sample will be correctly identied. To work with

thisworst-casecriterion, onemay e.g.calculatetheexpectationofthe largestobservation

that is not identied as

n+k

-outlier under the condition that the random sample X

n

comes i.i.d. from an Exp() distribution. Indeed, this expectation will often be a point

in out(

n+k ;F

)and we willuse the notationLO alsointhis case. Anotherway towork

(5)

The idea of applying stepwise procedures for detecting multiple outliers if there is no a

prioriinformationabout theirnumberimmediatelysuggests itselfandthe rst procedure

ofthis typehas indeedbeen proposedasearlyas in1936by Pearson andChandra Sekar.

With an inward testing procedure, in the rst step the \most extreme observation" of a

given sampleisconsidered by meansof adiscordancy test. If the test decidesthat this is

anoutlier it is removed from the sample and in the next step the \most extreme" of the

remainingobservations ischecked. The procedureterminates if for the rst timesuch an

observationisnotidentiedasanoutlierorifthelargestsensiblenumberk

=b(N 1)=2c

of possible outliers isreached.

In the exponential case, the most extreme observation of a (sub-) sample is simply its

maximum. Letx

(1)

x

(N)

denote the ordered observations inthe completesample

and x N i+1;N =(x (1) ;:::;x (N i+1)

) the sampleinvestigated inthe i-th step of the inward

testing procedure. There are many possible ways for choosing the test statistics for the

discordancy tests in eachstep. Appealingare statistics of the form

T S N i+1 (x N i+1;N ) = x (N i+1) S N i+1 (x N i+1;N ) ; i=1;:::;k ; (2) where S N i+1

denotes an estimator of the scale parameter that is based on the rst

N i+1ordered observations only. We consider the followingchoices for S

N i+1 : i) Standardized median SM N i+1 (x N i+1;N ) = 1 :4427Med(x (1) ;:::;x (N i+1) ): ii) RCS-estimator R CS N i+1 (x N i+1;N ) = 1 :6982Med k Med j fjx (j) x (k) j; j;k =1;:::;N i+1g : iii) RCQ-estimator R CQ N i+1 (x N i+1;N ) = 3 :476 jx (j) x (k) j; j;k =1;:::;N i+1; j <k (l ) where l = l (N i+1)(N i) 8 m :

iv) Mean of the subsample (Maximum-Likelihood-(ML-)estimator)

ML N i+1 (x N i+1;N ) = 1 N i+1 N i+1 X j=1 x (j) :

(6)

asrobust alternative tothe sample mean by Gather and Schultze (1999). The RCS- and

RCQ-estimators were proposed by Rousseeuw and Croux (1993) in the general context

of robust estimation of a scale parameter. Their properties in the exponential case have

been studied further by Gather and Schultze. Their usefulness in the construction of

one-step outlier identiers has been shown in Schultze and Pawlitschko (2000). Choice

iv) corresponds to the well known Cochran-statistic which has been discussed e.g. by

Cochran (1941), Kimber (1982), Chikkagoudar and Kunchur (1987), Likes (1987), and

Tseand Balasooryia(1991). Inthe following,wedenotethecorrespondinginward testing

procedures as SM-IT, RCS-IT, RCQ-IT, and Cochran-IT, respectively.

Itremainstospecify the criticalvaluesforthe discordancy tests. Acommonrequirement

for identicationrules is that for given ~ 2(0;1)under the null modelH

0

{all X

i come

fromi.i.d. Exp()-distributed random variables {one has

P H 0 noobservation isidentied as N -outlier 1 :~ (3)

For an inward testing procedure, condition (3) is already fullled if the critical value in

the rststep, sayt S

N

(~ ),ischosensuchthat thecorrespondingdiscordancytest keeps the

level .~ The criticalvalues t S

N i+1

(~ ); i =2;:::;k

;can be chosen arbitrarily. Here they

are determined such that every discordancy test keeps the niveau ~ under H

0 .

For the three inward testing procedures based on robust estimators of scale the null

distributionof T S

N i+1

isquitediculttohandle. ForSM-IT, expressions forthe survival

function of T SM

N i+1

have been calculated in Pawlitschko (2000). These expressions are

helpful for deriving critical values if N is not too large. Otherwise and for RCS- and

RCQ-IT ingeneral, the mostappropriate way toobtain criticalvalues isvia simulations.

ForCochran-IT aquite goodapproximation isgiven by choosing

t ML N i+1 ()~ = ( N i+1) 1 a N i+1 1+(i 1)a N i+1 (4) where for i=1;:::;k a N i+1 = ~ N i 1 1=(i 1) (5)

see e.g.Likes(1987). This approximationisexact if t ML

N i+1

(~)>(N i+1)=2,otherwise

it makesthe Cochran-tests slightly conservative.

We now investigate masking and swamping breakdown points of the four inward testing

procedures. Theorem 3.1 Let=( N ) N2N bea sequence with N 2(0;1), 2(0;1), and x n be

a sample of arbitrary observations assumed to come from Exp(). If ~ in (3) is chosen

(7)

(i) M (SM-IT ;;x n ;) = M (RCS-IT ;;x n ;) = M (RCQ-IT; ;x n ;) = 2 ; (ii) M (Cochran-IT; ;x n ;) k n+k ; where k

2f1;:::;ngisthe smallestk such that with N =n+k

N k t ML N ( ):~ Proof.

(i) Suppose that for an identication rule with test statistics of type (2) there exists

k 2f1;:::;n 1g such that M ( n+k ; x n ; k; ) = 0 :

That is, for any > 0 we can nd k -outliers x o k = ( x o 1 ;:::;x o k

) such that at least

one arbitrarily large -outlier contained in the combined sample x

N = ( x n ;x o k ) of size N =n+kisnotidentiedas N

-outlier. If !0,thelowerborderofthecorresponding

-outlierregionmovestoinnity. Nowassumethatforsomer k wehavethatx

(N r+1) 2

out(;F

) { then it also holds that x

(N i+1)

2 out(;F

) for i < r { and that it is not

correctly identied as

N

-outlier. This implies the existence of some j r such that

T S N j+1 (x N j+1;N ) t S N j+1 (~ ): (6)

However, if S is chosen as anestimatorof scale with explosion breakdown point

+ (S;x n ) = 1 2 > i n+i thedenominatorofT S N i+1

; i=1;:::;n 1;isbounded. Thisholdsespeciallyforj,hence

for ! 0 we have that T S

N j+1 (x

N j+1;N

) moves to innity which contradicts (6). As a

simple consequence M (OI;;x n ;) n 1 2n 1 (7)

for SM-, RCS-, and RCQ-IT, since the explosion breakdown point of the standardized

median,the RCS-, and the RCQ-estimatorequals 1/2 (cf.Gather and Schultze, 1999).

If k = n, for each of those three inward testing procedures we have for reasonable small

~ that inf x o n 2out(;F) T S N (x N ) < t S N ();~

wherethis inmum isobtained ifsome ofthe-outliersmove toinnityand thereforeare

(8)

favourable position of the -outliers is given if they all are placed at the same point x . In this case inf x o n 2out(;F ) T SM N (x N ) = lim x o !1 T SM N (x N ) = 2 ln2 1:386

which is smaller than t SM

N

(~ ) for all reasonable choices of .~ For the other two

proce-dures similar arguments apply. Togetherwith (7), these ndingsestablishpart (i) of the

theorem.

(ii) For Cochran-IT, the least favourableposition of k -outliersis given if they all are

placed atthe same pointx o

. This placementgives

inf x o k 2out(;F ) T ML N (x N ) = lim x o !1 T ML N (x N ) = n+k k ;

note again that for x o

!1 these -outliersare also -outliersfor any >0.

If an inward testing procedure is build with test statistics of type (2) and the critical

value in the rst step ischosenaccording to (3), then the most extremeobservationx

(N)

is identied as

N

-outlier if and only if it is located in the corresponding empirical

N

-outlierregion(1)based onthe scale estimatorS

N

. However, asthe proofof Theorem 3.1

(ii) shows, this fact does not allowthe conclusion that the masking breakdown pointsof

these two identication rules are equal. The reason is the slightly dierent denition of

the breakdown point for the two classes of identicationrules.

Theorem 3.2 Given an arbitray sample x

n , 2(0;1),and a sequence =( N ) N2N ,

the swamping breakdown point of SM-IT, RCS-IT, RCQ-IT, and Cochran-IT is not

smaller than 1=2.

Proof. Denote withOI any ofthe fourinwardtestingprocedures. Supposethere exists

k 2f1;:::;n 1g with S (OI; n+k ;x n ;k;) = 0 : Denote with x r

the largest

non-n+k

-outlier with respect to F

that is contained in x

n .

The condition for S

is then fullledif and only if thereexist k -outliersx o k 2out(;F ) such that T S N i+1 (x N i+1;N ) > t S N i+1 (); i=1;:::;k ; (8)

for all >0, wherek

denotes the rankof x r

in the combined sample(x

n ;x

o

k

). However,

if ! 0, it follows generally for all i 2 f 1;:::;kg that t S N i+1 () ! 1 . Sincex r < ln( N ) < 1, at least at stage k

of the inward testing procedure the corresponding

test statistic becomes bounded. This is obvious for Cochran-IT, in case of SM-, RCS-,

(9)

hence proves the theorem.

Concerning the largestnonidentiable outlier inthe presence of k -outliers, the

calcula-tionofitsexpectationwouldbeaquiteinvolvedtask. However,forSM-ITandCochran-IT

itispossible togiveanupperbound which isquiteaccurate whereincase ofCochran-IT

one has toassume that k=N is smallerthan the masking breakdown point.

Theorem3.3 LetX n =(X 1 ;:::;X n

)beasampleofsizencomingi.i.d.froma

Exp()-distributionandX

(1)

X

(n)

denotetheordered sample. Fork<nandappropriate

2(0;1)letX o k =(X o 1 ;:::;X o k

)bearandomsampleof -outliers(possibly dependent of

X

n

) and set N =n+k. Then conditional onmin(X o

k ) >X

(n)

, for all reasonable ~ > 0

we have the followingresults forthe expected value of the largest observation that isnot

identied as

N

-outlier.

(i) ForSM-IT one has

E LO(k; SM-IT ;X n ; ;~ ) = t SM N (~ ) ln2 8 > > > > < > > > > : (N+1)=2 X i=1 1 N k i+1 ; N odd, N = 2 X i=1 1 N k i+1 + 1 N 2k ! ; N even. (ii) If k N 1 t ML N ()~ (9)

then for Cochran-IT one has

E LO(k;Cochran-IT ;X n ;;~ ) = t ML N (~ ) N k t ML N ( )~ (N k):

Proof. W.l.o.g. let = 1. For both inward testing procedures the least favourable

constellationof k -outliersisgiven if they are located atone point,say X o

, such that in

the rst step the corresponding discordancy test fails toreject.

(i)In case ofSM-IT, letX o

be dened as the solution of

X o = t SM N ()~ Med(X n ;X o k ) ln2

(10)

where now X

k

denotes the random vector with k components equal to X o

. Provided

thatthis solutionislargerthanX

(n)

,we havethat Med(X

n ;X o k )doesdepend onX o k only

through k and its distribution is equal to that of the (N +1)=2-th order statistic out of

N k observations from an Exp()-distribution if N is odd and to that of the mean of

the N=2-th and(N+1)=2-thorderstatisticifN iseven. The resultnowfollows fromthe

well known relation

E(X (r) ) = r X i=1 1 N k i+1 (10)

whichholds for the r-thorder statisticout of N k observations from Exp().

(ii)In case of Cochran-IT,consider the equation

X o = t ML N ( )~ 1 N n X i=1 X i +kX o :

Under condition (9) this equationhas apositivenite solution

X o = t ML N ()~ N kt ML N (~) n X i=1 X i :

Taking expectationgivesthe expression stated inthe theorem.

Simulationsshowthatinbothcases the probabilityof theeventX o

X

(n)

isquitesmall

even ifk issmall,hencetheunconditionalexpectation{whichissmaller{willnotbevery

dierentfromthe conditionalone. Furthercalculationshaveshown thatfor reasonable~

the expectations given inthe theorem are indeed

N

-outliers.

Forthe three inward testing procedures basedonrobust estimators ofscale the following

asymptoticresult can bederived.

Theorem 3.4 Let X

1 ;X

2

;::: be a sequence of observations coming i.i.d. from an

Exp()-distribution and X

n

the sample consisting of the rst n observations of this

se-quence. For 2(0;1)set k =bnc,andN =k+n. Further,let =(

N )

N2N

denote an

appropriatesequence withelementsin(0;1),X o

k

asampleof sizek containing

N -outliers (possibly dependent on X n ), and X N = ( X n ;X o k

) the combined sample. Then for all

~

>0,withn !1the probabilitythat SM-,RCS-, andRCQ-ITidentify all

N -outliers in out( o N ();F

) converges toone, where

o N () = exp(b(S;;)) N and b(S; ;) = limsup n!1 sup X o k 2out( N ;F) ln S N (X N )

(11)

Beforegivingthe proofofTheorem3.4some remarksareinorder: Themaximum

asymp-totic bias for scale estimators has been introduced in Davies and Gather (1993). For

the estimators considered inTheorem 3.4 the corresponding expressions can be found in

Schultze and Pawlitschko(2000). Anapproximation of the largestnonidentiable outlier

in asample of size N with afraction =(1+) of

N

-outliersis then given as

ALO(; OI;N; ; ) = ln o

N ();

where OI stands for one of the three inward testing procedures considered in the

theo-rem. Note that these approximations coincidewith those for the corresponding one-step

identiers (1) based on the robust scale estimators. Hence the comparisons in Schultze

andPawlitschko(2000)carry overdirectlytothe inward testingscheme. Forallvaluesof

N and k considered there it turned out that ALO is smallest for SM-IT and largest for

RCQ-IT.

Proof of Theorem 3.4. Under theconditions statedinthetheorem, the largest

possi-bleobservationX o

(N) inX

N

whichisnot identiedasan

N

-outlierby the inward testing

procedure basedon S N is determined from X o (N) = t S N (~ ) sup X o k 2out( N ;F) S N (X N ): (11)

Consider the critical value: if S

N

is a root-N-consistent estimator of , under the null

modelofnooutlierswecan asymptoticallyapproximatethe distributionof T S N (X N )with that of X (N)

=. That is,for any c>0

lim N!1 P H0 T S N (X N ) c 1 exp ( c) N N = 0 : (12)

Therefore, the critical values in the rst step of the three identication rules considered

here fulll lim N!1 t S N ( )~ +ln N = 0

and from (11) one obtainsthat in probability

lim N!1 X o (N) + ln N exp b(S; ; ) = 0 :

Hence, asymptoticallyobservationslarger than ln

N

exp b(S;;)

are always

iden-tied as N -outliers. Setting ln( o N ) = ln N exp b(S;;)

and solving for o

N

(12)

maximum asymptoticbias ofthe ML-estimator isinnity. Further,notice fromTheorem

3.1 thatfor n!1the masking breakdown pointof Cochran-ITtends tozero. However,

in those cases where it exceeds 1=(n+1) it is possible to give an approximation based

on asymptotic arguments that works quite well if N is suciently large. Given the

assumptions of Theorem 3.4and provided that

<

1

1+ln

N

(13)

we can approximate the largestnonidentiable outlierby

ALO(;Cochran-IT;N;;) = ln N 1+(1+ln N ) : (14)

Condition (13) is derived from the masking breakdown point of Cochran-IT given in

Theorem 3.1(ii), the argument leading to (14) is essentially the same as in the proof of

Theorem 3.3.

SimulationsshowthatallapproximationsworkwellinsamplesofsizeN 50,seeSchultze

andPawlitschko(2000)wheresomenumericalresultsforthe correspondingone-step

iden-tiers are presented.

4 Outward testing procedures

The proneness of classicalinward testing procedures like Cochran-IT tomasking has led

tothe development ofalternativestepwise rules. Rosner (1975),inthe contextof normal

samples, suggested a method that inverts the concept of inward testing and, therefore,

is often denoted as outward testing. In an outward testing procedure, at rst the k

\mostextreme" observations ofa samplex

N

are removed. In thefollowingwealways set

k

=b(N 1)=2c, the maximal reasonable number of outliers. Of course smaller values

fork

arepossible, e.g.ifonehas aroughideaaboutthe fractionof irregularobservations

inthe sample. Then, beginningwiththe least extremeof theselected observations,these

are tested with an appropriate discordancy test whether they are indeed outlying. If at

one stage the test rejects, the corresponding observation and all that are more extreme

are identied asoutliers. Ifthetest rejects not,the correspondingobservation isreunited

with the reduced sample and the next observation is considered. In general, in the rst

step a problem is the need for a criterion to judge the extremeness of an observation.

In case of an exponential sample, however, this problem does not occur since naturally

the k

largest observations stick out asthe most extreme ones. There is arich literature

(13)

(1987), Balasooriya (1989), Tse and Balasooriya (1991), and Balasooriya and Gadag

(1994). To ourknowledge,however, the worst-case behaviouroftheseprocedures has not

been considered yet.

Foreachstep of anoutward testingprocedureprincipallythe samediscordancy tests can

beused aswiththe inwardtestingscheme. However, nowalsoapplicationof discordancy

testswhichinthelattercasesueredfrommaskingyieldssatisfactoryresultsandtherefore

deservesa closer investigation. Weconsider the worst-case behavior of the following four

outward testing procedures based on

i) Cochran-statistics (Cochran-OT),

ii) Dixon-statistics(Dixon-OT), that is inthe j-thstep we use

T D N k +j (x N k +j;N ) = x (N k +j) x (N k +j 1) x (N k +j) ; j =1;:::;k

(see Dixon, 1950, Likes, 1967, Chikkagoudarand Kunchur, 1987),

iii) test statistics proposed by Balasooriya (B-OT),

T B N k +j (x N k +j;N ) = x (N k +j) x (N k +j 1) W j ; j =1;:::;k ; with W j = P N k +j 1 i=1 x (i) +(N k +j 1)x (N k +j 1) (N k +j 1)(k j +1)

(see Balasooriya,1989, Tse and Balasooriya,1991, Balasooriyaand Gadag, 1994),

iv) teststatisticsoftype(2)withthestandardizedmedianasscaleestimator(SM-OT).

The rst three outward testing procedures are well known. The Balasooriya-statistics

have the interesting property that they are mutually independent under the null model

(cf. Sweeting, 1983). In this case, the denominator of T B

r

is equal to the best linear

predictor of the r-thorder statisticgiven the r 1smallest observations. The procedure

basedonthe standardizedmedianisnewand isincluded inthe followinginvestigationto

see how anoutward testing procedure based onarobust estimatorofscale compets with

classicalmethods.

Asintheprevioussection,werequire thatcondition(3)isfullled. In caseofanoutward

testingprocedurewithteststatisticsT

N k

+j

; j =1;:::;k

(14)

P H 0 k [ j=1 T N k +j (X N k +j;N ) >t N k +j (~ ) :~ (15)

Condition (15) doesnot uniquely determine the critical values. This can be achieved by

the additionalrequirement that

P H 0 T N k +j (X N k +j;N ) > t N k +j ()~ = k ; j =1;:::;k : (16)

If the test statistics for the individual steps are mutually independent then

k

can be

chosen to1 (1 )~ 1=k

. In case of B-OTwe get the followingsimple expression for the

corresponding criticalvalues:

t B N k +j ( )~ = ( N k +j 1) 1=(N k +j 1) k 1

(cf. Tse and Balasooriya, 1983). Otherwise, the Bonferroni-inequality allows for the

slightly more conservative choice

k

= =k~

. In case of Cochran-OT the critical

val-ues then can be approximated asin (4) with ~ in(5)now replaced with =k~

. In case of

Dixon-OT we nd them as the solutionsof

~ =k = 1 t D N k +j ()~ N k +j 1 N k +j 1 Y i=1 N i+1 N i t D N k +j ()~ (N k +j i) : (17)

Thisequationcanbededucedfromthemoregeneralexpressions inLikes(1967)andKabe

(1970) orby direct calculation.

Wenowdeterminemaskingandswampingbreakdownpointsfor thefouroutward testing

procedures introducedabove.

Theorem 4.1 Given anarbitrarysample x

n , 2(0;1);and a sequence =( N ) N2N ,

allfour outward testing procedures have masking breakdown point

M (OI; ; x n ; ) 1 2 :

Proof. We rst consider the case of Cochran-OT and assume that its masking

break-down point is smaller than 1/2. Withsimilar arguments as in the proof of Theorem 3.1

this would imply the existence of k<n such that forthe combined sample (x

n ;x

o

k )none

of the discordancy tests with test statistics T ML n+j ; j = 1 ;:::;k;rejects as x o moves to innity. Herex o k

denotes avectorwithk componentsequaltosomex o

2out(;F

). That

is forall j k we must have

lim x o !1 T ML n+j (x n+j;n+k ) t ML n+j (=k~ ): (18)

(15)

lim x o !1 T ML n+1 (x n+1;n+k ) = n+1 whereas t ML n+1 (=k~ ) < n+1

and this contradicts (18).

The proof for the other three outward testing procedures is quite similar and therefore

omitted.

Theorem 4.2 Letthe assumptions ofTheorem 4.1befullled andassume furtherthat

allx

i 2x

n

are positive. Then allfour outward testing procedures have swamping

break-down point S (OI;;x n ;) 1 2 :

Proof. Assumethattheswampingbreakdownpointissmallerthan1/2. Asintheproof

ofTheorem3.2thisimpliestheexistenceof k<nsuchthatforsomex r 2x n whichisnot inout( n+k ;F )wecan ndasamplex o k 2out(;F

)foreach >0suchthatx r

isfalsely

identiedas-outlier. Asinthe proofof Theorem4.1welookcloseratCochran-OT. Set

k

= b(n+k 1)=2c. Then x r

is identied as -outlier if for some j 2 f 1;:::;k

g it is

not smaller than the maximum of the subsample x

n+k k

+j;n+k

of the combined sample

(x n ;x o k ) and if T ML n+k k +j (x n+k k +j;n+k ) > t ML n+k k +j (): (19) Since x r isno n+k -outlierit is bounded: x r ln n+k = c < 1:

Hence, alsothe test statisticof the discordancy test is bounded by

T ML n+k k +j (x n+k k +j;n+k ) (n+k k +j) c n+k k +j 1 X i=1 x (i) +c < n+k k +j

wherethelastinequalityisstrictsincetheregularobservationsareassumedtobepositive.

However, forj =1;:::;k we have lim !0 t ML n+k k +j () = n+k k +j

(16)

quitesimilarly.

Theorems 4.1and 4.2show thatconcerning their breakdown points thereis nodierence

between the three \classical"outward testing procedures and the new one based on the

standardizedmedian: allprocedureshaveoptimalhighmaskingandswampingbreakdown

points. Hence,a nerinvestigationof their worst-case behaviourbased onthe size of the

largest nonidentiableoutlier isnecessary.

As in case of the EDR-ESD identier for normal samples which is thoroughly discussed

inDavies andGather (1993),the most infavourableconstellation ofoutliers isgiven if to

aregularsample ofsize n a\string"of k -outliersisadded. Sucha stringisconstructed

byplacingineachstepofthe outwardtestingprocedurea-outlieratthatpointatwhich

the corresponding discordancy test just fails to reject. For the four identiers discussed

here, this leads to the following results.

Theorem 4.3 Let X n = ( X 1 ;:::;X n

) be a sample of size n coming i.i.d. from an

Exp()-distribution and with X

(n)

denote the maximum of X

n . Fork <n and 2(0;1) letX o k =(X o 1 ;:::;X o k

)bea randomsample of -outliers (possibly dependent of X

n ) and

set N =n+k. Then for allreasonable ~ > 0, conditional on the event that no regular

observation is identied as

N

-outlier,the expected value of the largest observation that

is not identiedas

N

-outlier isgiven by

(i) for Cochran-OT

E LO(k;Cochran-OT;X n ;;~ ) = c ML N (~) k 1 Y j=1 1+c ML N k+j ( )~ (N k) with c ML N k+j ( )~ = t ML N k+j ( =k~ ) (N k+j)+t ML N k+j (~=k ) ; j =1;:::;k;

(ii) for Dixon-OT

E LO(k;Dixon-OT;X n ;;~ ) = k Y j=1 1 1 t D N k+j (~ ) k X i=1 1 N i+1 ;

(iii) for outward testing with Balasooriya-statistics

E LO(k;B-OT;X n ;;~ ) = E(X o k );

(17)

with X k recursively dened by X o 1 = X (n) + 1=(N k) k 1 1 k n X i=1 X i +X (n) ; X o j = X o j 1 + 1=(N k+j 1) k 1 1 k j+1 n X i=1 X i + j 1 X i=1 X o i +X o j 1 ; for j =2;:::;k;

(iv) for SM-OT

E LO(k;SM-OT;X n ; ;~ ) = t SM N (~ =k ) ln2 8 > > > > < > > > > : (N+1)=2 X i=1 1 N k i+1 N odd, N = 2 X i=1 1 N k i+1 + 1 N 2k N even.

Proof. Since the proofs of parts (i) { (iv) are quite similar, we only look closer at

Cochran-OT. Assume that for given ~ no regularobservation is classiedas

N

-outlier.

To achieve that nofurther discordancy test rejects, choose

X o 1 = t ML N k+1 ( =k~ ) (N k+1)+t ML N k+1 (~ =k ) n X i=1 X i

whichis a-outlierfor appropriately chosen ,and then subsequently

X o j = t ML N k +j ( =k~ ) (N k+j)+t ML N k+j (~ =k ) n X i=1 X i + j 1 X r=1 X o r = c ML N k+j ( )~ 1+c ML N k+j 1 ( )~ c ML N k+j 1 (~ ) X o j 1

for j = 2 ;:::;k. Consecutive application of this recurrence relation and making use of

E P n i=1 X i

= (N k) yields the result in (i). For (ii) and (iv), relation (10) can be

applied in the same way as in the proof of Theorem 3.3 since the smallest -outlier is

placed tothe right of X

(n)

.

Notethat underthe assumptionsofTheorem4.3forSM-OTtheexpected sizeofLO only

depends on the critical value used inthe last step and on the expected values of certain

order statisticsof the regularX

i . Further, since t ML N ( =k~ ) t ML N ( )~

(18)

testing procedure based on the standardized median than for the corresponding inward

testingprocedure. A similarresultwould beobtained whenusingone ofthe otherrobust

estimators of scale suggested in Section3.

Figure 1 illustrates the results from Theorem 4.3 for N = 20 and all possible values of

k. From plot a) it can be seen that B-OT and especially Dixon-OT perform very poorly

if the fraction of irregular observations approaches 1/2. Therefore, plot b) contains only

the values forCochran-OT andSM-OT. Generallyitcanbesaidthat the latterperforms

best with respect to this worst-case criterionwith the exception of the case k=1,where

Cochran-OThastheedge. Itisworthnotingthatthetwoprocedureswherethenominator

of the test statistics used in the discordancy tests is a dierence of consecutive order

statisticsare clearly outperformed by thosewhere only asingle order statistic appears.

k

ELO

1

2

3

4

5

6

7

8

9

0

500

1000

1500

2000

2500

3000

3500

a)

k

ELO

1

2

3

4

5

6

7

8

9

0

20

40

60

80

b)

Figure 1: Expected size of LO for ~ =0:05,N =20, with : SM-OT,

(19)

In the previous sections we have mainly been concerned with the worst-case behaviour

of stepwise outlier identicationrules. Ifno distributional assumptionsare made for the

outlyingobservationsapartfrombeinglocatedinacertainoutlierregion,itisnotpossible

togivegeneralresultsfortheperformanceoftheseidentiersinnonworst-casesituations.

Therefore, we have also studied the fraction of correctly identied outliers and inliers

(that are the sample elements not located in out(

N ;F

)) via a small simulation study

under a certain popular outlier-generating model, namely a slippage model of F

erguson-type (cf. Gather, 1995). Under this model, the distribution of a random sample X

N = (X 1 ;::: ;X N )of size N isgiven by F X N = Q N i=1 F s(i)

wheres(i)2f1;bgforsome b 1

and P

N

i=1

I[s(i) = b] = k for some given k b N=2c. To guarantee that the number

of outliers generated by this model is not too small we set b = 1 =13. In Schultze and

Pawlitschko (2000) the expected number of

N

-outliers for ~ = 0 :05 and some selected

values for N and k has been calculated. For these values the expectationalways exceeds

k=2 but isnearly always smaller than 2=3k. Forthe simulation study we set N =20;50;

and ~=0:05. The corresponding

N

-outlier regionsare given by the intervals (5:97; 1)

and (6:88;1), respectively. Then 5000 random samples from the Ferguson-model were

drawn for each combination of N and some selected values of k k

= b(N 1)=2c.

The average fractions of correctly classied observations are displayed in Figures 2 to 5.

To makedierences inthe heightof the blocks morevisible, their shading becomesmore

dense with growing fraction.

Concerning the fractionof correctly identied outliers,Figures2 and 4show that SM-IT

yieldsthebestresultsofthefourinwardtestingprocedureswhiletheothertwoprocedures

based on robustestimators of scale are nearly asgood. Cochran-IT performs poorlyif k

gets large and therefore is not recommendable. The highest power of the four outward

testing procedures is obtained with Cochran-OT. If k is not to small, SM-OT is nearly

as good for N =50, however, it performs not very satisfactory for N =20. For outward

testingprocedures thatuse test statisticsoftype(2) thereseems tobeacloseconnection

between the nite sample eciency of the scale estimator and the power of the

identi-cation rule. Somefurther simulationswith outward testing procedures that are based on

theRCS-andRCQ-estimatorshowed thatthey hadsmallerpowerthanCochran-OT too,

althoughthey did not perform better than SM-OT. Dixon-OTin generalshows the least

convincingresults,whereas B-OTperforms quitewellaslong ask isnottoolarge. When

comparing inward and outward testing procedures one nds that the latter are always

inferior to their competitors if k is small and have similar power for large k only. This

comparison, however, is somewhat misleading, which becomes clear from Figures 3 and

5 where the fraction of correctly classied inliers is shown. With few exceptions for all

outward testing procedures this fraction is at least as great as 90% irrespective of the

(20)

k

-thstep of the inward testingprocedures. These values are determined under the null

modelof nooutlierswhichleadstodiscordancytests thataretooliberalifindeedoutliers

are contained inthe sample. Thereis a nice analogyin the eld of multiple testing: The

probabilityofdeclaringatleastoneregularobservationasoutliercorrespondsinsomeway

tothe probabilityoffalsely rejectingatleastone truenullhypothesis inagiven familyof

hypotheses, the so-calledfamilywiseerror rate (FWE).An outlieridenticationrule that

is standardized according to (3) corresponds to a multiple test that controls FWE only

in the weak sense that is if all hypotheses in the family are true. Neither outward nor

inward testing procedures provide an equivalent to strong control of FWE. However the

former seem to have a smaller probability of classifying at least one regular observation

as outlyingif the nullmodel doesnot hold.

Figure2: Fractionof correctly identied outliersin the Ferguson-model,

(21)

Figure4: Fractionof correctly identied outliersin the Ferguson-model,

(22)

6 Example and conclusion

As anexampleforthe applicationof the stepward outlieridenticationrules discussed in

this paper, we consider a data set taken from Nelson (1982, p. 104). The data are the

times to breakdown of an insulatinguid between two electrodes, recorded at a voltage

of 32 kV. The recorded breakdown times in ascending order are 0.27, 0.40, 0.69, 0.79,

2.75, 3.91, 9.88, 13.95, 15.93, 27.80, 53.24, 82.85, 89.29, 100.58, and 215.10. We suppose

that the breakdown times follow a one-parameter exponential distribution and seek to

nd out if the sample contains any

N

-outliers where N = 15 and ~ = 0 :05. Tables 1

and 2containa comparisonofthe results given by the eightidenticationrules discussed

sofar inthis paper. Foreachrule the test statisticsand corresponding criticalvalues are

listed up to the terminalstep.

The results can be summarized as follows: The outward testing procedures based on

spacings do not classify any observation as outlying whereas Cochran-OT identies the

ve largest ones and SM-OT even one more. Inward testing with Cochran-statistics is

less successful: Cochran-IT failstoidentifyany outlier. This resultisnot duetomasking

sincealsoRCQ-IT declares noobservationasoutlying. RCS-ITidentiesonlythelargest

observationwhereastheinwardtestingprocedurebasedonthestandardizedmedianaggs

the maximal reasonable number k

= 7 of observations as outliers. In this example we

(23)

i x (16 i) T SM 16 i t SM 16 i T ML 16 i t ML 16 i T RCS 16 i t RCS 16 i T RCQ 16 i t RCQ 16 i 1 215.10 10.6879 7.0437 5.2257 6.3146 9.2590 8.0307 5.5251 5.9838 2 100.58 5.8512 4.9587 4.7287 5.6065 3 89.29 6.2643 4.3398 4 82.85 8.3288 3.7570 5 53.24 9.4381 3.6342 6 27.80 5.7866 3.2748 7 15.93 4.0152 3.3284

Table 1: Inward testing procedures forthe example

SM-OT Cochran-OT B-OT Dixon-OT

j x (8+j) T SM 8+j t SM 8+j T ML 8+j t ML 8+j T B 8+j t B 8+j T D 8+j t D 8+j 1 15.93 4.0152 5.4701 2.9518 3.5457 0.8510 6.7968 0.1243 0.5763 2 27.80 5.7866 5.0888 3.6402 3.6587 4.4466 6.5470 0.4270 0.5615 3 53.24 4.5185 3.8111 5.9061 6.3555 0.4778 0.5569 4 82.85 3.8031 6.2041 0.3574 0.5637 5 89.29 0.5029 6.0815 0.0721 0.5861 6 100.58 0.6111 5.9801 0.1122 0.9873 7 215.10 3.1880 5.8950 0.5324 0.7324

Table 2: Outward testingprocedures forthe example

from a one-parameter exponential distribution. The results obtained with Cochran-OT

and both procedures based on the standardized median may however also be seen as

indicationthat thisassumption isquestionable. It isindeedmoreproperlytoanalyzethe

breakdown times undera Weibullmodel.

Tocometoanal conclusion: with respect totheirbreakdownpropertiesthere aremany

competing optimal stepwise procedures for outlier identication in exponential samples,

among themthe well known outward testing procedures based ondiscordancy tests with

Cochran-, Dixon-, and Balasooryia-statistics. A nerworst-case analysis with respect to

the size of the largest nonidentiable outlier LO reveals that these classical procedures

are outperformed by an outward testingprocedure SM-OT that relies on a standardized

version of the sample median. Especially the two procedures based on spacings perform

(24)

An alternativetothese outward testing procedures isgiven by inward testing procedures

that are based on robust estimators of the scale parameter. Whereas inward testing

procedures with classicaldiscordancy tests suer heavily from maskingand are therefore

notrecommendable,these newmethodshaveoptimalbreakdown propertiesand leadtoa

smallersize of LO than competing inward testingprocedures. Aminor drawback istheir

tendency to identify too many observations as outlying if the null model does not hold.

This tendency isdue tothefact thatthe usualchoiceof criticalvalues aftertherst step

seems to be too liberal. How a better choice could be made is an interesting topic for

further investigations.

Acknowledgement

Support ofthe Deutsche Forschungsgemeinschaft(SFB475, \Reduction of complexity in

multivariatedata structures")is gratefully acknowledged.

References

Balasooriya, U. (1989). Detection of outliers in the exponential distribution based on

prediction. Commun. Statist. {Theory Meth., 18, 711{720.

Balasooriya, U. and Gadag, V. (1994). Tests for upper outliers in the two-parameter

exponentialdistribution. J. Statist. Comput. Simul., 50,249{259.

Chikkagoudar,M.S.andKunchur, S.H.(1987). Comparisonofmanyoutlierprocedures

for exponential samples. Commun. Statist.{ Theory Meth., 16,627{645.

Carey,V.J.,Rosner,B. A.,Wager, C.G.and Walters,E.E.(1997). Resistantand

test-basedoutlier rejection: Eects ongaussianone-and two-sampleinference. T

echno-metrics,39, 320{330.

Cochran, W. G. (1941). The distribution of the largest of a set of estimated variances

as afraction of their total. Ann. Eugen., 11,47{52.

Davies, L. and Gather, U. (1993). The identication of multiple outliers. J. Amer.

Statist.Assoc., 88, 782{792.

(25)

ishnan and A.P. Basu (eds.): The exponential distribution: Theory, methods and

applications,Gordon and Breach Publishers, Amsterdam, 221-239.

Gather,U. and V. Schultze (1999). Robustestimation of scale of anexponential

distri-bution. Statistica Neerlandica,53, 327{341.

Kabe, D. G. (1970). Testing outliers from an exponential population. Metrika, 15,

15{18.

Kimber, A. C. (1982). Tests for many outliersin a exponential sample. Applied

Statis-tics, 31,263{271.

Likes, J.(1967). DistributionofDixon's Statisticsinthe case ofanexponential

popula-tion. Metrika,11,46{54.

Likes, J. (1987). Some tests for k 2 upper outliers in aexponential sample. Biom.J.

29,313{324.

Nelson, W. (1982). Applied life data analysis. John Wileyand Sons, New York.

Pawlitschko, J.(2000). On the distribution ofa teststatisticfor outlieridenticationin

exponentialsamples. Universitat Dortmund, preprint.

Pearson, E. S. and Chandra Sekar, C. (1936). The eciency of statistical tools and a

criterionfor the rejection of outlying observations. Biometrika, 28, 308{320.

Rosner, B. (1975). On the detection of many outliers. Technometrics, 17,221{227.

Rousseeuw, P. J. and Croux, C. (1993). Alternativesto the medianabsolute deviation.

J. Amer. Statist.Assoc., 88,1273{1283.

Schultze, V. and Pawlitschko, J. (2000). The identication of outliers in exponential

samples. StatisticaNeerlandica, to appear.

Sweeting,T. J. (1983). Independent scale-freespacingsfor the exponentialand uniform

distributions. Statist. Prob. Letters, 1,115{119.

Tse, Y. K. and Balasooriya, U. (1991). Tests for multiple outliers in an exponential

Figure

Figure 1 illustrates the results from Theorem 4.3 for N = 20 and all possible values of
Figure 2: F raction of correctly identied outliers in the F erguson-model,
Figure 4: F raction of correctly identied outliers in the F erguson-model,

References

Related documents