Working Paper 92-32
Divisi6n de Economfa
July 1992
Universidad Carlos
illde Madrid
Calle Madrid, 126
28903 Getafe (Spain)
Fax (341) 624-9849
COMPARING PROBABILISTIC METHODS FOR
OUTLIER DETECTION
Daniel Peiia and
Irwin Guttman·
Abstract
_
This paper compares the use of two posterior probability methods to deal with outliers
in linear models. We show that putting together diagnostics that come from the mean-shift
and variance-shift models yields a procedure that seems to be more effective than the use of
probabilities computed from the posterior distributions of actual realized residuals. The
relation of the suggested procedure to the use of
a certainpredictive distribution for
diagnostics is derived.
Key words:
Diagnostic, Posterior and Predictive distributions, leverage, linear models.
·Pena, Departamento de Estadistica y Econometrla, Universidad Carlos III de Madrid, Getafe,
28903
Spain; Guttman, Department of Statistics, University of Toronto, Toronto, Ontario
M5S1Al.
1. INTRODUCTIO~.
The analysis of outliers from the Bayesian point of view has become increasingly interesting to the Statistical Profession because of the possibility of carrying out the diffi. cult computations involved using new algorithms such as the Gibbs sampling method (see Gelfand and Smith (1990), or Casella and George (1992)). Also, specific algorithms have been developed to deal with the Bayesian treatment of outliers (Pena and Tiao (1992)).
Not only has the work on analysis of outliers proved useful in its own right, but it turns out that many other statistical problems can also be usefully analyzed by approaching these problems from the outlier point of view. As examples, we would cite: analysis of ullreplicated fractional factional factorial designs to detect significant effects, see Box and Meyer (1986), Juan and Pena (1992); detection of interaction in unreplicated ANOYA designs, see Tussell (1990); estimation of missing observations in time series models. see Ljung (1989) and PeIia and Maravall (1991).
There have been three main approaches to the problem of outliers in the literature. Succinctly, these may be classified as (i) the diagnostic approach (ii) the Bayesian approach and (iii) 'robust' approach to estimation and tests of hypothesis in the presence of outliers. The first approach is clearly identified with the work of Cook and Weisberg (1982), Belsley, huh and \Velsch (1980), and Atkinson (1985). and the aim of\'lorkers in this area is mostly that of identification of observations that may be deemed outlying and/or influential. The approach listed as (iii) above has been motivated by the work of Huber (1981), and Hampel et al (1986), the aim here being to build estimatiors that are not affected by the fraction of tht> sample that is outlying. Truly in the middle and listed as such above, is the Bay('~jan
approach, which seeks to combine identification with estimation: see for example Box aurl Tiao (1968); Guttman et al (19iS), etc. Here, the identification is carried out using the posterior proLabilities for an observation or a set of obsen-ations being outlying, and these are used as weights in estimation procedures.
2 UI ,) () -- I !
conditional on (12, we have
.
~
""' N(B;
(12(X'
X)-l) (2.3)where
iJ
=
(X'X)-lX'y
I andp(a2Iy,x)
is such that(n - p)s2 '"
2(12
Xn - p , (2.4)where
(n -
p)s2
=
e'e
=
y'[I -
H]y , (2.5)with H denoting the so called hat matrix
H
=
X(X'
X)-lX' .
(2.6)
'We denote a set of
k
distinct integers chosen from the set (1, ... ,n) byI.
Then, the vector y can be decomposed as,
( '
,
)y = YI Y(1) (2.7)
where
(I)
means "delete setI".
Similarly, theX
matrix can be partitioned as(2.8) (The use of the symbol
I
without brackets means restrict information to the setI.)
Consistent with the above notation, we wi~l use in the rest of the paper the designations(2.9) (2.10) that is.
iJ
(1) and$11)
are estimators of ~ and (12 based on (X(I), Y(1)) , etc.In contrast with the null model (2.1) we will be concerned in this paper with two alternative models. The first is the mean-shift model and takes the form for the generation of the observations Y = (YI'Y(I))"
YI=XI~+a+~1
(2.11)
Here HI is the k x k block of the hat matrix H of (2.6) formed by using the k rows I ~I
and columns of H indexed by 1, or I
(3.4a) Indeed, HI is refereed to as the "leverage of obserVations YI " .
Expression (3.2) could be written in another form that will be useful when comparing
,
it to the other approaches to be discussed in this paper. In order to do so, we use the identitJ, (see Cook and \Veisberg, 1982, pg. 191)
(3.5) I
)1
for then i IL _
n - p - k(1
eH1-HI)-leI) 2 -+
)
2 (3.6) 8(I)n -
p(n -
p - k 8(1)Here, we have partitioned the residual vector e,
e
=
(I -H)y
(3.6a)using
(3.6b)
with X::::
(X1:X(I»)'
used in constructingH
=X(X'X)-lX'.
Now, it can be shown that~ -1
YI-XI(J(I)=(I-HI) er
(3.7)
and, therefore, (3.4) itself can be written as
_
ei(I - HI )-l eIQ
1 - 2 (3.8)(n - p - k)S(I)
Hence, we may rewrite (3.2) as follows
1.'
11 -
HI1
1 /2[1
Q ]-(n-
p)/2I )
=.n1 -1/2+
I (3.9) p(YIY(I) n-p~)
( n-p-1 '(1) 6Following the results outlined in Section 2, we have that the posterio~.distribution of the ei , as given by (3.13) conditional on u2 , is easily seen to be on using (2.3), such that
(3.14) where fi
=
Yi -x~/J
is the residual of the observation Yi, and hi is the i -th diagonal element of H given in (2.6), which is to say, hi is the leverage of Yi. Chaloner and Brant(1988) have shown that Pi of (3.12) can be written in the form
(3.15)
Xn2 p
where r
=
(1-2, so thatp(rly,x)
is the density ofa - variable, with(n - p)s2
(q -
e.,j'T)
(q
+
ei.;'T)
Zl
=
Viii
'
Z2=
.,j'hi
(3.16)Chaloner and Brant (1988) discuss appropriate choices of q and then declare an observa
tion Yi to be an "outlier" if its Pi is large.
To help us interpret the properties of the Chaloner-Brant procedure, we will obtain an explicit formula for (3.15) as a function of the standard diagnostic measures. We may, of course, write P(ci,u2 IY,X) as
(3.17) and from (3.14) and results of Section 2, we have that the right hand side of (3.17) is
1 exp
{-~(Ci
_
e
i)2}.
K(u2
)-[(n-p )/2+1] exp _ {_I_(n -p)s2} .
(3.18)V27ru
2
hi 2u2
hi 2u2
Now
p.
as given in (3.15) may be written asPi
=
p'
1+
Pi2 ,
wherePH
=
P(ei
>
quly; X) ,Pi2
=
P(ci
<
-quly; X) (3.19) Now(3.20)
(3.2i) .), To summarize we have I (3.28) with ) (3.28a) and
>
(~-*)
ti2=
)
(3.28b) ~/1
+
~;V
2h,(n-p),2It can be shown that til and ti2 can be written as: 1:_)
I I (3.29)
(~-*)
ti2=
(3.30). I
1 r?VI
+
2(n-p).t
e·where 1', is the studentized
residual'
~'and I, is the measure of leverage giwnyl
s2(1 -hd
'.)jby
(3.31 ) ~ow, suppose that
7',
is positive and fixed. If we now let hi - 1, that is. the lewrage of the ohsen-ation is very high, then li - 00 and Ul - q. U2 - -q and from (3.28). wesee that P, goes to 2cJl( -q) , which is the conditional probability, given
fJ.
(12 that Vi isoutl;ying in the Chaloner-Brant sense, that is,
2cJl(-q)
=
P[lYi
-x~fJl
> q(1I/l,(12]
(3.3:2)(
C
byI
to be spuriously generated is(4.1 )
c
where(4.2) and where the sum is taken over all sets 1 of size k of distinct integers from (1, ... ,
n).
Now, on the other hand, it is interesting to note that for the variance inflation model (2.12), the probability that k observations indexed by 1 are generated with 'noise' ~ given by
N(O,62q2) ,
62>
1, and(n -
k) generated byN(O,q2)
takes the form, as proved in Box and Tiao (1968),WI
=
C
(a
)Ir -Ir (
IX'XI
)1/2 (1 _ Q
6
IX'X
~XIXII
s2 ) SlI)T
,(4.3)
'-.. (4.30 )where C is a normalizing constant that can be shown to be the probability of no outliers, and ~
=
1 - 6-2. (For a precise definition of .s~l) see Box and Tiao (1968),) When IJ' IS large, it can be shown (Pelia and Tiao (1992)) that WI is approximately
1/2 (!!.=.Z )
U'I _
C
(~)Ir
6-K (IX'XI)
(~)
2. - 1 Q
IX(l)X(l)I
sll)
.
Adding up the values U'I for all sets of size k we obtain the probability of exactly k
outliers in the sample, and, in turn, by adding all the U'I 's up. the constant C could be
obtained,
For fixed k and 6 large, the conditional probabilit~· that a particular set of k 013
sen'ations indexed by I are spuriously generated with noise variances 62 (12 is (Pelia and
Tiao (1992)):
1/2 !!.=.Z
'( IX'XI)
(S2)
2PI
=
C
-,
-21-\l)X(l)I
S(l)which in turn can be written as
(-1.4)
(4.5 )
Pi given by (3.15) with q
=
2 , and the Ci as given by (4.1), which is, as demonstrated in Section 4, inversely proportional to the predictive ordinate. As anticipated, tht' values for Pi are all very small, except for a large value of .9552 at i=
11 , whkh is 2.2 times greater than the next largest value that occurs at i=
14. As far as the Ci, the largest occurs at i=
11 with a value of .9708 which is 511 times greater than the next largest value, that occurs at i=
20. We note that the outlier will be idenitified by both procedures, although in a more powerful way by Ci then by Pi.The next experiment we carried out was to introduce to the original set of data of Table 5.1 an outlier at the high leverage point X20
=
40, by again adding 4 to the originalTable 5.1.
Data for the simulated example
x y h 1 1 1.955 .13
2
2 2.201 .11 3 3 3.235 .10 4 4 5.862 .09 5 5 5.944 .08 6 6 7.514 .077
7
8.397 .06 8 8 9.756 .06 9 9 .10.401 .05 10 10 9.659 .05 11 11 12.375' . .05 12 12 14.125 .05 13 13 14.729 .05 14 14 12.622 .05 15 15 15. ~'26 .06 16 16 16.677 .06 17 17 18.318 .07 18 18 18.489 .08 19 19 19.998 .09 20 40 42.607 .62observed data point. Here, however, the largest
Pi
value is .6758 and occurs at i=
14, instead of the expected i = 20. Indeed the next largest of the Pi'S is P20 , with the value14 I, ,~ .) ), , ) ,J
country involved in this set of data. \Ve should note that the data is from 1959 and 1960 UN data.
Table 5.2.
Probability of k outliers for the logistic model with Zellner-Moulton data.
1 2 3 4
k
I
0
p(k) .3566 .3732 .1961 .0611 .0130
ACKNOWLEDGEMENTS
The authors would like to acknowledge support, with thanks: for this research. Daniel Peiia was supported by DGICYT (Spain), under Grant No. PB90-0266. Irwin Guttman was supported by NSERC (Canada) through Grant No. A8743.
APPENDIX I. The non-central
T-distribution.
Vle remind the reader that the definition of the classical non-central T -variable comes about as follows. Suppose Z and
l-r
are independent random variables with distributions given byZ '" N(O, 1) and lV '" \~
(AI.1 )
Then the non-central T -variable with non-central parameter ~. degrees of freedom t·. is defined asT
=
(Z+
~) (.41.2)Jlr/v
It is wry easy to see that the density of the random variable T as defined in (:\1.2) may be written as
(AI.3)
or
P(T
:5
t)==
P(z
S
;q)
=
iP(u) (AI.10)1
+
.c.
2"
where u = (t-6) , and, Z '" N(O,l). Hence, the density function of T IS ,
Jl
+
t
2/2vdifferentiating, such that
(AI.11)
where of course ~(u) is the density of a N(O,l) random variable, and, to repeat. )1 I
(t -
6)(AI.12)
u
=
J1
+
.~
.
2"
,)
\\ie haye that the Jacobian of the transformation from
t
to u given by(AI.12)
has absolute value1
111
=I~~
I
so that the density of U IS
that is, the density of U IS
g(u) = 6(u) ,
and \\'e haw that, approximately, U has the density of a standard normal aLle. for large v, and the theorem is proyed.
REFERENCES
~) I (.4I.13) , )' X(O.l) YanAtkinson. A.C. (19i5). Plots, Tran4ormation,~ and Regre.~.~ion, Oxford: Clan'll<!on Prt'ss.
Belsley, D.A., huh, E. and \Velsh, R.E. (1980). Regre.~sion Diagno.~tic.~. !\e\\' York:.John \\'ilej'.
Box, G.E.P. (1980). "Sampling and Bayes Inference in Sdentifh- "Modelling alHl Ro bustness." Journal of the Royal Stati~tical Society, Series A 143, 383-430 (with discussion).
Box, G.E.P., and Meyer, R. (1986). "An Analysis for Unreplicated Fractional fartori
8.1S."
Technometric.9 28, 11-18.Box, G.E.P., and Tiao, G.C. (1968). "A Bayesian Approach to Some Outlier Proh lems." Biometrika 55, 119-129.
Casella, G. and George, E.I. (1992). "An Introduction to Gibbs Sampling." American
Stati~tician. (To Appear)
Chaloner, K. and Brant,
R.
(1988). "A Bayesian approach to outlier detection and residual analysis." Biometrika 75, 651-659.Cook, R.D., and \Veisberg, S. (1982). Re.9idual~ and Influence in Regre~~ion. Chapman and Hall,
Eddy, \\'.F. (1980). "Discussion of P. Freeman's paper." Baye"ian Stati"tics, J.~1.
Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M. Smith, Editors. Valencia University Press, 370-373.
Freeman, P.R. (1980). "On the number of outliers in data from a linear model."
Baye~ian Stati~tic", J.M. Bernardo, M.H.' DeGroot, D.V. Lindley and A.F.~1.
Smith, Editors. Valencia University Press, 349-365.
Geissel', S. (1980). "Discussion of a paper by G.E.P. Box", Journal of the Royal Sta·
ti.~tical Society. Series A 143, 416-7.
Geisst:'r, S. (1987). "Influential observations, diagnostics al1d discordancy tests". JOUT·
nal of Applied Stati.~tic.~ 14, 133-42.
Geisser, S. (1988). ';Predictive approaches to discordancy testing". Baye~ian and Like·
lihood Influence: Es"ay.~ in HonOT of George Barnard, S. Geisser, J .5. Hodges~ S..J.
Press and A. Zellner, Editors. Amsterdam: !\orth Holland. 19
Kendall, M., and Stuart, A. (1977). The Advanced Theory of Stati.~tic.~; Volmne 1 - Di3tribution Theory (Fourth Edition): Macmillan Publishing Company, New
York.
Ljung, G.M. (1989). "A note on estimation of Missing Values in Time Series". Corn.
u
munication3 in Stati3tic3, Simulation and Computation 18, 459-465.
Pelia, D. and Maravall, A. (1991). "Interpolation, Outliers and Inverse Autocorrela
tions." Communication3 in Stati3tic3, Theory and Method3 20, 3175~3186. )
Pena, D. and Tiao, G.C. (1992). "Bayesian Robustness Functions for Linear Models."
Baye3ian Stati3tic3 4., J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith
(Eds). Oxford University Press, 365-388. :. )
Pettit, L.!. and Smith, A.F.M. (1985). "Outliers and Influential Observations in Linear Models." Baye.~ian Stati3tic3 2, J.M. Bernardo, M.H. DeGroot, D.V. Lindley.
u
A.F.M. Smith (eds). Elsevier Science Publishers, 473-494.
Tussell, F. (1990). 4lTesting for Interaction in Two-Way ANOVA tables with no repli cation." Computational Stati3tic3 and Data AnalY3i3 10, 29-45.
()
)
WORKING PAPERS 1992 92-01 Carlos Ocafla Perez de Tudela and Joan Pasquall Rocabert
-Environmental Costs oJ Residuals: A Characterization oJ Efficient Tax Policies"
92-02 Nelida E. Ferreti. Diana M. Ke1mansky and Victor J. Yohal
"A Goodness-oj-Fit Test Based on Ranks Jor Arma Models"
92-03 Jesus Juan and Daniel Pefla
"A Simple Method to Identify significant Effects in Unreplicated Two-Level Factorial Designs"
92-04 Jacek Osiewalski and Mark. F.J. Steel
"Bayesian Marginal Equivalence oJ Elliptical Regression Models"
92-05 Jacek Osiewalski and Mark. F.J. Steel
"Posterior Moments oJ Scale Parameters in Elliptical Regression Models"
92-06 Omar Licandro
"A Non- Walrasian General Equilibrium Model With Monopolistic Competition and Bargaining"
92-07 David de la Croix and Omar Licandro
"The q Theory oJ Investment Under Unit Root Tests"
92-08 Santiago Velilla
"A Note on the MultivariaLe Box-Cox TransJormation to Normality"
92-09 T. Cipra. R. Romera and A. Rubio
"Square RoOL Kalman Filter with Contaminated Observations"
92-10 Gary Koop. Jacek Osiewalski and Mark F.J. Steel
"Baysian Long-Run Prediction in Time Series Models"
92-11 Eduardo Ley
"Switching Regressions and Activity Analysis
92-12 Julie van den Broeck. Gal)' Koop. Jacek Osiewalski and Mark F.J. Steel
"Stochstic Frontier Models: A Bayesian Perspective"
92-13 Carlos San Juan Mesonada
"Costs and BeneJits oJ the Cap ReJonn"
92-14 Eduardo Ley
"A Note on Ramsey and Corlell-Hague Rules"
92-15 Miguel A. Delgado
"Testing the Equality oJ Nonparametric Regression Curves"
92-16 Miguel A. Delgado and Peter M. Roblnson
"Nonparametric and Semiparametric Methods Jor Economic Research"
92-17 Migue1 A. Delgado
'''Computing Nonparametric Functional Stimates in Semiparametric Problems"
92-18 Pedro Fraile Balbin
DOCUMENTOS DE lRABAJO 1992
,)i . I 92-01 Daniel Pena
"Rejlexiones sobre la ensenanza experimental de la Estadistica"
92-02 Carlos Newland y Mi Jesus San Segundo
"Una contrastaci6n de la Teoria del Capital Humano: ingresos y educaci6n en
Buenos Aires a mediados del siglo XIX'
u
92-03 Luis Rodriguez Romero
"Actividad economica y actividad tecnol6gica: un analisis stmultaneo de datos
de panel"
92-04 Alfonso Alba Ramirez
"El empleo asalariado en Espana desde 1987 hasta 1991. Especial rejerencia al
tipo de contra to"
92-05 Juan Ignacio Pena
"Contrataci6n Asincrona. Riesgo Sistemcitico y Contrastes de Efteiencia"
92-06 Pedro F. Delicado Useros
"NOPARAM: Estimaci6n Funcional No·Parametrica. Un acercamiento del Sojtware
CURVDAT al usuario"
92-07 Carmen Higuera y Javier RUiz-Castillo
"indices de Precios individuales para la Economia Espanola con base de 1976 y 1983"
92-08 F. Javier Suarez Bemaldo de Quir6s
"Seguro de depositos y comportamiento bancario: un analiSis basado en la teoria de
opciones"
92-09 Juan Urrutia
"La moda en Economia. (El caso del ajuste liberal a la criSis)"
, )
92-10 Antoni Espasa
"El AnciliSis de la Coyuntura Economica: un Ejercicio basado en Modelos"
92-11 J. Ignacio Pefla
"Nuevos Modelos Estadisticos para el AnaliSis de Mercados Financieros"
92-12 Miguel Angel L6pez L6pez
"Sobre la RepresentaciOn Continua de Relaciones de Prejerencia"
\,-) I
REPRINT 1992 - Antoni Espasa
"Un anal is is de la injlaciOn en la economia espanola a traues de los indices de
precios al consumo"
- Eduardo Morales. Antoni Espasa and Mi Luisa Rojo
"Univariate Methods jor the Analysis oj the industrial Sector in Spain"
.)
I I
Working Paper 92-32
Divisi6n de Economfa
July 1992
Universidad Carlos
IDde Madrid
Calle Madrid, 126
28903 Getafe (Spain)
Fax (341) 624-9849
COMPARING PROBABILISTIC METHODS FOR
OUTLIER DETECTION
Daniel Pena and
Irwin Guttman·
Abstract
_
This paper compares the use of two posterior probability methods to deal with outliers
in linear models. We show that putting together diagnostics that come from the mean-shift
and variance-shift models yields a procedure that seems to be more effective than the use of
probabilities computed from the posterior distributions of actual realized residuals. The
relation of the suggested procedure
tothe use
ofa certain predictive distribution for
diagnostics is derived.
Key words:
Diagnostic, Posterior and Predictive distributions, leverage, linear models.
·Pena, Departamento de Estadfstica y Econometrla, Universidad Carlos
III de Madrid, Getafe,
28903 Spain; Guttman, Department of Statistics, University of Toronto, Toronto, Ontario
M5S1Al.
:) ) ) I ! I , ): !
1. INTRODUCTION.
The analysis of outliers from the Bayesian point of view has become increasingly interesting to the Statistical Profession because of the possibility of carrying out the diffi cult computations involved using new algorithms such as the Gibbs sampling method (see Gelfand and Smith (1990), or Casella and George (1992)). Also, specific algorithms have been developed to deal with the Bayesian treatment of outliers (Pefia and Tiao (1992)).
Not only has the work on analysis of outliers proved useful in its own right, but it turns out that many other statistical problems can also be usefully analyzed by approaching these problems from the outlier point of view. As exanlples, we would cite: analysis of unreplicated fractional factional factorial designs to detect significant effects, see Box and Meyer (1986), Juan and Pefia (1992); detection of interaction in unreplicated ANO'-A designs, see Tussell (1990); estimation of missing observations in time series models, s~e
Ljung (1989) and Pefia and Maravall (1991).
There have been three main approaches to the problem of outliers in the literature. Succinctly, these may be classified as (i) the diagnostic approach (ii) the Bayesian approach and (iii) 'robust' approach to estimation and tests of hypothesis in the presence of outliers. The first approach is clearly identified with the work of Cook and Weisberg (1982), Belsley, huh and \Velsch (1980), and Atkinson (1985), and the aim of workers in this area is mostly that of identification of observations that may be deemed outlying and/or influential. The approach listed as (iii) above has been motivated by the \'v'ork of Huber (1981), ancl Hamp~l
et al (1986), the aim here being to build estimatiors that are not affected by the fraction of tht> sample that is outlying. Trul~'in the middle and listed as such above, is the Bay{'~ian
approach, which seeks to combine identification with estimation: see for example Box and Tiao (1968); Guttman et al (19i8), etc. Here, the identification is carried out using the posterior probabilities for an observation or a set of obsen'ations being outlying, ancl these are used as weights in estimation procedures.
identification methods for outliers with no alternative model to the null entertained.
Ex
amples of work in the category are (i) use of the predictive distribution for detection (ii) using the posterior probabilities of various unobserved pertubations and (iii) looking at the change in a posterior of interest when some observations are deleted. These methods will be discussed in Section 3 of this paper.
The second category that has emerged takes into account an alternative model for the generation of a subset of the sample. As examples, various authors have proposed and utilized the mean-shift model and the variance-inflation model. These methods are discussed in Section 4 of this paper.
All of the above will be compared and discussed in Section 5, where an illustrative example based on real data will be given.
2, THE BASIC MODEL.
In this paper we will be concerned with the univariate linear model
y =,XfJ
+
f: (2.1 )where y is a (n X 1) vector of normal random variables, X is a (71. X
p)
matrix of full column rank p<
12.
fJ is a (p xl) vector of parameters and f: is a(71
xl) '·ector of normal ,·ariables. mean wctor 0 and with coval'iance matrix (12In.
This will be callf'tlthe llull model in the rest of the paper. The estimation of the parameters (fJ, (12)
will
lJt'done assuming a non-infonnath'e prior
(2.2 )
It is well known that in this case the posterior distribution for the parameters is such that 3 :), . )
)1
I i )I
, jconditional on (12, we have
-fJ -
N($;(12(X' X)-I)
(2.3)where
/J
=
(X 'X)-lX'y,
andp((12Iy,x)
is such that (n -p)s2
. . 0 . - " " " ; ; " ' : " ' _ . . . , v 2
(12
I\.n-p , (2.4)where
(n -
p)S2
=
e'e=y'[I - H]y ,
(2.5)with
H
denoting the so called hat matrixH
=
X(X '
X)-IX' .
(2.6)We denote a set of k distinct integers chosen from the set (1, ... ,
n)
by I. Then, the vector y can be decomposed asI ( ' I )
Y
=
Yl Y(1) (2.7)where (1) means "delete set
I".
Similarly, theX
matrix can be partitioned asX' == [Xl X(I)] . (2.8)
(The use of the symbol I without brackets means restrict information to the set I. ) Consistent with the above notation, we wi~l use in the rest of the paper the designations
/J
(1)=
(X(l)X(1») - IX(l)Y(1) (2.9)s~1)
==
(Y(1) - X(1)/J(1»)I(Y(1) - X(1)/J(1»)/(n - p - k) (2.10)that is.
/J
(1) and $~l) are estimators offJ
and (12 based on (X(1),y(1»). etc.In contrast with the null model (2.1) we will be concerned in this paper with hvo alternative models. The first is the mean-shift model and takes the form for the generation of the observations y
==
(yj, Y(l))I •YI
==
X IfJ
+
a+
El(2.11)
where a is a (k x 1) vector of mean-shift constants, and £ 1'" N(O,02 h) indf'pt'ndent
0)'
-of £ (I) '" N(O, 0 2
In-le).
This model was used by Guttman (1973) and further utilizedby Guttman et al (1978).
The second model we will be involved with is the so-called variance-inflation model I)
o. I
which says that the distribution of £ of (2.11) is such that
(2.12) where C\' is small and 62
>
1 is usually thought of as being large. These models hawbeen compared in Freeman (1980), Eddy (1980) and Pettit and Smith (1985).
3. BAYESIAN IDENTIFICATION METHODS USING THE NULL MODEL.
Several authors (for example, see Box (1980) , Geisser (1980,1987,1988), Pettit and
~) I I Smith (1985) and Pefia and Tiao (1992)) have advocated the use of the predictive dish'i
bution to arrive at methods for detecting outlying observations. A main idea here is to compute the predictive density
(3.1 ) where Y1 is a vector of k observations, and where Y(l) is the sample data at hand on which the posterior for the vector of parameters
0'
is based. For the linear models with()'I
I
normal noise as given in (2.1), it is well known that for the non-informative prior (2.2) (3.2) where
r
(!!..=.£)
/ { _ 2 (3.3 )- r
(t)'=
r
(n-r
le )(Tt _
P _ k)le/2 and (3.4) 5Here
HI
is thek
xk
block of the hat matrixH
of (2.6) formed by using the and columns of H indexed by 1, ork
rowsHI
=
XI(X'X)-1 X} .
(3.4a)Indeed,
HI
is refereed to as the "leverage of observationsYI".
Expression (3.2) could be written in another form that will be useful when comparing it to the other approaches to be discussed in this paper. In order to do so, we use the identity (see Cook and Weisberg, 1982, pg. 191)
(n P k)S(I)
=
(n - p)s2 -
e~{I- HI )-1 eI ,
(3.5) for thenS2
n p k (
ej{I HI)-1 eI )
- 2 - =
1+
2S(l)
n-p
(n-p-k)s(l)
Here, we have partitioned the residual vector e,
(3.6) e
=
(1 H)y
(3.6a) ;' '", usmg withX
that = e =(ej,e(I»)'
=
(1 H)(yj'Y(I»)'
(Xj:X(I»)'
used in constructingH
=
X(X' X)-1 X'.
A -1
YI- XIfl(l)
=
(1 HI) eI
(3.6b)
Now, it can be shown
(3.7)
c
and, therefore, (3.4) itself can be written as
Q
1 =ej{I HI)-1 eI
2(n - p - k)s(l)
Hence, we may rewrite (3.2) as follows11 -
HI1
1 / 2 ( )/2p(YIIY(l»)
=
K1
-1c/2[1
+
QI]-
n-p ( n-p.2)
n-p-1c.ll)
6 (3.8)(3.9)
c
where
K
=
(n -
p)S2)_k/2
K.1 (3.11 )
n-p-k
Note that p(YrIY(r») behaves in a rather expected way: the larger the studentized residual statistic (3.8), as measured by the quadratic function (3.8) that takes into account the leverage - the smaller the ordinate of the density (3.10). We also note that small ordinates could occur because of large leverage, due to the presence of the factor
11 -
Hr1 1/ 2 in (3.10). Because of all this, many authors use the predictive to rank sets Yr, deeming these with lowest p(Yrly(/») as outliers.Another approach that utilizes the predictive distribution is the one by Johnson and Geisser (1983). They showed that when monotoring the change in the predictive distri bution when some observations are dropped, measures of influence and outlyingness can be built that are related to the standard ones used in the literature. Johnson and Geisser (1985) and Guttman and Peila (1988) extended these ideas to changes in certain posterior distributions. Recently, Guttman and Peiia (1992) ha\'e shown that the beha\'iour of thesE' changes are related to the probability of a group of observations being spuriously gener ated that is, generated not according to the null model. The probability will be precisdy defined and discussed in section 4.
The third approach has been suggested by ChaloneI' and Brant (1988) and is bRs('(1 011 the posterior probabilities
P,
=
P(I£,I
>
qO'ly·X) (3.12)where the
c,
are the unobserved residuals, given byCl
=
YI - x~fJ (3.13) 7 ) )Y
IFollowing the results outlined in Section 2, we have that the posterior distribution of the
"
Ci , as given by (3.13) conditional on 0'2 , is easily seen to be on using (2.3), such that
(3.14) where fi =
Yi -
xi/J
is the residual of the observationYi,
and h, is the i -th diagonaleleme~tof H given in (2.6), which is to say, hi is the leverage of Yi. Chaloner and Brant
(1988) have shown that
Pi
of (3.12) can be written in the form(3.15) 2
2
where r
==
0- , so thatp(rly,x)
is the density of a (Xn-j
2 variable, withn-ps
(3.16) Chaloner and Brant (1988) discuss appropriate choices of q and then declare an observa
tion
Yi
to be an "outlier" if itsPi
is large.To help us interpret the properties of the Chaloner-Brant procedure, we will obtain an explicit formula for (3.15) as a function of the standard diagnostic measures. We may, of course, write
p(ci,u2Iy,X)
as(3.17) and from (3.14) and results of Section 2, we have that the right hand side of (3.17) is
Now
Pi
as given in (3.15) may be written asPi
==
Pil
+
Pi2 ,
wherePi1
==
P(ei
>
quly; X) , Pi2
=
P(ei
<
-quly; X)
(3.19)Now
(3.20)
(3.21 )
and letting w
=
(n p)s2 / (12 , we find that1
00 (
ea
~
w q)
Pli = 0 ~
.;riiSf
n p -v'hi
hn _l'(w)dw, (3.22)::J
1
where, in general,
(3.23)
)
is the density ofax~-variable. Now it is proved in Appendix I of this paper that the right hand side of (3.22) is in the form of the probability that a non-central T -variable with (n - p) degrees of freedom, non-central parameter tJ.
=
q/A has value less thanei
1
.
or equal to t = 1L7i' t lat IS, y hi s2
"
)1 -.Ji Pli = P(T::; 1'L'""':21tJ.ea
V hi s2 = ;r; q Z! = yhi n - p) (3.24)and, as is well known, (details given in Appendix I), (3.24) has as an approximation, for moderate to large n, given by
PI' I ~ P(Z
<
-t Do)
J .
2 1+
2(~-P)
(3.25) " I , ) Iwhere Z '" N(O, 1). that is, we have
(3.25(1) '.J
1 1
Similarily. we may easily find that
P2i
=
P(ci<
-q(1ly; X)=
1-100
~
[~
o hi s2C£
+ (
~)l
hn _l'(w)dwV
--;;=P
Y hi (3.26 ) 9 )1 I 1(3.26a) Hence, (3,2i) To summarize we have (3.28) with (3.28a) and U2
=
J
(~-*)
1 e.2+
2h;(n:'p).2 (3.28b)It
can be shown that Ul and U2 can be written as:(~+*)
(3,29) . / ---L- r~V
1+
2(n-p)"t
(~-*)
U2=
--;======::;:
(3.30) . / ---L- r~V
1+
2(n-p)"t
where 1'i is the studentized residual
..js2(~i
_ hi) ; 'and li is the measure of leverage giYt'llby
(3.31 )
Now, suppose that 1'i is positive and fixed. If we now let hi --+ 1, that is. the If'Yt'Htge of the ohservation is very high, then li --+ 00 and Ul --+ q, U2 --+ -q and from (3.28). wt'
see that
Pi
goes to 24>( -q) , which is the conditional probability, givellf:J,
(12 that Yi is outlying ill the Chaloner-Brant sense, that is,24>(-q) == P[\Yi -
x~f:J1
>
qO'lp,(j2]
(3.32)Therefore, the P03 terior probability of a high leverage observation Yi (hi ~ 1) being an outtier in the Chaloner-Brant sense, is, in moderate to large samples, very nearly
' . ) I
lim
Pi
=
24>( -q)h,-l (3.33)
regardle33 of the data, so that leverage is not always being treated by Pi of (3.15) in the way we would wish in moderate to large numbers.
It is interesting to note the similarities between the Chaloner and Brant (1988) result for Pi, as stated in (3.15), and the approximation derived in this paper stated in (3.28). In fact, it is easy to see that after setting T in (3.15) equal to
w
2T = ( ) 2 ' W '" Xn-p
n-ps
that Pi of (3.15) can be written as
1
00 4> -( ~e' -~
-q)
t r hn_p(w)dw o hi s2 n - p V hi1
00 ( ei~
q)
+
4> .Ii:""'":'2 - - - t r hn- p(w )dw ov
hiS· n - p V hiwhile the approximation (3.28) is of course,
(3.34) (3.35) ) (3.36) where a= (3.36a) ei q
\\'e note that in (3.35) and (3.36). the signs attached to ~.2 and
fr,.
that appear in V HiS· V Hi4> functions are the same, but that the (approximate) effect of integration with respect to
2 '
~
I
1
.
'1
1I 'I I
.
q
qu· '" 'xn- IS to remove - anc rep ace It WIt 1 - , W 11 e c langmg f r to tr' .
P 12 - P a V hi aV hi
4. BAYESIAN IDENTIFICATION METHODS USING
ALTER~ATIVE ~10DELS .•
Starting with the mean-shift model (2.11), it can be shown (see Guttman and Pena (1992)) that, conditional on k, the probability for a given set of k observations indexed
11
i )1
c
l-by I to be spuriously generated ise
I I I (4.1 ) where ( \. (4.2) and where the sum is taken over all setsI
of sizek
of distinct integers from (1, ... ,11).C Now, on the other hand, it is interesting to note that for the variance inflation model (2.12), the probability that
k
observations indexed byI
are generated with 'noise' e2
given by N(O,62a ), 62
>
1, and (n-k) generated by N(O,a2 ) takes the form, as C proved in Box and Tiao (1968),0: ) k -k (
IX'XI
)
1/2 ( 82 )7
WI=C ( - 6 - (4.3)
1 - 0:
IX'
X - ~X}Xllsll)
,
.C where C is a normalizing constant that can be shown to be the probability of no outliers, and <:P
=
1 - 6-2 • (For a precise definition of.slI)
see Box and Tiao (1968).) When ~ is large, it can be shown (Pelia and Tiao (1992)) that WI is approximately!e
0:
)k
-K (IX'XI
)1/2 ( s2)(7)
U'I-C - -( 6 - (4.3a)
- 1 - 0: IX(I)X(I)
I
slI)
.
\ Adding up the values WI for all sets of size k we obtain the probability of exactly k
l
outliers in the sample, and, in turn, by adding all the U'I'S up, the constant C could be obtained.For fixed k and 6 large, the conditional probability that a particular set of k ob
2
servations indexed by I are spuriously generated with noise \"ariances 62a is (Peiia and Tiao (1992)):
(-1.4)
which in turn can be written as
(4.5) 12
As indicated in section 3 the predictive density has been advocated as a diagnostie tool. Interestingly, there is a strong connection between both C] (or PI) with the pl'edietive
)
density (3.2). 'Writing C] from (4.1) as v'
(4.6)
and using (3.6) and (3.8) we then have
(4.7)
Henee, from (3.10), we see that
(4.8)
that is to say that the posterior probability Cl is inversely proportional to the ordinate of a
predictive density function with is related to the general k -variate student-t distribution with n - k - P degrees of freedom. The smaller the predictive ordinate, which occurs for large residuals in absolute value, the large the probability Cl that the corresponding
observations are spuriously generated.
..
i '.
. /[ I 5. T\VO ILLUSTRATIVE EXAMPLES.
, . ' , \
\Ve first illustrate the behaviour of the predietive and posterior probabilities for the residuals with a simulated example. \re have generated 20 observations using the model
Y = 1
+
J'+
~, where € is N(O, 1). The values for y,J' are given in Table 5.1. 'We lHiH'induded a potential influential point by loeating X20 at 40. Then, we introduced an (HIt Y I lier by adding 4 to the original Yll , and proeeeding by then computing for this new set of
data the probabilities for each observation to be an outlier using the posterior probabilities 13
, )
f Pi given by (3.15) with q
=
2, and the Ci as given by (4.1), which is, as demonstratedin Section 4, inversely proportional to the predictive ordinate. As anticipated, tIlt' values for Pi are all very small, except for a large value of .9552 at
i
=
11, which is 2.2 times greater than the next largest value that occurs ati
=
14. As far as the Ci, the largest occurs at i=
11 with a value of .9708 which is 511 times greater than the next largest value, that occurs ati
=
20. We note that the outlier will be idenitified by both procedures, although in a more powerful way by CS then by Pi.The next experiment we carried out was to introduce to the original set of data of Table 5.1 an outlier at the high leverage point X20
=
40 , by again adding 4 to the originalTable 5.1.
t, Data for the simulated example
x y h 1 1 1.955 .13
2
2 2.201 .11 3 3 3.235 .10 4 4 5.862 .09 5 5 5.944 .08 6 6 7.514 .07 7 7 8.397 .06 8 8 9.756 .06 9 9 .10.401 .05 10 10 9.659 .05 11 11 12.375' . .05 12 12 14.125 .05 13 13 14.729 .0514
14 12.622 .05 15 15 15.726 .06 16 16 16.677 .06 17 17 18.318 .07 18 18 18.489 .08 19 19 19.998 .09 20 40 42.607 .62observed data point. Here, however, the largest Pi value is .6758 and occurs at
i
==
14, instead of the expected i = 20. Indeed the next largest of the Pi'S is P20 , with the valueThese results are in agreement with the theoretical results of Section 3, in which w\:, have shown that the behaviour of Pi could be unsatisfactory for high leverage points.
As a second example we have chosen a set of data in which there are not pronounced differences among the leverages. For this purpose, we will use the consumption/income data, reported by Zellner and Moulton (1985). They compared the linear, log and logistic transformation for data obtained from 26 countries. The model we will use as our example is described as follows. Let, for the i th country, Zi be the permanent consumption
,)
" I
expenditures and Ui be the permanent disposable income, both in a per capita basis and , define
Z'!U·
Xi
=
log Ui and Yi=
log • I (5.1 )1 - Zi/Ui
i
The logit model is, in terms of Xi and Yi, is estimated by !) I
2
Yi
=
3.828 - .215,ri j 8 = .2236 . (5.2)Let us compare the behaviour of Pi and Ci for the logit model applied to this set
of data. In order to have a prior probability that an observation is 'not an outHer equal to .99;. the value of q should be 3. Now for q
=
3, the maximum Pi is PI. = .1634, followed by Pl8 = .0020. We conclude that the use of the Pi'S does not focus attention on anyone observation.Table 5.2 provides the probabilities of having exactly k outHers in the sample using the scale contaminated linear model (4.3a). for 6 = 5 and the prior probability of no ontliers equalling .95, as before. It can be seen that the most likely event is the presence' of one outlier. To identify which point has been spuriously generated, we turn to the conditional probabilities Ci as given by (4.1). The largest value is attained at £
=
14. and is Cl.=
.5665, followed in magnitude by C18 = .0813, a factor of 7, approximately. Theeighteenth observation corresponds to Japan, and, as it happens, it was the only Asian
15
1
I
country involved in this set of data. We should note that the data is from 1959 and 1960 UN data.
Table 5.2.
Probability of
k
outliers for the logistic model with Zellner-Moulton data.1
2
3 4.3732 .1961 .0611 .0130
ACKNOWLEDGEMENTS
The authors would like to acknowledge support, with thanks, for this research. Daniel Pefia was supported by DGICYT (Spain), under Grant No. PB90-0266. Irwin Guttman was supported by NSERC (Canada) through Grant No. A8743.
c
APPENDIX
I.The non-central
T-distribution.
c
We remind the reader that the definition of the classical non-centralT
-variable comes about as follows. SupposeZ
andIV
are independent random variables with distributions given byZ '" N(O, 1) and W "" \~ (AI.l)
Then the non-central T -variable with non-central parameter 6, degrees of freedom t'. is defined as
T= (Z+6) (AI.2 )
..jn-/v
It is very easy to see that the density of the random variable T as defined in (A1.2) may be written as
1
pet)
=
/00
fii_
exp{_! [
r;;t _A]2
11,,(w)dw (AI.3)Jo
V-;;
$
2V-;;
P(T ::; to)
=j
p(t)dt
so that, inverting the order of integration, we easily find (dropping the subscript zero on the particular value of
T
given byto
that we were concerned with) that(AI.5)
Hence, the right hand side of (3.22) and the integral of the second line of (3.26) are easily related to (AI.5), as indicated in (3.24) and (3.26a) respectively.Now in general,
P(T ::;
t) ,
whereT
is non-central withv,
degrees of freedom and non-central parameter 6, has, for moderate to large v, a well known approximation, which we used in Section 2. This approximation, as we will see, rests on the well known result that for moderate or large v, thatJW'
=a
has the approximate distribution given by (,:..- denotes "approximately distributed as")(AI.6)
This result is easily obtained from the results of Fisher as quoted in Kendall and Stuart (1977, pg 400). Now we have thatP(T::;
t)
= P(Z+
6$
:VJW)
(AI.7)
so that
(AI.S)
But for large v, using (AI.6), we have that(AI.9)
Hence we have thatP(T::;
t)
~ P(U ::;-6)
17
I
ie
! or IC
I peT~
t) == P(z
~
g)
= 4'(u) (041.10) 1+~2"where u
=
(t -~)
, and, Z '" N(O,I). Hence, the density function of T IS,'\
(
..
Jl
+
t2
/2v
differentiating, such that
p(t)~~(u)1~~
I ,
(041.11 ) C where of course ~(u) is the density of a N(O,I) random variable, and, to repeat.(t -
~)(.41.12)
u=
Jl+~
.
2tl
We have that the Jacobian ofthe transformation from t to u given by (AI.12) has absolute
value
1
III
=
I~~
I
so that the density ofU
is(Al.13)
g(
ti) ,:Il(
ti)I
~~
I
xI
t
I
(
that is, the density of
U
ISg(u) = 6(u) ,
and we ha\'e that, approximately, U has the density of a standard normal J\'(0.1) van able. for large v, and the theorem is proved.
REFERENCES
Atkinson, A. C. (19i5). Plot.~, Tran.~formation.~ and Re9Tc.~$ion. Oxford: Clnreu<!oll Prt'ss.
Belsley,
D.A.,
Kuh,E.
and 'Welsh,R.E.
(1980). Regres.~ion Diagno.~tic.~. i\e\v York:.John Wiley.r I ')
-.,J
I ) , I ,r,;,· ,, " I / '~ , i)
Box, G.E.P. (1980). "Sampling and Bayes Inference in Scientific ,Modelling and Ro
<: bustness." Journal of the Royal Stati.5tical Society, Series A 143, 383-430 (with
discussion) .
Box, G.E.P., and Meyer, R. (1986). "An Analysis for Unreplicated Fractional factori. als." Technometric.5 28, 11-18.
Box, G.E.P., and Tiao, G.C. (1968). "A Bayesian Approach to Some Outlier Prob lems." Biometrika55, 119-129.
Casella, G. and George, E.!. (1992). "An Introduction to Gibbs Sampling." American Stati.5tician. (To Appear)
Chaloner,
K.
and Brant, R. (1988). "A Bayesian approach to outlier detection and residual analysis." Biometrika 75, 651-659.Cook, R.D., and Weisberg, S. (1982). Re.5idual.5 and Influence in Regre.5.5ion. Chapman
and Hall.
(
Eddy, W.F. (1980). "Discussion of P. Freeman's paper." Bayesian Statistics, J.!\1.
Bernardo, M.H. DeGroot, D.V. Lilldley and A.F.M. Smith, Editors. Valencia University Press, 370-373.
Freeman, P.R. (1980). "Ou the number of outliers in data from a linear model."
Baye.5ian Stati.5tics, J.M. Bernardo, M.H.' DeGroot, D.V. Lindley and A.F.M.
Smith, Editors. Valencia University Press, 349-365.
Geisser, S. (1980). "Discussion of a paper by G.E.P. Box", Journal of the Royal Sta· tistical Society. Series A 143, 416-7.
GeissE'r, S. (1987). ';Illfluential observations, diagnostics and di:;cordancy te:;ts". Jour.
nal of Applied Stati.qtics 14, 133-42.
Geisser, S. (1988). ';Predietive approaches to discordancy testing". Baye,qian and Like· lihood Influence: ES.54YS in Honor of George Barnard, S. Geisser, J.S. Hodges, S.,J.
Press and A. Zellner, Editors. Amsterdam: Korth Holland. 19
;"V i
Guttman, I. (1973). "Care and Handling of of Univariate or Multivariate Outliers in
), Detecting Spuriosity - a Bayesian Approach". Technometrics 15, 723-738.
I
Guttman, 1., Dutter, R., and Freeman, P.R. (1978). "Care and Handling of Univari ate Outliers in the General Linear Model to Detect Spuriousity - A Bayesian
. J' Approach." Technometrics 20, 187-193.
Guttman,1. and Pena, D. (1988). "OutHers and Influence: Evaluation of Posteriors of
Parameters in Linear Model." Bayesian Statistics 9, J .M. Bernardo, M.H. DeG- ,J !
, --"
•
root, D.V. Lindley and A.F.M. Smith, Editors. Oxford University Press, 631-640. Guttman,
1.
and Pena, D. (1992). A Bayesian Look at Diagnostics in the UnivariJ,
ate Linear M odd. Technical Report, Department of Statistics and Econometrics, Universidad Carlos III de Madrid, Getafe, Spain.
Hampel, F.R. et al. (1986). Robust Statistics: The Approach Based on Influence Func
r)
tions. John Wiley, New York.
Huber, P. (1981). Robust Statistics. John Wiley, New York.
Johnston, W., and Geisser, S. (1983). "A Pre~i~tive View of the Detection and Char acterization of Influential Observations in Regression Analysis." Journal of the American Statistical Association 78, 137-144.
Johnston, W., and Geisser, S. (1985). "Estimative Influence Measures of the Multi variate General Linear Model." Journal of Statistical Planning and Inference 11, 33-56.
. I , •• I
I
I Juan, J. and Pena, D. (1992). "A Simple Method to Identify Significant Effects in .)
Unreplicated 2-1evel Factorial Designs". Communication.s in Statistics, Theory and Methods 21, 1383-1403. ~ : ('""'I'.. 20
.
, ,(
Kendall, M., and Stuart, A. (1977).
The Advanced Theory of
Stati,~tic,~;Volume
1- Di"tribution Theory
(Fourth Edition): Macmillan Publishing Company, New York.Ljung, G.M. (1989). "A note on estimation of Missing Values in Time Series".
Com.
munication" in Stati"tic", Simulation and Computation
18, 459-465.Pena, D. and Maravall, A. (1991). "Interpolation, Outliers and Inverse Autocorrela tions."
Communication" in Stati"tic", Theory and Method"
20, 3175-3186. Pena, D. and Tiao, G.C. (1992). "Bayesian Robustness Functions for Linear Models."Baye"ian Stati"tic"
4., J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith (Eds). Oxford University Press, 365-388.Pettit, L.I. and Smith, A.F.M. (1985). "Outliers and Influential Observations in Linear Models."
Baye"ian Stati"tic"
2, J.M. Bernardo, M.H. DeGroot, D.V. Lindley, A.F.M. Smith (eds). Elsevier Science Publishers, 473-494.Tussell, F. (1990). "Testing for Interaction in Two-Way ANOVA tables with no repli cation."
Computational Stati"tics and Data Analy"is
10, 29-45.( '
\
r,