THE
EFFECT OF A
SINGLE POINT ON
CORRELATION AND SLOPE
DAVID L. FARNSWORTH
Department
ofMathematicsRochesterInstituteofTechnology
Rochester,
NewYork 14623U.S.A.
(Received December 8, 1989 and in revised form April 27, 1990)
ABSTRACT. By
augmentingabivariatedata setwithonepoint, thecorrelationcoefficientand/or
the slopeof the regressionlinecan be changedto anyprescribed values. Forthe target valueof
thecorrelationcoefficientortheslope,the coordinates of thenewpointarefound asafunctionof
certain statistics of theoriginal data. The locationofthis newpoint withrespect to the original dataisinvestigated.
KEY
WORDSAND PHRASES.
Correlationcoefficient,deletion technique,influenceofdata,leastsquaresline, regression diagnostic, sampleinfluence curve.
1980AMS
SUBJECT CLASSIFICATION CODES.
62J05,62F35.1.
INTRODUCTION.
Thecorrelationcoefficient andslopeofaleast squaresregressionlineareknowntobesensitive
toafew outliers in the data.
For
thisreason, theseestimators arecallednonrobustornonresistant.It is not unexpected that changing one data point or introducing anew data point will
greatly
perturbthe regression line and various statistics associatedwithit. Actually, we canjudiciously introducea newdata point and create alineofour choosing. Theeffect ofaddingone newdata pointisthe subject ofthis paper.
foraclassroomexerciseor amade-upcasestudy,wemaywant toproducecertainoutcomes. These mightbeobtained withthe introduction of justoneadditionalpoint.
Theideaofaddingonespecial pointinorder to forcearegressionlinethroughthe originwas
studiedbyCasella
[1].
The location of the point wasshown to be related to various statisticsfor the line.The robustness ofanestimator is inpartgauged bytheinfluence function. Theinfluencecurve orfunctionforagivenestimator is definedpointwiseas alimitas avanishing weight isplacedon
the datapoint. Thesecurves arederivatives andmeasureinfinitesimal asymptotic
(large sample)
influenceonthe givenestimator. The earliestworkonthisconceptwas
by
Hampel[2]
and[3].
Also,
seeBelsleyet al.[4],
Cook andWeisberg[5],
andHuber[6].
Finite sample influence curves are difference quotients, not limits. For sample size n, the numerator consistsof thedifferencebetween a statisticofinterest withagiven point included in itsformulation and thesamestatistic without thatpoint. Thedenominator is
1/n.
Thissample influencecurveisused tomeasurethe influence ofasingledata pointwithinasample. The deletion ofaninfluentialdata point would lead tolargevalues of theinfluencecurve. Thediagnostictests for influencepresentedin Section 5 arestandardizedversionsof thesampleinfluencecurve.In
Sections 2, 3 and 4 for any target value of the slope or the correlation coefficient, thecoordinatesofoneadditionalpointarefoundas afunction ofcertain statisticsof the original data. The locations of these pointsarepresentedboth analytically andgraphically. Theyarerealizations ofcurvesof constant influence where theoriginaln-1 dataareparameters. Thetroubleisthatwe may becaught tamperingwiththe data inthisway if conventionaldiagnostics flagthenewpoint.
In
Section 5,four of the customary measuresofleverage andinfluence are briefly discussed.In
Section6,anillustrative numericalexampleisgiven. 2.
PRESCRIBING
THE
CORRELATION COEFFICIENT.
We
aregiventheoriginaldata{(x{,y)"
1,2,...,n-1}.
For
conveniencetake the center2
_y/(n- 1)
1, thatis,x
and(5,)
(0,0)
and variancess
x/(n-
1)
1 ands
2 and 2 arethefull samplesize sincethat choice yi axeinstandardunits. Thedenominators of
s
s
leads to considerable simplification ofsubsequentformulas.
Thus,
the regressionline isy bx,and the correlation coefficientis r b. The additional point(x,,
y,,)
isexpressedinstandarddeviationunitsof the originaln-1dataas
(u,
v)
(x,/s,
y,,/s)
(x,,
y,).
Theconditionthatthecorrelationcoefficient of the
augmented
dataisthe prescribedvalue pis
=.
(2.)
/(- +
=)(.
+
,)
The dependenceof p upon theoriginaldataisonly
through
thevalues ofn andr. Equation(2.1)
isexpressible as
For eachp, symmetries of thesolution curves areabout the origin and the lines v u and v -ttsinceequation
(2.1)
is invariant underreplacement of(u, v)
with(-u,-v)
or(4-v, +u).
Verticalasymptotes appear at
(1
p2)tt2
np 0, thatis, tt4-V/-ffp/x/’l
f
4-k.By
symmetry, horizontalasymptotesarev 4-k. Note that k isamonotonically increasingfunctionofrt and
p2.
Forp 0,the asymptotesarethe axes, andequation(2.1)
becomesuv -rtr. Thediscriminantof equation(2.2)is 4rip’In
+
u2][(1-p2)u
+
n(r
-
p2)]
which isnonnegative fortt>
n(p
-r2)/(1-
p2).
Thus, forp2 <
r there isnorestriction upon it. Forp
>
r thenonnegativity of thediscriminantdoesrestrict ttfromastrip about thev-axis. Ifr 0, thep r solutioncurvebecomes the twoaxes. Ifr 1,thereisjust the
p
<
r typeof solutioncurves.Representative solution curves are shownin Figure 1 for r b 0.5 and n 12. Because ofthe symmetries, only the right half-plane is displayed. Theregression line v 0.5u and the asymptotes u 2 and v 2 for thep r 0.5 curve are shown. Thep -0.6 and the
p 4-0.6curveshaveasymptotes u 4-2.59 andv 4-2.59. Onlythebulgeabout v ufor the
p 0.9curve isvisible. Its shapeissimilar tothat of the curveforp 0.6. The p 0.9curve doublesback, changingitsconcavity andapproachingtheverticalasymptoteu 7.15 from the left and the horizontalasymptote v 7.15 from below. All of thep 0 curve is outsidethe square
{-2 <
u<
2,-2<
v< 2},
indicatinghowunusualthenewpoint(u,
v)
would havetobe toconvert fromr 0.5 to p 0. Of particularinterest arethe two branches of thep r 0.5curve. The correlation coefficientisunchangedevenbypoints whicharevery distantalongthis curve.I Ii
Figure1:
Curves
alongwhich(u,
v)
maybeplacedtochangethe value of thecorrelation3.
PRESCRIBING
THE SLOPE.
The condition that theslopeof the least squaresregressionline basedon then pointsisthe
prescribed valued is
nb
+
uv=d
(3.1)
n+u
or
.(d- b)/
+
d.For
eachdthe solutioncurveissymmetric withrespecttotheorigin.Asymptotes
are u 0 andv dufor d b. Maximaandminimaoccurat
(-I-V/n(d
b)/d,4-2dV/n(d
b)/d)
for(d-b)/d >
O. Intersectionswiththeu-axis occurat(-l-v/n(b-
d)/d, 0)
for(d-b)/d
<
O. Placingd b in equation(3.1)
impliesu 0andv bu.Figure 2displays selected solutioncurvesfor b r 0.5andn 12.
Because
ofsymmetry, just theright half-planeisshown. Ofcourse, the d 0 curveinFigure 2isthesame asthe p 0curveinFigure1.
Io0
0.7
O.e
II
d:O
Figure2:
Curves along
which(u, v)
may beplacedtochange
thevalue oftheslopetoselectedvaluesdforr b 0.5andn 12.4.
PRESCRIBING BOTH THE CORRILATION COEFFICIENT AND THE SLOPE.
From
Figures1and 2it isapparentthat theslopecanbe drasticallychaged
whilethecorre-lationcoefficientisnumericallythesame. Settingp r
b,
equations(2.1)
and(3.1)
giveW/n
n+v’
d=
+u
2 b.(4.1)
the twonumbersu andv.
However,
theissueof remotenessof(u, v)
from(, )
(0, 0)
and from V bxisimportant andisaddressed below.5.
DETECTING THE NEW POINT.
Among
the availablediagnostictestsforgaugingwhetheradata point(xi,
W)
isaninterloper, fourrethe mostcommon. Eachmeasures adifferentfeature of data. See Neteretal.[7],
Atldnson[8],
or Cook andWeisberg[5].
Thenotationand critical valuesarenot uniformlyagreedupon in the literature. ThosesuggestedinChapter11 ofNeter
etM.
[7]
areused below.First,the point
(u,
v)
could beanx-outlier,that is,it could be farinanhorizontal directionfrom the mean of
MI
n data. Such points influenceslope the most for agivenchrome
in thecoordinate and are saidto have high leverage. The usual measure of leverage for
(xi,w)
isthe diagonalelementhii
of the hat matrix.See
Hoaglin and Welsch[9].
Eachhii
depends upon all the x-coordinates but does notdepend upon they-coordinatesof the data. If4In
<
hii <_
1, then(xi,
W)
will be saidtohavehigh leverage. If1/n
<
h <
4/n,
then(x,,
y)
haslowleverage.The next threediagnostic measures arisefrom the deletion technique. The impact ofadata pointisoftendetectedbyusingthistechnique: The data pointisremoved; thestatisticofinterest
iscomputed; andastandardized differencebetween that statistic andthe corresponding statistic
utilizingall the dataisanalysedasthediagnosticmeasure. Theunstandardized differences divided by
l/n,
arethesampleinfluencecurvesdiscussedinthe Introduction.Second,
the point(u,
v)
could beoutlyingwithrespect tothe!/-value. A
diagnostic statisticformeasuringwhether
(xi,
yi)isaV-outlyingpointistheexternallystudentized residualdT,
which isastandardized vertical distance at xi betweenthe point (xi,yi)and theregressionlinecreatedwithout
(Xi,l/i).
Eahd
is t-distributed with df n-3. Using the predictionlimits for anewpoint
(xi,
yi)
basedonthe other n-1 data’sregressionlineasthe criteria fordesignatingapointasoutlyingisequivalent to using
d,
asshown inNeteretal.[7],
page 399.Third,
(u,
v)
could inordinatelychange
predictedvalues. The statistic(DFFITS)i
isa stan-dardized differencebetween thepredictedvaluesatxi onecomputed with,andonewithout, thedata point
(x,
y).
The point(x,
yi)issaidto be influentialin thissenseifI(DFFITS)I
>
forlargen and if
I(DFFITS)I
>
1 for small to moderaten. This test isbasedontheusual confi-dence bands for regression lines.Fourth,
(u,
v)
could unduly impact the slope. The statistic(DFBETA)
is astandardized differencebetweentheslopeof the tworegressionlines--onewith(x, y)
andonewithout(z,
yi).
If
I(DFBETA)il
>
1 for small tomoderatenand ifI(DFBETA)il >
2/V/’
forlargen,then(xi,
yi)iscalled influentialin thissense.
Generally,indata analysis each of these fourmeasures isroutinelycomputedfor allndata sets obtainedby deletingeach pointinturn. Then,the structure of each of the four batches ofnumbers
nextsection.
6.
NUMERICAL EXAMPLE.
Consider thefollowingconstructed datainstandardunits:
(-1.54,-1.07),
(-1.16,-1.26),
(-0.77,
1.26),
(-0.77,-0.63), (-0.39,-0.60), (-0.39, 0.60), (0.39,-0.63), (0.77, 0), (1.16, 1.89), (1.16,-0.63),
and
(1.54,
1.07).
For
these11 datar b 0.50. Letusselect d 0.7 as ourdesired slopeand pointsA(0.39,
6.49), B(0.80, 3.55), C(1.85, 2.59),
andD(3.00, 2.90)
aspossibleauxiliary pointsonthe d 0.7curve. Point
B
is onewherethe correlation coetcientremains0.5.See
equation(4.1).
PointC
isthe localminimumpoint of the d 0.7curve.Eachofthefourdiagnosticstatisticsdiscussedin Section5ispresentedin Table 1 for each of
thesefour pointsrepresenting
(x,,
y,).
Thesymbol * next toanumericalvalue meansthat the point would be deemedunusual, that is, consideredofhigh leverage orinfluence, by the criteria giveninSection5. The level ofsignificanceis0.05 for thed,
column.Point
h,.,.__a
d,
(DFFITS),
(DFBETA),
A
0.09 6.27* 2.03* 0.70B
0.13 3.08* 1.19" 0.71C
0.29 1.48 0.93 0.79D
0.48* 1.06 1.01" 0.92Table 1.
Points onthed 0.7curvewith x, u
_>
3.59 will beflagged
with(DFBETA),
>
1.Table1showsthat introducingpoint
C
willgiveusthedesiredchangeofslopebut will not be detectedwiththese fourdiagnosticstatistics. Otherpointson the d 0.7 curvemayormay not be detected withvariousteststatisticsdependingupon their locations.Figure3: NumericalexampledescribedinSection 6.
REFERENCES
1.
CASELLA,
G.
Leverage
and RegressionThrough
the Origin,Amer.
Statist
37(1983),
147-152.
2.
HAMPEL,
F.R.
(ontributions to the Theory ofRobust Estimation, Ph.D.Thesis,University ofCalifornia, Berkeley, 1968.3.
HAMPEL,
F.R.
TheInfluenceCurve
andItsRole inRobustEstimation,]..Amer.
Statist.
Assoc.
69(1974),
383-393.4.
BELSLEY,
D.A., KUH,
E.andWELSCH,
R.E.
Recession
Diagnostics, John Wiley andSons,
New York,
1980.5.
COOK,
R.D.
andWEISBERG,
S.
Residuals and Influence
in
Re_ression, ChapmanandHall,
New York,
1982.6.
HUBER,
P.J.
Robust
Statistics, JohnWiley andSons,
New
York, 1981.7.
NETER,
3.,WASSERMAN,
W.andKUTNER,
M.H.
AppliedLinear
RegressionModels,
Second Edition, Richard
D.
Irwin,Inc.,
Homewood,
Illinois,1989.8.
ATKINSON,
A.C. Plots,
Transformations andRecession:
An
Introduction
to(raphical Methods ofDia_k,nosticRe_rressionAnalysis,OxfordUniversityPress,
Oxford,
1985.9.
HOAGLIN,
D.C.
andWELSCH,
R.E.
TheHat
Matrix inRegression andANOVA,
Amer.