The effect of a single point on correlation and slope

(1)

THE

EFFECT OF A

SINGLE POINT ON

CORRELATION AND SLOPE

DAVID L. FARNSWORTH

Department

ofMathematics

RochesterInstituteofTechnology

Rochester,

NewYork 14623

U.S.A.

(Received December 8, 1989 and in revised form April 27, 1990)

ABSTRACT. By

augmentingabivariatedata setwithonepoint, thecorrelationcoefficient

and/or

the slopeof the regressionlinecan be changedto anyprescribed values. Forthe target valueof

thecorrelationcoefficientortheslope,the coordinates of thenewpointarefound asafunctionof

certain statistics of theoriginal data. The locationofthis newpoint withrespect to the original dataisinvestigated.

KEY

WORDS

AND PHRASES.

Correlationcoefficient,deletion technique,influenceofdata,least

squaresline, regression diagnostic, sampleinfluence curve.

1980AMS

SUBJECT CLASSIFICATION CODES.

62J05,62F35.

1.

INTRODUCTION.

Thecorrelationcoefficient andslopeofaleast squaresregressionlineareknowntobesensitive

toafew outliers in the data.

For

thisreason, theseestimators arecallednonrobustornonresistant.

It is not unexpected that changing one data point or introducing anew data point will

greatly

perturbthe regression line and various statistics associatedwithit. Actually, we canjudiciously introducea newdata point and create alineofour choosing. Theeffect ofaddingone newdata pointisthe subject ofthis paper.

(2)

foraclassroomexerciseor amade-upcasestudy,wemaywant toproducecertainoutcomes. These mightbeobtained withthe introduction of justoneadditionalpoint.

Theideaofaddingonespecial pointinorder to forcearegressionlinethroughthe originwas

studiedbyCasella

[1].

The location of the point wasshown to be related to various statisticsfor the line.

The robustness ofanestimator is inpartgauged bytheinfluence function. Theinfluencecurve orfunctionforagivenestimator is definedpointwiseas alimitas avanishing weight isplacedon

the datapoint. Thesecurves arederivatives andmeasureinfinitesimal asymptotic

(large sample)

influenceonthe givenestimator. The earliestworkonthisconceptwas

by

Hampel

[2]

and

[3].

Also,

seeBelsleyet al.

[4],

Cook andWeisberg

[5],

andHuber

[6].

Finite sample influence curves are difference quotients, not limits. For sample size n, the numerator consistsof thedifferencebetween a statisticofinterest withagiven point included in itsformulation and thesamestatistic without thatpoint. Thedenominator is

1/n.

Thissample influencecurveisused tomeasurethe influence ofasingledata pointwithinasample. The deletion ofaninfluentialdata point would lead tolargevalues of theinfluencecurve. Thediagnostictests for influencepresentedin Section 5 arestandardizedversionsof thesampleinfluencecurve.

In

Sections 2, 3 and 4 for any target value of the slope or the correlation coefficient, the

coordinatesofoneadditionalpointarefoundas afunction ofcertain statisticsof the original data. The locations of these pointsarepresentedboth analytically andgraphically. Theyarerealizations ofcurvesof constant influence where theoriginaln-1 dataareparameters. Thetroubleisthatwe may becaught tamperingwiththe data inthisway if conventionaldiagnostics flagthenewpoint.

In

Section 5,four of the customary measuresofleverage andinfluence are briefly discussed.

In

Section6,anillustrative numericalexampleisgiven. 2.

PRESCRIBING

THE

CORRELATION COEFFICIENT.

We

aregiventheoriginaldata

{(x{,y)"

1,2,...,n-

1}.

For

conveniencetake the center

2

_y/(n- 1)

_1, _that_is,

x

and

(5,)

(0,0)

and variances

s

x/(n-

1)

1 and

s

2 _and 2 _are_the_{full sample}_{size since}_{that choice} yi axeinstandardunits. Thedenominators of

s

leads to considerable simplification ofsubsequentformulas.

Thus,

the regressionline isy bx,and the correlation coefficientis r b. The additional point

(x,,

y,,)

isexpressedinstandarddeviation

unitsof the originaln-1dataas

(u,

v)

(x,/s,

y,,/s)

(x,,

y,).

Theconditionthatthecorrelationcoefficient of the

augmented

dataisthe prescribedvalue p

is

=.

(2.)

/(- +

=)(.

+

,)

The dependenceof p upon theoriginaldataisonly

through

thevalues ofn andr. Equation

(2.1)

isexpressible as

(3)

For eachp, symmetries of thesolution curves areabout the origin and the lines v u and v -ttsinceequation

(2.1)

is invariant underreplacement of

(u, v)

with

(-u,-v)

or

(4-v, +u).

Verticalasymptotes appear at

(1

p2)tt2

np 0, thatis, tt

4-V/-ffp/x/’l

f

4-k.

By

symmetry, horizontalasymptotesarev 4-k. Note that k isamonotonically increasingfunction

ofrt and

p2.

Forp 0,the asymptotesarethe axes, andequation

(2.1)

becomesuv -rtr. Thediscriminantof equation

(2.2)is 4rip’In

+

u2][(1-p2)u

+

n(r

-

p2)]

which isnonnegative fortt

>

n(p

-r2)/(1-

p2).

Thus, for

p2 <

r there isnorestriction upon it. For

p

>

r the

nonnegativity of thediscriminantdoesrestrict ttfromastrip about thev-axis. Ifr 0, thep r solutioncurvebecomes the twoaxes. Ifr 1,thereisjust the

p

<

r typeof solutioncurves.

Representative solution curves are shownin Figure 1 for r b 0.5 and n 12. Because ofthe symmetries, only the right half-plane is displayed. Theregression line v 0.5u and the asymptotes u 2 and v 2 for thep r 0.5 curve are shown. Thep -0.6 and the

p 4-0.6curveshaveasymptotes u 4-2.59 andv 4-2.59. Onlythebulgeabout v ufor the

p 0.9curve isvisible. Its shapeissimilar tothat of the curveforp 0.6. The p 0.9curve doublesback, changingitsconcavity andapproachingtheverticalasymptoteu 7.15 from the left and the horizontalasymptote v 7.15 from below. All of thep 0 curve is outsidethe square

{-2 <

u

<

2,-2

<

v

< 2},

indicatinghowunusualthenewpoint

(u,

v)

would havetobe toconvert fromr 0.5 to p 0. Of particularinterest arethe two branches of thep r 0.5curve. The correlation coefficientisunchangedevenbypoints whicharevery distantalongthis curve.

I Ii

Figure1:

Curves

alongwhich

(u,

v)

maybeplacedtochangethe value of thecorrelation

(4)

3.

PRESCRIBING

THE SLOPE.

The condition that theslopeof the least squaresregressionline basedon then pointsisthe

prescribed valued is

nb

+

uv

=d

(3.1)

n+u

or

.(d- b)/

+

d.

For

eachdthe solutioncurveissymmetric withrespecttotheorigin.

Asymptotes

are u 0 and

v dufor d b. Maximaandminimaoccurat

(-I-V/n(d

b)/d,4-2dV/n(d

b)/d)

for

(d-b)/d >

O. Intersectionswiththeu-axis occurat

(-l-v/n(b-

d)/d, 0)

for

(d-b)/d

<

O. Placingd b in equation

(3.1)

impliesu 0andv bu.

Figure 2displays selected solutioncurvesfor b r 0.5andn 12.

Because

ofsymmetry, just theright half-planeisshown. Ofcourse, the d 0 curveinFigure 2isthesame asthe p 0

curveinFigure1.

Io0

0.7

O.e

II

d:O

Figure2:

Curves along

which

(u, v)

may beplacedto

change

thevalue oftheslopetoselectedvaluesdforr b 0.5andn 12.

4.

PRESCRIBING BOTH THE CORRILATION COEFFICIENT AND THE SLOPE.

From

Figures1and 2it isapparentthat theslopecanbe drastically

chaged

whilethe

corre-lationcoefficientisnumericallythesame. Settingp r

b,

equations

(2.1)

and

(3.1)

give

W/n

n+v’

d=

+u

2 b.

(4.1)

(5)

the twonumbersu andv.

However,

theissueof remotenessof

(u, v)

from

(, )

(0, 0)

and from V bxisimportant andisaddressed below.

5.

DETECTING THE NEW POINT.

Among

the availablediagnostictestsforgaugingwhetheradata point

(xi,

W)

isaninterloper, fourrethe mostcommon. Eachmeasures adifferentfeature of data. See Neteretal.

[7],

Atldnson

[8],

or Cook andWeisberg

[5].

Thenotationand critical valuesarenot uniformlyagreedupon in the literature. ThosesuggestedinChapter11 of

Neter

et

M.

[7]

areused below.

First,the point

(u,

v)

could beanx-outlier,that is,it could be farinanhorizontal direction

from the mean of

MI

n data. Such points influenceslope the most for agiven

chrome

in the

coordinate and are saidto have high leverage. The usual measure of leverage for

(xi,w)

isthe diagonalelement

hii

of the hat matrix.

See

Hoaglin and Welsch

[9].

Each

hii

depends upon all the x-coordinates but does notdepend upon they-coordinatesof the data. If

4In

<

hii <_

1, then

(xi,

W)

will be saidtohavehigh leverage. If

1/n

<

h <

4/n,

then

(x,,

y)

haslowleverage.

The next threediagnostic measures arisefrom the deletion technique. The impact ofadata pointisoftendetectedbyusingthistechnique: The data pointisremoved; thestatisticofinterest

iscomputed; andastandardized differencebetween that statistic andthe corresponding statistic

utilizingall the dataisanalysedasthediagnosticmeasure. Theunstandardized differences divided by

l/n,

arethesampleinfluencecurvesdiscussedinthe Introduction.

Second,

the point

(u,

v)

could beoutlyingwithrespect tothe

!/-value. A

diagnostic statistic

formeasuringwhether

(xi,

yi)isaV-outlyingpointistheexternallystudentized residual

dT,

which isastandardized vertical distance at xi betweenthe point (xi,yi)and theregressionlinecreated

without

(Xi,l/i).

Eah

d

is t-distributed with df n-3. Using the predictionlimits for anew

point

(xi,

yi)

basedonthe other n-1 data’sregressionlineasthe criteria fordesignatingapointas

outlyingisequivalent to using

d,

asshown inNeteretal.

[7],

page 399.

Third,

(u,

v)

could inordinately

change

predictedvalues. The statistic

(DFFITS)i

isa stan-dardized differencebetween thepredictedvaluesatxi onecomputed with,andonewithout, the

data point

(x,

y).

The point

(x,

yi)issaidto be influentialin thissenseif

I(DFFITS)I

>

forlargen and if

I(DFFITS)I

>

1 for small to moderaten. This test isbasedontheusual confi-dence bands for regression lines.

Fourth,

(u,

v)

could unduly impact the slope. The statistic

(DFBETA)

is astandardized differencebetweentheslopeof the tworegressionlines--onewith

(x, y)

andonewithout

(z,

yi).

If

I(DFBETA)il

>

1 for small tomoderatenand if

I(DFBETA)il >

2/V/’

forlargen,then

(xi,

yi)

iscalled influentialin thissense.

Generally,indata analysis each of these fourmeasures isroutinelycomputedfor allndata sets obtainedby deletingeach pointinturn. Then,the structure of each of the four batches ofnumbers

(6)

nextsection.

6.

NUMERICAL EXAMPLE.

Consider thefollowingconstructed datainstandardunits:

(-1.54,-1.07),

(-1.16,-1.26),

(-0.77,

1.26),

(-0.77,-0.63), (-0.39,-0.60), (-0.39, 0.60), (0.39,-0.63), (0.77, 0), (1.16, 1.89), (1.16,-0.63),

and

(1.54,

1.07).

For

these11 datar b 0.50. Letusselect d 0.7 as ourdesired slopeand points

A(0.39,

6.49), B(0.80, 3.55), C(1.85, 2.59),

and

D(3.00, 2.90)

aspossibleauxiliary pointson

the d 0.7curve. Point

B

is onewherethe correlation coetcientremains0.5.

See

equation

(4.1).

Point

C

isthe localminimumpoint of the d 0.7curve.

Eachofthefourdiagnosticstatisticsdiscussedin Section5ispresentedin Table 1 for each of

thesefour pointsrepresenting

(x,,

y,).

Thesymbol * next toanumericalvalue meansthat the point would be deemedunusual, that is, consideredofhigh leverage orinfluence, by the criteria giveninSection5. The level ofsignificanceis0.05 for the

d,

column.

Point

h,.,.__a

d,

(DFFITS),

(DFBETA),

A

0.09 6.27* 2.03* 0.70

B

0.13 3.08* 1.19" 0.71

C

0.29 1.48 0.93 0.79

D

0.48* 1.06 1.01" 0.92

Table 1.

Points onthed 0.7curvewith x, u

_>

3.59 will be

flagged

with

(DFBETA),

>

1.

Table1showsthat introducingpoint

C

willgiveusthedesiredchangeofslopebut will not be detectedwiththese fourdiagnosticstatistics. Otherpointson the d 0.7 curvemayormay not be detected withvariousteststatisticsdependingupon their locations.

(7)

Figure3: NumericalexampledescribedinSection 6.

REFERENCES

1.

CASELLA,

G.

Leverage

and Regression

Through

the Origin,

Amer.

Statist

37

(1983),

147-152.

2.

HAMPEL,

F.R.

(ontributions to the Theory ofRobust Estimation, Ph.D.Thesis,University ofCalifornia, Berkeley, 1968.

3.

HAMPEL,

F.R.

TheInfluence

Curve

andItsRole inRobustEstimation,]..

Amer.

Statist.

Assoc.

69

(1974),

383-393.

4.

BELSLEY,

D.A., KUH,

E.and

WELSCH,

R.E.

Recession

Diagnostics, John Wiley and

Sons,

New York,

1980.

5.

COOK,

R.D.

and

WEISBERG,

S.

Residuals and Influence

in

Re_ression, Chapmanand

Hall,

New York,

1982.

6.

HUBER,

P.J.

Robust

Statistics, JohnWiley and

Sons,

New

York, 1981.

7.

NETER,

3.,

WASSERMAN,

W.and

KUTNER,

M.H.

Applied

Linear

Regression

Models,

Second Edition, Richard

D.

Irwin,

Inc.,

Homewood,

Illinois,1989.

8.

ATKINSON,

A.C. Plots,

Transformations and

Recession:

An

Introduction

to(raphical Methods ofDia_k,nosticRe_rressionAnalysis,OxfordUniversity

Press,

Oxford,

1985.

9.

HOAGLIN,

D.C.

and

WELSCH,

R.E.

The

Hat

Matrix inRegression and

ANOVA,

Amer.