• No results found

The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others.

N/A
N/A
Protected

Academic year: 2021

Share "The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others."

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Outliers

Outliers are data points which lie outside the general linear pattern of which the midline is the

regression line. A rule of thumb is that outliers are points whose standardized residual is greater than 3.3 (corresponding to the .001 alpha level). The removal of outliers from the data set under analysis can at times dramatically affect the performance of a regression model. Outliers should be removed if there is reason to believe that other variables not in the model explain why the outlier cases are unusual -- that is, these cases need a separate model. Alternatively, outliers may suggest that additional explanatory variables need to be brought into the model (that is, the model needs respecification). Another alternative is to use robust regression, whose algorithm gives less weight to outliers but does not discard them.

The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others.

• Belsley, Kuh, and Welsch (1980) define the leverage (hi ) of the ith observation as

hi = 1 n % (xi&x¯) 2 (n&1)S2 x

Leverage assesses how far away a value of the independent variable value is from the mean value: the farther away the observation the more leverage it has.

From the definition you can see that leverage is mitigated by a larger sample size (any single point should have less influence) and by a larger variance of the independent variable (again, any single point should have less influence).

• 0<h<1 The leverage statistic varies from 0 (no influence on the model) to 1 (completely determines the model).

• A rule of thumb is that cases with leverage under .2 are not a problem, but if a case has

leverage over .5, the case has undue leverage and should be examined for the possibility of measurement error or the need to model such cases separately.

STATA command predict h, hat.

Cook's distance, D, is another measure of the influence of a case. Cook's distance measures the effect of deleting a given observation.

• Observations with larger D values than the rest of the data are those which have unusual

leverage.

• D > 4/n the criterion to indicate a possible problem.

(2)

dfbetas, is another statistic for assessing the influence of a case.

• If dfbetas > 0, the case increases the slope; if <0, the case decreases the slope. • The case may be considered an influential outlier if |dfbetas| > 2/%n.

STATA command dfbeta creates dfbeta’s for all variables. or predict DFx1, dfbeta(x1) for individual variables

dfFit. DfFit measues how much the estimate changes as a result of a particular observation being dropped from analysis.

dfits is defined as

• dfitsi = Rstudent . Where Rstudent is the studentized residual.

hi

1&h

i

• The case may be considered an influential outlier if DFITS> 2/%k/n

STATA command predict DFITS, dfits

Studentized residuals and deleted studentized residuals are also used to detect outliers with high leverage. A "studentized residual" is the observed residual divided by the standard deviation. The "studentized deleted residual," also called the "jacknife residual," is the observed residual divided by the standard deviation computed with the given observation left out of the analysis. Analysis of outliers usually focuses on deleted residuals. Other synonyms include externally studentized residual or, misleadingly, standardized residual. There will be a t value for each residual, with df - n - k - 1, where k is the number of independent variables. When t exceeds the critical value for a given alpha level (ex., .05) then the case is considered an outlier. In a plot of deleted studentized residuals versus ordinary residuals, one may draw lines at plus and minus two standard units to highlight cases outside the range where 95% of the cases normally lie; points substantially off the straight line are potential leverage problems.

STATA command predict student, rstudent or predict standard, rstandard

Partial regression plots, also called partial regression leverage plots or added variable plots, are yet another way of detecting influential sets of cases. Partial regression plots are a series of bivariate regression plots of the dependent variable with each of the independent variables in turn. The plots show cases by number or label instead of dots. One looks for cases which are outliers on all or many of the plots.

(3)

STATA command avplots

Example using the Murder.dta data set.

reg mrdrte exec unem d90 d93

Source SS df MS Number of obs= 153

F( 4, 148) = 3.05

Model 977.390644 4 244.347661 Prob > F = 0.0190

Residual 11867.9475 148 80.1888343 R-squared = 0.0761

Adj R-squared= 0.0511

Total 12845.3381 152 84.5088034 Root MSE = 8.9548

mrdrte Coef. Std. Err. t P>t [95% Conf. Interval]

exec .1627547 .1939295 0.84 0.403 -.2204738 .5459832 unem 1.390786 .4508653 3.08 0.002 .4998207 2.281751 d90 2.675335 1.816934 1.47 0.143 -.91515 6.26582 d93 1.607317 1.774768 0.91 0.367 -1.899842 5.114476 _cons -1.864393 3.069517 -0.61 0.545 -7.930134 4.201349 predict e, residual predict yhat

predict standard, rstandard predict student, rstudent predict h, hat

predict D, cooksd predict DFITS, dfits predict W, welsch dfbeta DFexec: DFbeta(exec) DFunem: DFbeta(unem) DFd90: DFbeta(d90) DFd93: DFbeta(d93)

(4)

rvplot

To create a standardized residual plot graph twoway scatter standard yhat, yline(0)

-20 0 20 40 60 Residuals 0 5 10 15 Fitted values

(5)

To identify the outliers.

graph twoway scatter standard yhat, yline(0) mlabel(state)

-2 0 2 4 6 8 Standardized residuals 0 5 10 15 Fitted values A L ALA L AK AK AK AZ AZ A Z AR AR AR C A CA CA CO CO CO CT CTCT DE D E DE DC DC DC FL FL FL GA GA GA HI H I HI ID I D ID IL IL IL IN IN IN IA IA IA KS K S KS KY K Y KY L A L A LA M E M E M E M D MD MD MA MAM A M I MI M I M N M NMN MS MS MS MO M OM O MTM T MT NE NE N E N V N VNV NH NHNH NJ N J NJ N M N M NM NY N Y NY N C NCNC ND N D N D O H O HOH OKOK OK OR OR OR PA PA PA RI RIRI SC SC SC SD SD S D TN T NT N TX T X T X UT UT UT VT V T VT V A VA VA WA WA WA WV WV WV WIWI W I WY WY WY -2 0 2 4 6 8 Standardized residuals 0 5 10 15 Fitted values

(6)

avplots, mlabel(state)

To graphically measure the influence of observations

graph twoway scatter standard yhat [aweight=D], msymbol(oh) yline(0)

Leverage versus residual squared plot marks the means of leverage and squared residuals. Leverage

W VW VA K DC ME NM RI NY KY AK NM WY NJ I L O R WV M I PA M I A RMA ID NH CO OH WA MS IL MT OK CT MD NV KY ID MT MIOHAKTN RI DC TN VT I N DE CO NM MO D C MN I L AZ O R KS WI MAWIMTKY ID CA OH PA NH CA WA SC O R IA ND PA WY MNHIAZIN NY TN ND IA DE CT ME NJ VT WA CO NY KS NEMNSDMDMSWIKSMEI ASD MD CA N C MS NJ ND H I RINE SD VT WY UT IN CTNVMADEHIOKNHAL LA NE UT SC ALN C UT A ROKA RN CSCGAAZGANVVA LA VA ALMOMOFL LA FL FLGATXVA TX TX -20 0 20 40 60 e( mrdrte | X ) 0 10 20 30 e( exec | X ) coef = .16275467, se = .19392954, t = .84 NHNEN EDEMACT SDVTH IRIHI UTVANJIASD MD HINDVAME NC SDWIND N CFL NE KS NYGAN C TX VAKS IA UT M NN DC OKSW II ND EW YMNVT MD S C IA SCNVGA MNPA TN WA CO CA NJ VTOKMEDECTMTAZ MO TN NY AR WI ID GA AZ I NAZO RKY MD CT NVNV PA WY MOFL DC MO MS IN UT OR OK TXCA NHOHOHFL TN NHMTIDKY LA MAMAOHPA M II LTXNM OR LA ALALIL MT OK NJ IL W A SC D C NMR IW AAKA RA LC ON Y R IA KME MI IDA R MIMS D C WY KYNMCA WVMSWV AK WVLA 0 20 40 60 80 e( mrdrte | X ) -4 -2 0 2 4 6 e( unem | X ) coef = 1.3907856, se = .45086525, t = 3.08 NHM AD EC TVTH IR IN JVAMD SDMEN CNE NY KSND FL MN GA IA SC PA CA WIAZORNVMO DC UT I NTNOHMTILOKW ACOALIDAR MI TX WYKYNMM SAKW V NE LASDUTI AH INDWI NC VA KSMNCOI NDEWYVT TN GA OK MTIDAZ A R KY NV CT MDMSMO OHNHMATXFLPAMI O R LA NJ ILALSC WA NM AK RI NY ME D C C A WVNEH ISDND N C IA VA UT KS W I MDSC MN NV WA CONJVTMECTDETN NY AZ INGAPAW YOROK NH CA M O OH I D KY M TFLMA TX LA IL N M D C R IALA RAKM IMSWV -20 0 20 40 60 80 e( mrdrte | X ) -1 -.5 0 . 5 e( d90 | X ) coef = 2.6753348, se = 1.8169343, t = 1.47 TX GA LA FLVASCAL MS N C N V UT I N N H D E M A C T VT H I R I N J SD M D M E K S N E N Y N D M N I AP A C A W I O R A Z MO D C T N O H I L O K MT W A C O I D A R M I W Y K Y N M W V A K TXFL MO AL LA VA NVGA A R UT SC OK MS NEH I SD ND N C IA W I KS MD MN W A CO NJ VT CT ME DE TN NY AZ IN W Y PA O R NH CA OH KY ID MT MA IL NM D C RI AK M I W V TX VAFLMOGAAZN COKA R LA ALCAUTW YW ASC NESDIAH INDW IKS MN CO IN DE VT TN MT ID KY NV MD CT MS OH NH MA PA MI O R IL NJ NM AK RI NY ME D C W V -20 0 20 40 60 e( mrdrte | X ) -.5 0 . 5 e( d93 | X ) coef = 1.6073174, se = 1.774768, t = .91 -2 0 2 4 6 8 Standardized residuals 0 5 10 15 Fitted values

(7)

tells us how much potential for influencing the regression an observation has. lvr2plot

To examine the numerical measure for outliers. leverage h>2k/n Cook’s D>4/n DFITS>2/%k/n Welsch’s W>3/%k DFBETA>2/%n For example, sort D

list state yhat D DFITS W in -5/l

state yhat D DFITS W

149. WV 13.15609 .0150447 -.2742131 -3.513339

150. DC 6.897557 .0452811 .4927538 6.137678

151. TX 15.01208 .0504226 -.500824 -8.794935

152. DC 9.990127 .2905912 1.547051 19.3077

(8)

Recall that

D > 4/n indicates a possible problem. With n=153, D > .02614379

DFITS> 2/%k/n may be considered an influential outlier. With n=153, k=4, DFITS>12.369317

References

Related documents

GISH analysis using Ps. strigosa genomic DNA as probe were conducted to pollen mother cells from 21 typical resistant plants and 29 susceptible plants in order to identify

Figure 5-18 Effects of mTOR inhibition and BCR stimulation on cellular Rap1 staining CLL122, a sample known to harbour a 17p deletion, was pre-treated with mTOR inhibitors for 30

Alain Amade asks the Honorary Members present to stand up so the newcomers to this Congress get to know them: Louis Polome (RSA), Paul Jensch (NED), Andy Harris (GBR), Hans

To answer research question one, does the introduction of problem-based learning positively increase high school students’ attitudes towards geography, the researcher compared the

Overall, these results confirm our previously reported findings suggesting that as long as the interest rate rule satisfies the Taylor principle, average inflation rates are close to

This shows that average duration of work increased with exposure to the fixed-term contract where exposure is defined as the number of months before the change in wage contracts

F aCUltY B Y C ategorY For 2015 Classroom: Deborah VanderLinde Blair Amy Burns Marla Butke Jeremy Cohen Anthony DeQuattro Diane Lange Michael Miles Jeffrey Vaughan

“Perfect genuineness” is an inadequate translation of a term that denotes a type of perfection, or ethical perfection, that characterizes both the action of Heaven, or Tian (as