• No results found

Robust statistics using Stata

N/A
N/A
Protected

Academic year: 2021

Share "Robust statistics using Stata"

Copied!
56
0
0

Loading.... (view fulltext now)

Full text

(1)

Robust

Robust Statistics

Statistics in Stata

in Stata

Vincenzo Verardi ([email protected])

FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher

Based on joint work with C. Croux (KULeuven) and Catherine Dehon (ULB)

(2)

1. Introduction

2. Outliers in regression analysis

3. Overview of robust estimators

3. Overview of robust estimators

4. Stata codes

(3)

Y Vertical Outlier Introduction Outliers in Good Leverage X-Design regression analysis Overview of robust estimators Stata codes Conclusion R a n g e f o r y Range for x Bad Leverage

(4)

To illustrate the influence of outliers, we

generate a dataset according to

Y=1.25+0.6X+

ε

, where X and ε~N(0,1).

We then contaminate the data with

Introduction

Outliers in

We then contaminate the data with single outliers.

set set set

set obsobsobsobs 100100100100 drawnorm drawnorm drawnorm drawnorm XXXX eeee gen gen gen

gen y=y=y=y=111.1..25.252525++0++000....666*X+e6*X+e*X+e*X+e replace replace replace replace x=x=x= …x= ……… regression analysis Overview of robust estimators Stata codes Conclusion

(5)

Y

LS

Introduction

Outliers in

Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 2.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.74 -0.42 0.48 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design regression analysis Overview of robust estimators Stata codes Conclusion

(6)

Y OLS

Introduction

Outliers in

Vertical Outlier

Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 2.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.67 -0.42 0.48 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design 2.24 regression analysis Overview of robust estimators Stata codes Conclusion

(7)

Y

OLS

Introduction

Outliers in

Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 4.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.67 -0.42 0.48 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design 4.07 -0.42 regression analysis Overview of robust estimators Stata codes Conclusion

(8)

Y

OLS

Introduction

Outliers in

Good Leverage Point

Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 4.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.67 -0.42 0.57 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design regression analysis Overview of robust estimators Stata codes Conclusion (14.04)

(9)

The objective of regression analysis is to figure out how a dependent variable is linearly related to a set of explanatory ones.

Introduction Outliers in

Technically speaking, it consists in

estimating the θ parameters in:

to find the model that better fits the data.

θ θ θ θ ε = 0 + 1 1 + 2 2 + ... + 1 1 + i i i p ip i y x x x regression analysis Overview of robust estimators Stata codes Conclusion

(10)

On the basis of the estimated parameters, it is then possible to fit the

model and predict, the dependent

variable. The discrepancy between y

and is called the residual

ˆ y ˆ y (ri = yiyˆi ). Introduction Outliers in

and is called the residual

The objective of LS is to minimize the sum of the squared residuals:

ˆ y (ri = yiyˆi ). 0 2 1 1 ˆ argmin n ( ) where LS i i p r θ

θ

θ

θ

θ

θ

= −     = =    

regression analysis Overview of robust estimators Stata codes Conclusion
(11)

However, the squaring of the residuals makes LS very sensitive to outliers.

To increase robustness, the square function could be replaced by the absolute value (Edgeworth, 1887).

Introduction Outliers in

absolute value (Edgeworth, 1887).

[qreg function in Stata]

regression analysis Overview of robust estimators Stata codes Conclusion 1 1 ˆ argmin n ( ) L i i r θ θ θ = =

(12)

Huber (1964) generalized this idea to a

set of symmetric ρ functions that could

be used instead of the absolute value to increase efficiency and robustness.

To guarantee scale equivariance,

Introduction Outliers in

To guarantee scale equivariance,

residuals are standardized by a

measure of dispersion σ.

The problem becomes:

regression analysis Overview of robust estimators Stata codes Conclusion 1 ( ) ˆ argmin n i M i r θ θ θ ρ σ =   =  

(13)

M-estimators can be redescending (1) or monotonic (2). Introduction Outliers in 0.4 0.6 0.8 1.0 1.2 2 3 4 5 y ρ(r) ρ(r) regression analysis Overview of robust estimators Stata codes Conclusion -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0.2 0.4 r -5 -4 -3 -2 -1 0 1 2 3 4 5 1 x -5 -4 -3 -2 -1 1 2 3 4 5 -1.0 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 1.0 x y -5 -4 -3 -2 -1 1 2 3 4 5 -1.0 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 1.0 x y r r r r ρ’(r) ρ’(r) (1) (2)

(14)

If σ is known, the practical

implementation of M-estimators is

straightforward. Indeed, by defining a weight: Introduction Outliers in ( ) r θ ρ  

the problem boils down to:

regression analysis Overview of robust estimators Stata codes Conclusion 2 ( ) ( ) i i i r w r θ ρ σ θ       = 2 1 ˆ argmin n ( ) M i i i w r θ θ θ = =

(15)

Introduction Outliers in 2 1 ˆ argmin n ( ) M i i i w r θ θ θ = =

However:

1. Weights wi are a function of θ that

should thus be estimated iteratively

regression analysis Overview of robust estimators Stata codes Conclusion

should thus be estimated iteratively

2. This iterative algorithm is guaranteed to converge (and yield a solution which is unique) only for monotonic M-estimators … which are not robust

(16)

Introduction Outliers in

The rreg command was created to

tackle these problems. It works as follows:

1. It awards a weight zero to individuals with Cook distances larger than 1.

regression analysis Overview of robust estimators Stata codes Conclusion

with Cook distances larger than 1.

2. A “redescending” M-estimator is

computed using the iterative algorithm starting from a monotonic M-solution.

3. σ is re-estimated at each iteration

using the median residual of the previous iteration.

(17)

Introduction Outliers in

Unfortunately, this command has not the expected robust properties:

1. Cook distances do not help

identifying leverage points when

(clustered) outliers mask one the other.

regression analysis Overview of robust estimators Stata codes Conclusion

(clustered) outliers mask one the other.

2. The preliminary monotonic

M-estimator provides a poor initial

candidate because of point 1.

3. σ is poorly estimated because of 1

(18)

Introduction Outliers in

qreg and rreg are not robust methods:

Stata example: set obs 100 drawnorm x1-x5 e regression analysis Overview of robust estimators Stata codes Conclusion gen y=x1+x2+x3+x4+x5+e replace x1=invnorm(uniform())+10 in 1/10 qreg y x* rreg y x* display e(rmse)

(19)

Raw sum of deviations 22220000222.2.8..88844445551511 (about -1 -.--...22232338388899299225255858887777)

Median regression Number of obs = 110110000000 Iteration 6: sum of abs. weighted deviations = 11111116166.6...00100116166767777777

Iteration 5: sum of abs. weighted deviations = 11111116166.6...00100119199090005555 Iteration 4: sum of abs. weighted deviations = 11111116166.6...66566551511414445555 Iteration 3: sum of abs. weighted deviations = 11111117177.7...00400443433636669999 Iteration 2: sum of abs. weighted deviations = 11111117177.7...11811887877171114444 Iteration 1: sum of abs. weighted deviations = 11111119199.9...66466448488181118888 Iteration 1: WLS sum of weighted deviations = 11111117177.7...33133118188282224444

Introduction Outliers in _cons ----....0000000000009999224224454555 ..1..1118888887887767666444848 88 ----00.00..0.00000 00 0000....999999969666 ----...3.373377755557772722121113333 ...3.373377733338878877272224444 x5 ..9..999999393133116166767577555 ..1..11177797991911199993383388 8 55.55..5.55454 44 0000....000000000000 ...6.63663337773737337477444 111.1..3.34334448888999696661111 x4 ....8887877777377333552552212111 ..1..11166626224244466661111111 1 55.55..4.44040 00 0000....0000000000 00 ..5..5555555444477877881811177 77 111.1.1..11919999999992992222222 x3 ...9.99494944999111919989888 ...1.11616667777555858 88 55.55..6.6666666 0000....000000000000 ...6.61661116664644646664444 111.1..2.28228881111999393332222 x2 ....7775754554744777221221121222 ..1..11155585889899999994444444 4 44.44..7.77575 55 0000....0000000000 00 ..4..4434333999900300334344411 11 111.1.0..00707770000440440008888 x1 ..1..11717779999888787777777 ....000055535363366868882222222 2 3333...3.3353555 00.00...000000010111 ...0.007077733332228289889997777 ...2.228286886664444666464443333 y Coef. Std. Err. t P>|t| [95% Conf. Interval] Min sum of deviations 11111111666.6..0.00011611668688 8 Pseudo R2 = 00.00...4444222828881111 Raw sum of deviations 22220000222.2.8..88844445551511 (about -1 -.--...22232338388899299225255858887777)

regression analysis Overview of robust estimators Stata codes Conclusion

(20)

Prob > F = 000.0..0.000000000000000 F( 5, 94) = 3333333.3.2..2228888 Robust regression Number of obs = 111010000000 Biweight iteration 5: maximum difference in weights = ...0.0000000777777077008088800008888

Biweight iteration 4: maximum difference in weights = ...1.1114444777575955990900055552222 Huber iteration 3: maximum difference in weights = ...0.0001111555757277224244400001111 Huber iteration 2: maximum difference in weights = ...0.0006666000202522553533300006666 Huber iteration 1: maximum difference in weights = ...4.4448888444141711771711177773333 . rreg y x* Introduction Outliers in 1 11 1....6666111155155111555555575777 . display e(rmse) _cons --.--...0000555588884424422828878777 ...1.17117775555009009998888 ----0000..3..333333 3 0000....77773393399 9 --.--.4..440406006660000889889899888 ....2222889889929223233232225555 x5 1111....111111115558583883363666 ...1.116163663339999770770007777 6666....8881811 1 0000....00000000000 0 ...7.79779990000226226866888 1111....444444414114144040003333 x4 ....7777777788881119190990050555 ...1.115155555554444880880007777 5555....0001011 1 000.0...00000000000 0 ..4..446469669994444880880100111 1111....008008868669699090001111 x3 ....9999222222221112129229969666 ...1.115155656669999992992226666 5555....8887877 7 000.0...00000000000 0 ..6..661610110004444117117277222 1111....223223333338388484442222 x2 ....9999222244441112129229959555 ...1.114144545559999884884445555 6666....3333333 3 000.0...00000000000 0 ..6..663634334442222773773933999 1111....221221131339399898885555 x1 ..1..11177775552522626766777 ...0.005055151411444999966166111 3333....4440400 0 00.00...00000001011 1 ...0.007077373330000220220003333 ....2222777777757551511313336666 y Coef. Std. Err. t P>|t| [95% Conf. Interval] regression analysis Overview of robust estimators Stata codes Conclusion

(21)

Robustness can be however achieved by tackling the problem from a different perspective.

Instead of minimizing the variance of the residuals (LS) a more robust

Introduction Outliers in

the residuals (LS) a more robust measure of spread of the residuals could be minimized (Rousseeuw and Yohai, 1987).

The measure of spread considered here is an M-estimator of scale.

regression analysis Overview of robust estimators Stata codes Conclusion

(22)

Intuition:

The variance is defined by:

Introduction Outliers in

2 2

1

1

ˆ ( ) which can be rewritten: n i i r n σ θ = =

But the square function ...

regression analysis Overview of robust estimators Stata codes Conclusion 1 2 1 ( ) 1

1 hence LS looks for the

ˆ

minimal ˆ that satisfies the equality. i n i i n r n θ σ σ = =   =  

(23)

Replace the square by another

ρ

: Introduction Outliers in 1 ( ) 1 1 ˆ

but for Gaussian data we want ˆ to be the standard deviation ( correction)

n i S i S r n θ ρ σ σ =   =   ⇒

regression analysis Overview of robust estimators Stata codes Conclusion 1

the standard deviation ( correction) ( )

1

ˆ

ˆ The problem boils down to finding the associated to the m n i S i S r n θ δ ρ σ θ = ⇒   =  

inimal ˆ that satisfies the equality

S

σ

M-estimator of scale …

(24)

ρ is generally (Tukey Biweight): Introduction Outliers in 3 2 / 1 1 if ( ) i i i r r k r k

σ

σ

ρ

σ

   = 

where for k=1.548 the BDP is 50% and

the efficiency is 28%. For k=5.182 the efficiency is 96% but the BDP is 10%.

regression analysis Overview of robust estimators Stata codes Conclusion ( ) 1 if ri k

ρ

σ

σ

=   >  
(25)

To ensure robustness AND efficiency, Yohai (1987) proposes to estimate an M-estimator: Introduction Outliers in ˆ argmin ( ) n i M r θ θ ρ σ   =  

where ρ is a 95% efficiency Tukey

Biweight function and where

σ

is set

equal to , estimated using a high BDP

S-estimator. The starting point for the

iterations is . regression analysis Overview of robust estimators Stata codes Conclusion 1 argmin M i θ θ ρ σ = =  

ˆ

S

σ

ˆ

S

θ

(26)

Scale parameter= 1111....111188088000777474446666 _cons ....0000333333339999997997727222 ...1.11144644666444747477444222 2 000.0.2..222333 3 000.0...888811711777 --.--.2..222558558884444447447977999 ....3333226226664444444422242444 x5 ....7777000088886606600011121222 ...1.1411444444433733777888844 44 444.4.9..999111 1 00.00...000000000000 ...4.444222020003334340440400444 ....9999996996668886866622212111 x4 ....6666555577778888888800080888 ...1.1411442422255555555777733 33 444.4.6..666111 1 000.0...000000000000 ....337337773332325225655666 ....9999442442225550500055575777 x3 ....999922220008080880030333 ...1.11144454555000505455444555 5 666.6..3.333555 5 000.0...000000000000 ..6..666331331111119192992322333 1111....221221110004044411141444 x2 1111....111188881161166666686888 ...1.1211229299966866888111188 88 999.9.1..111111 1 00.00...000000000000 ...9.999222222227774749449899888 1111....444444440005055588868666 x1 ....9999777755555565566600060666 ...1.1311333333311711777111111 11 777.7.3..333333 3 00.00...000000000000 ...7.777000909996667675775855888 1111....224224441114144444454555 y Coef. Std. Err. t P>|t| [95% Conf. Interval] . Sregress y x* Introduction Outliers in Scale parameter= 1111....111188088000777474446666 Scale parameter= 1.180745 _cons ---.-...11211221211144644668688585 55 ...1.1211228288484044000333366 66 ---0-0.00...999955 55 0000..3..334344477 77 ----..3..337377766866888111313133111 ..1..11313333383388877767666 x5 ....9989988989992292299696676777 ...1.1211222666868888888777722 22 777.7.8..8880000 00.00...000000000000 ....7773733636669999667667777777 1111....224224441111666622262666 x4 ....9929922828889969966666656555 ...1.1111111999797377333000099 99 777.7.7..7776666 00.00...000000000000 ....6669699090008888668668488444 1111....116116667777000066656555 x3 1111..0..00000005505500101161666 ...1.1111111777979299222000033 33 888.8.5..5552222 00.00...000000000000 ....7777777070005555118118688666 1111....223223339999555511131333 x2 ....8898899696667757755353353555 ...1.1111111000808388333333311 11 888.8.0..0009999 00.00...000000000000 ....6667677676663333449449899888 1111....111111117777111155575777 x1 1111...0.003033355255223233636 66 ....111111161669695995556666 888.8.8..8885555 0000....000000000000 ....8880802002226666555555558888 1111....226226667777888811151555 y Coef. Std. Err. t P>|t| [95% Conf. Interval] . MMregress y x* regression analysis Overview of robust estimators Stata codes Conclusion

(27)

Introduction Outliers in

The implemented algorithm:

Salibian-Barrera and Yohai (2006) 1. P-subset

2. Improve the 10 best candidates (i.e.

regression analysis Overview of robust estimators Stata codes Conclusion

2. Improve the 10 best candidates (i.e.

those with the 10 smallest ) using

iteratively reweighted least squares. 3. Keep the improved candidate with the

smallest .

ˆS σ

ˆS σ

(28)

Y

Introduction Outliers in

Pick 2 (p) points randomly and

estimate the equation of the line (hyperplane) connecting them

regression analysis Overview of robust estimators Stata codes Conclusion X

(29)

Y

Introduction Outliers in

Estimate the residuals associated to this line (hyperplane)

regression analysis Overview of robust estimators Stata codes Conclusion X

(30)

Y

Introduction Outliers in

Do it N times and each time calculate the robust residual spread

regression analysis Overview of robust estimators Stata codes Conclusion X

(31)

Introduction Outliers in

Take the 10 regression lines

(hyperplanes) associated with the

smallest robust spreads and run the iterative algorithm described previously to improve the initial candidate.

regression analysis Overview of robust estimators Stata codes Conclusion

to improve the initial candidate.

The regression line (hyperplane)

associated with the smallest refined robust spread will be the estimated S.

(32)

Introduction Outliers in

The minimal number of subsets we need

to have a probability (Pr) of having at

least one clean if α% of outliers corrupt

the dataset can be easily derived:

regression analysis Overview of robust estimators Stata codes Conclusion

(

)

log(1 Pr) * log(1 (1 ) )

Pr

1

1

1

p N p

N

α

α

− − −

= −

=

Contamination: %
(33)

Introduction Outliers in

The minimal number of subsets we need

to have a probability (Pr) of having at

least one clean if α% of outliers corrupt

the dataset can be easily derived:

regression analysis Overview of robust estimators Stata codes Conclusion

(

)

log(1 Pr) * log(1 (1 ) )

Pr

1

1

1

p N p

N

α

α

− − −

= −

=

Will be the probability that one random point in the dataset is not an outlier

(34)

Introduction Outliers in

The minimal number of subsets we need

to have a probability (Pr) of having at

least one clean if α% of outliers corrupt

the dataset can be easily derived:

regression analysis Overview of robust estimators Stata codes Conclusion

(

)

log(1 Pr) * log(1 (1 ) )

Pr

1

1

1

p N p

N

α

α

− − −

= −

=

Will be the probability that none of the p random points in a p-subset is an outlier

(35)

Introduction Outliers in

The minimal number of subsets we need

to have a probability (Pr) of having at

least one clean if α% of outliers corrupt

the dataset can be easily derived:

regression analysis Overview of robust estimators Stata codes Conclusion

(

)

log(1 Pr) * log(1 (1 ) )

Pr

1

1

1

p N p

N

α

α

− − −

= −

=

Will be the probability that at least one of the p random points in a p-subset is an outlier

(36)

Introduction Outliers in

The minimal number of subsets we need

to have a probability (Pr) of having at

least one clean if α% of outliers corrupt

the dataset can be easily derived:

regression analysis Overview of robust estimators Stata codes Conclusion

(

)

log(1 Pr) * log(1 (1 ) )

Pr

1

1

1

p N p

N

α

α

− − −

= −

=

Will be the probability that there is at least one outlier in each of the N subsets considered (i.e. that all p-subsets are corrupt)

(37)

Introduction Outliers in

The minimal number of subsets we need

to have a probability (Pr) of having at

least one clean if α% of outliers corrupt

the dataset can be easily derived:

regression analysis Overview of robust estimators Stata codes Conclusion

(

)

log(1 Pr) * log(1 (1 ) )

Pr

1

1

1

p N p

N

α

α

− − −

= −

=

Will be the probability that there is at least one clean p-subset among the N considered

(38)

Introduction Outliers in

The minimal number of subsets we need

to have a probability (Pr) of having at

least one clean if α% of outliers corrupt

the dataset can be easily derived:

regression analysis Overview of robust estimators Stata codes Conclusion

(

)

log(1 Pr) * log(1 (1 ) )

Pr

1

1

1

p N p

N

α

α

− − −

= −

=

Rearranging we have:
(39)

Introduction Outliers in

If several dummies are present, the algorithm might lead collinear samples. To solve this we programmed the MS-estimator (out of the scope here). Idea:

regression analysis Overview of robust estimators Stata codes Conclusion

(40)

Introduction Outliers in

If several dummies are present, the algorithm might lead collinear samples. To solve this we programmed the MS-estimator (out of the scope here). Idea:

regression analysis Overview of robust estimators Stata codes Conclusion

1 2 1 1 2 2 1 2 2 1 1 1 2 1 1 2 2

ˆ

argmin

([

]

)

ˆ

argmin ˆ ([

]

)

discrete continuous n MS MS i i MS S MS i

y

X

X

y

X

X

y

X

X

θ θ

θ

θ

ε

θ

ρ

θ

θ

θ

σ

θ

θ

=

=

+

+

=



=



(41)

Introduction Outliers in

To properly identify outliers, in addition to robust (standardized) residuals, we need an assessment of the outlyingness in the design space (x variables).

This is generally done by calling on

regression analysis Overview of robust estimators Stata codes Conclusion

This is generally done by calling on Mahalanobis distances:

That are known to be distributed as a for Gaussian data.

1 ( i ) ( i )' MD = x

µ

Σ− x

µ

2 p χ
(42)

Introduction Outliers in

However MD are not robust since they

are based on classical estimations of µ

(location) and Σ (scatter).

This drawback can be easily solved by

regression analysis Overview of robust estimators Stata codes Conclusion

This drawback can be easily solved by

(43)

Introduction Outliers in

A well suited method for this is MCD that considers several subsets containing (generally) 50% of the observations and

estimates µ and Σ on the data of the

regression analysis Overview of robust estimators Stata codes Conclusion

subset associated with the smallest covariance matrix determinant.

(44)

Introduction Outliers in

The generalized variance proposed by

Wilks (1932), is a one-dimensional

measure of multidimensional scatter. It

is defined as .

In the 2x2 case it is easy to see the

det( )

GV

=

Σ

regression analysis Overview of robust estimators Stata codes Conclusion

In the 2x2 case it is easy to see the underlying idea: 2 2 2 2 2

and det( )

x xy x y xy xy y

σ

σ

σ σ

σ

σ

σ

Σ =

Σ =

Raw bivariate spread

(45)

Introduction Outliers in

The implemented algorithm:

Rousseeuw and Van Driessen (1999) 1. P-subset

2. Concentration (sorting distances)

regression analysis Overview of robust estimators Stata codes Conclusion

2. Concentration (sorting distances)

3. Estimation of robust µMCD and ΣMCD

4. Estimation of robust distances:

1

ˆ

(

i

ˆ

MCD

)

MCD

(

i

ˆ

MCD

)'

(46)

clear set obs 1000 local b=sqrt(invchi2(5,0.95)) drawnorm x1-x5 e replace x1=invnorm(uniform())+5 in 1/100 gen outlier=0 Introduction Outliers in gen outlier=0 replace outlier=1 in 1/100 mcd x*, outlier gen RD=Robust_distance hadimvo x*, gen(a b) p(0.5) Scatter RD b regression analysis Overview of robust estimators Stata codes Conclusion

(47)

Introduction Outliers in 6 8 R o b u s t d is ta n c e Fast-MCD regression analysis Overview of robust estimators Stata codes Conclusion 0 2 4 R o b u s t d is ta n c e 0 1 2 3 4 5 Hadi distance (p=.5) Hadi

(48)

1.96

Vertical outliers Bad leverage (Rousseeuw and Van Zomeren, 1990)

Introduction Outliers in _ ˆ i S S res σ 2 ,0.95 p χ χχ χ 0 RDi -1.96

Vertical outliers Bad leverage

Good leverage Good leverage regression analysis Overview of robust estimators Stata codes Conclusion

(49)

clear set obs 1000 local b=sqrt(invchi2(5,0.95)) drawnorm x1-x5 e gen y=x1+x2+x3+x4+x5+e replace x1=invnorm(uniform())+5 in 1/100 Introduction Outliers in replace x1=invnorm(uniform())+5 in 1/100 gen noise=1 in 1/100 Sregress y x*, outlier mcd x*, outlier hadimvo x*, gen(a b) regression analysis Overview of robust estimators Stata codes Conclusion

(50)

-8 -6 -4 -2 0 2 R o b u s t S ta n d a rd iz e d R e s id u a ls 0 2 4 6 8 Introduction Outliers in 0 2 4 6 8 Robust Distances -8 -6 -4 -2 0 2 R o b u s t S ta n d a rd iz e d R e s id u a ls 0 2 4 6 8 Hadi Distances regression analysis Overview of robust estimators Stata codes Conclusion

(51)

webuse auto

xi: Sregress price mpg headroom trunk

weight length turn displacement

gear_ratio foreign i.rep78, outlier

mcd mpg headroom trunk weight length

turn displacement gear_ratio, outlier

Introduction Outliers in

turn displacement gear_ratio, outlier Scatter S_stdres Robust_distance

gen w1= invnormal(0.975)/abs(S_stdres) replace w1=1 if w1>1 gen w2= sqrt(invchi2(r(N),0.95))/RD replace w2=1 if w2>1 gen w=w1*w2 regression analysis Overview of robust estimators Stata codes Conclusion

(52)

Cad. Seville Volvo 260 Linc. Mark V Cad. Eldorado Linc. Versailles 1 0 1 5 2 0 R o b u s t S ta n d a rd iz e d R e s id u a ls Introduction Outliers in VW Diesel Plym. Arrow Subaru -5 0 5 R o b u s t S ta n d a rd iz e d R e s id u a ls 0 5 10 15 20 Robust distances regression analysis Overview of robust estimators Stata codes Conclusion

(53)

S Introduction Outliers in LS regression analysis Overview of robust estimators Stata codes Conclusion

(54)

Introduction Outliers in S LS regression analysis Overview of robust estimators Stata codes Conclusion Furthermore: LS_R²=0.61 LS_RMSE=2031 S_R²=0.82 S_RMSE=402

(55)

Sregress varlist [if exp] [in range] [, e(#) proba(#) noconstant outlier test replic(#) setseed(#)]

MMregress varlist [if exp] [in range] [, e(#) proba(#) noconstant outlier eff

Introduction Outliers in

[, e(#) proba(#) noconstant outlier eff replic(#)]

mcd varlist [if exp] [in range] [, e(#) p(#) trim(#) outlier finsample]

MSregress varlist [if exp] [in range] , dummies(dummies) [ e(#) proba(#)

noconstant outlier test]

regression analysis Overview of robust estimators Stata codes Conclusion

(56)

The available methods to identify (and treat) outliers in Stata are not fully efficient

The proposed commands should be helpful to deal with outliers in:

Introduction Outliers in

1.Regression analysis

2.Multivariate analysis (PCA, etc)

3.Available from [email protected] regression analysis Overview of robust estimators Stata codes Conclusion

References

Related documents