Robust
Robust Statistics
Statistics in Stata
in Stata
Vincenzo Verardi ([email protected])
FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher
Based on joint work with C. Croux (KULeuven) and Catherine Dehon (ULB)
1. Introduction
2. Outliers in regression analysis
3. Overview of robust estimators
3. Overview of robust estimators
4. Stata codes
Y Vertical Outlier Introduction Outliers in Good Leverage X-Design regression analysis Overview of robust estimators Stata codes Conclusion R a n g e f o r y Range for x Bad Leverage
To illustrate the influence of outliers, we
generate a dataset according to
Y=1.25+0.6X+
ε
, where X and ε~N(0,1).We then contaminate the data with
Introduction
Outliers in
We then contaminate the data with single outliers.
set set set
set obsobsobsobs 100100100100 drawnorm drawnorm drawnorm drawnorm XXXX eeee gen gen gen
gen y=y=y=y=111.1..25.252525++0++000....666*X+e6*X+e*X+e*X+e replace replace replace replace x=x=x= …x= ……… regression analysis Overview of robust estimators Stata codes Conclusion
Y
LS
Introduction
Outliers in
Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 2.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.74 -0.42 0.48 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design regression analysis Overview of robust estimators Stata codes Conclusion
Y OLS
Introduction
Outliers in
Vertical Outlier
Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 2.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.67 -0.42 0.48 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design 2.24 regression analysis Overview of robust estimators Stata codes Conclusion
Y
OLS
Introduction
Outliers in
Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 4.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.67 -0.42 0.48 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design 4.07 -0.42 regression analysis Overview of robust estimators Stata codes Conclusion
Y
OLS
Introduction
Outliers in
Good Leverage Point
Clean Vertical Bad leverage Good leverage Intercept 1.24 2.24 4.07 1.25 t-stat (10.76) (7.15) (6.99) (10.94) Slope 0.59 0.67 -0.42 0.57 t-stat (4.96) (2.26) (-9.02) (14.04) X-Design regression analysis Overview of robust estimators Stata codes Conclusion (14.04)
The objective of regression analysis is to figure out how a dependent variable is linearly related to a set of explanatory ones.
Introduction Outliers in
Technically speaking, it consists in
estimating the θ parameters in:
to find the model that better fits the data.
θ θ θ θ − − ε = 0 + 1 1 + 2 2 + ... + 1 1 + i i i p ip i y x x x regression analysis Overview of robust estimators Stata codes Conclusion
On the basis of the estimated parameters, it is then possible to fit the
model and predict, the dependent
variable. The discrepancy between y
and is called the residual
ˆ y ˆ y (ri = yi − yˆi ). Introduction Outliers in
and is called the residual
The objective of LS is to minimize the sum of the squared residuals:
ˆ y (ri = yi − yˆi ). 0 2 1 1 ˆ argmin n ( ) where LS i i p r θ
θ
θ
θ
θ
θ
= − = = ∑
regression analysis Overview of robust estimators Stata codes ConclusionHowever, the squaring of the residuals makes LS very sensitive to outliers.
To increase robustness, the square function could be replaced by the absolute value (Edgeworth, 1887).
Introduction Outliers in
absolute value (Edgeworth, 1887).
[qreg function in Stata]
regression analysis Overview of robust estimators Stata codes Conclusion 1 1 ˆ argmin n ( ) L i i r θ θ θ = =
∑
Huber (1964) generalized this idea to a
set of symmetric ρ functions that could
be used instead of the absolute value to increase efficiency and robustness.
To guarantee scale equivariance,
Introduction Outliers in
To guarantee scale equivariance,
residuals are standardized by a
measure of dispersion σ.
The problem becomes:
regression analysis Overview of robust estimators Stata codes Conclusion 1 ( ) ˆ argmin n i M i r θ θ θ ρ σ = =
∑
M-estimators can be redescending (1) or monotonic (2). Introduction Outliers in 0.4 0.6 0.8 1.0 1.2 2 3 4 5 y ρ(r) ρ(r) regression analysis Overview of robust estimators Stata codes Conclusion -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 0.2 0.4 r -5 -4 -3 -2 -1 0 1 2 3 4 5 1 x -5 -4 -3 -2 -1 1 2 3 4 5 -1.0 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 1.0 x y -5 -4 -3 -2 -1 1 2 3 4 5 -1.0 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 1.0 x y r r r r ρ’(r) ρ’(r) (1) (2)
If σ is known, the practical
implementation of M-estimators is
straightforward. Indeed, by defining a weight: Introduction Outliers in ( ) r θ ρ
the problem boils down to:
regression analysis Overview of robust estimators Stata codes Conclusion 2 ( ) ( ) i i i r w r θ ρ σ θ = 2 1 ˆ argmin n ( ) M i i i w r θ θ θ = =
∑
Introduction Outliers in 2 1 ˆ argmin n ( ) M i i i w r θ θ θ = =
∑
However:1. Weights wi are a function of θ that
should thus be estimated iteratively
regression analysis Overview of robust estimators Stata codes Conclusion
should thus be estimated iteratively
2. This iterative algorithm is guaranteed to converge (and yield a solution which is unique) only for monotonic M-estimators … which are not robust
Introduction Outliers in
The rreg command was created to
tackle these problems. It works as follows:
1. It awards a weight zero to individuals with Cook distances larger than 1.
regression analysis Overview of robust estimators Stata codes Conclusion
with Cook distances larger than 1.
2. A “redescending” M-estimator is
computed using the iterative algorithm starting from a monotonic M-solution.
3. σ is re-estimated at each iteration
using the median residual of the previous iteration.
Introduction Outliers in
Unfortunately, this command has not the expected robust properties:
1. Cook distances do not help
identifying leverage points when
(clustered) outliers mask one the other.
regression analysis Overview of robust estimators Stata codes Conclusion
(clustered) outliers mask one the other.
2. The preliminary monotonic
M-estimator provides a poor initial
candidate because of point 1.
3. σ is poorly estimated because of 1
Introduction Outliers in
qreg and rreg are not robust methods:
Stata example: set obs 100 drawnorm x1-x5 e regression analysis Overview of robust estimators Stata codes Conclusion gen y=x1+x2+x3+x4+x5+e replace x1=invnorm(uniform())+10 in 1/10 qreg y x* rreg y x* display e(rmse)
Raw sum of deviations 22220000222.2.8..88844445551511 (about -1 -.--...22232338388899299225255858887777)
Median regression Number of obs = 110110000000 Iteration 6: sum of abs. weighted deviations = 11111116166.6...00100116166767777777
Iteration 5: sum of abs. weighted deviations = 11111116166.6...00100119199090005555 Iteration 4: sum of abs. weighted deviations = 11111116166.6...66566551511414445555 Iteration 3: sum of abs. weighted deviations = 11111117177.7...00400443433636669999 Iteration 2: sum of abs. weighted deviations = 11111117177.7...11811887877171114444 Iteration 1: sum of abs. weighted deviations = 11111119199.9...66466448488181118888 Iteration 1: WLS sum of weighted deviations = 11111117177.7...33133118188282224444
Introduction Outliers in _cons ----....0000000000009999224224454555 ..1..1118888887887767666444848 88 ----00.00..0.00000 00 0000....999999969666 ----...3.373377755557772722121113333 ...3.373377733338878877272224444 x5 ..9..999999393133116166767577555 ..1..11177797991911199993383388 8 55.55..5.55454 44 0000....000000000000 ...6.63663337773737337477444 111.1..3.34334448888999696661111 x4 ....8887877777377333552552212111 ..1..11166626224244466661111111 1 55.55..4.44040 00 0000....0000000000 00 ..5..5555555444477877881811177 77 111.1.1..11919999999992992222222 x3 ...9.99494944999111919989888 ...1.11616667777555858 88 55.55..6.6666666 0000....000000000000 ...6.61661116664644646664444 111.1..2.28228881111999393332222 x2 ....7775754554744777221221121222 ..1..11155585889899999994444444 4 44.44..7.77575 55 0000....0000000000 00 ..4..4434333999900300334344411 11 111.1.0..00707770000440440008888 x1 ..1..11717779999888787777777 ....000055535363366868882222222 2 3333...3.3353555 00.00...000000010111 ...0.007077733332228289889997777 ...2.228286886664444666464443333 y Coef. Std. Err. t P>|t| [95% Conf. Interval] Min sum of deviations 11111111666.6..0.00011611668688 8 Pseudo R2 = 00.00...4444222828881111 Raw sum of deviations 22220000222.2.8..88844445551511 (about -1 -.--...22232338388899299225255858887777)
regression analysis Overview of robust estimators Stata codes Conclusion
Prob > F = 000.0..0.000000000000000 F( 5, 94) = 3333333.3.2..2228888 Robust regression Number of obs = 111010000000 Biweight iteration 5: maximum difference in weights = ...0.0000000777777077008088800008888
Biweight iteration 4: maximum difference in weights = ...1.1114444777575955990900055552222 Huber iteration 3: maximum difference in weights = ...0.0001111555757277224244400001111 Huber iteration 2: maximum difference in weights = ...0.0006666000202522553533300006666 Huber iteration 1: maximum difference in weights = ...4.4448888444141711771711177773333 . rreg y x* Introduction Outliers in 1 11 1....6666111155155111555555575777 . display e(rmse) _cons --.--...0000555588884424422828878777 ...1.17117775555009009998888 ----0000..3..333333 3 0000....77773393399 9 --.--.4..440406006660000889889899888 ....2222889889929223233232225555 x5 1111....111111115558583883363666 ...1.116163663339999770770007777 6666....8881811 1 0000....00000000000 0 ...7.79779990000226226866888 1111....444444414114144040003333 x4 ....7777777788881119190990050555 ...1.115155555554444880880007777 5555....0001011 1 000.0...00000000000 0 ..4..446469669994444880880100111 1111....008008868669699090001111 x3 ....9999222222221112129229969666 ...1.115155656669999992992226666 5555....8887877 7 000.0...00000000000 0 ..6..661610110004444117117277222 1111....223223333338388484442222 x2 ....9999222244441112129229959555 ...1.114144545559999884884445555 6666....3333333 3 000.0...00000000000 0 ..6..663634334442222773773933999 1111....221221131339399898885555 x1 ..1..11177775552522626766777 ...0.005055151411444999966166111 3333....4440400 0 00.00...00000001011 1 ...0.007077373330000220220003333 ....2222777777757551511313336666 y Coef. Std. Err. t P>|t| [95% Conf. Interval] regression analysis Overview of robust estimators Stata codes Conclusion
Robustness can be however achieved by tackling the problem from a different perspective.
Instead of minimizing the variance of the residuals (LS) a more robust
Introduction Outliers in
the residuals (LS) a more robust measure of spread of the residuals could be minimized (Rousseeuw and Yohai, 1987).
The measure of spread considered here is an M-estimator of scale.
regression analysis Overview of robust estimators Stata codes Conclusion
Intuition:
The variance is defined by:
Introduction Outliers in
2 2
1
1
ˆ ( ) which can be rewritten: n i i r n σ θ = =
∑
But the square function ...
regression analysis Overview of robust estimators Stata codes Conclusion 1 2 1 ( ) 1
1 hence LS looks for the
ˆ
minimal ˆ that satisfies the equality. i n i i n r n θ σ σ = = =
∑
Replace the square by another
ρ
: Introduction Outliers in 1 ( ) 1 1 ˆbut for Gaussian data we want ˆ to be the standard deviation ( correction)
n i S i S r n θ ρ σ σ = = ⇒
∑
regression analysis Overview of robust estimators Stata codes Conclusion 1the standard deviation ( correction) ( )
1
ˆ
ˆ The problem boils down to finding the associated to the m n i S i S r n θ δ ρ σ θ = ⇒ =
∑
inimal ˆ that satisfies the equality
S
σ
M-estimator of scale …
ρ is generally (Tukey Biweight): Introduction Outliers in 3 2 / 1 1 if ( ) i i i r r k r k
σ
σ
ρ
σ
− − ≤ = where for k=1.548 the BDP is 50% and
the efficiency is 28%. For k=5.182 the efficiency is 96% but the BDP is 10%.
regression analysis Overview of robust estimators Stata codes Conclusion ( ) 1 if ri k
ρ
σ
σ
= > To ensure robustness AND efficiency, Yohai (1987) proposes to estimate an M-estimator: Introduction Outliers in ˆ argmin ( ) n i M r θ θ ρ σ =
∑
where ρ is a 95% efficiency Tukey
Biweight function and where
σ
is setequal to , estimated using a high BDP
S-estimator. The starting point for the
iterations is . regression analysis Overview of robust estimators Stata codes Conclusion 1 argmin M i θ θ ρ σ = =
∑
ˆ
Sσ
ˆ
Sθ
Scale parameter= 1111....111188088000777474446666 _cons ....0000333333339999997997727222 ...1.11144644666444747477444222 2 000.0.2..222333 3 000.0...888811711777 --.--.2..222558558884444447447977999 ....3333226226664444444422242444 x5 ....7777000088886606600011121222 ...1.1411444444433733777888844 44 444.4.9..999111 1 00.00...000000000000 ...4.444222020003334340440400444 ....9999996996668886866622212111 x4 ....6666555577778888888800080888 ...1.1411442422255555555777733 33 444.4.6..666111 1 000.0...000000000000 ....337337773332325225655666 ....9999442442225550500055575777 x3 ....999922220008080880030333 ...1.11144454555000505455444555 5 666.6..3.333555 5 000.0...000000000000 ..6..666331331111119192992322333 1111....221221110004044411141444 x2 1111....111188881161166666686888 ...1.1211229299966866888111188 88 999.9.1..111111 1 00.00...000000000000 ...9.999222222227774749449899888 1111....444444440005055588868666 x1 ....9999777755555565566600060666 ...1.1311333333311711777111111 11 777.7.3..333333 3 00.00...000000000000 ...7.777000909996667675775855888 1111....224224441114144444454555 y Coef. Std. Err. t P>|t| [95% Conf. Interval] . Sregress y x* Introduction Outliers in Scale parameter= 1111....111188088000777474446666 Scale parameter= 1.180745 _cons ---.-...11211221211144644668688585 55 ...1.1211228288484044000333366 66 ---0-0.00...999955 55 0000..3..334344477 77 ----..3..337377766866888111313133111 ..1..11313333383388877767666 x5 ....9989988989992292299696676777 ...1.1211222666868888888777722 22 777.7.8..8880000 00.00...000000000000 ....7773733636669999667667777777 1111....224224441111666622262666 x4 ....9929922828889969966666656555 ...1.1111111999797377333000099 99 777.7.7..7776666 00.00...000000000000 ....6669699090008888668668488444 1111....116116667777000066656555 x3 1111..0..00000005505500101161666 ...1.1111111777979299222000033 33 888.8.5..5552222 00.00...000000000000 ....7777777070005555118118688666 1111....223223339999555511131333 x2 ....8898899696667757755353353555 ...1.1111111000808388333333311 11 888.8.0..0009999 00.00...000000000000 ....6667677676663333449449899888 1111....111111117777111155575777 x1 1111...0.003033355255223233636 66 ....111111161669695995556666 888.8.8..8885555 0000....000000000000 ....8880802002226666555555558888 1111....226226667777888811151555 y Coef. Std. Err. t P>|t| [95% Conf. Interval] . MMregress y x* regression analysis Overview of robust estimators Stata codes Conclusion
Introduction Outliers in
The implemented algorithm:
Salibian-Barrera and Yohai (2006) 1. P-subset
2. Improve the 10 best candidates (i.e.
regression analysis Overview of robust estimators Stata codes Conclusion
2. Improve the 10 best candidates (i.e.
those with the 10 smallest ) using
iteratively reweighted least squares. 3. Keep the improved candidate with the
smallest .
ˆS σ
ˆS σ
Y
Introduction Outliers in
Pick 2 (p) points randomly and
estimate the equation of the line (hyperplane) connecting them
regression analysis Overview of robust estimators Stata codes Conclusion X
Y
Introduction Outliers in
Estimate the residuals associated to this line (hyperplane)
regression analysis Overview of robust estimators Stata codes Conclusion X
Y
Introduction Outliers in
Do it N times and each time calculate the robust residual spread
regression analysis Overview of robust estimators Stata codes Conclusion X
Introduction Outliers in
Take the 10 regression lines
(hyperplanes) associated with the
smallest robust spreads and run the iterative algorithm described previously to improve the initial candidate.
regression analysis Overview of robust estimators Stata codes Conclusion
to improve the initial candidate.
The regression line (hyperplane)
associated with the smallest refined robust spread will be the estimated S.
Introduction Outliers in
The minimal number of subsets we need
to have a probability (Pr) of having at
least one clean if α% of outliers corrupt
the dataset can be easily derived:
regression analysis Overview of robust estimators Stata codes Conclusion
(
)
log(1 Pr) * log(1 (1 ) )Pr
1
1
1
p N pN
αα
− − −
= −
−
−
=
Contamination: %Introduction Outliers in
The minimal number of subsets we need
to have a probability (Pr) of having at
least one clean if α% of outliers corrupt
the dataset can be easily derived:
regression analysis Overview of robust estimators Stata codes Conclusion
(
)
log(1 Pr) * log(1 (1 ) )Pr
1
1
1
p N pN
αα
− − −
= −
−
−
=
Will be the probability that one random point in the dataset is not an outlier
Introduction Outliers in
The minimal number of subsets we need
to have a probability (Pr) of having at
least one clean if α% of outliers corrupt
the dataset can be easily derived:
regression analysis Overview of robust estimators Stata codes Conclusion
(
)
log(1 Pr) * log(1 (1 ) )Pr
1
1
1
p N pN
αα
− − −
= −
−
−
=
Will be the probability that none of the p random points in a p-subset is an outlier
Introduction Outliers in
The minimal number of subsets we need
to have a probability (Pr) of having at
least one clean if α% of outliers corrupt
the dataset can be easily derived:
regression analysis Overview of robust estimators Stata codes Conclusion
(
)
log(1 Pr) * log(1 (1 ) )Pr
1
1
1
p N pN
αα
− − −
= −
−
−
=
Will be the probability that at least one of the p random points in a p-subset is an outlier
Introduction Outliers in
The minimal number of subsets we need
to have a probability (Pr) of having at
least one clean if α% of outliers corrupt
the dataset can be easily derived:
regression analysis Overview of robust estimators Stata codes Conclusion
(
)
log(1 Pr) * log(1 (1 ) )Pr
1
1
1
p N pN
αα
− − −
= −
−
−
=
Will be the probability that there is at least one outlier in each of the N subsets considered (i.e. that all p-subsets are corrupt)
Introduction Outliers in
The minimal number of subsets we need
to have a probability (Pr) of having at
least one clean if α% of outliers corrupt
the dataset can be easily derived:
regression analysis Overview of robust estimators Stata codes Conclusion
(
)
log(1 Pr) * log(1 (1 ) )Pr
1
1
1
p N pN
αα
− − −
= −
−
−
=
Will be the probability that there is at least one clean p-subset among the N considered
Introduction Outliers in
The minimal number of subsets we need
to have a probability (Pr) of having at
least one clean if α% of outliers corrupt
the dataset can be easily derived:
regression analysis Overview of robust estimators Stata codes Conclusion
(
)
log(1 Pr) * log(1 (1 ) )Pr
1
1
1
p N pN
αα
− − −
= −
−
−
=
Rearranging we have:Introduction Outliers in
If several dummies are present, the algorithm might lead collinear samples. To solve this we programmed the MS-estimator (out of the scope here). Idea:
regression analysis Overview of robust estimators Stata codes Conclusion
Introduction Outliers in
If several dummies are present, the algorithm might lead collinear samples. To solve this we programmed the MS-estimator (out of the scope here). Idea:
regression analysis Overview of robust estimators Stata codes Conclusion
1 2 1 1 2 2 1 2 2 1 1 1 2 1 1 2 2ˆ
argmin
([
]
)
ˆ
argmin ˆ ([
]
)
discrete continuous n MS MS i i MS S MS iy
X
X
y
X
X
y
X
X
θ θθ
θ
ε
θ
ρ
θ
θ
θ
σ
θ
θ
==
+
+
=
−
−
=
−
−
∑
Introduction Outliers in
To properly identify outliers, in addition to robust (standardized) residuals, we need an assessment of the outlyingness in the design space (x variables).
This is generally done by calling on
regression analysis Overview of robust estimators Stata codes Conclusion
This is generally done by calling on Mahalanobis distances:
That are known to be distributed as a for Gaussian data.
1 ( i ) ( i )' MD = x −
µ
Σ− x −µ
2 p χIntroduction Outliers in
However MD are not robust since they
are based on classical estimations of µ
(location) and Σ (scatter).
This drawback can be easily solved by
regression analysis Overview of robust estimators Stata codes Conclusion
This drawback can be easily solved by
Introduction Outliers in
A well suited method for this is MCD that considers several subsets containing (generally) 50% of the observations and
estimates µ and Σ on the data of the
regression analysis Overview of robust estimators Stata codes Conclusion
subset associated with the smallest covariance matrix determinant.
Introduction Outliers in
The generalized variance proposed by
Wilks (1932), is a one-dimensional
measure of multidimensional scatter. It
is defined as .
In the 2x2 case it is easy to see the
det( )
GV
=
Σ
regression analysis Overview of robust estimators Stata codes ConclusionIn the 2x2 case it is easy to see the underlying idea: 2 2 2 2 2
and det( )
x xy x y xy xy yσ
σ
σ σ
σ
σ
σ
Σ =
Σ =
−
Raw bivariate spread
Introduction Outliers in
The implemented algorithm:
Rousseeuw and Van Driessen (1999) 1. P-subset
2. Concentration (sorting distances)
regression analysis Overview of robust estimators Stata codes Conclusion
2. Concentration (sorting distances)
3. Estimation of robust µMCD and ΣMCD
4. Estimation of robust distances:
1
ˆ
(
iˆ
MCD)
MCD(
iˆ
MCD)'
clear set obs 1000 local b=sqrt(invchi2(5,0.95)) drawnorm x1-x5 e replace x1=invnorm(uniform())+5 in 1/100 gen outlier=0 Introduction Outliers in gen outlier=0 replace outlier=1 in 1/100 mcd x*, outlier gen RD=Robust_distance hadimvo x*, gen(a b) p(0.5) Scatter RD b regression analysis Overview of robust estimators Stata codes Conclusion
Introduction Outliers in 6 8 R o b u s t d is ta n c e Fast-MCD regression analysis Overview of robust estimators Stata codes Conclusion 0 2 4 R o b u s t d is ta n c e 0 1 2 3 4 5 Hadi distance (p=.5) Hadi
1.96
Vertical outliers Bad leverage (Rousseeuw and Van Zomeren, 1990)
Introduction Outliers in _ ˆ i S S res σ 2 ,0.95 p χ χχ χ 0 RDi -1.96
Vertical outliers Bad leverage
Good leverage Good leverage regression analysis Overview of robust estimators Stata codes Conclusion
clear set obs 1000 local b=sqrt(invchi2(5,0.95)) drawnorm x1-x5 e gen y=x1+x2+x3+x4+x5+e replace x1=invnorm(uniform())+5 in 1/100 Introduction Outliers in replace x1=invnorm(uniform())+5 in 1/100 gen noise=1 in 1/100 Sregress y x*, outlier mcd x*, outlier hadimvo x*, gen(a b) regression analysis Overview of robust estimators Stata codes Conclusion
-8 -6 -4 -2 0 2 R o b u s t S ta n d a rd iz e d R e s id u a ls 0 2 4 6 8 Introduction Outliers in 0 2 4 6 8 Robust Distances -8 -6 -4 -2 0 2 R o b u s t S ta n d a rd iz e d R e s id u a ls 0 2 4 6 8 Hadi Distances regression analysis Overview of robust estimators Stata codes Conclusion
webuse auto
xi: Sregress price mpg headroom trunk
weight length turn displacement
gear_ratio foreign i.rep78, outlier
mcd mpg headroom trunk weight length
turn displacement gear_ratio, outlier
Introduction Outliers in
turn displacement gear_ratio, outlier Scatter S_stdres Robust_distance
gen w1= invnormal(0.975)/abs(S_stdres) replace w1=1 if w1>1 gen w2= sqrt(invchi2(r(N),0.95))/RD replace w2=1 if w2>1 gen w=w1*w2 regression analysis Overview of robust estimators Stata codes Conclusion
Cad. Seville Volvo 260 Linc. Mark V Cad. Eldorado Linc. Versailles 1 0 1 5 2 0 R o b u s t S ta n d a rd iz e d R e s id u a ls Introduction Outliers in VW Diesel Plym. Arrow Subaru -5 0 5 R o b u s t S ta n d a rd iz e d R e s id u a ls 0 5 10 15 20 Robust distances regression analysis Overview of robust estimators Stata codes Conclusion
S Introduction Outliers in LS regression analysis Overview of robust estimators Stata codes Conclusion
Introduction Outliers in S LS regression analysis Overview of robust estimators Stata codes Conclusion Furthermore: LS_R²=0.61 LS_RMSE=2031 S_R²=0.82 S_RMSE=402
Sregress varlist [if exp] [in range] [, e(#) proba(#) noconstant outlier test replic(#) setseed(#)]
MMregress varlist [if exp] [in range] [, e(#) proba(#) noconstant outlier eff
Introduction Outliers in
[, e(#) proba(#) noconstant outlier eff replic(#)]
mcd varlist [if exp] [in range] [, e(#) p(#) trim(#) outlier finsample]
MSregress varlist [if exp] [in range] , dummies(dummies) [ e(#) proba(#)
noconstant outlier test]
regression analysis Overview of robust estimators Stata codes Conclusion
The available methods to identify (and treat) outliers in Stata are not fully efficient
The proposed commands should be helpful to deal with outliers in:
Introduction Outliers in
1.Regression analysis
2.Multivariate analysis (PCA, etc)
3.Available from [email protected] regression analysis Overview of robust estimators Stata codes Conclusion