High Breakdown Rank-Based Fits - Nonparametric Statistical Methods Using R

High breakdown rank-based (HBR) estimates were developed by Chang et al.

(1999) and are fully discussed in Section 3.12 of Hettmansperger and McKean (2011). To obtain HBR fits of linear models, a suite of R functions (ww) was developed by Terpstra and McKean (2005). We use a modified version, hbrfit, of ww to compute HBR fits.¹

The objective function for HBR estimation is a weighted Wilcoxon disper-sion function given by

kvk^HBR=X

i<j

bij|vⁱ− v^j| (7.6)

where bij ≥ 0 and b^ij = bji. The HBR estimator of β minimizes this objective function, which we denote by

βˆ_HBR= Argminky − Xβk^HBR. (7.7) As with the rank-based estimates the intercept α is estimated as the median of the residuals; that is,

α = medi{yⁱ− x^Tβˆ_HBR}. (7.8) As shown in Chapter 3 of Hettmansperger and McKean (2011), if all the weights are one (i.e., bij ≡ 1) then k · k^HBR is the Wilcoxon norm. Thus the question is, what weights should be chosen to yield estimates which are robust to outliers in both the x- and y-spaces? In Section 7.2.1, we discuss the HBR weights implemented in hbrfit which achieve 50% breakdown. For now, though, we illustrate their use and computation with several examples.

Stars Data

In this subsection we present an example to illustrate the usage of the weighted Wilcoxon code hbrfit to compute HBR estimates. This example uses the

1See https://github.com/kloke/book for more information.

3.6 3.8 4.0 4.2 4.4 4.6

4.04.55.05.56.0

temperature

light

FIGURE 7.1

Scatterplot of stars data.

starsdataset which is from Rousseeuw et al. (1987). The data are from an astronomy study on the star cluster CYG OB1. The cluster contains 47 stars.

Measurements were taken on light intensity and temperature. The response variable is log light intensity and the explanatory variable is log temperature.

As is apparent in the scatterplot displayed in Figure 7.1 there are several outliers: there are four stars with lower temperature and higher light intensity than the other members of the cluster. These four stars are labeled giant stars in this dataset. The others are labeled main sequence stars, except for the two with log temperature 3.84 and 4.01 which are between the giant and main sequence stars.

In Figure 7.2, the Wilcoxon (WIL), high breakdown (HBR), and least squares (LS) fits are overlaid on the scatterplot. As seen in Figure 7.2, both the least squares and Wilcoxon fit are affected substantially by the outliers;

the HBR fit, however, is robust.

The HBR fit is computed as

> fitHBR<-hbrfit(stars$light ~ stars$temperature)

As we have emphasized throughout the book the use of residuals, in particular

3.6 3.8 4.0 4.2 4.4 4.6

4.04.55.05.56.0

temperature

light

WIL HBR LS

FIGURE 7.2

Scatterplot of stars data with fitted regression lines overlaid.

Studentized residuals, are essential to the model building process. Studentized residuals are available through the command rstudent. In addition, a set of diagnostic plots can be obtained using diagplot. For HBR fit of the stars data, Figure 7.3 displays these diagnostic plots, which resulted from the code:

> diagplot(fitHBR)

Note from these plots in Figure 7.3 that the Studentized residuals of the HBR fit clearly identify the 4 giant stars. They also identify the two stars between the giant and main sequence stars.

Finally, we may examine the estimated regression coefficients and their standard errors in the table of regression coefficients with the command summary; i.e.,

> summary(fitHBR) Call:

hbrfit(formula = stars$light ~ stars$temperature) Coefficients:

3.5 4.0 4.5 5.0

0123

Residuals vs. Fits

Fit

Residual

Histogram of Residuals

Residual

Density

−1 0 1 2 3

0.00.20.40.6

0 10 20 30 40

−22468

Case Plot of Studentized Residuals

Case

Studentized Residual

−2 −1 0 1 2

0123

Normal Q−Q Plot of Residuals

Theoretical Quantiles

Sample Quantiles

FIGURE 7.3

Diagnostic plots based on the HBR fit of the stars data.

Estimate Std. Error t.value p.value (Intercept) -3.46917 1.64733 -2.1059 0.04082 * stars$temperature 1.91667 0.38144 5.0248 8.47e-06 ***

---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Wald Test: 25.24853 p-value: 1e-05

The estimate of intercept is −3.47 (se = 1.65). The estimate of slope is 1.92 (se = 0.38). Critical values based on a t-distribution with n − p − 1 degrees of freedom are recommend for inference; for example p-values in the coefficients table for the stars data are based on a t45 distribution. Also displayed is a Wald test of H0: β = 0.

7.2.1 Weights for the HBR Fit

Let Xc be the centered design matrix. For weights it seems reasonable to downweight points far from the center of the data. The traditional distances

are the leverage values hi = n⁻¹+ x^′_ci(X^′_cXc)⁻¹xci, where xci is the vec-tor of the ith row of Xc. Because the leverage values are based on the LS variance-covariance scatter matrix, they are not robust. The weights for the HBR estimate make use of the high breakdown minimum covariance deter-minant, MCD, which is an ellipsoid in p-space that covers about half of the data and yet has minimum determinant. Rousseeuw and Van Driessen (1999) present a fast computational algorithm for it. Let V and vc denote respec-tively the MCD and the center of the MCD. The robust distances and weights are respectively vni= (xci−v^c)^′V⁻¹(xci−v^c) and wi = minn

1,_v^c

o, where c is usually set at the 95th percentile of the χ²(p) distribution. Note that “good”

points generally have weight 1. The estimator bβ^∗ (7.7) of β obtained with these weights is called a generalized R (GR) estimator. In general, this GR estimator has a bounded influence function in both the Y and the x-spaces and a positive breakdown. It can be computed using the suite of R functions wwwith wts = "GR".

Note that the GR estimate downweights “good” points as well as “bad”

points of high leverage. Due to this indiscriminate downweighting the GR estimator is less efficient than the Wilcoxon estimator. At times, the loss in efficiency can be severe. The HBR weights also use the MCD to determine weights in the x-space. Unlike the GR weights, though, residual information from the Y -space is also used. These residuals are based on the the least trim squares (LTS) estimate which is ArgminPh

i=1[Y − α − x^′β]²_(i) where h = [n/2]+1 and (i) denotes the ith ordered residual. This is a high breakdown initial estimate; see Rousseeuw and Van Driessen (1999). Let be⁰ denote the residuals from this initial fit.

Define the function ψ(t) by ψ(t) = 1, t, or − 1 according as t ≥ 1,

where b and c are tuning constants. Following Chang et al. (1999), b is set at the upper χ²_.05(p) quantile and c is set as

c = [med{aⁱ} + 3MAD{aⁱ}]²,

where ai = ˆe⁽⁰⁾_i /(M AD · Qi). From this point of view, it is clear that these weights downweight both outlying points in factor space and outlying re-sponses. Note that the initial residual information is a multiplicative factor in

the weight function. Hence, a good leverage point will generally have a small (in absolute value) initial residual which will offset its distance in factor space.

These are the weights used for the HBR fit computed by hbrfit.

In general, the HBR estimator has a 50% breakdown point, provided the initial estimates used in forming the weights also have a 50% breakdown point.

Further, its influence function is a bounded function in both the Y and the x-spaces, is continuous everywhere, and converges to zero as (x^∗, Y^∗) get large in any direction. The asymptotic distribution of bβ_HBRis asymptotically normal.

As with all high breakdown estimates, bβ_HBRis less efficient than the Wilcoxon estimates but it regains some of the efficiency loss of the GR estimate. See Section 3.12 of Hettmansperger and McKean (2011) for discussion.

In document Nonparametric Statistical Methods Using R (Page 192-197)