• No results found

Problems with the Predictors

5.1 Errors in the Predictors

The regression model Y=X!+" allows for Y being measured with error by having the "

term, but what if the X is measured with error? In other words, what if the X we see is not the X used to generate Y? It is not unreasonable that there might be errors in measuring X.

For example, consider the problem of determining the effects of being exposed to a potentially hazardous substance such as secondhand tobacco smoke. Such exposure would be a predictor in such a study, but clearly it is very hard to measure this exactly over a period of years.

One should not confuse the errors in predictors with treating X as a random variable.

For observational data, X could be regarded as a random variable, but the regression inference proceeds conditional on a fixed value for X. We make the assumption that the Y is generated conditional on the fixed value of X. Contrast this with the errors in predictors case where the X we see is not the X that was used to generate the Y.

Suppose that what we observe is for i=1,…n which are related to the true values

where the errors " and + are independent. The situation is depicted in Figure 5.1. The true underlying relationship is:

but we only see Putting it together, we get:

Suppose we use least squares to estimate !0 and !1. Let’s assume E"i=E&i=0 and that

, Let:

experiment we can view it as just a numerical measure of the spread of the design. A similar distinction should be made for cov(xA,+) although in many cases, it will be reasonable to assume that this is zero.

Figure 5.1 Measurement error: True vs. observed data.

Now and after some calculation we find that:

There are two main special cases of interest:

1. If there is no relation between XA and +, #x" = 0, this simplifies to:

So will be biased towards zero, regardless of the sample size. If is small relative to then the problem can be ignored. In other words, if the variability in the errors of observation of X are small relative to the range of X, then we need not be too concerned. For multiple predictors, the usual effect of measurement errors is also to bias the in the direction of zero.

For observational data, is (almost) the sample variance of XA while for a controlled

2. In controlled experiments, we need to distinguish two ways in which error in x may arise. In the first case, we measure x so although the true value is xA we observe x0. If we were to repeat the measurement, we would have the same xA but a different x0. In the second case, you fix x0—for example, you make up a chemical solution with a specified concentration x0. The true concentration would be xA. Now if you were to repeat this, you would get the same x0, but the xA would be different. In this latter case we have:

and then we would have So our estimate would be unbiased. This seems paradoxical, until you notice that the second case effectively reverses the roles of xA and x0 and if you get to observe the true X, then you will get an unbiased estimate of

!1. See Berkson (1950) for a discussion of this.

If the model is used for prediction purposes, we can make the same argument as in the second case above. In repeated “experiments,” the value of x at which the prediction is to be made will be fixed, even though these may represent different underlying “true” values of x.

In cases where the error in X can simply not be ignored, we should consider alternatives to the least squares estimation of !. The least squares regression equation can be written as:

so that . Note that if we reverse the roles of x and y, we do not get the same regression equation. Since we have errors in both x and y in our problem, we might argue that neither one, in particular, deserves the role of response or predictor and so the equation should be the same either way. One way to achieve this is to set

This is known as the geometric mean functional relationship. More on this can be found in Draper and Smith (1998). Another approach is to use the SIMEX method of Cook and Stefanski (1994), which we illustrate below.

Consider some data on the speed and stopping distances of cars in the 1920s. We plot the data, as seen in Figure 5.2, and fit a linear model:

> data (cars)

> plot (dist ˜ speed, cars, ylab="distance")

> g < - lm (dist ˜ speed, cars)

> summary (g) Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) !17.579 6.758 !2.60 0.012 speed 3.932 0.416 9.46 1.5e–12 Residual standard error: 15.4 on 48 degrees of freedom Multiple R-Squared: 0.651, Adjusted R-squared: 0.644 F-statistic: 89.6 on 1 and 48 DF, p-value: 1.49e–12

> abline (g)

We could explore transformations and diagnostics for these data, but we will just focus on the measurement error issue. Now we investigate the effect of adding measurement error to the predictor. We plot the modified fits in Figure 5.2:

> gel < - lm (dist ˜ I (speed+rnorm (50)), cars)

> coef (ge1)

(Intercept) I (speed + rnorm(50))

!15.0619 3.7582

> abline (gel, lty=2)

> ge2 < - lm (dist ˜ I (speed+2*rnorm (50)), cars)

> coef (ge2)

(Intercept) I (speed + 2 * rnorm(50)) !5.3503 3.1676

> abline (ge2, lty=3)

Figure 5.2 Stopping distance and speeds of cars. The least squares fit is shown as a solid line. The fits with three progressively larger amounts of measurement error on the speed are shown as dot-ted lines, where the slope gets shallower as the error increases.

> ge5 < - lm (dist ˜ I (speed+5*rnorm(50)), cars)

> coef (ge5)

(Intercept) I (speed + 5 * rnorm(50)) 15.1589 1.8696

> abline (ge5, lty=4)

We can see that the slope becomes shallower as the amount of noise increases.

Suppose we knew that the predictor, speed, in the original data had been measured with a known error variance, say 0.5. Given what we have seen in the simulated measurement error models, we might extrapolate back to suggest an estimate of the slope under no measurement error. This is the idea behind SIMEX.

Here we simulate the effects of adding normal random error with variances ranging from 0.1 to 0.5, replicating the experiment 1000 times for each setting:

> vv < - rep(1:5/10,each=1000)

> slopes < - numeric(5000)

> for (i in 1:5000) slopes [i] < - lm (dist ˜

I (speed+sqrt (vv [i] ) *rnorm(50) ) , cars) $coef [2]

Now plot the mean slopes for each variance. We are assuming that the data have variance 0.5 so the extra variance is added to this:

> betas < - c(coef (g) [2], colMeans (matrix (slopes, nrow=1000) ) )

> variances < - c (0,1:5/10)+0.5

> plot(variances, betas, xlim=c (0, 1), ylim=c (3.86, 4)) We fit a linear model and extrapolate to zero variance:

> gv < - lm (betas ˜ variances)

> coef (gv)

Figure 5.3 Simulation-Extrapolation estimation of the unbiased slope in the presence of measurement error in the predictors. We predict

at a variance of zero.

(Intercept) variances 3.99975 !0.13552

> points (0, gv$coef [1], pch=3)

The predicted value of at variance equal to zero, that is no measurement error, is 4.0.

Better models for extrapolation are worth considering; see Cook and Stefanski (1994) for details.