• No results found

Reading the model summary

In document Practical Data Science with R (Page 178-183)

Linear and logistic regression

7.1.5 Reading the model summary

and characterizing coefficient quality

In section 7.1.3, we checked whether our income predictions were to be trusted. We’ll now show how to check whether model coefficients are reliable. This is especially urgent, as we’ve been discussing showing coefficients’ relations to others as advice.

Most of what we need to know is already in the model summary, which is produced using the summary() command: summary(model). This produces the output shown in figure 7.7, which looks intimidating, but contains a lot of useful information and diag- nostics. You’re likely to be asked about elements of figure 7.7 when presenting results, so we’ll demonstrate how all of these fields are derived and what the fields mean.

We’ll first break down the summary() into pieces. THE ORIGINAL MODEL CALL

The first part of the summary() is how the lm() model was constructed: Call:

lm(formula = log(PINCP, base = 10) ~ AGEP + SEX + COW + SCHL, data = dtrain)

This is a good place to double-check whether we used the correct data frame, per- formed our intended transformations, and used the right variables. For example, we can double-check whether we used the data frame dtrain and not the data frame dtest. THE RESIDUALS SUMMARY

The next part of the summary() is the residuals summary: Residuals:

Min 1Q Median 3Q Max

-1.29220 -0.14153 0.02458 0.17632 0.62532

2 To see a trick for dealing with factors with very many levels, see http://mng.bz/ytFY.

Indicator variables

Most modeling methods handle a string-valued (categorical) variable with n possible levels by converting it to n (or n-1) binary variables, or indicator variables. R has com- mands to explicitly control the conversion of string-valued variables into well-behaved indicators: as.factor() creates categorical variables from string variables; relevel() allows the user to specify the reference level.

But beware of variables with a very large number of levels, like ZIP codes. The runtime of linear (and logistic) regression increases as roughly the cube of the number of coef- ficients. Too many levels (or too many variables in general) will bog the algorithm

In linear regression, the residuals are everything. Most of what we want to know about the quality of our model fit is in the residuals. The residuals are our errors in predic- tion: log(dtrain$PINCP,base=10) - predict(model,newdata=dtrain). We can find useful summaries of the residuals for both the training and test sets, as shown in the following listing.

> summary(log(dtrain$PINCP,base=10) - predict(model,newdata=dtrain))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.29200 -0.14150 0.02458 0.00000 0.17630 0.62530

> summary(log(dtest$PINCP,base=10) - predict(model,newdata=dtest))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.494000 -0.165300 0.018920 -0.004637 0.175500 0.868100

In linear regression, the coefficients are chosen to minimize the sum of squares of the residuals. This is the why the method is also often called the least squares method. So for good models, we expect the residuals to be small.

In the residual summary, you’re given the Min. and Max., which are the smallest and largest residuals seen. You’re also given three quantiles of the residuals: 1st. Qu.,

Listing 7.6 Summarizing residuals

Model call summary Model quality summary Coefficients Residuals summary

153

Using linear regression

Median, and 3rd Qu. An r-quantile is a number r such that an r-fraction of the residuals

is less than x and a (1-r)-fraction of residuals is greater than x. The 1st. Qu., Median, and 3rd Qu. quantiles’ values are the values of the 0.25, 0.5, and 0.75 quantiles.

What you hope to see in the residual summary is the median near 0 and symme- try in that 1st. Qu. is near -3rd Qu. (with neither too large). The 1st. Qu. and 3rd

Qu. quantiles are interesting because exactly half of the training data has a residual

in this range. If you drew a random training example, its residual would be in this range exactly half the time. So you really expect to commonly see prediction errors of these magnitudes. If these errors are too big for your application, you don’t have a usable model.

THE COEFFICIENTS TABLE

The next part of the summary(model) is the coefficients table, as shown in figure 7.8. A matrix form of this table can be retrieved as summary(model)$coefficients.

Each model coefficient forms a row of the summary coefficients table. The col- umns report the estimated coefficient, the uncertainty of the estimate, how large the coefficient is relative to the uncertainty, and how likely such a ratio would be due to mere chance. Figure 7.8 gives the names and interpretations of the columns.

We set out to study income and the impact on income of getting a bachelor’s degree. But we must look at all of the coefficients to check for interfering effects.

For example, the coefficient of -0.093 for SEXF means that our model learned a penalty of -0.093 to log(PINCP,base=10) for being female. Females are modeled as earning 1-10^-0.093 relative to males, or 19% less, all other model parameters being equal. Note we said “all other model parameters being equal” not “all other things

Name of coefficient p-value: Probability of such a large t-value forming by mere chance t-value: Number of standard errors estimate is away from zero Coefficient estimate Standard error in estimate

being equal.” That’s because we’re not modeling the number of years in the work- force (which age may not be a reliable proxy for) or occupation/industry type (which has a big impact on income). This model is not, with the features it was given, capable of testing if, on average, a female in the same job with the same number of years of experience is paid less.

The p-value (also called the significance) is one of the most important diagnostic col- umns in the coefficient summary. The p-value estimates the probability of seeing a coefficient with a magnitude as large as we observe if the true coefficient is really 0 (if the variable has no effect on the outcome). Don’t trust the estimate of any coefficient with a large p-value. The general rule of thumb, p>=0.05, is not to be trusted. The esti- mate of the coefficient may be good, but you want to use more data to build a model that reliably shows that the estimate is good. However, lower p-values aren’t always “bet- ter” once they’re good enough. There’s no reason to prefer a coefficient with a p-value of 1e-23 to one with a p-value of 1e-08; at this point you know both coefficients are likely good estimates and you should prefer the ones that explain the most variance.

Statistics as an attempt to correct bad experimental design

The absolute best experiment to test if there’s a sex-driven difference in income dis- tribution would be to compare incomes of individuals who were identical in all possi- ble variables (age, education, years in industry, performance reviews, race, region, and so on) but differ only in sex. We’re unlikely to have access to such data, so we’d settle for a good experimental design: a population where there’s no correlation between any other feature and sex. Random selection can help in experimental design, but it’s not a complete panacea. Barring a good experimental design, the usual pragmatic strategy is this: introduce extra variables to represent effects that may have been interfering with the effect we were trying to study. Thus a study of the effect of sex on income may include other variables like education and age to try to disentangle the competing effects.

Collinearity also lowers significance

Sometimes, a predictive variable won’t appear significant because it’s collinear (or correlated) with another predictive variable. For example, if we did try to use both age and number of years in the workforce to predict income, neither variable may appear significant. This is because age tends to be correlated with number of years in the workforce. If you remove one of the variables and the other one gains significance, this is a good indicator of correlation.

Another possible indication of collinearity in the inputs is seeing coefficients with an unexpected sign: for example, seeing that income is negatively correlated with years in the workforce.

155

Using linear regression

OVERALL MODEL QUALITY SUMMARIES

The last part of the summary(model) report is the overall model quality statistics. It’s a good idea to check the overall model quality before sharing any predictions or coeffi- cients. The summaries are as follows:

Residual standard error: 0.2691 on 578 degrees of freedom Multiple R-squared: 0.3383, Adjusted R-squared: 0.3199 F-statistic: 18.47 on 16 and 578 DF, p-value: < 2.2e-16

The degrees of freedom is the number of data rows minus the number of coefficients fit; in our case, this:

df <- dim(dtrain)[1] - dim(summary(model)$coefficients)[1]

The degrees of freedom is thought of as the number of training data rows you have after correcting for the number of coefficients you tried to solve for. You want the degrees of freedom to be large compared to the number of coefficients fit to avoid overfitting. Overfitting is when you find chance relations in your training data that aren’t present in the general population. Overfitting is bad: you think you have a good model when you don’t.

The residual standard error is the sum of the square of the residuals (or the sum of squared error) divided by the degrees of freedom. So it’s similar to the RMSE (root mean squared error) that we discussed earlier, except with the number of data rows adjusted to be the degrees of freedom; in R, this:

modelResidualError <- sqrt(sum(residuals(model)^2)/df) Multiple R-squared is just the R-squared (discussed in section 7.1.3).

The adjusted R-squared is the multiple R-squared penalized by the ratio of the degrees of freedom to the number of training examples. This attempts to correct the fact that more complex models tend to look better on training data due to overfitting. Usually it’s better to rely on the adjusted R-squared. Better still is to compute the R-squared between predictions and actuals on hold-out test data. In section 7.1.3, we showed the R-squared on test data was 0.26, which is significantly lower than the reported adjusted R-squared of 0.32. So the adjusted R-squared discounts for overfitting, but not always enough. This is one of the reasons we advise preparing both training and test datasets;

The overall model can still predict income quite well, even when the inputs are corre- lated; it just can’t determine which variable deserves the credit for the prediction. Using regularization (especially ridge regression as found in lm.ridge() in the pack- age MASS) is helpful in collinear situations (we prefer it to “x-alone” variable prepro- cessing, such as principal components analysis). If you want to use the coefficient values as advice as well as to make good predictions, try to avoid collinearity in the inputs as much as possible.

the test dataset estimates can be more representative of production model perfor- mance than statistical formulas.

The F-statistic is similar to the p-values we saw with the model coefficients. It’s used to measure whether the linear regression model predicts outcome better than the constant mode (the mean value of y). The F-statistic gets its name from the F-test, which is the technique used to check if two variances—in this case, the variance of the residuals from the constant model and the variance of the residuals from the linear model—are significantly different. The corresponding p-value is the estimate of the probability that we would’ve observed an F-statistic this large or larger if the two vari- ances in question are in reality the same. So you want the p-value to be small (rule of thumb: less than 0.05).

In our example, the model is doing better than just the constant model, and the improvement is incredibly unlikely to have arisen from sampling error.

INTERPRETING MODEL SIGNIFICANCES Most of the tests of linear regression, including the tests for coefficient and model significance, are based on the error terms, or residuals are normally distributed. It’s important to examine graphically or using quantile analysis to determine if the regression model is appropriate.

In document Practical Data Science with R (Page 178-183)