Model specification
4.3 Linear regression
4.3.3 Goodness of fit
The regression line represents a linear fit of the dispersion diagram and therefore involves a degree of approximation. We want to measure the accuracy of that approximation. An important judgement criterion is based on a decomposition of the variance of the dependent variable. Recall that the variance is a mea- sure of variability, and variability in statistics means ‘information’. By applying Pythagoras’ theorem to the right-angled triangle in Section 4.3.2, we obtain
(yi−y)2= (yˆi −y)2+ (yi−yˆi)2.
This identity establishes that the total sum of squares (SST), on the left, equals the sum of squares explained by the regression (SSR) plus the sum of squares of the errors (SSE). It can also be written as
SST= SSR + SSE .
These three quantities are called deviances; if we divide them by the number of observations n, and denote statistical variables using the corresponding capital letters, we obtain
Var(Y )=Var(Y )ˆ +Var(E).
We have decomposed the variance of the response variable into two components: the variance ‘explained’ by the regression line, and the ‘residual’ variance. This leads to our main index for goodness of fit of the regression line; it is the index of determinationR2, defined by
R2= Var(Y )ˆ
Var(Y ) =1−
Var(E)
Var(Y ).
The coefficientR2is equivalent to the square of the linear correlation coefficient, so it takes values between 0 and 1. It is equal to 0 when the regression line is constant (Y=y, i.e. b=0); it is equal to 1 when the fit is perfect (the residuals are all null). In general, a high value ofR2indicates that the dependent variable Y can be well predicted by a linear function of X. The R2 coefficient of cluster analysis can be derived in exactly the same way by substituting the group means for the fitted line. From the definition ofR2, notice that Var(E)=
Var(Y )(1−R2). This relationship shows how the error in predictingY reduces from Var(Y), when the predictor is the mean (Y =y), to Var(E), when the predictor is ˆyi =a+bxi. Notice that the linear predictor is at least as good as
the mean predictor and its superiority increases withR2=r2(X, Y ).
Figure 4.3 has R2 equal to 0.81. This indicates a good fit of the regression line to the data. For the time being, we cannot establish a threshold value forR2, above which we can say that the regression is valid, and vice versa. We can do this if we assume a normal linear model, as in Section 4.11.
R2 is only a summary index. Sometimes it is appropriate to augment it with diagnostic graphical measures, which permit us to understand where the regres- sion line approximates the observed data well and where the approximation is
2 −2 −6 −4 −2 2 4 0 0 R E N D R P REND
Figure 4.4 Diagnostic plot of a regression model.
poorer. Most of these tools plot the residuals and see what they look like. If the linear regression model is valid, the Y points should be distributed around the fitted line in a random way, without showing obvious trends. It may be a good starting point to examine the plot of the residuals against the fitted values of the response variable. If the plot indicates a difficult fit, look at the plot of the residuals with respect to the explanatory variable and try to see where the explanatory variable is above or below the fit. Figure 4.4 is a diagnostic plot of the residuals (R REND) against the fitted values (P REND) for the financial data in Figure 4.3. The diagnostic confirms a good fit of the regression line. Determination of the regression line can be strongly influenced by the presence of anomalous values, or outliers. This is because the calculation of the parame- ters is fundamentally based on determining mean measures, so it is sensitive to the presence of extreme values. Before fitting a regression model, it is wise to conduct accurate exploratory analysis to identify anomalous observations. Plot- ting the residuals against the fitted values can support the univariate analysis of Section 3.1 in locating such outliers.
4.3.4 Multiple linear regression
We now consider a more general (and realistic) situation, in which there is more than one explanatory variable. Suppose that all variables contained in the data matrix are explanatory, except for the variable chosen as response variable. Let
k be the number of such explanatory variables. The multiple linear regression is defined, fori=1,2, . . . , n, by
yi =a+b1xi1+b2xi2+. . .+bkxik+ei
or, equivalently, in more compact matrix terms,
where, for all the nobservations considered, Yis a column vector with nrows containing the values of the response variable; X is a matrix with n rows and
k+1 columns containing for each column the values of the explanatory variables for thenobservations, plus a column (to refer to the intercept) containingnvalues equal to 1;bis a vector withk+1 rows containing all the model parameters to be estimated on the basis of the data (the intercept and thek slope coefficients relative to each explanatory variable); and E is a column vector of length n
containing the error terms. Whereas in the bivariate case the regression model was represented by a line, now it corresponds to a (k+1)-dimensional plane, called the regression plane. Such a plane is defined by the equation
ˆ
yi =a+b1xi1+b2xi2+ · · · +bkxik.
To determine the fitted plane it is necessary to estimate the vector of the param- eters (a, b1, . . . , bk) on the basis of the available data. Using the least squares
optimality criterion, as before, thebparameters will be obtained by minimising the square of the Euclidean distance:
d2(y,y)ˆ =
n
i=1
(yi −yˆi)2.
We can obtain a solution in a similar way to bivariate regression; in matrix terms it is given byYˆ =Xβ, where
β=XX−1XY .
Therefore, the optimal fitted plane results to be defined by
ˆ