Ordinary Least Squares Regression - Exploring Spatial Data with GeoDa TM : A Workbook

22.4.1 Saving Predicted Values and Residuals

If you want to add the predicted values and/or residuals to the current data table, do not select the OK button, but instead, click on Save, as in Figure 22.11. This brings up a dialog to specify the variable names for residuals and/or predicted values, as shown in Figure 22.12 on p. 173.

In this dialog, you can check the box next to Predicted Value and/or Residual and either keep the default variable names (OLS PREDIC for the predicted value and OLS RESIDU for the residuals), or replace them with more meaningful names (simply overwrite the defaults). ClickOKto add the

Figure 22.12: Predicted values and residuals variable name dialog.

Figure 22.13: Predicted values and residuals added to table.

Figure 22.14: Showing regression output.

Figure 22.15: Standard (short) OLS output window.

is as in Figure 22.13 on p. 173. Remember to save the shapefile under a different file name to make the new variables permanently part of the dbf file.

22.4.2 Regression Output

Click on OK in the regression variable dialog (Figure 22.14 on p. 173) to bring up the results window, shown in Figure 22.15. The top part of the window contains several summary characteristics of the model as well as measures of fit. This is followed by a list of variable names, with associated coefficient estimates, standard error, t-statistic and probability (of rejecting the null hypothesis thatβ = 0).5 Next are given a list of model diagnostics, a discussion of which is left for Exercise 23.

The summary characteristics of the model listed at the top include the name of the data set (columbus), the dependent variable (CRIME), its mean (35.1288) and standard deviation (16.5605). In addition, the number of

observations are listed (49), the number of variables included in the model (inclusive of the constant term, so, the value is3), and the degrees of freedom (46).

In the left hand column of the standard output are traditional measures of fit, including theR2 (0.552404) and adjustedR2 (0.532943), the sum of squared residuals (6014.89), and the residual variance and standard error es- timate, both with adjustment for a loss in degrees of freedom (Sigma-square andS.E. of regression) as well as without (Sigma-square MLandS.E. of regression ML).6

In the right hand column are listed the F-statistic on the null hypothesis that all regression coefficients are jointly zero (28.3856), and the associated probability (9.34074e-009). This test statistic is included for completeness sake, since it typically will reject the null hypothesis and is therefore not that useful.

Finally, this column contains three measures that are included to main- tain comparability with the fit of the spatial regression models, treated in Exercises 24 and 25. They are the log likelihood (-187.377), the Akaike information criterion (380.754) and the Schwarz criterion (386.43). These three measures are based on an assumption of multivariate normality and the corresponding likelihood function for the standard regression model. The higher the log-likelihood, the better the fit (high on the real line, so less negative is better). For the information criteria, the direction is opposite, and the lower the measure, the better the fit.7

When the long output options are checked in the regression title dialog, as in Figure 22.6 on p. 167, an additional set of results is included in the output window. These are the full covariance matrix for the regression coefficient estimates, and/or the predicted values and residuals for each observation. These results are listed after the diagnostics and are illustrated in Figure 22.16 on p. 176. The variable names are given on top of the columns of the covariance matrix (this matrix is symmetric, so the rows match the columns). In addition, for each observation, the observed dependent variable is listed, as well as the predicted value and residual (observed less predicted).

6_{The difference between the two is that the first divides the sum of squared residuals}

by the degrees of freedom (46), the second by the total number of observations (49). The second measure will therefore always be smaller than the first, but for large data sets, the difference will become negligible.

7_{The AIC =}₋₂_L₊₂_K_{, where}_L_{is the log-likelihood and}_K_{is the number of parameters}

in the model, here3. Hence, in the Columbus example: AIC =−2×(−187.377) + 2×3 = 380.754. The SC =−2L+K.ln(N), where ln is the natural logarithm. As a result, in the Columbus example: SC =−2×(−187.377) + 3×3.892 = 386.43.

Figure 22.16: OLS long output window.

Figure 22.17: OLS rich text format (rtf) output file in Wordpad. 22.4.3 Regression Output File

The results of the regression are also written to a file in the current work- ing directory, with a file name specified in the title dialog (Figure 22.6 on

Figure 22.18: OLS rich text format (rtf) output file in Notepad.

p. 167). In the current example, this is columbus.rtf. The file is in rich text format and opens readily in Wordpad, Word and other word processors, allowing you to cut and past results to other documents. The file contents are illustrated in Figure 22.17 on p. 176.

Note that when you attempt to open this file in a simple text editor like Notepad, the result is as in Figure 22.18, revealing the formatting codes.

In document Exploring Spatial Data with GeoDa TM : A Workbook (Page 190-195)