• No results found

Statistical data mining

5.3 The normal linear model

5.3.2 Application

To illustrate the application of the normal linear model, we will again consider the data matrix with 262 rows and 6 columns containing observations on the

behaviour of an investment fund return, and the five sector indexes that can be adopted as explanatory variables (Section 4.3). The objective is to determine the best linear model that describes the returns as a function of the sector indexes. We can do this by comparing different linear combinations of the predictors and choosing the best one. The exploratory procedures in Chapter 4 detected a strong correlation between the return and the predictors, encouraging us to apply a linear model. Before proceeding with model comparisons, it is useful to test whether the response variable satisfies the assumptions for the normal linear model, and whether it has an approximately normal distribution. If this is not the case, we will need to transform it to bring it closer to normality.

Table 5.1 shows the calculation of a few summary univariate indexes for the response variable. The values of the skewness and kurtosis do not depart much from those for the theoretical normal distribution (both equal to 0 in SAS). Figure 5.3 shows the qq-plot of the response variable, and confirms the validity of the normal approximation, apart from the possible presence of anomalous observations in the tail of the distribution. This means we can proceed with the

Table 5.1 Univariate statistics for the response variable.

Table 5.2 Estimates with the most complex linear model.

Backward Elimination Procedure for Dependent Variable REND Step 0 All Variables Entered R-square = 0.81919887

Source DF Sum of Squares Mean Square F Prob>F Regression 5 755.61964877 151.12392975 231.98 0.0001

Error 256 166.76888880 0.65144097 Total 261 922.38853757

Parameter Standard Type II

Variable Estimate Error Sum of Squares F Prob>F INTERCEP -0.00771196 0.05111640 0.01482805 0.02 0.8802 COMIT -0.01453272 0.02254704 0.27063860 0.42 0.5198 JAPAN 0.07160597 0.02174317 7.06525395 10.85 0.0011 PACIFIC 0.08148180 0.02408673 7.45487776 11.44 0.0008 EURO 0.35309221 0.03924075 52.74444954 80.97 0.0001 NORDAM 0.35357909 0.02945975 93.84044089 144.05 0.0001

selection of the model. Although we are assuming a normal linear model, we could consider other linear models, such as constant variance and uncorrelated residuals. We will do this indirectly by examining the residuals of the final model. Econometrics textbooks give an introduction to more formal tests for choosing different kinds of linear model. Using the backward elimination procedure, we begin by fitting the most complex model, containing all the explanatory variables. Table 5.2 shows the typical output from applying the normal linear model. The first part of the table shows the results relative to the variance decomposition of the response variable; it is an ANOVA table. For each source of variability (regression, error, total) it shows the degrees of freedom (DF) and the sum of squares, which represents the explained variance. The mean square regression is the regression sum of squares divided by the regression DF. The mean square error is the error sum of squares divided by the error DF. TheF statistic (from Result 4 on page 148) is the mean square regression divided by the mean square error. We can evaluate ap-value forF. The p-value is small (lower than 5%), so we reject the null hypothesis that the explanatory variables offer no predictive improvement over the mean alone. Therefore the model with five explanatory variables is significant. The multiple coefficient of determination R2 is equal to

81.91%, a relatively large value that leads, through the application of theF test, to a significant model.

The second part of the table shows the maximum likelihood estimates of the six parameters of the regression plane (the intercept plus the five slope coeffi- cients). These estimates match those obtained with the method of least squares (Table 4.5). But now the introduction of a statistical model allows us to attach measures of sampling variability (standard errors). Calculating the ratio between the value of the estimated parameters and their standard errors, we obtain theT

statistic (from Result 3 on page 148). To test the hypothesis that the COMIT slope coefficient is 0, we obtain aT-value equal to−0.64, corresponding to ap-value of 0.54. This clearly indicates that the COMIT coefficient is not significantly

different from 0 and therefore COMIT is not a significant predictor. The type II sum of squares is the additional contribution to the explained variability, with respect to a model that contains all other variables. Each type II sum of squares corresponds to the numerator of the F statistic (from Result 5 on page 149). The F statistic is the type II sum of squares divided by the mean square error in the first part of the table. The final column gives the correspondingp-values, which show that the response variable strongly depends on only four of the five explanatory variables.

Table 5.3 is a summary of the backward elimination procedure. We removed only the COMIT variable and then we stopped the procedure. The table shows the step of the procedure where the variable was removed, the number of variables remaining in the model (In=4), the partial correlation coefficient (given all the others) of the excluded variable (0.0003), the coefficient of multiple determination of the model, for the remaining four variables (0.8189); it also gives theFstatistic of Result 5 and its p-value, for inserting COMIT in a plane with all the other variables. The hypothesis is clearly rejected. Table 5.4 shows the final linear model, where all the remaining variables are significant.

Once a statistical model is chosen, it is useful to diagnose it, perhaps by analysing the residuals (Section 4.3). To facilitate comparisons, the residuals are often Studentised – they are divided by their estimated standard error. The name derives from the fact that we can take the resulting ratio and apply a Student’st

test to see whether each residual significantly departs from the null value. If this

Table 5.3 Results of the backward selection procedure.

Summary of Backward Elimination Procedure for Dependent Variable REND

Step Removed In Partial R**2 Model R**2 F Prob>F 1 COMIT 4 0.0003 0.8189 0.4154 0.5198

Table 5.4 Estimates with the chosen final model.

Source DF Sum of Squares Mean Square F Prob>F Regression 4 755.34901017 188.83725254 290.54 0.0001 Error 257 167.03952741 0.64995925

Total 261 922.38853757

Parameter Standard Type II

Variable Estimate Error Sum of Squares F Prob>F INTERCEP -0.00794188 0.05105699 0.01572613 0.02 0.8765 JAPAN 0.07239498 0.02168398 7.24477432 11.15 0.0010 PACIFIC 0.08249154 0.02400838 7.67324503 11.81 0.0007 EURO 0.33825116 0.03173909 73.82028615 113.58 0.0001 NORDAM 0.35346510 0.02942570 93.78332454 144.29 0.0001

(a) (b)

Figure 5.4 Diagnostics: (a) residuals against fitted values, (b) residuals qq-plot.

is so, it indicates a possible problem for the linear model. Using a significance level of 5%, the absolute value of the residuals should not exceed about 2, which approximates a 5% significance test. Figure 5.4 shows the analysis of the residuals for our chosen model. Figure 5.4(a) shows a plot of the observed residuals (y- axis) versus the fitted values (x-axis); Figure 5.4(b) is a qq-plot of the observed residuals (y-axis) against the theoretical normal ones (x-axis). Both plots show good behaviour of the model’s diagnostics. In particular, all the residuals are included in the interval (−2,+2) and there are no evident trends – no increasing or decreasing tendencies. The qq-plot confirms the hypothesis underlying the normal linear model. We can therefore conclude that we have chosen a valid model, on which it is reasonable to base predictions. The final model is described by the following regression plane:

REND= −0.0079+0.0724 JAPAN+0.0825 PACIFIC

+0.3383 EURO+0.3535 NORDAM

Comparing this model with the model in Table 4.5, there are slight differences in the estimated coefficients, due to the absence of the variable COMIT. The slight effect of COMIT on the response variable is absorbed by the other variables.