4.2 Linear Regression
4.2.1 Statistical Significance
Before moving into the prediction, lets consider about the significance value which is mostly used in linear regression. It is also termed as significance or probability which is denoted by the letter p. The likelihood that a particular outcome may occur by chance is given by the p value. It can be used to identify whether two or more variables are correlated to each other signifi- cantly. So we should always try to find a very smaller p value for valid results. Social scientists have accepted that a p value less than 0.05 is statistically a significant correlation [33].
0 1 2 3 4 5 6 0 1 2 3 4 5 6 x y
Figure 4.6: Scatter plot of the sample dataset.
4.2.2 Linear Regression
Linear regression analysis is a way of testing hypothesis concerning the relationship between two numerical variables and a way of estimating the specific nature of such relationships [34]. The relationship is expressed in the form of an equation or a model connecting the dependant variable and one or more independent variables depending on the problem of interest. The method of least squares is used most frequently in fitting a line in linear regression.
The simplest relationship between an independent variable x and a dependant variable y is represented as
y = β0+ β1x + (4.15)
where β0 is the intercept and β1 is the slope. The random error term is given by which
should be normally distributed with 0 mean and at each possible value of x, the variance of |xi
should be constant and it should be independent of the other errors [34]. Normally we examine the residuals which are the differences between the observed values (y) and the estimated values to approximate this error term.
These unknowns have to be found using the samples in the training dataset. Lets consider the sample dataset found in Table 3.1 in chapter 3, which has an independent variable x and dependant variable y as shown in Figure 4.6.
SPSS generates four tables in linear regression analysis. Table of variables in the regression equation, a model summary, an ANOVA table, and a table of coefficients.
Table 4.1: Variables table.
Model Variables Entered Variables Removed Method
1 x Enter
A variables table with only one independent variable is shown in Table 4.1. We can have multiple linear regression models if we have multiple variables based on the variables in the Entered and Removed columns. However in this case there is only one model due to the availability of one independent variable.
Table 4.2: Model summary table.
Model R R Square Adjusted R Square Std. Error of the Estimate 1 .984 .969 .958 .32914
The model summary table shown in Table 4.2 illustrates the goodness of fit in regression. Here, R is the correlation coefficient which ranges from -1 to +1 and R2 is the coefficient of determination which is the squared of R. If R2 is equal to 1, then it is a perfect fit. The value in the given table (i.e 0.969) is very close to one, meaning that the points in the dataset are experiencing a very good linear relationship. The adjusted R square value is a more better value than R square value which can be used for population estimates specially in multiple regression. The third table is the ANOVA table. ANOVA is used to compare three or more means to one another. For a single independent variable it is called one-way ANOVA [34].
Table 4.3: ANOVA table.
Model Sum of Squares df Mean Square F Sig. 1 Regression 10.000 1 10.000 92.308 .002
Residual .325 3 .108 Total 10.325 4
The Sig value is also known as the P-value and, if it is less than 0.05 we say the ANOVA is significant( F value is significant) and it can be concluded that there is a regression in the model. The F value in the table is known as the Levene statistic. In Table 4.3, the p value is less than .05, and therefore we can conclude that the two variables are statistically significant.
0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 x y
Figure 4.7: Fitted line for the sample dataset.
Table 4.4: Coefficients table.
Model Unstandardised Coefficients Standardized Coefficients t Sig. B Std.Error Beta
1 (Constant) 1.100 .345 3.187 .050 x 1.000 .104 .984 9.608 .002
The last table is the table of coefficients. We know that β0 and β1 are the coefficients in
equation 4.15. The first value in column B (i.e 1.100) is the intercept β0 and the second value
1.000 is the slope β1. If we consider the t value and the significant value for the slope, the
significance value is less than 0.05 meaning that there is a statistically significant relationship between x and y.
Using thes information we can interpret equation 4.15 as follows for the considered dataset.
y = 1.1 + x + (4.16)
The fitted line for the above discussed dataset could be illustrated as shown in Figure 4.7. Now we can estimate the output of the target variable when x=6, i.e y=7.1.
But now we should pay our attention to the error term (or disturbance) of the fitted line. For this, we have to look at Table 4.2. The standard error of the estimate .329, is a measure of the variability of the random error. This can be used to calculate the 95% confidence interval by multiplying by two. This can be considered as the residual error term for the regression line. Therefore our estimate for y should be as follows:
Figure 4.8: The normal P-P plot of regression, which is used for residual analysis.
Figure 4.9: Graph of standardized predicted value versus standardized residual value, which is used for residual analysis.
y = 7.1 ± .658
To find out the validity of the first assumption of the residuals (i.e normality) we look at the normal probability plot as shown in Figure 4.8.
The assumption of equal variances can be identified by the scattered plot of the standardized residuals versus the standardized fitted values. [34]. This is illustrated in Figure 4.9.