Multicollinearity - Multiple regression analysis

Chapter 7: Results and Analysis of the Factors Influencing Disclosures

7.3 Multiple regression analysis

7.3.6 Multicollinearity

Multicollinearity exists when there is a linear relationship between two or more predictors (independent variables) in a regression or when the predictors are highly correlated (Field, 2005). Common methods to detect multicollinearity in a regression are when the correlation coefficient between each pair of predictors is above 0.8, and the Variance Inflation Factor (VIF)89 is more than 10 (Field, 2005).90 Table 7.2 presents a correlation matrix for dependent and independent variables, as well as for VIF values.

89_{VIF indicate whether a predictor has a strong linear relationship with other predictors (Field,}

2005).

159

Table 7.2: Kendall’s tau_b correlation coefficient amongindependent variables

AR Index

AE Index

Staff Revenue Asset Population Med.

income VIF AR Index 1.000 AE Index .304** 1.000 Staff .168* .097 1.000 15.846 Revenue .176* .137 .779** 1.000 41.450 Asset .222** .141 .712** .817** 1.000 29.066 Population .179* .093 .775** .840** .768** 1.000 16.916 Med.income ..087 .042 .225* .253** .277** .279** 1.000 1.204

** Correlation is significant at the level 0.01 (2-tailed) * Correlation is significant at the level 0.05 (2-tailed)

From Table 7.2, it is evident that multicollinearity exists between the independent variables in the standard regression model. Variables for staff, revenue, assets, and population are highly intercorrelated, with the Kendall‟s tau_b correlation coefficients having values close to or higher than 0.8. Further, VIF values of staff, revenue, assets, and population are beyond the thresholds of 10 (15.846, 41.450, 29.066, and 16.916) which confirms the problem of multicollinearity.

In order to control the problem of multicollinearity, and as there is no overriding reason to choose one variable over another, a separate regression analysis including one of the highly correlated variables could be run alternately (Cooke, 1989), or factor analysis91 could be used to derive factor score(s) as independent variable(s), replacing the highly correlated variables (Craig & Diga, 1998; Ingram, 1984).

As discussed earlier in this study, outliers, non-linearity relationships, heteroscedasticity, and multicollinearity are the principal problems of the standard

91_{Factor analysis combines variables with similar characteristics into a group or a single factor.}

As a result, it reduces a large number of independent variables to a subset of uncorrelated factors (Field, 2005).

160

regression analysis that need to be solved. In order to tackle the first three violations, transformation and removal of outliers were considered. Factor analysis and running regression models with an alternate correlated variable could be used to control multicollinearity.

Numerous combinations of different types of transformation were regressed, and natural logarithms of both the AR and AE indices and of independent variables were found to provide the lower residual mean square error (RMSE). In using logarithmic transformation, the problems of heteroscedasicity and non-linearity were alleviated; however, outliers persisted. The statistics of both regression models for the AR and AE indices show evidence of the outliers, with standardised residuals and Cook‟s Distance values of more than 2 and .06 respectively. (Appendices I and J show the standardised residual and Cook‟s Distance values for the logarithmic transformed AR and AE indices.) These outliers were the four lowest-scored authorities (Papakura DC, Manukau CC, Auckland CC, and Hurunui DC). Given the problem of outliers, removal of the four authorities was made for both models. Consequently, the sample size reduced to 69 local authorities.

When running a factor analysis to extract the explanatory contribution of highly correlated variables (revenue, total asset, population, and staff), one principal component was found (eigenvalue = 3.785) which accounted for 94 percent of total variance. Factor loadings of the components show .984, .965, .984, and .957, for revenue, total assets, population, and number of staff respectively. The equation to derive a factor score for these variables, collectively called size, is:

161

SIZE = .984*REV + .965*ASSET + .984*POP + .957*STAFF

The size variable embraces, to certain degree, explanatory contribution of, the highly correlated variable, and the problem of an omitted variable is therefore avoided. This size variable is another independent variable suitable for alleviating multicollinearity problems. Since no theoretical reason exists on which of these independent variables (size derived from factor analysis, revenue, total assets, population, and number of staff) is best at capturing the effect on the disclosures and any multicollinearity problem, five regression models were developed, using each of the five variables alternately. The models are:

Model A : INDEXi = α + β2 SIZEi + β2 MEDINC i + ε i

Model B : INDEXi = α + β1 REVi + β2 MEDINC i + ε i

Model C : INDEXi = α + β1 ASSETi + β2 MEDINC i + ε i

Model D : INDEXi = α + β1 POP i + β2 MEDINC i + ε i

Model E : INDEXi = α + β1 STAFF i + β2 MEDINC i + ε i

Where

INDEX = log of AR index scores and log of AE index SIZE = size factor score derived from factor analysis STAFF = log of number of staff

REV = log of revenue

ASSET = log of total assets POP = log of population

MEDINC = log of medium income per capita

i = local authority

α,β = constants or parameters

ε = errors

Note that each of these models applies for both indices. As a result, ten models are run. Commonly, all models include median income per capita. Model A

162

includes the size variable derived from factor analysis, while Models B – E include one of the highly correlated variables.

In document Examination of Statements of Service Performance of New Zealand Local Authorities: the Case of Wastewater Services (Page 173-177)