6.2 Model development
6.2.5 Model estimation
This stage entails fitting the model to a set of data and using statistical measures to assess its goodness of fit and usefulness in predicting the dependent variable. Among a number of different approaches to estimation of the parameters of models, this study considered multiple regression analysis as the most appropriate. Described by Greene (2008) as the benchmark approach, the technique is designed to accommodate multiple variables. The advantage of multiple regression analysis is its ability to identify the independent effect of a set of explanatory variables on a dependent variable. It can also be used to predict the amount or size of the variable of interest (Hair et al, 2010).
Underlying the standard linear regression model is a number of assumptions than any model should meet. Greene (2008) lists six assumptions of the classical linear regression model:-
a) Linearity. The regression model specifies a linear relationship. In the regression context this linearity assumption refers to the manner in which the parameters and the disturbance enter the regression equation and not necessarily the relationship of the variables. Thus the model
expresses the concept that models possess properties of additivity and homogeneity.
b) Full rank. The regression analysis assumes orthogonal (not linearly correlated) explanatory variables.
c) Exogeneity of the explanatory variables. The disturbance is assumed to have conditional expected value zero at every observation
d) Homoscedasticity and nonautocorrelation. The variance of error terms are assumed constant over a range of explanatory variables.
e) Exogenously generated data. The data generation for the explanatory variables operates independently of the process that generates the disturbance 𝑒i
f) Normal distribution. Regression assumes that the disturbances are normally distributed, with zero mean and constant variance.
In practice many models do violate some of these regression assumptions and corrective action has to be taken to remedy the breach. The way the study dealt with these issues is addressed in the proceeding sections.
6.2.5.1 Estimating gravity model equations
The multiplicative functional form of gravity models violates the linearity assumption. To rectify the breach and make the relationship between the dependent and the explanatory variables linear, the model is transformed using natural logs. In this form, what is multiplied (the generative variables) in the gravity model becomes added and what is divided (impedance variables) is subtracted. Equation 6-1 was therefore estimated in the following form:
e
i ij n ij ij ij ij ija
b
POP
b
GNI
b
T
b
D
b
O
V
ln
ln
ln
ln
ln
ln
0 1 2 4 3 Equation 6-2Where the explanation for the variables is the same as in equation 6-1 and
i
Once the assumptions of the regression analysis are satisfied, the unknown parameters of the log-transformed model are estimated. Using the ordinary least squares method, regression analysis generates an equation that describes the statistical relationship between one or more explanatory variables and the dependant variable. In conducting the regression analysis, measures of predictive accuracy are set and statistical tests are used to assess the significance of the predictive power. The objective is that the econometric model achieves acceptable levels of predictive accuracy to justify its application.
6.2.5.2 Assessing goodness of model fit
Regression results indicate the direction, size, and statistical significance of the relationship between explanatory variables and the dependent variable. Having fitted the regression equation to the model data the next step is to establish how sound, statistically, the econometric model is. Fleming and Nellis (2000) suggest four areas:-
a. Interpretation of the individual regression coefficients b. Statistical significance of the regression coefficients c. Overall predictive power of the estimated equation d. Statistical significance of the overall predictive power
6.2.5.2.1 Interpretation of the individual regression coefficients
Regression coefficients represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant. The sign of each coefficient indicates the direction of the relationship. The signs, magnitude and statistical significance of each of the regression coefficients are examined to ascertain their collective prediction of the dependent variable and their individual contribution to the regression model and its predictive power.
A priori assumptions on the SADC model are that regression coefficients for the population, per capita income, trade, tourism, border and language should be positive as they are postulated to be positively related to volume of air
passenger traffic. The regression coefficients of distance and travel restrictions are expected to be negative as the two variables are hypothesised to be inversely related to demand for air travel. The HDI and PSI are expected to be either negative or positive.
Since the SADC gravity model is multiplicative in nature the measured changes are in percentage terms. Each regression coefficient in the SADC model estimates the percentage change in the mean response per unit in the explanatory variable when all other explanatory variables are held constant. 6.2.5.2.2 Statistical significance of the individual regression coefficients The size of each regression coefficient indicates the contribution the variable has to the variation in volume of city-pair passenger traffic that is associated with a unit change in overall predictive power of the model. The statistical significance of the individual regression coefficients is tested using a t-statistic test. In the regression equation, this involves testing whether the values of the regression coefficients differ significantly from zero at predefined level of significance (probability value). The null hypothesis for the t-test is 𝑯₀: b₁ = 0 and the alternative is 𝑯1: b₁ ≠ 0. This means that if the probability value (p-
value) associated with a test of a regression coefficient is less than the chosen alpha (α)-level of significance which in the SADC model is 0.05, the relationship between the explanatory variable and the dependent is considered statistically significant.
6.2.5.2.3 Overall predictive power of the econometric model
The coefficient of determination R–squared (𝙍²) measures the proportion of the total variation in the dependant variable (𝘺) that is accounted for by the variation in the explanatory variables (Greene, 2008). This measure ranges from 0 to 100% and the more variation that is accounted for by the regression model, the better the model fits the data (Hair et al, 2010). 𝙍² therefore provides an indication of the overall predictive power of the estimated regression equation (Mendenhall and Sincich, 2010; Greene, 2008). Hair et al (2010) suggests that
projects should aim at designing models that target to achieve higher predictive power. Their rule of thumb is 80%.
However 𝙍² has some problems in analysing a model’s goodness of fit. Greene (2008) notes that 𝙍² will never decrease when more variables are added in a model. A higher 𝙍² would therefore not necessarily imply a better model. Computing the 𝙍²adjusted is useful as it incorporates a penalty for the number of
degrees of freedom used up in estimating parameters. Greene (2008) notes that 𝙍²adjusted may rise or fall when a new variable is added in a model. Its
movement depends on the contribution of the new variable to the fit of the regression model. 𝙍²adjusted will increase only if the new explanatory variable
improves the model. It also decreases when a new explanatory variable does not add value to the predictive power of the whole model.
Hair et al (2010) note that the 𝙍²adjusted is a useful tool for comparing the
explanatory power of competing models with the same dependent variable but different numbers of explanatory variables.
6.2.5.2.4 Statistical significance of the overall predictive power of the model
The statistical significance of the overall explanatory power as given by 𝙍² is assessed by the 𝑭 test. Hair et al (2010) define the F-statistic as the ratio of the explained to the unexplained variance. They argue that the F-test compares the explanatory power of the full model to the reduced model with only the y- intercept term in it, which is equivalent to the sample mean. Thus, it compares a model with one or more partial slope coefficients to the sample mean model, to see if it provides a convincingly better fit to the data. This is therefore a test of whether or not the regression coefficients are equal to zero. As is the case with the t-test for the partial regression coefficients there is a null and alternate hypothesis stated as follows:-
𝑯₀: b₁ = b₂ = b₃ = bn = 0
If one rejects the 𝑯₀ the conclusion is that there is a significant relationship between the volumes of air passenger traffic and at least one of the explanatory variables and that the regression as a whole is significant.