7.1 Methodology
7.1.1 Procedures Used for Evaluating Multiple Regression Models
The multiple regression models defined in this study were estimated using the standard Ordinary Least Squares (OLS) method. This method obtains estimates of the parameters of a multiple regression model by minimizing the sum of squared residuals and was
implemented using SPSS for Windows version 14.0. However, for the statistical findings to be reliable and valid, the estimated multiple regression models have to meet the specific theoretical assumptions and requirements of OLS regression analysis.
506 Cf. Albers and Hildebrandt (2006), p. 2 507 See chapter 3.2.
The first requirement relates to the correct specification of the regression model. That is, the model needs to be linear in the parameters d0, …, dk, it needs to include all relevant explanatory (independent) variables, and the number of parameters to be estimated needs to be smaller than the number of observations.508 If all relevant variables are included in the regression model, the error term (u) only represents random deviations of the estimated values from the actual (observed) values. Therefore, the conditional mean of the error term (u) is assumed to be equal to zero. That is, the error (u) has an expected value of zero, given any values of the independent variables. In other words, E(uex1, x2, …,xk) = 0.509 The zero-conditional-mean assumption guarantees that two conditions necessary for deriving OLS estimators are satisfied. It leads to zero mean of the error term and zero covariance between the error term and the independent variables.
However, the assumption can fail, if a relevant explanatory variable is omitted or excluded from the regression model. In this case, the OLS estimators can be biased and inconsistent. Yet, omitted-variable bias occurs only when the omitted variable actually belongs in the true population model (i.e., it has a non-zero marginal effect on the dependent variable) and when it is correlated with any of the explanatory variables (x1, x2, …,xk) included in the estimated model.510 Moreover, Wooldridge notes that the zero-conditional-mean assumption can also fail, if the functional relationship between the explained and
explanatory variables is misspecified in the estimated regression model. Functional form misspecification occurs, for example, when important nonlinearities are neglected, i.e., when quadratic (or cubic) terms of some of the explanatory variables are not included in the estimated model.511
To statistically test whether the employed multiple regression models are correctly specified, this dissertation performs Ramsey's regression specification error test
(RESET).512 The RESET test is an F-Test of the difference in R2between the originally estimated regression model and an augmented model that includes power functions of the predicted values (i.e. their squares and cubes) obtained from the original model. The general philosophy of the RESET test is that if the original model can be significantly improved by artificially including powers of the predictions of the model, then the original 508 Cf. Backhaus et al. (2005), p. 79.
509 Cf. Backhaus et al. (2005), pp. 83-84; Wooldridge (2003), p. 85.
510 At this point, it should be noted that in any application there are always factors that the researcher will not be able to include, due to data limitations or ignorance.
511 Cf. Wooldridge (2003), pp. 89-94. 512 See Ramsey (1969), pp. 350-371.
7.1 Methodology 163 model must have been inadequate. Thus, a significant F-statistic (p < 0.05) implies that the original model is inadequate and can be improved. In contrast, an insignificant F-statistic (p > 0.05) suggests that the test has not been able to detect any misspecification. This, in turn, implies that the original regression model has been correctly specified. Ramsey proposed the RESET test as a general misspecification test designed to detect both omitted variables and inappropriate functional form.513 However, the RESET test does not
technically test for omitted variables. Therefore, several researchers argue that "RESET is a functional form test, and nothing more."514 Consequently, this dissertation interprets the results of the RESET test primarily as indicative of the presence or absence of functional form misspecification.
A second requirement is that the variance of the error term (u), conditional on the explanatory variables, is the same for all combinations of outcomes of the explanatory variables. This condition is known as the homoscedasticity or "constant variance" assumption.515 If this assumption fails, then the regression model exhibits
heteroscedasticity. While heteroscedasticity does not cause bias or inconsistency in the OLS estimators, it distorts the OLS standard errors. Consequently, the OLS standard errors are no longer valid for constructing confidence intervals and t-statistics.516 Thus, the statistics used to test hypotheses in a multiple regression model are not valid in the presence of heteroscedasticity. To test for the presence of heteroscedasticity in the employed regression models, this study conducts Breusch-Pagan tests.517 The Breusch- Pagan test is a test against the null hypothesis of homoscedasticity. Thus, a statistically insignificant test statistic (p > 0.05) leads to the acceptance of the null hypothesis and indicates the absence of heteroscedasticity in the examined regression model.
Moreover, it is required that none of the independent variables in the regression model is constant, and there are no exact linear relationships among the independent variables. That is, no perfect multicollinearity among the independent (explanatory) variables in the multiple regression model should exist.518 While perfect multicollinearity can practically
513 Cf. Ramsey (1969), p. 369. 514 Wooldridge (2003), p. 294. 515 Cf. Wooldridge (2003), p. 95. 516 Cf. Backhaus et al. (2005), pp. 85-86.
517 Cf. Breusch and Pagan (1979), pp. 1287-1294; Wooldridge (2003), pp. 266-267. 518 Cf. Bachkaus et al. (2005), p. 70; Wooldridge (2003), p. 86.
be ruled out through correct specification of the regression model519, some level of multicollinearity will always be present, particularly in complex models. However, high (but not perfect) multicollinearity among the independent variables can have detrimental effects on the results of the regression analysis. With high levels of multicollinearity it may be difficult to identify the specific contribution of each explanatory variable, because the OLS estimators may be unstable. That is, the estimators (regression coefficients) can be sensitive to the deletion or addition of explanatory variables. Moreover, high
multicollinearity causes the estimators to be less efficient, i.e., they possess larger standard errors and wider confidence intervals.520 One way to assess the level of multicollinearity in the employed regression models is to examine the bivariate (pairwise) correlations between the independent variables. Commonly, the presence of bivariate correlations above 0.8 is considered indicative of the presence of strong linear associations, suggesting that
multicollinearity may be a problem.521 However, the absence of high bivariate correlations does not imply lack of collinearity because the correlation matrix may not reveal collinear relationships involving more than two variables. Therefore, this dissertation examines the level of multicollinearity in the employed multiple regression models by calculating variance inflation factors (VIFs).522 In line with literature, a maximum VIF greater than 10 is used as cut-off threshold for high (harmful) multicollinearity.523 It is important to note that the "multicollinearity problem" is not really well-defined, since multicollinearity does not violate a regression assumption. Therefore, it is ultimately up to each researcher individually to determine how serious the multicollinearity problem is, and how the problem should be treated.524
If the above requirements (assumptions) are fulfilled, the OLS estimators obtained from the multiple regression analysis are unbiased and efficient.525 However, in order to perform statistical inference, the unobserved error needs to be normally distributed. Thus, prior to testing hypotheses using F-tests and t-tests, the normality of the residuals needs to be
519 Perfect multicollinearity mostly results from model misspecification, e.g., by including the same variable twice. Note that the inclusion of a nonlinear function of an independent variable (e.g., by including a squared term of the variable) does not violate the assumption of no perfect multicollinearity. Even though the squared term is an exact function of the respective independent variable, it is not an exact linear function of the variable (Cf. Wooldridge (2003), p. 87). 520 Cf. Backhaus et al. (2005), p. 90.
521 Cf. Farrar and Glauber (1967), p. 98; Mason and Perreault (1991), p. 270. 522 Cf. Backhaus et al. (2005), p. 91.
523 Cf. Belsley (1991), p. 28; Mason and Perreault (1991), p. 270. 524 Cf. Wooldridge (2003), p. 97; Backhaus et al. (2005), p. 92.
525 The assumption of no serial correlation (autocorrelation) has been neglected because it is only relevant in regressions using time series or panel data. This is not the case in this dissertation.
7.1 Methodology 165 demonstrated. For this purpose, the study performs Kolmogorov-Smirnov and Shapiro- Wilk tests on the standardized residuals obtained from the estimated regression models.526 In both cases, a significant test statistic (p < 0.05) indicates a significant deviation from normality. In contrast, insignificant test results (p > 0.05) provide support for the normality of the residuals, which in turn implies that the conditions for performing statistical
inferences are met.
In addition to testing whether the general assumptions and requirements of OLS regression analysis have been met, the goodness-of-fit of the estimated regression models needs to be assessed. For this purpose, three global goodness-of-fit statistics are used that indicate how well the estimated regression model explains the dependent variable. First, the coefficient of determination (R2) provides the fraction of the sample variation in the dependent variable that is explained by the independent (explanatory) variables. The value of R2is always between zero and one with higher values indicating greater model fit. Thus, an R2- value of one indicates that the estimated regression model provides a perfect fit to the data. However, because the explanatory power of an independent variable is at worst zero, the coefficient of determination (R2) never deceases when a new independent variable is added to the regression model.527 Therefore, a second statistic called the adjusted coefficient of determination (adjusted R2) is calculated. The adjusted R2statistic depends explicitly on the number of independent variables (k) and therefore imposes a penalty for adding additional independent variables to a model. Thus, in contrast to R2, adjusted R2can go up or down when a new independent variable is added to a regression.
There is no general guideline requiring R2or adjusted R2to be above any particular value. As Backhaus et al. note, the evaluation of the obtained R2(adjusted R2) largely depends on the individual research setting.528 Indeed, low values of R2in regression equations are not uncommon in the social sciences. Moreover, "a seemingly low R2does not necessarily mean that an OLS regression equation is useless."529 Thus, it is not possible to a priori determine the level of R2(adjusted R2) that indicates a satisfying fit of the regression models employed by this study. Consequently, the main purpose of these two statistics is to compare the different regression models estimated in this study based on their ability to explain the variance in the dependent variable. In addition, the (adjusted) R2values provide
526 Cf. Field (2005), p. 205.
527 Cf. Wooldridge (2003), pp. 197-198. 528 Cf. Backhaus et al. (2005), p. 97. 529 Wooldridge (2003), p. 41.
a basis for comparing the fit of the analyzed regression models with the fit of the models employed by previous empirical studies on the internationalization-performance
relationship.
Because the evaluation of regression models should not put too much weight on the size of (adjusted) R2,530 an important role is assigned to the third global goodness-of-fit statistic. The F-statistic for the overall significance of a regression model tests the null hypothesis that none of the explanatory variables has an effect on the dependent variable. Thus, if the F-statistic fails to reject the null hypothesis, then there is no evidence that any of the independent variables help to explain the dependent variable. This is the case, when the p- value of the F-statistic is greater than 0.05. In contrast, a statistically significant F-statistic (p < 0.05) indicates that all independent variables are jointly significant in explaining the variations in the dependent variable. As Wooldridge demonstrates, even a seemingly small R2may result in a highly significant F-statistic, which indicates that the regression model still has explanatory power.531 Thus, the F-statistic for overall significance is an important measure of the quality of the multiple regression models employed by this study.
Once the overall significance of the estimated regression models has been established, it is possible to proceed with the assessment of the individual regression coefficients. In
multiple regression analysis, the regression coefficient for a particular independent variable measures the partial effect of the independent variable on the dependent variable after controlling for all other independent variables included in the model.532 To assess the strength of the partial effect of each independent variable analyzed in the multiple
regression models, this dissertation computes a t-statistic for the regression coefficient of each variable. The t-statistic is used to test the null hypothesis that the regression
coefficient equals zero, which would indicate that the independent variable has no effect on the dependent variable.533 A significant t-statistic (p < 0.05) therefore leads to the rejection of the null hypothesis and indicates that the independent variable contributes significantly to explaining the variation in the dependent variable. Thus, the t-statistics of the regression coefficients are used to identify what variables in the regression models have significant explanatory power. Consequently, calculating t-statistics is one of the main methods of testing the hypotheses developed in chapter 4.
530 Cf. Wooldridge (2003), p. 41. 531 Cf. Wooldridge (2003), p. 153. 532 Cf. Wooldridge (2003), p. 120. 533 Cf. Backhaus et al. (2005), p. 74.
7.1 Methodology 167 Table 18 summarizes the various criteria and test statistics used to evaluate the multiple regression models employed by this study.
Test Statistic Requirement
Correctly specified regression model
- Linear in parameters [0,[1,…,[k - Inclusion of all relevant variables - Number of parameters (k+1) smaller
than the numer of observations (n)
n > (k+1) Breusch-Pagan Test p > 0,05 Variance Inflation Factor
(VIF) VIF (max) < 10 Kolmogorov-Smirnov Test Shapiro-Wilk Test p > 0,05 p > 0,05
R2 Highest values possible
Adj. R2 Highest values possible
F-Test p ^ 0,05 p ^ 0,05 Global Goodness-of- Fit Statistics Regression Assumptions
Significance of the regression coefficients
Local Goodness-of- Fit Statistics T-Test RESET Test p > 0,05 Coefficient of determination Adjusted coefficient of determination Overall significance of the regression
Criteria
Homoskedasticity No perfect multicollinearity
Normal distribution of the standardized resiudals
Table 18: Criteria Used to Assess the Goodness-of-Fit of Regression Models (Source: own illustration)
7.1.2 Regression Strategy Used to Test the Moderating Effects of