Multiple linear regression analysis

4 HYPOTHESES TESTING METHODS

4.3 STATISTICAL METHODS

4.3.2 Multiple linear regression analysis

Multiple linear regression analysis was used to test the created hypotheses. Regression analysis is a statistical technique to analyze the relationship between a single dependent variable and several predictor (independent, explanatory) variables. The multiple linear regression equation in its general form can be presented as the following: Yj = B0 + B1X1j + B2X2j + … +BnXnj + Ej. Yj are the values of the dependent variable, B0 is a constant, X1j – Xnj are independent variables, B1 – Bn are the regression coefficients for X1j – Xnj, and Ej is an error term representing residuals from fitting the regression line to the different data observations. This study applies ordinary least squares regression (OLS), which is the most often applied regression analysis method. In OLS regression, the values of regression coefficients are estimated to minimize the sum of squared residuals of prediction, i.e. the distance between the observed data points and the corresponding points on the regression line is minimized557.

557

Assumptions

The use of multiple linear regression analysis is based on several assumptions that the empirical data and the investigated phenomenon must fulfill. These are 1) linearity of the phenomenon measured, 2) normality of the error term (residual) distribution, 3) constant variance of the error terms, 4) independence of the error terms, 5) low multicollinearity, and 6) sufficient sample size.558 In addition, the multiple regression analysis is only able to handle metric data. Thus the categorical data was transformed to the metric by creating dummy variables in this study.

The first assumption, linearity of the phenomenon measured, means that the relationship between the dependent and independent variable should be linear. This linearity refers to the degree to which the change (the regression coefficient) in the dependent variable is constant across the range of values for the independent variable. If any curvilinear pattern is found, data transformations should be used.559 The linearity was investigated by creating a scatter plot for each pair of dependent and independent variables and fitting a linear line to this scatter plot. The investigation of these scatter plots did not reveal any non-linear relationships.

The second assumption requires error terms to be normally distributed. This concerns independent variables especially560. There are two alternatives detecting normality assumptions. The simplest diagnostic tool is a histogram of residuals (error terms), which can be visually checked for whether a distribution is approximately normal. The other way is the use of normal probability plots where the standardized residuals are compared with a normal distribution. The residual line closely follows a straight diagonal line, which represents a normal distribution if a distribution is normal.561 This study applied both a histogram of residuals and normal probability plots to analyze the normality of error terms. No indication of non-normality was found.

Thirdly, error terms should have a constant variance. The presence of unequal variance causes heteroscedasticity, which is one of the most often violated assumptions in linear regression analysis. Heteroscedasticity can be investigated by

558

Hair et al. 1998, Cohen and Cohen 2003, Nummenmaa et al. 1997 559

Hair et al. 1998, Cohen and Cohen 2003 560

Cohen and Cohen 2003 561

using the Levene test, which measures the equality of variances for a single pair of variables. There are two alternative remedies for heteroscedasticity. If the violation occurs only in one independent variable, the weighted least square method can be applied.562 The other option is to use other variance-stabilizing transformations such as the White correction. The Levene test was conducted for each pair of variables and no signs of heteroscedasticity were found.

The fourth assumption concerns the independency of error terms. Error terms of the observations must be independent of each other, i.e. cannot be sequenced by any variable. Typically, any random sample from a population fulfils this criterion.563 Independency of error terms can be investigated by plotting the residuals against any possible sequencing variable. Independent error terms are seen as a random pattern in a residual plot. Data transformations and the inclusion of control variables can be used to overcome this violation.564 In this study, several control variables such as firm size, R&D intensiveness, front end intensiveness, industry sector and objectives of the front end project were used to ensure the independency of error terms.

The fifth assumption requires low multicollinearity among independent variables. In the case of multicollinearity, the same variation is inserted in the regression model at more than one time. This makes it difficult to define the influence of each independent variable.565 Hair et al. state that the presence of high correlation (.90 or more) is one indication of high collinearity. Lack of high correlation, however, does not guarantee the lack of collinearity, which may be caused by the combined effect of other independent variables. There are two common measures typically used to evaluate multicollinearity: 1) the tolerance value and 2) the variance inflation factor (VIF). These measures indicate to which extent each independent variable is explained by the other independent variables. Thus a small tolerance value and high VIF value reflects high collinearity. Typically applied cut-off values are .10 for the tolerance value and 10 for the VIF value, indicating serious multicollinearity problems.566 All the VIF values of independent variables without interaction terms were found to be

562

Hair et al. 1998 563

Cohen and Cohen 2003 564 Hair et al. 1998 565 Nummenmaa et al. 1997 566 Hair et al. 1998

below 2, the highest being 1.42. This indicates that multicollinearity is not a problem in this study.

Finally, the sample size should be sufficient to ensure the appropriateness of using multiple regression analysis and adequate statistical power567. With small sample sizes, only very strong associations can be detected with certainty. On the other hand, very large samples (over 1000 observations) make statistical significance tests too sensitive. Hair et al. give a general rule that there should be at least five times as many observations as there are independent variables in total in order to avoid ‘overfitting’ the model and causing problems of generalizability.568 This study follows the above recommendations.

Interpreting the regression model

The standardized coefficients (Beta values bk) indicate the relative importance of independent variables, i.e. how much they uniquely account for the variance of the dependent variable. The bigger the Beta value on a scale of 0–1, the more important the independent variable. The t-test is used to examine whether the variance explained by each independent variable is statistically significant. The t-value indicates how many standard errors the coefficient is from zero. The probability value p in turn indicates the significance of the test that bk is different from zero. For statistical significance, the p-value needs to be below .05.569

R values indicate the overall explanatory power of the regression equation. The R value is the multiple correlation between the independent variables and the dependent variable. The R2 value shows the percentage of variance in the dependent variable that the independent variables collectively account for.570 However, the R2 value is influenced by the number of different independent variables relative to sample size in the regression equation. Thus the adjusted R2 value, which takes into account the number of independent variables and the sample size, is typically used to measure explanatory power, i.e. goodness of fit, of the overall regression equation.571 Statistical significance of the overall regression equation is indicated by the F value of

567

Nummenmaa et al. 1997, Hair et al. 1998 568 Hair et al. 1998 569 Dewberry 2004 570 Ibid. 571 Hair et al. 1998

the analysis of variance. If the F value is below .05, the null hypothesis that there is no association between the independent variables and the dependent variable can be rejected.572 Beta values, p-values, R values, R2 values and F values are reported in each regression analysis.

Moderating effect

The moderating effect of market uncertainty and technology uncertainty on two independent variables (front end process formalization and outcome-based rewarding) was tested. The moderating effect (interaction effect) means that an independent variable (C) changes the form of the relationship between another independent (A) and dependent variable (B), as presented in Figure 4. The moderating effect can be presented in a regression equation simply by multiplying an independent variable by the moderating variable. The moderating effect is investigated by first estimating the original, unmoderated regression equation. Second, the moderated relationship is estimated. Third, the change in R2 between these two equations is investigated. A statistically significant change in R2 value indicates a significant moderating effect.573

Figure 4. Moderating effect of C on the relationship between A and B.

Predictor value centering was used to investigate interaction terms in this study in order to avoid problems of multicollinearity caused by interaction574. Centering means that a predictor value is linearly transformed into a new variable with an average of zero, i.e. the mean of the predictor value is subtracted from each score of the predictor. According to Lance, a centering provides the following advantages: 1) multicollinearity among predictors is reduced, 2) interaction and main effects are

572 Dewberry 2004 573 Hair et al. 1998 574

separated, and 3) a regression coefficient for the residual cross-product term is directly interpretable575.

In document Management control in the front end of innovation (Page 113-118)