Regression analysis - Data analysis procedures

3. Methodology

3.3 Methods

3.3.5 Data analysis procedures

3.3.5.3 Regression analysis

As the researcher aims to analyse quantitative data on the firm performance of grid owner companies in Germany, and search for patterns or causal relationships in the data to create law-like generalisations (Saunders et al., 2011), regression analysis seems to be an appropriate analytical technique. In general, regression analysis is a set of statistical processes to estimate the relationships among variables; i.e. the focus of the analysis is on the relationships between one or more predictor variables and an outcome variable. Regression analysis means fitting a model to the data and using it to predict values of an outcome variable from one or more predictor variables (Davies & Hughes, 2014; Field, 2013). In other words, regression analysis is used to model the dependence of a variable on one or more explanatory independent variables (Davies & Hughes, 2014) Thus, the task is to find the mathematical formula that best describes the relationship between the relevant variables (Field, 2013). In contrast, correlation analysis as a method of statistical evaluation used to examine the strength of a relationship between two variables does not imply causality. In particular, correlation coefficients give no indication of the direction of causality. Hence, questions on the cause-effect-relationships of variables are usually addressed by the application of regression analysis (Field, 2013). Overall, bivariate and multivariate statistical analyses are appropriate to test the data. Whereas “bivariate analysis is concerned with the analysis of two variables at a time in order to uncover whether or not the two variables are related” (Bryman & Bell, 2011, p. 346), “multivariate analysis entails the simultaneous analysis of three or more variables” (Bryman & Bell, 2011, p.

350).

There are several regression techniques, in particular linear, logistic and multiple regressions. According to Davies and Hughes (2014), the selection of a particular regression technique depends on the type of variables. Whereas linear regression models are based upon a straight line with regard to interval or ratio variables, logistic regression is a version of multiple regression in which the outcome is a categorical variable (Field, 2013).

A simple regression is a linear model in which one variable or outcome is

predicted from a single predictor variable. The formula is: Yi = (b0 + b1Xi) + εⁱ.

Yi denotes the outcome variable, Xi the independent variable, b1 is the regression coefficient, b0 is the value of the dependent variable when the independent variable is zero and εⁱ symbolises some error (Field, 2013). An extension of simple regression is multiple regression in which an outcome is predicted by a linear combination of two or more predictor variables (Field, 2013).

In research on corporate governance, a lot of critical drivers that determine firm performance were already identified. For example, firm size, industry affiliation and debt ratio have an influence on firm performance. However, these influences could not be identified in univariate analyses. Multivariate analyses offer the opportunity to isolate effects and to minimise bias (Fessler, 2013). Hence, regression analysis is also preferred and supplemented by tests of robustness.

The structure of the multiple linear model is as follows:

Yi = (b0 + b1 * X1i + b2 * X2i + … + bn * Xni) + εⁱ

Y represents the outcome (dependent variable), and each predictor (independent variable) is denoted as X. b1 is the regression coefficient of the first predictor X1, b2

is the regression coefficient of the second predictor X2, etc. In general, each predictor has a regression coefficient b associated with it that represents the gradient of the regression line in a simple regression model. The regression coefficients b estimate the relationship between predictors and the outcome, i.e.

the value of b stands for the change in the outcome resulting from a unit change in the predictor (Field, 2013). b0 as the intercept of the regression line is the value of the outcome when all predictors are zero. The error term of a regression model is symbolised by ε. It summarises all those factors that have an influence on the dependent variables beyond the independent variables (Kohler & Kreuter, 2016).

To assess the error in a regression model, the sum of squared errors is used, the

so-called residual sum of squares (Field, 2013). In all cases, the sub- script i denotes an individual item, for example a grid owner company.

Applied to the phenomenon of grid owner companies, the ROA as the outcome (dependent variable) is denoted as Yi. X1 stands for the population or area as proxies for firm size, X2 symbolises the private participation quota and X3 refers to

the legal form of a grid owner company. Hence, the multiple linear regression model is suitable for the analysis.

As the same German grid owner companies are observed over several years, i.e.

from 2010 to 2015, the data have a temporal dimension and form a panel.

Normally, panel data are analysed by panel data regression. However, the data on German grid owner companies from 2010 to 2015 are not enough to run a valid panel data regression. One main reason is that grid owner companies are a relatively new phenomenon and most of them were established in 2014. Thus, instead of panel data regression, multiple linear regression seems to be a suitable method.

In order to find the regression model or the regression coefficients that fit best with the data, the OLS method is applied. OLS stands for ”Ordinary least squares” and is a method of regression in which the parameters of the model are estimated using the method of least squares. It is a method of estimating parameters, here regression coefficients, that is based on minimising the sum of squared errors. The parameter estimate will be the value out of all possible values that has the smallest sum of squared errors. The aim of a multiple linear OLS regression model is to determine the influence of at least two independent variables on a dependent variable (Field, 2013; Kohler & Kreuter, 2016). As the underlying analysis aims at determining the influence of firm size, private participation quota and legal form on firm performance of German grid owner companies, a multiple OLS regression model is chosen.

To sum up, in the following regression analysis, the influences of population or area as proxies for firm size, private participation quota and legal form on ROA are investigated. Thus, a multiple linear regression is applied and an ordinary least squares estimation method is used.

The estimation of regression coefficients of a regression model with the OLS method is determined by the assumption that the expected value of error terms is zero, formally E (ε) = 0. It means that all influences on the dependent variable that are not part of the model cancel each other out on average.

In other words, these other influences are zero over a large number or repetitions (Kohler & Kreuter, 2016). To avoid biased estimations, it has to be checked whether the requirements are met (Kohler & Kreuter, 2016).

The assumption E (ε) = 0 is violated if

(1) the relationship between the dependent and one of the independent variables is non-linear,

(2) individual outliers excessively influence the regression outcomes and

(3) multicollinearity between the independent variables exists.

A violation of the requirements reduces the quality of the results. However, a full compliance of all assumptions is not possible in practice as this requires a pure linear relationship between the independent and dependent variables (Field, 2013;

Kohler & Kreuter, 2016).

Concerning the methods of predictor selection, i.e. the way in which variables can be entered into the regression model, different methods have to be distinguished.

In order to obtain a robust regression model, only those predictor variables that account for a large proportion of the outcome variable should be included in the model. In general, stepwise regressions like the forward and the backward method, hierarchical (blockwise entry) method and forced entry are common methods. Whereas stepwise techniques are often influenced by random variation in the data, forced entry as a method in which all predictors are forced into the model simultaneously is seen as the only appropriate method for theory testing (Field, 2013). Thus, forced entry is chosen in this thesis. As a rule of thumb, the number of observations should be about 20 times larger than the number of variables studied (Schneider, Hommel, & Blettner, 2010).

The general procedure for conducting regression analysis and fitting a regression model is as follows (Field, 2013):

(1) Producing scatterplots in order to check if the assumption of linearity is met, and also check for any outliers and obvious unusual cases.

(2) Running initial regression and fitting a model.

(3) Generalising the model beyond the sample by examining residuals to check for homoscedasticity, normality, independence and linearity.

In document An Analysis of the Critical Drivers of Firm Performance for Profitable Grid Owner Companies in Germany (Page 167-171)