4.2. Methods of Data Analysis
4.2.3. Econometric Methods
This sub-section describes the empirical models used in the econometric analysis (in particular chapters 6 and 7). The research questions and the hypotheses of the study only make conditional likelihood procedures relevant. Conditional likelihood procedures are statistical techniques that estimate the probability of observing a given event conditional on a particular set of parameters. There are several forms of conditional likelihood models, depending on the number and order of the dependent variables. Binary logit/probit models40 are used when the
40 The main difference between a logit model and a probit model is that for the logit model, the cumulative distribution function (CDF) is the logistic distribution, while for the probit model, the CDF is the standard normal distribution. In both models, the predicted probabilities are limited between 0 and 1. Both models are estimated by maximum likelihood (ML). The choice between logit and probit models depends on the data generating process, which is unknown. Both models produce almost identical results (different coefficients but similar marginal effects).
dependent variable is a binary response, i.e. it takes on two values: 0 and 1 (y = 0 if no, 1 if yes). These binary outcome models estimate the probability that y=1 as a function of the independent variables. Multinomial logistic regression is the linear regression analysis to conduct when the dependent variable is nominal with more than two levels. Thus it is an extension of the binary logistic regression, which analyses dichotomous dependents. The ordinal logistic regression (OLR) model, also called the ordered logit model, is a statistical technique with an ordered dependent variable. Examples include: rating systems (poor, fair, good, excellent); opinion surveys (strongly agree, agree, neutral, disagree, strongly disagree); ranking (unimportant, moderately important, important, very important), frequencies (never, sometimes, often, always) etc. Among these conditional likelihood procedures, the only method that suits the ordered structure of the variables obtainable from the questionnaire conducted in this study is the ordinal logistic regression (OLR). The reason for the choice of ordinal logistic regression model over other conditional likelihood estimations such as the binary logit and probit models and the multinomial logit and probit model is that all the variables of interest in this study are ordered outcomes. Where there are ordered outcomes exceeding two categories, with meaningful sequence, OLR models become inevitable (Norusis, 2012; Katchova, 2013; Torres-Reyna, 2014).
Fitting an OLR Model
An OLR model can be used when a dependent variable has more than two categories and the values of each category have a meaningful sequential order where a value is indeed ‘higher’ than the previous one. The categories for the dependent variables are rankings so the numbers do not make sense. For example, even if they are coded as 0, 1, 2, 3, 4 the difference between the first and second outcome may not be the same as between the second and third (Katchova, 2013). Thus, in fitting an OLR model, the event of interest is observing a particular score or less (Norusis, 2012). Assuming we are rating the frequency of loan pricing decisions by Nigerian banks using the following scale: ‘never’ (1), ‘sometimes’ (2), ‘often’ (3) and ‘always’ (4), we can model the following odds41:
θ1 = prob (score 1)/prob (score greater than 1) ………(4.1)
θ2 = prob (score of 1 or 2)/ prob (score greater than 2)………(4.2)
θ3 = prob (score of 1, or 2, or 3)/ prob (score greater than 3)………..……….(4.3)
Notice that for the last category (always - 4), we do not include an equation since the probability of scoring up to and including the last score is 1 (i.e. the only score greater than 3 is 4). This can be better understood by the concept of cumulative probability. An OLR model simultaneously estimates multiple equations. The number of equations it estimates will depend on the number of categories in the dependent variable minus one (Snedker et al., 2002). So since we have four categories for the dependent variable, three equations will be estimated. All of the odds are of the form:
θj = prob (score ≤ j) / prob (score > j)……….(4.4)
We can also write equation 4.4 as:
θJ = prob (score ≤ j)/ (1- prob (score ≤ j)),………..…...(4.5)
since the probability of a score greater than j is 1- probability of a score less than or equal to j
The ordered logistic regression (OLR) model has the form:
LOGIT(p1)≡log p1 1−p1 =α1+β ' x LOGIT(p1+ p2)≡log p1+p2 1−p1− p2 =α2+β ' x LOGIT(p1+ p2+...+pk)≡log p1+p2+...+pk 1−p1−p2−...−pk =αk+β ' x and:p1+p2+...+pk+1=1 ………(4.6)
This model is known as the proportional-odds model because the odds ratio of the event is independent of the category j. The odds ratio is assumed to be constant for all categories (Snedker et al., 2002). We can define an index model for a single latent variable y* (which is unobservable, we only know when it crosses pre-defined thresholds):
yi *= xi 'β+µ i...(4.7) yi = j if αj−1<yi *≤α j
The probability that observation i will select the alternative j is: pij= p(yi = j)= p(αj−1<yi *≤α j)=F(αJ−xi 'β )−F(αj−1−xi 'β )………..(4.8)
For the ordered logit, F is the logistic cumulative distribution function (cdf)
F(z)=ez
/ (1+ez
). The ordered logit model with j alternatives will have one set of coefficients with (j-1) intercepts. As noted earlier, the OLR model can be identified by multiple intercepts.
Interpretation of OLR Estimates
The sign of parameters shows whether the latent variable y* increases with the regressor. As the dependent variable is a multiple factor, the way we interpret the OLR coefficients will also be slightly different from how we would interpret logistic regression coefficients with only one transition. A positive coefficient indicates an increased chance or likelihood that a subject with a higher score on the independent variable will be observed in a higher category. A negative coefficient indicates the chances or likelihood that a subject with a higher score on the independent variable will be observed in a lower category (Snedker et al., 2002). So for example, if we are interested in testing whether the risk premium on an SME loan depends on whether a firm has collateral or not, a positive coefficient will imply that a firm with no collateral to secure the loan is more likely to be charged a higher risk premium, while a negative coefficient will mean that there is a lower likelihood of a higher risk premium with available collateral.
It is worthy of note that logit coefficients are in log-odds units and cannot be read as regular OLS coefficients so that we cannot interpret the magnitude of the coefficients. To interpret, we will need to estimate the predicted probabilities that y*=1 for each score category. OLS provides only one set of coefficients for each independent variable. Therefore, there is an assumption of parallel regression (Torres-Reyna, 2014). That is, the coefficients for the variables in the equations would not vary significantly if they were estimated separately. The intercepts would be different, but the slopes would be essentially the same. This means that the results are a set of parallel lines or planes – one for each category of the outcome variable (Norusis, 2012). A significant test statistic provides evidence that the parallel regression assumption has been violated.
OLR Marginal Effects
In addition to the fixed effects OLR, we can also estimate the ordered logit marginal effects. The ordered logit model with j alternatives will have j sets of marginal effects (Katchova, 2013). The marginal effect of an increase in a regressor xr on the probability of selecting
alternative j is: ∂pij/∂xri = F ' (αj−1−xi 'β )−F' (αJ−xi 'β )
{
}
βr………..(4.9)The marginal effects of each variable on the different alternatives sum up to zero. To interpret the marginal effects, we say that each unit increase in the independent variable increases/decreases the probability of selecting alternative j by the marginal effect expressed as a percentage.
Alternative Multivariate Statistical Methods
Apart from conditional likelihood estimations, there are other alternative multivariate statistical procedures, which can be used to analyze survey data (e.g. Principal Components Analysis and Factor Analysis) but were not chosen for the study. Principal Components Analysis (PCA) is a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The main advantage of PCA is that it is a dimensionality reduction or data compression method. The main disadvantage is that there is no guarantee that the reduced dimensions are interpretable. Factor Analysis (FA) is a similar statistical procedure that identifies interrelationships that exist among a large number of variables. However, it is mostly suited for exploratory or confirmatory studies. As an exploratory procedure, factor analysis is used to search for a possible underling structure in the variables. In confirmatory research, the research evaluates how similar the actual structure of the data, as indicated by factor analysis, is to the expected structure. The main reason for choosing conditional likelihood procedures (in particular, the OLR) over other statistical methods is that they help in studying the relationship between two or more variables or independent samples, without altering the underlying structure of the dataset as does PCA or factor analysis. Moreover, as noted earlier, given the nature of the research questions, only conditional likelihood procedures would help provide statistically viable answers.
Stepwise Regression Analysis
The stepwise regression procedure was employed as part of robustness checks to test the quality of predictors used in the OLR regression models in chapters 6 and 7. The stepwise regression is a multiple regression procedure that is used to determine the best combination of independent (predictor) variables that would predict the dependent (predicted) variable. Hauser (1974) describes stepwise regression as essentially a search procedure to identify which independent variables, previously thought to be of some importance, actually appear to have the strongest relationship with the dependable variable. In stepwise regression, predictor variables are entered into the regression equation one at a time based upon statistical criteria. At each step in the analysis, the predictor variable that contributes the most to the prediction equation in terms of increasing the multiple correlation, R, is entered first. This process continues only if additional variables improve the predictive power of the model or add anything statistically to the regression equation. When no additional predictor variables add anything statistically meaningful to the regression equation the analysis stops. Thus, not all independent (predictor) variables may enter the equation in stepwise regression.
Stepwise analysis is an approach to selecting a subset of variables and to evaluate the order of importance of variables in a regression model. It can be useful in the following situations (a) There is little theory to guide the selection of terms for a model (b) the researcher wants to explore which predictors seem to provide a good fit, or (c) the researcher wants to improve a model’s prediction performance by reducing the variance caused by estimating unnecessary terms. However, a number of problems have been identified with the application of stepwise analysis. According to Thompson (1995) and Lewis (2007), there are three problems with using stepwise procedures. First, computer packages use incorrect degrees of freedom in their stepwise computations, resulting in artifactually greater likelihood of obtaining spurious statistical significance. Second, stepwise methods do not correctly identify the best predictor variable set of a given size. This problem is further compounded by the presence of multicollinearity where predictors are correlated with each other. High intercorrelations result in high standard errors for regression coefficients and the consequent exclusion of variables from regression equations (Hauser, 1974). Thus, where independent variables are correlated, relevant variables may be discarded purely on the grounds of collinearity, with resultant
possibility of specification bias. Third, stepwise methods tend to capitalize on sampling error, and thus tend to yield results that are not replicable.