3.3 Data Analysis
3.3.2 Multiple regression analyses
Multiple regression is used when the researcher wants to predict the value of a variable, based on the value of two or more other variables. In regard to the present body of research, the researcher was interested in determining the relationship
between a range of continuous and categorical independent variables and the incidence of prison offending. Due to the dependent variable being continuous, and independent variables being both categorical and continuous, multiple regression was chosen as the primary means of determining which independent variables are statistically significant in predicting the incidence of offending in prison.
The general purpose of multiple regression is to assess the significance of the relationships between the dependent variables and the independent variables that are being modelled (Hosmer & Lemeshow, 1989). Multiple regression is based on
correlation, but allows for a refined exploration of the interrelationship between variables (Field, 2013). Multiple regression can be used to explain how well a set of variables can predict a particular outcome; for example, and in reference to the current study, the incidence of offending in prison. Multiple regression can also be used to statistically control for an additional variable or variables when exploring the predictive ability of several other independent variables (Field, 2013). In regard to the analyses pertaining to the incidence of prison offending, all variables which provide little
predictive ability will be removed until the most parsimonious model is found, in order to determine which variables are significantly significant in predicting the rate of offending in prison. It is noted that the best fitting model in regard to the number of independent variables to be entered into a multiple regression analysis is one where all independent variables are included (McDonald, 2009). However, when the purpose of a multiple regression analysis is to predict a relationship between independent variables and one dependent variable, it is useful to determine which independent variables are important and which are unimportant to the relationship.
Methods of multiple regression.
Three methods of multiple regression were reviewed prior to a decision being made as to which method was most suitable in order to determine the most
parsimonious model, based on available literature. Those methods were standard or simultaneous regression, hierarchical regression and stepwise regression. Each will be explained in more detail in the following sections.
Standard or simultaneous regression.
The most commonly used of all methods of multiple regression, standard or simultaneous regression, also known as the enter method, involves entering all possible predictor variables into the model simultaneously. Following all variables being entered, each is assessed as to its predictive power, in addition to that explained by other
variables (Field, 2013). This method is useful if the researcher has a set of variables and wants to determine how much variance is explained in the dependent variable (Field 2013).
Hierarchical regression.
Hierarchical regression involves the entering of independent variables in the model in the order specified by the researcher based on theory and past research. Variables, or groups of variables, are entered in steps, or blocks. Following this, each independent variable, or block of variables, is assessed as to what it adds to the
prediction of the dependent variable, after the previously entered variables have been controlled for. Once all variables or blocks of variables have been entered, the overall model is assessed in terms of its ability to predict the dependent variable (Field, 2013).
Stepwise regression.
One approach to simplifying multiple regression equations are the stepwise procedures (Dallal, 2001). These include forward selection, backward elimination, and bi-directional elimination (Makridakis, Wheelwright & Hyndman, 1998). Each of these stepwise methods involves the addition or removal of variables, one at a time. Forward selection starts with an empty model. The variable that has the smallest 𝑝-value when it is the only predictor in the regression equation is placed in the model. Each subsequent step adds the variable that has the smallest 𝑝 value in the presence of the predictors
already in the equation. Variables are added as long as their 𝑝-values are small enough; typically less than 0.05 or 0.10 (Dallal, 2001).
Backward elimination starts with all of the predictors in the model. The variable that is the least significant is removed and the model is refitted. Each subsequent step removes the least significant variable in the model until all remaining variables have individual 𝑝 values smaller than a pre-determined value of significance, such as 0.05 or 0.10. Bi-directional elimination, or stepwise selection, is a combination of forward selection and backward elimination, where the model is tested at each step for variables to be included or excluded (Dallal, 2001). Backward elimination has an advantage over forward selection and bi-directional regression because it is possible for a set of variables to have considerable predictive capability even though any subset of them does not, where forward selection and bi-directional selection will fail to identify them (Dallal, 2001).
Selection of a multiple regression method.
As discussed in the previous section, standard multiple regression has the ability to determine how much unique variance in the dependent variable each of the
independent variables explained. Alternatively, although hierarchical regression has the ability to determine an overall model’s predictive capability in regard to the dependent variable, this method of multiple regression is most often used where the researcher wishes to assess whether adding particular variables improves a model, rather than to determine which variables from a group of variables have the most predictive capability. Alternatively, stepwise regression has been criticised because it involves the computer selecting variables based on a mathematical criteria, rather than the researcher making important methodological decisions in regard to the independent variables (Field, 2013). For this reason, standard multiple regression was used to determine which variables had the most predictive capability, but a further manual backward elimination process was employed which involved the manual, rather than computerised, identification of the least significant variable for removal, and the subsequent refitting of the model.
A three step process was developed in order to find the most parsimonious model in regard to the incidence of prison offending. The first step involves a careful univariate analysis of each independent variable via the use of univariate analyses, in
order to minimise the number of variables prior to the commencement of the multiple regression analyses in the second step. The minimisation of variables included in the initial multiple regression analyses was undertaken for three reasons. Firstly, consistency with the previously discussed binary logistic regression analyses was deemed
appropriate. Secondly, a small sample size in regard to the female prisoner cohort dictates the requirement for fewer variables to ensure a parsimonious final model, and lastly, a similar approach to multiple regression analyses has been employed by other researchers (e.g. Rajakaruna, Henry & Scott, 2015). In accordance with the univariate analyses undertaken in regard to the binary logistic regression analyses, independent variables for inclusion in the first multivariable model are identified as any variable whose univariate test has a 𝑝-value less than 0.25. The use of a more traditional level of significance such as 0.05 often fails to identify variables known to be important from a clinical point of view, due to their potential to interact in a model with other variables present (Pallant, 2013).
The second step involves fitting the model containing all covariates identified for inclusion at the first step. Variables that do not contribute, at the traditional level of statistical significance of 0.05, are eliminated and a new model fit. The third step involves the fitting of the final model with only significant variables remaining in the model.
Assumption testing.
This section discusses these key assumptions and the tests conducted to ensure robust multiple regression analyses were performed and serves to explain the processes employed to address assumption testing at each step of the data analyses processes.
Sample size.
As discussed in the previous section in regard to binary logistic regression analyses, sample size can affect the reliability of results of statistical tests, and this is particularly the case in regard to regression analyses (Field, 2013). Generally, the rule of thumb is that 10 to 15 cases per variable is sufficient for each independent variable included in the model (Field, 2013). However, it is further suggested that this rule of thumb may be too simple, and the number of variables should be determined by the
regression is dependent on the number of independent variables (k) and the sample size (N) (Cohen, 1988) where R = k/(N – 1). Cohen (1988) has standardised effect sizes into small, medium and large values depending on the type of analyses employed. In terms of regression analyses, the effect size index for small, medium and large effect sizes are .02, .15 and .35 respectively. It has been noted that a medium effect size is desirable as it would be able to approximate the average size of observed effects in a range of fields (Cohen, 1988).
Outliers and influential cases.
Statistical tests can be quite sensitive to and be influenced by outliers (Pallant, 2013). One or two values that are far from the mean can alter the results considerably. The following tests were used in SPSS to determine the presence of outliers and cases which have the potential to affect the regression model. Firstly, leverage values were checked in the output from SPSS after running the initial multiple logistic regression. Leverage gauges the influence of the observed value of the dependent variable over the predicted values (Field, 2013). The average leverage is defined as (k + 1)/n where k is the number of independent variables, and n is the number of cases. To further ensure influential cases were not present, the Cook’s distance value of each case can be assessed. Cook’s distance is a commonly used estimate of the influence of one case on the model as a whole (Field, 2013). It is suggested that Cook’s distance values over 1 may denote an outlier in the model.
Multicollinearity and singularity.
To ensure that each independent variable was not strongly related to another independent variable, collinearity diagnostics were requested of SPSS. The SPSS output provides the Collinearity Statistics ‘Tolerance’ and VIF. Tolerance is an indicator of how much of the variability of each independent variable is not explained by other
independent variables (Pallant, 2013). A tolerance level less than one indicates that one independent variable has a high correlation with another independent variable (Pallant, 2013). VIF, or the variance inflation factor, is the reverse of the Tolerance value, where a higher value indicates multicollinearity (over 10) (Pallant, 2013).
Normality, linearity and homodescadicity.
Normality, linearity and homodescadicity refer to aspects of the distribution of scores and the nature of the underlying relationship between the variables. These assumptions are checked by assessing the residuals scatterplot which is produced by SPSS following a multiple regression analysis. Residuals are the differences between the obtained and the predicted dependent variable scores (Pallant, 2013; Tabachnick & Fidell, 2013).
The following three chapters seek to determine what prisoner and prison
characteristics are related to the prevalence and incidence of prison offending, and what prisoner, prison and situational characteristics are related to the type of prison
offending, in Western Australian male, female, Aboriginal and non-Aboriginal prisoner samples. Further details of the analyses are provided in appendices.
CHAPTER FOUR: PREVALENCE, INCIDENCE AND TYPE OF PRISON OFFENDING IN THE MALE PRISONER SAMPLE