Chapter 4: Results and discussion
4.2 Stage 1: Preparation of data
Preparation of data for analysis requires the cleaning of data to ensure its suitability for statistical analyses. Part of the cleaning process involves removal of missing responses and identifying outliers to eliminate their potential impact on data analysis (Hair et al. 2010).
4.2.1 Dealing with missing data
The cleaning of data commences by identifying the types of missing data in the sample. The key distinction within the missing data is whether it is ignorable or non-ignorable data (Hair et al. 2010). This distinction is based on the relationships between the missing data and the observed values.
As a result of the cleaning process, the final sample comprised 337 observations. Details of the removed non-ignorable data are presented in Table 25 below.
126
Table 25: Cleaning of sample data
No. Original sample size prior to cleaning data 446 Number of removed cases with non-ignorable missing data in:
Employment 7
Residency 11
WIL 2
Self-efficacy (items 1-12, all missing) 24
Self-efficacy item 1 2 Self-efficacy item 2 1 Self-efficacy item 3 3 Self-efficacy item 4 4 Self-efficacy item 5 2 Self-efficacy item 10 1 Self-efficacy item 11 1 Self-efficacy item 12 4
Self-efficacy - all 12 items answered with identical response (e.g. all agree or all disagree, thus ignoring the meaning of statements)
47
Total number of cases removed (109)
Sample size after cleaning 337
As per Table 25, the initial sample (n=446) included 81 cases of students who secured employment and 45 cases of students who completed WIL. The cleaned sample (n=337) resulted in 55 cases of students who secured employment and 21 cases of students who completed WIL.
Data were collected from two Melbourne-based universities, with 32.6 percent of the total sample represented by students of Swinburne University of Technology and 67.4 percent represented by students of VU. To test the differences between the two subsets in the sample, the study employed chi-squire tests (see Appendix D). The results indicated that the subsets of data were not significantly different. Since the differences between the two subsets of data were minimal, the full sample was deemed to be homogeneous. Therefore, the study’s full data set included the survey results of both universities.
The preparation of sample data also involved the testing of the assumptions for the use of factor analysis and logistic regression.
4.2.2 Testing assumptions for data analysis
The study employed logistic regression analysis to address RQ2 and RQ3 and factor analysis and association tests to address RQ1. The following assumptions needed to be met for the data analyses in this study.
127 4.2.2.1 Sample size
Logistic regression analysis requires a relatively large sample size to accurately estimate all the parameters in the model. According to Hair et al. (2010), the minimum ratio of valid cases to independent variables in a logistic regression should be 10 to 1 EPV, meaning that one predictive variable can be studied for every ten events without risk of overfitting the logistic regression model (Harrell et al. 1984). EPV considers the number of positive predictor outcomes (events) per variable included in the logistic regression model.
The employment logistic regression model includes 55 cases with positive predictor outcomes (secured employment) and 9 independent variables, giving an EPV ratio for the employment model of 6:1. The overall sample size for this study was fine (n=337), however, since the EPV ratio for the employment model was below 10:1, there was a potential issue. Consequently, the study employed a logistic regression model, as well as Lasso and R-glmulti techniques to ensure the robustness of the employment model. This issue also applied to the WIL model, which indicated three independent variables and 21 positive predictor outcomes (participation in WIL). The EPV in the WIL model was a ratio of 7:1 for events per variable and thus also had the potential to cause an issue of sparsity of data. A logistic regression model was therefore also ran, as well as Lasso and R-glmulti techniques to ensure the robustness of the WIL model. According to Santner and Duffy (1989), King and Zheng (2001), and Cox and Snell (1989), inclusion of imbalanced data in the logistic regression model is acceptable where the variable is of a categorical nature.
With respect to factor analysis, typical guidelines recommend sample size n>200 as a fair sample (Comrey & Lee 1992), with a minimum sample size of five cases per variable factor and an ideal sample of more than 20 cases per variable. The GSES includes 12 items, which would require at least 60 cases for a minimum sample size (n>5 cases per variable factor) and 240 cases for an ideal sample size (n>20 cases per variable factor). The study’s sample of 337 cases was therefore classified as ‘good’ based on the above guidelines.
128 4.2.2.2 Diagnostic tests
A number of diagnostic tests were performed. The results showed that the study data did not exhibit multicollinearity; the results are discussed later and shown in tables 39 and 44.
Since factor analysis is based on correlations between the variables, linear relations amongst the items of the GSES needed to be tested. Correlation analysis of the 12-item GSES using Pearson correlation, is discussed in Section 4.3.
In addition, the test for normality of data, outliers, and heteroscedasticity were performed. The results shown in Appendix C did not reveal any issues with data. Thus, the results of testing the associated assumptions indicated that the present research’s sample data was appropriate for conducting further statistical analyses.