CHAPTER 4 – DATA ANALYSIS AND RESULTS
4.6. Missing Data and Multiple Imputation
4.6.1. Missing data
Having analyzed the JSS’s (Spector, 1997) reliability and validity, missing data and data imputation will be discussed in this section. A summary of the variables with missing data is presented in Table 23. As pointed out above, every item of the 36 questions of Job Satisfaction Survey (Spector, 1997) was successfully answered, so there is no missing data for these items. In contrast, all the rest of the variables contain some missing data, with the variable ‘Age’ being the one with the largest portion of missing data (n = 182 out of N = 202, 9.9% missing).
Table 23
Variables with Missing Data (Main Study)
Variables Valid Count Missing Count Percent
Age 182 20 9.9
Sex 194 8 4.0
Education background 192 10 5.0
Work type 192 10 5.0
appears as the only variable that contains more than 5% missing data, so it is the only indicator variable created. As reported in Table 24, all the mean values of JSS’s (Spector, 1997) items when age is missing do not vary much from that when age is non-missing. This suggests that the missing data are either missing at random or completely random.
Table 24
Separate Variance t Tests
JSS’s items Age
t df # Present # Missing Mean(Present) Mean(Missing)
PAY1 -.8 25.5 182 20 3.62 3.80 PAY2R .5 25.8 182 20 3.75 3.60 PAY3R 1.0 31.0 182 20 3.20 3.00 PAY4 -.4 28.5 182 20 3.36 3.45 PRO1R -.8 24.9 182 20 2.97 3.20 PRO2 .4 30.3 182 20 4.04 3.95 PRO3 .1 25.0 182 20 3.64 3.60 PRO4 .6 30.3 182 20 3.76 3.65 SUV1 1.3 23.7 182 20 4.60 4.25 SUV2R .3 26.3 182 20 2.71 2.65 SUV3R 2.2 27.4 182 20 3.36 2.80 SUV4 -.1 24.9 182 20 3.88 3.90 FB1R 1.8 32.4 182 20 3.40 3.05 FB2 -1.9 30.4 182 20 3.41 3.75 FB3 1.1 31.1 182 20 3.51 3.30 FB4R 1.2 26.6 182 20 3.80 3.50 CRE1 -.8 28.1 182 20 3.79 3.95 CRE2R .0 25.1 182 20 3.06 3.05 CRE3R .1 26.7 182 20 3.72 3.70 CRE4R 1.6 30.3 182 20 3.37 3.05 OPC1R -.7 28.6 182 20 3.38 3.55 OPC2 1.4 24.6 182 20 4.19 3.80 OPC3R .5 26.0 182 20 4.52 4.40 OPC4R .4 25.3 182 20 4.15 4.05 COW1 2.4 28.0 182 20 4.74 4.30 COW2R 1.7 27.1 182 20 3.03 2.65 COW3 1.9 25.8 182 20 4.62 4.25 COW4R .8 27.2 182 20 2.70 2.50 NAT1R .7 25.4 182 20 3.05 2.85 NAT2 .6 28.3 182 20 3.87 3.75 NAT3 1.4 33.1 182 20 4.09 3.85 NAT4 2.1 31.4 182 20 3.99 3.60 COM1 3.1 34.7 182 20 4.43 3.95 COM2R 1.8 29.2 182 20 2.85 2.45 COM3R 1.5 25.1 182 20 2.93 2.55 COM4R 1.8 29.1 182 20 2.95 2.55
cross-tabulations is that ‘Age’ acts as the sole indicator variable due to it being the only variable with more than 5% of missing data. The variable AGE is reported by 91.9% of female respondents and 91.6% for males (Table 25). The difference here is negligible and does not imply any non-randomness of the missing data.
Table 25
Crosstabulation of Gender and Age
Total 1.00 Woman 2.00 Man Missing
Age Present Count 182 91 87 4
Percent 90.1 91.9 91.6 50.0
Missing Percent 9.9 8.1 8.4 50.0
Meanwhile, the chance of missing data of the variable AGE is greatest for the respondent group who possess Bachelor’s degree(s) (89.9%), but the percentage of missing AGE for other groups is considerably similar (Table 26). The only exception belongs to the respondent group who have PhD degree(s), and PhD degree(s) and professional certificate(s). These groups have no missing data of the AGE variable; however, there is only one respondent in each group. As such, these traits indicate that the missing data are likely to be caused by chance.
Table 26
Cross-tabulation of Education Background and Age
Age
Present Missing Count Percent %
Total 182 90.1 9.9
3. Bachelor’s Degree(s) 100 89.3 10.7
4. Bachelor’s Degree(s) and Professional Certificate(s) 30 90.9 9.1
5. Master’s Degree(s) 32 94.1 5.9
6. Master’s Degree(s) and Professional Certificate(s) 10 90.9 9.1
7. PhD Degree(s) 1 100.0 .0
In contrast, the age variable is reported 87.2% of the time when the respondent is currently working as an auditor, while ex-auditors reported the variable 94% of the time (Table 27). This difference might suggest a greater tendency to not report age among current auditors. Additionally, a discrepancy is also noted between different geographically-based groups. There is no missing data of the AGE variable among respondents who are working in Ho Chi Minh City and in other regions of Vietnam; however, only 89.3% of the respondents who are working in Hanoi reported their age (Table 28). These two signs suggest that the data might not be missing completely at random.
Table 27
Cross-tabulation of Work Type and Age
Work Type
Total Auditors Ex-Auditors Missing
Age Present Count 182 95 78 9
Percent 90.1 87.2 94.0 90.0
Missing Percent 9.9 12.8 6.0 10.0
Table 28
Cross-tabulation of Work City and Age
Work City
Total Hanoi Ho Chi Minh City Other Missing
AGE Present Count 182 158 12 10 2
Percent 90.1 89.3 100.0 100.0 66.7
Furthermore, tabulated missing patterns are also examined to identify whether the data are jointly missing (IBM, 2012). Different patterns of missing data (with more than 1% of the cases) are presented in Table 29. Firstly, there is only one pattern of jointly missing data that occurs in more than 1% of the cases. The two variables, SEX and AGE, are missing together in only four out of 202 cases (2%). Secondly, where information of gender is missing, the mean age appears highest among the missing patterns. Nonetheless, only three out of the 202 cases share this pattern. According to the low popularity of these two missing patterns, they seem to not affect the missing data’s randomness.
Table 29
Tabulated Patterns
Total Cases
Missing patternsa Complete if ...b
Agec Sexd Education backgroundd Work typed Work cityd Sex Edu Work
type
Age Woman Man 3 4 5 6 7 8 Auditor Ex-
Auditor Hanoi Ho Chi Minh City Other 161 161 27.40 82 79 93 29 29 9 0 1 86 75 141 11 9 7 X 168 30.43 2 5 0 0 0 0 0 0 4 3 6 1 0 3 X 164 32.00 0 0 2 0 1 0 0 0 3 0 3 0 0 4 X X 181 - 0 0 3 1 0 0 0 0 3 1 4 0 0 13 X 174 - 7 6 8 2 2 1 0 0 10 3 13 0 0 8 X 169 28.75 5 3 4 1 1 1 1 0 0 0 7 0 1
Note. Patterns with less than 1% cases (2 or fewer) are not displayed. a. Variables are sorted on missing patterns.
These patterns are visualized by a grid-form chart and a bar graph with the help of SPSS program (Figure 6 and Figure 7). The Missing Value Patterns chart graphically displays existing patterns of missing values of all the variables, and the bar graph presents the frequency in percentages of the patterns. The most common pattern is when there is no missing data, roughly 4 in 5 (161 cases, 80.5%), and there are 12 patterns of missing values. The pattern numbered 11 appears as the one with the most involving variables, including work city, educational background, and age. However, it is not shown in the frequency bar- graph (Figure 7), which means that it occurs in less than 0.5% of the cases. This frequency is minimal, so, it is highly likely caused by chance. Furthermore, according to IBM in the manual guide for SPSS Missing Value 21 (IBM, 2012, p. 48):
If the data are monotone, then all missing cells and non-missing cells in the chart will be contiguous; that is, there will be no “islands” of non-missing cells in the lower right portion of the chart and no “islands” of missing cells in the upper left portion of the chart.
As shown in Figure 6, clearly there is no cluster of non-missing cells in the lower right portion and no cluster of missing cells in the upper left portion of this chart. Therefore, the data for this study are monotonic, and it supports further the conclusion that the data are missing at completely random (McKnight, McKnight, Sidani, & Figueredo, 2007).
Figure 7. Missing value pattern graph.
Considering the statistics and patterns of missing data, it appears that the data are missing completely at random. Furthermore, Little’s MCAR test is used to confirm this assumption. The null hypothesis for Little’s MCAR test is that the data are missing completely at random (IBM, 2012). Little’s MCAR test returns X2 = 33.7, DF = 36 with a significance value of p = .58, much larger than .05. Therefore, the null hypothesis is retained, and it could be concluded that the data are missing completely at random.
Once the data are missing completely at random, the missing data will have ignorable potential impacts on the statistical analyses (McKnight et al., 2007). Nonetheless, it is important to take a look at the overall summary of missing data to decide whether data imputation is needed. The percentage of missing data for variables, cases, and values are shown by pie-charts in Figure 8 (12.20%, 20.3%, and 0.62%, respectively). There are five variables with at least one missing value on a case, 41 out of 202 cases contain at least one missing value on a variable, and 51 out of 8,282 (41 variables x 202 cases) values are missing. Only 0.62% of the data are missing, but using list-wise deletion will remove as many as 20.3% of the cases. This reduction is significant and might seriously affect the statistical analyses and thus lead to biased inference (Leech et al., 2014). Therefore, data imputation seems advisable for this study.
Figure 8. Overall summary of missing data.