Missing values and multiple imputation - Multilevel modelling analyses

Chapter 4 Methodology

4.6 Data analysis procedures

4.6.2 Quantitative data analyses on PISA data

4.6.2.2 Multilevel modelling analyses

4.6.2.2.2 Missing values and multiple imputation

As an international large-scale assessment, PISA usually experiences the issue of missing values due to students’ non-response on or non-reach of some questions. There are also a number of missing values in Fangshan PISA 2012 China Trial database employed for MLM. As shown in the fifth column in Table 4.5, the biggest number of observed student cases, is 614 rather than 624 which was displayed in Table 4.3 (see Section 4.5.2). This is due to the removal of ten cases having missing values in ESCS. ESCS was comprised based on three other indices which reflect students’ background in terms of home possessions, parents’ occupations and parents’

educational levels (OECD, 2014a). When missing is observed in one of these three indices, regression on the other two indices would be conducted to impute the missing value (OECD, 2014a). In Fangshan PISA 2012 China Trial database, these ten students who had missing values in ESCS had missing values in two or even three of the indices. Due to the lack of information to predict the values of ESCS, these ten student cases were excluded from analyses, the student sample size of the database therefore turns to be 614.

From Table 4.5 we also see that there are missing values in each of the non- cognitive outcomes variables and processes variables. In PISA 2012, three forms of student questionnaire were used to cover more mathematics-

focused issues of policy interest, with common items linking with each other. Due to this rotation design of the student questionnaire, each student only received one of the three questionnaire forms. Hence, on each of

mathematics-focused items, missing values brought by this design would theoretically account for around one third of the student sample. This is the case for the processes variables and non-cognitive outcome variables which were constructed on these items. For the anchored variables ANCINTMAT and ANCINSTMOT, the percent points of missingness were even larger, as Table 4.5 shows above, since fewer students were delivered items with anchoring vignettes. Table 4.6 shows the patterns of missingness below.

Table 4.6 Patterns of missingness Variables Percent (%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 <1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 34 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 33 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 31 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 <1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 <1 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 <1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 <1 100% Note: variables are (1) EXAPPLM (2) EXPUREM (3) FAMCONC (4) ANXMAT (5) DISCLIMA (6) TCHBEHFA (7) TCHBEHSO (8) TCHBEHTD (9) TEACHSUP (10) MATHEFF (11) ANCCLSMAN (12) ANCCOGACT (13) ANCMTSUP (14) ANCSCMAT (15) ANCINSTMOT (16) ANCINTMAT

In this table, “1” in the patterns represents complete data, and “0” represents missingness. Reading across the table, we see that for almost all cases there are missing values in at least three variables. For example, in the second row, missingness in variables MATHEFF, ANCINSTMOT, and ANCINTMAT accounts for 34%. Ignoring the cases with missing values would be a waste of information and also hinder the analyses to involve all these variables of interest. According to the missing mechanism I explained before, these missing values were assumed as missing at random (MAR). I used multiple imputation (Rubin, 1987) which is suitable for addressing this kind of missing data. Because the missing values are arbitrary continuous values, I imputed them by using the Monte Carlo Markov Chain (MCMC) approach (OECD, 2014a).

It is suggested that all the variables, including the dependent variable (i.e. mathematics PVs), involved in the analysis model, and additional variables that could predict the missing values to be imputed should be employed in the imputation model (Acock, 2014). However, this does not mean that the more involved the better in the case that the variables themselves used to impute missing values have a number of missing values (Acock, 2014). In my research, with the observed dataset, bivariate correlation analysis on the variables of concern was conducted to identify potential predictors of the variables having missing values. As Table 4.7 shows below, amongst processes variables, most bivariate correlations were significant, and this is the same for the bivariate correlations amongst non-cognitive outcomes variables. For most processes variables, their correlation with non-cognitive outcomes variables were non-significant or weak. Although the processes

variables ANCCOGACT, ANCMTSUP, and ANCCLSMAN had medium correlation with most non-cognitive outcomes, it is notable that there were no students who had responses both on the processes variables EXAPPLM, EXPUREM, FAMCONC and on the non-cognitive outcome variables ANCINTMAT, ANCINSTMOT. Involving processes variables in imputing missing values for non-cognitive outcomes variables would be problematic and vice versa. Moving to look at the reading performance, its correlation with most processes variables and non-cognitive outcomes variables was quite weak or not significant, and therefore it was not suitable to be involved in imputation. In consideration of these, imputation of missing data was conducted for processes variables and non-cognitive outcomes variables separately. Input variables, all the processes variables, and mathematics PVs were involved in imputing missing values for processes variables. Input variables, all the non-cognitive outcomes variables, and mathematics PVs were involved in imputing missing values for non-cognitive outcomes

variables. Regarding whether it is necessary to include weights in imputation, it seems still inconclusive (Goldstein, 2014; Kim et al., 2014). Besides,

techniques for employing weights in imputing multilevel data are still rarely developed (Rutkowski and Zhou, 2014). Hence, in my research, weights were not involved in multiple imputation.

Table 4.7 Bivariate correlation matrix of variables pv1math centred

_ESCS

centred

_AGE male highersec SCH_ESCS EXAPPLM EXPUREM FAMCONC TCHBEHFA TCHBEHSO TCHBEHTD TEACHSUP ANCCOGACT DISCLIMA ANCMTSUP ANCCLSMAN

pv1math 1.000 centred_ESCS 0.069 1.000 centred_AGE -0.029 -0.028 1.000 male -0.015 0.052 0.031 1.000 highersec 0.407* 0.039 0.114* -0.148* 1.000 SCH_ESCS 0.395* 0.336* -0.065 -0.014 0.101* 1.000 EXAPPLM -0.151* 0.134* 0.024 -0.029 -0.072 0.008 1.000 EXPUREM -0.016 0.084 0.032 0.011 -0.032 -0.053 0.411* 1.000 FAMCONC 0.048 0.050 -0.083 -0.002 0.071 0.026 -0.056 0.111* 1.000 TCHBEHFA -0.202* 0.061 0.026 0.135* -0.179* -0.116* 0.312* 0.268* 0.022 1.000 TCHBEHSO -0.329* 0.060 0.022 0.179* -0.265* -0.126* 0.301* 0.195* 0.050 0.724* 1.000 TCHBEHTD -0.203* 0.058 0.057 0.006 -0.174* -0.109* 0.208* 0.190* 0.099 0.644* 0.522* 1.000 TEACHSUP -0.050 0.042 0.048 -0.121* -0.038 -0.093 0.132 0.160* 0.045 0.448* 0.310* 0.619* 1.000 ANCCOGACT 0.097 0.033 0.065 -0.083 0.083 -0.011 0.092 0.081 0.039 0.206* 0.120* 0.283* 0.201* 1.000 DISCLIMA 0.053 0.037 0.119* -0.127* -0.016 0.020 -0.013 0.082 -0.023 0.162* 0.079 0.308* 0.329* 0.112* 1.000 ANCMTSUP 0.090 -0.084 0.072 -0.242* 0.141* -0.085 0.086 0.114 -0.012 0.093 -0.048 0.155* 0.198* 0.565* 0.068 1.000 ANCCLSMAN 0.083 -0.063 0.049 -0.206* 0.073 -0.059 -0.019 -0.045 -0.051 0.065 0.002 0.170* 0.206* 0.653* 0.238* 0.638* 1.000 ANCINTMAT 0.180* -0.100 0.146* -0.055 0.105 -0.069 . . . 0.175* 0.043 0.228* 0.181* 0.456* 0.157* 0.527* 0.606* ANCINSTMOT 0.113 -0.132 0.098 -0.173* 0.104 -0.057 . . . 0.096 0.012 0.173* 0.145* 0.564* 0.166* 0.677* 0.681* MATHEFF 0.407* 0.123* -0.031 0.149* 0.127* 0.198* 0.221* 0.013 -0.015 0.125 -0.014 0.137 0.178* 0.109 0.131 0.053 0.035 ANCSCMAT 0.289* -0.065 0.005 -0.067 0.110* 0.032 0.042 0.049 -0.005 0.056 -0.044 0.114* 0.121* 0.578* 0.123* 0.460* 0.578* ANXMAT -0.223* 0.031 0.102* -0.081 -0.001 -0.027 -0.031 -0.122 0.104 -0.049 -0.020 -0.073 -0.150* -0.084 -0.228* -0.051 -0.084 pv1read 0.800* 0.042 -0.005 -0.249* 0.429* 0.338* -0.171* 0.012 0.086 -0.252* -0.405* -0.200* -0.051 0.115* 0.063 0.136* 0.093

Note: For illustration, pv1math and pv1read were used in correlation calculation. Significance at 95 confidence level is marked with ”*”.

ANCINTMAT ANCINSTMOT MATHEFF ANCSCMAT ANXMAT pv1read

ANCINTMAT 1.000 ANCINSTMOT 0.763* 1.000 MATHEFF 0.300* 0.110 1.000 ANCSCMAT 0.740* 0.697* 0.268* 1.000 ANXMAT -0.255* -0.201* -0.273* -0.326* 1.000 pv1read 0.086 0.098 0.306* 0.181* -0.067 1.000

In multiple imputation, a number of datasets rather than one single database are typically generated to address the uncertainty of imputation. Usually, five imputations are considered sufficient to obtain valid results (Rubin, 1987; van Buuren et al., 1999). In the field of international large-scale assessments (ILSAs), for example, PISA (pre- PISA 2015) and TIMSS, five imputations are commonly used. To align with the five sets of PVs of students’

performance, in my research, five imputed datasets were generated. The MI procedure in Stata 13 (StataCorp, 2013) was employed for multiple

imputation. It is suggested to compare the completed dataset in which missing values are imputed with the observed dataset for evaluating the imputation (Acock, 2014). According to the imputation results in my research, imputed values were generally consistent with observed values. To save space, imputation results of one of the variables having missing values, that is, ANXMAT (Mathematics Anxiety), are displayed below in Figure 4.2.

Figure 4.2 Imputation results for ANXMAT

In document The Impact of PISA on Students' Learning: a Chinese Perspective (Page 103-107)