Chapter 4: Data and samples
4.8 Missing data strategy
Survey datasets, particularly longitudinal survey datasets, usually suffer from the problem of missing data. Data on variables may be missing because participants did not respond to particular items in questionnaires, for example if the question was sensitive or poorly worded, and they chose not to give an answer. Data may also be missing because a participant left the survey altogether, and did not wish to
participate in later surveys (or could not be traced for follow-up). The former type of missing data may lead to biased estimates if the characteristics of those who did answer the question differed from those who did not, and if these characteristics were also related to the outcomes in question. The latter form of missing data, or attrition, may compromise the representativeness of the survey. Whilst each of the
63 surveys aimed initially to create a sample that was similar to the overall population of each country, it is likely that particular groups would be more likely to attrit over time. People who become homeless, for example, would be harder to contact over multiple waves, and thus some of the most disadvantaged individuals would no longer be represented. There may also be missing or incorrect responses through human error, if participants/ interviewers skipped over a question by mistake or input an incorrect response.
There are three main types of missing data; Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not at Random (MNAR). Data MCAR would cause the fewest problems to analysis, as the missingness would be uncorrelated with any variables of interest in the study. This may occur if the missingness was driven purely by chance. If data were MCAR, analysis using listwise or pairwise deletion methods (or complete case analysis) would be
justified, as the only implications for results would be a loss in sample size and thus power. If this was not the case, however, using these approaches would bias results and it would need to be stated that the sample is no longer representative. Whilst it may be the case that missing data through error would be random, it would be too strong an assumption to assert that all instances of missing data were completely random.
In this thesis I have used missing data strategies that assume data are MAR; that is, that missing responses or attrition can be fully explained by observable variables included in the dataset. I therefore also assume that there are no unobserved variables that explain some of the missingness, for example underlying motivation, and that the variables included in the model measure each construct reliably. This assumption becomes less problematic because of the rich set of indicators included in each dataset, yet is more conservative than assuming data are MCAR. Data may also be MNAR, for example, if the missingness is directly related to the construct being measured. This may occur, for example, if father’s education level is missing because the participant does not have a father. Where possible, this is taken account of when coding the data, and an additional level created. For other variables this is not possible, for example if people who do not like science are more likely to skip questions about their attitudes to science.
64 Given I assume data are MAR, I use two main missing data strategies throughout this thesis. The first, to account for cross-sectional and longitudinal non-response or attrition, is weighting. This strategy assigns greater weight in analysis to individuals if their characteristics are associated with attrition. Individual
responses are given less weight if people with their characteristics are more likely to either remain in the study or to be sampled in the first place. In Next Steps, certain ethnic minority groups are over-sampled, to allow researchers greater power in analysis including these groups. In this case, weighting assigns less weight to these individuals. In Next Steps and the NLSY79 weights are provided for
researchers, and these are used in all analysis. In BCS70, however, weights were not provided. I therefore constructed weights using logistic regression methods, predicting probability of being in the most recent wave (2012) based on baseline characteristics. Characteristics chosen were informed by Mostafa & Wiggins (2014), and included sex, birth weight, parity, mother’s age, whether mother lived in the southeast of England in the first survey, social class at birth, and mother’s and father’s age at completion of education.
Whilst this helps to account for attrition over time and non-response to the entire survey, it does not account for missing data on particular items within the survey. To account for this, I constructed multiple imputation models through chained equations for each of the studies. This method is considered particularly preferable when data are MAR (Allison, 2001), and to lead to less bias in results than complete case analysis, or other imputation methods. Simple imputation methods
underestimate standard errors, overestimate ‘t’ statistics and can therefore return significant effects where there aren’t any. This is because by imputing a single value for a large proportion of cases (for example the mean) the variance in scores appears artificially smaller. If there were responses on these cases, even if data were MCAR, there would likely be random variation between respondents. Single regression imputation methods would also return biased standard error and test statistic estimates. When using regression methods the imputed data become a direct function of the outcome and other predictors in the model. This then
artificially inflates the relationship between the imputed variable and the outcome. MI deliberately introduces this random variation by creating many datasets based on the regression equation entered, and takes the mean value from these estimates. Variation across the different imputation datasets is then utilised to calculate standard errors and test statistics that are larger and smaller respectively to reflect
65 expected natural random variation in responses. The standard errors are thus a function of the variance between cases within each data set, and the variance between datasets. In all MI models, I created 20 datasets, as guidance suggests using a large number of imputations (Graham, Olchowski, & Gilreath, 2007). Analysis was conducted using the MI ICE command in STATA (Royston, 2004). Variables included in each imputation model are shown in appendix A.
66