Chapter 6 General methods and preliminary analyses analyses
6.1 Data management and statistical analysis
Obtaining data
I obtained the B-CAMHS datasets from colleagues at ONS and the Institute of Psychiatry, London. I did not receive personal identifiers such as the child’s name or postcode; instead when using this information for name-matching techniques or assigning area deprivation I visited the ONS offices and performed the matches on site. As described in Chapter 9, I also generated additional variables regarding the schools in B-CAMHS04 in collaboration with the Office for Standards in Education (OFSTED).
Statistical software
I conducted most data analysis using Stata 9.2. The exception was my use of MPlus5 for factor analyses, as indicated in the relevant methods sections.
Adjusting for survey design
As described in Section 5.2.1 Chapter 5, the B-CAMHS surveys sampled children from 901 clusters (postal sectors) in 439 strata. These were sampled without replacement from a
148 total of 8265 postal sectors. The original B-CAMHS team calculated probability weights to account for
1. The oversampling of the Highlands and Islands of Scotland in B-CAMHS99 and the under-sampling of Wales in B-CAMHS04
2. Differential non-response rates by age, sex and region (for details see [2-3])
The original B-CAMHS team additionally calculated three-year follow-up weights, adjusting for the oversampling in B-CAMHS99 of children with disorders at baseline (see Section 5.2.3, Chapter 5).
As discussed more fully in Appendix 1 Section 13.3, adjusting for complex survey design is important when conducting analyses. Failure to weight the data may lead to biased point estimates of means, proportions or effect sizes. It may also bias estimates of variance and standard errors, usually underestimating these in unweighted data. Not adjusting for clustered design is likewise expected to bias (and usually underestimate) estimates of variance. In both cases, this will generate misleadingly narrow confidence intervals, misleadingly large test statistics and misleadingly small p-values. By contrast, failure to adjust for stratification may overestimate the variance, although this effect is often comparatively small.
Both Stata and MPlus have specialised commands for accommodating complex survey design, including stratification, clustering and probability weights. Both estimate parameters using pseudo-maximum likelihood methods and calculating robust standard errors [476-477].
Throughout this PhD, I use these in-built options to adjust for the complex B-CAMHS survey design whenever calculating proportions and means; when fitting regression models (including those using multiple imputation); when conducting exploratory and confirmatory factor analyses; and when calculating Pearson’s correlation coefficients. This includes analyses using the three-year follow-up data, which use the follow-up weights. The use of pseudo-maximum likelihood methods means, however, that I cannot adjust for survey design while performing likelihood ratio tests. I therefore instead compared models
149 without adjusting for survey design, but then present the better model with adjustment, as follows:
1. Calculate likelihood ratio of nested and general models – not adjusted for survey design.
2. Use likelihood ratio to select model – not adjusted for survey design.
3. Present results from the chosen model – adjusted for survey design.
Furthermore, I do not adjust for survey design when calculating Spearman’s coefficients as neither program allows this. In fact the effect of adjusting for survey design was modest in B-CAMHS (see Appendix 1, Table 13.7), meaning that these occasional failures to adjust for survey design are unlikely to affect my substantive findings.
Checking assumptions in regression models
Regression models feature in this and all subsequent data analysis Chapters, with linear and logistic regression being the most common types. Throughout this thesis, I check the assumptions underlying these models as outlined below. Section 13.2 Appendix 1 provides a more detailed discussion of regression techniques in general, and (in Section 13.2.1) of their underlying assumptions in particular.
Linear and logistic models Assess linearity
(All models) Plot the outcome (or logit(outcome) for logistic regression) against all continuous or ordered categorical explanatory variables to check for approximate linearity in univariable analysis.
(All models) Plot the residuals against the expected values to inspect whether these show random scatter around zero.
(Ordered categorical variables) Likelihood ratio tests to compare linear vs.
categorical entry of variables.
(Continuous variables) Enter quadratic and cubic terms and use the Wald test statistic to determine their significance; or band and enter as categorical.
Normality of the errors:
(All models) Histograms and normal plots of standardized residuals.
150 Constant variance of the errors
(All models) Plot the residuals against the explanatory variable; check that no tendency for the scatter to increase or decrease at higher values.
Identify influential data points
(Linear regression models) Sensitivity analyses excluding variables with a Cook’s distance of over 4/n.
Dealing with violation of assumptions
Where the relationship between the explanatory and outcome variable was not linear, I entered the variable as an ordered categorical variable or with a quadratic/cubic term.
If the residuals of regression models were skewed rather than normally distributed I repeated the analyses after taking zero-skew logs (see Appendix 1, p.379). I also used these approaches if the variance of the errors was not constant. Both in repeating analyses after taking zero-skew logs and in sensitivity analyses excluding highly influential points, I only report the results of these analyses if there was any substantive difference to the model’s findings.
Proportional odds models
Ordered logistic regression requires the proportional odds assumption: that is, that the true population odds ratio for being in category ≥k vs. category <k is the same for all values of k. When using ordered logistic regression, I used likelihood ratio tests to compare the fit of a non-proportional odds model with partial proportional odds model, in which the odds ratios of a given explanatory variable of interest were constrained to be identical. Variables not of substantive interest (e.g. potential confounders such as age or sex) were allowed to have non-proportional odds.
If there was no evidence (p<0.01) of a violation of the proportional odds assumption, I selected the partial-proportional odds model; otherwise I selected the fully non-proportional odds model. I used a 1% significance cut-off to reduce spurious findings when fitting multiple models. Having selected the appropriate model, I then reported the results of that model with adjustment for complex survey design.
151