Chapter 2 – Materials and Methods
2.3 Statistical methods
2.3 Statistical methods
2.3.1 Distributions of treatment outcome measures
The distribution of a given set of data can define the statistical tools used to analyse it.
Broadly speaking the outcomes analysed in this thesis can be categorised as either
continuous or binary. In all instances the hypothesis testing required the estimation of an effect size in order to gauge the magnitude of a genetic variant’s effect on the outcome in question. To this end appropriate regression models were used based upon the
distribution of the data; rather than non‐parametric tests that may be largely agnostic to the distribution of the data but do not yield effect size estimates, beyond possibly correlation statistics (e.g. Spearman’s ρ). Largely data were normally distributed, that is they did not deviate markedly from the expected normal distribution. The fit of a variable to the normal distribution was gauged using histograms and normal distribution probability plots (Figure 2.1 below).
Some dependent variables, such as the baseline CRP measurements shown in Figure 2.1, displayed a non‐normal distribution. In these cases it is permissible to use either a non‐
parametric test that does not rely on the axiom that the data is normally distributed, or transform the data in such a way that it becomes normally distributed. For CRP
measurements the curve of the data follow an apparently logarithmic shape, thus using a log transformation would result in data that more closely fits a normal/Gaussian
distribution.
B
C A
Figure 2.1 Histograms and normal distribution probability plots of clinical variables. The top panel (A) shows an example of a normally distributed variable, baseline DAS28‐CRP. The green line shows the expected normal distribution. Panel B shows a non‐normally distributed variable, baseline CRP, demonstrated by its deviation from the expected normal probability distribution. Logarithmic
transformation of non‐normally distributed data can be used to alleviate this as shown in panel C. The exception here is the presence of zero‐inflation in the baseline CRP data that arises because of the way CRP is measured in NHS laboratories (i.e. values <5 are treated as 0).
Consideration must be given to how the distribution of values is generated. For instance a count variable may be treated as arising by a particular process such as a Poisson process.
A Poisson process describes the probability distribution of the number of independent events occurring over a given time interval, i.e. a counting process. Thus it may be used to model data that can be thought of as count variables. A pertinent example would be the swollen joint count (SJC) used to calculate the DAS. The underlying theory behind this is that the number of swollen joints is a function of the severity and/or activity of the disease, so the more severe the disease the more joints are affected, though damage at one joint may be independent of damage at another. Alternatively, to bring in a time interval explanation, one would expect the joint count (if we consider each affected joint as a single event) to change as the disease develops or is managed by medical intervention, i.e. the patient is given treatment for the disease. The Poisson distribution is defined:
Pr ! Equation 1.
where;
λi log μ μ α βx
α intercept
β parameter estimate of the explanatory variable x explanatory variable
Y number of events
y any value 0 dependent variable i ith case
This distribution requires the conditional mean and variance to be equal; violation of this, where the variance is greater than the mean is called overdispersion. A generalization of this distribution can be used that allows for the incorporation of this overdispersion which can be used to model count data called the negative binomial distribution. In the example above, the SJC28 fits a negative binomial distribution better than a Poisson distribution because of the overdispersion present (Figure 2.2).
Thus the negative binomial distribution can be used to estimate the effects of a given predictor variable on the dependent variable without resorting to data transformation.
2.3.2 Zero‐inflated negative binomial regression
An interesting phenomenon described is the presence of excess numbers of 0’s in a dataset called zero inflation. Statistical adaptations of regression models are available to account for zero‐inflated data for both the Poisson and negative binomial regression models. These adaptations assume that the 0’s in the data can arise by two processes; genuine 0’s part of the expected distribution of the data and systematic zero’s or data points that can only take a zero value for a particular reason. This phenomenon of zero‐inflation and a non‐
normal distribution was observed in the radiographic damage variable the Sharp/van der Heijde score (SHS) (Figure 2.3).
Figure 2.2 Comparison of the clinical variable SJC28 against the expected Poisson and negative binomial disitributions generated in Stata. The green connected line represents the expected Poisson distribution, which is clearly a poor fit for the observed data (blue connected line). The red connected line represents the negative binomial distributions which the SJC28 appears to approximate very closely.
The underlying theory behind the excess 0’s in this data could be that certain patients cannot take a value other than 0 because of some clinical or biological factor. This will be explored in further detail in the results chapter on radiographic joint damage (Chapter 3.3.2). The joint damage variables SHS, erosions (ERN) and joint space narrowing (JSN) can be modelled using the negative binomial distribution (Figure 2.4), similar to the SJC, but with the incorporation of the zero‐inflated data; this is the zero‐inflated negative binomial (ZINB) regression model. A formal test of the fit of a zero‐inflated model can be used to assess its appropriateness. In this instance a Vuong test of non‐nested models of the negative binomial versus the zero‐inflated negative binomial model gives a test statistic with a standard normal distribution, where large positive values indicate the zero‐inflated model is a better fit [264].
Figure 2.3 Zero‐inflation is highly evident in the clinical variable Sharp score used to quantify joint damage in RA patients. The histogram in the left‐hand panel shows the large proportion of 0’s in the data, whilst the normal probability plot on the right shows both the non‐normal distribution of the data and the effect of the zero‐inflation within this variable.
The ZINB models the data in two parts; the count model with parameter estimates using the negative binomial distribution, and the zero‐inflation model which performs a log‐
linked regression to estimate the log odds of the zero‐inflation, where the dependent variable is the presence/absence of a zero value. Thus predictor variables can be separately entered into the count and zero‐inflation equations to observe their effects.
2.3.3 Modified Poisson regression
The exponent of the estimate from the Poisson model (β from the above equation (equation 1) is the incidence rate ratio (IRR), when applied to a binary variable this can be interpreted simply as the risk ratio (RR). In order to accurately estimate the RR for a binary outcome applying a Poisson regression it is necessary to apply a sandwich estimator of the standard error estimator, otherwise known as a robust error variance [265]. This allows the robust estimation of risk ratios in a cohort‐based study, as opposed to the use of
Figure 2.4 Comparison of the expected negative binomial and Poisson distributions against the SHS. The SHS very closely approximates the negative binomial distribution (red connected line), whilst the Poisson distribution (green connected line) would provide a poor fit, and thus an inappropriate regression model for predictors of joint damage.
logistic regression to estimate the log odds ratio which can overestimate the effect of a predictor where the outcome is common.
2.3.4 Linear regression and treatment response models
As discussed in Chapter 1.5.1 the definition of treatment response requires a strict prior definition based on the manner in which it is measured. The DAS28 is commonly used as a metric of response because it is familiar to most clinicians and has a simple algorithm for calculation [246]. The distribution of the DAS28 is approximately normal whether it is calculated using the CRP or ESR measurement, and whether it includes 4 variables or 3 (Figure 2.5). The magnitude of response can be calculated from individual time points, for instance in the first six months of treatment the response could be measured between the baseline visit and this time point by subtracting the value at 6months from the baseline (∆DAS28‐CRP). This gives a directional measure of response where positive values indicate an improvement in the patients DAS28, and a negative value indicates a worsening (Figure 2.6). The ∆DAS28‐CRP has an approximately normal distribution which makes it amenable as a dependent variable in a linear regression analysis. The other requirements of a linear regression are that the residuals from the regression model are linear and have a constant variance, and are normally distributed. These assumptions can be checked by visualising the regression model output graphically. For instance a normal probability plot can detect deviation from expected values based on the normal distribution, whilst the linearity and constant variance can be graphed on a scatter plot of the residuals against the fitted values (Figure 2.7).
Figure 2.5 – Histogram of baseline DAS28‐CRP calculated using 4 variables (SJC28, TJC28, CRP and VAS‐Global Health). The DAS28‐CRP is distributed in an approximately normal manner shown by the turquoise curve.
Data are from the YEAR cohort with symptom duration of 24months or less.
Figure 2.6 – Histogram of ∆DAS28‐CRP from baseline to 6months in the YEAR cohort. The direction of the
∆
Figure 2.7 – Graphical visualisation of linear regression residuals to check for deviation from model assumptions. A – A scatter plot of the residuals and fitted values from an example linear regression model with the ∆DAS28‐CRP as the dependent variable and patient baseline age as the predictor variable. No obvious patterns are observed in the data indicative of non‐constant variance. B – A normal probability plot of the linear regression model residuals. There is very little deviation from the expected distribution for this model implying it is appropriate for this context.
A B
Measuring the treatment response of a patient using such a measure at a single time point can fall foul of several confounders; 1) the patient may be experiencing a flare in their symptoms on a particular day, which may underestimate the magnitude of their response, 2) non‐disease related factors can affect their response on the VAS‐Global health
questionnaire of the DAS28 thereby introducing subjective error into the measure, 3) weighting of individual components can affect the total DAS28 score differently. A patient may have few large inflamed joints which can lead to a highly elevated CRP/ESR whilst a patient with a high number of smaller affected joints may have a lower CRP/ESR. In order to deal with these possible confounders several approaches can be taken. Data can be measured on serial time points to draw a curve of disease activity. The area under the curve (AUC) would then reflect a smoothed average of the disease activity over time which would account for 1). The DAS28 can be calculated with 3 variables that does not include the VAS‐Global Health, removing some confounding subjective error, this would partially account for 2). The weighting of the individual components is based upon their relative contribution to the original factor analysis in which the DAS28 was calculated, thus this confounder cannot be easily accounted for. One possible solution is to use multiple response measures which individually reflect different aspects of disease activity in RA, such as the separate components of the DAS28.
2.3.5 Genetic association models
The effects of biallelic SNPs were tested in regression models using an additive genetic model. This assumes an allele‐dose effect of the minor allele on the dependent variable on a linear scale. This is denoted as the log additive model on a log scale. Alternative genetic models, such as recessive or dominant models, may fit the data better in some
circumstances, and specifying an inappropriate genetic model results in a loss of statistical power. However, testing each SNP under each genetic model increases the multiple testing
little impact on statistical power if the true model is dominant or co‐dominant. Though it lacks power to detect recessive effects with MAF<0.5 [167]. Genetic association testing was performed using Stata IC v11 (Stata Corp, College Station, Texas, USA) and PLINK [266].
2.3.6 Genetic analysis programs – Haploview and LocusZoom
GWAS visually display results using a Manhattan plot; a plot of –log10 p values (y axis) against SNP position, ordered by chromosome. A standard Manhattan plot can therefore display data from millions of markers across the human genome in an easy to interpret manner. However, it does not necessarily allow for the fine‐scale interrogation of
association signals on its own. It is possible to generate high‐resolution Manhattan plots in a number of statistical packages, but it does not immediately allow for the integration of other data of interest, such as locus‐wide recombination hotspots and linkage
disequilibrium between SNP markers. A commonly used server to generate plots directly from genetic analysis software is LocusZoom. This online application utilizes the R
statistical package to generate and integrate data from several public sources, including the HapMap and 1000 Genomes project [267]. LocusZoom was used to visualise data from all statistical association analyses that utilised large scale genotyping data, i.e. those derived from whole‐genome genotyping chips.
Genotype data for association studies can also be used to generate LD plots and perform statistical association studies. Haploview 4.2 is capable of generating LD plots for a number of LD measures including r2 and D’ [268] and was used in this thesis for this purpose.
2.3.7 Genotype imputation
Genotype imputation was performed on pre‐phased haplotypes using a set of reference haplotypes (1000 Genomes March 2012) in IMPUTE2 [269]. Imputation was performed on the Roche clinical trials, YEAR and healthy control primary cell gene expression data donated by Dr Julian Knight. All pre‐phasing and imputation was carried out by Mr John
Taylor (Section of Epidemiology and Biostatistics, Leeds Institute of Cancer and Pathology, University of Leeds).