Chapter 2. Patients and methodology
2.2 Statistical methods
2.2.2 Statistical models
Statistical models aim to provide an understanding of the relationships between variables and can be applied for both descriptive and predictive purposes. They provide a mathematical representation of how the variability of a response can be explained in terms of explanatory variables. They also incorporate a random component to account for the deviation of the observed response values from the predicted values. In HIV epidemiology research, models are used to look at a number of virological, immunological and clinical outcomes. Univariable models contain only one explanatory variable and are used to investigate what effect this variable alone has on the response. However, there may be additional factors that have associations with both the response variable and the explanatory variable that confound their relationship. For example, a univariable analysis may find that IDUs have a higher risk of death, but IDUs also typically have a lower CD4 cell count due to lifestyle factors, which is also associated with a higher risk of death. Therefore it is possible that the relationship between IDU and death could be due to CD4 cell count instead. To prevent bias of potentially confounding factors, multivariable analyses are used that contain more than one variable and adjust the effects of each variable to take account of the others in the model. Unfortunately in observational studies, there may still be unknown or unmeasured confounding variables, which is a limitation of analyses such as these.
Statistical interactions may also be adjusted for in multivariable analyses. These occur when the relationship between the explanatory and the response variable is stronger in some groups than in others, e.g. the cumulative effect of treatment exposure on cholesterol is worse in men than in women meaning that there is a statistical interaction between gender and length of time on treatment. As a general rule, statistical interactions between all variables are not routinely tested, because of the possibility of a false positive result due to repeated statistical testing. Interactions of interest are normally decided a priori based upon clinical suspicion.
Three main methods of statistical modelling were used in this thesis: logistic regression, survival analysis and Poisson regression. Linear regression is also used in some sensitivity analyses. Formal descriptions can be found in Appendix V.
Overviews of what the models are used for and the reasons for choosing these particular methods are briefly outlined below.
2.2.2.1 Logistic regression
Logistic regression models are used when the response variable under investigation is of a binary form, i.e. with only two different outcomes [370,371]. These are often labelled ‘success’ and ‘failure’ usually corresponding to a positive and negative outcome, for example, a successful reduction of viral load to an undetectable level following ART as opposed to failure to reach an undetectable level. For simple analyses, the proportion of successes can be compared between groups. However, this probability of a success, p, is a number between 0 and 1 and in order to use regression methods for analysis, this has to be mathematically transformed into something that takes a number between minus infinity (-∞) and infinity (∞). Hence, logistic regression models predict the log odds of observing a ‘success’, which can take values in (-∞,∞), based on observed explanatory variables. The odds of observing a success are defined as the ratio of the probability of success to the probability of failure, i.e. p/(1-p). This number is similar to the probability of a success when the outcome is rare.
This type of model is called a linear logistic model and assumes a linear relationship between the explanatory variables and the log odds of a ‘success’. It can be used to estimate odds ratios (and 95% CIs) that compare the outcomes from two groups of patients. An odds ratio is defined as the ratio of the odds in one group to the odds in a second group. When the value is greater than one, this indicates that the first group
has greater odds of a success than the second. When it is less than one, the second has greater odds.
2.2.2.2 Survival analysis
Survival analysis methods are used to investigate time to an event from a well-defined time origin [372]. If the event is death, the data are literally survival times. Survival data are often highly skewed and have a non-normal distribution. However, their main feature is that they can be censored when the event has not been observed for an individual, for example, if for the duration of patient follow-up, the event has not been observed the patient can be censored at their last visit, which indicates that from that point in time onwards there was no information available. The assumption of this is that the actual survival time is independent of any mechanism that causes the censoring. Censoring that occurs after the last known survival time is called right censoring (as it is to the right when plotted on a graph) and gives a right censored survival time that is less than the unknown actual survival time. Data can also be left censored when the actual survival time is less than that observed.
A mathematical function that summarises a distribution of survival times is called the survivor function. This can be estimated using a Kaplan-Meier estimate, which can be displayed visually as a plot showing the cumulative rates of those experiencing an event. The median survival time (and other percentiles) can be read from this plot.
The log-rank test is a non-parametric test that can be used to compare survival times in independent groups.
To assess the effects of explanatory variables on survival times, a proportional hazards regression model can be used. This models a hazard function that predicts the instantaneous risk or hazard of the event occurring at a given time point after the time origin and is conditional on the individual having survived up to that point. The model can estimate hazard ratios or relative hazards (and 95% CIs) to compare the risk of an event between groups of patients, which are comparable to odds ratios in a logistic regression model.
A proportional hazards model assumes that for different values of an explanatory variable, the relative hazards are proportional over time. This can be checked by including in the model an interaction term between the log survival time and the variable of interest. If the model fits significantly better after including this term, there is evidence of non-proportionality. It also assumes a log-linear relationship between the hazard and the explanatory variables. In this thesis, the Cox proportional hazards
regression model is used that makes no assumption about the shape of the distribution of the hazard function.
2.2.2.3 Poisson regression
Poisson regression models are used for count data to predict rates of an event and hence can be used to calculate rate ratios (and 95% CIs) [373]. As with a logistic regression model, a function is needed to transform the data into a form that can take values in (-∞,∞) and the appropriate function for this is log-linear.
2.2.2.4 Linear regression
Linear regression is the most straightforward of the linear models, used when the data are continuous with a normal distribution and assuming a linear relationship with the explanatory variables [374]. Differences in the response can be compared between values of an explanatory variable.