The statistical methods used in this study are based upon the methods developed by the
APHEA p r o j e c t . T h i s applies to the formulation of the statistical model and
assumptions and the method used to select the appropriate terms for the model to allow
for confounding factors.
The aim of the time-series analyses is to assess, using an appropriate statistical model,
the association between GP consultations and air pollution. In order to do this it is first
necessary to determine any 'confounding factors' which influence the health outcome
and then allow for them in the statistical model. These confounding factors can be other
environmental variables such as temperature, relevant clinical variables such as
influenza epidemics, or temporal factors such as day of week patterns, seasonal patterns
(seasonality) and other patterns linked to human activity and their environment. Once
their associations assessed in terms of their magnitude, direction and statistical
significance.
The focus of interest is the association between daily measures of air pollution and daily
counts of GP consultations that exist on a short-term basis, that is over the space of a
few days. Therefore, it is necessary to exclude any associations between the two
variables that may exist because they have similar time-dependent patterns when viewed
over a long time period i.e. over a number of years. These can be long-term systematic
changes in the mean or trend or seasonal variations over the course of the year. It is not
necessary to explain these patterns, simply to identify their existence and account for
them in the modelling procedure. Any subsequent association found between the health
outcome and the air pollution indicators can then be assumed to be o f a short-term
nature.
The data consist of routine observational measurements and therefore it is difficult to
infer causality from any statistically significant associations found in the analyses. In
addition no specific hypotheses are being tested; instead the study is an exploration of
the data for consistent, plausible findings. For these reasons the modelling methodology
is conservative. Air pollution variables are only included in the model once every effort
has been made to explain the GP consultations time-series using a core model derived
from other explanatory factors.
Time-series of health outcome data, usually counts, are often approximately Poisson
distributed, overdispersed and usually positively autocorrelated (non-independence of
the statistical model. Overdispersion occurs when the variance of the outcome is greater
than the mean; this can be due to omitted variables in the model and whilst this is
reduced as the model is built, remaining overdispersion can be accounted for by the
appropriate modelling procedure. Autocorrelation (or serial correlation) is the
dependence of levels of a variable on a given day to what they were on the previous day,
and, to a lesser extent, to what they were on the day before that, etc. The autocorrelation
in the GP data is not intrinsic but due to autocorrelated explanatory variables, some of
which may be known. Seasonal terms are examples of such explanatory variables, as
are temperature and air pollution. Again this autocorrelation can be reduced or removed
altogether by inclusion of the relevant explanatory variables. Any autocorrelation
remaining thereafter is accounted for by the inclusion of appropriate autoregressive
terms in the model.
The relevant explanatory variables are not known a priori. Each diagnostic and age grouping may have different seasonal patterns, temperature lags etc. The modelling
procedure is then one of trial and error to select the 'best' combination, from many
confounding factors, which describe the GP time-series. The following sections
describe the guidelines used to assess what constitutes the 'best' model for a given
outcome as well as the procedures used to arrive at the model.
3.4.2 Criteria for selecting core model
The following criteria and diagnostic tools are used to determine whether a suitable core
experience and judgement are required. In most cases the final model represents a
balance of a number of factors as judged by the analyst.
• Raw/Predicted Time Series. A plot of observed and predicted values help identify
areas where the model fitting is inadequate. For example, if seasonal peaks vary
from year to year, such a plot will help identify whether these have been correctly
modelled.
• Residual Time Series. Used in conjunction with the plot of the observed and
predicted values. The residual plot should be as close as possible to white noise,
showing no systematic patterns or changes in variance over time. Trends and
patterns visible in the raw data should be described by the predicted series and
invisible in the residual series. Also no obvious outliers or extreme points should be
visible.
• Model Residuals against fitted values. A plot of model residuals against fitted values
will show up any model inadequacies usually as gaps in the residuals for given ranges
of the fitted values or clumping together in vertical bands. In addition specific
patterns or trends suggesting omitted variables or model mis-specification are also
revealed.
• Periodogram. The periodogram is part of the spectral decomposition of a time series.
It is used to indicate the presence o f seasonality in the data. After correct control for
to white noise i.e. the remaining spikes representing cycles of long wavelength
should be of similar, or smaller, magnitude than the short wavelength spikes.
• Partial Autocorrelation Function (PACF). The PACF describes the serial correlation
(autocorrelation) of a time series at lags 1,2,3 etc. with the value of each lag corrected
for the previous lags. After correct control for long-term trends and seasonal patterns
a plot of the PACF should show random non-significant partial autocorrelation
coefficients, preferably at all lags. It may be that the first few lags will be strongly
positive. These can be accounted for at a later stage using autoregressive terms.
The sum of these estimates over 60 days should be close to zero (assuming no
strong autocorrelation remains) or equal to the sum of the autcorrelation terms in
the first few days if these terms are large and significant. Meeting this criteria is
often not possible and compromise is necessary. The aim is therefore to minimise
the positive autocorrelation present in the raw series without introducing negative
autocorrelation. If the first several lags are consistently below 0 this points to over-
fitting of the model, that is inclusion in the model of unnecessary seasonal terms.
• Overdispersion in the core model should be as close as possible to the value 1.
• Akaike's information criterion (AIC) to be used in conjuntion with plots to assess
model fit.