Aims of the modelling process - Effects of air pollution on daily general practitioner consulta

The statistical methods used in this study are based upon the methods developed by the

APHEA p r o j e c t . T h i s applies to the formulation of the statistical model and

assumptions and the method used to select the appropriate terms for the model to allow

for confounding factors.

The aim of the time-series analyses is to assess, using an appropriate statistical model,

the association between GP consultations and air pollution. In order to do this it is first

necessary to determine any 'confounding factors' which influence the health outcome

and then allow for them in the statistical model. These confounding factors can be other

environmental variables such as temperature, relevant clinical variables such as

influenza epidemics, or temporal factors such as day of week patterns, seasonal patterns

(seasonality) and other patterns linked to human activity and their environment. Once

their associations assessed in terms of their magnitude, direction and statistical

significance.

The focus of interest is the association between daily measures of air pollution and daily

counts of GP consultations that exist on a short-term basis, that is over the space of a

few days. Therefore, it is necessary to exclude any associations between the two

variables that may exist because they have similar time-dependent patterns when viewed

over a long time period i.e. over a number of years. These can be long-term systematic

changes in the mean or trend or seasonal variations over the course of the year. It is not

necessary to explain these patterns, simply to identify their existence and account for

them in the modelling procedure. Any subsequent association found between the health

outcome and the air pollution indicators can then be assumed to be o f a short-term

nature.

The data consist of routine observational measurements and therefore it is difficult to

infer causality from any statistically significant associations found in the analyses. In

addition no specific hypotheses are being tested; instead the study is an exploration of

the data for consistent, plausible findings. For these reasons the modelling methodology

is conservative. Air pollution variables are only included in the model once every effort

has been made to explain the GP consultations time-series using a core model derived

from other explanatory factors.

Time-series of health outcome data, usually counts, are often approximately Poisson

distributed, overdispersed and usually positively autocorrelated (non-independence of

the statistical model. Overdispersion occurs when the variance of the outcome is greater

than the mean; this can be due to omitted variables in the model and whilst this is

reduced as the model is built, remaining overdispersion can be accounted for by the

appropriate modelling procedure. Autocorrelation (or serial correlation) is the

dependence of levels of a variable on a given day to what they were on the previous day,

and, to a lesser extent, to what they were on the day before that, etc. The autocorrelation

in the GP data is not intrinsic but due to autocorrelated explanatory variables, some of

which may be known. Seasonal terms are examples of such explanatory variables, as

are temperature and air pollution. Again this autocorrelation can be reduced or removed

altogether by inclusion of the relevant explanatory variables. Any autocorrelation

remaining thereafter is accounted for by the inclusion of appropriate autoregressive

terms in the model.

The relevant explanatory variables are not known a priori. Each diagnostic and age grouping may have different seasonal patterns, temperature lags etc. The modelling

procedure is then one of trial and error to select the 'best' combination, from many

confounding factors, which describe the GP time-series. The following sections

describe the guidelines used to assess what constitutes the 'best' model for a given

outcome as well as the procedures used to arrive at the model.

3.4.2 Criteria for selecting core model

The following criteria and diagnostic tools are used to determine whether a suitable core

experience and judgement are required. In most cases the final model represents a

balance of a number of factors as judged by the analyst.

• Raw/Predicted Time Series. A plot of observed and predicted values help identify

areas where the model fitting is inadequate. For example, if seasonal peaks vary

from year to year, such a plot will help identify whether these have been correctly

modelled.

• Residual Time Series. Used in conjunction with the plot of the observed and

predicted values. The residual plot should be as close as possible to white noise,

showing no systematic patterns or changes in variance over time. Trends and

patterns visible in the raw data should be described by the predicted series and

invisible in the residual series. Also no obvious outliers or extreme points should be

visible.

• Model Residuals against fitted values. A plot of model residuals against fitted values

will show up any model inadequacies usually as gaps in the residuals for given ranges

of the fitted values or clumping together in vertical bands. In addition specific

patterns or trends suggesting omitted variables or model mis-specification are also

revealed.

• Periodogram. The periodogram is part of the spectral decomposition of a time series.

It is used to indicate the presence o f seasonality in the data. After correct control for

to white noise i.e. the remaining spikes representing cycles of long wavelength

should be of similar, or smaller, magnitude than the short wavelength spikes.

• Partial Autocorrelation Function (PACF). The PACF describes the serial correlation

(autocorrelation) of a time series at lags 1,2,3 etc. with the value of each lag corrected

for the previous lags. After correct control for long-term trends and seasonal patterns

a plot of the PACF should show random non-significant partial autocorrelation

coefficients, preferably at all lags. It may be that the first few lags will be strongly

positive. These can be accounted for at a later stage using autoregressive terms.

The sum of these estimates over 60 days should be close to zero (assuming no

strong autocorrelation remains) or equal to the sum of the autcorrelation terms in

the first few days if these terms are large and significant. Meeting this criteria is

often not possible and compromise is necessary. The aim is therefore to minimise

the positive autocorrelation present in the raw series without introducing negative

autocorrelation. If the first several lags are consistently below 0 this points to over-

fitting of the model, that is inclusion in the model of unnecessary seasonal terms.

• Overdispersion in the core model should be as close as possible to the value 1.

• Akaike's information criterion (AIC) to be used in conjuntion with plots to assess

model fit.

In document Effects of air pollution on daily general practitioner consultations in London (Page 77-81)