Modelling Framework: The Generalised Additive Model

3.3 Discussion

4.1.1 Modelling Framework: The Generalised Additive Model

Generalised Additive Models (GAMs) are a more flexible extension to the more com-monly known Generalised Linear Models (GLMs) which are in turn an extension of the linear model. The reasons for using a GAM in this study largely relate to its allowance of non-Gaussian response variables and the flexibility it allows in the incorporation of explanatory variables. The following paragraph gives a brief introduction to statistical modelling terminology and statistical models in general before overviews of the two sim-pler modelling approaches, and the reasons why they were not suitable for this study, are given. There follows an explanation of the GAM in more detail, which leads on to details of the selection of the specific GAM used for this work. Throughout this section Simon Wood’s book on Generalised Additive Models in R, [Wood, 2006], will be the primary reference.

A statistical model is concerned with modelling a response variabley= (y₁, ..., yn)^T, which has some probability distribution p(y; θ), where θ is a vector of the model parameters. For example, θ could be the model’s mean and variance, which could in turn be modelled by explanatory variables, x_i, that are related to the response variable in some way. When using a statistical model the θ will normally need to be estimated. The driving force be-hind this estimation process is the likelihood, L(θ) = p(y; θ). When p(y; θ) is viewed as the likelihood function it is treated as a function of θ for fixedy as it represents how (relatively) likely different values of θ are for the observed datay. If you have fixed val-ues of θ and are looking at p(y; θ) as a function of y then it is a probability distribution, as already stated. The values of θ that maximise the value of L(θ) are the maximum likelihood estimates, ˆθ, which can be found by setting the first derivatives of log(L(θ)) to zero. (log(L(θ)) tends to be used instead of L(θ) as it tends to be easier to work with and will give the same ˆθ estimates). These estimates have desirable statistical proper-ties including being consistent (varying less from the true θ as sample size increases) and being asymptotically unbiased and efficient (varying less than other estimates for the same sample size). If one then takes the second derivative of log(L(θ)) then it is possible

to also form confidence intervals for the maximum likelihood estimates.

The Normal Linear Model (LM)

A normal linear model, henceforth referred to as the linear model, in statistics is a model wherey is assumed to be normally distributed with a mean, µ, and constant variance, σ². For this model the mean is itself represented using a linear combination of parameters βmultiplied by explanatory variables,x. The model can therefore be written in the form y∼ N (µ, σ²) where µ = Xβ and X is the model matrix that contains the values of the explanatory variables (or functions of them).

The assumption of linearity in the parameters means that although functions of explana-tory variables can be included, they cannot incorporate the β’s. For example, the term β₁sin(x_i)would be acceptable, but the term sin(β₁x_i)would not be. Further information on linear models including the method of fitting can be found in chapter one of Wood [2006]. The method of fitting revolves around the idea that the β’s should be chosen to minimise the squared difference between the predictions and the observations, that is, they should be chosen to minimise the least squares function, S = ||y − Xβ||². The values minimising S will be those values maximising the likelihood, L(θ).

It is evident from figures 3.4 and 3.5 that neither the normality, nor the constant variance assumption of this model hold for the daily temperature data in the focus regions of this study. It is often the case in climate studies that temperature data have their seasonal cy-cle removed and are standardised by dividing by their standard deviation to try and better match these assumptions. However, this is removing information that could instead be exploited by a more sophisticated model.

The Generalised Linear Model (GLM)

The generalised linear model (GLM) is more flexible than the linear model, allowing the response to be non-Gaussian. In fact, the response can have any distribution in the exponential family of distributions, where any distribution in this family can be written as

p (y; θ, φ) = exp (yθ − b (θ))

a(φ) + c (y, φ)

where a, b and c are functions, φ is a scale parameter and θ is the location, or canonical, parameter of the distribution [Wood, 2006]. The mean and variance of any distribution in this family can be written as µ = b⁰(θ)and σ² = b⁰⁰(θ)a(φ). Many well known distributions can be written in this manner, including the Normal, Binomial and Gamma distributions.

Thus, the linear model is a special case of the generalised linear model. Using a slightly different fitting approach the range of distributions can be extended to anywhere where the mean-variance relationship is known, for an explanation of this the reader is referred to section 2.1.10 in Wood [2006] on quasi-likelihood.

The assumption of linearity in the parameters is relaxed to some extent for the GLM as the model is now written asy ∼ p(y; θ, φ) and a link function is defined as g(µ) = η = Xβ, which relates the mean of the chosen exponential family distribution to the explanatory

4. Creation of the Benchmark Clean Data

variables. Thus, the β’s have to enter the link function linearly, but this link function could be non-linear as long as it is smooth (differentiable) and monotonic.

The equations it is necessary to solve to find the parameter estimates of a GLM nor-mally require an iterative solution instead of the straightforward least squares estimation method used for the linear model. Therefore, a method known as iteratively re-weighted least squares (IRLS) is employed to get the maximum likelihood estimates of the β’s. A detailed explanation of this method is given in chapter 2 of Wood [2006], an overview, heavily relying on this source is given here.

The least squares function that must be minimised is derived from the log likelihood of the model and can be written as S =Pn

i=1

(yi−µi)²

V (µi) ; the derivation of this expression can be found in section 2.1.2 of Wood [2006]. V (µ_i) is known as the variance function of a GLM and is equal to ^b⁰⁰_w^(θⁱ⁾ and w is a known constant. In matrix form this can be written as S = ||

V_[k]⁻¹[y − µ(β)]||² whereV is a diagonal matrix with V_[k]ii = V (µi[k]) and k denotes the iteration number. Using Taylor expansions of β and introducing the notation G which is the diagonal matrix with G_ii = g⁰(µ_i^[k]) this expression can be manipulated into an iterative least squares expression of the form S = ||

√

W^[k][z^[k]−Xβ]||². Therefore iteration proceeds as follows until convergence occurs:

1. Use the current iterations of µ and η to calculatez^[k] = g⁰(µ^[k])(y−µ^[k]) + η^[k] and the iterative weights from the diagonal matrixW with W_ii^[k] = _{V (µ} ¹

i[k])g⁰(µi[k])². If this is the first iteration then common practice is to set µ_i= y_iand η_i = g(µ_i).

2. Minimise S with respect to β to get β^[k+1] and use these new estimates of β to get updated values of µ and η. Increment k by 1 and repeat until convergence has occurred.

Clearly this extension to the linear model allows more features of the data to be modelled instead of removed and removing the assumption of normality also allows greater flexibil-ity. Link functions could be investigated and functions of explanatory variables included so as to allow a desired relationship with the response to be mimicked. However, this could become very complicated very quickly, also, the aim of this study is not to perfectly explain the given data, as these contain inhomogeneities, but instead to be able to repro-duce data that are like them. Thus, a still more flexible approach would be desirable with even fewer constraints on the model framework and this is what the Generalised Additive Model offers.

The Generalised Additive Model

A generalised additive model (GAM) takes a similar form to the GLM, but with even fewer restrictions in how the explanatory variables can enter the model. y∼ p( y; θ, φ) remains the same, but now g(µ) can contain both traditionalXβ terms and smooth functions of the explanatory variables, f_j(x_ji). These smooth functions are commonly fitted using a spline based approach.

Splines are formed from a set of basis functions that can be combined in such a way as to

create a smooth curve that mimics the behaviour of f(·). If left unconstrained these smooth curves can be over fitted to the data, therefore, it is advisable to add a penalty term so that the spline will be penalised if it becomes too unsmooth. This means that the estimate for the function f (x_i)is chosen to minimisePn

i=1[y_i− f (x_i)] − λR

Range(x)[f⁰⁰(x_i)]²dxwhere λ is a smoothing parameter and the smaller λ is the less the smooth function is penalised.

Because the form of the GAM differs from the GLM, the fitting method also changes and is now Penalised-Iteratively Re-weighted Least Squares (P-IRLS). This fitting mechanism is described below, and is paraphrased from chapter four of Wood [2006].

Before the fitting process is explained it is helpful to introduce some notation. If a spline can be written as a series of basis functions, then each smooth function can be written in matrix form f_j = fX_jβf_jas defined on page 167 of Wood [2006]. This means that the entire GAM link function can be written as g(µ) =Xβ, subject to some identifiability constraints to ensure that this is a one-to-one function. HereX contains the traditional model matrices as from the GLM in addition to the newly defined X_j’s for each smooth function, which are the fXj’s multiplied by a matrix Z that is used to ensure that the aforementioned identifiability constraints are met. Further details on Z can be found on page 168 of Wood [2006]. Similarly, the β contains the parameter vector for the parametric part of the model and the parameters from the smooth function bases.

It would be possible to get the maximum likelihood estimate of β, but this would likely result in over fitting to the data. Which is why penalised likelihood is maximised using P-IRLS. This penalised likelihood is defined as l_p(β) = l(β) −¹₂β^TSβ where S=P

jλjS_j and theS_j are matrices of known coefficients and the reader is referred to section 4.2 of Wood [2006] for a more in depth explanation of these matrices. The λj are smoothing parameters.

If the scale and smoothing parameters are unknown then fitting proceeds as follows, details are not given for the fitting method if these parameters are known as they never were for this thesis. First some new terminology is defined, any terminology not defined here can be assumed to be defined in the same way as for the GLM.

Define the ’influence matrix’ of the GAM as A=X(X^TWX+S)⁻¹X^TW and the trace of this matrix as tr(A)=Pdim(A)

i=1 aii. Then the scale parameter can be estimated as ˆφ =

iV ( ˆµi)⁻¹(yi− ˆµi)²

n−tr(A) . Because the scale parameter has been estimated, the smoothing pa-rameters are estimated so that they minimise the Generalised Cross Validation (GCV) score, Vg = _[n−tr(A)]^{nD( ˆ}^β) 2. Here D( ˆβ)is the deviance of the model, defined as D = 2[l( ˆβ_max)−

l( ˆβ)]φ, and ’l( ˆβ_max)indicates the maximised likelihood of the saturated model: the model with one parameter per data point’, from section 2.1.6 of Wood [2006]. To minimise Vg

a numerical method involving its derivatives can be used inside P-IRLS as the following steps indicate. These are the steps that must be iterated until convergence occurs. What follows is directly taken from page 187 of Wood [2006], further information on the calcu-lation of some of these steps can be found there. Two final pieces of notation to introduce are ρ_k= log(λ_k)and the omission of hats from estimates for ease of writing.

1. ”Evaluate the pseudodata, z_i = ^∂η_∂µⁱ

i(y_i − µ_i) + η_i, and corresponding derivatives,

4. Creation of the Benchmark Clean Data

3. Drop any observations (for this iteration only) for which w_i = 0or ^∂µ_∂ηⁱ

i = 0.

4. Find the parameters, ˆβ, minimisingP

iw²_i(zi− X_iβ)²+ β^THβ +P

ke^ρⁱβ^TS_iβand the derivative vector, _∂ρ^{∂ ˆ}^β

k, for the next iteration” whereH is any positive semi-definite matrix which could be used to impose bounds on the smoothing parameters, regu-late an ill-conditioned problem or just be zero if neither of these are necessary, see section 4.6.1 of Wood [2006].

Here the GAM fitting method that has been described is that used for this study; where the scale parameter, φ, and the smoothing parameters, λj, are unknown and must be estimated along with the β⁰s. The fitting method can vary according to what is known about the model beforehand and the amount of computational cost that is deemed ac-ceptable. For further information on these other methods, the method described here or the possible spline bases available the reader is referred to chapters three and four of Wood [2006].

In document Benchmarking the Performance of Homogenisation Algorithms on Daily Temperature Data (Page 58-62)