Generalized Additive Models - Statistical Considerations

3.9 Statistical Considerations

3.9.1 Generalized Additive Models

3.9.1.1 The Linear Framework: Linear Models and Generalized Linear Models

Important tools in inferential statistics, likelihood-based regression models presume a likelihood for a given response variable (y), modeling the mean as a linear function of a set of linear (or other parametric) covariates (X1, X2, …, Xp). Using maximum likelihood estimation

(MLE), we get the parameters of the linear function (T. J. Hastie & Tibshirani, 1986). In a linear regression model, we investigate/predict an outcome variable, which we believe to be a function of (an)other variable(s). The outcome variable (y) is assumed to be normally distributed with mean µ and variance 𝜎1_{, 𝑦 ~ 𝑁(𝜇, 𝜎}1_{). The Xs are the predictors/covariates, represented in}

Equation 1:

𝑦 = 𝛽L + 𝛽8𝑋8+ 𝛽1𝑋1… + 𝛽/𝑋/ (1)

The coefficients (𝛽) are obtained using ordinary least squares (OLS); summing these coefficients results in the linear predictor, which also directly yields the estimated fitted values (M. Clark, 2016, June 26; Field et al., 2013). The problem with linear models is that they can be limiting in terms of their scope, i.e. ability to model polynomial or wiggly curves. If there is a dichotomous (binary) outcome variable, a linear model does not work and a switch to a generalized linear model (GLM) is necessary. A GLM allows for other types of distributions,

such as binomial, or Poisson, adding a link function, g(), to the equation (e.g. linear regression: link = identity; logistic regression: link = logit). Said link function relates the expected values to a linear predictor still assuming normal distribution for the outcome variable and

homoscedasticity for all observations (M. Clark, 2016, June 26; Gelman & Hill, 2007; Larsen, 2015, July 30), see Equation 2:

𝑔(𝜇) = 𝛽_L+ 𝛽₈𝑋₈+ 𝛽₁𝑋₁ (2)

I offer this comparison not to diminish linear models’ efficacy or dispute their usefulness in many statistical analyses and research settings in any way, but just to show their limitations when it comes to handling non-linearity, and the investigation of patterns that often get overlooked because a general additive model would have been a better fit and a linear model would have fit a straight line through a more complex underlying pattern.

3.9.1.2 The Generalized Additive Model (GAM)

The Generalized Additive Model was invented by Trevor Hastie and Robert Tibshirani (1986). Although Generalized Additive Models (GAMs) have been in existence for a while, data scientists and researchers in many disciplines are not making a lot of use of this statistical

technique, which some have even labeled “the predictive modeling silver bullet” (Larsen, 2015, July 30, p. 1), and called for more extended applications. While GAMs are certainly not a silver bullet, just like most statistical techniques, they lend themselves to analyzing natural data for one simple reason: natural relationships are often non-linear, which can be a problem for standard linear regression models (SLiM) estimated via ordinary least squares (OLS). This makes GAMs

especially applicable to linguistic/language data, as relationships between predictors can often be non-linear. Winter and Wieling (2016) showed exactly this in their analysis of linguistic change over time for instance. Assuming basic familiarity with linear regression and generalized linear models (GLMs, e.g. logit), I will present a short introduction to GAMs and why they were employed in this study, depending on the hypothesis to be tested.

Larsen (2015, July 30), Clark (2016, June 26), and Wood (2006) praise GAMs for several reasons: (1) they are easy to interpret, (2) oft-overlooked hidden patterns in the data can be revealed, and (3) we can avoid/reduce overfitting by regularizing (penalizing) the predictor(s). Simply put, “[r]elationships between the individual predictors and the dependent variable follow smooth patterns that can be linear or non-linear” (Larsen, 2015, July 30, p. 1). Another advantage is that GAMs support various link functions like GLMs, such as the logit link for a dichotomous dependent variable (T. J. Hastie & Tibshirani, 1986; Larsen, 2015, July 30).

Figure 4. Graphical Representation of Possible GAM Structure (Larsen, 2015, July 30, p. 2).

Figure 4 shows how a GAM captures linear relationships (middle), just as well as polynomial relationships (left and right). The additive model extends the linear regression by adding a smoothing function, which denotes the impact of the predictors on the dependent

variable showing underlying patterns, which can be non-linear (T. J. Hastie & Tibshirani, 1986, 1990; Larsen, 2015, July 30), see Equation 3 and compare to Figure 4 above:

𝑔 𝐸 𝑦 = 𝛽_L+ /_;=8𝑓(𝑋_;) (3)

Here, y is the dependent variable, E(y) is the expected value, with g(y) being the link function that relates the outcome to the predictors X1, …, Xp. Beta zero (𝛽L) is the intercept (like

in a SLiM), and f(X1), …, f(Xp) are smooth nonparametric functions, which are estimated

iteratively and automatically during model estimation (T. J. Hastie & Tibshirani, 1986, 1990; Larsen, 2015, July 30; Wood, 2006).

These smooth functions f(∙) can best be explained by invoking this simple example from Wood (2006, p. 120) in Equation 4:

𝑦; = 𝑓 𝑋; + 𝜀; (4)

Here, yi is the dependent variable, Xi is a predictor variable, f is a smooth function, and 𝜀;,

the error term, consists of normally distributed random variables, 𝑁 0, 𝜎1 _{. To estimate f, (4) has}

to become a linear model. By choosing a basis, or basis function, b, which is definitively known, we select a space of which f is an element (Wood, 2006):

Equation 5 illustrates the representation of f for the unknown parameter 𝛽;, which results

in a linear model (Wood, 2006). One of the main differences to other models is that the linear predictor now has smooth functions for one or potentially all covariates/predictors (M. Clark, 2016, June 26). At the core of any GAM, there are basis functions, and smoothing splines, of which there are too many to go into detail here (e.g. cubic splines, and regression splines), to determine the ‘wiggliness’ of non-linear curves with a smoothing parameter lambda (𝜆) between zero and one, which had to be selected manually by the researcher (T. J. Hastie & Tibshirani, 1986), but can now be selected automatically. A 𝜆 of 0.6 often seems to yield good results (Larsen, 2015, July 30; Wood, 2006). With 𝜆, we can control the balance between ‘wiggliness’ of the function f and goodness of fit to the data. As larger values of 𝜆 make f smoother, the curves also tend to have more bias; i.e. in-sample-error (T. J. Hastie & Tibshirani, 1986, 1990; Larsen, 2015, July 30).

Thin plate regression splines (TPRS, Wood, 2003) in particular, often yield excellent results with the mean squared error in mind by combining lower-level functions (e.g. linear, quadratic, cubic, or logarithmic) to fit a smoothed function (Winter & Wieling, 2016). This is probably why TPRS is the default setting in the mgcv-package to conduct GAMs in R (Wood, 2006). As for the interpretation of GAMs, each smooth (non-parametric variable) has a p-value to test whether it is significantly different from 0 and the ‘edf’ (effective degrees of freedom, which indicate the level of non-linearity. An edf of 1 indicates a linear pattern, and any edf > 1 indicates a non-linear pattern. Since there are no coefficients for the predictors, visualization of the model fit is crucial (Winter & Wieling, 2016). The interpretation of the GAM-output, as well as other important issues such as concurvity, selection of smoothing parameters, and

analysis section. Naturally, the present work can only include a cursory introduction, and there is a lot more to be said about GAMs. Larsen (2015, July 30) and Clark (2016, June 26) offer two concise, yet thorough introductions to GAMs with examples in R, which should be

complemented by Hastie and Tibshirani (1986, 1990) and Wood (2006), especially for output interpretation and more detailed explanations of core concepts.

3.10 Software

In document Interactions cubed (Page 115-120)