The generalized linear model and regularizers 35

4 Techniques and methods 26

4.1 Machine learning and statistical methods 26

4.1.8 The generalized linear model and regularizers 35

Let 𝑦 be a response variable, 𝒙 the covariates and 𝒃 a vector of coefficients. The standard model for regression is written as in (20):

𝑦 = 𝒙Y_{𝒃 + 𝜀}

𝜀~𝑁(0, 𝜎E₎ (20)

However, this relationship forces the response to vary linearly with the predictors, which be a restrictive modeling assumption in many cases.

The generalized linear model extends this relationship by introducing the notion of the link function 𝑔 defined in (21):

The variable 𝑦 can follow any distribution from the exponential family and the function 𝑔 is called the link function.

A common measure-of-fit used in the generalized linear model is the deviance defined as in formula (22):

𝐷 = −2 log 𝑝 𝑦 𝜃k − log 𝑝 𝑦 𝜃‰ (22)

where 𝜃k are the parameters of the model, and 𝜃‰ are the parameters of the saturated model,

with a parameter for every observation.

In place of the standard residuals used in linear regression, there are different kinds of residuals that can be used such as deviance residuals. For the sum of deviance residuals, relationship (23) holds: 𝐷E₌ _𝑑 AE S G (23)

where 𝑑A is the deviance residual, whose form depends on the distribution of the response.

Some models used in these thesis include:

Logistic regression

Setting the link function to the one shown in (24):

𝑔 µμ = log 𝐹(µμ) 1 − 𝐹(µμ) 𝑤ℎ𝑒𝑟𝑒 𝐹 𝑥 = 1

1 + 𝑒F𝒙

(24)

leads to the logistic regression model. The link function provides the log odds of a class A versus class B but can easily be converted to the probability of class A shown in (25):

𝑃 𝑐𝑙𝑎𝑠𝑠 = 𝐴 = e

𝒙

1 + e𝒙

(25)

Ordinal regression

It is possible to extend the previous model to handle response variables with more than 2 ordered categories. Let the response be composed of 𝑟 categories. We define the sequence of cumulative logits as shown in (26):

𝐿G= log _𝑝 𝑝G E+ 𝑝^+ ⋯ + 𝑝• 𝐿_E= log 𝑝G+ 𝑝E 𝑝_^+ 𝑝_•+ ⋯ + 𝑝_• … 𝐿•FG = log 𝑝_G+ 𝑝_E+ ⋯ + 𝑝_•FG 𝑝• (26)

Then, the final model is defined as shown in (27):

𝑳 = 𝒂 + 𝜷Y_𝒙 ₍₂₇₎

where L is a vector containing the cumulative logits for each category. This is equivalent to running multiple logistic regression models.

Poisson regression and Negative binomial regression

Another common choice for a link function is the log function, which is used in Poisson regression. Poisson regression can be applied when the range of the response variable lies in the range of the positive integers.

A core assumption of the Poisson distribution is that the mean is equal to the variance (overdispersion or underdispersion). This assumption can be checked before running the analysis by getting the mean and the variance of the response variable but also it can be tested after the analysis by using residual plots of the fitted values against the true values. In a plot that satisfies the assumption the variance of the points stays approximately the same across the whole range. If this assumption does not hold, a possible remedy is to use negative binomial regression which allows the variance to be different to the mean.

Also, when estimating the standard errors of the coefficients under heteroscedasticity, alternative methods can be used, such as White’s heteroscedasticity-consistent estimator. The variance of the estimate of the coefficient 𝛽 is given by (28):

𝑣 𝛽 = 𝛸,_𝛸 FG _𝛸,_{𝑑𝑖𝑎𝑔 𝜀}

G, … , 𝜀S 𝑋 𝑋,𝑋 FG (28)

where 𝜀 are the fitted residuals and 𝑋 is an 𝑛 ∙ 𝑘 design matrix where 𝑛 is the total number of datapoints and 𝑘 the total number of covariates.

Optimization and regularization

The usual optimization goal for these models is the minimization of the sum of squares defined by (29): min 𝒃 𝑦 − 𝒙 Y_𝒃 E S šMG (29)

It is possible to improve the performance of a model on predictive tasks, by imposing a penalty on the size of the weights. This technique is called regularization. A specific kind of

regularization ridge regression (or L2 penalty or else weight decay in neural networks) defined by (30): min 𝒃 𝑦 − 𝒙 Y_𝒃 E S šMG + 𝜆 𝒃 E (30)

where the parameter 𝜆 ≥ 0 controls the amount of the penalty. Another choice is the LASSO (or L1) penalty defined by (31):

min œ 𝑦 − 𝒙 Y_𝒃 E S šMG + 𝜆 𝒃 (31)

The LASSO leads to sparse solutions, but it does not deal well with highly correlated variables. LASSO tends to select some of the variables arbitrarily and ridge regression has been shown to have better performance than LASSO in this context (Murphy, 2012).

The elastic net is a model that combines both penalties (Zou & Hastie, 2005). The standard version tries to solve an optimization problem of the form defined by (32):

min 𝒃 𝑦 − 𝒙 Y_𝒃 E S šMG + 𝜆_G 𝒃 + 𝜆_E 𝒃 E (32)

The parameters 𝜆G, 𝜆E≥ 0 control the size of each kind of penalty.

Cook’s distance

When fitting a regression model it is possible that one or more observations might have a strong influence. A way to measure that is Cook’s distance (Cook, 1979). Cook’s distance is defined by (33): 𝐷_A = 𝑌ž− 𝑌Ÿ A E S ŸMG 𝑘 𝑀𝑆𝐸 (33)

where 𝐷A is Cook’s distance for point 𝑖, k is the number of fitted parameters in the model, MSE

is the mean squared error and 𝑌_ž is the prediction rom the model for observation 𝑖 for the original model, and 𝑌_{Ÿ A} is the prediction for point 𝑗 where 𝑖 has been omitted.

In document Predictive modelling of football injuries (Page 41-44)