• No results found

The generalized linear model and regularizers 35

4 Β  Techniques and methods 26

4.1 Β  Machine learning and statistical methods 26

4.1.8 Β  The generalized linear model and regularizers 35

Let 𝑦 be a response variable, 𝒙 the covariates and 𝒃 a vector of coefficients. The standard model for regression is written as in (20):

𝑦 = 𝒙Y𝒃 + πœ€

πœ€~𝑁(0, 𝜎E) (20)

However, this relationship forces the response to vary linearly with the predictors, which be a restrictive modeling assumption in many cases.

The generalized linear model extends this relationship by introducing the notion of the link function 𝑔 defined in (21):

The variable 𝑦 can follow any distribution from the exponential family and the function 𝑔 is called the link function.

A common measure-of-fit used in the generalized linear model is the deviance defined as in formula (22):

𝐷 = βˆ’2 log 𝑝 𝑦 πœƒk βˆ’ log 𝑝 𝑦 πœƒβ€° (22)

where πœƒk are the parameters of the model, and πœƒβ€° are the parameters of the saturated model,

with a parameter for every observation.

In place of the standard residuals used in linear regression, there are different kinds of residuals that can be used such as deviance residuals. For the sum of deviance residuals, relationship (23) holds: 𝐷E= 𝑑 AE S G (23)

where 𝑑A is the deviance residual, whose form depends on the distribution of the response.

Some models used in these thesis include:

Logistic regression

Setting the link function to the one shown in (24):

𝑔 ¡μ = log 𝐹(¡μ) 1 βˆ’ 𝐹(¡μ) π‘€β„Žπ‘’π‘Ÿπ‘’  𝐹 π‘₯ = 1

1 + 𝑒F𝒙

(24)

leads to the logistic regression model. The link function provides the log odds of a class A versus class B but can easily be converted to the probability of class A shown in (25):

𝑃 π‘π‘™π‘Žπ‘ π‘  = 𝐴 = e

𝒙 Β 

1 + e𝒙 Β  Β 

(25)

Ordinal regression

It is possible to extend the previous model to handle response variables with more than 2 ordered categories. Let the response be composed of π‘Ÿ categories. We define the sequence of cumulative logits as shown in (26):

𝐿G= log 𝑝 𝑝G E+ 𝑝^+ β‹― + 𝑝‒ 𝐿E= log 𝑝G+ 𝑝E 𝑝^+ 𝑝‒+ β‹― + 𝑝‒ … 𝐿‒FG = log 𝑝G+ 𝑝E+ β‹― + 𝑝‒FG 𝑝‒ (26)

Then, the final model is defined as shown in (27):

𝑳 = 𝒂 + 𝜷Y𝒙 (27)

where L is a vector containing the cumulative logits for each category. This is equivalent to running multiple logistic regression models.

Poisson regression and Negative binomial regression

Another common choice for a link function is the log function, which is used in Poisson regression. Poisson regression can be applied when the range of the response variable lies in the range of the positive integers.

A core assumption of the Poisson distribution is that the mean is equal to the variance (overdispersion or underdispersion). This assumption can be checked before running the analysis by getting the mean and the variance of the response variable but also it can be tested after the analysis by using residual plots of the fitted values against the true values. In a plot that satisfies the assumption the variance of the points stays approximately the same across the whole range. If this assumption does not hold, a possible remedy is to use negative binomial regression which allows the variance to be different to the mean.

Also, when estimating the standard errors of the coefficients under heteroscedasticity, alternative methods can be used, such as White’s heteroscedasticity-consistent estimator. The variance of the estimate of the coefficient 𝛽 is given by (28):

𝑣 𝛽 = 𝛸,𝛸 FG 𝛸,π‘‘π‘–π‘Žπ‘” πœ€

G, … , πœ€S 𝑋 𝑋,𝑋 FG (28)

where πœ€ are the fitted residuals and 𝑋 is an 𝑛 βˆ™ π‘˜ design matrix where 𝑛 is the total number of datapoints and π‘˜ the total number of covariates.

Optimization and regularization

The usual optimization goal for these models is the minimization of the sum of squares defined by (29): min 𝒃 𝑦 βˆ’ 𝒙 Y𝒃 E S Ε‘MG (29)

It is possible to improve the performance of a model on predictive tasks, by imposing a penalty on the size of the weights. This technique is called regularization. A specific kind of

regularization ridge regression (or L2 penalty or else weight decay in neural networks) defined by (30): min 𝒃 𝑦 βˆ’ 𝒙 Y𝒃 E S Ε‘MG + πœ† 𝒃 E (30)

where the parameter πœ† β‰₯ 0 controls the amount of the penalty. Another choice is the LASSO (or L1) penalty defined by (31):

min Ε“ 𝑦 βˆ’ 𝒙 Y𝒃 E S Ε‘MG + πœ† 𝒃 (31)

The LASSO leads to sparse solutions, but it does not deal well with highly correlated variables. LASSO tends to select some of the variables arbitrarily and ridge regression has been shown to have better performance than LASSO in this context (Murphy, 2012).

The elastic net is a model that combines both penalties (Zou & Hastie, 2005). The standard version tries to solve an optimization problem of the form defined by (32):

min 𝒃 𝑦 βˆ’ 𝒙 Y𝒃 E S Ε‘MG + πœ†G 𝒃 + πœ†E 𝒃 E (32)

The parameters πœ†G, πœ†Eβ‰₯ 0 control the size of each kind of penalty.

Cook’s distance

When fitting a regression model it is possible that one or more observations might have a strong influence. A way to measure that is Cook’s distance (Cook, 1979). Cook’s distance is defined by (33): 𝐷A = π‘ŒΕΎβˆ’ π‘ŒΕΈ A E S ΕΈMG π‘˜  𝑀𝑆𝐸 (33)

where 𝐷A is Cook’s distance for point 𝑖, k is the number of fitted parameters in the model, MSE

is the mean squared error and π‘ŒΕΎ is the prediction rom the model for observation 𝑖 for the original model, and π‘ŒΕΈ A is the prediction for point 𝑗 where 𝑖 has been omitted.