4 Β Techniques and methods 26
4.1 Β Machine learning and statistical methods 26
4.1.8 Β The generalized linear model and regularizers 35
Let π¦ be a response variable, π the covariates and π a vector of coefficients. The standard model for regression is written as in (20):
π¦ = πYπ + π
π~π(0, πE) (20)
However, this relationship forces the response to vary linearly with the predictors, which be a restrictive modeling assumption in many cases.
The generalized linear model extends this relationship by introducing the notion of the link function π defined in (21):
The variable π¦ can follow any distribution from the exponential family and the function π is called the link function.
A common measure-of-fit used in the generalized linear model is the deviance defined as in formula (22):
π· = β2 log π π¦ πk β log π π¦ πβ° (22)
where πk are the parameters of the model, and πβ° are the parameters of the saturated model,
with a parameter for every observation.
In place of the standard residuals used in linear regression, there are different kinds of residuals that can be used such as deviance residuals. For the sum of deviance residuals, relationship (23) holds: π·E= π AE S G (23)
where πA is the deviance residual, whose form depends on the distribution of the response.
Some models used in these thesis include:
Logistic regression
Setting the link function to the one shown in (24):
π ¡μ = log πΉ(¡μ) 1 β πΉ(¡μ) π€βπππ Β πΉ π₯ = 1
1 + πFπ
(24)
leads to the logistic regression model. The link function provides the log odds of a class A versus class B but can easily be converted to the probability of class A shown in (25):
π ππππ π = π΄ = e
π Β
1 + eπ Β Β
(25)
Ordinal regression
It is possible to extend the previous model to handle response variables with more than 2 ordered categories. Let the response be composed of π categories. We define the sequence of cumulative logits as shown in (26):
πΏG= log π πG E+ π^+ β― + πβ’ πΏE= log πG+ πE π^+ πβ’+ β― + πβ’ β¦ πΏβ’FG = log πG+ πE+ β― + πβ’FG πβ’ (26)
Then, the final model is defined as shown in (27):
π³ = π + π·Yπ (27)
where L is a vector containing the cumulative logits for each category. This is equivalent to running multiple logistic regression models.
Poisson regression and Negative binomial regression
Another common choice for a link function is the log function, which is used in Poisson regression. Poisson regression can be applied when the range of the response variable lies in the range of the positive integers.
A core assumption of the Poisson distribution is that the mean is equal to the variance (overdispersion or underdispersion). This assumption can be checked before running the analysis by getting the mean and the variance of the response variable but also it can be tested after the analysis by using residual plots of the fitted values against the true values. In a plot that satisfies the assumption the variance of the points stays approximately the same across the whole range. If this assumption does not hold, a possible remedy is to use negative binomial regression which allows the variance to be different to the mean.
Also, when estimating the standard errors of the coefficients under heteroscedasticity, alternative methods can be used, such as Whiteβs heteroscedasticity-consistent estimator. The variance of the estimate of the coefficient π½ is given by (28):
π£ π½ = πΈ,πΈ FG πΈ,ππππ π
G, β¦ , πS π π,π FG (28)
where π are the fitted residuals and π is an π β π design matrix where π is the total number of datapoints and π the total number of covariates.
Optimization and regularization
The usual optimization goal for these models is the minimization of the sum of squares defined by (29): min π π¦ β π Yπ E S Ε‘MG (29)
It is possible to improve the performance of a model on predictive tasks, by imposing a penalty on the size of the weights. This technique is called regularization. A specific kind of
regularization ridge regression (or L2 penalty or else weight decay in neural networks) defined by (30): min π π¦ β π Yπ E S Ε‘MG + π π E (30)
where the parameter π β₯ 0 controls the amount of the penalty. Another choice is the LASSO (or L1) penalty defined by (31):
min Ε π¦ β π Yπ E S Ε‘MG + π π (31)
The LASSO leads to sparse solutions, but it does not deal well with highly correlated variables. LASSO tends to select some of the variables arbitrarily and ridge regression has been shown to have better performance than LASSO in this context (Murphy, 2012).
The elastic net is a model that combines both penalties (Zou & Hastie, 2005). The standard version tries to solve an optimization problem of the form defined by (32):
min π π¦ β π Yπ E S Ε‘MG + πG π + πE π E (32)
The parameters πG, πEβ₯ 0 control the size of each kind of penalty.
Cookβs distance
When fitting a regression model it is possible that one or more observations might have a strong influence. A way to measure that is Cookβs distance (Cook, 1979). Cookβs distance is defined by (33): π·A = πΕΎβ πΕΈ A E S ΕΈMG π Β πππΈ (33)
where π·A is Cookβs distance for point π, k is the number of fitted parameters in the model, MSE
is the mean squared error and πΕΎ is the prediction rom the model for observation π for the original model, and πΕΈ A is the prediction for point π where π has been omitted.