Maximum Likelihood Estimates - Generalized Linear Models (GLMs)

CHAPTER 2 OVERVIEW

2.2 Generalized Linear Models (GLMs)

2.2.6 Maximum Likelihood Estimates

This section is about parameter estimation for generalized linear models using maximum likelihood. Although explicit mathematical equations can be found for estimators in some particular cases (e.g. closed-form functions), numerical methods based on a form of IRLS are often required. We first study briefly the least squares and weighted least squares methods and then describe the MLE for GLMs using an iterative variant of weighted least squares called Iteratively Reweighted Least Squares (IRLS).

Least squares : the simplest form of the method of least squares in linear regression consists of finding the estimator ˆw that maximizes the log-likelihood which is equivalent to minimizing the sum of squares of the differences between yi’s and their expected values as

w = (X>X)−1X>z, (2.42)

where zi is called as the working dependent variable.

Weighted least squares : in the cases wherein the variability of the error is unequal across the predictions, a standard method is used to compensate this nonconstant error variance by inserting a diagonal matrix Ω in estimating the parameter w such that the observed heteroscedasticity will be alleviated. The Ω matrix could be calculated by taking the error variance of the ith _{response at trail estimation as described in (2.34) and assigns its inverse}

to the ith _{entry on diagonal Ω matrix. The main motivation in this case is that large error}

variances will be mitigated by multiplication of the reciprocal (Gill, 2000).

To further describe the weighted regression, recall the standard form of the linear regression as

yi = xi>w + εi.

which is produced from a Cholesky factorization, yields Ω12y_i = Ω 1 2x_iw + Ω 1 2ε_i. (2.43)

Thus, under assumption of having heteroscedasticity in error term as ε ∼ (0, σ2_{V), it could}

be induced from (2.43) that ε ∼ (0, σ2ΩV) = (0, σ2) and therefore, heteroscedasticity is eliminated. As a result, the parameter estimates will be performed by minimizing the (y − X>_w)>_Ω−1_{(z − X}>_{w) is rather than (y − X}>_w)>_{(z − X}>_{w) in ordinary least squares. The}

estimator for weighted least squares takes the form

w = (X>ΩX)−1X>Ωz, (2.44)

where X is the model matrix, Ω is a diagonal weight matrix with entries ωi and z is the

vector of responses with entries zi (Gill, 2000).

Iteratively Reweighted Least Square (IRLS) : when the individual variances in reci- procal diagonal matrix Ω are a function of the mean, vi = f (µi) = f {E(yi)}, and when the

f (.) is known, the estimation algorithm could be still straightforward. A solution to solve the problem in this circumstance is to iteratively estimate the weight matrix Ω, by improving the estimates in each trial.

Given η = g−1(X>w), the estimator ˆw could provide an estimate for mean and vice versa. The IRLS algorithm could estimate these variables through improving the weight matrix in an iterative manner as follows (Gill, 2000) :

1. Initialize the diagonal weight matrix Ω, generally equals to one (ωi = 1).

2. Estimate w using weighted least squares in (2.44) with the current weight values. 3. Update the weight matrix by the new estimates for the mean.

4. Repeat steps 2 and 3 until the iterative process converges in such a way that the estimate parameter of ˆw changes by less than a prior known threshold.

Suppose that the method of maximum likelihood applied to GLMs with a canonical link of the form ηi = g(µi) = x>i w. Therefore, the log-likelihood function will take the form :

`(w; y) = log L(w; y) = n X i=1 " {yiθi− b(θi)} a(σ) + c(yi, σ) # , (2.45)

likelihood easily seen to be ∂` ∂w = ∂` ∂θi ∂θi ∂w = n X i=1 1 a(σ) ( yi− ∂b(θi) ∂θi ) xi n X i=1 1 a(σ)(yi− µi)xi (2.46)

Using the canonical link, the maximum likelihood estimates for the parameters are obtained by solving the following score equations for w.

U(w) = n X i=1 1 a(σ)(yi− µi)xi = 0, (2.47) where in most cases, a(σ) is considered as known prior parameters. Hence, equating this to zero to obtain the solution gives us the equation below in a matrix form :

U(w) = X>(y − µ) = 0, (2.48)

where µ = [µ1, µ2, ..., µn] and µi = x>i w. Note that µi is a function of ηi = x>i w and is not

necessarily a linear function. Thus, there is not a closed-form solution for w. This system of equations is called the maximum likelihood score equations which is operative for GLMs with a canonical link. A Taylor series approach could be applied to calculate the following approximation regarding the w∗

µ ∼ µ∗+ Ω(X>w − X>w∗), (2.49)

where µ∗ = g−1(X>w∗_{) (Hastie et al., 2009).}

Therefore, the following linear approximation will be found for the score for w : ∂`

∂w = X

Ω(z − X>w), (2.50)

where z = X>w∗+ Ω−1(y − µ∗) is known as the working dependent variable. This approxi- mation is calculated based on w∗ or, equivalently, µ∗ and are considered as constants in the

equation. Solving this approximation gives the maximum likelihood estimator ˆw as follows

w = (X>ΩX)−1X>Ωz,

where Ω plays the same role as the diagonal weighted matrix in weighted least squares. One must keep in mind that for the canonical link, Ω matrix is obtained by the mean-variance relationship, and it affects in the variability of w as well. The maximum likelihood solution for ˆw could be estimated through progressively improving weights with IRLS as described before (Myers et al., 2012).

Suppose that we have a trial estimate of the parameters ˆw. Then, the estimated linear predictor ˆη_i = x>_i w could be obtained and used to evaluate the fitted values ˆˆ µ_i = g−1( ˆη_i), consequently. Having the ˆη_i and ˆµ_i quantities, the term zi as the working dependent variable

could be identified as :

zi = ˆηi+ (yi− ˆµi)

∂ηi

∂µi

, (2.51)

where the term ∂ηi

∂µi denotes the derivative of the link function calculated at the current

iteration (the trial estimate). Furthermore, the iterative weights turns out to be :

ωi =

b00_(θ

i)(∂η_µ_ii)2

, (2.52)

with the assumption that a(σ) = σ

pi and b

00_(θ

i) is the second derivative of b00(θi) with respect

to the θi calculated at the trial estimate. Finally, the estimate of the parameter w through

regressing the working dependent variable zi on the predictors xi through employing the

weights wi are acquired. The updated procedure will be repeated until the iterative process

converges to a threshold. Hence the estimates for the parameter w can be simply obtained using an IRLS algorithm (Charnes et al., 1976).

In (McCullagh, 1984), authors proved that this procedure is equivalent to Fisher scoring and yields to maximum likelihood estimates. Remember that for the data from Normal distribution with identity link ηi = µi, the derivative equals

∂ηi

∂µi

= 1,

In document Learning Activation Functions in Deep Neural Networks (Page 41-45)