Logistic Regression. Steve Kroon

(1)

Logistic Regression

Steve Kroon

Course notes sections: 24.3-24.4

Disclaimer: these notes do not explicitly indicate whether values are vectors or scalars, but expects the reader to discern this from the context.

Scenario — supervised classification

• We are given training data {(xi, yi) : i = 1, . . . , n} from some (mixture)

distribution, whereyi indicates class membership. • Aim: given a new x, predict the corresponding y.

• This situation is often not deterministic (e.g. given height and weight info, predicting gender).

Using class membership probabilities

• Given prior probabilities for each class, and a generative model for each class, we can use the maximum likelihood estimate.

• This is the class the point has the highest probability of being in.

• To do this, we only need to know which class has highestP(y|x) at each

x.

• More generally, we might not want to pick the class with highest proba-bility (e.g. spam classification, cancer diagnosis, extreme sports).

• Deciding when this is the case, and what to do then, is the subject of decision theory. The theory makes use of a so-called loss function.

• Key insight is that the actual probabilities of each class are useful beyond just the maximum.

• However, we still only need to knowP(y|x)at each x.

Generative vs discriminative models

• We can get class probabilities if we have generative models. (A generative model is a full specification ofP(x, y).)

• The key issue: The more parameters you have to estimate from data, the less sure you are of each estimate.

(2)

• Since we usually don’t actually know the model for each class, we must estimate it from the class data. Two phases: certain assumptions/prior knowledge, such as normality; followed by estimating parameters from the data.

• If we need the model, there is no problem with this approach.

• However,if we only want to classify, we don’t need to know the marginal distributionP(x), even though generative models provide this informa-tion.

• Discriminative models are specifications of the conditional distribution

P(y|x).

• Since generative models usually have more parameters than discrimina-tive ones, discriminadiscrimina-tive models often outperform generadiscrimina-tive models for classification.

• Note that generative models can be used for tasks discriminative models can’t perform.

What should a discriminative model look like?

We don’t know a model for P(y|x), and have no intuition yet. To develop an intuition, let us look at what P(y|x) looks like when we do know the mod-els generating the data. Assume we have two classes C1 and C2, with prior probabilitiesP(C1) andP(C2). Then

where we conveniently define a(x) = lnP_P(x_(x|_|C1)P(C1)_C2)P(C2) and the logistic function

σ(y) = 1 1+exp (−y).

Note thatσ(y) lies in (0,1), and thata(x) is the so-called log-odds for class membership ofx. (You should see a similar expression turning up in assignment 2.) Also it is worth verifying that the derivative ofσ(y) isσ(y)(1−σ(y)).

Next, we will assume the classes each have Gaussian distributions, with meansµ1 andµ2and covariance matrices Σ1 and Σ2. What isa(x) then?

(3)

If we further assume that Σ1= Σ2= Σ, we get some cancellation, yielding

lnP(C1)

P(C2) −1/2[(x−µ1)

T

Σ−1(x−µ1)−(x−µ2)TΣ−1(x−µ2)]

Multiplying out, we get:

−1/2 (µ2−µ1)TΣ−1x+xTΣ−1(µ2−µ1) + (µT1Σ− 1_µ 1−µT2Σ− 1_µ 2) + lnP(C1) P(C2) = [Σ−1(µ2−µ1)]Tx+ −1/2µT₁Σ−1µ1+ 1/2µT2Σ −1_µ 2+ ln P(C1) P(C2) = wTx+w0

where these equations definewandw0.

Thus, we find that for 2 classes with equal covariances, but different means,

the log-odds is a linear function of the observations.1

It follows that in this case, if we used the data to directly estimate the means and covariance matrix, we would estimate 2d+d(d+1)/2 parameters, while if we could directly estimate (w, w0), we would only be estimatingd+ 1 parameters.

The multivariate normal case

Let us now consider the same problem, but withk classes. Then

P(C1|x) =P(x|C1)P(C1)/X

i

[P(x|Ci)P(Ci)]

We could go the same route as before (dividing the numerator and denom-inator by P(x|C1)P(C1), but that leads to complications with more than 2 classes. Instead, we shall write ai(x) = lnP(x|Ci)P(Ci), so that P(C1|x) =

exp (a1(x))/Piexp (ai(x)).2 Again assuming Gaussians with shared covariance,

we eventually conclude thatai(x) =wTi x+w0i, where

wi = Σ−1µi

and

w0i =

µT_iΣ−1µi

2 + lnP(Ci)

Comparing the number of parameters, we havekd+d(d+1)/2 for a generative approach versusk(d+ 1) for the discriminative approach. If we restrict ourself to using a diagonal covariance matrix in the generative approach, a la Naive Bayes, the number of parameters is reduced to k(d+ 1). But now there is a higher chance the model is wrong.

Finding

w

These examples motivate modelling P(y|x) by a logistic function of the log-odds of the observation, which we model using linear functions (for 2 classes); or a softmax function, using linear functions of the observations as exponents

1_{If we assume different covariance matrices, we get a quadratic function of the observations.} 2_Thus_a₍_x_{) in the two-class case is}_a

(4)

(for multi-class problems). More generally, we could use quadratic functions, or even more generally, a linear function of some transformation of the observation. The extension to transformations of the data is in the textbook; we will stick to the linear case here. However, we add a ”1” to the feature vector for each observation to get rid of the inconvenientw0i.

Let us try to selectwusing maximum likelihood on a training set. Thus, we try to identify which selection ofwwas most likely to generate the labels in the training set! We begin by writing down the likelihood of the training data as a function ofw(binary case in notes).3 We haveP(X, Y|w) =P(Y|X, w)P(X|w), but sinceP(X|w) =P(X), this equals

P(Y|X, w)P(X) =P(X)Y

i

P(yi|xi, w)

This factorization assumes that the label ofxi is c.i. of other observations and

labels, given the observation xi.4

For mathematical convenience definetij = 1 ifyi=Cj, and 0 for the other

k−1 classes.5 _{Then the likelihood becomes}

P(X)Y

i

Y

j

P(y=Cj|xi, w)tij

In order to maximize this, we minimize the negative log-likelihood w.r.t. w. This equals −lnP(X)−X i X j tijlnP(y=Cj|xi, w) WritingP(y=Cj|xi, w) = exp (aj(xi)) P rexp (ar(xi)) , we get −lnP(X)−X i X j tij " aj(xi)−ln X r exp (ar(xi)) #

where thear are linear functions ofxi: ar(xi) =wrTxi.

To minimize, we take the gradient w.r.t. w:6

∇w_v = −X i tivxi− exp (av(xi))xi P rexp (ar(xi)) = X i _{exp (}_a v(xi)) P rexp (ar(xi)) −tiv xi

For an optimum allkof these gradients must simultaneously be zero. This is a non-linear system ofk(d+ 1) equations ink(d+ 1) unknowns, so we will make use of a numerical optimization technique, Newton-Raphson optimization.

3_Here_X _{is the observation matrix and}_Y _{the vector of labels.}

4_{A common setting for supervised learning is assuming IID data, which satisfies this. This}

assumption keeps things simple, even if often not quite true.

5_{This is known as a “1-of-k” encoding.}

6_{Note that ln}_P₍_X_{) is constant w.r.t.} _w_{, allowing this term to be removed from the}

(5)

Newton-Raphson for multi-class logistic regression

Recall that we wanted to find the elements of w minimizing the negative log-likelihood `(w) =−lnP(X)−X i X j tij[aj(xi)−ln X r exp (ar(xi))]

withar(xi) =wrTxi. Setting the gradient∇`(w) to zero directly yielded a large

non-linear system of equations, which we could not solve analytically.

To apply logistic regression, we need not only ` and ∇`, but also H`, so

we must do further differentiation. To simplify this, let us define yv(b) =

exp (bv)/P_rexp (br), with yiv = yv(a(xi)). In this notation, the gradient of

the negative log-likelihood (w.r.t. wv) turns out to simply bePi[yiv−tiv]xi.

Another advantage of this definition is to simplify the calculus. Let us first calculate∇byv. We have ∂yv ∂bv =exp (bv)( P rexp (br)−exp (bv)) (P rexp (br))2 =yv(b)(1−yv(b))

similar to the derivative of the logistic function, while forj6=v we have

∂yv ∂bj = −exp (bv) exp (bj) (P rexp (br))2 = −yv(b)yj(b) (P rexp (br))2

Note that these results can be pooled as ∂yv

∂bj =yv(b)(Ivj−yj(b)), so that we do

not need to handle the casej =v separately.

Using this, we can find the entries of the Hessian as follows (where xi(k)

denotes thek-th component ofxi):7

∂2_` ∂wv1,d1∂wv2,d2 = ∂ ∂wv1,d1 X i [yiv2−tiv2]xi(d2) = X i xi(d2) ∂ ∂wv1,d1 yv2(a(xi)) = X i xi(d2) X j yv2(a(xi))(Iv2j−yj(a(xi))) ∂ ∂wv1,d1 aj(xi)

where the last step follows from the chain rule. Now, ∂

∂wv₁,d₁aj(xi) is zero for j6=v1, andxi(d1)forj=v1, so that the above expression equals

X

i

yv2(a(xi))(Iv2v1−yv1(a(xi)))xi(d2)xi(d1)

so that we can write the block of the Hessian corresponding towv1 andwv2 as X

i

yiv2(Iv2v1−yiv1)xix

T i

Now that we can calculate the Hessian and the gradient, we can start with an initial guess (for example, setting all thew’s to zero initially), and then applying Newton-Raphson updates. We leave showing that the Hessian is positive semi-definite to the interested reader.

7_Here_w

(6)

Two-class logistic regression

It is worth noticing that the solution to the multi-class problem above is not unique: adding a constant to any component of all thewvectors yields the same solution. Thus, we can assume that the solution vector for one of the classes is the zero vector. This means that we only need to find thew vectors fork−1 classes, rather thank. In the binary case, this simplifies things considerably.

It is left to the reader to verify that after the adjustment mentioned in the previous paragraph (settingw1= 0), the softmax function for class probability

for the first class reduces to the logistic function discussed earlier, wherewnow represents the adjusted weightw2.8

The negative log-likelihood of the observations, as obtained earlier, is

`= lnP(X) +X

i

X

j

tijlnP(y=Cj|xi, w)

In this case, we have P(y = C1|xi, w) = σ(a(xi)) and P(y = C2|xi, w) =

1−σ(a(xi)), with a(x) = wTxi (again, an extra feature has been added to

the observations to cater for the bias term). For the binary case, it is more convenient to replace the 1-of-k encoding tij with a binary encoding: ti = 1 if

xi is in class 1, and 0 otherwise. Then, the negative log-likelihood becomes −lnP(X)−X

i

(tilnσ(a(xi)) + (1−ti) ln(1−σ(a(xi))))

Next we derive the gradient and Hessian of`:

∂` ∂wd1 = −X i ti[σ(a(xi))]−1σ(a(xi))(1−σ(a(xi)))xi(d1) = −(1−ti)[1−σ(a(xi))]−1σ(a(xi))(1−σ(a(xi)))xi(d1) = −X i [ti(1−σ(a(xi)))−(1−ti)σ(a(xi))]xi(d1) = X i (σ(a(xi))−ti)xi(d1) so that∇w`=Pi(σ(a(xi))−ti)xi. Next ∂2_` ∂wd1∂wd2 =X i xi(d2)σ(a(xi))[1−σ(a(xi))]xi(d1) so thatH`(w) =Piσ(a(xi))[1−σ(a(xi))]xixTi.

In order to ensure that our optimization finds a minimum, we show that the Hessian matrix is positive semi-definite. First note that for any i, xixTi is

positive semi-definite, since for any u,

uTxixTi u= (x T iu) T₍_xT i u) =kx T iuk 2_≥₀

8_{It should also be easy to verify that the new weight vector}_w_{equals the difference of the}

(7)

Next we note that since the range ofσis (0,1), the coefficient ofxixTi is always

positive, so that the Hessian is a sum of positive semi-definite matrices, and is thus positive semi-definite.9

Finally, we can apply Newton-Raphson optimization10_{to the log-likelihood}

to obtain the weight vectorw.

A complication with logistic regression — overfitting

Suppose that a weight vectorwleads to a perfect classification of the data set in the binary case, using classification by the class with highest probability, and where we consider all classes equally likely. In such a case, we say the data set is linearly seperable. For this classification, the classification boundary (or

decision surface) lies whereP(C1|x) = 0.5. SinceP(C1|x) = (1+exp (wT_x₎₎−1_,

we must have thatwT_x_{= 0, i.e. the decision surface is a line passing through}

the origin.

Now, consider what happens if we rather classify with w0 = 2w. In such a case, the decision boundary and all the point classifications remain the same. However, the likelihoods associated with each point now becomes greater, yield-ing a higher likelihood solution than the original w. We can continue doubling

wrepeatedly in this way, leading in the limit to a situation where the predicted probabilities become step functions at the decision boundary. (To clarify this, draw the logistic function as its argument increases.) Although the maths is different, this behaviour manifest to varying degrees even when there are mul-tiple classes, prior probabilities for the classes are unequal, and the classes are not linearly separable.

This is a common problem with many machine learning approaches that estimate parameters by optimization, and is known as overfitting. To see why this is a problem, note that your model is now very confident about its classifi-cation of future points close to the decision boundary,even though it has never observed data there! Essentially, the only guide available to the algorithm is the data it has been given, and the linear constraint on the decision surface. However, we usually do not expect the probabilities of the classes to change abruptly between 0 and 1. Thus, we must find some way of taking this into ac-count in our calculations. There are two major approaches to doing this, which are closely related: first, one can use a prior distribution on the weight vector

w, which is then updated by the likelihood calculation to obtain a maximum a posteriori (MAP) estimate; second, one can penalize choices of wleading to undesirable behaviour in our classifier by adding an extra term to the likelihood function — this is known as regularization.

The relationship between these approaches is that the size of the penalty, or regularization, term, should depend on how likely you think certain values ofw

are in advance. Thus, the choice of regularization function effectively encodes a prior distribution on the parameters under investigation into the likelihood,

9_{Under fairly general conditions, the Hessian can in fact be shown to be positive definite,}

by noting that it is a weighted sum of the rank 1 matricesxixTi. However, we do not go into

that here, since the regularization we apply later will easily lead to a positive definite Hessian, in any case.

10_{Because each quadratic approximation step in the Newton-Raphson optimization is}

effec-tively a weighted least squares fit to the data (a common approach for estimating parameters in statistics), this procedure is sometimes called iterative reweighted least squares (IRLS).

(8)

so that the optimum of this regularized likelihood function is actually the MAP estimate for the corresponding prior.

Priors and regularization

Let us assume a normal prior distribution onw. We would prefer smallerw, so let us set the mean of the prior to 0. Also, we have no reason to expect that certain components of w must be larger than others, or that they should be correlated, so let us assume a diagonal covariance matrix, with equal entries on the diagonals (i.e. Σ =λI for some λ >0).

Given the data set (X, Y), what is the posterior distribution forw? We have

p(w|X, Y) = p(w)p(X, Y|w)

p(X, Y)

where the denominator is not dependent onw. We can maximize this by mini-mizing the negative logarithm of the numerator,

−logp(w)−logp(X, Y|w) = w

T_w

2λ +`(w) +C

for a constant C, and in this formulation we see that the prior distribution on

whas led to the regularization penaltyλ−1J(w) = w_2λTw.11 This particular form is very convenient from a calculus point of view, since ∇J(w) = w, and thus

HJ(w) =I. The choice ofλdetermines how strong one wishes the penalty term

to be, and must usually be determined empirically.

To apply regularization in this context is a straight forward modification of the earlier approach: one still uses Newton-Raphson optimization, but rather than optimizing`(w), one optimizes`(w) +λ−1_J₍_w_{), which has a slightly}

mod-ified gradient and Hessian12_.

Choice ofλ Let us next discuss the selection ofλ: λis an example of an algo-rithm parameter which we can adjust, or “tune”, in the hope of obtaining good performance for our classifier, although we have no guidance of our selection. One way to get an indication of a good choice is to keep some of our training data aside (let us call this part thevalidation set), and do parameter estimation on the remaining training set for various choices ofλ. Our final choice ofλcan then be obtained by comparing the performance of the various classifiers built with differing choices of λon the validation set. Finally, we might re-estimate the parameters for this choice ofλusing the whole original training set for our final classifier.13

11_{Reviewing the math, we see a general rule of thumb that the regularization penalty}

corre-sponds roughly to the negative logarithm of the prior distribution, since for MAP estimates, we typically minimize the negative log posterior, which equals the negative log-likelihood plus the negative log-prior.

12_{Note that the modified Hessian is positive definite now, rather than positive semi-definite.} 13_{Many other approaches, such as cross-validation, are possible, and finding good approaches}

to handling parameter tuning is somewhat of an art. Much research has been done in this area, but it is fraught with difficulties.

(9)

An alternative view If we consider`(w) for the binary case, and ignore the constant−lnP(X), we have

−X

i

(tilnσ(a(xi)) + (1−ti) ln(1−σ(a(xi))))

For each data point, this function calculates the predicted class probability p

for the actual class of that point, and adds −lnpto a total. If the probability is close to one, the amount added is small, but for points which are badly misclassified, the amount added can be much larger. This interpretation helps us understand why maximum lilelihood approaches overfit — there is pressure to reduce these penalties. However, when regularization is performed, an extra

λ−1J(w) is added to this function, in such a way that overfitting to reduce the loss function is prevented by a compensating increase in this regularization term. Many other classification techniques can also be formulated in terms of a regularization function combined with penalty function on classification of points.