Section 6: Model Selection, Logistic Regression and more...

(1)

Carlos M. Carvalho

The University of Texas McCombs School of Business

(2)

Model Building Process

When building a regression model remember that simplicity is your friend... smaller models areeasier to interpret and havefewer unknown parametersto be estimated.

Keep in mind that everyadditional parameter represents a cost!! The first step of every model building exercise is the selection of thethe universe of variablesto be potentially used. This task is entirely solved through you experience and context specific knowledge...

I Think carefully about the problem

I Consult subject matter research and experts

I Avoid the mistake of selecting too many variables

(3)

Model Building Process

With a universe of variables in hand, the goal now is to select the model. Why not include all the variables in?

Big models tend to over-fit and find features that are specific to the data in hand... ie, not generalizable relationships.

The results are bad predictions and bad science!

In addition, bigger models have more parameters and potentially more uncertainty about everything we are trying to learn... (check the beer and weight example!)

We need a strategy to build a model in ways that accounts for the trade-off between fitting the data and the uncertainty associated with the model

(4)

Out-of-Sample Prediction

One idea is to focus on the model’s ability to predict... How do we evaluate a forecasting model? Make predictions!

Basic Idea: We want to use the model to forecast outcomes for observations we have not seen before.

I Use the data to create a prediction problem.

I See how our candidate models perform. We’ll use most of the data fortraining the model, and the left over part forvalidatingthe model.

(5)

Out-of-Sample Prediction

In across-validation scheme, you fit a bunch of models to most of the data (trainingsample) and choose the model

that performed the best on the rest (left-out sample).

I Fit the model on the training data

I Use the model to predict ˆYj values for all of the NLO left-out

data points

I Calculate the Mean Square Error for these predictions

MSE = 1 NLO NLO X j =1 (Yj − ˆYj)2

(6)

Example

To illustrate the potential problems of “over-fitting” the data, let’s look again at the Telemarketing example... let’s look at multiple polynomial terms... 10 15 20 25 30 35 15 20 25 30 35 40 months Calls 6

(7)

Example

Let’s evaluate the fit of each model by their R2 (on the training data)

0.775 0.776 0.777 0.778 0.779 R2

(8)

Example

How about the MSE?? (on the left-out data)

2 4 6 8 10 2.155 2.160 2.165 2.170 2.175 Polynomial Order R MSE 8

(9)

BIC for Model Selection

Another way to evaluate a model is to useInformation Criteria

metrics which attempt to quantify how well our modelwouldhave predicted the data (regardless of what you’ve estimated for the βj’s).

A good alternative is theBIC: Bayes Information Criterion, which is based on a “Bayesian” philosophy of statistics.

BIC = n log(s

2

) + p log(n)

(10)

BIC for Model Selection

One (very!) nice thing about the BIC is that you can interpret it in terms ofmodel probabilities.

Given a list of possible models {M1, M2, . . . , MR}, the probability

that model i is correct is

P(Mi) ≈ e−12BIC (Mi) PR r =1e −1 2BIC (Mr) = e

−1₂[BIC (Mi)−BICmin]

PR r =1e

−1

2[BIC (Mr)−BICmin]

(Subtract BICmin= min{BIC (M1) . . . BIC (MR)} for numerical stability.)

(11)

BIC for Model Selection

Thus BIC is an alternative to testing for comparing models.

I It is easy to calculate.

I You are able to evaluate model probabilities.

I There are no “multiple testing” type worries.

I It generally leads to more simple models than F -tests. As with testing, you need to narrow down your options before comparing models. What if there are too many possibilities?

(12)

Stepwise Regression

One computational approach to build a regression model step-by-step is “stepwise regression” There are 3 options:

I Forward: adds one variable at the time until no remaining variable makes a significant contribution (or meet a certain criteria... could be out of sample prediction)

I Backwards: starts will all possible variables and removes one at the time until further deletions would do more harm them good

I Stepwise: just like the forward procedure but allows for deletions at each step

(13)

LASSO

The LASSO is a shrinkage method that performs automatic selection. Yet another alternative... has similar properties as stepwise regression but it is more automatic... R does it for you! The LASSO solves the following problem:

arg min_β ( _N X i =1 Yi − Xi0β 2 + λ|β| )

I Coefficients can be set exactly to zero (automatic model selection)

I Very efficient computational method

(14)

One informal but very useful idea to put it all together...

I like to build models from the bottom, up...

I Set aside a set of points to be your validating set (if dataset large enought)

I Working on the training data, add one variable at the time deciding which one to add based on some criteria:

1. larger increases in R2_{while significant}

2. larger reduction in MSE while significant 3. BIC, etc...

I at every step, carefully analyze the output and

check the residuals!

I Stop when no additional variable produces a “significant” improvement

I Always make sure you understand what the model is doing in the specific context of your problem

(15)

Binary Response Data

Let’s now look at data where the response Y is a binary variable (taking the value 0 or 1).

I Win or lose.

I Sick or healthy.

I Buy or not buy.

I Pay or default.

I Thumbs up or down.

The goal is generally to predict theprobability that Y = 1, and you can then doclassification based on this estimate.

(16)

Binary Response Data

Y is an indicator: Y = 0 or 1. The conditional mean is thus

E[Y |X ]= p(Y = 1|X ) × 1 + p(Y = 0|X ) × 0 =p(Y = 1|X )

The mean function is a probability: We need a model that gives mean/probability values between 0 and 1.

We’ll use a transform function that takes the right-hand side of the model (x0β) and gives back a value between zero and one.

(17)

Binary Response Data

The binary choice model is

p(Y = 1|X1. . . Xd) = S (β0+ β1X1. . . + βdXd) where S is a function that increases in value from zero to one.

(18)

Binary Response Data

There are two main functions that are used for this:

I Logistic Regression: S (z) = e

z

1 + ez.

I Probit Regression: S (z) = pnorm(z).

Both functions are S -shaped and take values in (0, 1).

Probitis used by economists, logitby biologists, and the rest of us are fairly indifferent: they result in practically the same fit.

(19)

Logistic Regression

We’ll use logistic regression, such that p(Y = 1|X1. . . Xd) =

exp[β0+ β1X1. . . + βdXd]

1 + exp[β0+ β1X1. . . + βdXd]

The “logit” link is more common, and it’s the default in R. These models are easy to fit in R:

glm(Y ∼ X1 + X2, family=binomial)

“g” stands forgeneralized, andbinomialindicates Y = 0 or 1. Otherwise, generalized linear models use the same syntax as lm().

(20)

Logistic Regression

What is happening here? Instead of least-squares,

glmis maximizing the product of probabilities:

n Y i =1 P(Yi|xi) = n Y i =1 exp[x0b] 1 + exp[x0b] Yi 1 1 + exp[x0b] 1−Yi

This maximizes thelikelihoodof our data (which is also what least-squares did).

(21)

Logistic Regression

The important things are basically the same as before:

I Individual parameter p-values are interpreted as always.

I extractAIC(reg,k=log(n)) will get your BICs.

I Thepredictfunction works as before, but you need to addtype =

‘‘response’’to get ˆpi = exp[x0b]/(1 + exp[x0b])

(otherwise it just returns the linear function x0_β_).

Unfortunately, techniques for residual diagnostics and model checking are different (but we’ll not worry about that today).

(22)

Example: Basketball Spreads

NBA basketball point spreads: we have Las Vegas betting point spreads for 553 NBA games and the resulting scores.

We can use logistic regression of scores onto spread to predict the probability of the favored team winning.

I Response: favwin=1if favored team wins.

I Covariate: spreadis the Vegas point spread.

spread Frequency 0 10 20 30 40 0 40 80 120 _favwin=1 favwin=0 0 1 0 10 20 30 40 spread favwin 22

(23)

Example: Basketball Spreads

This is a weird situation where we assume is no intercept.

I There is considerable evidence that betting odds are efficient. I A spread of zero implies p(win) = 0.5 for each team.

I Thus p(win) = exp[β0]/(1 + exp[β0]) = 1/2⇔β0= 0.

The model we want to fit is thus

p(favwin|spread ) = exp[β × spread ] 1 + exp[β × spread ]

(24)

Example: Basketball Spreads

summary(nbareg <- glm(favwin ∼ spread-1, family=binomial))

Some things are different (z not t) and some are missing (F , R2_).

(25)

Example: Basketball Spreads

The fitted model is

p(favwin|spread ) = exp[0.156 × spread ] 1 + exp[0.156 × spread ] 0 5 10 15 20 25 30 0.5 0.6 0.7 0.8 0.9 1.0 spread P(favwin)

(26)

Example: Basketball Spreads

We could consider other models... and compare with BIC!

Our “Efficient Vegas” model: > extractAIC(nbareg, k=log(553))

1.000 534.287

A model that includes non-zero intercept:

> extractAIC(glm(favwin ∼ spread, family=binomial), k=log(553))

2.0000 540.4333

What if we throw in home-court advantage?

> extractAIC(glm(favwin ∼ spread+favhome, family=binomial), k=log(553))

3.0000 545.6371

The simplest model is best

(The model probabilities are 19/20, 1/20, and zero.)

(27)

Example: Basketball Spreads

Let’s use our model to predict the result of a game:

I Portland vs Golden State: spread is PRT by 8 p(PRT win) = exp[0.156 × 8]

1 + exp[0.156 × 8] = 0.78

I Chicago vs Orlando: spread is ORL by 4 p(CHI win) = 1

(28)

Example: Credit Scoring

A common business application of logistic regression is in evaluating the credit quality of (potential) debtors.

I Take a list of borrower characteristics.

I Build a prediction rule for their credit.

I Use this rule to automatically evaluate applicants (and track your risk profile).

You can do all this with logistic regression, and then use the predicted probabilities to build aclassification rule.

(29)

Example: Credit Scoring

We have data on 1000 loan applicants at German community banks, and judgement of the loan outcomes (goodor bad). The data has 20 borrower characteristics, including

I Credit history (5 categories).

I Housing (rent, own, or free).

I The loan purpose and duration.

(30)

Example: Credit Scoring

We can use forward step wise regression to build a model. null <- glm(Y ∼ history3, family=binomial, data=credit[train,]) full <- glm(Y ∼., family=binomial, data=credit[train,])

reg <- step(null, scope=formula(full), direction="forward", k=log(n))

. . . Step: AIC=882.94

Y[train] ∼ history3 + checkingstatus1 + duration2 + installment8

The null model has credit history as a variable, since I’d include this regardless, and we’ve left-out 200 points for validation.

(31)

Classification

A common goal with logistic regression is toclassifythe inputs depending on their predicted response probabilities.

For example, we might want to classify the German borrowers as having “good” or “bad” credit (i.e., do we loan to them?). A simple classification rule is to say that anyone with p(good|x) > 0.5 can get a loan, and the rest do not.

(32)

Example: Credit Scoring

Let’s use the validation set to compare this and the full model. > full <- glm(formula(terms(Y[train] ∼., data=covars)),

data=covars[train,], family=binomial)

> predreg <- predict(reg, newdata=covars[-train,], type="response") > predfull <- predict(full, newdata=covars[-train,], type="response") > # 1 = false negative, -1 = false positive

> errorreg <- Y[-train]-(predreg >= .5) > errorfull <- Y[-train]-(predfull >= .5) > # misclassification rates: > mean(abs(errorreg)) 0.220 > mean(abs(errorfull)) 0.265

Our model classifies borrowers correctly 78% of the time.