Common Predictive
GENERALIZED LINEAR MODELS
In 1972, statisticians John Nelder and Robert Wedderburn, who worked together at the Rothamsted Experimental Station in England, defi ned generalized linear models (GLMs) as a unifi ed framework for probit4 models in pursuit of chemical dosage tolerance, contingency tables, or-dinary least squares (OLS) regression, and many more. This generalized
4 Probit is a statistical term that describes a type of regression model that only has two possible values: for example, male or female.
C O M M O N P R E D I C T I V E M O D E L I N G T E C H N I Q U E S ◂ 85
description of models lifted the restrictions, which were prohibitive for certain types of problems, and offered the fl exibility to accommodate response variables that are nonnormally distributed, have a mean with a restricted range, and/or have nonconstant variance—which are all violations of the assumptions for OLS regression. GLMs are character-ized by three components:
1. The probability distribution of the response variable Yi(the ran-dom component) can be any distribution from the exponential family. 5
2. The linear model which includes the explanatory variables and the model parameters, x′ β (the systematic component). i 3. The link function which describes the relationship between the
systematic and random components.
More specifi cally, the usual linear component of a linear model, x′ β, now links to the expected value μi iof the response variable Yi through a function g such that g( )μi = ′ . The link function g can be xiβ any monotonic, differentiable function.6 For ordinary linear regres-sion, the link is simply the identity and the response distribution is normal.
A logistic regression model, another example of a GLM, uses the logit link function for the Bernoulli‐distributed response variable;
in turn, this gives a probability estimate between 0 and 1 for the modeled outcome.7 This model would be used when the model is trying to predict churn, attrition, or any other binary event where the event of interest is not as important as the likelihood that the event will occur.
5 The exponential family of distribution include normal, gamma, beta, and chi‐squared among many others.
6 “Monotonic” is a property where the values of an independent variable never decrease.
“Differentiable” means that there is a derivative that exists at each point along the function.
7 The Bernoulli distribution is named after Jacob Bernoulli, a Swiss seventeenth‐century mathematician. The distribution is one discrete trial (like a coin toss) with probability of success p and probability of failureq= −1 p.
86 ▸ B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G
Counts can be modeled in a GLM using a Poisson or negative binomial distribution and a log link function. Examples of count mod-els could include the number of claims someone makes against an insurance policy, the number of cars passing through a toll booth, the number of items that will be returned to a store after purchase, or any other counting‐type process where the occurrence of one event is in-dependent from another occurrence.
Example of a Probit GLM
I, like most people, do not like bugs in my home. Because North Carolina has lots of bugs, to keep them out of my house, I spray periodically the perimeter of my home and the trunks of my trees with pesticide. I have a friend whom I met when I fi rst moved to North Carolina who worked for a national pest consumer control company. His job was to research the effectiveness of certain chemi-cal compounds with killing pests. Consider this common example for modeling whether an insect dies from a dosage of a toxic chemical in a pesticide. Several different dosages were applied to a sample of insects, and data was collected on how many insects lived and died at the particular dosages. The plot in Figure 5.8 shows the propor-tion of insects that died; the Y‐axis represents the proporpropor-tion that died, at certain dosage amounts represented on the X‐axis. While the relationship between these two variables does appear somewhat
1.0 0.8
0.6
Observed
0.4 0.2
0 2 4
Dose
6 8
Figure 5.8 Proportion of Response by Dose
C O M M O N P R E D I C T I V E M O D E L I N G T E C H N I Q U E S ◂ 87
linear, a linear regression model is not suitable in this case; we are trying to predict the probability of dying at a particular dosage, which needs to be between 0 and 1 because we cannot have less than 0%
of the insects die or more than 100% of the insects live, and OLS is not bounded. This probability of death P at a given dosage or logP dosage, x , is equal to the probability that the insect’s tolerance forx the chemical T is less thanT x . The tolerance for all subjects (or in this case insects) at a fi xed dosage or log dosage is often assumed to have a normal distribution with meanμ and standard deviation σ. Thus we have:
P=Pr(T < x)=Φ
[
(x−μ σ)/]
The functionΦ is the standard normal cumulative distribution func-tion, and its inverseΦ−1is the probit function. SinceΦ−1is a legitimate link function, this model can be expressed in the form of a GLM as:
Φ−1
( )
P =β0+β1xwhere
β0 = − /μ σ β1 = 1/σ
One of the motivations for Nelder and Wedderburn’s paper was to introduce a process for fi tting GLMs because an exact estimate of the parameters was not available most of the time. 8 They presented an iterative weighted least squares process that can be used to obtain the parameter estimates ofβ. The predicted probability of death ˆP can be estimated from the MLEs of β as:
ˆ ˆ ˆ
P =Φ β
(
0+β1x)
Figure 5.9 shows the probit model curve (the solid line) represent-ing the predicted probabilities that is fi t to the data collected.
Unlike the regression example, in most cases for GLMs, we do not get a measure like R‐Square to determine how well the model fi ts.
Instead of R‐Square, there are a set of measures to evaluate model fi t.
8 A closed-form solution for the maximum likelihood estimates of the parameters is not available in most cases (an exception being OLS regression).
88 ▸ B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G
A simple way to classify between the two groups of model fi t mea-sures: tests where small values indicate bad model fi t and small values indicate good model fi t.
For small values are bad:
■ Pearson’s chi‐square statistic
■ deviance 9 or scaled deviance
Signifi cant p ‐values10 for these two test statistics indicate a lack of fi t.
For small values are good:
■ Akaike information criterion (AIC)
■ Corrected AIC (AICC)
■ Bayesian information criterion (BIC)
These fi t measures can also be used for evaluating the goodness of a model fi t, with smaller values indicating a better fi t. As for a re-gression analysis discussed in the Rere-gression section, examining the residual plots can give insight into the reason for a lack of fi t, as can other diagnostics mentioned, such as Cook’s D for outlier detection.
1.0 0.8 0.6
Probability
Predicted Probabilities for Response With 95% Confidence Limits
0.4 0.2 0.0
1 2 3 4 5 6 7
Dose Figure 5.9 Predicted Probabilities for Response
9 Also called the likelihood ratio chi‐square statistic.
10 p ‐values are the probability of obtaining a result at least this extreme when the null hypothesis is true.
C O M M O N P R E D I C T I V E M O D E L I N G T E C H N I Q U E S ◂ 89
The DFBETA and standardized DFBETA statistics are other measures of infl uence of individual observations on the model fi t. An example of the unstandardized DFBETA measure plotted against observa-tion number for each of the two estimated parameters is shown in Figure 5.10 . These statistics and plots can help identify observations that could be outliers and are adversely affecting the fi t of the model to the data.
Applications in the Big Data Era
A main application of GLMs with respect to big data is in the insurance in-dustry for building ratemaking models. Ratemaking is the determination of insurance premiums based on risk characteristics (that are captured in rating variables); GLMs have become the standard for fi tting ratemaking models due to their ability to accommodate all rating variables simulta-neously, to remove noise or random effects, to provide diagnostics for the models fi t, and to allow for interactions between rating variables.
The response variables modeled for ratemaking applications are typical-ly claim counts (the number of claims an individual fi les) with a Poisson distribution, claim amounts with a gamma distribution, or pure premi-um models that use the Tweedie distribution. These distributions are all members of the exponential family, and the log link function typically
0.4 0.2 0.0
−0.2
Observation 1 2 3 4 5 6 7
Intercept
0.5
0.0
−0.5
Observation 1 2 3 4 5 6 7
Idose
Figure 5.10 Unstandardized DFBETA Plots for Response
90 ▸ B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G
relates the response to a linear combination of rating variables, allowing these models to be formulated as GLMs. One large insurance company based in the United States uses GLMs to build ratemaking models on data that contains 150 million policies and 70 degrees of freedom.