Generalized Linear Models
Dr. Marcus Wurzer
Introduction I
I One assumption of linear models is that the error term ε is normally distributed. If we want to model nonnormal responses Y , for example nominal, ordinal or count variables, we use generalized linear models (GLMs) instead.
Introduction II
I Rationale for calling such models GLMs: As in linear models, the goal is the prediction of a response by a linear predictor, but in contrast to these, a direct prediction is not possible since the relationship is not linear.
I A link function that linearizes the relationship is needed. It links the linear predictor to the mean of the response.
Introduction III
I general form of GLMs: g(µi) =xTi β
I Benefits if a model belongs to the GLM family
I Common parameter estimation routines
I Common strategies for model testing
Introduction IV
I Depending on the scales the variables are measured, we use the following methods:
Response Explanatory variables Methods
Binary Categorical & continuous Binary logistic Regression Nominal with >2 categories Categorical & continuous Multinomial logistic regression Ordinal Categorical & continuous Ordinal logistic regression Counts Categorical Log-linear models Counts Categorical & continuous Poisson regression
Logistic Regression I
I With Logistic Regression, the probability of the occurence of a certain event Y in dependence of a set of explanatory variables X1, · · · ,Xk can be modelled.
I The independent variables can be metric or categorical
I The dependent variable is categorical. If it has only two categories, the method is calledBinary Logistic
Regression. If it has more than two categories, we speak
ofMultinomial Logistic Regression
Binary Logistic Regression I
I Typically, the values 0 and 1 are assigned to the dependent variable. For example, 0 may mean ’no customer’ and 1 may stand for ’customer’.
I This dependent variable Y is binomially distributed with probabilities π and 1 − π to fall into one of the categories:
Binary Logistic Regression II
I Example: The presence of Coronary Heart Disease
Binary Logistic Regression III
I Goal: Find a function for the probability of the event Y = 1, thus:
π =P(Y = 1) = f (X1, · · · ,Xk)
I One could try to express the dependent variable as an equation linear in x, like it is done in Linear Regression:
π(x ) = β0+ β1x
Binary Logistic Regression IV
I If age is transformed into a variable giving the age group of the person, the mean proportion of people having CHD can be computed for every group.
Binary Logistic Regression VI
Binary Logistic Regression VII
Binary Logistic Regression VIII
I Using the logistic distribution, we can formulate the logistic regression model that is defined as:
π(x ) = e
β0+β1x 1 + eβ0+β1x
I This relationship between probability π(x ) and the
Binary Logistic Regression IX
I This transformation is calledlogit transformation and is
defined as: ln π(x ) 1 − π(x ) = β0+ β1x where ln π(x ) 1 − π(x )
Parameter Estimation and Interpretation I
I Least Squares estimation, as used in Linear Regression Analysis, yields a number of undesirable properties when applied to a model with a dichotomous dependent variable.
I Because of that, in Logistic Regression a more general form of estimation, called maximum likelihood estimation, is used. Actually, applying maximum likelihood in linear regression leads to the least squares function.
I Generally speaking, for a certain data set, maximum likelihood estimation yields values for the unknown parameters that maximize the probability of obtaining just that data set.
Parameter Estimation and Interpretation II
I For the Heart disease example, the fitted values are given by the equation:
ˆ
π(x ) = e
−5.309+0.111·AGE
1 + e−5.309+0.111·AGE
Parameter Estimation and Interpretation III
I The relationship between predictor(s) and the dependent variable is just like in Linear Regression: The intercept has the value -5.309 and with each additional year, the
dependent variable increases by a value of 0.111.
I The problem now is that the interpretation of the
dependent variable isn’t just as straightforward as in Linear Regression. What does it mean if the logit increases by a certain value?
I This issue can be solved by the introduction of a measure of association that is calledodds ratio. For the
construction of this measure, theodds of a certain
outcome are needed.
Parameter Estimation and Interpretation IV
I For example, in the age group 20-29 there are 9 persons without CHD and 1 person with CHD. The probability of having CHD therefore would be 1/10 = 0.1. The odds that a person has CHD are 1/9 = 0.1 ˙1.
I The values can be converted with the following formulas:
Parameter Estimation and Interpretation V
I Therefore, the logit can be defined as being the logarithm of the odds
I In the example above, the independent variable AGE was metric. To illustrate the interpretation of the corresponding parameter, we make the assumption that AGE is
dichotomized, having people up to 55 years of age in category 0 and people older than 55 in category 1 (We return to the metric variable AGE below)
Parameter Estimation and Interpretation VI
I The odds ratio then is defined as the ratio of the odds of the outcome being present among individuals with x = 1 and the individuals with x = 0:
OR = π1/[1 − π1] π0/[1 − π0]
I It can be shown that after substituting this expressions for the logistic regression model we get
OR = eβ1
Parameter Estimation and Interpretation VII
I For the example with the dichotomized AGE variable, this means that for persons older than 55, it is more than eight times as likely (OR = e2.094 =8.1) to get CHD than for people up to 55.
I Generally speaking, the possible values of the parameter estimators and the corresponding odds ratios have the following meaning:
β <0 → 0 ≤ eβ ≤ 1
β >0 → eβ >1
Parameter Estimation and Interpretation VIII
I If β = 0 and eβ =1, accordingly, this implies that x has no influence on the dependent variable (The odds stay the same, regardless of the category of x).
Parameter Estimation and Interpretation IX
I The independent variables can also be continuous. The odds ratio for a change of c units than is defined as: OR(c) = ecβ1
I Using the AGE variable like it was defined initially (metric scale), a value of β1=0.111 was obtained. For an
increase of ten years in age, this would result in an estimated odds ratio of ˆOR(10) = e10·0.111=3.03
Assessing Model Fit I
I A number of statistics and tests are available to assess the overall fit of the model. The null and alternative hypotheses are always stated as
H0: The hypothesized model fits the data
H1: The hypothesized model does not fit the data
I Thus, nonrejection of the null hypothesis is wanted,
Assessing Model Fit II
I A widely accepted test statistic is thelikelihood ratio statistic that is based on the likelihood function, where the
likelihood L of a model is defined as the probability that the hypothesized model represents the input data
I The likelihood L is transformed to −2 log L in order to test the hypotheses
I If the value of this statistic is not significant, this means that the H0cannot be rejected - the model fits the data
I For the test of the overall model, thesaturated model that
Assessing Model Fit III
I The comparison of observed values (that can also be seen as being predictions of the saturated model) and predicted values (by the fitted value) is based on a test statistic that is called deviance:
D = −2 ln
likelihood of the fitted model likelihood of the saturated model
I The term in brackets is calledlikelihood ratio and the
whole expression can be used for model testing purposes
Assessing Model Fit IV
I The likelihood ratio statistic can also be used to determine if an additional independent variable would significantly improve the model fit
I For this purpose, the model containing and the model not containing the additional variable are compared. If the difference of the −2 log L - values of the models is significantly large, the variable should be included in the model
Assessing Model Fit V
I Researchers also tried to establish measures that contain comparable information as the R2- measure in Linear Regression. These ’Pseudo’-R2measures can be useful for model comparison, but have a number of drawbacks when used for the assessment of goodness-of-fit.
Prominent examples areMcFadden’s R2,Nagelkerke’s
Assessing Model Fit VI
I Other measures of goodness-of-fit are the
Pearson-Chi-Square Statistic and, like in Linear
Multinomial Logistic Regression
I Multinomial logistic regression is used when the
reponse variable has more than two categories
I The response variable may be ordinal, but the order is not taken into account
Ordinal Logistic Regression I
I Multinomial response models can become very complex,
e.g., k regressor variables and m response categories will result in (k + 1)(m − 1) coefficients that have to be
estimated
I For ordinal responses, it may be possible to get a simpler model
I Several possibilities; proportional-odds logistic regression ist the most common model
Ordinal Logistic Regression II
I Sometimes it is assumed that we want to measure some
latent (i.e., unobservable) metric variable in reality, but instead of the actual values, we only observe a rough categorization
Poisson, Quasi-Poisson, Negative-Binomial regression
I Used when the response is a count
I In principle, linear regression could also be used to model counts, but since the distribution is often skewed, the normal distribution of the errors assumption is not met
I In such cases, it is better to model the counts using
Poisson regression, Quasi-Poisson-Regression, or Negative-binomial regression
I Which one of these methods should be used depends on
the presence of overdispersion (see accompanying R-script)
I There are special methods (Hurdle regression, Zero-Inflated regression) for data that show excess
Further extensions
I Beta regression for variables that assume values in the
standard unit interval (0, 1) (e.g., proportions)
I (Parametric) Survival Analysis if the response is a failure
time
I Models forRepeated Measures and Longitudinal Data