Learn About Ordered Logit in R With Data From the Behavioral Risk Factor Surveillance System (2013)

(1)

With Data From the Behavioral Risk Factor Surveillance System

(2013)

This PDF has been generated from SAGE Research Methods Datasets.

(2)

Learn About Ordered Logit in R With Data From the Behavioral Risk Factor Surveillance System

(2013)

Student Guide Introduction

This dataset example introduces ordered logit. This technique allows researchers to evaluate whether a categorical variable with three or more categories that follow some order is a function of one or more independent variables. The ordered logit model is most commonly estimated via maximum likelihood estimation (MLE).

This example describes ordered logit, discusses the assumptions underlying it, and shows how to estimate and interpret ordered logit models. We illustrate ordered logit using a subset of data from the 2013 Behavioral Risk Factor Surveillance System (BRFSS) operated by the U.S. Centers for Disease Control (http://www.cdc.gov/brfss/). Specifically, we test whether a 4-category measure of body mass index (BMI) is predicted by gender, age, and a person’s level of activity.

An analysis like this allows researchers to evaluate factors that affect personal health situations, which may be helpful in designing health policy.

What Is Ordered Logit?

Ordered logit models explain variation in a categorical variable that consists of three or more ordered categories as a function of one or more independent variables. Categories must only be ordered (e.g. lowest to highest, weakest to

(3)

strongest, strongly agree to strongly disagree) – the method does not require that the distance between the categories be equal. Typically the values of such variables are scored sequentially starting at 0 or 1, but the method only requires that the scoring follow some recognizable order. Ordered logit models are typically used when the dependent variable has 3 to 7 ordered categories. More than that, and researchers often turn to OLS regression, while if the dependent variable only has two categories, the ordered logit model reduces to simple logit.

Ordered logit is one example from the family of Generalized Linear Models (GLMs). GLMs connect a linear combination of independent variables and estimated parameters – often called the linear predictor – to a dependent variable using a link function. The link function typically involves some sort of non-linear transformation, which in the case of ordered logit means that the probabilities that a given observation in the dataset falls into each of the categories of the dependent variable are non-linear functions of the independent variables. The parameters of GLMs are typically estimated using Maximum Likelihood Estimation (MLE). Because ordered logit models are estimated via MLE, it is best if the dataset has a sufficiently large number of observations. Just how many is open to debate, but in his book Regression Models for Categorical and Limited Dependent Variables (SAGE, 1997), J. Scott Long suggests trying to meet two criteria: (1) have at least 100 observations total, and (2) have at least 10 observations for each coefficient estimated in the model.

In simple terms, MLE is an iterative process that approximates estimates for the coefficients that maximize the fit of the model to the sample of data. By maximizing fit, MLE also minimizes the unexplained variance in the dependent variable. In that sense, MLE accomplishes the same objective as ordinary least squares (OLS) does for standard regression.

When computing statistical tests, it is customary to define the null hypothesis

(4)

(H0) to be tested. In ordered logit, the standard null hypothesis is that each coefficient is equal to zero. The actual coefficient estimates will not be exactly equal to zero in any particular sample of data, simply due to random chance in sampling. The t-tests conducted to test each individual coefficient are designed to help determine if the coefficients are different enough from zero to be declared statistically significant. "Different enough" is typically defined as producing a test statistic with a level of statistical significance, or p-value, that is less than 0.05.

This would lead us to reject the null hypothesis (H0) that the coefficient in question equals zero.

Estimating an Ordered Logit Model

One way to understand the ordered logit model is to imagine that there is a continuous, but unobserved, dependent variable that is a linear function of an independent variable. Let’s call that latent dependent variable Y* and the observed independent variable X. We do not observe Y*, but we do observe Y as a categorical variable. Which category of Y a case falls into depends on whether Y* crosses a given threshold.

Figure 1 illustrates this in the simple case where Y only takes on two values, coded 0 for those voting for Romney for U.S. President in 2012 and 1 for those voting for Obama. Figure 1 shows a latent, or unobserved, propensity of voting for Obama on the y-axis as Y*. This propensity is a linear function of X. At any given value for X, there is a probability distribution representing possible values for Y*. Because the area under any probability distribution sums to 1, at any given value of X, the average probability of someone voting for Obama is captured by the proportion of the probability distribution that falls above the threshold dividing Y* into those two groups. That proportion of the distribution is shown in blue in Figure 1. For a logit model, the distribution is assumed to be logistic. (Note: If the distribution were assumed to be normal, we would estimate the model using probit

(5)

rather than logit.)

Figure 1: Illustration of the latent variable interpretation of a simple logit model with a single threshold and two categories for the dependent variable.

Figure 2 shows what happens when we have three ordered categories rather than two. Figure 2 still represents a latent variable Y* on the y-axis as a linear function of X. However, Y* is now divided by two thresholds into three observed categories. At any given value of X, the proportion of the probability distribution that falls below the first threshold (shown in white) represents the probability of falling into the first category. The proportion of the probability distribution that falls between the two thresholds (shown in red) represents the probability of falling into the second category. The proportion of the probability distribution that falls above

(6)

the second threshold (shown in blue) represents the probability of falling into the third category. At any given value of X, all of these probabilities will (and must) sum to 1. For an ordered logit model, the distribution is assumed to be logistic.

(Note: If the distribution were assumed to be normal, we would estimate the model using ordered probit rather than ordered logit.)

Figure 2: Illustration of the latent variable interpretation of an ordered logit model with two thresholds and three ordered categories for the dependent variable.

Because the latent variable Y* is unobserved, it has no scale. In order to estimate the model illustrated in Figure 1 or 2, we need to impose a restriction on either the intercept of the regression line or one of the thresholds. In the case of simple logit where there is only one threshold and, thus, two categories on the dependent variable, nearly every statistical software fixes the threshold at zero and estimates the intercept. In the case of ordered logit where there are two or more thresholds

(7)

and, thus, three or more categories, nearly every statistical software fixes the intercept at zero and estimates the thresholds. In fact, many software programs will refer to the set of thresholds as intercepts. These restrictions are necessary, but the choice between them is not consequential. Regardless of which restriction is imposed, the estimated impact of each independent variable on the dependent variable will be unchanged.

Ordered logit models express the latent variable Y* as a function of one or more independent variables, as shown in Equation 1:

(1)

Y * = Xβ + ε

Where:

• Y* is a vector representing the individual values of the latent dependent variable

• X is a matrix of the individual values for one or more independent variables

• β is a vector of coefficients that link those independent variables to the dependent variable

• εi is a vector of stochastic error terms.

Because we only observe the categorical version of Y rather than Y*, we need a way to link Xβ to the probability that an observation falls into each given category of the observed dependent variable Y. Suppose there are j categories of the dependent variable. We need a way to transform Xβ into the probability that each observation falls into each of the j categories. This will also require that we estimate values for the thresholds. In short, we need what is called a link function to perform the following operation as shown in Equation 2:

(2)

(8)

g(p_ij = τ_i − X_iβ) Where:

• g() is a link function we have yet to define

• pij is the probability of observation i falling into category j of the dependent variable

• τj is the estimated threshold separating category j from those above it

• Xiβ is the matrix of individual values of the independent variables for each observation i multiplied by the vector of coefficient estimates that link those variables to the dependent variable.

For the ordered logit model, g() is the logistic link function. Thus, we can calculate values for pij using the inverse of the logit link function as shown in Equation 3:

(3)

p_ij= exp(τ_j − X_iβ) 1 + exp(τ_j − X_iβ)

Researchers have values for Y and the independent variables in their datasets – they use MLE to estimate the β coefficients as well as the thresholds – the τj values (sometimes called the intercepts). Unlike standard multiple regression, the β coefficients cannot be directly interpreted as slope coefficients that describe the marginal effect of each independent variable on the probability that Y falls into any particular category. Interpreting the coefficient estimates of an ordered logit model is more complicated, and is something described below in the context of a specific example.

Assumptions Behind the Model

Nearly every statistical model or test relies on some underlying assumptions, and they are all affected by the mix of data you happen to have. Different textbooks

(9)

present the assumptions for an ordered logit model in different ways. Here are the key factors to consider when estimating an ordered logit:

• The dependent variable must consist of ordered categories.

• The model is correctly specified (e.g. we have the right independent variables in the model properly measured).

• The values of the independent variables are fixed in repeated samples.

• The individual residuals are independent of each other and follow a logistic distribution.

• The effect of a given independent variable on the latent variable Y* is the same across all thresholds. This is sometimes called the parallel regression assumption or the proportional odds assumption. It simply means that we only estimate one β for each independent variable rather than having that coefficient estimate change as we move from one category of the dependent variable to another.

• Because it is generally estimated via MLE, logistic regression requires moderate to large sample sizes.

Illustrative Example: Body Mass Index (BMI) Classification

This analysis examines whether gender, age, and level of activity predict a person’s body mass index (BMI) classification. The specific research questions are:

• After controlling for age and level of activity, are women more or less likely than men to fall into a higher BMI classification?

• After controlling for gender and level of activity, are people of different age groups more or less likely to fall into a higher BMI classification?

• After controlling for gender and age, are people who engage in more strenuous exercise activities more or less likely to fall into a higher BMI classification?

(10)

Each of these research questions could be stated in the form of a null hypothesis:

• H0a = After controlling for age and level of activity, gender has no impact on BMI classification.

• H0b = After controlling for gender and level of activity, age has no impact on BMI classification.

• H0c = After controlling for gender and age, how strenuous a person’s exercise activity is has no impact on BMI classification.

The Data

This example uses data from the 2013 BRFSS. We use several variables:

• BMI classification (bmicat): 1 = Underweight, 2 = Normal Weight, 3 = Overweight, 4 = Obese.

• Whether the respondent is female (female): 1 = Yes, 0 = No.

• Age under 30 (under30): 1 = Yes, 0 = No.

• Age 65 or older (age65plus): 1 = Yes, 0 = No.

• Strenuousness of physical activity in last 30 days (active1): 0 = None or Below Moderate, 1 = Moderate, 2 = Vigorous.

Including dummy variables for those under age 30 and those age 65 or older means that the coefficients estimated for these variables are in comparison to the excluded age group – those age 30 through 64. BMI is calculated by a person’s body weight measured in kilograms divided by their height measured in meters.

Respondents are classified into the four categories for the dependent variable (bmicat) as follows:

• Underweight = BMI below 18.5

• Normal Weight = BMI from 18.5 to 25

• Overweight = BMI from 25 to 30

(11)

• Obese = BMI of 30 or greater

Responses are recorded on a 4-point scale that follows an order, making this example appropriate for ordered logit.

Analyzing the Data

Before proceeding to the ordered logit model, it is a good idea to produce a frequency distribution of the dependent variable. Remember that the dependent variable records the BMI category in which each respondent falls. Table 1 presents the frequency distribution.

Table 1: Frequency distribution of BMI classification, 2013 BRFSS.

Response Category Frequency

Underweight 5,698

Normal Weight 116,788

Overweight 130,049

Obese 107,390

Total 359,925

There are 359,925 respondents in this sample. The sample includes only 5,698 people classified as underweight, but 130,049 classified as overweight and 107,390 classified as obese. Ordered logit models do not perform as well if there are small numbers of observations in one or more of the categories of the dependent variable or if there is a substantial skew in the distribution of observations across the categories. While less than 2% of the observations fall in the first category, that still consists of more than 5000 individuals. Thus, there is little reason based on this frequency distribution to expect problems with estimating the ordered logit model.

(12)

It would also be valuable to produce summary statistics and explore the distributions of each of the independent variables as well. However, in the interest of space, we will forgo doing so now.

The results of the ordered logit model itself are presented in Table 2.

Table 2: Results from an ordered logit model predicting BMI classification as a function of gender, age, and activity level, 2013 BRFSS.

BMI Category Slope Coefficients

Female −0.310 (0.006)\s\up5(***)

Under 30 −0.850 (0.011)\s\up5(***)

65 and Older −0.113 (0.007)\s\up5(***)

Strenuousness of Activity −0.316 (0.004)\s\up5(***)

Thresholds/Intercepts

Group 1–2 −4.784 (0.015)\s\up5(***)

Group 2–3 −1.265 (0.007)\s\up5(***)

Group 3–4 0.297 (0.007)\s\up5(***)

AIC 821524

BIC 821601

Log Likelihood (df = 7) −410756

Deviance 821511

Num. obs. 359925

*** p < 0.001, **p < 0.01, *p < 0.05

(13)

The top portion of Table 2 reports the individual parameter estimates linking the independent variables to the dependent variable, their estimated standard errors in parentheses, and indicators of statistical significance. The middle portion of Table 2 reports the estimated values for the thresholds, or intercepts. Researchers generally do not have predictions about the thresholds or their level of statistical significance.

The bottom portion of Table 2 reports four measures of relative model fit and the sample size. None of the measures of model fit follow any particular scale, so they cannot be interpreted as "large" or "small" in absolute terms. They would only become relevant if we were to estimate additional models using the same exact dataset and dependent variable.

Each coefficient estimate operating on each independent variable is statistically significantly different from zero, and all are negative. From them we can say that as the values of each independent variable increase, the latent propensity of falling into the higher BMI classifications tends to decrease. However, just looking at ordered logit coefficients and tests of statistical significance does not tell the whole story. We explore some of the findings in greater detail through computing predicted probabilities.

Predicted Probabilities

We can compute the predicted probability of a respondent falling into the various BMI classifications based on the results in Table 2 using the inverse link function as shown previously in Equation 3. Because the relationship between all of the independent variables and the probability that Y falls into a particular category is nonlinear, you can only compute a predicted probability by setting every independent variable in the model to some specific value.

For example, to compare the predicted probabilities of falling into each of the

(14)

four categories on the dependent variable among women and men, we need to set the value for the female indicator variable to the appropriate value and we need to set all of the other independent variables to some fixed value as well. The most common strategy is to set the remaining variables to central measures such as their means, medians, or modes. An alternative is to compute the predicted probability for each observation based on its own values for its independent variables, but this makes it harder to isolate the potential effect of any one independent variable. In order to keep it simple, we set the activity variable to its middle value of "Moderate" and we also set the two age variables equal to zero, which means our predicted probabilities for women and men will be for people aged 30 to 64.

Table 3 reports the results of estimating these predicted probabilities using post- estimation simulation. A full discussion of this process is beyond the scope of this example, but briefly, the process computes 1000 sets of predicted probabilities by simulating values for the model coefficients based on their estimated values, variances, and covariances. For more information, see “Making the most of statistical analyses: improving interpretation and presentation” by King, Tomz, and Wittenberg (American Journal of Political Science, 44(2): 341–355).

Table 3: Estimated predicted probability of female and male respondents falling into each of four BMI classifications for people age 30 to 64 who engaged in moderately strenuous exercise in the last 30 days, 2013 BRFSS.

Expected Values

Women Men

Category Probability 95% CI Probability 95% CI

Underweight 0.015 0.015–0.016 0.011 0.011–0.012

Normal Weight 0.330 0.317–0.343 0.268 0.255–0.280

(15)

Overweight 0.370 0.367–0.373 0.369 0.363–0.375

Obese 0.284 0.270–0.300 0.352 0.334–0.370

Table 3 shows that women are more likely than men to fall into the Normal Weight classification and less likely than men to fall into the Obese classification. Men and women have similar probabilities of being Underweight or being Overweight.

Overall, Table 3 shows that gender is significantly related to a person’s BMI classification.

Tables like Table 3 are helpful when you only need to compute a small number of predicted probabilities to interpret the findings of the model. However, for continuous independent variables or variables that take on more values, it is better to compute a large number of predicted probabilities and present the results graphically. For example, this analysis includes two dummy variables representing age. One (under30) is coded 1 for respondents who are under 30 years of age and 0 otherwise. The other (age65plus) is coded 1 for respondents age 65 and older and 0 otherwise. This means when both are equal to 0, the respondent is age 30 to 64.

Figure 3 presents the predicted probability of falling into each BMI classification for respondents in each of these three age groups graphically. The height of each bar captures the size of the predicted probability and the color of each bar indicates the particular BMI category being plotted.

Figure 3: Predicted probability of respondents falling into each of

four BMI classifications across different age groups while holding the

female and activity variables at their respective means, 2013 BRFSS.

(16)

Figure 3 shows that the most dramatic movement occurs between the first and second age groups. The predicted probability of someone under age 30 being at normal weight (blue bars) is nearly 0.5, but that declines to 0.3 for people age 30 through 64. At the same time, the predicted probability of someone under 30 being obese (orange bars) is about 0.18, but that rises to more than 0.3 for those age 30 through 64. In contrast, the BMI classification distribution for those age 30 through 64 and those age 65 and older are nearly identical. In short, Figure 3 shows a clear relationship between age and BMI classification, with nearly all of the effects being between the younger and middle aged groups.

Complete interpretation of the results of an ordered logit model would present similar tables or figures for every independent variable in the model.

Presenting Results

(17)

The results of an ordered logit model can be presented in a variety of ways. Here we offer one example.

“We used a subset of data from the 2013 BRFSS to test three null hypotheses:

• H0a = After controlling for age and level of activity, gender has no impact on BMI classification.

• H0b = After controlling for gender and level of activity, age has no impact on BMI classification.

• H0c = After controlling for gender and age, how strenuous a person’s exercise activity is has no impact on BMI classification.

The data included 359,925 individual respondents. Results from the ordered logit model are presented in Table 2. Those results show that gender, age, and prior activity level are statistically significantly linked to the probability of falling into a particular BMI category. Each coefficient is negative, meaning that as the value of that independent variable increases, the probability of a respondent falling into a higher BMI category decreases. Table 3 explores the effect of gender more fully, showing that the real differences between women and men occur in the Normal Weight and Obese categories. Figure 3 shows that the biggest differences in age occur between those under age 30 and those age 30 through 64, again in the Normal Weight and Obese categories. Further interpretation and diagnostic testing should be explored to evaluate the robustness of these findings.”

Review

Ordered logit expresses an ordered categorical dependent variable as a function of one or more independent variables. Ordered logit models are estimated via MLE. Direct interpretation of the coefficient estimates is limited to whether they are positive, negative, or not statistically significant. To really understand the results of an ordered logit model requires calculating predicted probabilities.

(18)

The ordered logit model is very similar to the ordered probit model. Ordered logit simply assumes the residuals of the latent variable model follow a logistic distribution whereas the ordered probit model assumes they follow a normal distribution. Ordered logit is also somewhat similar to the multinomial logit model, which is a model where the dependent variable takes on three or more categorical values, but for the multinomial logit model those categories are not assumed to follow any order. There is also a multinomial probit model, which shares some similarities to multinomial logit.

You should know:

• What types of variables are suitable for an ordered logit model.

• The basic assumptions behind the ordered logit model.

• How to estimate and interpret the results of an ordered logit model.

• How to report the results from an ordered logit model.

Your Turn

You can download the sample dataset along with a guide showing how to estimate an ordered logit model using statistical software. See if you can reproduce the results presented here, then try producing your own ordered logit model by using the genhealth as the dependent variable. It measures a respondent’s general health on a 5-point scale where 1 = Excellent and 5 = Poor.