Statistics and Data Analysis

(1)

PROC LOGISTIC: The Logistics Behind Interpreting Categorical Variable Effects Taylor Lewis, U.S. Office of Personnel Management, Washington, DC

ABSTRACT

The goal of this paper is to demystify how SAS models (a.k.a, parameterizes) categorical variables in PROC LOGISTIC. Specifically, readers will become more familiar with the commonly used effect and reference parameterizations. In conjunction with these two parameterizations and associated options, this paper touches on issues such as why SAS needs to create dummy variables for the k distinct categories and why the output displays estimates for only k – 1 parameters. At the conclusion of the paper, readers should feel more confident interpreting a categorical variable’s effect on the response as well as testing for significance, by way of the odds ratios computed from the output or via the CONTRAST statement. Discussion uses real-world data from the U.S. Office of Personnel Management, collected for a multiple logistic regression model project whereby the likelihood of a promotion for Federal civilian employees was modeled using personnel data.

BACKGROUND

PROC LOGISTIC is the SAS/STAT procedure which allows users to model and analyze factors affecting the outcome of a dichotomous response variable—one in which an ‘event’ or ‘nonevent’ can occur. After some initial derivations to linearize this modeling process (the details of which are not a concern of this paper), the end result involves computing the log-odds, or logits, and producing a logit function,

L ( X )

, model as follows:

x x nonevent P

x event X P

L

₀ ₁

)

| (

)

| log (

)

( ⎟⎟ ⎠ = β + β

⎜⎜ ⎞

⎝

= ⎛

In the instance of a continuous variable, β1 has the interpretation of the increase in the log-odds, given a one-unit increase in the variable x. Exponentiate this model parameter estimate exp(β1) and you have the more readily interpretable change in the odds themselves (no more logarithms), given that one-unit increase in x.

The plot thickens, however, when the predictor variable of interest is categorical in nature, rather than continuous. A series of design, or dummy, variables must be created for the different levels of the categorical variable, and interpretations and tests of significance can quickly become more involved. Lucky for us, PROC LOGISTIC performs a lot of the nitty-gritty modeling work behind the scenes, but it is imperative to first understand the varying SAS parameterization schemes available before utilizing the PROC’s options and output to guide SAS in producing exactly what is desired.

EFFECT CODING – THE DEFAULT PARAMETERIZATION

Through the course of this paper, we will consider a personnel data extract of nearly 60,000 Federal employees used to model the likelihood of promotion over a one-year period. The SAS data set PROM contains, for each employee, the variable PROMOTION given as ‘1’ if a promotion occurred, ‘0’ if not. The predictor variable to be investigated is education level attainment, EDLEVEL, consisting of four groups of employees: A=high school diploma or equivalent;

B=bachelor’s degree; C=master’s degree; and D=Ph.D. To initially model education, we invoke PROC LOGISTIC with the following syntax

PROC LOGISTIC data=prom descending;

CLASS edlevel;

MODEL promotion = edlevel;

RUN;

A note about the ‘descending’ option in the PROC LOGISTIC statement: SAS will first try to model the probability that the variable PROMOTION=’0’. Recall that our data has a promotion indicated by a ‘1’, and discussion makes more sense when talking about likelihood of promotion as opposed to likelihood of not being promoted. This option is a quick way to reverse the SAS default.

We immediately note from the Analysis of Maximum Likelihood Estimates section of the output that parameter estimates are given for EDLEVEL A, B, and C but not D

Analysis of Maximum Likelihood Estimates

(2)

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -2.0146 0.0221 8282.8034 <.0001 EDLEVEL A 1 0.2589 0.0280 85.6330 <.0001 EDLEVEL B 1 0.2596 0.0247 110.4632 <.0001 EDLEVEL C 1 -0.0206 0.0313 0.4324 0.5108

We also note there is a Class Level Information section with a curious matrix of 1s, 0s and -1s

.

Class Level Information

Class Value Design Variables

EDLEVEL A 1 0 0 B 0 1 0 C 0 0 1 D -1 -1 -1

This parameterization scheme is PROC LOGISTICS’s default effect coding of dummy variables. SAS sorts the class variable’s value list and assigns dummy variables for one less than the number of distinct values, omitting the last category—the number of columns under the Design Variables heading indicates the count of dummy variables created. An initial roadblock with this scheme is that the parameter estimates of the dummy variables are not directly interpretable; they are a measure of the difference between the classification level’s effect and the average effect across all levels.

Notice, however, there is an Odds Ratio Estimates section in the output Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits

EDLEVEL A vs D 2.131 1.817 2.500 EDLEVEL B vs D 2.133 1.826 2.491 EDLEVEL C vs D 1.612 1.368 1.899

For any logistic regression model without interaction terms, SAS computes a series of odds ratios and confidence limits for each class variable. It is important to review how these odds ratios are computed, since SAS will not output all possible comparisons of interest.

From the Design Variables section of Class Level Information, the first, second, and third columns correspond to the dummy variables for group A, B, and C, all such dummy variables in the model. Each row can be thought of as the sequence of coefficients to be placed in front of the dummy variable parameter estimates to arrive at a logit function estimate for that particular level. For instance, the row of -1s for the last group, D, corresponds to a logit function of β0 + (-1)*βA + (-1)*βB + (-1)*βC) or β0 - βA - βB - βC. Assume we want to investigate the odds of promotion between groups A and D. Our log-odds difference of interest is

( ) ( )

7568 . 0 ) 0206 . 0 ( 2596 . 0 2589 . 0

* 2

2 ) (

) ( )

( )

(

₀ ₀

=

− + +

= + +

=

−

− +

=

− L D

_A _A _B _C _A _B _C

A

L β β β β β β β β β

And the odds ratio turns out to be exp(0.7568) = 2.13, exactly as seen in the first row of Odds Ratio Estimates output.

This says the probability of promotion for those educated at the high school level is more than double that of the Ph.D level.

Knowing how the odds ratios are calculated gives us greater flexibility to compare, say, two levels within a classification variable that do not happen to be listed in the Odds Ratio Estimates output. For instance, we may wish to investigate a statistical difference between group A, high school graduates, and group B, bachelor’s degrees. We

(3)

note from the output how close the maximum likelihood parameter estimates for the two groups are and further reason the model could be simplified if we could collapse groups A and B into one group.

For the two groups, we take coefficients from the first and second rows of the Class Information Matrix to arrive at the following

( ) ( ) 0 . 2589 0 . 2596 0 . 0007

) ( )

( A − L B =

₀

+

_A

−

₀

+

_B

=

_A

−

_B

= − = −

L β β β β β β

We observe this logit difference is approximately zero, and exp(0) = 1. With an odds ratio of 1, the probabilities of promotion between the two groups are roughly the same, so it is not necessary for the model to distinguish between them. It may prove easier to collapse groups A and B together into one category covering all employees who have attained a bachelor’s degree or less.

REFERENCE CODING – AN ALTERNATIVE PARAMETERIZATION

While there are situations where such a coding scheme is preferable, SAS allows users to change this setting to other parameterizations. A second useful coding scheme is called reference coding, where one level of the classification variable is designated as the reference level to which parameter estimates for the remaining levels are directly comparable. Under this coding scheme, the exponentiated parameter estimate of a level is interpreted as the odds ratio between that level and the reference level. Hence, it would make sense to assign to the reference level any particular level we wanted to pit against all others.

Suppose we were interested in reporting the effect of education level on promotion likelihood and wanted to compare, individually, those who had obtained a bachelor’s, master’s, and Ph.D, with the high school diploma. We can use additional CLASS statement options to reference parameterize EDLEVEL with the group A as the reference category

PROC LOGISTIC data=prom desc;

CLASS edlevel(param=ref ref='A');

MODEL promotion = edlevel;

RUN;

In parentheses after the listed CLASS variable, param=ref overrides the default param=effect and ref='A' designates the high school level to be the reference. Other ref options are LAST, the default, which sorts the distinct variable levels and sets the last level to the reference, and FIRST, which sorts and sets the first value in the list.

Interestingly, the ref= option in the CLASS statement is also available under the effect parameterization; it determines what level gets the -1 row of dummy variable coefficients and, thus, what group is compared to all others in the Odds Ratio Estimates portion of the output.

Looking at the output, we note some differences in the Analysis of Maximum Likelihood Estimates and Class Level Information matrix from what we initially saw under the effect parameterization

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -1.7557 0.0242 5269.8636 <.0001 EDLEVEL B 1 0.000723 0.0287 0.0006 0.9799 EDLEVEL C 1 -0.2794 0.0395 50.0299 <.0001 EDLEVEL D 1 -0.7567 0.0814 86.4424 <.0001

Class Level Information

Class Value Design Variables

(4)

EDLEVEL A 0 0 0 B 1 0 0 C 0 1 0 D 0 0 1

In terms of the parameter estimates, notice how no dummy variable is created for the reference group A, as the three other groups’ estimates are interpreted as the difference in the log-odds from that first group. The 0.0007 parameter estimate form EDLEVEL group B suggests a small, nearly zero increase in the log-odds compared to group A. This is precisely the conclusion we drew under the effect coding. This should serve as an affirmation that PROC LOGISTIC can take more than one path to arrive at a given conclusion. The ultimate path to be chosen can be what is most comfortable for the analyst.

Rest assured, we are still able to compute odds ratios by hand from the Class Level Information matrix by plugging in the appropriate dummy variables

( ) ( ) 0 . 0007

) ( )

( A − L B =

₀

−

₀

+

_B

= −

_B

= −

L β β β β

Recall that our model parameter estimates under the reference coding have a new interpretation involving odds ratios related to the reference level, but they are still reported in the output as log-odds differences. To quickly convert these to odds-ratios sans logarithms, we have the EXPB option available in the MODEL statement

MODEL promotion = edlevel / expb;

This adds a column to the end of the Parameter Estimates Output

Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq Exp(Est)

Intercept 1 -1.7557 0.0242 5269.8636 <.0001 0.173 EDLEVEL B 1 0.000723 0.0287 0.0006 0.9799 1.001 EDLEVEL C 1 -0.2794 0.0395 50.0299 <.0001 0.756 EDLEVEL D 1 -0.7567 0.0814 86.4424 <.0001 0.469

Again, this last column is simply the Estimate column exponentiated for quick reference. We observe how this agrees with the Odds Ratio Estimates section of the output, which is still created

Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits

EDLEVEL B vs A 1.001 0.946 1.059 EDLEVEL C vs A 0.756 0.700 0.817 EDLEVEL D vs A 0.469 0.400 0.550

THE CONTRAST STATEMENT

We have seen how we can compute basic odds ratios by hand. The limitation to these is they lack confidence intervals on the estimates. We often want to check that the odds ratio estimate’s confidence interval does not contain 1, for example. The Odds Ratio Estimates output will contain confidence intervals, but only for the levels of a categorical variable compared to one particular reference level. Though we could re-run PROC LOGISTIC with differing reference levels to get additional odds ratio estimates and confidence intervals, we are still restricted to a one-to-one comparison. It may be prudent to investigate a difference between the average of two EDLEVEL groups compared with a reference group, as we will explore momentarily, or any other relevant combination of levels.

To solve this dilemma, we can make use of the CONTRAST statement. It is in constructing these statements that we are apt to be familiar with the Class Level Information matrix and effect versus reference parameterizations.

The general syntax of the CONTRAST statement is

(5)

CONTRAST 'label' var-name dummy-coeff-1 <…dummy-coeff-n> </ options >;

After providing a label—required, since more than one CONTRAST statements are allowed—we define the variable name for which we are interested in constructing odds ratios. Immediately after that, we will assign dummy coefficients by summoning the Class Level Information matrix.

Identically as we did by hand, we can use the CONTRAST statement in a simple, one-to-one comparison to test the logit function difference between EDLEVEL A and D. Recall that under effect coding we had

(

_A

) (

_A _B _C

)

_A _B _C

D L A

L ( ) − ( ) = β

₀

+ ( β ) − β

₀

+ ( − β − β − β ) = 2 β + β + β

The CONTRAST statement syntax would then be

CONTRAST 'EDLEVEL A vs. D' EDLEVEL 2 1 1/ estimate=both;

Contrast Test Results

Wald

Contrast DF Chi-Square Pr > ChiSq

EDLEVEL A vs. D 1 86.4424 <.0001

Contrast Rows Estimation and Testing Results

Standard Wald Contrast Type Row Estimate Error Alpha Confidence Limits Chi-Square EDLEVEL A vs. D PARM 1 0.7567 0.0814 0.05 0.5972 0.9162 86.4424 EDLEVEL A vs. D EXP 1 2.1313 0.1735 0.05 1.8170 2.4999 86.4424

With no options in the CONTRAST statement, the only output is the global Wald test given the null hypothesis that the difference in the logit functions is zero. We see here that the Wald test statistic is large and so we have a significant result, but we do not know in which direction the odds are favored. The estimate=both option in the CONTRAST statement adds the value of the logit function difference in both log-odds terms (Type=PARM line) and the exponentiated odds ratio terms (Type=EXP line). The 2.1313 is the same odds ratio difference we have calculated twice earlier, and the 95% confidence interval (1.8170, 2.4999) matches with what was seen in the Odds Ratio Estimates section of the output.

Relating this to the reference parameterization with ‘A’ as the reference level, we reason that the third dummy variable SAS created for EDLEVEL is an odds ratio of group D vs. group A. To invert this computation and make comparable to the contrast above, testing -1 times this estimate produces the desired group A vs. group D odds ratio.

CONTRAST 'EDLEVEL A vs. D' EDLEVEL 0 0 -1/ estimate=both;

Though we refrain from reprinting, the syntax above produces the exact same contrast output as does the syntax under effect parameterization of EDLEVEL.

We saw there was very little difference between odds of promotion between EDLEVEL groups A and B, suggesting we could collapse the two groups to simplify the model. We could also employ the CONTRAST statement to jointly test whether groups A/B and C/D could be collapsed, respectively. One can separate by a comma two parts, or rows, of a contrast.

Staying with reference coding and ‘A’ as the reference level, to test A vs B you would have

(

B

)

B

B L A

L ( ) − ( ) = β

₀

− β

₀

+ β = − β

Furthermore, to test C vs D you would have

(6)

(

C

) (

D

)

C D

D L C

L ( ) − ( ) = β

₀

+ β − β

₀

+ β = β − β

So we painlessly determined the dummy variable coefficients necessary for the CONTRAST statement. This time we apply a few more options. The first is the estimate=exp option, which outputs only the exponentiated logit function (odds ratio); the second is the e option that outputs the vector of coefficients and corresponding dummy variables.

This is good practice to double-check that the contrast being calculated is what the analyst intended. Needless to say, changes to the reference level or parameterization scheme can quickly change what a sequence of coefficients is actually testing.

contrast 'Joint A/B & C/D' edlevel -1 0 0,

edlevel 0 1 -1 / e estimate=exp;

Produces the following output

Coefficients of Contrast Joint A/B & C/D

Parameter Row1 Row2

Intercept 0 0

EDLEVELB -1 0

EDLEVELC 0 1

EDLEVELD 0 -1

Contrast Test Results

Wald

Contrast DF Chi-Square Pr > ChiSq

Joint A/B & C/D 2 32.4769 <.0001

Contrast Rows Estimation and Testing Results

Standard Contrast Type Row Estimate Error Alpha Confidence Limits Joint A/B & C/D EXP 1 0.9993 0.0287 0.05 0.9446 1.0571 Joint A/B & C/D EXP 2 1.6117 0.1350 0.05 1.3677 1.8992

After acknowledging the Coefficients of Contrast as what we intended, we note that the Contrast Test Results section yields a Wald test statistic which suggests strongly the contrast is not equal to zero. Virtually all of the deviation from zero is clearly coming from the second part of the contrast between group C and group D, as the odds ratio for that comparison is significantly greater than 1 (1.6117), while the group A vs. group B odds ratio is not significantly different from 1. At this point, we conclude that we cannot jointly collapse groups A with B and C with D.

CONCLUSION

This paper outlined two parameterization schemes for a logistic regression model in which the predictor variable is categorical. There are other parameterizations available within SAS for this PROC, but practice and experience have dictated to the author that the effect and reference parameterizations are utilized most frequently. At an initial glance of the unabridged output from a PROC LOGISTIC invocation, the shear amount of output can make interpretation and analysis appear a daunting task. Yet after a little work picking out the relevant sections and tweaking the SAS code with a few added options, the task at hand can be quickly simplified, especially when one can realize how the various sections are interrelated.

REFERENCES

SAS Institute Inc. 2004. SAS/STAT® 9.1 User’s Guide. Cary, NC: SAS Institute Inc.

Hosmer, David and Lemeshow, Stanley, 1989. Applied Logistic Regression. John Wiley & Sons.

Agresti, Alan, 1996. An Introduction to Categorical Data Analysis. John Wiley & Sons.

(7)

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the author at:

Taylor Lewis

U.S. Office of Personnel Management (OPM) 1900 E St., NW, Room 7439

Washington, DC 20415 Work Phone: (202) 606-1309 Fax: (202) 606-1719

E-mail: [email protected]

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.