Training Document
Logistic Regression
Table of Content
Contents
Objective... 2
About Logistic Regression...3
CONCEPT... 3
Steps of developing a Logistic Regression Model...4
Key Metrics Finalization... 4
Rolling Performance Windows...4
Data Preparation... 6
Data Treatment... 7
Derived variables creation...10
Data Split... 11
Oversampling... 12
Variable Selection/Reduction...12
Data Distribution Related Issues...12
Information Value... 13
WOE Approach... 16
MULTI COLLINEARITY CHECK...16
Standardization of Variables...19
Logistic Regression Procedure...19
Key Model Statistics... 20
Model Fit Statistics... 21
Model description... 22
KS Statistic and Rank Ordering –...23
Gini and Lorentz curves...24
Divergence Index Test...25
Clustering checks –... 26
Deviance and Residual Test...27
Model Validation... 29 1) Re-estimation on Hold out sample...29 2) Rescoring on bootstrap samples...29
The Purpose of this document is to guide new joiners or people new to Logistic modelling on how to carry out each step starting from data collection/preparation to logistic modelling results and validation.
The level of detail of each stage will be primary. It does this by allowing the reader to start at the beginning, seeing how each stage of the process contributes to the overall problem, and how it interacts and flows together while progressing towards a final solution and its presentation.
The focus will be on execution of each step of the process and methods used to verify the integrity of the process.
About Logistic Regression
Logistic regression technique uses maximum likelihood estimation to develop the models. Logistic regression is a form of statistical modeling that is often appropriate for dichotomous outcomes, for example good and bad. It is a method of describing the relationship between a binary dependent (predicted) variable and a set of independent explanatory variables from a set of observations. The independent variables typically comprise of demographic characteristics, past performance characteristics, and product related characteristics. Essentially, it is a method of finding the best fit to a set of data points
CONCEPT
Logistic Regression predicts the probability (P) of an event (Y) to occur through the following equation:
Log(P/(1-P)) = α+β1X1+β2X2+..+βnXn P is the probability that the event Y occurs, p(Y=1) Odds Ratio = P/1-P
Log{P/1-P} = log of the odds ratio
METHOD OF ESTIMATION
• Maximum Likelihood Estimation: The coefficients α, β1, β2,...,βp are estimated such that the Log of the likelihood function is as large as possible. • Maximum likelihood solves for the following condition: (Y – p(Y=1)) Xi = 0;
summed over all observations, i = 1, 2....,n. • Assumption: Yi and Yj independent for all i≠j.
• There are no distributional assumptions on the independent predictors.
Steps of developing a Logistic Regression Model
Key Metrics Finalization
• Observation Window: Time frame from where independent variables (X’s)
Observation Window Performance Window
Observation Point
• Observation Point: Point at which the population will be scored
• Performance Window: Time frame from where the dependent variable (Y)
comes from
Rolling Performance Windows
The above example uses Jan’14 to Mar’14 as Observation Window and May’14 to Aug’14 as Performance Window i.e. single performance and observation window. Multiple rolling performance windows are used in following cases:
1. To capture data seasonality
While using a single performance window, the assumption is that the parameters of the model are constant over time. However, the economic environment often changes considerably and it may not be reasonable to assume that a model’s parameters are constant. A common technique to assess the constancy of a model’s parameters is to compute parameter estimates over a rolling window of a fixed size through the sample. If the parameters are truly constant over the entire sample, then the estimates over the rolling windows should not be too different. If the parameters change at some point during the sample, then the rolling estimates should capture this instability
e.g. The below example utilizes 3 performance windows of 3 months each. Using multiple performance windows, data of 10 months (Jan’13 to Oct’13) is utilized in model development which would not be possible using a single performance window. This will cater for seasonality in the data.
2. Utilizing campaign data of multiple months for model development: If campaign data of multiple months is to be utilized for campaign response model development then multiple performance windows can be used.
e.g. Credit cards campaign data of 3 months (i.e. Jan’13 to Mar’13) is available for campaign response model development. Then instead of a single performance window, following rolling windows can be utilized. Since different set of customers will be targeted in campaign in each month, there will be no duplicates across different windows.
Performance Window 1: Customers targeted in Jan’13 for the campaign–whether they bought a credit card from Feb’13 to April’13
Performance Window 2: Customers targeted in Feb’13 for the campaign – whether they bought a credit card from Mar’13 to May’13
Performance Window 3: Customers targeted in Mar’13 for the campaign – whether they bought a credit card from Apr’13 to Jun’13
Target variable:
Once the objective and scope of analysis is defined, it is important to identify the target variable. For example, a risk model depending upon the data and business, can have multiple target variables, such as 90DPD, Bankruptcy Indicator, Charge Off, BAD12 (account becoming bad within first 12 months of activation).
Different target variables will lead to different models/ performance/ and business insights. Sometimes, a combination of various target variables is used to build the overall model. For Example, an overall BAD predictor can be a combination of 90DPD, Bankruptcy or Charge-offs.
In some problems, the target variable needs to be created and possibly defined. For example, the client may want to build a model to identify potential churn but might not have a clear definition of attrition. In such situations, it might often help to look at the data and come up with some set of rules/algorithm to identify the dependent variable.
Again, the definition of the dependent variable in certain cases may influence the overall value of the model. For example, say the objective is to predict bankruptcy of cardholders. We can choose to define the dependent variable to capture bankruptcy next month or bankruptcy in 3 months. Clearly the latter model is more useful if the objective of the analysis is to take some pre-emptive action against those likely to go bankrupt.
In the current sample data, target variable is defined as
:-• The set of positive responders to campaign in the population data are (tagged as ‘1’). Rest of the population (non-target) is tagged as ‘0’.
Exclusion Criteria:
Policy exclusions and any other exclusions needs to be undertaken prior to model development to ensure data is not biased and model base is representative of the actual population.
Data Preparation
The goal of this step is to prepare a “master” data set to be used in the modeling phase of the problem solution. This dataset atleast should contain:
A key, or set of keys, that identifies each record uniquely
The dependent variable relevant to the problem
All independent variables that are relevant or may be important to the problem solution
In the early stages of a solution, it can be sometimes hard to determine an exact set of independent variables. Often, nothing is left out to begin with, and the list of relevant variables is derived and constantly updated as the process unfolds.
If the required master data is spread across several data sets, then the pertinent records and variables will need to be extracted from each dataset and merged together to form the master dataset. If this must be done, it is very important that proper keys are used across the datasets so that not only do we end up with all the needed variables in the final dataset, but that you are merging the datasets
correctly. For example, you may have a customer dataset with customer level information such as name, dob, age, sex, address etc…. (a “static” data set), and another data set, “account” data, which contains account level information such as account number, account type(savings/current/mortgage/Fixed deposit) , total balance , date of opening , last transaction date etc. This account level dataset needs to be rolled up to customer level before merging with customer dataset to create master dataset.
PS:- If you try to merge two datasets by a common numeric variable, but whose lengths were defined differently in each dataset, you may see a warning in the log file similar to:
WARNING: Multiple lengths were specified for the BY variable by_var by input data sets. This may cause unexpected results.
It is generally not wise to overlook log file warnings unless you have a very good reason to. A short data step redefining the length of the shorter variable in one of the datasets before merging will suffice to get rid of the warning, and could reveal important data problems, such as information being truncated from some values of the BY variables in the data set with the shorter length.
Data Treatment
Once master dataset has been created, univariate macro(if available) needs to be run to understand the data. Certain characteristics of the data that need to be looked at
are:-Variable name
Format
Number of unique values
Number of missing values
Distribution (proc means output for numeric variables; highest and lowest frequency categories for a categorical variable)
o Numeric variables: standard numerical distribution including the mean, min, max, and percentiles 1, 5, 10, 25, 50, 75, 90, 95, and 99
o Categorical variables: no of times the variable takes each categorical value . Ob s na me ty pe var_len gth n_p os numo bs nmi ss uniq ue mean_or_t op1 min_or_t op2 p1_or_to p3 Delet ed Colum ns p99_or_b ot2 max_or_b ot1 1 Var1 nu m 8 0 46187 0 929 0.12648 0 0 0.769 0.998
2 Var2 nu m 8 8 46187 0 505 0.06473 0 0 0.285 0.944 3 Var3 nu m 8 16 46187 0 175 714.42876 0 650 756 794 4 Var4 nu m 8 24 46187 0 257 656.30054 0 0 755 794 5 Var5 nu m 8 32 46187 0 1067 3 100.50368 0 0 1710.922 136318.12 3 6 Var6 nu m 8 40 46187 0 3312 3 305.97356 0 0 2552.431 221315.61 4 7 Var7 nu m 8 48 46187 0 1332 0.11786 0 0 1.073 47.794 8 Var8 cha r 1 56 46187 0 10 2::524952 ::429733 1::37644 1 7::5468 8::2781
Output of univariate macro for few variables is given below:
Code for getting univariate output of variables:
Univariate_Macro.txt
Put the library path location ( where the dataset exists) and the dataset name( on which univariate will run) in place of XXX at the bottom of the Univariate code before running it.
Basic things that should be looked for when first assessing the data:
Are data formats correct?
o Are numerical variables stored as text? Do date variables need to be converted in order to be useful?
Which variables have missing values?
Data Outliers?
Do any variables exhibit invalid values (many “9999999”, “101010101…”, “0/1” values, etc)?
o If you have a data dictionary provided by the client, there may be information on invalid values, so this would be the first thing to check Are any distributions clearly out of line with expectations?
Missing Value Imputation
It is important to impute the missing values in a dataset before analysis can be performed on it. Below are some popular techniques:
Replace missing values with zero
Replace missing values with value of same variable whose records have closest mean value of response variable
Regression on other predictors
Replace missing values with mean/median
Replace missing values with median
Inter correlation
Do not impute missing values
Replace missing values with zero
Situations where the absence of a value is implicitly zero – for example, NUMBER OF LATE PAYMENTS. The value of this field would be expected to be zero for most customers. Check related fields to justify this decision. Also, if some records have a 0 and others are blank (missing), then check with client if a blank has a different interpretation.
Regression on other predictors
Create a linear model with populated records to predict the values of this variable using other variables in dataset as predictors. Then score the records with missing values using the model coefficients. It is a very good method when there is sufficient covariance among the variables in the dataset to produce a precise and accurate regression.
Replace missing values with mean
This technique should be used in situations where great majority of records ~ 85 % + are populated and where other methods are not feasible. It can also be used in situations where the variable is a predictor with low influence in the model but it needs to be included.
Replace missing values with median
This technique should be used in situations where great majority of records ~ 85 % + are populated and used instead of mean imputation if distribution is highly skewed and only if other methods are not feasible. . It can also be used in situations where the variable is a predictor with low influence in the model but it needs to be included.
Inter-correlation
This method involves finding another predictor variable which has very high fill rate and which is very highly correlated with the variable being imputed. The other predictor is binned and the median value of the variable being imputed is calculated
for the bin. The variable is then imputed based on the bin into which it falls. This is a good method but will need a very high correlation among predictors and require a very high (close to 100%) fillrate.
Do not impute missing values Response variable for model
Predictor variable that has low correlation with other predictors and imputation of zero, mean or median would bias model results
Predictor variable that has low correlation with response and is unlikely to play significant role in model should be excluded from modeling
Predictor variable that could be important in model, but has large percentage of values missing should be excluded from model (imputation using above techniques and inclusion in model would result in either a model with inflated performance statistics or reflecting data manipulation rather than original source data)
Predictor variable that could be important in model, but has large percentage of values missing should either be excluded records with missing values or exclude variable from model (imputation using above techniques and inclusion in model would result in either a model with inflated performance statistics or reflecting data manipulation rather than original source data)
Outlier Treatment
It is very important to eliminate outliers from the dataset before any analysis can be performed. Outliers can be detected using Proc Univariate output.
Comparing the P99 and Max Values (or the P1 and the Min values), we can identify the variables having possible outliers
.
Here are some common ways of dealing with outliers:
Cap all outliers at P1 or P99.
Cap all outliers at (P1 - δ) or (P99 + δ). The value of δ will be subjective.
Using Exponential Smoothening for all values beyond the range P1-P99
The first and second methodis easier to implement but lose the ordinality of data. The fourth method takes care of the outlier problem but does not lose the ordinality of the data.
Derived variables are created in order to capture all underlying trends and aspects of the data.
Rather than just using the raw variables in the model; taking proportions, ratios and making indexes sometimes help reduce bias and also helps in identifying new trends in the data.
For E.g.: Taking average monthly spends instead of total spends in last 12 months is more insightful because it helps neutralize the effect of new customers having lower spends due to the reduced tenure. The normalized average value provides a more accurate comparison amongst customers
Data Split
• Development dataset – Fit the model on this dataset
• Validation dataset (Hold-out sample) – Validate the model using the hold-out sample
• Out of time sample (Validation) – Validate the model on a different time period to ensure the model works.
Development and validation sample are split in any ratio with 50 - 80% records in development sample. Sample code for doing a 70-30 split is below
:-data temp; set xxx; n=ranuni(8);
proc sort data=temp; by n;
data training validation; set temp nobs=nobs;
if _n_<=.7*nobs then output training; else output validation;
run;
In many situations, data is scarce and it is not possible to generate separate validation datasets. In such cases, sampling techniques are used as explained below
:-Multifold Cross validation
Note that this technique is a reasonable substitute for out-of-sample validation but not for out-of-time. Usually a 10 fold cross validation is applied. Say you are performing a k fold cross validation:
Divide the data into k disjoint portions.
Build k models, each time using a different portion as the validation dataset and using the remaining k-1 portions as the modeling set.
Coefficients of model which validate well are averaged to get to the final set of coefficients
Bootstrapping
This technique involves re-estimating the model over numerous randomly drawn sub- samples. It is used in several ways. Often the model coefficients are taken to be the average of the coefficients in the sub sample models. The final model is then validated.
In other instances, bootstrapping is used as a variable reduction technique. Here are some steps to perform the task:
1000 random subpopulations from modeling population are formed ( the number of subpopulations can be varied)
Regression is run for each of these subpopulations and 1000 models are created
Presence Rate is calculated by noting the presence of each variable as predictor over all models. So, a variable present as predictor in 750 case will have a presence rate of 75%
Models are made with variables which had at least 50% Presence Rate ( The cutoff can be changed to suit the modeling task at hand)
Oversampling
When the target event is rare(less than 2-5%), it is common to oversample the rare event, that is, take a disproportionately large number of event cases. Oversampling rare events is generally believed to lead to better predictions.
In the case of logistic regressions, oversampling only affects the intercept term and the coefficients are left unaffected. Therefore, the rank order produced by the model estimated on oversampled data holds true and will not be changed even if the intercept is corrected for oversampling.
Therefore, if the objective of the logistic regression is to produce a scorecard, then no correction is required for oversampling.
Variable Selection/Reduction
We might get client data with hundreds and even thousands of variables. It is important to reduce the dimension of the dataset by eliminating redundant/irrelevant variables before meaningful analysis can be performed.
Irrelevant variables carry no meaningful information while redundant variables carry little or no additional information. So one example of Redundant variable could be say variable C which is a linear combination of variable A and B. Irrelevant variable on the other hand would be something like variable X which has very low correlation with the dependent variable Y.
Various techniques used for variable
reduction:-Data Distribution Related Issues
Following categories of variables are generally not considered for model development exercise.
• Variables with very high missing data
• Variables with low variation (concentration in single bin)
• Date variables
• Variables not usable in raw form because of granularity
Information Value
For logistic regression, the raw variable is replaced by its Weight of evidence (Woe), which has higher predictability than the raw variable.
Weight of Evidence is the log (Event rate/ Non-event rate) for each category (bin) of the variable considered for analysis.
Methodology to calculate Weight of Evidence:
Fine Classing: Fine Classing is a method to divide the population in categories that differentiate events from non-events. The objective of fine classing is to identify categories that can differentiate event from non-events. Customers belonging to categories that can differentiate events from non-event have high likelihood of having objective=’positive’ in the final model. Hence have higher predictability. To fine class the populations divide the population in buckets of (5%-10%) each
2. Calculate the Event % and Non-event % for each category 3. Calculate the log odds of the category -
Log odds = natural log(%Events/%non-events)
The positive value indicates that customer in that category is more likely to belong to the Persona. Higher the positive value higher is the differentiating power of the category. Negative value indicates that customers in that category are less likely to be in the Persona.
4. Calculate the Information Value for each category – Information Value (IV) indicates the predictable strength of the category. Higher the IV, higher the probability of the category of the variable
Consider variable “bfiinca” from the training data
PER_GOOD=% event=% of Customers with Objective=”Positive” PER_BAD=% non-event=% Customers with Objective=”Negative” Information Value of the variable “bfiinca” .145
1. Event % is calculated for each category (Column PER_GOOD for e.g. consider category 22167 – 38485. Total events – 591. Event rate – (40/591)*100 = 6.77% 2. Non Event% is calculated for each category (Column PER_BAD) category 22167 –
38485. Total non-events – 567.non-event rate – (76/567)*100 = 13.4% 3. % Event/%Non-event (Column RAW_ODDS) = 6.77/13.4 = 0.51
4. Natural log(% Event/%Non-event )=(Column RAW_ODDS) = natural log( 0.51) = -0.67
5. Information Value – (PER_GOOD-PER_BAD)*RAW_ODDS/100 = .044 Similarly, information value is calculated for each category of the variable.
Information Value of variable=sum of information value of all categories of variable The above calculations of IV with fine classing can be done automatically by running the below code and pasting the output in the given excel:
csv1.sas fclassc1_n2.sas fclassd1_n2.sas Fineclassing.sas fc_code_postModific
atoin.xls
Steps of Using the above codes:
1 Create an inverse of target variable ie. Records where target variable is “1”, set inverse target variable as “0” and records where target variable is “0”, set inverse target variable as “1”.
2 Save all the codes in the same location as your dataset.
3 Open fclassc1_n2, change the libname at the bottom of the code 4 Do the same for fclassd1_n2.
5 Once you have made the changes; save it and close these two codes. 6 Post that open the “fineclassing” code and run it
7 Paste the output in the given excel file using text to column and keeping first column of type text and save the file as Excel 2003 format and close it.
8 Open fc_code_postModification.xls and open the excel file created in step 8 above. Run the “Set Parameters” first followed by “Sas code” macro in the excel file which will display fine classing output .
The information value gives a measure of how well the characteristic can discriminate between good and bad and whether it should be considered for modeling. As a rule of thumb apply the following, however these cut offs can be changed based on data.
< 0.03 Not predictive – don’t consider for modelling 0.03 – 0.1 Predictive – consider for modelling
> 0.1 Very Predictive – use in modelling
Coarse Classing:
Coarse Classing is a method to identify similar categories. To coarse class population, group categories with similar log odds and same sign. Calculate log odds for the grouped category. The new log odds is Weight of Evidence of the variable
Each of the characteristics deemed to be predictive (information value > 0.03) should be grouped (normally performed using fine class output and a ruler) into larger more robust groups such that the underlying trend in the characteristic is preserved. A rule of thumb suggests that at least 3% of the goods and 3% of the bads should fall within a group.
For continuous characteristics 10% bands are used to get an initial indication of the predictive patterns and the strength of the characteristic. However, for the grouping of attributes more detailed reports are produced using 5% bands.
Try to make classes with around 5-10% of the population. Classes with less than 5% might not be a true picture of the data distribution and might lead to model instability.
The trend post coarse classing should either be monotonically increasing, decreasing, parabola or an inverted parabola. Polytonic trends are usually not acceptable
Business inputs from the SMEs in the markets are essential for coarse-classing process as fluctuations in variables can be better explained and classes make business sense.
Concept:
In the standard WOE approach every variable is replaced by its binned counterpart. The binned variable is created by assigning a value equal to WOE of each of the bins formed during coarse classing.
WOE = ln (% Good/ % Bad)
WOE = 0 is imputed for the bins containing missing records and for bins that consisted of less than 2% of the population.
Advantage:
Every attribute of the variable is differently weighed hence taking care of the neutral weight assignment in case of dummy approach
Disadvantage:
Lesser degrees of freedom hence the chances of a variable representation is lower in comparison to the dummy approach.
MULTI COLLINEARITY CHECK
Multicollinearity Macro Introduction:
The macro MULTI COLLINEARITY is used to remove the multicollinearity. It identifies the variables that are correlated and helps in removing the correlated and / or insignificant variables.
Logic:
1. Capture the outputs of Regression and Logistic Regression procedures
2. Transpose the factor-loading matrix and attach the parameter estimate, VIF, Wald-Chi Square value to each independent variable.
3. Go to last eigen vector (with lowest eigen value) and find out the variables that are correlated (that have factor loadings more than specified in the Cut-Off Factor loading in the Excel sheet)
4. Remove those variables (not more than 3 variables at any iteration / point of time) that have high VIF (more than 1.75) and lower Wald – Chi Square value. 5. Go to each eigen vector in the ascending order of the eigen value and find
out the variables that are correlated (that have factor loadings more than specified in the Cut-Off Factor loading in the Excel sheet)
Points to note:
1. If the factor loadings on a particular Eigen vector are not above the cut-off, that vector is ignored and next Eigen vector would be looked for.
2. Not more than 3 variables could be dropped.
3. Not more than 250 variables could be used because of Excel limitation on the number of rows.
4. Clear Contents in the columns M & P of Multicol8.xls sheet before start using the macro for each new project.
Programs – Overview of Inputs & Outputs
1. SAS Program:
Multi Collinearity.sas
Inputs required apart from the Library and dataset name are 1. List of variables for REG and LOGISTIC procedures 2.Response Variable Name
Output: One Excel sheet "mc.xls" created in directory C ( Can change the location and Name of the file).
Please go through the program and input the values at appropriate places ( COMMENTS will guide you in doing that)
2. Excel Sheet:Multicol8.xls. Save this in C folder.
MultiCol8.ZIP
This has VB macro that runs on the file "C:\mc.xls" created out of SAS Program. Save this file on your computer and keep it open at the time of removing multi collinearity.
3. Outputs:
List of Variables Retained (column M) and Removed (column P) will get pasted in the same excel sheet "MultiCol8.xls"
Tracking of variables removed from the first iteration to the last itearation. Name of the tracking file is to be specified each time at the time of running macro (eg: c:\log.txt). You can use same file name through out the project. This would have the history of variables removed and corresponding correlated variables. Please make sure that you change the file name when you are working on a new project. Otherwise the existing file "log.txt" gets appended.
Idea of having this tracking file is to find out the replacement variables for any variable that was dropped any point of time. Open this txt file in Excel " Delimited , select Comma and others "-" .
Each row will give the variables that are correlated.
Columns (B, C and D) give the variables removed at a point. Variables from F column are correlated with the removed variables and retained at that respective point.
Frequently Asked Questions
Q1: Why should we do a multicollinearity check?
Ans1: Multicollinearity refers to correlation among independent variables and leads to an increase in the standard error. This in turn makes the model unreliable.
Clustering of variables
Varclus is a SAS procedure that groups variables into clusters based on the variance of the data they explain. It is unsupervised in that it does not depend on the dependent variable. In the background it performs principal component like analysis and then makes the components orthogonal so that that they contain distinct set of variables.
Here are some practical implementation steps for running varclus: Varclus will group the identical variables into clusters
Ideal representative(s) from cluster can be retained
Selected variable should have high r-square with own cluster and low r-square with next closest cluster
1-R-squared ratio is a good selection criteria ( 1-r-quared with own cluster/1-r-square with next closest cluster)
Multiple selections can be made from clusters if necessary Business judgment might often determine variable selection Here is the sample code:
proc varclus data=imputed maxeigen=.7 short hi; var &list;
run;
Here is the sample output:
R-squared with
1-Own Next closest R-squared variable cluster# Cluster Cluster Ratio
Cluster 69 bb 0.5804 0.468 0.788722 cc 0.443 0.1362 0.644825 Cluster 70 kk 0.8057 0.2993 0.277294 ll 0.6345 0.2918 0.516097 Cluster 71 mm 0.5625 0.0013 0.438069 Cluster 72 nn 0.7797 0.2811 0.30644
So in the above output we have 4 clusters. We want to select one variable to represent a cluster, but often we might use more than one variable from a cluster due to business reasons. We ideally want to choose a variable which has a high r-square with its own cluster and a low r-r-square with its next closest cluster. Choosing the variable with the lowest “1- rsquared” ratio accomplishes this task. This implies the choice of cc, kk, mm, nn.
Standardization of Variables
Once final set of variables on which Logistic has to be run is decided, standardization of coefficients can be obtained. For example, the z score for age would be calculated as (age
-mean[age])/std dev[age]. Your output would then give standardized coefficients as results.
Logistic Regression Procedure
Before running a logistic model in SAS you first need to check that your dataset is ready for logistic modeling. SAS will discard any observations with missing values
for any of the variables the user specifies to be considered for the model, so it is necessary that general modeling data preparation and missing imputation have occurred. Also, PROC LOGISTIC will only accept numeric variables as predictors, unless categorical level character variables are specified in the CLASS statement. You must also be careful with categoricals that are coded with numerals – the program will treat these as if they were continuous numerics unless they are specified in the CLASS statement.
To run a logistic regression in SAS, and estimate the coefficients of the model, one would run code similar to the below:
proc logistic data = <libname>.<modeling dataset> DESCENDING; model dependent_variable = var1 var2 var3;
run;
The DESCENDING options lets SAS know that the value of the dependent variable we wish to predict is “1”, and not “0”.
With no other options selected, SAS will estimate the “full model”, meaning all variables will be included, regardless of whether their coefficients are significantly different from 0.
PROC LOGISTIC also permits:
forward selection backward elimination forward stepwise
selection of ‘optimal subsets’ of independent variables
– Default significance levels for entry into/removal from the model can be modified by use of the SLENTRY and SLSTAY options
Key Model Statistics
PROC LOGISTIC provides several means of assessing model fit:
ALL THE BELOW TABLES/GRAPHS ARE ILLUSTRATIVE AND YET TO BE DEVELOPED FOR SAMPLE DATA
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates AIC 501.977 470.517
SC 505.968 494.466 -2 Log L 499.977 458.517
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq Likelihood Ratio 41.4590 5 <.0001 Score 40.1603 5 <.0001 Wald 36.1390 5 <.0001 Type 3 Analysis of Effects
Effect DF Wald Chi-Square Pr > ChiSq GRE 1 4.2842 0.0385 GPA 1 5.8714 0.0154 RANK 3 20.8949 0.0001
The portion of the output labeled Model Fit Statistics describes and tests the overall fit of the model. The -2 Log L (499.977) can be used in comparisons of nested models, but we won't show an example of that here.
In the next section of output, the likelihood ratio chi-square of 41.4590 with a p-value of 0.0001 tells us that our model as a whole fits significantly better than an empty model. The Score and Wald tests are asymptotically equivalent tests of the
same hypothesis tested by the likelihood ratio test, not surprisingly, these tests also indicate that the model is statistically significant.
The section labeled Type 3 Analysis of Effects, shows the hypothesis tests for each of the variables in the model individually. The chi-square test statistics and associated p-values shown in the table indicate that each of the three variables in the model significantly improve the model fit. Forgre, and gpa, this test duplicates the test of the coefficients shown below. However, for class variables (e.g., rank), this table gives the multiple degree of freedom test for the overall effect of the variable.
Model description
Variables used in the model:
The LOGISTIC Procedure
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 -5.5414 1.1381 23.7081 <.0001 GRE 1 0.00226 0.00109 4.2842 0.0385 GPA 1 0.8040 0.3318 5.8714 0.0154 RANK 1 1 1.5514 0.4178 13.7870 0.0002 RANK 2 1 0.8760 0.3667 5.7056 0.0169 RANK 3 1 0.2112 0.3929 0.2891 0.5908
Odds Ratio Estimates
Point 95% Wald
Effect Estimate Confidence Limits
GRE 1.002 1.000 1.004 GPA 2.235 1.166 4.282 RANK 1 vs 4 4.718 2.080 10.701 RANK 2 vs 4 2.401 1.170 4.927 RANK 3 vs 4 1.235 0.572 2.668
Association of Predicted Probabilities and Observed Responses
Percent Concordant 69.1 Somers' D 0.386 Percent Discordant 30.6 Gamma 0.387 Percent Tied 0.3 Tau-a 0.168 Pairs 34671 c 0.693
KS Statistic and Rank Ordering –
The population is divided into the deciles in the ascending order of Bads. Models that rank orders, predicts the highest number of Goods in the first decile and then goes progressively down. We define the Kolmogorov-Smirnoff (K-S) statistic as the maximum value of |G(s) – B(s)| over the score range. The K-S statistic has a known distribution under the null hypothesis that G(s) and B(s) are identical, and has a
critical value of 1.36/
B
G
Model performance on development sample –
Comparison of model performance on development and validation samples using Lorentz curve –
Gini and Lorentz curves
The industrial standard measures for assessing the predictive power of a model are the Gini and the K-S Statistic. A model giving a Gini greater than 40% and/or a K-S statistic in the region of > 20% would be classified as a good model. Let G(s) be the number of goods with a score less than s, and B(s) be the number of bads with a score less than s. A Lorentz curve is a plot of G(s) against B(s). The Gini coefficient captures the degree to which the distributions differ by calculating the difference
between the areas under the G(s) and B(s) curves, i.e.:
ds
s
B
s
G
(
)
(
)
.
. If G(s) and B(s) are identical the Lorentz curve is the straight line between (0, 0) and (1, 1), and the required integral is 0.The Lorentz curves above show the plot of the cumulative percentage of bad against the cumulative percentage of the good in the development and validation samples. The darkblueline shows the distribution of good and bad under random scoring whereas the brown curve (development sample) and green curve (validation sample) show the ‘lift’ provided by the Conversion Rate Model over and above random selection. Model exhibits similar level of performance across development and validation samples as can be seen from the almost overlapping Lorentz curves.
Frequently Asked Questions
Q: What is C-stat (C)?
Ans: C is the Area under the Curve=(# of concordant pairs+0.5*# of tie pairs)/#of pairs
= % concordance+0.5*% tie =2(1+gini)
Ans: To measure the contribution of variables we standardize variables since all variables are on different scale. If the variables have the same unit of measurement then their magnitude or scale needs to be compared.
The contribution of variables in the model can be measured by: a) Wald chi square
b) Point estimate
Q: What does point estimate tell us?
Ans : A point estimate tells us that for every change in one unit of the estimate, how will the dependent variable change .
Divergence Index Test
Divergence Index - s
x x D g b
is a commonly employed measure of the separation achieved by a model. It is related to a t-distribution (multiply by (G+B) ½) if the two population variances are equal. This measure tells us how well the means of the goods and bads are differentiated. A t statistic > |6| shows a high level of differentiation.
Null Hypothesis (H0): The mean score of the good in the population is less than
equal to the mean score of the bad in the population. A robust model implies that the mean score for good will be significantly greater than the mean score for bad i.e. the null hypothesis needs to be rejected. As shown by the p-value in the Table 4.2.4, the null hypothesis is being rejected at 1% level of significanc
Clustering checks –
A good model should not have significant clustering of the population at any particular score and the population must be well scattered across score points.
Clustering refers to the proportion of accounts falling at various integral values of the model-generated scores1.
Deviance and Residual Test
Both the Pearson and deviance test whether or not there is sufficient evidence that the observed data do not fit the model. The null hypothesis is that the data fit the model. If they are not significant it suggests that there is no reason to assume that the model is not correct / we accept that the model generally fits the data. For this model both the Pearson and Deviance test are coming as insignificant thereby further confirming that the model is fitting the data.
Hosmer and Lemeshow Test
The Hosmer-Lemeshow Goodness-of-Fit test tells us whether we have constructed a valid overall model or not. If the model is a good fit to the data then the Hosmer-Lemeshow Goodness-of-Fit test should have an associated 1 Score is defined as the probability of being a responder (as per Conversion Rate Model) multiplied by 1000
p-value greater than 0.05. In our case the associated p value is coming as high for both the development and the validation sample signalling that the model is a good fit for the data.
Frequently Asked Questions
Q: What drawback does Hosmer Lemeshow test have?
Ans: Hosmer Lemeshow is a goodness of fit test. However this metric is volatile due to the degrees of freedom deployed
Confusion Matrix (1 is the Event Indicator) – Development Data
65% of the consumer completions got correctly predicted by the model
Confusion Matrix (1 is the Event Indicator) – Validation Data
Model Validation
We re-estimate the model parameters on the hold-out validation sample to ensure the parameters are in close proximity of the development sample and all the other model performance measures hold
2) Rescoring on bootstrap samples –
The samples are selected every time with replacement and the equation from the development sample is used to re-score the model on several bootstrap samples in varying sample proportions – 20% - 80%. The model should satisfy the performance measures stated above. Test statistics show that the model validates for all 5 bootstrap samples with within confidence interval and achieves complete rank-ordering .