Assessing the Importance of Variables - Ratner - Statistical and Machine-Learning Data Mining

The classic approach for assessing the statistical significance of a variable considered for model inclusion is the well-known null hypothesis significance testing procedure, which is based on the reduction in prediction error (actual response minus predicted response) associated with the variable in question. The statistical apparatus of the formal testing procedure for logistic regression analysis consists of: The log likelihood (LL) function, the G statistic, degrees of freedom (df), and the p value. The procedure uses the apparatus within a theoretical framework with weighty and untenable assumptions. From a purist point of view, this could cast doubt on findings that actually have statistical significance. Even if findings of statistical significance are accepted as correct, they may not be of practical importance or have noticeable value to the study at hand. For the data miner with a prag- matic slant, the limitations and lack of scalability inherent in the classic sys- tem cannot be overlooked, especially within big data settings. In contrast, the data mining approach uses the LL units, the G statistic, and degrees of freedom in an informal data-guided search for variables that suggest a noticeable reduction in prediction error. One point worth noting is that the informality of the data mining approach calls for suitable change in termi- nology, from declaring a result as statistically significant to one worthy of notice or noticeably important.

Before I describe the data mining approach of variable assessment, I would like to comment on the objectivity of the classic approach as well as degrees of freedom. The classic approach is so ingrained in the analytic community that no viable alternative occurs to practitioners, especially an alternative based on an informal and sometimes highly individualized series of steps. Declaring a variable statistically significant appears to be purely objec- tive as it is based on sound probability theory and statistical mathematical machinery. However, the settings of the testing machinery defined by model builders could affect the results. The settings include the levels of rejecting a variable as significant when, in fact, it is not, or accepting a variable as not significant when, in fact, it is. Determining the proper sample size is also a subjective setting as it depends on the amount budgeted for the study. Last, the allowable deviation of violations of test assumptions is set by the model builder’s experience. Therefore, by acknowledging the subjective nature of the classic approach, the model builder can be receptive to the alternative data mining approach, which is free of theoretical ostentation and mathematical elegance.

A word about degrees of freedom clarifies the discussion. This concept is typically described as a generic measure of the number of independent pieces of information available for analysis. To ensure accurate results, this concept is accompanied by the mathematical adjustment “replace N with N -1.” The concept of degrees of freedom gives a deceptive impression of simplicity in

counting the pieces of information. However, the principles used in counting are not easy for all but the mathematical statistician. To date, there is no generalized calculus for counting degrees of freedom. Fortunately, the counting already exists for many analytical routines. Therefore, the correct degrees of freedom are readily available; computer output automatically provides them, and there are lookup tables in older statistics textbooks. For the analyses in the following discussions, the counting of degrees of freedom is provided.

8.10.1 Computing the g Statistic

In data mining, the assessment of the importance of a subset of variables for predicting response involves the notion of a noticeable reduction in prediction error due to the subset of variables and is based on the ratio of the G statistic to the degrees of freedom, G/df. The degrees of freedom is defined as the number of variables in the subset. The G statistic is defined, in Equation (8.7), as the difference between two LL quantities, one corresponding to a model without the subset of variables and the other corresponding to a model with the subset of variables.

G = -2LL(model without variables) - -2 LL(model with variables) (8.7) There are two points worth noting: First, the LL units are multiplied by a factor of -2, a mathematical necessity; second, the term subset is used to imply there is always a large set of variables available from which the model builder considers the smaller subset, which can include a single variable.

In the following sections, I detail the decision rules in three scenarios for assessing the likelihood that the variables have some predictive power. In brief, the larger the average G value per degrees of freedom (G/df), the more important the variables are in predicting response.

8.10.2 Importance of a Single Variable

If X is the only variable considered for inclusion into the model, the G statistic is defined in Equation (8.8):

G = -2LL(model with intercept only) - -2LL(model with X) (8.8) The decision rule for declaring X an important variable in predicting response is as follows: If G/df* is greater than the standard G/df value 4, then X is an important predictor variable and should be considered for inclusion in the model. Note that the decision rule only indicates that the variable

has some importance, not how much importance. The decision rule implies that a variable with a greater G/df value has a greater likelihood of some importance than a variable with a smaller G/df value, not that it has greater importance.

8.10.3 Importance of a Subset of Variables

When subset A consisting of k variables is the only subset considered for model inclusion, the G statistic is defined in Equation (8.9):

G = -2LL(model with intercept) - -2LL(model with A(k) variables) (8.9) The decision rule for declaring subset A important in predicting response is as follows: If G/k is greater than the standard G/df value 4, then subset A is an important subset of the predictor variable and should be considered for inclusion in the model. As before, the decision rule only indicates that the subset has some importance, not how much importance.

8.10.4 Comparing the Importance of Different Subsets of Variables Let subsets A and B consist of k and p variables, respectively. The number of variables in each subset does not have to be equal. If they are equal, then all but one variable can be the same in both subsets. The G statistics for A and B are defined in Equations (8.10) and (8.11), respectively:

G(k) = -2LL(model with intercept) - -2LL(model with “A” variables) (8.10) G(p) = -2LL(model with intercept) - -2LL(model with “B” variables) (8.11) The decision rule for declaring which of the two subsets is more important (i.e., greater likelihood of having some predictive power) in predicting response is as follows:

1. If G(k)/k is greater than G(p)/p, then subset A is the more important predictor variable subset; otherwise, B is the more important subset.

2. If G(k)/k and G(p)/p are equal or have comparable values, then both subsets are to be regarded tentatively of comparable importance. The model builder should consider additional indicators to assist in the decision about which subset is better.

It follows clearly from the decision rule that the better model is defined by the more important subset. Of course, this rule assumes that G(k)/k and G(p)/p are greater than the standard G/df value 4.

In document Ratner - Statistical and Machine-Learning Data Mining (Page 145-148)