An Empirical Example - Extensions and Applications of Ensemble-of-trees Methods in Machine Lear

We turn now to analyses of real data. The dataset was selected to be typical of those recently used in parole or probation settings. Recall, however, that it is very difficult with real data to arrive at results that are broadly generalizable.

6.5.1 Forecasting Arrests for Serious Crimes

The data address how well parolees manage under supervision. There are 20,000 observations in the training data and 5,000 observations in the test data. We consider whether an individual is arrested for a serious crime within 2 years of release on probation. Serious crimes include murder, attempted murder, rape, aggravated assault, and arson. About 13% fail by this definition. Such crimes are of widespread concern. Static and dynamic predictors include:

(a) Date of Birth;

(e) Earliest Age for a Charge as a Juvenile (f) Total Number of Priors as a Juvenile

(g) Number of Charges for Drug Crimes as an Adult (h) Number of Sex Crime Priors as an Adult

There is nothing special about these predictors. They represent the usual kinds of information that is routinely available on parolees when they begin their supervision. From past experience, they can make important contributions to forecasting accuracy (Berk, 2012).

We first apply logistic regression to the training data. A threshold of .135 is imposed on the predicted probabilities in order to arrive empirically at a 5 to 1 cost ratio of false negatives to false positives. Table 6.4 is the confusion table that results when the model is applied to test data. From the column on the far right, about 44% of the true failures are misclassified and about 32% of the true successes are misclassified. The forecasting accuracy is within the range of recent studies with similar data (Berk, 2012) and could well be useful for decision-makers.

Predict Fail Predict No Fail Model Error

Actual Fail 378 302 0.444

Actual No Fail 1385 2935 0.321

Table 6.4: Logistic regression test data confusion table for serious crime.

Table 6.5 is the confusion table for RF using the test data. The procedure was

positives. From the column on the far right, about 37% of those who actually fail are incorrectly identified and about 28% of those who actually do not fail are incorrectly identified. Forecasting accuracy for RF appears to be superior.

Predict Fail Predict No Fail Model Error

Actual Fail 427 253 0.372

Actual No Fail 1196 3124 0.277

Table 6.5: Random forests test data confusion table for serious crime.

Table 6.6 is the confusion table for SGB using the test data.

A threshold of .13 was used on the predicted probabilities from the training data to empirically arrive at a cost ratio of about 5 to 1. From the column on the far right, about 42% of those who actually fail are incorrectly identified and about 32% of those who actually do not fail are incorrectly identified. SGB does appreciably better than logistic regression when forecasting failures, but only slightly better when forecasting successes.

Predict Fail Predict No Fail Model Error

Actual Fail 396 284 0.418

Actual No Fail 1361 2459 0.315

Table 6.6: Stochastic gradient boosting test data confusion table for serious crime.

It appears that across the three tables, RF performs better than logistic regression

and SGB. This is consistent with published studies (Berk, 2012). But one must not

overstate what is learned from the comparisons we report. It is difficult to guarantee that after tuning, one is necessarily comparing apples-to-apples. We have tried to insure that for all practical purposes, the false negative to false positive cost ratios are the same for all three procedures. But the cost ratios are not identical, and

it is essentially impossible to make them so. The test data and training data are

different random splits of the available dataset. Tuning done on the training data will carry over a bit differently to the test data, depending on the forecasting procedure. Moreover, each procedure was tuned with its own special set of tuning parameters. There is no guarantee that the results are fully comparable. Indeed, it is not even clear how to define such a thing.

Another important issue is whether the differences are large enough to matter. As already explained, that judgment depends on the application. For example, the agency from which these data were obtained supervises about 40,000 individuals on probation each year. About 5000 of these individuals are arrested for a serious crime within 24 months, most within less than a year. For failures, the difference

of approximately 7% between the accuracy of logistic regression compared to RF

translates into about 350 serious crimes. Roughly 50 of those will be homicides or attempted homicides, the perpetrator of which could be identified in advance by RF, but not by logistic regression. In this instance, stakeholders found the practical difference in forecasting accuracy dramatic.

If one is looking for firm conclusions about forecasting accuracy from our results and others, it is almost certain that properly applied, RF will always do at least as well as logistic regression and much of the time meaningfully better. SGB will do at least as well as logistic regression, but is somewhat less likely to dominate it.

In document Extensions and Applications of Ensemble-of-trees Methods in Machine Learning (Page 194-197)