• No results found

Our goal in this section is to use squared error and cross-entropy performance as a quantitative measure of the calibration of a model and conduct an empirical com- parison across the learning algorithms and calibration methods discussed earlier in the chapter. Rather than delving into details of performances on each individual problem, we will maintain a more high-level view and only present and discuss average performance across the eleven test problems.

For each learning algorithm, we use all the the parameter settings described in the previous section to train classifiers, and calibrate each classifier with Isotonic Regression and Platt Scaling. In the case of boosted trees and stumps we also cali- brate the models using Logistic Correction and train models that directly optimize log-loss. Models are trained on 4000 points and calibrated on 1000 independent points (if needed). For each data set, learning algorithm, and calibration method, we select the model with the best performance using the same 1000 points used for calibration, and report it’s performance on the large final test set. We use five fold cross-validation to obtain five trials.

Figure 3.12 shows the root mean squared error (left) and mean cross-entropy (right) for each learning method before and after calibration. Each bar averages

0.3 0.32 0.34 0.36 0.38

BST-DT RF SVM BAG-DT ANN KNN BST-STMP DT LOGREG NB

Root Mean Squared Error

Uncalibrated Isotonic Regression Platt Scaling Logistic Correction LogLoss 0.4 0.45 0.5 0.55 0.6 0.65 0.7 NB LOGREG DT BST-STMP KNN ANN BAG-DT SVM RF BST-DT Mean Cross-Entropy Uncalibrated Isotonic Regression Platt Scaling Logistic Correction LogLoss

Figure 3.12: Performance of learning algorithms

over five trials on each of the eleven problems. Error bars representing one standard error are shown, but they are too small to be seen in some cases. The learning algorithms are roughly ordered by their squared error performance, with the best algorithm on the left. Boosted decision trees calibrated with Platt Scaling are clearly the best method with significantly lower squared error and cross-entropy when compared to the other learning algorithms. Boosted trees are followed by calibrated random forests, calibrated SVMs, bagged decision trees and uncalibrated neural nets. If we only look at the uncalibrated models however, bagged trees, random forests and neural nets are the best models. At the other end of the spectrum, boosted stumps, decision trees, logistic regression and Naive Bayes have the worst performance.2 The main reason why boosted stumps, logistic regression

and Naive Bayes perform poorly regardless of the calibration method used is not that these models could not be calibrated, but because the models learned using these algorithms are too simple to capture the full complexity of most datasets. Logistic regression and Naive Bayes learn linear models, while boosted stumps are unable to represent any interactions between attributes (Friedman et al., 2000). Similarly the performance of decision trees suffers because they have very high

2

The mean cross-entropy for the uncalibrated Naive Bayes models is actually 1.06, but the figure is cut at 0.7 to better show the interesting region.

variance or make coarse predictions, and there is little that calibration can do to fix these problems. This ranking of learning algorithms by their performance is fairly consistent whether we look at squared error or at cross-entropy. The only difference is that bagged trees have slightly better cross-entropy than SVMs calibrated with Platt’s method, while having slightly worse squared error.

As expected, the probabilities predicted by four learning methods — boosted trees, SVMs, boosted stumps, and Naive Bayes — are dramatically improved by calibration, while neural nets and logistic regression models are actually hurt a little by post-training calibration. For random forests, bagged trees and memory based learning we have seen in the previous section that Platt Scaling helps on some problems that display sigmoidal reliability plots, but hurts on other problems where the models already are well calibrated. For random forest and memory based learning the benefit is greater than the loss, so on average the models calibrated with Platt Scaling perform better. For bagged trees it’s a wash. Comparing the left and the right graphs in Figure 3.12 it looks as if Isotonic Regression does perform slightly worse than Platt Scaling for cross-entropy for all learning methods. When looking at squared error, Isotonic Regression performs a little better than Platt Scaling for decision trees and Naive Bayes.

We have seen in Sections 3.3.1 and 3.3.1 that boosted trees calibrated with Logistic Correction and boosted trees trained to optimize log-loss only output zero or one predictions for most problems, which explains their poor squared error and cross-entropy performance. On the other hand, for boosted stumps, Logistic Correction performs on par with Platt Scaling. This makes Logistic Correction very appealing, especially since it does not require any extra time or data for calibration. Directly optimizing log-loss with boosted stumps performs a few per- cent worse than Logistic Correction, which is consistent with the results reported

by Lebanon and Lafferty (2001). Unfortunately, the main reason why Logistic Correction works well with boosted stumps and not with boosted trees is that boosted stumps have limited expressive power. Most datasets are too complex to be learned well using boosted stumps. Boosted trees however, are able to learn models for these complex datasets well, and when calibrated with Platt Scaling produce excellent probabilistic predictions.3

Related documents