Choosing and evaluating models
107Evaluating models
K- FOLD CROSS-VALIDATION
Testing on hold-out data, while useful, only gives a single-point estimate of model per- formance. In practice we want both an unbiased estimate of our model’s future perfor- mance on new data (simulated by test data) and an estimate of the distribution of this estimate under typical variations in data and training procedures. A good method to perform these estimates is k-fold cross-validation and the related ideas of empirical resampling and bootstrapping.
The idea behind k-fold cross-validation is to repeat the construction of the model on different subsets of the available training data and then evaluate the model only on data not seen during construction. This is an attempt to simulate the performance of the model on unseen future data. The need to cross-validate is one of the reasons it’s critical that model construction be automatable, such as with a script in a language
7 See, for example, “The Unreasonable Effectiveness of Data,” Alon Halevy, Peter Norvig, and Fernando
like R, and not depend on manual steps. Assuming you have enough data to cross- validate (not having to worry too much about the statistical efficiency of techniques) is one of the differences between the attitudes of data science and traditional statistics. Section 6.2.3 works through an example of automating k-fold cross-validation. SIGNIFICANCE TESTING
Statisticians have a powerful idea related to cross-validation called significance testing. Significance also goes under the name of p-value and you will actually be asked, “What is your p-value?” when presenting.
The idea behind significance testing is that we can believe our model’s perfor- mance is good if it’s very unlikely that a naive model (a null hypothesis) could score as well as our model. The standard incantation in that case is “We can reject the null hypothesis.” This means our model’s measured performance is unlikely for the null model. Null models are always of a simple form: assuming two effects are independent when we’re trying to model a relation, or assuming a variable has no effect when we’re trying to measure an effect strength.
For example, suppose you’ve trained a model to predict how much a house will sell for, based on certain variables. You want to know if your model’s predictions are better than simply guessing the average selling price of a house in the neighborhood (call this the null model). Your new model will mispredict a given house’s selling price by a certain average amount, which we’ll call err.model. The null model will also mispre- dict a given house’s selling price by a different amount, err.null. The null hypothesis is that D = (err.null - err.model) == 0—on average, the new model performs the same as the null model.
When you evaluate your model over a test set of houses, you will (hopefully) see that D = (err.null - err.model) > 0 (your model is more accurate). You want to make sure that this positive outcome is genuine, and not just something you observed by chance. The p-value is the probability that you’d see a D as large as you observed if the two models actually perform the same.
Our advice is this: always think about p-values as estimates of how often you’d find a relation (between the model and the data) when there actually is none. This is why low p-values are good, as they’re estimates of the probabilities of undetected disas-
trous situations (see http://mng.bz/A3G1). You might also think of the p-value as the
probability of your whole modeling result being one big “false positive.” So, clearly, you want the p-value (or the significance) to be small, say less than 0.05.
The traditional statistical method of computing significance or p-values is through a Student’s t-test or an f-test (depending on what you’re testing). For classifiers, there’s a particularly good significance test that’s run on the confusion matrix called the fisher.test(). These tests are built into most model fitters. They have a lot of math behind them that lets a statistician avoid fitting more than one model. These tests also rely on a few assumptions (to make the math work) that may or may not be true about your data and your modeling procedure.
113
Summary
One way to directly simulate a bad modeling situation is by using a permutation test. This is when you permute the input (or independent) variables among examples. In this case, there’s no real relation between the modeling features (which we have per- muted among examples) and the quantity to be predicted, because in our new dataset the modeling features and the result come from different (unrelated) examples. Thus each rerun of the permuted procedure builds something much like a null model. If our actual model isn’t much better than the population of permuted models, then we should be suspicious of our actual model. Note that in this case, we’re thinking about the uncertainty of our estimates as being a distribution drawn about the null model.
We could modify the code in section 6.2.3 to perform an approximate permuta- tion test by permuting the y-values each time we resplit the training data. Or we could try a package that performs the work and/or brings in convenient formulas for the various probability and significance statements that come out of permutation experi-
ments (for example, http://mng.bz/SvyB).
CONFIDENCE INTERVALS
An important and very technical frequentist statistical concept is the confidence interval. To illustrate, a 95% confidence interval is an interval from an estimation procedure such that the procedure is thought to have a 95% (or better) chance of catching the true unknown value to be estimated in an interval. It is not the case that there is a 95% chance that the unknown true value is actually in the interval at hand (thought it’s often misstated as such). The Bayesian alternative to confidence intervals is credible intervals (which can be easier to understand, but do require the introduction of a prior distribution).
USING STATISTICAL TERMINOLOGY
The field of statistics has spent the most time formally studying the issues of model correctness and model soundness (probability theory, operations research, theoretical computer science, econometrics, and a few other fields have of course also contrib- uted). Because of their priority, statisticians often insist that the checking of model performance and soundness be solely described in traditional statistical terms. But a data scientist must present to many non-statistical audiences, so the reasoning behind a given test is in fact best explicitly presented and discussed. It’s not always practical to allow the dictates of a single field to completely style a cross-disciplinary conversation.
5.4
Summary
You now have some solid ideas on how to choose among modeling techniques. You also know how to evaluate the quality of data science work, be it your own or that of others. The remaining chapters of part 2 of the book will go into more detail on how to build, test, and deliver effective predictive models. In the next chapter, we’ll actu- ally start building predictive models, using the simplest types of models that essentially memorize and summarize portions of the training data.
Key takeaways
Always first explore your data, but don’t start modeling before designing some
measurable goals.
Divide you model testing into establishing the model’s effect (performance on
various metrics) and soundness (likelihood of being a correct model versus arising from overfitting).
Keep a portion of your data out of your modeling work for final testing. You may
also want to subdivide your training data into training and calibration and to estimate best values for various modeling parameters.
Keep many different model metrics in mind, and for a given project try to pick
115