The Problem of Overﬁtting Data - Oops, page not found.

Suppose that you are given the job of distinguishing likely Democratic party voters from likely Republican voters on the basis of age and income.

A scatter plot illustrating political party afﬁliation as a function of age and income.

To give you the basis to make such a decision, you have polled 100 people and obtained theirage, income, and party afﬁliation and put the results in a scatterplot.

How can you generalize from this data to predict political afﬁliation based only on people’s age and income? The simplest approach would be to carve up the age–space plane into two regions and assign each of these regions to one of the political parties. We present two possible divisions on page 138. On the top is the best possible Democrat–Republican dis-criminator that can be built from a single straight line. It cuts the space completely according to income; anyone who makes less than $80,000 a year is a classiﬁed a Democrat, whereas anyone who makes more than that is called a Republican.

Such a simple-minded division makes mistakes, of course. Indeed, three of the Democrats and four of the Republicans ended up on the wrong side of the line. On the bottom we provide a different divider that correctly classiﬁes forall the compassionate rich and misguided poorin ourtest set, but it has to jump around a lot in order to do so.

Which of these two classifiers do you think does a better job distin-guishing Democrats from Republicans? Even though it makes a few mis-takes, I prefer the simpler model on the left. Its simplicity helps guard against overfitting the data, that is, building a model that so completely reflects the weirdnesses of the training data that it misses the larger pic-ture. The classifier on the right distorts its shape to classify the outliers, correctly whereas the classifier on the left mislabels these oddballs on the assumption that they are, in fact, oddballs without predictive value.

Properly modeling the expected trifecta payoffs required care to guard against overﬁtting our data. Recall that we averaged the results of all pre-vious payoffs to predict future returns. This method worked well for most bet types such as win, place, show, and quiniela. However, the results of simple averaging are not so easy to believe in the case of trifectas. There are 336 different trifecta combinations, and thus the average trifecta should have occurred roughly 23 times during our sampling interval. But this average is misleading because there is a high variance in the numberof occurrences.

The mean oraverage is a statistical measure of the most likely value of a sequence, whereas variance, and its close cousin standard deviation,

Two dividers that discriminate between Democrats and Republicans.

measure the consistency of values in a sequence. Let us consider the annual salaries (in thousands of dollars) of 10 people in each of 2 dif-ferent professions. The ﬁrst sample comes from unionized postal workers in Omaha, Nebraska:

33, 27, 39, 25, 26, 24, 36, 28, 32, 30

and the second sample comes from people in the telemarketing industry

(a large fraction of whom happen to operate out of Omaha):

19, 30, 20, 24, 108, 17, 23, 19, 22, 18

Both of these sequences have the same average (30K). But the variance of the telemarketers is considerably higher because it is thrown off by the inclusion of one high-paid memberof management. The standard deviation in salaries at a union shop is likely to be much lower than one in which management feels free to oppress the masses and appropriately oil the squeaky wheel.

The high variance of payoffs associated with rare trifectas becomes a problem in trying to estimate their expected return accurately. Suppose we were to pick a single random element of each of the two sets of salaries above. Which random salary would more accurately reﬂect the average of the group? There is less chance that a random element of the low-variance sequence will do a bad job representing his cronies than one from the more diverse sequence. Picking the manager as a typical representative of the telemarketing industry would be seriously misleading, but is just as likely as picking the single fellow who is right on the average.

Simply averaging the payoffs for rare, high-variance trifectas doesn’t make much sense. Overthe last 2 years the trifecta 8–7–6 came in only ﬁve times at Milford, paying at $3708.60, $4568.40, $4574.70, $1975.50, and

$1293.00 fora $3 bet. What about even rarertrifectas that may have come in only once ornever? What should they pay off at?

To do a betterjob of estimating the payoff of rare trifectas, we partitioned them into groups with similar occurrence frequencies and then averaged all the payoffs within each group. This meant all of the low-probability trifectas in a given group were assigned the same expected payoff. Damping the projected payoff from the highest-return singleton payoff was essential to keep ourbetting system from being burned like a moth attracted to a ﬂame. If one trifecta had a projected payoff of $2000 per dollarinvested based on only one ortwo actual occurrences, the system would be liable to keep chasing what was probably a ﬂuke payoff instead of a real quirk in the public’s betting strategy.

In document Oops, page not found. (Page 154-157)