Using Naive Bayes - Memorization methods - Practical Data Science with R

Memorization methods

6.3.4 Using Naive Bayes

Naive Bayes is an interesting method that memorizes how each training variable is related to outcome, and then makes predictions by multiplying together the effects of each variable. To demonstrate this, let’s use a scenario in which we’re trying to predict whether somebody is employed based on their level of education, their geographic region, and other variables. Naive Bayes begins by reversing that logic and asking this question: Given that you are employed, what is the probability that you have a high school education? From that data, we can then make our prediction regarding employment.

Let’s call a specific variable (x_1) taking on a specific value (X_1) a piece of evi- dence: ev_1. For example, suppose we define our evidence (ev_1) as the predicate

education=="High School", which is true when the variable x_1 (education) takes on

the value X_1 ("High School"). Let’s call the outcome y (taking on values T or True if the person is employed and F otherwise). Then the fraction of all positive examples where ev_1 is true is an approximation to the conditional probability of ev_1, given y==T. This is usually written as P(ev1|y==T). But what we want to estimate is the conditional probability of a subject being employed, given that they have a high school education:

P(y==T|ev1). How do we get from P(ev1|y==T) (the quantities we know from our

training data) to an estimate of P(y==T|ev1 ... evN) (what we want to predict)? Listing 6.22 Plotting the performance of a logistic regression model

135

Building models using many variables

Bayes’ law tells us we can expand P(y==T|ev1) and P(y==F|ev1) like this:

The left-hand side is what you want; the right-hand side is all quantities that can be estimated from the statistics of the training data. For a single feature ev1, this buys us little as we could derive P(y==T|ev1) as easily from our training data as from P(ev1|y==T). For multiple features (ev1 ... evN) this sort of expansion is useful. The Naive Bayes assumption lets us assume that all the evidence is conditionally indepen- dent of each other for a given outcome:

This gives us the following:

The numerator terms of the right sides of the final expressions can be calculated effi- ciently from the training data, while the left sides can’t. We don’t have a direct scheme for estimating the denominators in the Naive Bayes expression (these are called the joint probability of the evidence). However, we can still estimate P(y==T|evidence) and

P(y==F|evidence), as we know by the law of total probability that we should have

P(y==T|evidence) + P(y==F|evidence) = 1. So it’s enough to pick a denominator

such that our estimates add up to 1.

For numerical reasons, it’s better to convert the products into sums, by taking the log of both sides. Since the denominator term is the same in both expressions, we can ignore it; we only want to determine which of the following expressions is greater:

It’s also a good idea to add a smoothing term so that you’re never taking the log of zero.

P(y==T )

×

P(ev1| y==T)

P(y==T | ev1) =

P(ev1)

P(y==F)

×

(P(ev1| y==F)

P(y==F | ev1) =

P(ev1)

P(ev1&. . . evN| y==T)

≈

P(ev1| y==T)

×

P(ev2| y==T)

×

. . . P(evN| y==T)

P(ev1&. . . evN| y==F)

≈

P(ev1| y==F)

×

P(ev2| y==F)

×

. . . P(evN| y==F)

P(y==T)

×

(P(ev1| y==T)

×

. . . P(evN| y==T))

P(y==T | ev1&. . . evN) ≈

P(ev1&. . . evN)

P(y==F)

×

(P(ev1| y==F)

×

. . . P(evN| y==F))

P(y==F | ev1&. . . evN) ≈

P(ev1&. . . evN)

score (T| ev1&. . . evN) = log (P( y==T))

+

log (P( ev1| y==T))

+

. . . log (P(evN| y==T))

All of the single-variable models we’ve built up to now are estimates of the form

model(e_i) ~ P(y==T|e_i), so by another appeal to Bayes’ law we can say that the

proportions we need for the Naive Bayes calculation (the ratios of P(e_i|y==T) to

P(e_i|y==F)) are identical to the ratios of model(e_i)/P(y===T)) to (1-

model(e_i))/P(y===F). So our single-variable models can be directly used to build an

overall Naive Bayes model (without any need for additional record keeping). We show such an implementation in the following listing.

pPos <- sum(dTrain[,outcome]==pos)/length(dTrain[,outcome]) nBayes <- function(pPos,pf) {

pNeg <- 1 - pPos

smoothingEpsilon <- 1.0e-5

scorePos <- log(pPos + smoothingEpsilon) + rowSums(log(pf/pPos + smoothingEpsilon)) scoreNeg <- log(pNeg + smoothingEpsilon) +

rowSums(log((1-pf)/(1-pPos) + smoothingEpsilon)) m <- pmax(scorePos,scoreNeg) expScorePos <- exp(scorePos-m) expScoreNeg <- exp(scoreNeg-m) expScorePos/(expScorePos+expScoreNeg) } pVars <- paste('pred',c(numericVars,catVars),sep='') dTrain$nbpredl <- nBayes(pPos,dTrain[,pVars]) dCal$nbpredl <- nBayes(pPos,dCal[,pVars]) dTest$nbpredl <- nBayes(pPos,dTest[,pVars]) print(calcAUC(dTrain$nbpredl,dTrain[,outcome])) ## [1] 0.9757348 print(calcAUC(dCal$nbpredl,dCal[,outcome])) ## [1] 0.5995206 print(calcAUC(dTest$nbpredl,dTest[,outcome])) ## [1] 0.5956515

Listing 6.23 Building, applying, and evaluating a Naive Bayes model

Define a function that performs the Naive Bayes prediction.

For each row, compute (with a smoothing term) the sum of log(P[positive & evidence_i]/ P[positive]) across all columns. This is equivalent to the log of the product of P[evidence_i | positive] up to terms that don’t depend on the positive/negative outcome.

For each row, compute (with a smoothing term) the sum of log(P[negative & evidence_i]/P[negative]) across all columns. This is equivalent to the log of the product of P[evidence_i | negative] up to terms that don’t depend on the positive/negative outcome. Exponentiate to turn sums back into products, but make sure we don’t cause a floating point overflow in doing so.

Use the fact that the predicted positive probability plus the predicted negative probability should sum to 1.0 to find and eliminate Z. Return the correctly scaled predicted odds of being positive as our forecast. Apply the

function to make the predictions.

Calculate the AUCs. Notice the overfit—fantastic performance on the training set that isn’t repeated on the calibration or test sets.

137

Building models using many variables

Intuitively, what we’ve done is built a new per-variable prediction column from each of our single-variable models. Each new column is the logarithm of the ratio of the single- variable model’s predicted churn rate over the overall churn rate. When the model predicts a rate near the overall churn rate, this ratio is near 1.0 and therefore the logarithm is near 0. Similarly, for high predicted churn rates, the prediction column is a positive number, and for low predicted churn rates the column prediction is negative. Summing these signed columns is akin to taking a net-consensus vote across all of the columns’ variables. If all the evidence is conditionally independent given the outcome (this is the Naive Bayes assumption—and remember it’s only an assumption), then this is exactly the right thing to do. The amazing thing about the Naive Bayes classifier is that it can perform well even when the conditional independence assumption isn’t true.

There are many discussions of Bayes Law and Naive Bayes methods that cover the math in much more detail. One thing to remember is that Naive Bayes doesn’t perform any clever optimization, so it can be outperformed by methods like logistic regression and support vector machines (when you have enough training data). Also, variable selection is very important for Naive Bayes. Naive Bayes is particularly useful when you have a very large number of features that are rare and/or nearly independent.

Smoothing

The most important design parameter in Naive Bayes is how smoothing is handled. The idea of smoothing is an attempt to obey Cromwell’s rule that no probability esti- mate of 0 should ever be used in probabilistic reasoning. This is because if you’re combining probabilities by multiplication (the most common method of combining probability estimates), then once some term is 0, the entire estimate will be 0 no mat- ter what the values of the other terms are. The most common form of smoothing is called Laplace smoothing, which counts k successes out of n trials as a success ratio of (k+1)/(n+1) and not as a ratio of k/n (defending against the k=0 case). Frequen- tist statisticians think of smoothing as a form of regularization and Bayesian statisticians think of smoothing in terms of priors.

Document classification and Naive Bayes

Naive Bayes is the workhorse method when classifying text documents (as done by email spam detectors). This is because the standard model for text documents (usu- ally called bag-of-words or bag-of-k-grams) can have an extreme number of possible features. In the bag-of-k-grams model, we pick a small k (typically 2) and each possi- ble consecutive sequence of k words is a possible feature. Each document is repre- sented as a bag, which is a sparse vector indicating which k-grams are in the document. The number of possible features runs into the millions, but each docu- ment only has a non-zero value on a number of features proportional to k times the size of the document.

Of course we can also call a prepackaged Naive Bayes implementation (that includes its own variable treatments), as shown in the following listing.

library('e1071') lVars <- c(catVars,numericVars) ff <- paste('as.factor(',outcome,'>0) ~ ', paste(lVars,collapse=' + '),sep='') nbmodel <- naiveBayes(as.formula(ff),data=dTrain) dTrain$nbpred <- predict(nbmodel,newdata=dTrain,type='raw')[,'TRUE'] dCal$nbpred <- predict(nbmodel,newdata=dCal,type='raw')[,'TRUE'] dTest$nbpred <- predict(nbmodel,newdata=dTest,type='raw')[,'TRUE'] calcAUC(dTrain$nbpred,dTrain[,outcome]) ## [1] 0.4643591 calcAUC(dCal$nbpred,dCal[,outcome]) ## [1] 0.5544484 calcAUC(dTest$nbpred,dTest[,outcome]) ## [1] 0.5679519

The e1071 code is performing a bit below our expectations on raw data. We do see per- formance superior from e1072 if we call it again with our processed and selected variables. This emphasizes the advantage of combining by hand variable processing with pre-made machine learning libraries.

6.4 Summary

The single-variable and multiple-variable memorization style models in this section are always worth trying first. This is especially true if most of your variables are categorical variables, as memorization is a good idea in this case. The techniques of this chapter are also a good repeat example of variable treatment and variable selection.

We have, at a bit of a stretch, called all of the modeling techniques of this chapter memorization methods. The reason for this is because, having worked an example using all of these models all in the same place, you now have enough experience to see the common memorization traits in these models: their predictions are all sums of summaries of the original training data.

The models of this chapter are conceptualized as follows:

 Single-variable models can be thought of as being simple memorizations or

summaries of the training data. This is especially true for categorical variables where the model is essentially a contingency table or pivot table, where for every level of the variable we record the distribution of training outcomes (see section 6.2.1). Some sophisticated ideas (like smoothing, regularization, or shrinkage) may be required to avoid overfitting and to build good single- variable models. But in the end, single-variable models essentially organize the training data into a number of subsets indexed by the predictive variable and then store a summary of the distribution of outcome as their future prediction. These models are atoms or sub-assemblies that we sum in different ways to get the rest of the models of this chapter.

139

Summary

 Decision tree model decisions are also sums of summaries over subsets of the

training data. For each scoring example, the model makes a prediction by choosing the summary of all training data that was placed in the same leaf node of the decision tree as the current example to be scored. There’s some clever- ness in the construction of the decision tree itself, but once we have the tree, it’s enough to store a single data summary per tree leaf.

 K-nearest neighbor predictions are based on summaries of the k pieces of train-

ing data that are closest to the example to be scored. KNN models usually store

all of their original training data instead of an efficient summary, so they truly do memorize the training data.

 Naive Bayes models partially memorize training data through intermediate fea-

tures. Roughly speaking, Naive Bayes models form their decision by building a

large collection of independent single-variable models.8_{The Naive Bayes pre-}

diction for a given example is just the product of all the applicable single- variable model adjustments (or, isomorphically, the sum of logarithms of the single-variable contributions). Note that Naive Bayes models are constructed without any truly clever functional forms or optimization steps. This is why we stretch terms a bit and call them memorization: their predictions are just sums of appropriate summaries of the original training data.

For all their fascinating features, at some point you’ll have needs that push you away from memorization methods. For some problems, you’ll want models that capture more of the functional or additive structure of relationships. In particular, you’ll want to try regression for value prediction and logistic regression for category prediction, as we demonstrate in chapter 7.

8 _{As you saw in section 6.3.4, these are slightly modified single-variable models, since they model feature-driven}

change in outcome distribution, or in Bayesian terms “have the priors pulled out.” Key takeaways

 Always try single-variable models before trying more complicated techniques.

 Single-variable modeling techniques give you a useful start on variable

selection.

 Always compare your model performance to the performance of your best

single-variable model.

 Consider decision trees, nearest neighbor, and naive Bayes models as

basic data memorization techniques and, if appropriate, try them early in your projects.

140

Linear and

In document Practical Data Science with R (Page 161-167)