Building a classifier and making predictions

Learning by Henrik Brink, Joseph W Richards, and Mark Fetherolf.

3.2.1 Building a classifier and making predictions

The first order of business is to choose the classification algorithm to use for building the classifier. Many algorithms are available, and each has pros and cons for different data and deployment requirements. The appendix provides a table of algorithms and a comparison of their properties. You’ll use this table throughout the book for select- ing algorithms to try for different problems. In this section, the choice of algorithm isn’t essential; in the next chapter, you’ll learn how to properly measure the performance of the algorithm and choose the best for the job.

The next step is to ready the data for modeling. After exploring some of the features in the dataset, you may want to preprocess the data to deal with categorical features, missing values, and so on (as discussed in chapter 2). The preprocessing requirements are also dependent on the specific algorithm, and the appendix lists these requirements for each algorithm.

For the Titanic survival model, you’ll start by choosing a simple classification algorithm: logistic regression.1_{For logistic regression, you need to do the following:}

1 Impute missing values. 2 Expand categorical features.

3 From chapter 2, you know that the Fare feature is heavily skewed. In this situation, it’s advantageous (for some ML models) to transform the variable to make the feature distribution more symmetric and to reduce the potentially harmful impact of outliers. Here, you’ll choose to transform Fare by taking the square root.

The final dataset that you’ll use for modeling is shown in figure 3.6.

Pclass 3 Age 1 3 1 3 22 38 26 35 35 SibSp 1 1 0 1 0 Parch 0 0 0 0 0 sqrt_Fare 2.692582 8.442944 2.815138 7.286975 2.837252 Gender = female 0 1 1 1 0 Gender = male 1 0 0 0 1 Embarked = C 0 1 0 0 0 Embarked = Q 0 0 0 0 0 Embarked = S 1 0 1 1 1

Figure 3.6 The first five rows of the Titanic Passengers dataset after processing categorical  features and missing values, and transforming the Fare variable by taking the square root (see the

prepare_data function in the source code repository). All features are now numerical, which is 

the preferred format for most ML algorithms.

You can now go ahead and build the model by running the data through the logistic regression algorithm. This algorithm is implemented in the scikit-learn Python package, and the model-building and prediction code look like the following listing.

1 _The_regression_{in logistic regression doesn’t mean it’s a regression algorithm. Logistic regression expands}

from sklearn.linear_model import LogisticRegression as Model def train(features, target):

model = Model()

model.fit(features, target) return model

def predict(model, new_features): preds = model.predict(new_features) return preds

# Assume Titanic data is loaded into titanic_feats, # titanic_target and titanic_test

model = train(titanic_feats, titanic_target) predictions = predict(model, titanic_test) Returns

predictions (0 or 1)

After building the model, you predict the survival of previously unseen passengers based on their features. The model expects features in the format given in figure 3.6, so any new passengers will have to be run through exactly the same processes as the training data. The output of the predict function will be 1 if the passenger is predicted to survive, and 0 otherwise.

It’s useful to visualize the classifier by plotting the decision boundary. Given two of the features in the dataset, you can plot the boundary that separates surviving passengers from the dead, according to the model. Figure 3.7 shows this for the Age and square-root Fare features.

Listing 3.1 Building a logistic regression classifier with scikit-learn

Imports the logistic regression algorithm Fits the logistic regression

algorithm using features and target data

Makes predictions on a new set of features using the model

Returns the model built by the algorithm

0 20 sqrt (fare) 10 20 30 40 50 60 70 15 5 10 0 Age Survived (actual) Died (actual) Classifier predicts: Survived Decision boundary Classifier predicts: Died

Figure 3.7 The decision boundary for the Age and sqrt(Fare) features. The diamonds show passengers who survived, whereas circles denote passengers who died. The light background denotes the combinations of Age and Fare that are predicted to yield survival. Notice that a few instances overlap the boundary. The classifier isn’t perfect, but you’re looking in only two dimensions. The algorithm on the full dataset finds this decision boundary in 10 dimensions, but that becomes harder to visualize.

Classification: predicting into buckets

Algorithm highlight: logistic regression

In these “Algorithm highlight” boxes, you’ll take a closer look at the basic ideas behind the algorithms used throughout the book. This allows curious readers to try to code up, with some extra research, basic working versions of the algorithms. Even though we focus mostly on the use of existing packages in this book, understanding the basics of a particular algorithm can sometimes be important to fully realize the predictive potential.

The first algorithm you’ll look at is the logistic regression algorithm, arguably the sim- plest ML algorithm for classification tasks. It’s helpful to think about the problem as having only two features and a dataset divided into two classes. Figure 3.7 shows an example, with the features Age and sqrt(Fare); the target is Survived or Died. To build the classifier, you want to find the line that best splits the data into the target classes. A line in two dimensions can be described by two parameters. These two numbers are the parameters of the model that you need to determine.

The algorithm then consists of the following steps:

1 You can start the search by picking the parameter values at random, hence plac-

ing a random line in the two-dimensional figure.

2 Measure how well this line separates the two classes. In logistic regression, you use the statistical deviance for the goodness-of-fit measurement.

3 Guess new values of the parameters and measure the separation power. 4 Repeat until there are no better guesses. This is an optimization procedure that

can be done with a range of optimization algorithms. Gradient descent is a pop- ular choice for a simple optimization algorithm.

This approach can be extended to more dimensions, so you’re not limited to two features in this model. If you’re interested in the details, we strongly encourage you to research further and try to implement this algorithm in your programming language of choice. Then look at an implementation in a widely used ML package. We’ve left out plenty of details, but the preceding steps remain the basis of the algorithm.

Some properties of logistic regression include the following:

 The algorithm is relatively simple to understand, compared to more-complex algorithms. It’s also computationally simple, making it scalable to large datasets.  The performance will degrade if the decision boundary that separates the classes

needs to be highly nonlinear. See section 3.2.2.

 Logistic regression algorithms can sometimes overfit the data, and you often need to use a technique called regularization that limits this danger. See section 3.2.2 for an example of overfitting.