In our earlier discussion of parametric models, we saw that they come with a procedure to train the model using a set of training data. Nonparametric models will typically either perform lazy learning, in which case there really isn't an actual training procedure at all beyond memorizing the training data, or as in the case of splines, will perform local computations on the training data.
Either way, if we are to assess the performance of our model, we need to split our data into a training set and a test set. The key idea is that we want to assess our model based on how we expect it to perform on unseen future data. We do this by using the test set, which is a portion (typically 15-30 percent) of the data we collected and set aside for this purpose and haven't used during training. For example, one possible divide is to have a training set with 80 percent of the observations in our original data, and a test set with the remaining 20 percent. The reason why we need a test set is that we cannot use the training set to fairly assess our model performance, since we fi t our model to the training data and it does not represent data that we haven't seen before. From a prediction standpoint, if our goal was to maximize performance on our training data alone, then the best thing to do would be to simply memorize the input data along with the desired output values and our model would thus be a simple look-up table!
A good question to ask would be how we decide between how much data to use for training and testing. There is a trade-off that is involved here that makes the answer to this question nontrivial. On the one hand, we would like to use as much data as possible in our training set, so that the model has more examples from which to learn. On the other, we would like to have a large test set so that we can test our trained model using many examples in order to minimize the variance of our estimate of the model's predictive performance. If we only have a handful of observations in our test set, then we cannot really generalize about how our model performs on unseen data overall.
Another factor that comes into play is how much starting data we have collected. If we have very little data, we may have to use a larger amount in order to train our model, such as an 85-15 split. If we have enough data, then we might consider a 70- 30 split so that we can get a more accurate prediction on our test set.
To split a data set using the caret package, we can use the createDataPartition()
function to create a sampling vector containing the indices of the rows we will use in our training set. These are selected by randomly sampling the rows until a specifi ed proportion of the rows have been sampled, using the p parameter:
> set.seed(2412)
> iris_sampling_vector <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
It is good practice when reporting the results of a statistical analysis involving a random number generation, to apply the set.seed() function on a randomly chosen but fi xed number. This function ensures that the random numbers that are generated from the next function call involving random number generation will be the same every time the code is run. This is done so that others who read the analysis are able to reproduce the results exactly. Note that if we have several functions in our code that perform random number generation, or the same function is called multiple times, we should ideally apply set.seed() before each one of them.
Using our sampling vector, which we created for the iris data set, we can construct our training and test sets. We'll do this for a few versions of the iris data set that we built earlier on when we experimented with different feature transformations.
> iris_train <- iris_numeric[iris_sampling_vector,] > iris_train_z <- iris_numeric_zscore[iris_sampling_vector,] > iris_train_pca <- iris_numeric_pca[iris_sampling_vector,] > iris_train_labels <- iris$Species[iris_sampling_vector] > > iris_test <- iris_numeric[-iris_sampling_vector,]
> iris_test_z <- iris_numeric_zscore[-iris_sampling_vector,] > iris_test_pca <- iris_numeric_pca[-iris_sampling_vector,] > iris_test_labels <- iris$Species[-iris_sampling_vector]
We are now in a position to build and test three different models for the iris data set. These are the in turn, the unnormalized model, a model where the input features have been centered and scaled with a Z-score transformation, and the PCA model with two principal components. We could use our test set in order to measure the predictive performance of each of these models after we build them; however, this would mean that in our fi nal estimate of unseen accuracy, we will have used the test set in the model selection, thus producing a biased estimate. For this reason, we often maintain a separate split of the data, usually as large as the test set, known as the validation set. This is used to tune model parameters, such as k in kNN, and among different encodings and transformations of the input features before using the test set to predict unseen performance. In Chapter 5, Support Vector Machines, we'll discuss an alternative to this approach known as cross-validation.
Once we split our data, train our model by following the relevant training procedure that it requires, and tune our model parameters, we then have to assess its performance on the test set. Typically, we won't fi nd the same performance on our test set as on our training set. Sometimes, we may even fi nd that the performance we see when we deploy our model does not match what we expected to see, based on the performance on our training or test sets. There are a number of possible reasons for this disparity in performance. The fi rst of these is that the data we may have collected may either not be representative of the process that we are modeling, or that there are certain combinations of feature inputs that we simply did not encounter in our training data. This could produce results that are inconsistent with our expectations. This situation can happen both in the real world, but also with our test set if it contains outliers, for example. Another common situation is the problem of model overfi tting.
Overfi tting is a problem in which some models, especially more fl exible models, perform well on their training data set but perform signifi cantly worse on an unseen test set. This occurs when a model matches the observations in the training data too closely and fails to generalize on unseen data. Put differently, the model is picking up on spurious details and variations in a training data set, which are not representative of the underlying population as a whole. Overfi tting is one of the key reasons why we do not choose our model based on its performance on the training data. Other sources of discrepancy between training and test data performance are model bias and variance. Together, these actually form a well-known trade-off in statistical modeling known as the bias-variance tradeoff.
The variance of a statistical model refers to how much the model's predicted function would change, should a differently chosen training set (but generated from the exact same process or system that we are trying to predict as the original) be used to train the model. A low variance is desired because essentially, we don't want to predict a very different function with a different training set that is generated from the same process. Model bias refers to the errors inherently introduced in the predicted function, as a result of the limitation as to what functional forms the specifi c model can learn. For example, linear models introduce bias when trying to approximate nonlinear functions because they can only learn linear functions. The ideal scenario for a good predictive model is to have both a low variance and a low bias. It is
important for a predictive modeler to be aware of the fact that there is a bias-variance trade-off that arises from the choice of models. Models that are typically more
complex because of the fact that they make fewer assumptions on the target function are prone to less bias but higher variance than simpler but more restrictive models, such as linear models. This is because more complex models are able to approximate the training data more closely due to their fl exibility, but as a result, they are more sensitive to changes in training data. This, of course, is also related to the problem of overfi tting that complex models often exhibit.
We can actually see the effects of overfi tting by fi rst training some kNN models on our iris data sets. There are a number of packages that offer an implementation of the kNN algorithm, but we will use the knn3() function provided by the caret
package with which we are familiar. To train a model using this function, all we have to do is provide it with a data frame that contains the numerical input features, a vector of output labels, and k, the number of nearest neighbors we want to use for
the prediction:
> knn_model <- knn3(iris_train, iris_train_labels, k = 5) > knn_model_z <- knn3(iris_train_z, iris_train_labels, k = 5) > knn_model_pca <- knn3(iris_train_pca, iris_train_labels, k = 5)
To see the effect of different values of k, we will use the iris PCA model that is
In the preceding plots, we have used different symbols to denote data points corresponding to different species. The lines shown in the plots correspond to the decision boundaries between the different species, which are the class labels of our output variable. Notice that using a low value of k, such as 1, captures local
variation in the data very closely and as a result, the decision boundaries are very irregular. A higher value of k uses many neighbors to create a prediction, resulting
in a smoothing effect and smoother decision boundaries. Tuning k in kNN is an
example of tuning a model parameter to balance the effect of overfi tting. We haven't mentioned any specifi c performance metrics in this section. There are different measures of model quality relevant to regression and classifi cation, and we will address these after we wrap up our discussion on the predictive