Feature engineering and dimensionality reduction

The number and type of features that we use with a model is one of the most important decisions that we will make in the predictive modeling process. Having the right features for a model will ensure that we have suffi cient evidence on which to base a prediction. On the fl ip side, the number of features that we work with is precisely the number of dimensions that the model has. A large number of dimensions can be the source of several complications. High dimensional problems often suffer from data sparsity, which means that because of the number of dimensions available, the range of possible combinations of values across all the features grows so large that it is unlikely that we will ever collect enough data in order to have enough representative examples for training. In a similar vein, we often talk about the curse of dimensionality. This describes the fact that because of the overwhelmingly large space of possible inputs, data points that we have collected are likely to be far away from each other in the feature space. As a result, local methods, such as k-nearest neighbors that make predictions using observations in the training data that are close to the point for which we are trying to make a prediction, will not work as well in high dimensions. A large feature set

Consequently, there are two types of processes that feature engineering involves. The fi rst of these, which grows the feature space, is the design of new features based on features within our data. Sometimes, a new feature that is a product or ratio of two original features might work better. There are many ways to combine existing features into new ones, and often it is expert knowledge from the problem's particular application domain that might help guide us. In general though, this process takes experience and a lot of trial and error. Note that there is no guarantee that adding a new feature will not degrade performance. Sometimes, adding a feature that is very noisy or highly correlated with an existing feature may actually cause us to lose accuracy.

The second process in feature engineering is feature reduction or shrinkage, which reduces the size of the feature space. In the previous section on data preprocessing, we looked at how we can detect individual features that may be problematic for our model in some way. Feature selection refers to the process in which the subset of features that are the most informative for our target output are selected from the original pool of features. Some methods, such as tree-based models, have built-in feature selection, as we shall see in Chapter 6, Tree-based Methods. In

Chapter 2, Linear Regression, we'll also explore methods to perform feature selection

for linear models. Another way to reduce the overall number of features, a concept known as dimensionality reduction, is to transform the entire set of features into a completely new set of features that are fewer in number. A classic example of this is Principal Component Analysis (PCA).

In a nutshell, PCA creates a new set of input features, known as principal

components, all of which are linear combinations of the original input features. For the fi rst principal component, the linear combination weights are chosen in order to capture the maximum amount of variation in the data. If we could visualize the fi rst principal component as a line in the original feature space, this would be the line in which the data varies the most. It also happens to be the line that is closest to all the data points in the original feature space. Every subsequent principal component attempts to capture a line of maximum variation, but in a way that the new principal component is uncorrelated with the previous ones already computed. Thus, the second principal component selects the linear combination of original input features that have the highest degree of variation in the data, while being uncorrelated with the fi rst principal component.

The principal components are ordered naturally in a descending order according to the amount of variation that they capture. This allows us to perform dimensionality reduction in a simple manner by keeping the fi rst N components, where we choose N so that the components chosen incorporate a minimum amount of the variance from the original data set. We won't go into the details of the underlying linear algebra necessary to compute the principal components.

Instead, we'll direct our attention to the fact that this process is sensitive to the variance and scale of the original features. For this reason, we often scale our features before carrying out this process. To visualize how useful PCA can be, we'll once again turn to our faithful iris data set. We can use the caret package to carry out

PCA. To do this, we specify pca in the method parameter of the preProcess()

function. We can also use the thresh parameter, which specifi es the minimum

variance we must retain. We'll explicitly use the value 0.95 so that we retain 95

percent of the variance of the original data, but note that this is also the default value of this parameter:

> pp_pca <- preProcess(iris_numeric, method = c("BoxCox", "center", "scale", "pca"), thresh = 0.95)

> iris_numeric_pca <- predict(pp_pca, iris_numeric) > head(iris_numeric_pca, n = 3)

PC1 PC2 1 -2.304 -0.4748 2 -2.151 0.6483 3 -2.461 0.3464

As a result of this transformation, we are now left with only two features, so we can conclude that the fi rst two principal components of the numerical iris features incorporate over 95 percent of the variation in the data.

If we are interested in learning the weights that were used to compute the principal components, we can inspect the rotation attribute of the pp_pca object:

> options(digits = 2) > pp_pca$rotation PC1 PC2 Sepal.Length 0.52 -0.386 Sepal.Width -0.27 -0.920 Petal.Length 0.58 -0.049 Petal.Width 0.57 -0.037

This means that the fi rst principal component, PC1, was computed as follows:

0.52⋅Sepal Length.

−0.27 Sepal. Width 0.58 Petal.Length 0.57 Petal. Width⋅

+

⋅

+

⋅

Sometimes, instead of directly specifying a threshold for the total variance captured by the principal components, we might want to examine a plot of each principal component and its variance. This is known as a scree plot, and we can build this by fi rst performing PCA and indicating that we want to keep all the components. To do this, instead of specifying a variance threshold, we set the pcaComp parameter, which

is the number of principal components we want to keep. We will set this to 4, which

includes all of them, remembering that the total number of principal components is the same as the total number of original features or dimensions we started out with. We will then compute the variance and cumulative variance of these components and store it in a data frame. Finally, we will plot this in the fi gure that follows, noting that the numbers in brackets are cumulative percentages of variance captured:

> pp_pca_full <- preProcess(iris_numeric, method = c("BoxCox", "center", "scale", "pca"), pcaComp = 4)

> iris_pca_full <- predict(pp_pca_full, iris_numeric) > pp_pca_var <- apply(iris_pca_full, 2, var)

> iris_pca_var <- data.frame(Variance =

round(100 * pp_pca_var / sum(pp_pca_var), 2), CumulativeVariance = round(100 * cumsum(pp_pca_var) / sum(pp_pca_var), 2))

> iris_pca_var Variance CumulativeVariance PC1 73.45 73.45 PC2 22.82 96.27 PC3 3.20 99.47 PC4 0.53 100.00

As we can see, the fi rst principal component accounts for 73.45 percent of the total variance in the iris data set, while together with the second component, the total variance captured is 96.27 percent. PCA is an unsupervised method for dimensionality reduction that does not make use of the output variable even when it is available. Instead, it looks at the data geometrically in the feature space. This means that we cannot ensure that PCA will give us a new feature space that will perform well in our prediction problem, beyond the computational advantages of having fewer features. These advantages might make PCA a viable choice even when there is reduction in model accuracy as long as this reduction is small and acceptable for the specifi c task. As a fi nal note, we should point out that we weights of the principal components, often referred to as loadings are unique within a sign fl ip as long as they have been normalized. In cases where we have perfectly correlated features or perfect linear combinations we will obtain a few principal components that are exactly zero.

In document Mastering Predictive Analytics with R - Sample Chapter (Page 39-43)