Preprocessing a data set can also involve the decision to drop some of the features if we know that they will cause us problems with our model. A common example is when two or more features are highly correlated with each other. In R, we can easily compute pairwise correlations on a data frame using the cor() function:
> cor(iris_numeric)
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Here, we can see that the Petal.Length feature is very highly correlated with the Petal.Width feature, with the correlation exceeding 0.96. The caret package offers
the findCorrelation() function, which takes a correlation matrix as an input, and
the optional cutoff parameter, which specifi es a threshold for the absolute value of
a pairwise correlation. This then returns a (possibly zero length) vector which shows the columns to be removed from our data frame due to correlation. The default setting of cutoff is 0.9:
> iris_cor <- cor(iris_numeric) > findCorrelation(iris_cor) [1] 3 > findCorrelation(iris_cor, cutoff = 0.99) integer(0) > findCorrelation(iris_cor, cutoff = 0.80) [1] 3 4
An alternative approach to removing correlation is a complete transformation of the entire feature space as is done in many methods for dimensionality reduction, such as Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). We'll see the former shortly, and the latter we'll visit in Chapter 11,
Recommendation Systems.
In a similar vein, we might want to remove features that are linear combinations of each other. By linear combination of features, we mean a sum of features where each feature is multiplied by a scalar constant. To see how caret deals with these, we will
create a new iris data frame with two additional columns, which we will call Cmb and Cmb.N, as follows: > new_iris <- iris_numeric > new_iris$Cmb <- 6.7 * new_iris$Sepal.Length – 0.9 * new_iris$Petal.Width > set.seed(68) > new_iris$Cmb.N <- new_iris$Cmb + rnorm(nrow(new_iris), sd = 0.1) > options(digits = 4) > head(new_iris,n = 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Cmb Cmb.N 1 5.1 3.5 1.4 0.2 33.99 34.13 2 4.9 3.0 1.4 0.2 32.65 32.63 3 4.7 3.2 1.3 0.2 31.31 31.27
As we can see, Cmb is a perfect linear combination of the Sepal.Length and Petal. Width features. Cmb.N is a feature that is the same as Cmb but with some added
Gaussian noise with a mean of zero and a very small standard deviation (0.1), so that the values are very close to those of Cmb. The caret package can detect exact
linear combinations of features, though not if the features are noisy, using the
findLinearCombos() function: > findLinearCombos(new_iris) $linearCombos $linearCombos[[1]] [1] 5 1 4 $remove
As we can see, the function only suggests that we should remove the fi fth feature (Cmb) from our data frame, because it is an exact linear combination of the fi rst and
fourth features. Exact linear combinations are rare, but can sometimes arise when we have a very large number of features and redundancy occurs between them. Both correlated features as well as linear combinations are an issue with linear regression models, as we shall soon see in Chapter 2, Linear Regression. In this chapter, we'll also see a method of detecting features that are very nearly linear combinations of each other.
A fi nal issue that we'll look at for problematic features, is the issue of having features that do not vary at all in our data set, or that have near zero variance. For some models, having these types of features does not cause us problems. For others, it may create problems and we'll demonstrate why this is the case. As in the previous example, we'll create a new iris data frame, as follows:
> newer_iris <- iris_numeric > newer_iris$ZV <- 6.5
> newer_iris$Yellow <- ifelse(rownames(newer_iris) == 1, T, F > head(newer_iris, n = 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width ZV Yellow 1 5.1 3.5 1.4 0.2 6.5 TRUE 2 4.9 3.0 1.4 0.2 6.5 FALSE 3 4.7 3.2 1.3 0.2 6.5 FALSE
The ZV column has the constant number of 6.5 for all observations. The Yellow
column is a fi ctional column that records whether an observation had some yellow color on the petal. All the observations, except the fi rst, are made to have this feature set to FALSE and so this is a near zero variance column. The caret package uses a
defi nition of near zero variance that checks whether the number of unique values that a feature takes as compared to the overall number of observations is very small, or whether the ratio of the most common value to the second most common value (referred to as the frequency ratio) is very high. The nearZeroVar() function
applied to a data frame returns a vector containing the features which have zero or near zero variance. By setting the saveMetrics parameter to TRUE, we can see more
information about the features in our data frame:
> nearZeroVar(newer_iris) [1] 5 6
> nearZeroVar(newer_iris, saveMetrics = T)
freqRatio percentUnique zeroVar nzv Sepal.Length 1.111 23.3333 FALSE FALSE Sepal.Width 1.857 15.3333 FALSE FALSE Petal.Length 1.000 28.6667 FALSE FALSE Petal.Width 2.231 14.6667 FALSE FALSE ZV 0.000 0.6667 TRUE TRUE Yellow 149.000 1.3333 FALSE TRUE
Here, we can see that the ZV column has been identifi ed as a zero variance column
(which is also by defi nition a near zero variance column). The Yellow column
does have a nonzero variance, but its high frequency ratio and low unique value percentage make it a near zero variance column. In practice, we tend to remove zero variance columns, as they don't have any information to give to our model. Removing near zero variance columns, however, is tricky and should be done with care. To understand this, consider the fact that a model for species prediction, using our newer iris data set, might learn that if a sample has yellow in its petals, then regardless of all other predictors, we would predict the setosa species, as this is the species that corresponds to the only observation in our entire data set that had the color yellow in its petals. This might indeed be true in reality, in which case, the yellow feature is informative and we should keep it. On the other hand, the presence of the color yellow on iris petals may be completely random and non-indicative of species but also an extremely rare event. This would explain why only one
observation in our data set had the yellow color in its petals. In this case, keeping the feature is dangerous because of the aforementioned conclusion. Another potential problem with keeping this feature will become apparent when we look at splitting our data into training and test sets, as well as other cases of data splitting, such as cross-validation, described in Chapter 5, Support Vector Machines. Here, the issue is that one split in our data may lead to unique values for a near zero variance column, for example, only FALSE values for our Yellow iris column.