Often, we'll fi nd that our numerical features are measured on scales that are
completely different to each other. For example, we might measure a person's body temperature in degrees Celsius, so the numerical values will typically be in the range of 36-38. At the same time, we might also measure a person's white blood cell count per microliter of blood. This feature generally takes values in the thousands. If we are to use these features as an input to an algorithm, such as kNN, we'd fi nd that the large values of the white blood cell count feature dominate the Euclidean distance calculation. We could have several features in our input that are important and useful for classifi cation, but if they were measured on scales that produce numerical values much smaller than one thousand, we'd essentially be picking our nearest neighbors mostly on the basis of a single feature, namely the white blood cell count. This problem comes up often and applies to many models, not just kNN. We handle this by transforming (also referred to as scaling) our input features before using them in our model.
We'll discuss three popular options for feature scaling. When we know that our input features are close to being normally distributed, one possible transformation to use is Z-score normalization, which works by subtracting the mean and dividing it by the standard deviation:
( )
( )
z scorex
E x
x
Var x
−−
=
E(x) is the expectation or mean of x, and the standard deviation is the square root of
the variance of x, written as Var(x). Notice that as a result of this transformation, the new feature will be centered on a mean of zero and will have unit variance. Another possible transformation, which is better when the input is uniformly distributed, is to scale all the features and outputs so that they lie within a single interval, typically the unit interval [0,1]:
( )
( )
( )
min
max
min
unit intervalx
x
x
x
x
−−
=
−
A third option is known as the Box-Cox transformation. This is often applied when our input features are highly skewed (asymmetric) and our model requires the input features to be normally distributed or symmetrical at the very least:
1
box coxx
x
λλ
−−
=
As λ is in the denominator, it must take a value other than zero. The transformation is actually defi ned for a zero-valued λ: in this case, it is given by the natural
logarithm of the input feature, ln(x). Notice that this is a parameterized transform and so there is a need to specify a concrete value of λ. There are various ways to estimate an appropriate value for λ from the data itself. Indicatively, we'll mention a technique to do this, known as cross-validation, which we will encounter later on in this book in Chapter 5, Support Vector Machines.
The original reference for the Box-Cox transformation is a paper
published in 1964 by the Journal of the Royal Statistical Society, titled An
analysis of Transformations and authored by G. E. P. Box and D. R. Cox.
To get a feel for how these transformations work in practice, we'll try them out on the Sepal.Length feature from our iris data set. Before we do this, however, we'll
introduce the fi rst R package that we will be working with, caret.
The caret package is a very useful package that has a number of goals. It provides
a number of helpful functions that are commonly used in the process of predictive modeling, from data preprocessing and visualization, to feature selection and resampling techniques. It also features a unifi ed interface for many predictive modeling functions and provides functionalities for parallel processing.
The defi nitive reference for predictive modeling using the caret package is a book called Applied Predictive Modeling, written by Max Kuhn and Kjell Johnson and published by Springer.
Max Kuhn is the principal author of the caret package itself.
The book also comes with a companion website at http:// appliedpredictivemodeling.com.
When we transform our input features on the data we use to train our model, we must remember that we will need to apply the same transformation to the features of later inputs that we will use at prediction time. For this reason, transforming data using the caret package is done in two steps. In the fi rst step, we use the preProcess() function that stores the parameters of the transformations to be
applied to the data, and in the second step, we use the predict() function to
actually compute the transformation. We tend to use the preProcess() function
only once, and then the predict() function every time we need to apply the same
transformation to some data. The preProcess() function takes a data frame with
some numerical values as its fi rst input, and we will also specify a vector containing the names of the transformations to be applied to the method parameter. The
predict() function then takes the output of the previous function along with the
data we want to transform, which in the case of the training data itself may well be the same data frame. Let's see all this in action:
> library("caret")
> iris_numeric <- iris[1:4]
> pp_unit <- preProcess(iris_numeric, method = c("range")) > iris_numeric_unit <- predict(pp_unit, iris_numeric)
> pp_zscore <- preProcess(iris_numeric, method = c("center", "scale"))
> iris_numeric_zscore <- predict(pp_zscore, iris_numeric) > pp_boxcox <- preProcess(iris_numeric, method = c("BoxCox")) > iris_numeric_boxcox <- predict(pp_boxcox, iris_numeric)
Downloading the example code
You can download the example code fi les from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the fi les e-mailed directly to you.
We've created three new versions of the numerical features of the iris data, with the difference being that in each case we used a different transformation. We can visualize the effects of our transformations by plotting the density of the Sepal. Length feature for each scaled data frame using the density() function and plotting
Notice that the Z-score and unit interval transformations preserve the overall shape of the density while shifting and scaling the values, whereas the Box-Cox transformation also changes the overall shape, resulting in a density that is less skewed than the original.