This chapter covers
4.1.2 Data transformations
The purpose of data transformation is to make data easier to model—and easier to understand. For example, the cost of living will vary from state to state, so what would be a high salary in one region could be barely enough to scrape by in another. If you want to use income as an input to your insurance model, it might be more meaningful to normalize a customer’s income by the typical income in the area where they live. The next listing is an example of a relatively simple (and common) transformation.
> summary(medianincome) State Median.Income
: 1 Min. :37427
Listing 4.4 Tracking original NAs with an extra categorical variable
Listing 4.5 Normalizing income by state
The missingIncome variable lets you differentiate the two kinds of zeros in the data: the ones that you are about to add, and the ones that were already there.
Replace the NAs with zeros.
Suppose medianincome is a data frame of median income by state.
Alabama : 1 1st Qu.:47483 Alaska : 1 Median :52274 Arizona : 1 Mean :52655 Arkansas : 1 3rd Qu.:57195 California: 1 Max. :68187 (Other) :46
> custdata <- merge(custdata, medianincome,
by.x="state.of.res", by.y="State")
> summary(custdata[,c("state.of.res", "income", "Median.Income")])
state.of.res income Median.Income
California :100 Min. : -8700 Min. :37427
New York : 71 1st Qu.: 14600 1st Qu.:44819
Pennsylvania: 70 Median : 35000 Median :50977
Texas : 56 Mean : 53505 Mean :51161
Michigan : 52 3rd Qu.: 67000 3rd Qu.:55559
Ohio : 51 Max. :615000 Max. :68187
(Other) :600
> custdata$income.norm <- with(custdata, income/Median.Income) > summary(custdata$income.norm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1791 0.2729 0.6992 1.0820 1.3120 11.6600
The need for data transformation can also depend on which modeling method you plan to use. For linear and logistic regression, for example, you ideally want to make sure that the relationship between input variables and output variable is approxi- mately linear, and that the output variable is constant variance (the variance of the output variable is independent of the input variables). You may need to transform some of your input variables to better meet these assumptions.
In this section, we’ll look at some useful data transformations and when to use them: converting continuous variables to discrete; normalizing variables; and log transformations.
CONVERTING CONTINUOUS VARIABLES TO DISCRETE
For some continuous variables, their exact value matters less than whether they fall into a certain range. For example, you may notice that customers with incomes less than $20,000 have different health insurance patterns than customers with higher incomes. Or you may notice that customers younger than 25 and older than 65 have high probabilities of insurance coverage, because they tend to be on their parents’ coverage or on a retirement plan, respectively, whereas customers between those ages have a different pattern.
In these cases, you might want to convert the continuous age and income variables into ranges, or discrete variables. Discretizing continuous variables is useful when the relationship between input and output isn’t linear, but you’re using a modeling tech- nique that assumes it is, like regression (see figure 4.2).
Merge median income information into the custdata data frame by matching the column custdata$state.of.res to the column medianincome$State. Median.Income is now part of custdata. Normalize income by Median.Income.
71
Cleaning data
Looking at figure 4.2, you see that you can replace the income variable with a Boolean variable that indicates whether income is less than $20,000:
> custdata$income.lt.20K <- custdata$income < 20000 > summary(custdata$income.lt.20K)
Mode FALSE TRUE NA's
logical 678 322 0
If you want more than a simple threshold (as in the age example), you can use the
cut() function, as you saw in the section “When values are missing systematically.”
> brks <- c(0, 25, 65, Inf) > custdata$age.range <- cut(custdata$age, breaks=brks, include.lowest=T) > summary(custdata$age.range) [0,25] (25,65] (65,Inf] 56 732 212
Even when you do decide not to discretize a numeric variable, you may still need to transform it to better bring out the relationship between it and other variables. You saw this in the example that introduced this section, where we normalized income by the regional median income. In the next section, we’ll talk about normalization and rescaling.
Listing 4.6 Converting age into ranges
Kink in the graph at about $20,000; a cut here divides the graph into two regions that each have less variation than the total graph.
It also expresses the relative flatness of the left side of the cut.
This is a good candidate to split into a binary variable.
0.0 0.3 0.6 0.9
1e+03 1e+04 1e+05
income
as
.n
umer
ic(health.ins)
Figure 4.2 Health insurance coverage versus income (log10 scale)
Select the age ranges of interest. The upper and lower bounds should encompass the full range of the data. Cut the data into age ranges.
The include.lowest=T argument makes sure that zero age data is included in the lowest age range category. By default it would be excluded. The output of cut() is a factor variable.
NORMALIZATION AND RESCALING
Normalization is useful when absolute quantities are less meaningful than relative ones. We’ve already seen an example of normalizing income relative to another meaningful quantity (median income). In that case, the meaningful quantity was external (came from the analyst’s domain knowledge); but it can also be internal (derived from the data itself).
For example, you might be less interested in a customer’s absolute age than you are in how old or young they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical age. You can normalize by that, as shown in the following listing.
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 38.0 50.0 51.7 64.0 146.7
> meanage <- mean(custdata$age)
> custdata$age.normalized <- custdata$age/meanage > summary(custdata$age.normalized)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.7350 0.9671 1.0000 1.2380 2.8370
A value for age.normalized that is much less than 1 signifies an unusually young cus- tomer; much greater than 1 signifies an unusually old customer. But what constitutes “much less” or “much greater” than 1? That depends on how wide an age spread your customers tend to have. See figure 4.3 for an example.
Listing 4.7 Centering on mean age
0.00 0.02 0.04 0.06 0.08 30 60 90 age density label population1 population2 A 35-year-old seems fairly typical
(a little young) in population1, but unusually young in population2.
The average age of both populations is 50.
Population2 falls mostly in the 40-60 age range.
Population1 includes people with a wide
range of ages.
73
Cleaning data
The typical age spread of your customers is summarized in the standard deviation. You can rescale your data by using the standard deviation as a unit of distance. A customer who is within one standard deviation of the mean is not much older or younger than typical. A customer who is more than one or two standard deviations from the mean can be considered much older, or much younger.
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 38.0 50.0 51.7 64.0 146.7 > meanage <- mean(custdata$age) > stdage <- sd(custdata$age) > meanage [1] 51.69981 > stdage [1] 18.86343 > custdata$age.normalized <- (custdata$age-meanage)/stdage > summary(custdata$age.normalized)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.74100 -0.72630 -0.09011 0.00000 0.65210 5.03500
Now, values less than -1 signify customers younger than typical; values greater than 1 signify customers older than typical.
Normalizing by mean and standard deviation is most meaningful when the data distribution is roughly symmetric. Next, we’ll look at a transformation that can make some distributions more symmetric.
LOG TRANSFORMATIONS FOR SKEWED AND WIDE DISTRIBUTIONS
Monetary amounts—incomes, customer value, account, or purchase sizes—are some of the most commonly encoun- tered sources of skewed distributions in data science applications. In fact, as we discuss in appendix B, monetary amounts are often lognormally distrib- uted—the log of the data is normally distributed. This leads us to the idea that taking the log of the data can restore symmetry to it. We demonstrate this in figure 4.4.
For the purposes of modeling, which logarithm you use—natural logarithm, log base 10, or log base 2—is generally not critical. In regression, for exam- ple, the choice of logarithm affects the
Listing 4.8 Summarizing age
Take the
mean. Take the
standard deviation. Use the mean value as the origin (or reference point) and rescale the distance from the mean by the standard deviation.
A technicality
The common interpretation of standard deviation as a unit of distance implicitly assumes that the data is distributed normally. For a normal distribution, roughly two-thirds of the data (about 68%) is within plus/minus one stan- dard deviation from the mean. About 95% of the data is within plus/minus two standard deviations from the mean. In figure 4.3, a 35-year-old is (just barely) within one standard devia- tion from the mean in population1, but more than two standard deviations from the mean in population2. You can still use this transformation if the data isn’t normally distributed, but the standard deviation is most mean- ingful as a unit of distance if the data is
unimodal and roughly symmetric
magnitude of the coefficient that corresponds to the logged variable, but it doesn’t affect the value of the outcome. We like to use log base 10 for monetary amounts, because orders of ten seem natural for money: $100, $1000, $10,000, and so on. The transformed data is easy to read.
AN ASIDE ON GRAPHING Notice that the bottom panel of figure 4.4 has the same shape as figure 3.5. The difference between using the ggplot layer
scale_x_log10 on a density plot of income and plotting a density plot of
log10(income) is primarily axis labeling. Using scale_x_log10 will label the x-axis in dollars amounts, rather than in logs.
It’s also generally a good idea to log transform data with values that range over several orders of magnitude—first, because modeling techniques often have a difficult time with very wide data ranges; and second, because such data often comes from multipli- cative processes, so log units are in some sense more natural.
For example, when you’re studying weight loss, the natural unit is often pounds or kilograms. If you weigh 150 pounds and your friend weighs 200, you’re both equally active, and you both go on the exact same restricted-calorie diet, then you’ll probably both lose about the same number of pounds—in other words, how much weight you lose doesn’t (to first order) depend on how much you weighed in the first place, only on calorie intake. This is an additive process.
On the other hand, if management gives everyone in the department a raise, it probably isn’t giving everyone $5,000 extra. Instead, everyone gets a 2% raise: how
0e+00 5e 0 6 1e 0 5 $0 $200,000 $400,000 $600,000 income density 0.00 0.25 0.50 0.75 1.00 2 3 4 5 6 log10(income) density
The income distribution is asymmetric, skewed so most of the mass is
on the left.
Most of the mass of log10(income) is nearly symmetric, though there is
a long tail on the left (very small incomes).
75
Cleaning data
much extra money ends up in your paycheck depends on your initial salary. This is a multiplicative process, and the natural unit of measurement is percentage, not absolute dollars. Other examples of multiplicative processes: a change to an online retail site increases conversion (purchases) for each item by 2% (not by exactly two purchases); a change to a restaurant menu increases patronage every night by 5% (not by exactly five customers every night). When the process is multiplicative, log transforming the process data can make modeling easier.
Of course, taking the logarithm only works if the data is non-negative. There are other transforms, such as arcsinh, that you can use to decrease data range if you have zero or negative values. We don’t always use arcsinh, because we don’t find the values of the transformed data to be meaningful. In applications where the skewed data is monetary (like account balances or customer value), we instead use what we call a signed logarithm. A signed logarithm takes the logarithm of the absolute value of the variable and multiplies by the appropriate sign. Values strictly between -1 and 1 are mapped to zero. The difference between log and signed log is shown in figure 4.5.
Here’s how to calculate signed log base 10, in R: signedlog10 <- function(x) {
ifelse(abs(x) <= 1, 0, sign(x)*log10(abs(x))) }
Clearly this isn’t useful if values below unit magnitude are important. But with many monetary variables (in US currency), values less than a dollar aren’t much different
0.00 0.25 0.50 0.75 1.00 4 2 0 2 4 6 log10income density
Solid line: log10(income)
Dotted line: signedlog10(income)
signedlog() picks up the non-positive data that log ()
dropped
from zero (or one), for all practical purposes. So, for example, mapping account bal- ances that are less than or equal to $1 (the equivalent of every account always having a
minimum balance of one dollar) is probably okay.2
Once you’ve got the data suitably cleaned and transformed, you’re almost ready to start the modeling stage. Before we get there, we have one more step.
4.2
Sampling for modeling and validation
Sampling is the process of selecting a subset of a population to represent the whole, during analysis and modeling. In the current era of big datasets, some people argue that computational power and modern algorithms let us analyze the entire large data- set without the need to sample.
We can certainly analyze larger datasets than we could before, but sampling is a necessary task for other reasons. When you’re in the middle of developing or refining a modeling procedure, it’s easier to test and debug the code on small subsamples before training the model on the entire dataset. Visualization can be easier with a sub- sample of the data; ggplot runs faster on smaller datasets, and too much data will often obscure the patterns in a graph, as we mentioned in chapter 3. And often it’s not feasible to use your entire customer base to train a model.
It’s important that the dataset that you do use is an accurate representation of your population as a whole. For example, your customers might come from all over the United States. When you collect your custdata dataset, it might be tempting to use all the customers from one state, say Connecticut, to train the model. But if you plan to use the model to make predictions about customers all over the country, it’s a good idea to pick customers randomly from all the states, because what predicts health insurance coverage for Texas customers might be different from what predicts health insurance coverage in Connecticut. This might not always be possible (perhaps only your Connecticut and Massachusetts branches currently collect the customer health insurance information), but the shortcomings of using a nonrepresentative dataset should be kept in mind.
The other reason to sample your data is to create test and training splits. 4.2.1 Test and training splits
When you’re building a model to make predictions, like our model to predict the probability of health insurance coverage, you need data to build the model. You also need data to test whether the model makes correct predictions on new data. The first set is called the training set, and the second set is called the test (or hold-out) set.
The training set is the data that you feed to the model-building algorithm—regres- sion, decision trees, and so on—so that the algorithm can set the correct parameters to best predict the outcome variable. The test set is the data that you feed into the resulting model, to verify that the model’s predictions are accurate. We’ll go into
2 There are methods other than capping to deal with signed logarithms, such as the arcsinh function (see
77
Sampling for modeling and validation
detail about the kinds of modeling issues that you can detect by using hold-out data in chapter 5. For now, we’ll just get our data ready for doing hold-out experiments at a later stage.
Many writers recommend train/calibration/test splits, which is also good advice. Our philosophy is this: split the data into train/test early, don’t look at test until final evaluation, and if you need calibration data, resplit it from your training subset. 4.2.2 Creating a sample group column
A convenient way to manage random sampling is to add a sample group column to the data frame. The sample group column contains a number generated uniformly from zero to one, using the runif function. You can draw a random sample of arbi- trary size from the data frame by using the appropriate threshold on the sample group column.
For example, once you’ve labeled all the rows of your data frame with your sample group column (let’s call it gp), then the set of all rows such that gp < 0.4 will be about four-tenths, or 40%, of the data. The set of all rows where gp is between 0.55 and 0.70 is about 15% of the data (0.7 - 0.55 = 0.15). So you can repeatably generate a ran- dom sample of the data of any size by using gp.
> custdata$gp <- runif(dim(custdata)[1])
> testSet <- subset(custdata, custdata$gp <= 0.1) > trainingSet <- subset(custdata, custdata$gp > 0.1) > dim(testSet)[1]
[1] 93
> dim(trainingSet)[1] [1] 907
R also has a function called sample that draws a random sample (a uniform random sample, by default) from a data frame. Why not just use sample to draw training and test sets? You could, but using a sample group column guarantees that you’ll draw the same sample group every time. This reproducible sampling is convenient when you’re debugging code. In many cases, code will crash because of a corner case that you for- got to guard against. This corner case might show up in your random sample. If you’re using a different random input sample every time you run the code, you won’t know if you will tickle the bug again. This makes it hard to track down and fix errors.
You also want repeatable input samples for what software engineers call regression testing (not to be confused with statistical regression). In other words, when you make changes to a model or to your data treatment, you want to make sure you don’t break what was already working. If model version 1 was giving “the right answer” for a cer- tain input set, you want to make sure that model version 2 does so also.
Listing 4.9 Splitting into test and training using a random group mark
dim(custdata) returns the number of rows and columns of the data frame as a vector, so dim(custdata)[1] returns the number of rows. Here we generate a training using the remaining data.
Here we generate a test set of about 10% of the data (93 customers—a little over 9%, actually) and train on the remaining 90%.
REPRODUCIBLE SAMPLING IS NOT JUST A TRICK FOR R If your data is in a data- base or other external store, and you only want to pull a subset of the data into R for analysis, you can draw a reproducible random sample by generat- ing a sample group column in an appropriate table in the database, using the
SQL command RAND .