Filling in the Unknown Values by Exploring Correla- Correla-tions

Predicting Algae Blooms

2.5 Unknown Values

2.5.3 Filling in the Unknown Values by Exploring Correla- Correla-tions

An alternative for getting less biased estimators of the unknown values is to explore the relationships between variables. For instance, using the correlation between the variable values, we could discover that a certain variable is highly correlated with mxPH, which would enable us to obtain other, more probable values for the sample number 48, which has an unknown on this variable. This could be preferable to the use the mean as we did above.

To obtain the variables correlation we can issue the command

> cor(algae[, 4:18], use = "complete.obs")

The function cor() produces a matrix with the correlation values between

19Without this ‘detail’ the result of the call would beNA because of the presence of NA values in this column.

Predicting Algae Blooms 57 the variables (we have avoided the ﬁrst 3 variables/columns because they are nominal). The use="complete.obs" setting tellsR to disregard observations withNA values in this calculation. Values near 1 (−1) indicate a strong pos-itive (negative) linear correlation between the values of the two respective variables. OtherR functions could then be used to approximate the functional form of this linear correlation, which in turn would allow us to estimate the values of one variable from the values of the correlated variable.

The result of this cor() function is not very legible but we can put it through the function symnum() to improve this:

> symnum(cor(algae[,4:18],use="complete.obs")) mP mO Cl NO NH o P Ch a1 a2 a3 a4 a5 a6 a7 mxPH 1

mnO2 1

Cl 1

NO3 1

NH4 , 1

oPO4 . . 1

PO4 . . * 1

Chla . 1

a1 . . . 1

a2 . . 1

a3 1

a4 . . 1

a5 1

a6 . . . 1

a7 1

attr(,"legend")

[1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

This symbolic representation of the correlation values is more legible, par-ticularly for large correlation matrices.

In our data, the correlations are in most cases irrelevant. However, there are two exceptions: between variables NH4 and NO3, and between PO4 and oPO4. These two latter variables are strongly correlated (above 0.9). The correlation between NH4 and NO3 is less evident (0.72) and thus it is risky to take advantage of it to fill in the unknowns. Moreover, assuming that you have removed the samples 62 and 199 because they have too many unknowns, there will be no water sample with unknown values on NH4 and NO3. With respect to PO4 and oPO4, the discovery of this correlation²⁰allows us to fill in the unknowns on these variables. In order to achieve this we need to find the form of the linear correlation between these variables. This can be done as follows:

20According to domain experts, this was expected because the value of total phosphates (PO4 ) includes the value of orthophosphate (oPO4 ).

> data(algae)

> algae <- algae[-manyNAs(algae), ]

> lm(PO4 ~ oPO4, data = algae) Call:

lm(formula = PO4 ~ oPO4, data = algae) Coefficients:

(Intercept) oPO4

42.897 1.293

The function lm() can be used to obtain linear models of the form Y = β₀+ β₁X₁+ . . . + β_nX_n. We will describe this function in detail in Section 2.6.

The linear model we have obtained tells us that PO4 = 42.897+1.293×oPO4.

With this formula we can ﬁll in the unknown values of these variables, provided they are not both unknown.

After removing the sample 62 and 199, we are left with a single observation with an unknown value on the variable PO4 (sample 28); thus we could simply use the discovered relation to do the following:

> algae[28, "PO4"] <- 42.897 + 1.293 * algae[28, "oPO4"]

However, for illustration purposes, let us assume that there were several samples with unknown values on the variable PO4. How could we use the above linear relationship to ﬁll all the unknowns? The best would be to create a function that would return the value of PO4 given the value of oPO4, and then apply this function to all unknown values:

> data(algae)

> algae <- algae[-manyNAs(algae), ]

> fillPO4 <- function(oP) { + if (is.na(oP))

+ return(NA)

+ else return(42.897 + 1.293 * oP) + }

> algae[is.na(algae$PO4), "PO4"] <- sapply(algae[is.na(algae$PO4), + "oPO4"], fillPO4)

We ﬁrst create a function called fillPO4() with one argument, which is assumed to be the value of oPO4. Given a value of oPO4, this function re-turns the value of PO4 according to the discovered linear relation (try issuing

“fillPO4(6.5)”). This function is then applied to all samples with unknown value on the variable PO4. This is done using the function sapply(), another example of a meta-function. This function has a vector as the ﬁrst argument and a function as the second. The result is another vector with the same length, with the elements being the result of applying the function in the sec-ond argument to each element of the given vector. This means that the result of this call to sapply() will be a vector with the values to ﬁll in the unknowns

Predicting Algae Blooms 59 of the variable PO4. The last assignment is yet another example of the use of function composition. In eﬀect, in a single instruction we are using the result of function is.na() to index the rows in the data frame, and then to the result of this data selection we are applying the function fillPO4() to each of its elements through function sapply().

The study of the linear correlations enabled us to ﬁll in some new unknown values. Still, there are several observations left with unknown values. We can try to explore the correlations between the variables with unknowns and the nominal variables of this problem. We can use conditioned histograms that are available through the lattice R package with this objective. For instance, Figure 2.7 shows an example of such a graph. This graph was produced as follows:

> histogram(~mxPH | season, data = algae)

mxPH

Percent of Total

0 10 20 30 40

6 7 8 9 10

autumn spring

summer

6 7 8 9 10

0 10 20 30 40 winter

FIGURE 2.7: A histogram of variable mxPH conditioned by season.

This instruction obtains an histogram of the values of mxPH for the diﬀer-ent values of season. Each histogram is built using only the subset of observa-tions with a certain season value. You may have noticed that the ordering of the seasons in the graphs is a bit unnatural. If you wish the natural temporal ordering of the seasons, you have to change the ordering of the labels that form the factor season in the data frame. This could be done by

> algae$season <- factor(algae$season, levels = c("spring", + "summer", "autumn", "winter"))

By default, when we factor a set of nominal variable values, the levels parameter assumes the alphabetical ordering of these values. In this case we want a diﬀerent ordering (the temporal order of the seasons), so we need to specify it to the factor function. Try executing this instruction and afterward obtain again the histogram to see the diﬀerence.

Notice that the histograms in Figure 2.7 are rather similar, thus leading us to conclude that the values of mxPH are not seriously inﬂuenced by the season of the year when the samples were collected. If we try the same using the size of the river, with histogram(∼ mxPH | size,data=algae), we can observe a tendency for smaller rivers to show lower values of mxPH. We can extend our study of these dependencies using several nominal variables. For instance,

> histogram(~mxPH | size * speed, data = algae)

shows the variation of mxPH for all combinations of size and speed of the rivers. It is curious to note that there is no information regarding small rivers with low speed.²¹The single sample that has these properties is exactly sample 48, the one for which we do not know the value of mxPH !

Another alternative to obtain similar information but now with the con-crete values of the variable is

> stripplot(size ~ mxPH | speed, data = algae, jitter = T)

The result of this instruction is shown in Figure 2.8. The jitter=T pa-rameter setting is used to perform a small random permutation of the values in the Y-direction to avoid plotting observations with the same values over each other, thus losing some information on the concentration of observations with some particular value.

This type of analysis could be carried out for the other variables with unknown values. Still, this is a tedious process because there are too many combinations to analyze. Nevertheless, this is a method that can be applied in small datasets with few nominal variables.

2.5.4 Filling in the Unknown Values by Exploring

In document Data.mining.with.R.learning.with.Case.studies.nov (Page 71-75)