Data Pre-treatment and Variable Selection
3.5 DATA REDUCTION
This chapter is concerned with the pre-treatment of data and so far we have discussed the properties of the distribution of data, means by which data may be scaled and correlations between variables. All of these mat- ters are important, in so far as they dictate what can be done with data,
Figure 3.2 Illustration of the sharing of variance between three correlated variables. The hatched areas show shared variance between X1 and X2and X1and X3. The
crosshatched area shows variance shared by all three variables.
but perhaps the most important is to answer the question ‘What infor- mation does the data contain?’ It is most unlikely that any given data set will contain as many pieces of information as it does variables.1That is to say, most data sets suffer from a degree of redundancy particularly when they contain more variables than cases, a situation in which the data matrix is referred to as being over-square. Most people are aware of the fact that with two data points it is possible to construct a line, a 1 dimensional object, and with three data points a plane, a 2 dimen- sional object. This can be continued so that 4 data points allows a 3 dimensional object, 5 points, 4 dimensions and so on. Thus, the maxi- mum dimensionality of an object, and hence the maximum number of dimensions, in a data set is n− 1 where n is the number of data points. For dimensions we can substitute ‘independent pieces of information’
and thus the maximum that any data set may contain is n − 1. This,
however, is a maximum and in reality the true dimensionality, where dimension means ‘information’, is often much less than n− 1.
1An example where this is not true is the unusual situation where all of the variables in the set are orthogonal to one another, e.g. principal components (see Chapter 4), but even here some variables may not contain information but be merely ‘noise’.
This section describes ways by which redundancy may be identified and, to some extent at least, eliminated. This stage in data analysis is called data reduction in which selected variables are removed from a data set. It should not be confused with dimension reduction, described in the next chapter, in which high-dimensional data sets are reduced to lower dimensions, usually for the purposes of display.
An obvious first test to apply to the variables in a data set is to look for missing values; is there an entry in each column for every row? What can be done if there are missing values? An easy solution, and often the best one, is to discard the variable but the problem with this approach is that the particular variable concerned may contain information that is useful for the description of the dependent property. Another approach which has the advantage of retaining the variable is to delete samples with missing values. The disadvantage of this is that it reduces the size and variety of the data set. In fact either of these methods of dealing with missing data, variable deletion or sample deletion, result in a smaller data set which is likely to contain less information.
An alternative to deletion is to provide the missing values, and if these can be calculated with a reasonable degree of certainty, then all is well. If not, however, other methods may be sought. Missing values may be replaced by random numbers, generated to lie in the range of the variable concerned. This allows the information contained in the variable to be used usefully for the members of the set which have ‘real’ values, but, of course, any correlation or pattern involving that variable does not apply to the other members of the set. A problem with random fill is that some variables may only have certain values and the use of random numbers, even within the range of values of the variable, may distort this structure. In this case a better solution is to randomly take some of the existing values of the variable for other cases and use these to replace the missing values. This has the advantage that the distribution of the variable is unaltered and any special properties that it has, like only being able to take certain values, is unchanged.
An alternative to random fill is mean fill which, as the name implies, replaces missing values by the mean of the variable involved. This, like random fill, has the advantage that the variable with missing values can now be used; it also has the further advantage that the distribution of the variable will not be altered, other than to increase its kurtosis, perhaps. Another approach to the problem of missing values is to use linear combinations of the other variables to produce an estimate for the missing variable. As will be seen later in this section, data sets some- times suffer from a condition known as multicollinearity in which one
variable is correlated with a linear combination of the other variables. This method of filling missing values certainly involves more work, unless the statistics package has it ‘built in’, and is probably of debatable value since multicollinearity is a condition which is generally best avoided. There are a number of other ways in which missing data can be filled in and some statistics packages have procedures to analyse ‘missingness’ and offer a variety of options to estimate the missing values. The ideal solution to missing values, of course, is not to have them in the first place!
The next stage in data reduction is to examine the distribution of the variables as discussed in Section 3.2. A fairly obvious feature to look for in the distribution of the variables is to identify those parameters which have constant, or nearly constant, values. Such a situation may arise because a property has been poorly chosen in the first place, but may also happen when structural changes in the compounds in the set lead to compensating changes in physicochemical properties. Some data analysis packages have a built-in facility for the identification of such ill-conditioned variables. At this stage in data reduction it is also a good idea to actually plot the distribution of each of the variables in the set so as to identify outliers or variables which have become ‘indicators’, as discussed in Section 1.4.3.
This introduces the correlation matrix. Having removed ill- conditioned variables from the data set, a correlation matrix is con- structed by calculation of the correlation coefficient (see Section 3.4) between each pair of variables in the set. A sample correlation matrix is shown in Table 3.1 where the correlation between a pair of variables is found by the intersection of a particular row and column, for exam- ple, the correlation between ClogP and Iy is 0.503. The diagonal of the matrix consists of 1.00s, since this represents the correlation of each vari- able with itself, and it is usual to show only half of the matrix since it is
Table 3.1 Correlation matrix for a set of physicochemical properties.
Ix 1.00 Iy 0.806 1.00 ClogP 0.524 0.503 1.00 CMR 0.829 0.942 0.591 1.00 CHGE(4) 0.344 0.349 0.286 0.243 1.00 ESDL(4) 0.299 0.257 0.128 0.118 0.947 1.00 DIPMOM 0.337 0.347 0.280 0.233 0.531 0.650 1.00 EHOMO 0.229 0.172 0.209 0.029 0.895 0.917 0.433 1.00
symmetrical (the top-right hand side of the matrix is identical to the bottom left-hand side).
Inspection of the correlation matrix allows the identification of pairs of correlating features, although choice of the level at which correlation becomes important is problematic and dependent to some extent on the requirements of the analysis. There are a number of high correlations (r> 0.9) in Table 3.1 however, and removal of one variable from each of these pairs will reduce the size of the data set without much likelihood of removing useful information. At this point the data reduction process might begin to be called ‘variable selection’ which is not just a matter of semantics but actually a different procedure with different aims to data reduction. There are a number of strategies for variable selection; some are applied before any further data analysis, as discussed in the next section, while others are actually an integral part of the data analysis process.
So, to summarize the data reduction process so far:
r Missing values have been identified and the problem treated by
either filling them in or removing the offending cases or variables. r Variables which are constant or nearly constant have been identified
and removed.
r Variables which have ‘strange’ or extreme distributions have been identified and the problem solved, by fixing mistakes or removing samples, or the variables removed.
r Correlated variables have been identified and marked for future removal.