Data Pre-treatment and Variable Selection
3.2 DATA DISTRIBUTION
As discussed in Section 1.4.2 of Chapter 1, knowledge of the distribution of the data values of a variable is important. Examining the distribution allows the identification of outliers, whether real or artefacts, shows whether apparently continuous variables really are and gives an idea of how well the data conforms to the assumptions (usually of normality) employed in some analytical methods. A prime example of this is the very commonly used technique of regression (see Chapter 6) which depends on numerous assumptions about the distribution of the data. Thus, it is often rewarding to plot data values and examine the frequency distri- butions of variables as a preliminary step in any data analysis. This, of course, is easy to do if the problem consists of only one or two dependent variables and a few tens of independent variables but becomes much more tedious if the set contains hundreds of variables. This, amongst other reasons, is why it is very rarely done! At the very least one should always examine the distribution of any variables that end up in some sort of statistical or mathematical model of a data set.
One particularly important property of variables that can be easily checked, even when considering a large number of variables, is the vari- ance (s2) or standard deviation (s) which was discussed in Section 1.4.2. The sample variance is the average of the squares of the distance of each data value from the mean:
s2=
(X− X)2
n− 1 (3.1)
Calculation of the variance and standard deviation for a set of variables is a trivial calculation that is a standard feature of all statistics packages. Examination of the standard deviation (which has the same units as the original variable) will show up variables that have small values, which means variables that contain little information about the samples. In the
limit, when the standard deviation of a variable is zero, then that variable contains no information at all since all of the values for every sample are the same. It isn’t possible to give any ‘guide’ value for a ‘useful’ standard deviation since this depends on the units of the original measurements but if we have a set of, say, 50 variables and need to reduce this to 10 or 20 then a reasonable filter would be to discard those variables with the smallest standard deviations. It goes without saying that variables with a standard deviation of zero are useless!
The mean and standard deviation of a data set are dependent on the first and second ‘moments’ of the set. The term moments here refers to the powers that the data are raised to in the calculation of that partic- ular statistic. There are two higher moment statistics that may be used to characterize the shape of a distribution – skewness, based on the third moment, and kurtosis based on the fourth moment. Skewness is a measure of how symmetrical a distribution is around it’s mean and for a completely ‘normal’ distribution with equal numbers of data values either side of the mean the value of skewness should be zero (see Fig- ure 1.6 for examples of skewed distributions). Distributions with more data values smaller than the mean are said to be positively skewed and will generally have a long right tail so they are also known as ‘skewed to the right’. Negatively skewed distributions have the opposite shape, of course. Kurtosis is a measure of how ‘peaked’ a distribution is as shown in Figure 1.6. Kurtosis, as measured by the moment ratio, has a value of 3 for a normal distribution, although there are other ways of calcu- lating kurtosis which give different values for a normal distribution. It is thus important to check how kurtosis is computed in the particular statistics package being used to analyse data. Distributions which have a high peak (kurtosis> 3) are known as leptokurtic, those with a flatter peak (kurtosis< 3) are called platykurtic and a normal distribution is mesokurtic.
These measures of the spread and shape of the distribution of a vari- able allow us to decide how close the distribution is to normal but how important is this and what sort of deviation is acceptable? Perhaps unsurprisingly, there is no simple answer to these questions. If the ana- lytical method which will be used on the data depends on assumptions of normality, as in linear regression for example, then the nearer the dis- tributions are to normal the more reliable the results of the analysis will be. If, however, the analytical technique does not rely on assumptions of normality then deviations from normality may well not matter at all. In any case, any ‘real’ variable is unlikely to have a perfect, normal distri- bution. The best use that can be made of these measures of normality is
as a filter for the removal of variables in order to reduce redundancy in a data set. As will be seen in later sections, variables may be redundant because they contain the same or very similar information to another variable or combination of variables. When it is necessary to remove one of a pair of variables then it makes sense to eliminate the one which has the least normal distribution.
3.3 SCALING
Scaling is a problem familiar to anyone who has ever plotted a graph. In the case of a graph, the axes are scaled so that the information present in each variable may be readily perceived. The same principle applies to the scaling of variables before subjecting them to some form of analysis. The objective of scaling methods is to remove any weighting which is solely due to the units which are used to express a particular variable. An example of this is measurement of the height and weight of people. Expressing height in feet and weight in stones gives comparable values but inches and stones or feet and pounds will result in apparent greater emphasis on height or weight, respectively. Another example can be
seen in the values of 1H and 13C NMR shifts. In any comparison of
these two types of shifts the variance of the 13C measurements will be far greater simply due to their magnitude. One means by which this can be overcome, to a certain extent at least, is to express all shifts relative to a common structure, the least substituted member of the series, for example. This only partly solves the problem, however, since the magnitude of the shifts will still be greater for13C than for1H. A commonly used steric parameter, MR, is often scaled by division by 10 to place it on a similar scale to other parameters such asπ and σ.
These somewhat arbitrary scaling methods are far from ideal since, apart from suffering from subjectivity, they require the individual inspec- tion of each variable in detail which can be a time-consuming task. What other forms of scaling are available? One of the most familiar is called normalization or range scaling where the minimum value of a variable is set to zero and the values of the variable are divided by the range of the variable
Xij = Xij− Xj(MIN) Xj(MAX)− Xj(MIN)
(3.2) In this equation Xi j is the new range-scaled value for row i (case i) of variable j. The values of range-scaled variables fall into the range
0=< Xj=< 1; the variables are also described as being normalized in the
range zero to one. Normalization can be carried out over any preferred range, perhaps for aesthetic reasons, by multiplication of the range-scaled values by a factor. A particular shortcoming of range scaling is that it is dependent on the minimum and maximum values of the variable, thus it is very sensitive to outliers. One way to reduce this sensitivity to outliers is to scale the data by subtracting the mean from the data values, a process known as mean centring:
Xij = Xij− Xj (3.3)
As for Equation (3.2), Xi j is the new mean-scaled value for row i (case i) of variable j where Xjis the mean of variable j. Mean centred variables are
better ‘behaved’ in terms of extreme values but they are still dependent on their units of measurement.
Another form of scaling which is less sensitive to outliers and which addresses the problem of scaling is known as autoscaling in which the mean is subtracted from the variable values and the resultant values are divided by the standard deviation
Xij = Xij− Xj σj
(3.4) Again, in this equation Xi j represents the new autoscaled value for row i of variable j, Xjis the mean of variable j, andσjis the standard deviation
given by Equation (3.6). σj= N i=1 (xij− xj)2 N− 1 (3.5)
Autoscaled variables have a mean of zero and a variance (standard devi- ation) of one. Because they are mean centred, they are less susceptible to the effects of compounds with extreme values. That they have a variance of one is useful in variance-related methods (see Chapters 4 and 5) since they each contribute one unit of variance to the overall variance of a data set. Autoscaled variables are also known as Z scores, symbol Z, or standard scores.
One further method of scaling which may be employed is known as feature weighting where variables are scaled so as to enhance their effects in the analysis. The objective of feature weighting is quite opposite to
that of ‘equalization’ scaling methods described here; it is discussed in detail in Chapter 7.
3.4 CORRELATIONS
When a data set contains a number of variables which describe the same samples, which is the usual case for most common data sets, then some of these variables will have values which change in a similar way across the set of samples. As was shown in the box in chapter two, the way that two variables are distributed about their means is given by a quantity called covariance: C(x,y) = n i=1 (xi − x)(yi − y)/n (3.6)
Where the covariance is positive the values of one variable increase as the values of the other increase, where it is negative the values of one variable get larger as the values of the other get smaller. This can be handily expressed as the correlation coefficient, r shown in Equation (3.7), which ranges from−1, a perfect negative correlation, through 0, no correlation, to+1, a perfect positive correlation.
r = C(x,y) V(x)× V(y) 1
2 (3.7)
If two variables are perfectly correlated, either negatively or positively, then it is clear that one of them is redundant and can be excluded from the set without losing any information, but what about the situation where the correlation coefficient between a pair of variables is less than one but greater than zero? One useful property of the correlation coefficient is that it’s square multiplied by 100 gives the percentage of variance that is shared or common to the two variables. Thus, a correlation coefficient of 0.7 between a pair of variables means that they share almost half (49 %) of their variance. A correlation coefficient of 0.9 means a shared variance of 81 %. A diagrammatic representation of these correlations is shown in Figure 3.1.
These simple, pairwise, correlation coefficients can be rapidly com- puted and displayed in a table called the correlation matrix, as shown in the next section. Inspection of this matrix allows the ready identification
Figure 3.1 Illustration of the sharing of variance between two correlated variables. The hatched area represents shared variance.
of variables that are correlated, a situation also known as collinearity, and thus are good candidates for removal from the set.
If two variables can share variance then is it possible that three or more variables can also share variance? The answer to this is yes and this is a situation known as multicollinearity. Figure 3.2 illustrates how three variables can share variance.
There is a statistic to describe this situation which is equivalent to the simple correlation coefficient, r, which is called the multiple correlation coefficient. This is discussed further in the next section and in Chapter 6 but suffice to say here that the multiple correlation coefficient can also be used to identify redundancy in a data set and can be used as a criterion for the removal of variables.