Non-linear transformations - for Statistical Analysis: Ranking and Transformations

for Statistical Analysis: Ranking and Transformations

10.2 Non-linear transformations

10.2.1 Square root transformation

For count data, e.g., numbers of mineral grains or geochemical data where the concentration of an element is a direct function of the number of rare mineral grains present, it is often possible to approach normal distribution via a simple square root transformation. Instead of the original data, the distribution of the square root of the result for each sample is studied and used for further data analysis. This reflects the fact that count data follow a Poisson distribution (Bartlett, 1947; Krumbein and Graybill, 1965; Weissberg, 1980) and that the square root transform is the generally accepted procedure for normalising such data.

10.2.2 Power transformation

Instead of applying the square root transformation, which is a special case of the power trans-formation (x¹^/2), a transformation to any other power can sometimes be useful to approach a normal distribution. Even negative powers can be used, e.g., the inverse of the square root transformation (x^−1/2). In the special case of a power of zero the power transformation is defined as log(x). Power transformations require that the data are positive, which is the usual case in environmental sciences and applied geochemistry. If negative data are present, these must first be brought into a positive range via addition of a suitable constant. The choice of an appropriate power for the transformation is the objective of the Box–Cox procedure (see Section 10.2.4).

Figure 10.2 shows as an example the original, strongly right-skewed distribution for Cr from the Kola C-horizon data set, and the effect of power transformation with powers of+¹₄ and−¹₄ on the distribution. Both transformations result in more symmetrical distributions, however, some skewness still exists.

10.2.3 Log(arithmic)-transformation

In geochemistry a log-transformation is often used to approach a normal distribution. A log-transformation will reduce the very high values and spread out the small data values and

170 IMPROVING DATA BEHAVIOUR FOR STATISTICAL ANALYSIS: RANKING AND TRANSFORMATIONS

0 100 200 300 400 500

0.0000.015

Cr [mg/kg]

Density

1 2 3 4 5

0.00.40.81.2

Cr power transformed (1/4)

Density

0.2 0.3 0.4 0.5 0.6 0.7 0.8

0123456

Cr power transformed (–1/4)

Density

Figure 10.2 Density trace for Cr of the Kola C-horizon soil data set. Upper diagram: original data;

middle diagram: power transformed data with power+¹₄; lower diagram, power transformed data with power−¹₄

is thus well suited for right-skewed distributions. In practice there are two commonly used log-transformations, the transformation to the natural logarithm (ln) and the transformation to base 10 logarithms (log10). Some statistical data inspection techniques (e.g., the CP-plot) benefit from a logarithmic transformation. If the objective is to check whether data points follow a straight line, either a ln or a log10 transformation will suffice for a CP-plot. Otherwise, when working with environmental data, a transformation to the base 10 logarithms may be the better choice, because it is considerably easier to relate to the original data, as it is common knowledge that 0-1-2-3 on the log10-scale corresponds to 1-10-100-1000 in the original scale.

Data transformations are also often carried out to make different variables more com-parable. For example, when plotting Cr versus Cu for the Kola C-horizon soil data, the majority of points fall in the lower left corner of a plotting area that is dominated by a few extreme values. Plotting the log-transformed values, the data points are spread over the whole plotting area, and the structure of the two-dimensional data can be studied in far greater detail (see Figure 10.3).

NON-LINEAR TRANSFORMATIONS 171

0 50 100 150

0100200300400

Cu in C−horizon [mg/kg]

Cr in C−horizon [mg/kg]

Cu in C−horizon [mg/kg]

Cr in C−horizon [mg/kg]

2 5 10 20 50 100

25102050200500

Figure 10.3 Scatterplot for Cu and Cr from the Kola Project C-horizon soil data set. Left diagram:

original data scale; right diagram: log-scaled data

10.2.4 Box–Cox transformation

The algorithm for the Box–Cox transformation (Box and Cox, 1964) estimates the power that is most likely to result in a normal distribution when applied to the given data set. This is an extremely flexible transformation, and as noted above, includes a logarithmic transformation for the special case of the estimated power being zero. In reality the power may not transform the data to perfect normality, however, the estimate is the power that brings the data closest to normality. If the parameter is close to some specific value, for example−1, 0, ¹₃, 0.5, it is often sufficient to use a reciprocal (−1), logarithmic (0), cube root (¹₃), or square root (0.5) transform, respectively. This is most appropriate when there is evidence that the underlying physical or chemical processes controlling the data are best modelled by a specific distribution normalised by that power.

As an example, the variable Fe XRF from the Kola C-horizon soil data set can be used. The Box–Cox transformation estimates a power of 0.31. There is the choice of using a log-transform or a power transformation. Using a log-transformation, a Shapiro–Wilk test for normality (see Chapter 9) returns ap-value of 4 · 10⁻⁶, and thus the hypothesis that the data follow a normal distribution is rejected. Applying a power transformation with the value of ¹₃ to the data, the Shapiro–Wilk test delivers ap-value of 0.057 and the hypothesis that the data follow a normal distribution cannot be rejected. Figure 10.4 shows this example in the form of density traces. The direct relation to the unit of the measurements is lost under power transformation.

This can be overcome by an appropriate transformation of the scale of thex-axis (not routinely provided in DAS+R, but possible in R).

10.2.5 Logit transformation

To approach a normal distribution for proportional data, the logit transformation (Berkson, 1944) performs better than the log-transformation in many instances. It is particularly appropriate for proportions or probabilities as it opens up the data to an unbounded scale, though zero is not permitted. To carry out a logit transformation, if the data are not already

172 IMPROVING DATA BEHAVIOUR FOR STATISTICAL ANALYSIS: RANKING AND TRANSFORMATIONS

0 2 4 6 8 10 12

0.000.100.200.30

Fe_XRF [wt%]

Density 0.01.02.0

Fe_XRF [wt%]

Density

0.5 1 2 5 10

0.01.02.0

(Fe_XRF) [wt%]

Density

0.5 1 2 5 10

Figure 10.4 Density trace for Fe XRF from the Kola C-horizon data set. Upper diagram: original data;

middle diagram: log-transformed data; lower diagram: power transformed data with power 0.31

on a 0–1 scale, the data need to be transformed such that they fall into the range between 0 and 1. For data given in wt% they thus need to be divided by 100; data given in mg/kg need to be divided by 1 000 000 and data given in␮g/kg need to be divided by 10⁹. The data are then divided by the inverse proportion of the data and log-transformed (i.e. if the value is 0.1 the transform is ln(0.1/(1 − 0.1)); if the value is 0.3 the transform is ln(0.3/(1 − 0.3)).

Because the relation to the original data is lost, this transformation is not favoured for applied geochemical and environmental data. In addition, when working with skewed data (as is often the case in environmental sciences), the calculation of the inverse proportion will increase the skewness of the distribution, which is not a desirable result. The logit transformation will perform much better than the log-transformation when dealing with uniformly distributed data, or data that are expressed as proportions or probabilities.

In document Statistical Data Analysis Explained (Page 191-194)