Statistical methods to identify extreme values and data outliers

Threshold, Identification of Data Outliers and Element Sources

7.1 Statistical methods to identify extreme values and data outliers

7.1.1 Classical statistics

Statisticians use mean and standard deviation to identify extreme values of the normal distri-bution (see Chapter 7). This is a well-established and very reliable method if independent data are drawn from a normal distribution.

STATISTICAL METHODS TO IDENTIFY EXTREME VALUES AND DATA OUTLIERS 109 Table 7.1 Lower and upper limits for extreme values estimated using the “MEAN± 2 · SD” rule for Cu in the Kola Project O-horizon data. Results are shown when using the non-transformed and the log-transformed values for the calculation. The log-transformed values have been

back-transformed to the original data scale

[MEAN− 2 · SD] [MEAN+ 2 · SD]

original log-based original log-based

Cu −451 1.8 540 98

Cu-Finland 2.3 3.9 12 13

Cu-Norway −31 2.6 56 32

Cu-Russia −626 2.5 790 213

The distribution of geochemical data has been repeatedly discussed in the literature (for a recent paper see, for example, Reimann and Filzmoser, 2000). Many of the classical statistical methods (not only “MEAN± 2 · SD”) require that the data follow a normal distribution or, as a minimum, approach symmetry. In the previous chapters it was demonstrated that geochemical data are usually strongly right skewed and are characterised by the existence of outliers. Skew and outliers will have a strong influence on location (central value) and spread of a data set.

Table 7.1 shows the estimated threshold values calculated for the variable Cu as measured in the Kola Project O-horizon data set based on the classical statistical parameters. Because the data distribution is strongly skewed, the resulting values are clearly biased. To overcome the right skew, the data could be log-transformed. Values for MEAN and SD can then be calculated for the log-transformed data; the boundaries can be calculated using these estimates, and back-transformed to the original data scale. All the statistical parameters have changed dramatically, and the estimate of threshold is now much lower. But is it correct? A statistical test for data normality (see Chapter 9) will show that the variable Cu in the Kola Project O-horizon data set follows neither a normal nor a lognormal distribution (see Table 9.1). This provides a strong reason to be suspicious of calculated threshold values. Due to the strong right skew of the data, the calculation of the limits for the lower outlier boundaries, “thresholds”, without the use of a log-transformation results in negative estimates. But negative concentrations do not exist in geochemistry, which further highlights the problem with this approach.

The problem of using the above statistical formula to identify extreme values and thus the upper and lower boundaries of the background variation for spatial (applied geochemical and environmental) data is obvious when treating the three countries as data subsets. Calculat-ing MEAN, SD, and outlier boundaries for each of the countries usCalculat-ing the above approach (non-transformed versus log-transformed values) results in totally different estimates for each country. What is the most reliable estimate of “background” for the survey area? Is it justified to base regulatory action or guidance values on any of these estimates?

7.1.2 The boxplot

As demonstrated in Chapter 3 the boxplot as defined by Tukey (1977) will automatically identify extreme values. However, it has also been discussed that the calculation of the Tukey outlier boundaries requires data symmetry and is based on assumption of underlying normality (Section 3.5.2). Thus, before using the boxplot to identify data outliers, the data

110 DEFINING BACKGROUND AND THRESHOLD, IDENTIFICATION OF DATA OUTLIERS AND ELEMENT SOURCES Table 7.2 Lower and upper boundaries for extreme values for Cu in the Kola Project O-horizon data as obtained with the Tukey boxplot (non-transformed data) and the log-boxplot (data log-transformed to approach symmetry and resulting boundaries for extreme values back-transformed into the original data scale)

Lower whisker Upper whisker

original log-based original log-based

Cu 2.7 2.7 35 76

Cu-Finland 2.7 3.9 11.9 13

Cu-Norway 4.5 4.5 17 24

Cu-Russia 3.1 3.1 72 185

distribution needs to be studied in some detail and the required data transformation to approach distributional symmetry carried out. When this is done, the boxplot should be better suited to identify data outliers in geochemical data. Table 7.2 compares outlier boundaries as obtained when using the boxplot for the original and for log-transformed Cu data. Using the log-transformed data and back-transforming the resulting boundaries to the original data scale is equivalent to using the log-boxplot (Section 3.5.2) on the original data. Due to the construction rules for the boxplot, negative lower boundaries for extreme values are now impossible (the lower whisker is taken to the lowest original data value). The values for the upper whisker are often considerably higher when using the log-boxplot (see Section 3.5.2).

Compared to the boundaries calculated above using the “MEAN± 2 · SD” rule, only the upper threshold for the country subset “Finland” remains the same when using log-transformed data in both cases. Again the three country subsets all provide different results.

7.1.3 Robust statistics

To better accommodate the special properties of applied geochemical data it is possible to replace MEAN and SD in the “MEAN± 2 · SD” formula by robust estimators for the location (central value) and spread (see Sections 4.1 and 4.2). When, for example, using MEDIAN and MAD, the resulting formula for identifying extreme values is “MEDIAN± 2 · MAD”.

However, the definition of the constant used in estimating the MAD is based on the assump-tion of a normal data distribuassump-tion of the core (inner 50 percent) of the data. Thus the data distribution must first be checked for symmetry of the majority of the data before performing any calculations of the boundaries for extreme values, even when using robust estimators.

The estimates obtained for the upper and lower outlier boundaries (Table 7.3) are very different from the estimates obtained when using the classical parameters (Table 7.1) or the boxplot (Table 7.2). Differences between estimates using the original and the log- and back-transformed values are now much smaller. The reason is that the MAD is much less influenced by skewed data than the SD and even less than the hinge width (IQR) used in the boxplot. The

“MEDIAN± 2 · MAD” rule delivers the lowest threshold values of all techniques discussed so far. Reimann et al. (2005) have shown that the percentage of extreme values is usually overestimated when using this rule. Again, different estimates for the threshold will be obtained when location or size of the survey area change.

STATISTICAL METHODS TO IDENTIFY EXTREME VALUES AND DATA OUTLIERS 111 Table 7.3 Lower and upper limits for extreme values calculated using the “MEDIAN± 2 · MAD”

rule for Cu in the Kola Project O-horizon data. Results are shown when using the non-transformed and the log-transformed values for the calculation. The log-transformed values have been back-transformed to the original data scale

[MEDIAN− 2 · MAD] [MEDIAN+ 2 · MAD]

original log-based original log-based

Cu −0.6 2.8 20 33

Cu-Finland 3.2 4.0 10 11

Cu-Norway 2.7 3.9 13 15

Cu-Russia −5.3 3.8 40 79

7.1.4 Percentiles

The uppermost 2 percent, 2.5 percent, or 5 percent of the data (the uppermost extreme values) are sometimes arbitrarily defined as “outliers” for further inspection. This will result in the same percentage of extreme values for all measurements. This approach is not necessarily valid, because the real percentage of extreme values could be very different. In a data distribution derived from natural processes, there may be no extreme values at all, or, in the case of multiple natural background processes, there may appear to be outliers in the context of the main mass of the data that are not in fact outliers in the context of the background process with the highest levels. However, in practice the percentile approach delivers a number of samples for further inspection that can easily be handled. In some cases environmental regulators have used the 98^thpercentile of background data as a more sensible inspection level (threshold) than values calculated by the “MEAN± 2 · SD” rule (see, e.g., Ontario Ministry of Environment and Energy, 1993). The remaining problem is that even percentiles will change with size or location of the survey area.

Table 7.4 shows the 2^ndand 98^thpercentiles for Cu in the Kola Project O-horizon data set for all data and for the three countries. Note that percentiles are solely based on sorting of the data, log- and back-transformation will not change the resulting percentiles (see Section 4.3).

The results for the lower boundary for extreme values are of course again within the data range and are actually the highest estimates of all the techniques presented so far. Results for the upper threshold are high when compared to the other techniques. The exception is for the

Table 7.4 2^ndand 98^thpercentiles of Cu from the Kola Project O-horizon data set. To highlight the inherent problem of working with spatial data the 2^ndand 98^thpercentiles for the three country subsets are also shown

2^nd 98^th

percentile percentile

Cu 4.7 248

Cu-Finland 4.4 13

Cu-Norway 4.7 62

Cu-Russia 6.9 478

112 DEFINING BACKGROUND AND THRESHOLD, IDENTIFICATION OF DATA OUTLIERS AND ELEMENT SOURCES data from Finland. These provide a relatively stable upper threshold for all techniques. This indicates that the distribution of Cu in the Finnish samples is quite symmetrical and does not contain more than two percent upper extreme values.

7.1.5 Can the range of background be calculated?

As demonstrated above, several methods are available for calculating the range of geochemical background. When applied, all result in quite different estimates for the same variable. When changing size or location of the survey area, again all estimates will change. Reimann et al.

(2005) studied in detail the different methods and their statistical behaviour with simulated and real data. The conclusion was reached that, given the special properties of geochemical data and that in fact real data outliers and not “extreme values of the normal distribution” are being sought, the “MEDIAN± 2 · MAD” rule and the boxplot (if a suitable data transform is used) will provide the most reliable calculated results.

It should be noted that in the literature on robust statistics many other approaches for outlier detection have been proposed (see, e.g., Huber, 1981; Rousseeuw and Leroy, 1987; Barnett and Lewis, 1994; Dutter et al., 2003; Maronna et al., 2006). There are also multivariate methods for outlier detection (e.g., Chapter 13) and mixture decomposition methods have also been proposed as a solution (see, e.g., Graf and Henning, 1952; Carral et al., 1995; Neykov et al., 2007). However, none of these methods is able to solve the problem of working with spatial data. Estimated values will always change depending on size and location of the survey area.

Applied geochemical data are not the well-behaved data from classical investigations, where statistical parameters will improve with the number of samples; they may in fact change dramatically when the survey boundaries are changed.

Percentiles based on the empirical data have the advantage that they will identify the up-permost (lowermost) two percent (or five percent) of the data as “unusual” and without being based on any distributional model. They will result in the identification of a reasonable number of samples that will need further investigation. These samples may, or may not, be true outliers.

However, in a highly mineralised or contaminated survey area there may in fact be far more than, for example, five percent outliers, e.g., Kola Project mosses or O-horizon soils.

Considering the above results, it is clearly inappropriate that the estimation of background for applied geochemical data should be solely based on such calculations when the values are to be used for regulatory purposes. However, the resulting values may still provide useful estimates for data comparison purposes.

In document Statistical Data Analysis Explained (Page 131-135)