The Kola data set and the normal or lognormal distribution

Comparing Data Using Statistical Tests

9.1.1 The Kola data set and the normal or lognormal distribution

In the scientific literature it is often stated that geochemical data follow a lognormal distribution.

This has been questioned by a number of scientists, e.g., more than 40 years ago (see, e.g., Aubrey, 1954, 1956; Chayes, 1954; Miller and Goldberg, 1955; Vistelius, 1960), or quite

152 COMPARING DATA USING STATISTICAL TESTS recently by Reimann and Filzmoser (2000). The Shapiro–Wilk test can be applied to the Kola data sets to test for normal data distributions of the untransformed and the log-transformed values. Table 9.1 provides the resultingp-values for the Shapiro–Wilk tests for the 24 variables where more than 95 per cent of all data in all four materials returned analytical results above the respective limits of detection.

Table 9.1 demonstrates that for the Kola data set the vast majority of variables in all four layers do not follow a normal or lognormal distribution. This is not really surprising. Environ-mental/geochemical data are spatially dependent, and spatially dependent data almost never show a normal or lognormal distribution. Furthermore, environmental/geochemical data are based on rather imprecise measurements (see Chapter 18). There are many sources of error involved in sampling, sample preparation, and analysis. Trace element analyses are often plagued by detection limit problems, i.e. a substantial number of samples are not charac-terised by a real measured value. In addition, the precision of the measurements changes with element concentration, i.e. values are less precise at very low and very high concentrations. The existence of data outliers – in most cases the existence of some samples with unusually high concentrations – is a very common characteristic of such data sets. They are thus strongly

Table 9.1 Resulting p-values of the Shapiro–Wilk (SW) test for normal (norm) and lognormal (lognorm) distribution for 24 variables in the four sample materials collected for the Kola Project

Moss O-horizon B-horizon C-horizon

SW SW SW SW SW SW SW SW

norm lognorm norm lognorm norm lognorm norm lognorm

Ag 0 0 0 0.048 0 0.001 0 0

Al 0 0 0 0.001 0 0.228 0 0.003

As 0 0 0 0 0 0 0 0

Ba 0 0 0 0.025 0 0.022 0 0.002

Bi 0 0 0 0.062 0 0 0 0

Ca 0 0 0 0 0 0 0 0

Cd 0 0 0 0.002 0 0 0 0

Co 0 0 0 0 0 0 0 0.840

Cr 0 0 0 0 0 0 0 0

Cu 0 0 0 0 0 0.840 0 0.257

Fe 0 0 0 0 0 0.004 0 0.269

K 0 0.001 0 0 0 0 0 0.001

Mg 0 0.006 0 0 0 0.028 0 0

Mn 0 0 0 0.066 0 0 0 0

Mo 0 0 0 0 0 0 0 0

Na 0 0 0 0 0 0 0 0

Ni 0 0 0 0 0 0.239 0 0.001

P 0 0 0 0 0 0.130 0 0

Pb 0 0 0 0 0 0 0 0

S 0 0 0 0 0 0 0 0

Sr 0 0 0 0 0 0 0 0

Th 0 0 0 0 0 0 0 0

V 0 0 0 0 0 0 0 0.841

Zn 0 0.587 0 0 0 0.446 0 0.012

TESTS FOR DISTRIBUTION (KOLMOGOROV–SMIRNOV AND SHAPIRO–WILK TESTS) 153 skewed. Even worse, these outliers originate from (a) different population(s) than the main body of data. Not only do the outliers originate from another population, in the majority of cases there exist several different primary and secondary factors that influence the chemical composition of the samples at every sample site. Primary factors include geology, the existence of mineralisation and/or contamination sources in an area, vegetation zones and plant commu-nities, and distance to coast and topography. Examples of secondary factors having influence on the chemical composition of a soil sample include pH, the amount of organic material, the amount and exact composition of Fe-Mn-oxyhydroxides, and the grain size distribution. These factors will change from site to site. The emerging empirical data distribution is thus a result of a mixture of several unspecified populations, and it should not be surprising that the data fail formal tests for normality. What is interesting is that data often best mimic a lognormal distri-bution, and for those interested in this observation Vistelius (1960) is the classic publication.

Cobalt as measured in the C-horizon of the Kola data set was one of the few elements where the tests indicate the data were drawn from a lognormal distribution (Section 9.1 and Table 9.1). Figure 9.2 shows the density trace for a number of lithology-related data subsets in C-horizon soil samples collected on top of the five main lithologies in the survey area (see geological map, Figure 1.2). The density trace of the Co distribution for all the selected subsets (Figure 9.2, left) clearly suggests a lognormal distribution. The Shapiro–Wilk test delivers a p-value of 0.88 (note, not 0.84 as in Table 9.1, as this test is for the combined data from only five lithologies). However, when the density traces of the different subsets are plotted separately (Figure 9.2, right) several of the selected subsets clearly deviate from a lognormal distribution.

A highly significant lognormal distribution can thus still consist of groups with very different data structures.

Testing a wide variety of different regional geochemical data sets, Reimann and Filzmoser (2000) came to the conclusion that the data almost never follow a normal or lognormal dis-tribution. In the majority of cases a data transformation (e.g., log, ln, logit, square root, range

0.00.51.01.5

Co in C−horizon [mg/kg]

Density

1 2 5 10 20 50

0.00.51.01.52.0 ●

Co in C−horizon [mg/kg]

Density

1 2 5 10 20 50

9 20 32 51 52

Figure 9.2 Density traces for a number of lithology related subsets of the Co C-horizon soil data, Kola Project C-horizon. Left: one density trace plotted using all selected samples; right: density traces for each of the selected subsets. Lithologies: 9, sedimentary rocks; 20, felsic gneisses; 32, mafic granulites;

51, andesites; 52, basalts (see Figure 1.2)

154 COMPARING DATA USING STATISTICAL TESTS or Box–Cox – see Chapter 10) did not result in a normal distribution. The vast majority of classical statistical methods are based on the assumption that the data follow a normal distribu-tion. When using them with non-normally distributed data, one should be very aware that this could give biased, or even erroneous, results. Data outliers do not influence, or have minimal influence, on robust methods, and non-parametric methods are not based on such strict model assumptions. These are thus preferable to the classical methods. In any case, a thorough graph-ical exploratory data analysis and documentation of geochemgraph-ical and environmental data sets is an absolute necessity before moving on to more advanced statistical methods.

In document Statistical Data Analysis Explained (Page 174-177)