• No results found

Comparing data in tables

In document Statistical Data Analysis Explained (Page 152-156)

Comparing Data in Tables and Graphics

8.1 Comparing data in tables

Comparing Data in Tables and Graphics

Data analysis often starts with some kind of comparison of data. It may be desired to compare data collected in one area with data collected from another. It may be necessary to compare project data for a certain sample material with “world average data” for that same material.

It may be important to compare the data obtained from a certain area five years ago with data recently obtained from the same area. Typical questions include the following: “are there differences”, “which factors can cause such differences”, and last but not least a “statistical”

question: “how significant are the differences”? This latter question is addressed in Chapter 9.

Often much can be learned about the processes that cause a certain data behaviour if it is possible to create subsets that differ in a known property from the data. In the case of the “complete” Kola data set there exist, for example, four different sample materials, each carefully selected to represent a certain part of the ecosystem. Moss is primarily reflecting the atmospheric input of elements in the survey area. The O-horizon reflects the interplay between atmosphere, biosphere and the pedosphere (see Chapter 1). The B-horizon is influenced by many soil-forming processes and is a representative of the pedosphere. The C-horizon is weathered soil parent material and its composition gives a representation of the bedrock in the survey area, the lithosphere (for a more complete discussion see Chapter 1). The project was designed to be able to directly compare the chemical composition of these four materials in tables and graphics. Such comparisons facilitate a better understanding of the cycling and fate of elements in and between the different compartments of the ecosystem.

8.1 Comparing data in tables

Because it is not usually possible to compare large data sets data item by data item, it is necessary to employ some kind of data summary. Most “classical” methods for comparing data populations are built around the MEAN and the SD, and thus the “location” or “central value” and “spread” of each variable. Most data tables summarising a data set report at least these two parameters, hopefully together with the data set size. It can also be informative to report the minimum (MIN) and maximum (MAX) values observed and certain percentiles of the data set (see summary table in Section 4.6). It has been demonstrated that MEAN and

Statistical Data Analysis Explained: Applied Environmental Statistics with R. C. Reimann, P. Filzmoser, R. G. Garrett, R. Dutter © 2008 John Wiley & Sons, Ltd. ISBN: 978-0-470-98581-6

130 COMPARING DATA IN TABLES AND GRAPHICS SD are usually not good descriptors for applied geochemical and environmental data. Thus a better choice for reporting in a table should be MEDIAN and MAD; MEAN and SD can be provided in addition because so many scientists still use these classical measures and they might want to make data set comparisons. A large difference between the MEDIAN and MEAN (environmental data will usually show a higher MEAN than MEDIAN indicating a right skew) will show that the data are strongly skewed, and/or there are some extreme values, biasing the estimates of the MEAN and SD. In the simplest case of a table comparing data averages for a project, sample material is compared with averages of the same material collected elsewhere.

Table 8.1 shows an example of such a table using a selection of the elements from the Kola Project C-horizon data (<2 mm fraction, aqua regia extraction). These Kola Project data are compared to the MEDIAN of soil samples: collected in England & Wales (<2 mm fraction,

“sub-soil”, aqua regia extraction – McGrath and Loveland, 1992); from the Baltic Soil Survey (BSS), covering ten northern European countries (<2 mm fraction, B-/C-horizon, aqua regia extraction – Reimann et al., 2003); and to the world average for soils (as provided in Reimann and de Caritat, 1998). The latter value is not based on measurements of soil collected all over the world but is an “estimate” of the most likely average of total concentration of these elements in soils based on limited data from soil surveys from different parts of the world. The table allows the reader to judge whether the project data are within a likely range and to note some major differences between the data sets.

Table 8.1 also demonstrates one of the problems with the approach; there are often elements where information is missing in one or another data set. Looking at Table 8.1 it is apparent that the world average values for soil are often much higher than the values found in any of the other data sets. This is in part caused by the fact that the world soil values are provided for “total”

concentrations of the elements and not for an aqua regia extraction. The very high value for S in the world soils indicates that they represent agricultural soils, taken at the earth surface, while all other projects collected soils at a certain depth. The table also indicates that the As and Pb concentrations in the C-horizon soils from the Kola Project area are unusually low. A first reaction would be to check the quality control results for these two elements (see Section 18.2) to be sure that the data are correct. Once this is established one could start to consider why these elements might show such low concentrations in the survey area.

Table 8.1 MEDIAN (mg/kg) of some selected elements from the Kola C-horizon data set in comparison to soil average values as cited in literature

Kola C-horizon England & Wales BSS World soil

Unit MEDIAN MEDIAN MEDIAN AVERAGE

Ag mg/kg 0.008 NA NA 0.07

As mg/kg 0.5 NA 2 5

Bi mg/kg 0.03 NA 0.05 0.3

Cd mg/kg 0.02 0.7 0.06 0.3

Co mg/kg 7 9.8 5 10

Cu mg/kg 16 18 7 25

Ni mg/kg 19 23 9 20

Pb mg/kg 1.6 40 5 17

S mg/kg 30 NA 62 800

Th mg/kg 6 NA 5.4 9.4

NA: not available.

COMPARING DATA IN TABLES 131

Table 8.2 Comparison of some selected elements in the Kola Project O- and C-horizon data sets

Number of MEAN MEDIAN SD MAD

samples mg/kg mg/kg mg/kg mg/kg

O-hor C-hor O-hor C-hor O-hor C-hor O-hor C-hor O-hor C-hor

Ag 617 606 0.283 0.011 0.2 0.008 0.325 0.011 0.160 0.0044

As 617 606 1.60 1.25 1.16 0.5 2.49 2.35 0.460 0.445

Bi 617 606 0.186 0.049 0.159 0.026 0.113 0.164 0.076 0.021

Cd 617 606 0.327 0.029 0.303 0.024 0.150 0.020 0.114 0.010

Co 617 606 3.1 8.2 1.6 7 7.1 5.03 1.11 3.71

Cu 617 606 44 22 9.7 16.2 246 18 5.15 10.8

Ni 617 606 51 23 9.2 18.7 199 21 7.74 11.6

Pb 617 606 24 2.75 18.8 1.6 49 3.3 7.41 0.741

S 617 606 1550 41 1530 30 334 43 297 17.8

Th 617 606 0.571 7.9 0.345 6.5 0.93 6.2 0.254 3.71

A serious shortcoming of just comparing MEDIAN values is that information about the spread of the data and the number of samples that support the reported MEDIAN value is missing. However, in the published literature this information is often not provided. For the Kola data the original data sets are available. For comparing the element concentrations in the collected sample materials it is thus possible to provide the information that is needed for a statistically more appropriate first comparison. Table 8.2 compares MEAN, MEDIAN, SD and MAD for the O- and C-horizon samples. Immediately such a small table covering just 10 elements (out of 50 that could be compared) gets quite complex and requires careful attention (and good editing and presentation of the table) in order to extract the required information.

In Table 8.2 it is apparent that for some elements there are large differences between MEAN and MEDIAN (even more so between SD and MAD). The elements Co, Cu and Ni in the O-horizon are prominent examples. All three are emitted by the Russian nickel industry, and the existence of a large number of upper outliers in the data sets causes extremely right-skewed data distributions. MEAN and SD should thus not be used for these data. Table 8.2 also shows that it is difficult to format such a table in a way that the inherent information can be retrieved at one glance.

In Table 8.3 the values for MEAN and SD are removed and instead the robust coefficient of variation (CVR) is added to get a measure of spread that is independent of the unit and range of the data. To easier “see” the differences between the two sample materials and the elements, the ratio between the MEDIANS for O- and C-horizon samples was calculated. It becomes apparent that there are large differences in these ratios between the different elements. To further improve the “readability” of such a table it is also possible to sort the variables according to one of these ratios.

In Table 8.4 the elements are no longer sorted in alphabetical order but according to the ratio of the MEDIANS for the O- and C-horizons. Here it is possible to immediately detect which elements, sulphur and silver, are most enriched in the O-horizon when compared to the C-horizon. This is because S is a major plant nutrient and thus it is found at high concentrations in the organic layer. The next element that is highly enriched in the O-horizon is Ag – as already observed in Figure 7.2. It is an interesting observation that all elements with a ratio>1 do not belong to the most important metals emitted by industry – Cu and Ni, which both have still

132 COMPARING DATA IN TABLES AND GRAPHICS Table 8.3 Data from Table 8.2, MEAN and SD removed, CVR added, and the ratio of MEDIAN, MAD and CVR O-/C-horizon provided

Number of MEDIAN MAD CVR

samples mg/kg mg/kg

O-hor C-hor O-hor C-hor O/C-rat O-hor C-hor O/C-rat O-hor C-hor O/C-rat

Ag 617 606 0.2 0.008 25 0.16 0.0044 36 0.80 0.56 1.44

As 617 606 1.16 0.5 2.32 0.46 0.445 1.03 0.40 0.89 0.45

Bi 617 606 0.159 0.026 6.12 0.076 0.021 3.64 0.48 0.80 0.60

Cd 617 606 0.303 0.024 12.6 0.114 0.010 11 0.38 0.43 0.87

Co 617 606 1.57 7 0.22 1.11 3.71 0.30 0.71 0.53 1.34

Cu 617 606 9.69 16.2 0.60 5.15 10.8 0.48 0.53 0.67 0.79

Ni 617 606 9.18 18.7 0.49 7.74 11.6 0.67 0.84 0.62 1.36

Pb 617 606 18.8 1.6 11.75 7.41 0.741 10 0.39 0.46 0.85

S 617 606 1530 30 51 297 18 17 0.19 0.59 0.33

Th 617 606 0.345 6.5 0.053 0.25 3.71 0.07 0.73 0.57 1.29

higher concentrations in the C-horizon. Thus there is every reason to suspect that a process other than anthropogenic contamination causes the enrichment of elements like Ag, Cd, Pb, Bi and As in the O-horizon (compare Goldschmidt, 1937; Reimann et al., 2001a, 2007).

When looking at the additional information provided by MAD or CVR (Table 8.4), it is apparent that there are large differences in variation between the elements. For example S, with the highest MEDIAN-ratio, shows the lowest ratio for the CVR (Table 8.4). This is probably caused by the fact that S is a major plant nutrient and that plants keep uptake of this element regulated and in an optimal concentration range. It is also interesting to note the difference in the CVR between Cu and Ni, the major components in the emissions from the Russian nickel industry (Table 8.4). One important difference between the two elements is that Cu is a better-regulated micro-nutrient for the plants than Ni. Such tables, especially when sorted, can be used to help develop ideas about processes.

As long as only a few populations are to be compared, a table of summary statistics can be built in a word processor or a spreadsheet from the results displayed by the data analysis

Table 8.4 Data from Table 8.3 sorted according to the MEDIAN ratio O-/C-horizon

Number of MEDIAN MAD CVR

samples mg/kg mg/kg

O-hor C-hor O-hor C-hor O/C-rat O-hor C-hor O/C-rat O-hor C-hor O/C-rat

S 617 606 1530 30 51 297 18 17 0.19 0.59 0.33

Ag 617 606 0.2 0.008 25 0.160 0.0044 36 0.80 0.56 1.44

Cd 617 606 0.303 0.024 13 0.114 0.010 11 0.38 0.43 0.87

Pb 617 606 18.8 1.6 12 7.4 0.741 10 0.39 0.46 0.85

Bi 617 606 0.159 0.026 6.1 0.076 0.021 3.64 0.48 0.80 0.60

As 617 606 1.16 0.5 2.3 0.460 0.445 1.03 0.40 0.89 0.45

Cu 617 606 9.69 16 0.598 5.1 11 0.476 0.53 0.67 0.79

Ni 617 606 9.18 19 0.492 7.7 12 0.669 0.84 0.62 1.36

Co 617 606 1.57 7 0.224 1.1 3.7 0.300 0.71 0.53 1.34

Th 617 606 0.345 6.5 0.053 0.254 3.7 0.068 0.73 0.57 1.29

GRAPHICAL COMPARISON OF THE DATA DISTRIBUTIONS OF SEVERAL DATA SETS 133 software. It may even be possible to “cut and paste” the results (DAS+R is able to write these results directly into a “.csv” file). Once in a tabular form, results from other investigations that may be relevant can be added. However, the point is soon reached where such “summary tables” contain too much data to really grasp their content. For example, when looking at the complete Kola data set there is information for four materials and not only two as shown in the above tables (Tables 8.2–8.4). In addition this information exists for many more elements than just the selected ten shown in these tables. Thus the point where it becomes very tedious to work with tables is soon reached. It is at this point that graphical methods come to the fore.

8.2 Graphical comparison of the data distributions of several data sets

In document Statistical Data Analysis Explained (Page 152-156)