Subset creation – a mighty tool in graphical data analysis

Comparing Data in Tables and Graphics

8.4 Subset creation – a mighty tool in graphical data analysis

The comparison of the four sample materials analysed in the Kola Project is an obvious choice.

What can be done, however, if there is only one data set? Is it still possible to learn some-thing via subsetting/grouping the data? Each data set can usually be further subdivided into subsets/groups. For example the Kola Project covered the territory of three different states (Finland, Norway and Russia). Already for national reasons it could be important to compare

SUBSET CREATION – A MIGHTY TOOL IN GRAPHICAL DATA ANALYSIS 139 element levels and variation in the sample materials from the different countries. In addition, the Russian part of the survey area is heavily industrialised, while the Norwegian and Finnish parts are almost pristine (Norway being more influenced by the Russian industry than Finland).

Thus, it could be informative to compare the behaviour of the elements in each of the materials for the three country data subsets to learn something about the impact of contamination. Other important factors in the survey area include the existence of three different vegetation zones, distance to the coast with a steady input of several elements via marine aerosols, distance to the different industries in the area and major differences in geology and topography. All these could be used to construct useful data subsets/groups (if the data file includes such informa-tion). It could also aid interpretation to use certain variables (e.g., pH) to construct data subsets – sometimes trends become more recognisable when large groups of samples are compared rather than single points in a scatterplot.

It is necessary to include information like “country of origin”, “vegetation zone” or

“geology” in the data file at the time of its construction to be able to define such subsets/groups that may become important later on during data analysis. Thus, it is time well spent to think in terms of “data analysis” already during the project design stages to ensure that data, especially field data, required for subsetting/grouping are captured during the execution of the survey. In effect, auxiliary data is being selected for inclusion on the supposition that it could influence the geochemistry of the study area, and its inclusion in the data file will permit some preliminary investigation and testing of the suppositions, i.e. hypotheses.

Figure 8.6 (left) shows that Cu concentrations and variation in C-horizon soils are quite comparable between the three countries. The samples from Russia actually have the lowest MEDIAN. For the O-horizon the picture is quite different (Figure 8.6, right). Here Cu data for Finland exhibit the lowest MEDIAN and the lowest spread, while Russia is at the other extreme. Both the MEDIAN and spread are much higher in Russia than in either of the other two countries, and additionally a large number of upper outliers are visible in the plot (Figure 8.6, right). This change is due to the effect of the Russian Cu-Ni industry on the O-horizon samples.

Norway falls in an intermediate position in this plot as expected from the fact that many of the Norwegian sample sites are close to the Russian border and thus the Norwegian subset of samples is more affected by the emissions than the samples from Finland.

When directly comparing O- and C-horizon results for the whole Kola data set, it appears that these show little relation (e.g., Figure 8.5). Comparing the distribution of the variables in a number of subsets may provide a different impression and facilitate data interpretation.

Figure 8.7 (lower half) is a boxplot comparison of Al2O3 and K2O in C-horizon soils as collected in areas of several different bedrock lithologies (Caledonian sediments (lithologies 9 and 10), Palaeozoic basalts (lithologies 51 and 52), alkaline rocks (lithologies 81 and 82) and granites (lithology 7) see Figure 1.2). The boxplots illuminate pronounced differences in the chemical composition of the C-horizon soils collected in areas of these lithologies. The samples influenced by the alkaline rocks have the highest concentrations of both elements (Figure 8.7). In the O-horizon the Al distribution patterns are still similar to those exhibited in the C-horizon (Figure 8.7, upper left). For K, however, which is an important plant nutrient, the clear influence of the different bedrock types as seen in the C-horizon is greatly reduced in the O-horizon (Figure 8.7, upper right).

Note that when using different units or expressing the element contents in different ways, i.e. as oxides for major components, the direct comparability of data location and spread is lost (Figure 8.7). Transforming all variables to the same unit has thus important advantages during graphical data analysis.

RussiaNorwayFinland

Cu in C−horizon [mg/kg]

2 5 10 20 50 100

RussiaNorwayFinland

Cu in O−horizon [mg/kg]

5 10 50 200 1000 5000

Figure 8.6 Tukey boxplot (logarithmic scale) comparison of Cu concentrations in the Kola O- and C-horizon soil data for the three countries where samples were collected

GranitesAlkalineBasaltsSediments

Al in O−horizon [mg/kg]

500 1 000 2 000 5 000 10 000

GranitesAlkalineBasaltsSediments

K in O−horizon [mg/kg]

500 1 000 2 000

GranitesAlkalineBasaltsSediments

5 10 15 20

Al₂O₃ in C−horizon [wt%]

GranitesAlkalineBasaltsSediments

1 2 3 4 5 6

K₂O in C−horizon [wt%]

Figure 8.7 Tukey boxplot comparison of Al and K in the O-horizon (upper row – log-scale) and Al2O3

and K2O in C-horizon samples (lower row) as collected above four abundant bedrock lithologies

DATA SUBSETS IN SCATTERPLOTS 141 Comparing data behaviour in a number of carefully defined data subsets in a variety of graphics (instead of boxplots, density traces or various CDF-plots could be used) is a very powerful tool in exploratory data analysis. However, it must be kept in mind that more than one factor or process may hide behind the selected subsets (for example, subsets for the vegetation zones are clearly linked to “distance from coast” and vice versa), and obscure the true causation or even provide graphical evidence for an erroneous conclusion. In this context the user needs to be aware of the possibility for “lurking variables”, that is an apparent correlation with a factor where no true correlation exists, but both the factor and data are responding similarly to a third variable not directly present in the display. The graphics may thus not “prove” a process but can, in a truly “exploratory” sense be used to investigate and test hypotheses in an informal, but informative, way.

In document Statistical Data Analysis Explained (Page 161-164)