Mapping Spatial Data
5.5 Mapping geochemical data using classes
It is often assumed that geochemical data follow a single (log)normal distribution, which is “disturbed” by a limited number of extreme values, the outliers. Traditionally the high extremes have been of greatest interest. The problems with this concept are discussed in a number of recent papers (see, e.g., Reimann and Filzmoser, 2000a; Reimann and Garrett, 2005b, Reimann et al., 2005c) and in several chapters of this book. In general, geochemical data do not follow a normal or lognormal data distribution but are poly-populational. The resulting distribution may mimic a lognormal distribution due to the superimposition of a number of separate distributions related to different geochemical processes. These processes are reflected in separate data populations, i.e. sets of data with unique statistical parameters, such as means and standard deviations.
Usually, the most dominant distributions in regional geochemical surveys are data related to the geochemically distinct bedrock lithologies present in the survey area. Superimposed on these rock type-related distributions are effects of secondary processes, such as anthropogenic contamination, sea spray, or enrichment or depletion of elements due to a wide variety of causes (pH, grain size, Fe- and Mn-oxyhydroxides, presence and amount of organic material, to name just a few). All these processes are location dependent. Sometimes non-bedrock factors can become so dominant that the “original” geochemical signature of different bedrock lithologies is lost.
Thus, geochemical data do not consist of independent samples as assumed in classical statistics, but the samples related to certain processes are linked additionally by a spatial
70 MAPPING SPATIAL DATA dependence. Therefore the task in geochemical mapping and interpretation cannot be to just detect some few high values, which may be indicative of a large mineral occurrence or ex-treme contamination in a survey area. The task is rather to display these different processes determining the data structure in map form and to detect local deviations from the dominating process in any one sub-area (Reimann, 2005). Due to the fact that multiple processes are involved, this may appear close to impossible at first glance. However, by splitting the data into groups on the basis of order statistics (see Sections 5.5.2 and 5.5.3) it is possible to display the spatial aspects of the data structure in a map such that the symbols or isopleths reflect at least a limited number of the main processes underlying the regional distribution of the elements.
5.5.1 Choice of symbols for geochemical mapping
Almost as many different symbol sets as there are geochemists have been used for geochemical mapping. The lack of standardisation for symbols used for geochemical mapping has resulted in maps that are not directly comparable, and the preferred use of proportional dot maps. These do, however, focus on very high values and do not facilitate detecting the data structure in a map.
It is EDA approaches that have permitted the acceptance of a standard set of symbol class boundaries. However, within this context different practitioners have used different symbol sets for mapping (Figure 5.3). These aim to provide an even optical weight for each symbol in a map in order to be able to focus on data structure and not on “high” or “low” values. The original EDA symbol set is based on the boxplot, reflecting the data structure, and permits the mapping of five classes. Limiting the number of classes results in a rather “quiet” and relatively easy to grasp distribution on a map. Experience teaches that seven classes is the maximum number that should be shown in a black and white map.
EDA symbol set EDA symbol set with accentuated extreme values
GSC symbol set
Highest values
+
+ +
+ Higher values
Inner values Lower values
Lowest values
Figure 5.3 Three possible symbol sets of use for geochemical mapping: original EDA symbol set; EDA symbol set with accentuated upper extreme values; an alternative symbol set as used by the Geological Survey of Canada (GSC). If desired these symbol sets can be easily extended from five to seven classes by using an additional size-class for the outer symbols
MAPPING GEOCHEMICAL DATA USING CLASSES 71 Although today colour is most widely used for mapping, black and white geochemical maps still have the advantage of being inexpensively copied and printed. In addition, black and white maps introduce less perceptual bias than colour maps. This may have substantial advantages when trying to detect geochemical processes with such a map. As a further advantage, colour-blind individuals are able to work with black and white maps. However, for a black and white map the choice of symbols is crucial. A black and white map can also be drawn as a contour, isopleth map, avoiding the use of discrete individual symbols altogether or as a smoothed surface map (see Section 5.6).
Note that the choice of symbols and their size in relation to the size of the map are critical to map appearance. It is not a trivial task to find the optimal symbol type and size for a map. Too many different symbols will usually result in a cluttered and thus hard to read map.
Because the symbol size must have a relation to the scale of the map, it is a clear advantage if plotting software permits scaling a whole group of symbols at once so that the relative size proportions between the symbols stay constant once a good set of symbols has been determined.
5.5.2 Percentile classes
Percentiles are based on order statistics and their use does not assume any underlying data distribution. This is a major advantage when dealing with geochemical data. However, the geochemist is left with the decision as to how to distribute the classes over the range of percentiles from 0 to 100. If the task is to identify geochemical processes from regional patterns in a map, this data structure must be revealed in the map. One possibility is to spread the symbols or colours almost evenly across the range of values, e.g., via using the 20th, 40th, 60th, 80th percentiles (Figure 5.4, lower left). However, geochemists are often more interested in the tails of the data distribution. To highlight the tails the 5th, 25th, 75th, and 95thpercentiles can be used (Figure 5.4, upper left and upper right). Additional classes can easily be accommodated.
A logical choice may be to include further class boundaries at the 50thand 98th(and possibly 2nd) percentiles (Figure 5.4, lower right). An extended version of the EDA symbol set as well as the Geological Survey of Canada symbol set as introduced above (see Figure 5.3) can easily handle up to seven classes for mapping (Figure 5.4, lower right). Maps constructed with percentile classes and one of these symbol sets will usually facilitate the direct recognition of a number of major processes determining the distribution of the mapped variable in space.
A remaining problem is that when using percentiles, there is no satisfactory way to identify extreme values or true outliers. The percentile-based map thus includes the assumption that the uppermost, or lowermost, two, five, or ten percent of the data are outliers. This can be justified as identifying an appropriate number of samples for further inspection (Reimann et al., 2005c).
5.5.3 Boxplot classes
The boxplot is based on order statistics and is almost free of any assumption about data distribution (see Section 3.5). More than 15 years ago K¨urzl (1988) realised that it is also well suited to define classes for geochemical mapping. O’Connor and Reimann (1993), O’Connor et al. (1988) and Reimann et al. (1998a) subsequently used this technique with great success.
It should be recognised, however, that the normal distribution sneaks into the definition of the
72 MAPPING SPATIAL DATA
Figure 5.4 Geochemical maps of the variable As in the Kola C-horizon based on different percentile classes and using EDA (upper left), GSC (upper right), extended EDA (lower left), and accentuated EDA (lower right) symbol sets
boxplot when the whiskers, the borders for extreme values, are computed. It was demonstrated above (see Section 3.5) that the boxplot will not recognise lower extreme values and will seriously overestimate the number of upper extreme values when the data distribution is right skewed (the most usual case in applied geochemistry). If identifying “too many” upper extreme values is desirable, one can continue using the original Tukey boxplot with the “raw” data.
The boxplot then provides a simple and fast method to define class boundaries for mapping.
If the task is to really study the data structure in a map, it may be preferable to use percentiles, a version of the boxplot based on percentiles, or a version of the boxplot that takes care of the vulnerability of the original boxplot to skewed distributions (e.g., the log-boxplot).
Five main classes result from using the boxplot for class selection for mapping: lower extreme values to lower whisker, lower whisker to lower hinge, the box, containing the inner
MAPPING GEOCHEMICAL DATA USING CLASSES 73 50 per cent of data (the box can be divided into two classes if needed), upper hinge to upper whisker, and upper extreme values. A set of black and white symbols (see Figure 5.3) based on this EDA approach can be used to display these five classes on a map. These symbols are based on the concept that to show the data structure on a map objectively, each class should have an even optical (graphical) weight. This is achieved by using large symbols for the lower and upper extreme values. These symbols will not dominate the map, because there are usually not a large number of extreme values. At the same time, the symbols of circle (“”) and cross (“+”) or square (“”) are almost intuitively interpreted as low and high. The next classes at both ends of the distribution, consisting of≤ 25 per cent of the data each, employ smaller versions of the extreme symbols. The inner 50 per cent, and thus the majority, of the data have the smallest symbol, a dot. However, in some instances the dots are hard to see, and in such instances a small cross can be used to advantage (GSC symbol set, Figure 5.3).
Maps constructed using these symbol sets look unusually “calm” at first glance, maybe even lacking in information content, to someone used to reading maps where the eye is automati-cally drawn to the high values. Their real power lies in the fact that the spatial data structure becomes visible (Reimann, 2005). The underlying geochemical processes usually determine the spatial data structure. Such maps can thus be used to understand and interpret the main processes governing the geochemical distribution in space. In addition, even locally unusual data behaviour which is not marked as “extreme” can be easily recognised in such a map in the form of the occurrence of different symbols in an otherwise uniform area. Furthermore, via the boxplot an automated check for extreme values is performed (based on an assumption of log-normality), which is the main advantage of the boxplot over percentiles. Note that for this to be effective, a symmetrical boxplot that has been adjusted to handle right-skewed data distributions should be used, for example by employing a log-transform, or a large number of upper extreme values will be displayed. Instead of the EDA symbols, colour classes can also be used. With the exception of the rare situations where no outliers exist, colour maps constructed using boxplot classes will in general look similar to maps constructed with percentile classes with a symmetrical distribution of symbols about the MEDIAN. Figure 5.5 shows the As dis-tribution in the C-horizon of the Kola Project area based on boxplot classes. Figure 5.5 (right) uses the original EDA symbols with an accentuated outlier symbol. To satisfy the geochemists quest for the highest value, this symbol grows continuously in direct relation to the analytical result.
Considering the major advantages of boxplot classes in combination with EDA symbols for black and white mapping, it is surprising that the technique has found so little application in applied geochemistry. One reason is probably that other maps, such as proportional dots, look much simpler at first glance. Another may be the lack of available software to prepare such maps. Many geochemists still think in terms of “high” (= interesting, may indicate a mineral occurrence, or contamination in environmental sciences) and “low” (= useless, no mineralisation in these areas, or background in contamination studies), and they seek a map that simply displays the range of the data in a relative manner. However, EDA maps achieve the task of spatially locating high and low extreme values even better, by displaying symbols indicative of the extent of “extremeness” in terms of the data distribution itself. Other reasons may be that EDA symbols need appropriate scaling in relation to the other features or symbols displayed on the map (K¨urzl, 1988) and that hardly any software is available that permits this approach. An additional reason may be the overestimation of the number of upper extreme values by the original Tukey boxplot when care is not taken to use a transform that brings the data into symmetry.
74 MAPPING SPATIAL DATA
Boxplot class selection EDA symbolset
Boxplot class selection EDA symbolset
(accentuated)
0 50 100 km
N
Figure 5.5 Boxplot class-based maps showing the distribution of As in C-horizon soils of the Kola Project area. Tukey boxplot (with log-scale) classes and original EDA symbol set (left), Tukey boxplot classes (log-scale), and accentuated EDA symbol set (right)
5.5.4 Use of ECDF- and CP-plot to select classes for mapping
One of the best procedures to study geochemical distributions graphically is to plot the empir-ical cumulative distribution function (ECDF-plot) and cumulative probability plot (CP-plot) (Sections 3.4.2 and 3.4.4) and seek breaks or changes of slope in the plots that can be used to identify useful classes for geochemical mapping. The procedure is discussed in detail in Reimann et al. (2005c). Working with ECDF- and CP-plots requires experience and detailed study of data distributions prior to mapping. It is thus a technique that an experienced applied geochemist will use in a second step of data analysis, once the first set of maps using a standard default technique (preferably percentile or boxplot classes with EDA symbols) as described above has been prepared.