Preparing the Data for Use in R and DAS+R
2.2 The detection limit problem
A common problem in the analysis of applied geochemical and environmental data is that a data set as received from the laboratory will often contain a number of variables where a certain percentage of the samples return values below the detection limit (DL) (see, e.g., Levinson, 1974; AMC, 2001). There are both lower and an upper DLs for any analytical method – in practice it is the lower DL that will most often be of concern in applied geochemistry and environmental projects (when analysing ore samples or samples from an extremely contami-nated site, the upper DL of the method may also need consideration). In a spreadsheet or file as received from the laboratory these samples will most often be marked as “<DL” (or “<rl”,
“<reporting limit”), where DL is the actual value of the detection limit (e.g., 2 mg/kg). While values marked “<” are desirable in a printed data table, they cannot be used for any numerical processing or statistical studies. Means and standard deviations cannot be computed for sam-ples marked “<” – and graphics or maps cannot be easily prepared. Because most statistical tests are based on estimates of central tendency (e.g., MEAN) and spread (e.g., variance or standard deviation), such tests cannot be carried out.
18 PREPARING THE DATA FOR USE IN R AND DAS+R Thus selecting analytical procedures that provide adequate DLs is one of the most important considerations in planning any applied geochemistry project. The (lower) DL is commonly understood to be the smallest concentration that can be measured reliably with a particular technique. Different analytical techniques will have different DLs for the same element, and different laboratories may well quote different DLs for the same element using the same analytical technique. In addition the DL will change with the model of the instrument, the technician performing the analysis, and the solid-to-liquid ratio in any procedure requiring sample dissolution.
The choice of the analytical method for a given project is often a compromise between cost, a careful evaluation of expected concentrations, and the need to have a complete data set for the elements required (e.g., for geochemical mapping). In many instances the practitioner will end with a data set containing a substantial number of “<” signs. As mentioned above, these create problems in data analysis. There are several approaches to dealing with variables where some analytical results are “<DL”. Options are:
Delete the whole variable or all samples with values “<DL” from data analysis;
Mark all observations “<DL” as missing;
Model a distribution in the interval [0, DL], and assign an arbitrarily chosen value from this distribution to each sample<DL;
Try to predict a value for this variable in each sample via multiple regression (imputation) techniques using all other analytical results; or
Set all values marked “<DL” to an arbitrarily chosen low number, e.g., half the DL.
None of these solutions is ideal. To delete samples from data analysis is not acceptable, it will shift all statistical estimates towards the “high” end (or “low” end – in the rare case of observations above the upper limit of detection), although there is information that the concentration in a considerable number of samples is low (high). The same happens if the values are marked as “missing”.
Modelling a distribution based on the existing data and an assumption about the shape of the distribution (e.g., lognormal) will give the statistically most satisfying result. On the other hand it is then no longer easy to differentiate between true (measured) values and “modelled”
values that have no geochemical legitimacy beyond the model, especially in the multivariate context. This is again an undesirable situation.
To predict the values<DL, for example via regression techniques using all other analytical results or kriging (for spatial data – Chapter 5), is only sensible when the value is really needed, e.g., for mapping. The problem of differentiating between true (measured) values and
“modelled” values that have no geochemical legitimacy beyond the model, especially in the multivariate context, remains.
Thus the most widely applied solution to the “<DL” problem is to set the value to an arbitrarily chosen value of half the DL, some practitioners use the value of the DL (but note that there may be “true” measurements at this value, thus 1/2 of the DL is the better choice).
This is easy to do in a spreadsheet and works well as long as the number of values<DL is relatively low (say less than 10–15 percent of all samples). A more realistic value than half the DL can often be estimated from an ECDF- or CP-plot (see Chapter 3). This method is acceptable for constructing statistical graphics or maps – the value of half the DL is used as a “place holder” and marks “some low value” in the graphics. Often it is easy to recognise
<DL situations in a graphic, e.g., a scatterplot (e.g., Figure 18.8) or an ECDF- or CP-plot
THE DETECTION LIMIT PROBLEM 19 by a straight line of values at the lower end of the data (e.g., Figure 18.7). An element that contains values below the detection limit is called “censored”. In case of data below the limit of detection, they are stated to be “left censored”; conversely, in case of data above the upper limit of detection, they are stated to be “right censored”.
To just replace the values<DL by an arbitrary low value is not, however, acceptable if we need to calculate a “reliable” estimate (e.g., for complying with some environmental regulation) of central tendency and spread (see Chapter 4), as needed for most statistical tests. Helsel and Hirsch (1992) and Helsel (2005) provide a detailed description of these problems and possible solutions. A first and easy solution is to use MEDIAN (Section 4.1.4) and MAD (median absolute deviation, see Section 4.2.4) as estimators of central value and spread instead of MEAN (Section 4.1.1) and SD (standard deviation, Section 4.2.3) (but remember that the values representing “<DL” need to be kept in the data set as the lowest (highest) values for a correct ranking). MEDIAN and MAD can be used as long as no more than 50 per cent of the data are below the limit of detection.
To further complicate the issue, analytical methods exist where the laboratory will report different DLs for different samples depending on the matrix of the actual sample (typical for Instrumental Neutron Activation Analysis, INAA). It is also not unusual to find different detection limits over time because a laboratory has improved its methodology or instrumen-tation, or when employing different laboratories reporting with different DLs. One possible solution in this situation is to use the value of half the lowest DL reported as “place holder”
for all samples<DL.
To solve these problems, the reporting of values below detection limit as one value has been officially discouraged in favour or reporting all measurements with their individual uncertainty value (AMC, 2001). However, at the time of writing, most laboratories are still unwilling to deliver instrument readings for those results that they consider as “below detection” to their clients and the replacement of values that are marked “<DL” by a value of half the DL prior to data analysis is still a practical, if not perfect, solution for many applications.
In general a conscious decision is required as to whether a variable with censored data is to be included in a multivariate data analysis. A variable can, for example, contain 10 per cent of censored data, an amount that could probably be handled by the selected multivariate method.
However, a second variable can also contain 10 per cent of censored observations, but the samples with censored values do not need to be the same as those with censored values in the first variable. In the worst case all censored values occur in different samples and this will accumulate to 20 per cent of censored observations for the multivariate data set. Before entering multivariate analysis, it is thus important to check how many observations (samples) are plagued with missing values for any variable. The proportion of such samples should be as low as possible. If standard multivariate methods are to be used, the total proportion of samples with censored data should probably not exceed 10 per cent. Even when planning to use robust methods (which could in theory handle up to 50 per cent of “outliers”), care is necessary because further data quality issues or missing values in the remainder of the data set may exist.
The problem of censored data can create a serious dilemma when dealing with several sample materials that are to be compared. For example, when excluding variables with more than 5 per cent of censored data from all four sample materials collected during the Kola Project (moss, O-, B-, and C-horizon), only 24 (out of more than 40) variables remain in common for direct comparison.
To avoid as many “DL-problems” as possible, one of the most important first steps in the design of the Kola Project was to construct a list of all the chemical elements for determination,
20 PREPARING THE DATA FOR USE IN R AND DAS+R setting a “priority” for each element (how important is the element for project success), and identifying the detection limit that was needed to obtain a complete data set for this element for all of the sample materials. This list can then be compared against the analytical packages offered by different laboratories together with their respective detection limits. The list can then be used to discuss detection limits, costs (if appropriate), and terms with the most suitable laboratories. This list is of course an “ideal” that cannot be achieved for all elements. For some elements no analytical technique may exist that provides a low enough DL for obtaining a
“complete” data set (no sample<DL) at the time of the project. For others, procedures may exist but be so expensive that it is not possible to use them within a given budget. Thus in the end the final choice of the elements to be determined, the detection limits, and the choice of the laboratory that undertakes the work will be a compromise. It is always good practice to archive a complete sample set to be recoverable later when analytical techniques are improved (or cheaper) or interest in some element(s) arises that was of no interest at the time of the original project.