Preparing the Data for Use in R and DAS+R
2.1 Required data format for import into R and DAS +R
The question could be asked, “what is the problem with using the original data file as received from the laboratory for data analysis?” Firstly, the laboratory’s result file will likely not contain the geographic coordinates of the samples, field information, and other ancillary data that may be required in the following data analyses, nor the “keys” identifying field duplicates, analytical duplicates, the project standards (control reference materials), or even the original sample site number. The data file must thus be linked with a “key-file” containing all this information.
Secondly, data will often be received from several different laboratories, and the different files must be linked somehow. Thirdly, the received files most likely contain a lot of information that are necessary to know and may be useful to view in a table, but are of no use in statistical analysis. Tables prepared for “nice looks” are actually often especially unsuited as input for data analysis software and may require much editing. Thus it is sensible to not spend time on
“table design” at this stage.
However, during data analysis it is often advantageous to have all variables in the same unit in the data file (for example for direct 1:1 comparisons or if a centred logratio transformation (see Section 10.5.2) is necessary). For the Kola data mg/kg was used as it was the most frequent laboratory reporting unit. For example, results of XRF analysis reported as oxides in wt%, e.g., Al2O3, were re-calculated as element concentrations in mg/kg resulting in the new variable Al XRF (Figure 2.1). A data file for import into data analysis software packages should not contain any special signs such as “<”, “>”, “!”, “?”, “%” or characters from a language other than English, e.g., “˚a”, “ø”, “¨u”, or “” (g/kg). Sometimes these can cause serious problems during data import or cause the software to crash later on. All variables need to have different and unique names. Missing values (see Section 2.3) need to be marked by an empty field, or some coded value, e.g.,−9999, which can be used to tell the program that this is a missing value. Data below (or above) the analytical method detection limit must be treated in a consistent manner (see Section 2.2) – in the case of the Kola data sets all values below the detection limit (there were no data above an upper detection limit) were set to a value of half of the detection limit. Thus there usually is a certain amount of editing necessary to make the data file software-compatible.
Most statistical software, including R, require as input a rather simple data format, where the first row identifies the variables and all further rows contain the results to be used during data analysis. Some further requirements will be discussed below. DAS+R will read such files, but accepts data files with additional auxiliary information like project name, a description of sampling and analytical methods used, the lower and upper detection limits, and the units in which the data of each variable are reported; these can be directly stored (and thus retrieved during data analysis) with the data. This special DAS+R data file format is constructed such that it can be turned into a “normal” R file with no more than two commands.
The most simple file format that can be read by almost all statistical data analysis packages consists of a header row, identifying the variables, and the results following row by row underneath (Figure 2.1). Such a file should be stored in a simple “csv” (variables and results separated by a comma: csv= comma separated values) or “txt” format.
REQUIRED DATA FORMAT FOR IMPORT INTO R AND DAS+R 15
Figure 2.1 Screen snapshot of a simple Microsoft ExcelTM-file of Kola Project C-horizon results. These are ready for import by most statistical data analysis packages, including R, once stored in a simple format, e.g., the “csv” format
The special DAS+R file format, fine-tuned for use in applied geochemistry, allows the storing of much more information about the data and about each single variable, so it can be retrieved and used during data analysis. This format requires an empty first column, providing a number of predetermined key words (up to but not more than eleven) telling the software about the auxiliary information that it is expected to handle (see Figure 2.2). The sequence in which the keywords are provided is flexible, only HEADER and VARIABLE need to be specified, none of the other keywords need to be used, the software checks for the keywords and stores the information if it detects one or several of the keywords. Of course other variable-relevant information than “EXTRACTION” and “METHOD” could be stored using these fields. The keywords are:
HEADER: holding information on the data file, e.g., “Kola Project, regional mapping 1995, C-horizon”.
COMMENT DATASET: this row can contain free text with additional comments that will be kept with the data and are valid for the whole file or large parts thereof, e.g., “<2 mm fraction, air dried, laboratories: Geological Survey of Finland (all ICP and AAS results);
Geological Survey of Norway (all XRF results); ACTLABS (all INAA results)”.
These two keywords are not linked to any specific variable but to the data file as a whole.
16 PREPARING THE DATA FOR USE IN R AND DAS+R
Figure 2.2 Kola Project Moss data file prepared with auxiliary information (project name, sample type and preparation, analytical method, detection limit, unit) that may be needed during data analysis.
This format will be accepted by DAS+R. If the data file is to be used outside of DAS+R, for example in R, the first column and the uppermost ten rows should be deleted or commented out
SAMPLE IDENTIFIER: this keyword is used to identify the table column (variable) that contains the sample or site number (or code) via entering “ID” in the column that contains this information in the VARIABLE record. This can also be done later after the data are imported into DAS+R if necessary. If no sample identifier is provided, the samples are numbered from 1 ton, the number of observations (samples), and identified using this number in the graphics and tables where samples are identified.
COORDINATES: used to identify the two columns holding the information about the geographical coordinates via entering “XCOO” (east) and “YCOO” (north) in the columns containing this information. If the two variables containing the coordinates are not iden-tified, mapping and spatial functions within the software will not be available during data analysis.
COMMENT VARIABLES: can hold a free comment that is linked to each single variable, e.g., a remark on data quality or number (or percentage) of samples below detection.
EXTRACTION: holds a second comment linked to each variable separately like “aqua regia”
or “total” (or any other variable-related comment).
METHOD: holds a third comment, the method of determination for each variable like
“ICP-AES” or “XRF” (or any other variable-related comment).
UDL: the value of the upper limit of detection for each variable (if not applicable for the data at hand, it is simplest to not provide this keyword and row), e.g., “10000”.
LDL: the value of the lower limit of detection for each variable, e.g., “0.01”.
THE DETECTION LIMIT PROBLEM 17
UNIT: the unit of the measurement of each variable, e.g., “mg/kg”, “wt-percent”, “micro g/kg”
(it is generally wise to try to avoid “special” signs like “%” or “” – see above), and finally VARIABLE: the line usually starting with the sample identifier and providing the coordinate and variable names. Attention: a “unique” name must be used for each variable. If the same element has been analysed by different methods a simple text or numeric extension can be used. Note that variable names should not contain blanks.
The data array can contain empty cells (missing values, see above, R replaces empty cells by the code NA – not available) but should normally not contain any special signs like “<”, “>” or
“!” if the variable is to be used for statistical analyses. Text variables or variables consisting of a mixture of text and numbers are allowed and will be automatically recognised as such. They can for example contain important information that is linked to each sample and can be used to create data subsets or groups (see Chapter 8). Figure 2.2 shows the above example file in the special DAS+R format with all auxiliary information (except the upper limit of detection) provided.
Note that different types of variables exist and can be used in DAS+R. When importing a data file, DAS+R will try to allocate each variable automatically to a certain data type. These types are later displayed in the software and can be changed. It is important to assign the
“correct” data type to each variable. It determines what can be done with this variable during data analysis. If the automatic assignment is wrong, this should be a clear indication to check the data for this variable for inconsistencies before continuing with data analysis.
Logical: TRUE/FALSE or T/F, e.g., used to identify samples that belong to a certain subset.
Integer: a number without decimals.
Double: any real numerical value, i.e. a number with decimals.
Factor: a variable having very few different levels, which can be a character string (text) or a number or a combination thereof.
Character: any text; the sample identifier (sample number) is a typical “character” variable because it is the unique name of each sample. It can consist of a number, a text string, or combination of text and numbers.