Some “typical” problems encountered when editing a laboratory data report file to a DAS +R file

Preparing the Data for Use in R and DAS+R

2.4 Some “typical” problems encountered when editing a laboratory data report file to a DAS +R file

A data file as received from a laboratory will usually contain a lot of text that is necessary only for general information purposes. Figure 2.3 shows a screen snapshot of one of the Kola Project result files as received from the laboratory. This kind of file cannot be easily imported by most data analysis software. The file will probably contain information on the analytical technique(s) used for the determination of each element, the detection limit, the unit of the measurement for each variable, and more. Some of these text lines contain the information DAS+R can hold in the first 11 comment rows, others may need to be deleted or edited. Files may contain “accessory” information or comments at the beginning as well as at the end of the data array (check the bottom line of any file – see Figure 2.3), or this information may be given in other worksheets (check for multiple work sheets!) within the file. If the laboratory has invested some time in table design and merged cells somewhere, e.g., at the top of the file, this will create severe problems when trying to import such a file into data analysis software.

In contrast an “ideal” file for most data analysis packages will only contain one row identi-fying the variables (with the first variable being the “Identifier” for the sample (i.e. the sample number), and then row by following row will contain the identifier and the analytical results

Figure 2.3 One of the “original” laboratory data files as received for the Kola Project C-horizon results

22 PREPARING THE DATA FOR USE IN R AND DAS+R for all variables listed in the first row in the same sequence (see above). If possible such a file format should be agreed upon with the laboratory at an early stage of the project. The

“optimal” data report file from a laboratory will look like the files in Figures 2.1 or 2.2 rather than like a “nice table”.

2.4.1 Sample identification

Laboratories often add their own “laboratory number” to all samples received for analysis according to the laboratory’s standards (see Figure 2.3). This laboratory number is used as a

“unique” number for storage in the laboratory’s database. Make sure that the laboratory reports the results according to the original sample numbers and not with these “new” laboratory numbers only. In the case where samples were randomised and given new numbers prior to submission (see Chapter 18), it is necessary to ensure that the laboratory analyses the samples in the exact sequence of the new numbers and not in the sequence of the laboratory’s own numbers. This requires that the “submitter” must keep and safely back-up the table that relates the original sample numbers to the new randomised sequence numbers that are on the physical samples submitted to the laboratory.

2.4.2 Reporting units

Units used to report the values are often quite different from variable to variable and information on units is provided somewhere in the data file. In the given example (Figure 2.3) several different units as identified in row five are used for the few variables (␮g/kg, mg/kg, g/kg) visible in the screen snapshot. Thus the units of analysis are important information. Unfortunately, most statistical data analysis packages do not appreciate the presence of this row. For many packages this row disrupts the connection between variable names and values. Keeping the row in this position will create severe problems in entering the data into most data analysis software, deleting it will create severe problems during data analysis when it becomes important to know the units! It is tempting to think that this need not be a problem because the unit can usually be guessed from the analytical values, and if not it is always possible to go back to the

“original” data file (good if “the original” data file was stored somewhere safely as suggested above – a problem if this was not done and the units were deleted for data analysis). It is often not so easy to “guess” the correct unit of a measurement just from the analytical values.

Thus information on the unit needs to be preserved – but where and how? DAS+R can store the information about the unit, but it is often better to transfer all results to one unit only.

For geochemical projects “mg/kg” (mg/L for water) is the most universal unit. This approach cannot be followed in all situations because, for example, pH measurements come without a unit, conductivity measurements are reported in mS/m, and radioisotopes are measured in Bq/kg, and all these may appear in the same data file. It is, nevertheless, a good start to “unify”

the units as much as possible. In case another unit is needed later on for “nice looks”, for example in a data table, it is easy to transform any values to that unit. It is also necessary to check that the unit the data are reported in is not changed during the lifetime of a project or even within just one data file (e.g., different staff measuring different batches of samples and using different reporting units!). It is often possible, and advisable, to agree on the unit for reporting measurements when formulating an analytical contract with a laboratory. That way a lot of unnecessary and always dangerous file editing can be avoided.

SOME “TYPICAL” PROBLEMS ENCOUNTERED WHEN EDITING 23

2.4.3 Variable names

Variable names need to be unique. For the laboratory it may be sufficient to use the same variable name if the method code or extraction is identified as different in another row of the data file – for data analysis this is not sufficient! If one and the same element is analysed by a variety of techniques or in a variety of different extractions, this information needs to be incorporated into the variable name (e.g., Ca aa, Ca ar and Ca tot for Ca determined in an ammonium acetate extraction, Ca determined in an aqua regia extraction and Ca determined by XRF or INAA, giving total concentrations). The same applies when files with different sample materials are linked (e.g., Ca MOSS, Ca OHOR, Ca BHOR, Ca CHOR). Note that most data analysis programs cannot handle variable names containing a space – it is thus advisable to use another sign as separator and an underscore is a common solution. If material and method need to be included in a variable name (e.g., Ca CHOR ar, Ca CHOR tot), very long variable names may result. Note that lengthy variable names may not be convenient during data analyses (e.g., when the variable names are displayed in graphics).

2.4.4 Results below the detection limit

The laboratory will have marked results below the detection limit (Section 2.2) somehow – either via a “<” sign followed by the value of the detection limit or by the actual reading of the instrument marked in some specific way as “below detection” (e.g., via an exclamation mark before or after the result or by a “−”). Such markers are acceptable in a printed table, however, a computer or a data analysis package will not know what to do with such special symbols. In the example file (Figure 2.3) the laboratory has managed to use different ways to identify measurements that fall below the lower limit of detection. For some variables a

“!” marks such values, for other samples a “<” was used. As most data analysis software cannot handle such special signs they need to be removed. However, these signs cannot just be deleted because the file will most likely contain true measurements with exactly this value, and then these true values could no longer be differentiated from values that are lower. It may also be tempting to just delete the contents of the whole cell marked as “lower than the detection limit”. However, an empty cell denotes a “missing value”, a value “below de-tection” denotes a value too small to measure with the chosen analytical method, which is an important difference when making statistical estimations. It is also no solution to replace these values with a “0” – for some variables “0” may be a true data value, importantly “0”

will often provide a far too low estimate of the real concentration, and in case there are any

“0” in a data file, logarithmic transformations become impossible. One standard solution to the situation (the solution chosen for the Kola data) is to replace all values marked as “lower than detection limit” by a value of one half of the detection limit. Therefore, in the example data file from the Kola Project all values marked with a “!” or a “<” need to be replaced by a value of one half of the detection limit. Before starting to replace the “<” sign, it is necessary to check that the laboratory is operating with one detection limit per variable only.

It may happen that the laboratory has operated with different detection limits, depending on the matrix of the samples (a common situation in neutron activation analyses (NAA or INAA (“I” for “instrumental”)). When replacing all “<” values with one half of their value in such a situation some of these replaced values may suddenly be in the range of “real” values. Thus this situation requires special care (one possible solution is to replace all the “<” values with one half of the lowest DL value reported – see Section 2.2). Afterwards it is easy to use the

24 PREPARING THE DATA FOR USE IN R AND DAS+R

“find” and “replace” functions in spreadsheet software to get rid of all the “<” signs (and of the “!”).

When actually doing this for the example file, it turned out to be more difficult than antici-pated – the laboratory had used blanks (sometimes none or one, two, three, four, and even five blanks) between the actual number and the sign. Each single one of these possibilities needs thus to be checked until in the end there are no more “<” or “!” in the data file. Thus great care is necessary when replacing all values below detection with a value of one half of the detection limit. To avoid confusions (e.g., when a file contains values<1 and <10) this should be done variable by variable when using “find” and “replace” options of the spreadsheet software rather than for the whole data file at once. It is good practice to agree with the laboratory beforehand exactly how values below (or above) detection limit are to be reported. As can be seen this editing process can be error prone; this just emphasises the importance of making a back-up copy of the original file received from a laboratory, so it can always be referred to in case it is thought that an editing error may have occurred.

2.4.5 Handling of missing values

Empty fields may exist in a data file, indicating “missing values” (Section 2.3), e.g., a sample that got lost during transport or where there was not enough material available for all the different analytical techniques that were ordered. Never replace empty fields by a number (for example it could be tempting to use a “0”). Empty fields indicate “no information”, if these fields are replaced with zeros, which may be valid for a measurement, their original meaning will be lost. The “empty” information thus needs to be preserved or replaced by a term that the software recognises as “missing value”. In Figure 2.3 in row 14 (and row 798) the special sign

“−” was used by the laboratory to identify whole samples or single variables for a sample that were not analysed. Here the “−” thus marks “missing values”, a directly dangerous approach.

Such a “−” should be immediately replaced by an empty cell. When doing this replacement, it is wise to use the “find entire cells only” option to ensure that no real “−” sign, identifying possibly existing negative values, is replaced. Also, it is not a good idea to just delete a whole row with “missing values”. At some point it may either be important to identify the samples with missing values, to know the absolute number of missing values, or another data file will be linked with the first data file, and this new file may contain data for the sample that contains missing values in the first file. The standard in R to identify missing values is to use NA (not available) as marker for a missing value, NA is automatically inserted when a data file containing empty fields is imported to R (or DAS+R). Thus empty fields should be left empty when editing the data file.

2.4.6 File structure

In case the file contains a column identifying the variables and the results follow in columns rather than in rows, it is possible to use the “copy” and “paste special” – “transposed” - features in a spreadsheet to produce a transposed file that follows the above standard. Again it is essential to ensure that the original data file is safely stored somewhere before attempting to transpose the file so that it cannot be destroyed by this procedure.

Data tables as received from different laboratories as spreadsheets may have a very dissimilar structure, and thus there is no simple way around a substantial amount of file editing, no matter whether these files are to be incorporated into a database or directly imported into a data analysis package.

APPENDING AND LINKING DATA FILES 25 For R (outside DAS+R) there exists the possibility to mark several leading (top) rows with a “#”. The software will ignore lines marked this way (just like comment lines in a computer program) when importing data, and the information is at least not deleted. Data files prepared with all auxiliary information for DAS+R can thus be directly read into other R-routines using this facility without having to re-edit the file.

2.4.7 Quality control samples

However, before the file is really ready for import into R or DAS+R, more work may still be required. Hopefully the file contains Quality Control (QC) samples (i.e. duplicates and standards inserted among the “real” samples). These QC-samples must be removed prior to data analysis (and used for quality control (see Chapter 18)). The “key-file” identifying QC-samples needs to be used to identify and remove the QC samples from all other samples.

2.4.8 Geographical coordinates, further editing and some unpleasant limitations of spreadsheet programs

For geochemical mapping and thus identifying the location of unusual results the samples also need coordinates. Thus these coordinates need also to be linked to the data file. Furthermore, analytical results may be received from more than one laboratory, and all these different results will be needed in just one large file for data analysis. A “cut” and “paste” job in a spreadsheet has to be carried out with great care. If very many variables exist, or if results for different materials are to be linked into just one file, it is possible to reach the limits of the most common spreadsheet software packages. These are usually limited to 256 columns. In such a case it is also possible to link the different files in R (or DAS+R). Alternatively, those users familiar with database software, e.g., Microsoft Access^TM, may choose to import the different field, laboratory, and other tables into the database manager and undertake the editing and final table construction in that environment.

Many packages used for data analysis will not be able to read the special spreadsheet file format (e.g., “.xls” files). Furthermore, software manufacturers sometimes change their formats with each upgrade of their spreadsheet software with the consequence that older versions of the program are no longer able to read files that were saved in a newer version. This can turn out to be a real nuisance when data files are to be exchanged between colleagues. Thus the

“final” file should be saved separately in both the spreadsheet format (for easy handling and editing) and in a simple text or ascii format. For this it is best to use the “.csv” format, which is a simple file format where the variables and values are separated by commas. To use the

“.csv” format, it is necessary to make sure that the comma was not used as decimal sign in the file. This may actually even depend on the “country” setting of the spreadsheet used for editing. Note that the “.csv” format does not support multiple worksheets; information stored in different worksheet will be lost when storing a former Microsoft Excel^TM file in “.csv”

format. The major advantage of this format is that almost all “foreign” software packages are able to import the “.csv” format. This is the file format used to import all Kola data into R (and DAS+R).

In document Statistical Data Analysis Explained (Page 45-49)