Data Management for
Multi-Environment Trials in Excel
Cathy Garlick ([email protected]), Statistical Services Centre, University of Reading, UK 27 May 2010
1.
Introduction
Excel is widely available and easy to use, and thus, is frequently used in agricultural research for entering and managing data. However, it is easy to use “badly”. The Statistical Services Centre has produced a guide on using Excel “with discipline”, available from the SSC web site
www.reading.ac.uk/ssc - the booklet is entitled “Disciplined Use of Spreadsheets for Data Entry”. This is recommended reading for all researchers using Excel. In this supplement we emphasise the use of Excel for preparing data from multi-environment trials (METs) for analysis.
2.
“List” Format
Many users of Excel are tempted to make their spreadsheets look “pretty”; effectively laying out the data in tabular form. The following is a simple example for an experiment with 3 treatments and 4 breeds of hen. The number of eggs laid and the cholesterol levels in the blood were measured. The researcher laid out the data in table format as shown below:
This layout is fine for a summary report but is not suitable for analysis. Instead you should store your data in a “List Format”. List format is a rectangular block of data where the rows represent cases and the columns represent variables – for experimental data these are generally factors or measurement variables. The list should include column headers. When transferring to a statistics package the column headers are generally used as variable names.
There should be no blank rows in the data – a statistics package will often interpret a blank row as the end of the data. Each column should have the same data type throughout – a statistics package often determines the data type from the first few rows of data.
The layout is less “pretty” but more practical. You can always use Pivot Tables to create the summary table from the list.
Note we have included a longer description of the variable in the top row with a short variable name in row 2. It is useful to have the longer description, although some statistics packages restrict variable names to 8 characters and do not allow spaces. Thus, we have the description in row 1 and the shorter name in row 2. When the data are exported, we would export from row 2 onwards.
3.
Multiple Site Experiments
In multiple site experiments the same rules apply, for example, data from all plots in all sites should be together in one list. A column should be added to distinguish the site. This is illustrated below:
The last case for site 1 is on row 14, and the first case for site 2 is on row 15.
Do not attempt to make this “prettier” by adding blank rows between sites - there should be no blank rows in the data as this is often interpreted as the end of the data.
If the layout factors differ at some sites, then include extra columns and leave blanks as appropriate. For example let’s assume site 1 has blocks as a layout factor, but site 2 has ROWs and COLUMNs1
; we need 3 columns for the layout factors and would leave blanks for site 1 for ROW and COLUMN and blanks for site 2 for “block”:
4.
Dealing with Numeric Codes
Many variables are set up as numeric codes. For example let’s assume we have codes associated with districts as follows:
District Code District Name 1 Nsitu
2 Kalulu 3 Mkanda 4 Dididi 5 Chitunda 6 Linga 7 Mwansamba 8 Nkhunga
The best way to deal with these codes is to enter just the numeric code in the main data sheet. Then, on a separate sheet, enter the list of codes with the district names – this is like a codebook. You can then use the Vlookup function in Excel to include the district names in a new column in the main sheet.
So we would have a sheet which we’ll call Codes, shown below:
1
Note when we say ROW and COLUMN using capital letters we are referring to “rows” and “columns” as used in agricultural experiments. When we use local case letters we are referring to rows and columns in the Excel spreadsheet.
This is our “lookup table” or code list. On the main data sheet shown below, we have entered the district code into column A. We insert a new column B and use the vlookup function as shown:
Note the formula vlookup (A3, Codes!$A$2:$B$9, 2, FALSE) in cell B3. Vlookup takes 4 parameters:
The first – in this case A3 – is the reference for the cell containing the value we are looking up. In this case A3 contains the value 1.
The second parameter – Codes!$A$2:$B$9 – defines the range of cells, which defines the lookup table or code list. In this example it is the range of cells A2:B9 on the sheet named
Codes. Note, here we use absolute cell references (i.e. with the $ signs), so that if we copy this formula the reference remains the same.
The third parameter specifies the column in the lookup table from where we should take the value. In this case it is column 2, i.e. column B. Thus in our example, Excel will look for a 1 in the first column of the range A2:B9 on the sheet called Codes. It then takes the
corresponding value in column B, i.e. the second column. There is a 1 in cell A2 so the function returns the value in B2 which is Nsitu.
The fourth parameter is TRUE or FALSE and determines the action if the code is not found in the lookup table. If you choose TRUE, Excel will return the nearest value; if you choose
FALSE it will give the result as #N/A. For example, our coding table includes codes from 1 to 8 – if we were to look up the value of 9, then Excel would consider this 4th parameter. If TRUE, it would find the nearest value which is 8 and return Nkhunga; if FALSE it would return #N/A. For most cases it is best to use FALSE so you can easily see if you have used invalid codes.
This method of using codes improves data quality, as you are only typing the text values once – entering text values repeatedly can lead to errors. Note you should not attempt to change the district names in the main data sheet – if the district is wrong you should change the code and the name will automatically update. If the district name has a mistake, then change it in the Codes sheet.
5.
Importing Excel data into a Statistics Package
Note that when Excel data are imported into statistics packages, it is the values that are imported and not the formulae. Therefore, if the data mentioned above with the district codes and names were imported into Genstat, changing the district code in Genstat would not automatically change the district name.
You must be aware that once you start importing data into other packages you are producing copies of the data, and it is important to keep all copies up-to-date or to have some level of version control. A typical scenario is that a researcher has the data in Excel and imports it into Genstat. She then notices an error which she corrects in Genstat. Meanwhile, another researcher takes the Excel data into STATA where he notices a different error, which he corrects within STATA. There are now 3 copies of the data, all slightly different and none of them totally correct.
The solution is to designate the Excel file as the MASTER dataset. If and when errors are found, the correction must be made immediately in the MASTER dataset and ideally a log should be kept of the change so that others using the data will be aware of the change. The corrected data can then be re-imported into the statistics package and analysis can continue or be rerun. In most statistics packages it is possible to save the syntax for any analysis you do, so rerunning the analysis should not be a problem.
6.
“Raw data” and “Results”
It is advisable to keep raw data and results on separate sheets in the Excel file. For example Pivot tables and graphs should not be on the same sheet with the raw data.
Some researchers like to include summaries (totals and averages) at the end of the data – again we would recommend storing these summaries on a separate sheet otherwise you will have problems when exporting the data.
7.
Calculations
As far as possible, all calculations should be done using formulae in Excel. For example, in the dataset below we have the Fresh Pod Weight in column H and the Sample Grain weight in column I; to calculate the Total grain per plot we use the formula I*H as shown below.
We keep the formulae in the spreadsheet so that the values can easily be checked – in this case we would not enter values into column K but only the formulae.
Doing calculations by hand and entering the results into the spreadsheet can and will lead to errors both in the calculations themselves and in the transcription of the results – let the computer do the boring work!
8.
Data at different levels
Often you will have data at different levels. For example you may have data at the plot level and at the plant level; or data at the site or district level, as well as plot level data. Data at different levels should be stored on separate sheets in the workbook to avoid unnecessary duplication, which can lead to errors. There must be a column or variable which links the 2 sets of data. The example below shows some data from a pigeon pea experiment.
Note the farmers are identified by a code (Farmer ID). We also have data at the farmer level, which are shown below:
The data are linked by the FarmerID, which appears in both sheets. Data from the farmer level can be displayed in the pigeon pea sheet using Vlookup. The example below shows farm size – a farmer level variable – included with the pigeon pea data.
This is an extension of using Vlookup to display labels for numeric codes. Here we just have more columns in our “lookup table”, the range for which is A3:K57, so we have 11 columns in the table and we are picking data from the 10th column.
In each sheet you should be able to determine what each row represents – e.g. one row per plot, one row per farmer, etc. You should know the number of plots, farmers and other subjects in your experiment, and easily check that you have the correct number of rows. If you have 10 farmers, you must have 10 rows of data at the farmer level; if you have 50 plots in total, you must have 50 rows of data at the plot level.
9.
One item per cell – dealing with repetitions and missing data
A single cell in an Excel spreadsheet must only contain a single item. Two or more numbers in the same cell will be treated as a text value, and will not be unusable. The examples below to the left show where the diameter was measured twice for each tree; this should be split into two columns as shown in the example below to the right.
Where data are missing, the cell should be left blank in Excel. You might want to include the reason for the missing value, and this can be done by giving the explanation as a comment or in a separate column.
In this example, the tree in row 15 died and the text “Dead was entered instead of the diameter measurements. This is text in an otherwise numeric column, which will cause problems and should be avoided.
The cells should be left blank and a comment added explaining the missing data as shown below.
10.
Sorting the data
Data should be entered and stored in field collection order. There is no need to sort the data. The layout factors can be entered prior to data collection, and the spreadsheet printed and used as a data collection sheet.
Sorting the sheet can lead to problems and it is easy in Excel to accidentally sort a single column. It is important to check your data after every manipulation, and of course to keep backup copies of your data.
11.
Summary
The ideas introduced here can be summarised into a set of “rules” or recommendations as follows: 1. Store the data in “List format”;
2. Data from a multiple site experiment should be combined into a single worksheet, not a separate sheet of file for each location;
3. Do not have blank rows in the data; 4. Keep data and results separate;
5. Let the computer do the calculations – don’t do hand calculations; 6. Store data at different levels on separate sheets;
7. Use vlookup for coded data and to link data at different levels; 8. Only enter one item per cell;
9. Do not include text in numeric columns – use comments instead; 10. Keep regular backups of all your data;