Research Data Management Training Manual

(1)

Research Data Management Training Manual

Research Data Management

Training Manual

(2)

The World Agroforestry Centre, an autonomous, non-profit research organization, aims to bring about a rural transformation in the developing world by encouraging and enabling smallholders to increase their use of trees in agricultural landscapes. This will help to improve food security, nutrition, income and health; provide shelter and energy; and lead to greater environmental sustainability.

We are one of the 15 centres of the Consultative Group on International Agricultural Research (CGIAR).

Headquartered in Nairobi, Kenya, we operate six regional offices located in Brazil, Cameroon, India, Indonesia, Kenya and Malawi, and conduct research in 18 other countries around the developing world.

We receive funding from over 50 different investors. Our current top 10 investors are Canada, the European Union, Finland, Ireland, the Netherlands, Norway, Denmark, the United Kingdom, the United States of America and the World Bank.

World Agroforestry Centre, Nairobi, Kenya, 2012.

Publisher: World Agroforestry Centre Compilation: Leroy Mwanzia

Design and Layout: Martha Mwenda

World Agroforestry Centre United Nations Avenue, Gigiri P. O. Box 30677-00100 Nairobi, Kenya.

This manual is released under the Creative Commons Attribution NonCommercial ShareAlike license unless otherwise noted.

This manual makes substantial use of materials from the University of Essex, the UK Data Archive, MIT Libraries, University of Edinburgh and the Dataverse Network project.

(3)

CGIAR Research Program 6 - Forests, Trees

and Agroforestry: Livelihoods, Landscapes and Governance

A third of our planet is covered with forests and about 1 billion people depend on forest resources for their everyday lives. Forests are a nutritional bounty and provide essential services to mainstream agriculture. Without them, future food supplies will be compromised. Forests play a vital role in slowing the pace of climate change through carbon storage and in helping countries adapt to severe weather events.

Yet, in the time it takes to read this webpage, an area of forest roughly equal to 100 football fields (45 hectares) will have been cleared to make way for agriculture, mining, pastures and other non- forest uses, or will have been degraded by unsustainable and illegal logging and other poor land-use practices.

The CGIAR Research Program - ‘Forests, Trees and Agroforestry: Livelihoods, Landscapes and Governance’ responds to a call for an urgent, strong and sustained effort focused on forest management and governance, given the crucial role of forests in confronting some of the most important challenges of our time: climate change, poverty, and food security.

The Center for International Forestry Research leads the program in partnership with Bioversity International, the International Center for Tropical Agriculture and the World Agroforestry Centre. The centers collaborate with leading national research institutes and other organizations. They partner with knowledge-sharing experts to maximize outreach and share research results with policy and practitioner partners, who can use and share this knowledge on the ground in the developing world.

The program is made up of five research components:

• Smallholder production systems and markets, with a focus on boosting the productivity and sustainability of forestry and agroforestry, increasing incomes in forested areas, and improving policies and institutions that affect land rights for the rural poor;

• Management and conservation of forest and tree resources, which involves research into threats to important tree species, conserving high-value tree species, improving silviculture practices, and developing ways to resolve conflicts over resource rights;

• Landscape management of forested areas for environmental services, biodiversity conservation and livelihoods, which explores the drivers and consequences of forest transition—in which deforested and degraded lands are restored—for environmental goods and services;

(5)

• Climate change adaptation and mitigation, which considers how forests, trees and

agroforestry can play a role in climate change mitigation and also how they can help people adapt to climate change; and

• Impacts of trade and investment on forests and people, which seeks to understand the effects of forest-related trade and investment and to improve efforts to mitigate the negative and enhance the positive impacts.

Impact driven and innovative, the program’s eventual impact will enhance the management and use of forests, agroforestry and tree genetic resources across the landscape, from farms to forests. The initiative will target 46% of global forest cover, 1.3 billion hectares of closed forests and 500 million hectares of open and fragmented forests.

(6)

1 About the Course

1.1 Introduction

This document is a description of an introductory Data Management Training Course, organised by the World Agroforestry Centre.

Data Management is a key component of every project that is carried out under the CGIAR Research Program 6 Forests, Trees and Agroforestry (CRP6). All research data generated from the projects should be of high quality and managed efficiently. Research data, like all research output should therefore be delivered to the people who need them - scientists, practitioners, donors, development agencies, policymakers, media and NGOs- while adhering to these standards.

The main aim of this training is to encourage research scientists and research projects to allocate resources to data management and data sharing as a tool for scientific knowledge sharing.

1.2 Objectives

The objectives of the data management training are to:

a) Provide CRP6 partners with the necessary understanding and capability to handle data in this project as well as in future projects.

b) Promote the use of dataverse as an easy and economic application for in-house data management and archiving.

c) Assist the partners in developing a Research Data Management Plan for each project

1.3 Learning outcomes

On completion of the training course the participants should be able to.

a) Recognise the importance of good practice in managing research data in general and apply it within their own work context.

b) Apply knowledge gained to be able to draw up a data management plan and maintain it throughout the project life.

c) Be able to organise and document their data efficiently during the course of a project.

d) Be aware of the options available for securely storing and backing up data.

(7)

2 Introduction to Data Management

Think Ahead Quiz: What is Data?

True or False: In research projects is only the information and observations made as part of scientific inquiry considered research data?

Answer: False. Data also includes documents, procedures, scripts and all data elements that comprise research observation, findings or outcomes including primary materials and analysed data. Therefore questionnaires, methodologies, scripts and models are also considered research data.

2.1 Concepts

Terms commonly used in data management are usually defined differently depending on the context and field of study. For this training we approach data management in the context of doing research projects.

Data are raw facts and statistics collected together for reference or analysis.

Research data is data that is collected, observed, or created, for purposes of analysis to produce original research results. Research data would include all data that comprise research observations, findings or outcomes, including primary materials and analysed data.

Research data can be classified into 4 types of data [1].

1. Observational: data captured in real-time, usually irreplaceable, examples: Sensor data, telemetry, survey data, sample data, neuroimages.

2. Experimental: data from lab equipment, often reproducible, but can be expensive, examples: gene sequences, chromatograms and toroid magnetic field data.

3. Simulation: data generated from test models where model and metadata (inputs) are more important than output data, examples: climate models, economic models.

4. Derived or compiled: data that is reproducible (but very expensive), examples: text and data mining, compiled database, 3D models, data gathered from public documents Research data may include all of the following:

• Text or Word documents, spreadsheets

• Models, algorithms, scripts

• Laboratory notebooks, field notebooks, diaries

• Questionnaires, transcripts, codebooks

• Audiotapes, videotape, photographs, films

• Test responses

• Slides, artefacts, specimens, samples

• Data files

• Database contents including video, audio, text, images

(8)

• Contents of an application such as input, output, log files for analysis software, simulation software, schemas

• Methodologies and workflows

• Standard operating procedures and protocols

Research Data Life Cycle – There is a need to manage data throughout the research data life cycle shown in Figure 1 below.

Figure 1: Research Data Cycle (UK Data Archive. 2012)

Different data management activities are carried during the stages of the life cycle. These activities are listed below.

(9)

Processing data

• Enter data, digitize, transcribe, translate

• Check, validate, clean data

• Anonymise data where necessary

• Describe data

• Manage and store data Analysing data

• Interpret data

• Derive data

• Produce research outputs

• Author publications

• Prepare data for preservation Preserving data

• Migrate data to best format

• Migrate data to suitable medium

• Back-up and store data

• Create metadata and documentation

• Archive data Giving access to data

• Distribute data

• Share data

• Control access

• Establish copyright

• Promote data Re-using data

• Follow-up research

• New research

• Undertake research reviews

• Scrutinize findings

• Teach and learn

Data Management is the processes involved in creating, obtaining, transforming, sharing, protecting, documenting and preserving data. [3]

Data management therefore includes all activities associated with data other than the direct use of the data itself. It may include things like;

(10)

• data entry

• data cleaning

• data organisation

• data backups

• data archiving

• data sharing or publishing

• data security and confidentiality

2.2 Need for Data Management

Preserving research data

Proper management, archiving and sharing of data ensures that your research data will be available to you and other researchers for a long time. Preserving the evidence of your writings ensure you preserve your unique contribution to research. It also ensure you research will not fall prey to information entropy.

(11)

Meeting funding Agency or Partnership agreement requirement

Many funding agencies and partnership agreements require researchers to share data usually by depositing it in an archive. Some funding agencies are increasingly requiring data management plans as part of the project proposal.

Figure 3: Proposal Requirement

Figure 4: Program Participant Agreement

(12)

Increasing your research efficiency and saves time

Documenting your data throughout its life cycle saves time because it ensures that in the future you and others will be able to understand and use your data. Also careful planning of data management at the start of the project ensures that time is not wasted when preparing data for analysis e.g. statistics and modelling.

Planning for your data management needs ahead of time will also save you time and resources in the long run.

Ensure data quality

Planned data management ensures that research data and records are accurate, complete, authentic and reliable. This is done by putting quality checks or procedures at every stage of the data life cycle.

Facilitating science through interoperable discovery and access

Making your data available to other researchers through searchable repositories ensures that other scientists can build on your work and therefore preventing duplication of effort. It reinforces open scientific enquiry and can lead to new unanticipated discoveries.

Increases visibility of your research

Sharing you data also promotes your work and demonstrates continued use to the data and relevance of the research. It can also provide direct credit to the researcher as a research output in its own right.

Exercise: Data Life Cycle

In this activity you will be considering the data life cycle of a research project.

Think of your past or current research project and try and identify the data management activities or procedures you think should be or should have been carried out at each stage of the data life cycle. See an example of a past project below.

(13)

&RUWLHWDO

(14)

3 Data Management Planning

Think Ahead Quiz:

Should data management planning be thought of a simple administrative task that is done at the beginning of the study, with little or no intention to implement planned data management measures?

Answer: At the moment in the research cycle, the cost of implementing late data

management sharing measures can be prohibitively high. Implementing data management measures during the planning and development stages of research will avoid later panic and frustration. Many aspects of data management can be embedded in everyday aspects of research co-ordination and management and in research procedures.

A data management plan is a formal document you develop at the start of your research project which outlines all aspects of your data i.e. what you will do with your data during and after your research project.

Data management planning helps you ensure your research data is accurate, complete, reliable, and secure both during and after you complete your research. Your data management plan may describe:

• What research data you will be creating or collecting.

• Who will be responsible for each aspect of the management plan you are developing.

• What policies (funding, institutional, and legal) will apply to your data.

• How will data be organised (folder structures, file naming conventions, file versioning).

• How will data be documented during the collection and analysis phase of your research.

• What data management practices (backups, storage, access control, archiving) you will be using to store and secure your data.

• What facilities and equipment will be required (hard-disk space, backup server, repository).

• Who will have ownership and access rights to your data.

• How will the data be preserved and made available in the long term once your research is completed. [4]

(15)

3.2 Creating a data management plan

3.2.1 Data management plan checklist

• What type of data will be produced? Will it be reproducible? What would happen if it got lost or became unusable later?

• How much data will it be, and at what growth rate? How often will it change?

• Who will use it now, and later?

• Who controls it (Principal investigator, student, lab, Institution, funder)?

• How long should it be retained? e.g. 3-5 years, 10-20 years, permanently

• Are there tools or software needed to create/process/visualize the data?

• Any special privacy or security requirements? e.g., personal data, high-security data

• Any sharing requirements? e.g., funder data sharing policy

• Any other funder requirements? e.g., data management plan in proposal

• Is there good project and data documentation?

• What directory and file naming convention will be used?

• What project and data identifiers will be assigned?

• What file formats? Are they long-lived?

• Storage and backup strategy?

• When will I publish it and where?

• Is there ontology or other community standard for data sharing/integration?

• Who in the research group will be responsible for data management?

3.2.2 Data Management Plan Components

A simple data management plan may include the following components. [4]

A. Context

Basic project information:

• Name of the project

• Aim & purpose of the project

• Funding body/bodies

• Duration

• Partner institutions

(16)

Data collection:

• What kind of data will be created or captured? (Data description, including anticipated volume, type, content to be created).

• How will the data be collected?

• Have you surveyed existing data, in your own institution and from third parties?

• What existing datasets could you use or build upon?

Related policies:

• Funding body requirements.

• Institutional or research group guidelines.

Responsibilities:

• Your responsibilities as a researcher.

• Staff/organisational (PIs, supervisors’, colleagues, Project Manager, School etc.) roles and responsibilities for implementing this plan.

Adherence

• When will adherence to this data management plan be checked or demonstrated?

• Who will do this?

• How and when will this data management plan be reviewed?

B. Organising data & file formats

• File structure

• Folder structure

• File and folder naming conventions

• Versioning

• File formats

• File transformation C. Documentation and metadata

Some examples of data documentation:

• Laboratory notebooks & experimental protocols

(17)

Metadata

• What contextual details are needed to make the data you capture or collect meaningful?

• How will you create or capture these metadata?

• What form will the metadata take?

• To what extent will metadata creation be automated?

• Which metadata standards will you use?

D. Storage and security Storage

• Where (physically) will you store the data?

• On what media will you store the data?

• Whose responsibility is the storage of the data?

• How will you transmit the data, if required?

Back-up

• How will you back up the data

• How regularly will backups be made?

• Whose responsibility will this be?

Security

• How will you manage access arrangements and data security?

• How will you enforce permissions, restrictions and embargoes?

• Other security issues E. Data protection, rights and access

Ethical and privacy issues

• Are there ethical and privacy issues?

• If so, how will these be resolved?

• Confidentiality.

• Is the data ‘personal data’ in terms of the Data Protection Act 1998 (the DPA)?

• What have you done to comply with your obligations under the DPA?

IPR (Intellectual Property Rights)

• Is the dataset covered by copyright or the Database Right? If so, who owns the copyright and other intellectual property?

• How will the dataset be licensed if rights exist?

(18)

F. Preservation, sharing and licensing

• What is the long-term strategy for maintaining, curating and archiving the data?

• On what basis will data be selected for preservation?

• How long will (or should) data be kept beyond the life of the project?

• How will you dispose of/transfer sensitive data?

• Which archive/repository/central database/ data centre have you identified as a place to deposit data?

• Appraisal and retention timeframes (ideally with definite figures)

• What transformations will be necessary to prepare data for preservation / data sharing?

• What related (representation) information will be deposited?

3.2.3 Data Management Plan tools and template

Normally you will not have to create a data management plan from scratch. Most funding agencies or institutions that require a data management plan will provide tools and/or templates to assist you in creating a data management plans.

The Digital Curation Centre (DCC) provides an online tool, DMP Online (http://dmponline.dcc.ac.uk/) to enable you to build and edit DMPs with a view to the requirements stipulated by the major UK funders. They also have a generic template in the tool that can be used for any project. The tool itself requires registration. However, the DMP template and the DMP Checklist are both available without registration (http://dmponline.dcc.ac.uk/documents).

The DCC data management plan template is provided as a hand-out in this training.

Exercise:

Use the DCC data management plan template to help point out relevant data management topics you should consider when planning this project.

(19)

4 Local Data Management

Think Ahead Quiz:

What is in a name? Does how you name your research files matter in data management?

Answer: Clear, unique and descriptive file names that describe the contents of the file are important for effective data management. This is especially true when you are dealing with multiple people working on the project.

4.1 Local Folder Management

A well organised folder structure and clear, descriptive and unique file and folder names makes it easier to find and keep track of data files. Good file management practises are required to enable you to identify, locate and use your research data files efficiently and effectively. [4]

File and Folder names should be constructed for easy management by various data systems. The following are some general guidelines for folder and file names:

• Names should contain only numbers, letters, dashes, and underscores - no spaces or special characters.

• Lower-case names are less software and platform dependent and are preferred. If you use mixed case file names (for readability), make sure that you do not have two filenames which differ only by case.

• When choosing a name, check for any system limitations on the use of special characters and file name length. For practical reasons of legibility and usability, file names should not be more than 64 characters in length, usually you can construct a meaningful name with less than 25 characters.

4.1.1 File Naming

Good file names can provide useful cues to the content and status of a file; they can uniquely identify a file and can help in classifying files. File names may contain information such as project acronyms, study title, location, investigator name or initials, year(s) of study, data type, version number, date and file type.

Do not use generic data file names that may conflict when moved from one location to another. Ensure filenames are independent of location and if you work on more than one computer ensure that your files are synchronised.

Consider how scalable your data file naming policy needs to be e.g. if you want to include the project number, don’t limit your project number to two digits, or you can only have ninety nine projects.

(20)

Best practice for naming files is to:

• create unique, meaningful but brief names

• The file name should reflect the contents i.e. Identify the activity or project in the file name

• use file names to classify broad types of files

• avoid using spaces and special characters

• avoid very long file names

Examples of Bad File Names

Mydata.xls, 2001_data.csv, best version.txt

Figure 6: Story Told by File Names (Federation of Earth Science Information Partners. 2012)

(21)

Better File Name

LandscapeMosaic _TZ_2000_ HHIncome.xls

• LandscapeMosaic – Project Name

• TZ – Site Name

• 2000 - Year

• HHIncome – What was measured

• xls – File type

Benefits of consistent file naming

The benefits of consistent and unique data file labelling are:

• Data files are distinguishable from each other within their containing folder

• Data file naming prevents confusion when multiple people are working on shared files or when files from different investigators are combined in a directory or FTPsite.

• Data files are easier to locate and browse

• Data files can be retrieved not only by the creator but by other users

• Data files can be sorted in logical sequence

• Data files are not accidentally overwritten or deleted

• This enables precise search and discovery of particular files.

• Different versions of data files can be identified If data files are moved to other storage platform their names will retain useful context

4.1.2 Folder Structure

Think carefully on how best to organise your folder structure so as to make files easy to locate, this is especially true when working in a collaborative environment. Below we have outlined some general guidelines on folder structure.

• When organizing files, directory top-level folder should include the project title, unique identifier, and date (year).

• The substructure should have a clear, documented naming convention; for example, each run of an experiment, each version of a dataset, and/or each person in the group. [1]

(22)

(23)

4.2 Versioning

It is important to ensure that different copies or versions of files held in different formats or locations, and information that are cross-referenced between files are all subject to version control.

It can be difficult to locate a correct version or to know how versions differ after some time has elapsed. A version control strategy depends on whether files are used by single or multiple users, in one or multiple locations and whether or not versions across users or locations need to be synchronised or not.

It is also important to keep track of master versions of files, for example the latest iteration, especially where data files are shared between people or locations, e.g. on both a PC and a laptop. Checks and procedures may also need to be put in place to make sure that if the information in one file is altered, the related information in other files is also updated.

Best practice is to:

• decide how many versions of a file to keep, which versions to keep, for how long and how to organise versions

• identify milestone versions of files to keep

• uniquely identify files using a systematic naming convention

• record version and status of a file, e.g. draft, interim, final, internal

• record what changes are made to a file when a new version is created

• record relationships between items where needed, e.g. relationship between code and the data file it is run against; between data file and related documentation or metadata; or between multiple files

• track the location of files if they are stored in a variety of locations

• regularly synchronise files in different locations, e.g. using MS SyncToy software

• maintain single master files in a suitable file format to avoid version control problems associated with multiple working versions of files being developed in parallel

• identify a single location for the storage of milestone and master versions of files

• Turn on versioning or tracking in collaborative documents or storage utilities such as Wikis, GoogleDocs etc

• Consider using version control software e.g. Subversion, TortoiseSVN Examples of file versions

• date recorded in the file name or embedded within the file HealthProj_Kisumu_06-04- 2008

• version numbering in the file name (v1, v2, v3 or 00.01, 01.00) BGHSurveyProcedures_00_04

• version description in the file name or embedded within the file (draft, final) FoodInterview_1_draft

FoodInterview_1_final [8]

(24)

Some structured examples of maintaining version control [document name] [version number] [status:

draft/final]:

• Smith_interview_July2010_V1_DRAFT

• Lipid-analysis-rate-V2_definitive2001_01_28_ILB_CS3_V6_AB_edited [4]

4.3 File Formats

A file format describes the way information is organised or encoded in a computer file. A program or application must be able to recognise the file format in order to access data within the file.

All digital information is designed to be interpreted by computer programs to make it understandable and is - by nature - software dependent. All digital data are thus endangered by the obsolescence of the hardware and software environment on which access to data depends.

Despite the backward compatibility of many software packages to import data created in previous software versions and the interoperability between competing popular software programs, the safest option to guarantee long-term data access and usable data is to convert data to standard formats that most software are capable of interpreting, and that are suitable for data interchange and

Figure 8: Example Version Control Data table. Source - UK Data Archive

(25)

Examples of these formats are OpenDocument Format (ODF), ASCII, tab-delimited format, comma- separated values, XML - as opposed to proprietary ones. Some proprietary formats, such as MS Rich Text Format, MS Excel, SPSS, are widely used and likely to be accessible for a reasonable, but not unlimited, time.

Examples of preferred formats

• Documents – PDF/A or Open Document Format text (.pdf, odt), not MS Word

• Tabular data – Delimited ASCII text e.g. CSV (.csv, .txt, .tab) Open Document Spreadsheet (.ods), not MS Excel

• Images – Tiff (.tif), not JPEG

• Digital Audio – Free Lossless Audio Codec (FLAC - .flac), not MP3

• Digital Video – MPEG-4(.mp4), not Quicktime

4.3.1 File format Conversion

Data may need to be converted from the original format to a preferred data preservation format in preparation for long-term storage. When data are converted from one format to another - through export or by using data translation software - certain changes may occur to the data. It is important for you to understand what is at risk for the type of data you are working with.

Potential risks for loss or corruption on conversion or migration to new media:

• For data held in statistical packages, spreadsheets or databases, some data or internal metadata such as missing value definitions, decimal numbers, formulae or variable labels may be lost during conversion to another format, or data may be truncated

• For textual data, editing such as highlighting, bold text or headers/footers may be lost.

• For other numeric files: special characters (such as quotation marks), end of line returns, last characters in rows (due to row size limitations), last rows (due to row number limitations)

• For database files: as numeric files, but also relations between items in a table and between tables.

• For Image files: loss of layers, colour fidelity, resolution, sound quality, etc.

• For Multimedia: as image files, but attention to frame rates, codecs and wrappers is needed.

(26)

4.4 Data Storage

Through the course of your research you must ensure that all your research data, regardless of format, is stored securely, backed up and maintained regularly. You should estimate the volume of data required for your project at an early stage, probably, while drawing up your data management plan. It is also a good idea to consider including costs for data storage in funding proposal.

Data storage is crucial to a research project for the following reasons:

• Properly storing data is a way to safeguard your research investment.

• Data may need to be accessed in the future to explain or augment subsequent research.

• Other researchers might wish to evaluate or use the results of your research.

• Stored data can establish precedence in the event that similar research is published.

• Storing data can protect research subjects and researchers in the event of legal allegations.

• Type and Amount of Data Key considerations for data storage are:

• Thorough documentation to allow data to be appropriately used in the future

• Storage format that is easily adaptable to evolving computer hardware and software

• Rapid access to the data

• Fast read/write rates

• Low cost

• Ability to archive the data

• Removability

• A backup system, such as storing data on CDs [6]

4.4.1 Digital Data Storage Media

Networked Drives

Networked drives are managed by staff centrally or within your School, College or organization. It is highly recommended that you store your research data on regularly backed up networked drives such as:

• Fileservers managed by your research group or school.

• Fileservers managed by Information Services.

(27)

• Their longevity is not guaranteed, especially if they are not stored correctly, for example, CDs and hard drives degrade; tapes shrink in the long term. They can be easily damaged, misplaced or lost.

• Errors with writing to CDs and DVDs are common.

• They may not be big enough for all the research data, so multiple disks or drives may be needed.

• They pose a security risk.

Data should be regularly migrated to new media.

Personal Computers and Laptops

Personal computers (PCs) and laptops are convenient for storing your data temporarily. However, they should not be used for storing master copies of your data. Local drives may fail or PCs and laptops may be lost or stolen leading to an inevitable loss of your data. [4]

4.5 Data Backup

Making back-ups of files is an essential element of data management. Regular back-ups protect against accidental or malicious data loss and can be used to restore originals if there is loss of data.

Accidental or malicious loss of data can be due to:

• hardware faults or failure

• software or media faults

• virus infection or malicious hacking

• power failure

• human errors by changing or deleting files

Choosing a precise back-up procedure to adopt depends on local circumstances, the perceived value of the data and the levels of risk considered appropriate for the circumstances. For many researchers, carrying out an informal risk analysis provides an indication of back-up needs.

Should you back up particular data files or back up the entire system?

What will you need to restore in the event of data loss? If your institution can restore your system then you may wish to take responsibility only for your data files. If it cannot, you may wish to take full responsibility for your own ‘system’ back-ups. Where applicable this should include portable computers or devices, non-network computers and home-based computers.

Where data contain personal information, care should be taken to only create the minimal number of copies needed, e.g. a master file and one back-up copy.

Does your institution have a back-up policy?

Most institutions have a back-up policy for data that are held on an institutions network space. You should check with your university about any strategies and policies in place. If you are not happy with the robustness of the solution you should maintain an independent back-up of critical files.

How often should you back up?

To reduce risk as far as possible, back-ups should be made after every change to data or at regular intervals. You can use an automated back-up process to back up frequently used and critical data files.

(28)

Which media should I use?

The choice of media on which to store back-up files depends on the quantity of files, type of data, and the preferred method of backing up. Examples include recordable CD/DVD, networked hard drive, removable hard drive or magnetic tape. If you are backing up many small data files on a daily basis, copying them to a recordable CD probably suffices but if you are making back-ups of very large quantities of data from a networked hard drive, a removable hard drive or even magnetic tape is probably more convenient.

Where should I store my back-ups?

Depending on the form of back-up and the risks associated with data loss, it is most convenient to keep back-up files on a networked hard drive. For critical data, which are not available elsewhere, we would recommend that you adopt offline storage on recordable CD/DVD, removable hard drive or magnetic tape. Physical media can be safely stored in another location. Most manufacturers provide recommendations for the best storage conditions of physical media.

Validation of back-up copies

It is important that you verify and validate back-up files regularly by fully restoring them to another location and comparing them with the original. Back-up copies can be checked for completeness and integrity, for example by checking the MD5 checksum value, file size and date. [8]

Exercise: Folder Management

For this exercise consider your research project and all the data files collected and documentation produced by the project. Layout what would be the most effective folder structure for your project.

Please see a simple example of a project structure below.

(29)

5 Data Entry – PEN Database Case

Think ahead quiz:

Data that are collected as part of a scientific research project ultimately prove or disprove the PI’s hypotheses and justify a body of research to the public at large. Which statement is true about data collection in scientific research?

__ Ensuring validity of the data is the key to successful research.

__ Ensuring reliability of the data is the key to successful research.

__ Ensuring reliability and validity are equally important.

__ Data collection is actually not a key part of scientific research, since many researchers use previously collected data.

Answer: Ensuring reliability and validity are equally important. Ensuring reliability and validity of the data are equally important during data collection. When data collection is carried out according to these 2 rules, researchers will be able to accurately assess, replicate, and disseminate their results. Read on to learn more. [6]

The databases are all designed to look like the physical questionnaire. The first page on the physical questionnaire will also be the first in the database. Data entry personnel only will be working with two modules of the database: Forms and Tables. Other modules i.e. macros and queries pertain to added functionality of the database such as carrying over values of a variable from one page to the next and it is recommended that these should not be modified by persons other than the database designers.

5.1 Opening the database

Opening the database to begin data entry can be done in two ways:

a) If operating Windows 2000 and above or XP go to the start menu, click on Programs and locate Microsoft Access then click on it.

(30)

Once you have Microsoft Access open, go to the File menu and click Open. Locate the folder under which the database is stored, click on the database that you want to open so that it is highlighted in blue then click Open at the bottom right of the Open dialogue box.

This will open the Main Microsoft Access window which by default opens to the forms that are in the database. Double click on the form that you want to open. When beginning data entry for the first time, this should be the first form at the top since the forms are named according to what section of the questionnaire they come from.

Figure 11: Example Opening the Database.

(31)

Double clicking on the icon for the database will take you to the Main Microsoft Access window which by default opens to the forms that are in the database. Double click on the form that you want to open.

5.2 Forms

There are two types of forms: Main forms and sub forms. A main form is the main level on which data is being collected e.g. a village or a household. A sub form is a form within a form e.g. within a household there are multiple forest products collected. Main forms have one-to-one relationship i.e.

for one household there is one village code, district code etc while sub forms have a one-to-many relationship i.e. for one household there are multiple forest products collected, multiple fish types collected etc.

5.3 When to use Forms and Tables

It is possible to have a database that has only tables and no forms but because the forms look like the questionnaire, it eases data entry. All the data that is entered into a form is stored in a table of the same name e.g. data entered into the form qtrhhd_b_fup will be stored in the table named qtrhhd_b_fup.

When entering data from or making corrections in a single questionnaire, use the form. When making modifications for multiple records use the tables e.g. if district A had the code 10 and this code was later on changed to 11, this change would have to be made for all the villages that fall under district A.

Figure 13: Example Opening the Database.

(32)

5.4 Entering Data

Upon opening the form into which data is to be entered, begin data entry. To move to the next question on the questionnaire (also known as a field) press the Tab key. To go back to the previous field press the combination Shift + Tab.

As data is entered, the number of records in the database can be seen at the bottom of the form.

To go to the next page, click on the button labeled ‘Next Page’. Doing so filters the data so that the values in the header section of the form are carried over to the next page automatically.

Note: At the bottom of the form there are the words 1 of 1 (Filtered). This does not mean that there is only one record in the database. To see how many records there are in total, close the form then

Figure 14: Example Entering data in forms and tables.

(33)

Locate the desired record by searching for it using the header information. For instance to search for the household whose code is 26, place the cursor in the household code field then press the combination control + F. Type in 26 next to the words ‘Find What:’ and press Enter. To find the next occurrence of 26, click on the button labeled ‘Find Next’.

5.5 Validation

One advantage of using Access forms for data entry is the ability to validate the data as it is entered.

This allows you to ensure only data in a certain range or of a certain type is entered in a particular field. This goes to increasing the accuracy of the entered data and reduces the data cleaning process at a later stage.

Figure 15: Example Entering data in forms and tables.

(34)

5.6 Navigating records

To go through the records one by one use the buttons with a single arrow which are located at the bottom of the form. The one facing left will take you one record down and the one facing right will take you one record up. To go to the first record in the database, click on the left facing arrow with a line in front of it (|<) and to go to the last record use the right facing arrow with a line after it (>|). The button with an arrow followed by a star (>*) will bring up a new record.

Note: When using these buttons in a form that has a sub form in it, to navigate in the sub form place the cursor in the white box that has the record number in it at the bottom of the sub form. To navigate in the main form use the same type of white box that is at the very bottom of the page.

Tip: When the database has been closed and then re opened, to begin entering a new questionnaire, go to the first page and click new record button.

Figure 16: Example Validating data in Access.

(35)

5.7 Data types

In Access, each “variable” is a field (or what you would normally call a column in Excel). Each field has a field name, data type, and a description. Select table design view to see these characteristics

Field name: is the name of the variable. e.g. houscode. We have tried to make these as informative as possible while keeping them short. Many of the PEN partners work with Stata which has a limit on

Figure 17-19: Examples of Data types.

(36)

Data type: There are a number of data types however, because this particular database will be used in many time and currency zones, we have opted to use only text and numbers for all fields. Text has only been allowed for names (household member, administrative regions etc). All other fields will be numbers and you will need to enter the codes from the codelist. We return to data types below.

Description: This is a descriptive phrase that says something about the field i.e. the question being answered.

5.7.1 Field Properties

Both text and numbers have properties. For text the most important property is the field size which ranges from 0 to 255 characters. In all cases we have used the default length of 50 characters however this can be increased.

More interesting however are the properties of numbers. The most important properties are:

• Field Size

• Decimal Places

• Validation Rule

(37)

5.7.2 Field Size

This defines the precision with which numbers will be stored. Numbers can be stored as bytes, integers, long integers, single, double and decimal.

Byte:

Values between 0 and 255 will be stored as bytes.

example

--- Field name Field Description Field Values

--- hhc_sex sex of member 0 “Male” 1 “Female”

---

The field hhc_sex is the sex of the household member and takes on two values 0 and 1. We do not allow for negative values because we assume everyone will fall in one of those classes. Furthermore, such a field does not have decimals so we hold it as a byte.

Integer (and Long Integer)

Fields with larger numbers in the range −32,768 through +32,767 will be stored as integers (larger non decimal fields can be stored as long integers). Long integer stores numbers from −2,147,483,648 to 2,147,483,647.

example

--- Field name Field Description Field Values --- hhc_educ Education of member 0 – 30 or -9 or -8 --- Single and Double

Fields with negative or positive values as well as decimal points will be stored as single or double.

The former takes up to 7 decimal places where as the latter goes up to 15.

A more detailed discussion on these properties can be found in any material that discusses data precision.

Decimals

For fields held as single and double, you can set the number of decimal places.

(38)

5.7.3 Validation rules

Recall that each field has validation rules. These are rules are useful to data entry errors.

We have used the code book to anticipate what data will be entered and set this range in the validation rules. Validation rules instruct Access what values can be entered for any given field.

Other examples of validation rules:

Is Null Or Between 1 And 15

Is Null Or Between 200 And 400 Or -9

If you try to enter a value that violates the validation rules, access will reject the value and warn you.

We have also tried to put some informative text in the validation text which is the message you get when access rejects a value you try to enter.

5.8 Modifying Data

5.8.1 Validation rules

The most common modification you will likely make is to change the validation rules. Typically you may want to increase the upper limit or add a new negative value. This is quite straight forward, simply select the field you are interested in and scroll down to the validation rule to make changes.

Important! At the end of the project, PEN would like to compile a comprehensive dataset which means we will append multiple databases to make one final database. Access requires that fields being appended are identical. You therefore need to keep a record of all changes you make so that we can make them in all other databases. We cannot emphasize this enough!

Other changes may be to include decimal places in fields we have defined as non decimal place. To

Figure 17: An example of validation rules.

(39)

Note: Some validation rules (e.g tenure types) are really long and may be troublesome to modify in the validation rule window. The best way to change such validation rules is to use the expression builder.

1. Click on the small button next at the end of the validation rule

2. The expression builder will pop up and then you can make the modifications

Figure 18 and 19: Example of validation rules and expression builders.

(40)

5.9 Editing records

You may occasionally make mistakes whilst entering data and need to correct them. Editing records in access is not so different from editing records in Excel. Place the cursor where you want to make the change and type. And as in Excel, you can undo this change.

5.10 Deleting records

You may want to delete a record that has been entered. To do this:

1. Select the records you want to delete by left clicking in the grey column at the extreme left and dragging downwards.

2. Press delete

(41)

You will get a warning that you are about to delete a number of rows, confirm the deletion by clicking Yes or cancel by clicking No.

Exercise: Data Entry Tools

Review all the data entry tools you use for this project and ask the instructor any questions or clarifications you require on data entry.

(42)

6 Data Documentation and Metadata

6.1 Introduction

A crucial part of making data user-friendly, shareable and with long-lasting usability is to ensure they can be understood and interpreted by any user. This requires clear data description, annotation, contextual information and documentation.

Data documentation explains how data were created or digitised, what data mean, what their content and structure are, and any manipulations that may have taken place. It ensures that data can be understood during research projects, that researchers continue to understand data in the longer term and that re-users of data are able to interpret the data. Good documentation is also vital for successful data preservation. [8]

It is critical to begin to document your data at the very beginning of your research project, even before data collection begins; doing so will make data documentation easier and reduce the likelihood that you will forget aspects of your data later in the research project.

Data documentation can be viewed in two different levels

• Project or Study Level

• Data Level

We will look at the two levels in more detail ahead.

Metadata are a subset of core data documentation, which provides standardised structured

information that lets you find, understand and use the data. It could for example include explaining of the purpose, origin, time references, geographic location, creator, access conditions and terms of use of a data collection.

6.2 Project level documentation

Project level documentation provides the overall specifications and instructions of what the project was meant to do and why, how it went about meeting its goals, where the research was done and when it was done. It should include an overview of the research context and design, data collection methods, data preparation and results or findings. The project level documentation enables the user to understand how to make best use of the data for their purposes.

Good project-level data documentation includes the information on:

• the context of data collection: project history, aims, objectives and hypotheses

(43)

• modifications made to data over time since their original creation and identification of different versions of datasets

• for time series or longitudinal surveys, changes made to methodology, variable content, question text, variable labelling, measurements or sampling

• information on data confidentiality, access and use conditions, where applicable [8]

Data documentation would include Country reports, technical reports, working papers, questionnaires, interview instructions and research methods.

Figure 18: Example of Project Level Documentation Archived

6.3 Data Level

Data level documentation describes the files and tables that make the dataset and also each variable that makes up a file or table of a dataset.

Data documentation can be embedded in data, such as variable and code descriptions in databases or headers in an interview transcript. Alternatively, information about data items can be recorded in a structured document.

Documenting data at the data level includes:

• names, labels and descriptions for variables, records and their values

• explanation of codes and classification schemes used

• codes of, and reasons for, missing values

• derived data created after collection, with code, algorithm or command file used to create them

• weighting and grossing variables created and how they should be used

• data list describing cases, individuals or items studied, for example for logging qualitative

(44)

6.3.1 Labelling and Coding

All structured, tabular data should have cases or records and variables adequately documented with:

• Names, labels and descriptions for all variables, fields, records and their values Variable labels should:

o be brief with a maximum of 80 characters

o indicate the unit of measurement, where applicable

o reference the question number of a survey or questionnaire, where applicable e.g. variable ‘q11hexw’ with label ‘Q11: hours spent taking physical exercise in a typical

week’ - the label gives the unit of measurement and a reference to the question number (Q11b)

• Code labels

e.g. variable ‘p1sex’ = ‘sex of respondent’ with codes ‘1=female’, ‘2=male’, ‘-8=don’t know’,

‘-9=not answered’

• Coding or classification schemes used, ideally with a bibliographic reference

e.g. Standard Occupational Classification 2000 - a list of codes to classify respondents’ jobs;

ISO 3166 alpha-2 country codes - an international standard of 2-letter country codes

• Codes of, and reasons for, missing data - blanks, system-missing or ‘0’ values are best avoided

e.g. ‘99=not recorded’, ‘98=not provided (no answer)’, ‘97=not applicable’, ‘96=not known’,

‘95=error’ [8]

6.3.2 Embedding data documentation

Data-level descriptions can be embedded within a data file itself. Many data analysis software

packages have facilities for data annotation and description, as variable attributes (labels, codes, data type, missing values), data type definitions, table relationships, etc.

Statistical e.g. SPSS

Variable descriptions and attributes (codes, data type, missing values) of each variable in the data file can be documented in ‘Variable View’ or via syntax, whereby embedded data documentation is then contained in the SPSS command file

Databases e.g. MS Access

(45)

Figure 19: Embedded Data in SPPS file (UK Data Archive. 2012)