Child Labour Survey Data Processing and Storage of Electronic Files

(1)

Statistical Information and Monitoring Programme on Child labour (SIMPOC)

International Programme on the Elimination of Child Labour (IPEC)

Child Labour Survey Data Processing

and Storage of Electronic Files

A Practical Guide

Revised December 2003

(2)

Publications of the International Labour Office enjoy copyright under Protocol 2 of the Universal Copyright Convention. Nevertheless, short excerpts from them may be reproduced without authorization, on condition that the source is indicated. For rights of reproduction or translation, application should be made to the ILO Publications Bureau (Rights and Permissions), Inter-national Labour Office, CH-1211 Geneva 22, Switzerland. The InterInter-national Labour Office welcomes such applications. Libraries, institutions and other users registered in the United Kingdom with the Copyright Licensing Agency, 90 Tottenham Court Road, London WIT 4LP [Fax: (+44) (0)207631 5500; e-mail: [email protected]], in the United States with the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 [Fax: (+ 1) (978) 7504470; e-mail: [email protected]] or in other countries with associated Reproduction Rights Organizations, may make photocopies in accordance with the licences issued to them for this purpose.

ISBN 92-2-113629-9

First published 2004

The designations employed in ILO publications, which are in conformity with United Nations practice, and the presentation of material therein do not imply the expression of any opinion whatsoever on the part of the International Labour Office concerning the legal status of any country, area or territory or of its authorities, or concerning the delimitation of its frontiers.

The responsibility for opinions expressed in signed articles, studies and other contributions rests solely with their authors, and publication does not constitute an endorsement by the International Labour Office of the opinions expressed in them.

Reference to names of Firms and commercial products and processes does not imply their endorsement by the International Labour Office, and any failure to mention a particular firm, commercial product or process is not a sign of disapproval.

ILO publications can be obtained through major booksellers or ILO local offices in many countries, or direct from ILO Publica-tions, International Labour Office, CH-1211 Geneva 22, Switzerland. Catalogues or lists of new publications are available free of charge from the above address.

(3)

Foreword and acknowledgements

The production of survey results in a presentable format is often delayed, one of the main reasons being that data processing issues are addressed neither properly nor early enough. Emphasizing the importance of careful and informed data processing, this guide provides detailed guidelines for survey planners, data processors, and computer system administrators with respect to data processing planning, actual data processing activities, and the storage of generated files. The guide also outlines the requirements and procedures for transferring electronic files to the ILO at the completion of child labour surveys, a process that is contributing to a growing global child labour data repository. The main aim is to facil-itate the generation of high-quality micro-data derived from child labour surveys.

This guide has been prepared by Muhammad Q. Hasan of SIMPOC/IPEC, ILO. Many people involved with child labour surveys helped in the exercise. We would like to express sincere thanks to all concerned. We particularly wish to thank Mr. Sylvester Young, Director of the ILO Bureau of Statistics and Mr Farhad Mehran of the ILO’s Department of Policy Integration for their valuable comments and suggestions.

This guide is planned to be revised and reproduced on a regular basis. For this suggestions and comments are always welcome. Users should direct any feedback to [email protected]

(4)

(5)

1. Introduction

1.1 Background

ILO/IPEC’s Statistical Information and Monitoring Programme on Child Labour (SIMPOC) supports child labour surveys conducted in a large number of countries. One of the most important aspects of this programme is the collection, archiving, and dissemina-tion of credible, well-documented, and easily accessible micro-data. This requires the exten-sive planning, organization, and execution of planned activities, especially at the country level, where, it is expected, collected data will be archived for an indefinite period.

At the ILO, meanwhile, this information will provide the basis of a global child labour data repository for use by a variety of people in a variety of countries in a variety of comput-ing environments. Thus, the data must be clean, with no inconsistencies, and well docu-mented, readily accessible for use at any time in research and policy-making activities. The dataset received by the ILO also needs to be complete—incorporating codebooks, ques-tionnaires and so on—and ready for straightforward use by any analyst in any computing environment.

Child labour surveys include three phases. First of all, data are collected through inter-views with children and other family members. Data collection is followed by data process-ing, where the collected information is checked for errors, and micro-data and relevant docu-mentation files are created. Finally, data analysis is performed in the light of any additional requirements or policy.

Data processing is a difficult and complex process, but in many cases this is the stage that receives least attention. Data processing activities such as planning for equipment, soft-ware, and training of personnel can be conducted concurrently with such activities as survey design and field-data collection. Since all child labour surveys are carried out under strict time constraints, it is recommended that all planning, training, and testing procedures are completed before the field data collection is undertaken.

The data processing phase includes several distinct stages, each of these comprising multiple steps where errors can and do occur. Child labour surveys are smaller operations than censuses, but, since most are first-time surveys and collect a greater amount of infor-mation than many other general household surveys, they tend to be more complex. While overall data processing activities are in many respects similar to other general household-type surveys, child labour surveys, given their larger sample sizes and questionnaires, some-times make greater demands on time and other resources.

Presentable survey results are commonly delayed because data processing issues are addressed neither appropriately nor early enough. This guide presents a brief overview of the data collection phase before going on, first, to highlight the importance of data process-ing and, second, to provide detailed guidelines for its conduct, with particular emphasis on issues pertinent to child labour surveys. Chapter 2 addresses planning issues involved in data processing. Chapter 3 looks at the conduct of data processing and, immediately upon comple-tion of a child labour survey, the generacomple-tion of files, including well-documented public use datasets. One main purpose of this guide is to help data processors at the country level to produce clean, reliable datasets together with all the necessary documentation for use by secondary analysts, at the conclusion of surveys, in producing reliable aggregate data. Chapter 4 provides information on how to preserve datasets, allowing continued ease of access over an indefinite period.

Survey design issues, data analysis, and data dissemination lie beyond the scope of this guide.

(8)

The information presented in the following chapters should be viewed only as guide-lines—the procedures outlined here may of course be adapted in the light of available national resources and experience.

This guide as a whole is intended for planners and technical experts supervising data processing activities. Chapter 3, however, is designed specifically for those who perform the actual data processing, while Chapter 4 is for computer system administrators responsible for storage of child labour survey data. The guide also provides an overview of data process-ing activities that can be carried out at the survey design stage.

1.2 Field data collection: A brief overview

In general, data collection can involve a variety of methods, from face-to-face or tele-phone interviews to aerial photography. Child labour surveys, however, involve only face-to-face interviews, and only two such methods are feasible.

PAPI.With paper-and-pencil interviews, enumerators apply questionnaires on paper,

recording the data with pencils. Data entry operators then key data into computers or convert them into machine-readable form through some scanning technique coupled with character recognition technology. No matter which method of data entry is selected, the information needs to be rechecked. Various means are used to ensure that data are entered properly. Much of this will be explained in the following chapters.

CAPI.With computer-aided personal interviews, enumerators are supplied with

hand-held electronic devices (e.g. palmtop or laptop computers), permitting the direct digital recording of data. This method offers advantages, compared to PAPI, since major errors occur only while keying in the data, which can be rechecked immediately after data collection. Data are then transferred to computers, with almost no time needed for addi-tional data entry, and data cleaning can begin immediately.

This guide primarily addresses PAPI, which is the data collection method applied in most child labour surveys.

1.3 Importance of data processing

Child labour typically remains a hidden issue, in many respects; and surveys seek reli-able quantitative data about the various related concerns. National surveys require impor-tant sums of money and enormous organizational efforts involving ministries, national statis-tical offices, and other agencies. The resultant data are then provided to policy-makers, researchers, global estimators, and campaigners responsible for publicizing the adverse effects of child labour. All of these mentioned above need easily accessible, reliable data on the various aspects of child labour.

Both non-sampling and sampling errors appear in survey datasets. Sampling errors are handled during the sample design phase, and are not addressed in this guide. Non-sampling errors may originate with respondents, interviewers, data-entry clerks, or processing programmers. One main objective of data processing is to find these errors and fix them in the shortest possible time. Where irreparable errors are found, they should be flagged with explanations. Unidentified and unflagged errors can corrupt interpretations of data and, ulti-mately, may result in the adoption of inappropriate policies.

Competent and thorough data processing activities—including error correction, logical checks, and compilation of information as the basis of documentation—are vital to reliable

(9)

survey information. Otherwise, output from a successful survey (the collected field data) may be limited only to a few tables. Secondary analysts will find it difficult, if not imposs-ible, to use the data, while national and international policy-makers may be misguided by the survey results.

One key to successful data processing is careful planning. The various related activi-ties need to be detailed as early as possible, and should include fallback plans.

Data processing is immensely important to the survey output, and cleaning and veri-fying the data is essential.

(10)

(11)

2. Planning

2.1 Introduction

The preparation of high-quality datasets requires proper planning, and involves two essential elements:

• Statistical method.One should employ good data collection tools and a

well-devel-oped survey methodology.

• Processing and subsequent storage of datasets. A second essential element

involves the informed use of established data processing tools, processing method-ology, and up-to-date computer hardware and software where applicable.

In most cases, child labour surveys are conducted either as stand-alone operations or are attached as a module to some form of national household survey. In stand-alone surveys, children and their parents are usually interviewed. On the basis of initial investigations, this manual assumes that all child labour surveys employ paper-and-pencil interviews (PAPI) for data collection. Survey planning and data cleaning are discussed in light of this assumption. Upon completion of the interviews, the collected data are entered on a computer. Data entry in the field may occur under the supervision of field office supervisors or at survey headquarters, which is normally the national statistics office. If data are entered in the field, there will be a minimum of one file at each field location. Since the same survey question-naire is used, all files generated in field locations will be similar with respect to the number of variables. No matter how the data are entered, different files are either appended before data cleaning or cleaned and then appended. These activities are normally conducted at survey headquarters.

Where a child labour survey is conducted as a module attached to some other house-hold-based survey (e.g. a household member health and education module), child labour data may be collected together with other modules (as with stand-alone surveys) or as a complete module without the household information (which is collected as part of another module). The data may also be collected at different times (e.g. if attached to a quarterly labour-force survey, the total sample will be covered over a period of one year). In such cases, household information needs to be extracted from the other data file(s) and then combined with the child labour data. Such cases entail both appending and merging of data. (Merging and appending are described in greater detail in later sections.) Completion of a data file is followed by data cleaning (partial cleaning may also occur in each modular file). It should be noted that child labour is difficult to define unless all relevant information about children is thoroughly investigated—understanding the causes and consequences of child labour requires analysis of information about the household and other family members. Another scenario1_{presents itself when the survey is conducted in phases, with a series} of questionnaires referring to different entities or differing in their respective coverage. In this situation, data may have to be presented in separate files, with no merging or appending. All of the situations described above warn us that careful planning is needed before the collected information is processed and made available for analysis. All planning issues can

1 _{One example for this would be the SIMPOC – assisted country report}_{Survey of activities}

of young people in South Africa 1999, http://www.ilo.org/childlabour/simpoc/southafrica/ report/rep1999.pdf.

(12)

be addressed while the survey design is in progress. If financial and time constraints are not an issue, all data processing activities should be tested during the pilot survey. (If the CAPI data collection method is used, this is essential.) It should be noted that careful advance plan-ning considerably reduces actual processing time.

The following sections discuss planning issues that need consideration before the actual data processing.

2.2 Data processing policy planning

Two areas of planning are important to data processing. On the one hand, we must decide how the actual data processing is to be accomplished, and this is treated in detail in Chapter 3. But first we must ask what resources and definitions are required for effective and efficient data processing. We may term this initial step “policy planning”.

Policy planning comprises the following essential features: • defining the relevant aspects of a dataset;

• selecting hardware and software; • identifying personnel;

• scheduling the time needed for data processing; • formulating a data preservation strategy; and • designing an access procedure.

2.3 Defining the relevant aspects of a dataset

If analysts are to use a dataset effectively, the micro-data must first be properly processed. This involves a number of stages. Preliminary planning is essential, and includes the identification and definition of such aspects of the dataset as the following.

Record identification variable

To identify a case or record, an identifying variable is usually created and encoded with a unique value. The encoding method and the elements that constitute this variable have to be determined, and the variable—often referred to as the unique record identifier—should be named in accordance with the procedure described later in this chapter. This identifica-tion variable will provide the only linkage between the original dataset containing all the variables and a public use dataset (where many identification variables may have been deleted for reasons of confidentiality) or when data are in different files, but a cross comparison of information is required.

For example, a combination of state or provincial code, enumeration area code, and house number appended one after another may be enough to identify a house uniquely. A line number (position of a person in a house) can be used to identify a person in the house uniquely. Other approaches may achieve the same goal, but care should always be taken when appending these numbers, and each household, as well as each person living in a household, should always have a unique identifier.

(13)

File structure

In child labour surveys, the unit of analysis is the “child” or the “person”, whereas the medium is the “household”, because information about the child or person is collected by first identifying a house. Thus, it is worth deciding what the final data files should look like.

The structure of data files may vary considerably in format and organization when, upon completion of data entry, they become available to secondary analysts. Is a large data file with one long data record preferable (describing both a child and the house in which he/she is living, for example), or does one want several small data files with short data records (where child and house information, for instance, reside in different files with a linking vari-able)? This decision will depend on factors such as how the survey is conducted and what statistical software is being used for data entry and processing. The following considerations may serve as guidelines.

A data file may contain one long record or several smaller records. A large number of records slows processing speed. Some statistical packages (e.g. Stata) limit records to a maximum number of variables. On the other hand, one advantage of long records in a single file is that secondary analysts do not have to merge files at a later date. Annex I describes limitations associated with statistical packages such as SPSS, SAS, and Stata.

Data may be organized in a file such that household records are followed by person records (with different record types in an ASCII hierarchical file). Alternatively, there may be two separate files: one for a house and one for persons living in that house, with well-defined linking variables common to both files in a package-specific format. There may also be a single merged file with long records. The values for many variables will be repeated for members of the same house in such files, thus occupying more storage space. Each system has its pros and cons, and one planning decision must address the questions of how many data files are to be included in the dataset and what the structure of each should be.

Because of the way specific software handles data files, processing large data files within a Windows environment may be a problem. A child labour survey data file may become large when associated with a labour force survey, so the data file may need to be split before analysis.

The file structure should be chosen according to available computing resources and the experience of the data processors. Because of its simplicity, however, a single flat data file is recommended where possible for child labour surveys.

Naming files

As soon as a file is created, it must be named, and it is worth deciding beforehand how all files are to be named. This means, at a minimum, adopting a naming convention.

It is always recommended, for one thing, that names reflect the file contents. The version number of the file can also be included. (In Chapter 3 we see how different versions may be generated.) For child labour surveys, specifically, it is recommended that the following infor-mation is included in file names:

• file content (data, documentation, questionnaire, etc.); • to whom the file relates (child, parent, both);

• version number; • relevant country; and

(14)

Such a standard naming convention greatly assists users in choosing the correct file from the dataset. In general, it facilitates processing the contents, often at a much later date, of computer-based storage systems that may contain thousands of files. Other information, such as survey year and survey round, may also be included in the file name. However, there are generally restrictions on the number of characters used in naming a file, with most computer systems allowing 8.3 structures—i.e., eight characters for the actual file name and three characters for the file extension (e.g. MY_FILE.DOC). The extension is usually allocated by the package that created the file. (MSWord, for example, will use DOC as the extension.) In other words, only eight characters can be manipulated to express as much information as possible about the nature of a file. In view of these limitations, the following naming convention is recommended.

All filenames should start with a country code (Annex II lists the two-character codes) followed by the abbreviations C for child or P for parents or F for family (both parents and children) and H for house (dwelling). The version number follows and, since more than nine versions can easily evolve over time, two characters should be used. G indicates the file is available for general use, and R marks it as restricted. Finally, the eighth character—D, Q, or C representing data, questionnaire, or codebook respectively—indicates the file contents. If any field in the file name is not applicable, that should be replaced with an underscore (_), thereby simplifying manipulations during computer processing. In summary, when naming files according to an 8.3 structure, use the following convention.

The first eight characters:

• first and second characters—country code

• third and fourth characters—child/parent (person), house (dwellings) or both

•• C_—for child only

•• F_—for child and parent (family)

•• H_—for house

•• P_—for parent only

•• FH—a single file containing information on child, parent and house (dwellings) combined

Note: an underscore (_) is used to fill the blank space of the fourth character • fifth and sixth characters—version number

•• 01—first or original version

•• 02—second version, and therefore not the original version

•• and so on…

• seventh character— file use

•• G—for general (public) use

•• R—for restricted (internal) use (in case of data only) • eighth character—file contents

•• C—for codebook (normally associated with an ASCII data file)

•• D—for data

(15)

•• L—internal consistency check rules

•• Q—for questionnaire

•• S—for summary of classification of occupations

•• V—for variable list

The last three characters, following the decimal point, denote the type of file (propri-etary or otherwise).

The following examples should clarify the convention: BDC_01RD.DOC/SAV/POR

A file containing data about children in Bangladesh, and which is the original version, might be named BDC_01RD, where BD stands for Bangladesh; C stands for child; _ indicates there is no information regarding the house or dwelling; 01 marks this file as the first version; R shows that the file is restricted; and D stands for data. The corre-sponding public use data file derived from the above would carry the name BDC_01GD. The associated questionnaires would be called BDC_01GQ. (Since the questionnaires are for general public use, they would always include the “G” code.) The extension would say whether it is a package-specific data file or documentation. For example, a SPSS data file takes a SAV or POR extension, while documentation in MSWord takes a DOC extension.

UAFH04RD.[xxx]

Similarly, a file containing data about parents, children, and their house in the Ukraine, and which is the fourth version, can be named UAFH04RD. The public use version would then be named UAFH04GD. Associated questionnaires are named UAFH04GQ, while a variable description file is named UAFH04GV. A summary classification of occupations file would be named UAFH04GS. All the file names would include appro-priate three-character extensions.

PAFH02RD.txt

An ASCII data file that contains data about parents, children, and households in Panama, and is the second version, can be named PAFH02RD.txt, and the public use version would be PAFH02GD.txt. The associated codebook file should be named PAFH02GC with a TXT or DOC extension, depending on file type.

Creation and naming of variables

Once a survey is completed, a set of variables is created from the questionnaire (primary variables). At a later stage, manipulating the primary variables may produce derived vari-ables. Unless conventions are followed, naming these variables can prove awkward. Here are a few rules of thumb:

• Variable names should convey the meaning of the data content they represent. Any potential analyst should be confident that the same variable names apply to the same data. For example, if two questions are used to determine the work status of a respon-dent—e.g. enquiring as to both current work and usual work—variables represent-ing these questions should never be named “work1” and “work2”, since this leaves it unclear which variable refers to which question.

• Ideally, questionnaires should be prepared such that each question comes with a pre-designated variable name. For example: “How old are you?” would be annotated

(16)

with the variable name AGE. This type of questionnaire is often referred to as an annotated questionnaire.

• As with files, naming variables often depends on statistical packages that restrict the number of code characters to eight or fewer (SPSS for example).2_{The prevailing} computing environment in any particular country will also influence naming conven-tions.

• Each answer in a multiple-choice question should also be assigned a variable name. For example : If question number 9 has 2 multiple-choice answers then variables may be named as Q9A and Q9B.

Several different methods may be applied for naming variables3_.

One-up numbers. In this approach, variables are numbered sequentially. Thus if there are 100 variables in a data file, they can be numbered from 1 to 100. However, many statistical software packages do not allow a digit to be the first character in a variable name (e.g. in SPSS), as such a letter can be added as a first character (e.g. in SPSS, variable names will be automatically assigned either v1 to v100 or var0001 to var00100.) Variable names can be changed manually afterwards. However, the problem with this method is that it is often impossible to comprehend the meaning of the vari-able or to match some varivari-able names with the respective questions without additional labels. Errors can easily happen if variables are named in this way.

Question numbers. A possible alternative to the one-up number method is to name variables with the respective question number; for example, Q1 is the variable that corresponds to question 1. Since multiple answer questions would require more than one variable to be created for a single question, a letter can be appended after the ques-tion number, Q4a, Q4b etc. Since all child labour quesques-tionnaires consist of multiple sections, the first letter can be chosen to represent the section (A1, A2… B4a, B4b etc, where A, B are two different sections) Again, additional labels can also be used to explain the actual meaning of the variables

Mnemonic names.In this method, variables are named with words representing the

concept of the variable. However, the same word may offer different meanings to differ-ent users. Also, the maximum of eight permissible characters in the variable name may impose severe restrictions to conveying the actual meaning. It is also hard to assign manually the same word to different variables conveying the same type of meaning.

Prefix, root, suffix systems. A possible alternative to the mnemonic method of

constructing variable names is to use predefined abbreviated words and join them as prefix, root and suffix. For example, all variables related to children may use CH as a prefix; WW and WY, to denote last week’s work and last year’s work respectively, as a root; and GRP, to group cases, as a suffix.

Derived variables.As mentioned earlier, derived variables are created from primary

variables or by combining multiple primary variables. For example, age may be a primary variable, but analysts might need information about children in the 5- to 9-year age group. Information about individual children’s ages can then be grouped to form the derived variable “age group”. It is always recommended that primary and derived

2 _{See Annex I for maximum number of characters allowed in naming a variable in some}

statis-tical packages.

3 _{This follows the approaches outlined in: Inter-university Consortium for Political and Social}

Research (ICPSR), Guide to Social Science Data Preparation and Archiving. Retrieved from http://www.icpsr.umich.edu/access/dpm.html

(17)

variables are distinguishable. For a variety of reasons, it is also advised that public use datasets should not contain large numbers of derived variables: they are costly in terms of data-processing time; if they are to be properly used, they need adequate explana-tions; and the datasets may become too large and unwieldy. Moreover, data analysts may not have occasion to use these derived variables at a later date, and prefer to tailor derived variables to their own requirements.

Remember that the weight factor included in a dataset is not a variable from the ques-tionnaire, and it should be treated separately. It should be named WEIGHT, using the naming convention applied to a primary variable.

Individual countries are of course free to choose the naming convention appropriate to their variables. With the aim of establishing international consistency with regard to child labour data, however, the following rules are recommended:

• use the question-number method in naming variables, with the character represent-ing the section appearrepresent-ing as the first character in the variable name;

• use the prefix method in naming derived variables; • use capital letters for primary variables, when possible; • use lower-case characters for derived variables; and

• the weight factor should be named according to the rules for primary variables, but at the same time be distinguished from a primary variable.

Variable labels

It is more difficult to understand a dataset if attributes associated with the variables— for example the literal question asked—are not properly described inside that dataset. People who want to perform secondary analyses of child labour surveys prefer that all information be contained in the dataset. One sign-posting method is to provide an adequate label for each variable.

Since nowadays almost all data processing software (e.g. SPSS) provides the option to add labels, this option should be used to describe each variable. If no suitable labels can be found, the literal question together with the appropriate question number should be used as a label. If the variable is a derived variable, a label can be added to express which variable or variables are used to create this new variable and if possible indicate the reason for creat-ing such a variable.

Coding

A statistical software package is used to analyse the information collected through collection of field data. Thus, the information needs to be transformed into data that the soft-ware can handle. To this end, each answer is coded, and the process that determines which symbol represents what item is known as coding. Coding should be undertaken during the survey design process, and it is important that the data processors themselves are involved.

Child labour surveys should be pre-coded before data entry. All possible values— including those such as “not available”, “not applicable”, “refused to answer”—ought to be included in the questionnaire, and interviewers should receive proper training. These meas-ures will greatly reduce the time that data entry or data processing personnel need to spend on coding. Following are a few guidelines drawing on the ICPSR Guide to Social Science Data Preparation and Archiving4_{and the}_{Audience Dialogue Survey analysis}5_.

4 _ibid.

5 _{Audience Dialogue: Survey analysis. Retrieved from http://www.audiencedialogue.org/}

(18)

• Should the need for additional codes arise (for example, assigning a specific code for open-ended questions), this is to be carried out with proper consideration to the coding scheme defined during the questionnaire design. It is particularly important to ensure that there are no overlaps between code categories and that each code fits into only one category.

• For open-ended questions, major categories/classifications should be identified by examining the number of responses and should be used for additional coding. The meaning of each code should be clearly documented. During the additional coding procedure it is also good practice to preserve as much information as possible in the data as they are collected (i.e. no collapsing or bracketing etc.).

• With occupational coding, it is important to follow a standard format defined by an accepted standards institution—e.g. the International Standard Classification of Occupations, ISCO-88—and to use as many digits and, therefore, include as many details as possible.

• Specify all possible missing values (such as “no response” or “not applicable”). Assign the same value (99, for example) to each type (e.g. “not applicable”) in the same dataset. One of the following factors is usually responsible for missing values in child labour survey data, and a different code should be assigned to each case.

•• Refused to answer.A child or parent did not answer the question.

•• Don’t know. A child or parent was unable to answer the question. The respondent might not have had any concept of time or arithmetic, for example, and replied “Don’t know” to the question: “What was your total income last year”. (Respon-dents should be discouraged from answering, “Don’t know”.)

•• Not applicable.For some valid reason, the question was not asked. Following the response “Not working”, for example, any questions related to income were not asked.

It has been observed in many child labour surveys that missing values were left blank or coded with a “zero” that was not pre-defined. It is of paramount importance, therefore, that all cases are assigned different codes during the coding process; and these should then appear pre-coded in the questionnaire. If for any reason missing values are assigned codes, the documentation should include clear descriptions.

It is often quite difficult to code such items as occupations and industries. Where codes are developed, some classifications (occupation, for example) may be missed, making the jobs of enumerators and data processors even more difficult. Consequently, countries are encouraged to consult the following resources for help:

International Standard Classification of Occupations (ISCO)6 International Classification of Status in Employment (ICSE)7

International Standard Industrial Classification of all Economic Activities (ISIC)8 Classifications of Occupational Injuries9

6 _{retrieved from http://www.ilo.org/public/english/bureau/stat/class/isco.htm} 7 _{retrieved from http://www.ilo.org/public/english/bureau/stat/class/icse.htm} 8 _{retrieved from http:// www.ilo.org//public/english/bureau/stat/class/isic.htm} 9 _{retrieved from http://www.ilo.org/public/english/bureau/stat/class/acc/index.htm}

(19)

This list, which is not exhaustive, can be accessed through the Bureau of Statistics web page.10_{Child labour classifications, the relevant categories varying from country to country,} are not yet in a finalized form, and additional coding schemes may need to be developed.

Consistency and logic check rules

It is important to develop as many logic check rules as possible by going through the questionnaire. This requires a detailed understanding of the questionnaire and its flow, and will greatly help computer programmers at later stages. First, consistency check rules have to be generated by studying the routing of each question (e.g., if the answer to question 20 is “yes”, enter skip pattern as answers to questions 21 and 22). Sample responses from ques-tionnaires that suggest other consistency checking rules include these:

• A child aged younger than six years is reported as having completed secondary school.

• A child is reported as not working but as nevertheless bringing cash into the house-hold.

• A child did not work, but reported a work-related injury.

Another type of logic check rule needs to be developed where data contains a legal value but nevertheless does not look right. For example, a parent is reported as having 11 children. This may be true, but may not look right, and could well represent a typographi-cal error. The correct value may more likely be 1 child. The corresponding rule could read: “Flag cases where parents reported having more than 10 children”. These flagged cases then need to be checked manually.

Imputations

Once consistency checks are performed, many missing values can be replaced follow-ing imputation rules. Imputations estimate what would otherwise be missfollow-ing values, where survey respondents failed to provide responses to given items. One rule might indicate, for example, that a person’s income can be imputed by generating a formula involving age, type of work, wage rate, and number of days worked in a particular geographical area. As many of these formulae as possible should be developed by going through the questionnaire.

It must be decided how imputed variables are to be incorporated in the dataset, and, where needed, relevant computer programs may be developed and tested. For simplicity, a completely new variable can be created, one which includes imputed values for missing codes, or where missing codes are replaced with imputed values together with a flagged vari-able with a value of 1 for imputed, and a value of 0 if not.

Weights

Since all child labour surveys are sample surveys, weights need to be calculated in order to produce national estimates. In choosing a sampling procedure, we should ask whether standard errors based on simple random sampling are appropriate, or whether more complex methods are required. If weights are required, they should be described. A clear indication of the response rate should be provided in the documentation, indicating what proportion of those sampled actually participated in the survey. The retention rate, if applicable, should also be noted. Weights are usually developed by specialists, and it is essential that a weight-ing formula with descriptions of all its elements is obtained well before data processweight-ing begins.

(20)

Documentation

Documentation should be as much a part of overall planning as is analysis. It has to be decided who is responsible for keeping a log of what is happening during data processing, including such considerations as problems encountered, major decisions taken, and any imputation method adopted. A more detailed account of this process is presented in “Final documentation”, (Section 3.8).

2.4 Selection of hardware and software

Marshalling resources for a child labour survey strongly depends on what hardware, software, and national statistics office personnel are available. Given those constraints, the following aspects must be considered when selecting hardware and software for data processing:

• computers and printers • data entry and data cleaning

• statistical processing and tabulations • documentations and other tabulations • software utility tools

• automation tools (to perform repeated tasks)

•• tools for transferring files among different computers.

•• virus-checking software • hardware accessories

•• cables, disks, CD, UPS, etc. Computers and printers

Since data will be entered in batches and probably in parallel, one PC is needed for each data entry operator. Different data entry operators, however, can often share the same computer at different times. Printers capable of printing “landscape” format are also neces-sary. If line printers/ dot matrix printers are used, they should have a capacity of 120 char-acters per line.

A Pentium computer with a 1GB hard disk is more than enough for data processing and temporary storage of child labour survey data. A permanent computer is also needed where the final dataset will be archived. It is highly recommended that the computer used for permanent storage of data is not the same one used for day-to-day work, even where this computer may be a central one, shared by different sections in the national statistics offices to store their data on a permanent basis.

Data entry and data cleaning

A great number of staff-hours are sometimes devoted to developing custom software for checking data entry errors. A better solution can be to use automatic data entry software, most of which has some form of built-in checking facility. Over the years, a variety of organizations have developed data entry software, and many national statistics offices use one or all of the following programs for data entry and initial data validations (this list is not exhaustive):

(21)

Blaise.11_{A flexible, relatively powerful system developed by Statistics Netherlands for} computer-assisted interviewing, data entry, and data editing, Blaise is a software system for survey processing on microcomputers. Blaise also simplifies subsequent process-ing of the collected data. This software is beprocess-ing used primarily by European Union countries.

IMPS.12 _{Developed by the US Census Bureau, the original DOS-based Integrated}

Microcomputer Processing System has been superseded by a Windows-based version. Many developing countries are using this software for data entry.

ISSA.13_{Integrated Systems for Survey Analysis is produced jointly by SerPro Ltd of}

Chile and Macro International of the USA. A number of developing countries are using this software for data entry. Evidence suggests that ISSA does not have a wide user base in SIMPOC countries and offers limited support in the form of training courses and documentation.

EpiInfo.14_{This word-processing, database, and statistics program for public health on}

IBM-compatible microcomputers is produced by the Centre for Disease Control and Prevention, in the USA. Many developing countries are using this software for data entry.

CSPro.15_{The Census and Survey Processing System was also developed by the US}

Census Bureau. Incorporating many features of IMPS, ISSA, and EpiInfo, CSPro is designed to replace both IMPS and ISSA, eventually.

Detailed evaluation of the above software lies beyond the scope of this manual. In general, however, availability of financial resources, trained personnel, and microcomput-ers are all-important considerations in choosing any child labour survey software.

Where no other data entry software is available and trained national statistics office personnel are lacking, CSPro (see above), public domain software from the US Census Bureau, can be used for entering, tabulating, and mapping survey data. This software, together with its documentation, is free online, although online registration may be required. The US Census Bureau can arrange training programmes, but charges for them. According to the software documentation, it is possible to handle child labour survey data with this soft-ware. Nevertheless, although some national statistics offices reportedly use versions of this software, they have yet to be tried on child labour surveys specifically, and the training may be worth the cost.

An alternative is Blaise (see above), a user-friendly, high-speed data entry and data manipulation software with an interactive editing facility and survey management capabil-ities. The software is not free, but is offered at a discounted price to developing countries. However, it has a number of characteristics that can make it harder for non-programmers to learn. One such characteristic is the use of advanced programming concepts such as data typing and procedure parameters. Another is the lack of structured forms to aid in defining questionnaire forms and variables. Blaise is not widely used outside Europe, moreover, so an established user base in developing countries does not yet exist.

11 _{Details may be obtained from Statistics Netherlands http://neon.vb.cbs.nl/blaise}

12 _{Details may be obtained from U.S Census Bureau http://www.census.gov/ipc/www/imps/}

index.html

13 _{More information is available at SERPRO http://www.serpro.com/about.asp}

14 _{Details may be obtained from Centre for Disease Control and Prevention http://www.cdc.}

gov/epiinfo/

15 _{Details may be obtained from U.S Census Bureau http://www.census.gov/ipc/www/}

(22)

In any case, software should be tested beforehand, and data entry operators should be both trained with the software and familiar with child labour surveys before actual data entry.

Processing and tabulations

Evidence suggests that virtually all national statistics offices have access to either SAS or SPSS or both statistical packages. Where that may not be the case, national statistical offices should try to adopt one standard statistical software package (e.g. SPSS, SAS, or Stata). Where that is not possible, data entry software can also be used for child labour survey data processing purposes. (Data analysis can be performed using EpiInfo, for example.) See Annex I for a comparison of the SAS, SPSS, and Stata statistical packages.

Documentation and other tabulation

Microsoft Office Suite, comprising Word, Excel, and Access, is being used by many statistical offices, and is adequate for creating the appropriate documentation, including creation of the questionnaires. Both MSExcel, a spreadsheet program, and Access, a data-base program, are user-friendly means to preparing tables. TPL, table generation software from QQQ software,16_{can also be used. Again, availability of resources and trained} person-nel should be the main criteria for choosing a particular software.

Software utility tools

The following list of software utilities is not an exhaustive one, and many other utility tools may currently be in use in various countries.

Databases.General users are often unfamiliar with statistical packages, and they might

prefer to have a subset of the data (or even the entire dataset) in a database format. Many statistical packages allow data to be saved in a database format, and database programs such as Microsoft Access are sometimes quite helpful.

File compression software (e.g. WinZip, PKZIP, gzip).This software is used for

compressing files. It is sometime possible to reduce the file sizes as much as 80 per cent or more using these kinds of software. Compression is useful where a hard disk is short of storage space or when using floppy disks to transfer files between computers.

Compiling software (e.g. Visual Basic, FoxPro, C++).This is programming software

other than that incorporated in the statistical package. Compilers can be used to develop user-friendly front-end for data entry, for example, or to produce customized, in-house automation software for performing repetitive tasks.

Conversion software.Utility software such as STAT Transfer and DBMScopy converts

files from one specific statistical package to another. SAS proc convert statements can easily convert SPSS portable files into SAS datasets.

File transfer software.This is software that allows files to be transferred between

computers, whether networked or not. These utilities include Direct Cable Connection, which is included in the Windows operating system, or LL3 for non-Windows-based transfers. FTP programs are also helpful in transferring files among networked computers.

(23)

Virus checking and recovery software.Programs such as Norton Utilities, McAfee Virus Shield and Scan Disk (which may or may not come with the operating system) not only provide protection against virus attacks, but can also sometimes be useful in recovering corrupted files.

Hardware accessories

Apart from computers, required resources include hardware accessories such as cables, floppy disks, CDs, uninterrupted power supplies, air-conditioning systems, and dehumidi-fiers. Associated problems will vary from country to country.

2.5 Identification of personnel

Human resources are required in the following areas:

Data entry personnel.These individuals are responsible for such tasks as data entry

and initial validations.

Although some countries are trying to use scanning technology coupled with optical character recognition systems for data entry, most child labour survey data is still entered manually. Data entry personnel should be familiar with data entry software, as well as with questionnaire design. Ideally, they should have previous data entry expe-rience. At a minimum, data entry operators should be familiar with the computer keyboard and have typing skills.

A rule of thumb: 10 data entry operators working in parallel for about 40 hours a week are needed to enter and make preliminary validation of data regarding 8,000 house-holds over a period of about 2 months.

Using the CAPI method of data collection eliminates the need for such data entry operators.

Data processing personnel.These persons should be thoroughly familiar with the

survey questionnaire, data processing activities, editing, and the necessary tabulations. They need to be familiar with statistical packages, and should be capable of finding errors and correcting certain types of errors in the dataset. They also have to be capable of performing repetitious tasks efficiently.

Computer programmer. This person should be able to develop programs—either in the software specific format or by using other computer programming languages— based on consistency checking rules. Ideally, the person should also be capable of under-standing the survey questionnaire and developing the consistency rules. If any program-mers are used in the questionnaire design, it is strongly recommended that they are subsequently included in the programmer team.

Computer system administration.This person should be a competent computer

systems administrator—familiar with managing stand-alone or networked systems, printers, file transfer methods, virus-checking systems, back-up operations, and corrupted-file recovery methods.

Supervisor.This position requires a highly qualified data processing specialist with

programming experience, capable of overseeing the entire data processing operation. He or she should have previous experience managing survey or census data processing, and be familiar with the software packages used to process the child labour survey data.

(24)

It is likely that one and the same person could perform a number of the activities described above. Where this is the case, the supervisor should decide which of the activities the same person can perform, and specify how that person’s time should be allocated.

2.6 Scheduling the data processing

Time is always a crucial factor in child labour surveys. Administrative procedures, non-submission of progress reports to funding agencies, non-availability of resources, and train-ing of personnel are among the factors that can delay data capturtrain-ing and data processtrain-ing. Plans should stipulate that all data processing activities be completed within three months of the data entry starting date, if not sooner. Other major considerations at this time are iden-tifying tasks that can be conducted in parallel and ascertaining the availability of human and machine resources.

In what follows, we present guidelines for processing 8,000 household records with about 50 questions. A greater number of household records or questions will normally mean that data entry, cleaning, and error correction require extra resources, including more time. Fewer records or questions, conversely, will require less. The following time allocations should serve as rules of thumb for stand-alone child labour surveys:

• about one month for data entry, including additional coding; and • about one month for data validations.

2.7 Data preservation strategy and access procedure

Surveys often conclude with the preparation of tables. If the micro-data are not prop-erly archived, they may eventually become obsolete where, for example, data are stored in a package-specific format, and the package used to create the data is superseded by a newer version. Planning must take into account data storage and strategies for how this informa-tion can later be accessed.

When one is establishing data preservation and access procedures, certain considera-tions require careful attention:

Hardware.Sometimes a shared machine, where other datasets are stored by a national

statistics office, is used. This may be a workstation with offline storage capacity or any server where data are stored. The minimum requirement is a Pentium PC that is not used for day-to-day work.

Automation software.This may be in-house, purpose-built software, and will vary

depending on the hardware platforms available in the individual country. This software is used to perform repetitive tasks, for example checking that all files are transferred and labelled.

Directory structure. Design a structure for storage of data, documentation, and

programme files. Remember that files will be created using a variety of software packages. It is not good practice to store all files related to the same dataset in a single directory. (A model directory structure is presented in Chapter 4.)

Access policy.Decide who is allowed to access the datasets. A data access request may

come from someone within the department, another ministry or organization, or a complete outsider. Access policy, in general, should be to make data available to all users. However, certain data may be restricted to a particular group of users only.

(25)

Backup policy.Child survey data backup procedures will probably resemble those which exist for an organization’s data in general. In any case, aspects to consider when backing up include these:

• which files are to be backed up;

• how often they are to be backed up (daily, weekly, monthly, etc.); • what backup medium is to be used (CD, tape, etc.); and

• what procedure is to be used— who is responsible, and how the backups are to be performed; backup procedures during and after data processing are different:

•• During the processing, files are incomplete and backups are only short term (yet the latest versions can still be recovered if the system crashes). Typically, these temporary files are small; child labour data processing files can normally be accommodated on a couple of floppy disks.

•• Following data processing, permanent backups are needed.

Dissemination procedure.Access policy determines in part how the data will be

disseminated. Dissemination procedures should be simple. Approaches to consider include online dissemination through the Internet or an intranet, and offline dissemi-nation using diskettes or CD-ROMs. Detailed procedures need to be formulated. All policy planning activities can be performed in parallel with survey design and field data collection. Policy planning should be completed before field data collection so that data can be entered immediately afterwards.

(26)

(27)

3. Data processing

3.1 Introduction

In many respects, survey data processing has remained unchanged over the past few decades. With the invention of more sophisticated technology, however, it has become faster and more reliable.

Data are first collected using manually completed questionnaires. Next, a manual count of the completed questionnaires is cross-checked with the number of persons interviewed. Then the data are coded and sent for data entry. Data entry is accomplished as quickly as possible using operators who automatically enter what they see. Some countries may use scanning technology followed by optical character recognition (OCR) procedures, where the responses to questionnaires are scanned to enter the data and then OCRed to identify the codes in a manner that statistical software can handle. If CAPI is used for data collection, this kind of data entry is unnecessary. But most child labour surveys are conducted using the PAPI method, so the data must be captured as quickly as possible to produce a preliminary count for the survey.

As mentioned above, data processing represents the second stage in the survey process. Data are first received, usually from multiple sources, and then converted to a format suit-able for the following stage, which is analysis.

Data may be collected on paper or as digital information. Similarly, data processing may be either electronic or non-digital. Initial investigations at the country level reveal that data processing activities for child labour surveys are performed electronically, in most cases using personal desktop computers. This guide, then, addresses only electronic data processing.

Whichever way data are entered, the following phases need attention during data processing:

• data entry and preliminary validations; • appending, merging, and splitting files;

• data validation (further checking, editing, and imputations); • final decisions on errors;

• completion of data processing and generation of data file(s); • preparation of public use datasets;

• final documentations; • final tabulations;

• conversion of data files to other formats as required; and • storage of all files.

Child labour survey data should go though these stages at a minimum, and there should be no shortcuts. “Shortcuts” are rarely effective, since they increase the risk of producing unreliable and thus less creditable datasets, which then require more time for error correction.

The following sections of this chapter elaborate on these stages. It is recommended that those involved in data processing read this chapter carefully before approaching any process-ing tasks. Remember that it is also important to include proper weightprocess-ing factors in the data.

(28)

3.2 Data entry and preliminary validations

Depending on the prevailing situation in a given country, data entry may occur either at the field level, under the supervision of a field supervisor, or at survey headquarters.

Where data is entered in batches, each batch should appear in a separate file, rather than together in a single large file. Most importantly, at this stage, the data should be entered right after collection and checked to ensure all information has been entered correctly. Error-detec-tion procedures should be in place, and errors should be corrected immediately.

Data entry operators should not leave their computer while entering data related to a household record. However short the interval away from the task, this practice tends to gener-ate errors.

Once a batch of data is entered, the questionnaire should be bundled, labelled, and stored for future reference.

A variety of common data entry errors with the appropriate precautionary or corrective measures follow:

Data from an old questionnaire entered (e.g. from a pilot survey). This can be veri-fied by referring to interview dates or to the colour of the questionnaire paper (where different colours should be used for pilot surveys and actual surveys). Data entry soft-ware should be programmed to recognize this problem.

Wrong data but within range. The sex of a female child is entered as male. Both “male” and “female” are legal, so this type of error evades normal statistical checking. Custom-built programs involving different questions need to be developed for this and tested beforehand.

Wrong data and out of range (wild code).If 1 stands for “male” and 2 for “female”,

a value of 3 represents erroneous data. Frequency distribution procedures will flag these cases. Once the error is found, compare the erroneous record with other answers to correct the error.

False logic (consistency).A six-year-old child is reported as having completed

second-ary school. The child may have responded appropriately, but the answer was typed incorrectly. This type of error may also be caught with custom-built programming involving different questions. Once found, compare the erroneous record with other answers to correct the error.

Data not typed (missing data).Missing data codes for items such as “not applicable”

and “refused to answer” may not have been pre-coded in the questionnaire, even though they should have been. Or, during data entry, all such values may have been left as blanks to be filled in later. In either case, all these instances should be found and replaced with the appropriate data code.

Duplicate entry (same records or cases entered more than once).As data are entered

in batches, the same cases or records may be entered twice. Checks can be performed to capture this type of error (e.g. refer to unique identification numbers). Once such cases are identified, appropriate actions include deletion (this may not be possible with some data entry software), flagging, or reporting to supervisor.

Unmatched record (for hierarchical files).A “household” record may be followed

by a “persons living in the house” record. In this case, two types of error may occur. One is where there are either fewer or more persons’ records than were actually

(29)

collected. A second is where entire households have been missed. Both cases can be caught and corrected with proper programming.

Dropped cases (interviewed but not entered).Sometimes data are not entered in the

computer. Where undesired cases/records are dropped, this should be verified.

Appending error (data entered but not appropriately joined).Where data are

entered in batches, programmes need to be developed to join/merge all files. The number of cases must correspond to the sample size or number of persons interviewed (or records collected), whichever is applicable.

The data entry software described above is capable of catching many of these errors. Data entry using interactive software is often referred to as “intelligent data entry”. The “double entry” method—where two different people enter the same data, and the two files are then compared to find any differences—is also used to validate the data entry process. It is recommended that both “double entry” and “intelligent data entry” methods be applied in child labour surveys.

Electronic files generated after data entry may be formatted as modules. In this case, check each module separately, revisiting the questionnaires as necessary. Once the data entry is complete, record identification variables should be checked to see that each is unique for each record or case as applicable. If not, to help processors avoid problems merging files, those cases with errors should be corrected by revisiting the questionnaires.

3.3 Appending/merging/splitting

files

“Appending” is the method of combining multiple files with different observations (consisting of variables) into a single file. The properties of each variable are usually the same in each file. Conceptually, it helps to understand appending as increasing the data size vertically.

“Merging” is the method of combining variables from multiple files into a single file. The variables in each file describe the same observation, usually with different units, such as household and person. Conceptually, it helps to understand merging as increasing the data size horizontally. The files to be merged must have one or more unique identifying variable in common. Merging operations can be of different types depending how files are merged. On the other hand, “splitting”, also called sub-setting, refers to dividing files. This may occur in terms of numbers of either variables or observations.

Extreme care should be taken in merging files. Merging often leads to missing values, even though the files to be merged may be perfectly clean and correct. Different types of merging, appending, and sub-setting of files are described below. They are based on the SPSS class notes of the UCLA Academic Technology Services17_{and The University of North} Carolina’s Carolina Population Center.18

17 _{SPSS Learning Module Match merging data files http://www.ats.ucla.edu/stat/spss/}

modules/merge.htm

18 _{Stata Programming: Data Management}

(30)

One-to-one merging

One-to-one merging refers to the process of joining files where one record in each file constitutes a case, and each record in each file must have at least one unique identifying vari-able. There may or may not be more than one common varivari-able.

Merging is performed according to unique identifying variables. This procedure is usually applied when data are collected at two different times, or when data is entered as two different modules, thus generating more than one file. For example, File 1 (house file) may include three variables a1, a2, and a3 representing age of household head, energy sources in the house, and number of people living in the house. (See Table 1, below.) File 2 (person file), on the other hand, may include more detailed information about the household head, such as number of hours (x1) worked per week, educational level (x2), and income (x3). Numbers 1, 2, and 3 represent unique record/case identification numbers based, perhaps, on cluster, household, and line number nested in order. In this case, there will be a one-to-one merge, since one house has one household head. The merged file will present all six items of information (variables) about the household head in a single file.

During one-to-one merging, some statistical packages place restrictions on the number of variables (Stata, for example, has a limit of 2,047 variables. Limitations imposed by some statistical packages are included in Annex I). Although it is unlikely, in child labour surveys, that the total number of variables will exceed the number allowed by a particular statistical package, data processors should remain alert to the possibility during a one-to-one merge.

Table 1 Example of one-to-one merging

Before merge After merge

File 1 (house file) File 2 (person file) (Numbers are unique identifiers used for merging)

1 a1 a2 a3 1 x1 x2 x3 1 a1 a2 a3 x1 x2 x3

2 b1 b2 b3 2 y1 y2 y3 2 b1 b2 b3 y1 y2 y3

3 c1 c2 c3 3 z1 z2 z3 3 c1 c2 c3 z1 z2 z3

Some exceptions: One of the files has more cases then the other. Or two files have the same variables. Different statistical packages will handle such situations differently.

The operation can be performed in the following way:

• SORT household file by unique variable ID and save as a separate file (File1), thereby preserving the original in case of mishaps.

• SORT persons file by unique variable ID and save as a separate file (File2), thereby preserving the original in case of mishaps.

• Make sure both files (File1) and (File2) are properly saved by closing both files and reopening them.

• Execute MERGE FILES command to merge the File1 and File2. • SAVE merged file (New File).

Sample SPSS syntax programming: GET FILE=“Household.sav”. SORT CASES BY ID.

SAVE OUTFILE=“ File1.sav”. GET FILE=“Persons.sav”. SORT CASES BY ID.

Child Labour Survey Data Processing and Storage of Electronic Files

Statistical Information and Monitoring Programme on Child labour (SIMPOC)

International Programme on the Elimination of Child Labour (IPEC)