3. Data processing
3.8 Final documentation
Preparing high-quality documentation can be a time-consuming task, but clear and complete documentation greatly enhances the survey process. It is always best to engage those personnel who have been involved with the dataset from the beginning. They will know better than anyone else how the datasets were created, why the derived variables were created, and which major decisions were taken or editing rules applied during the data processing and why.
On the other hand, people involved with the survey and initial data processing are often so close to the project that they tend to think some of the information does not need to be documented. However, the data is intended for use by a variety of people, and thorough, clear, and concise documentation greatly enhances its usability.
Preparation of documentation or metadata should begin well before the start of actual data processing. As soon as data processing is complete, all the relevant information should be compiled. Two items can serve as final documentation. The first is a short file describing the structure of the dataset together with information concerning variables and values, coding and classification schemes, and weighting. A brief description of the survey should also be included. The second document is a more detailed one, and is described in what follows.
The following are extracts relevant to child labour surveys from the Data Document Initiative (DDI) codebook DTD Version 1.0 (FINAL).23
Note that the information organized under the following headings is not a replacement for the codebook or data dictionary for ASCII datasets that defines micro-data layout.
Summary survey description
Title. The full, authoritative title of the survey will be used for all data and documen-
tation, and it should indicate the geographic scope of the data collection as well as the time period covered. For example: Child labour in Portugal: Social characterization
of school-age children and their families, 1998.
Subtitle. A secondary title may be used to amplify or state limitations on the main title.
For example: Child labour in Portugal, 1998.
Alternative title. The alternative title may be the title by which a data collection is
commonly referred to, or it may be an abbreviation of the title. For example: SIMPOC
Portugal survey, 1998.
Parallel title. A title may be translated into another language. For example: Trabalho
Infantil em Portugal: Caracterização social dos menores emidade escolar e suas famílias, 1998.
Keywords. Words or phrases should be specified that describe salient aspects of the
survey and which may be used in building keyword indexes for classification and retrieval purposes.
Abstract. This is a summary describing the purpose, nature, and scope of the child
labour data collection. Special characteristics of the contents and a listing of major vari- ables in the data can be added here.
Summary data description
This should briefly describe the child labour survey in terms of its duration and data collection dates, geographic coverage, and unit of analysis.
Time period covered. This is the time period to which the data refers—the period
covered by the data, not the dates of coding or of making documents machine-readable or the dates when the data were collected. For example, if the data was collected in 1999, and one question was “Did you work last year?”, the time period should be 1998-99.
Date of collection. Contains the date(s) when the data were collected. Country. Name of the country where the survey was conducted.
23 The Data Documentation Initiative Codebook DTD, http://www.icpsr.umich.edu/DDI/ users/dtd/codebook.html (The excerpts have been modified, since the document was prepared as a guide to different types of surveys conducted in a variety of situations, and some fields might not be applicable to a particular country. In addition, the same information is sometimes presented in different sections of the codebook, since, during on-line dissemination, software might seek the same information under a variety of headings. A full version of the codebook is available from the Codebook DTD website.)
Geographic coverage. Includes the total geographic scope of the data, and any addi-
tional levels of geographic coding provided in the variables. Most child labour surveys are national in scope.
Geographic unit. This item refers to the lowest level of geographic aggregation covered
by the data—for example province, state, or district.
Unit of analysis. For most child labour surveys, the basic unit of analysis or observa-
tion is the individual person.
Universe. The summary should also include a description of the population covered by the data in the file—the group of persons or other elements who are the objects of the survey and to which the survey results refer. Age, nationality, and residence commonly help to delineate a given universe—also known as a universe of interest, population of interest, or target population—but a number of other factors may be involved, among them age limits, sex, marital status, race, ethnic group, nationality, income, veteran status, and history of criminal conviction. The universe may consist of elements other than persons, including housing units and countries. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypo- thetical or real) is a member of the population under survey (for example, where a child labour survey interviewed only children from the 5- to 15-year age group).
Kind of data. This item refers to the type of data included in the file, for example
survey, aggregate, clinical, or event/transaction data; program source code; machine- readable text; administrative records data; textual data; coded textual data; coded docu- ments; time budget diaries; observation data/ratings; or process-produced data. All applicable data types should be included.
Notes. Notes should be used to provide additional information, clarifying and anno-
tating codebook information on the scope of the data collection.
Survey methodology and processing
Time method. Panel, cross-sectional, trend, and time-series are some ways of approach-
ing the time dimension of data collection.
Data collector. This refers to the entity (e.g. a national statistics office) responsible for
administering the questionnaire or interview or for compiling the data.
Frequency of data collection. If the data were collected at different times, indicate the
frequency with which this happened. For example, in first-time child labour surveys, “first time” would suffice.
Sampling procedure. This is the type of sample and sample design used to select
survey respondents representative of the target population. It may include reference to the target sample size and the sampling fraction.
Major deviations from the sample design. Show correspondences as well as discrep-
ancies between the sampled units (obtained) and available statistics for the population as a whole (age, sex-ratio, marital status, etc.).
Mode of data collection. This is the method used to collect the data (e.g. face-to-face
interviews).
Type of research instrument. “Structured” indicates a questionnaire that presents all
small portion of such a questionnaire includes open-ended questions, provide appro- priate comments. “Semi-structured” indicates that the questionnaire contains mainly open-ended questions. “Unstructured” indicates that in-depth interviews were conducted. Most child labour surveys are structured in nature.
Actions to minimize losses. The summary should include such actions taken to mini-
mize data loss as follow-up visits, supervisory checks, historical matching, and esti- mation.
Control operations. Describe the methods used to facilitate data control during the
survey and subsequent data processing.
Weighting. The use of sampling procedures may make it necessary to apply weights
to produce accurate statistical results. Describe here the criteria for using weights in the analysis of a data collection. If a weighting formula or coefficient was developed, provide the formula, define its elements, and indicate how the formula was applied to the data.
Cleaning operation. Methods used to clean the data collected may include consistency
checking and wild code checking, for example.
Study-level error note. Include any information annotating or clarifying the method-
ology and data processing procedures.
Data appraisal information
Response rate. This refers to the percentage of sample members who provided infor-
mation.
Estimates of sampling error. Include a measure of how precisely one can estimate a
population value from a given sample.
Other forms of data appraisal. Include such issues as response variance, non-response
rate and testing for question bias, interviewer and response bias, and confidence levels.
Data access
This section describes access conditions and terms of use as well as other information regarding availability and storage of the data collection.
Location. Say where the data is currently stored (e.g. a national statistics office). Archive where study originally stored. Give the place, if any, where the data was
stored earlier (e.g. another ministry or department).
Availability status. Provide a statement of data availability. For example, data may be
unavailable because it was embargoed before formal dissemination of the final report.
Extent of data. Summarize the number of physical files that exist in a dataset, record-
ing the number of files that contain data and noting whether the collection contains machine-readable documentation or other supplementary files and information such as data dictionaries, data definition statements, and data collection instruments.
Completeness of study stored. Describe the relationship of the data collected to the
amount of data coded and stored in the data collection. Where appropriate, explain why certain items of collected information were not included in the data file.
Number of files. Give the total number of physical files associated with a collection. Collection notes. Provide any additional information regarding data availability.
Access authority. Identify the contact person or organization that controls access to the data collection at the country level (with full address and telephone number, if avail- able).
Date use statement. Explain the terms of use for the data collection, if any.
Conditions. Where appropriate, describe use and access conditions not covered else-
where.
Citation requirement. Specify any text that should be cited in publications based on
analysis of the data.
Deposit requirement. Information regarding the responsibility of external users for
informing countries or the ILO of their use of data when citing or providing copies of the published work.
Notes. Include a generic “notes” sub-section in the data access section to facilitate anno-
tation/clarification of information regarding data access.
File-by-file descriptions
All files, including data and documentation files, should be individually described.
File name. Use a short title to distinguish a particular file/part from other files/parts in
the data collection.
File contents. Provide an abstract or short description of the file describing its purpose,
nature, and scope, special characteristics of its contents, major subject areas covered, and the reason the file was first created. It is also important to list the major variables contained in the file. In the case of multi-file collections, describe the contents of each file individually.
File structure. Describe the type of file structure, for example indicating whether a
given file is hierarchical, rectangular, or relational.
Record or record group. If the file is hierarchical or relational, then describe the record
groupings.
Label (of record). Provide more detailed information for each record group.
Dimensions (of record). Describe the physical characteristics of the record, including
such items as number of variables per record, number of cases, and record length if applicable.
Notes (on record or record group). Indicate any additional information regarding this
record type.
Dimensions of the overall file
Overall case count. With rectangular files, specify the number of cases or observa-
Overall variable count. With rectangular files, specify the number of variables in the
entire file.
Logical record length. The logical record length of a file is the number of characters
contained therein. Provide this for rectangular files or where all records in a hierarchi- cal file are the same length.
Type of file. If the data files are of mixed types (e.g. both ASCII and software depend-
ent) mention their types.
Data format. Specify the physical format of the data file, i.e. delimited format, free
format, software dependent, etc.
Place of file production. Indicate which department produced the file.
Extent of processing checks. Indicate here, at the file level, the types of checks and
operations performed on the data file.
Processing status. Indicate the processing status of the file, if part of a multi-file
collection.
Missing data. Provide information that can be used to account for missing data—show
that missing data have been standardized across the collection, that missing data are the result of merging, etc.
Software. Identify the software used to create the file, including the software version
number.
Version statement. Provide a version statement for the data file.
Notes. Provide additional information about the data file not covered in the other
elements of this summary.
Variable group
This refers to a group of variables that may share a common subject, arise from the interpretation of a single question, or are linked by some other factor. Specify whichever of the following apply:
Type. Show the general type of variable grouping (topic, multiple responses, etc.). Var. This indicates the entire constituent variable IDs in the group.
Variable group. This indicates all the subsidiary variable groups nested under the
current variable group, allowing the encoding of a hierarchical structure of variable groups.
Name. This is the unique ID for the group.
Summary data description references. These record the ID values of all elements
within the summary data description referred to previously that apply to this variable group. These elements include time period covered, date of collection, nation or country, geographic coverage, geographic unit, unit of analysis, universe, and type of data.
Methodology and processing references. These record the ID values of all elements
to this variable group. These elements include information on data collection and data appraisal (e.g. sampling, sources, weighting, data cleaning, response rates, and sampling error estimates).
Variable group label
Create a short description of the variable group.
Variable group text. This is a lengthier description of variable group.
Variable group definition. Provide a rationale for why the variables are grouped in
this way.
Notes. Add any clarifying information/annotation regarding the variable groups.
Variable
Each variable needs a name to serve as its unique ID. For each variable provide the following information:
• whether the variable is a weight;
• reference to the weight variable for this variable; • a question ID for the variable;
• reference of the file to which the variable belongs; • which format has been used (e.g. SAS, SPSS); • the number of decimal points in the variable; • whether the options are discrete or continuous; • which record type this variable belongs to;
• references to the summary data description that records the ID values for all elements that apply to this variable; and
• references to the methodology and processing that records the ID value of all elements that apply to this variable.
•• Variable label. This is a descriptive phrase that defines the variable. The length
of the phrase may depend on the statistical analysis system used.
•• Imputation. Imputation is the process by which missing values for items that
survey respondents failed to provide are estimated. If applicable in this context, mention the procedure used.
•• Embargo. This provides information on variables that may not be currently avail-
able because of policies established by national statistics offices or ministries.
•• Response unit. Describes who provided the information contained within the vari-
able (e.g. respondent, proxy, interviewer)
•• Analysis unit. This provides details of whom or what the variable describes.
•• Literal question. This is the literal text of the actual question asked.
•• Post-question text. This text describes what occurred, if anything, after the literal
question was asked.
•• Interviewer instructions. These are the specific instructions to the individual
•• Range of valid data values. This refers to the values for a particular variable that
represent legitimate responses.
•• Range of invalid data values. This refers to the values for a particular variable
that represent missing data, “not applicable” responses, etc.
•• List of undocumented codes. These are values the meanings of which are
unknown.
•• Summary statistics. This refers to one or more statistical measures that describe
the responses to a particular variable, and which may include one or more stan- dard summaries, e.g. minimum and maximum values.
•• Variable text. This refers to an extended description of the variable, something
beyond that provided by “variable name” and “variable label”.
•• Coder instructions. These are any special instructions to those who converted the
information from one form to another for a particular variable. These might include the reordering of numeric information into another form, or the conversion of textual information into numeric information.
•• Version statement. If a variable has undergone changes, a version statement is
required.
•• Derivation. Used only in the case of derived variables, this element provides both
a description of how the derivation was performed and the command used to gener- ate the derived variable, as well as indicating the other variables in the study used to generate the derivation.
•• Derivation description. This is a textual description of the way in which this vari-
able was derived for display to users.
•• Derivation command. This is the actual command used to generate the derived
variable. The syntax attribute is used to indicate the command language employed (e.g. SPSS, SAS, Fortran).
•• Variable format. This refers to the format for the variable in question, and
includes type (character or numeric), name for the particular format (if applica- ble schema: vendor or standards body which defines the format, one of SAS, SPSS, IBM, ANSI, ISO, or XML-DATA), category (date, time, currency, other), and network identifier for format definition.