• No results found

3. Data processing

3.4 Data validation

With effective entry procedures, data are already basically clean as soon as they have been entered. Secondary editing involves complex internal consistency and structure checks that require the review of several sections of the questionnaire, and which, if corrections are needed, must follow detailed recommendations. More advanced data processors can do this interactively. Some people prefer to carry out all data validations before merging of files. Individual countries will decide which approach is appropriate for a given situation.

Even when the first statistical operations are performed with due care on new data, it is not uncommon to find cases such as where six-year-olds have completed secondary school- ing. Validation checks are needed to find these errors and fix them. Although no system is perfect, the number of errors may be more surely reduced if, at a minimum, the following steps are taken.19

Number of variables check. It sometimes happens that the number of variables that

should be generated from a questionnaire do not match the number of variables in the data. Various factors may be responsible. For example, the variable may not have been created in the first place, or the questionnaires may have been imperfectly translated from one language to another. Although this type of error should be recognized in earlier stages, it is best to recheck after all files have been merged or appended.

Number of record/cases check. If the household is considered as a case, check that

the number of cases entered equals the expected number of households (may be equal to sample size). Also check that the number of person records is equal to the number of persons interviewed (or data collected).

Record matches and counts. If household records and records about persons living

in the household are in two different files, check to make sure that identifying variables required for merging are clearly defined. Also make sure that all members belonging to a household are properly entered by comparing the number of persons in the house- hold file with the number of persons in the same household in the person file.

Wild codes and out-of-range values. Wild codes are those that are not defined as

acceptable legal codes in the data, whereas out-of-range values are those values that are assigned to acceptable legal codes but may not be right. For example, if 1 stands for male and 2 for female, 3 will be a wild code, whereas giving the per week income of a child at 1000 is an out-of-range value when actually it should be 100. Frequency distributions as well as graphs expose these types of errors, so frequency distributions of all variables should be examined for possible anomalies. Revisit the questionnaire as necessary to correct these problems.

19 Further development of the procedures outlined in Inter-university Consortium for Politi- cal and Social Research (ICPSR), Guide to Social Science Data Preparation and Archiving, op. cit., and Audience Dialogue: Survey analysis, op. cit.

Missing values. Flag all values that are missing, in each case indicating the reason why

the value is missing. Responses such as “do not know”, “not applicable”, “not avail- able”, and “refused to answer” should be clearly marked. Their values, to the extent possible, should also be uniform through out the dataset. No cell in the data set should have a blank space.

Consistency checks. There are always possibilities of inconsistencies among responses

to related questions. For example, 100 children say they did work, but 105 children report earnings. The presence of some inconsistencies may also arise from more than two variables. For example, five extra children who reported earnings may be wrong because they were in fact attending school.

One of the easiest ways to perform consistency checks is to check the “question route”. For instance, where the questionnaire says “When answer to question number 10 is 2 (“NO”), skip to question number 14”, a logical rule can be developed:

If Q10 = 2, then Q11 = Q12 = Q13 = 99 (means “not applicable”).

If the data indicates that this is not true, then it is possible that Q10 is really 1, but during data entry it was entered as 2. This is probably the case if Q11, Q12, and Q13 all have valid codes. So the answer to Q10 can easily be changed to 1 (“YES”). If they have been coded but are not all valid, one may need to refer back to the original questionnaire to find out which need to be changed. Comparing frequency counts or cross tabulations among all possi- ble related variables would reveal many inconsistencies. Revisit the questionnaire to correct these problems.

A useful example (see also Annex 3) of how logic checking rules can be developed is presented in the following consistency check rules derived from the 1999 Zambia End of Decade and Child Labour Survey (education module)20, in which information about chil-

dren aged 5-17 years was collected:

If a child answers YES to “ever attended school”, then skip the (not applicable) ques- tion “why not attended school”.

Where a child answers NO to “ever attended school”, yet reports a grade to the ques- tion “highest grade attained”, his/her answer should be changed to YES to “ever attended school”

Those who answer neither the “ever attended school” nor the “highest grade attained” questions should be taken as responding NO to “attended school” and 0 GRADE to “highest grade attained”. Rationale: Neither of two related questions are answered, so the likelihood is that the correct answers are actually NO.

If the child answers NO to “ever attended school” and NO to “attending school”, then the rest of the questions in the education module should be considered not applicable. If a child answers NO to “attending school”, then skip the (not applicable) question “grade currently attending”.

Where a child answers NO to question “attending school last year”, but YES to “attend- ing type of school last year” and YES to “grade attending last year”, the first response should be changed to YES. Rationale: The two related YES responses, in this case, suggest YES is a more likely response than NO in the third instance.

20 Central Statistical Office, Zambia End of Decade and Child Labour Survey (1999) House- hold Questionnaire http://www.ilo.org/public/english/standards/ipec/simpoc/zambia/document/ zafh01gq.pdf

During consistency checks, extreme care should be taken to avoid executing scripts based on false logic (consistency rule). If several consistency check operations are to be carried out involving the same variable, moreover, take great care to choose the appropriate sequence in executing the check operations and change of values.

Some programmers find it more effective to split files before running consistency checks, and merge the files again once the checks are complete.

Errors are inevitable during complex consistency checks. It is good practice to keep old files so that it is always possible to refer to a copy of the original data.

At a minimum, consistency checks should be run to ensure that no fields are blank and that all fields contain valid values.

Finally, the first three to five per cent of the records should be carefully checked to ascertain that those records are error free. Afterwards, random checks should be conducted to test the overall integrity of the dataset.

At the end of the data validation stage, there should be no missing values of any type (e.g. not applicable codes are properly included); consistency errors and all records should be matched with all unique record/case identifiers uniquely defined. In other words, all missing values must be properly defined. However, if any values remaining in the dataset cannot be rectified, then a file should be generated that contains the following information:

• case/record identification;

• type of error (missing value, non-response error, etc.);

• detailed breakdown in terms of number of cases, records, etc.; • reasons why such errors could not be corrected;

• some tabulations to show their impact on the overall dataset;

• number of mismatches between cases and applicable records in a case; and

• mismatches between number of cases and data collection, and possible reasons for the errors.

In addition, a list of all variables with labels needs to be generated. These tables, together with the error report file and variable list, then should be forwarded to the supervisor for consideration.

Related documents