Data capture - Sampling Errors - Model Quality Report in Business Statistics

Part 1: Sampling Errors

7.5 Data capture

A variety of methods may be used to ‘capture’ data. These include:

• keying responses from pencil and paper questionnaires;

• using scanning to capture images followed by automated data recognition to translate

those images into data records;

• keying by interviewers of responses during computer assisted interviews;

and these are discussed in turn below.

7.5.1 Data keying from pencil and paper questionnaires

The traditional method of data capture for business surveys is the keying of responses from pencil and paper questionnaires onto computer by a centrally located data entry team. This is a very labour intensive task, which has now been replaced on many surveys by more modern technologies. Some modes of data collection such as computer assisted personal interviewing (CAPI) and computer-assisted telephone interviewing (CATI) enter the data onto computer in the course of the interview.

Data keying is used on many postal surveys where pencil and paper questionnaires are the simplest way to collect information. ONS is investigating the potential of other methods of data capture including scanning and automated data recognition to reduce the number of surveys where data are captured in this way.

7.5.1.1 Measuring error occurring during data keying

The accuracy of data keying can be measured by either comparing a batch of entered data with the original questionnaires or more commonly by re-entering the batch and comparing the two sets of data. Lyberg & Kasprzyk (1997) give a range of examples with error rates varying from 0.1% to 1.6%. Any new methods of data capture must have error rates at least as good as these to maintain the quality of survey data.

7.5.1.2 Minimising error occurring during data keying Methods of minimising errors during data keying include:

• checking regular batches of questionnaires for keying errors;

• in-built edits in computer assisted transmission can identify keying errors;

• checking all data entry work of new staff until they reach an acceptable level of accuracy.

7.5.2 Data capture using scanning and automated data recognition

The potential cost savings offered by the use of scanning and automated data recognition over traditional data keying has led to increasing interest in this technology. In ONS scanning is being used for some business surveys. For example, the last Census of Employment carried out in the UK used scanning equipment to capture all the data resulting in quicker processing and a lower cost for a very large survey. Other organisations who have investigated the use of scanning and automated data recognition for data capture include Statistics Sweden (Blom & Friberg, 1995), and Statistics Canada (Vezina, 1996).

The stages in the data capture process are:

• Scanning

The questionnaires are separated into single sheets and fed into the scanner which stores the image of each page as a TIF file. The preparation of questionnaires for the scanner can be fairly labour intensive (Elder & McAleese 1996) since any staples need to be removed and the questionnaire correctly aligned. The storage of images of questionnaires has the additional advantages of providing rapid access to questionnaires if any queries arise and reducing the need for storage of large volumes of paper questionnaires.

• Form Out

In many data recognition systems the image of the original printed questionnaire is removed electronically from the image of the data filled in by the respondent. This reduces the computer memory needed to store the image of the data and clarifies the image for automated data recognition.

• Automated data recognition

Different methods are used to extract the data from the image depending on the type of information being captured. These include:

Ø Bar code recognition (BCR). Used to read bar codes, for example serial numbers on paper questionnaires. Very accurate.

Ø Optical Mark Recognition (OMR). Used to read responses in tick boxes. Over 99% of items are (presumably correctly) recognised by the system.

Ø Optical Character Recognition (OCR). Used to read machine-printed text. Over 99% of items are (presumably correctly) recognised by the system.

Ø Intelligent Character Recognition (ICR). Used to read hand-written characters. For hand written numerical information 65%-90% of question responses were recognised. This figure is lower for hand written text information; as a result ICR is rarely used for collecting such information.

The recognition figures quoted above are from Statistics Sweden’s experience of automated data recognition as reported in Blom & Friberg (1995). It must be emphasised that technology is developing quickly in this area so that the accuracy of automated data recognition systems can be expected to improve.

7.5.2.1 Measuring error associated with scanning and automated data recognition Automated data recognition may introduce errors into data when characters are incorrectly recognised; for example the numbers 3 and 8 may be confused, as may the numbers 1 and 7. If the system is more likely to confuse a 3 for an 8 than vice versa, and similarly for the numbers 1 and 7, then these errors could cause an upward bias in the survey estimates. Some of these errors may be detected at the editing stage but some inaccuracies may slip through. The accuracy of automated data recognition may be compared with keyed data entry by processing a batch of forms in both ways and comparing the resulting data. Elder & McAleese (1996) report the results of such a comparison where they found that for some questionnaires the accuracy achieved by the automated recognition system was at least as high as that achieved by the keyed data entry process.

7.5.2.2 Minimising error associated with scanning and automated data recognition The most effective way to ensure high quality data capture using automated data recognition is to design forms that are easily scanned and interpreted by the data recognition process. Vezina (1996) provides a useful discussion of aspects of form design that influence data quality. These include:

• the characteristics of the paper – it needs to feed easily into the scanner;

• the colour of the ink – scanners pick up some colours better than others and this can be

used to enhance the images of the data;

• page identifiers;

• registration points – marks on the form which enable the system to align the scanned

• definition of zones of data to be captured – this is particularly important for parts of the form where the respondent is asked to write in numbers or letters. The provision of boxes encourages the respondent to print characters in capitals that are easier for the system to recognise than manuscript;

and to these we could add one which Vezina does not mention:

• instructions asking the contributor to provide data in the required format.

In document Model Quality Report in Business Statistics (Page 123-126)