• No results found

Testing a Handprint Data Capture System (with Two Different Form Designs)

N/A
N/A
Protected

Academic year: 2021

Share "Testing a Handprint Data Capture System (with Two Different Form Designs)"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

Testing a Handprint Data Capture System

(with Two Different Form Designs)

A paper written for the Business Forms Management Association, Inc. BFMA 38th International Symposium

May 4-8, 2008 K. Bradley Paxton, Ph.D. [email protected] and Dawn R. Savacool [email protected]

(2)

Abstract

In helping the U.S. Census Bureau develop and test the data capture systems for the Year 2000 and 2010 Decennial Census, we have developed a unique testing approach for handprint data capture systems. This approach not only tests the inherent quality of the handprint data capture automation and keying, but also monitors the progress in forms design, where the objective is to produce a form that is both "respondent-friendly" and "data capture-friendly". We illustrate this approach to testing with examples from our Census experience, and also from a commercial client study. Using this technology, we produced two test decks containing the same respondent data; one on a “poor” form design, and one on a “good” form design, and actually measured the data capture quality from each design. These results, along with related cost estimates, enable us to estimate the possible Return On Investment (ROI) due to better handprint recognition resulting from an improved forms design.

Introduction

We began consulting with the U. S. Census Bureau in 1993, and are continuing to do so to this day, preparing for the 2010 Decennial Census. We assist the Bureau in many areas, including forms design, printing, automatic recognition, system testing and

evaluation, and quality assurance. We became aware early on in developing the Census’ Data Capture System 2000 (DCS 2000) that marked improvements in form design were going to be needed for the year 2000 Decennial Census. The forms used for the 1990 Census were considered “machine-friendly”, but were not “respondent-friendly”, which became a Census goal for DCS 2000. An example snippet of the 1990 form is shown in Fig. 1. The 1990 form had a confusing layout, a scattering of distracting black marks needed for registration, and very tiny circles that the respondent was instructed to fill in to indicate the answers to multiple-choice questions. In the 1990 Census, no Optical Character Recognition (OCR) was used, but rather human keyers keyed the write-in data. The forms were microfilmed, and an ingenious Census invention called “FOSDIC” (Film Optical Sensing Device for Input to Computers) was used to scan the microfilm with a flying-spot scanner and read where the respondents filled in the tiny circles, performing an early version of what we now call Optical Mark Recognition (OMR). [For an excellent discussion of respondent-friendly design, FOSDIC, and many other related matters, see Ref. 1].

(3)

Figure 1 - 1990 U.S. Census Form

One of the challenges we faced in moving the Census Bureau to an electronic imaging solution for DCS 2000 was to design forms that were friendly to the respondent as well as the data capture system. A snippet of the Census 2000 “short form” is shown in Fig. 2. Here, you can see that color was used to guide the respondent through the questionnaire, check-boxes were spaced apart for easy marking of the answers to multiple-choice questions, and light colored lines were used to aid the respondent in filling out the write-in fields neatly to improve OCR. Although not perfect, the Census 2000 forms showed that it is actually possible to achieve “respondent-friendly” and “data capture-friendly” in the same form design.

(4)

As we at ADI began to work with clients other than the U. S. Census, we found that handprint data capture in the commercial world was quite variable in terms of quality. We saw that OCR systems were not being used optimally, that excessive manual keying was used to get the answers, and some of the form designs we saw were neither

respondent-friendly nor data capture-friendly. In one case, when we suggested changing the forms designs to get better data capture, we were told that “some other business unit” was in charge of that, and changes were not feasible.

So we devised a way to test the client’s data capture system with both the incumbent form design and an improved one, both populated with identical data, and we showed how to estimate the possible financial value to the organization as a whole to move to an improved form design. This is the topic of this paper.

Total Systems Approach

To begin with, an aspect of handprinted forms data capture that does not seem to be appreciated by many is that it is a “system”, containing many interacting subsystems. A diagram similar to one we have used with Census to illustrate this concept is shown in Fig. 3. Note we have intentionally chosen to place “The Form” at the center of this mini-solar system, surrounded by 10 planetary subsystems, all of which interact to produce overall data capture results. The major point here is that a high-quality, cost-effective data capture system is possible only to the extent these interactions are understood and incorporated into the overall design. This diagram provides the outline for a lot of discussion beyond the scope of this paper, but here we will focus on how the form design interacts with the data capture technology, and how to produce more accurate output data at lower cost by understanding this interaction.

Security USPS Automation Advances Replacement Mailing Printing Technology Implementation (DRIS) Technical Requirement s Data Capture Technology Color Science Concept Form Designs

The Form

System Testing

(5)

A Typical OCR System

Briefly, a typical OCR system looks like the diagram shown in Fig. 4. Here, forms containing respondent data are scanned and the form images are processed and passed to an OCR subsystem that attempts to infer the contents of the respondent’s handprint. Depending on the confidence the OCR has in a write-in field it is trying to read, it either accepts the inferred result or rejects it. Rejected field snippets are sent to human keyers, who attempt to key the correct answer to the field snippet. The term “Reject Rate” is used to measure the fraction of fields input to the OCR that is not read automatically. The keyed fields are then merged with the accepted fields and stored in a database for use by the client in their subsequent business processes. It turns out that there are always errors in the final merged data fields, either from the OCR accepted fields, or from the human keyers. The errors in the accepted OCR fields are measured by a quantity called OCR “Error Rate”.

Figure 4 – An Elementary OCR System

Engineers who work on OCR subsystems are fond of measuring the quality of OCR by plotting OCR Error Rate versus Reject Rate, as shown in Fig. 5. In this somewhat idealized model, the OCR Error Rate starts out highest at a Reject Rate value of zero,

OCR Forms Scan Form Image Database Users Key From Image

Rejected Fields

Keyed Fields

(6)

Figure 5 - OCR Error Rate vs. Reject Rate

How to Measure OCR Quality

Most present-day practitioners of data capture use what is sometimes called a “golden deck” of carefully truthed results they can run through their data capture system to infer it’s performance. We prefer to use what we call a Digital Test Deck®, which makes

possible a lot of synthetic (but realistic) test data on the client’s preferred form type, reproducible as needed for testing over time or between sites. An example of a snippet from one of these decks made recently to test the 2008 Census Dress Rehearsal system is shown in Fig. 6. You can’t tell them from “real” forms, and neither can the scanner and data capture system under test. (The data looks real, but is not…it is test data).

(7)

Cost Model

It is necessary to be able to measure a graph something like that in Fig. 5 to estimate overall data capture system costs for processing forms (or at least a few points on such a graph). We presented a data capture cost model at TAWPI in 2006, and it was published in TAWPI’s “Today” magazine (Ref. 2). When we apply the usual costs of keying, costs of scanning, and other costs typically associated with forms processing (including the cost of correcting an error downstream), we may obtain a plot of the cost to process an average form versus Reject Rate something like the example shown in Fig. 7.

Figure 7 - Cost per Form vs. Reject Rate

In Fig. 7, we see a very interesting effect, namely, that there is an optimum Reject Rate for a given data capture system that produces the lowest cost to process a form. If a fundamental improvement in the data capture system is made, such as an improved form

(8)

Figure 8 - A "Poor" Form Design with Black Boxes

Figure 9 - A "Good" Form Design with Color Boxes

This is may seem to be an extreme example, as it is pretty well understood these days that black write-in boxes are very confusing for OCR engines to process; nevertheless it is real, and actually being used. Essentially, the OCR engines get confused by the extra horizontal and vertical lines in the handprint image snippet, and tend to reject practically everything. We created two Digital Test Decks®, one with the incumbent form design

and one with the improved form design, but each containing the same respondent data. The results of testing the OCR system with these two form types were dramatic, as shown in Fig. 10.

(9)

Using the cost model, and business values particular to the AnyCo’s system, we were able to show that AnyCo could save up to 40 cents per form in data capture and processing costs if they went to the improved form designs, as shown in Fig. 11.

Fig. 11 – Cost to Process a Form versus Reject Rate for “Good” (Blue) and “Poor” (Red) Form designs

Since AnyCo processed several tens of millions of forms per year, this showed them the possibility of millions of dollars in savings to their enterprise by using better form designs. This client did make some form design improvements, and began discussions with a major OCR vendor to install an improved OCR system as well.

Conclusions

(10)

REFERENCES

(1) Dillman, Don A., Mail and Internet Surveys. New York: John Wiley & Sons, Inc., 2000.

(2) Paxton, K. Bradley, Optimizing Paper Data Capture, TAWPI “Today”, 29 (2006): 26-28.

References

Related documents

‘Zefyr’ caused by Gnomonia fragariae in the greenhouse 11 weeks after inoculation: (A) Severe stunt of plants inoculated by root dipping in ascospore

Mortgage type Remaining term Repayment type Name of lender/company Monthly repayment Balance outstanding End date of loan Current months in arrears Intend to repay on completion

Mortgage type Remaining term Repayment type Name of lender/company Monthly repayment Balance outstanding End date of loan Current months in arrears Intend to repay on completion

Where your client has taken retirement benefits, from any scheme, that came into payment after 5 April 2006, you will need to provide the following:. If your client has

I further agree to notify this Managed Care Entity in writing, promptly and NO later than fourteen (14) calendar days from the occurrence of any of the following: (i) receipt

My/our financial adviser has my/our authority to submit the information contained in the Data Capture form to Standard Life on my/our behalf through adviserzone. I/we understand

GVSU general education requirements) will be done PRIOR to entry in the program  Completion of the Prerequisite & General Education Course Worksheet (attached)  Submit

Wright hence called for a new Soviet society coupled with ‘organic’ architecture that could form an expansive and international movement, but of course he was naively out of