Extracting data from the images - Choose your starting point

Chapter 5. Designing a Datacap solution

5.3 Choose your starting point

5.3.4 Extracting data from the images

There are four ways of extracting data from an image. They are ranked in order of preferable practice:

򐂰 Zonal

򐂰 Locate with regular expressions

򐂰 Locating by keyword

򐂰 Click N Key

Where possible, try to choose the extraction method that ranks highest in the list.

Zonal method

Zonal data is considered the best extraction method. You know exactly where the data you are trying to extract is located, and, knowing that position, you can be specific about how that area of the form is recognized.

In Figure 5-8, zonal techniques can be used to extract the data in a specific way. For example, the TOTAL field can be set to recognize hand printed characters; consider only numeric digits; and look for nine character positions. This gives you much better results than simply

recognizing the zone without the additional information provided.

Figure 5-8 Zonal recognition example

Zonal recognition is not dependent on the format of the data (as is the case with Regular Expressions) nor on the successful recognition of data elements close to the desired data (as is the case with Keywords) to extract the data. You define a zone, and whatever is in that zone gets populated in the field.

Locate with regular expressions

This technique can find data that is located anywhere on the recognized page. When recognition takes place, a CCO file is created that contains the recognized data and the location where the data was found. If a zone is not known (eliminating zonal extraction), regular expressions are the next best choice.

A regular expression locates data by searching the entire CCO for data in the specified format. For instance, to find a date, you might write a regular expression to look for white space, followed by two digits, a dash, two more digits, another dash, and four more digits, see the following example:

Although regular expression syntax might look daunting, it is worth the trouble to learn the basics and to use this technique where applicable.

However, its use is limited to finding data that is uniquely and somewhat predictably formatted. Insurance IDs, banking account numbers, credit card numbers, phone numbers, and so on have a specific format that they must follow, which is unlikely to resemble other unrelated data in the document. However, if you are searching for a five-digit employee number on a document, it might be successful use of regular expressions for locating and extracting data can only be used in certain situations.

Locating by keyword

Keywords are labels that accompany the data you want to find on the form. For instance, you might have the data as shown in Figure 5-9.

Figure 5-9 Sample keyword data

To find the various pieces of data in this example, it is possible to find text, such as

Insurance:,

and then look to the right or below that word to find the actual data that you would like to extract ($104.95).

Locating by keyword ranks third behind the zonal and regular expression methods in terms of reliability because you have to know what the keyword is and recognize the keyword correctly. You also have to anticipate the location of the actual data, relative to the position of the keyword, for extraction.

Keyword extraction is made easier by using keyword lists, which are text files that contain many different keywords that might be used as labels around the data you want to extract. For instance, a keyword list could contain Total, Total Due, Pay this amount, and similar words or phrases that occur in close proximity to the actual data.

The rules engine can be configured to account for the distance and direction that your data might be located relative to the keyword. For example, you could create a function that first looks to the right one word, and if the appropriate data is not found there, looks one line below the keyword.

Click N Key

The fourth and final method of extracting data from a form is for the data entry operator to click the form. Data at that location is then extracted and placed into the active field. This is the least preferred approach because it is the most expensive (it requires a human to do the clicking), although it is reliable.

An application is not limited to one method of extracting data. For forms, zonal recognition is generally sufficient because the zones are configured for each image type before use. However, in a learning environment, we typically use a hierarchy of techniques to find the data we are looking for.

It is common in a learning application for each field to try to find data in a predefined zone, and, if not found there, to try a regular expression or keyword locate. If those two methods also fail, the operator is prompted to click the data on the image. When the operator clicks the image, the data is extracted, but the position (zone) information of where the operator clicked is also stored and can be saved and used for future encounters with similar images.

This is why such applications are called “learning applications.” They automatically use the best extraction technique available on every image, and they can use the process to learn how to best extract data from future images. Rather than knowing what every image looks like at the outset of the project, such applications add fingerprints during runtime and use zonal recognition on an increasing percentage of the images encountered over time.

In document Implementing Mobile Document Capture with IBM Datacap Software (Page 144-146)