Recognition and text manipulation

Chapter 2. Advanced Datacap capabilities

2.2 Multichannel input

2.3.6 Recognition and text manipulation

Recognition is concerned with the process of converting the contents locked in the images of paper or electronic documents to live data that can be put to use, such as indexing

documents and driving the business workflow, producing full-text searchable documents for analytics, or feeding data to line-of-business systems.

In addition to recognition, Datacap provides several capabilities to locate, validate, manipulate, and look up extracted data in support of classification, redaction, and quality assurance.

Bar code recognition

The bar code detection capabilities are used to automatically locate and recognize one or several bar codes in an image. Bar codes are used to store data that can be detected with high accuracy, even on poor quality documents.

Many types of bar codes have been developed over the years to serve the needs of specific industries, such as retail, logistics, manufacturing, postal services, healthcare, airlines, and transportation in general. The choice of using a specific bar code in a Datacap application can be dictated by the bar code standard in use by the customer. Alternatively, you can decide on the choice of bar code arbitrarily to satisfy the internal needs of the application. Such needs might be to identify a type of form, mark document boundaries in a batch, or reconcile incoming documents with a work item or other documents that belong together. Datacap can detect 1D and 2D bar codes. One-dimensional bar codes (Figure 2-6) store a limited amount of data, typically a single alphanumeric string that can be used as a reference. They are coded by using a pattern of vertical lines of varying width read along the horizontal axis of the bar code.

Figure 2-6 One-dimensional bar code Code 39

For example, the Code 39 bar code can be used to code a claim number on outgoing correspondence to a customer to automate the claim identification process when the mail returns. Code 39 can store up to 43 alphanumeric characters.

Two-dimensional bar codes can store up to several kilobytes of data. They are coded by using a matrix that represents information along the vertical and horizontal axes of the bar code.

The PDF417 bar code is a popular two-dimensional bar code (see Figure 2-7) that has many uses including to code driver license information in certain states. It codes information in multiple rows (from 3 to 90) that show clusters of bars and spaces. It is described as a “portable data file,” because the bar code carries the information, not just a reference to it.

Figure 2-7 Two-dimensional bar code PDF417

For example, this PDF417 bar code encodes the entire postal address: “International Business Machines, 3565 Harbor Boulevard, Costa Mesa, California, 92626-1405, United States.”

A full list of the 1D and 2D bar codes supported by Datacap can be found in the Datacap documentation.

A special type of bar code is the patch code (Figure 2-8), which is typically used on sheets of paper that are inserted in a batch to separate documents. Unlike a bar code, a patch code needs to be positioned precisely parallel to the leading edge of the page in the scanner transport.

Figure 2-8 Patch code type 2

A patch code consists of a pattern of 4 horizontal bars with 2 levels of thickness which results in 6 combinations. Each patch code type has a specific use. For example, type 2 indicates the start of a new document, whereas type 4, also called the toggle patch code, can be used to indicate the start and end of a portion of the batch with color images.

Optical character recognition (OCR)

OCR technology is used to convert machine-printed text in an image to editable text. It is at the core of the Datacap document identification and data capture process and is used for full page recognition and zonal data extraction. Datacap includes two OCR engines that can be used individually or in combination to offer the best possible results.

Although implementations differ, the two engines operate in a similar fashion. At a high level, the OCR engine analyzes the image and the textual zones and isolates individual characters. It then compares each character to a collection of template character bitmaps (in various fonts) and selects the closest match. It assigns each character a “recognition confidence level” based on how well it correlates with the template. Then it assembles the characters into words and resolves ambiguities by using dictionaries or lexicons and various other

The confidence levels are measured against the confidence thresholds set in Datacap. The higher the OCR confidence threshold is, the higher the number of errors. Confidence levels of recognized zones are saved in the Document Hierarchy at run time to be used to color code the fields and drive the focus to problem fields in the Datacap Verify user interface. It is also possible to run multiple OCR engines on the same image to achieve the most accurate results across them all, which is a technique called

voting

. In that case, in each successive pass, Datacap compares the confidence level of the recognized data in the current pass with the one recorded in the previous pass. It then only updates the field with the new data and the confidence level if it is higher.

Although the accuracy of OCR engines has improved over the years, it is affected by many factors. These factors include the page layout and background, image resolution,

color/contrast/brightness, skewing, jagged characters, font type, size, and emphasis, type of compression, language, and character set. However, in most Datacap implementations, the following factors can be adjusted to produce the best possible results:

򐂰 Resolution, color, brightness, contrast, and skewing can be tuned at the scanner level or corrected by using Datacap image cleanup and enhancement actions.

򐂰 Character attributes, such as jagging, thickness, or thinness, can also be corrected by using image enhancement actions.

򐂰 Page layout, font type, and font size can be adjusted if the organization can influence the design of forms.

򐂰 Uncompressed or CITT Group 3/4 TIFF can yield better recognition results. In such case, Datacap offers the ability to convert incoming images with other formats and compression schemes to TIFF. They can still be carried through the Datacap workflow by using a separate stream to the repository. Alternatively, Datacap can convert the TIFF images to a more compressed format at the Export stage.

򐂰 Recognition performance of national languages and character sets (especially accented characters) can be adjusted by selecting the OCR engine that yields the best results or by using voting.

Intelligent character recognition (ICR)

ICR technology converts hand-written characters in an image to editable text. The overall operating principle of ICR is similar to OCR. However, because of the variability in handwriting, the techniques to separate and classify characters are different. Rather than trying to isolate and match whole characters, individual handwriting strokes are isolated and analyzed spatially to see which strokes most likely belong to which characters.

Also, character classification requires a much broader base of shapes and more complex methods, including statistical probabilities, to eliminate ambiguity and identify characters and words with confidence.

Figure 2-9 shows an example of ICR on a tax form.

Figure 2-9 ICR on tax form

ICR is affected by the same factors as OCR. However, it is affected by the clutter introduced by handwriting going over the boxes used in form fields and for normalizing character spacing (“constraint boxes”). Line removal or repair and dropout are typically employed to make the text stand out and improve accuracy.

Optical mark recognition (OMR)

OMR is used in Datacap to detect whether check boxes, bubbles, or other types of marks have been selected. It is typically used in combination with other Datacap functions that applies processing logic to the basic OMR results to interpret and turn them into actionable data. The results depend on the purpose and type of prompt or answer expected. For example, the answer might be yes or no, yes or no to all that apply, multiple choice, grid to form numbers or words, or add up marks.

Although this technology has been in use for years, it is not easy to address the many situations that can arise and to determine the confidence levels and errors that trigger the Verify task. Factors that affect accuracy include the type of marks, filling method, variability in the response of the filler (spill over, too light, erasure), and interfering specks and background noise in the image. See Figure 2-10.

Datacap provides flexibility to address these cases. When configuring the zones and fields, you define a parent field (for example, “Frequency”) as an OMR group and subfields (for example, “One Time,” “Monthly,” “Quarterly,” and “Annual”) to host options that belong together. You also specify in the parent field the number of options and whether multiple ones can be selected.

To do the actual recognition, Datacap offers two methods to address various situations. In most cases, when regular check boxes are used, the quickest setup is to use the OCR_A (ABBYY) engine. When the outline of the check box is removed by line removal or dropout, which is sometimes necessary to process high-density forms, or when using bubbles or unusual check marks, you might need to use the “pixel evaluation method” instead. In this method, the zone defined for the check box is evaluated against a threshold of darkness (black pixel percentage) and a threshold of background noise that you specify to determine what is considered a checked mark. The difficulty with this technique is that it is more sensitive to variations in background noise or specks, which affects the pixel count for the zone. Therefore, adjustments might be necessary to find the appropriate values of these thresholds.

Locating text

Datacap offers an extensive library of actions that can be used to search the recognized text for specific words and regular expressions and to navigate around a page based on the positions of lines and words detected by the OCR engines.

These actions are used in instances when text location cannot be predicted, such as in semi-structured or free forms, but when specific keywords or text can be expected in labels and headers. They include the capability to form lists of keywords, such as Invoice, InvNum,

Invoice #, and regular expressions, such as the one that follows, for a US Social Security number, and to iterate through them until matches are found:

[\^\b\s\n\r][0-9]{3}[\-]{1}[0-9]{2}[\-]{1}[0-9]{4}[\b\s]*

When a particular textual form is located, you can use navigation actions such as moving up/down several lines, or right or left several words, to find and manipulate the target data with other actions, such as those used for validation or to process line items and tables in purchase orders, invoices, or shipping manifests.

Combining recognition techniques for successful extraction

Successful content extraction depends on combining three aspects of recognition techniques:

򐂰 The correct recognition engine on the correct type of content and language

򐂰 Judiciously defining the recognition scope, that is, zonal versus full page recognition, depending on the type of page layout

򐂰 The method used in the application to target the text to be extracted, depending on the variability of the page layout

We have seen in the previous sections that OCR is designed for machine print and ICR for hand-printed characters. Recognition accuracy, however, also depends on the national language being recognized, especially when character sets are different from one language to another, such as Russian and English. Within the same character set, accuracy can also be improved by precisely selecting the locale, because specific dictionaries can be used to disambiguate wrongly recognized words. Also, some recognition engines can be better than others on certain fonts or languages, so it is possible to improve accuracy by using voting to run different OCR engines on the same document and to keep the best result of the combined passes.

For information about the languages supported by Datacap, see the Datacap Language support page on the Web:

http://www.ibm.com/support/docview.wss?uid=swg27044111

If the full content of a page is not required, zonal recognition typically performs much faster than full-page recognition and works both for machine and hand-printed characters. However, zone positions need to be reliable across documents of a given type. They are stored in Datacap as part of the fingerprint to delineate the parts of the image where the recognition needs to take place. If the page layout changes, the recognition engine will look at the wrong place and produce unreliable results.

When layout variability is too great, a better strategy is to use full-page recognition and search the extracted text content for the target strings or search by using regular expressions. When the OCR engine processes a full page, it analyses the image to find and identify the zones of interest, such as the blocks of text, text lines within the blocks, words within the lines, photo areas, tables, and so on. The engine stores their positions together with the recognized content so that it can be used later by Datacap to navigate the text from the page.

For example, a medical record number in a block of text that flows with the preceding text can be extracted by searching for the string Medical Record # and then navigating immediately to the right and capturing the actual number, as shown in Figure 2-11. Similarly, a Social Security Number (SSN) can be directly extracted from the content by using a regular expression that filters an SSN number with 3 numbers, dash, 2 numbers, dash, and 4 numbers.

Figure 2-11 Navigating the recognized text to extract data

Validation

Datacap offers an extensive library of elemental actions to validate and manipulate data captured in the objects of the runtime Document Hierarchy and to ensure that they conform to your business rules.

The following actions are possible:

򐂰 Data formatting actions to normalize field values or prepare them for calculation or comparison purposes, including the following actions:

– Padding with zeros or spaces to match the expected number of characters – Deleting a specific character in a specific position or all instances

– Deleting a class of characters from a field (alpha, numeric, punctuation, non-alpha numeric, or system characters)

– Testing data types (date, currency, alpha, or numeric) and field length – Converting the case of characters or value to currency

– Clearing a field value

– Inserting a decimal point or a specified character in an existing field value – Trimming spaces and truncating and splitting field values

򐂰 Manipulation of field values and document hierarchy variables, including the following actions:

– Assigning default values to fields

– Copying or appending values between fields

– Comparing field values and verifying arithmetic calculations between fields – Comparing dates and testing them within a range or days

– Assigning data or a time stamp to a field

– Testing field content: Filled, empty, max or min. length, specified matching value, or percentage numeric or non-numeric data

– Testing the number of OMR boxes and whether they are selected – Testing a regular expression in a field

– Testing variables and assigning variables to fields – Summing up values of subfields

򐂰 Invoking a message box to provide guidance in the Verify task

For example, by using combinations of these actions, you can create rules and attach them to any object of the Document Hierarchy to test the following information:

– Values are within accepted ranges – Data is formatted as required

– Dates are valid and deadlines are met – Numbers add correctly or are not missing

– Mandatory fields and check boxes have been completed – Dependencies between data are respected

– Data matches sets of permitted values, as checked by using lookup actions

Upon failure of any of these rules, Datacap flags the associated field and page for manual review in a Verify task, similar to actions for low-confidence recognition results.

Lookups

As a good practice, check and normalize data early in the business process to reduce errors and enforce consistency. To achieve this objective, Datacap provides a set of actions that you can use to connect through ODBC to a business database hosting reference information to run SQL queries to populate fields in the runtime Document Hierarchy. Such information might include customer names, part numbers, and geographical information, such as states and locations of branch offices.

On the client side, Datacap Navigator also offers the ability to use the External Data Service (EDS) capability of IBM Content Navigator for lookups. When an external data service is configured for a certain action or property, the service is invoked automatically every time a user interacts with that item in Datacap Navigator.

In document Implementing Mobile Document Capture with IBM Datacap Software (Page 66-72)