Chapter 4. Planning considerations
4.2 Define requirements for the capture system
4.2.6 Capture design considerations
This section highlights the areas to consider when designing a capture system and indicates the alternatives that are available. You can select from multiple options depending on your business and technical requirements.
Document acquisition
Datacap services various input channels that deliver documents in several formats. Channels or methods of capturing documents include scan, mobile device, multifunction devices, printers, fax, email, file import, and a web service. Some of the considerations for each channel are provided in the following sections.
Scanning
Direct scanning is typically done by internal users. Both desktop and web client options are available.
Desktop scanning
is used in centralized scanning operations that use mid-range to high-volume scanners that have heavy-duty cycles. These types of scanners support continuous operation in multiple shifts. Even though a lower-cost scanner might have a high scan rate, it might not be designed for continuous operation.The Datacap scanning user interface is production-oriented to support highly efficient operation of the scanner. In this environment, scanners are operated nonstop. Scan operators occasionally check the scan quality of images. The goal is to maximize the throughput of the production-level scanners.
Although multifunction devices (MFDs) can also be used to scan documents, they typically are not used for high volume environments but more for distributed capture environment. MFDs have integrated scanner, printer, copier, and fax capabilities. Production-level MFDs can be operated as stand-alone devices without being connected to a workstation. In this mode, the MFD control panel is used to control the scanner. Images can be transferred to a well-defined storage location by using the network filing, File Transfer Protocol (FTP), or email functions of the device. Datacap can import and process the documents by using its virtual scanning and email import actions.
A preferable MFD solution is to enable direct Datacap integration through the use of NSi Autostore or Imagine Solutions’ Encapture products, which can be purchased through IBM. These products can integrate directly into the MFD console, which provides the ability to select a Datacap application to scan into, select document types, and enter index properties. One area of common confusion is the difference between thin client scanning and operating an MFD directly. If you scan with an MFD by using thin client scanning, the MFD is connected to a workstation by using a TWAIN driver and the web user interface on the workstation provides the scanning control panel. This method is used with lower-end desktop MFDs and is not used with higher-end production MFDs.
Consider the following additional factors:
Many current generation devices include image enhancement features that are run within the scanner hardware or in the scanner driver. In either case, the resulting image might have improved readability, improved recognition results, and reduced file size.
Scanners are available for specialized purposes, such as remittance scanning and large format document scanning. These devices might not have Image and Scanner Interface Specification (ISIS) or TWAIN drivers. Therefore, they interface with Datacap by using import features.
If you expect an MFD to be used full-time as a scanning device, consider using a dedicated scanner instead. Production scanning can handle larger scan jobs that might occupy an MFD that needs to be shared by a workgroup.
Web client scanning
can also be used in centralized scanning operations but is more commonly deployed for distributed capture. Common deployment models are dedicated scanning stations connected to mid volume production scanners and user workstations connected to low volume scanners.You can use Datacap Web or IBM Content Navigator. For organizations currently using or planning to deploy Content Navigator, it is recommended you use Content Navigator as the Datacap client. Using Content Navigator with Datacap can provide a single interface to users whether they are scanning or validating document, or browsing a content repository and provides numerous beneficial features such as detachable image windows for use on multi-window workstations.
Mobile
Adding support for mobile device can enable field workers to process documents in near real time. Currently available for both iOS and Android phones and tablets, Datacap Mobile acquires images and uploads them to a Datacap server for processing. Apple and Android phones and tablets support Content Navigator. Images can originate from the device’s photo album or from the built-in camera.
As in the case with other clients, users need to log in to the application that they want to add documents to. Their credentials determine which application they can use.
Datacap Mobile dynamically detects document edges and automatically captures and rectifies the image only when quality criteria are met. This ensures that documents are of sufficient quality and can be processed downstream. Image enhancement tools are provided to help the user improve the quality of the image if necessary. Users can provide manual index values if necessary. When captured, the images are uploaded to Datacap for further processing.
IBM provides the Datacap Mobile SDK for iOS and Android to integrate document capture and image processing capabilities into custom applications.
Fax
Datacap software works with fax server products so that documents that are sent to a fax server can be imported into the capture system and processed in the same manner as scanned documents.
Fax is typically used by external users. The trend in many organizations is to reduce the internal use of fax for capturing documents. This trend is due to the lower quality of the image and the greater time needed to send a fax compared to remote scanning. However, because fax requires low bandwidth, its use is common in situations where only dial-up connections are feasible.
The primary disadvantage of fax is low image quality. The quality of the equipment varies resulting in inconsistent image quality. Fax image resolution is low.
Standard mode
provides a horizontal scan at 200 or 204 scan lines per inch. It provides a vertical scan at 100 or 98 scan lines per inch.Fine mode
provides a horizontal scan at 200 or 204 scan lines per inch. It provides a vertical scan at 200 or 196 scan lines per inch.Each fax transmission is received as a single TIFF or PDF file that contains multiple images. Datacap can burst the file into individual image pages for processing by the system. The image enhancement actions improve the ability of the system to recognize text. Datacap can normalize the dimensions of the image so that all the images are 200 dpi in both dimensions. It can also compress images to the TIFF Group 4 format.
Datacap can capture and process email messages and their attached files. In addition to scanned images, Datacap can accept various electronic formats, such as word-processing documents and spreadsheets. Electronic documents can be converted to TIFF by Datacap so that they can be processed as images for data extraction and export.
Consider the following common scenarios for using email:
Documents can be received directly from customers or other external parties. In the scenario in this book, customers who want to refinance their loans can be allowed to send supporting documents by email to a service email account.
Email can be used as a replacement for fax as a way to transmit scanned or electronic documents.
File import
File import is a common method for inputting files into the system. The virtual scan (VScan or MVScan) features of Datacap are used to import files. File import can be done in an attended or unattended mode. In an attended mode, a user starts the virtual scan by using the desktop or web-client user interface. In an unattended mode, the virtual scan is run by the Rulerunner service, which runs as a Microsoft Windows service.
Consider the following common scenarios for using file import:
Receiving images from an external party. For example, a financial institution might receive loan file images as part of the process for purchasing loans from another financial institution.
Receiving images scanned by a scanning service. For example, large quantities of documents might be scanned by a third-party service as part of a backfile conversion.
Interfacing with fax or MFDs.
Interfacing with a scanner that does not have a TWAIN or ISIS driver. Some specialized scanners operate in this fashion.
MVScan can use index files to process images. An index XML file is provided along with the images to import. Within this XML document, you can specify the document type, the properties to be passed, and the pointers to the images to be imported.
Multiple MVScan threads can be configured within Rulerunner. They can point to the same or to different file locations. This is ideal for situations where you need a higher ingestion throughput.
Web service
Datacap displays the document processing capabilities as a web service. The web service can run the background document processing tasks. This method is used by software applications that need to process documents.
Consider the following common scenarios for using a web service:
Processing previously scanned and stored documents that were not previously processed for recognition. A bank that stored loan documents when a loan was originated might want to perform data extraction on the same documents years later when a loan is modified.
Providing a service where documents can be processed in an ad hoc manner. An organization might provide a service to upload documents for recognition and transformation through a web application or portal.
When the Datacap web service can run in Microsoft IIS or as a Windows service.
Centralized capture
With centralized capture, dedicated staff and equipment process documents in a factory-like setting. Documents are mailed or delivered to the central location where documents are prepared into controlled batches. Batches are scanned on high-speed scanners. Other tasks, such as indexing, data entry, and fixup, are performed on separate workstations so that each task is optimized and labor and other resources are used efficiently at the central location. Centralized capture offers the following advantages:
Economy of scale
Standardized processes
Dedicated trained personnel who only do capture-related tasks
Easier to maintain image quality controls
Centralized capture has the following disadvantage:
Documents must be delivered to a central location.
Users understand less about the documents.
Corrections might require returning documents to the sender or interacting with remote users to correct problems.
Decentralized capture
With decentralized capture, remote offices or individuals scan, fax, and process documents, but they do not send the paper to a central location. Staff is not dedicated to performing capture activities. Capture might be done directly by the customer or by an external business partner.
Decentralized capture has the following advantages:
Documents do not need to be mailed or shipped to a central location.
Documents are stored into the repository more quickly.
Users can correct errors immediately.
Users understand the documents and can more accurately enter and correct data.
Work can be offloaded to a partner or customer by using self-service. Decentralized capture has the following disadvantages:
Equipment is needed at each location.
It is harder to maintain standardized processes.
More users need to be trained.
Users do not perform capture functions all the time and, therefore, do not handle the tasks as efficiently.
Image quality varies, and image quality issues are more difficult to correct.
Authenticity is more difficult to verify.
In many instances, organizations use a blend of these models. The capture system needs to accommodate the constraints and demands of the business. Organizations have multiple applications that require one or both models.
We must also consider the network capacity to determine whether it is sufficient to handle the required load. In some locations with low bandwidth, networks might need additional
bandwidth to accommodate higher volumes of imaging network traffic.
In either scenario, the background processing of documents is handled centrally using the Rulerunner service. Background processing includes image enhancement, OCR or ICR, format conversion from input or for export, and export. Because these are processor-intensive activities, they are handled most effectively in servers or high-end workstations. In this manner, client workstations do not need to have software installed to perform these functions.
Image enhancement
Images can be enhanced to improve recognition and readability and to reduce file size. Image enhancement is most important when using OCR and ICR or to improve the format of faxed images. Datacap includes image enhancement capabilities for this purpose. The current generation of document scanners often includes image enhancement capabilities in the hardware or scan driver that can be configured in the scanning user interface. Use the capabilities of your scanning hardware for image enhancement, and supplement those capabilities with the Datacap enhancement features.
New in version 9 is the ability to change the order of execution of the image enhancement tasks, add or remove tasks, run tasks more than one time, and see changes to documents in real time. Also, several new image enhancement capabilities have been added.
Page ID and document assembly
Page ID and document assembly are often referred to as
classification
. This processidentifies the type of each page in a batch and creates documents from the stream of pages.
Page identification
is the process of identifying the type of each page.Document assembly
is the process of determining where each document begins and ends.Orchestrated classification
Datacap performs automated classification by using the
orchestrated classification
technique
. Orchestrated classification uses page identification rules, document integrity rules, and document creation rules to automate the classification process. Classification can also be done manually in a scanning or verification user interface.Orchestrated classification uses a set of rules that takes a stream of pages. Then, it optionally enhances images, identifies each type of page using one of many methods, creates
documents from the pages, and validates the resulting structure. All of the classification processing can occur in a single module in one workflow step. If necessary, you can have multiple types of classification modules. Classification can use any of the processing actions in Datacap.
Page identification
Documents are created and separated based on the page types and a set of document integrity rules. Pages can be determined by one of the following methods:
Bar code
Pattern match using image anchors
Pattern match using text anchors
Match image-based fingerprint
Match text-based fingerprint
Match regular expressions to recognized text
Text analytics using IBM Content Classification
Document structure using rules
Consider using bar codes as the primary method of page identification for forms that you control. When you do not control the layout of the form, you can use the other page identification methods depending on the characteristics of the pages.
Careful planning should go into selecting classification methods and the order in which they are used. Most applications will use several classification methods. For example, an
application might first look for separator sheets, then page level bar codes, and finally text-based matching. Some methods are faster than others. For example, bar codes are faster than having to recognize an entire page looking for a specific keyword therefore, trying faster methods of classification first and working our way down from there is recommended.
Document assembly
In Datacap, the system determines document separation and document type by matching the document hierarchy to the identified pages. After pages are identified, Datacap uses the information in the document hierarchy to determine the correct document type. For example, a Loan Application page type is part of a Loan Application document, where a page type named Marketing Postcard is part of a Marketing Postcard document.
Each page has the following variables that define the structure of the parent document:
Maximum number of pages of this type for each document (0 means no maximum)
Minimum number of pages of this type for each document (0 means no minimum)
Order, which is the position of this page relative to other pages in the same document (0 means any position)
Datacap uses the information in the document hierarchy to assemble individual pages into multipage documents.
A common approach to document separation is to use bar code sheets between each document or printing bar codes on the first page of a document. During scanning, several documents of varying lengths can be scanned in a batch, with bar-coded sheets separating each individual document. The system saves the documents as separate documents automatically. Barcoding can also be used to identify the type of document or pages. Position bar codes vertically. Keep in mind that scanners and fax machines can produce vertical lines on a page when dirt is on the scanning sensor. If the line is parallel to the bar code line, it can make a bar code unreadable. If the line runs perpendicular across the bar code, it is readable.
Recognition, fingerprinting, and locating data
Recognition
is used to read data from images by using OCR, ICR, OMR, or bar code technology. Recognition is used in three primary use cases: to automate document classification, to automate indexing, or to reduce data entry typing.One of the methods of classification involves performing recognition on the document or a portion of a document looking for keywords, patterns, form numbers, or other meaningful information. Use recognition for classification carefully to ensure performance. For example, when looking for a form number that is always in the lower-right side, a well-designed application focuses recognition only in the area of the form. This enables the recognition process and the subsequent search for the form name to run much faster.
Indexing is the process of identifying the documents stored in the content repository. Documents are identified with properties that are stored in a content engine catalog. The process of entering these properties is called
indexing
. Users search for documents by using these properties. As a result, these properties must clearly identify each document with information, such as the name, social security number, and address. Usually, only a few properties are used to index a document.Data entry
is the process of typing data into a database or application system. Documents can contain dozens or hundreds of fields of data on many pages. In a manual process, users type data by looking at the paper document or at an image of the document in a window. Typing from a window is calledtype from image
.When we design the capture system in the scenario in this book, we use recognition features to read the data from images so that we can reduce the amount of manual typing.
Data recognition and extraction can be highly accurate when certain conditions are met. An understanding of the document characteristic is vital, because you need to choose the most