• No results found

A Cost Efficient approach to correct OCR errors in Large Document Collections

N/A
N/A
Protected

Academic year: 2022

Share "A Cost Efficient approach to correct OCR errors in Large Document Collections"

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

A Cost Efficient approach to correct OCR errors in Large Document Collections

1

Deepayan Das, Jerin Philip, Minesh Mathew and C.V. Jawahar

Center for Visual Information Technology, IIIT- Hyderabad

(2)

Digital Library

2

A digital repository for books, accessible to people around the world

(3)

Digital Libraries

Popular Digital libraries include:

3

Project #Books

Google Books Project 25 million (as of 2015) Project Gutenberg 60, 000

Million Books Project 1.5 million

(4)

Digital Libraries

● Easy access to millions of books and articles.

● Less cost in maintenance and support.

● Supports search and indexing.

4

(5)

Digital Libraries

5

Scanning

centers OCR

Access to millions of

people Annotator

proofreads the text

(6)

OCRs are not always 100% accurate

6

(7)

● OCR is sensitive to quality of document images.

● Degradations can result in words being misclassified.

Word Image OCR prediction Ground Truth Lord Cauning Lord Canning

Cawnporo Cawnpore

Dolhi Delhi

rnorning Morning

OCR

7

(8)

Information Retrieval on OCR text

OCR errors leads to difference in ranking of the retrieved document.

8

(9)

Post-processing for Large Document Collection

9

Project Gutenberg.

GB. Newby and C.Frank.

Distributed proofreading.

JCDL, 2003

Google Books Von Ahn et. al

Recaptcha: Human based character recognition.

Science, 2008

(10)

Motivation

● OCR makes consistent errors throughout a document collection.

10

Juiiet Juiiet Juiiet Juiiet

Qucen Qucen

Camiing Canning Caniiing Caiiing

Qucen Qucen

Word images and their corresponding predictions in a collection

(11)

Motivation

● Books/collections have a finite vocabulary that repeat throughout the book.

11

A small subset of words can cover more than 50% of total words in a collection.

50% of words

(12)

Motivation

● Grouping and correcting words with high frequency can lead to significant gain in word accuracy.

12

50% of words

(13)

t-SNE Image Embedding

13

Maaten, Laurens van der and Hinton, Geoffrey. “Visualizing data using t-SNE”. JMLR, 2008

(14)

14

Reverse Annotation

Sankar et al. “Probabilistic Reverse Annotation For Large Scale Image Annotation.” CVPR, 2007.

Fusing Word Clusters

Rasagna et al. “Robust Recognition of Documents by Fusing Results of Word Clusters.” ICDAR, 2009.

Khader and casey. “Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment”. ICDAR, 2009.

(15)

Automatic Error Correction

15

(16)

Automatic Error Correction

16

Cluster representative, propagated to all cluster elements

Character Majority Voting

● Word Images are clustered on a feature space.

● A cluster representative is chosen for each cluster.

Rasagna et al. use character major voting where the most frequently character is taken at each time step.

(17)

Automatic Error Correction

● Voted label is propagated to all the cluster elements.

17

Fig. shows a cluster of error words with label “thousand”. There are two incorrect labels

“housan” and “thusiasn” which can be corrected with the above proposed method.

thusiasn

housan

thousand

(18)

Automatic Error Correction

18

money money

money

money

money

money

money

aoney

more

Fig. shows nearest neighbour to the image embedding for word

“money”. The error word (highlighted in red) can be corrected using character majority voting.

(19)

● Each clusters cannot be completely homogenous.

● Character majority voting can lead to error propagation.

Cluster Impurities

19

impurity

money

money

money

money,

money money

aoney

more

(20)

Can a better clustering algorithm help?

20

(21)

MST on word predictions

● We further partition the clusters using minimum spanning tree (MST).

● The nodes are the predictions.

● The edit distance between the predictions form the edges.

21

(22)

Minimum spanning tree

22

Fully connected graph Minimum spanning tree Forest of individual trees

(23)

MST on word predictions

23 money!

money

money,

money,

money

money

aoney

more,

Cluster Partition using MST

(24)

What happens when all the predictions are wrong in a cluster ?

24

(25)

Manual correction

25

Figure shows a cluster with erred OCR predictions.

(26)

Automatic Error correction by label propagation will not achieve absolute word accuracy when clusters

are not homogeneous.

26

(27)

Human vs Machine

27

Humans accurate but slow.

High cost Machines fast but inaccurate.

Error propagation.

A human will be needed to rectify the mistakes incurred by the

machine.

Will lead to high cost.

(28)

Human Machine Collaboration

28

(29)

Proposed Method

A human should verify each cluster by:

1. Picking the cluster representative.

2. Choosing the cluster elements to which the label should be propagated to.

29

(30)

Pipeline

30

Word Predictions

HWNet [2]

Image Features

Error Detection

Error Clusters falling,

Manual Clustering

Correction CRNN-OCR [1]

1. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. Shi et. al PAMI, 2017 2. HWNet v2: An Efficient Word Image Representation for Handwritten Documents. Krishnan et. al IJDAR 2018.

(31)

Implementation Details

31

(32)

Edit Actions

● Full Typing ( no Dictionary involved)

● Type + Select from dropdown (Static Dictionary)

● Type + Select from dropdown(Growing Dictionary)

32

(33)

● Fully Annotated

English

19 books

~2.5k pages

Hindi

32 books

~5k pages

Datasets

33

Sample word images from Fully annotated dataset.

(34)

Evaluation Protocol

● For Fully Annotated

○ Units of seconds spent by a human for correcting a book.

○ We refer to it as cost of correction (C).

○ We measure the cost for each method relative to Full-Typing.

34

(35)

Cost of correction

Cost C = w1ct+ w2cd + w3cv ct = typing cost

cd = selection cost cv = verification cost

w1 = error words that need typing

w2 = error words whose correct alternative is present in dictionary

W3 = words that are correct but flagged wrongly as errors 35

(36)

Results

36

(37)

Relative cost

37

Fig. Relative cost of correction with respect to full typing when no clustering is involved.

(38)

Relative cost

38

Cost of correction across different clustering techniques for automatic label selection and propagation.

Cost without clustering

(39)

Qualitative Results

39

Qualitative results of k-means + MST clustering. The false positives are

crossed out. Images, relevant to the cluster are marked correct while the non relevant ones are crossed out.

(40)

Results: Fully Annotated Data (English)

Comparison between Automatic vs Manual correction

40

● No false error propagation.

● Reduces cost of correction.

(41)

Scope for automatic error correction

41

(42)

Clustering on Large Scale Dataset

● We cluster on images from 100 unannotated books.

● Testing is done on 200 annotated pages.

● We use CMV for label assignment to erred predictions.

42

(43)

● Comparison of performance of CRNN and Tesseract OCR

Results

43

Automatic error correction is able to rectify the errors more accurately as the size of collection increases.

Gain in word accuracy

CRNN-OCR >> Tesseract OCR

(44)

Clustering on Large Scale Dataset

We observe that as the size of the collection increases, CMV becomes better at picking the correct cluster

representative which is subsequently propagated to all the cluster elements.

44

(45)

Conclusion

● We proposed a cost efficient batch correction scheme for error reduction in OCRs.

● We also demonstrate how our approach can effectively be scaled to larger collections.

45

(46)

Future Work

● active learning techniques to find clusters/subclusters that need post-processing

● adapting recognizer to a collection,not just the post-processing module.

46

(47)

Thank You

47

References

Related documents

In the symmetric equilibrium of the pay-as-bid auction with independent private types, each bidder shades his bid for 1 of N shares as if he competed with ( N − 1) N bidders in

Bites by the Monocled Cobra, Naja kaouthia, in Chittagong Division, Bangladesh Epidemiology, Clinical Features of Envenoming and Management of 70 Identified Cases In order to provide

WHO IN 2009 reclassified dengue according to levels of severity as: dengue without warning signs, dengue with warning signs (abdominal pain, persistent vomiting,

In the event that said Craft changes ownership, the Marina must be notified on the day of said change of ownership and the new owner must (unless written waiver

Ltd for approval under Section 17 (3) and 17 (4) of the Act for creation of security in favour of SBICAP Trustee Co. Ltd as Security Trustee for the benefit of the IDFC by way

School based mental health programs are lacking in most Nigerian regular and special education institutions for children and adolescents and the burden of intellectual disability

Hear more available in word hyperlink not want to print the active document is not have converted a hypertext document, the field codes will have the contents.. I have to a

(2008) consider that research into segregation may be divided into two main forms, the anthropological and the geographical: whereby the anthropological school focuses