OCR for historical printings a tutorial

(1)

OCR for historical printings

-a tutori-al

Kolloquium Korpuslinguistik

HU Berlin, 23.04.2014

(2)

● Centrum für Informations- und Sprachverarbeitung

● OCR group (led by Prof. Dr. Klaus Schulz) has existed for over 10 years

● specific areas: interactive and automatic postcorrection, lexical resources for

improved OCR results

● partner in EU IMPACT project (Improving Access to Text – 2008-2011) ● open source postcorrection tool PoCoTo

● since 2013: recognition and postcorrection of Latin texts

(3)

Uwe Springmann, 23.04.2014 p. 3 1. Document acquisition 2. Preprocessing 3. OCR 4. Evaluation 5. Postcorrection 6. Training

Agenda

(4)

● Deutsche Digitale Bibliothek (DDB): www.ddb.de

with links to page images of holding institutions

● Bayerische Staatsbibliothek (BSB)

www.bsb-muenchen.de or www.digitale-sammlungen.de sometimes with OCR results for texts

● Google books: books.google.com

often texts from BSB in lower resolution, but sometimes with OCR even when BSB does not offer it

● Europeana: www.europeana.eu

(5)

Uwe Springmann, 23.04.2014 p. 5

book search@ddb: Book on podagra and herbs

(Adam von Bodenstein, 1577)

(6)

(7)

Recommendations

● If you want to manually correct OCR, go to Google (they offer the best freely

available OCR results)

● If you want your own OCR, go to DDB or BSB and download the pdf (we will be

taking this route in this tutorial)

(8)

2. Preprocessing

● (page splitting) ● deskewing ● border removal ● crop ● binarize ● dewarp ● despeckle ● . . .

(9)

Uwe _Springmann, 23.04.2014 p. 9

UNIX preprocessing commands

● pdftk WieSichMeniglich_Basel_1557-original.pdf cat 64-104 output kräuter.pdf ● mkdir pdf

● pdftk kräuter.pdf burst output pdf/%03d.pdf ● mkdir png

● cd pdf

● for f in *.pdf; do convert "$f" "${f/%pdf/png}"; done ● mv *.png ../png

(10)

Tutorial: ScanTailor (10 min)

Using your downloaded pdf and ScanTailor (www.scantailor.org), produce a set of

clean tif files of pages 65–105

002.png: 3.8 MB (colored image, background counts)

(11)

(12)

Tutorial: gImageReader (5 min)

Convert page 2 (file 002.tif) into text!

(13)

4. Evaluation

How good (in % correct characters = accuracy) is your text?

Compare against correct transcription (= ground truth)!

OCR evaluation toolkit (ISRI/UNLV) adapted to UTF-8 by Nick White:

https://gitorious.org/ancient-greek-training-for-tesseract/ocr-evaluation-tools/source/ 207f421c198d12f793d3ba0215677a294bed1583

Engine Accuracy (%) Remark

Tesseract (deu-frak) 65.03 raw png, not cleaned tif! Tesseract (deu-frak) 76.12 s for ſ not counted as error OCRopus (fraktur) 78.94 s for ſ not counted as error

Google books 78.18 Google sponsors Tess. & Ocrop. ABBYY FR 11 (gothic) 83.23 “industry leader”

(14)

5. Postcorrection

Can we clean up messy OCR output?

a) Tesseract: interactive correction in gImageReader (side-by-side view + dictionary)

b)OCRopus: interactive correction in browser (text line synopsis of image + OCR)

ocropus-gtedit html book/????/??????.bin.png -o corr.html; firefox corr.html

c) Tesseract: interactive correction in PoCoTo (postcorrection tool of CIS, open source)

(15)

OCRopus line synopsis

● Fraktur model is not too well

adapated to book font

(Schwabacher), but it's a start

● correct OCR output in browser ● generated ground truth can be

used for later training

● with better training, OCRopus

will yield better result

● correct remaining errors in the

(16)

PostCorrectionTool PoCoTo

● locally installable Java package for postcorrection ● developed at CIS as part of the EU IMPACT project ● word synopsis: image + OCR with interactive correction ● error profiling: calculate statistical error model based on

a) historical spelling (not an error) b) proper OCR errors

and propose most probable correction candidate

● batch correction: rank errors according to frequency & error pattern

and enable quick correction decision in concordance view

● try it out:

http://www.digitisation.eu/tools/browse/ocr-post-correction-and-enrichment/post-correction-tool/ https://github.com/thorstenv/PoCoTo

(17)

PoCoTo (developed at CIS)

● error frequency ● word synopsis

(tesseract hocr output)

(18)

PoCoTo (developed at CIS)

(19)

6. Training

Can we beat ABBYY (83%) by training an open source engine on the relevant font(s)?

OCRopus training on images: ground truth data available! pp. 1-34: training set; pp. 35-40: test set

page

acc. (%)

35 97.19 36 98.58 37 98.75 38 97.37 39 97.62 40 97.29 cave:

p. 35 has 12 (out of 22) errors due to confusions in ground truth between

ů: U+016F and

u: uU+030A (combined character)̊

(20)

Training result – page 36

Kreüter

ner erſcheinüg/ vnſerer teütſcher zaun oder hagwurtzel/ gar micht/ welche der mehrertheil balbierer für rechte Ariſtolochiam rotun⸗ dam einſamlend. Dioſc. Diſer wurtzel etwas mit wein myrrhen vnd pfeffer getruncken/ reiniget die weiber von vberfliſzigem vn⸗ rath der můter/ treibt auſz die an

d geburt vñ weiber menſes. Ein ſalb gemacht vonn diſer wurtzen zeitloſen vñ anagallide zeücht vſz ſpreiſſel/ d roo nvñ geſchiferte bein. Hiemitt beſchlies ich mein

rede diſer zeit von den zwelff zei_⸗ chen kreütteren/ beg ren menckao ⸗ lich welle mirs im beſten aufnem men als dañ ichs gethan/ hab ſye weitleüffiger beſchribẽ wellen/ ſo ſind yetziger kürtze viel vrſach_en/ vorauſz dieweil ich groſſen koſten angewendet in ſůchung der kreü ter auſz eignem willen vñ beüt_tel/ ncn_⸗

(21)

OCR for historical printings a tutorial