• No results found

OCR for historical printings a tutorial

N/A
N/A
Protected

Academic year: 2021

Share "OCR for historical printings a tutorial"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

OCR for historical printings

-a tutori-al

Kolloquium Korpuslinguistik

HU Berlin, 23.04.2014

(2)

Centrum für Informations- und Sprachverarbeitung

● OCR group (led by Prof. Dr. Klaus Schulz) has existed for over 10 years

● specific areas: interactive and automatic postcorrection, lexical resources for

improved OCR results

● partner in EU IMPACT project (Improving Access to Text – 2008-2011) ● open source postcorrection tool PoCoTo

● since 2013: recognition and postcorrection of Latin texts

(3)

Uwe Springmann, 23.04.2014 p. 3 1. Document acquisition 2. Preprocessing 3. OCR 4. Evaluation 5. Postcorrection 6. Training

Agenda

(4)

Deutsche Digitale Bibliothek (DDB): www.ddb.de

with links to page images of holding institutions

Bayerische Staatsbibliothek (BSB)

www.bsb-muenchen.de or www.digitale-sammlungen.de sometimes with OCR results for texts

Google books: books.google.com

often texts from BSB in lower resolution, but sometimes with OCR even when BSB does not offer it

Europeana: www.europeana.eu

(5)

Uwe Springmann, 23.04.2014 p. 5

book search@ddb: Book on podagra and herbs

(Adam von Bodenstein, 1577)

(6)
(7)

Uwe Springmann, 23.04.2014 p. 7

Recommendations

● If you want to manually correct OCR, go to Google (they offer the best freely

available OCR results)

● If you want your own OCR, go to DDB or BSB and download the pdf (we will be

taking this route in this tutorial)

(8)

2. Preprocessing

(page splitting)deskewingborder removalcropbinarizedewarpdespeckle. . .

(9)

Uwe _Springmann, 23.04.2014 p. 9

UNIX preprocessing commands

● pdftk WieSichMeniglich_Basel_1557-original.pdf cat 64-104 output kräuter.pdf ● mkdir pdf

● pdftk kräuter.pdf burst output pdf/%03d.pdf ● mkdir png

● cd pdf

● for f in *.pdf; do convert "$f" "${f/%pdf/png}"; done ● mv *.png ../png

(10)

Tutorial: ScanTailor (10 min)

Using your downloaded pdf and ScanTailor (www.scantailor.org), produce a set of

clean tif files of pages 65–105

002.png: 3.8 MB (colored image, background counts)

(11)
(12)

Tutorial: gImageReader (5 min)

Convert page 2 (file 002.tif) into text!

(13)

Uwe Springmann, 23.04.2014 p. 13

4. Evaluation

How good (in % correct characters = accuracy) is your text?

Compare against correct transcription (= ground truth)!

OCR evaluation toolkit (ISRI/UNLV) adapted to UTF-8 by Nick White:

https://gitorious.org/ancient-greek-training-for-tesseract/ocr-evaluation-tools/source/ 207f421c198d12f793d3ba0215677a294bed1583

Engine Accuracy (%) Remark

Tesseract (deu-frak) 65.03 raw png, not cleaned tif! Tesseract (deu-frak) 76.12 s for ſ not counted as error OCRopus (fraktur) 78.94 s for ſ not counted as error

Google books 78.18 Google sponsors Tess. & Ocrop. ABBYY FR 11 (gothic) 83.23 “industry leader”

(14)

5. Postcorrection

Can we clean up messy OCR output?

a) Tesseract: interactive correction in gImageReader (side-by-side view + dictionary)

b)OCRopus: interactive correction in browser (text line synopsis of image + OCR)

ocropus-gtedit html book/????/??????.bin.png -o corr.html; firefox corr.html

c) Tesseract: interactive correction in PoCoTo (postcorrection tool of CIS, open source)

(15)

Uwe Springmann, 23.04.2014 p. 15

OCRopus line synopsis

● Fraktur model is not too well

adapated to book font

(Schwabacher), but it's a start

● correct OCR output in browser ● generated ground truth can be

used for later training

● with better training, OCRopus

will yield better result

● correct remaining errors in the

(16)

PostCorrectionTool PoCoTo

● locally installable Java package for postcorrection ● developed at CIS as part of the EU IMPACT project ● word synopsis: image + OCR with interactive correction ● error profiling: calculate statistical error model based on

a) historical spelling (not an error) b) proper OCR errors

and propose most probable correction candidate

● batch correction: rank errors according to frequency & error pattern

and enable quick correction decision in concordance view

● try it out:

http://www.digitisation.eu/tools/browse/ocr-post-correction-and-enrichment/post-correction-tool/ https://github.com/thorstenv/PoCoTo

(17)

Uwe Springmann, 23.04.2014 p. 17

PoCoTo (developed at CIS)

● error frequency ● word synopsis

(tesseract hocr output)

(18)

PoCoTo (developed at CIS)

(19)

Uwe Springmann, 23.04.2014 p. 19

6. Training

Can we beat ABBYY (83%) by training an open source engine on the relevant font(s)?

OCRopus training on images: ground truth data available! pp. 1-34: training set; pp. 35-40: test set

page

acc. (%)

35 97.19 36 98.58 37 98.75 38 97.37 39 97.62 40 97.29 cave:

p. 35 has 12 (out of 22) errors due to confusions in ground truth between

ů: U+016F and

u: uU+030A (combined character)̊

(20)

Training result – page 36

Kreüter

ner erſcheinüg/ vnſerer teütſcher zaun oder hagwurtzel/ gar micht/ welche der mehrertheil balbierer für rechte Ariſtolochiam rotun⸗ dam einſamlend. Dioſc. Diſer wurtzel etwas mit wein myrrhen vnd pfeffer getruncken/ reiniget die weiber von vberfliſzigem vn⸗ rath der můter/ treibt auſz die an

d geburt vñ weiber menſes. Ein ſalb gemacht vonn diſer wurtzen zeitloſen vñ anagallide zeücht vſz ſpreiſſel/ d roo nvñ geſchiferte bein. Hiemitt beſchlies ich mein

rede diſer zeit von den zwelff zei chen kreütteren/ beg ren menckao lich welle mirs im beſten aufnem men als dañ ichs gethan/ hab ſye weitleüffiger beſchribẽ wellen/ ſo ſind yetziger kürtze viel vrſach_en/ vorauſz dieweil ich groſſen koſten angewendet in ſůchung der kreü ter auſz eignem willen vñ beüt_tel/ ncn

(21)

Uwe Springmann, 23.04.2014 p. 21

Thank you for your attention!

Questions?

References

Related documents

Specification Function Usage Increment Planning Box structure Specification and Design Usage Modeling and Test Case Generation Statistical Testing Quality Certification

Small-Herd Facility Layout – The small-herd facility for alpaca ranchers with two to twelve alpacas should have the capacity to house one or two herd sires, one to five juvenile

Try Scribd FREE for 30 days to access over 125 million titles without ads or interruptions. Start

Ratio of 401(k) assets in 2040 to assets in 2000, for families, by lifetime earnings decile and by Social Security wealth decile--historical rate of equity return and historical

A two-dimensional pseudo-homogeneous model has been developed to investigate the influence of tube size on the thermal behavior and performance of packed fixed bed reactor for the

If it is not possible to declare Bolshevism, taken as a whole, a Jewish creation it is nevertheless true that the Jews have furnished several leaders to the Marximalist movement

• In 2011, OCR established a pilot audit program, developed an audit protocol and used the protocol to evaluate the HIPAA compliance efforts of 115 covered entities.. • OCR

Although the renormalization of the GI values makes the asset allocation consistent with the assumption that web search volume is a risk indicator, renormalized GI values produce