OCR for historical printings
-a tutori-al
Kolloquium Korpuslinguistik
HU Berlin, 23.04.2014
● Centrum für Informations- und Sprachverarbeitung
● OCR group (led by Prof. Dr. Klaus Schulz) has existed for over 10 years
● specific areas: interactive and automatic postcorrection, lexical resources for
improved OCR results
● partner in EU IMPACT project (Improving Access to Text – 2008-2011) ● open source postcorrection tool PoCoTo
● since 2013: recognition and postcorrection of Latin texts
Uwe Springmann, 23.04.2014 p. 3 1. Document acquisition 2. Preprocessing 3. OCR 4. Evaluation 5. Postcorrection 6. Training
Agenda
● Deutsche Digitale Bibliothek (DDB): www.ddb.de
with links to page images of holding institutions
● Bayerische Staatsbibliothek (BSB)
www.bsb-muenchen.de or www.digitale-sammlungen.de sometimes with OCR results for texts
● Google books: books.google.com
often texts from BSB in lower resolution, but sometimes with OCR even when BSB does not offer it
● Europeana: www.europeana.eu
Uwe Springmann, 23.04.2014 p. 5
book search@ddb: Book on podagra and herbs
(Adam von Bodenstein, 1577)
Uwe Springmann, 23.04.2014 p. 7
Recommendations
● If you want to manually correct OCR, go to Google (they offer the best freely
available OCR results)
● If you want your own OCR, go to DDB or BSB and download the pdf (we will be
taking this route in this tutorial)
2. Preprocessing
● (page splitting) ● deskewing ● border removal ● crop ● binarize ● dewarp ● despeckle ● . . .Uwe _Springmann, 23.04.2014 p. 9
UNIX preprocessing commands
● pdftk WieSichMeniglich_Basel_1557-original.pdf cat 64-104 output kräuter.pdf ● mkdir pdf
● pdftk kräuter.pdf burst output pdf/%03d.pdf ● mkdir png
● cd pdf
● for f in *.pdf; do convert "$f" "${f/%pdf/png}"; done ● mv *.png ../png
Tutorial: ScanTailor (10 min)
Using your downloaded pdf and ScanTailor (www.scantailor.org), produce a set ofclean tif files of pages 65–105
002.png: 3.8 MB (colored image, background counts)
Tutorial: gImageReader (5 min)
Convert page 2 (file 002.tif) into text!
Uwe Springmann, 23.04.2014 p. 13
4. Evaluation
How good (in % correct characters = accuracy) is your text?
Compare against correct transcription (= ground truth)!
OCR evaluation toolkit (ISRI/UNLV) adapted to UTF-8 by Nick White:
https://gitorious.org/ancient-greek-training-for-tesseract/ocr-evaluation-tools/source/ 207f421c198d12f793d3ba0215677a294bed1583
Engine Accuracy (%) Remark
Tesseract (deu-frak) 65.03 raw png, not cleaned tif! Tesseract (deu-frak) 76.12 s for ſ not counted as error OCRopus (fraktur) 78.94 s for ſ not counted as error
Google books 78.18 Google sponsors Tess. & Ocrop. ABBYY FR 11 (gothic) 83.23 “industry leader”
5. Postcorrection
Can we clean up messy OCR output?
a) Tesseract: interactive correction in gImageReader (side-by-side view + dictionary)
b)OCRopus: interactive correction in browser (text line synopsis of image + OCR)
ocropus-gtedit html book/????/??????.bin.png -o corr.html; firefox corr.html
c) Tesseract: interactive correction in PoCoTo (postcorrection tool of CIS, open source)
Uwe Springmann, 23.04.2014 p. 15
OCRopus line synopsis
● Fraktur model is not too well
adapated to book font
(Schwabacher), but it's a start
● correct OCR output in browser ● generated ground truth can be
used for later training
● with better training, OCRopus
will yield better result
● correct remaining errors in the
PostCorrectionTool PoCoTo
● locally installable Java package for postcorrection ● developed at CIS as part of the EU IMPACT project ● word synopsis: image + OCR with interactive correction ● error profiling: calculate statistical error model based ona) historical spelling (not an error) b) proper OCR errors
and propose most probable correction candidate
● batch correction: rank errors according to frequency & error pattern
and enable quick correction decision in concordance view
● try it out:
http://www.digitisation.eu/tools/browse/ocr-post-correction-and-enrichment/post-correction-tool/ https://github.com/thorstenv/PoCoTo
Uwe Springmann, 23.04.2014 p. 17
PoCoTo (developed at CIS)
● error frequency ● word synopsis
(tesseract hocr output)
PoCoTo (developed at CIS)
Uwe Springmann, 23.04.2014 p. 19
6. Training
Can we beat ABBYY (83%) by training an open source engine on the relevant font(s)?
OCRopus training on images: ground truth data available! pp. 1-34: training set; pp. 35-40: test set
page
acc. (%)
35 97.19 36 98.58 37 98.75 38 97.37 39 97.62 40 97.29 cave:p. 35 has 12 (out of 22) errors due to confusions in ground truth between
ů: U+016F and
u: uU+030A (combined character)̊
Training result – page 36
Kreüter
ner erſcheinüg/ vnſerer teütſcher zaun oder hagwurtzel/ gar micht/ welche der mehrertheil balbierer für rechte Ariſtolochiam rotun⸗ dam einſamlend. Dioſc. Diſer wurtzel etwas mit wein myrrhen vnd pfeffer getruncken/ reiniget die weiber von vberfliſzigem vn⸗ rath der můter/ treibt auſz die an
d geburt vñ weiber menſes. Ein ſalb gemacht vonn diſer wurtzen zeitloſen vñ anagallide zeücht vſz ſpreiſſel/ d roo nvñ geſchiferte bein. Hiemitt beſchlies ich mein
rede diſer zeit von den zwelff zei⸗ chen kreütteren/ beg ren menckao ⸗ lich welle mirs im beſten aufnem men als dañ ichs gethan/ hab ſye weitleüffiger beſchribẽ wellen/ ſo ſind yetziger kürtze viel vrſach_en/ vorauſz dieweil ich groſſen koſten angewendet in ſůchung der kreü ter auſz eignem willen vñ beüt_tel/ ncn⸗
Uwe Springmann, 23.04.2014 p. 21