• No results found

TRANSKRIBUS. Research Infrastructure for the Transcription and Recognition of Historical Documents

N/A
N/A
Protected

Academic year: 2021

Share "TRANSKRIBUS. Research Infrastructure for the Transcription and Recognition of Historical Documents"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

TRANSKRIBUS. Research Infrastructure for

the Transcription and Recognition of Historical

Documents

Günter Mühlberger,

Sebastian Colutto, Philip Kahle

Digitisation and Digital Preservation group

(University of Innsbruck)

(2)

TRANSKRIBUS enables collaboration among

humanities scholars, computer scientists,

archives and volunteers with the ultimate goal to

revolutionise recognition, transcription and

access to historical handwritten documents.

(3)

HUMANITIES SCHOLARS

ARCHIVES & LIBRARIES

COMPUTER SCIENTISTS

(4)

HUMANITIES SCHOLARS

ARCHIVES & LIBRARIES

COMPUTER SCIENTISTS PUBLIC USERS

TRANSCRIPITON &

RECOGNITON

PLATFORM

(5)

TRP

Provide

images Receive standard formats(METS/ALTO/TEI/PDF-A)

DOCUMENTS

HUMANITIES SCHOLARS

ARCHIVES & LIBRARIES

COMPUTER SCIENTISTS

(6)

TRP

Provide

images Receive standard formats(METS/ALTO/TEI/PDF-A)

DOCUMENTS Transcribe text Receive TEI, PDF EXPERT & CROWD CLIENTS HUMANITIES SCHOLARS

ARCHIVES & LIBRARIES

COMPUTER SCIENTISTS

(7)

TRP

HUMANITIES SCHOLARS

ARCHIVES & LIBRARIES

COMPUTER SCIENTISTS

PUBLIC USERS

Provide

images Receive standard formats(METS/ALTO/TEI/PDF-A)

DOCUMENTS Transcribe text Receive TEI, PDF EXPERT & CROWD

CLIENTS algorithmsInvent

Work with reference data

HTR DIA KWS NLP, HPC …

(8)

TRP

Provide

images Receive standard formats(METS/ALTO/TEI/PDF-A)

DOCUMENTS Transcribe text Receive TEI, PDF EXPERT & CROWD

CLIENTS algorithmsImprove

Work with reference data

HTR DIA KWS NLP, HPC …

Search and retrieve handwritten documents WEBSITE Annotate, evaluate, contribute HUMANITIES SCHOLARS

ARCHIVES & LIBRARIES

COMPUTER SCIENTISTS

(9)

HUMANITIES SCHOLARS

ARCHIVES & LIBRARIES

COMPUTER SCIENTISTS PUBLIC USERS

TRANSKRIBUS

TRANSCRIPITON

RECOGNITION

PLATFORM

Provide

images Receive standard formats(METS/ALTO/TEI/PDF-A)

Transcribe text

Improve algorithms Work with reference data

Search and retrieve handwritten documents Annotate, evaluate,

contribute Receive TEI, PDF

(10)

Cycle of growth

HTR TECHNOLOGY NEW SERVICES MORE USERS MORE TRANSCRIPTS ACCELERATED DIGITISATION 16/06/2015 10
(11)

TRP

Platform – Software as a Service

16/06/2015 11 Images METS ALTO PDF TEI RTF PAGE Integrated Tools

Website InterfaceSearch Crowd-Sourcing GUI External Services Expert GUI Document Management User Management

(12)

Website – http://transkribus.eu/

(13)

TRANSKRIBUS Expert GUI

(14)

TSX – Crowd-sourcing platform

(15)

Archives/collection holders are enabled to

Manage the transcription process of large and heterogeneous

document collections in a standardized and effective way

Expose their digitised documents to their own employees,

humanities scholars, volunteers and the crowd

Involve users according to their background

Support humanities scholars in their research

Outsource transcription (or the training of an HTR model) to service

providers (e.g. off-shore)

Make large amounts of handwritten documents searchable (if a

trained HTR model is available)

Import transcribed documents in machine readable formats

(METS/ALTO)

Provide standardized requirements to researchers and technology

providers dealing with “issues of interest” (e.g. manage tendering)

(16)

Humanities scholars are enabled to (1)

Use TRANSKRIBUS for free (registration)

Upload as many single documents as they want

Transcribe their documents semi-automated in a way that image and text

are linked to each other (improved transcription quality!)

Enrich their documents with Named Entities (person and geo names, dates)

and personal tags

Normalize their transcription in a transparent way (abbreviations, character

sets, editorial declaration)

Export their document in various formats

– TEI: transcribed text with machine and human readable text (“scientific style”)

– PDF-Text: transcribed text with some formatting (“book style”)

– PDF-Image: transcribed text in the background, image in the foreground (“Scanned PDF style”)

– RTF: transcribed text for human readability (“working style”)

– METS/ALTO: full machine readable package (“digital library style”)

– METS/PAGE: full machine readable package (“computer science style”)

(17)

Humanities scholars are enabled to (2)

• Manage their documents in a private collection (no one else has access!)

• Invite other users to collaborate (e.g. students, colleagues) and to work on the same document

• See different versions of their documents

• Expose their documents also in a simplified web-based transcription GUI (TSX)

• Transcribe documents with the support of HTR if a model is available (needs offline training in beforehand)

• Exploit language resource database in the background (e.g. words, variants, frequencies, etc.)

• Search within a handwritten text collection

• Benefit from other TRANSKRIBUS users in an indirect way. The following resources will be available to every user of the platform:

– Trained HTR models

– Normalized editorial declarations

– Normalized Named Entities

– Language resources

(18)

Sustainability model

Free usage

– Single documents (who ever wants to work with the platform)

– E.g. humanities scholars, volunteers, genealogists, archivists,…

Service Level Agreement

– Document collections

– E.g. transcription projects (grants), archives, libraries (OCR collections)

Public-Public-Partnership

– Subsidiary model, not only for TRANSKRIBUS itself, but also for participants (e.g. archives, collection holders)

– E.g. DARIAH, direct support via e-infrastructure funds, research agencies (may have interest on standardized formats and workflows, etc.)

– Research agencies

Vision

– Dozens of archives expose their collections via TRANSKRIBUS, hundreds of humanities scholars are actually working with it and thousands of volunteers are involved…

(19)

tranScriptorium

READ

READ

– Recognition and Enrichment of Archival Documents

– E-Infrastructure programme: Virtual Research Environments

– 13 partners, coordinated by UIBK

– 3,5 years (very likely 1.1.2016 until 30.6.2019)

– 8,2 mill. EUR funding

– 10 associated partners

Highlights

– Release of TRANSKRIBUS e2015 (=all main features included)

– Implementation of HTR technology in four Large Scale Demonstrators (Zurich, Finland, Passau, Venice Time Machine)

– 10 associated partners (archives, collection holders) joining the project in Y1

– ScanApp (mobile phone as document scanner)

– E-Learning application (history students!)

– European Hands (marketing campaign)

– Large Data Sets for Research Competitions (put HTR on the research agenda)

(20)

Thank you for your attention!

References

Related documents

Pdf text recognition product data of scanned files into searchable data, you launch the pdf text information purposes only need for recognizing text in scanned pdf documents with

Client Staff Gather Documents Request Documents Process Transfer Convert to PDF Format Change Documents Add Password & Encrypt Add Signature Login and Download

Documents, Protocols & Transports Documents, Protocols & Transports Service Implementation & Orchestration Service Implementation & Orchestration.

[r]

[r]

[r]

Blazing New Paths From Ancient Footprints: Enactment Of Mexican Traditional Dance And Music folklórico In A New York Urban Community Of Early Childhood Learners Pamela Proscia

The paper firstly addresses developments of the multi-party system and of the free mass media in post-communist Lithuania as two parallel processes, which intersection or