• No results found

How To Digitise Newspapers On A Computer At Nla.Com

N/A
N/A
Protected

Academic year: 2021

Share "How To Digitise Newspapers On A Computer At Nla.Com"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Australian Newspapers Australian Newspapers

Digitisation Program Digitisation Program

Development of the Newspapers Development of the Newspapers

Content Management System Content Management System

Rose Holley

Rose Holley – – ANDP Manager ANDP Manager ANPlan

ANPlan/ANDP Workshop, 28 November 2008 /ANDP Workshop, 28 November 2008

(2)

Requirements Requirements

  Manage, store and organise millions of Manage, store and organise millions of digital newspaper pages behind the

digital newspaper pages behind the scenes.

scenes.

  Manage the entire digitisation workflow Manage the entire digitisation workflow from scanning to public delivery.

from scanning to public delivery.

(3)

How? How?

  Current NLA Digital Content Current NLA Digital Content

Management System cannot cope with Management System cannot cope with

volume of digital newspapers or complex volume of digital newspapers or complex

structure of newspapers structure of newspapers

  No No ‘ ‘ off the shelf off the shelf ’ ’ product available that product available that meets requirements

meets requirements

  Need the system now (March 2007) Need the system now (March 2007)

(4)

Solution Solution

  NLA team to develop a software solution NLA team to develop a software solution

  Ensure the system uses open source software Ensure the system uses open source software

  System to be standalone and not bolted into System to be standalone and not bolted into other systems

other systems

  Possibility of sharing system in future/providing Possibility of sharing system in future/providing

(5)

Software Development Software Development

  Agile method of development used Agile method of development used

  Modules designed in stages as required Modules designed in stages as required

  Stage 1 Stage 1 – – Receipt and checking of scanned images Receipt and checking of scanned images

  Stage 2 Stage 2 – – Quality Assurance Modules Quality Assurance Modules

  Stage 3 Stage 3 – – Sending/receiving items from OCR Sending/receiving items from OCR

  Stage 4 Stage 4 – – System Administration and Statistics System Administration and Statistics

  Stage 5 Stage 5 – – Interface Design and Usability of System Interface Design and Usability of System

(6)

Progress Progress

  Software development March 2007 Software development March 2007 – – June 2008 June 2008

  First module in use May 2007 First module in use May 2007

  CMS in use for 18 months CMS in use for 18 months

  CMS in final stages of completion (Jan CMS in final stages of completion (Jan – – June 2009) June 2009)

  Further development required to enable acceptance Further development required to enable acceptance

(7)
(8)

Australian Newspapers CMS Australian Newspapers CMS

  Screenshots of system follow and Screenshots of system follow and explanation of workflows.

explanation of workflows.

(9)

  Preparing for Digitisation Preparing for Digitisation

  Creation of digital images Creation of digital images

  Adding metadata and Quality Assurance Adding metadata and Quality Assurance

  Optical Character Recognition Optical Character Recognition

  Quality Assurance Quality Assurance

  Statistics and Admin Statistics and Admin

Workflow Summary

Workflow Summary

(10)

  Identify title to be digitised Identify title to be digitised

  Source master microfilm from owner Source master microfilm from owner

  Send master microfilm to scanning Send master microfilm to scanning contractors

contractors

Preparing for Digitisation

Preparing for Digitisation

(11)

CMS CMS - - Add Title Add Title

(12)

Microfilm converted to digital images

Microfilm converted to digital images

(13)

Image Reception Image Reception

  Images received from scanning contractor Images received from scanning contractor on LTO2 Tape

on LTO2 Tape

  Tapes added to tape robot and extracted Tapes added to tape robot and extracted

  Reels automatically added to Content Reels automatically added to Content Management System

Management System

  Reel details are checked Reel details are checked

  Images ingested into Content Images ingested into Content Management System

Management System

(14)

CMS CMS - - Check Reel Details Check Reel Details

(15)

CMS CMS - - Ingest Reels Ingest Reels

(16)

CMS CMS - - Tasks 1 and 2 Tasks 1 and 2

  Task 1 Task 1 – – Add metadata (dates and page Add metadata (dates and page numbers)

numbers)

  Supervisor reviews marked pages Supervisor reviews marked pages

  Task 2 Task 2 – – Define batches Define batches

(17)

Identify title to be worked on

Identify title to be worked on

(18)

Identify reel

(19)

CMS CMS - - Adding Metadata Adding Metadata

  Date and Page Sequence number added Date and Page Sequence number added

(20)

Supervisor Supervisor

Review Review

  Supervisor Supervisor

reviews pages reviews pages

marked for marked for

attention

attention

(21)

CMS CMS - - Define Batches Define Batches

  Batches defined by date Batches defined by date

  Each batch contains 2 Each batch contains 2 - - 3000 images 3000 images

  Batches are automatically assigned a number Batches are automatically assigned a number

(22)

CMS CMS - - Resolve Duplicates Resolve Duplicates



Duplicate pages compared and the best copy is selected Duplicate pages compared and the best copy is selected

(23)

  Missing Missing page

page

targets are targets are

generated generated

Missing Missing

Pages

Pages

(24)

Optical Character Recognition Optical Character Recognition

(OCR) (OCR)

  Complete batches are added to a tape Complete batches are added to a tape

  Tapes are generated and written Tapes are generated and written

  Tapes sent to OCR contractor Tapes sent to OCR contractor

  Contractor completes OCR processes Contractor completes OCR processes

(25)

CMS CMS - - Tapes Created Tapes Created

  Completed batches added to a tape Completed batches added to a tape

(26)

Optical Character Recognition (OCR) of pages and article zoning

Optical Character Recognition (OCR) of pages and article zoning

(27)

OCR Data Reception OCR Data Reception

(Automated process) (Automated process)

  OCR contractor advises NLA server that a batch OCR contractor advises NLA server that a batch has been completed

has been completed

  NLA server downloads the batch NLA server downloads the batch

  Batch is ingested into Content Management Batch is ingested into Content Management System

System

  Checks are performed on data validity Checks are performed on data validity

  QA Derivatives are generated QA Derivatives are generated

  Articles may now be searched, but are not yet Articles may now be searched, but are not yet

(28)

CMS CMS - - Batch information Batch information

(29)

Quality Assurance (QA) Quality Assurance (QA)

  A random sample of Issues and Articles are A random sample of Issues and Articles are checked

checked

  Volume and Issue number are checked for Volume and Issue number are checked for accuracy

accuracy

  Sample articles are checked against agreed Sample articles are checked against agreed Quality Acceptance Criteria (QAC)

Quality Acceptance Criteria (QAC)

  Error rates calculated against QAC on the fly Error rates calculated against QAC on the fly

  Supervisor checks final results Supervisor checks final results

(30)

CMS CMS - - Selecting the batch Selecting the batch

(31)

Volume & Issue Number Check

Volume & Issue Number Check

(32)

Article checked against QAC

Article checked against QAC

(33)

Re Re - - keyed fields checked for accuracy keyed fields checked for accuracy

(34)

Supervisor checks results (auto or Supervisor checks results (auto or

manual accept/reject)

manual accept/reject)

(35)

QA Results QA Results

  Automated email sent to supplier Automated email sent to supplier advising the result

advising the result

  Emails for rejected batches include a Emails for rejected batches include a summary of errors

summary of errors

  Summary of errors saved for all batches Summary of errors saved for all batches

  Accepted batches are immediately Accepted batches are immediately

accessible in public search system

accessible in public search system

(36)

Batch History and details retained

Batch History and details retained

(37)
(38)

Search or Browse articles within CMS

Search or Browse articles within CMS

(39)

Statistics Statistics

  Stats for content received, Stats for content received, QA QA ’ ’ d d and and

delivered to the public generated by the delivered to the public generated by the

Content Management System Content Management System

  (Stats for usage of public search system (Stats for usage of public search system collected using Google Analytics)

collected using Google Analytics)

(40)

CMS CMS - - Content Statistics Content Statistics

(41)

CMS CMS - - Work Statistics Work Statistics

(42)

Access Access

  Public access to digital newspapers is Public access to digital newspapers is

provided through Australian Newspapers provided through Australian Newspapers

Search and Delivery System Search and Delivery System

  Users can search or browse newspapers Users can search or browse newspapers

  Search results can be refined using filters Search results can be refined using filters

(43)

References

Related documents

The classroom teacher plays a vital role in providing opportunities for students to commu- nicate their mathematical reasoning with academic language and supporting their efforts

entrepreneurship,  the  entrepreneurial  dynamics  of  accelerated  internationalization,  the   evolution  of  advanced  foreign  subsidiaries  of  the

The heart of the matter is when we refer to someone’s soul we are saying something about the depths of their being, something different from the self, seen in terms of desires,

This chapter contains requirements for ventilation air supply and exhaust, evaporative cooling systems and makeup air requirements for direct-gas-fired heaters, industrial air

Church Prohibition of Books.. THE CATHOLIC LAWYER Solution of the Problem Cases. William F.. Cahill.

an employee of a plumbing subcontractor slipped and fell during an inspection of a high-rise construction proj- ect. The subcontract contained an indemnity clause that indemnified

That is, explaining variables (both country and industry-specific effects) may explain the fact that countries engage in IIT but could influence in a different

By achieving this objective (designing the DTT model) and evaluating its effectiveness in vertical TT process, the aim of this research is also accomplished, which