• No results found

Database preservation toolkit:

N/A
N/A
Protected

Academic year: 2021

Share "Database preservation toolkit:"

Copied!
103
0
0

Loading.... (view fulltext now)

Full text

(1)

DLM Forum: “Making the Information Governance Landscape in Europe” José Carlos Ramalho [email protected]

KEEP SOLUTIONS // University of Minho

Nov. 12-14, 2014, Lisbon, Portugal

Database preservation

toolkit:

a flexible tool to normalize

and give access to databases

(2)

This  work  was  partially  supported  by  the  SCAPE  Project.  

The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

http://www.keep.pt

This  work  was  partially  supported  by  the  SCAPE  Project.  

The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

KEEP  SOLUTIONS:  Projects

• DigitArq,  CRAV  (2003..[2008-­‐2012])  

• RODA  (2006..[2008-­‐…[)  

• RCAAP  (2008-­‐…)  

• PPA  (2009)  

• Open  source:  RODA,  KOHA,  DSpace,  Moodle,  etc.  

• Scientific  research  (european  projects)  

SCAPE:  Large  scale  preservation  4C:  the  cost  for  curation  

e-­‐arK:  presented  by  Kuldar

(3)

Developed within RODA project

Now stand-alone open-source project

http://keeps.github.io/db-preservation-toolkit/

Imports and exports between DBMS and DB formats Supports preservation formats: DBML, SIARD

(4)

RODA:

Repositório de Objectos Digitais Autênticos

José Carlos Ramalho [email protected] Miguel Ferreira [email protected] Luis Faria [email protected] Rui Castro [email protected] Francisco Barbedo [email protected] Cecília Henriques [email protected] Luis Corujo [email protected]

The Past: 2006 - 2009

(5)

SCAPE

SCAPE

What  is  RODA?

Long  term  preservation  and  Autenticity  

S

tandards  based  

(OAIS,  EAD,  PREMIS,  METS,  etc.)  

Following  TRAC  demands  

(Trustworthy  Repositories  Audit  &  Certification)  

Security  policies  

Service  Oriented  Architecture  

(SOA)  

Nice  and  simple  design    

Open  source  

DGArq  and  University  of  Minho

5

A  digital    repository  created  for  archives,  with  the    

following  features:

(6)

Context

RODA (2006-2009)

• Metadata management (EAD) • Digital Object management (...)

• Digital Preservation Policies and protocols • National Project (Archives National Board)

CRiB: Preservation Services Digital Repositories (2005-2008) • Distributed Migration Service

• Migration Service supported by knowledge base • phd Thesis / U. Minho (Miguel Ferreira)

(7)

Context

RODA (2006-2009)

• Metadata management (EAD) • Digital Object management (...)

• Digital Preservation Policies and protocols • National Project (Archives National Board)

CRiB: Preservation Services Digital Repositories (2005-2008) • Distributed Migration Service

• Migration Service supported by knowledge base • phd Thesis / U. Minho (Miguel Ferreira)

(8)

OAIS

7

(9)

Architecture

(10)

Architecture

8

(11)

Data Model

(12)

Data Model

(13)

Data Model

(14)

Data Model

(15)

Data Model

(16)

Digital Object Classes

(17)

Digital Object Classes

(18)

Digital Object Classes

10

(19)

Digital Object Classes

10

Normalization

Significant pr

(20)

Ingest

(21)

SIP structure

(22)

SIP structure

12

(23)

SIP structure

12

Object class and format

(24)

SIP structure

12

Object class and format

Place in classification plan

(25)

SIP structure

12

Object class and format

Place in classification plan

Descriptive metadata

(26)

SIP structure

12

Object class and format

Place in classification plan

Descriptive metadata

Preservation metadata

(27)

SIP structure

12

Object class and format

Place in classification plan

Descriptive metadata

Preservation metadata

Technical metadata

(28)

SIP structure

12

Object class and format

Place in classification plan

Descriptive metadata

Preservation metadata

Technical metadata

Representation files

(29)
(30)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

(31)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

(32)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

(33)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

How can we preserve the original data semantics and structure?

(34)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

How can we preserve the original data semantics and structure?

How can we preserve authenticity and provenance of databases?

How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen

(35)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

How can we preserve the original data semantics and structure?

How can we preserve authenticity and provenance of databases?

How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen

databases)

How can we have efficient preservation frameworks, while retaining the ability to query different database versions?

(36)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

How can we preserve the original data semantics and structure?

How can we preserve authenticity and provenance of databases?

How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen

databases)

How can we have efficient preservation frameworks, while retaining the ability to query different database versions?

How can multi-user online access be provided to hundreds of archived databases containing terabytes of data?

(37)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

How can we preserve the original data semantics and structure?

How can we preserve authenticity and provenance of databases?

How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen

databases)

How can we have efficient preservation frameworks, while retaining the ability to query different database versions?

How can multi-user online access be provided to hundreds of archived databases containing terabytes of data?

Can we move from a centralized model to a distributed, redundant model of database preservation?

(38)

Databases

How do we keep archived databases readable and usable in the long term (at acceptable cost)?

How do we separate the data from a specific database management environment?

How can we preserve the original data semantics and structure?

How can we preserve authenticity and provenance of databases?

How can we preserve data while it continues to evolve? (here we can split the problem in two: operational databases and frozen

databases)

How can we have efficient preservation frameworks, while retaining the ability to query different database versions?

How can multi-user online access be provided to hundreds of archived databases containing terabytes of data?

Can we move from a centralized model to a distributed, redundant model of database preservation?

(39)
(40)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

(41)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

What are the different stages in the database preservation’s life cycle?

(42)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

What are the different stages in the database preservation’s life cycle?

What documentation is preserved together with a database, and in what format?

(43)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

What are the different stages in the database preservation’s life cycle?

What documentation is preserved together with a database, and in what format?

(44)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

What are the different stages in the database preservation’s life cycle?

What documentation is preserved together with a database, and in what format?

What are the legal encumbrances on database preservation?

What can be learned from traditional archival appraisal for the selection of databases for preservation?

(45)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

What are the different stages in the database preservation’s life cycle?

What documentation is preserved together with a database, and in what format?

What are the legal encumbrances on database preservation?

What can be learned from traditional archival appraisal for the selection of databases for preservation?

To what extent can the preservation strategies, and procedural policies developed by archivists be adapted for databases?

(46)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

What are the different stages in the database preservation’s life cycle?

What documentation is preserved together with a database, and in what format?

What are the legal encumbrances on database preservation?

What can be learned from traditional archival appraisal for the selection of databases for preservation?

To what extent can the preservation strategies, and procedural policies developed by archivists be adapted for databases?

How can we mesure the quality of preservation strategies when they are applied to data- bases? What are DB significant properties?

(47)

Databases

What are the salient features of a database that should be preserved? (significant properties...)

What are the different stages in the database preservation’s life cycle?

What documentation is preserved together with a database, and in what format?

What are the legal encumbrances on database preservation?

What can be learned from traditional archival appraisal for the selection of databases for preservation?

To what extent can the preservation strategies, and procedural policies developed by archivists be adapted for databases?

How can we mesure the quality of preservation strategies when they are applied to data- bases? What are DB significant properties?

(quality assurance...)

(48)

Databases: goals

• How do we store them?

• How do we access them?

(49)

Databases: goals

• How do we store them?

• How do we access them?

(50)
(51)

Databases

Normal evolution path:

(52)

Databases

First prototype:

• Data

• Structure

• Only “frozen” DBs

Normal evolution path:

(53)

Databases

• Data?

First prototype:

• Data

• Structure

• Only “frozen” DBs

Normal evolution path:

(54)

Databases

• Data?

• Structure?

First prototype:

• Data

• Structure

• Only “frozen” DBs

Normal evolution path:

(55)

Databases

• Data?

• Structure?

• Views?

First prototype:

• Data

• Structure

• Only “frozen” DBs

Normal evolution path:

(56)

Databases

• Data?

• Structure?

• Views?

• Reports?

First prototype:

• Data

• Structure

• Only “frozen” DBs

Normal evolution path:

(57)

Databases

• Data?

• Structure?

• Views?

• Reports?

• Stored Procedures?

First prototype:

• Data

• Structure

• Only “frozen” DBs

Normal evolution path:

(58)

Databases

• Data?

• Structure?

• Views?

• Reports?

• Stored Procedures?

• ...

First prototype:

• Data

• Structure

• Only “frozen” DBs

Normal evolution path:

(59)

HOW?

The need for an intermediate representation

(60)

HOW?

The need for an intermediate representation

Input Formats (M)

Output Formats (N)

M*N

mi

gra

tor

s

(61)

IR: DBML

Input Formats (M)

Output Formats (N)

(62)

IR: DBML

Input Formats (M)

Output Formats (N)

DBML

For each new output or input format you only need to code one new migrator.

(63)

DBML design Principles

Hardware independent;

Software independent;

Easy to process;

Descriptive;

It should be possible to add metadata;

(64)

DBML design Principles

Hardware independent;

Software independent;

Easy to process;

Descriptive;

It should be possible to add metadata;

It should be possible to add semantics;

(65)

SIP Structure (DB example)

(66)

SIP Structure (DB example)

Manifest

(67)

SIP Structure (DB example)

Manifest KEEPS Employees DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(68)

SIP Structure (DB example)

Manifest KEEPS Employees Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(69)

SIP Structure (DB example)

Manifest KEEPS Employees Technical Metadata: - color - dimensions - ... Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(70)

SIP Structure (DB example)

Manifest KEEPS Employees Technical Metadata: - color - dimensions - ... Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(71)

SIP Structure (DB example)

Manifest KEEPS Employees Technical Metadata: - color - dimensions - ... Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(72)

SIP Structure (DB example)

Manifest KEEPS Employees Technical Metadata: - color - dimensions - ... Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(73)

SIP Structure (DB example)

Manifest KEEPS Employees Technical Metadata: - color - dimensions - ... Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(74)

SIP Structure (DB example)

Manifest KEEPS Employees Technical Metadata: - color - dimensions - ... Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(75)

SIP Structure (DB example)

Compressed File

Manifest KEEPS Employees Technical Metadata: - color - dimensions - ... Binaries DBML: - Data; - Structure; - Other metadata.

Descriptive Metadata (EAD): - producer

- colection - ...

(76)
(77)
(78)
(79)

Databases

: RODA answers

• How do we store them?

★ DBML + binaries + technical metadata

• How do we access them?

(80)

Databases

: RODA answers

• How do we store them?

★ DBML + binaries + technical metadata

• How do we access them?

★ PhpMyAdmin (hacked version)

(81)

Some problems

Extracting data is easy:

SELECT * FROM ...

Extracting the structure is not:

DBMS protect this information;

Each DBMS stores it differently;

Different versions of the same DBMS

can also act differently;

(82)

Open Planets: “Database Archiving” (Copenhagen 2012)

Two approaches emerged:

• Preserve the database and the environment allowing authentic access

to the information and any accompanying applications

• Extract / migrate the raw data and table structure from the original

(83)

Open Planets: “Database Archiving” (Copenhagen 2012)

Two approaches emerged:

• Preserve the database and the environment allowing authentic access

to the information and any accompanying applications

• Extract / migrate the raw data and table structure from the original

database

Relies on

emulatio

(84)

Open Planets: “Database Archiving” (Copenhagen 2012)

Two approaches emerged:

• Preserve the database and the environment allowing authentic access

to the information and any accompanying applications

• Extract / migrate the raw data and table structure from the original

database Relies on emulatio n strategy Relies on migration strategy

(85)

Open Planets: “Database Archiving” (Copenhagen 2012)

Two approaches emerged:

• Preserve the database and the environment allowing authentic access

to the information and any accompanying applications

• Extract / migrate the raw data and table structure from the original

database Relies on emulatio n strategy Relies on migration strategy

RODA+DBML

(86)

Open Planets: “Database Archiving” (Copenhagen 2012)

Two approaches emerged:

• Preserve the database and the environment allowing authentic access

to the information and any accompanying applications

• Extract / migrate the raw data and table structure from the original

database Relies on emulatio n strategy Relies on migration strategy

RODA+DBML

SIARD

(87)

Open Planets: “Database Archiving” (Copenhagen 2012)

Two approaches emerged:

• Preserve the database and the environment allowing authentic access

to the information and any accompanying applications

• Extract / migrate the raw data and table structure from the original

database Relies on emulatio n strategy Relies on migration strategy

RODA+DBML

SIARD

SIARD-DK

(88)

Open Planets: “Database Archiving” (Copenhagen 2012)

Two approaches emerged:

• Preserve the database and the environment allowing authentic access

to the information and any accompanying applications

• Extract / migrate the raw data and table structure from the original

database Relies on emulatio n strategy Relies on migration strategy

RODA+DBML

SIARD

SIARD-DK

ADDML

(89)

DBML vs SIARD

DBML SIARD

Data Yes (no segmentation) Yes (with segmentation)

Structure Yes (XML) Yes (XSD)

Stored Procedures No Yes (XML)

Triggers No Yes (XML)

(90)
(91)

db-preservation-toolkit architecture

(92)

db-preservation-toolkit architecture

(93)
(94)

Pre-ingest / Production / SIP creation

(95)

Pre-ingest / Production / SIP creation

new

(96)

Pre-ingest / Production / SIP creation

new

new

(97)

Pre-ingest / Production / SIP creation

new

new

(98)
(99)

Access

Working on RDF viewer, we don’t need

(100)

New prototype

Msc thesis: “Database Convert Tool: exploring SIARD”

OWL model development

Triple Store Implementation RDF endpoint implementation SPARQL based access

(101)

Future plan:

(102)

Phd thesis: Graph view dissemination, relational model reverse engineering, algorithm development (from relational to graph

model) [http://repositorium.sdum.uminho.pt/handle/

1822/25655]

Msc thesis: SIARD support (DBML replacement) Msc thesis: RDF representation for databases

(103)

José Carlos Ramalho

Engineer / Researcher / Teacher

[email protected] / [email protected]

www.keep.pt

A R Q U I V O S | B I B L I O T E C A S | M U S E U S

References

Related documents

application service assurance solutions deliver enhanced business process and service-level management capabilities for telecom players and enterprises, with a

We can provide project or program health check services, or deliver independent quality assurance and probity within your existing governance structures.. This

Designed to meet the needs of full -time students, working professionals and military personnel, the Jenkins MBA offers flexible avenues to success through both online and

This can include showing graphic pictures of abortion at university campuses (known as GAP displays), erecting billboards which draw awareness to the health effects of abortion

Unmedicated children and adolescents with ADHD who used multiple different NUS in our study mainly took omega-3 fatty acids and vitamin D.. The NICE committee concluded that

Just like the structured Web has been provided with the XSLT transformation language to present XML data to the user into HTML pages or to trans- form XML data from one XML schema

To ensure a long-term a long term archive you should ideally make three copies of the media and store it in three separate locations, if possible on three differing types of

As a modern union that represents distribution employees around the country we understand the pressures that our members and their families face every day in trying to balance