• No results found

Why long time storage does not equate to archive

N/A
N/A
Protected

Academic year: 2021

Share "Why long time storage does not equate to archive"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

STEINBUCH CENTRE FOR COMPUTING - SCC

Why long time storage does not equate to archive

Jos van Wezel

HUF Toronto 2015

(2)
(3)
(4)

Archive tasks at KIT

State of Baden-Württemberg

universities, museums, state archives, libraries currently 9 PB / 1 billion files

est. growth 0.5 PB / year HLRS Stuttgart

finalised projects from e.g. climatology, CFD, molecular dynamics, engineering currently 1.5 PB / 7.5 million files

est. growth 0.5 PB / year

GridKa (German LHC Tier 1 center) support of all 4 LHC experiments currently 16 PB / 13 million files est. growth 1 PB / year

Roadmap

EU projects: Human Brain Project, EUDAT, DARIAH Helmholtz Data Center

BW-HPC-C5

KIT has been designated as central site for long time digital data storage in the State of Baden-Württemberg

Move archived data out of TSM to HPSS.

(5)

Long time storage business cases

A. Project Store

• Temporary storage for large data, 3 to 4 years • Non active data that active experiments depend on • Simple upload interface

B. Good scientific practice store

All of A above plus

• At least 10 years, conformity with good scientific practice • Minimal set of meta data, requires PIDs

• Expiration, Integritity checks

C. Long term store, archive

All of B above plus

• Forever or as long as is needed

• Interaction with community for curation

(6)

Archive layer model

content preservation referencing L ibra ri e s, M u se u m s Com m u n itie s bit preservation Dat a ce n tre s disk/tape storage data center interface data exploration (web) interface P ortal s Col lec ti on s

Communities take care of selection, ordering, ingesting of data

Community or ideally archivists representing the

community, takes care of curation, deletions, pruning, formats,…

Libraries and museums are confronted with large data sets and subsequent handling

Content Preservation

The data centers, take care of bit preservation Bit preservation is not an done deal

Access to preservation information is not standardised The bits that go in are the bits that come out

technical metadata

(7)

Classic view

Data sources

Archive Management

(Archive) Storage

HPSS

LTFS

Black Pearl

other

Content Management Systems, DSpace, Fedora, ArchiveMatica, Invenio, ePrints, etc…

(8)

Application access to data in deep archives

Data interfaces assume disk interface i.e. on-line storage

Archivematica, Fedora, Invenio, Fixity etc.

DB access

Interfaces still keep with POSIX

No consideration with regard to off-line, or replicated, storage

Assume different storage is handled behind the scenes

No one waits for information to become on-line

Unless you know its will come on line

How to handle asynchronous data access

Requirements from the provider, i.e. data center, perspective

How to do efficient offline check summing (with user provided method)

Applications should not DOS the (tape) storage systems

An upstream reference

how to get from content to the ‘owner of the data’

Data center exchange

(9)

Rough edges and requirements

storing [m|b]illions of files

data exchange

roles and rights, accounting

wide range of file sizes

assured reliability

storage quality

(10)

*FTP S3 …

standard protocols for data transport

Archive service (REST)

EUDAT

development Archive Logic

VIM Meta DB IDM Auth storage backends

Archive service path Data path

disk storage

B2SAFE

LTFS

iRODS

• meta data (PREMIS)

• checksums

• storage quality

(11)

archive access

enhance fixity support

user provided checksum

‘lazy’ checking and periodic integrity verification

‘we need to be more intelligent if we want to scale in the future’

poc implementation for HPSS and TSM

with PSNC (Poland) and MPCDF (was RZG)

integrate in current EUDAT B2SAFE service

enable bring-on-line functionality

on-line and off-line data represent two different storage qualities

task in INDIGO data cloud project

what else?

(12)

Conclusions

A data archive is more than long term data storage

HPSS is used in HPC and increasingly as cold storage

engine for many other data sources i.e. communities

With some user space effort ARCHIVE requirements can

be implemented on current HPSS

Work on HPSS from bottom and top

Buddy Bland: We must make it easy for sites to match the types of

storage to the requirements of the users.

We must make it easy for users to match the types of data to the

requirements of the sites.

(13)
(14)
(15)

Challenges

Determining the rate of bit rot and taking countermeasures

we have no metric for archive (bit preservation) quality

this some how should be part of a site certification

Intelligent selection of candidates for caching

getting data from a deep (i.e. cost effective) archive takes time

on line caches in front of archives for fast access

what goes to and what data stays in the cache

Notions: trends, markets, relationships: maybe google can help?

Uniform interface for “cold” storage

API that goes beyond classic POSIX and takes longer access times

Into account

Exchange data sets or data volumes with peers for redundancy or

for contract changes: asynchronous operations

(16)

Archive interface

work in progress of the RADAR and bwArchive projects

revisit the 2008 SNIA XAM standard

Asynchronous access to data on off-line storage

xam comes with an emulator /

reference implementation

xam modular adaptation to

different back-ends through

Vendor Interface Modules (VIM)

co p y ri g h t: Ja n P o tt h o ff / S CC

(17)

What is an Archive? SCC view

Definition of an Archive:

“A long time file repository containing a massive collection of digital

information

No loss of data or degradation of quality for an agreed amount of time

SCC ensures proper integrity

Does not depend on proprietary middleware or storage formats

migration should be possible in some number of years and must be independent from content or owner

Data is stored in (self-defining) containers

containers are not files, they are “bitfiles” with a checksum, they may contain one or more files.

Reference to the container is fixed for the lifetime of the container

The operator (SCC) is not responsible for passing content on to the next

generation of owners

data ownership has a lifetime that must be known in advance, but can be renewed data is stored after consent of a data provider and SCC. It is effectuated in a service level agreement for archives.

References

Related documents

cloud as a backup and/or archive target but not as a dynamic storage tier. HCS adds data access and management tools that allow IT to use the cloud as an active storage tier.

The interest in microalgae for biodiesel production is due to the presence of high amount of lipid content in some species, and also due to the fact that lipid synthesis, especially

12.. Digital Data Archive and Preservation in the Cloud – What to do and What Not to Do © 2013 Storage Networking Industry Association. All Rights

MOOSEHEAD LAKE REGION The Moosehead Lake region is ideally situ­ ated on Moosehead Lake, taking in the many water courses in the surrounding territory and vast

Using Latané and Darley’s bystander effect theory, the purpose of this quantitative survey design study was to examine the frequency and level of bullying and the relationship

GFI FaxMaker and GFI FaxMaker Online enable users to send and receive faxes directly from their email client and provide APIs for application integration and automation.. For

Chair of a Session for the Society of Medieval and Renaissance Philosophy, Annual Convention of the American Catholic Philosophical Association, Miami, Florida.. Participant,

I am currently studying for my EdD at the Institute of Education, University College London and am conducting a comparative analysis of the assessment of aspiring principals who