Why long time storage does not equate to archive

(1)

STEINBUCH CENTRE FOR COMPUTING - SCC

Why long time storage does not equate to archive

Jos van Wezel

HUF Toronto 2015

(2)

(3)

(4)

Archive tasks at KIT

State of Baden-Württemberg

universities, museums, state archives, libraries currently 9 PB / 1 billion files

est. growth 0.5 PB / year HLRS Stuttgart

finalised projects from e.g. climatology, CFD, molecular dynamics, engineering currently 1.5 PB / 7.5 million files

est. growth 0.5 PB / year

GridKa (German LHC Tier 1 center) support of all 4 LHC experiments currently 16 PB / 13 million files est. growth 1 PB / year

Roadmap

EU projects: Human Brain Project, EUDAT, DARIAH Helmholtz Data Center

BW-HPC-C5

KIT has been designated as central site for long time digital data storage in the State of Baden-Württemberg

Move archived data out of TSM to HPSS.

(5)

Long time storage business cases

A. Project Store

• Temporary storage for large data, 3 to 4 years • Non active data that active experiments depend on • Simple upload interface

B. Good scientific practice store

• All of A above plus

• At least 10 years, conformity with good scientific practice • Minimal set of meta data, requires PIDs

• Expiration, Integritity checks

C. Long term store, archive

• All of B above plus

• Forever or as long as is needed

• Interaction with community for curation

(6)

Archive layer model

content preservation referencing L ibra ri e s, M u se u m s Com m u n itie s bit preservation Dat a ce n tre s disk/tape storage data center interface data exploration (web) interface P ortal s Col lec ti on s

Communities take care of selection, ordering, ingesting of data

Community or ideally archivists representing the

community, takes care of curation, deletions, pruning, formats,…

Libraries and museums are confronted with large data sets and subsequent handling

Content Preservation

The data centers, take care of bit preservation Bit preservation is not an done deal

Access to preservation information is not standardised The bits that go in are the bits that come out

technical metadata

(7)

Classic view

Data sources

Archive Management

(Archive) Storage

HPSS

LTFS

Black Pearl

other

Content Management Systems, DSpace, Fedora, ArchiveMatica, Invenio, ePrints, etc…

(8)

Application access to data in deep archives

Data interfaces assume disk interface i.e. on-line storage

Archivematica, Fedora, Invenio, Fixity etc.

DB access

Interfaces still keep with POSIX

No consideration with regard to off-line, or replicated, storage

Assume different storage is handled behind the scenes

No one waits for information to become on-line

Unless you know its will come on line

How to handle asynchronous data access

Requirements from the provider, i.e. data center, perspective

How to do efficient offline check summing (with user provided method)

Applications should not DOS the (tape) storage systems

An upstream reference

how to get from content to the ‘owner of the data’

Data center exchange

(9)

Rough edges and requirements

storing [m|b]illions of files

data exchange

roles and rights, accounting

wide range of file sizes

assured reliability

storage quality

(10)

*FTP S3 …

standard protocols for data transport

Archive service (REST)

EUDAT

development Archive Logic

VIM Meta DB IDM Auth storage backends

Archive service path Data path

disk storage

B2SAFE

LTFS

iRODS

• meta data (PREMIS)

• checksums

• storage quality

(11)

archive access

enhance fixity support

user provided checksum

‘lazy’ checking and periodic integrity verification

‘we need to be more intelligent if we want to scale in the future’

poc implementation for HPSS and TSM

with PSNC (Poland) and MPCDF (was RZG)

integrate in current EUDAT B2SAFE service

enable bring-on-line functionality

on-line and off-line data represent two different storage qualities

task in INDIGO data cloud project

what else?

(12)

Conclusions

A data archive is more than long term data storage

HPSS is used in HPC and increasingly as cold storage

engine for many other data sources i.e. communities

With some user space effort ARCHIVE requirements can

be implemented on current HPSS

Work on HPSS from bottom and top

Buddy Bland: We must make it easy for sites to match the types of

storage to the requirements of the users.

We must make it easy for users to match the types of data to the

requirements of the sites.

(13)

(14)

(15)

Challenges

Determining the rate of bit rot and taking countermeasures

we have no metric for archive (bit preservation) quality

this some how should be part of a site certification

Intelligent selection of candidates for caching

getting data from a deep (i.e. cost effective) archive takes time

on line caches in front of archives for fast access

what goes to and what data stays in the cache

Notions: trends, markets, relationships: maybe google can help?

Uniform interface for “cold” storage

API that goes beyond classic POSIX and takes longer access times

Into account

Exchange data sets or data volumes with peers for redundancy or

for contract changes: asynchronous operations

(16)

Archive interface

work in progress of the RADAR and bwArchive projects

revisit the 2008 SNIA XAM standard

Asynchronous access to data on off-line storage

xam comes with an emulator /

reference implementation

xam modular adaptation to

different back-ends through

Vendor Interface Modules (VIM)

co p y ri g h t: Ja n P o tt h o ff / S CC

(17)

What is an Archive? SCC view

Definition of an Archive:

“A long time file repository containing a massive collection of digital

information

”

No loss of data or degradation of quality for an agreed amount of time

SCC ensures proper integrity

Does not depend on proprietary middleware or storage formats

migration should be possible in some number of years and must be independent from content or owner

Data is stored in (self-defining) containers

containers are not files, they are “bitfiles” with a checksum, they may contain one or more files.

Reference to the container is fixed for the lifetime of the container

The operator (SCC) is not responsible for passing content on to the next

generation of owners

data ownership has a lifetime that must be known in advance, but can be renewed data is stored after consent of a data provider and SCC. It is effectuated in a service level agreement for archives.