STEINBUCH CENTRE FOR COMPUTING - SCC
Why long time storage does not equate to archive
Jos van Wezel
HUF Toronto 2015
Archive tasks at KIT
State of Baden-Württemberg
universities, museums, state archives, libraries currently 9 PB / 1 billion files
est. growth 0.5 PB / year HLRS Stuttgart
finalised projects from e.g. climatology, CFD, molecular dynamics, engineering currently 1.5 PB / 7.5 million files
est. growth 0.5 PB / year
GridKa (German LHC Tier 1 center) support of all 4 LHC experiments currently 16 PB / 13 million files est. growth 1 PB / year
Roadmap
EU projects: Human Brain Project, EUDAT, DARIAH Helmholtz Data Center
BW-HPC-C5
KIT has been designated as central site for long time digital data storage in the State of Baden-Württemberg
Move archived data out of TSM to HPSS.
Long time storage business cases
A. Project Store• Temporary storage for large data, 3 to 4 years • Non active data that active experiments depend on • Simple upload interface
B. Good scientific practice store
• All of A above plus
• At least 10 years, conformity with good scientific practice • Minimal set of meta data, requires PIDs
• Expiration, Integritity checks
C. Long term store, archive
• All of B above plus
• Forever or as long as is needed
• Interaction with community for curation
Archive layer model
content preservation referencing L ibra ri e s, M u se u m s Com m u n itie s bit preservation Dat a ce n tre s disk/tape storage data center interface data exploration (web) interface P ortal s Col lec ti on sCommunities take care of selection, ordering, ingesting of data
Community or ideally archivists representing the
community, takes care of curation, deletions, pruning, formats,…
Libraries and museums are confronted with large data sets and subsequent handling
Content Preservation
The data centers, take care of bit preservation Bit preservation is not an done deal
Access to preservation information is not standardised The bits that go in are the bits that come out
technical metadata
Classic view
Data sources
Archive Management
(Archive) Storage
HPSS
LTFS
Black Pearl
other
Content Management Systems, DSpace, Fedora, ArchiveMatica, Invenio, ePrints, etc…
Application access to data in deep archives
Data interfaces assume disk interface i.e. on-line storage
Archivematica, Fedora, Invenio, Fixity etc.
DB access
Interfaces still keep with POSIX
No consideration with regard to off-line, or replicated, storage
Assume different storage is handled behind the scenes
No one waits for information to become on-line
Unless you know its will come on line
How to handle asynchronous data access
Requirements from the provider, i.e. data center, perspective
How to do efficient offline check summing (with user provided method)
Applications should not DOS the (tape) storage systems
An upstream reference
how to get from content to the ‘owner of the data’
Data center exchange
Rough edges and requirements
storing [m|b]illions of files
data exchange
roles and rights, accounting
wide range of file sizes
assured reliability
storage quality
*FTP S3 …
standard protocols for data transport
Archive service (REST)
EUDAT
development Archive Logic
VIM Meta DB IDM Auth storage backends
Archive service path Data path
disk storage
B2SAFE
LTFS
iRODS
• meta data (PREMIS)
• checksums
• storage quality
archive access
enhance fixity support
user provided checksum
‘lazy’ checking and periodic integrity verification
‘we need to be more intelligent if we want to scale in the future’
poc implementation for HPSS and TSM
with PSNC (Poland) and MPCDF (was RZG)
integrate in current EUDAT B2SAFE service
enable bring-on-line functionality
on-line and off-line data represent two different storage qualities
task in INDIGO data cloud project
what else?
Conclusions
A data archive is more than long term data storage
HPSS is used in HPC and increasingly as cold storage
engine for many other data sources i.e. communities
With some user space effort ARCHIVE requirements can
be implemented on current HPSS
Work on HPSS from bottom and top
Buddy Bland: We must make it easy for sites to match the types of
storage to the requirements of the users.
We must make it easy for users to match the types of data to the
requirements of the sites.
Challenges
Determining the rate of bit rot and taking countermeasures
we have no metric for archive (bit preservation) quality
this some how should be part of a site certification
Intelligent selection of candidates for caching
getting data from a deep (i.e. cost effective) archive takes time
on line caches in front of archives for fast access
what goes to and what data stays in the cache
Notions: trends, markets, relationships: maybe google can help?
Uniform interface for “cold” storage
API that goes beyond classic POSIX and takes longer access times
Into account
Exchange data sets or data volumes with peers for redundancy or
for contract changes: asynchronous operations
Archive interface
work in progress of the RADAR and bwArchive projects
revisit the 2008 SNIA XAM standard
Asynchronous access to data on off-line storage
xam comes with an emulator /
reference implementation
xam modular adaptation to
different back-ends through
Vendor Interface Modules (VIM)
co p y ri g h t: Ja n P o tt h o ff / S CC
What is an Archive? SCC view
Definition of an Archive:
“A long time file repository containing a massive collection of digital
information
”
No loss of data or degradation of quality for an agreed amount of time
SCC ensures proper integrity
Does not depend on proprietary middleware or storage formats
migration should be possible in some number of years and must be independent from content or owner
Data is stored in (self-defining) containers
containers are not files, they are “bitfiles” with a checksum, they may contain one or more files.
Reference to the container is fixed for the lifetime of the container
The operator (SCC) is not responsible for passing content on to the next
generation of owners
data ownership has a lifetime that must be known in advance, but can be renewed data is stored after consent of a data provider and SCC. It is effectuated in a service level agreement for archives.