Research Data Storage
Infrastructure (RDSI)
Project
Recap from the Node Workshop
(Cherry-picked)
*
Higher Tiered DCs cost roughly twice the cost of Lower Tiered DCs.
*
However can provide a robust “Higher Tiered” like service.*
Using co-operating Lower Tiered DCs.*
With distributed and/or replicated mechanisms.*
If a service (partially) fails another DC can temporarily provide it.*
If a DC fails other DCs can provide its services temporarily.*
Loss of service pardonable. Loss of data unforgivable.
*
Whats DaSh all about?
*
“Developing sufficient elements of potential technical architectures for data interoperability and sharing.”*
“So that its use can be appropriately specified the call for nodes proposal.”*
Mile high view of technical architectures to get data into and out of the RDSI node(s).*
Ensure (meta)data durability and curation.*
Loss of (meta)data is a capital offence.*
Ensure data scalability.*
Storage capacity, moving data into and out of a node(s).*
Ensure End-user usability.*
Provide a good end-user experience.*
DaSh straw-man seeks community opinion on the various possible architectures.GRIDs
Building Blocks
gsiftp, https dCap, DPM, xrootd
Wide Area xfers
SRM SRM
protocol neg.
Clouds
Wide Area xfers
REST S3 Re -exp ort ed FS NFS, CIFS WebDAV, FUSE HSM, Tiers Storage Classes
*
iRODs and Federation
*
Federation is a feature in which separate iRODS Zones
(iRODS instances), can be integrated.
*
When zones 'A' and 'B' are federated, they work
together.
*
Each zone continues to be separately administrated.
*
Users in the multiple zones, if given permission, will
be able to access data and metaData in the other
zones.
*
No user passwords exchanged
*
Zone admins setup trust relationships to other
ARCS Data Fabric
iCAT only.
Hosted on NeCTAR NSP
iRODS server + tape
iRODS server
iRODS server + tape
iRODS server + tape iRODS server + tape
iRODS server iRODS server
Node’s Eye View. (N=6)
No Federation.
Node’s Eye View. (N=6)
Too much Federation.
Too much confusion!!
Node’s Eye View. (N=6)
Just right Federation.
Slave ICAT
Slave
ICAT Slave ICAT
Slave ICAT Slave ICAT Slave ICAT Master ICAT
Dis tr ib uted Fa ult -To lerant Parallel FS Ov er N= 6 node s
Distributed vs Federated
SRM SRM GRIDs gsiftp, https dCap, DPM, xrootdWide Area xfers
protocol neg.
Clouds
Wide Area xfers
REST S3 Re -exp ort ed FS NFS, CIFS WebDAV, FUSE HSM, Tiers Storage Classes
Distributed Pros and Cons
*
Distributed over a larger number of nodes.
*
Geographic scaling as well as node scaling.*
Inherent data replication.*
Fault Tolerant.
*
A storage brick took lickin’ but the service keep on tickin’.*
A node took a lickin’ but the service keep on tickin’.*
Parallel I/O.
*
All nodes can participate to move data. High aggregate BW.*
Single global namespace.
*
Rather than separate logical namespaces.*
Cost Effective
*
Use cheap hardware. Big disks over fast disks.File Replication
*
Whole file
*
Duplicated and stored on multiple bricks.*
Slices of file
*
File sliced and diced, slices stored on multiple bricks.*
A single brick may not contain the whole file.*
Erasure Codes* Parity Blocks
* (used in RAID)
* Reed-Solomon
* Over sampled polynomial constructed from data.
* Add Erasure codes and slice file
* Need M of N pieces to recover file (M < N)
* Can store a slice on multiple bricks. Extra redundancy.
SurfNET Survey of Wide Area Distributed
Storage. (Circa 2010) [1/4]
http://www.surfnet.nl/nl/Innovatieprogramma's/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf
Requirements required:
*
Scalable.
*
Capacity, performance and concurrent access.*
Expandable storage without degrading performance.*
High Availability.
*
Keeps data available to apps and clients.*
Even in the event of a malfunction.*
Or system reconfiguration.SurfNET Survey of Wide Area Distributed
Storage. (Circa 2010) [2/4]
http://www.surfnet.nl/nl/Innovatieprogramma's/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf
*
Durability*
No data is lost from a single software or hardware failure.*
Automatically maintain minimum number of replicas.*
Support backup to tape.*
Performance at Traditional SAN/NAS Level.*
Comparable performance to traditional non-distributed SAN/NAS.*
Dynamic Operation.*
Availability, durability, performance configurable per application.* Reduce costs as not running at highest support level at the time.
* Allow users, apps, sysadmins to balance cost vs features.
*
System should be self-configurable, self-tunable.*
Support data movement between different storage technologies.SurfNET Survey of Wide Area Distributed
Storage. (Circa 2010) [3/4]
http://www.surfnet.nl/nl/Innovatieprogramma's/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf
*
Cost Effective*
Must be possible to build, configure, run and maintain in a cost effective manner.*
Must work with commodity hardware.* Hardware may not be as reliable as high end hardware.
*
Configuration of system and its maintenance must be easy and straight forward.*
Operation of system is energy efficient.*
License fees for software when applicable must be limited.*
Generic Interfaces.*
System offers generic interfaces to apps and clients* POSIX interface. POSIX/NFSv4.1 semantics.
SurfNET Survey of Wide Area Distributed
Storage. (Circa 2010) [4/4]
http://www.surfnet.nl/nl/Innovatieprogramma's/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf
*
Protocols Based on Open Standards
*
System build using open protocols*
Reduces vendor lock-in*
More economical in the long run.*
Multi-Party Access
*
System must support access by multiple geographically dispersed parties at the same time.SurfNET Survey of Wide Area Distributed
Storage.(Circa 2010)
http://www.surfnet.nl/nl/Innovatieprogramma's/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf Candidates*
Lustre*
GlusterFS*
GPFS*
Ceph*
+ dCache Non-Candidates*
XtreemFS*
MogileFS*
NFS v4.1 (pNFS)*
ZFS*
VERITAS FS*
Parascale*
CAStor*
Tahoe-LAFS*
DRBDThe DEISA Global File System at European Scale
TeraGrid
(GPFS & Lustre)SurfNET Survey of Wide Area Distributed Storage
+ dCache
Lustre GlusterFS GPFS Ceph dCache
Owner Oracle Gluster IBM Newdream dCache.org
Licence GNU GPL GNU GPL commercial GNU GPL DESY
Data Primitive
Object (file) Object (file) block Object (file) Object (file)
Data placement Round robin + free space heuristics Different strategies via modules
Policy based Placement groups, random mappings
Policy based
Metadata Max 2 metadata
servers
Stored with file Distribute over storage servers Multiple metadata servers pnfs (postgreSQL)
Storage tiers Pools of object
targets
unknown Policy defined CRUSH rules Policy defined
SurfNET Survey of Wide Area Distributed Storage + dCache.
Lustre GlusterFS GPFS Ceph dCache
Failure handling Assuming reliable nodes Assuming unreliable nodes Assuming reliable nodes, Failure groups Assuming unreliable nodes Assuming reliable nodes
Replication Server side (failover pairs)
Client side Server side Server side Server side
WAN
deployment example
TeraGrid City Cloud (Swedish IaaS provider) TeraGrid DEISA unknown Fermilab, Swegrid, NDGF Client interface Native client, FUSE, CIFS, NFS Native Client, FUSE Native Client, exports NFSv3, CIFS, pCIFS, WebDAV, SRM (StoRM) Native client, FUSE NFSv4.1, HTTP, WebDAV, GridFTP, Xrootd, SRM, dCap
Node types Clients, metadata, objects
Client, data Client, data Clients, metadata, objects
Clients, metadata, objects
WAN Data Caching and Performance
Bringing data closer to where it is consumed.
*
Researchers are naturally distributed over the city and country
*
Some may not benefit from the high speed networks provide by
AARNet and the NRN due to their location.
*
Can RDSI help these spatially disenfranchised?
*
Yes, (sort of).
*
Take the model of Content Delivery Networks.
*
ie Akamai, Amazon CloudFront, etc*
Web content, videos etc are cached close to the end user.*
But focus on data caching rather that content caching.
*
May not provide the same experience as the spatially franchised.
WAN Data Caching
Continued …
*
dCache is a distributed cache system.
*
Locate a dCache pool close to the spatially disenfranchised.*
dCache admin can populate required data collections to spatially disenfranchised using standard SRM processes.*
Potentially a (reasonably) fast parallel transfer.*
BioTorrents <http://www.biotorrents.net>
*
Allows scientists to rapidly share their results, datasets, and software using the popular BitTorrent file sharing technology.*
All data is open-access and any illegal filesharing is not allowed on BioTorrents.*
Or RDSI nodes can provide bit-torrent seeders itself from its nodes.Data Durability.
Things that go bump in the night (or not!)
*
Data Durability is an absolute necessity.
*
RDSI must provide a safe and enduring home for research data.*
This might be more difficult as it appears!*
The enemy is …
*
Physics.*
The world is a complex quantum/probabilistic system.*
And so are all your computing and storage infrastructure.*
Random events in your infrastructure will create:
*
Bit Rot and Silent Corruptions.Data Durability.
Sources of Bit Rot and Silent Corruptions
Physical magnetic media User Space VM Memory Filesystems Block layer SCSI layer Low-level drivers Controller firmware Storage firmware Disk Mechanics
+
All interconnecting cables Cosmic rays/Sun spots EM Radiation, etcFrom “Silent Corruptions”, Peter.Kelemen, CERN
Wear Out Bugs in FW Inter-op issues ECC errors Corrupted Metadata Corrupted Data Flipped Bits
Latent sector errors
Lost Writes Torn Writes Misdirected Writes
Data Durability.
Expected Background Bit Error Rate (BER)
From “Silent Corruptions”, Peter.Kelemen, CERN
* NIC/Link/HBA: 10-10 (1 bit in ~1.1 GB)
* Check-summed, retransmit if necessary
* Memory: 10-12 (1 bit in ~116 GB)
* ECC
* SATA Disk: 10-14 (1 bit in ~11.3 TB)
* Various error correction codes
* Enterprise Disk: 10-15 (1 bit in ~113 TB)
* Various error correction codes
* Tape: 10-18 (1 bit in ~1.11 PB)
* Various error correction codes
* Data maybe encoded up to five or more times as it travels to and from physical disk/tape to user space.
Data Durability.
The errors you know. The errors you don’t know.
From “Silent Corruptions”, Peter.Kelemen, CERN
There are known errors;
there are errors we know we know.
We also know there are known unknown errors;
that is to say we know there are some things we do not
know.
But there are also unknown unknown errors;
the ones we don't know we don't know.
Data Durability.
The errors you know. The errors you don’t know.
From “Silent Corruptions”, Peter.Kelemen, CERN
*
There are Data Errors that you will now know about.
*
Logs message.*
SMART messages*
Detection: SW/HW-level with error messages*
Correction: SW/HW-level with warnings*
If your really lucky your kernel will panic so you’ll know something happened.*
There are Data Errors that you will never know about.
*
As far as your storage infrastructure knows that write/read was executed perfectly.*
In reality you will probably never know the data has been corrupted.Data Durability.
How to discover the unknown unknowns.
From “Silent Corruptions”, Peter.Kelemen, CERN
* checksums
* (CRC32, MD5, SHA1, ...)
* Checksum (meta)data.
* Transport checksum with meta(data) for later comparison.
* Error detection and correction codings.
* Detects errors caused by noise, etc. (See checksums.)
* Corrects detected errors and reconstruction of the original, error-free data.
* Backward error correction:
* Automatic Retransmit on error detection.
* Forward error correction:
* Encode extra redundant data .
* Regenerate data from Forward Error Codes.
Data Durability.
Silent Corruptions and CERN
From “Silent Corruptions”, Peter.Kelemen, CERN
*
Circa 2007. 9PB tape. 4PB disk. 6000 nodes. 20000 drives, 1200 RAID.*
Probabilistic storage integrity check (fsprobe) on 4000 nodes.*
Write known bit pattern*
Read it back.*
Compare and alert when mismatch found.*
6 cycles over 1 hour each.*
Low I/O footprint for background operation on 2GB file.*
Keep complexity to the minimum.*
use static buffers*
Attempt to preserve details about detected corruptions for further analysis.Data Durability.
Silent Corruptions and CERN
From “Silent Corruptions”, Peter.Kelemen, CERN
*
2000 incidents reported over 97 PB of traffic.*
6/day on average observed!*
192 MB of data silent data corruption.*
320 nodes affected over 27 hardware types.*
Multiple types of corruptions.*
Some corruptions are transient.*
Overall BER considering all the links in the chain*
3x10-7.Data Durability.
Types of silent Corruptions
From “Silent Corruptions”, Peter.Kelemen, CERN
*
Type 1*
Single/double bit flip errors. Usually persistent.*
Usually bad memory (RAM, cache, etc.)*
Happens with expensive ECC memory too.*
Type II*
Small, 2n-sized random chunks (128-512 bytes) of unknown origin.*
Usually transient.*
Possible OOM Killer or corrupted SLAB/SLUB allocator.*
Type III*
multiple large chunks of 64K, “old file data”. I/O command timeouts*
Usually persistent.*
Type IVData Durability.
What Can Be Done?
From “Silent Corruptions”, Peter.Kelemen, CERN
*
Self-examining/healing hardware.
*
WRITE-READ cycles before ACK.
*
Check-summing though not necessarily enough.
*
End-to-end check-summing.
*
Store multiple copies.
*
Regular scrubbing of RAID arrays.
*
Data refresh. Re-read cycles on tapes.
Data Durability.
The solutions. ZFS. The Good.
*
Developed by Sun (now Oracle) on Solaris.*
Designed from the ground up with a focus on data integrity.*
Combined filesystem, logical volume manager*
RAID-Z, RAID-Z2, RAID-Z3, or mirrored*
Copy-on-write. Transactional operation.*
Built-in end-to-end data integrity.*
Data/metadata checksum all the way to the root.*
Always consistent on disk. no fsck or journaling*
Automatic self-healing.*
Intelligent online scrubbing and resilvering.*
Very large filesystem limits. Max. 256 ZB FS*
Deduplication.*
Snapshots.Data Durability.
The solutions. ZFS. The Bad.
*
Supported on Solaris only.*
OpenSolaris is no more.*
Kernel ports for FreeBSD and NetBSD.*
Using OpenSolaris kernel source code.*
Linux port via ZFS-FUSE.*
Kernel space good. User space not so good.*
ZFS on Linux.*
Supported by Lawrence Livermore National Laboratory.*
Issues with CDDL and GPL license compatibility in the kernel.*
Solaris Portability layer/shim to the rescue.Data Durability.
The solutions. ZFS for Lustre.
*
1999: Peter Bramm from CMU creates Lustre.*
A GPL massively parallel distributed file system.*
2003: Bramm created Cluster File Systems Inc to continue work.*
2007: Sun acquires Cluster File Systems Inc.*
Works to combines ZFS and Lustre.*
High Performance parallel FS with end to end data integrity.*
But only supported on solaris.*
2009: LLNL starts porting ZFS kernel to linux.*
Oracle acquires Sun.*
2010: Oracle announced ZFS/Lustre only for Solaris.*
2011: LLNL starts ZFS/Lustre port for linux.*
Late 2011: LLNL plans ZFS/Lustre FS.Data Durability.
The solutions. DataDirect Networks S2S Technology.
*
SATA storage with:*
Enterprise-class performance.*
Reliability and data integrity.*
Automatic self-healing*
Detects anomalies and begins journaling all writes while recovering operations.*
Dynamic Maid (D-MAID)*
Save additional power and cooling by powering down the platters,*
Where over 80% of power is consumed.Community Input Time.
*
Are we barking up the right tree.
*
Are we barking up the wrong tree.
*
Is there even a tree in the first place.
Building Block
*
Are the base building blocks sufficient?
*
If not what should be added?*
Is there a need for additional data transfer protocols.
*
If so what should be added?*
Is there a need for additional file system protocol?
*
If so what should be added?*
What additional public cloud storage infrastructure should
RDSI consider?
*
What additional private cloud storage infrastructure should
RDSI consider?
Federated vs Distributed.
*
Should RDSI continue to embrace the federated iRODS model?
*
Should RDSI embrace the Distributed FS model?
*
Should RDSI embrace both the federated and distributed
model?
Distributed Fault Tolerant Parallel Filesystems.
*
If RDSI chooses to use a Distributed Fault Tolerant Parallelfilesystem component, are there such systems that we have not yet consider?