Research Data Storage Infrastructure (RDSI) Project. DaSh Straw-Man

(1)

Research Data Storage

Infrastructure (RDSI)

Project

(2)

Recap from the Node Workshop

(Cherry-picked)

*

Higher Tiered DCs cost roughly twice the cost of Lower Tiered DCs.

*

However can provide a robust “Higher Tiered” like service.

*

Using co-operating Lower Tiered DCs.

*

With distributed and/or replicated mechanisms.

*

If a service (partially) fails another DC can temporarily provide it.

*

If a DC fails other DCs can provide its services temporarily.

*

Loss of service pardonable. Loss of data unforgivable.

(3)

*

Whats DaSh all about?

*

“Developing sufficient elements of potential technical architectures for data interoperability and sharing.”

*

“So that its use can be appropriately specified the call for nodes proposal.”

*

Mile high view of technical architectures to get data into and out of the RDSI node(s).

*

Ensure (meta)data durability and curation.

*

Loss of (meta)data is a capital offence.

*

Ensure data scalability.

*

Storage capacity, moving data into and out of a node(s).

*

Ensure End-user usability.

*

Provide a good end-user experience.

*

DaSh straw-man seeks community opinion on the various possible architectures.

(4)

GRIDs

Building Blocks

gsiftp, https dCap, DPM, xrootd

Wide Area xfers

SRM SRM

protocol neg.

Clouds

Wide Area xfers

REST S3 Re -exp ort ed FS NFS, CIFS WebDAV, FUSE HSM, Tiers Storage Classes

(5)

*

iRODs and Federation

*

Federation is a feature in which separate iRODS Zones

(iRODS instances), can be integrated.

*

When zones 'A' and 'B' are federated, they work

together.

*

Each zone continues to be separately administrated.

*

Users in the multiple zones, if given permission, will

be able to access data and metaData in the other

zones.

*

No user passwords exchanged

*

Zone admins setup trust relationships to other

(6)

ARCS Data Fabric

iCAT only.

Hosted on NeCTAR NSP

iRODS server + tape

iRODS server

iRODS server + tape

iRODS server + tape iRODS server + tape

iRODS server iRODS server

(7)

Node’s Eye View. (N=6)

No Federation.

(8)

Node’s Eye View. (N=6)

Too much Federation.

Too much confusion!!

(9)

Node’s Eye View. (N=6)

Just right Federation.

Slave ICAT

Slave

ICAT Slave _ICAT

Slave ICAT Slave ICAT Slave ICAT Master ICAT

(10)

Dis tr ib uted Fa ult -To lerant Parallel FS Ov er N= 6 node s

Distributed vs Federated

SRM SRM GRIDs gsiftp, https dCap, DPM, xrootd

Wide Area xfers

protocol neg.

Clouds

Wide Area xfers

REST S3 Re -exp ort ed FS NFS, CIFS WebDAV, FUSE HSM, Tiers Storage Classes

(11)

Distributed Pros and Cons

*

Distributed over a larger number of nodes.

*

Geographic scaling as well as node scaling.

*

Inherent data replication.

*

Fault Tolerant.

*

A storage brick took lickin’ but the service keep on tickin’.

*

A node took a lickin’ but the service keep on tickin’.

*

Parallel I/O.

*

All nodes can participate to move data. High aggregate BW.

*

Single global namespace.

*

Rather than separate logical namespaces.

*

Cost Effective

*

Use cheap hardware. Big disks over fast disks.

(12)

File Replication

*

Whole file

*

Duplicated and stored on multiple bricks.

*

Slices of file

*

File sliced and diced, slices stored on multiple bricks.

*

A single brick may not contain the whole file.

*

Erasure Codes

* Parity Blocks

* (used in RAID)

* Reed-Solomon

* Over sampled polynomial constructed from data.

* Add Erasure codes and slice file

* Need M of N pieces to recover file (M < N)

* Can store a slice on multiple bricks. Extra redundancy.

(13)

SurfNET Survey of Wide Area Distributed

Storage. (Circa 2010) [1/4]

http://www.surfnet.nl/nl/Innovatieprogramma's/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf

Requirements required:

*

Scalable.

*

Capacity, performance and concurrent access.

*

Expandable storage without degrading performance.

*

High Availability.

*

Keeps data available to apps and clients.

*

Even in the event of a malfunction.

*

Or system reconfiguration.

(14)

SurfNET Survey of Wide Area Distributed

Storage. (Circa 2010) [2/4]

*

Durability

*

No data is lost from a single software or hardware failure.

*

Automatically maintain minimum number of replicas.

*

Support backup to tape.

*

Performance at Traditional SAN/NAS Level.

*

Comparable performance to traditional non-distributed SAN/NAS.

*

Dynamic Operation.

*

Availability, durability, performance configurable per application.

* Reduce costs as not running at highest support level at the time.

* Allow users, apps, sysadmins to balance cost vs features.

*

System should be self-configurable, self-tunable.

*

Support data movement between different storage technologies.

(15)

SurfNET Survey of Wide Area Distributed

Storage. (Circa 2010) [3/4]

*

Cost Effective

*

Must be possible to build, configure, run and maintain in a cost effective manner.

*

Must work with commodity hardware.

* Hardware may not be as reliable as high end hardware.

*

Configuration of system and its maintenance must be easy and straight forward.

*

Operation of system is energy efficient.

*

License fees for software when applicable must be limited.

*

Generic Interfaces.

*

System offers generic interfaces to apps and clients

* POSIX interface. POSIX/NFSv4.1 semantics.

(16)

SurfNET Survey of Wide Area Distributed

Storage. (Circa 2010) [4/4]

*

Protocols Based on Open Standards

*

System build using open protocols

*

Reduces vendor lock-in

*

More economical in the long run.

*

Multi-Party Access

*

System must support access by multiple geographically dispersed parties at the same time.

(17)

SurfNET Survey of Wide Area Distributed

Storage.(Circa 2010)

http://www.surfnet.nl/nl/Innovatieprogramma's/gigaport3/Documents/EDS-3R%20open-storage-scouting-v1.0.pdf Candidates

*

Lustre

*

GlusterFS

*

GPFS

*

Ceph

*

+ dCache Non-Candidates

*

XtreemFS

*

MogileFS

*

NFS v4.1 (pNFS)

*

ZFS

*

VERITAS FS

*

Parascale

*

CAStor

*

Tahoe-LAFS

*

DRBD

(18)

(19)

The DEISA Global File System at European Scale

(20)

TeraGrid

(GPFS & Lustre)

(21)

SurfNET Survey of Wide Area Distributed Storage

+ dCache

Lustre GlusterFS GPFS Ceph dCache

Owner Oracle Gluster IBM Newdream dCache.org

Licence GNU GPL GNU GPL commercial GNU GPL DESY

Data Primitive

Object (file) Object (file) block Object (file) Object (file)

Data placement Round robin + free space heuristics Different strategies via modules

Policy based Placement groups, random mappings

Policy based

Metadata Max 2 metadata

servers

Stored with file Distribute over storage servers Multiple metadata servers pnfs (postgreSQL)

Storage tiers Pools of object

targets

unknown Policy defined CRUSH rules Policy defined

(22)

SurfNET Survey of Wide Area Distributed Storage + dCache.

Lustre GlusterFS GPFS Ceph dCache

Failure handling Assuming reliable nodes Assuming unreliable nodes Assuming reliable nodes, Failure groups Assuming unreliable nodes Assuming reliable nodes

Replication Server side (failover pairs)

Client side Server side Server side Server side

WAN

deployment example

TeraGrid City Cloud (Swedish IaaS provider) TeraGrid DEISA unknown Fermilab, Swegrid, NDGF Client interface Native client, FUSE, CIFS, NFS Native Client, FUSE Native Client, exports NFSv3, CIFS, pCIFS, WebDAV, SRM (StoRM) Native client, FUSE NFSv4.1, HTTP, WebDAV, GridFTP, Xrootd, SRM, dCap

Node types Clients, metadata, objects

Client, data Client, data Clients, metadata, objects

Clients, metadata, objects

(23)

WAN Data Caching and Performance

Bringing data closer to where it is consumed.

*

Researchers are naturally distributed over the city and country

*

Some may not benefit from the high speed networks provide by

AARNet and the NRN due to their location.

*

Can RDSI help these spatially disenfranchised?

*

Yes, (sort of).

*

Take the model of Content Delivery Networks.

*

ie Akamai, Amazon CloudFront, etc

*

Web content, videos etc are cached close to the end user.

*

But focus on data caching rather that content caching.

*

May not provide the same experience as the spatially franchised.

(24)

(25)

WAN Data Caching

Continued …

*

dCache is a distributed cache system.

*

Locate a dCache pool close to the spatially disenfranchised.

*

dCache admin can populate required data collections to spatially disenfranchised using standard SRM processes.

*

Potentially a (reasonably) fast parallel transfer.

*

BioTorrents <http://www.biotorrents.net>

*

Allows scientists to rapidly share their results, datasets, and software using the popular BitTorrent file sharing technology.

*

All data is open-access and any illegal filesharing is not allowed on BioTorrents.

*

Or RDSI nodes can provide bit-torrent seeders itself from its nodes.

(26)

Data Durability.

Things that go bump in the night (or not!)

*

Data Durability is an absolute necessity.

*

RDSI must provide a safe and enduring home for research data.

*

This might be more difficult as it appears!

*

The enemy is …

*

Physics.

*

The world is a complex quantum/probabilistic system.

*

And so are all your computing and storage infrastructure.

*

Random events in your infrastructure will create:

*

Bit Rot and Silent Corruptions.

(27)

Data Durability.

Sources of Bit Rot and Silent Corruptions

Physical magnetic media User Space VM Memory Filesystems Block layer SCSI layer Low-level drivers Controller firmware Storage firmware Disk Mechanics

+

All interconnecting cables Cosmic rays/Sun spots EM Radiation, etc

From “Silent Corruptions”, Peter.Kelemen, CERN

Wear Out Bugs in FW Inter-op issues ECC errors Corrupted Metadata Corrupted Data Flipped Bits

Latent sector errors

Lost Writes Torn Writes Misdirected Writes

(28)

Data Durability.

Expected Background Bit Error Rate (BER)

* NIC/Link/HBA: 10-10_{(1 bit in ~1.1 GB)}

* Check-summed, retransmit if necessary

* Memory: 10-12_{(1 bit in ~116 GB)}

* ECC

* SATA Disk: 10-14_{(1 bit in ~11.3 TB)}

* Various error correction codes

* Enterprise Disk: 10-15_{(1 bit in ~113 TB)}

* Tape: 10-18_{(1 bit in ~1.11 PB)}

* Data maybe encoded up to five or more times as it travels to and from physical disk/tape to user space.

(29)

Data Durability.

The errors you know. The errors you don’t know.

There are known errors;

there are errors we know we know.

We also know there are known unknown errors;

that is to say we know there are some things we do not

know.

But there are also unknown unknown errors;

the ones we don't know we don't know.

(30)

Data Durability.

The errors you know. The errors you don’t know.

*

There are Data Errors that you will now know about.

*

Logs message.

*

SMART messages

*

Detection: SW/HW-level with error messages

*

Correction: SW/HW-level with warnings

*

If your really lucky your kernel will panic so you’ll know something happened.

*

There are Data Errors that you will never know about.

*

As far as your storage infrastructure knows that write/read was executed perfectly.

*

In reality you will probably never know the data has been corrupted.

(31)

Data Durability.

How to discover the unknown unknowns.

* checksums

* (CRC32, MD5, SHA1, ...)

* Checksum (meta)data.

* Transport checksum with meta(data) for later comparison.

* Error detection and correction codings.

* Detects errors caused by noise, etc. (See checksums.)

* Corrects detected errors and reconstruction of the original, error-free data.

* Backward error correction:

* Automatic Retransmit on error detection.

* Forward error correction:

* Encode extra redundant data .

* Regenerate data from Forward Error Codes.

(32)

Data Durability.

Silent Corruptions and CERN

*

Circa 2007. 9PB tape. 4PB disk. 6000 nodes. 20000 drives, 1200 RAID.

*

Probabilistic storage integrity check (fsprobe) on 4000 nodes.

*

Write known bit pattern

*

Read it back.

*

Compare and alert when mismatch found.

*

6 cycles over 1 hour each.

*

Low I/O footprint for background operation on 2GB file.

*

Keep complexity to the minimum.

*

use static buffers

*

Attempt to preserve details about detected corruptions for further analysis.

(33)

Data Durability.

Silent Corruptions and CERN

*

2000 incidents reported over 97 PB of traffic.

*

6/day on average observed!

*

192 MB of data silent data corruption.

*

320 nodes affected over 27 hardware types.

*

Multiple types of corruptions.

*

Some corruptions are transient.

*

Overall BER considering all the links in the chain

*

3x10-7_.

(34)

Data Durability.

Types of silent Corruptions

*

Type 1

*

Single/double bit flip errors. Usually persistent.

*

Usually bad memory (RAM, cache, etc.)

*

Happens with expensive ECC memory too.

*

Type II

*

Small, 2n_{-sized random chunks (128-512 bytes) of unknown origin.}

*

Usually transient.

*

Possible OOM Killer or corrupted SLAB/SLUB allocator.

*

Type III

*

multiple large chunks of 64K, “old file data”. I/O command timeouts

*

Usually persistent.

*

Type IV

(35)

Data Durability.

What Can Be Done?

*

Self-examining/healing hardware.

*

WRITE-READ cycles before ACK.

*

Check-summing though not necessarily enough.

*

End-to-end check-summing.

*

Store multiple copies.

*

Regular scrubbing of RAID arrays.

*

Data refresh. Re-read cycles on tapes.

(36)

Data Durability.

The solutions. ZFS. The Good.

*

Developed by Sun (now Oracle) on Solaris.

*

Designed from the ground up with a focus on data integrity.

*

Combined filesystem, logical volume manager

*

RAID-Z, RAID-Z2, RAID-Z3, or mirrored

*

Copy-on-write. Transactional operation.

*

Built-in end-to-end data integrity.

*

Data/metadata checksum all the way to the root.

*

Always consistent on disk. no fsck or journaling

*

Automatic self-healing.

*

Intelligent online scrubbing and resilvering.

*

Very large filesystem limits. Max. 256 ZB FS

*

Deduplication.

*

Snapshots.

(37)

Data Durability.

The solutions. ZFS. The Bad.

*

Supported on Solaris only.

*

OpenSolaris is no more.

*

Kernel ports for FreeBSD and NetBSD.

*

Using OpenSolaris kernel source code.

*

Linux port via ZFS-FUSE.

*

Kernel space good. User space not so good.

*

ZFS on Linux.

*

Supported by Lawrence Livermore National Laboratory.

*

Issues with CDDL and GPL license compatibility in the kernel.

*

Solaris Portability layer/shim to the rescue.

(38)

Data Durability.

The solutions. ZFS for Lustre.

*

1999: Peter Bramm from CMU creates Lustre.

*

A GPL massively parallel distributed file system.

*

2003: Bramm created Cluster File Systems Inc to continue work.

*

2007: Sun acquires Cluster File Systems Inc.

*

Works to combines ZFS and Lustre.

*

High Performance parallel FS with end to end data integrity.

*

But only supported on solaris.

*

2009: LLNL starts porting ZFS kernel to linux.

*

Oracle acquires Sun.

*

2010: Oracle announced ZFS/Lustre only for Solaris.

*

2011: LLNL starts ZFS/Lustre port for linux.

*

Late 2011: LLNL plans ZFS/Lustre FS.

(39)

Data Durability.

The solutions. DataDirect Networks S2S Technology.

*

SATA storage with:

*

Enterprise-class performance.

*

Reliability and data integrity.

*

Automatic self-healing

*

Detects anomalies and begins journaling all writes while recovering operations.

*

Dynamic Maid (D-MAID)

*

Save additional power and cooling by powering down the platters,

*

Where over 80% of power is consumed.

(40)

Community Input Time.

*

Are we barking up the right tree.

*

Are we barking up the wrong tree.

*

Is there even a tree in the first place.

(41)

Building Block

*

Are the base building blocks sufficient?

*

If not what should be added?

*

Is there a need for additional data transfer protocols.

*

If so what should be added?

*

Is there a need for additional file system protocol?

*

If so what should be added?

*

What additional public cloud storage infrastructure should

RDSI consider?

*

What additional private cloud storage infrastructure should

RDSI consider?

(42)

Federated vs Distributed.

*

Should RDSI continue to embrace the federated iRODS model?

*

Should RDSI embrace the Distributed FS model?

*

Should RDSI embrace both the federated and distributed

model?

(43)

Distributed Fault Tolerant Parallel Filesystems.

*

If RDSI chooses to use a Distributed Fault Tolerant Parallel

filesystem component, are there such systems that we have not yet consider?

(44)

WAN Data Caching

There are always going to researchers who may not be able to

benefit from the high speed networks provide by AARNet and

the NRN. WAN Data Caching may partially eliminate their

disadvantage but at cost.

*

Should RDSI consider the use of WAN Data Caches?

(45)

Data Durability.

Data Durability is one of the foremost challenges of RDSI.

However it seems impossible to entirely eliminate the various

issues of bit rot and silent corruptions.

*

Given this fact of nature what level of data durability is the

research community willing to accept?