National Data Storage data replication in the network

(1)

National Data Storage

data replication

in the network

Maciej Brzeźniak,

Michał Jankowski, Norbert Meyer,

PSNC, Supercomputing Dept.

1st Technical meeting in Munich,

December 5-6th, 2011

Project funded by: NCBiR for 2011-2013 under „KMD2” project (no. NR02-0025-10/2011)

Full Polish name of the project: System bezpiecznego przechowywania i współdzielenia danych

oraz składowania kopii zapasowych i archiwalnych w Krajowym Magazynie Danych

(2)

National Data Storage

NDS Overview:

NDS Architecture:

Design assumptions

Overal architecture

Data replication in NDS

Data Replication modes

Replication protocols usage

User profiles vs data replication settings

Rule-based replication?

NDS vs external world vs EUDAT

(3)

NDS - design assumptions

Overall assumptions:

Avoid SoF - distributed: data & meta-data replication

Standard access protocols (tools) to be usable

Abstraction of system internals

Logical namespace visible to user;

Separate namespaces for different user groups

Robust implementation (C/C++) within 2 years

Tape systems (HSMs) on the back-end (for cost-effieciency)

Main applications

Archival and backup data storage

Effective storage and accesss of large files

(4)

NDS Project status

National Data Storage (R&D project: 2007-2009)

System architecture & concept

Software stack (rpms for CentOS/RHEL)

Current NDS deployment:

Backup and Archive Services for Science – BADSS (Service Platform for e-Science) Capacity: 12,5 PB of tapes – in 5 sites & performance: 2 PB of disks – in 5 sites

National Data Storage 2 (R&D project: 2011-2013)

(5)

NDS highlights

Automated, TRANSPARENT, data replication

Users do not see the details

(if they don’t want; they can)

They „speak” to remote virtual filesystem

Abstract data access interfaces:

File-system view of the data (remote virtual filesystem)

NDS is implemented as a user-level code (FUSE library)

User access: standard methods: SFTP, WebDAV, GridFTP

Storage access: NFS / GridFTP-NFS

(each SN exposes at least NFS)

Meta-data replication:

Automated, transparent

Postgress Slony-I + semi-synchronous replication

DR, not full HA (no recovery automation)

(6)

NDS architecture (1)

Overall picture

HSM system (NFS)

FS with data migration (HSM) Replica access methods servers

(NFS, GridFTP) Storage

Node NDS system logic

VFS for data and meta-data

Access Node Database

Node

Access Methods Servers (SSH, HTTPs, WebDAV...) User Meta-data DB Users DB Accounting & limits DB Storage Node Replication NAS appliance

(7)

NDS architecture (2)

Data replication & presentation

FS with data migration (HSM) Replica access methods servers

(NFS, GridFTP) Storage

Node NDS system logic

Access Node Database

Node

Access Methods Servers (SSH, HTTPs, WebDAV...) User Meta-data DB Users DB Accounting & limits DB Storage Node Replication NAS appliance

(8)

NDS architecture (3)

Data replication & presentation – Data Daemon

• Implements the core NDS system logic (together with MC):

– I/O serving, filesystem presentation

– data operations with replication

– meta-data-related operations

• Emulates Virtual File System

• Supports most of POSIX functions:

– open, close, read, write

– opendir, readdir,

– getattr, setattr, rename, link, unlink...

• Based on FUSE (Filesystem in USErspace)

• Additional:

– enforces security policies (access control)

– optimizes replica access and creation

(9)

NDS architecture (4)

Async. vs sync. replication from VFS perspective:

Writing to the system (async mode)

Writing to the system (sync mode)

VFS: OPEN (new file, O_RDWR|O_CREAT)

- Register a new logical file in MC (lock for writing)

- Create one physical replica

- Register replica in MC

VFS: OPEN (new file, O_RDWR|O_CREAT) - Register a new logical file in MC (lock for writing)

- Create physical multiple replicas

- Register replicas in MC

VFS: WRITE...

- Write to local replica (async.) (QUICK-local, single replica write)

- Update meta-data (size, last access etc.)

VFS: WRITE...

- Write to all replicas (sync.)

(TAKES TIME-remote, multiple sites to write)

- Update meta-data (size, last access etc.) VFS: CLOSE (on an opened file)

- Flush buffers and close replica (async.) (QUICK)

- Update meta-data incl. release write locks - Return to user

VFS: CLOSE (on an opened file)

- Flush buffers (also to remote replicas) (TAKES TIME)

- Update meta-data incl. release write locks - Return to user

Asynchronous action: Make replicas

- Enqueue replication tasks to replication daemon

- Update meta-data

- Replications daemon (in the background) does the replicas (typically 3-rd party: SN1->SN2)

(10)

Replica access method (NFS) Replica access method (GridFTP)

LOCAL Storage Node REMOTE Storage Node Access Node

Replica access method (NFS) Replica access method (GridFTP) Replica access client (NFS)

Replica access client (GridFTP)

Access Methods Servers (SSH, HTTPs, WebDAV...)

NDS system logic

NDS architecture (5)

Replica access methods (AN-SN) (1)

LAN

WAN

Low latency High bandwidth (eg. 10 Geth)

High latency

High bandwidth (eg. 1 Geth)

GridFTP

NFS

GridFTP or NFS used when needed

(11)

NDS architecture (6)

Replica access methods (AN-SN) (2)

NFS GridFTP

State-less, IOPS friendly State-full, can exploit available badwidth Low overhead on IOPS operations:

• small files access

• meta-data related operations

High overhead on IOPS operations:

• even small files access and meta-data ops require session

Low performance (MB/s) on long-distance High performance (MB/s) despite distance No parallelism (to/from single file)

• NFS 4.1. (pNFS) on the horizon but still not there

Parallelism (up to 256 streams)

• 64+ streams can sustain 1Geth link (1000 km-long)

Usage in NDS:

• meta-data related operations • accessing replicas on local SNs

• access to small files on remote SNs (future)

Usage in NDS:

• 3-rd party replication (async. mode) • transferring replicas to/from remote SNs

Stable, standardised Stability issues, even if standard in Grids

NFS and GridFTP used where they fit best:

– Static protocol selection (currently),

(12)

NDS architecture (7)

GridFTP 3-rd party replication (SN-SN) (3)

• 3-rd party transmission (SN->SN) used in async. replication mode

REMOTE Storage

Node

Access Node VFS for data and meta-data

NDS system logic WAN GridFTP control conection Replication daemon (GridFTPclient) Replica access method (GridFTP) LOCAL Storage Node Low latency

High bandwidth (eg. 10 Geth) LAN

Replica access method (GridFTP)

High latency

High bandwidth (eg. 1 Geth)

High latency

High bandwidth (eg. 1 Geth) WAN

(13)

NDS architecture in PLATON:

Replica access: GridFTP-NFS access to SNs

NFS client

Replica access method (GridFTP) virtual Storage Node virtual Storage Node NAS appliance (NFS) Access Node

FS with data migration (HSM)

NFS client

Replica access method (GridFTP)

NFS NFS

GridFTP GridFTP

Replica access client (NFS) Replica access client (GridFTP)

(14)

NDS architecture (8)

Replication-related settings (1)

Profile parameter Possible values and meaning

Replication mode • Asynchronous (default)

• Synchronous

Number of replicas • Typically 2 (max. 3)

• Can be set to „any” value

Allowed storage sites and nodes:

• Default replica

locations

• Additional replica

location

• Replicas are typically created in default locations

• Additional replica locations are used in case of failure

of default ones

Storage media type:

• Disk vs tape (HSM)

• Using a combination

of allowed storage sites & nodes + knowledge on the deployment infrastructure we can determine media type

(15)

NDS architecture (9)

Replication-related settings (2)

• Replication is configured per-profile,

NOT per data-object , e.g. directory / file

• Policies are ‚static’ – cannot be changed dynamically

• Users can use one or many profiles:

• Assigned to profiles using DNs of certs => multiple certificates have to be used

in order to access different profiles

Fast, HA & FT space for backups: • Replication: SYNC

• 3 replicas on (1 local + 2 distant) on disks only

FT space for archives: • Replication: ASYNC

• 2 replicas on distant nodes; both copies on tapes

Safe storage space for collaboration:

• Replication: ASYNC

• 2 replicas: local on disk + 1 remote in HSM

(16)

Automated, TRANSPARENT, data replication

⇒

Safe, transparent replication service case

Abstract data interfaces

above and below NDS:

SFTP, WebDAV, GridFTP for users

Possible to interface with NDS from other systems

e.g. 3-rd party transfer to/from NDS

Data available through VFS layer on ANs:

Possible to add

new access methods

Some work needed to extend the authentication mechanisms

Storage access: NFS / GridFTP

„Any” kind of storage

can be used as the backend... as far as it provides NFS service

NDS logic

NDS features vs EUDAT (1)

NDS VFS NDS Access Methods servers GridFTP front-end to storage

Access Methods (SSH, HTTPs, WebDAV...)

GridFTP front-end to storage

HSM system (NFS) NAS appliance (NFS) FS with migration (HSM) GridFTP front-end to storage „Any” other storage (NFS) ’s data centres?

service

(17)

NDS features vs EUDAT (2)

Persistent IDs?

User always sees the same logical structure, nevertheless:

The replication process:

Physical location is transparent

Replication process does not affect the logical namespace

which Access Node he uses:

The logical structure of VFS is the same everywhere

what access method he uses:

The logical structure of VFS is presented similarly through different access methods

failures:

As long as at least one replica is OK

Path to the file or directory is constant

(18)

NDS features vs EUDAT (3)

User-level metadata:

User can assign free-form text files to data objects,

they can include metadata

This is done through Web-GUI or procfs-like mechanism

Medata search possible

but not yet implemented (on the roadmap)

Can above be somehow re-used in EUDAT

Extendability?

Functionality can be easily extended

as the architecture & interfaces are open (Postgresql, NFS/GridFTP...)

„micro-services” – like approach possible:

but requires effoert on the NDS consortium side

For instance: some basic interfaces to meta-data can be defined (e.g. for searching data meeting some criteria)

Example: We currently design and develop a mechanism

for periodic data integrity checking (data scrubbing)

(19)

NDS2 - features

• Secure data storage:

– Data encrypted on the user side

• Symmetric keys to file stored in the system - protected by user’s asymmetric key

– Integrity control on the user side

• MD5’s/SHA1 digests stored in the system - encrypted

• Secure data exchange:

– 2-level access control:

• ACLs on virtual filesystem level

• User-side encryption and keys exchange

make the sharing safe (e.g. if we don’t trust provider)

• Secure data publication:

– 2 kinds of storage space: private (for internal users) and public (sanbox’ed) – Multiple web servers (load-balancing, HA, data synchronisation)

to serve data effectively

• Specialised user-side tools needed:

– Java GUI for managing file sharing, ACLs, publication and versioning – Virtual encrypted (!) filesystem for end-users: both for Linux and Windows

• Status: R&D project (2011-2013); prototype expected in 2Q2013

(20)

(21)

NDS1:

Summary

• Data storage & replication::

– Implemented @ VFS level: portability and security – Robust and lightweight

• Data replication:

– Automatic, transparent to users - Sync. and async modes

– NFS or GridFTP or GridFTP 3rd party used to access//make replicas

• Meta-data handling & replication:

– Handles file system-level medata and user-level metadata – Logically centralized but DR solutions in place for quick recovery

• Logical filesystem structure persistency:

– Physical location-agnostic access

• „Pluggable”:

– Open interfaces, standard interfaces to external world (both user- and storage-side) – We can provide custom interfaces to meta-data if needed

(22)

NDS architecture: (10)

Meta-catalog (1)

• Functionality:

– System-level meta-data storage and handling

• File system structure • Data replicas

– User-level meta-data storage

• Implementation:

– C++ library used by Data Daemon

– Postgres database

with Slony-I replication at the backend

– Separation of namespaces:

• No sharing among users groups assumed • Security by isolation

• Scalability – multiple instances of MC for multiple users groups / insitutions

Database Node

(23)

NDS architecture: (11)

Meta-catalog (2)

• Meta-data redundancy:

problem: reliability and performance?

– (1) Postgres database with Slony-I replication

• Each meta-catalog replicated asynchronously

in master-slaves mode (Slony-I)

• In case of failure of master MC:

slave MC is manually selected as master (DR, not full HA, human intervention needed)

– (2) Semi-synchronous data replication:

• All operations on metadata

synchronously logged to distributed logs

• In case of failure of master MC:

– part of operations logged are repeated

on the ”new” master (human interv. needed)

• Comments:

• Reliability similar to synchronous DBMS replication but mechanism is lighter!!!