National Data Storage
data replication
in the network
Maciej Brzeźniak,
Michał Jankowski, Norbert Meyer,
PSNC, Supercomputing Dept.
1st Technical meeting in Munich,
December 5-6th, 2011
Project funded by: NCBiR for 2011-2013 under „KMD2” project (no. NR02-0025-10/2011)
Full Polish name of the project: System bezpiecznego przechowywania i współdzielenia danychoraz składowania kopii zapasowych i archiwalnych w Krajowym Magazynie Danych
National Data Storage
NDS Overview:
NDS Architecture:
Design assumptions
Overal architecture
Data replication in NDS
Data Replication modes
Replication protocols usage
User profiles vs data replication settings
Rule-based replication?
NDS vs external world vs EUDAT
NDS - design assumptions
Overall assumptions:
Avoid SoF - distributed: data & meta-data replication
Standard access protocols (tools) to be usable
Abstraction of system internals
Logical namespace visible to user;
Separate namespaces for different user groups
Robust implementation (C/C++) within 2 years
Tape systems (HSMs) on the back-end (for cost-effieciency)
Main applications
Archival and backup data storage
Effective storage and accesss of large files
NDS Project status
National Data Storage (R&D project: 2007-2009)
System architecture & concept
Software stack (rpms for CentOS/RHEL)
Current NDS deployment:
Backup and Archive Services for Science – BADSS (Service Platform for e-Science) Capacity: 12,5 PB of tapes – in 5 sites & performance: 2 PB of disks – in 5 sites
National Data Storage 2 (R&D project: 2011-2013)
NDS highlights
Automated, TRANSPARENT, data replication
Users do not see the details
(if they don’t want; they can)
They „speak” to remote virtual filesystem
Abstract data access interfaces:
File-system view of the data (remote virtual filesystem)
NDS is implemented as a user-level code (FUSE library)
User access: standard methods: SFTP, WebDAV, GridFTP
Storage access: NFS / GridFTP-NFS
(each SN exposes at least NFS)
Meta-data replication:
Automated, transparent
Postgress Slony-I + semi-synchronous replication
DR, not full HA (no recovery automation)
NDS architecture (1)
Overall picture
HSM system (NFS)
FS with data migration (HSM) Replica access methods servers
(NFS, GridFTP) Storage
Node NDS system logic
VFS for data and meta-data
Access Node Database
Node
Access Methods Servers (SSH, HTTPs, WebDAV...) User Meta-data DB Users DB Accounting & limits DB Storage Node Replication NAS appliance
NDS architecture (2)
Data replication & presentation
HSM system (NFS)
FS with data migration (HSM) Replica access methods servers
(NFS, GridFTP) Storage
Node NDS system logic
VFS for data and meta-data
Access Node Database
Node
Access Methods Servers (SSH, HTTPs, WebDAV...) User Meta-data DB Users DB Accounting & limits DB Storage Node Replication NAS appliance
NDS architecture (3)
Data replication & presentation – Data Daemon
•
Implements the core NDS system logic (together with MC):
– I/O serving, filesystem presentation
– data operations with replication
– meta-data-related operations
•
Emulates Virtual File System
•
Supports most of POSIX functions:
– open, close, read, write
– opendir, readdir,
– getattr, setattr, rename, link, unlink...
•
Based on FUSE (Filesystem in USErspace)
•
Additional:
– enforces security policies (access control)
– optimizes replica access and creation
NDS architecture (4)
Async. vs sync. replication from VFS perspective:
Writing to the system (async mode)
Writing to the system (sync mode)
VFS: OPEN (new file, O_RDWR|O_CREAT)- Register a new logical file in MC (lock for writing)
- Create one physical replica
- Register replica in MC
VFS: OPEN (new file, O_RDWR|O_CREAT) - Register a new logical file in MC (lock for writing)
- Create physical multiple replicas
- Register replicas in MC
VFS: WRITE...
- Write to local replica (async.) (QUICK-local, single replica write)
- Update meta-data (size, last access etc.)
VFS: WRITE...
- Write to all replicas (sync.)
(TAKES TIME-remote, multiple sites to write)
- Update meta-data (size, last access etc.) VFS: CLOSE (on an opened file)
- Flush buffers and close replica (async.) (QUICK)
- Update meta-data incl. release write locks - Return to user
VFS: CLOSE (on an opened file)
- Flush buffers (also to remote replicas) (TAKES TIME)
- Update meta-data incl. release write locks - Return to user
Asynchronous action: Make replicas
- Enqueue replication tasks to replication daemon
- Update meta-data
- Replications daemon (in the background) does the replicas (typically 3-rd party: SN1->SN2)
Replica access method (NFS) Replica access method (GridFTP)
LOCAL Storage Node REMOTE Storage Node Access Node
Replica access method (NFS) Replica access method (GridFTP) Replica access client (NFS)
Replica access client (GridFTP)
VFS for data and meta-data
Access Methods Servers (SSH, HTTPs, WebDAV...)
NDS system logic
NDS architecture (5)
Replica access methods (AN-SN) (1)
LAN
WAN
Low latency High bandwidth (eg. 10 Geth)
High latency
High bandwidth (eg. 1 Geth)
GridFTP
NFS
NFS
GridFTP or NFS used when needed
NDS architecture (6)
Replica access methods (AN-SN) (2)
NFS GridFTP
State-less, IOPS friendly State-full, can exploit available badwidth Low overhead on IOPS operations:
• small files access
• meta-data related operations
High overhead on IOPS operations:
• even small files access and meta-data ops require session
Low performance (MB/s) on long-distance High performance (MB/s) despite distance No parallelism (to/from single file)
• NFS 4.1. (pNFS) on the horizon but still not there
Parallelism (up to 256 streams)
• 64+ streams can sustain 1Geth link (1000 km-long)
Usage in NDS:
• meta-data related operations • accessing replicas on local SNs
• access to small files on remote SNs (future)
Usage in NDS:
• 3-rd party replication (async. mode) • transferring replicas to/from remote SNs
Stable, standardised Stability issues, even if standard in Grids
NFS and GridFTP used where they fit best:
– Static protocol selection (currently),
NDS architecture (7)
GridFTP 3-rd party replication (SN-SN) (3)
• 3-rd party transmission (SN->SN) used in async. replication mode
REMOTE Storage
Node
Access Node VFS for data and meta-data
Access Methods Servers (SSH, HTTPs, WebDAV...)
NDS system logic WAN GridFTP control conection Replication daemon (GridFTPclient) Replica access method (GridFTP) LOCAL Storage Node Low latency
High bandwidth (eg. 10 Geth) LAN
Replica access method (GridFTP)
High latency
High bandwidth (eg. 1 Geth)
High latency
High bandwidth (eg. 1 Geth) WAN
NDS architecture in PLATON:
Replica access: GridFTP-NFS access to SNs
HSM system (NFS)
NFS client
Replica access method (GridFTP) virtual Storage Node virtual Storage Node NAS appliance (NFS) Access Node
FS with data migration (HSM)
NFS client
Replica access method (GridFTP)
NFS NFS
GridFTP GridFTP
Replica access client (NFS) Replica access client (GridFTP)
VFS for data and meta-data
Access Methods Servers (SSH, HTTPs, WebDAV...)
NDS architecture (8)
Replication-related settings (1)
Profile parameter Possible values and meaning
Replication mode • Asynchronous (default)
• Synchronous
Number of replicas • Typically 2 (max. 3)
• Can be set to „any” value
Allowed storage sites and nodes:
• Default replica
locations
• Additional replica
location
• Replicas are typically created in default locations
• Additional replica locations are used in case of failure
of default ones
Storage media type:
• Disk vs tape (HSM)
• Using a combination
of allowed storage sites & nodes + knowledge on the deployment infrastructure we can determine media type
NDS architecture (9)
Replication-related settings (2)
• Replication is configured per-profile,
NOT per data-object , e.g. directory / file
• Policies are ‚static’ – cannot be changed dynamically
• Users can use one or many profiles:
• Assigned to profiles using DNs of certs => multiple certificates have to be used
in order to access different profiles
Fast, HA & FT space for backups: • Replication: SYNC
• 3 replicas on (1 local + 2 distant) on disks only
FT space for archives: • Replication: ASYNC
• 2 replicas on distant nodes; both copies on tapes
Safe storage space for collaboration:
• Replication: ASYNC
• 2 replicas: local on disk + 1 remote in HSM
Automated, TRANSPARENT, data replication
⇒
Safe, transparent replication service case
Abstract data interfaces
above and below NDS:
SFTP, WebDAV, GridFTP for users
Possible to interface with NDS from other systemse.g. 3-rd party transfer to/from NDS
Data available through VFS layer on ANs:
Possible to addnew access methods
Some work needed to extend the authentication mechanisms
Storage access: NFS / GridFTP
„Any” kind of storage
can be used as the backend... as far as it provides NFS service
NDS logic
NDS features vs EUDAT (1)
NDS VFS NDS Access Methods servers GridFTP front-end to storageAccess Methods (SSH, HTTPs, WebDAV...)
GridFTP front-end to storage
HSM system (NFS) NAS appliance (NFS) FS with migration (HSM) GridFTP front-end to storage „Any” other storage (NFS) ’s data centres?
service
service
NDS features vs EUDAT (2)
Persistent IDs?
User always sees the same logical structure, nevertheless:
The replication process:
Physical location is transparent
Replication process does not affect the logical namespace
which Access Node he uses:
The logical structure of VFS is the same everywhere
what access method he uses:
The logical structure of VFS is presented similarly through different access methods
failures:
As long as at least one replica is OK
Path to the file or directory is constant
NDS features vs EUDAT (3)
User-level metadata:
User can assign free-form text files to data objects,
they can include metadata
This is done through Web-GUI or procfs-like mechanism
Medata search possible
but not yet implemented (on the roadmap)
Can above be somehow re-used in EUDAT
Extendability?
Functionality can be easily extended
as the architecture & interfaces are open (Postgresql, NFS/GridFTP...)
„micro-services” – like approach possible:
but requires effoert on the NDS consortium side
For instance: some basic interfaces to meta-data can be defined (e.g. for searching data meeting some criteria)
Example: We currently design and develop a mechanism
for periodic data integrity checking (data scrubbing)
NDS2 - features
•
Secure data storage:
– Data encrypted on the user side
• Symmetric keys to file stored in the system - protected by user’s asymmetric key
– Integrity control on the user side
• MD5’s/SHA1 digests stored in the system - encrypted
•
Secure data exchange:
– 2-level access control:
• ACLs on virtual filesystem level
• User-side encryption and keys exchange
make the sharing safe (e.g. if we don’t trust provider)
•
Secure data publication:
– 2 kinds of storage space: private (for internal users) and public (sanbox’ed) – Multiple web servers (load-balancing, HA, data synchronisation)
to serve data effectively
•
Specialised user-side tools needed:
– Java GUI for managing file sharing, ACLs, publication and versioning – Virtual encrypted (!) filesystem for end-users: both for Linux and Windows
•
Status: R&D project (2011-2013); prototype expected in 2Q2013
NDS1:
Summary
•
Data storage & replication::
– Implemented @ VFS level: portability and security – Robust and lightweight
•
Data replication:
– Automatic, transparent to users - Sync. and async modes
– NFS or GridFTP or GridFTP 3rd party used to access//make replicas
•
Meta-data handling & replication:
– Handles file system-level medata and user-level metadata – Logically centralized but DR solutions in place for quick recovery
•
Logical filesystem structure persistency:
– Physical location-agnostic access
•
„Pluggable”:
– Open interfaces, standard interfaces to external world (both user- and storage-side) – We can provide custom interfaces to meta-data if needed
NDS architecture: (10)
Meta-catalog (1)
•
Functionality:
– System-level meta-data storage and handling
• File system structure • Data replicas
– User-level meta-data storage
•
Implementation:
– C++ library used by Data Daemon
– Postgres database
with Slony-I replication at the backend
– Separation of namespaces:
• No sharing among users groups assumed • Security by isolation
• Scalability – multiple instances of MC for multiple users groups / insitutions
Database Node
NDS architecture: (11)
Meta-catalog (2)
•
Meta-data redundancy:
problem: reliability and performance?
– (1) Postgres database with Slony-I replication
• Each meta-catalog replicated asynchronously
in master-slaves mode (Slony-I)
• In case of failure of master MC:
slave MC is manually selected as master (DR, not full HA, human intervention needed)
– (2) Semi-synchronous data replication:
• All operations on metadata
synchronously logged to distributed logs
• In case of failure of master MC:
– part of operations logged are repeated
on the ”new” master (human interv. needed)
• Comments:
• Reliability similar to synchronous DBMS replication but mechanism is lighter!!!