XtreemFS - a distributed and replicated cloud file system
Michael Berlin
Zuse Institute Berlin
Who we are
–
Zuse Institute Berlin
– operates the HLRN supercomputer
(#63+64)
– Research in Computer Science and
Mathematics
–
Parallel and Distributed Systems
Group
– lead by Prof. Alexander Reinefeld (Humboldt University)
– Distributed and failure-tolerant storage systems
Who we are
Michael Berlin
PhD student since 03/2011
studied Informatik at Humboldt Universität zu Berlin
Diplom thesis dealt with XtreemFS
Motivation
Problem: Multiple copies of data
Where? Copy complete? Different versions? internal Cluster storage external Cluster storage local file server PC external Cluster Nodes internal Cluster Nodes
Motivation (2)
Problem: Different access interfaces
external Cluster storage local file server PC Laptop via 3G/Wi-Fi external Cluster Nodes <parallel file system> NFS/ Samba VPN+?/ SSHFS SCP
Motivation (3)
XtreemFS goals:
Transparency Availability PC Laptop via 3G/Wi-Fi external Cluster Nodes internal Cluster NodesXtreemFS
Outline
1.
XtreemFS Architecture
2.
Client Interfaces
3.
Read-Only Replication
4.
Read-Write Replication
5.
Metadata Replication
6.
Customization through Policies
7.
Security
8.
Use Case: Mosgrid
XtreemFS Architecture (1)
Volume on a Metadata Server:
provides hierarchical namespace
File Content on Storage servers:
accessed directly by clients
internal Cluster storage PC internal Cluster Nodes local file server
XtreemFS Architecture (2)
Metadata and Replica Catalog (MRC):
– holds volumes
Object Storage Devices (OSDs):
– file content split into objects
– objects can be striped across OSDs
object-based
file system
architecture
Scalability
File I/O Throughput
parallel I/O: scales with number of OSDs
Storage Capacity
add and removal of OSDs possible
OSDs may be used by multiple volumes
Metadata Throughput
limited by MRC hardware
use many volumes spread over multiple MRCs
REA
D
W
Accessing Components
Directory Service (DIR)
central registry
all servers (MRC, OSD) register there with their id
provides:
list of available volumes
mapping id URL to service
Client Interfaces
XtreemFS supports POSIX interface and semantics
mount.xtreemfs: using FUSE
runs on Linux, FreeBSD, OS X and Windows (Dokan)
libxtreemfs for Java and C++
PC Laptop via 3G/WiFi external Cluster Nodes
XtreemFS
mount.xtreemfs mount.xtreemfs mount.xtreemfs internal Cluster NodesRead-Only Replication
Requirement: Mark file as read-only
Replica types:
a. Full replica:
requires complete copy
b. Partial replica:
fills itself on demand
instantly ready to use
external Cluster storage external Cluster Nodes internal Cluster storage
Read-Only Replication (3)
Receiver-initiated transfer at object level
OSDs exchange object lists
–
Filling strategies:
Fetch objects
– in order
– rarest first
–
Prefetching available
Read-Write Replication
Availability
Data safety
Allow Modifications
internal Cluster storage PC local file server important.cppRead-Write Replication (2)
Read-Write Replication (3)
Primary/Backup:
1. Lease Acquisition
at most one valid lease per file
revocation
Read-Write Replication (4)
Primary/Backup:
1. Lease Acquisition
at most one valid lease per file
revocation
= lease timeout
Read-Write Replication (5)
Lease Acquisition
XtreemFS: Flease scalable majority-based
Data Dissemination
Update Strategies: Write All, Read 1
Write Quorum, Read Quorum
Flease Central Lock Service
Metadata Replication
Primary/backup replication
volume = database
transparently replicate database
use leases to elect primary
replicate insert/update/delete
Database = Key/Value Store
Customization through Policies
Example:
Which replica shall the client select?
determined by policies
Policies:
– Authentication – Authorization – UID/GID mappings – Replica placement – Replica selection external Cluster storage external Cluster Nodes internal Cluster storage???
Customization through Policies (2)
Replica Placement/Selection Policies:
filter / sort / group replica list
available default policies:
FQDN-based
datacenter map
Vivaldi (latency estimation)
can be chained
own policies possible (Java)
external Cluster storage external Cluster Nodes internal Cluster storage node1.ext-cluster osd1.ext-cluster open() sorted replica list
Security
X.509 certificates support for authentication
SSL to encrypt communication
Laptop via 3G/Wi-Fi external Cluster NodesXtreemFS
mount.xtreemfs w/ user certificate mount.xtreemfs w/ host certificateUse case: Mosgrid
Mosgrid:
ease running experiments in computational chemistry
use grid resources through a web portal
portal allows to submit and retrieve compute jobs
Use case: Mosgrid (2)
PC Cluster NodesXtreemFS
Web Portal Browser Unicore Frontend Results Input Data Retrieve Results Submit Job XtreemFS scope mount.xtreemfs w/ user certificate mount.xtreemfs w/ host certificate libxtreemfs (Java)Snapshots
Backups needed in case of
accidental deletion/modification
virus infections
Snapshot
stable image of the file system at a given point in time
internal Cluster storage PC local file server important.cpp unlink(“important.cpp“)
Snapshots (2)
MRC: create snapshot if requested
OSDs: Copy-on-Write
on modify: create new object instead of overwriting
on delete: only mark as deleted
t t0 snapshot() write("file.txt“) file.txt: V1, t1 write("file.txt“) file.txt: V2, t2
Snapshots (3)
No exact global time: Loosely synchronized clocks
assumption: maximum drift ε
Time span-based snapshots
t t0 snapshot() write("file.txt“) file.txt: V1, t1 write("file.txt“) file.txt: V2, t2 t0 - ε t0 + ε write("file.txt“) file.txt: V2, t2
Snapshots (4)
OSDs: limit number of versions
not version-on-every-write
Instead: close-to-open
problem: client sends no explicit close
implicit close:
create new version if last write at least X seconds ago
Cleanup tool:
deletes versions which belong to no snapshot
Future Research
Self-Tuning
Quota support
Data de-duplication
XtreemFS Software
Open source:
www.xtreemfs.org
Development:
5 core developers at ZIB
integration tests for quality assurance
Community:
users and bug reporters
mailing list with 102 subscribers
Release 1.3:
Thank You!
www.contrail-project.eu
The Contrail project is supported by funding under the Seventh Framework Programme of the European Commission: ICT, Internet of Services, Software and Virtualization.
GA nr.: FP7-ICT-257438.