XtreemFS – A Cloud File System
Michael Berlin
Zuse Institute Berlin
Contrail Summer School, Almere, 24.07.2012
Funded under: FP7 (Seventh Framework Programme)
Motivation Cloud Storage / Cloud File System
• Cloud Storage Requirements
• highly available
• scalable
• elastic: add and remove capacity
• suitable for wide area networks
• Support for legacy applications
• POSIX-compatible file system required
Outline
• XtreemFS Architecture
• Replication in XtreemFS
• Read-Only File Replication
• Read/Write File Replication
• Custom Replica Placement and Selection
• Metadata Replication
• XtreemFS Use Cases
XtreemFS - A Cloud File System
• History
• 2006 initial development in XtreemOS project
• 2010 further development in Contrail project
• 2012 August: Release 1.3.2
• Features
• Distributed File System
• POSIX compatible
• Replication
• X.509 Certificates and SSL Support
XtreemFS Architecture
Metadata and Replica Catalog (MRC):
– stores metadata per volume
Object Storage Devices (OSDs):
– directly accessed by clients
– file content split into objects
Separation of Metadata
and File Content:
Scalability
• Storage Capacity
• addition and removal of OSDs possible
• OSDs may be used by multiple volumes
• File I/O Throughput
• scales with number of OSDs
• Metadata Throughput
• limited by MRC hardware
Read-Only Replication (1)
• Only for “write-once” files
• File must be marked as “read-only”
• done automatically after close()
• Use Case: CDN
• Replica Types:
1. Full replicas
• complete copy, fills itself as fast as possible
2. Partial replicas
• Initially empty
• on-demand fetching of missing objects
Read/Write Replication (1)
• Primary/backup scheme
• POSIX requires total order of update operations
primary/backup
• Primary fail-over?
• Leases
• grants access to a resource (here: primary role) for a predefined period of time
• Failover after timeout possible
• Assumption: loosely synchronized clocks
Read/Write Replication (2)
Read/Write Replication (3)
Replicated write():
1. Lease Acquisition
Read/Write Replication (4)
Replicated write():
1. Lease Acquisition
Read/Write Replication (5)
Replicated read():
1. Lease Acquisition
1b. “Replica Reset”
update primary’s replica
2. Respond to read() using
local replica
Read/Write Replication: Distributed Lease Acquisition with Flease
• Flease
• Failure tolerant: majority-based
• Scalable: lease per file
• Experiment:
• Zookeeper: 3 servers
• Flease: 3 nodes
(2 randomly selected)
Flease Central Lock Service
Example with 3 (=N) Replicas (W = 2, R = 2)
a) Write
b) Read
Read/Write Replication: Data dissemination
• Ensuring Consistency with Quorum Protocol
• R + W > N
• R - # replicas have to be read from
• W - # replicas have to be updated
• Quorum intersection property • Intersection never empty
• Write All, Read 1 (W = N, R = 1)
• No availability
• Reads from Backup Replicas allowed
• Write Quorum, Read Quorum
• Available if majority reachable
Quorum Read covered by “Replica Reset” phase
Read/Write Replication: Summary
• High up-front costs (for first access to inactive file)
• 3+ round-trips
• 2 for Flease (lease acquisition)
• 1 for Replica Reset
• + further when fetching missing objects
• Minimal cost for subsequent operations
• Read: identical to non-replicated case
• Write: latency increases by time to update majority of backups
Custom Replica Placement and Selection
• Policies
• filter and sort available OSDs/replicas
• evaluates client information (IP address/hostname, estimated latency) “create file on OSD close to me”
“access closest replica”
• Available default policies:
• Server ID
• DNS
• Datacenter Map
• Vivaldi
Metadata Replication
• Replication at database level
• same approach as file R/W replication
• Loosen consistency
• allow stale reads
• All services replicated
XtreemFS Use Cases
• Storage of VM images for IaaS solutions (OpenNebula, ...)
• Storage-as-a-Service: Volumes per User
• XtreemFS as HDFS replacement in Hadoop
XtreemFS and OpenNebula (1)
• Use Case: VM images in OpenNebula cluster
• no distributed file system: scp VM images to hosts
• distributed file system: shared storage, available on all nodes
• Support for live migration
• Fault-tolerant storage of VM images
• Resume VM on another node after crash Use XtreemFS Read/Write file replication
XtreemFS and OpenNebula (2)
• VM deployment
• Create copy (clone) of original VM image
• Run cloned VM image at scheduled host
• (Discard cloned image after VM shutdown)
• Problems
1. cloning time-consuming
2. waste of space
XtreemFS and OpenNebula: qcow2 + Replication
• qcow2 VM image format
• allows snapshots
1. immutable backing file
2. mutable, initially empty snapshot file
instead of cloning, snapshot original VM image (< 1 second)
Use Read/Write replication for snapshot file
• Problem left: run multiple VMs simultaneously
• snapshot file: R/W replication scales with # OSDs and # files
• backing file: bottle neck
XtreemFS and OpenNebula: Benchmark (1)
• OpenNebula Test Cluster
• Frontend + 30 Worker nodes
• Gigabit Ethernet (100 MB/s)
• SATA disk (70 MB/s)
• Setup
• Frontend
• MRC
• OSD (has the ConPaaS VM image)
• Each worker node
• OSD
XtreemFS and OpenNebula: Benchmark (2)
Setup Total Boot Time
copy (1.6 GB image file) 82 seconds (69 seconds for copy)
qcow2, 1 VM 13.6 seconds
qcow2, 30 VMs 20.8 seconds
qcow2, 30 VMs, 30 partial replicas 142.8 seconds
- second run 20.1 seconds
- after second run 17.5 seconds
+ Read/Write Replication on
snapshot file 19.5 seconds
• few read()s on image, no bottleneck yet
Future Research & Work
• Deduplication
• Improved Elasticity
• Fault Tolerance
• Optimize Storage Cost
• Erasure Codes
• Self-*
• Client Cache
Funded under: FP7 (Seventh Framework Programme)
Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438
Total cost: 11,29 million euro EU contribution: 8,3 million euro
Execution: From 2010-10-01 till 2013-09-30 Duration: 36 months