XtreemFS-A Cloud File System

(1)

XtreemFS – A Cloud File System

Michael Berlin

Zuse Institute Berlin

Contrail Summer School, Almere, 24.07.2012

Funded under: FP7 (Seventh Framework Programme)

(2)

Motivation Cloud Storage / Cloud File System

• Cloud Storage Requirements

• highly available

• scalable

• elastic: add and remove capacity

• suitable for wide area networks

• Support for legacy applications

• POSIX-compatible file system required

(3)

Outline

• XtreemFS Architecture

• Replication in XtreemFS

• Read-Only File Replication

• Read/Write File Replication

• Custom Replica Placement and Selection

• Metadata Replication

• XtreemFS Use Cases

(4)

XtreemFS - A Cloud File System

• History

• 2006 initial development in XtreemOS project

• 2010 further development in Contrail project

• 2012 August: Release 1.3.2

• Features

• Distributed File System

• POSIX compatible

• Replication

• X.509 Certificates and SSL Support

(5)

XtreemFS Architecture

Metadata and Replica Catalog (MRC):

– stores metadata per volume

Object Storage Devices (OSDs):

– directly accessed by clients

– _{file content split into objects}

Separation of Metadata

and File Content:

(6)

Scalability

• Storage Capacity

• addition and removal of OSDs possible

• OSDs may be used by multiple volumes

• File I/O Throughput

• scales with number of OSDs

• Metadata Throughput

• limited by MRC hardware

(7)

Read-Only Replication (1)

• Only for “write-once” files

• File must be marked as “read-only”

• done automatically after close()

• Use Case: CDN

• Replica Types:

1. Full replicas

• complete copy, fills itself as fast as possible

2. Partial replicas

• Initially empty

• on-demand fetching of missing objects

(8)

(9)

Read/Write Replication (1)

• Primary/backup scheme

• POSIX requires total order of update operations

 primary/backup

• Primary fail-over?

• Leases

• grants access to a resource (here: primary role) for a predefined period of time

• Failover after timeout possible

• Assumption: loosely synchronized clocks

(10)

Read/Write Replication (2)

(11)

Read/Write Replication (3)

Replicated write():

1. Lease Acquisition

(12)

Read/Write Replication (4)

Replicated write():

1. Lease Acquisition

(13)

Read/Write Replication (5)

Replicated read():

1. Lease Acquisition

1b. “Replica Reset”

 update primary’s replica

2. Respond to read() using

local replica

(14)

Read/Write Replication: Distributed Lease Acquisition with Flease

• Flease

• Failure tolerant: majority-based

• Scalable: lease per file

• Experiment:

• Zookeeper: 3 servers

• Flease: 3 nodes

(2 randomly selected)

Flease Central Lock Service

(15)

Example with 3 (=N) Replicas (W = 2, R = 2)

a) Write

b) Read

Read/Write Replication: Data dissemination

• Ensuring Consistency with Quorum Protocol

• R + W > N

• R - # replicas have to be read from

• W - # replicas have to be updated

• Quorum intersection property • Intersection never empty

• Write All, Read 1 (W = N, R = 1)

• No availability

• Reads from Backup Replicas allowed

• Write Quorum, Read Quorum

• Available if majority reachable

Quorum Read covered by “Replica Reset” phase

(16)

Read/Write Replication: Summary

• High up-front costs (for first access to inactive file)

• 3+ round-trips

• 2 for Flease (lease acquisition)

• 1 for Replica Reset

• + further when fetching missing objects

• Minimal cost for subsequent operations

• Read: identical to non-replicated case

• Write: latency increases by time to update majority of backups

(17)

Custom Replica Placement and Selection

• Policies

• filter and sort available OSDs/replicas

• evaluates client information (IP address/hostname, estimated latency) “create file on OSD close to me”

“access closest replica”

• Available default policies:

• Server ID

• DNS

• Datacenter Map

• Vivaldi

(18)

(19)

Metadata Replication

• Replication at database level

• same approach as file R/W replication

• Loosen consistency

• allow stale reads

• All services replicated

(20)

XtreemFS Use Cases

• Storage of VM images for IaaS solutions (OpenNebula, ...)

• Storage-as-a-Service: Volumes per User

• XtreemFS as HDFS replacement in Hadoop

(21)

XtreemFS and OpenNebula (1)

• Use Case: VM images in OpenNebula cluster

• no distributed file system: scp VM images to hosts

• distributed file system: shared storage, available on all nodes

• Support for live migration

• Fault-tolerant storage of VM images

• Resume VM on another node after crash  Use XtreemFS Read/Write file replication

(22)

XtreemFS and OpenNebula (2)

• VM deployment

• Create copy (clone) of original VM image

• Run cloned VM image at scheduled host

• (Discard cloned image after VM shutdown)

• Problems

1. cloning time-consuming

2. waste of space

(23)

XtreemFS and OpenNebula: qcow2 + Replication

• qcow2 VM image format

• allows snapshots

1. immutable backing file

2. mutable, initially empty snapshot file

 instead of cloning, snapshot original VM image (< 1 second)

 Use Read/Write replication for snapshot file

• Problem left: run multiple VMs simultaneously

• snapshot file: R/W replication scales with # OSDs and # files

• backing file: bottle neck

(24)

XtreemFS and OpenNebula: Benchmark (1)

• OpenNebula Test Cluster

• Frontend + 30 Worker nodes

• Gigabit Ethernet (100 MB/s)

• SATA disk (70 MB/s)

• Setup

• Frontend

• MRC

• OSD (has the ConPaaS VM image)

• Each worker node

• OSD

(25)

XtreemFS and OpenNebula: Benchmark (2)

Setup Total Boot Time

copy (1.6 GB image file) 82 seconds (69 seconds for copy)

qcow2, 1 VM 13.6 seconds

qcow2, 30 VMs 20.8 seconds

qcow2, 30 VMs, 30 partial replicas 142.8 seconds

- second run 20.1 seconds

- after second run 17.5 seconds

+ Read/Write Replication on

snapshot file 19.5 seconds

• few read()s on image, no bottleneck yet

(26)

Future Research & Work

• Deduplication

• Improved Elasticity

• Fault Tolerance

• Optimize Storage Cost

• Erasure Codes

• Self-*

• Client Cache

(27)

Funded under: FP7 (Seventh Framework Programme)

Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438

Total cost: 11,29 million euro EU contribution: 8,3 million euro

Execution: From 2010-10-01 till 2013-09-30 Duration: 36 months