• No results found

XtreemFS-A Cloud File System

N/A
N/A
Protected

Academic year: 2021

Share "XtreemFS-A Cloud File System"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

XtreemFS – A Cloud File System

Michael Berlin

Zuse Institute Berlin

Contrail Summer School, Almere, 24.07.2012

Funded under: FP7 (Seventh Framework Programme)

(2)

Motivation Cloud Storage / Cloud File System

• Cloud Storage Requirements

• highly available

• scalable

• elastic: add and remove capacity

• suitable for wide area networks

• Support for legacy applications

POSIX-compatible file system required

(3)

Outline

• XtreemFS Architecture

• Replication in XtreemFS

• Read-Only File Replication

• Read/Write File Replication

• Custom Replica Placement and Selection

• Metadata Replication

• XtreemFS Use Cases

(4)

XtreemFS - A Cloud File System

• History

• 2006 initial development in XtreemOS project

• 2010 further development in Contrail project

• 2012 August: Release 1.3.2

• Features

• Distributed File System

• POSIX compatible

• Replication

• X.509 Certificates and SSL Support

(5)

XtreemFS Architecture

Metadata and Replica Catalog (MRC):

– stores metadata per volume

Object Storage Devices (OSDs):

– directly accessed by clients

file content split into objects

Separation of Metadata

and File Content:

(6)

Scalability

• Storage Capacity

• addition and removal of OSDs possible

• OSDs may be used by multiple volumes

• File I/O Throughput

• scales with number of OSDs

• Metadata Throughput

• limited by MRC hardware

(7)

Read-Only Replication (1)

• Only for “write-once” files

• File must be marked as “read-only”

• done automatically after close()

• Use Case: CDN

• Replica Types:

1. Full replicas

• complete copy, fills itself as fast as possible

2. Partial replicas

• Initially empty

• on-demand fetching of missing objects

(8)
(9)

Read/Write Replication (1)

• Primary/backup scheme

• POSIX requires total order of update operations

 primary/backup

• Primary fail-over?

• Leases

• grants access to a resource (here: primary role) for a predefined period of time

• Failover after timeout possible

• Assumption: loosely synchronized clocks

(10)

Read/Write Replication (2)

(11)

Read/Write Replication (3)

Replicated write():

1. Lease Acquisition

(12)

Read/Write Replication (4)

Replicated write():

1. Lease Acquisition

(13)

Read/Write Replication (5)

Replicated read():

1. Lease Acquisition

1b. “Replica Reset”

 update primary’s replica

2. Respond to read() using

local replica

(14)

Read/Write Replication: Distributed Lease Acquisition with Flease

• Flease

• Failure tolerant: majority-based

• Scalable: lease per file

• Experiment:

• Zookeeper: 3 servers

• Flease: 3 nodes

(2 randomly selected)

Flease Central Lock Service

(15)

Example with 3 (=N) Replicas (W = 2, R = 2)

a) Write

b) Read

Read/Write Replication: Data dissemination

• Ensuring Consistency with Quorum Protocol

• R + W > N

• R - # replicas have to be read from

• W - # replicas have to be updated

• Quorum intersection property • Intersection never empty

• Write All, Read 1 (W = N, R = 1)

• No availability

• Reads from Backup Replicas allowed

• Write Quorum, Read Quorum

• Available if majority reachable

Quorum Read covered by “Replica Reset” phase

(16)

Read/Write Replication: Summary

• High up-front costs (for first access to inactive file)

• 3+ round-trips

• 2 for Flease (lease acquisition)

• 1 for Replica Reset

• + further when fetching missing objects

• Minimal cost for subsequent operations

• Read: identical to non-replicated case

• Write: latency increases by time to update majority of backups

(17)

Custom Replica Placement and Selection

• Policies

• filter and sort available OSDs/replicas

• evaluates client information (IP address/hostname, estimated latency) “create file on OSD close to me”

“access closest replica”

• Available default policies:

• Server ID

• DNS

• Datacenter Map

• Vivaldi

(18)
(19)

Metadata Replication

• Replication at database level

• same approach as file R/W replication

• Loosen consistency

• allow stale reads

• All services replicated

(20)

XtreemFS Use Cases

• Storage of VM images for IaaS solutions (OpenNebula, ...)

• Storage-as-a-Service: Volumes per User

• XtreemFS as HDFS replacement in Hadoop

(21)

XtreemFS and OpenNebula (1)

• Use Case: VM images in OpenNebula cluster

• no distributed file system: scp VM images to hosts

• distributed file system: shared storage, available on all nodes

• Support for live migration

• Fault-tolerant storage of VM images

• Resume VM on another node after crash  Use XtreemFS Read/Write file replication

(22)

XtreemFS and OpenNebula (2)

• VM deployment

• Create copy (clone) of original VM image

• Run cloned VM image at scheduled host

• (Discard cloned image after VM shutdown)

• Problems

1. cloning time-consuming

2. waste of space

(23)

XtreemFS and OpenNebula: qcow2 + Replication

• qcow2 VM image format

• allows snapshots

1. immutable backing file

2. mutable, initially empty snapshot file

 instead of cloning, snapshot original VM image (< 1 second)

 Use Read/Write replication for snapshot file

• Problem left: run multiple VMs simultaneously

• snapshot file: R/W replication scales with # OSDs and # files

• backing file: bottle neck

(24)

XtreemFS and OpenNebula: Benchmark (1)

• OpenNebula Test Cluster

• Frontend + 30 Worker nodes

• Gigabit Ethernet (100 MB/s)

• SATA disk (70 MB/s)

• Setup

• Frontend

• MRC

• OSD (has the ConPaaS VM image)

• Each worker node

• OSD

(25)

XtreemFS and OpenNebula: Benchmark (2)

Setup Total Boot Time

copy (1.6 GB image file) 82 seconds (69 seconds for copy)

qcow2, 1 VM 13.6 seconds

qcow2, 30 VMs 20.8 seconds

qcow2, 30 VMs, 30 partial replicas 142.8 seconds

- second run 20.1 seconds

- after second run 17.5 seconds

+ Read/Write Replication on

snapshot file 19.5 seconds

• few read()s on image, no bottleneck yet

(26)

Future Research & Work

• Deduplication

• Improved Elasticity

• Fault Tolerance

• Optimize Storage Cost

• Erasure Codes

• Self-*

• Client Cache

(27)

Funded under: FP7 (Seventh Framework Programme)

Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438

Total cost: 11,29 million euro EU contribution: 8,3 million euro

Execution: From 2010-10-01 till 2013-09-30 Duration: 36 months

References

Related documents

Proprietary Head Unit Controller Hardware Storage Software Commodity Hardware Industry Controllers NexentaStor Disks Must buy vendor disk at 5x markup Must buy vendor

Cloud Storage Interface Preferences Cloud – IT as a Service “IaaS” Infrastructure as a Service IT Services:  Servers  Network  Storage  Managemen t  Reporting

PaaS Storage -as-a- service DB-as-a- service SaaS Cloud billing IaaS VM hosting On- demand scaling MSP Dedicated hosting DC outsourcing Traditional IT Cloud computing

A4 OUR STORAgE AS A SERvICE OFFERIngS InClUdE: STORAGE AS A SERvICE — STORE Unlike traditional storage models — by which you purchase storage in incremental blocks or steps, and

Leveraging the cloud as a component, Storage Infrastructure as a Service (SIaaS) is a new solution for enterprise data storage that delivers uniform storage with

The Evaluator will place a check mark in the box for each numbered item completed correctly. The Firefighter will get three attempts to PASS

Wood sheathing cannot be used as tie or strut Wood sheathing cannot be used as tie or strut Steel elements used for wall anchorage shall Steel elements used for wall anchorage