• No results found

XtreemFS - a distributed and replicated cloud file system

N/A
N/A
Protected

Academic year: 2021

Share "XtreemFS - a distributed and replicated cloud file system"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

XtreemFS - a distributed and replicated cloud file system

Michael Berlin

Zuse Institute Berlin

(2)

Who we are

Zuse Institute Berlin

– operates the HLRN supercomputer

(#63+64)

– Research in Computer Science and

Mathematics

Parallel and Distributed Systems

Group

– lead by Prof. Alexander Reinefeld (Humboldt University)

– Distributed and failure-tolerant storage systems

(3)

Who we are

Michael Berlin

 PhD student since 03/2011

 studied Informatik at Humboldt Universität zu Berlin

 Diplom thesis dealt with XtreemFS

(4)

Motivation

Problem: Multiple copies of data

 Where?  Copy complete?  Different versions? internal Cluster storage external Cluster storage local file server PC external Cluster Nodes internal Cluster Nodes

(5)

Motivation (2)

Problem: Different access interfaces

external Cluster storage local file server PC Laptop via 3G/Wi-Fi external Cluster Nodes <parallel file system> NFS/ Samba VPN+?/ SSHFS SCP

(6)

Motivation (3)

XtreemFS goals:

 Transparency  Availability PC Laptop via 3G/Wi-Fi external Cluster Nodes internal Cluster Nodes

XtreemFS

(7)
(8)

Outline

1.

XtreemFS Architecture

2.

Client Interfaces

3.

Read-Only Replication

4.

Read-Write Replication

5.

Metadata Replication

6.

Customization through Policies

7.

Security

8.

Use Case: Mosgrid

(9)

XtreemFS Architecture (1)

Volume on a Metadata Server:

 provides hierarchical namespace

File Content on Storage servers:

 accessed directly by clients

internal Cluster storage PC internal Cluster Nodes local file server

(10)

XtreemFS Architecture (2)

Metadata and Replica Catalog (MRC):

– holds volumes

Object Storage Devices (OSDs):

– file content split into objects

– objects can be striped across OSDs

object-based

file system

architecture

(11)

Scalability

File I/O Throughput

 parallel I/O: scales with number of OSDs

Storage Capacity

 add and removal of OSDs possible

 OSDs may be used by multiple volumes

Metadata Throughput

 limited by MRC hardware

 use many volumes spread over multiple MRCs

REA

D

W

(12)

Accessing Components

Directory Service (DIR)

 central registry

 all servers (MRC, OSD) register there with their id

 provides:

 list of available volumes

 mapping id  URL to service

(13)

Client Interfaces

XtreemFS supports POSIX interface and semantics

mount.xtreemfs: using FUSE

 runs on Linux, FreeBSD, OS X and Windows (Dokan)

libxtreemfs for Java and C++

PC Laptop via 3G/WiFi external Cluster Nodes

XtreemFS

mount.xtreemfs mount.xtreemfs mount.xtreemfs internal Cluster Nodes

(14)

Read-Only Replication

Requirement: Mark file as read-only

Replica types:

a. Full replica:

 requires complete copy

b. Partial replica:

 fills itself on demand

 instantly ready to use

external Cluster storage external Cluster Nodes internal Cluster storage

(15)
(16)

Read-Only Replication (3)

Receiver-initiated transfer at object level

OSDs exchange object lists

Filling strategies:

Fetch objects

– in order

– rarest first

Prefetching available

(17)

Read-Write Replication

Availability

Data safety

Allow Modifications

internal Cluster storage PC local file server important.cpp

(18)

Read-Write Replication (2)

(19)

Read-Write Replication (3)

Primary/Backup:

1. Lease Acquisition

 at most one valid lease per file

 revocation

(20)

Read-Write Replication (4)

Primary/Backup:

1. Lease Acquisition

 at most one valid lease per file

 revocation

= lease timeout

(21)

Read-Write Replication (5)

Lease Acquisition

 XtreemFS: Flease  scalable  majority-based

Data Dissemination

 Update Strategies:

 Write All, Read 1

 Write Quorum, Read Quorum

Flease Central Lock Service

(22)

Metadata Replication

Primary/backup replication

 volume = database

 transparently replicate database

 use leases to elect primary

 replicate insert/update/delete

Database = Key/Value Store

(23)

Customization through Policies

Example:

 Which replica shall the client select?

 determined by policies

Policies:

– Authentication – Authorization – UID/GID mappings – Replica placement – Replica selection external Cluster storage external Cluster Nodes internal Cluster storage

???

(24)

Customization through Policies (2)

Replica Placement/Selection Policies:

 filter / sort / group replica list

 available default policies:

 FQDN-based

 datacenter map

 Vivaldi (latency estimation)

 can be chained

 own policies possible (Java)

external Cluster storage external Cluster Nodes internal Cluster storage node1.ext-cluster osd1.ext-cluster open() sorted replica list

(25)

Security

X.509 certificates support for authentication

SSL to encrypt communication

Laptop via 3G/Wi-Fi external Cluster Nodes

XtreemFS

mount.xtreemfs w/ user certificate mount.xtreemfs w/ host certificate

(26)

Use case: Mosgrid

Mosgrid:

 ease running experiments in computational chemistry

 use grid resources through a web portal

 portal allows to submit and retrieve compute jobs

(27)

Use case: Mosgrid (2)

PC Cluster Nodes

XtreemFS

Web Portal Browser Unicore Frontend Results Input Data Retrieve Results Submit Job XtreemFS scope mount.xtreemfs w/ user certificate mount.xtreemfs w/ host certificate libxtreemfs (Java)

(28)

Snapshots

Backups needed in case of

 accidental deletion/modification

 virus infections

Snapshot

 stable image of the file system at a given point in time

internal Cluster storage PC local file server important.cpp unlink(“important.cpp“)

(29)

Snapshots (2)

MRC: create snapshot if requested

OSDs: Copy-on-Write

 on modify: create new object instead of overwriting

 on delete: only mark as deleted

t t0 snapshot() write("file.txt“) file.txt: V1, t1 write("file.txt“) file.txt: V2, t2

(30)

Snapshots (3)

No exact global time: Loosely synchronized clocks

 assumption: maximum drift ε

Time span-based snapshots

t t0 snapshot() write("file.txt“) file.txt: V1, t1 write("file.txt“) file.txt: V2, t2 t0 - ε t0 + ε write("file.txt“) file.txt: V2, t2

(31)

Snapshots (4)

OSDs: limit number of versions

 not version-on-every-write

 Instead: close-to-open

 problem: client sends no explicit close

 implicit close:

 create new version if last write at least X seconds ago

Cleanup tool:

 deletes versions which belong to no snapshot

(32)

Future Research

Self-Tuning

Quota support

Data de-duplication

(33)

XtreemFS Software

Open source:

www.xtreemfs.org

Development:

 5 core developers at ZIB

 integration tests for quality assurance

Community:

 users and bug reporters

 mailing list with 102 subscribers

Release 1.3:

(34)

Thank You!

www.contrail-project.eu

 The Contrail project is supported by funding under the Seventh Framework Programme of the European Commission: ICT, Internet of Services, Software and Virtualization.

GA nr.: FP7-ICT-257438.

References:

References

Related documents

• There are 2 automatic hydrological stations which measure and collect data on air and water temperature, water level and water discharge (data transmited to Armstatehydromet

«Medarbeiderskap – bry deg om arbeidsplassen din!» (Ueland, 2013) at det ikke er mangelen på lederskapsteorier som er problemet. Han mener det er evnen til å ta tak og gjennomføre

in relative prices of female-intensive goods can explain changes in female relative wages in the Mexican manufacturing sector before and after NAFTA in 1994.. The results of this

The objectification of women experienced by female characters in the novel Tanah Tabu Anindita S Thayf's works found were 28 data which were divided into 8 parts,

percentage point differential for mode-adjusted VMT (i.e., MVMT) was -90.3—a product of a mean 66.8 percent decline for members and a 23.5 percent increase for non-members over

• samba daemons on cluster nodes need to act as one CIFS server: – view of file ownership.. – windows file

This paper presents the results of a systematic analysis of all judgments handed down by the High Court, Court of Appeal, and House of Lords in defamation claims brought by non-human

These studies were: INdacaterol: Value in COPD: Longer term Validation of Ef ficacy and safety (INVOLVE, Clinicaltrials.gov registration number NCT00393458), a double- blind