BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

(1)

BlobSeer: Enabling Efficient Lock-Free,

Versioning-Based Storage for Massive Data

under Heavy Access Concurrency

Gabriel Antoniu

1

, Luc Bougé

2

, Bogdan Nicolae

3

KerData research team

1 _{INRIA Rennes - Bretagne-Atlantique, France} 2 _{ENS Cachan - Brittany, France}

3 _{University of Rennes 1, France}

(2)

New challenges for large-scale data storage

Scalable storage management for new-generation, data-oriented high-performance applications  Massive, unstructured data objects (Terabytes)  Many data objects (10³)  High concurrency (10³ concurrent clients)  Fine-grain access (Megabytes)  Large-scale platforms: large clusters, grids, clouds, petascale machines, desktop grids

Applications: distributed, with high-throughput requirements under concurrency

 Map-Reduce-based data-mining applications  High resolution medical image processing  Data-intensive HPC simulations  Storage services for cloud infrastructures  Checkpointing on desktop grids A new research team at INRIA Rennes: KerData - http://www.irisa.fr/kerdata/  Recently created from the PARIS project-team

(3)

BlobSeer: a BLOB-based approach

Generic data-management platform for huge, unstructured data  Huge data (TB)  Highly concurrent, fine-grain access (MB): R/W/A  Prototype available  Ph.D. theses: Bogdan Nicolae, Alexandra Carpen Amarie, Diana Moise, Viet-Trung Tran Key design features

 Decentralized metadata management

 Beyond MVCC: multiversioning exposed to the user  Lock-free concurrent writes (enabled by versioning) A back-end for higher-level, sophisticated data management systems  Short term: highly scalable distributed file systems  Middle term: storage for cloud services  Long term: extremely large distributed databases http://blobseer.gforge.inria.fr/

(4)

BlobSeer: key design choices

Each blob is fragmented into equally-sized “pages”

 Allows huge data amounts to be distributed all over the peers  Avoids contention for simultaneous accesses to disjoint parts of the data block

Metadata : locate pages that make up a given blob

 Fine-grained and distributed  Efficiently managed through a segment tree over a DHT

Versioning

 Update/append: generate new pages rather than overwrite  Metadata is extended to incorporate the update  Both the old and the new version of the blob are accessible

(5)

Clients Providers Provider manager Version manager

BlobSeer: architecture

Clients

 Perform fine grain blob accesses

Providers

 Store the pages of the blob

Provider manager

 Monitors the providers  Favors data load balancing

Metadata providers

 Store information about page location

Version manager

 Ensures concurrency control

(6)

How does a read work?

Client Providers Metadata providers Version manager I II III

 1. Optionally ask the version

manager for the latest published

version

 2. Fetch the corresponding

metadata from the metadata

providers

 3. Contact providers 

in parallel

and fetch the pages in the local

buffer

(7)

 1. Get a list of providers that are

able to store the pages, one for each

page

 2. Contact providers 

in parallel

 and

write the pages to the corresponding

providers

 3. Get a version number for the

update

 4. Add new metadata to consolidate

the new version

 5. Report the new version is ready

for publication.

How does a write work?

Client Providers Metadata providers Version manager Provider manager II I III IV V

(8)

How versioning enables efficient,

heavy access concurrency

Client

#1 Client #2 Providers Metadataproviders Versionmanager

Publish Publish

Pages are written concurrently by

the clients

Versions are assigned in the

order the clients finish writing

Metadata is written concurrently

by the clients

Versions are published in the

order they were assigned

(9)

[0, 4] [0, 2] [2, 2] [0, 1] [1, 1] [2, 1] [3, 1]

Organized as a segment tree

Each node covers a range of the

blob identified by (offset, size)

The first/second half of the range

is covered by the left/right child

Each leaf corresponds to a page

and holds information about its

location

Metadata zoom (1)

(10)

[0, 4] [0, 2] [2, 2] [0, 1] [1, 1] [2, 1] [3, 1] [0, 2] [2, 2] [0, 4] [1, 1] [2, 1] [0, 8] [4, 4] [4, 2] [4, 1]

Each node holds versioning

Information

Write/Append

•

Add leaves and build subtree

up to the root

•

The tree may grow one level

Read: descend from the root

towards the leaves

Tree nodes are 

distributed

 among

metadata providers

Full access concurrency: 

R/R, R/W,

W/W

(11)

•

Initial version: v = 1

•

2 concurrent writers: gray and black

•

Both write their pages independently

•

Gray is first, it is enqueued on the

versioning manager and assigned

version v2, black follows and gets v3

•

Both write independently the metadata

tree nodes: black is faster and links to

(the not yet created node) B2

•

First to finish is black, it is marked ready

•

Next is gray, being the first means its

root gets published and it is dequeued

•

Finally black gets first in the queue and

and will be published

How concurrent writes work by example

(12)

•

Both write their pages independently

•

versioning manager and assigned

•

tree nodes: black is faster and links to

•

root gets published and it is dequeued

•

Finally black gets first in the queue and

(13)

•

Both write their pages independently

•

versioning manager and assigned

•

tree nodes: black is faster and links to

•

root gets published and it is dequeued

•

Finally black gets first in the queue and

and will be published

(14)

Evaluation: experimental platform

Implementation

•

Custom RPC layer based on Boost ASIO

•

Metadata providers rely on a custom simplified DHT

Testbed: Grid’5000

•

Used the nodes of two sites: Rennes and Orsay

•

Each node: x86_64 architecture, 4GB RAM

•

Internode parameters within the same cluster:

•

Bandwidth: 117MB/s with MTU=1500B

•

Latency: 0.1ms

(15)

(16)

Impact of metadata decentralization

under heavy pressure

90 storage machines, on each: • 1 data provider • 1 metadata provider 90 client machines, on each: • 4 writers Each writer writes 128 consecutive pages of 64KB for 50 times Represented: total aggregated bandwidth for all writers

(17)

Towards a BLOB-based file system

Goal: Build a BLOB-based file system, able to cope with huge data and heavy access concurrency in a large-scale environment

Hierarchical approach

 High-level ﬁle system metadata management: the Gfarm grid file system  Low-level object management: the BlobSeer BLOB management system

BlobSeer

Gfarm

(18)

The Gfarm grid file system

The Gfarm file system [University of Tsukuba, Japan]

A distributed file system designed for working at the Grid scale

File can be shared among all nodes and clients

Main components

 Gfarm's metadata server  File system nodes  Gfarm clients

(19)

Why combine Gfarm and BlobSeer?

Lack of POSIX file

system interface

Access concurrency

Fine-grain access

Versioning

BlobSeer

POSIX interface

User management

GSI support

File sizes are

limited

Not suitable for

concurrent access

No versioning

Gfarm

Access concurrency

Huge file sizes

Fine-grain access

Versioning

Gfarm/BlobSeer

POSIX interface

User management

GSI support

General idea: Gfarm handles file metadata, BlobSeer handles file data

(20)



The first approach

 Each storage node (gfsd) connects to BlobSeer to store/get Gfarm file data  The gfsd manage the mapping from Gfarm files to BLOBs  The gfsd always acts as an intermediary for data transfer

Coupling Gfarm and BlobSeer [1]

Gfarm

BlobSeer

1 2 3 4

(21)



The first approach

 Each storage node (gfsd) connects to BlobSeer to store/get Gfarm file data  The gfsd manage the mapping from Gfarm files to BLOBs  The gfsd always acts as an intermediary for data transfer  Bottleneck!

Gfarm

BlobSeer

1 2 3 4

(22)

Second approach

 The gfsd maps Gfarm files to BLOBs,

and provides the client with the BLOB ID

 Then, the client directly access data

in BlobSeer

Gfarm

1 2 3 4 5

(23)

Experimental evaluation on Grid'5000 [1]



Access throughput 

under concurrency



Configuration

 1 gfmd  1 gfsd  24 data providers  Each client accesses 1GB of a 10GB file  Page size 8MB 

Gfarm sequentializes concurrent accesses

(24)

Experimental evaluation on Grid'5000 [2]



Access throughput 

under heavy

concurrency



Configuration (deployed on 157 nodes)

 1 gfmd  1 gfsd  Each client accesses 1GB of a 64GB file  Page size 8MB  Up to 64 concurrent clients  64 data providers  24 metadata providers  1 version manager

(25)

Work in progress:

Introducting versioning in Gfarm/BlobSeer



Clients may access data in a specified file version



Not only 

rollback

 data when desired, but also access different file versions

within the same computation



Favors efficient access 

concurrency



Approach

 Delegate versioning management to BlobSeer  A Gfarm file is mapped to a single BLOB  A file version is mapped to the corresponding version of the BLOB

(26)

Versioning interface



Versioning capability was fully implemented



At Gfarm API level

 gfs_get_current_version(GFS_File gf,size_t *version)‏  gfs_get_latest_version(GFS_File gf,size_t *version)‏  gfs_set_version(GFS_File gf,size_t version)‏  gfs_pio_vread(size_t nversion,GFS_File gf, void *buffer, int size, int *np)‏ 

At POSIX file system level

 Defined some ioctl commands fd = open(argv[1], 0_RDWR);

np = pwrite(fd, buffer_w, BUFFER_SIZE,0);

ioctl(fd, BLOB_GET_LATEST_VERSION, &nversion); ioctl(fd, BLOB_SET_ACTIVE_VERSION, &nversion); np = pread(fd, buffer_r, BUFFER_SIZE,0);

(27)

Work in progress: support for MapReduce

Integrating BlobSeer with Yahoo!’s Hadoop MapReduce framework

  Use BlobSeer instead of HDFS

Implemented a Java API for BlobSeer

 Basic ﬁle system operations: create, read, write...

BlobSeer File System (BSFS)

 File system namespace - keeps ﬁle metadata, maps ﬁles to BLOB’s  Data prefetching  Exposing data distribution

(28)

BSFS vs. HDFS: concurrent reads from a

shared file

(29)

(30)

Open issues and opportunities for collaboration

BSFS/BlobSeer on Petascale architectures: open issues

 Impact of topology-awareness: multi-level hierarchy  Impact of data access patterns in Petascale applications  Coupling topology-aware storage ressource management with job scheduling  Which fault-tolerant mechanisms to use to ensure a high availability for data and metadata?  Which strategy to use for metadata distribution?

BSFS/BlobSeer vs. GPFS?

 BSFS is highly optimized for heavy access concurrency  Leverage the versioning support?  An in-depth comparison with data-intensive applications with highly concurrent accesses may prove interesting  Imagine some cooperation scheme?

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency