BlobSeer: Enabling Efficient Lock-Free,
Versioning-Based Storage for Massive Data
under Heavy Access Concurrency
Gabriel Antoniu
1, Luc Bougé
2, Bogdan Nicolae
3KerData research team
1 INRIA Rennes - Bretagne-Atlantique, France 2 ENS Cachan - Brittany, France
3 University of Rennes 1, France
New challenges for large-scale data storage
Scalable storage management for new-generation, data-oriented high-performance applications Massive, unstructured data objects (Terabytes) Many data objects (10³) High concurrency (10³ concurrent clients) Fine-grain access (Megabytes) Large-scale platforms: large clusters, grids, clouds, petascale machines, desktop gridsApplications: distributed, with high-throughput requirements under concurrency
Map-Reduce-based data-mining applications High resolution medical image processing Data-intensive HPC simulations Storage services for cloud infrastructures Checkpointing on desktop grids A new research team at INRIA Rennes: KerData - http://www.irisa.fr/kerdata/ Recently created from the PARIS project-team
BlobSeer: a BLOB-based approach
Generic data-management platform for huge, unstructured data Huge data (TB) Highly concurrent, fine-grain access (MB): R/W/A Prototype available Ph.D. theses: Bogdan Nicolae, Alexandra Carpen Amarie, Diana Moise, Viet-Trung Tran Key design features Decentralized metadata management
Beyond MVCC: multiversioning exposed to the user Lock-free concurrent writes (enabled by versioning) A back-end for higher-level, sophisticated data management systems Short term: highly scalable distributed file systems Middle term: storage for cloud services Long term: extremely large distributed databases http://blobseer.gforge.inria.fr/
BlobSeer: key design choices
Each blob is fragmented into equally-sized “pages”
Allows huge data amounts to be distributed all over the peers Avoids contention for simultaneous accesses to disjoint parts of the data blockMetadata : locate pages that make up a given blob
Fine-grained and distributed Efficiently managed through a segment tree over a DHTVersioning
Update/append: generate new pages rather than overwrite Metadata is extended to incorporate the update Both the old and the new version of the blob are accessibleClients Providers Provider manager Version manager
BlobSeer: architecture
Clients
Perform fine grain blob accessesProviders
Store the pages of the blobProvider manager
Monitors the providers Favors data load balancingMetadata providers
Store information about page locationVersion manager
Ensures concurrency controlHow does a read work?
Client Providers Metadata providers Version manager I II III1. Optionally ask the version
manager for the latest published
version
2. Fetch the corresponding
metadata from the metadata
providers
3. Contact providers
in parallel
and fetch the pages in the local
buffer
1. Get a list of providers that are
able to store the pages, one for each
page
2. Contact providers
in parallel
and
write the pages to the corresponding
providers
3. Get a version number for the
update
4. Add new metadata to consolidate
the new version
5. Report the new version is ready
for publication.
How does a write work?
Client Providers Metadata providers Version manager Provider manager II I III IV VHow versioning enables efficient,
heavy access concurrency
Client
#1 Client #2 Providers Metadataproviders Versionmanager
Publish Publish
Pages are written concurrently by
the clients
Versions are assigned in the
order the clients finish writing
Metadata is written concurrently
by the clients
Versions are published in the
order they were assigned
[0, 4] [0, 2] [2, 2] [0, 1] [1, 1] [2, 1] [3, 1]
Organized as a segment tree
Each node covers a range of the
blob identified by (offset, size)
The first/second half of the range
is covered by the left/right child
Each leaf corresponds to a page
and holds information about its
location
Metadata zoom (1)
[0, 4] [0, 2] [2, 2] [0, 1] [1, 1] [2, 1] [3, 1] [0, 2] [2, 2] [0, 4] [1, 1] [2, 1] [0, 8] [4, 4] [4, 2] [4, 1]
Each node holds versioning
Information
Write/Append
•
Add leaves and build subtree
up to the root
•
The tree may grow one level
Read: descend from the root
towards the leaves
Tree nodes are
distributed
among
metadata providers
Full access concurrency:
R/R, R/W,
W/W
•
Initial version: v = 1
•
2 concurrent writers: gray and black
•
Both write their pages independently
•
Gray is first, it is enqueued on the
versioning manager and assigned
version v2, black follows and gets v3
•
Both write independently the metadata
tree nodes: black is faster and links to
(the not yet created node) B2
•
First to finish is black, it is marked ready
•
Next is gray, being the first means its
root gets published and it is dequeued
•
Finally black gets first in the queue and
and will be published
How concurrent writes work by example
•
Initial version: v = 1
•
2 concurrent writers: gray and black
•
Both write their pages independently
•
Gray is first, it is enqueued on the
versioning manager and assigned
version v2, black follows and gets v3
•
Both write independently the metadata
tree nodes: black is faster and links to
(the not yet created node) B2
•
First to finish is black, it is marked ready
•
Next is gray, being the first means its
root gets published and it is dequeued
•
Finally black gets first in the queue and
How concurrent writes work by example
•
Initial version: v = 1
•
2 concurrent writers: gray and black
•
Both write their pages independently
•
Gray is first, it is enqueued on the
versioning manager and assigned
version v2, black follows and gets v3
•
Both write independently the metadata
tree nodes: black is faster and links to
(the not yet created node) B2
•
First to finish is black, it is marked ready
•
Next is gray, being the first means its
root gets published and it is dequeued
•
Finally black gets first in the queue and
and will be published
How concurrent writes work by example
Evaluation: experimental platform
Implementation
•
Custom RPC layer based on Boost ASIO
•
Metadata providers rely on a custom simplified DHT
Testbed: Grid’5000
•
Used the nodes of two sites: Rennes and Orsay
•
Each node: x86_64 architecture, 4GB RAM
•
Internode parameters within the same cluster:
•
Bandwidth: 117MB/s with MTU=1500B
•
Latency: 0.1ms
Impact of metadata decentralization
under heavy pressure
90 storage machines, on each: • 1 data provider • 1 metadata provider 90 client machines, on each: • 4 writers Each writer writes 128 consecutive pages of 64KB for 50 times Represented: total aggregated bandwidth for all writersTowards a BLOB-based file system
Goal: Build a BLOB-based file system, able to cope with huge data and heavy access concurrency in a large-scale environment
Hierarchical approach
High-level file system metadata management: the Gfarm grid file system Low-level object management: the BlobSeer BLOB management system
BlobSeer
Gfarm
The Gfarm grid file system
The Gfarm file system [University of Tsukuba, Japan]
A distributed file system designed for working at the Grid scale
File can be shared among all nodes and clients
Main components
Gfarm's metadata server File system nodes Gfarm clientsWhy combine Gfarm and BlobSeer?
Lack of POSIX file
system interface
Access concurrency
Fine-grain access
Versioning
BlobSeer
POSIX interface
User management
GSI support
File sizes are
limited
Not suitable for
concurrent access
No versioning
Gfarm
Access concurrency
Huge file sizes
Fine-grain access
Versioning
Gfarm/BlobSeer
POSIX interface
User management
GSI support
General idea: Gfarm handles file metadata, BlobSeer handles file data
The first approach
Each storage node (gfsd) connects to BlobSeer to store/get Gfarm file data The gfsd manage the mapping from Gfarm files to BLOBs The gfsd always acts as an intermediary for data transferCoupling Gfarm and BlobSeer [1]
Gfarm
BlobSeer
1 2 3 4
The first approach
Each storage node (gfsd) connects to BlobSeer to store/get Gfarm file data The gfsd manage the mapping from Gfarm files to BLOBs The gfsd always acts as an intermediary for data transfer Bottleneck!Coupling Gfarm and BlobSeer [1]
Gfarm
BlobSeer
1 2 3 4Coupling Gfarm and BlobSeer [2]
Second approach
The gfsd maps Gfarm files to BLOBs,
and provides the client with the BLOB ID
Then, the client directly access data
in BlobSeer
Gfarm
1 2 3 4 5Experimental evaluation on Grid'5000 [1]
Access throughput
under concurrency
Configuration
1 gfmd 1 gfsd 24 data providers Each client accesses 1GB of a 10GB file Page size 8MB Gfarm sequentializes concurrent accesses
Experimental evaluation on Grid'5000 [2]
Access throughput
under heavy
concurrency
Configuration (deployed on 157 nodes)
1 gfmd 1 gfsd Each client accesses 1GB of a 64GB file Page size 8MB Up to 64 concurrent clients 64 data providers 24 metadata providers 1 version managerWork in progress:
Introducting versioning in Gfarm/BlobSeer
Clients may access data in a specified file version
Not only
rollback
data when desired, but also access different file versions
within the same computation
Favors efficient access
concurrency
Approach
Delegate versioning management to BlobSeer A Gfarm file is mapped to a single BLOB A file version is mapped to the corresponding version of the BLOBVersioning interface
Versioning capability was fully implemented
At Gfarm API level
gfs_get_current_version(GFS_File gf,size_t *version) gfs_get_latest_version(GFS_File gf,size_t *version) gfs_set_version(GFS_File gf,size_t version) gfs_pio_vread(size_t nversion,GFS_File gf, void *buffer, int size, int *np) At POSIX file system level
Defined some ioctl commands fd = open(argv[1], 0_RDWR);np = pwrite(fd, buffer_w, BUFFER_SIZE,0);
ioctl(fd, BLOB_GET_LATEST_VERSION, &nversion); ioctl(fd, BLOB_SET_ACTIVE_VERSION, &nversion); np = pread(fd, buffer_r, BUFFER_SIZE,0);
Work in progress: support for MapReduce
Integrating BlobSeer with Yahoo!’s Hadoop MapReduce framework
Use BlobSeer instead of HDFSImplemented a Java API for BlobSeer
Basic file system operations: create, read, write...BlobSeer File System (BSFS)
File system namespace - keeps file metadata, maps files to BLOB’s Data prefetching Exposing data distributionBSFS vs. HDFS: concurrent reads from a
shared file
Open issues and opportunities for collaboration
BSFS/BlobSeer on Petascale architectures: open issues
Impact of topology-awareness: multi-level hierarchy Impact of data access patterns in Petascale applications Coupling topology-aware storage ressource management with job scheduling Which fault-tolerant mechanisms to use to ensure a high availability for data and metadata? Which strategy to use for metadata distribution?BSFS/BlobSeer vs. GPFS?
BSFS is highly optimized for heavy access concurrency Leverage the versioning support? An in-depth comparison with data-intensive applications with highly concurrent accesses may prove interesting Imagine some cooperation scheme?