• No results found

BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency

N/A
N/A
Protected

Academic year: 2021

Share "BlobSeer: Enabling Efficient Lock-Free, Versioning-Based Storage for Massive Data under Heavy Access Concurrency"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

BlobSeer:
Enabling
Efficient
Lock-Free,

Versioning-Based
Storage
for
Massive
Data

under
Heavy
Access
Concurrency

Gabriel
Antoniu

1

,
Luc
Bougé

2

,
Bogdan
Nicolae

3

KerData
research
team

1
INRIA
Rennes
-
Bretagne-Atlantique,
France 2
ENS
Cachan
-
Brittany,
France

3
University
of
Rennes
1,
France

(2)

New
challenges
for
large-scale
data
storage

Scalable
storage
management
for
new-generation,
data-oriented high-performance
applications  Massive,
unstructured
data
objects
(Terabytes)  Many
data
objects
(10³)  High
concurrency
(10³
concurrent
clients)  Fine-grain
access
(Megabytes)  Large-scale
platforms:
large
clusters,
grids,
clouds,
petascale
machines,
desktop
grids

Applications:
distributed,
with
high-throughput
requirements
under
concurrency

 Map-Reduce-based
data-mining
applications  High
resolution
medical
image
processing  Data-intensive
HPC
simulations  Storage
services
for
cloud
infrastructures  Checkpointing
on
desktop
grids A
new
research
team
at
INRIA
Rennes:
KerData
-
http://www.irisa.fr/kerdata/  Recently
created
from
the
PARIS
project-team

(3)

BlobSeer:
a
BLOB-based
approach

Generic
data-management
platform
for
huge,
unstructured
data  Huge
data
(TB)  Highly
concurrent,
fine-grain
access
(MB):
R/W/A  Prototype
available  Ph.D.
theses:
Bogdan
Nicolae,
Alexandra
Carpen
Amarie,
Diana
Moise,
Viet-Trung
Tran Key
design
features

 Decentralized
metadata
management

 Beyond
MVCC:
multiversioning
exposed
to
the
user  Lock-free
concurrent
writes
(enabled
by
versioning) A
back-end
for
higher-level,
sophisticated
data
management
systems  Short
term:
highly
scalable
distributed
file
systems  Middle
term:
storage
for
cloud
services  Long
term:
extremely
large
distributed
databases http://blobseer.gforge.inria.fr/

(4)

BlobSeer:
key
design
choices

Each
blob
is
fragmented
into
equally-sized
“pages”

 Allows
huge
data
amounts
to
be
distributed
all
over
the
peers  Avoids
contention
for
simultaneous
accesses
to
disjoint
parts
of
the
data
block

Metadata
:
locate
pages
that
make
up
a
given
blob

 Fine-grained
and
distributed  Efficiently
managed
through
a
segment
tree
over
a
DHT

Versioning

 Update/append:
generate
new
pages
rather
than
overwrite  Metadata
is
extended
to
incorporate
the
update  Both
the
old
and
the
new
version
of
the
blob
are
accessible
(5)

Clients Providers Provider manager Version manager

BlobSeer:
architecture

Clients

 Perform
fine
grain
blob
accesses

Providers

 Store
the
pages
of
the
blob

Provider
manager

 Monitors
the
providers  Favors
data
load
balancing

Metadata
providers

 Store
information
about
page
location

Version
manager

 Ensures
concurrency
control
(6)

How
does
a
read
work?

Client Providers Metadata providers Version manager I II III


1.
Optionally
ask
the
version

manager
for
the
latest
published

version


2.
Fetch
the
corresponding

metadata
from
the
metadata

providers


3.
Contact
providers


in
parallel

and
fetch
the
pages
in
the
local

buffer

(7)


1.
Get
a
list
of
providers
that
are

able
to
store
the
pages,
one
for
each

page


2.
Contact
providers


in
parallel


and

write
the
pages
to
the
corresponding

providers


3.
Get
a
version
number
for
the

update


4.
Add
new
metadata
to
consolidate

the
new
version


5.
Report
the
new
version
is
ready

for
publication.

How
does
a
write
work?

Client Providers Metadata providers Version manager Provider manager II I III IV V
(8)

How
versioning
enables
efficient,

heavy
access
concurrency

Client

#1 Client #2 Providers Metadataproviders Versionmanager

Publish Publish

Pages
are
written
concurrently
by

the
clients

Versions
are
assigned
in
the

order
the
clients
finish
writing

Metadata
is
written
concurrently

by
the
clients

Versions
are
published
in
the

order
they
were
assigned

(9)

[0, 4] [0, 2] [2, 2] [0, 1] [1, 1] [2, 1] [3, 1]

Organized
as
a
segment
tree

Each
node
covers
a
range
of
the

blob
identified
by
(offset,
size)

The
first/second
half
of
the
range

is
covered
by
the
left/right
child

Each
leaf
corresponds
to
a
page

and
holds
information
about
its

location

Metadata
zoom
(1)

(10)

[0, 4] [0, 2] [2, 2] [0, 1] [1, 1] [2, 1] [3, 1] [0, 2] [2, 2] [0, 4] [1, 1] [2, 1] [0, 8] [4, 4] [4, 2] [4, 1]

Each
node
holds
versioning

Information

Write/Append

Add
leaves
and
build
subtree

up
to
the
root

The
tree
may
grow
one
level

Read:
descend
from
the
root

towards
the
leaves

Tree
nodes
are


distributed


among

metadata
providers

Full
access
concurrency:


R/R,
R/W,

W/W

(11)

Initial
version:
v
=
1

2
concurrent
writers:
gray
and
black

Both
write
their
pages
independently

Gray
is
first,
it
is
enqueued
on
the

versioning
manager
and
assigned

version
v2,
black
follows
and
gets
v3

Both
write
independently
the
metadata

tree
nodes:
black
is
faster
and
links
to

(the
not
yet
created
node)
B2

First
to
finish
is
black,
it
is
marked
ready

Next
is
gray,
being
the
first
means
its

root
gets
published
and
it
is
dequeued

Finally
black
gets
first
in
the
queue
and

and
will
be
published

How
concurrent
writes
work
by
example

(12)

Initial
version:
v
=
1

2
concurrent
writers:
gray
and
black

Both
write
their
pages
independently

Gray
is
first,
it
is
enqueued
on
the

versioning
manager
and
assigned

version
v2,
black
follows
and
gets
v3

Both
write
independently
the
metadata

tree
nodes:
black
is
faster
and
links
to

(the
not
yet
created
node)
B2

First
to
finish
is
black,
it
is
marked
ready

Next
is
gray,
being
the
first
means
its

root
gets
published
and
it
is
dequeued

Finally
black
gets
first
in
the
queue
and

How
concurrent
writes
work
by
example

(13)

Initial
version:
v
=
1

2
concurrent
writers:
gray
and
black

Both
write
their
pages
independently

Gray
is
first,
it
is
enqueued
on
the

versioning
manager
and
assigned

version
v2,
black
follows
and
gets
v3

Both
write
independently
the
metadata

tree
nodes:
black
is
faster
and
links
to

(the
not
yet
created
node)
B2

First
to
finish
is
black,
it
is
marked
ready

Next
is
gray,
being
the
first
means
its

root
gets
published
and
it
is
dequeued

Finally
black
gets
first
in
the
queue
and

and
will
be
published

How
concurrent
writes
work
by
example

(14)

Evaluation:
experimental
platform

Implementation

Custom
RPC
layer
based
on
Boost
ASIO

Metadata
providers
rely
on
a
custom
simplified
DHT

Testbed:
Grid’5000

Used
the
nodes
of
two
sites:
Rennes
and
Orsay

Each
node:
x86_64
architecture,
4GB
RAM

Internode
parameters
within
the
same
cluster:

Bandwidth:
117MB/s
with
MTU=1500B

Latency:
0.1ms

(15)
(16)

Impact
of
metadata
decentralization

under
heavy
pressure

90
storage
machines, on
each: • 1
data
provider • 1
metadata
provider 90
client
machines, on
each: • 4
writers Each
writer
writes
128 consecutive
pages
of
64KB
for 50
times Represented:
total
aggregated bandwidth
for
all
writers
(17)

Towards
a
BLOB-based
file
system

Goal:
Build
a
BLOB-based
file
system,
able
to
cope
with
huge
data
and
heavy
access concurrency
in
a
large-scale
environment

Hierarchical
approach

 High-level
file
system
metadata
management:
the
Gfarm
grid
file
system  Low-level
object
management:
the
BlobSeer
BLOB
management
system

BlobSeer

Gfarm

(18)

The
Gfarm
grid
file
system

The
Gfarm
file
system
[University
of
Tsukuba,
Japan]

A
distributed
file
system
designed
for
working
at
the
Grid
scale

File
can
be
shared
among
all
nodes
and
clients

Main
components

 Gfarm's
metadata
server  File
system
nodes  Gfarm
clients
(19)

Why
combine
Gfarm
and
BlobSeer?

Lack
of
POSIX
file

system
interface

Access
concurrency

Fine-grain
access

Versioning

BlobSeer

POSIX
interface

User
management

GSI
support

File
sizes
are

limited

Not
suitable
for

concurrent
access

No
versioning

Gfarm

Access
concurrency

Huge
file
sizes

Fine-grain
access

Versioning

Gfarm/BlobSeer

POSIX
interface

User
management

GSI
support

General
idea:
Gfarm
handles
file
metadata,
BlobSeer
handles
file
data

(20)

The
first
approach

 Each
storage
node
(gfsd)
connects to
BlobSeer
to
store/get
Gfarm
file data  The
gfsd
manage
the
mapping
from Gfarm
files
to
BLOBs  The
gfsd
always
acts
as
an intermediary
for
data
transfer

Coupling
Gfarm
and
BlobSeer
[1]

Gfarm

BlobSeer

1 2 3 4
(21)

The
first
approach

 Each
storage
node
(gfsd)
connects to
BlobSeer
to
store/get
Gfarm
file data  The
gfsd
manage
the
mapping
from Gfarm
files
to
BLOBs  The
gfsd
always
acts
as
an intermediary
for
data
transfer  Bottleneck!

Coupling
Gfarm
and
BlobSeer
[1]

Gfarm

BlobSeer

1 2 3 4
(22)

Coupling
Gfarm
and
BlobSeer
[2]

Second
approach

 The
gfsd
maps
Gfarm
files
to
BLOBs,

and
provides
the
client
with
the BLOB
ID

 Then,
the
client
directly
access
data

in
BlobSeer

Gfarm

1 2 3 4 5
(23)

Experimental
evaluation
on
Grid'5000
[1]

Access
throughput


under
concurrency

Configuration

 1
gfmd  1
gfsd  24
data
providers  Each
client
accesses
1GB
of
a
10GB
file  Page
size
8MB 

Gfarm
sequentializes
concurrent
accesses

(24)

Experimental
evaluation
on
Grid'5000
[2]

Access
throughput


under
heavy

concurrency

Configuration
(deployed
on
157
nodes)

 1
gfmd  1
gfsd  Each
client
accesses
1GB
of
a
64GB
file  Page
size
8MB  Up
to
64
concurrent
clients  64
data
providers  24
metadata
providers  1
version
manager
(25)

Work
in
progress:

Introducting
versioning
in
Gfarm/BlobSeer

Clients
may
access
data
in
a
specified
file
version

Not
only


rollback


data
when
desired,
but
also
access
different
file
versions

within
the
same
computation

Favors
efficient
access


concurrency

Approach

 Delegate
versioning
management
to
BlobSeer  A
Gfarm
file
is
mapped
to
a
single
BLOB  A
file
version
is
mapped
to
the
corresponding
version
of
the
BLOB
(26)

Versioning
interface

Versioning
capability
was
fully
implemented

At
Gfarm
API
level

 gfs_get_current_version(GFS_File
gf,size_t
*version)‏  gfs_get_latest_version(GFS_File
gf,size_t
*version)‏  gfs_set_version(GFS_File
gf,size_t
version)‏  gfs_pio_vread(size_t
nversion,GFS_File
gf,
void
*buffer,
int
size,
int
*np)‏ 

At
POSIX
file
system
level

 Defined
some
ioctl
commands fd = open(argv[1], 0_RDWR);

np = pwrite(fd, buffer_w, BUFFER_SIZE,0);

ioctl(fd, BLOB_GET_LATEST_VERSION, &nversion); ioctl(fd, BLOB_SET_ACTIVE_VERSION, &nversion); np = pread(fd, buffer_r, BUFFER_SIZE,0);

(27)

Work
in
progress:
support
for
MapReduce

Integrating
BlobSeer
with
Yahoo!’s
Hadoop
MapReduce
framework

 
Use
BlobSeer
instead
of
HDFS

Implemented
a
Java
API
for
BlobSeer

 Basic
file
system
operations:
create,
read,
write...

BlobSeer
File
System
(BSFS)

 File
system
namespace
-
keeps
file
metadata,
maps
files
to
BLOB’s  Data
prefetching  Exposing
data
distribution
(28)

BSFS
vs.
HDFS:
concurrent
reads
from
a

shared
file

(29)
(30)

Open
issues
and
opportunities
for
collaboration

BSFS/BlobSeer
on
Petascale
architectures:
open
issues

 Impact
of
topology-awareness:
multi-level
hierarchy  Impact
of
data
access
patterns
in
Petascale
applications  Coupling
topology-aware
storage
ressource
management
with
job
scheduling  Which
fault-tolerant
mechanisms
to
use
to
ensure
a
high
availability
for
data and
metadata?  Which
strategy
to
use
for
metadata
distribution?

BSFS/BlobSeer
vs.
GPFS?

 BSFS
is
highly
optimized
for
heavy
access
concurrency  Leverage
the
versioning
support?  An
in-depth
comparison
with
data-intensive
applications
with
highly
concurrent accesses
may
prove
interesting  Imagine
some
cooperation
scheme?

References

Related documents

At each step m of the algorithm, a feasible link flow vector cjJm = (cPi)fEC is constructed from the previous feasible link flow vector cjJm-1 by shifting a portion a of the

Guardian ad Litem (“GAL”) Andrea Manning-Dudley also recommended termination of Mother’s parental rights. In so doing, Manning-Dudley informed the juvenile court

Traditional construction, using mostly local materials and salvaging materials from prior construction, with improvements in roof design was seen as one of the primary ways

A gyermek kognitív konstrukciója meghatározó módon az iskolán kívüli, részben az iskolát megelőző környezeti hatások nyomán épült fel (Feketéné 2002), és ez

positive link between financial sector openness - financial sector competition and financial sector competition - economic growth - Eschenbach, Francois (2004) Region: 130

This paper focuses on the development of a data integration algorithm based on XML DTD as a global schema and at the same time maintain the semantic knowledge among the data.. The

It has been shown, for instance, that many characteristics of the individual multiplex layers, such as their degree distributions, clustering coef fi cients, or the probabilities

Other characteristics of informal settlements include: (i) lack of secure tenure; (ii) housing that contradicts city by-laws; (iii) housing built on land not owned by the