Data Intensive
Distributed
Computing:
Challenges
and Solutions
for
Large-Scale
Information
Management
Tevfik
Kosar
State
University
of New
York
atBuffalo
(SUNY),
USA
Information Science
REFERENCE
Detailed
Table of
Contents
Preface xiii
Section 1
New
Paradigms
in Data IntensiveComputing
Chapter
1Data-Aware Distributed
Computing
1 EsmaYildirim,
StateUniversity of
New YorkatBuffalo (SUNY),
USAMehmet
Balman,
LawrenceBerkeley
NationalLaboratory,
USATevfikKosar,
StateUniversity of
New YorkatBuffalo
(SUNY),
USAWith
the continuous
increase in the datarequirements
of scientific and commercialapplications,
accessto remote and distributed data
has become
amajor
bottleneck for end-to-endapplication performance.
Traditional distributedcomputing
systems
closely couple
dataaccess andcomputation,
andgenerally,
data accessis considered aside effect ofcomputation.
The limitationsof
traditional distributed com¬puting
systems
and
CPU-orientedscheduling
andworkflow management
tools inmanaging complex
datahandling
have motivated anewly
emerging
era:data-aware
distributedcomputing.
Inthischapter,
the authors elaborateon how themostcrucial distributed
computing
components,
such asscheduling,
workflowmanagement,
and end-to-endthroughput optimization,
canbecome
"data-aware." In thisnewcomputing
paradigm,
calleddata-aware
distributedcomputing,
dataplacement
activitiesarerepresented
as full-featured
jobs
in theend-to-end
workflow,
andthey
arequeued, managed, scheduled,
andopti¬
mized viaaspecialized
data-aware scheduler. Aspart of this
newparadigm,
the authors presenta setof
tools formitigating
the data bottleneck indistributed
computing
systems,
whichconsists
of three maincomponents:
adata-awarescheduler,
which
provides capabilities
suchasplanning,
scheduling,
resourcereservation,
job execution,
anderrorrecovery for data movementtasks;
integration
of thesecapabilities
tothe other
layers
in distributedcomputing,
suchasworkflowplanning;
and furtheroptimization
of dataChapter
2Towards
Data
IntensiveMany-Task Computing
28loan
Raicu,
Illinois Instituteof
Technology,
USA &Argonne
NationalLaboratory,
USA IanFoster,
University of Chicago,
USA &Argonne
NationalLaboratory,
USAYong
Zhao,
University of
Electronic Science andTechnology of
China,
China AlexSzalay,
JohnsHopkins University,
USAPhilip
Little,
University
of
NotreDame,
USAChristopher
M.Moretti,University
of
NotreDame,
USAAmitabh
Chaudhary,
University
of
NotreDame,
USADouglas
Thain,
University of
NotreDame,
USAMany-task computing
aims tobridge
the gap between twocomputing paradigms,
high
throughput
computing
andhigh performance computing.
Traditionaltechniques
tosupport
many-task computing
commonly
found in scientificcomputing (i.e.
the relianceonparallel
filesystems
with staticconfigura¬
tions)
donotscale totoday's
largest systems
fordata
intensiveapplication,
astherateof increase in thenumber
of processors persystem
isoutgrowing
the
rateofperformance
increase ofparallel
filesystems.
In thischapter,
the authors arguethat
in suchcircumstances,
datalocality
is criticaltothe successful and efficientuseoflarge
distributedsystems
for data-intensiveapplications. They
proposea"datadiffusion"
approach
toenable data-intensivemany-task computing.
They
define anabstract model fordatadiffu¬sion,
define andimplement
scheduling
policies
withheuristics
thatoptimize
real worldperformance,
anddevelop
acompetitive
online
caching
eviction
policy.
They
alsooffer
manyempirical
experiments
to
explore
the benefits of data
diffusion,
both
under static anddynamic
resourceprovisioning,
demon¬strating
approaches
thatimprove
bothperformance
andscalability.
Chapter
3Micro-Services: A Service-Oriented
Paradigm
forScalable,
Distributed
DataManagement
74Arcot
Rajasekar,
University
of
North Carolina
atChapel
Hill,
USA MikeWan,
University of
California
atSanDiego,
USAReagan
Moore,
University of
North CarolinaatChapel
Hill,
USAWayne
Schroeder,
University of California
atSanDiego,
USAService-oriented architectures
(SOA)
enable orchestration ofloosely-coupled
andinteroperable
func¬ tional software units todevelop
and executecomplex
butagile applications.
Datamanagement
on a distributed datagrid
can be viewed as a setof
operations
thatareperformed
acrossall
stages
in thelife-cycle
ofadata
object.
The set ofsuch
operations
depends
onthe type
ofobjects,
based ontheir
physical
and
discipline-centric
characteristics. Inthis
chapter,
the authorsdefine
server-sidefunctions,
called
micro-services,
which areorchestrated into conditional workflows forachieving large-scale
datamanagement
specific
tocollections of data. Micro-services communicate with each otherusing
param¬eter
exchange,
in memory datastructures,
adatabase-basedpersistent
informationstore,
andanetworkmessaging
system
thatuses aserialization
protocol
forcommunicating
with
remotemicro-services.
Theorchestration
of theworkflow is done
by
adistributed rule
engine
thatchains and
executestheworkflows
and maintains
transactionalproperties through
recoverymicro-services.
They
discussthe micro-service
oriented
architecture,
compare the micro-serviceapproach
with traditionalSOA,
and describe the use of micro-services forimplementing policy-based
datamanagement systems.
Section 2 Distributed
Storage
Chapter
4Distributed
Storage Systems
for Data IntensiveComputing
95Sudharshan S.
Vazhkadai,
OakRidge
NationalLaboratory,
USA AH R.Butt,
Virginia Polytechnic
Institute and StateUniversity,
USAXiaosong
Ma, North Carolina StateUniversity,
USAInthis
chapter,
theauthorspresent
an overview of theutility
of distributedstorage systems
insupporting
modernapplications
thatareincreasingly becoming
data intensive. Theircoverageof distributedstorage
systems
is basedontherequirements
imposed
by
data intensivecomputing
and nota meresummary of storagesystems.
To thisend,
they
delve into severalaspects of
supporting
data-intensive
analysis,
such
asdata
staging, offloading, checkpointing,
and end-useraccesstoterabytes
ofdata,
and illustrate theuse of noveltechniques
andmethodologies
forrealizing
distributedstorage systems
therein. The datadeluge
from scientific
experiments,
observations,
and simulations isaffecting
all of the aforementionedday-to-day
operations
indata-intensive
computing.
Modern distributedstorage
systems
employ techniques
thatcanhelp improve application
performance,
alleviate I/O bandwidth
bottleneck,
maskfailures,
andimprove
dataavailability. They
present
key guiding principles
involved in the constructionof
suchstorage
systems,
associatedtradeoffs,
design,
andarchitecture,
all withaneye towardaddressing challenges
of data-intensive scientificapplications.
They
highlight
theconcepts
involvedusing
severalcasestudies ofstate-of-the-art storage systems
thatarecurrently
available in the data-intensivecomputing landscape.
Chapter
5Metadata
Management
in PetaShare DistributedStorage
Network 118Ismail
Akturk,
BilkentUniversity, Turkey
Xinqi Wang,
Louisiana StateUniversity,
USATevfik
Kosar, StateUniversity of
New YorkatBuffalo (SUNY),
USAThe unbounded
increase
inthesize of datagenerated by
scientificapplications
necessitates collaborationand
sharing
among thenation's education
and researchinstitutions.
Simply purchasing high-capacity,
high-performance
storage systems
andadding
them to theexisting
infrastructure of thecollaborating
institutions does not solve theunderlying
andhighly challenging
datahandling problem.
Scientists arecompelled
tospend
agreat
deal of time and energy onsolving
basicdata-handling
issues,
such as thephysical
location ofdata,
how to accessit,
and/or howto move it to visualization and/orcompute
resources for further
analysis.
Thischapter
presents
thedesign
andimplementation
ofa reliable and efficientdistributed
datastorage system,
PetaShare,
which
spansmultiple
institutions
across the stateof
Louisiana. At theback-end,
PetaShareprovides
a unified namespace andefficient
data movementacross
geographically
distributedstorage
sites. At thefront-end,
itprovides light-weight
clients theen¬ able easy,transparent,
and scalable access. InPetaShare,
the authors havedesigned
andimplemented
anasynchronously replicated
multi-master metadata system for enhancedreliability
andavailability.
The authorsalso
presentahigh
level cross-domain metadata schematoprovide
astructuredsystematic
view ofmultiple
science domains
supported
by
PetaShare.Chapter
6Data Intensive
Computing
with ClusteredChirp
Servers 140Douglas
Thain, University
of
Notre Dame, USAMichaelAlbrecht,
University of
NotreDame, USAHoang
Bui,University
of
NotreDame, USAPeter
Bui,
University
of
NotreDame, USARory
Carmichael,
University of
Notre Dame, USA ScottEmrich,
University of
NotreDame,
USA PatrickFlynn, University of
NotreDame,
USAOver the last few
decades, computing performance,
memorycapacity,
and diskstorage
have all increasedby
many orders ofmagnitude.
However,
I/Operformance
has not increased atnearly
the same pace:a disk arm movement is still measured in
milliseconds,
and disk I/Othroughput
is still measured inmegabytes
persecond. Ifonewishestobuildcomputer
systems thatcanstoreand processpetabytes
ofdata, they
musthavelarge
numbers of disks and thecorresponding
I/Opaths
and memorycapacity
tosupport
thedesired datarate. A costefficient waytoaccomplish
this isby
clustering large
numbers ofcommodity
machinestogether.
Thischapter
presents
Chirp
as abuilding
block for clustered dataintensive
scientificcomputing.
Chirp
wasoriginally
designed
as alightweight
fileserverforgrid computing
and wasused as a"personal"
file server.The authorsexplore building
systems
with veryhigh
I/Ocapacity
using commodity
storage
devicesby
tying together multiple Chirp
servers. Several real-lifeapplications
such as the GRAND DataAnalysis
Grid,
theBiometrics Research
Grid,
and theBiocompute
Facility
use
Chirp
astheir fundamentalbuilding
block,
butprovide
differentservices
and interfacesappropriate
to their
target
communities.Section3
Data & Workflow
Management
Chapter
7A
Survey
ofScheduling
andManagement Techniques
for Data-IntensiveApplication
Workflows 156Suraj Pandey
The CommonwealthScientific
and Industrial ResearchOrganisation (CSIRO),
AustraliaRajkitmar Buyya,
TheUniversity
of
Melbourne,
AustraliaThis
chapter
presents
acomprehensive
surveyofalgorithms,
techniques,
and frameworks used for sched¬uling
andmanagement of
data-intensiveapplication
workflows.Many
complex
scientificexperiments
areexpressed
in the form ofworkflows forstructured, repeatable,
controlled, scalable,
and automated executions. Thischapter
focuses on thetype
of workflows that have tasksprocessing huge
amount ofdata, usually
in the range from hundreds ofmega-bytes
topetabytes.
Scientists arealready using
Grid
systems that schedule these workflowsontoglobally
distributed
resourcesforoptimizing
variousobjec¬
tives: minimize totalmakespan
of theworkflow,
minimizecostand usageof
networkbandwidth,
minimizecost of
computation
and storage,
meet the deadline of theapplication,
and soforth. Thischapter
lists and describestechniques
used in each of thesesystems
forprocessing huge
amount of data. Asurvey of workflowmanagement
techniques
is
useful forunderstanding
theworking
of theGrid systems
providing
Chapter
8Data
Management
in ScientificWorkflows
177Ewa
Deeltnan,
University of
SouthernCalifornia,
USA AnnChervenak,
University of
SouthernCalifornia,
USAScientific
applications
such asthose inastronomy,
earthquake science,
gravitational-wave physics,
and others have embraced workflowtechnologies
to dolarge-scale
science. Workflows
enable researchersto
collaboratively design,
manage,and obtain
resultsthat involve hundreds
ofthousands of steps,
accessterabytes
of data,
andgenerate
similaramountsof intermediate and final dataproducts. Although
work¬ flowsystems
areable tofacilitate the automatedgeneration
of dataproducts,
many issues still remain tobe addressed. These issues exist in different forms in the workflowlifecycle.
Thischapter
describesa workflow
lifecycle
asconsisting
ofa workflowgeneration phase
wherethe
analysis
isdefined,
the workflowplanning phase
where resources neededfor execution
areselected,
theworkflow
executionpart,
where theactual
computations
take
place,
and theresult, metadata,
and provenancestoring phase.
The
authors
discuss the issuesrelated
to datamanagement
at eachstep
of the workflowcycle.
They
describechallenge problems
and illustrate them in thecontextof real-lifeapplications. They
discussthechallenges, possible
solutions,
and open issues faced whenmapping
andexecuting large-scale
workflows
oncurrent
cyberinfrastructure.
They particularly emphasize
the issues relatedtothemanagement
of datathroughout
the workflowlifecycle.
Chapter
9Replica Management
in Data Intensive Distributed ScienceApplications
188 Ann L.Chervenak,
University of
SouthernCalifornia,
USARobert
Schuler,
University of
SouthernCalifornia,
USAManagement
of thelarge
data setsproduced by
data-intensive
scientificapplications
iscomplicated
by
the fact thatparticipating
institutions areoften
geographically
distributed andseparated by
distinctadministrative domains.
Akey
datamanagement
problem
in thesedistributed
collaborationshas been
the creation
and maintenance ofreplicated
data sets. Thischapter provides
anoverview
ofreplica
management
schemes used inlarge, data-intensive,
distributed scientific collaborations.Early replica
management
strategies
focusedonthedevelopment
of
robust,
highly
scalablecatalogs
formaintaining
replica
locations. In recent years, moresophisticated, application-specific
replica
management systems
have beendeveloped
tosupport
therequirements
of scientific VirtualOrganizations.
These systems have
motivatedinterest in
application-independent, policy-driven
schemes forreplica
management
thatcan be tailoredto meettheperformance
andreliability requirements
ofarange of scientific collaborations.The authors discuss the data
replication
solutions to meetthechallenges
associated withincreasingly
Section 4
Data
Discovery
& VisualizationChapter
10Data Intensive
Computing
for Bioinformatics
207Judy Qiu,
IndianaUniversity
-Bloomington,
USAJaliya
Ekanayake,
IndianaUniversity
-Bloomington,
USAThilina
Gunarathne,
IndianaUniversity
-Bloomington,
USAJong
Youl
Choi,
IndianaUniversity
-Bloomington,
USASeung-Hee
Bae, Indiana
University
-Bloomington,
USAYang
Ruan,
IndianaUniversity
-Bloomington,
USASaliya Ekanayake,
IndianaUniversity
-Bloomington,
USAStephen
Wu,
IndianaUniversity
-Bloomington,
USAScott
Beason,
Computer
SciencesCorporation,
USAGeoffrey
Fox, IndianaUniversity
-Bloomington,
USAMina
Rho,
IndianaUniversity
-Bloomington,
USAHaixu
Tang,
IndianaUniversity
-Bloomington,
USADataintensive
computing,
cloudcomputing,
and
multicorecomputing
areconverging
asfrontierstoad¬ dress massive dataproblems
withhybrid programming
modelsand/orruntimesincluding MapReduce,
MPI, and
parallel threading
on multicoreplatforms.
Amajor challenge
isto utilize thesetechnologies
andlarge-scale
computing
resourceseffectively
to advance fundamental sciencediscoveries
such as those in Life Sciences. Therecently
developed
next-generation
sequencers have enabledlarge-scale
genomesequencing
in areassuch asenvironmental
sample sequencing
leading
tometagenomic
studies of collections of genes.Metagenomic
research isjust
oneof theareasthatpresent
asignificant
compu¬ tationalchallenge
because of theamountandcomplexity
of datatobeprocessed.
Thischapter
discusses the useof
innovativedata-mining
algorithms
andnewprogramming
models for several Life Sciencesapplications.
Theauthors
particularly
focus on methods that areapplicable
tolarge
datasetscoming
from
high throughput
devices ofsteadily
increasing
power.They
show results for bothclustering
anddimension reduction
algorithms,
and
the use ofMapReduce
on modest sizeproblems. They
identify
two
key
areas where further research isessential,
and propose todevelop
new0(NlogN) complexity
algorithms
suitable for theanalysis
of millions of sequences.They
suggest
IterativeMapReduce
as apromising programming
modelcombining
the best features ofMapReduce
with those ofhigh perfor¬
manceenvironmentssuch
asMPI.Chapter
11Visualization of
Large-Scale
Distributed
Data 242Jason
Leigh,
University of
IllinoisatChicago,
USA AndrewJohnson,
University
of
IllinoisatChicago,
USA LucRenambot, University
of
IllinoisatChicago,
USAVenkatram
Vishwanath,
University
of
IllinoisatChicago,
USA&Argonne
National
Laboratory,
USA TomPeterka,
Argonne
National
Laboratory,
USANicholas
Schwarz,
NorthwesternUniversity,
USAAn effective visualization is best achieved
through
thecreation ofaproper
representation
of data and the interactivemanipulation
andquerying
of thevisualization.
Large-scale
data visualization isparticularly
onanaverage
desktop
computer.
Large-scale
data visualization thereforerequires
the useofdistributed
computing. By leveraging
thewidespread
expansion
of the Internet and other national andinternational
high-speed
network infrastructure suchastheNationalLambdaRail, Internet-2,
andtheGlobal
LambdaIntegrated Facility,
data and serviceproviders
began
tomigrate
towarda model ofwidespread
distribu¬tionofresources.This
chapter
introduces different instantiations of the visualizationpipeline
and thehistoric
motivation for their creation.The
authors examineindividual components
of thepipeline
in detailto understand the technicalchallenges
thatmustbesolved
in ordertoensurecontinuedscalability.
They
discuss distributed datamanagement
issues that arespecifically
relevanttolarge-scale
visualiza¬ tion.They
also introducekey
data
rendering techniques
andexplain through
casestudiesapproaches
forscaling
themby
leveraging
distributedcomputing. Lastly they
describe advanceddisplay technologies
thatare nowconsideredthe "lenses" forexamining large-scale
data.Chapter
12On-Demand Visualization onScalable Shared Infrastructure 275
Huadong
Liu,University of
Tennessee,
USAJinzhuGao,
University of
ThePacific,
USAMan
Huang, University
of
Tennessee,
USA MicahBeck,
University of
Tennessee,
USATerry
Moore,
University of
Tennessee,
USAThe emergence of
high-resolution simulation,
where simulation outputs
havegrownto terascale levelsand
beyond,
raisesmajor
newchallenges
for thevisualization
community,
which isserving computational
scientists who wantadequate
visualization
servicesprovided
to them on-demand.Many existing algo¬
rithms forparallel
visualizationwere notdesigned
tooperate
optimally
ontime-shared
parallel
systems
or onheterogeneous
systems.They
areusually optimized
forsystems
thatarehomogeneous
and havebeen reserved for exclusive use.This
chapter explores
thepossibility
ofdeveloping
parallel
visualiza¬ tionalgorithms
thatcan usedistributed, heterogeneous
processorstovisualize
cutting edge
simulation datasets. The authorsstudy
how toeffectively
support
multiple
concurrent usersoperating
on the samelarge
dataset,
with eachfocusing
on adynamically varying
subset of thedata. Fromasystemdesign point
ofview,
they
observethatadistributed cache offers variousadvantages, including
improved scalability.
They develop
basicscheduling
mechanisms thatwereabletoachieve fault-tolerance andload-balancing,
optimal
useofresources, and flow-controlusing system-level
back-off,
while stillenforcing
deadline driven(i.e. time-critical)
visualization.Compilation
of References 291About the Contributors 319