Computing: Challenges and Solutions. Data Intensive Distributed. Management. for Large-Scale Information REFERENCE. Information Science.

(1)

Data Intensive

Distributed

Computing:

Challenges

and Solutions

for

Large-Scale

Information

Management

Tevfik

Kosar

State

University

of New

York

at

Buffalo

(SUNY),

USA

Information Science

REFERENCE

(2)

Detailed

Table of

Contents

Preface _xiii

Section 1

New

Paradigms

in Data Intensive

Computing

Chapter

1

Data-Aware Distributed

Computing

₁ Esma

Yildirim,

State

University of

New Yorkat

Buffalo (SUNY),

USA

Mehmet

Balman,

Lawrence

Berkeley

National

Laboratory,

USA

TevfikKosar,

State

University of

New Yorkat

Buffalo

(SUNY),

USA

With

the continuous

increase in the data

requirements

of scientific and commercial

applications,

access

to remote and distributed data

has become

a

major

bottleneck for end-to-end

application performance.

Traditional distributed

computing

systems

closely couple

dataaccess and

computation,

and

generally,

data accessis considered aside effect of

computation.

The limitations

of

traditional distributed com¬

puting

systems

and

CPU-oriented

scheduling

and

workflow management

tools in

managing complex

data

handling

have motivated a

newly

emerging

era:

data-aware

distributed

computing.

Inthis

chapter,

the authors elaborateon how themost

crucial distributed

computing

components,

_such as

scheduling,

workflow

management,

and end-to-end

throughput optimization,

can

become

"data-aware." In thisnew

computing

paradigm,

called

data-aware

distributed

computing,

data

placement

activitiesare

represented

as full-featured

jobs

in the

end-to-end

workflow,

and

they

are

queued, managed, scheduled,

and

opti¬

mized viaa

specialized

data-aware scheduler. As

part of this

_new

paradigm,

_{the authors present}_{a set}

of

tools for

mitigating

the data bottleneck in

distributed

computing

systems,

which

consists

of three main

components:

adata-aware

scheduler,

which

provides capabilities

suchas

planning,

scheduling,

resource

reservation,

job execution,

anderror_{recovery for data} movement

tasks;

integration

of these

capabilities

tothe other

layers

in distributed

computing,

suchasworkflow

planning;

and further

optimization

of data

(3)

Chapter

2

Towards

Data

Intensive

Many-Task Computing

28

loan

Raicu,

Illinois Institute

of

Technology,

USA &

Argonne

National

Laboratory,

USA Ian

Foster,

University of Chicago,

USA &

Argonne

National

Laboratory,

USA

Yong

Zhao,

University of

Electronic Science and

Technology of

China,

China Alex

Szalay,

Johns

Hopkins University,

USA

Philip

Little,

University

of

Notre

Dame,

USA

Christopher

M.Moretti,

University

of

Notre

Dame,

USA

Amitabh

Chaudhary,

University

of

Notre

Dame,

USA

Douglas

Thain,

University of

Notre

Dame,

USA

Many-task computing

aims to

bridge

the gap between two

computing paradigms,

high

throughput

computing

and

high performance computing.

Traditional

techniques

to

support

many-task computing

commonly

found in scientific

computing (i.e.

the relianceon

parallel

file

systems

with static

configura¬

tions)

donotscale to

today's

largest systems

for

data

intensive

application,

astherateof increase in the

number

_{of processors per}

system

is

outgrowing

the

rateof

performance

increase of

parallel

file

systems.

In this

chapter,

_{the authors argue}

that

in such

circumstances,

data

locality

is criticaltothe successful and efficientuseof

large

distributed

systems

for data-intensive

applications. They

_proposea"data

diffusion"

approach

toenable data-intensive

many-task computing.

They

define anabstract model fordatadiffu¬

sion,

define and

implement

scheduling

policies

with

heuristics

that

optimize

real world

performance,

and

develop

a

competitive

online

caching

eviction

policy.

They

also

offer

_many

empirical

experiments

to

explore

the benefits of data

diffusion,

both

under static and

dynamic

resource

provisioning,

demon¬

strating

approaches

that

improve

both

performance

and

scalability.

Chapter

3

Micro-Services: A Service-Oriented

Paradigm

for

Scalable,

Distributed

Data

Management

74

Arcot

Rajasekar,

University

of

North Carolina

at

Chapel

Hill,

USA Mike

Wan,

University of

California

atSan

Diego,

USA

Reagan

Moore,

University of

North Carolinaat

Chapel

Hill,

USA

Wayne

Schroeder,

University of California

atSan

Diego,

USA

Service-oriented architectures

(SOA)

enable orchestration of

loosely-coupled

and

interoperable

func¬ tional software units to

develop

and execute

complex

but

agile applications.

Data

management

on a distributed data

grid

can be viewed as a set

of

operations

thatare

performed

across

all

stages

in the

life-cycle

ofa

data

object.

The set of

such

operations

depends

on

the type

of

objects,

based on

their

physical

and

discipline-centric

characteristics. In

this

chapter,

the authors

define

server-side

functions,

called

micro-services,

which areorchestrated into conditional workflows for

achieving large-scale

data

management

specific

tocollections of data. Micro-services communicate with each other

using

_param¬

eter

exchange,

_{in memory data}

structures,

adatabase-based

persistent

information

store,

andanetwork

messaging

system

thatuses a

serialization

protocol

for

communicating

with

remote

micro-services.

The

orchestration

of the

workflow is done

by

a

distributed rule

engine

that

chains and

executesthe

workflows

and maintains

transactional

properties through

_recovery

micro-services.

They

discuss

the micro-service

oriented

architecture,

compare the micro-service

approach

with traditional

SOA,

and describe the use of micro-services for

implementing policy-based

data

management systems.

(4)

Section 2 Distributed

Storage

Chapter

4

Distributed

Storage Systems

for Data Intensive

Computing

95

Sudharshan S.

Vazhkadai,

Oak

Ridge

National

Laboratory,

USA AH R.

Butt,

Virginia Polytechnic

Institute and State

University,

USA

Xiaosong

Ma, North Carolina State

University,

USA

Inthis

chapter,

theauthors

present

an overview of the

utility

of distributed

storage systems

in

supporting

modern

applications

thatare

increasingly becoming

data intensive. Their_coverageof distributed

storage

systems

is basedonthe

requirements

imposed

by

data intensive

computing

and nota mere_{summary of} storage

systems.

To this

end,

they

delve into several

aspects of

supporting

data-intensive

analysis,

such

asdata

staging, offloading, checkpointing,

and end-useraccessto

terabytes

of

data,

and illustrate theuse of novel

techniques

and

methodologies

for

realizing

distributed

storage systems

therein. The data

deluge

from scientific

experiments,

observations,

and simulations is

affecting

all of the aforementioned

day-to-day

operations

in

data-intensive

computing.

Modern distributed

storage

systems

employ techniques

thatcan

help improve application

performance,

alleviate I/O bandwidth

bottleneck,

mask

failures,

and

improve

data

availability. They

present

key guiding principles

involved in the construction

of

such

storage

systems,

associated

tradeoffs,

design,

and

architecture,

all withan_{eye toward}

addressing challenges

of data-intensive scientific

applications.

They

highlight

the

concepts

involved

using

severalcasestudies of

state-of-the-art storage systems

thatare

currently

available in the data-intensive

computing landscape.

Chapter

5

Metadata

Management

in PetaShare Distributed

Storage

Network 118

Ismail

Akturk,

Bilkent

University, Turkey

Xinqi Wang,

Louisiana State

University,

USA

Tevfik

Kosar, State

University of

New Yorkat

Buffalo (SUNY),

USA

The unbounded

increase

inthesize of data

generated by

scientific

applications

necessitates collaboration

and

sharing

among the

nation's education

and research

institutions.

Simply purchasing high-capacity,

high-performance

storage systems

and

adding

them to the

existing

infrastructure of the

collaborating

institutions does not solve the

underlying

and

highly challenging

data

handling problem.

Scientists are

compelled

to

spend

a

great

deal of time and energy on

solving

basic

data-handling

issues,

such as the

physical

location of

data,

how to access

it,

and/or howto move it to visualization and/or

compute

resources for further

analysis.

This

chapter

presents

the

design

and

implementation

ofa reliable and efficient

distributed

data

storage system,

PetaShare,

which

_spans

multiple

institutions

across the state

of

Louisiana. At the

back-end,

PetaShare

provides

a unified name_{space and}

efficient

data movement

across

geographically

distributed

storage

sites. At the

front-end,

it

provides light-weight

clients theen¬ able easy,

transparent,

and scalable access. In

PetaShare,

the authors have

designed

and

implemented

an

asynchronously replicated

multi-master metadata system for enhanced

reliability

and

availability.

The authors

also

presenta

high

level cross-domain metadata schemato

provide

astructured

systematic

view of

multiple

science domains

supported

by

PetaShare.

(5)

Chapter

6

Data Intensive

Computing

with Clustered

Chirp

Servers 140

Douglas

Thain, University

of

Notre Dame, USA

MichaelAlbrecht,

University of

NotreDame, USA

Hoang

Bui,

University

of

NotreDame, USA

Peter

Bui,

University

of

NotreDame, USA

Rory

Carmichael,

University of

Notre Dame, USA Scott

Emrich,

University of

Notre

Dame,

USA Patrick

Flynn, University of

Notre

Dame,

USA

Over the last few

decades, computing performance,

memory

capacity,

and disk

storage

have all increased

by

many orders of

magnitude.

However,

I/O

performance

has not increased at

nearly

the same _pace:

a disk arm movement is still measured in

milliseconds,

and disk I/O

throughput

is still measured in

megabytes

persecond. Ifonewishestobuild

computer

_{systems that}canstoreand process

petabytes

of

data, they

musthave

large

numbers of disks and the

corresponding

I/O

paths

and _memory

capacity

to

support

thedesired datarate. A costefficient wayto

accomplish

this is

by

clustering large

numbers of

commodity

machines

together.

This

chapter

presents

Chirp

as a

building

block for clustered data

intensive

scientific

computing.

Chirp

was

originally

designed

as a

lightweight

fileserverfor

grid computing

and wasused as a

"personal"

file server.The authors

explore building

systems

with very

high

I/O

capacity

using commodity

storage

devices

by

tying together multiple Chirp

servers. Several real-life

applications

such as the GRAND Data

Analysis

Grid,

the

Biometrics Research

Grid,

and the

Biocompute

Facility

use

Chirp

astheir fundamental

building

block,

but

provide

different

services

and interfaces

appropriate

to their

target

communities.

Section3

Data & Workflow

Management

Chapter

7

A

Survey

of

Scheduling

and

Management Techniques

for Data-Intensive

Application

Workflows 156

Suraj Pandey

The Commonwealth

Scientific

and Industrial Research

Organisation (CSIRO),

Australia

Rajkitmar Buyya,

The

University

of

Melbourne,

Australia

This

chapter

presents

a

comprehensive

_surveyof

algorithms,

techniques,

and frameworks used for sched¬

uling

and

management of

data-intensive

application

workflows.

Many

complex

scientific

experiments

are

expressed

in the form ofworkflows for

structured, repeatable,

controlled, scalable,

and automated executions. This

chapter

focuses on the

type

of workflows that have tasks

processing huge

amount of

data, usually

in the _range from hundreds of

mega-bytes

to

petabytes.

Scientists are

already using

Grid

systems that schedule these workflowsonto

globally

distributed

resourcesfor

optimizing

various

objec¬

tives: minimize total

makespan

of the

workflow,

minimizecostand usage

of

network

bandwidth,

minimize

cost of

computation

and storage,

meet the deadline of the

application,

and soforth. This

chapter

lists and describes

techniques

used in each of these

systems

for

processing huge

amount of data. Asurvey of workflow

management

techniques

is

useful for

understanding

the

working

of the

Grid systems

providing

(6)

Chapter

8

Data

Management

in Scientific

Workflows

177

Ewa

Deeltnan,

University of

Southern

California,

USA Ann

Chervenak,

University of

Southern

California,

USA

Scientific

applications

such asthose in

astronomy,

earthquake science,

gravitational-wave physics,

and others have embraced workflow

technologies

to do

large-scale

science. Workflows

enable researchers

to

collaboratively design,

manage,

and obtain

results

that involve hundreds

of

thousands of steps,

access

terabytes

of data,

and

generate

similaramountsof intermediate and final data

products. Although

work¬ flow

systems

areable tofacilitate the automated

generation

of data

products,

_many issues still remain tobe addressed. These issues exist in different forms in the workflow

lifecycle.

This

chapter

describes

a workflow

lifecycle

as

consisting

ofa workflow

generation phase

where

the

analysis

is

defined,

the workflow

planning phase

where resources needed

for execution

are

selected,

the

workflow

execution

part,

where the

actual

computations

take

place,

and the

result, metadata,

and provenance

storing phase.

The

authors

discuss the issues

management

at each

step

of the workflow

cycle.

They

describe

challenge problems

and illustrate them in thecontextof real-life

applications. They

discussthe

challenges, possible

solutions,

and open issues faced when

mapping

and

executing large-scale

workflows

oncurrent

cyberinfrastructure.

They particularly emphasize

the issues relatedtothe

management

of data

throughout

the workflow

lifecycle.

Chapter

9

Replica Management

in Data Intensive Distributed Science

Applications

188 Ann L.

Chervenak,

University of

Southern

California,

USA

Robert

Schuler,

University of

Southern

California,

USA

Management

of the

large

data sets

produced by

data-intensive

scientific

applications

is

complicated

by

the fact that

participating

institutions are

often

geographically

distributed and

separated by

distinct

administrative domains.

A

key

data

management

problem

in these

distributed

collaborations

has been

the creation

and maintenance of

replicated

data sets. This

chapter provides

an

overview

of

replica

management

schemes used in

large, data-intensive,

distributed scientific collaborations.

Early replica

management

strategies

focusedonthe

development

of

robust,

highly

scalable

catalogs

for

maintaining

replica

locations. In recent years, more

sophisticated, application-specific

replica

management systems

have been

developed

to

support

the

requirements

of scientific Virtual

Organizations.

These systems have

motivated

interest in

application-independent, policy-driven

schemes for

replica

management

thatcan be tailoredto meetthe

performance

and

reliability requirements

ofa_{range of scientific collaborations.}

The authors discuss the data

replication

solutions to meetthe

challenges

associated with

increasingly

(7)

Section 4

Data

Discovery

& Visualization

Chapter

10

Data Intensive

Computing

for Bioinformatics

207

Judy Qiu,

Indiana

University

-

Bloomington,

USA

Jaliya

Ekanayake,

Indiana

University

-

Bloomington,

USA

Thilina

Gunarathne,

Indiana

University

-

Bloomington,

USA

Jong

Youl

Choi,

Indiana

University

-

Bloomington,

USA

Seung-Hee

Bae, Indiana

University

-

Bloomington,

USA

Yang

Ruan,

Indiana

University

-

Bloomington,

USA

Saliya Ekanayake,

Indiana

University

-

Bloomington,

USA

Stephen

Wu,

Indiana

University

-

Bloomington,

USA

Scott

Beason,

Computer

Sciences

Corporation,

USA

Geoffrey

Fox, Indiana

University

-

Bloomington,

USA

Mina

Rho,

Indiana

University

-

Bloomington,

USA

Haixu

Tang,

Indiana

University

-

Bloomington,

USA

Dataintensive

computing,

cloud

computing,

and

multicore

computing

are

converging

asfrontierstoad¬ dress massive data

problems

with

hybrid programming

modelsand/orruntimes

including MapReduce,

MPI, and

parallel threading

on multicore

platforms.

A

major challenge

isto utilize these

technologies

and

large-scale

computing

resources

effectively

to advance fundamental science

discoveries

such as those in Life Sciences. The

recently

developed

next-generation

sequencers have enabled

large-scale

genome

sequencing

in areassuch as

environmental

sample sequencing

leading

to

metagenomic

studies of collections of genes.

Metagenomic

research is

just

oneof theareasthat

present

a

significant

compu¬ tational

challenge

because of theamountand

complexity

of datatobe

processed.

This

chapter

discusses the use

of

innovative

data-mining

algorithms

andnew

programming

models for several Life Sciences

applications.

The

authors

particularly

focus on methods that are

applicable

to

large

datasets

coming

from

high throughput

devices of

steadily

increasing

power.

They

show results for both

clustering

and

dimension reduction

algorithms,

and

the use of

MapReduce

on modest size

problems. They

identify

two

key

areas where further research is

essential,

and propose to

develop

new

0(NlogN) complexity

algorithms

suitable for the

analysis

_{of millions of sequences.}

They

suggest

Iterative

MapReduce

as a

promising programming

model

combining

the best features of

MapReduce

with those of

high perfor¬

manceenvironments

such

asMPI.

Chapter

11

Visualization of

Large-Scale

Distributed

Data 242

Jason

Leigh,

University of

Illinoisat

Chicago,

USA Andrew

Johnson,

University

of

Illinoisat

Chicago,

USA Luc

Renambot, University

of

Illinoisat

Chicago,

USA

Venkatram

Vishwanath,

University

of

Illinoisat

Chicago,

USA

&Argonne

National

Laboratory,

USA Tom

Peterka,

Argonne

National

Laboratory,

USA

Nicholas

Schwarz,

Northwestern

University,

USA

An effective visualization is best achieved

through

thecreation ofa

proper

representation

of data and the interactive

manipulation

and

querying

of the

visualization.

Large-scale

data visualization is

particularly

(8)

onan_average

desktop

computer.

Large-scale

data visualization therefore

requires

the useof

distributed

computing. By leveraging

the

widespread

expansion

of the Internet and other national and

international

high-speed

network infrastructure suchastheNational

LambdaRail, Internet-2,

andthe

Global

Lambda

Integrated Facility,

data and service

providers

began

to

migrate

towarda model of

widespread

distribu¬

tionofresources.This

chapter

introduces different instantiations of the visualization

pipeline

and the

historic

motivation for their creation.

The

authors examine

individual components

of the

pipeline

in detailto understand the technical

challenges

thatmustbe

solved

in ordertoensurecontinued

scalability.

They

discuss distributed data

management

issues that are

specifically

relevantto

large-scale

visualiza¬ tion.

They

also introduce

key

data

rendering techniques

and

explain through

casestudies

approaches

for

scaling

them

by

leveraging

distributed

computing. Lastly they

describe advanced

display technologies

thatare nowconsideredthe "lenses" for

examining large-scale

data.

Chapter

12

On-Demand Visualization onScalable Shared Infrastructure 275

Huadong

Liu,

University of

Tennessee,

USA

JinzhuGao,

University of

The

Pacific,

USA

Man

Huang, University

of

Tennessee,

USA Micah

Beck,

University of

Tennessee,

USA

Terry

Moore,

University of

Tennessee,

USA

The emergence of

high-resolution simulation,

where simulation outputs

have_grownto terascale levels

and

beyond,

raises

major

new

challenges

for the

visualization

community,

which is

serving computational

scientists who want

adequate

visualization

services

provided

to them on-demand.

Many existing algo¬

rithms for

parallel

visualizationwere not

designed

to

operate

optimally

on

time-shared

parallel

systems

or on

heterogeneous

_systems.

They

are

usually optimized

for

systems

thatare

homogeneous

and have

been reserved for exclusive use.This

chapter explores

the

possibility

of

developing

parallel

visualiza¬ tion

algorithms

thatcan use

distributed, heterogeneous

_processorsto

visualize

cutting edge

simulation datasets. The authors

study

how to

effectively

support

multiple

concurrent users

operating

on the same

large

dataset,

with each

focusing

on a

dynamically varying

subset of thedata. Froma_system

design point

of

view,

they

observethatadistributed cache offers various

advantages, including

improved scalability.

They develop

basic

scheduling

mechanisms thatwereabletoachieve fault-tolerance and

load-balancing,

optimal

useof_resources, and flow-control

using system-level

back-off,

while still

enforcing

deadline driven

(i.e. time-critical)

visualization.

Compilation

of References ₂₉₁

About the Contributors ₃₁₉