• No results found

Analytics and Fusion for Distributed Big Data

N/A
N/A
Protected

Academic year: 2021

Share "Analytics and Fusion for Distributed Big Data"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Arjun Shankar, Ph.D.

Thanks also to: James Horey and Arvind

Ramanathan

ORNL Computational Sciences

and Engineering Division

SOS, Jekyll Island, Georgia

March 2013

Analytics and Fusion for

Distributed Big Data

(2)

2 Managed by UT-Battelle

for the U.S. Department of Energy

Outline

Context

Examples of Distributed Big Data Analytics

Emerging needs as Big Compute and Big Data

(3)

3 Managed by UT-Battelle

for the U.S. Department of Energy

(4)

4 Managed by UT-Battelle

for the U.S. Department of Energy

Volume and rates

Twitter Updates

400 M/d

Facebook

Likes/Comments: 2.7 B/d

Shared Contents: 30 B/m

World Emails

419 B/d

YouTube

Storage: 76 PB/yr

Traffic: 16.2 EB/yr

(5)

5 Managed by UT-Battelle

for the U.S. Department of Energy

Volume and rates

2.5 m Telescope

200 GB/d

Ion Mobility Spectroscopy

10 TB/d

3D X-ray Diffraction Microscopy

24 TB/d

Boeing 737 cross-country flight

240 TB

Personal Location Data

1 PB/yr

Astrophysics Data

10 PB (2014)

(6)

6 Managed by UT-Battelle

for the U.S. Department of Energy

Big Data = Volume, Variety, Velocity

(7)

7 Managed by UT-Battelle

for the U.S. Department of Energy

Adam Jacobs, “The Pathologies of Big Data” 2009

“Data is typically acquired in a

transactional

fashion...The

trouble comes when we want to take that accumulated

data, collected over months or years, and

learn

something

from it.”

Learning from data is a major problem

How we harness our infrastructure to do data management

and data analytics?

(8)

8 Managed by UT-Battelle

for the U.S. Department of Energy

Data Systems and Analytics

Pre-History

1900-1970’s

Databases

1970

Networking

and Analytics

1980

• Machine

learning

• Large-scale

cluster

computing

Infrastructure

2000

• Large scale commodity

processing (Google,

Hadoop)

• NoSQL systems

• Virtualization

Web 3.0

2005

• Mobile

• Semantics, linked-data

(IBM – Watson)

• Apps, social-media

(9)

9 Managed by UT-Battelle

for the U.S. Department of Energy

Big Computing Developments

Prehistory: ENIAC,

ILLIAC, CDC, Cray

1930-1970

Vector, Pipelined,

Compilers,

Interconnects,

FLOPS

Shared/

Distributed

Memory

Hierarchies,

Storage

Multicore,

Heterogeneity

Big Data

Programming

Models

Flexible Data

Access

(10)

10 Managed by UT-Battelle

for the U.S. Department of Energy

Examples of Distributed Big Data

Analytics

(11)

11 Managed by UT-Battelle

for the U.S. Department of Energy

Large Scale Infrastructures for

Analytics

Centralized (aka Big Compute) HPC

Centralized Big Compute/Data platforms

Distributed Big Compute/Data platforms

(12)

12 Managed by UT-Battelle

for the U.S. Department of Energy

Big Data Analytics

1.

Centralized (aka Big Compute) HPC

First Principles, Floating Point Solvers, Prospective Data Generation

2.

Centralized Big Compute/Data platforms

Discrete/Combinatorial, Retrieval, Indexing Lookup, Query,

Graph-lookups, Data Ingest (ETL)

3.

Distributed Big Compute/Data platforms

Virtualization and Utility Computing, Machine Learning Stack

4.

Wide-Area Distributed Big Data platforms

Distributed Sensing, Correlate, and Actuate – Real-time, HPC

Resource Use

(13)

13 Managed by UT-Battelle

for the U.S. Department of Energy

(#2) Ebay – Big Data Analytics Example

Killer app: mining web logs

Slides from:Tom Fastener, 2011 Ebay

Principal Architect, @ High-Performance

Transaction Systems, Asilomar

(14)

14 Managed by UT-Battelle

for the U.S. Department of Energy

(#3) Healthcare Overpayments

Analysis (@ORNL)

$10M

$7M

Deceased

beneficiaries

Omitted from

consolidated

billing

Inaccurate payments

(preliminary estimates)

User submits

SQL to Hive

Hive transforms SQL

to MapReduce

MapReduce is executed

in Hadoop cluster

SQL

Hive

MapReduce

job

MapReduce

job

NAS

Data is loaded into Hadoop

filesystem (HDFS)

Compute nodes

Name node

and job

tracker

Hadoop infrastructure

Fraud

patterns

(15)

15 Managed by UT-Battelle

for the U.S. Department of Energy

(#4) Real-Time Distributed Analytics

Centralized data analysis across infrastructure and data

modalities is prohibitive (and often too late)!

(16)

16 Managed by UT-Battelle

for the U.S. Department of Energy

Distributed Pattern Storage and

Correlation

If (

Sensor-noticed-X && Report-said-Y && Within-Last-Day

)

Notify!

Messa

ge

flow

up

a h

ie

rarch

y

Middleware

Optimizes

In-Network Storage

(17)

17 Managed by UT-Battelle

for the U.S. Department of Energy

Processing in the Network

Application specific open questions

Value of Information

What to keep and what to drop (or send later)?

When to take notice?

Infrastructure

Where in the hierarchy to compute?

(18)

18 Managed by UT-Battelle

for the U.S. Department of Energy

Big Data Analytics on Big Compute -

Emerging Applications

1. Graph analytics

2. Hypothesis driven to data driven analytics

3. Compute-analyze-compute paradigm

(19)

19 Managed by UT-Battelle

for the U.S. Department of Energy

(1) Data Analytics Beyond

Data-Parallelism

Data-Parallel

Graph-Parallel

Cross

Validation

Feature

Extraction

Map Reduce

Computing Sufficient

Statistics

Graphical Models

Gibbs Sampling

Belief Propagation

Variational Opt.

Semi-Supervised

Learning

Label Propagation

CoEM

Graph Analysis

PageRank

Triangle Counting

Collaborative

Filtering

Tensor Factorization

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

(20)

20 Managed by UT-Battelle

for the U.S. Department of Energy

PageRank

What’s the rank

of this user?

Rank?

Depends on rank

of who follows her

Depends on rank

of who follows them…

Loops in graph

Must iterate!

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

(21)

21 Managed by UT-Battelle

for the U.S. Department of Energy

PageRank Iteration

α

is the random reset probability

w

ji

is the prob. transitioning (similarity) from j to i

R[i]

R[j]

w

ji

Iterate until convergence:

“My rank is weighted

average of my friends’ ranks”

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

(22)

22 Managed by UT-Battelle

for the U.S. Department of Energy

Properties of Graph Parallel Algorithms

Dependency

Graph

Iterative

Computation

My Rank

Friends Rank

Local

Updates

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

(23)

23 Managed by UT-Battelle

for the U.S. Department of Energy

Addressing Graph-Parallel ML

Data-Parallel

Graph-Parallel

Cross

Validation

Feature

Extraction

Map Reduce

Computing Sufficient

Statistics

Graphical Models

Gibbs Sampling

Belief Propagation

Variational Opt.

Semi-Supervised

Learning

Label Propagation

CoEM

Data-Mining

PageRank

Triangle Counting

Collaborative

Filtering

Tensor Factorization

Map Reduce?

Graph-Parallel Abstraction

Slide courtesy: Prof.

Carlos Guestrin’s

GraphLab Workshop,

July 2012

GraphLab software tailors abstractions for asynchronous

updates, “natural” graphs, in-memory computations, etc.

(24)

24 Managed by UT-Battelle

for the U.S. Department of Energy

(2) Scaling Analytics: Let a Million Hypotheses

Bloom ...

Classic scientific discovery scenario:

Choose relevant D dimensions, evaluate on N samples

Modern data-driven analysis:

Measure everything, hope to discover relevant D using N samples

The progression to ultrascale:

Characterized by interdependency (not necessarily redundancy); many

inter-related subsystems



D

N



N





D

N



1

 

D



:

Error



D

:

Dimensionality



N

:

#

Samples



D



N



N

:

fixed

i 1i 2 i 3 i 4 i 1i 2 i 3 i 4 i 1i 2 i 3 i 4 i 1i 2 i 3 i 4 i 1i 2 i 3 i 4 i 1i 2 i 3 i 4 i 1 i 2 i 3 i 4 i 1 i 2 i 3 i 4 i 1 i 2 i 3 i 4 i 1i 2 i 3 i 4 i 1i 2 i 3 i 4 i 1i 2 i 3 i 4 i 5 i 6 i 7 i 8 i 9 i 1 0 i 1i 2 i 3 i 4 i 5 i 6 i 7 i 8 i 9 i 1 0

(25)

25 Managed by UT-Battelle

for the U.S. Department of Energy

(3) Big Data Analytics Integrated with Big

Compute: In Situ Machine Learning

Biological Data

High resolution

all-atom

simulations

(Jaguar/Titan)

Infer biological

function?

> 1 petabyte

Save state (Big Data)

Analyze state (Big Data

Analytics)

Run MD (Big Compute)

Save

?

(26)

26 Managed by UT-Battelle

for the U.S. Department of Energy

Data Analytic Needs from

Big-Compute

Better programming

abstractions and

environments for

Machine learning suites

Automatic memory

hierarchies management

libraries

Automatic storage

management libraries

Simplify communication

and synchronization

abstractions

DARPA is asking for programming

techniques to simplify big data

analytics… (19

th

March 2013)

(27)

27 Managed by UT-Battelle

for the U.S. Department of Energy

Thank you!

References

Related documents

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Atrioventricular block resulting from atrioven- tricular nodal ablation or from attempted atrioventricu- lar nodal modification for control of ventricular rate in patients with

In some internal Cache Manager code paths, the File Server description records are first locked, and then the afs xconn lock is used to access the associated Rx connection records..

IBM Business Analytics software provides organizations with new ways to combine the analysis of big data with traditional analytics, particularly in three areas: accessing big data,

participating schools within the local educational agency in planning and implementing effective parent and family involvement activities to improve student

The research captures the image that Romanian consumers have sketched regarding domestic and foreign products, the domestic ones being ranked lower in terms

The purpose of this report is to highlight and review literature, programs and activities focused on substance abuse in urban American Indian and Alaska Native (AI/AN) communities