• No results found

HIGH PERFORMANCE BIG DATA ANALYTICS

N/A
N/A
Protected

Academic year: 2021

Share "HIGH PERFORMANCE BIG DATA ANALYTICS"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

HIGH PERFORMANCE

BIG DATA ANALYTICS

Kunle  Olukotun  

 

Electrical  Engineering  and  Computer  Science  

Stanford  University  

   

(2)

Explosion of Data Sources

n 

“DoD is swimming in sensors and drowning in data”

n 

Challenge: enable discovery

n 

Deliver the capability to mine, search and analyze this

data in near real time

(3)

PROCESSING SYSTEMS

FOR BIG DATA

n 

Goal: Make high-performance data analytics easy to

use

n 

Manipulate big data sets in real time

n 

Make better decisions, the right conclusions

n 

Streaming data

n 

Low latency computation

n 

Interactive data exploration

n 

Requires the full power of modern computing

platforms

(4)

Processing

PERFORMANCE TODAY

E xe cut io n T ime R ela tiv e to C++

(5)

PERFORMANCE Vs. EASE

OF USE

Performance

E

as

e

o

f

us

e

Matlab, R ︎ MapReduce ︎ Spark ︎ GraphLab ︎ Pig, Hive ︎ Goal

(6)

Heterogeneous parallelism

Cluster Multicore CPU Graphics Processing Unit (GPU) Programmable Logic

(7)

MPI MapReduce Verilog VHDL CUDA OpenCL Threads OpenMP

Expert PARALLEL

programming

Cluster Multicore CPU Graphics Processing Unit (GPU) Programmable Logic

(8)

8 Applications

Heterogeneous

Hardware New Arch.

THe programmability GAP

Data Wrangling Graph Analysis Prediction Recommendation Data Transformation

(9)

Applications

Heterogeneous

Hardware New Arch.

Scala

Java Python

Ruby C++

Clojure

Not enough semantic knowledge to compile automatically No restrictions

general-purpose languages

Data Wrangling Graph Analysis Prediction Recommendation Data Transformation

(10)

Domain Specific Languages

n 

Domain Specific Languages (DSLs)

n 

Programming language with restricted expressiveness for a

particular domain

(11)

High Performance

DSLs for Data

Analytics

Data Wrangling Graph Analysis Prediction Recommendation Data Transformation Data Query OptiQL Graph Alg. OptiGraph Machine Learning OptiML Convex Opt. OptiCVX Data Transform OptiWrangle Applications Domain Specific Languages Heterogeneous Hardware DSL Compiler New Arch. DSL Compiler DSL Compiler DSL Compiler DSL Compiler

(12)

Scaling the DSL Approach

n 

Many potential DSLs

n 

How do we quickly create high-performance

implementations for DSLs we care about?

n 

Enable expert parallel programmers to easily create

new DSLs

n 

Make optimization knowledge reusable

n 

Simplify the compiler generation process

n 

A few DSL developers enable many more DSL users

(13)

Delite Common DSL Infrastructure

Delite: dSL

infrastructure

Data Wrangling Graph Analysis Prediction Recommendation Data Transformation Data Query OptiQL Graph Alg. OptiGraph Machine Learning OptiML Convex Opt. OptiCVX Data Transform OptiWrangle Applications Domain Specific Languages Heterogeneous Hardware DSL Compiler New Arch. DSL Compiler DSL Compiler DSL Compiler DSL Compiler

(14)

Delite DSL Framework

Delite: dSL

infrastructure

Graph Analysis Prediction Recommendation Data Transformation Data Query OptiQL Graph Alg. OptiGraph Machine Learning OptiML Data Transform OptiWrangle Applications Domain Specific Languages Heterogeneous Hardware DSL Compiler DSL Compiler DSL Compiler DSL Compiler

(15)

Delite Overview

Key elements

n  Scala libraries on steroids n  Intermediate representation n  Domain specific optimization

n  General parallelism and

locality optimizations

n  Mapping to HW targets

Op?{CVX,  Graph,  ML,  QL,  Wrangle}  

Generic  analyses    &   transforma?ons   Domain  specific   analyses  &   transforma?ons   Code  generators   domain ops domain data

parallel data Parallel

patterns

Threads   OpenMP   CUDA   OpenCL   MPI   Verilog  

DSL  U se r   DSL  De ve lope r   D elite  Fr am ew or k  

(16)

optiMl: a Dsl for machine

learning

n 

Designed for Iterative Statistical Inference

n 

e.g. SVMs, logistic regression, neural networks, etc.

n 

Dense/sparse vectors and matrices, message-passing

graphs, training/test sets

n 

Mostly Functional

n 

Data manipulation with classic functional operators (map,

filter) and

ML-specific ones (sum, vector constructor,

untilconverged)

n 

Math with MATLAB-like syntax (a*b, chol(..), exp(..))

n 

Runs Anywhere

(17)

K-MEANs CLUSTERING

PERFORMANCE

Op,ML   Parallelized  MATLAB   C++  

1.0   1.9   3.6   5.2   10.8   0.3   0.3   0.3   0.3   0.3   1.6   0.0   2.0   4.0   6.0   8.0   10.0   12.0  

1  CPU   2  CPU   4  CPU   8  CPU   CPU  +  GPU  

Spe

edup  

K-­‐means  

untilconverged(kMeans,  tol){kMeans  =>  

   val  clusters  =  samples.groupRowsBy  {  sample  =>  

         val  allDistances  =  kMeans.mapRows  {  mean  =>  

                   dist(sample,  mean)                

         }  

         allDistances.minIndex  

   }  

   val  newKmeans  =  clusters.map(e  =>  e.sum  /  e.length)      newKmeans  

(18)

MSM Builder Using OptiML

with Vijay Pande

!

Markov State Models (MSMs) MSMs are a powerful means of modeling the structure and

dynamics of molecular systems, like proteins

x86 ASM

high prod, low perf

low prod, high perf

(19)

OptiGraph

n  A DSL for large-scale graph analysis based on Green-Marl

n  A DSL for Real-world Graph Analysis

n  Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12

n  Functional DSL

n  No mutation, no explicit iteration

n  Data structures

n  Graph (directed, undirected), node, edge,

n  Set of nodes, edges, neighbors, …

n  Graph traversals

n  Summation, Breadth-first traversal, …

n  Parallel reductions but no deferred assignment

n  Example applications written in framework

n  Betweeness Centrality

n  Directed/Undirected Triangle Counting

(20)

Example: Betweenness

Centrality

n

Betweenness Centrality

n 

A measure that tells how ‘central’ a

node is in the graph

n 

Used in social network analysis

n 

Definition

n 

How many shortest paths are

there between any two nodes

going through this node.

Ayush K. Kehdekar Kevin Bacon High BC Low BC

(21)

OptiGraph Betweenness

Centrality

 

1.  val  bc  =  sum(g.nodes){  s  =>  

2.      val  sigma  =  g.inBfOrder(s)  {  (v,prev_sigma)  =>  if(v==s)  1    

3.          else  sum(v.upNbrs){  w  =>  prev_sigma(w)}  

4.      }  

5.      val  delta  =  g.inRevBfOrder(s){  (v,prev_delta)  =>  

6.          sum(v.downNbrs){  w  =>   7.              sigma(v)/sigma(w)*(1+prev_delta(w))   8.          }   9.      }   10.    delta   11.  }

(22)

OptiGraph: Page Rank

 

1.  val  pr  =  untilConverged(0,threshold){  oldPR  =>  

2.      g.mapNodes{  n  =>    

3.          ((1.0-­‐damp)/g.numNodes)  +  damp*sum(n.inNbrs){  w  =>  

4.          oldPr(w)/w.outDegree  

5.          }   6.      }  

(23)

OptiGraph vs. GPS (Pregel)

Higher Productivity

Algorithm (lines of code) OptiGraph (lines of code) Native GPS

Average Teenage Follower (AvgTeen) 13 130

PageRank 11 110

Conductance (Conduct) 12 149

Single Source Shortest Paths (SSSP) 29 105

Random Bipartite Matching (Bipartite) 47 225

Approximate Betweeness Centrality 25 Not Available

(24)

OptiGraph vs. Native GPS

(Pregel)

Similar Performance

25 x 4 = 100 cores

(25)

TRIANGLE COUNTING

n 

Microsoft Research, IBM, Logic Blox, MIT, and Oracle are trying

to make this application run fast

n 

Simple yet practical example of multi-way join–a fundamental

operation to any data analytics engine

n 

Serves as the building block to many graph mining applications

such as identifying cliques, graph transitivity, and clustering

coefficients  

 

1.  val  triangleCount  =  g.sumOverNodes{  n  =>  

2.    sum(n.nbrs){nbr  =>  n.commonNbrs(nbr).size  }  

(26)

KeyS to TRIANGLE

COUNTING PERFORMANCE

n 

Data layout

n 

Hash

n 

Compressed sparse row

n 

Bit set

n 

Compressed bit set

n 

Dynamically switch based on graph characteristics

n 

E.g. mesh vs. scale free (power law)

n 

Mixed layouts are best in some cases

n 

Dynamic scheduling for load balance

References

Related documents