HIGH PERFORMANCE
BIG DATA ANALYTICS
Kunle Olukotun
Electrical Engineering and Computer Science
Stanford University
Explosion of Data Sources
n
“DoD is swimming in sensors and drowning in data”
n
Challenge: enable discovery
n
Deliver the capability to mine, search and analyze this
data in near real time
PROCESSING SYSTEMS
FOR BIG DATA
n
Goal: Make high-performance data analytics easy to
use
n
Manipulate big data sets in real time
n
Make better decisions, the right conclusions
n
Streaming data
n
Low latency computation
n
Interactive data exploration
n
Requires the full power of modern computing
platforms
Processing
PERFORMANCE TODAY
E xe cut io n T ime R ela tiv e to C++PERFORMANCE Vs. EASE
OF USE
Performance
E
as
e
o
f
us
e
Matlab, R ︎ MapReduce ︎ Spark ︎ GraphLab ︎ Pig, Hive ︎ GoalHeterogeneous parallelism
Cluster Multicore CPU Graphics Processing Unit (GPU) Programmable LogicMPI MapReduce Verilog VHDL CUDA OpenCL Threads OpenMP
Expert PARALLEL
programming
Cluster Multicore CPU Graphics Processing Unit (GPU) Programmable Logic8 Applications
Heterogeneous
Hardware New Arch.
THe programmability GAP
Data Wrangling Graph Analysis Prediction Recommendation Data TransformationApplications
Heterogeneous
Hardware New Arch.
Scala
Java Python
Ruby C++
Clojure
Not enough semantic knowledge to compile automatically No restrictions
general-purpose languages
Data Wrangling Graph Analysis Prediction Recommendation Data TransformationDomain Specific Languages
n
Domain Specific Languages (DSLs)
n
Programming language with restricted expressiveness for a
particular domain
High Performance
DSLs for Data
Analytics
Data Wrangling Graph Analysis Prediction Recommendation Data Transformation Data Query OptiQL Graph Alg. OptiGraph Machine Learning OptiML Convex Opt. OptiCVX Data Transform OptiWrangle Applications Domain Specific Languages Heterogeneous Hardware DSL Compiler New Arch. DSL Compiler DSL Compiler DSL Compiler DSL CompilerScaling the DSL Approach
n
Many potential DSLs
n
How do we quickly create high-performance
implementations for DSLs we care about?
n
Enable expert parallel programmers to easily create
new DSLs
n
Make optimization knowledge reusable
n
Simplify the compiler generation process
n
A few DSL developers enable many more DSL users
Delite Common DSL Infrastructure
Delite: dSL
infrastructure
Data Wrangling Graph Analysis Prediction Recommendation Data Transformation Data Query OptiQL Graph Alg. OptiGraph Machine Learning OptiML Convex Opt. OptiCVX Data Transform OptiWrangle Applications Domain Specific Languages Heterogeneous Hardware DSL Compiler New Arch. DSL Compiler DSL Compiler DSL Compiler DSL CompilerDelite DSL Framework
Delite: dSL
infrastructure
Graph Analysis Prediction Recommendation Data Transformation Data Query OptiQL Graph Alg. OptiGraph Machine Learning OptiML Data Transform OptiWrangle Applications Domain Specific Languages Heterogeneous Hardware DSL Compiler DSL Compiler DSL Compiler DSL CompilerDelite Overview
Key elements
n Scala libraries on steroids n Intermediate representation n Domain specific optimizationn General parallelism and
locality optimizations
n Mapping to HW targets
Op?{CVX, Graph, ML, QL, Wrangle}
Generic analyses & transforma?ons Domain specific analyses & transforma?ons Code generators domain ops domain data
parallel data Parallel
patterns
Threads OpenMP CUDA OpenCL MPI Verilog
DSL U se r DSL De ve lope r D elite Fr am ew or k
optiMl: a Dsl for machine
learning
n
Designed for Iterative Statistical Inference
n
e.g. SVMs, logistic regression, neural networks, etc.
n
Dense/sparse vectors and matrices, message-passing
graphs, training/test sets
n
Mostly Functional
n
Data manipulation with classic functional operators (map,
filter) and
ML-specific ones (sum, vector constructor,
untilconverged)
n
Math with MATLAB-like syntax (a*b, chol(..), exp(..))
n
Runs Anywhere
K-MEANs CLUSTERING
PERFORMANCE
Op,ML Parallelized MATLAB C++
1.0 1.9 3.6 5.2 10.8 0.3 0.3 0.3 0.3 0.3 1.6 0.0 2.0 4.0 6.0 8.0 10.0 12.0
1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU
Spe
edup
K-‐means
untilconverged(kMeans, tol){kMeans =>
val clusters = samples.groupRowsBy { sample =>
val allDistances = kMeans.mapRows { mean =>
dist(sample, mean)
}
allDistances.minIndex
}
val newKmeans = clusters.map(e => e.sum / e.length) newKmeans
MSM Builder Using OptiML
with Vijay Pande
!
Markov State Models (MSMs) MSMs are a powerful means of modeling the structure and
dynamics of molecular systems, like proteins
x86 ASM
high prod, low perf
low prod, high perf
OptiGraph
n A DSL for large-scale graph analysis based on Green-Marl
n A DSL for Real-world Graph Analysis
n Green-Marl: A DSL for Easy and Efficient Graph Analysis (Hong et. al.), ASPLOS ’12
n Functional DSL
n No mutation, no explicit iteration
n Data structures
n Graph (directed, undirected), node, edge,
n Set of nodes, edges, neighbors, …
n Graph traversals
n Summation, Breadth-first traversal, …
n Parallel reductions but no deferred assignment
n Example applications written in framework
n Betweeness Centrality
n Directed/Undirected Triangle Counting
Example: Betweenness
Centrality
n
Betweenness Centrality
n
A measure that tells how ‘central’ a
node is in the graph
n
Used in social network analysis
n
Definition
n
How many shortest paths are
there between any two nodes
going through this node.
Ayush K. Kehdekar Kevin Bacon High BC Low BC
OptiGraph Betweenness
Centrality
1. val bc = sum(g.nodes){ s =>
2. val sigma = g.inBfOrder(s) { (v,prev_sigma) => if(v==s) 1
3. else sum(v.upNbrs){ w => prev_sigma(w)}
4. }
5. val delta = g.inRevBfOrder(s){ (v,prev_delta) =>
6. sum(v.downNbrs){ w => 7. sigma(v)/sigma(w)*(1+prev_delta(w)) 8. } 9. } 10. delta 11. }
OptiGraph: Page Rank
1. val pr = untilConverged(0,threshold){ oldPR =>
2. g.mapNodes{ n =>
3. ((1.0-‐damp)/g.numNodes) + damp*sum(n.inNbrs){ w =>
4. oldPr(w)/w.outDegree
5. } 6. }
OptiGraph vs. GPS (Pregel)
Higher Productivity
Algorithm (lines of code) OptiGraph (lines of code) Native GPS
Average Teenage Follower (AvgTeen) 13 130
PageRank 11 110
Conductance (Conduct) 12 149
Single Source Shortest Paths (SSSP) 29 105
Random Bipartite Matching (Bipartite) 47 225
Approximate Betweeness Centrality 25 Not Available
OptiGraph vs. Native GPS
(Pregel)
Similar Performance
25 x 4 = 100 cores
TRIANGLE COUNTING
n
Microsoft Research, IBM, Logic Blox, MIT, and Oracle are trying
to make this application run fast
n
Simple yet practical example of multi-way join–a fundamental
operation to any data analytics engine
n
Serves as the building block to many graph mining applications
such as identifying cliques, graph transitivity, and clustering
coefficients
1. val triangleCount = g.sumOverNodes{ n =>
2. sum(n.nbrs){nbr => n.commonNbrs(nbr).size }
KeyS to TRIANGLE
COUNTING PERFORMANCE
n
Data layout
n
Hash
n
Compressed sparse row
n
Bit set
n
Compressed bit set
n
Dynamically switch based on graph characteristics
n
E.g. mesh vs. scale free (power law)
n
Mixed layouts are best in some cases
n