A System for Distributed Graph-Parallel Machine Learning

(1)

Joseph Gonzalez

Postdoc, UC Berkeley AMPLab PhD @ CMU

Yucheng Low Aapo Kyrola Danny Bickson Alex Smola Haijie Gu The Team: Carlos Guestrin Guy Blelloch

(2)

About me …

2

Machine

Learning

Graphical

Models

+

BigLearning

Big

Scalable

(3)

Graphs are Essential to

Data-Mining

and

Machine Learning

• Identify influential people and information • Find communities

• _{Target ads and products}

• _{Model complex data dependencies}

(4)

Liberal Conservative Post Post Post Post Post Post Post Post

Example:

Estimate Political Bias

Post Post Post Post Post Post Post Post Post Post Post Post Post Post ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? _? ? 4

Loopy Belief Propagation

Conditional Random Field

(5)

Collaborative Filtering: Exploiting Dependencies

City of God

Wild Strawberries The Celebration

La Dolce Vita

Women on the Verge of a Nervous Breakdown

What do I recommend???

(6)

Matrix Factorization

Alternating Least Squares (ALS)

r₁₃ r₁₄ r₂₄ r₂₅ f(1) f(2) f(3) f(4) f(5) User F act or s (U) Mo vie F act or s (M) User s Movies Netflix User s

≈

x Movies f(i) f(j) Iterate: Factor for User i Factor for Movie j

(7)

PageRank

• Everyone starts with equal ranks • Update ranks in parallel

• _{Iterate until convergence}

Rank of

user i Weighted sum of

neighbors’ ranks

(8)

How

should

we program

graph-parallel

algorithms?

Low-level

tools like

MPI

and

Pthreads

?

- Me, during my first years of grad school

(9)

Threads, Locks, and MPI

• ML experts repeatedly solve the same parallel design challenges:

– Implement and debug complex parallel system

– Tune for a single parallel platform

– Six months later the conference paper contains:

“We implemented ______ in parallel.”

• _{The resulting code:}

– is difficult to maintain and extend

– couples learning model and implementation

(10)

How

should

we program

graph-parallel

algorithms?

10

High-level

Abstractions!

- Me, now

(11)

The

Graph-Parallel

Abstraction

• A user-defined Vertex-Program runs on each vertex

• Graph constrains interaction along edges

– Using messages (e.g. Pregel [PODC’09, SIGMOD’10])

– Through shared state (e.g., GraphLab [UAI’10, VLDB’12])

• Parallelism: run multiple vertex programs simultaneously

11

“Think like a Vertex.”

(12)

Better for Machine Learning

Graph-parallel

Abstractions

12 Shared State ii Dynamic Asynchronous Messaging ii Synchronous

(13)

The GraphLab Vertex Program

Vertex Programs directly access adjacent vertices and edges

GraphLab_PageRank(i)

// Compute sum over neighbors

total = 0

foreach( j in neighbors(i)): total = total + R[j] * w_ji

// Update the PageRank

R[i] = 0.15 + total

// Trigger neighbors to run again

priority = |R[i] – oldR[i]| if R[i] not converged then

signal neighborsOf(i) with priority

13 R[4] * w₄₁ + + 44 11 33 22

(14)

Benefit of Dynamic PageRank

1 100 10000 1000000 100000000 0 10 20 30 40 50 60 70 Num-V e rtices Number of Updates

51% updated only once!

Be

tt

er

(15)

GraphLab

Asynchronous

Execution

CPU 1

CPU 2

The scheduler determines the order that vertices are executed

ee ff gg kk jj ii hh dd cc bb aa bb ii hh aa ii bb ee ff jj cc

Scheduler

(16)

Asynchronous Belief Propagation

Synthetic Noisy Image

Cumulative Vertex Updates

Many Updates

Few Updates

Algorithm identifies and focuses on hidden sequential structure

Graphical Model

(17)

GraphLab Ensures a

Serializable

Execution

• _{Enables: Gauss-Seidel iterations, Gibbs} Sampling, Graph Coloring, …

(18)

Never Ending Learner Project (CoEM)

• Language modeling: named entity recognition

18

GraphLab 16 Cores 30 min

15x Faster! 6x fewer CPUs! Hadoop (BSP) 95 Cores 7.5 hrs Distributed GraphLab 32 EC2 machines 80 secs 0.3% of Hadoop time

(19)

GraphLab provided a

powerful new abstraction

But…

Thus far…

We couldn’t scale up to

Altavista Webgraph from 2002

(20)

20

Natural Graphs

Graphs derived from natural

phenomena

(21)

Properties of Natural Graphs

21

Power-Law Degree Distribution

(22)

Power-Law Degree Distribution

100 102 104 106 108 100 102 104 106 108 1010 degree c o u n t

Top 1% of vertices are adjacent to 50% of the edges!

High-Degree

Vertices

22 Number of V ertices AltaVista WebGraph 1.4B Vertices, 6.6B Edges Degree

More than 108 _vertices

(23)

Power-Law Degree Distribution

23

“Star Like”

Motif

President

(24)

Asynchronous Execution

requires heavy locking (GraphLab)

Challenges of

High-Degree

Vertices

Touches a large fraction of graph (GraphLab) Sequentially process edges Sends many messages (Pregel) Edge meta-data too large for single

machine

Synchronous Execution prone to stragglers (Pregel)

(25)

Graph Partitioning

• Graph parallel abstractions rely on partitioning:

– Minimize communication

– Balance computation and storage

25

Machine 1 Machine 2

Comm. Cost

(26)

Power-Law Graphs are

Difficult to Partition

• Power-Law graphs do not have low-cost balanced

cuts [Leskovec et al. 08, Lang 04]

• Traditional graph-partitioning algorithms perform

poorly on Power-Law Graphs.

[Abou-Rjeili et al. 06]

26

(27)

Machine 1 Machine 2

Random Partitioning

• GraphLab resorts to random (hashed) partitioning on natural graphs

10 Machines  90% of edges cut

100 Machines  99% of edges cut!

(28)

Machine 1 Machine 2

• Split High-Degree vertices

• _{New Abstraction}  _Equivalence _{on Split Vertices}

28

Program

(29)

Gather Information About Neighborhood

Update Vertex Signal Neighbors &

Modify Edge Data

A

Common Pattern

for

Vertex-Programs

GraphLab_PageRank(i)

// Compute sum over neighbors

total = 0

foreach( j in neighbors(i)): total = total + R[j] * w_ji

// Update the PageRank

R[i] = total

// Trigger neighbors to run again

priority = |R[i] – oldR[i]| if R[i] not converged then

signal neighbors(i) with priority

(30)

Formal GraphLab2 Semantics

• _Gather_{(SrcV, Edge, DstV)} _A

– Collect information from neighbors

• Sum(A, A)  A

– Commutative associative Sum

• _Apply_{(V, A)} _V

– Update the vertex

• Scatter(SrcV, Edge, DstV)  (Edge, signal)

– Update edges and signal neighbors

(31)

GraphLab2_PageRank(i)

Gather( j  i ) : return w_ji * R[j]

sum(a, b) : return a + b;

Apply(i, Σ) : R[i] = 0.15 + Σ

Scatter( i  j ) :

if R[i] changed then trigger j to be recomputed

PageRank in

GraphLab2

(32)

Machine 2 Machine 1 Machine 4 Machine 3

GAS Decomposition

Σ₁ Σ₂ Σ₃ Σ₄ + + + YYYY Y’

Σ

Y’Y’ Y’

G

ather

A

pply

S

catter

32 Master Mirror Mirror Mirror

(33)

Minimizing Communication in PowerGraph

YYY

A vertex-cut minimizes

machines each vertex spans

Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000]

Communication is linear in the number of machines

each vertex spans

33

New Theorem:

For any edge-cut we can directly

construct a vertex-cut which requires

(34)

Constructing Vertex-Cuts

• Evenly assign edges to machines

– Minimize machines spanned by each vertex

• Assign each edge as it is loaded

– Touch each edge only once

• Propose two distributed approaches:

– Random Vertex Cut

– Greedy Vertex Cut

(35)

Machine 2

Machine 1 Machine 3

Random

Vertex-Cut

• Randomly assign edges to machines

YYYYYYYYY ZZZ Y Spans 3 Machines Z Spans 2 Machines Balanced Vertex-Cut Not cut! 35

(36)

Random Vertex-Cuts vs. Edge-Cuts

• Expected improvement from vertex-cuts:

1 10 100 0 50 100 150 R e duction in Comm. and St or ag e Number of Machines 36 Order of Magnitude Improvement

(37)

Streaming

Greedy Vertex-Cuts

• Place edges on machines which already have the vertices in that edge.

Machine1 Machine 2 B A B C D AB E 37

(38)

Greedy Vertex-Cuts Improve Performance

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PageRank Collaborative Filtering Shortest Path Run time R e la tiv e to Random Random Greedy

Greedy partitioning improves

(39)

System Design

• Implemented as C++ API

• Uses HDFS for Graph Input and Output

• _{Fault-tolerance is achieved by check-pointing}

– Snapshot time < 5 seconds for twitter network

39

EC2 HPC Nodes

MPI/TCP-IP PThreads HDFS

(40)

Implemented Many Algorithms

• Collaborative Filtering

– Alternating Least Squares

– Stochastic Gradient Descent

– SVD

– Non-negative MF

• Statistical Inference

– Loopy Belief Propagation

– Max-Product Linear Programs – Gibbs Sampling • Graph Analytics – PageRank – Triangle Counting – Shortest Path – Graph Coloring – K-core Decomposition • Computer Vision – Image stitching • Language Modeling – LDA 40

(41)

PageRank on the Twitter Follower Graph

0 10 20 30 40 50 60 70 GraphLab Pregel (Piccolo) PowerGraph 41 0 5 10 15 20 25 30 35 40 GraphLab Pregel (Piccolo) PowerGraph To ta l N et w o rk ( G B ) Sec o nds Communication Runtime

Natural Graph with 40M Users, 1.4 Billion Links

Reduces Communication Runs Faster

(42)

PageRank on Twitter Follower Graph

Natural Graph with 40M Users, 1.4 Billion Links

Hadoop results from [Kang et al. '11]

Twister (in-memory MapReduce) [Ekanayake et al. ‘10] 42

0 50 100 150 200 Hadoop GraphLab Twister Piccolo PowerGraph

Runtime Per Iteration

Order of magnitude by

exploiting properties

(43)

GraphLab2 is Scalable

Yahoo Altavista Web Graph (2002):

One of the largest publicly available web graphs

1.4 Billion Webpages, 6.6 Billion Links

1024 Cores (2048 HT) 64 HPC Nodes

7 Seconds per Iter.

1B links processed per second

30 lines of user code

(44)

Topic Modeling

• English language Wikipedia

– 2.6M Documents, 8.3M Words, 500M Tokens

– Computationally intensive algorithm

44

0 20 40 60 80 100 120 140 160

Smola et al.

PowerGraph

Million Tokens Per Second

100 Yahoo! Machines

Specifically engineered for this task

64 cc2.8xlarge EC2 Nodes

(45)

Triangle Counting

• For each vertex in graph, count number of triangles containing it

• _{Measures both “popularity” of the vertex and} “cohesiveness” of the vertex’s community:

More Triangles Stronger Community Fewer Triangles

(46)

Counted: 34.8 Billion Triangles

46

Triangle Counting

on The Twitter Graph

Identify individuals with strong communities.

64 Machines

1.5 Minutes

1536 Machines 423 Minutes

Hadoop

[WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

282 x Faster

Why? Wrong Abstraction 

(47)

EC2 HPC Nodes MPI/TCP-IP PThreads HDFS GraphLab2 System Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering

Machine Learning

and

Data-Mining

Toolkits

Apache 2 License

(48)

GraphChi

:

Going small with GraphLab

Solve huge problems on small or embedded

devices?

Key: Exploit non-volatile memory (starting with SSDs and HDs)

(49)

GraphChi

– disk-based GraphLab

Novel Parallel Sliding

Windows algorithm • Single-Machine

– Parallel, asynchronous execution

• Solves big problems

– That are normally solved in cloud

• Efficiently exploits disks

– Optimized for stream acces

– Efficient on both SSD and hard-drives

(50)

Triangle Counting in Twitter Graph

40M Users

1.2B Edges

Total: 34.8 Billion Triangles

Hadoop results from [Suri & Vassilvitskii '11]

64 Machines, 1024 Cores 1.5 Minutes PowerGraph GraphChi Hadoop 1536 Machines 423 Minutes

(51)

Apache 2 License

http://graphlab.org

(52)

Active Work

• Cross language support (Python/Java)

• Support for incremental graph computation

• Integration with Graph Databases

• Declarative representations of GAS decomposition:

– my.pr := nbrs.in.map(x => x.pr).reduce( (a,b) => a + b )

52 Joseph E. Gonzalez Postdoc, UC Berkeley [email protected] [email protected] http://graphlab.org

(53)

Why not use

Map-Reduce

for

(54)

Data Dependencies are Difficult

• Difficult to express dependent data in Map

Reduce

– Substantial data transformations

– User managed graph structure

– Costly data replication

Independen t Da ta R ec o rd s

(55)

Iterative Computation is Difficult

• System is not optimized for iteration:

Data Data Data Data Data Data Data Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Data Data Data Data Data Data Data CPU 1 CPU 2 CPU 3 Iterations Disk P enalt y Disk P enalt y Disk P enalt y Startup P enalt y Startup P enalt y Startup P enalt y

(56)

The Pregel Abstraction

Vertex-Programs interact by sending messages.

ii

Pregel_PageRank(i, messages) :

// Receive all the messages

total = 0

foreach( msg in messages) :

total = total + msg

// Update the rank of this vertex

R[i] = total

// Send new messages to neighbors

foreach(j in out_neighbors[i]) :

Send msg(R[i]) to vertex j

56

(57)

Barrier

Pregel

Synchronou

s Execution

(58)

Communication Overhead

for High-Degree Vertices

Fan-In vs. Fan-Out

(59)

Pregel

Message Combiners

on

Fan-In

Machine 1 Machine 2 ++ B A C D Sum

• User defined commutative associative (+) message operation:

(60)

Pregel Struggles with

Fan-Out

Machine 1 Machine 2 B A C D

• Broadcast sends many copies of the same

message to the same machine!

(61)

Fan-In and Fan-Out Performance

• PageRank on synthetic Power-Law Graphs

– Piccolo was used to simulate Pregel with combiners

0 2 4 6 8 10 1.8 1.9 2 2.1 2.2 To ta l Comm. (GB) Power-Law Constant α

(62)

GraphLab Ghosting

• Changes to master are synced to ghosts Machine 1 A B C Machine 2 D D A B C Ghost 62

(63)

GraphLab Ghosting

• Changes to neighbors of high degree vertices creates substantial network traffic

Machine 1 A B C Machine 2 D D A B C Ghost 63

(64)

Fan-In and Fan-Out Performance

• PageRank on synthetic Power-Law Graphs

• GraphLab is undirected 0 2 4 6 8 10 1.8 1.9 2 2.1 2.2 To ta l Comm. (GB)

Power-Law Constant alpha

(65)

Comparison with GraphLab & Pregel

• PageRank on Synthetic Power-Law Graphs:

Runtime Communication 0 2 4 6 8 10 1.8 To ta l N et w o rk ( G B ) Power-Law Constant α 0 5 10 15 20 25 30 1.8 Sec o nds Power-Law Constant α Pregel (Piccolo) GraphLab Pregel (Piccolo) GraphLab 65

High-degree vertices High-degree vertices

(66)

GraphLab on Spark

66

#include <graphlab.hpp>

struct vertex_data : public graphlab::IS_POD_TYPE { float rank; vertex_data() : rank(1) { }

};

typedef graphlab::empty edge_data;

typedef graphlab::distributed_graph<vertex_data, edge_data> graph_type; class pagerank :

public graphlab::ivertex_program<graph_type, float>, public graphlab::IS_POD_TYPE {

float last_change; public:

float gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const {

return edge.source().data().rank / edge.source().num_out_edges(); }

void apply(icontext_type& context, vertex_type& vertex, const gather_type& total) {

const double newval = 0.15*total + 0.85;

last_change = std::fabs(newval - vertex.data().rank); vertex.data().rank = newval;

}

void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const {

if (last_change > TOLERANCE) context.signal(edge.target()); } }; struct pagerank_writer { std::string save_vertex(graph_type::vertex_type v) { std::stringstream strm; strm << v.id() << "\t" << v.data() << "\n"; return strm.str(); }

std::string save_edge(graph_type::edge_type e) { return ""; } };

int main(int argc, char** argv) { graphlab::mpi_tools::init(argc, argv); graphlab::distributed_control dc;

graphlab::command_line_options clopts("PageRank algorithm."); graph_type graph(dc, clopts);

graph.load_format(“biggraph.tsv”, "tsv");

graphlab::omni_engine<pagerank> engine(dc, graph, clopts); engine.signal_all();

engine.start();

graph.save(saveprefix, pagerank_writer(), false, true false); graphlab::mpi_tools::finalize();

return EXIT_SUCCESS; }

import spark.graphlab._

val sc = spark.SparkContext(master, “pagerank”) val graph = Graph.textFile(“bigGraph.tsv”)

val vertices = graph.outDegree().mapValues((_, 1.0, 1.0)) val pr = Graph(vertices, graph.edges).iterate(

(meId, e) => e.source.data._2 / e.source.data._1, (a: Double, b: Double) => a + b,

(v, accum) => (v.data._1, (0.15 + 0.85*a), v.data._2), (meId, e) => abs(e.source.data._2-e.source.data._1)>0.01) pr.vertices.saveAsTextFile(“results”)