How To Create A Graphlab Framework For Parallel Machine Learning

(1)

Datacenter Simulation Methodologies:

GraphLab

Tamara Silbergleit Lehman, Qiuyun Wang, Seyed Majid Zahedi and Benjamin C. Lee

(2)

Tutorial Schedule

Time Topic

09:00 - 09:15 Introduction

09:15 - 10:30 Setting up MARSSx86 and DRAMSim2

10:30 - 11:00 Break 11:00 - 12:00 Spark simulation 12:00 - 13:00 Lunch 13:00 - 13:30 Spark continued 13:30 - 14:30 GraphLab simulation 14:30 - 15:00 Break

15:00 - 16:15 Web search simulation 16:15 - 17:00 Case studies

(3)

Agenda

• Objectives

• be able to deploy graph analytics framework • be able to simulate GraphLab engine, tasks • Outline

• Learn GraphLab for recommender, clustering • Instrument GraphLab for simulation

• Create checkpoints

(4)

Types of Graph Analysis

• Iterative, batch processing over entire graph dataset • Clustering

• PageRank • Pattern Mining

• Real-time processing over fraction of the entire graph • Reachability

• Shortest-path

(5)

Parallel Graph Algorithms

• Common Properties

• Sparse data dependencies • Local computations • Iterative updates

• Difficult programming models • Race conditions, deadlocks • Shared memory synchronization

(6)

Graph Computation Challenges

• Poor memory locality • I/O intensive

• Limited data parallelism • Limited scalability

(7)

MapReduce for Graphs

MapReduce performs poorly for parallel graph analysis

• Graph is re-loaded, re-processed iteratively

• MapReduce writes intermediate results to disk between

(8)

GraphLab, An Alternative Approach

• Captures data dependencies • Performs iterative analysis • Updates data asynchronously • Enables parallel execution models

• Multiprocessor • Distributed machines

www.select.cs.cmu.edu/code/ graphlab

(9)

The GraphLab Framework

• Represent data as graph • Specify update functions,

user computation

• Choose consistency model

• Choose task scheduler _{www.cs.cmu.edu/~pavlo/courses/fall2013/} static/slides/graphlab.pdf

(10)

Represent Data as Graph

• Data graph associates data to each vertex and edge

(11)

Update Functions and Scope

• Computation with stateless • Scheduler prioritizes computation

• Scope determines affected edges and vertices

(12)

Scheduling Tasks and Updates

• Update scheduler orders update functions • Static scheduling

• Models: synchronous, round-robin • Task scheduler adds, reorders tasks

• Dynamic scheduling

(13)

GraphLab Software Stack

• Collaborative filtering – recommendation system • Clustering – Kmeans++

(14)

Questions?

(15)

Datacenter Simulation Methodologies

Getting Started with GraphLab

Tamara Silbergleit Lehman, Qiuyun Wang, Seyed Majid Zahedi and Benjamin C. Lee

(16)

Agenda

• Objectives

(17)

GraphLab Setup

• Get a product key from:

http://graphlab.com/products/create/ quick-start-guide.html

• Launch QEMU emulator:

$ qemu - system - x 8 6 _ 6 4 - m 4 G - d r i v e f i l e = m i c r o 2 0 1 4 . qcow2 , c a c h e = u n s a f e - n o g r a p h i c

• In QEMU, install required tools and GraphLab-create python package

# apt - get i n s t a l l python - pip python - dev build - e s s e n t i a l gcc

(18)

GraphLab Setup

• Register product with generated key by opening file /root/.graphlab/config and editing it as follows [ P r o d u c t ]

p r o d u c t _ k e y = ’ ’ < g e n e r a t e d _ k e y > ’ ’

• Create a directory for GraphLab # m k d i r g r a p h l a b

# cd g r a p h l a b

• Create a directory for the dataset # m k d i r d a t a s e t

(19)

Downloading a Dataset

• Download the dataset: 10 million movie ratings by 72,000 users on 10,000 movies # w g e t f i l e s . g r o u p l e n s . org / d a t a s e t s / m o v i e l e n s / ml -10 m . zip # u n z i p ml -10 m . zip # sed ’ s /::/ ,/ g ’ ml -10 M 1 0 0 K / r a t i n g s . dat > r a t i n g s . csv

• Open the file and add column names on the first line: userid,moveid,rating,timestamp

(20)

GraphLab Recommender Toolkit

We will create a factorization recommender program. • Create a new python file called recommender.py

i m p o r t g r a p h l a b as gl d a t a = gl . S F r a m e . r e a d _ c s v (’ / r o o t / g r a p h l a b / d a t a s e t s / r a t i n g s . csv ’, c o l u m n _ t y p e _ h i n t s ={’ r a t i n g ’: int } , h e a d e r = T r u e ) m o d e l = gl . r e c o m m e n d e r . c r e a t e ( data , u s e r _ i d =’ u s e r i d ’, i t e m _ i d =’ m o v i e i d ’, t a r g e t =’ r a t i n g ’) r e s u l t s = m o d e l . r e c o m m e n d ( u s e r s = None , k =5) p r i n t r e s u l t s

• The gl.recommender.create(args)command chooses a recommendation model based on the input dataset format,

(21)

GraphLab Recommender Toolkit

• The user can specify recommendation model • item similarity recommender,

• factorization recommender,

• ranking factorization recommender, • popularity-based recommender.

• When user specifies model explicitly, she can also specify • number of latent factors,

• number of maximum iterations, etc.

• model.recommend(args) returns the k-highest scored items for each user. When users parameter is None, it returns

(22)

Setup for Creating Checkpoints

• Copy fileptlcalls.hfrom marss.dramsim directory

# scp u s e r 0 1 @ s a i l 0 3 . egr . d u k e . edu :/ h o m e / u s e r 0 1 / m a r s s . d r a m s i m / p t l s i m / t o o l s / p t l c a l l s . h .

(23)

libptlcalls.cpp

# i n c l u d e < i o s t r e a m > # i n c l u d e " p t l c a l l s . h " # i n c l u d e < s t d l i b . h > e x t e r n " C " v o i d c r e a t e _ c h e c k p o i n t () { c h a r * c h _ n a m e = g e t e n v (" C H E C K P O I N T _ N A M E ") ; if( c h _ n a m e != N U L L ) { p r i n t f (" c r e a t i n g c h e c k p o i n t % s \ n ", c h _ n a m e ) ; p t l c a l l _ c h e c k p o i n t _ a n d _ s h u t d o w n ( c h _ n a m e ) ; } } e x t e r n " C " v o i d s t o p _ s i m u l a t i o n () { p r i n t f (" S t o p p i n g s i m u l a t i o n \ n ") ; p t l c a l l _ k i l l () ; }

(24)

Compile libptlcalls.cpp

• Compile C++ code

# g ++ - c - f P I C l i b p t l c a l l s . cpp - o l i b p t l c a l l s . o

• Create shared library for Python

# g ++ - s h a r e d - Wl , - soname , l i b p t l c a l l s . so - o l i b p t l c a l l s . so l i b p t l c a l l s . o

(25)

Setup for Creating Checkpoints

• Include the library in recommender.py source code f r o m c t y p e s i m p o r t c d l l

lib = c d l l . L o a d L i b r a r y (’ ./ l i b p t l c a l l s . so ’)

• Call function to create checkpoint before the recommender is created. Stop the simulation after recommend function. lib.create checkpoint() m o d e l = gl . r e c o m m e n d e r . c r e a t e ( data , u s e r _ i d =’ u s e r i d ’, i t e m _ i d =’ m o v i e i d ’, t a r g e t =’ r a t i n g ’) lib.stop simulation() r e s u l t s = m o d e l . r e c o m m e n d ( u s e r s = None , k = 2 0 )

(26)

Creating Checkpoints

• Shutdown QEMU emulator

# p o w e r o f f

• Once the emulator is shut down change into the marss.dramsim directory

$ cd m a r s s . d r a m s i m

• Run MARSSx86’ QEMU emulator

$ ./ q e m u / qemu - system - x 8 6 _ 6 4 - m 4 G - d r i v e f i l e =/ h o m e t e m p / u s e r X X / m i c r o 2 0 1 4 . qcow2 , c a c h e = u n s a f e - n o g r a p h i c

• Export CHECKPOINT NAME

# e x p o r t C H E C K P O I N T _ N A M E = g r a p h l a b • Run recommender.py

(27)

Running from Checkpoints

• Add -simconfig micro2014.simcfgto specify the simulation configuration

• Add -loadvmoption to load from newly created checkpoint • Add -snapshotto prevent the simulation from modifying disk

image

> ./ q e m u / qemu - system - x 8 6 _ 6 4 - m 4 G - d r i v e f i l e =/ h o m e t e m p / u s e r X X / m i c r o 2 0 1 4 . qcow2 , c a c h e = u n s a f e n o g r a p h i c s i m c o n f i g m i c r o 2 0 1 4 . s i m c f g -l o a d v m g r a p h -l a b - s n a p s h o t

(28)

GraphLab Clustering Toolkit

We will now perform k-means++ clustering

• We will use airline ontime information for 2008

• Download dataset from Statistical Computing web site. Decompress it

# w g e t stat - c o m p u t i n g . org / d a t a e x p o / 2 0 0 9 / 2 0 0 8 . csv . bz2

(29)

GraphLab Clustering Toolkit

• Create a new python file called clustering.py i m p o r t g r a p h l a b as gl f r o m m a t h i m p o r t s q r t d a t a _ u r l =’ 2 0 0 8 . csv ’ d a t a = gl . S F r a m e . r e a d _ c s v ( d a t a _ u r l ) # r e m o v e e m p t y r o w s d a t a _ g o o d , d a t a _ b a d = d a t a . d r o p n a _ s p l i t () # d e t e r m i n e the n u m b e r of r o w s in the d a t a s e t n = len ( d a t a _ g o o d ) # c o m p u t e the n u m b e r of c l u s t e r s to c r e a t e k = int ( s q r t ( n / 2 . 0 ) ) p r i n t " S t a r t i n g k - m e a n s w i t h % d c l u s t e r s " % k m o d e l = gl . k m e a n s . c r e a t e ( d a t a _ g o o d , n u m _ c l u s t e r s = k ) # # p r i n t s o m e i n f o r m a t i o n on c l u s t e r s c r e a t e d m o d e l [’ c l u s t e r _ i n f o ’][[’ c l u s t e r _ i d ’, ’ _ _ w i t h i n _ d i s t a n c e _ _ ’, ’ _ _ s i z e _ _ ’]]

(30)

Setup for Creating Checkpoints

• Include the library in clustering.py source code f r o m c t y p e s i m p o r t c d l l

lib = c d l l . L o a d L i b r a r y (’ ./ l i b p t l c a l l s . so ’)

• Call the function to create checkpoint before k-means clustering model is created.

p r i n t " S t a r t i n g k - m e a n s w i t h % d c l u s t e r s " % k lib.create checkpoint()

m o d e l = gl . k m e a n s . c r e a t e ( d a t a _ g o o d , n u m _ c l u s t e r s = k )

(31)

Creating Checkpoints

• Shutdown QEMU emulator

# p o w e r o f f

• Once the emulator is shut down change into the marss.dramsim directory

$ cd m a r s s . d r a m s i m

• Run MARSSx86’ QEMU emulator

$ ./ q e m u / qemu - system - x 8 6 _ 6 4 - m 4 G - d r i v e f i l e =/ h o m e t e m p / u s e r X X / m i c r o 2 0 1 4 . qcow2 , c a c h e = u n s a f e - n o g r a p h i c

• Export CHECKPOINT NAME

(32)

Running from Checkpoints

• Run clustering.py

# p y t h o n g r a p h l a b / c l u s t e r i n g . py

• The checkpoint will be created. Then the VM will shutdown • Once the VM shuts down, update micro2014.simcfg to specify

number of instructions to simulate -stopinsns 1B • Run MARSSx86 from the checkpoint

$ ./ q e m u / qemu - system - x 8 6 _ 6 4 - m 4 G - d r i v e f i l e =/ h o m e t e m p / u s e r X X / m i c r o 2 0 1 4 . qcow2 , c a c h e = u n s a f e - n o g r a p h i c - s i m c o n f i g m i c r o 2 0 1 4 . s i m c f g - l o a d v m k m e a n s - s n a p s h o t

(33)

Datasets

Problem Domain Contributor Link

Misc Amazon Web Services public datasets dataset

Social Graphs Stanford Large Network Dataset

(SNAP)

dataset

Social Graphs Laboratory for Web Algorithms dataset

Collaborative Filtering Million Song dataset dataset

Collaborative Filtering Movielens dataset GroupLens dataset

Collaborative Filtering KDD Cup 2012 by Tencent, Inc. dataset

Collaborative Filtering

(matrix factorization

based methods)

University of Florida sparse matrix col-lection

dataset

Classification Airline on time performance dataset

Classification SF restaurants dataset dataset

(34)

Agenda

• Objectives

(35)

GraphLab

(36)

Tutorial Schedule

Time Topic

09:00 - 09:15 Introduction

09:15 - 10:30 Setting up MARSSx86 and DRAMSim2

10:30 - 11:00 Break 11:00 - 12:00 Spark simulation 12:00 - 13:00 Lunch 13:00 - 13:30 Spark continued 13:30 - 14:30 GraphLab simulation 14:30 - 15:00 Break

15:00 - 16:15 Web search simulation 16:15 - 17:00 Case studies