• No results found

Big Data Analytics. Lucas Rego Drumond

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Analytics. Lucas Rego Drumond"

Copied!
102
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science

University of Hildesheim, Germany

Big Data Analytics

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(2)

Big Data Analytics

Outline

1. Introduction 2. GraphLab

2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example

(3)

Big Data Analytics 1. Introduction

Outline

1. Introduction

2. GraphLab

2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example

2.5 Execution Model and Data Consistency

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(4)

Big Data Analytics 1. Introduction

Overview

Part III

Part II

Part I

Machine Learning Algorithms

Large Scale Computational Models

(5)

Big Data Analytics 1. Introduction

MapReduce - Review

1. Each mapper transforms a set key-value pairs into a list of output keys and intermediate value pairs

2. all intermediate values are grouped according to their output keys 3. each reducer receives all the intermediate values associated with a

given keys

4. each reducer associates one final value to each key

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(6)

Big Data Analytics 1. Introduction

(7)

Big Data Analytics 2. GraphLab

Outline

1. Introduction

2. GraphLab

2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example

2.5 Execution Model and Data Consistency

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(8)

Big Data Analytics 2. GraphLab

GraphLab

I A new framework for parallel machine learning

I Represents data as a directed graph

I Blocks of data stored in vertices and edges

I Globally shared data table

I Shared memory (GraphLab 1.0) distributed (GraphLab 2.1)

(9)

Big Data Analytics 2. GraphLab

GraphLab

I A new framework for parallel machine learning

I Represents data as a directed graph

I Blocks of data stored in vertices and edges

I Globally shared data table

I Shared memory (GraphLab 1.0) distributed (GraphLab 2.1)

Available for download at: http://graphlab.org

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(10)

Big Data Analytics 2. GraphLab

GraphLab

I A new framework for parallel machine learning

I Represents data as a directed graph

I Blocks of data stored in vertices and edges

I Globally shared data table

I Shared memory (GraphLab 1.0) distributed (GraphLab 2.1)

(11)

Big Data Analytics 2. GraphLab

GraphLab

I A new framework for parallel machine learning

I Represents data as a directed graph

I Blocks of data stored in vertices and edges

I Globally shared data table

I Shared memory (GraphLab 1.0) distributed (GraphLab 2.1)

Available for download at: http://graphlab.org

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(12)

Big Data Analytics 2. GraphLab

GraphLab

I A new framework for parallel machine learning

I Represents data as a directed graph

I Blocks of data stored in vertices and edges

I Globally shared data table

I Shared memory (GraphLab 1.0) distributed (GraphLab 2.1)

(13)

Big Data Analytics 2. GraphLab

GraphLab

I A new framework for parallel machine learning

I Represents data as a directed graph

I Blocks of data stored in vertices and edges

I Globally shared data table

I Shared memory (GraphLab 1.0) distributed (GraphLab 2.1)

Available for download at: http://graphlab.org

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(14)

Big Data Analytics 2. GraphLab

(15)

Big Data Analytics 2. GraphLab

GraphLab System Overview

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(16)

Big Data Analytics 2. GraphLab

GraphLab Abstraction

The GraphLab abstraction consists of the following main components:

I A Data Graph provides a high level representation for data and

complex computational dependencies

I Configurable consistency models for automatically guaranteeing

data consistancy

I Scheduling Primitives able to express iterative parallel algorithms

I User defined computation consisting of:

I Update Functions: local computation I Sync Mechanism: global aggregation

(17)

Big Data Analytics 2. GraphLab

GraphLab Abstraction

The GraphLab abstraction consists of the following main components:

I A Data Graph provides a high level representation for data and

complex computational dependencies

I Configurable consistency models for automatically guaranteeing

data consistancy

I Scheduling Primitives able to express iterative parallel algorithms

I User defined computation consisting of:

I Update Functions: local computation I Sync Mechanism: global aggregation

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(18)

Big Data Analytics 2. GraphLab

GraphLab Abstraction

The GraphLab abstraction consists of the following main components:

I A Data Graph provides a high level representation for data and

complex computational dependencies

I Configurable consistency models for automatically guaranteeing

data consistancy

I Scheduling Primitives able to express iterative parallel algorithms

I User defined computation consisting of:

I Update Functions: local computation I Sync Mechanism: global aggregation

(19)

Big Data Analytics 2. GraphLab

GraphLab Abstraction

The GraphLab abstraction consists of the following main components:

I A Data Graph provides a high level representation for data and

complex computational dependencies

I Configurable consistency models for automatically guaranteeing

data consistancy

I Scheduling Primitives able to express iterative parallel algorithms

I User defined computation consisting of:

I Update Functions: local computation I Sync Mechanism: global aggregation

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(20)

Big Data Analytics 2. GraphLab

GraphLab Abstraction

The GraphLab abstraction consists of the following main components:

I A Data Graph provides a high level representation for data and

complex computational dependencies

I Configurable consistency models for automatically guaranteeing

data consistancy

I Scheduling Primitives able to express iterative parallel algorithms

I User defined computation consisting of:

I Update Functions: local computation I Sync Mechanism: global aggregation

(21)

Big Data Analytics 2. GraphLab

GraphLab Abstraction

The GraphLab abstraction consists of the following main components:

I A Data Graph provides a high level representation for data and

complex computational dependencies

I Configurable consistency models for automatically guaranteeing

data consistancy

I Scheduling Primitives able to express iterative parallel algorithms

I User defined computation consisting of:

I Update Functions: local computation

I Sync Mechanism: global aggregation

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(22)

Big Data Analytics 2. GraphLab

GraphLab Abstraction

The GraphLab abstraction consists of the following main components:

I A Data Graph provides a high level representation for data and

complex computational dependencies

I Configurable consistency models for automatically guaranteeing

data consistancy

I Scheduling Primitives able to express iterative parallel algorithms

I User defined computation consisting of:

I Update Functions: local computation I Sync Mechanism: global aggregation

(23)

Big Data Analytics 2.1 GraphLab Data Model

Outline

1. Introduction 2. GraphLab

2.1 GraphLab Data Model

2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example

2.5 Execution Model and Data Consistency

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(24)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Consists of two parts:

1. A directed data graph:

G := (V , E )

where V is the set of vertices and E ⊆ V × V the directed edges 2. A globally shared data table:

T[Key] → Value which maps keys to arbitrary blocks of data

(25)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Consists of two parts: 1. A directed data graph:

G := (V , E )

where V is the set of vertices and E ⊆ V × V the directed edges

2. A globally shared data table:

T[Key] → Value which maps keys to arbitrary blocks of data

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(26)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Consists of two parts: 1. A directed data graph:

G := (V , E )

where V is the set of vertices and E ⊆ V × V the directed edges 2. A globally shared data table:

T[Key] → Value which maps keys to arbitrary blocks of data

(27)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Users can associate with each graph vertice and edge:

I Arbitrary data blocks

I Program or model parameters

Some notation:

Data associated with a vertex v : Dv

Data associated with an edge (u, v ): Du,v

Set of all outbound edges from v : (v → ∗) Set of all inbound edges from v : (∗ → v )

Neighboring vertices of v : Nv := {u|(u, v ) ∈ E ∨ (v , u) ∈ E }

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(28)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Users can associate with each graph vertice and edge:

I Arbitrary data blocks

I Program or model parameters

Some notation:

Data associated with a vertex v : Dv

Data associated with an edge (u, v ): Du,v

Set of all outbound edges from v : (v → ∗) Set of all inbound edges from v : (∗ → v )

(29)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Users can associate with each graph vertice and edge:

I Arbitrary data blocks

I Program or model parameters

Some notation:

Data associated with a vertex v : Dv

Data associated with an edge (u, v ): Du,v

Set of all outbound edges from v : (v → ∗) Set of all inbound edges from v : (∗ → v )

Neighboring vertices of v : Nv := {u|(u, v ) ∈ E ∨ (v , u) ∈ E }

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(30)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Users can associate with each graph vertice and edge:

I Arbitrary data blocks

I Program or model parameters

Some notation:

Data associated with a vertex v : Dv

Data associated with an edge (u, v ): Du,v

Set of all outbound edges from v : (v → ∗)

Set of all inbound edges from v : (∗ → v )

(31)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Users can associate with each graph vertice and edge:

I Arbitrary data blocks

I Program or model parameters

Some notation:

Data associated with a vertex v : Dv

Data associated with an edge (u, v ): Du,v

Set of all outbound edges from v : (v → ∗) Set of all inbound edges from v : (∗ → v )

Neighboring vertices of v : Nv := {u|(u, v ) ∈ E ∨ (v , u) ∈ E }

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(32)

Big Data Analytics 2.1 GraphLab Data Model

Graphlab Data Model

Users can associate with each graph vertice and edge:

I Arbitrary data blocks

I Program or model parameters

Some notation:

Data associated with a vertex v : Dv

Data associated with an edge (u, v ): Du,v

Set of all outbound edges from v : (v → ∗) Set of all inbound edges from v : (∗ → v )

(33)

Big Data Analytics 2.1 GraphLab Data Model

GraphLab - User defined computation

There are two types of user defined computation on GraphLab:

I Update function:

I local computation

I executed on small neighborhoods

I different functions may access and modify overlapping contexts

I Sync Mechanism:

I global aggregation

I runs concurrently with the update functions

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(34)

Big Data Analytics 2.1 GraphLab Data Model

GraphLab - User defined computation

There are two types of user defined computation on GraphLab:

I Update function:

I local computation

I executed on small neighborhoods

I different functions may access and modify overlapping contexts

I Sync Mechanism:

I global aggregation

(35)

Big Data Analytics 2.1 GraphLab Data Model

GraphLab - User defined computation

There are two types of user defined computation on GraphLab:

I Update function:

I local computation

I executed on small neighborhoods

I different functions may access and modify overlapping contexts

I Sync Mechanism:

I global aggregation

I runs concurrently with the update functions

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(36)

Big Data Analytics 2.1 GraphLab Data Model

GraphLab - User defined computation

There are two types of user defined computation on GraphLab:

I Update function:

I local computation

I executed on small neighborhoods

I different functions may access and modify overlapping contexts

I Sync Mechanism:

I global aggregation

(37)

Big Data Analytics 2.1 GraphLab Data Model

GraphLab - User defined computation

There are two types of user defined computation on GraphLab:

I Update function:

I local computation

I executed on small neighborhoods

I different functions may access and modify overlapping contexts

I Sync Mechanism:

I global aggregation

I runs concurrently with the update functions

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(38)

Big Data Analytics 2.1 GraphLab Data Model

GraphLab - User defined computation

There are two types of user defined computation on GraphLab:

I Update function:

I local computation

I executed on small neighborhoods

I different functions may access and modify overlapping contexts

I Sync Mechanism:

I global aggregation

(39)

Big Data Analytics 2.2 Update Function

Outline

1. Introduction 2. GraphLab

2.1 GraphLab Data Model

2.2 Update Function

2.3 Sync Mechanism 2.4 PageRank Example

2.5 Execution Model and Data Consistency

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(40)

Big Data Analytics 2.2 Update Function

Update Function - Scope of a Vertex

The Update Function is a stateless user defined function

It operates on the scope Sv of a vertex v :

Sv := {v } ∪ (v → ∗) ∪ (∗ → v ) ∪ Nv

Once executed on a vertex v , it may trigger the execution of the update functions on other vertices T

(41)

Big Data Analytics 2.2 Update Function

Update Function - Scope of a Vertex

The Update Function is a stateless user defined function It operates on the scope Sv of a vertex v :

Sv := {v } ∪ (v → ∗) ∪ (∗ → v ) ∪ Nv

Once executed on a vertex v , it may trigger the execution of the update functions on other vertices T

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(42)

Big Data Analytics 2.2 Update Function

Update Function - Scope of a Vertex

The Update Function is a stateless user defined function It operates on the scope Sv of a vertex v :

Sv := {v } ∪ (v → ∗) ∪ (∗ → v ) ∪ Nv

Once executed on a vertex v , it may trigger the execution of the update functions on other vertices T

(43)

Big Data Analytics 2.2 Update Function

Update Function - Scope of a Vertex

The Update Function is a stateless user defined function It operates on the scope Sv of a vertex v :

Sv := {v } ∪ (v → ∗) ∪ (∗ → v ) ∪ Nv

Once executed on a vertex v , it may trigger the execution of the update functions on other vertices T

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(44)

Big Data Analytics 2.2 Update Function

Update Function

Be Sv the scope of a vertex v and T the globally shared table.

The application of an update function f to a vertex v can be defined as (DSv, T ) ← f (DSv, T)

Where:

I DSv is the data associated with the scope of v

I f has read-only access to T

(45)

Big Data Analytics 2.2 Update Function

Update Function

Be Sv the scope of a vertex v and T the globally shared table.

The application of an update function f to a vertex v can be defined as (DSv, T ) ← f (DSv, T)

Where:

I DSv is the data associated with the scope of v

I f has read-only access to T

I T is a set of vertices

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(46)

Big Data Analytics 2.2 Update Function

Update Function

Be Sv the scope of a vertex v and T the globally shared table.

The application of an update function f to a vertex v can be defined as (DSv, T ) ← f (DSv, T)

Where:

I DSv is the data associated with the scope of v

I f has read-only access to T

(47)

Big Data Analytics 2.2 Update Function

Update Function

Be Sv the scope of a vertex v and T the globally shared table.

The application of an update function f to a vertex v can be defined as (DSv, T ) ← f (DSv, T)

Where:

I DSv is the data associated with the scope of v

I f has read-only access to T

I T is a set of vertices

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(48)

Big Data Analytics 2.2 Update Function

Update Function

Be Sv the scope of a vertex v and T the globally shared table.

The application of an update function f to a vertex v can be defined as (DSv, T ) ← f (DSv, T)

Where:

I DSv is the data associated with the scope of v

I f has read-only access to T

(49)

Big Data Analytics 2.2 Update Function

Update Function

Be Sv the scope of a vertex v and T the globally shared table.

The application of an update function f to a vertex v can be defined as (DSv, T ) ← f (DSv, T)

Where:

I DSv is the data associated with the scope of v

I f has read-only access to T

I T is a set of vertices

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(50)

Big Data Analytics 2.2 Update Function

Update Function and Vertex Programs

In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs

A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases:

I Gather: executed in parallel on each edge of the current vertex

(usually to read data)

I Apply: atomic function that modifies the vertex data

I Scatter: Executed in parallel on each edge of the current vertex

(usually to update edge data and signaling neighboring nodes)

This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data.

(51)

Big Data Analytics 2.2 Update Function

Update Function and Vertex Programs

In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs

A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases:

I Gather: executed in parallel on each edge of the current vertex

(usually to read data)

I Apply: atomic function that modifies the vertex data

I Scatter: Executed in parallel on each edge of the current vertex

(usually to update edge data and signaling neighboring nodes)

This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(52)

Big Data Analytics 2.2 Update Function

Update Function and Vertex Programs

In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs

A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases:

I Gather: executed in parallel on each edge of the current vertex

(usually to read data)

I Apply: atomic function that modifies the vertex data

I Scatter: Executed in parallel on each edge of the current vertex

(usually to update edge data and signaling neighboring nodes)

This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data.

(53)

Big Data Analytics 2.2 Update Function

Update Function and Vertex Programs

In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs

A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases:

I Gather: executed in parallel on each edge of the current vertex

(usually to read data)

I Apply: atomic function that modifies the vertex data

I Scatter: Executed in parallel on each edge of the current vertex

(usually to update edge data and signaling neighboring nodes)

This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(54)

Big Data Analytics 2.2 Update Function

Update Function and Vertex Programs

In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs

A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases:

I Gather: executed in parallel on each edge of the current vertex

(usually to read data)

I Apply: atomic function that modifies the vertex data

I Scatter: Executed in parallel on each edge of the current vertex

(usually to update edge data and signaling neighboring nodes)

This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data.

(55)

Big Data Analytics 2.2 Update Function

Update Function and Vertex Programs

In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs

A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases:

I Gather: executed in parallel on each edge of the current vertex

(usually to read data)

I Apply: atomic function that modifies the vertex data

I Scatter: Executed in parallel on each edge of the current vertex

(usually to update edge data and signaling neighboring nodes)

This decomposition allows us to execute a single vertex program on several machines simultaneously and move computation to the data.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(56)

Big Data Analytics 2.2 Update Function

Update Function and Vertex Programs

In the most recent version of the GraphLab abstraction, Update Functions are called Vertex Programs

A Vertex Program implements the Gather-Apply-Scatter (GAS) model, and has three phases:

I Gather: executed in parallel on each edge of the current vertex

(usually to read data)

I Apply: atomic function that modifies the vertex data

I Scatter: Executed in parallel on each edge of the current vertex

(usually to update edge data and signaling neighboring nodes)

(57)

Big Data Analytics 2.2 Update Function

Structure of an Update Function

1: procedure UpdateFunction

input: vertex v , scope Sv, data DSv, shared table T

2: GatherType: g . stores the results of the gather phase

3: for (u, v ) ∈ (∗ → v ) do . In Parallel

4: g ← g + gather ((u, v )) 5: end for 6: apply (g , Dv) 7: for (v , u) ∈ (v → ∗) do . In Parallel 8: scatter ((v , u)) 9: end for 10: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(58)

Big Data Analytics 2.3 Sync Mechanism

Outline

1. Introduction 2. GraphLab

2.1 GraphLab Data Model 2.2 Update Function

2.3 Sync Mechanism

2.4 PageRank Example

(59)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph

Associates the result with a particular entry in T The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |



Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(60)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph Associates the result with a particular entry in T

The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |

(61)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |



Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(62)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |

(63)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |



Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(64)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |

(65)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |



Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(66)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

Aggregates data accross all vertices in the graph Associates the result with a particular entry in T The user provides:

I a key k

I an initial value rk0

I a fold function: rki +1 ← Foldk Dv, rki



I an optional merge function: rkl ← Mergekrki, rkj

I an apply function: T[k] ← Applyk

 rk|V |

(67)

Big Data Analytics 2.3 Sync Mechanism

Sync Mechanism

1: procedure Sync Algorithm

input: key k, vertices V , data {Dv}v ∈V, initial value rk0, shared table

T 2: t ← rk0 3: for v ∈ V do 4: t ← Foldk(Dv, t) 5: end for 6: T[k] ← Applyk(t) 7: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(68)

Big Data Analytics 2.4 PageRank Example

Outline

1. Introduction 2. GraphLab

2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism

2.4 PageRank Example

(69)

Big Data Analytics 2.4 PageRank Example

Pagerank: Data Model

I G := {w1, w2, w3, w4} I E := {(w1, w2), (w1, w3), (w3, w1), (w2, w3), (w4, w3)} I Dv := PR(v ) I Dv ,u := ∅

w

1

w

2

w

3

w

4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(70)

Big Data Analytics 2.4 PageRank Example

Pagerank: Data Model

I G := {w1, w2, w3, w4} I E := {(w1, w2), (w1, w3), (w3, w1), (w2, w3), (w4, w3)} I Dv := PR(v ) I Dv ,u := ∅

w

1

w

2

w

3

w

4

(71)

Big Data Analytics 2.4 PageRank Example

Pagerank: Data Model

I G := {w1, w2, w3, w4} I E := {(w1, w2), (w1, w3), (w3, w1), (w2, w3), (w4, w3)} I Dv := PR(v ) I Dv ,u := ∅

w

1

w

2

w

3

w

4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(72)

Big Data Analytics 2.4 PageRank Example

Pagerank: Data Model

I G := {w1, w2, w3, w4} I E := {(w1, w2), (w1, w3), (w3, w1), (w2, w3), (w4, w3)} I Dv := PR(v ) I Dv ,u := ∅

w

1

w

2

w

3

w

4

(73)

Big Data Analytics 2.4 PageRank Example

Pagerank: Update function

1: procedure PageRankUpdateFunction

input: vertex v , scope Sv, ingoing edges (∗ → v )

2: T ← ∅ 3: Dvold ← Dv 4: p ← 0 5: for u ∈ {u|(u, v ) ∈ (∗ → v )} do 6: p ← p + Du |(u→∗)| 7: end for 8: Dv ← (1 − β) + β ∗ p 9: if |Dv − Dvold| >  then 10: T ← Nv 11: end if 12: return (DSv, T ) 13: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(74)

Big Data Analytics 2.4 PageRank Example

Pagerank: Vertex Program

1: procedure PageRankGather

input: vertex v , scope Sv,

ingoing edge (u → v )

2: return Du

|(u→∗)|

3: end procedure

1: procedure PageRankApply

input: vertex v , scope Sv,

gather result p,

2: Dvold ← Dv

3: Dv ← (1 − β) + β ∗ p

4: if |Dv− Dvold| >  then

5: Signal to perform scatter

6: end if

(75)

Big Data Analytics 2.4 PageRank Example

Pagerank: Vertex Program

1: procedure PageRankGather

input: vertex v , scope Sv,

ingoing edge (u → v )

2: return Du

|(u→∗)|

3: end procedure

1: procedure PageRankApply

input: vertex v , scope Sv,

gather result p,

2: Dvold ← Dv

3: Dv ← (1 − β) + β ∗ p

4: if |Dv− Dvold| >  then

5: Signal to perform scatter

6: end if

7: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(76)

Big Data Analytics 2.4 PageRank Example

Pagerank: Scatter

1: procedure PageRankScatter

input: outgoing edge (v → u), set of vertices to be rescheduled T

2: Add u to T

(77)

Big Data Analytics 2.4 PageRank Example

PageRank Implementation: Data Types

s t r u c t web_page {

std: :string pagename; d o u b l e pagerank;

e x p l i c i t web_page(std: :string name) :

pagename(name) ,pagerank( 0 . 0 ) { } v o i d save(graphlab: :oarchive& oarc) c o n s t {

oarc << pagename << pagerank; }

v o i d load(graphlab: :iarchive& iarc) {

iarc >> pagename >> pagerank; }

} ;

t y p e d e f graphlab: :d i s t r i b u t e d _ g r a p h

<web_page, graphlab: :empty> gr aph _ty pe;

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(78)

Big Data Analytics 2.4 PageRank Example

PageRank Implementation: Vertex Program

c l a s s p a g e r a n k _ p r o g r a m :

p u b l i c graphlab: :ivertex_program<graph_type, double >, p u b l i c graphlab: :I S _ P O D _ T Y P E { p u b l i c : b o o l p e r f o r m _ s c a t t e r; e d g e _ d i r _ t y p e g a t h e r _ e d g e s(i c o n t e x t _ t y p e& context, c o n s t v e r t e x _ t y p e& vertex) ; e d g e _ d i r _ t y p e s c a t t e r _ e d g e s(i c o n t e x t _ t y p e& context, c o n s t v e r t e x _ t y p e& vertex) ; d o u b l e gather(i c o n t e x t _ t y p e& context,

c o n s t v e r t e x _ t y p e& vertex, edge_type& edge) ; v o i d apply(i c o n t e x t _ t y p e& context, v e r t e x _ t y p e& vertex,

(79)

Big Data Analytics 2.4 PageRank Example

PageRank Implementation: Gather

// we a r e g o i n g t o g a t h e r on a l l t h e i n −e d g e s e d g e _ d i r _ t y p e g a t h e r _ e d g e s(i c o n t e x t _ t y p e& context, c o n s t v e r t e x _ t y p e& vertex) c o n s t { r e t u r n graphlab: :IN_EDGES; } // f o r e a c h i n −e d g e g a t h e r t h e w e i g h t e d sum o f t h e e d g e . d o u b l e gather(i c o n t e x t _ t y p e& context,

c o n s t v e r t e x _ t y p e& vertex,

edge_type& edge) c o n s t { r e t u r n edge.source( ) .data( ) .pagerank

/ edge.source( ) .n u m _ o u t _ e d g e s( ) ; }

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(80)

Big Data Analytics 2.4 PageRank Example

PageRank Implementation: Apply

// Use t h e t o t a l r a n k o f a d j a c e n t p a g e s // t o u p d a t e t h i s p a g e

v o i d apply(i c o n t e x t _ t y p e& context,

v e r t e x _ t y p e& vertex,

c o n s t g a t h e r _ t y p e& total) { d o u b l e newval = total ∗ 0 . 8 5 + 0 . 1 5 ; d o u b l e oldval = vertex.data( ) .pagerank;

vertex.data( ) .pagerank = newval;

p e r f o r m _ s c a t t e r = (std: :fabs(prevval − oldval) > 1E−3); }

(81)

Big Data Analytics 2.4 PageRank Example

PageRank Implementation: Scatter

e d g e _ d i r _ t y p e s c a t t e r _ e d g e s(i c o n t e x t _ t y p e& context, c o n s t v e r t e x _ t y p e& vertex) { i f (p e r f o r m _ s c a t t e r) r e t u r n graphlab: :OUT_EDGES;

e l s e r e t u r n graphlab: :NO_EDGES; }

v o i d scatter(i c o n t e x t _ t y p e& context, c o n s t v e r t e x _ t y p e& vertex,

edge_type& edge) c o n s t {

context.signal(edge.target( ) ) ; }

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(82)

Big Data Analytics 2.5 Execution Model and Data Consistency

Outline

1. Introduction 2. GraphLab

2.1 GraphLab Data Model 2.2 Update Function 2.3 Sync Mechanism 2.4 PageRank Example

(83)

Big Data Analytics 2.5 Execution Model and Data Consistency

Execution Model

1: procedure GraphLabExec

input: Graph G := (V , E ), data D, initial vertex set T ⊆ V

2: while |T | > 0 do 3: v ← RemoveNext(T ) 4: (T0, DSv) ← f (DSv, T) 5: T ← T ∪ T0 6: end while 7: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(84)

Big Data Analytics 2.5 Execution Model and Data Consistency

Serializability

I The GraphLab abstraction provides a rich sequential model

I Parallel Execution:

I Multiple processors to execute the same loop on the same graph I Simutaneous execution of update functions on different vertices

I We desire a serializable execution:

Definition

A GraphLab program is sequentially consistentif for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph

(85)

Big Data Analytics 2.5 Execution Model and Data Consistency

Serializability

I The GraphLab abstraction provides a rich sequential model

I Parallel Execution:

I Multiple processors to execute the same loop on the same graph I Simutaneous execution of update functions on different vertices

I We desire a serializable execution:

Definition

A GraphLab program is sequentially consistentif for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(86)

Big Data Analytics 2.5 Execution Model and Data Consistency

Serializability

I The GraphLab abstraction provides a rich sequential model

I Parallel Execution:

I Multiple processors to execute the same loop on the same graph

I Simutaneous execution of update functions on different vertices

I We desire a serializable execution:

Definition

A GraphLab program is sequentially consistentif for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph

(87)

Big Data Analytics 2.5 Execution Model and Data Consistency

Serializability

I The GraphLab abstraction provides a rich sequential model

I Parallel Execution:

I Multiple processors to execute the same loop on the same graph I Simutaneous execution of update functions on different vertices

I We desire a serializable execution:

Definition

A GraphLab program is sequentially consistentif for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(88)

Big Data Analytics 2.5 Execution Model and Data Consistency

Serializability

I The GraphLab abstraction provides a rich sequential model

I Parallel Execution:

I Multiple processors to execute the same loop on the same graph I Simutaneous execution of update functions on different vertices

I We desire a serializable execution:

Definition

A GraphLab program is sequentially consistentif for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph

(89)

Big Data Analytics 2.5 Execution Model and Data Consistency

Serializability

I The GraphLab abstraction provides a rich sequential model

I Parallel Execution:

I Multiple processors to execute the same loop on the same graph I Simutaneous execution of update functions on different vertices

I We desire a serializable execution:

Definition

A GraphLab program is sequentially consistentif for every parallel execution, there exists a sequential execution of update functions that produces the same values in the data graph

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(90)

Big Data Analytics 2.5 Execution Model and Data Consistency

Data Consistency

To ensure Serializability we need consistency models GraphLab supports three types of consistency models:

I Full Consistency: the scope of two concurrently executing update

functions do not overlap

I Edge Consistency: each update function has R/W access to its

vertex and adjacent edges but only read access to neighboring vertices

I Vertex Consistency: only ensures that no concurrently update

(91)

Big Data Analytics 2.5 Execution Model and Data Consistency

Data Consistency

To ensure Serializability we need consistency models GraphLab supports three types of consistency models:

I Full Consistency: the scope of two concurrently executing update

functions do not overlap

I Edge Consistency: each update function has R/W access to its

vertex and adjacent edges but only read access to neighboring vertices

I Vertex Consistency: only ensures that no concurrently update

functions are executed on the same vertice

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(92)

Big Data Analytics 2.5 Execution Model and Data Consistency

Data Consistency

To ensure Serializability we need consistency models GraphLab supports three types of consistency models:

I Full Consistency: the scope of two concurrently executing update

functions do not overlap

I Edge Consistency: each update function has R/W access to its

vertex and adjacent edges but only read access to neighboring vertices

I Vertex Consistency: only ensures that no concurrently update

(93)

Big Data Analytics 2.5 Execution Model and Data Consistency

Data Consistency - Exclusion Sets

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(94)

Big Data Analytics 2.5 Execution Model and Data Consistency

(95)

Big Data Analytics 2.5 Execution Model and Data Consistency

Data Consistency

GraphLab guarantees sequential consistency under the following three conditions:

I The full consistency model is used

I The edge consistency model is used and update functions do not

modify data in adjacent vertices

I The vertex consistency model is used and update functions only

access local vertex data

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(96)

Big Data Analytics 2.5 Execution Model and Data Consistency

Data Consistency

GraphLab guarantees sequential consistency under the following three conditions:

I The full consistency model is used

I The edge consistency model is used and update functions do not

modify data in adjacent vertices

I The vertex consistency model is used and update functions only

(97)

Big Data Analytics 2.5 Execution Model and Data Consistency

Data Consistency

GraphLab guarantees sequential consistency under the following three conditions:

I The full consistency model is used

I The edge consistency model is used and update functions do not

modify data in adjacent vertices

I The vertex consistency model is used and update functions only

access local vertex data

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(98)

Big Data Analytics 2.5 Execution Model and Data Consistency

Benchmarks - Horizontal Scaling

Dataset 1 Dataset 2

Dataset |V | |E | Size

1 61.2K 50.9 M 655 MB

2 65.6 M 1.8 B 31 GB

(99)

Big Data Analytics 2.5 Execution Model and Data Consistency

Summary

A GraphLab program consists of:

I A data graph I An update function I Gather Function I Apply Function I Scatter Function I A sync mechanism I A consistency model

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(100)

Big Data Analytics 2.5 Execution Model and Data Consistency

Summary

A GraphLab program consists of:

I A data graph I An update function I Gather Function I Apply Function I Scatter Function I A sync mechanism I A consistency model

(101)

Big Data Analytics 2.5 Execution Model and Data Consistency

Summary

A GraphLab program consists of:

I A data graph I An update function I Gather Function I Apply Function I Scatter Function I A sync mechanism I A consistency model

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

(102)

Big Data Analytics 2.5 Execution Model and Data Consistency

Summary

A GraphLab program consists of:

I A data graph I An update function I Gather Function I Apply Function I Scatter Function I A sync mechanism I A consistency model

References

Related documents

The main wall of the living room has been designated as a &#34;Model Wall&#34; of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Using graduate competencies as a way to specify expected outcomes gives us a powerful way to tell the story about the strengths of our graduates, continue to bolster our image,

Factors were related to the host (i. population size of elderly and/or susceptible people; ii. underlying condition rate), the food (iii. monocytogenes prevalence in RTE food

To protect this alternate point of entry, which leads deep into the dungeon complex, the priests of Orcus erected a dark temple on the surface (see the Wilderness Area 3

Most of the work in Spark is a set of operation on Resilient Distributed Datasets (RDDs):.. I Main

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany.?.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany. Big Data Analytics 18

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim,