! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

(1)

! E6893 Big Data Analytics Lecture 10:

! Linked Big Data — Graph Computing (II)

Ching-Yung Lin, Ph.D.

Adjunct Professor, Dept. of Electrical Engineering and Computer Science

Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center

November 6th

, 2014

(2)

Course Structure

Class Data Number Topics Covered

09/04/14 1 Introduction to Big Data Analytics

09/11/14 2 Big Data Analytics Platforms

09/18/14 3 Big Data Storage and Processing

09/25/14 4 Big Data Analytics Algorithms -- I

10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms — III (clustering)

10/16/14 7 Big Data Analytics Algorithms — IV (classification)

10/23/14 8 Big Data Analytics Algorithms — V (classification & clustering) 10/30/14 9 Linked Big Data — Graph Computing I (Graph DB)

11/06/14 10 Linked Big Data — Graph Computing II (Graph Analytics) 11/13/14 11 Big Data on Hardware, Processors, and Cluster Platforms 11/20/14 12 Final Project First Presentations

11/27/14 Thanksgiving Holiday

12/04/14 13 Next Stage of Big Data Analytics

12/11/14 14 Big Data Analytics Workshop – Final Project Presentations

(3)

Final Project Proposal (First) Presentation

Date/Time: November 20, 7pm - 9:30pm!

!

Each Team — about 3 mins:!

!

1. Team members and expected contributions of each member;!

2. Motivation of your project (The problem you would like to solve);!

3. Dataset, algorithm, and tools for your project;!

4. Current Status of your project.!

!

Please update your team info in the Project webpage by November 11. The presentation schedule will be announced on November 13.!

!

The website will be opened to allow you upload your slides by November 20.!

!

If a project is purely by CVN students, please submit your slides without

oral presentation.

(4)

ScaleGraph — an Open Source version of IBM System G

(5)

ScaleGraph algorithms made Top #1 in Graph 500 benchmark

Source: Dr. Toyotaro Suzumura, ICPE2014 keynote

(6)

Graph Definitions and Concepts

▪ A graph:

!

▪ V = Vertices or Nodes

▪ E = Edges or Links

!

▪ The number of vertices: “Order”

!

▪ The number of edges: “Size”

( , ) G = V E

N v = V

N e = E

(7)

Subgraph

▪ A graph H is a subgraph of another graph G, if:

H G

V ⊆ V ^and _E _H _⊆ _E _G

(8)

Families of Graphs

▪ Complete Graph: every vertex is linked to every other vertex.

▪ Clique: a complete subgraph.

(9)

Multi-Graph vs. Simple Graph

▪ Loops:

▪ Multi-Edges:

!

(10)

Directed Graph vs. Undirected Graph

▪ Mutual arcs:

▪ Directed Edges = Arcs:

!

! { } ^{u v} ^, _u

v

(11)

Adjacency

▪ Two edges are adjacent if joined by a common endpoint in V:

▪ u and v are adjacent if joined by an edge in E:

!

! ^u _v

e

₁

e

₂

(12)

Decorated Graph

▪ Weighted Edges

0.2

0.8

(13)

Incident and Degree

▪ The degree of a vertex v, say d

_v

, is defined as the number of edges incident on v.

▪ A vertex is incident on an edge if v is an endpoint of e.

!

! _v

v V ∈ e E ∈

e

v

d

_v

=2

(14)

In-degrees and out-degrees

▪ For Directed graphs:

In-degree = 8 Out-degree = 8

(15)

Degree Distribution Example: Power-Law Network

A. Barbasi and E. Bonabeau, “Scale-free Networks”, Scientific American 288: p.50-59, 2003.

/ k k

p = C k e ⋅

⁻^τ ⁻ ^κ

Newman, Strogatz and Watts, 2001

!

m k k

p e m

k

= − ⋅

(16)

Another example of complex network: Small-World Network

Six Degree Separation:

adding long range link, a regular graph can be transformed into a small-world network, in which the average number of degrees between two nodes become small.

from Watts and Strogatz, 1998

C: Clustering Coefficient, L: path length, (C(0), L(0) ): (C, L) as in a regular graph;

(C(p), L(p)): (C,L) in a Small-world graph with randomness p.

(17)

Indication of ‘Small’

A graph is ‘small’ which usually indicates the average distance between distinct vertices is

‘small’

1 ( , )

( 1) / 2

_{u v V}

v v

l dist u v

N N

_{≠ ∈}

= + ∑

For instance, a protein interaction network would be considered to have the small- world property, as there is an average distance of 3.68 among the 5,128 vertices in its giant component.

(18)

Some examples of Degree Distribution

(a) scientist collaboration: biologists (circle) physicists (square), (b) collaboration of move actors, (d) network of directors of Fortune 1000 companies

(19)

Degree Distribution

Kolaczyk, “Statistical Analysis of Network Data: Methods and Models”, Springer 2009.

(20)

ScaleGraph Analytics Algorithms

(21)

Centrality

“There is certainly no unanimity on exactly what centrality is or its conceptual foundations, and there is little agreement on the procedure of its measurement.” – Freeman 1979.

!

Degree (centrality) Closeness (centrality) Betweeness (centrality) Eigenvector (centrality)

(22)

Conceptual Descriptions of Three Centrality Measurements

Kolaczyk, “Statistical Analysis of Network Data: Methods and Models”, Springer 2009.

(23)

Distance

▪ Distance of two vertices: The length of the shortest path between the vertices.

▪ Geodesic: another name for shortest path.

▪ Diameter: the value of the longest distance in a graph

(24)

Closeness

Closeness: A vertex is ‘close’ to the other vertices

( ) 1

( , )

CI

u V

c v dist v u

∈

= ∑

where dist(v,u) is the geodesic distance between vertices v and u.

(25)

Betweenness

Betweenness measures are aimed at summarizing the extent to which a vertex is located

‘between’ other pairs of vertices.

Freeman’s definition:

( , | )

( ) ( , )

B

s t v V

s t v

c v s t

σ

≠ ≠ ∈

σ

= ∑

Calculation of all betweenness centralities requires

calculating the lengths of shortest paths among all pairs of vertices Computing the summation in the above definition for each vertex

(26)

Betweeness ==> Bridges

Key social bridges

Connections between different divisions

Example: Healthcare experts in the U.S.

Example: Healthcare experts in the world

(27)

|

■ Structural Diverse networks with abundance of structural holes are associated with higher performance.

■ Having diverse friends helps.

■ Betweenness is negatively correlated to people but highly positive correlated to projects.

■ Being a bridge between a lot of people is bottleneck.

■ Being a bridge of a lot of projects is good.

■ Network reach are highly corrected.

■ The number of people reachable in 3 steps is positively correlated with higher performance.

■ Having too many strong links — the same set of people one communicates frequently is negatively correlated with performance.

■ Perhaps frequent communication to the same person may imply redundant information exchange.

Productivity effect from network variables

• An additional person in network size ~

$986 revenue per year

• Each person that can be reached in 3 steps ~ $0.163 in revenue per month

• A link to manager ~ $1074 in revenue per month

• 1 standard deviation of network diversity (1 - constraint) ~ $758

• 1 standard deviation of btw ~ -$300K

• 1 strong link ~ $-7.9 per month

Network Value Analysis

– First Large-Scale Economical Social Network Study

(28)

Eigenvector Centrality

Try to capture the ‘status’, ‘prestige’, or ‘rank’.

More central the neighbors of a vertex are, the more central the vertex itself is.

{ , }

( ) ( )

Ei Ei

u v E

c v α c u

∈

= ∑

The vector

c

_Ei

= ( c

_Ei

(1),..., c

_Ei

( N

_v

))

^T is the solution of the

eigenvalue problem:

1 Ei α ⁻ Ei

⋅ =

A c c

(29)

PageRank Algorithm (Simplified)

(30)

PageRank Steps

Example: Simplified Initial State:

R(A) = R(B) = R(C) = R(D) = 0.25 Iterative Procedure:

R(A) = R(B) / 2 + R(C) / 1 + R(D) / 3

A B

C D

( ) ( )

v Bv v

R u d R u e

∈

N

= ∑ +

u u

N = F F

u

B

u

where

The set of pages u points to The set of pages point to u

Number of links from u

Normalization / damping factor

d e 1 d

N

= −

In general, d=0.85

(31)

Solution of PageRank

The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix.

1 2

( ) ( )

: (

_N

) R p R p R p

! "

# $

= # $

# $

% &

R

where R is the solution of the equation

where R is the adjacency function if page pj does not link to pi, and normalized such that for each j, l p p =( ,_i _j) 0

1

( , ) 1

N

i j

i

l p p

=

∑ =

(32)

Walk

▪ The length of this walk is l.

▪ A walk may be:

– Trail --- no repeated edges

– Path --- trails without repeated vertices.

▪ A walk on a graph G, from v

₀

to v

_l

, is an alternating sequence:

! { v e v e

⁰

, , , ,...,

¹ ¹ ²

v

_l₋¹

, , e v

_l _l

}

(33)

Connectivity of Graph

A measure related to the flow of information in the graph Connected ➔ every vertex is reachable from every other

A connected component of a graph is a maximally connected subgraph.

A graph usually has one dominating the others in magnitude ➔ giant component.

(34)

Reachable, Connected, Component

▪ Reachable: A vertex v in a graph G is said to be reachable from another vertex u if there exists a walk from u to v.

▪ Connected: A graph is said to be connected if every vertex is reachable from every other.

▪ Component: A component of a graph is a maximally

connected subgraph.

(35)

Local Density

A coherent subset of nodes should be locally dense.

Cliques:

3-cliques

A sufficient condition for a clique of size n to exist in G is:

2

2 2 1

v e

N n

! " ! − "

> $ & % $ ' & − % '

(36)

Weakened Versions of Cliques -- Plexes

A subgraph H consisting of m vertices is called n-plex, for m > n, if no vertex has degree less than m – n.

1-plex

1-plex ➔ No vertex is missing more than 1 of its possible m-1 edges.

(37)

Another Weakened Versions of Cliques -- Cores

A k-core of a graph G is a subgraph H in which all vertices have degree at least k.

3-core

Batagelj et. al., 1999. A maximal k-core subgraph may be computed in as little as O( Nv + Ne) time.

Computes the shell indices for every vertex in the graph

!

Shell index of v = the largest value, say c, such that v belongs to the c-core of G but not its (c+1)-core.

!

For a given vertex, those neighbors with lesser degree lead to a decrease in the potential shell index of that vertex.

(38)

Density measurement

The density of a subgraph H = ( VH , EH ) is:

( ) ( 1) / 2

H

H H

den H E

V V

= −

Range of density

! !

and

0 ≤ den H ( ) 1 ≤

( ) (

_H

1) ( ) den H = V − d H

average degree of H

(39)

Use of the density measure

Density of a graph: let H=G

‘Clustering’ of edges local to v: let H=Hv, which is the set of neighbors of a vertex v, and the edges between them

Clustering Coefficient of a graph: The average of den(Hv) over all vertices

(40)

An insight of clustering coefficient

A triangle is a complete subgraph of order three.

A connected triple is a subgraph of three vertices connected by two edges (regardless how the other two nodes connect).

The local clustering coefficient can be expressed as:

! !

The clustering coefficient of G is then:

3

( ) ( ) ( )

v

( )

den H cl v v

v τ

τ

= =

Δ

( ) 1 ( )

v V

cl G cl v V

_∈ _!

= ! ∑

Where V’⊆ V is the set of vertices v with dv ≥ 2.

# of triangles

# of connected triples for which 2 edges are both incident to v.

(41)

An example

(42)

Transitivity of a graph

A variation of the clustering coefficient ➔ takes weighted average

! !

!

where

3

3 3

( ) ( )

3 ( )

( ) ( ) ( )

T v V

v V

v cl v cl G G

v G

τ τ

"

∈ Δ

"

∈

= ∑ =

∑

( ) 1 ( )

3

_{v V}

G v

τ

_Δ

τ

_Δ

∈

= ∑

3

( )

3

( )

v V

G v

τ τ

∈

= ∑

is the number of triangles in the graph

is the number of connected triples

➔ The friend of your friend is also a friend of yours

Clustering coefficients have become a standard quantity for network structure analysis. But, it is important on reporting which clustering coefficients are used.

(43)

Vertex / Edge Connectivity

If an arbitrary subset of k vertices or edges is removed from a graph, is the remaining subgraph connected?

A graph G is called k-vertex-connected, if (1) Nv>k, and (2) the removal of any subset of

vertices X in V of cardinality |X| smaller than k leaves a subgraph G – X that is connected.

The vertex connectivity of G is the largest integer such that G is k- vertex-connected.

Similar measurement for edge connectivity

(44)

Vertex / Edge Cut

If the removal of a particular set of vertices in G disconnects the graph, that set is called a vertex cut.

!

For a given pair of vertices (u,v), a u-v-cut is a partition of V into two disjoint non-empty subsets, S and S’, where u is in S and v is in S’.

Minimum u-v-cut: the sum of the weights on edges connecting vertices in S to vertices in S’ is a minimum.

(45)

Minimum cut and flow

Find a minimum u-v-cut is an equivalent problem of maximizing a measure of flow on the edges of a derived directed graph.

Ford and Fulkerson, 1962. Max-Flow Min-Cut theorem.

(46)

Graph Partitioning

Many uses of graph partitioning:

E.g., community structure in social networks

!

A cohesive subset of vertices generally is taken to refer to a subset of vertices that (1) are well connected among themselves, and

(2) are relatively well separated from the remaining vertices

!

Graph partitioning algorithms typically seek a partition of the vertex set of a graph in such a manner that the sets E( Ck , Ck’ ) of edges connecting vertices in Ck to vertices in Ck’ are relatively small in size compared to the sets E(Ck) = E( Ck , Ck’ ) of edges connecting vertices within Ck’ .

(47)

Classify the nodes

(48)

Example: Karate Club Network

(49)

Hierarchical Clustering

Agglomerative Divisive

In agglomerative algorithms, given two sets of vertices C1 and C2, two standard approaches to assigning a similarity value to this pair of sets is to use the maximum (called single- linkage) or the minimum (called complete linkage) of the similarity xij over all pairs.

( ) ( 1)

i j

v v

ij

v v

N N x d N d N

= Δ

+ −

The “normalized” number of neighbors of vi and vj that are not shared.

(50)

Hierarchical Clustering Algorithms Types

Primarily differ in [Jain et. al. 1999]:

(1) how they evaluate the quality of proposed clusters, and (2) the algorithms by which they seek to optimze that quality.

Agglomerative: successive coarsening of parittions through the process of merging.

Divisive: successive refinement of partitions through the process of splitting.

!

At each stage, the current candidate partition is modified in a way that minizes a specific measure of cost.

!

In agglomerative methods, the least costly merge of two previously existing partition elements is executed

In divisive methods, it is the least costly split of a single existing partition element into two that is executed.

(51)

Hierarchical Clustering

The resulting hierarchy typically is represented in the form of a tree, called a dendrogram.

The measure of cost incorporated into a hierarchical clustering method used in graph partitioning should reflect our sense of what defines a ‘cohesive’ subset of vertices.

In agglomerative algorithms, given two sets of vertices C1 and C2, two standard approaches to assigning a similarity value to this pair of sets is to use the maximum (called single- linkage) or the minimum (called complete linkage) of the dissimilarity xij over all pairs.

Dissimlarities for subsets of vertices were calculated from the xij using the extension of Ward (1963)’s method and the lengths of the branches in the dendrogram are in relative

proportion to the changes in dissimilarity.

( ) ( 1)

i j

v v

ij

v v

N N x d N d N

= Δ

+ −

xij is the “normalized” number of neighbors of vi and vj that are not shared.

Nv is the set of neighbors of a vertex.

Δ is the symmetric difference of two sets which is the set of elements that are in one or the other but not both.

(52)

Other dissimilarity measures

There are various other common choices of dissimilarity measures, such as:

2 ,

( )

ij ik jk

k i j

x A A

≠

= ∑ −

Hierarchical clustering algorithms based on dissimilarities of this sort are reasonably efficient, running in time. ₂

(

_v

log

_v

)

O N N

(53)

Hierarchical Clustering Example

(54)

Several Graph Open Source on Tools

Titan is a native Blueprints enabled graph database

(55)

Graph Language

(56)

Dataset: 12.2 million edges, 2.2 million vertices

Goal: Find paths in a property graph. One of the vertex property is call TYPE. In this scenario, the user provides either a particular vertex, or a set of particular vertices of the same TYPE (say, "DRUG"). In addition, the user also provides another TYPE (say, "TARGET"). Then, we need find all the paths from the starting vertex to a vertex of TYPE “TARGET”. Therefore, we need to 1) find the paths using graph traversal; 2) keep trace of the paths, so that we can list them after the traversal. Even for the shortest paths, it can be multiple between two nodes, such as: drug->assay->target , drug->MOA->target

Avg time (100 tests)

First test (cold-

start) Requested depth

5 traversal Requested full depth traversal

IBM System G (NativeStore C++) 39 sec 3.0 sec 4.2 sec IBM System G (NativeStore JNI)

Java 57 sec 4.0 sec 6.2 sec

Neo4j (Blueprints 2.4) 105 sec 5.9 sec 8.3 sec Titan (Berkeley DB) 3861 sec 641 sec 794 sec

Titan (HBase) 3046 sec 1597 sec 2682 sec

First full test - full depth 23. All data pulled from disk. Nothing initially cached.

Modes - All tests in default modes of each graph implementation. Titan can only be run in transactional mode. Other implementations do not default to transactional mode.

Performance Comparison of Titan and others

(57)

ScaleGraph DB — System G DB’s open source version

Prereqs

▪ Linux

▪ Intel 64

▪ OpenJDK 6 or higher

▪ Maven - http://maven.apache.org/guides/getting-started/

maven-in-five-minutes.html

(58)

ScaleGraph DB (a.k.a. PropelGraph)

Installation

1a) git clone https://github.com/scalegraph/propelgraph.git or

1b) wget https://github.com/scalegraph/propelgraph/archive/master.zip ; unzip master.zip

2) cd propelgraph/propelgraph-gremlin 3) ./makepackage.sh

!

(59)

ScaleGraph DB

Trying It Out

3) cd propelgraph-gremlin-2.4.0 4) bin/gremlin.sh

5) optional: read a gremlin tutorial

6) g = CreateGraph.openGraph("nativemem_authors","awesome")

7) new LoadCSV().populateFromVertexFile(g, "data/movies.movies.v.csv", "movies", 5555555)

8) new LoadCSV().populateFromVertexFile(g, "data/movies.appearances.e.csv",

"appearances", 5555555) 9) g.v(20).both.bothV

10) Analytics.collaborativeFilter(g, 20, "appearance", Direction.OUT, "appearance", Direction.IN)

(60)

ScaleGraph DB

Help

▪ https://github.com/scalegraph/scalegraph/propelgraph

▪ [email protected]

(61)

Questions?

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)