LINEAR-ALGEBRAIC GRAPH MINING

(1)

LINEAR-ALGEBRAIC

GRAPH MINING

Geoffrey Sanders, CASC/LLNL

New Applications of Computer Analysis to Biomedical Data Sets

QB3 Seminar, UCSF Medical School, May 28

th

_{, 2015}

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

!

This work performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

(2)

LLNL and LDRD

• LLNL is a DOE FFRDC

• Center for Applied Scientific Computing (CASC)

• Several of us work on Laboratory Directed Research and

Development (LDRD) projects in HPC and Data Analysis

• Graph Analytics, Machine Learning, Network Analysis

• We are ALWAYS looking for domain scientist collaborators

with

interesting datasets

or

new data mining tasks

• DOE national labs have a history of building open HPC

software for PDE-related applications (Physic Simulation)

• PETSc [H2], Trilinos [H4], Hypre [H3], Samrai [H1], etc.

(3)

Outline

1. Introduction

2. Analytics that Rank

3. Analytics that Cluster

4. Analytics that Approximate

Expensive Calculations

(4)

Graph Model Definitions

• Graph G(V,E)

• Vertices

i

,

j

in V

• Edges (

i,j

) in E

• Edge weights

• Hypergraphs

• (

i,j,k,l

) and (

p,q,r

) in E

• Undirected vs Directed?

• (

i,j

) and (

j,i

)

• Attributes?

• Vertex Labels

• Height, Gender, Profession

• Edge Labels

• Timestamp, volume

i

(5)

Difficult Topologies

• Scale-Free

• Small-World

• Community Structure

• Hierarchical

• Overlapping

• Heterogeneous in size,

density, type, etc

• Other Structure

(6)

Web Data Commons Hyperlink Graph

• Crawled in 2014, directed [D1]

(7)

Spy Plot

• Graphs have

natural sparse

matrix

representations

• Linear algebra

applies

vertices

ve

rt

ice

s

(8)

Linear-Algebraic Kernels

• Linear Solve

• Matrix Factorization

• Eigensolve

• Tensor Factorization[T1]

=

L

x

b

L

x

=

_λ

x

≈

L

_F

G

t

A

≈

k

=

1 r

∑

u

_k

v

_k

w

_k

(9)

Outline

1. Introduction

2. Analytics that Rank

3. Analytics that Cluster

4. Analytics that Approximate

Expensive Calculations

(10)

Ranking Calculations

• Global Ranking?

• Ordered list

• Often only care about

top few of the list

• Personalized PR [R2]

• Supervised

• PageRank [R1]

• Centrality measure

• Random walk

• Connection Subgraph

[R3]

• Solve for direction

• Rank vertices

1

(11)

Exotic Ranking Calculations

• My brother gets

• My score

• Worse than a buoy

(12)

Outline

1. Introduction

2. Analytics that Rank

3. Analytics that Cluster

4. Analytics that Approximate

Expensive Calculations

(13)

• Spectral Clustering [O1,C4]

• Recursive Bipartite SC

Clustering

• Unsupervised?

• Hard or Soft?

• Agglomerative [C1]

• Start with

n

groups

• Make local grouping decisions

to maximize

Modularity:

ordered randomly

sign(v).*(log10(|v|+ε)+min{log10(|v|+ε)})

ordered by Fiedler vector

O R I G I N A L V E R T E X S E T

Reason split:

vector splitting

:

connected components:

Reason stopped:

minimum cluster size max clusters

cc 0

vec 1

cc 2

vec 3

Cc 4

0

1

2

3

4

5

6

7

8

9

10

1

1 ₁₂

13 !

"

#

$

%

&

comms

∑

−

!

"

#

$

%

&

Internal

edges

Expected

internal

edges

(14)

2 HAIFENG XU, HANS DE STERCK AND GEOFFREY SANDERS

3"

Feature'Matrix'F' Weighted'Bipar1te'Graph'

A"

C"

B"

2"

6" 8" 10" 8"

1"

3"

2"

A" B" C" D

D"

1"

9" 6" 1" 1"

Figure 1. Feature matrixFand induced weighted bipartite graph. Red dots correspond to row variables of

F, and blue dots to column variables. The rows and columns may, e.g., represent LinkedIn users and skills, with the weights indicating how often a user’s skill was endorsed by the user’s connections. research groups, etc. These hierarchical structures are overlapping; for example, some professors may be active in multiple departments or faculties, and many skills (often even the more specialized ones) are taught in multiple degree programs. In a similar way, it is to be expected that many of the currently emerging online social networks also contain inherent overlapping hierarchical organization, in particular when they focus on a specific dimension of the human condition, like, e.g., the professional dimension. Consider for example the LinkedIn social network, where users connect to their business relations and acquaintances, and list user-defined ‘skills and expertise’ on their user profiles that can be endorsed by their connections. Similar to the case of universities, it is clear that in a social network like LinkedIn there must be hierarchical overlapping groups of users with similar skills and professions, and hierarchical overlapping groups of skill keywords that characterize professional groups.

Figure 2. Bipartite graph hierarchy obtained by the FMCC algorithms. The input feature matrix is located at the bottom of the diagram.

However, in contrast to the example of universities, in emerging social networks this hierarchy is not ‘hard-coded’ into the structure of the network; if it were, it would seriously impede the growth and dynamical evolution of these networks. Since the hierarchy is not explicitly hard-coded into the structure of the network but is nevertheless present, it is at once a very interesting and a challenging problem to try to automatically generate a representation of this hierarchy from the

Copyright c 2015 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl.(2015)

Prepared usingnlaauth.cls DOI: 10.1002/nla

Overlapping (Co-)Clustering

• Non-Negative MF[C2]

• Factors positive-valued

• Interpreted as probabilities

• Coarsening [C5]

• Multilevel

• Linked-in data

• Latent Dirichlet

Allocation [C3]

• Model a document as a

weighted group of topics

• Each topic has individual

vocabulary

• Document is a random

combination of terms

from its topics

• More general than term-

document data

≈

L

F

G

(15)

Outline

1. Introduction

2. Analytics that Rank

3. Analytics that Cluster

4. Analytics that Approximate

Expensive Calculations

(16)

Approximations to Expensive Calcs

• Triangle Counting[O3]

• Diagonal of

A

3 is 6 x (# triangles)

• Estimate diagonal entries of

A

3

• Trace(

A

3 ) = sum [ eigenvalues(

A

)

3 ]

• Nearly-planar

coloring[O2]

• Mincut [O1]

• Maxcut [O4]

(17)

Outline

1. Introduction

2. Analytics that Rank

3. Analytics that Cluster

4. Analytics that Approximate

Expensive Calculations

(18)

• Bipartite Graphs

• Directed Graphs

• Highly-cyclic structure

Figure 12: Matrix sparsity structure for example 3 under the reordering given by the Fiedler vector (left) and associated ordering in of row and column variables (right). The purple line in the figure on the right shows the one-dimensional search space for row and column splittings considered by the out-of-box undirected method.

Figure 13: Original graph from example 4 (left), randomly reordered (middle), and bipartized (right).

15 Figure 12: Matrix sparsity structure for example 3 under the reordering given by the Fiedler vector

(left) and associated ordering in of row and column variables (right). The purple line in the figure on the right shows the one-dimensional search space for row and column splittings considered by the out-of-box undirected method.

Figure 13: Original graph from example 4 (left), randomly reordered (middle), and bipartized (right). 15

Important Extensions

• Dynamic Graphs [T2]

• Streaming

• Gather stats

• Tensors

• Time is a tensor dim.

• Causality?

• Labeled Graphs

• Factor in labels

• Label anomalies?

"

A

uthor

Conference

Co-cluster!

whereM[ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs ofB( o,x)and

( o,x)can be used to define a mapping intoR2such that the nodes are mapped to [DETERMINE

PARAMETERS OF REGION] around vectors of length 1 at angles of⇡

3,⇡or53⇡. 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 nz = 26130 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Spectrum of D−1_A Re Im Figure 2: . −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Spectral Coordinates Re (v i + wi) Im (v i − wi ) −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Spectral Coordinates Re (vi + wi) Im (v i − wi ) Figure 3: .

4 General

c

-cyclic structure

Forp= 0, ..., c 1,✓p,c2 (Cc). Entries of the eigenvectorCv=✓p,cvarevi=✓ip,c.

Bc=B⌦Cc

5 Numerical Approximation

xk+1=f(B)xk

9

whereM[ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs ofB(o,x)and

(o,x)can be used to define a mapping intoR2such that the nodes are mapped to [DETERMINE PARAMETERS OF REGION] around vectors of length 1 at angles of⇡

3,⇡or53⇡. 0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400 450 nz = 26130 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Spectrum of D−1_A Re Im Figure 2: . −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Spectral Coordinates Re (vi + wi) Im (v i − wi ) −0.25 −0.2−0.15−0.1−0.05 0 0.05 0.1 0.15 0.2 0.25 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Spectral Coordinates Re (vi + wi) Im (v i − wi ) Figure 3: .

4 Generalc-cyclic structure

Forp= 0, ..., c 1,✓p,c2 (Cc). Entries of the eigenvectorCv=✓p,cvarevi=✓ip,c.

Bc=B⌦Cc

5 Numerical Approximation

xk+1=f(B)xk

(19)

Summary

• Linear Algebra addresses a diverse set of graph analytics

• Linear Algebra kernels are somewhat scalable,

implemented in many computing environments

• Often requires close interaction with math or computer

(20)

References I of III

HPC

• [H1] Wissink, et al.

Large Scale Structured AMR Calculations Using the SAMRAI

Framework

, SC01 Proceedings, 2001. LLNL tech report UCRL-JC-144755.

• [H2] Balay, et al.

Efficient Management of Parallelism in Object Oriented Numerical

Software Libraries

, Modern Software Tools in Scientific Computing, 1997

• [H3] Falgout et al.

Design of the hypre Preconditioner Library

, Proc. of the SIAM

Workshop on Object Oriented Methods for Inter-operable Scientific and Engineering

Computing, 1998

• [H4] Heroux et al.

An Overview of Trilinos

, SNL Tech Report SAND2003-2927, 2003

Data

• [D1] Meusel et al.,

Web Data Commons 2014 Hyperlink Graph

,

http://webdatacommons.org/hyperlinkgraph/2014-04/topology.htm

• [D2] Leskovec et al.

Stanford Large Network Dataset Collection,

(21)

References II of III

Ranking

• [R1] Page.

PageRank: Bringing order to the web.

Stanford Digital Libraries

Working Paper, 1997

• [R2] Haveliwala.

Topic-sensitive pagerank

, In WWW pages 517–526, 2002

• [R3] Faloutsos et al.

Fast Discovery of Connection Subgraphs

, KDD 2004

• [R4] Walkscore,

http://www.walkscore.com/

Clustering

• [C1] Blondel et al.

Fast unfolding of communities in large networks

, Journal of

Statistical Mechanics: Theory and Experiment (10), P10008, 2008.

• [C2] Paatero et al.

Positive matrix factorization: A non-negative factor model with

optimal utilization of error estimates of data values

. Environmetrics, 1994

• [C3] Blei et al.

Latent Dirichlet Allocation

. Journal of Machine Learning, 2003

• [C4] von Luxburg.

A tutorial on spectral clustering

. Statistics and Computing, 2007

• [C5] Xu et al.

Fast Multilevel Co-Clustering: Unraveling the Multilevel Overlapping

Cluster Structure of Social Network Data,

submitted to Numerical Linear Algebra

(22)

References III of III

Tensors

• [T1] Kolda et al.

Tensor Decompositions and Applications

, SIAM Review, 2008

• [T2] Dunlavy et al.

Clustering network data using graphs, hypergraphs, and

tensors

, lecture given at University of Montreal, May, 2015

Discrete Optimization

• [O1] Fiedler.

Algebraic connectivity of Graphs

, Czechoslovak Mathematical

Journal: 23 (98), 1973.

• [O2] Hu et al.

On Maximum Differential Graph Coloring

, Lecture Notes in

Computer Science

, 2011

• [O3] Tsourakakis et al. Spectral Counting of Triangles in Power-Law Networks via