**LINEAR-ALGEBRAIC **

**GRAPH MINING**

**Geoffrey Sanders, CASC/LLNL **

### New Applications of Computer Analysis to Biomedical Data Sets

### QB3 Seminar, UCSF Medical School, May 28

### th

_{, 2015 }

**Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551**

**!**

**This work performed under the auspices of the U.S. Department of Energy by **

**Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344**

### LLNL and LDRD

### •

### LLNL is a DOE FFRDC

### •

### Center for Applied Scientific Computing (CASC)

### •

### Several of us work on Laboratory Directed Research and

### Development (LDRD) projects in HPC and Data Analysis

### •

### Graph Analytics, Machine Learning, Network Analysis

### •

### We are ALWAYS looking for domain scientist collaborators

### with

**interesting datasets **

### or

**new data mining tasks **

### •

### DOE national labs have a history of building open HPC

### software for PDE-related applications (Physic Simulation)

### •

### PETSc [H2], Trilinos [H4], Hypre [H3], Samrai [H1], etc.

### Outline

### 1.

### Introduction

### 2.

### Analytics that Rank

### 3.

### Analytics that Cluster

### 4.

### Analytics that Approximate

### Expensive Calculations

### Graph Model Definitions

### •

### Graph G(V,E)

### •

### Vertices

*i*

### ,

*j*

### in V

### •

### Edges (

*i,j*

### ) in E

### •

### Edge weights

### •

### Hypergraphs

### •

### (

*i,j,k,l*

### ) and (

*p,q,r*

### ) in E

### •

### Undirected vs Directed?

### •

### (

*i,j*

### ) and (

*j,i*

### )

### •

### Attributes?

### •

### Vertex Labels

### •

### Height, Gender, Profession

### •

### Edge Labels

### •

### Timestamp, volume

*i *

### Difficult Topologies

### •

**Scale-Free **

### •

### Small-World

### •

### Community Structure

### •

### Hierarchical

### •

### Overlapping

### •

### Heterogeneous in size,

### density, type, etc

### •

### Other Structure

### Web Data Commons Hyperlink Graph

### •

### Crawled in 2014, directed [D1]

### Spy Plot

### •

### Graphs have

### natural sparse

### matrix

### representations

### •

### Linear algebra

### applies

*vertices *

*ve*

*rt*

*ice*

*s *

### Linear-Algebraic Kernels

### •

### Linear Solve

### •

### Matrix Factorization

### •

### Eigensolve

### •

### Tensor Factorization[T1]

### =

*L*

**x**

**b**

*L*

**x**

### =

_{λ}

**x**

### ≈

*L*

_{F}

_{F}

*G*

*t*

**A**

### ≈

*k*

### =

### 1

*r*

## ∑

**u**

_{k}

_{k}

**v**

_{k}

_{k}

**w**

_{k}

_{k}

### Outline

### 1.

### Introduction

### 2.

### Analytics that Rank

### 3.

### Analytics that Cluster

### 4.

### Analytics that Approximate

### Expensive Calculations

### Ranking Calculations

### •

### Global Ranking?

### •

### Ordered list

### •

### Often only care about

### top few of the list

### •

### Personalized PR [R2]

### •

### Supervised

### •

### PageRank [R1]

### •

### Centrality measure

### •

### Random walk

### •

### Connection Subgraph

### [R3]

### •

### Solve for direction

### •

### Rank vertices

### 1

### Exotic Ranking Calculations

### •

### My brother gets

### •

### My score

### •

### Worse than a buoy

### Outline

### 1.

### Introduction

### 2.

### Analytics that Rank

### 3.

### Analytics that Cluster

### 4.

### Analytics that Approximate

### Expensive Calculations

### •

### Spectral Clustering [O1,C4]

### •

### Recursive Bipartite SC

### Clustering

### •

### Unsupervised?

### •

### Hard or Soft?

### •

### Agglomerative [C1]

### •

### Start with

*n*

### groups

### •

### Make local grouping decisions

### to maximize

*Modularity: *

ordered randomly

sign(**v**).*(log10(|**v**|+ε)+min{log10(|**v**|+ε)})

ordered by Fiedler vector

**O R I G I N A L V E R T E X S E T **

**Reason split:**

### vector splitting

**:**

### connected components:

**Reason stopped: **

### minimum cluster size max clusters

### cc 0

### vec 1

### cc 2

### vec 3

### Cc 4

### 0

### 1

### 2

### 3

### 4

### 5

### 6

### 7

### 8

### 9

### 10

### 1

### 1

_{12}

### 13

### !

### "

### #

### #

### $

### %

### &

### &

*comms*

### ∑

### −

### !

### "

### #

### #

### $

### %

### &

### &

*Internal *

*edges *

*Expected *

*internal *

*edges *

2 HAIFENG XU, HANS DE STERCK AND GEOFFREY SANDERS

### 3"

**Feature'Matrix'F'**

**Weighted'Bipar1te'Graph'**

### A"

### C"

### B"

### 2"

6" 8" 10" 8"### 1"

### 3"

### 2"

### A" B" C" D

### D"

### 1"

9" 6" 1" 1"Figure 1. Feature matrixFand induced weighted bipartite graph. Red dots correspond to row variables of

F, and blue dots to column variables. The rows and columns may, e.g., represent LinkedIn users and skills, with the weights indicating how often a user’s skill was endorsed by the user’s connections. research groups, etc. These hierarchical structures are overlapping; for example, some professors may be active in multiple departments or faculties, and many skills (often even the more specialized ones) are taught in multiple degree programs. In a similar way, it is to be expected that many of the currently emerging online social networks also contain inherent overlapping hierarchical organization, in particular when they focus on a specific dimension of the human condition, like, e.g., the professional dimension. Consider for example the LinkedIn social network, where users connect to their business relations and acquaintances, and list user-defined ‘skills and expertise’ on their user profiles that can be endorsed by their connections. Similar to the case of universities, it is clear that in a social network like LinkedIn there must be hierarchical overlapping groups of users with similar skills and professions, and hierarchical overlapping groups of skill keywords that characterize professional groups.

Figure 2. Bipartite graph hierarchy obtained by the FMCC algorithms. The input feature matrix is located at the bottom of the diagram.

However, in contrast to the example of universities, in emerging social networks this hierarchy is not ‘hard-coded’ into the structure of the network; if it were, it would seriously impede the growth and dynamical evolution of these networks. Since the hierarchy is not explicitly hard-coded into the structure of the network but is nevertheless present, it is at once a very interesting and a challenging problem to try to automatically generate a representation of this hierarchy from the

Copyright c 2015 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl.(2015)

Prepared usingnlaauth.cls DOI: 10.1002/nla

### Overlapping (Co-)Clustering

### •

### Non-Negative MF[C2]

### •

### Factors positive-valued

### •

### Interpreted as probabilities

### •

### Coarsening [C5]

### •

### Multilevel

### •

### Linked-in data

### •

### Latent Dirichlet

### Allocation [C3]

### •

### Model a document as a

### weighted group of topics

### •

### Each topic has individual

### vocabulary

### •

### Document is a random

### combination of terms

### from its topics

### •

### More general than term-

### document data

### ≈

*L*

*F*

*G*

### Outline

### 1.

### Introduction

### 2.

### Analytics that Rank

### 3.

### Analytics that Cluster

### 4.

### Analytics that Approximate

### Expensive Calculations

### Approximations to Expensive Calcs

### •

### Triangle Counting[O3]

### •

### Diagonal of

*A*

### 3

### is 6 x (# triangles)

### •

### Estimate diagonal entries of

*A*

### 3

### •

### Trace(

*A*

### 3

### ) = sum [ eigenvalues(

*A*

### )

### 3

### ]

### •

### Nearly-planar

### coloring[O2]

### •

### Mincut [O1]

### •

### Maxcut [O4]

### Outline

### 1.

### Introduction

### 2.

### Analytics that Rank

### 3.

### Analytics that Cluster

### 4.

### Analytics that Approximate

### Expensive Calculations

### •

### Bipartite Graphs

### •

### Directed Graphs

### •

### Highly-cyclic structure

Figure 12: Matrix sparsity structure for example 3 under the reordering given by the Fiedler vector (left) and associated ordering in of row and column variables (right). The purple line in the figure on the right shows the one-dimensional search space for row and column splittings considered by the out-of-box undirected method.

Figure 13: Original graph from example 4 (left), randomly reordered (middle), and bipartized (right).

15 Figure 12: Matrix sparsity structure for example 3 under the reordering given by the Fiedler vector

(left) and associated ordering in of row and column variables (right). The purple line in the figure on the right shows the one-dimensional search space for row and column splittings considered by the out-of-box undirected method.

Figure 13: Original graph from example 4 (left), randomly reordered (middle), and bipartized (right). 15

### Important Extensions

### •

### Dynamic Graphs [T2]

### •

### Streaming

### •

### Gather stats

### •

### Tensors

### •

### Time is a tensor dim.

### •

### Causality?

### •

### Labeled Graphs

### •

### Factor in labels

### •

### Label anomalies?

### "

### "

### "

**A**

**uthor **

**Conference **

### Co-cluster!

whereM[ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs ofB( o,x)and

( o,x)can be used to define a mapping intoR2such that the nodes are mapped to [DETERMINE

PARAMETERS OF REGION] around vectors of length 1 at angles of⇡

3,⇡or53⇡.
0 50 100 150 200 250 300 350 400 450
0
50
100
150
200
250
300
350
400
450
nz = 26130 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Spectrum of D−1_{ A}
Re
Im
Figure 2: .
−0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Spectral Coordinates
Re (v
i + wi)
Im (v
i
−
wi
)
−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Spectral Coordinates
Re (vi + wi)
Im (v
i
−
wi
)
Figure 3: .

### 4

### General

### c

### -cyclic structure

Forp= 0, ..., c 1,✓p,c2 (Cc). Entries of the eigenvectorCv=✓p,cvarevi=✓ip,c.

Bc=B⌦Cc

### 5

### Numerical Approximation

xk+1=f(B)xk

9

whereM[ADD THE NECESSARY CONSTRAINTS TO M]. Then, the eigenpairs ofB(o,x)and

(o,x)can be used to define a mapping intoR2such that the nodes are mapped to [DETERMINE PARAMETERS OF REGION] around vectors of length 1 at angles of⇡

3,⇡or53⇡.
0 50 100 150 200 250 300 350 400 450
0
50
100
150
200
250
300
350
400
450
nz = 26130 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Spectrum of D−1_{ A}
Re
Im
Figure 2: .
−0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25 Spectral Coordinates
Re (vi + wi)
Im (v
i
−
wi
)
−0.25 −0.2−0.15−0.1−0.05 0 0.05 0.1 0.15 0.2 0.25
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
Spectral Coordinates
Re (vi + wi)
Im (v
i
−
wi
)
Figure 3: .

4 Generalc-cyclic structure

Forp= 0, ..., c 1,✓p,c2 (Cc). Entries of the eigenvectorCv=✓p,cvarevi=✓ip,c.

Bc=B⌦Cc

5 Numerical Approximation

xk+1=f(B)xk