• No results found

Complex Networks Analysis: Clustering Methods

N/A
N/A
Protected

Academic year: 2021

Share "Complex Networks Analysis: Clustering Methods"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Complex Networks Analysis:

Clustering Methods

Spring 2013 ISI ETH Zurich

Nikolai Nefedov

[email protected]

(2)

Outline

Purpose

to give an overview of modern graph-clustering methods and their applications for analysis of complex dynamic networks.

Planned topics

short introduction to complex networks

discrete vector calculus, graph Laplacian, graph spectral analysis

methods of community detection based on modularity maximization

random walk on graphs, Laplacian dynamics, stability of community detection

multi-layer graphs: clustering and regularization

topology detection via system dynamics

dynamic network analysis and missing links prediction

applications for real-world datasets

(multi-dimensional time series and network analysis)

(3)

Complex vs Complicated

Complex systems (no unique definition):

• a (large) number of interacting elements

• stochastic interactions

• no centralized authority, self-organized

• Emerging properties

system behavior arises from interaction structure:

detailed understanding of elements in isolation is not enough • even if elements follow simple rules (chaotic behavior)

• evolving structures, system adaptation • hierarchies, heavy-tails,...

Complex Systems => Statistical physics

• large scale regularities

• microscopic origins of marcoscopic behavior

• multiple (hierarchical) scales

Complex Systems

(4)

Complex Systems => Complex Networks

Stat. Physics approach

• a fixed level of abstraction

• vertices => interacting elements

• edges => interactions

• (statistical) analysis of network structure

• dynamical processes taking place on a network

• dynamics of a network

Graph theory approach (mostly static graphs)

• simple graphs => cuts, structure, factorization, spanning trees, ...

• multigraphs => multiple edges and self-loops

• hypergraphs => hyper-edge as a set of vertices

• multi-layer graphs => a set of graphs on the same vertices => tensors

• multiplexing graphs

Complex Systems

(5)

Graph Theory

Origin: Leonhard Euler (1736)

L. Euler, Solutio problematis ad geometriam situs pertinentis, Comment. Academiae Sci. J. Petropolitanae 8, 128-140 (1736) (Euler theorem: when we can draw a graph with a single line)

Königsberg

(6)
(7)

Stat. Physics approach

• network analysis

• statistical analysis (random networks, small-world, scale-free networks) • network structure analysis

• clustering

• network partition

• classification (taxonomy => hierarchical classification)

• clustering => unsupervised classification (problem dependent) relates data to knowledge (basic human activity)

• dynamical processes taking place on a network

• random walk, opinion (voting) dynamics, synchronization game-strategies...

• convergence, stability...

• distributed computations/control

• dynamics of a network • evolving networks

• interplay between network topology and dynamics on a network • adaptive /learning networks

Complex Networks

(8)

Outline

Purpose

to give an overview of modern graph-clustering methods and their applications for analysis of complex dynamic networks.

Planned topics

short introduction to complex networks

discrete vector calculus, graph Laplacian, graph spectral analysis

methods of community detection based on modularity maximization

random walk on graphs, Laplacian dynamics, stability of community detection

multi-layer graphs: clustering and regularization

topology detection via system dynamics

dynamic network analysis and missing links prediction

applications for real-world datasets

(multi-dimensional time series and network analysis)

(9)

Outline

Purpose

to give an overview of modern graph-clustering methods and their applications for analysis of complex dynamic networks.

Planned topics

short introduction to complex networks

complex networks, definitions, basics

Graph partition

min-cut, normalized-cut, min-ratio-cut

Brief overview of vector calculus:

differential operators (gradient, divergence, Laplace operator)

Graph Laplacian as a discrete version of Laplace-Beltrami operator

Spectral analysis based on graph Laplacian

Limits of spectral analysis

(10)

Basics: Network Structure

Network or graph G = (V,E) => set of vertices joined by edges, V = {vi} set of vertices i =1,…, N,

E = {e (i, j ) } set of links/edges => (ordered) pair elements from V , max | E | = N (N – 1) /2 ;

vi is a neighbor of vj if there is e ( i, j ) in E

number of neighbors k of a vertex vi is called its degree in directed networks: in- and out- degrees k in, k out

edge density of the graph:

ρ = 1 => fully connected, ρ << 1 => sparse graph Cycle/loop = closed path (distinct vertices/edges) Graph types: regular, tree, forest …

Bipartite network: 2 types of nodes, links only between nodes of different types.

ρ=∣E∣/ N  N −1 /2

(11)

Basics: Network Structure

Shortest path between i and j => a path with min number of edges

Distance d(i,j) => measure associated with the shortest path between i and j Average shortest distance

Diameter of the graph

Connected graph: there is a path between any pair of nodes Min connected graph => no loops => tree, | E | = N - 1 edges Forest => collection of trees

Fully connected (complete) graph: d (i,j) = 1 for all i,j | E | = N(N – 1) /2 Adjacency matrix A (i,j) = 1 if e {i,j } in E, 0 otherwise

Clique: a fully connected subgraph k-clique: clique with k vertices

Motifs: subgraphs which often occur in a network (wrt to a null model)

l〉=

2d  i,j / N  N −1 

d = max d

i,j

(12)

Basics: Network Structure

Centrality measures:

node degree = number of neighbors

Closeness centrality:

measures how far (on the average) a vertex is from all other vertices

Betweenness centrality = number of shortest paths going through vertex/edge, measures the amount of flow through a vertex/edge,computationally demanding.

dci =1/ Σj≠id  i,j 

b i =

l,m

dil,m /d

l,m

d(l,m) shortest paths between l and m;

di(l,m) shortest paths going through node i Clustering coefficient of a node

C i = 1

kiki−1 

j≠k N

eijejkeki

triangles

C i = 2 Ei kiki−1 

(13)

Network: Statistical characterization

Degree distribution p(k) => probability that a randomly chosen vertex has degree k P(k|k’): => cond. prob. that a vertex of degree k is connected to a vertex of degree k’

Average degree <k > = 2 |E| /N Sparse graphs: <k> << N

Average degree of nearest neighbors of node i : Average degree fluctuations: <k2>

Clustering spectrum (of vertices which have the same degree) Topological heterogeneity:

homogeneous networks: light tails

heterogeneous networks: skewed, heavy tails

(14)

Stochastic Networks

Stochastic network -> not s single graph, but a statistical ensemble

Erdős–Rényi (random) networks: G (N,p)

-

connect N vertices randomly, each pair is connected with probability p

- ensemble of possible realizations: network properties => averages over the ensemble - average number of edges

- average degree

E〉 = pN  N −1/2

Clustering coefficient

E-R networks CER= p =k〉

N

practically there is no clustering

large random networks are tree-like networks

k 〉 = 2〈 E〉 / N = p  N −1 ≈ pN

C G  = triangles

connected triples

(15)

Erdős–Rényi Networks

Example N = 3, p = 1/3

(16)

Erdős–Rényi Networks

pik  = CN−1k pk

1− p

N−1−k

P k  =

i= 1 N

pik / N

=  pN 

k

k! exp− pN  P k  =k 〉

k!

k

e

−〈k〉

average degree:k 〉 = 2〈 E〉 / N=p

N −1

pN

=> Poisson distribution

For E-R networks

Degree distribution for the whole network Probability that vertex i has a degree k

• connected to k vertices,

• not connected to the other N – k – 1 pik 

N  ∞ s. t . 〈k 〉 = const

(17)

Erdős–Rényi Networks

: many small subgraphs

k 〉=1

k〉1

: phase transition (percolation)

: giant component + small subgraphs

k 〉 >>1

Connected component sizes

17

k〉

small subgraphs giant component

N  ∞ s. t . 〈k〉 = const

relative giant component size mean

component size

(18)

Degree distribution: Poisson (degrees of all nodes close to average)

• No correlations, all edges exist independently of each other

• Path lengths grow logarithmically with system size, <l> ~ ln (N)

• Connectivity depends on average degree <k>

small <k> => several disjoint components, high <k> => giant connected component there is a percolation transition phase (from a fragmented to a connected)

• Very “homogeneous” networks

Erdős–Rényi Networks

(19)

Real-World Networks

Shortest path Clustering

Random networks Short Low

Real networks Short High

Regular-topology networks * Long High *

* [Watts & Strogatz 1998]

(20)

Random vs Real-World Networks

Heavy tail distributions

Degree distributions

P k  = k 〉 k!

k

e−〈k 〉 Poison distribution

(21)

Network Models: Small-World

D.J. Watts and S. Strogatz,

”Collective dynamics of 'small-world' networks", Nature 393, 440–442, 1998

WS model:

• Take a regular clustered network

• Rewire the endpoint of each link to a random node with probability p

• SWN => a simple model for

interpolating between regular and random networks

• Randomness controlled by a single tuning parameter

N >> k >> ln(N) >> 1

Degree distribution clustering coefficient

WS model, k>2 <= independent of system size

[Barrat & Weight, 2000]

(22)

Path Length

Clustering

Network Models: Small-World Networks

“Small-World Network”

short paths, high clustering

random network regular

network

N = 1000 k = 10

average over 20 realizations at each p

(23)

Network Models: Small-World Networks

Dynamics of sync, virus spreading : small number of shortcuts greatly speeds up the process:

3% shortcuts => 50% epidemic Network structure strongly affects

processes taking place on networks

Density of shortcuts

Epidemic size Epidemics: number of infected

[Watts & Strogatz]

(24)

Network Models: Scale-Free Networks

A.-L. Barabási & R. Albert,

Emergence of Scaling in Random Networks, Science 286, 509 (1999)

logarithmic axes Power-Law Distribution

Degree distributions

(25)

25

kc= cut-off due to finite-size

diverging degree fluctuations for Fluctuations

Level of heterogeneity:

Power-law tails,

Power Law Distributions

F

k

F

αk

DF

k

Scale invariance:

Pk= Ak−γ Pαk= Aαkγ = αγ Pk

Power-law:

γ< 3

k>kminPk=

γ−1

kminγ−1 k−γ

kn〉=

kmin

kn Pkdk

for γ  n 1

kn〉 = kminn γ−1 γ−1−n 1<γ< 2⇒ 〈k 〉 ∞

only 〈kγ−1〉  ∞ 2 <γ<3 ⇒〈k2〉  ∞

<=> shift on log scale

for most of real world networks 2 <γ<3

(26)

Power Law Distributions

power-law Pk =〈k 〉kexp−〈k〉/k!

logarithmic axes

Networks with Power Law Distributions => Scale-Free Networks

no characteristic scale (node degree) in the distribution

(27)

27

Barabási-Albert Model Scale-Free Networks

P k =2m2/k3 B-A model of network growth

• based on the principle of preferential attachment - “the rich get richer”

• results in networks with a power-law degree distribution (average degree <k> = 2m )

Where networks come from?

Networks are not static => growth networks

πi= ki

ki

1. Take a small seed network, e.g. a few connected nodes 2. Let a new node of degree m enter the network

3. Connect the new node to existing nodes such that

the probability of connecting to node i of degree ki is

Average shortest path lengths Clustering coefficient:

πi

Degree distribution

P k = 2m

2

k

3

(28)

Random p = 0.02 Small world p = 0.1 Scale free <k> = 2

Network Models

(29)

Network Models: Summary

Erdös-Renyi model

• short path lengths

• Poisson distribution (no hubs)

• no clustering

Barabási-Albert scale-free model

• short path lengths

• power-law distribution for degrees

• robustness

• no clustering (may be fixed)

Real-world networks

• short path lengths

• high clustering

• broad degree distributions, often power laws

Watts-Strogatz

Small World model

• short path lengths

• high clustering (N independent)

• almost constant degrees

(30)

Similarity Graphs

Graphs embedded in space

Euclidean distance (L2 norm) Manhattan distance (L1 norm) Cosine similarity

Graphs built from data:

Data points from Euclidean space, sampling of some underlying distribution,...

Connectivity parameter: k (KNN), ε - neighborhood graph, ...

Similarity measure => fully connected (weighted ) matrix Graphs not embedded in space

Neighborhood measures

- structural equivalence: share the same neighbors => Jaccard coe cient - regular equivalence: if neighbors of a node are similar

Pearson correlation coe cient Path dependent measures

Measures based on random walk:

- commute-time: average number of steps for a random to hit a target and return - escape probability: probability to hit a target before coming back

References

Related documents

In this thesis, we propose to leverage models and Model-Driven Engineering [139] (MDE) techniques, not only at design-time but also at runtime [18], in order to consider and

In line with Percival and Walden (2000) we use the Maximal Overlapping Discrete Wavelet Transform, MODWT, to estimate the covariance matrix of a pair of fractionally

One problem with the proposed algorithm is that the global polygon’s medial axes might not let us locally adjust certain regions with high area error (see, for example, Figure

The Project Mana g er is responsi b le for the daily mana g ement of the project within the framework provided b y the Project Board, and is therefore responsi b le for the

These themes can be summarised as follows: (i) teachers’ concerns about a lack of preparedness for internationalisation in the classroom; (ii) concerns about enacting curriculum

a) Only ORIGINAL ACCOMPANIMENTS are permitted in category A and G, which means that any reductions in orchestration or transcriptions of other instruments that

 Prescribe activities and/or drills that only partially assist the athletes in correcting technical performance  Provide corrections that identify vague external factors rather

Accredited managed care: healthcare services Healthcare benefits (Part 4.11.1) Significant risk transfer (Part 4.13) Administrative expenditure: benefit management