Kernel methods for complex networks and big data

(1)

Kernel methods for complex networks and

big data

Johan Suykens

KU Leuven, ESAT-STADIUS Kasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium Email: [email protected]

http://www.esat.kuleuven.be/stadius/

(2)

(3)

Data world

10 30 24 16 19 23 21 15 9 13 12 20 17 11 22 4 7 5 6 28 25 32 29 26 14 8 3 2 1 18 31 33 34 27 C 3 C 2 C 1 C 4 (a)

(4)

Challenges

• data-driven

• general methodology

• scalability

(5)

Sparsity

(6)

Different paradigms

SVM & Kernel methods Convex Optimization Sparsity & Compressed sensing

(7)

Different paradigms

SVM & Kernel methods Convex Optimization Sparsity & Compressed sensing

?

(8)

(9)

Sparsity: through regularization or loss function

• through regularization: model yˆ = wTx + b

min X j |w_j_| + γ X i e2_i ⇒ sparse w

• through loss function: model yˆ = P

iαiK(x, xi) + b

min wTw + γ X

i

L(e_i)

(10)

Function estimation in RKHS

• Find function f such that

min f∈H 1 N N X i=1 L(yi, f(xi)) + λkfk2_K

with L(_·,_·) the loss function. _kf_k_K is norm in RKHS _H defined by K.

• Representer theorem: for convex loss function, solution of the form

f(x) =

N

X

i=1

αiK(x, xi)

Reproducing property f(x) = _hf, K_x_i_K with K_x(_·) = K(x, _·)

(11)

(12)

Learning models from data: alternative views

- Consider model yˆ = f(x;w), given input/output data _{(x_i, y_i)_}N_i₌₁:

min w w T_w ₊ _γ N X i=1 (yi − f(xi;w))2

- Rewrite the problem as

minw,e wTw + γ PN_i₌₁ (yi − f(xi;w))2

subject to ei = yi − f(xi;w), i = 1, ..., N

- Construct Lagrangian and take condition for optimality - For a model f(x;w) = Ph

j=1 wjϕj(x) = wTϕ(x) one obtains then

ˆ

f(x) = PN

(13)

Learning models from data: alternative views

- Consider model yˆ = f(x;w), given input/output data _{(xi, yi)}N_i₌₁:

min w w T_w ₊ _γ N X i=1 (y_i ₋ f(x_i;w))2

- Rewrite the problem as

min

w,e w

T_w ₊ _γ PN

i=1 e2i

subject to e_i = y_i ₋ f(x_i;w), i = 1, ..., N

- Express the solution and the model in terms of Lagrange multipliers α_i

- For a model f(x;w) = Ph

j=1 wjϕj(x) = wTϕ(x) one obtains then

(14)

Least Squares Support Vector Machines: “core models”

• Regression min w,b,e w T_w ₊ _γ X i e2_i s.t. y_i = wTϕ(x_i) + b + e_i, _∀i • Classification min w,b,e w T_w ₊ _γ X i e2_i s.t. yi(wTϕ(xi) + b) = 1 − ei, ∀i

• Kernel pca (V = I), Kernel spectral clustering (V = D−1)

min

w,b,e −w

T_w ₊ _γ X

i

vie2_i s.t. ei = wTϕ(xi) + b, ∀i

• Kernel canonical correlation analysis/partial least squares

min w,v,b,d,e,r w T_w ₊ _vT_v ₊ _ν X i (e_i ₋ r_i)2 s.t. e_i = wTϕ(1)(x_i) + b ri = vTϕ(2)(yi) + d [Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010]

(15)

Probability and quantum mechanics

• Kernel pmf estimation – Primal: min w,p_i 1 2hw, wi subject to pi = hw, ϕ(xi)i, i = 1, ..., N and PN i=1 pi = 1 – Dual: pi = PN j=1K(xj,xi) PN i=1PNj=1K(xj,xi)

• Quantum measurement: state vector _|ψ_i, measurement operators M_i

– Primal:

min

|wi,p_i

1

2 hw|wi subject to pi = Re(hw|Miψi), i = 1, ..., N and

PN

i=1 pi = 1

(16)

SVMs: living in two worlds ...

x o o o o x xx x x x x x x o o o o _{Input space} Feature space ϕ(x) Parametric Nonparametric Primal space Dual space ˆ y = sign[wTϕ(x) + b] ˆ y = sign[P#sv i=1 αiyiK(x, xi) + b] K(xi, xj) = ϕ(xi)Tϕ(xj) (Mercer) ˆ y ˆ y w₁ w_nh α₁ α_#sv ϕ₁(x) ϕ_nh(x) K(x, x₁) x x

(17)

SVMs: living in two worlds ...

x o o o o x xx x x x x x x o o o o _{Input space} Feature space ϕ(x) Parametric Nonparametric Primal space Dual space ˆ y = sign[wTϕ(x) + b] ˆ y = sign[P#sv i=1 αiyiK(x, xi) + b] K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”) ˆ y ˆ y w₁ w_nh α₁ α ϕ₁(x) ϕ_nh(x) K(x, x₁) x x Parametric Non−parametric

(18)

Linear model: solving in primal or dual?

inputs x _∈ _Rd, output y _∈ _R training set _{(x_i, y_i)_}N_i₌₁ (P) : yˆ = wTx + b, w _∈ _Rd ր Model ց (D) : yˆ = P i αi xTi x + b, α ∈ RN

(19)

Linear model: solving in primal or dual?

inputs x _∈ _Rd, output y _∈ _R training set _{(x_i, y_i)_}N_i₌₁ (P) : yˆ = wTx + b, w _∈ _Rd ր Model ց (D) : yˆ = P i αi xTi x + b, α ∈ RN

(20)

few inputs, many data points: d _≪ N

primal : w _∈ _Rd

(21)

many inputs, few data points: d _≫ N

primal: w _∈ _Rd

(22)

Feature map and kernel

From linear to nonlinear model:

(P) : yˆ = wTϕ(x) + b ր Model ց (D) : yˆ = P i αi K(xi, x) + b Mercer theorem: K(x, z) = ϕ(x)Tϕ(z) Feature map ϕ(x) = [ϕ₁(x);ϕ₂(x);...;ϕ_h(x)]

Kernel function K(x, z) (e.g. linear, polynomial, RBF, ...)

• Use of feature map and positive definite kernel [Cortes & Vapnik, 1995]

• Optimization in infinite dimensional spaces for LS-SVM

(23)

(24)

Fixed-size method: steps

1. selection of a subset from the data

2. kernel matrix on the subset

3. eigenvalue decomposition of kernel matrix

4. approximation of the feature map based on the eigenvectors

(Nystr¨om approximation)

5. estimation of the model in the primal using the approximate feature map (applicable to large data sets)

(25)

Selection of subset

• random

(26)

Subset selection using quadratic Renyi entropy: example

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

(27)

Fixed-size method: using quadratic Renyi entropy

Algorithm:

1. Given N training data

2. Choose a working set with fixed size M (i.e. M support vectors) (typically M ≪ N).

3. Randomly select a SV x∗ from the working set of M support vectors.

4. Randomly select a point xt∗ from the training data and replace x∗ by xt∗.

If the entropy increases by taking the point xt∗ instead of x∗ then this point xt∗

is accepted for the working set of M SVs, otherwise the point xt∗ is rejected (and

returned to the training data pool) and the SV x∗ stays in the working set.

5. Calculate the entropy value for the present working set. The quadratic Renyi entropy

equals HR = − log 1

M2

P

ij Ω(M,M)_ij.

(28)

Nystr¨

om method

• “big” kernel matrix: Ω₍_N,N₎ _∈ _RN×N

“small” kernel matrix: Ω₍_M,M₎ _∈ _RM×M (on subset)

• Eigenvalue decompositions: Ω₍_N,N₎U˜ = ˜U Λ˜ and Ω₍_M,M₎U = U Λ

• Relation to eigenvalues and eigenfunctions of the integral equation

Z K(x, x′)φi(x)p(x)dx = λiφi(x′) with ˆ λ_i = 1 Mλi, φˆi(xk) = √ M u_ki, φˆ_i(x′) = √ M λi M X k=1 u_kiK(x_k, x′)

(29)

Fixed-size method: estimation in primal

• For the feature map ϕ(_·) : _Rd _→ _Rh obtain an approximation

˜

ϕ(_·) : _Rd _→ _RM

based on the eigenvalue decomposition of the kernel matrix with

˜

ϕ_i(x′) = pλˆ_iφˆ_i(x′) (on a subset of size M _≪ N).

• Estimate in primal: min ˜ w,˜b 1 2w˜ T_w_˜ ₊ _γ 1 2 N X i=1 (y_i ₋ w˜Tϕ˜(x_i) ₋ ˜b)2

(30)

Fixed-size method: performance in classification

pid spa mgt adu ftc

N 768 4601 19020 45222 581012 Ncv 512 3068 13000 33000 531012 Ntest 256 1533 6020 12222 50000 d 8 57 11 14 54 FS-LSSVM (# SV) 150 200 1000 500 500 C-SVM (# SV) 290 800 7000 11085 185000 ν-SVM (# SV) 331 1525 7252 12205 165205 RBF FS-LSSVM 76.7(3.43) 92.5(0.67) 86.6(0.51) 85.21(0.21) 81.8(0.52) Lin FS-LSSVM 77.6(0.78) 90.9(0.75) 77.8(0.23) 83.9(0.17) 75.61(0.35) RBF C-SVM 75.1(3.31) 92.6(0.76) 85.6(1.46) 84.81(0.20) 81.5(no cv) Lin C-SVM 76.1(1.76) 91.9(0.82) 77.3(0.53) 83.5(0.28) 75.24(no cv) RBF ν-SVM 75.8(3.34) 88.7(0.73) 84.2(1.42) 83.9(0.23) 81.6(no cv) Maj. Rule 64.8(1.46) 60.6(0.58) 65.8(0.28) 83.4(0.1) 51.23(0.20)

• Fixed-size (FS) LSSVM: good performance and sparsity wrt C-SVM and ν-SVM

• Challenging to achieve high performance by very sparse models

(31)

Two stages of sparsity

stage 1

primal FS model estimation

→

reweighted l1

↑

dual subset selection

Nystr¨om approximation

Synergy between parametric & kernel-based models

[Mall & Suykens, IEEE-TNNLS, in press], reweighted l1 [Candes et al., 2008]

(32)

coefficient-Two stages of sparsity

stage 1

→

reweighted ℓ1

↑

[Mall & Suykens, IEEE-TNNLS, in press], reweighted l1 [Candes et al., 2008]

Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficient-based ℓ (0 < q ≤ 1) [Shi et al., 2013]; two-level ℓ [Huang et al., 2014]

(33)

Two stages of sparsity

stage 1 stage 2

→

reweighted ℓ1

↑

[Mall & Suykens, IEEE-TNNLS, in press], reweighted ℓ1 [Candes et al., 2008]

(34)

coefficient-Two stages of sparsity

stage 1 stage 2

→

reweighted ℓ1

↑

[Mall & Suykens, IEEE-TNNLS, in press], reweighted ℓ1 [Candes et al., 2008]

Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001];

(35)

Big data: representative subsets using k-NN graphs (1)

• Convert the large scale dataset into a sparse undirected k-NN graph

using a distributed network generation framework

• Julia language (http://julialang.org/)

• Large N _× N kernel matrix Ω for data set _D with N data points

• Batch cluster-based approach: a batch subset _D_p _{⊂ D} is loaded per node with _∪P_p₌₁_Dp = D; related matrix slice Xp and Ωp.

• MapReduce and AllReduce settings implementable using Hadoop or Spark (see also [Agarwal et al., JMLR 2014] Terascale linear learning)

• Computational complexity: complexity for construction of the kernel

(36)

Big data: representative subsets using k-NN graphs (2)

Map: for Silverman’s Rule of Thumb, compute mean and standard deviation

of the data per node; compute slice Ωp; sort in ascending order the columns

of Ωp (sortperm in Julia); pick indices for top k values.

Reduce: merge k-NN subgraphs into aggregated k-NN graph

(37)

Big data: representative subsets using k-NN graphs (3)

Steps:

1. MapReduce for representative subsets using k-NN graphs

2. Subset selection using FURS [Mall et al., SNAM 2013]

(which aims for good modelling of dense regions; deterministic method) 3. Use in fixed-size LS-SVM ([Mall & Suykens, IEEE-TNNLS, in press])

Dataset N SR SRE FURSk=10 FURSk=100 FURSk=500

Generalization or Test Error ± Standard Deviation

4G 28,500 0.395±0.235 0.898±0.375 0.252±0.023 0.331±0.063 0.414±0.070

5G 81,000 0.264±0.225 0.298±0.142 0.358±0.296 0.082±0.016 0.086±0.009

Magic 19,020 25.32±3.913 28.44±4.446 33.07±2.076 31.05±4.143 36.13±3.062

Shuttle 58,000 2.437±1.104 2.330±0.958 4.223±0.998 1.482±0.600 1.980±0.715

Skin 245,057 0.578±0.387 0.254±0.078 3.277±1.133 0.494±0.078 0.772±0.080

SR: stratified random sampling; SRE: stratified Renyi entropy

(38)

(39)

Spectral graph clustering (1)

Minimal cut: given the graph _G = (V, E), find clusters _A₁,_A₂

min q_i∈{−1,+1} 1 2 X i,j w_ij(q_i ₋ q_j)2

with cluster membership indicator qi (qi = 1 if i ∈ A1, qi = −1 if i ∈ A2)

and W = [wij] the weighted adjacency matrix.

2 3 4 5 6 1 cut of size 2 cut of size 1

(40)

Spectral graph clustering (2)

• Min-cut spectral clustering problem

min ˜

qTq˜=1 q˜

T_L_q_˜

with L = D ₋ W the unnormalized graph Laplacian, degree matrix

D = diag(d₁, ..., d_N), d_i = P

j wij, giving

Lq˜ = λq.˜

Cluster member indicators: qˆ_i = sign(˜q_i ₋ θ) with threshold θ.

• Normalized cut

Lq˜ = λDq˜

[Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997; von Luxburg, 2007]

• Discrete version to continuous problem (Laplace operator)

(41)

Kernel Spectral Clustering (KSC): case of two clusters

• Primal problem: training on given data _{x_i_}N_i₌₁

min w,b,e 1 2w T_w − γ1 2e T_{V e} subject to ei = wTϕ(xi) + b, i = 1, ..., N

with weighting matrix V and ϕ(_·) : _Rd _→ _Rh the feature map.

• Dual:

V M_V Ωα = λα

with λ = 1/γ, M_V = I_N ₋ 1

1TNV1N1N1

T

NV weighted centering matrix,

Ω = [Ω_ij] kernel matrix with Ω_ij = ϕ(x_i)Tϕ(x_j) = K(x_i, x_j)

• Taking V = D−1 with degree matrix D = diag_{d₁, ..., d_N_} and d_i =

PN

(42)

Kernel Spectral Clustering (KSC): case of two clusters

• Primal problem: training on given data _{x_i_}N_i₌₁

min w,b,e 1 2w T_w − γ1 2e T_V _e subject to ei = wTϕ(xi) + b, i = 1, ..., N

with weighting matrix V and ϕ(_·) : _Rd _→ _Rh the feature map.

• Dual:

V M_V Ωα = λα

with λ = 1/γ, M_V = I_N ₋ 1

1TNV1N1N1

T

NV weighted centering matrix,

Ω = [Ω_ij] kernel matrix with Ω_ij = ϕ(x_i)Tϕ(x_j) = K(x_i, x_j)

• Taking V = D−1 with degree matrix D = diag_{d₁, ..., d_N_} and d_i =

PN

j=1 Ωij relates to random walks algorithm.

(43)

Kernel spectral clustering: more clusters

• Case of k clusters: additional sets of constraints

min w(l),e(l),b_l 1 2 k−1 X l=1 w(l)Tw(l) ₋ 1 2 k−1 X l=1 γ_le(l)TD−1e(l) subject to e(1) = Φ_N_×_n hw (1) ₊ _b 11N e(2) = Φ_N_×_n hw (2) ₊ _b 21N ... e(k−1) = Φ_N_×_n hw (k−1) ₊ _b k−11N where e(l) = [e(₁l);...;e(_Nl)] and Φ_N_×_n h = [ϕ(x1) T_;_..._;_ϕ₍_x N)T] ∈ RN×nh. • Dual problem: M_DΩα(l) = λDα(l), l = 1, ..., k ₋ 1.

(44)

Primal and dual model representations

k clusters

k ₋ 1 sets of constraints (index l = 1, ..., k ₋ 1)

(P) : sign[ˆe(∗l)] = sign[w(l) T ϕ(x∗) + bl] ր M ց (D) : sign[ˆe(∗l)] = sign[P_j α_j(l)K(x∗, xj) + bl]

(45)

Advantages of kernel-based setting

• model-based approach

• out-of-sample extensions, applying model to new data

• consider training, validation and test data

(training problem corresponds to eigenvalue decomposition problem)

• model selection procedures

(46)

Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 )

(47)

Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 )

(48)

Model selection: toy example

−0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 σ2 = 0.5,BLF= 0.56 e(1) i,val e (2 ) i, va l −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x₍₁₎ x (2 ) BAD −0.4 −0.2 0 0.2 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 σ2 = 0.16,BLF = 1.0 e(1) i,val e (2 ) i, va l validation set −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x₍₁₎ x (2 )

train + validation + test data

(49)

Example: image segmentation

−5 −4 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 4 e(1) i,val e(2) i,val e (3 ) i, va l

(50)

Kernel spectral clustering: sparse kernel models

original image binary clustering

Incomplete Cholesky decomposition: Ω ≃ GGT with G ∈ _RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV

Time-complexity O(R2N2) in [Alzate & Suykens, 2008]

(51)

Kernel spectral clustering: sparse kernel models

original image sparse kernel model

Incomplete Cholesky decomposition: Ω ≃ GGT with G ∈ RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV

(52)

Incomplete Cholesky decomposition and reduced set

• For KSC problem M_DΩα = λDα, solve the approximation

UTMDUΛ2ζ = λζ

from Ω _≃ GGT, singular value decomposition G = UΛV T and ζ = UTα.

A smaller matrix of size R _× R is obtained instead of N _× N.

• Pivots are used as subset _{x˜_i_} for the data

• Reduced set method [Scholkopf et al., 1999]: approximation of w =

PN i=1 αiϕ(xi) by w˜ = PM j=1 βjϕ(˜xj) in the sense min β kw − w˜k 2 2 X j |β_j_|

• Sparser solutions by adding ℓ1 penalty, reweighted ℓ1 or group Lasso.

(53)

Incomplete Cholesky decomposition and reduced set

• For KSC problem M_DΩα = λDα, solve the approximation

UTMDUΛ2ζ = λζ

from Ω _≃ GGT, singular value decomposition G = UΛV T and ζ = UTα.

A smaller matrix of size R _× R is obtained instead of N _× N.

• Pivots are used as subset _{x˜_i_} for the data

• Reduced set method [Scholkopf et al., 1999]: approximation of w =

PN i=1 αiϕ(xi) by w˜ = PM j=1 βjϕ(˜xj) in the sense min β kw − w˜k 2 2 + ν X j |β_j_|

(54)

Big data: representative subsets using k-NN graphs

Steps:

1. MapReduce for representative subsets using k-NN graphs

2. Subset selection using FURS [Mall et al., SNAM 2013]

(which aims for good modelling of dense regions; deterministic method) 3. Use in Kernel Spectral Clustering KSC

Dataset Points M Random R`enyi entropy FURS

DB SIL DB SIL DB SIL

4G 28,500 285 0.804 ± 0.167 0.667 ± 0.059 2.909 ± 2.820 0.594 ± 0.148 0.616 0.765 5G 81,000 810 2.011 ± 0.998 0.649 ± 0.049 0.885 ± 0.276 0.601 ± 0.054 0.808 0.764 Batch 13,910 139 4.287 ± 0.764 0.503 ± 0.066 3.969 ± 0.783 0.460 ± 0.134 1.868 0.669 House 34,112 342 0.612 ± 0.154 0.679 ± 0.073 0.507 ± 0.028 0.579 ± 0.006 0.407 0.751 Mopsi Finland 13,467 135 0.897 ± 0.935 0.824 ± 0.085 0.526 ± 0.223 0.886 ± 0.010 0.568 0.920 DB DB DB KDDCupBio 145,751 1458 2.932 ± 1.205 4.233 ± 1.82 3.883 Power 2,049,280 2,050 2.558 ± 1.374 2.0619 ± 0.760 1.921

(quality metrics: Davies-Bouldin (DB) and silhouette (SIL)) [Mall, Jumutc, Langone, Suykens, IEEE Bigdata 2014]

(55)

Semi-supervised learning using KSC (1)

• N unlabeled data, but additional labels on M ₋ N data

X = _{x1, ..., xN,xN+1, ..., xM}

• Kernel spectral clustering as core model (binary case [Alzate & Suykens,

WCCI 2012], multi-way/multi-class [Mehrkanoon et al., TNNLS in press])

min w,e,b 1 2w T_w − γ 1 2e T_D−1_e ₊ _ρ 1 2 M X m=N+1 (e_m ₋ y_m)2 subject to e_i = wTϕ(x_i) + b, i = 1, ..., M

Dual solution is characterized by a linear system. Suitable for clustering as well as classification.

• Other approaches in semi-supervised learning and manifold learning, e.g.

(56)

Semi-supervised learning using KSC (2)

Dataset size nL/nU test (%) FS semi-KSC RD semi-KSC Lap-SVMp

Spambase 4597 368/736 919 (20%) 0.885 ± 0.01 0.883 ± 0.01 0.880 ± 0.03 Satimage 6435 1030/1030 1287 (20%) 0.864 ± 0.006 0.831 ± 0.009 0.834 ± 0.007 Ring 7400 592/592 1480 (20%) 0.975 ± 0.005 0.974 ± 0.005 0.972 ± 0.006 Magic 19020 761/1522 3804 (20%) 0.836 ± 0.006 0.829 ± 0.006 0.827 ± 0.005 Cod-rna 331152 1325/1325 66230 (20%) 0.957 ± 0.006 0.947 ± 0.008 0.951 ± 0.001 Covertype 581012 2760/2760 29050 (5%) 0.715 ± 0.005 0.684 ± 0.008 0.697 ± 0.001 2760/27600 0.729 ± 0.04 0.709 ± 0.05 − 2760/82800 0.739 ± 0.04 0.716 ± 0.03 − 2760/138000 0.742 ± 0.05 0.723 ± 0.06 −

FS semi-KSC: Fixed-size semi-supervised KSC

RD semi-KSC: other subset selection related to [Lee & Mangasarian, 2001]

Lap-SVM: Laplacian support vector machine [Belkin et al., 2006]

(57)

Semi-supervised learning using KSC (3)

original image KSC

(58)

(59)

Modularity and community detection

• Modularity for two-group case [Newman, 2006]:

Q = 1 4m X i,j (Aij − didj 2m )qiqj

with A adjacency matrix, d_i degree of node i, m = 1₂ P

i di,

qi = 1 if node i belongs to group 1 and qi = −1 for group 2.

• Use of modularity within kernel spectral clustering [Langone et al., 2012]:

– use modularity at the level of model validation

– finding representative subgraph using a fixed-size method by

maximizing the expansion factor [Maiya, 2010] |N_|G|(G)| with a subgraph

G and its neighborhood _N(_G).

– definition data in unweighted networks: x_i = A(:, i); use of a community kernel function [Kang, 2009].

(60)

Protein interaction network (1)

Pajek

(61)

Protein interaction network (2)

- Yeast interaction network: 2114 nodes, 4480 edges [Barabasi et al., 2001] - KSC community detection, representative subgraph [Langone et al., 2012]

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 m o d u la ri ty 7 detected clusters

(62)

Power grid network (1)

Pajek

(63)

Power grid network (2)

- Western USA power grid: 4941 nodes, 6594 edges [Watts & Strogatz, 1998] - KSC community detection, representative subgraph [Langone et al., 2012]

0.35 0.4 0.45 0.5 0.55 m o d u la ri ty 16 detected clusters

(64)

KSC for big data networks (1)

• YouTube Network: YouTube social network where users form friendship with each other and the users can create groups where others can join.

• RoadCA Network: a road network of California. Intersections and endpoints are represented by nodes and the roads connecting these intersections are represented as edges.

• Livejournal Network: free online social network where users are bloggers and they declare friendship among themselves.

Dataset Nodes Edges

YouTube 1,134,890 2,987,624

roadCA 1,965,206 5,533,214

Livejournal 3,997,962 34,681,189

(65)

KSC for big data networks (2)

BAF-KSC:

• Select representative training subgraph using FURS

(FURS = Fast and Unique Representative Subset selection [Mall et al., 2013]: selects nodes with high degree centrality belonging to different dense regions, using a deactivation and activation procedure)

• Perform model selection using BAF

(BAF = Balanced Angular Fit: makes use of a cosine similarity measure related to the

e-projection values on validation nodes)

• Train the KSC model by solving a small eigenvalue problem of size

min(0.15N,5000)2

• Apply out-of-sample extension to find cluster memberships of the remaining nodes

(66)

KSC for big data networks (3)

BAF-KSC [Mall, Langone, Suykens, 2013] Louvain [Blondel et al., 2008]

Infomap [Lancichinetti, Fortunato, 2009] CNM [Clauset, Newman, Moore, 2004]

Dataset BAF-KSC Louvain Infomap CNM

Cl Q Con Cl Q Con Cl Q Con Cl Q Con

Openflight 5 0.533 0.002 109 0.61 0.02 18 0.58 0.005 84 0.60 0.016 PGPnet 8 0.58 0.002 105 0.88 0.045 84 0.87 0.03 193 0.85 0.041 Metabolic 10 0.22 0.028 10 0.43 0.03 41 0.41 0.05 11 0.42 0.021 HepTh 6 0.45 0.0004 172 0.65 0.004 171 0.3 0.004 6 0.423 0.0004 HepPh 5 0.56 0.0004 82 0.72 0.007 69 0.62 0.06 6 0.48 0.0007 Enron 10 0.4 0.002 1272 0.62 0.05 1099 0.37 0.27 6 0.25 0.0045 Epinion 10 0.22 0.0003 33 0.006 0.0003 17 0.18 0.0002 10 0.14 0.0 Condmat 6 0.28 0.0002 1030 0.79 0.03 1086 0.79 0.025 8 0.38 0.0003

Flight network (Openflights), network based on trust (PGPnet), biological network (Metabolic), citation networks (HepTh, HepPh), communication network (Enron), review based network (Epinion), collaboration network (Condmat) [snap.stanford.edu]

(67)

Multilevel Hierarchical KSC for complex networks (1)

Generating a series of affinity matrices over different levels:

(68)

Multilevel Hierarchical KSC for complex networks (2)

• Start at ground level 0 and compute cosine similarities on validation

nodes S_val(0)(i, j) = 1 ₋ eT_i e_j/(_ke_i_kke_j_k)

• Select projections ej satisfying S_val(0)(i, j) < t(0) with t(0) a threshold

value. Keep these nodes as the first cluster at level 0 and remove the nodes from matrix S_val(0) to obtain a reduced matrix.

• Iterative procedure over different levels h:

Communities at level h become nodes for the next level h+1, by creating

a set of matrices at different levels of hierarchy:

S_val(h)(i, j) = 1 |C_i(h−1)_{| |}C_j(h−1)_| X m∈C_i(h−1) X l∈C_j(h−1) S_val(h−1)(m, l)

and working with suitable threshold values t(h) for each level.

(69)

Multilevel Hierarchical KSC for complex networks (3)

MH-KSC on PGP network:

fine _intermediate _intermediate coarse

Multilevel Hierarchical KSC finds high quality clusters at coarse as well as fine and intermediate levels of hierarchy.

(70)

Multilevel Hierarchical KSC for complex networks (4)

Louvain

Infomap

OSLOM

Louvain, Infomap, and OSLOM seem biased toward a particular scale

(71)

Conclusions

• Synergies parametric and kernel based-modelling

• Sparse kernel models using fixed-size method

• Supervised, semi-supervised and unsupervised learning applications

• Large scale complex networks and big data

Software: see ERC AdG A-DATADRIVE-B website

(72)

Acknowledgements (1)

• Current and former co-workers at ESAT-STADIUS:

M. Agudelo, C. Alzate, A. Argyriou, R. Castro, M. Claesen, A. Daemen, J. De Brabanter, K. De Brabanter, J. de Haas, L. De Lathauwer, B. De Moor, F. De Smet, M. Espinoza, T. Falck, Y. Feng, E. Frandi, D. Geebelen, O. Gevaert, L. Houthuys, X. Huang, B. Hunyadi, A. Installe, V. Jumutc, Z. Karevan, P. Karsmakers, R. Langone, Y. Liu, X. Lou, R. Mall, S. Mehrkanoon, Y. Moreau, M. Novak, F. Ojeda, K. Pelckmans, N. Pochet, D. Popovic, J. Puertas, M. Seslija, L. Shi, M. Signoretto, P. Smyth, Q. Tran Dinh, V. Van Belle, T. Van Gestel, J. Vandewalle, S. Van Huffel, C. Varon, X. Xi, Y. Yang, S. Yu, and others

• Many others for joint work, discussions, invitations, organizations

• Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet,

(73)

(74)