Kernel methods for exploratory data analysis and community detection

(1)

Kernel methods for exploratory data analysis

and community detection

Johan Suykens

KU Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium Email: [email protected]

http://www.esat.kuleuven.be/scd/

(2)

Overview

• Principal component analysis

• Kernel principal component analysis and LS-SVM; sparse and robust extensions

• Kernel spectral clustering

• Community detection in complex networks

(3)

Principal component analysis (1)

• Given a data cloud, potentially in a high dimensional input space

× × × × × × × × × × × × ×× × × × × × × × × × × × × × × × × × × × × × × × ×× × × × × × × × × × × ×

• assume an ellipsoidal data cloud

(4)

Principal component analysis (2)

• Given data {xi}N_i₌₁ with xi ∈ Rn (assumed zero mean)

• Find projected variables wTx_i with maximal variance

max

w E{(w

T_x₎2_} ₌ _wT_E_{xxT_}w

= wT C w

with covariance matrix C = E{xxT} and E{·} the expected value.

• For N given data points one has C ≃ _N1

N

X

i=1

x_ixT_i .

• Problem: the optimal solution for w in the above problem is unbounded. Therefore an additional constraint should be taken: a common choice is to impose wTw = 1.

(5)

Principal component analysis (2)

• Given data {xi}N_i₌₁ with xi ∈ Rn (assumed zero mean)

• Find projected variables wTx_i with maximal variance

max

w E{(w

T_x₎2_} ₌ _wT_E_{xxT_}w

= wT C w

with covariance matrix C = E{xxT} and E{·} the expected value.

• For N given data points one has C ≃ _N1

N

X

i=1

x_ixT_i .

• Problem: the optimal solution for w in the above problem is unbounded.

Therefore an additional constraint should be taken: a common choice is

(6)

Principal component analysis (3)

• The problem formulation becomes then:

max

w w

T _{C w} _{subject to} _wT_w _{= 1}

• Constrained optimization problem solved by taking the Lagrangian L:

L(w;λ) = 1 2w

T_Cw ₋ _λ₍_wT_w ₋ ₁₎

with Lagrange multiplier λ.

• Solution is given by the eigenvalue problem

Cw = λw

(7)

Principal component analysis (4)

u2 u1 µ x1 x2 λ1/2 1 λ1/2 2

Illustration of an eigenvalue decomposition Cu = λu

- the eigenvalues λi are real and positive

- the eigenvectors u₁ and u₂ are orthogonal with respect to each other

- maximal variance solution is the direction u₁ corresponding to λ_max = λ₁. - note that µ = 0 (the data should be made zero-mean beforehand)

(8)

Principal component analysis: dimensionality reduction (1)

• Aim:

Decreasing the dimensionality of the given input space by mapping vectors x ∈ Rn to z ∈ Rm with m < n.

• A point x is mapped to z in the lower dimensional space by

z₍_j₎ = uT_j x

where u_j are the eigenvectors corresponding to the m largest eigenvalues and z = [z₍₁₎z₍₂₎...z₍_m₎]T.

• The error resulting from the dimensionality reduction is characterized by

the neglected eigenvalues, i.e. Pn

(9)

Principal component analysis: dimensionality reduction (2)

i λ_i

In this example reducing the original 6-dimensional space to a 2-dimensional space is a good choice because the largest two eigenvalues λ₁ and λ₂ are much larger than the other ones.

(10)

Principal component analysis: reconstruction problem (1)

• Consider x ∈ _Rn and z ∈ _Rm with m ≪ n (dimensionality reduction).

• Encoder mapping:

z = G(x)

Decoder mapping:

˜

x = F(z)

• Objective: squared distortion error (reconstruction error)

minE = 1 N N X i=1 kxi − x˜ik2₂ = 1 N N X i=1 kxi − F(G(xi))k2₂

(11)

Principal component analysis: reconstruction problem (2)

x

G

(

x

)

z

F

(

z

)

x

˜

(12)

Principal component analysis: denoising example

Images: 16 × 15 pixels (n = 240)

Training on N = 200 clean digits (not containing digit 9)

Test data: (bottom-left) denoised digit 9 after reconstruction using 10 principal components; (bottom-right) 180 principal components

(13)

Overview

• Kernel principal component analysis and LS-SVM; sparse and

robust extensions

(14)

Kernel principal component analysis

−1.5 −1 −0.5 0 0.5 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 linear PCA −1.5 −1 −0.5 0 0.5 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5

kernel PCA (RBF kernel)

Kernel PCA [Sch¨olkopf et al., 1998]:

by eigenvalue decomposition of   K(x1, x1) ... K(x1, xN) ... ...   x(2) x₍₁₎

(15)

Kernel PCA: primal and dual problem

• Underlying primal problem [Suykens et al., IEEE-TNN 2003]

• Primal problem: min w,b,e ∓ 1 2w T_w _± 1 2γ N X i=1 e2_i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

• (Lagrange) dual problem = kernel PCA :

Ω_cα = λα with λ = 1/γ

with Ω_c,ij = (ϕ(x_i) − µˆ_ϕ)T(ϕ(x_j) − µˆ_ϕ) the centered kernel matrix.

• Interpretation:

1. pool of candidates components (objective function equals zero) 2. select relevant components (components with high variance)

(16)

Kernel PCA: model representations

Primal and dual model representations:

(P) : eˆ∗ = wTϕ(x∗) + b ր

M

ց

(D) : eˆ∗ = P_i αiK(x∗, xi) + b

which can be evaluated at any point x∗ ∈ Rd, where

K(x∗, xi) = ϕ(x∗)Tϕ(xi)

(17)

Generalizations to Kernel PCA: other loss functions

• Consider general loss function L:

min w,b,e − 1 2w T_w ₊ 1 2γ N X i=1 L(e_i) s.t. e_i = wTϕ(x_i) + b, i = 1, ..., N.

Generalizations of KPCA that lead to robustness and sparseness, e.g. Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].

• Weighted least squares versions and incorporation of constraints:

min w,b,e − 1 2w T_w ₊ 1 2γ N X i=1 v_ie2_i s.t.          ei = wTϕ(xi) + b, i = 1, ..., N PN i=1 eie (1) i = 0 ... PN i=1 eie (l−1) i = 0

Find l-th PC w.r.t. l − 1 orthogonality constraints (previous PC e(_ij)). The solution is given by a generalized eigenvalue problem.

(18)

Robustness: Kernel Component Analysis

original image corrupted image

KPCA reconstruction

(19)

Robustness: Kernel Component Analysis

original image corrupted image

KPCA reconstruction KCA reconstruction

(20)

Generalizations to Kernel PCA: sparseness

−0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC1 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC2 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC3

Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006]

(21)

Overview

(22)

Spectral graph clustering

Minimal cut: given the graph G = (V, E), find clusters A₁,A₂

min q_i∈{−1,+1} 1 2 X i,j wij(qi − qj)2

with cluster membership indicator q_i (q_i = 1 if i ∈ A₁, q_i = −1 if i ∈ A₂) and W = [w_ij] the weighted adjacency matrix.

2 3 4 5 6 1 cut of size 2 cut of size 1 (minimal cut)

(23)

Spectral graph clustering

• Relaxation to Min-cut spectral clustering problem

min

˜

qTq˜=1 q˜

T_L_q_˜

with L = D − W the unnormalized graph Laplacian, degree matrix

D = diag(d₁, ..., d_N), d_i = P

j wij, giving

Lq˜ = λq˜.

Cluster member indicators: qˆ_i = sign(˜q_i − θ) with threshold θ.

• Normalized cut

Lq˜ = λDq˜

[Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997; von Luxburg, 2007] • Discrete version to continuous problem (Laplace operator)

(24)

(25)

Kernel spectral clustering: case of two clusters

• Underlying model (primal representation):

ˆ

e∗ = wTϕ(x∗) + b

with qˆ∗ = sign[ˆe∗] the estimated cluster indicator at any x∗ ∈ Rd.

• Primal problem: training on given data {x_i}N_i₌₁

min w,b,e − 1 2w T_w ₊ _γ1 2 N X i=1 v_i e2_i subject to e_i = wTϕ(x_i) + b, i = 1, ..., N

with positive weights v_i (will be related to inverse degree matrix).

(26)

Lagrangian and conditions for optimality

• Lagrangian: L(w, b, e;α) = −1 2w T_w ₊ _γ 1 2 N X i=1 v_ie2_i − N X i=1 α_i(e_i − wTϕ(x_i) − b)

• Conditions for optimality:

                       ∂L ∂w = 0 ⇒ w = P i αiϕ(xi) ∂L ∂b = 0 ⇒ P iαi = 0 ∂L ∂e_i = 0 ⇒ αi = γviei, i = 1, ..., N ∂L ∂α_i = 0 ⇒ ei = w T_ϕ₍_x i) + b, i = 1, ..., N

(27)

Kernel-based model representation

• Dual problem: V M_V Ωα = λα with λ = 1/γ M_V = I_N − ₁T 1 NV1N

1_N1T_NV : weighted centering matrix

Ω = [Ω_ij]: kernel matrix with Ω_ij = ϕ(x_i)Tϕ(x_j) = K(x_i, x_j)

• Dual model representation:

ˆ e∗ = N X i=1 α_iK(x_i, x∗) + b with K(x_i, x∗) = ϕ(xi)Tϕ(x∗).

(28)

Choice of weights

v

_i

• Take V = D−1 where D = diag{d1, ..., dN} and di = PN_j₌₁ Ωij

• This gives the generalized eigenvalue problem:

MDΩα = λDα

with MD = IN − ₁T 1 ND−11N

1N1T_ND−1

This is a modified version of random walks spectral clustering.

• Note that sign[e_i] = sign[α_i] if γv_i > 0 (on training data) ... but sign[e∗] applies beyond training data

(29)

Kernel spectral clustering: more clusters

• Case of k clusters: additional sets of constraints

min w(l),e(l),b_l −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γ_le(l)TD−1e(l) subject to e(1) = Φ_N_×_n hw (1) ₊ _b 11N e(2) = Φ_N_×_n hw (2) ₊ _b 21N ... e(k−1) = Φ_N_×_n hw (k−1) ₊ _b k−11N where e(l) = [e(₁l);...;e(_Nl)] and Φ_N_×_n h = [ϕ(x1) T_;_..._;_ϕ₍_x N)T] ∈ RN×nh. • Dual problem: M_DΩα(l) = λDα(l), l = 1, ..., k − 1.

(30)

Primal and dual model representations

k clusters

k − 1 sets of constraints (index l = 1, ..., k − 1)

(P) : sign[ˆe(∗l)] = sign[w(l) T ϕ(x∗) + bl] ր M ց (D) : sign[ˆe(∗l)] = sign[P_j α_j(l)K(x∗, xj) + bl]

• Note: additional sets of constraints also in multi-class and vector-valued output LS-SVMs [Suykens et al., 1999]

(31)

Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 )

(32)

Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 )

(33)

Piecewise constant eigenvectors and extension (1)

• Definition. [Meila & Shi, 2001] Vector α is called piecewise constant

relative to a partition (A₁, ..., A_k) iff α_i = α_j ∀x_i, x_j ∈ A_p, p = 1, ..., k.

• Proposition. [Alzate & Suykens, 2010] Assume

(i) a training set D = {xi}N_i₌₁ and validation set Dv = {xv_m}N_mv₌₁

i.i.d. sampled from the same underlying distribution; (ii) a set of k clusters {A₁, ...,A_k} with k > 2;

(iii) an isotropic kernel function such that K(x, z) = 0 when x and z

belong to different clusters;

(iv) the eigenvectors α(l) for l = 1, ..., k − 1 are piecewise constant.

Then validation set points belonging to the same cluster are collinear

in the k − 1 dimensional subspace spanned by the columns of Ev ∈

RNv×(k−1) _where _Ev

ml = e

(l)

(34)

Piecewise constant eigenvectors and extension (2)

• Key aspect of the proof: for x∗ ∈ Ap one has

e∗(l) = PN_i₌₁ α_i(l)K(xi, x∗) + b(l)

= c(pl)P_i_∈A_p K(xi, x∗) + P_{i /}N_∈A_p α(_il)K(xi, x∗) + b(l)

= c(pl)P_i_∈A_p K(xi, x∗) + b(l)

• Model selection to determine kernel parameters and k:

Looking for line structures in the space (e(1)_i , e(2)_i , ..., e(_ik−1)), evaluated on validation data (aiming for good generalization)

• Choice kernel:

Gaussian RBF kernel

(35)

Model selection (looking for lines): toy problem 1

−0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 σ2 = 0.5,BLF= 0.56 e(1) i,val e (2 ) i, va l −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x₍₁₎ x (2 ) −0.4 −0.2 0 0.2 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 σ2 = 0.16,BLF= 1.0 e(1) i,val e (2 ) i, va l validation set −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x₍₁₎ x (2 )

(36)

Model selection (looking for lines): toy problem 2

−8 −6 −4 −2 0 2 4 −6 −4 −2 0 2 4 6 8 σ2 = 1.20,BLF = 0.49 e(1) i,val e (2 ) i, va l −2 0 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 x₍₁₎ x₍₂₎ x (3 ) −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 σ2 = 0.003,BLF= 1.0 e(1) i,val e (2 ) i, va l validation set −2 0 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 x₍₁₎ x₍₂₎ x (3 )

(37)

Example: image segmentation (looking for lines)

−5 −4 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 4 e(1) i,val e(2) i,val e (3 ) i, va l

(38)

145086 42049 167062 147091 196073 62096 119082 3096 295087 37073

Image Proposed method Nyström method Human Image ID

(39)

Example: power grid - identifying customer profiles (1)

Power load: 245 substations, hourly data (5 years), d = 43.824

Periodic AR modelling: dimensionality reduction 43.824 → 24

k-means clustering applied after dimensionality reduction

2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad

(40)

Example: power grid - identifying customer profiles (2)

Application of kernel spectral clustering, directly on d = 43.824

Model selection on kernel parameter and number of clusters

[Alzate, Espinoza, De Moor, Suykens, 2009]

5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad

(41)

Example: power grid - identifying customer profiles (3)

5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad

Electricity load: 245 substations in Belgian grid (1/2 train, 1/2 validation)

xi ∈ R43.824: spectral clustering on high dimensional data (5 years)

3 of 7 detected clusters:

- 1: Residential profile: morning and evening peaks

- 2: Business profile: peaked around noon

(42)

Kernel spectral clustering: sparse kernel models

original image binary clustering

Incomplete Cholesky decomposition: kΩ − GGTk2 ≤ η with G ∈ RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV e(∗l) = P_i_∈S

SV α

(l)

(43)

Kernel spectral clustering: sparse kernel models

original image sparse kernel model

Incomplete Cholesky decomposition: kΩ − GGTk2 ≤ η with G ∈ RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV e(∗l) = P_i_∈S

SV α

(l)

(44)

Highly sparse kernel models on images

• application on images:

x_i ∈ _R3 (r,g,b values per pixel), i = 1, ..., N

pre-processed into z_i ∈ R8 (quantization to 8 colors)

χ2-kernel to compare two local color histograms (5 × 5 pixels window)

• N > 100.000, select subset M ≪ N based on quadratic Renyi entropy as in the fixed-size method [Suykens et al., 2002]

• Highly sparse representations: # SV = 3 k

• Completion of cluster indicators based on out-of-sample extensions

sign[ˆe(∗l)] = sign[P_j_∈S_SV α_j(l)K(x∗, xj) + bl]

applied to the full image

(45)

Highly sparse kernel models: toy example 1

−5 0 5 10 −6 −4 −2 0 2 4 6 8 x₍₁₎ x (2 ) −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 4 e(1)_i e (2 ) i

(46)

Highly sparse kernel models: toy example 2

−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 x(1) x(2 )

(47)

−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 x₍₁₎ x(2 )

(48)

−4 −2 0 2 4 6 −4 −2 0 2 4 −4 −2 0 2 4 ˆ e(1)_i ˆ e(2)_i ˆ e (3 ) i

(49)

Highly sparse kernel models: image segmentation

* * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * *** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * ** * ** * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * ** * * * * * * * * ** * * * * * * * * * * * * * ** * * * * * * * ** * * * * * * * * * * * * * * *_* * * * * * ** * * * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *** * * * * * * ** * * ** * * * * * * * * * * * * * * * ** **_* * * * * * ** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * ** * * * * * ** * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * *** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * −1.5 −1 −0.5 0 0.5 −3 −2 −1 0 1 2 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 e(1)_i e(2)_i e (3 ) i

(50)

Highly sparse kernel models: image segmentation

* * * * * * * * * * * * −1.5 −1 −0.5 0 0.5 −3 −2 −1 0 1 2 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 e(1)_i e(2)_i e (3 ) i

(51)

Hierarchical kernel spectral clustering

Hierarchical kernel spectral clustering: - looking at different scales

- use of model selection and validation data

(52)

Kernel spectral clustering: adding prior knowledge

• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link

• Primal problem [Alzate & Suykens, IJCNN 2009]

min w(l),e(l),b_l −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γ_le(l)TD−1e(l) subject to e(1) = Φ_N_×_n hw (1) ₊ _b 11N ... e(k−1) = Φ_N_×_n hw (k−1) ₊ _b k−11N w(1)Tϕ(x†) = cw(1) T ϕ(x‡) ... w(k−1)Tϕ(x†) = cw(k−1) T ϕ(x‡)

(53)

Kernel spectral clustering: adding prior knowledge

• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link

• Primal problem [Alzate & Suykens, IJCNN 2009]

min w(l),e(l),b_l −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γ_le(l)TD−1e(l) subject to e(1) = Φ_N_×_n hw (1) ₊ _b 11N ... e(k−1) = Φ_N_×_n hw (k−1) ₊ _b k−11N w(1)Tϕ(x†) = cw(1) T ϕ(x‡) ... w(k−1)Tϕ(x†) = cw(k−1) T ϕ(x‡)

(54)

Adding prior knowledge

(55)

Adding prior knowledge

(56)

Semi-supervised learning

• N unlabeled data, but additional labels on M − N data

X = {x1, ..., xN,xN+1, ..., xM}

• Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 2012]:

min w,e,b 1 2w T_w ₋ _γ 1 2e T_D−1_e₊_ρ 1 2 M X m=N+1 (em − ym)2 subject to e_i = wTϕ(x_i) + b, i = 1, ..., M

Dual solution is characterized by a linear system.

(57)

Semi-supervised learning

• N unlabeled data, but additional labels on M − N data

X = {x1, ..., xN,xN+1, ..., xM}

• Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 2012]:

min w,e,b 1 2w T_w ₋ _γ 1 2e T_D−1_e₊_ρ 1 2 M X m=N+1 (em − ym)2 subject to e_i = wTϕ(x_i) + b, i = 1, ..., M

Dual solution is characterized by a linear system.

(58)

Overview

(59)

Modularity and community detection

• Modularity for two-group case [Newman, 2006]:

Q = 1 4m X i,j (Aij − didj 2m )qiqj

with A adjacency matrix, d_i degree of node i, m = 1₂ P

i di,

qi = 1 if node i belongs to group 1 and qi = −1 for group 2.

• Use of modularity within kernel spectral clustering [Langone et al., 2012]:

– use at the level of model validation

– finding representative subgraph using a fixed-size method by maximizing the expansion factor [Maiya, 2010] |N_|G|(G)| with a subgraph

G and its neighborhood N(G).

– definition data in unweighted networks: x_i = A(:, i); use of a community kernel function [Kang, 2009].

(60)

Protein interaction network (1)

Pajek

(61)

Protein interaction network (2)

- Yeast interaction network: 2114 nodes, 4480 edges [Barabasi et al., 2001] - KSC community detection, representative subgraph [Langone et al., 2012]

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 number of clusters m o d u la ri ty 7 detected clusters

(62)

Power grid network (1)

Pajek

(63)

Power grid network (2)

- Western USA power grid: 4941 nodes, 6594 edges [Watts & Strogatz, 1998] - KSC community detection, representative subgraph [Langone et al., 2012]

20 40 60 80 100 120 140 160 180 200 220 240 0.35 0.4 0.45 0.5 0.55 number of clusters m o d u la ri ty 16 detected clusters

(64)

Evolving networks

• Binary clustering case: adding a memory effect [Langone et al, 2012]

min w,e,b 1 2w T_w ₋ _γ 1 2e T_D−1_e−_{ν w}T_w old subject to ei = wTϕ(xi) + b, i = 1, ..., N

with wold the previous result in time. • Aims at including temporal smoothness

(65)

Overview

(66)

Dimensionality reduction and data visualization

• Traditionally:

commonly used techniques are e.g. principal component analysis (PCA), multi-dimensional scaling (MDS), self-organizing maps (SOM)

• More recently:

isomap, locally linear embedding (LLE), Hessian locally linear embedding, diffusion maps, Laplacian eigenmaps

(“kernel eigenmap methods and manifold learning”)

[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]

• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:

(67)

Kernel maps with reference point: formulation

- LS-SVM core part: realize dimensionality reduction x 7→ z

- Regularization term: (z − PDz)T(z − PDz) = PN_i₌₁ kzi − P_jN₌₁ sijDzjk2₂

with D diagonal matrix and sij = exp(−kxi − xjk2₂/σ2)

- reference point q (e.g. first point; sacrificed in the visualization)

• Example: d = 2 min z,w₁,w₂,b₁,b₂_,ei,₁_,ei,₂ 1 2(z − PDz) T (z − PDz)+ ν 2 (w T 1 w1 + w T 2 w2) + η 2 N X i=1 (e2_i,1 + e2_i,2) such that cT₁_,₁z = q1 + e1,1 cT₁_,₂z = q2 + e1,2 cT_i,₁z = wT₁ ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cT_i,₂z = wT₂ ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N

(68)

Kernel maps with reference point: formulation

- Regularization term: (z−PDz)T(z−PDz) = P_iN₌₁ kzi−P_jN₌₁ sijDzjk2₂

with D diagonal matrix and s_ij = exp(−kxi − xjk2₂/σ2)

(69)

Kernel maps with reference point: formulation

- Regularization term: (z−PDz)T(z−PDz) = P_iN₌₁ kzi−P_jN₌₁ sijDzjk2₂

with D diagonal matrix and sij = exp(−kxi − xjk2₂/σ2)

(70)

Model selection by validation

Model selection criterion: min

Θ X i,j ˆ z_iTzˆ_j kˆz_ik₂kˆz_jk₂ − xT_i x_j kx_ik₂kx_jk₂ 2 Tuning parameters Θ:

• Kernels tuning parameters in s_ij, K₁, K₂,(K₃)

• Regularization constants ν, η

• Choice of the diagonal matrix D

• Choice of reference point q, e.g. q ∈ {[+1; +1],[+1;−1],[−1; +1],[−1,−1]}

(71)

Kernel maps: spiral example

−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −0.5 0 0.5 x1 x2 x3 q = [+1;−1] −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 −2 0 2 4 6 8 10 12x 10 −3 z1hat z2 hat q = [−1;−1] −0.01−2 −0.005 0 0.005 0.01 0.015 0.02 0 2 4 6 8 10 12x 10 −3 z1hat z2 hat

training data (blue *), validation data (magenta o), test data (red +)

Model selection: minX i,j ˆ z_iTzˆ_j kˆz_ik₂kˆz_jk₂ − xT_i x_j kx_ik₂kx_jk₂ 2

(72)

Kernel maps: swiss roll example

−1 −0.5 0 0.5 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 x 1 x 2 x 3 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 x 10−3 0 0.5 1 1.5 2 2.5 3x 10 −3 z 1 z2

Given 3D swiss roll data Kernel map result - 2D

(73)

Kernel maps: visualizing gene distribution

−2.35 −2.3 −2.25 −2.2 −2.15 −2.1_−2.05 −2 −1.95 x 10−3 1.9 2 2.1 2.2 2.3 x 10−3 1.7 1.8 1.9 2 2.1 x 10−3 z 1 z 2 z3

Alon colon cancer microarray data set: 3D projections Dimension input space: 62

(74)

Kernel maps: time-series data visualization

Santa Fe laser data

0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 250 300 discrete time k 1 1.5 2 2.5 3 3.5 4 x 10−3 −4 −3.5 −3 −2.5 −2 −1.5 −1 x 10−3 2.3 2.4 2.5 2.6 2.7 2.8 2.9 x 10−3 z 1hat z 2hat z3 hat

Data {y_k_|_k₋₉}N_k₌₁: train, validation, test

Tuning parameters (kernel & regularization) based on validation set Model is able to make out of sample extensions

(75)

Conclusions

• From PCA to KPCA

• LS-SVM model framework with primal-dual setting

• Out-of-sample extensions, model selection procedures and large scale methods

• From spectral clustering to kernel spectral clustering

• Applications in complex networks

• Data visualization problems: learning and generalization

(76)

Acknowledgements (1)

• Colleagues at ESAT-SCD (especially research units: systems, models, control - biomedical data processing - bioinformatics):

C. Alzate, A. Argyriou, J. De Brabanter, K. De Brabanter, L. De Lathauwer, B. De Moor, M. Diehl, Ph. Dreesen, M. Espinoza, T. Falck, D. Geebelen, X. Huang, B. Hunyadi, A. Installe, V. Jumutc, P. Karsmakers, R. Langone, J. Lopez, J. Luts, R. Mall, S. Mehrkanoon, M. Moonen, Y. Moreau, K. Pelckmans, J. Puertas, L. Shi, M. Signoretto, P. Tsiaflakis, V. Van Belle, R. Van de Plas, S. Van Huffel, J. Vandewalle, T. van Waterschoot, C. Varon, S. Yu, and others

• Topics of this lecture: Carlos Alzate and Rocco Langone.

• Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet, COE Optimization in Engineering OPTEC, IUAP DYSCO, FWO projects, IWT, IBBT eHealth, COST

(77)

(78)