• No results found

Kernel methods for exploratory data analysis and community detection

N/A
N/A
Protected

Academic year: 2021

Share "Kernel methods for exploratory data analysis and community detection"

Copied!
78
0
0

Loading.... (view fulltext now)

Full text

(1)

Kernel methods for exploratory data analysis

and community detection

Johan Suykens

KU Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 10

B-3001 Leuven (Heverlee), Belgium Email: [email protected]

http://www.esat.kuleuven.be/scd/

(2)

Overview

• Principal component analysis

• Kernel principal component analysis and LS-SVM; sparse and robust extensions

• Kernel spectral clustering

• Community detection in complex networks

(3)

Principal component analysis (1)

• Given a data cloud, potentially in a high dimensional input space

× × × × × × × × × × × × ×× × × × × × × × × × × × × × × × × × × × × × × × ×× × × × × × × × × × × ×

• assume an ellipsoidal data cloud

(4)

Principal component analysis (2)

• Given data {xi}Ni=1 with xi ∈ Rn (assumed zero mean)

• Find projected variables wTxi with maximal variance

max

w E{(w

Tx)2} = wTE{xxT}w

= wT C w

with covariance matrix C = E{xxT} and E{·} the expected value.

• For N given data points one has C ≃ N1

N

X

i=1

xixTi .

• Problem: the optimal solution for w in the above problem is unbounded. Therefore an additional constraint should be taken: a common choice is to impose wTw = 1.

(5)

Principal component analysis (2)

• Given data {xi}Ni=1 with xi ∈ Rn (assumed zero mean)

• Find projected variables wTxi with maximal variance

max

w E{(w

Tx)2} = wTE{xxT}w

= wT C w

with covariance matrix C = E{xxT} and E{·} the expected value.

• For N given data points one has C ≃ N1

N

X

i=1

xixTi .

• Problem: the optimal solution for w in the above problem is unbounded.

Therefore an additional constraint should be taken: a common choice is

(6)

Principal component analysis (3)

• The problem formulation becomes then:

max

w w

T C w subject to wTw = 1

• Constrained optimization problem solved by taking the Lagrangian L:

L(w;λ) = 1 2w

TCw λ(wTw 1)

with Lagrange multiplier λ.

• Solution is given by the eigenvalue problem

Cw = λw

(7)

Principal component analysis (4)

u2 u1 µ x1 x2 λ1/2 1 λ1/2 2

Illustration of an eigenvalue decomposition Cu = λu

- the eigenvalues λi are real and positive

- the eigenvectors u1 and u2 are orthogonal with respect to each other

- maximal variance solution is the direction u1 corresponding to λmax = λ1. - note that µ = 0 (the data should be made zero-mean beforehand)

(8)

Principal component analysis: dimensionality reduction (1)

• Aim:

Decreasing the dimensionality of the given input space by mapping vectors x ∈ Rn to z ∈ Rm with m < n.

• A point x is mapped to z in the lower dimensional space by

z(j) = uTj x

where uj are the eigenvectors corresponding to the m largest eigenvalues and z = [z(1)z(2)...z(m)]T.

• The error resulting from the dimensionality reduction is characterized by

the neglected eigenvalues, i.e. Pn

(9)

Principal component analysis: dimensionality reduction (2)

i λi

In this example reducing the original 6-dimensional space to a 2-dimensional space is a good choice because the largest two eigenvalues λ1 and λ2 are much larger than the other ones.

(10)

Principal component analysis: reconstruction problem (1)

• Consider x ∈ Rn and z ∈ Rm with m ≪ n (dimensionality reduction).

• Encoder mapping:

z = G(x)

Decoder mapping:

˜

x = F(z)

• Objective: squared distortion error (reconstruction error)

minE = 1 N N X i=1 kxi − x˜ik22 = 1 N N X i=1 kxi − F(G(xi))k22

(11)

Principal component analysis: reconstruction problem (2)

x

G

(

x

)

z

F

(

z

)

x

˜

(12)

Principal component analysis: denoising example

Images: 16 × 15 pixels (n = 240)

Training on N = 200 clean digits (not containing digit 9)

Test data: (bottom-left) denoised digit 9 after reconstruction using 10 principal components; (bottom-right) 180 principal components

(13)

Overview

• Principal component analysis

• Kernel principal component analysis and LS-SVM; sparse and

robust extensions

• Kernel spectral clustering

• Community detection in complex networks

(14)

Kernel principal component analysis

−1.5 −1 −0.5 0 0.5 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 linear PCA −1.5 −1 −0.5 0 0.5 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5

kernel PCA (RBF kernel)

Kernel PCA [Sch¨olkopf et al., 1998]:

by eigenvalue decomposition of   K(x1, x1) ... K(x1, xN) ... ...   x(2) x(1)

(15)

Kernel PCA: primal and dual problem

• Underlying primal problem [Suykens et al., IEEE-TNN 2003]

• Primal problem: min w,b,e ∓ 1 2w Tw ± 1 2γ N X i=1 e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

• (Lagrange) dual problem = kernel PCA :

cα = λα with λ = 1/γ

with Ωc,ij = (ϕ(xi) − µˆϕ)T(ϕ(xj) − µˆϕ) the centered kernel matrix.

• Interpretation:

1. pool of candidates components (objective function equals zero) 2. select relevant components (components with high variance)

(16)

Kernel PCA: model representations

Primal and dual model representations:

(P) : eˆ∗ = wTϕ(x∗) + b ր

M

ց

(D) : eˆ∗ = Pi αiK(x∗, xi) + b

which can be evaluated at any point x∗ ∈ Rd, where

K(x∗, xi) = ϕ(x∗)Tϕ(xi)

(17)

Generalizations to Kernel PCA: other loss functions

• Consider general loss function L:

min w,b,e − 1 2w Tw + 1 2γ N X i=1 L(ei) s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

Generalizations of KPCA that lead to robustness and sparseness, e.g. Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].

• Weighted least squares versions and incorporation of constraints:

min w,b,e − 1 2w Tw + 1 2γ N X i=1 vie2i s.t.          ei = wTϕ(xi) + b, i = 1, ..., N PN i=1 eie (1) i = 0 ... PN i=1 eie (l−1) i = 0

Find l-th PC w.r.t. l − 1 orthogonality constraints (previous PC e(ij)). The solution is given by a generalized eigenvalue problem.

(18)

Robustness: Kernel Component Analysis

original image corrupted image

KPCA reconstruction

(19)

Robustness: Kernel Component Analysis

original image corrupted image

KPCA reconstruction KCA reconstruction

(20)

Generalizations to Kernel PCA: sparseness

−0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC1 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC2 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC3

Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006]

(21)

Overview

• Principal component analysis

• Kernel principal component analysis and LS-SVM; sparse and robust extensions

• Kernel spectral clustering

• Community detection in complex networks

(22)

Spectral graph clustering

Minimal cut: given the graph G = (V, E), find clusters A1,A2

min qi∈{−1,+1} 1 2 X i,j wij(qi − qj)2

with cluster membership indicator qi (qi = 1 if i ∈ A1, qi = −1 if i ∈ A2) and W = [wij] the weighted adjacency matrix.

2 3 4 5 6 1 cut of size 2 cut of size 1 (minimal cut)

(23)

Spectral graph clustering

• Relaxation to Min-cut spectral clustering problem

min

˜

qTq˜=1 q˜

TLq˜

with L = D − W the unnormalized graph Laplacian, degree matrix

D = diag(d1, ..., dN), di = P

j wij, giving

Lq˜ = λq˜.

Cluster member indicators: qˆi = sign(˜qi − θ) with threshold θ.

• Normalized cut

Lq˜ = λDq˜

[Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997; von Luxburg, 2007] • Discrete version to continuous problem (Laplace operator)

(24)
(25)

Kernel spectral clustering: case of two clusters

• Underlying model (primal representation):

ˆ

e∗ = wTϕ(x∗) + b

with qˆ∗ = sign[ˆe∗] the estimated cluster indicator at any x∗ ∈ Rd.

• Primal problem: training on given data {xi}Ni=1

min w,b,e − 1 2w Tw + γ1 2 N X i=1 vi e2i subject to ei = wTϕ(xi) + b, i = 1, ..., N

with positive weights vi (will be related to inverse degree matrix).

(26)

Lagrangian and conditions for optimality

• Lagrangian: L(w, b, e;α) = −1 2w Tw + γ 1 2 N X i=1 vie2i − N X i=1 αi(ei − wTϕ(xi) − b)

• Conditions for optimality:

                       ∂L ∂w = 0 ⇒ w = P i αiϕ(xi) ∂L ∂b = 0 ⇒ P iαi = 0 ∂L ∂ei = 0 ⇒ αi = γviei, i = 1, ..., N ∂L ∂αi = 0 ⇒ ei = w Tϕ(x i) + b, i = 1, ..., N

(27)

Kernel-based model representation

• Dual problem: V MV Ωα = λα with λ = 1/γ MV = IN1T 1 NV1N

1N1TNV : weighted centering matrix

Ω = [Ωij]: kernel matrix with Ωij = ϕ(xi)Tϕ(xj) = K(xi, xj)

• Dual model representation:

ˆ e∗ = N X i=1 αiK(xi, x∗) + b with K(xi, x∗) = ϕ(xi)Tϕ(x∗).

(28)

Choice of weights

v

i

• Take V = D−1 where D = diag{d1, ..., dN} and di = PNj=1 Ωij

• This gives the generalized eigenvalue problem:

MDΩα = λDα

with MD = IN − 1T 1 ND−11N

1N1TND−1

This is a modified version of random walks spectral clustering.

• Note that sign[ei] = sign[αi] if γvi > 0 (on training data) ... but sign[e∗] applies beyond training data

(29)

Kernel spectral clustering: more clusters

• Case of k clusters: additional sets of constraints

min w(l),e(l),bl −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γle(l)TD−1e(l) subject to e(1) = ΦN×n hw (1) + b 11N e(2) = ΦN×n hw (2) + b 21N ... e(k−1) = ΦN×n hw (k−1) + b k−11N where e(l) = [e(1l);...;e(Nl)] and ΦN×n h = [ϕ(x1) T;...;ϕ(x N)T] ∈ RN×nh. • Dual problem: MDΩα(l) = λDα(l), l = 1, ..., k − 1.

(30)

Primal and dual model representations

k clusters

k − 1 sets of constraints (index l = 1, ..., k − 1)

(P) : sign[ˆe(∗l)] = sign[w(l) T ϕ(x∗) + bl] ր M ց (D) : sign[ˆe(∗l)] = sign[Pj αj(l)K(x∗, xj) + bl]

• Note: additional sets of constraints also in multi-class and vector-valued output LS-SVMs [Suykens et al., 1999]

(31)

Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 )
(32)

Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 )
(33)

Piecewise constant eigenvectors and extension (1)

• Definition. [Meila & Shi, 2001] Vector α is called piecewise constant

relative to a partition (A1, ..., Ak) iff αi = αj ∀xi, xj ∈ Ap, p = 1, ..., k.

• Proposition. [Alzate & Suykens, 2010] Assume

(i) a training set D = {xi}Ni=1 and validation set Dv = {xvm}Nmv=1

i.i.d. sampled from the same underlying distribution; (ii) a set of k clusters {A1, ...,Ak} with k > 2;

(iii) an isotropic kernel function such that K(x, z) = 0 when x and z

belong to different clusters;

(iv) the eigenvectors α(l) for l = 1, ..., k − 1 are piecewise constant.

Then validation set points belonging to the same cluster are collinear

in the k − 1 dimensional subspace spanned by the columns of Ev ∈

RNv×(k−1) where Ev

ml = e

(l)

(34)

Piecewise constant eigenvectors and extension (2)

• Key aspect of the proof: for x∗ ∈ Ap one has

e∗(l) = PNi=1 αi(l)K(xi, x∗) + b(l)

= c(pl)Pi∈Ap K(xi, x∗) + Pi /N∈Ap α(il)K(xi, x∗) + b(l)

= c(pl)Pi∈Ap K(xi, x∗) + b(l)

• Model selection to determine kernel parameters and k:

Looking for line structures in the space (e(1)i , e(2)i , ..., e(ik−1)), evaluated on validation data (aiming for good generalization)

• Choice kernel:

Gaussian RBF kernel

(35)

Model selection (looking for lines): toy problem 1

−0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 σ2 = 0.5,BLF= 0.56 e(1) i,val e (2 ) i, va l −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x(1) x (2 ) −0.4 −0.2 0 0.2 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 σ2 = 0.16,BLF= 1.0 e(1) i,val e (2 ) i, va l validation set −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x(1) x (2 )
(36)

Model selection (looking for lines): toy problem 2

−8 −6 −4 −2 0 2 4 −6 −4 −2 0 2 4 6 8 σ2 = 1.20,BLF = 0.49 e(1) i,val e (2 ) i, va l −2 0 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 x(1) x(2) x (3 ) −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 σ2 = 0.003,BLF= 1.0 e(1) i,val e (2 ) i, va l validation set −2 0 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 x(1) x(2) x (3 )
(37)

Example: image segmentation (looking for lines)

−5 −4 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 4 e(1) i,val e(2) i,val e (3 ) i, va l
(38)

145086 42049 167062 147091 196073 62096 119082 3096 295087 37073

Image Proposed method Nyström method Human Image ID

(39)

Example: power grid - identifying customer profiles (1)

Power load: 245 substations, hourly data (5 years), d = 43.824

Periodic AR modelling: dimensionality reduction 43.824 → 24

k-means clustering applied after dimensionality reduction

2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad

(40)

Example: power grid - identifying customer profiles (2)

Application of kernel spectral clustering, directly on d = 43.824

Model selection on kernel parameter and number of clusters

[Alzate, Espinoza, De Moor, Suykens, 2009]

5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad

(41)

Example: power grid - identifying customer profiles (3)

5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad

Electricity load: 245 substations in Belgian grid (1/2 train, 1/2 validation)

xi ∈ R43.824: spectral clustering on high dimensional data (5 years)

3 of 7 detected clusters:

- 1: Residential profile: morning and evening peaks

- 2: Business profile: peaked around noon

(42)

Kernel spectral clustering: sparse kernel models

original image binary clustering

Incomplete Cholesky decomposition: kΩ − GGTk2 ≤ η with G ∈ RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV e(∗l) = Pi∈S

SV α

(l)

(43)

Kernel spectral clustering: sparse kernel models

original image sparse kernel model

Incomplete Cholesky decomposition: kΩ − GGTk2 ≤ η with G ∈ RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV e(∗l) = Pi∈S

SV α

(l)

(44)

Highly sparse kernel models on images

• application on images:

xiR3 (r,g,b values per pixel), i = 1, ..., N

pre-processed into zi ∈ R8 (quantization to 8 colors)

χ2-kernel to compare two local color histograms (5 × 5 pixels window)

• N > 100.000, select subset M ≪ N based on quadratic Renyi entropy as in the fixed-size method [Suykens et al., 2002]

• Highly sparse representations: # SV = 3 k

• Completion of cluster indicators based on out-of-sample extensions

sign[ˆe(∗l)] = sign[Pj∈SSV αj(l)K(x∗, xj) + bl]

applied to the full image

(45)

Highly sparse kernel models: toy example 1

−5 0 5 10 −6 −4 −2 0 2 4 6 8 x(1) x (2 ) −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 4 e(1)i e (2 ) i
(46)

Highly sparse kernel models: toy example 2

−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 x(1) x(2 )
(47)

Highly sparse kernel models: toy example 2

−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 x(1) x(2 )
(48)

Highly sparse kernel models: toy example 2

−4 −2 0 2 4 6 −4 −2 0 2 4 −4 −2 0 2 4 ˆ e(1)i ˆ e(2)i ˆ e (3 ) i
(49)

Highly sparse kernel models: image segmentation

* * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * *** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * ** * ** * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * ** * * * * * * * * ** * * * * * * * * * * * * * ** * * * * * * * ** * * * * * * * * * * * * * * ** * * * * * ** * * * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *** * * * * * * ** * * ** * * * * * * * * * * * * * * * ** *** * * * * * ** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * ** * * * * * ** * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * *** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * −1.5 −1 −0.5 0 0.5 −3 −2 −1 0 1 2 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 e(1)i e(2)i e (3 ) i
(50)

Highly sparse kernel models: image segmentation

* * * * * * * * * * * * −1.5 −1 −0.5 0 0.5 −3 −2 −1 0 1 2 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 e(1)i e(2)i e (3 ) i
(51)

Hierarchical kernel spectral clustering

Hierarchical kernel spectral clustering: - looking at different scales

- use of model selection and validation data

(52)

Kernel spectral clustering: adding prior knowledge

• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link

• Primal problem [Alzate & Suykens, IJCNN 2009]

min w(l),e(l),bl −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γle(l)TD−1e(l) subject to e(1) = ΦN×n hw (1) + b 11N ... e(k−1) = ΦN×n hw (k−1) + b k−11N w(1)Tϕ(x†) = cw(1) T ϕ(x‡) ... w(k−1)Tϕ(x†) = cw(k−1) T ϕ(x‡)

(53)

Kernel spectral clustering: adding prior knowledge

• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link

• Primal problem [Alzate & Suykens, IJCNN 2009]

min w(l),e(l),bl −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γle(l)TD−1e(l) subject to e(1) = ΦN×n hw (1) + b 11N ... e(k−1) = ΦN×n hw (k−1) + b k−11N w(1)Tϕ(x†) = cw(1) T ϕ(x‡) ... w(k−1)Tϕ(x†) = cw(k−1) T ϕ(x‡)

(54)

Adding prior knowledge

(55)

Adding prior knowledge

(56)

Semi-supervised learning

• N unlabeled data, but additional labels on M − N data

X = {x1, ..., xN,xN+1, ..., xM}

• Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 2012]:

min w,e,b 1 2w Tw γ 1 2e TD−1e+ρ 1 2 M X m=N+1 (em − ym)2 subject to ei = wTϕ(xi) + b, i = 1, ..., M

Dual solution is characterized by a linear system.

(57)

Semi-supervised learning

• N unlabeled data, but additional labels on M − N data

X = {x1, ..., xN,xN+1, ..., xM}

• Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 2012]:

min w,e,b 1 2w Tw γ 1 2e TD−1e+ρ 1 2 M X m=N+1 (em − ym)2 subject to ei = wTϕ(xi) + b, i = 1, ..., M

Dual solution is characterized by a linear system.

(58)

Overview

• Principal component analysis

• Kernel principal component analysis and LS-SVM; sparse and robust extensions

• Kernel spectral clustering

• Community detection in complex networks

(59)

Modularity and community detection

• Modularity for two-group case [Newman, 2006]:

Q = 1 4m X i,j (Aij − didj 2m )qiqj

with A adjacency matrix, di degree of node i, m = 12 P

i di,

qi = 1 if node i belongs to group 1 and qi = −1 for group 2.

• Use of modularity within kernel spectral clustering [Langone et al., 2012]:

– use at the level of model validation

– finding representative subgraph using a fixed-size method by maximizing the expansion factor [Maiya, 2010] |N|G|(G)| with a subgraph

G and its neighborhood N(G).

– definition data in unweighted networks: xi = A(:, i); use of a community kernel function [Kang, 2009].

(60)

Protein interaction network (1)

Pajek

(61)

Protein interaction network (2)

- Yeast interaction network: 2114 nodes, 4480 edges [Barabasi et al., 2001] - KSC community detection, representative subgraph [Langone et al., 2012]

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 number of clusters m o d u la ri ty 7 detected clusters

(62)

Power grid network (1)

Pajek

(63)

Power grid network (2)

- Western USA power grid: 4941 nodes, 6594 edges [Watts & Strogatz, 1998] - KSC community detection, representative subgraph [Langone et al., 2012]

20 40 60 80 100 120 140 160 180 200 220 240 0.35 0.4 0.45 0.5 0.55 number of clusters m o d u la ri ty 16 detected clusters

(64)

Evolving networks

• Binary clustering case: adding a memory effect [Langone et al, 2012]

min w,e,b 1 2w Tw γ 1 2e TD−1e−ν wTw old subject to ei = wTϕ(xi) + b, i = 1, ..., N

with wold the previous result in time. • Aims at including temporal smoothness

(65)

Overview

• Principal component analysis

• Kernel principal component analysis and LS-SVM; sparse and robust extensions

• Kernel spectral clustering

• Community detection in complex networks

(66)

Dimensionality reduction and data visualization

• Traditionally:

commonly used techniques are e.g. principal component analysis (PCA), multi-dimensional scaling (MDS), self-organizing maps (SOM)

• More recently:

isomap, locally linear embedding (LLE), Hessian locally linear embedding, diffusion maps, Laplacian eigenmaps

(“kernel eigenmap methods and manifold learning”)

[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]

• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:

(67)

Kernel maps with reference point: formulation

• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:

- LS-SVM core part: realize dimensionality reduction x 7→ z

- Regularization term: (z − PDz)T(z − PDz) = PNi=1 kzi − PjN=1 sijDzjk22

with D diagonal matrix and sij = exp(−kxi − xjk22/σ2)

- reference point q (e.g. first point; sacrificed in the visualization)

• Example: d = 2 min z,w1,w2,b1,b2,ei,1,ei,2 1 2(z − PDz) T (z − PDz)+ ν 2 (w T 1 w1 + w T 2 w2) + η 2 N X i=1 (e2i,1 + e2i,2) such that cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = wT1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = wT2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N

(68)

Kernel maps with reference point: formulation

• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:

- LS-SVM core part: realize dimensionality reduction x 7→ z

- Regularization term: (z−PDz)T(z−PDz) = PiN=1 kzi−PjN=1 sijDzjk22

with D diagonal matrix and sij = exp(−kxi − xjk22/σ2)

- reference point q (e.g. first point; sacrificed in the visualization)

• Example: d = 2 min z,w1,w2,b1,b2,ei,1,ei,2 1 2(z − PDz) T (z − PDz)+ ν 2 (w T 1 w1 + w T 2 w2) + η 2 N X i=1 (e2i,1 + e2i,2) such that cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = wT1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = wT2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N

(69)

Kernel maps with reference point: formulation

• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:

- LS-SVM core part: realize dimensionality reduction x 7→ z

- Regularization term: (z−PDz)T(z−PDz) = PiN=1 kzi−PjN=1 sijDzjk22

with D diagonal matrix and sij = exp(−kxi − xjk22/σ2)

- reference point q (e.g. first point; sacrificed in the visualization)

• Example: d = 2 min z,w1,w2,b1,b2,ei,1,ei,2 1 2(z − PDz) T (z − PDz)+ ν 2 (w T 1 w1 + w T 2 w2) + η 2 N X i=1 (e2i,1 + e2i,2) such that cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = wT1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = wT2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N

(70)

Model selection by validation

Model selection criterion: min

Θ X i,j ˆ ziTzˆj kˆzik2kˆzjk2 − xTi xj kxik2kxjk2 2 Tuning parameters Θ:

• Kernels tuning parameters in sij, K1, K2,(K3)

• Regularization constants ν, η

• Choice of the diagonal matrix D

• Choice of reference point q, e.g. q ∈ {[+1; +1],[+1;−1],[−1; +1],[−1,−1]}

(71)

Kernel maps: spiral example

−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −0.5 0 0.5 x1 x2 x3 q = [+1;−1] −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 −2 0 2 4 6 8 10 12x 10 −3 z1hat z2 hat q = [−1;−1] −0.01−2 −0.005 0 0.005 0.01 0.015 0.02 0 2 4 6 8 10 12x 10 −3 z1hat z2 hat

training data (blue *), validation data (magenta o), test data (red +)

Model selection: minX i,j ˆ ziTzˆj kˆzik2kˆzjk2 − xTi xj kxik2kxjk2 2

(72)

Kernel maps: swiss roll example

−1 −0.5 0 0.5 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 x 1 x 2 x 3 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 x 10−3 0 0.5 1 1.5 2 2.5 3x 10 −3 z 1 z2

Given 3D swiss roll data Kernel map result - 2D

(73)

Kernel maps: visualizing gene distribution

−2.35 −2.3 −2.25 −2.2 −2.15 −2.1−2.05 −2 −1.95 x 10−3 1.9 2 2.1 2.2 2.3 x 10−3 1.7 1.8 1.9 2 2.1 x 10−3 z 1 z 2 z3

Alon colon cancer microarray data set: 3D projections Dimension input space: 62

(74)

Kernel maps: time-series data visualization

Santa Fe laser data

0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 250 300 discrete time k 1 1.5 2 2.5 3 3.5 4 x 10−3 −4 −3.5 −3 −2.5 −2 −1.5 −1 x 10−3 2.3 2.4 2.5 2.6 2.7 2.8 2.9 x 10−3 z 1hat z 2hat z3 hat

Data {yk|k−9}Nk=1: train, validation, test

Tuning parameters (kernel & regularization) based on validation set Model is able to make out of sample extensions

(75)

Conclusions

• From PCA to KPCA

• LS-SVM model framework with primal-dual setting

• Out-of-sample extensions, model selection procedures and large scale methods

• From spectral clustering to kernel spectral clustering

• Applications in complex networks

• Data visualization problems: learning and generalization

(76)

Acknowledgements (1)

• Colleagues at ESAT-SCD (especially research units: systems, models, control - biomedical data processing - bioinformatics):

C. Alzate, A. Argyriou, J. De Brabanter, K. De Brabanter, L. De Lathauwer, B. De Moor, M. Diehl, Ph. Dreesen, M. Espinoza, T. Falck, D. Geebelen, X. Huang, B. Hunyadi, A. Installe, V. Jumutc, P. Karsmakers, R. Langone, J. Lopez, J. Luts, R. Mall, S. Mehrkanoon, M. Moonen, Y. Moreau, K. Pelckmans, J. Puertas, L. Shi, M. Signoretto, P. Tsiaflakis, V. Van Belle, R. Van de Plas, S. Van Huffel, J. Vandewalle, T. van Waterschoot, C. Varon, S. Yu, and others

• Topics of this lecture: Carlos Alzate and Rocco Langone.

• Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet, COE Optimization in Engineering OPTEC, IUAP DYSCO, FWO projects, IWT, IBBT eHealth, COST

(77)
(78)

References

Related documents

The association collects, processes, validates and publishes investment performance return information on commercial real estate assets owned in the private market by pension

Host an overall discussion with the entire group, asking people to share key elements of their dreams – things they saw that made them feel good about the community.

When warming time ends, the oven will automatically shut off and “End” and COOK TIME will show on the display.. Four tones will sound, and then four 1-second reminder tones will

Glass ceramic cookware as it does not conduct heat well and is therefore less suitable for use on CeRAn ® cooktops.. Pans with copper or aluminum bases are also not suitable

Forward only Auto-Reverse Fast auto-reverse Forward with pause Auto-reverse with pause Pause length adjustable Mixer speeds 20 Minimum speed 10 Maximum speed 200 Possible

The ASTM method reports several parameters of steamer performance, in- cluding preheat energy and time, idle energy rate, and cooking-energy effi- ciency and production capacity

• Heating causes dehydration and vertical cracking but horizontal cracks occur through incorporated particles. • This can lead to flaking of the anodic coating either within

Caulkins to JHT, 12 April 1850, 2.57, James Hammond Trumbull Papers, Manuscripts and Archives, Yale University, New Haven, ConnecticutA. Caulkins, History of New