Kernel methods for exploratory data analysis
and community detection
Johan Suykens
KU Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium Email: [email protected]
http://www.esat.kuleuven.be/scd/
Overview
• Principal component analysis
• Kernel principal component analysis and LS-SVM; sparse and robust extensions
• Kernel spectral clustering
• Community detection in complex networks
Principal component analysis (1)
• Given a data cloud, potentially in a high dimensional input space
× × × × × × × × × × × × ×× × × × × × × × × × × × × × × × × × × × × × × × ×× × × × × × × × × × × ×
• assume an ellipsoidal data cloud
Principal component analysis (2)
• Given data {xi}Ni=1 with xi ∈ Rn (assumed zero mean)• Find projected variables wTxi with maximal variance
max
w E{(w
Tx)2} = wTE{xxT}w
= wT C w
with covariance matrix C = E{xxT} and E{·} the expected value.
• For N given data points one has C ≃ N1
N
X
i=1
xixTi .
• Problem: the optimal solution for w in the above problem is unbounded. Therefore an additional constraint should be taken: a common choice is to impose wTw = 1.
Principal component analysis (2)
• Given data {xi}Ni=1 with xi ∈ Rn (assumed zero mean)• Find projected variables wTxi with maximal variance
max
w E{(w
Tx)2} = wTE{xxT}w
= wT C w
with covariance matrix C = E{xxT} and E{·} the expected value.
• For N given data points one has C ≃ N1
N
X
i=1
xixTi .
• Problem: the optimal solution for w in the above problem is unbounded.
Therefore an additional constraint should be taken: a common choice is
Principal component analysis (3)
• The problem formulation becomes then:max
w w
T C w subject to wTw = 1
• Constrained optimization problem solved by taking the Lagrangian L:
L(w;λ) = 1 2w
TCw − λ(wTw − 1)
with Lagrange multiplier λ.
• Solution is given by the eigenvalue problem
Cw = λw
Principal component analysis (4)
u2 u1 µ x1 x2 λ1/2 1 λ1/2 2Illustration of an eigenvalue decomposition Cu = λu
- the eigenvalues λi are real and positive
- the eigenvectors u1 and u2 are orthogonal with respect to each other
- maximal variance solution is the direction u1 corresponding to λmax = λ1. - note that µ = 0 (the data should be made zero-mean beforehand)
Principal component analysis: dimensionality reduction (1)
• Aim:Decreasing the dimensionality of the given input space by mapping vectors x ∈ Rn to z ∈ Rm with m < n.
• A point x is mapped to z in the lower dimensional space by
z(j) = uTj x
where uj are the eigenvectors corresponding to the m largest eigenvalues and z = [z(1)z(2)...z(m)]T.
• The error resulting from the dimensionality reduction is characterized by
the neglected eigenvalues, i.e. Pn
Principal component analysis: dimensionality reduction (2)
i λi
In this example reducing the original 6-dimensional space to a 2-dimensional space is a good choice because the largest two eigenvalues λ1 and λ2 are much larger than the other ones.
Principal component analysis: reconstruction problem (1)
• Consider x ∈ Rn and z ∈ Rm with m ≪ n (dimensionality reduction).• Encoder mapping:
z = G(x)
Decoder mapping:
˜
x = F(z)
• Objective: squared distortion error (reconstruction error)
minE = 1 N N X i=1 kxi − x˜ik22 = 1 N N X i=1 kxi − F(G(xi))k22
Principal component analysis: reconstruction problem (2)
x
G
(
x
)
z
F
(
z
)
x
˜
Principal component analysis: denoising example
Images: 16 × 15 pixels (n = 240)
Training on N = 200 clean digits (not containing digit 9)
Test data: (bottom-left) denoised digit 9 after reconstruction using 10 principal components; (bottom-right) 180 principal components
Overview
• Principal component analysis
• Kernel principal component analysis and LS-SVM; sparse and
robust extensions
• Kernel spectral clustering
• Community detection in complex networks
Kernel principal component analysis
−1.5 −1 −0.5 0 0.5 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 linear PCA −1.5 −1 −0.5 0 0.5 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5kernel PCA (RBF kernel)
Kernel PCA [Sch¨olkopf et al., 1998]:
by eigenvalue decomposition of K(x1, x1) ... K(x1, xN) ... ... x(2) x(1)
Kernel PCA: primal and dual problem
• Underlying primal problem [Suykens et al., IEEE-TNN 2003]• Primal problem: min w,b,e ∓ 1 2w Tw ± 1 2γ N X i=1 e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.
• (Lagrange) dual problem = kernel PCA :
Ωcα = λα with λ = 1/γ
with Ωc,ij = (ϕ(xi) − µˆϕ)T(ϕ(xj) − µˆϕ) the centered kernel matrix.
• Interpretation:
1. pool of candidates components (objective function equals zero) 2. select relevant components (components with high variance)
Kernel PCA: model representations
Primal and dual model representations:
(P) : eˆ∗ = wTϕ(x∗) + b ր
M
ց
(D) : eˆ∗ = Pi αiK(x∗, xi) + b
which can be evaluated at any point x∗ ∈ Rd, where
K(x∗, xi) = ϕ(x∗)Tϕ(xi)
Generalizations to Kernel PCA: other loss functions
• Consider general loss function L:min w,b,e − 1 2w Tw + 1 2γ N X i=1 L(ei) s.t. ei = wTϕ(xi) + b, i = 1, ..., N.
Generalizations of KPCA that lead to robustness and sparseness, e.g. Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].
• Weighted least squares versions and incorporation of constraints:
min w,b,e − 1 2w Tw + 1 2γ N X i=1 vie2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N PN i=1 eie (1) i = 0 ... PN i=1 eie (l−1) i = 0
Find l-th PC w.r.t. l − 1 orthogonality constraints (previous PC e(ij)). The solution is given by a generalized eigenvalue problem.
Robustness: Kernel Component Analysis
original image corrupted image
KPCA reconstruction
Robustness: Kernel Component Analysis
original image corrupted image
KPCA reconstruction KCA reconstruction
Generalizations to Kernel PCA: sparseness
−0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC1 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC2 −0.5 0 0.5 1 1.5 2 2.5 −1 −0.5 0 0.5 1 1.5 2 2.5 x 1 x2 PC3Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006]
Overview
• Principal component analysis
• Kernel principal component analysis and LS-SVM; sparse and robust extensions
• Kernel spectral clustering
• Community detection in complex networks
Spectral graph clustering
Minimal cut: given the graph G = (V, E), find clusters A1,A2
min qi∈{−1,+1} 1 2 X i,j wij(qi − qj)2
with cluster membership indicator qi (qi = 1 if i ∈ A1, qi = −1 if i ∈ A2) and W = [wij] the weighted adjacency matrix.
2 3 4 5 6 1 cut of size 2 cut of size 1 (minimal cut)
Spectral graph clustering
• Relaxation to Min-cut spectral clustering problemmin
˜
qTq˜=1 q˜
TLq˜
with L = D − W the unnormalized graph Laplacian, degree matrix
D = diag(d1, ..., dN), di = P
j wij, giving
Lq˜ = λq˜.
Cluster member indicators: qˆi = sign(˜qi − θ) with threshold θ.
• Normalized cut
Lq˜ = λDq˜
[Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997; von Luxburg, 2007] • Discrete version to continuous problem (Laplace operator)
Kernel spectral clustering: case of two clusters
• Underlying model (primal representation):
ˆ
e∗ = wTϕ(x∗) + b
with qˆ∗ = sign[ˆe∗] the estimated cluster indicator at any x∗ ∈ Rd.
• Primal problem: training on given data {xi}Ni=1
min w,b,e − 1 2w Tw + γ1 2 N X i=1 vi e2i subject to ei = wTϕ(xi) + b, i = 1, ..., N
with positive weights vi (will be related to inverse degree matrix).
Lagrangian and conditions for optimality
• Lagrangian: L(w, b, e;α) = −1 2w Tw + γ 1 2 N X i=1 vie2i − N X i=1 αi(ei − wTϕ(xi) − b)• Conditions for optimality:
∂L ∂w = 0 ⇒ w = P i αiϕ(xi) ∂L ∂b = 0 ⇒ P iαi = 0 ∂L ∂ei = 0 ⇒ αi = γviei, i = 1, ..., N ∂L ∂αi = 0 ⇒ ei = w Tϕ(x i) + b, i = 1, ..., N
Kernel-based model representation
• Dual problem: V MV Ωα = λα with λ = 1/γ MV = IN − 1T 1 NV1N1N1TNV : weighted centering matrix
Ω = [Ωij]: kernel matrix with Ωij = ϕ(xi)Tϕ(xj) = K(xi, xj)
• Dual model representation:
ˆ e∗ = N X i=1 αiK(xi, x∗) + b with K(xi, x∗) = ϕ(xi)Tϕ(x∗).
Choice of weights
v
i• Take V = D−1 where D = diag{d1, ..., dN} and di = PNj=1 Ωij
• This gives the generalized eigenvalue problem:
MDΩα = λDα
with MD = IN − 1T 1 ND−11N
1N1TND−1
This is a modified version of random walks spectral clustering.
• Note that sign[ei] = sign[αi] if γvi > 0 (on training data) ... but sign[e∗] applies beyond training data
Kernel spectral clustering: more clusters
• Case of k clusters: additional sets of constraintsmin w(l),e(l),bl −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γle(l)TD−1e(l) subject to e(1) = ΦN×n hw (1) + b 11N e(2) = ΦN×n hw (2) + b 21N ... e(k−1) = ΦN×n hw (k−1) + b k−11N where e(l) = [e(1l);...;e(Nl)] and ΦN×n h = [ϕ(x1) T;...;ϕ(x N)T] ∈ RN×nh. • Dual problem: MDΩα(l) = λDα(l), l = 1, ..., k − 1.
Primal and dual model representations
k clustersk − 1 sets of constraints (index l = 1, ..., k − 1)
(P) : sign[ˆe(∗l)] = sign[w(l) T ϕ(x∗) + bl] ր M ց (D) : sign[ˆe(∗l)] = sign[Pj αj(l)K(x∗, xj) + bl]
• Note: additional sets of constraints also in multi-class and vector-valued output LS-SVMs [Suykens et al., 1999]
Out-of-sample extension and coding
−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 )Out-of-sample extension and coding
−12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 ) −12 −10 −8 −6 −4 −2 0 2 4 6 −10 −8 −6 −4 −2 0 2 4 6 8 x(1) x (2 )Piecewise constant eigenvectors and extension (1)
• Definition. [Meila & Shi, 2001] Vector α is called piecewise constant
relative to a partition (A1, ..., Ak) iff αi = αj ∀xi, xj ∈ Ap, p = 1, ..., k.
• Proposition. [Alzate & Suykens, 2010] Assume
(i) a training set D = {xi}Ni=1 and validation set Dv = {xvm}Nmv=1
i.i.d. sampled from the same underlying distribution; (ii) a set of k clusters {A1, ...,Ak} with k > 2;
(iii) an isotropic kernel function such that K(x, z) = 0 when x and z
belong to different clusters;
(iv) the eigenvectors α(l) for l = 1, ..., k − 1 are piecewise constant.
Then validation set points belonging to the same cluster are collinear
in the k − 1 dimensional subspace spanned by the columns of Ev ∈
RNv×(k−1) where Ev
ml = e
(l)
Piecewise constant eigenvectors and extension (2)
• Key aspect of the proof: for x∗ ∈ Ap one has
e∗(l) = PNi=1 αi(l)K(xi, x∗) + b(l)
= c(pl)Pi∈Ap K(xi, x∗) + Pi /N∈Ap α(il)K(xi, x∗) + b(l)
= c(pl)Pi∈Ap K(xi, x∗) + b(l)
• Model selection to determine kernel parameters and k:
Looking for line structures in the space (e(1)i , e(2)i , ..., e(ik−1)), evaluated on validation data (aiming for good generalization)
• Choice kernel:
Gaussian RBF kernel
Model selection (looking for lines): toy problem 1
−0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 σ2 = 0.5,BLF= 0.56 e(1) i,val e (2 ) i, va l −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x(1) x (2 ) −0.4 −0.2 0 0.2 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 σ2 = 0.16,BLF= 1.0 e(1) i,val e (2 ) i, va l validation set −30 −20 −10 0 10 20 30 −25 −20 −15 −10 −5 0 5 10 15 20 25 x(1) x (2 )Model selection (looking for lines): toy problem 2
−8 −6 −4 −2 0 2 4 −6 −4 −2 0 2 4 6 8 σ2 = 1.20,BLF = 0.49 e(1) i,val e (2 ) i, va l −2 0 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 x(1) x(2) x (3 ) −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 σ2 = 0.003,BLF= 1.0 e(1) i,val e (2 ) i, va l validation set −2 0 2 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 x(1) x(2) x (3 )Example: image segmentation (looking for lines)
−5 −4 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 −2 −1 0 1 2 3 4 e(1) i,val e(2) i,val e (3 ) i, va l145086 42049 167062 147091 196073 62096 119082 3096 295087 37073
Image Proposed method Nyström method Human Image ID
Example: power grid - identifying customer profiles (1)
Power load: 245 substations, hourly data (5 years), d = 43.824
Periodic AR modelling: dimensionality reduction 43.824 → 24
k-means clustering applied after dimensionality reduction
2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 2 4 6 8 10 12 14 16 18 20 22 24 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad
Example: power grid - identifying customer profiles (2)
Application of kernel spectral clustering, directly on d = 43.824
Model selection on kernel parameter and number of clusters
[Alzate, Espinoza, De Moor, Suykens, 2009]
5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad
Example: power grid - identifying customer profiles (3)
5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo ad 5 10 15 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 hour n or m al iz ed lo adElectricity load: 245 substations in Belgian grid (1/2 train, 1/2 validation)
xi ∈ R43.824: spectral clustering on high dimensional data (5 years)
3 of 7 detected clusters:
- 1: Residential profile: morning and evening peaks
- 2: Business profile: peaked around noon
Kernel spectral clustering: sparse kernel models
original image binary clustering
Incomplete Cholesky decomposition: kΩ − GGTk2 ≤ η with G ∈ RN×R and R ≪ N
Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV e(∗l) = Pi∈S
SV α
(l)
Kernel spectral clustering: sparse kernel models
original image sparse kernel model
Incomplete Cholesky decomposition: kΩ − GGTk2 ≤ η with G ∈ RN×R and R ≪ N
Image (Berkeley image dataset): 321 × 481 (154,401 pixels), 175 SV e(∗l) = Pi∈S
SV α
(l)
Highly sparse kernel models on images
• application on images:xi ∈ R3 (r,g,b values per pixel), i = 1, ..., N
pre-processed into zi ∈ R8 (quantization to 8 colors)
χ2-kernel to compare two local color histograms (5 × 5 pixels window)
• N > 100.000, select subset M ≪ N based on quadratic Renyi entropy as in the fixed-size method [Suykens et al., 2002]
• Highly sparse representations: # SV = 3 k
• Completion of cluster indicators based on out-of-sample extensions
sign[ˆe(∗l)] = sign[Pj∈SSV αj(l)K(x∗, xj) + bl]
applied to the full image
Highly sparse kernel models: toy example 1
−5 0 5 10 −6 −4 −2 0 2 4 6 8 x(1) x (2 ) −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3 4 e(1)i e (2 ) iHighly sparse kernel models: toy example 2
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4 x(1) x(2 )Highly sparse kernel models: toy example 2
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 x(1) x(2 )Highly sparse kernel models: toy example 2
−4 −2 0 2 4 6 −4 −2 0 2 4 −4 −2 0 2 4 ˆ e(1)i ˆ e(2)i ˆ e (3 ) iHighly sparse kernel models: image segmentation
* * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * *** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * ** * ** * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * ** * * * * * * * * ** * * * * * * * * * * * * * ** * * * * * * * ** * * * * * * * * * * * * * * ** * * * * * ** * * * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *** * * * * * * ** * * ** * * * * * * * * * * * * * * * ** *** * * * * * ** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * ** * * * * * ** * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * *** * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * −1.5 −1 −0.5 0 0.5 −3 −2 −1 0 1 2 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 e(1)i e(2)i e (3 ) iHighly sparse kernel models: image segmentation
* * * * * * * * * * * * −1.5 −1 −0.5 0 0.5 −3 −2 −1 0 1 2 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 e(1)i e(2)i e (3 ) iHierarchical kernel spectral clustering
Hierarchical kernel spectral clustering: - looking at different scales
- use of model selection and validation data
Kernel spectral clustering: adding prior knowledge
• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link• Primal problem [Alzate & Suykens, IJCNN 2009]
min w(l),e(l),bl −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γle(l)TD−1e(l) subject to e(1) = ΦN×n hw (1) + b 11N ... e(k−1) = ΦN×n hw (k−1) + b k−11N w(1)Tϕ(x†) = cw(1) T ϕ(x‡) ... w(k−1)Tϕ(x†) = cw(k−1) T ϕ(x‡)
Kernel spectral clustering: adding prior knowledge
• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link• Primal problem [Alzate & Suykens, IJCNN 2009]
min w(l),e(l),bl −1 2 k−1 X l=1 w(l)Tw(l) + 1 2 k−1 X l=1 γle(l)TD−1e(l) subject to e(1) = ΦN×n hw (1) + b 11N ... e(k−1) = ΦN×n hw (k−1) + b k−11N w(1)Tϕ(x†) = cw(1) T ϕ(x‡) ... w(k−1)Tϕ(x†) = cw(k−1) T ϕ(x‡)
Adding prior knowledge
Adding prior knowledge
Semi-supervised learning
• N unlabeled data, but additional labels on M − N data
X = {x1, ..., xN,xN+1, ..., xM}
• Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 2012]:
min w,e,b 1 2w Tw − γ 1 2e TD−1e+ρ 1 2 M X m=N+1 (em − ym)2 subject to ei = wTϕ(xi) + b, i = 1, ..., M
Dual solution is characterized by a linear system.
Semi-supervised learning
• N unlabeled data, but additional labels on M − N data
X = {x1, ..., xN,xN+1, ..., xM}
• Binary classification by using a binary spectral clustering core model [Alzate & Suykens, WCCI 2012]:
min w,e,b 1 2w Tw − γ 1 2e TD−1e+ρ 1 2 M X m=N+1 (em − ym)2 subject to ei = wTϕ(xi) + b, i = 1, ..., M
Dual solution is characterized by a linear system.
Overview
• Principal component analysis
• Kernel principal component analysis and LS-SVM; sparse and robust extensions
• Kernel spectral clustering
• Community detection in complex networks
Modularity and community detection
• Modularity for two-group case [Newman, 2006]:
Q = 1 4m X i,j (Aij − didj 2m )qiqj
with A adjacency matrix, di degree of node i, m = 12 P
i di,
qi = 1 if node i belongs to group 1 and qi = −1 for group 2.
• Use of modularity within kernel spectral clustering [Langone et al., 2012]:
– use at the level of model validation
– finding representative subgraph using a fixed-size method by maximizing the expansion factor [Maiya, 2010] |N|G|(G)| with a subgraph
G and its neighborhood N(G).
– definition data in unweighted networks: xi = A(:, i); use of a community kernel function [Kang, 2009].
Protein interaction network (1)
Pajek
Protein interaction network (2)
- Yeast interaction network: 2114 nodes, 4480 edges [Barabasi et al., 2001] - KSC community detection, representative subgraph [Langone et al., 2012]
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 number of clusters m o d u la ri ty 7 detected clusters
Power grid network (1)
Pajek
Power grid network (2)
- Western USA power grid: 4941 nodes, 6594 edges [Watts & Strogatz, 1998] - KSC community detection, representative subgraph [Langone et al., 2012]
20 40 60 80 100 120 140 160 180 200 220 240 0.35 0.4 0.45 0.5 0.55 number of clusters m o d u la ri ty 16 detected clusters
Evolving networks
• Binary clustering case: adding a memory effect [Langone et al, 2012]
min w,e,b 1 2w Tw − γ 1 2e TD−1e−ν wTw old subject to ei = wTϕ(xi) + b, i = 1, ..., N
with wold the previous result in time. • Aims at including temporal smoothness
Overview
• Principal component analysis
• Kernel principal component analysis and LS-SVM; sparse and robust extensions
• Kernel spectral clustering
• Community detection in complex networks
Dimensionality reduction and data visualization
• Traditionally:
commonly used techniques are e.g. principal component analysis (PCA), multi-dimensional scaling (MDS), self-organizing maps (SOM)
• More recently:
isomap, locally linear embedding (LLE), Hessian locally linear embedding, diffusion maps, Laplacian eigenmaps
(“kernel eigenmap methods and manifold learning”)
[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:
Kernel maps with reference point: formulation
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z
- Regularization term: (z − PDz)T(z − PDz) = PNi=1 kzi − PjN=1 sijDzjk22
with D diagonal matrix and sij = exp(−kxi − xjk22/σ2)
- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2 min z,w1,w2,b1,b2,ei,1,ei,2 1 2(z − PDz) T (z − PDz)+ ν 2 (w T 1 w1 + w T 2 w2) + η 2 N X i=1 (e2i,1 + e2i,2) such that cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = wT1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = wT2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Kernel maps with reference point: formulation
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z
- Regularization term: (z−PDz)T(z−PDz) = PiN=1 kzi−PjN=1 sijDzjk22
with D diagonal matrix and sij = exp(−kxi − xjk22/σ2)
- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2 min z,w1,w2,b1,b2,ei,1,ei,2 1 2(z − PDz) T (z − PDz)+ ν 2 (w T 1 w1 + w T 2 w2) + η 2 N X i=1 (e2i,1 + e2i,2) such that cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = wT1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = wT2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Kernel maps with reference point: formulation
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z
- Regularization term: (z−PDz)T(z−PDz) = PiN=1 kzi−PjN=1 sijDzjk22
with D diagonal matrix and sij = exp(−kxi − xjk22/σ2)
- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2 min z,w1,w2,b1,b2,ei,1,ei,2 1 2(z − PDz) T (z − PDz)+ ν 2 (w T 1 w1 + w T 2 w2) + η 2 N X i=1 (e2i,1 + e2i,2) such that cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = wT1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = wT2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Model selection by validation
Model selection criterion: min
Θ X i,j ˆ ziTzˆj kˆzik2kˆzjk2 − xTi xj kxik2kxjk2 2 Tuning parameters Θ:
• Kernels tuning parameters in sij, K1, K2,(K3)
• Regularization constants ν, η
• Choice of the diagonal matrix D
• Choice of reference point q, e.g. q ∈ {[+1; +1],[+1;−1],[−1; +1],[−1,−1]}
Kernel maps: spiral example
−1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 −0.5 0 0.5 x1 x2 x3 q = [+1;−1] −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 −2 0 2 4 6 8 10 12x 10 −3 z1hat z2 hat q = [−1;−1] −0.01−2 −0.005 0 0.005 0.01 0.015 0.02 0 2 4 6 8 10 12x 10 −3 z1hat z2 hattraining data (blue *), validation data (magenta o), test data (red +)
Model selection: minX i,j ˆ ziTzˆj kˆzik2kˆzjk2 − xTi xj kxik2kxjk2 2
Kernel maps: swiss roll example
−1 −0.5 0 0.5 1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 x 1 x 2 x 3 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 x 10−3 0 0.5 1 1.5 2 2.5 3x 10 −3 z 1 z2Given 3D swiss roll data Kernel map result - 2D
Kernel maps: visualizing gene distribution
−2.35 −2.3 −2.25 −2.2 −2.15 −2.1−2.05 −2 −1.95 x 10−3 1.9 2 2.1 2.2 2.3 x 10−3 1.7 1.8 1.9 2 2.1 x 10−3 z 1 z 2 z3Alon colon cancer microarray data set: 3D projections Dimension input space: 62
Kernel maps: time-series data visualization
Santa Fe laser data
0 100 200 300 400 500 600 700 800 900 0 50 100 150 200 250 300 discrete time k 1 1.5 2 2.5 3 3.5 4 x 10−3 −4 −3.5 −3 −2.5 −2 −1.5 −1 x 10−3 2.3 2.4 2.5 2.6 2.7 2.8 2.9 x 10−3 z 1hat z 2hat z3 hat
Data {yk|k−9}Nk=1: train, validation, test
Tuning parameters (kernel & regularization) based on validation set Model is able to make out of sample extensions
Conclusions
• From PCA to KPCA• LS-SVM model framework with primal-dual setting
• Out-of-sample extensions, model selection procedures and large scale methods
• From spectral clustering to kernel spectral clustering
• Applications in complex networks
• Data visualization problems: learning and generalization
Acknowledgements (1)
• Colleagues at ESAT-SCD (especially research units: systems, models, control - biomedical data processing - bioinformatics):
C. Alzate, A. Argyriou, J. De Brabanter, K. De Brabanter, L. De Lathauwer, B. De Moor, M. Diehl, Ph. Dreesen, M. Espinoza, T. Falck, D. Geebelen, X. Huang, B. Hunyadi, A. Installe, V. Jumutc, P. Karsmakers, R. Langone, J. Lopez, J. Luts, R. Mall, S. Mehrkanoon, M. Moonen, Y. Moreau, K. Pelckmans, J. Puertas, L. Shi, M. Signoretto, P. Tsiaflakis, V. Van Belle, R. Van de Plas, S. Van Huffel, J. Vandewalle, T. van Waterschoot, C. Varon, S. Yu, and others
• Topics of this lecture: Carlos Alzate and Rocco Langone.
• Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet, COE Optimization in Engineering OPTEC, IUAP DYSCO, FWO projects, IWT, IBBT eHealth, COST