Prototype based methods:
Mathematical foundations,
interpretability, and data visualization
Barbara Hammer, Xibin Zhu
CITEC Centre of Excellence
Bielefeld University
http://www.techfak.uni-bielefeld.de/~xzhu/
ijcnn14_tutorial.html
Why LVQ?
[Machine Learning that Matters, Kiri L. Wagstaff, ICML 2012]
.... of 152 non-cross-conference papers published at ICML 2011:
!
there is a need for machine learning techniques which facilitate a
direct interpretation of the results
Why LVQ?
!
LVQ is a prime example of a Machine Learning model
which is intuitive and interpretable
!
but classical LVQ is a mere heuristic
!
This Tutorial:
Prototypes
!
prototypes are points in the data space:
!
which decompose the space into receptive fields:
!
induce a classification
~
w
i2
R
nPrototypes
!
prototypes offer a sparse encoding
!
prototypes represent data
16.07.2014
WSOM 2005, Paris
7
16.07.2014
WSOM 2005, Paris
8
Prototype learning
!
supervised: classes are known a priori:
training set:
!
LVQ
, GLVQ, RSLVQ, ...
!
unsupervised: clusters are not known priorly
!
NG, GTM, AP, ...
!
... usually solid mathematical foundation available
LVQ
Learning vector quantization [Kohonen, 1988]
init positions of
w
~
j, labels are
c
(
w
~
j)
repeat:
pick data point (
~
x
i, y
i) randomly
determine winner
w
~
Iif
y
i=
c
(
w
~
I):
w
~
I⇠
(
~
x
iw
~
I)
LVQ
LVQ 2.1 [Kohonen, 1990]
init positions of
w
~
j, labels are
c
(
w
~
j)
repeat:
pick data point (
~
x
i, y
i) randomly
determine closest prototype with
y
i=
c
(
w
~
+):
w
~
+determine closest prototype with
y
i6
=
c
(
w
~
):
w
~
if prototypes fall into a window around decision boundary:
~
w
+⇠
(
~
x
iw
~
+)
~
Online detection of faults
Online detection of faults
Setting:
•
high dim. features
•
few training data
•
online training
LVQ:
•
close to 100% accuracy
•
prototypes
•
can be stored
•
can be inspected
[T.Bojer et al., 2003]
Clinical proteomics
unhappy because
possibly ill ..
take serum
observe a characteristic spectrum
which tells us more about the
peptides in the serum
put into
mass
prostate cancer [National Cancer Institute, Prostate Cancer Dataset,
www.cancer.gov, 2004l]:
!
318 examples, SELDI-TOF from blood serum, 130 dim after preprocessing
(normalization, peak detection)
!
2 classes (healthy versus cancer in different states)
Clinical proteomics
LVQ
GRLVQ
SVM
62.5%
93.7%
92.7%
Steroid metabolomics
unhappy because
possibly ill ..
take serum
extract steroid markers
(32 selected steorid metabolites)
by means of GC/MS
Steroid metabolomics
Object recognition
Take home message
!
LVQ offers an intuitive classifier with high potential
for industrial applications
LVQ code
!
lvq PAK (http://www.cis.hut.fi/research/lvq_pak/): only
basic versions
!
included in popular software such as WEKA: only basic
versions
!
SOM toolbox (http://www.cis.hut.fi/somtoolbox/): also
GLVQ, matrix learning
!
mloss: also GLVQ, matrix learning
!
see also material at tutorial web site in particular for
advanced versions as covered in the following: http://
LVQ
!
LVQ 1 does not have a valid cost function:
X
i
f
LV Q
(
d
+
, d
)
where
d
±
= (
~
x
i
w
~
±
)
2
squared distance to closest correct / wrong prototype and
f
LV Q
(
a, b
) =
⇢
a
if
a
b
b
else
LVQ2.1
!
LVQ2.1 has a valid cost function:
X
i
f
LV Q2.1
(
d
+
, d
)
where
d
±
= (
~
x
i
w
~
±
)
2
squared distance to closest correct / wrong prototype and
f
LV Q2.1
(
a, b
) =
window
(
a
b
)
But this is unbounded!
LVQ2.1
!
behavior without window in simple model situations:
!
so tricky choice of window necessary....
generalization error of LVQ
depending on its initialization
in simple model setting:
result can be far from optimum
[Biehl,Ghosh,Hammer,2007]
(p
-)
(p
+> p
-)
More reasonable cost function for LVQ
!
based on margin maximization: GLVQ
[Sato/Yamada 1996, Hammer/Villmann 2002, Crammer et
al 2002, Schneider et al. 2009]
!
based on probabilistic modeling: RSLVQ
[Seo/Obermayer 2003]
!
function class
F given by possible LVQ-networks
!
training data (x
i,y
i)
!
machine learner
!
LVQ-function f in F
!
often: f(x
i) = y
ifor training points (i.e. small
empirical error
)
!
desired: P(f(x) = y) should be large (i.e. small
real error
)
Colt for LVQ in a nutshell
!
(hypothesis) margin
of x
i: m(x
i) = d
-- d
+where d
+/ d
-is the squared
distance to closest correct / wrong prototype
!
mathematics
!
error is bounded by:
E/m + O( p
2(B
3ln 1/
δ
)
1/2) / (
ρ
m
1/2))
where E = number of misclassified training data with margin smaller than ρ (including errors) δ = confidence
m = number of examples, B = support, p = number of prototypes
safe classification
insecure classification
+
does not include
dimensionality
good bounds for few
Colt for LVQ in a nutshell
!
(hypothesis) margin
of x
i: m(x
i) = d
-- d
+where d
+/ d
-is the squared
distance to closest correct / wrong prototype
!
mathematics
!
error is bounded by:
E/m + O( p
2(B
3ln 1/
δ
)
1/2) / (
ρ
m
1/2))
where E = number of misclassified training data with margin smaller than ρ (including errors) δ = confidence
m = number of examples, B = support, p = number of prototypes
safe classification
insecure classification
+
data with (too)
small margin
term / margin
does not include
dimensionality
good bounds for few
Margin maximization
!
mathematical objective:
maximize margin
Margin maximization
!
mathematical objective:
min
P
id
(
~
x
i)
d
+(
~
x
i)
Margin maximization
!
mathematical objective:
minimize
Σ
i(d
+(x
i) – d
-(x
i)) / (d
+(x
i) + d
-(x
i))
min
X
id
(
~
x
i)
d
+(
~
x
i)
d
(
~
x
i) +
d
+(
~
x
i)
Generalized LVQ (GLVQ)
derivatives
GLVQ
Generalized LVQ (GLVQ)
derivatives
Generalized LVQ (GLVQ)
derivatives
LVQ2.1
scaling
Probabilsitic modeling
Take home
!
LVQ can be substantiated by large margin generalization
bounds (independent of dimensionality)
!
LVQ can be based on cost functions:
!
probabilistic modeling
!
excellent results
!
bandwidth is very crititcal parameter (crisp limit does not perform well)
!
prototypes not always representative
!
margin maximization
!
very good results
!
parameters not critical
!
prototypes are representative for data
Why metric learning?
Example: acceptance of papers at some conference
L - layout, T - technical quality, I - interesting subject, F - famous author, S – appropriate
subject, Q overall quality, P author registers for conference, E appropriate length, B
likes beer, P looks pretty, G gives good talks, K knows programm committee, M
-member of programm committee, C - special session, R - has red hairs
Why metric learning?
!
data are usually represented by feature vectors
!
feature vectors are compared using Euclidean distance
!
but this might tell you nothing useful
(42,42,42,0, ...)
smell head belly human
(41,43,44,1, ...)
(-41,43,44,1, ...)
Metric learning: G relevance LVQ
!
mathematical objective:
minimize
Σ
i(d
λ+(x
i
) – d
λ-(x
i)) / (d
λ+(x
i) + d
λ-(x
i))
where d
λ(x,y) =
Σ
lλ
l(x
l-y
l)
2GRLVQ
!
mathematical objective: min
Σ
i(d
λ+(x
i
) – d
λ-(x
i)) / (d
λ+(x
i) + d
λ-(x
i))
derivatives
GRLVQ
!
mathematical objective: min
Σ
i(d
λ+(x
i
) – d
λ-(x
i)) / (d
λ+(x
i) + d
λ-(x
i))
derivatives
LVQ2.1
scaling
relevance update
GRLVQ
Generalized Matrix LVQ (GMLVQ)
Interpretability: Steroid metabolomics
GMLVQ
… yields (local) matrices, i.e. (local) scaling and
rotations of the space
GRLVQ: global scaling
GMLVQ: global scaling and rotation
LGMLVQ: local scaling and rotation
!
GMLVQ with positiv semidefinite matrices:
quadratic complexity w.r.t data dimensionality
*
=
!
GMLVQ with positiv semidefinite low rank matrices
matrices:
linear complexity w.r.t data dimensionality
equivalent to full version (if data are intrincically low
dimensional)
*
=
LiRamLVQ
*
=
glob
al
global
local
local
glob
al
induces global projection:
f: x
"
* x
Discriminative visualization
Stationary solutions of GMLVQ
!
assume fixed receptive fields, what is the optimum metric?
!
update of matrix has the form (prefactor indicates sign):
(x centered in prototype) plus normalization
!
similar to van Mises iteration
!
converges to first eigenvector of
Stationary solution
contributes with +
Interpretation of matrix terms
infra-red spectral data: 124 wine spamples
256 wavelengths 30 training data
94 test spectra
alco
ho
l co
nte
nt
high
low
medium
Interpretation of matrix terms
!
often: diagonal terms are interpreted as relevance
!
problem: for high dimensional data
holds for all matrices with differences in the null space of
Interpretation of matrix terms
!
dividing out null space yields the profile
!
direct interpretation of relevance profile misleading for high
dim data, get rid of null space first!
Interpretation of matrix terms
GMLVQ
best performance
7 dimensions remaining
over-fitting
effect
null-space correction
P=30 dimensions
Take home
!
metric adaptation:
!
increases accuracy
!
does not deteriorating its generalization ability
!
low rank matrix:
!
allows efficient training
!
data visualization
!
no restriction as compared to optimum metric
!
intrepretation:
!
by looking at feature weighting,
Schneider, Biehl, Hammer
!
feature extraction
"
vectorial data
!
pairwise (dis)similarity
measurement
"
(dis)similarity matrix
Dissimilarity or similarity data
size softness color curvature ...
"
(20,7,...)
0
0.2 0.8
0.2 0
0.8
0.8 0.8 0
(Dis)similarity data
GTTACAGGT
GTGACAAGT
GGTACACGT
!
(dis)similarity measures, e.g.:
1.
Alignment
2.
Normalized Compression Distance
3.
Graph structure kernels
LVQ for dis-/similarities
!
kernel GLVQ (Suganthan et al.)
!
differentiable kernel GLVQ (Villmann et al.)
!
relational GLVQ/SRLVQ (Xibin et al.)
!
kernel SRLVQ (Hofmann et al.)
Assumption: Prototypes are expressed as linear combinations
€
w
i
=
α
ij
j
∑
x
j
where
Fact: for every symmetric bilinear form and linear representation as
above we find
€
x
j
−
w
i
2
=
(
D
⋅
α
i
)
j
−
1
2
⋅
α
i
T
⋅
D
⋅
α
i
Method: Substitute all terms
in original methods and use
€
x
j
−
w
i
2
Relational GLVQ
0.6
0.1
0.2
0.05
0.05
Relational GLVQ
assume prototypes have the form
then
GLVQ costs become
Similarities/dissimilarities
euclid
general
k
~
x
i
~
x
j
k
2
d
ij
=
d
(
x
i
, x
j
)
h
~
x
i
,
~
x
j
i
s
ij
=
s(x
i
, x
j
)
assumption:
symmetric:
d
ij
=
d
ji
s
ij
=
s
ji
zero diagonal:
d
ii
= 0
normalization of
s
is possible:
s
ii
= 1
Similarities/dissimilarities
euclid
general
k
~
x
i
~
x
j
k
2
d
ij
=
d
(
x
i
, x
j
)
h
~
x
i
,
~
x
j
i
s
ij
=
s(x
i
, x
j
)
Similarities/dissimilarities
euclid
general
k
~
x
i
~
x
j
k
2
d
ij
=
d
(
x
i
, x
j
)
h
~
x
i
,
~
x
j
i
s
ij
=
s(x
i
, x
j
)
s
ij
=
1
2
⇣
d
ij
n
1
P
l
d
il
n
1
P
l
d
lj
+
n
1
2P
l,l
0d
ll
0⌘
Pseudo-euclidean embedding
pseudo-euclid
general
d
ij
=
d
(
x
i
, x
j
)
s
ij
=
s(x
i
, x
j
)
k
~
x
i
~
x
j
k
2
pq
=
k
~
x
1
i
~
x
1
j
k
2
k
~
x
2
i
~
x
2
j
k
2
h
~
x
i
,
~
x
j
i
pq
=
h
~
x
1
i
,
~
x
1
j
i
h
~
x
2
i
,
~
x
2
j
i
signature (
p, q, n
p
q
)
Pseudo-Euclidean Space
For every symmetric
D
a vector space embedding in pseudo-Euclidean
space exists; symmetric bilinear form induces dissimilarities
+1
-1
P1=(6.1,1)
P4=(-0.1,0)
P3=(0.1,0)
P6=(-4,-1)
P5=(4,-1)
P2=(-6.1,1)
classification based on
k
~
x
i
w
~
j
k
2
=
k
~
x
i
k
2
2
h
~
x
i
,
w
~
j
i
+
k
w
~
j
k
2
training optimizes
f
⇣
k
~
x
i
w
~
j
k
2
i,j
⌘
classification based on
k
~
x
i
w
~
j
k
2
=
k
~
x
i
k
2
2
h
~
x
i
,
w
~
j
i
+
k
w
~
j
k
2
training optimizes
f
⇣
k
~
x
i
w
~
j
k
2
i,j
⌘
LVQ for dis-/similarities
prototypes as linear combinations
w
~
j
=
P
↵
ji
~
x
i
classification based on
k
~
x
i
w
~
j
k
2
=
k
~
x
i
k
2
2
h
~
x
i
,
w
~
j
i
+
k
w
~
j
k
2
training optimizes
f
⇣
k
~
x
i
w
~
j
k
2
i,j
⌘
LVQ for dis-/similarities
k
~
x
i
w
~
j
k
2
=
s
ii
2
X
l
↵
jl
s
il
+
X
l,l
0↵
jl
↵
jl
0s
ll
0kernel aproach
classification based on
k
~
x
i
w
~
j
k
2
=
k
~
x
i
k
2
2
h
~
x
i
,
w
~
j
i
+
k
w
~
j
k
2
training optimizes
f
⇣
k
~
x
i
w
~
j
k
2
i,j
⌘
LVQ for dis-/similarities
relational aproach
k
~
x
i
w
~
j
k
2
=
X
l
↵
jl
d
il
1
2
X
l,l
0↵
jl
↵
jl
0d
ll
0for normalized
↵
jl
optimize:
gradient descent with respect to
followed by normalization
!
relational GLVQ / SRLVQ
LVQ for dis-/similarities
f
0
@
0
@
X
l
↵
jl
d
il
1
2
X
l,l
0↵
jl
↵
jl
0d
ll
01
A
i,j
1
A
f
0
@
0
@
s
ii
2
X
l
↵
jl
s
il
+
X
l,l
0↵
jl
↵
jl
0s
ll
01
A
i,j
1
A
↵
jl
gradient descent with respect to :
!
kernel GLVQ / SRLVQ
LVQ for dis-/similarities
~
w
j
=
X
l
↵
jl
~
x
l
@
@ ~
w
jf
⇣
k
~
x
i
w
~
j
k
2
i,j
⌘
=
2
f
0
•
(
~
x
i
w
~
j
)
hence:
P
l
↵
jl
~
x
l
⇠
2
f
0
•
(
~
x
i
P
l
↵
jl
~
x
l
)
this can be decomposed into contributions of the coe
ffi
cients
LVQ for dis-/similarities
GLVQ
similarities
gradient w.r.t. coefficients
RSLVQ
dissimilarities
gradient w.r.t. prototypes
only in the euclidean case:
kernel variants resemble gradient w.r.t w
large margin generalization bounds
Computational effort
Size of Matrix
(Double Precision)
n
Size
5000
190MB
10,000
763MB
20,000
3.0GB
50,000
18.6GB
200,000
300.0GB
Computational effort?
k
~
x
i
w
~
j
k
2
=
s
ii
2
X
l
↵
jl
s
il
+
X
l,l
0↵
jl
↵
jl
0s
ll
0=
e
t
i
Se
i
2
·
e
i
S
↵
j
+
↵
j
t
S
↵
j
sample m landmarks only
S
m,m
S
m,n
S
n,m
approximate
S
⇡
S
m,n
S
m,m
1
S
n,m
☺
☺
☺
☺
☺
☺
☺
☺
☺
☺
Take home
!
there exist cool methods which enable the application of
LVQ for similarities / dissimilarities
!
quadratic complexity
!
Nystroem approximation for low rank data reduces to
linear complexity
!
metric adaptation possible in a similar way as for GMLVQ:
adapt w.r.t similarity/dissimilarity parameters (has been
Confidence measures
!
Certainty of a classification?
x
Conformal prediction
!
framework to accompany pointwise classification of online
methods by provable guarantees:
classifier trained on N (exchangeable) data
conformity measure yields possible labels
such that for a new point
it holds:
Conformal prediction
!
pick conformity measure, e.g.
!
induces two terms:
•
Credibility
: how sure that a prediction is
correct
•
Confidence
: how sure that ALL OTHER labels are
incorrect
lower
credibility
lower confidence
higher
credibility
higher confidence
.. any measure is valid,but
Conformal prediction algorithm
Simplified conformal prediction
given training data
and new point
1.
train the model on training data
2.
compute
nonconformity
of training set
3.
for every
non conformity of is
4.
compare values
5.
output label with best r-value
credibility: largest r-value
Growing conformal semi-supervised LVQ
given labeled data and unlabeled data
init model with minimum number of prototypes
train model on
Loop:
predict confidence/credibility on and consider secure part
predict labels on based on secures part
add the part of with high confidence/credibility
identify regions with poor confidence/credibility for
generate new protoype
Take home
!
conformal prediction enables to accompany classification
results by confidence values
!
can be realised efficiently for LVQ based on distance
measures
!
allows incremental versions (also for relational setting,
semi-supervised training)
Literature
!
T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1997.
!
T. Kohonen. Learning vector quantization. In: M.A. Arbib, editor, The Handbook of Brain Theory and Neural Networks., pages 537–540. MIT Press, Cambridge, MA, 1995.
!
M. Biehl, B. Hammer, P. Schneider, T. Villmann, Metric Learning for Prototype-based, in: Innovations in Neural Information Paradigms and Applications, M. Bianchini, M. Maggini, F. Scarselli, L.C. Jain (eds.), Springer Studies in Computational Intelligence, Vol 247 (2009), 183-199
!
M. Biehl, B. Hammer, F.-M. Schleif, P. Schneider, T. Villmann, Stationarity of Matrix Relevance Learning Vector Quantization, Machine Learning Reports 01/2009, Univ. Leipzig (2009)
!
M. Biehl, A. Ghosh, and B. Hammer, Dynamics and generalization ability of LVQ algorithms, J. Machine Learning Research 8 (Feb):323-360, 2007
!
W. Arlt, M. Biehl, A.E. Taylor, S. Hahner, R. Libe, B.A. Hughes, P. Schneider, D.J. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat, F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C.H.L. Shackleton, X. Bertagna, M. Fassnacht, P.M. Stewart
Urine steroid metabolomics as a biomarker tool for detecting malignancy in adrenal tumors J. of Clinical Endocrinology & Metabolism 96: 3775-3784 (2011).
!
Frank-Michael Schleif, Thomas Villmann, Markus Kostrzewa, Barbara Hammer, Alexander Gammerman: Cancer informatics by prototype networks in mass spectrometry. Artificial Intelligence in Medicine 45(2-3): 215-228 (2009)
!
S. Kirstein, H. Wersing, H.-M. Gross, and E. Körner. A Life-Long Learning Vector Quantization Approach for Interactive Learning of Multiple Categories. Neural Networks 28:90-205 (2012).
!
Sambu Seo, Klaus Obermayer: Soft Learning Vector Quantization. Neural Computation 15(7): 1589-1604 (2003)
!
Barbara Hammer, Daniela Hofmann, Frank-Michael Schleif, Xibin Zhu: Learning vector quantization for (dis-)similarities. Neurocomputing (IJON) 131:43-51 (2014)
!
Marc Strickert, Barbara Hammer, Thomas Villmann, Michael Biehl: Regularization and improved interpretation of linear data mappings and adaptive distance measures. CIDM 2013:10-17
Literature
!
B. Mokbel, B. Paassen, and B. Hammer. Adaptive distance measures for sequential data. In Michel Verleysen, editor, ESANN, pages 265–270, 2014.
!
Daniela Hofmann, Frank-Michael Schleif, Benjamin Paa.en, and Barbara Hammer. Learning interpretable kernelized prototype-based models. Neurocomputing, accepted, 2013.
!
Xibin Zhu, Frank-Michael Schleif, and Barbara Hammer. Semi-supervised vector quantization for proximity data. In ESANN, pages 89–94, 2013.
!
Frank-Michael Schleif, Xibin Zhu, and Barbara Hammer. Sparse conformal prediction for dissimilarity data. Annals of Mathematics and Artificial Intelligence (AMAI), 2014.
!
Barbara Hammer, Daniela Hofmann, Frank-Michael Schleif, and Xibin Zhu. Learning vector quantization for (dis-)similarities. Neurocomputing, 131:43–51, 2014.
!
Xibin Zhu, Frank-Michael Schleif, and Barbara Hammer. Patch processing for relational learning vector quantization. In Jun Wang, Gary G. Yen, and Marios M. Polycarpou, editors, Advances in Neural Networks - ISNN 2012 - 9th
International Symposium on Neural Networks, Shenyang, China, July 11-14, 2012. Proceedings, Part I, volume 7367, pages 55–63. Springer, 2012.
!
Andrej Gisbrecht, Bassam Mokbel, Frank-Michael Schleif, Xibin Zhu, and Barbara Hammer. Linear time relational prototype based learning. Int. J. Neural Syst., 22(5), 2012.
!
Kerstin Bunte, Petra Schneider, Barbara Hammer, Frank-Michael Schleif, Thomas Villmann, and Michael Biehl. Limited rank matrix learning, discriminative dimension reduction and visualization. Neural Networks, 26:159–173, 2012.
!
P. Schneider, K. Bunte, H. Stiekema, B. Hammer, T. Villmann, and M. Biehl. Regularization in matrix relevance learning. IEEE Transactions on Neural Networks, 21:831–840, 2010.
!
M. Biehl, B. Hammer, F.-M. Schleif, P. Schneider, and T. Villmann. Stationarity of matrix relevance learning vector quantization machine learning reports. Technical Report 01/2009, University of Leipzig, 2009.
!
Petra Schneider, Michael Biehl, Barbara Hammer: Adaptive Relevance Matrices in Learning Vector Quantization. Neural Computation 21(12): 3532-3561 (2009)
!
Koby Crammer, Ran Gilad-Bachrach, Amir Navot, Naftali Tishby: Margin Analysis of the LVQ Algorithm. NIPS 2002: 462-469