Prototype based methods: Mathematical foundations, interpretability, and data visualization

(1)

Prototype based methods:

Mathematical foundations,

interpretability, and data visualization

Barbara Hammer, Xibin Zhu

CITEC Centre of Excellence

Bielefeld University

http://www.techfak.uni-bielefeld.de/~xzhu/

ijcnn14_tutorial.html

(2)

(3)

Why LVQ?

[Machine Learning that Matters, Kiri L. Wagstaff, ICML 2012]

.... of 152 non-cross-conference papers published at ICML 2011:

!

there is a need for machine learning techniques which facilitate a

direct interpretation of the results

(4)

Why LVQ?

!

LVQ is a prime example of a Machine Learning model

which is intuitive and interpretable

!

but classical LVQ is a mere heuristic

!

This Tutorial:

(5)

Prototypes

!

prototypes are points in the data space:

!

which decompose the space into receptive fields:

!

induce a classification

~

w

i

2

R

n

(6)

Prototypes

!

prototypes offer a sparse encoding

!

prototypes represent data

(7)

16.07.2014

WSOM 2005, Paris

7

(8)

16.07.2014

WSOM 2005, Paris

8

(9)

Prototype learning

!

supervised: classes are known a priori:

training set:

!

LVQ

, GLVQ, RSLVQ, ...

!

unsupervised: clusters are not known priorly

!

NG, GTM, AP, ...

!

... usually solid mathematical foundation available

(10)

LVQ

Learning vector quantization [Kohonen, 1988]

init positions of

w

~

_j

, labels are

c

(

w

~

_j

)

repeat:

pick data point (

~

x

i

, y

i

) randomly

determine winner

w

~

I

if

y

i

=

c

(

w

~

I

):

w

~

I

⇠

(

~

x

i

w

~

I

)

(11)

LVQ

LVQ 2.1 [Kohonen, 1990]

init positions of

w

~

_j

, labels are

c

(

w

~

_j

)

repeat:

pick data point (

~

x

i

, y

i

) randomly

determine closest prototype with

y

i

=

c

(

w

~

+

):

w

~

+

determine closest prototype with

y

i

6

=

c

(

w

~

):

w

~

if prototypes fall into a window around decision boundary:

~

w

+

⇠

(

~

x

i

w

~

+

)

~

(12)

(13)

Online detection of faults

(14)

Online detection of faults

Setting:

•

high dim. features

•

few training data

•

online training

LVQ:

•

close to 100% accuracy

•

prototypes

•

can be stored

•

can be inspected

[T.Bojer et al., 2003]

(15)

Clinical proteomics

unhappy because

possibly ill ..

take serum

observe a characteristic spectrum

which tells us more about the

peptides in the serum

put into

mass

(16)

prostate cancer [National Cancer Institute, Prostate Cancer Dataset,

www.cancer.gov, 2004l]:

! 

318 examples, SELDI-TOF from blood serum, 130 dim after preprocessing

(normalization, peak detection)

!

2 classes (healthy versus cancer in different states)

Clinical proteomics

LVQ

GRLVQ

SVM

62.5%

93.7%

92.7%

(17)

Steroid metabolomics

unhappy because

possibly ill ..

take serum

extract steroid markers

(32 selected steorid metabolites)

by means of GC/MS

(18)

Steroid metabolomics

(19)

Object recognition

(20)

Take home message

!

LVQ offers an intuitive classifier with high potential

for industrial applications

(21)

LVQ code

!

lvq PAK (http://www.cis.hut.fi/research/lvq_pak/): only

basic versions

!

included in popular software such as WEKA: only basic

versions

!

SOM toolbox (http://www.cis.hut.fi/somtoolbox/): also

GLVQ, matrix learning

!

mloss: also GLVQ, matrix learning

!

see also material at tutorial web site in particular for

advanced versions as covered in the following: http://

(22)

(23)

LVQ

!

LVQ 1 does not have a valid cost function:

X

i

f

LV Q

(

d

+

, d

)

where

d

±

= (

~

x

i

w

~

±

)

2

squared distance to closest correct / wrong prototype and

f

LV Q

(

a, b

) =

⇢

a

if

a



b

else

(24)

LVQ2.1

!

LVQ2.1 has a valid cost function:

X

i

f

LV Q2.1

(

d

+

, d

)

where

d

±

= (

~

x

i

w

~

±

)

2

squared distance to closest correct / wrong prototype and

f

LV Q2.1

(

a, b

) =

window

(

a

b

)

But this is unbounded!

(25)

LVQ2.1

!

behavior without window in simple model situations:

!

so tricky choice of window necessary....

generalization error of LVQ

depending on its initialization

in simple model setting:

result can be far from optimum

[Biehl,Ghosh,Hammer,2007]

(p

_-

)

(p

₊

> p

_-

)

(26)

More reasonable cost function for LVQ

!

based on margin maximization: GLVQ

[Sato/Yamada 1996, Hammer/Villmann 2002, Crammer et

al 2002, Schneider et al. 2009]

!

based on probabilistic modeling: RSLVQ

[Seo/Obermayer 2003]

(27)

!

function class

F given by possible LVQ-networks

!

training data (x

_i

,y

_i

)

!

machine learner

!

LVQ-function f in F

!

often: f(x

_i

) = y

_i

for training points (i.e. small

empirical error

)

!

desired: P(f(x) = y) should be large (i.e. small

real error

)

(28)

Colt for LVQ in a nutshell

!

(hypothesis) margin

of x

_i

: m(x

_i

) = d

-

- d

+

where d

+

/ d

-

is the squared

distance to closest correct / wrong prototype

!

mathematics

!

error is bounded by:

E/m + O( p

2

(B

3

ln 1/

δ

)

1/2

) / (

ρ

m

1/2

))

where E = number of misclassified training data with margin smaller than ρ (including errors) δ = confidence

m = number of examples, B = support, p = number of prototypes

safe classification

insecure classification

+

does not include

dimensionality

good bounds for few

(29)

Colt for LVQ in a nutshell

!

(hypothesis) margin

of x

_i

: m(x

_i

) = d

-

- d

+

where d

+

/ d

-

is the squared

distance to closest correct / wrong prototype

!

mathematics

!

error is bounded by:

E/m + O( p

2

(B

3

ln 1/

δ

)

1/2

) / (

ρ

m

1/2

))

where E = number of misclassified training data with margin smaller than ρ (including errors) δ = confidence

m = number of examples, B = support, p = number of prototypes

safe classification

insecure classification

+

data with (too)

small margin

term / margin

does not include

dimensionality

good bounds for few

(30)

Margin maximization

!

mathematical objective:

maximize margin

(31)

Margin maximization

!

min

P

_i

d

(

~

x

i

)

d

+

(

~

x

i

)

(32)

Margin maximization

!

minimize

Σ

_i

(d

+

(x

i

) – d

-

(x

i

)) / (d

+

(x

i

) + d

-

(x

i

))

min

X

i

d

(

~

x

_i

)

d

+

(

~

x

_i

)

d

(

~

x

i

) +

d

+

(

~

x

i

)

(33)

Generalized LVQ (GLVQ)

derivatives

GLVQ

(34)

derivatives

(35)

derivatives

LVQ2.1

scaling

(36)

Probabilsitic modeling

(37)

(38)

(39)

(40)

(41)

Take home

!

LVQ can be substantiated by large margin generalization

bounds (independent of dimensionality)

!

LVQ can be based on cost functions:

!

probabilistic modeling

!

excellent results

!

bandwidth is very crititcal parameter (crisp limit does not perform well)

!

prototypes not always representative

!

margin maximization

!

very good results

!

parameters not critical

!

prototypes are representative for data

(42)

(43)

Why metric learning?

Example: acceptance of papers at some conference

L - layout, T - technical quality, I - interesting subject, F - famous author, S – appropriate

subject, Q overall quality, P author registers for conference, E appropriate length, B

likes beer, P looks pretty, G gives good talks, K knows programm committee, M

-member of programm committee, C - special session, R - has red hairs

(44)

Why metric learning?

!

data are usually represented by feature vectors

!

feature vectors are compared using Euclidean distance

!

but this might tell you nothing useful

(42,42,42,0, ...)

smell head belly human

(41,43,44,1, ...)

(-41,43,44,1, ...)

(45)

(46)

(47)

Metric learning: G relevance LVQ

!

minimize

Σ

_i

(d

_λ+

(x

i

) – d

λ-

(x

i

)) / (d

λ+

(x

i

) + d

λ-

(x

i

))

where d

_λ

(x,y) =

Σ

_l

λ

_l

(x

_l

-y

_l

)

2

(48)

GRLVQ

!

mathematical objective: min

Σ

_i

(d

_λ+

(x

i

) – d

λ-

(x

i

)) / (d

λ+

(x

i

) + d

λ-

(x

i

))

derivatives

(49)

GRLVQ

!

mathematical objective: min

Σ

_i

(d

_λ+

(x

i

) – d

λ-

(x

i

)) / (d

λ+

(x

i

) + d

λ-

(x

i

))

derivatives

LVQ2.1

scaling

relevance update

(50)

GRLVQ

(51)

Generalized Matrix LVQ (GMLVQ)

(52)

(53)

(54)

Interpretability: Steroid metabolomics

(55)

(56)

GMLVQ

… yields (local) matrices, i.e. (local) scaling and

rotations of the space

GRLVQ: global scaling

GMLVQ: global scaling and rotation

LGMLVQ: local scaling and rotation

(57)

!

GMLVQ with positiv semidefinite matrices:

quadratic complexity w.r.t data dimensionality

*

=

(58)

!

GMLVQ with positiv semidefinite low rank matrices

matrices:

linear complexity w.r.t data dimensionality

equivalent to full version (if data are intrincically low

dimensional)

*

=

(59)

(60)

LiRamLVQ

*

=

glob

al

global

local

glob

al

induces global projection:

f: x

"

* x

(61)

Discriminative visualization

(62)

(63)

Stationary solutions of GMLVQ

!

assume fixed receptive fields, what is the optimum metric?

!

update of matrix has the form (prefactor indicates sign):

(x centered in prototype) plus normalization

!

similar to van Mises iteration

!

converges to first eigenvector of

(64)

Stationary solution

contributes with +

(65)

(66)

Interpretation of matrix terms

infra-red spectral data: 124 wine spamples

256 wavelengths 30 training data

94 test spectra

alco

ho

l co

nte

nt

high

low

medium

(67)

!

often: diagonal terms are interpreted as relevance

!

problem: for high dimensional data

holds for all matrices with differences in the null space of

(68)

!

dividing out null space yields the profile

!

direct interpretation of relevance profile misleading for high

dim data, get rid of null space first!

(69)

GMLVQ

best performance

7 dimensions remaining

over-fitting

effect

null-space correction

P=30 dimensions

(70)

Take home

!

metric adaptation:

!

increases accuracy

!

does not deteriorating its generalization ability

!

low rank matrix:

!

allows efficient training

!

data visualization

!

no restriction as compared to optimum metric

!

intrepretation:

!

by looking at feature weighting,

(71)

Schneider, Biehl, Hammer

(72)

(73)

!

feature extraction

"

vectorial data

!

pairwise (dis)similarity

measurement

"

(dis)similarity matrix

Dissimilarity or similarity data

size softness color curvature ...

"

(20,7,...)

0

0.2 0.8

0.2 0

0.8

0.8 0.8 0

(74)

(Dis)similarity data

GTTACAGGT

GTGACAAGT

GGTACACGT

!

(dis)similarity measures, e.g.:

1.

Alignment

2.

Normalized Compression Distance

3.

Graph structure kernels

(75)

LVQ for dis-/similarities

!

kernel GLVQ (Suganthan et al.)

!

differentiable kernel GLVQ (Villmann et al.)

!

relational GLVQ/SRLVQ (Xibin et al.)

!

kernel SRLVQ (Hofmann et al.)

(76)

Assumption: Prototypes are expressed as linear combinations

€

w

i

=

α

ij

j

∑

x

j

where

Fact: for every symmetric bilinear form and linear representation as

above we find

€

x

j

−

w

i

2

=

(

D

⋅

α

i

)

j

−

1

2

⋅

α

i

T

⋅

D

⋅

α

i

Method: Substitute all terms

in original methods and use

€

x

j

−

w

i

2

Relational GLVQ

0.6

0.1

0.2

0.05

(77)

Relational GLVQ

assume prototypes have the form

then

GLVQ costs become

(78)

(79)

Similarities/dissimilarities

euclid

general

k

~

x

i

~

x

j

k

2

d

ij

=

d

(

x

i

, x

j

)

h

~

x

i

,

~

x

j

i

s

ij

=

s(x

i

, x

j

)

assumption:

symmetric:

d

ij

=

d

ji

s

ij

=

s

ji

zero diagonal:

d

ii

= 0

normalization of

s

is possible:

s

ii

= 1

(80)

Similarities/dissimilarities

euclid

general

k

~

x

i

~

x

j

k

2

d

ij

=

d

(

x

i

, x

j

)

h

~

x

i

,

~

x

j

i

s

ij

=

s(x

i

, x

j

)

(81)

Similarities/dissimilarities

euclid

general

k

~

x

i

~

x

j

k

2

d

ij

=

d

(

x

i

, x

j

)

h

~

x

i

,

~

x

j

i

s

ij

=

s(x

i

, x

j

)

s

ij

=

1

2

⇣

d

ij

n

1

P

l

d

il

n

1

P

l

d

lj

+

n

1

2

P

l,l

0

d

ll

0

⌘

(82)

Pseudo-euclidean embedding

pseudo-euclid

general

d

ij

=

d

(

x

i

, x

j

)

s

ij

=

s(x

i

, x

j

)

k

~

x

i

~

x

j

k

2

pq

=

k

~

x

1

i

~

x

1

j

k

2

k

~

x

2

i

~

x

2

j

k

2

h

~

x

i

,

~

x

j

i

pq

=

h

~

x

1

i

,

~

x

1

j

i

h

~

x

2

i

,

~

x

2

j

i

signature (

p, q, n

p

q

)

(83)

Pseudo-Euclidean Space

For every symmetric

D

a vector space embedding in pseudo-Euclidean

space exists; symmetric bilinear form induces dissimilarities

+1

-1

P1=(6.1,1)

P4=(-0.1,0)

P3=(0.1,0)

P6=(-4,-1)

P5=(4,-1)

P2=(-6.1,1)

(84)

classification based on

k

~

x

i

w

~

j

k

2

=

k

~

x

i

k

2

h

~

x

i

,

w

~

j

i

+

k

w

~

j

k

2

training optimizes

f

⇣

k

~

x

i

w

~

j

k

2

i,j

⌘

(85)

k

~

x

i

w

~

j

k

2

=

k

~

x

i

k

2

h

~

x

i

,

w

~

j

i

+

k

w

~

j

k

2

training optimizes

f

⇣

k

~

x

i

w

~

j

k

2

i,j

⌘

prototypes as linear combinations

w

~

j

=

P

↵

ji

~

x

i

(86)

k

~

x

i

w

~

j

k

2

=

k

~

x

i

k

2

h

~

x

i

,

w

~

j

i

+

k

w

~

j

k

2

training optimizes

f

⇣

k

~

x

i

w

~

j

k

2

i,j

⌘

k

~

x

i

w

~

j

k

2

=

s

ii

2

X

l

↵

jl

s

il

+

X

l,l

0

↵

jl

↵

jl

0

s

ll

0

kernel aproach

(87)

k

~

x

i

w

~

j

k

2

=

k

~

x

i

k

2

h

~

x

i

,

w

~

j

i

+

k

w

~

j

k

2

training optimizes

f

⇣

k

~

x

i

w

~

j

k

2

i,j

⌘

relational aproach

k

~

x

i

w

~

j

k

2

=

X

l

↵

jl

d

il

1

2

X

l,l

0

↵

jl

↵

jl

0

d

ll

0

for normalized

↵

jl

(88)

optimize:

gradient descent with respect to

followed by normalization

!

relational GLVQ / SRLVQ

f

0

@

0

@

X

l

↵

jl

d

il

1

2

X

l,l

0

↵

jl

↵

jl

0

d

ll

0

1

A

i,j

1

A

f

0

@

0

@

s

ii

2

X

l

↵

jl

s

il

+

X

l,l

0

↵

jl

↵

jl

0

s

ll

0

1

A

i,j

1

A

↵

jl

(89)

gradient descent with respect to :

!

kernel GLVQ / SRLVQ

~

w

j

=

X

l

↵

jl

~

x

l

@

@ ~

w

j

f

⇣

k

~

x

i

w

~

j

k

2

i,j

⌘

=

2

f

0

•

(

~

x

i

w

~

j

)

hence:

P

l

↵

jl

~

x

l

⇠

2

f

0

•

(

~

x

i

P

l

↵

jl

~

x

l

)

this can be decomposed into contributions of the coe

ffi

cients

(90)

GLVQ

similarities

gradient w.r.t. coefficients

RSLVQ

dissimilarities

gradient w.r.t. prototypes

only in the euclidean case:

kernel variants resemble gradient w.r.t w

large margin generalization bounds

(91)

(92)

Computational effort

Size of Matrix

(Double Precision)

n

Size

5000

190MB

10,000

763MB

20,000

3.0GB

50,000

18.6GB

200,000

300.0GB

(93)

Computational effort?

k

~

x

i

w

~

j

k

2

=

s

ii

2

X

l

↵

jl

s

il

+

X

l,l

0

↵

jl

↵

jl

0

s

ll

0

=

e

t

i

Se

i

2

·

e

i

S

↵

j

+

↵

j

t

S

↵

j

sample m landmarks only

S

m,m

S

m,n

S

n,m

approximate

S

⇡

S

m,n

S

m,m

1

S

n,m

(94)

(95)

☺

(96)

Take home

!

there exist cool methods which enable the application of

LVQ for similarities / dissimilarities

!

quadratic complexity

!

Nystroem approximation for low rank data reduces to

linear complexity

!

metric adaptation possible in a similar way as for GMLVQ:

adapt w.r.t similarity/dissimilarity parameters (has been

(97)

(98)

Confidence measures

!

Certainty of a classification?

x

(99)

Conformal prediction

!

framework to accompany pointwise classification of online

methods by provable guarantees:

classifier trained on N (exchangeable) data

conformity measure yields possible labels

such that for a new point

it holds:

(100)

Conformal prediction

!

pick conformity measure, e.g.

!

induces two terms:

•

Credibility

: how sure that a prediction is

correct

•

Confidence

: how sure that ALL OTHER labels are

incorrect

lower

credibility

lower confidence

higher

credibility

higher confidence

.. any measure is valid,but

(101)

Conformal prediction algorithm

(102)

Simplified conformal prediction

given training data

and new point

1.

train the model on training data

2.

compute

nonconformity

of training set

3.

for every

non conformity of is

4.

compare values

5.

output label with best r-value

credibility: largest r-value

(103)

(104)

Growing conformal semi-supervised LVQ

given labeled data and unlabeled data

init model with minimum number of prototypes

train model on

Loop:

predict confidence/credibility on and consider secure part

predict labels on based on secures part

add the part of with high confidence/credibility

identify regions with poor confidence/credibility for

generate new protoype

(105)

(106)

(107)

(108)

Take home

!

conformal prediction enables to accompany classification

results by confidence values

!

can be realised efficiently for LVQ based on distance

measures

!

allows incremental versions (also for relational setting,

semi-supervised training)

(109)

(110)

Literature

!

T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1997.

!

T. Kohonen. Learning vector quantization. In: M.A. Arbib, editor, The Handbook of Brain Theory and Neural Networks., pages 537–540. MIT Press, Cambridge, MA, 1995.

!

M. Biehl, B. Hammer, P. Schneider, T. Villmann, Metric Learning for Prototype-based, in: Innovations in Neural Information Paradigms and Applications, M. Bianchini, M. Maggini, F. Scarselli, L.C. Jain (eds.), Springer Studies in Computational Intelligence, Vol 247 (2009), 183-199

!

M. Biehl, B. Hammer, F.-M. Schleif, P. Schneider, T. Villmann, Stationarity of Matrix Relevance Learning Vector Quantization, Machine Learning Reports 01/2009, Univ. Leipzig (2009)

!

M. Biehl, A. Ghosh, and B. Hammer, Dynamics and generalization ability of LVQ algorithms, J. Machine Learning Research 8 (Feb):323-360, 2007

!

W. Arlt, M. Biehl, A.E. Taylor, S. Hahner, R. Libe, B.A. Hughes, P. Schneider, D.J. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat, F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C.H.L. Shackleton, X. Bertagna, M. Fassnacht, P.M. Stewart

Urine steroid metabolomics as a biomarker tool for detecting malignancy in adrenal tumors J. of Clinical Endocrinology & Metabolism 96: 3775-3784 (2011).

!

Frank-Michael Schleif, Thomas Villmann, Markus Kostrzewa, Barbara Hammer, Alexander Gammerman: Cancer informatics by prototype networks in mass spectrometry. Artificial Intelligence in Medicine 45(2-3): 215-228 (2009)

!

S. Kirstein, H. Wersing, H.-M. Gross, and E. Körner. A Life-Long Learning Vector Quantization Approach for Interactive Learning of Multiple Categories. Neural Networks 28:90-205 (2012).

!

Sambu Seo, Klaus Obermayer: Soft Learning Vector Quantization. Neural Computation 15(7): 1589-1604 (2003)

!

Barbara Hammer, Daniela Hofmann, Frank-Michael Schleif, Xibin Zhu: Learning vector quantization for (dis-)similarities. Neurocomputing (IJON) 131:43-51 (2014)

!

Marc Strickert, Barbara Hammer, Thomas Villmann, Michael Biehl: Regularization and improved interpretation of linear data mappings and adaptive distance measures. CIDM 2013:10-17

(111)

Literature

!

B. Mokbel, B. Paassen, and B. Hammer. Adaptive distance measures for sequential data. In Michel Verleysen, editor, ESANN, pages 265–270, 2014.

!

Daniela Hofmann, Frank-Michael Schleif, Benjamin Paa.en, and Barbara Hammer. Learning interpretable kernelized prototype-based models. Neurocomputing, accepted, 2013.

!

Xibin Zhu, Frank-Michael Schleif, and Barbara Hammer. Semi-supervised vector quantization for proximity data. In ESANN, pages 89–94, 2013.

!

Frank-Michael Schleif, Xibin Zhu, and Barbara Hammer. Sparse conformal prediction for dissimilarity data. Annals of Mathematics and Artificial Intelligence (AMAI), 2014.

!

Barbara Hammer, Daniela Hofmann, Frank-Michael Schleif, and Xibin Zhu. Learning vector quantization for (dis-)similarities. Neurocomputing, 131:43–51, 2014.

!

Xibin Zhu, Frank-Michael Schleif, and Barbara Hammer. Patch processing for relational learning vector quantization. In Jun Wang, Gary G. Yen, and Marios M. Polycarpou, editors, Advances in Neural Networks - ISNN 2012 - 9th

International Symposium on Neural Networks, Shenyang, China, July 11-14, 2012. Proceedings, Part I, volume 7367, pages 55–63. Springer, 2012.

!

Andrej Gisbrecht, Bassam Mokbel, Frank-Michael Schleif, Xibin Zhu, and Barbara Hammer. Linear time relational prototype based learning. Int. J. Neural Syst., 22(5), 2012.

!

Kerstin Bunte, Petra Schneider, Barbara Hammer, Frank-Michael Schleif, Thomas Villmann, and Michael Biehl. Limited rank matrix learning, discriminative dimension reduction and visualization. Neural Networks, 26:159–173, 2012.

!

P. Schneider, K. Bunte, H. Stiekema, B. Hammer, T. Villmann, and M. Biehl. Regularization in matrix relevance learning. IEEE Transactions on Neural Networks, 21:831–840, 2010.

!

M. Biehl, B. Hammer, F.-M. Schleif, P. Schneider, and T. Villmann. Stationarity of matrix relevance learning vector quantization machine learning reports. Technical Report 01/2009, University of Leipzig, 2009.

!

Petra Schneider, Michael Biehl, Barbara Hammer: Adaptive Relevance Matrices in Learning Vector Quantization. Neural Computation 21(12): 3532-3561 (2009)

!

Koby Crammer, Ran Gilad-Bachrach, Amir Navot, Naftali Tishby: Margin Analysis of the LVQ Algorithm. NIPS 2002: 462-469