• No results found

Data visualization using nonlinear dimensionality reduction techniques: method review and quality assessment John A. Lee Michel Verleysen

N/A
N/A
Protected

Academic year: 2021

Share "Data visualization using nonlinear dimensionality reduction techniques: method review and quality assessment John A. Lee Michel Verleysen"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

Data visualization using nonlinear dimensionality reduction techniques:

method review and quality assessment

John A. Lee

Michel Verleysen

Machine Learning Group,

Université catholique de Louvain Louvain-la-Neuve, Belgium

(2)

How can we detect structure in data?

 Hopefully data convey some information…

 Informal definition of ‘structure’:

 We assume that we have vectorial data in some space

 General ‘probabilistic’ model:

• Data are distributed w.r.t. some distribution

 Two particular cases:

• Manifold data

• Clustered data

.

(3)

How can we detect structure in data?

 Two main solutions

 Visualize data

(the user’s eyes play a central part)

• Data are left unchanged

• Many views are proposed

• Interactivity is inherent Examples:

• Scatter plots

• Projection pursuit

• …

 Represent data

(the software does a data processing job)

• Data are appropriately modified

• A single interesting representation is to be found

→ (nonlinear) dimensionality reduction

(4)

High-dimensional spaces

 The curse of dimensionality

 Empty space phenomenon

(function approximation requires an exponential number of points)

 Norm concentration phenomenon

(distances in a normal distribution have a chi distribution)

 Unexpected consequences

 A hypercube looks like a sea urchin (many spiky corners!)

 Hypercube corners collapse towards the center in any projection

 The volume of a unit hypersphere tends to zero

 The sphere volume concentrates in a thin shell

 Tails of a Gaussian get heavier than the central bell

 Dimensionality reduction can hopefully address some of those issues…

. 3

D 2D

(5)

The manifold hypothesis

 The key idea behind dimensionality reduction

 Data live in a D-dimensional space

 Data lie on some P-dimensional subspace

Usual hypothesis: the subspace is a smooth manifold

 The manifold can be

 A linear subspace

 Any other function of some latent variables

 Dimensionality reduction aims at

 Inverting the latent variable mapping

 Unfolding the manifold (topology allows us to ‘deform’ it)

 An appropriate noise model makes the connection with the general probabilistic model

 In practice:

 P is unknown → estimator of the intrinsic dimensionality

(6)

Estimator of the intrinsic dimensionality

 General idea: estimate the fractal dimension

 Box counting (or capacity dimension)

 Create bins of width ε along each dimension

 Data sampled on a P-dimensional manifold occupy N(ε) ≈ α εP boxes

 Compute the slope in a log-log diagram of N(ε) w.r.t. ε

 Simple but

• Subjective method (slope estimation at some scale)

• Not robust againt noise

• Computationally expensive

.

(7)

Estimator of the intrinsic dimensionality

 Correlation dimension

 Any datum of a P-dimensional manifold is surrounded by

C2(ε) ≈ α εP neighbours, where ε is a small neighborhood radius

 Compute the slope of the correlation sum in a log-log diagram

.

Noisy spiral

Log-log plot of

correlation sum Slope ≈ int.dim.

(8)

Estimator of the intrinsic dimensionality

 Other techniques

 Local PCAs

• Split manifold into small patch

• Manifold is locally linear

→ Apply PCA on each patch

 Trial-and-error:

• Pick an appropriate DR method

• Run it for P = 1, …, D and record the value E*(P) of the cost function after optimisation

• Draw the curve E*(P) w.r.t. P and detect its elbow

.

E*(

P)

P

(9)

Historical review of some NLDR methods

 Principal component analysis

 Classical metric multidimensional scaling

 Stress-based MDS & Sammon mapping

 Nonmetric multidimensional scaling

 Self-organizing map

 Auto-encoder

 Curvilinear component analysis

 Spectral methods

 Kernel PCA

 Isomap

 Locally linear embedding

 Laplacian eigenmaps

 Maximum variance unfolding

 Similarity-based embedding

 Stochastic neighbor embedding

 Simbed & CCA revisited

Time 1900 1950 1965 1980 1990 1995 1996 2000 2003 2009

(10)

A technical slide… (some reminders)

(11)

Yet another bad guy…

(12)

Jamais deux sans trois (never 2 w/o 3)

(13)

Principal component analysis

 Pearson, 1901; Hotelling, 1933; Karhunen, 1946; Loève, 1948.

 Idea

 Decorrelate zero-mean data

 Keep large variance axes

→ Fit a plane though the data cloud and project

 Details (maximise projected variance)

.

(14)

Principal component analysis

 Details (minimise the reconstruction error)

.

(15)

Principal component analysis

 Implementation

 Center data by removing the sample mean

 Multiply data set with top eigenvectors of the sample covariance matrix

 Illustration

 Salient features

 Spectral method

• Incremental embeddings

• Estimator of the intrinsic dimensionality

• (covariance eigenvalues = variance along the projection axes)

 Parametric mapping model

(16)

Classical metric multidimensional scaling

 Young & Householder, 1938; Torgerson, 1952.

 Idea

 Fit a plane through the data cloud and project

 Inner product preservation (≈ distance preservation)

 Details

.

(17)

Classical metric multidimensional scaling

 Details (cont’d)

.

(18)

Classical metric multidimensional scaling

 Implementation

 ‘Double centering’:

• It converts distances into inner products

• It indirectly cancels the sample mean in the Gram matrix

 Eigenvalue decomposition of the centered Gram matrix

 Scaled top eigenvectors provide projected coordinates

 Salient features

 Provides same solution as PCA iff dissimilarity = Eucl. distance

 Nonparametric model

(Out-of-sample extension is possible with Nyström formula)

.

(19)

Stress-based MDS & Sammon mapping

 Kruskal, 1964; Sammon, 1969; de Leeuw, 1977.

 Idea

 True distance preservation, quantified by a cost function

 Particular case of stress-based MDS

 Details

 Distances:

 Objective functions:

• ‘Strain’

• ‘Stress’

• Sammon’s stress

.

(20)

Stress-based MDS & Sammon mapping

 Implementation

 Steepest descent of the stress function (Kruskal, 1964)

 Pseudo-Newton minimization of the stress function

(diagonal approximation of the Hessian; used in Sammon, 1969)

 SMaCoF for weighted stress

(scaling by majorizing a complex function; de Leeuw, 1977)

 Salient features

 Nonparametric mapping

 Main metaparameter: distance weights wij

 How can we choose them?

→ Give more importance to small distances

→ Pick a decreasing function function of distance δij such as in Sammon mapping

 Sammon mapping has almost no metaparameters

 Any distance in the high-dim space can be used (e.g. geodesic distances; see Isomap)

 Optimization procedure can get stuck in local minima

(21)

Nonmetric multidimensional scaling

 Shepard, 1962; Kruskal, 1964.

 Idea

 Stress-based MDS for ordinal (nonmetric) data

 Try to preserve monotically transformed distances (and optimise the transformation)

 Details

 Cost function

 Implementation

 Monotone regression

 Salient features

 Ad hoc optimization

 Nonparametric model

(22)

Self-organizing map

 von der Malsburg, 1973; Kohonen, 1982.

 Idea

 Biological inspiration (brain cortex)

 Nonlinear version of PCA

• Replace PCA plane with an articulated grid

• Fit the grid through the data cloud

(≈ K-means with a priori topology and ‘winner takes most’ rule)

 Details

 A grid is defined in the low-dim space: and

 Grid nodes have high-dim coordinates as well:

 The high-dim coordinates are updated in an adaptive procedure

(at each epoch, all data vectors are presented 1 by 1 in random order):

• Best matching node:

• Coordinate update:

.

(23)

Self-organizing maps

 Illustrations in the high-dim space (cactus dataset)

. Epochs

(24)

Self-organizing map

 Visualisations in the grid space

 Salient features

 Nonparametric model

 Many metaparameters: grid topology and decay laws for α and λ

 Performs a vector quantization

 Batch (non-adaptive) versions exist

 Popular in visualization and exploratory data analysis

 Low-dim coordinates are fixed…

 … but principle can be ‘reversed’ → Isotop, XOM

(25)

Auto-encoder

 Kramer, 1991; DeMers & Cottrell, 1993;

Hinton & Salakhutdinov, 2006.

 Idea

 Based on the TLS reconstruction error like PCA

 Cascaded codec with a ‘bottleneck’ (as in an hourglass)

 Replace PCA linear mapping with a nonlinear one

 Details

 Depends on chosen function approximator

(often a feed-forward ANN such as a multilayer perceptron)

 Implementation

 Apply the learning procedure to the cascaded networks

 Catch output value of the bottleneck layer

 Salient features

 Parametric model (out-of-sample extension is straightforward)

 Provides both backward and forward mapping

 The cascaded networks have a ‘deep architecture’

→ learning can be inefficient

Solution: initialize backpropagation with restricted Boltzmann machines

(26)

Auto-encoder

Original figure in Kramer, 1991.

Original figure in Salakhutdinov, 2006.

(27)

Curvilinear component analysis

 Demartines & Hérault, 1995.

 Idea

 Distance preservation

 Change Sammon weighting scheme

(use decreasing function of the low-dim distance

instead of decreasing function of the high-dim distance)

 Cost function

 Implementation

 Stochastic gradient descent (or ‘pin-point’ radial update)

 Salient features

 Nonparametric mapping

 Metaparameters are the decay laws of α and λ

 Can be used with geodesic distances (Lee & Verleysen, 2000)

 Able to ‘tear’ manifolds

(28)

CCA can tear manifolds

Sammon NLM CCA

Sphere

(29)

Historical review of some NLDR methods

 Principal component analysis

 Classical metric multidimensional scaling

 Stress-based MDS & Sammon mapping

 Nonmetric multidimensional scaling

 Self-organizing map

 Auto-encoder

 Curvilinear component analysis

 Spectral methods

 Kernel PCA

 Isomap

 Locally linear embedding

 Laplacian eigenmaps

 Maximum variance unfolding

 Similarity-based embedding

 Stochastic neighbor embedding

 Simbed & CCA revisited

Time 1900 1950 1965 1980 1990 1995 1996 2000 2003 2009

(30)

Kernel PCA

 Schölkopf, Smola & Müller, 1996.

 Idea

 Apply ‘kernel trick’ to classical metric MDS (and not to PCA!)

 Apply MDS in an (unknown) ‘feature space’

 Details

.

(31)

Kernel PCA

 Implementation

 Compute the kernel matrix K

(starting from pairwise distances or inner products)

 Perform ‘double centering’ of K

 Run classical metric MDS on centered K

 Kernels from kernel…

 Salient features

 Nonparametric mapping (Nyström formula can be used)

 Choice of the kernel? How to adjust its parameter(s)?

 Important milestone in the history of spectral embedding

(32)

Isomap

 Tenenbaum, 1998, 2000.

 Idea

 Apply classical metric MDS with a ‘smart metric’

 Replace Euclidean distance with geodesic distance

 Data-driven approximation of the geodesic distances with shortest paths in a graph of K-ary neighbourhoods

.

Original figure in Tenenbaum, 2000.

(33)

Isomap

 Model

 Classical metric MDS is optimal for linear manifold

→ Isomap is optimal for Euclidean manifolds (a P-dimensional manifold is Euclidean iff

it is isometric to a P-dimensional Euclidean space)

 Implementation

 Compute/collect pairwise distances

 Compute graph of K-ary neighbourhoods

 Compute the weighted shortest paths in the graph

 Apply classical metric MDS on the pairwise geodesic distances

 Salient features

 Nonparametric mapping (Nyström not applicable because…)

 Double-centred Gram matrix is not positive semidefinite (but fortunately not far from being so)

 Manifold must be convex

 Parameter K is critical with noisy data (‘short circuits’)

(34)

Locally linear embedding

 Roweis & Saul, 2000.

 Idea

 Each datum can be approximated by

a (regularised) linear combination of its K nearest neighbors

 LLE tries to reproduce similar linear combinations in a lower-dimensional space

. Original figure

in Roweis, 2000.

(35)

Locally linear embedding

 Details

 Step 1:

 Step 2:

 Implementation

 Approximate each datum with a regularised linear combination of its K nearest neighbours

 Build the sparse matrix of neighbor weights (W)

 Compute the eigenvalue decomposition of M

 Bottom eigenvectors provides embedding coordinates

 Salient features

 Metaparameters are K and the regularization coefficient

 EVD such as in MDS, but bottom eigenvectors are used

 Nonparametric mapping (Nyström formula can be used)

(36)

Laplacian eigenmaps

 Belkin & Niyogi, 2002.

 Idea

 Embed neighbouring points close to each other

→ shrink distances between neighbors in the embedding

 Avoid trivial solutions and undeterminacies

by constraining the covariance matrix of the embedding

 Details

 Symmetric affinity matrix:

 Cost function:

 Constrained optimization:

.

(37)

Laplacian eigenmaps

 Implementation

 Collect distances and compute K-ary neighbourhoods

 Compute the adjacency matrix and the corresponding weight matrix

 Compute the eigenvalue decomposition of the Laplacian matrix

 The bottom eigenvectors provide the embedding coordinates

 Salient features

 Nonparametric mapping (Nyström formula can be used)

 Connection with

• LLE (Laplacian operator applied twice)

• Spectral clustering and graph min-cut

(Laplacian matrix normalization is different)

• Diffusion maps

• Classical metric MDS with commute time distance

 Metaparameters are K and/or soft neighbourhood kernel parameters

(38)

Maximum variance unfolding

 Weinberger & Saul, 2004. (a.k.a. ‘semidefinite embedding’)

 Idea

 Do the opposite of Laplacian eigenmaps and try to unfold data

→ Stretch distance of non-neighbouring points

 Classical metric MDS with missing pairwise distances

→ Use semi definite programming (SDP)

to maintain the properties of the Gram matrix

.

Original figure in

Weinberger, 2004.

(39)

Maximum variance unfolding

 Details

.

(40)

Maximum variance unfolding

 Implementation

 Collect pairwise distances and compute K-ary neighbourhoods

 Deduce the corresponding constraints on pairwise distances

 Formulate everything with inner products and run SDP engine

 Apply classical metric MDS

 Variants

 Distances between neighbours can shrink

(and distances between non-neighbours can only grow as usual)

 Introduction of slack variables to soften the constraints

 Salient features

 MVU ≈ KPCA with data-driven local kernels

 MVU ≈ smart Isomap

 Semidefinite programming is computationally demanding

 Metaparameters are K and all flags of SDP engine

.

(41)

t-distributed stochastic neighbor embedding

 Hinton & Roweis, 2005; Van der Maaten & Hinton, 2008.

 Idea

 Try to reproduce the pairwise probabilities of being neighbours (≈ similarities)

 Details

 Probability of being a neighbour in the high-dim space:

 Symmetric similaritiy in the high-dim space:

 Similarity in the low-dim space:

.

where

where

(42)

t-distributed stochastic neighbor embedding

 Details (cont’d)

 Cost function:

 Implementation

 Gradient descent of the cost function:

 Salient features

 Can get stuck in local minima

 SNE = t-SNE with n → ∞

 t-SNE much more efficient than SNE, especially for clustered data

 Similarity-based embedding (similarity preservation)

 Can also be related to distance preservation

(with a specific weighting scheme and a distance transformation)

(43)

Simbed (similarity-based embedding)

 Lee & Verleysen, 2009.

 Idea:

 Probabilistic definition of the pairwise similarities

 Takes into account properties of high-dim spaces

 Details

 Distances in a multivariate Gaussian are chi-distributed

 Similarity function in the high-dim space

 Similarity function in the low-dim space

.

Q is the regularized upper incomplete

Gamma function

(44)

Simbed (similarity-based embedding)

 Details (cont’d)

 Cost function (SSE of HD & LD similarities)

 Implementation

 Stochastic gradient descent (CCA ‘pin-point’ gradient)

 Salient Features

 Close relationship with CCA

(CCA is nearly equivalent to Simbed with similarities defined

as )

 Close relationship with t-SNE (similar gradients)

(45)

Theoretical method comparisons

 Purpose

 Visualization / data preprocessing

 Hard / soft dimensionality reduction

 Model characteristics

 Backward / forward (generative)

 Linear / non-linear

 Parametric / non-parametric (data embedding)

 With / without vector quantization

 Algorithmic criteria

 Spectral / non-spectral (soft-computings, ANN, etc.)

 Among spectral methods: dense / sparse matrix

 Several unifying paradigms or framework

 Distance preservation

 Force-directed placement

 Rank preservation

(46)

Spectral methods: duality

 Two types of spectral NLDR methods

 ‘Dense’ matrix of ‘dissimilarities’ (e.g. distances)

→ Top eigenvectors

(CM MDS, Isomap, MVU)

 ‘Sparse’ matrix of ‘similarities’ (or ‘affinities’)

→ Bottom eigenvectors (except last one) (LLE, LE, diff.maps, spectral clustering)

 Duality

 Pseudo-inverse of sparse matrix

• Yields a dense matrix

• Inverts and therefore flips the eigenvalue spectrum

(bottom eigenvectors become leading ones and vice versa)

 Corollary

 All spectral methods (both sparse and dense) can be reformulated as applying CM MDS on a dense matrix

 Example: Laplacian eigenmaps = CM MDS with commute time distances (CTDs are related to the pseudo-inverse of the Laplacian matrix)

(47)

Spectral versus non-spectral NLDR

Spectral NLDR

 Cost function is convex

 Convex optimization

(spectral decomposition)

 Global optimum

 Incremental embeddings

 Intrinsic dimensionality can be estimated

 Cost function must fit within the spectral framework

 It often amounts to applying 1. An a priori nonlinear

distance tranformation 2. Classical metric MDS (Dense/sparse duality!)

 Eigenspectrum tail of sparse methods tend to be flat

→ ‘Spiky’ embeddings .

Non-spectral NLDR

 Cost function is not convex

 Ad hoc optimization (e.g. gradient descent)

 Local optima

 Independent embeddings

 No simple way to estimate intrinsic dimensionality

 More freedom is granted in the choice of the cost fun.

 It is often fully data-driven

(48)

Spiky spectral embeddings: examples

MVU

LLE LLE

(49)

Taxonomy (linear and spectral)

PCA

Latent variable sep. NLDR

Classification Clustering

Linear Nonlinear

ManifoldClusters

ICA/BSS FA/PP

LDA SpeClus

Nonspectral NLDR

MVU Isomap

LE diff.maps

LLE KPCA

SVM

CM MDS

(50)

Taxonomy (DR only)

Distances Inner products

Reconstruction error

PCA

Nonlinear auto-encoder CM

MDS

Spectral NLDR

SOM Principalcurves

Stress-based MDS

Similarities

t-SNE

Simbed CCA

(51)

Quality Assessment: Intuition

3D → 2D

Bad Good

(52)

Quality Assessment: Quantification

 We have:

 An NLDR method to assess

 Some ideas:

 Use its objective function

 Quantify the distance preservation

 Quantify the ‘topology’ preservation

 Topology in practice:

 K-ary neighbourhoods

 Neighbourhood ranks

 Literature:

 1962, Shepard: Shepard diagram (a.k.a. ‘dy-dx’)

 1992, Bauer & Pawelzik: topological product

 1997, Villmann et al.: topological function

 2001, Venna & Kaski: trustworthiness & continuity T&C

 2006, Chen & Buja: local continuity meta criterion LC-MC

 2007, Lee & Verleysen: mean relative rank errors MRREs

(53)

Distances, Ranks, and Neighbourhoods

 Distances:

 Ranks:

 Neighbourhoods:

 Co-ranking matrix:

(Q is a sum of N permutation matrices of size N-1)

(54)

Co-ranking Matrix: Blocks

Mild K-intrusions

Hard K-intrusions

Mild K-extrusions Hard K-extrusions

Negative rank error Positive rank error

N-1

N-1

K

K 1

1

ρij

rij

ρij – rij = 0

(55)

Co-ranking Matrix: Blocks

Mild K-intrusions

Hard K-intrusions

Mild K-extrusions Hard K-extrusions

N-1

N-1

K

K 1

1

ρij

rij

(56)

Trustworthiness & Continuity

 Formulas:

 Properties:

 Distinguish between points that errouneously

• enter a neighbourhood → trustwortiness 1-FPr

• quit a neighbourhood → continuity 1-FNr

 Functions of K (higher is better); range: [0,1] ([0.7,1])

 Elements q

kl

are weighted

with

(57)

Mean Relative Rank Errors

 Formulas:

 Properties:

 Two error types (same idea as in T&C)

 Functions of K (lower is better); range: [0,1] ([0,0.3])

 Stricter than T&C: all rank errors are counted

 Different weighting of q

kl

with

(58)

Local Continuity Meta-Criterion

 Formula:

 Properties

 Single measure

 Function of K (higher is better); range: [0,1]

 A priori milder than T&C and MRREs

 Presence of a baseline term

(random neighbourhood overlap)

 No weighting of q

kl

(59)

Unifying Framework

T&C LCMC

MRREs

Only the upper left block is important!

(60)

Unifying Framework

 Connection with LCMC:

(61)

Q&B: Quality and behavior

 Up to now, we have distinguished 3 fractions

 Mild extrusions

 Mild intrusions

 Correct ranks

 All 3 are added up in the LC-MC sum, which indicates the overall quality

 What about the difference between the proportions of mild intrusions and extrusions?

 Definition of quality & behavior criterion:

 Similar reformulations exist for T&C and MRREs

Positive for intrusive embeddings Negative for extrusive embeddings

(62)

Why are weightless criterion sufficient?

 Any hard in/extrusion is compensated for by several mild ex/intrusion (and vice versa)

 The fractions of mild in/extrusions reveal the severity of hard ex/intrusions

 No arbitrary weighting is needed

Mild K-intrusions

Hard K-intrusions

Mild K-extrusions Hard K-extrusions

N-1

N-1

K

K 1

1

ρij

rij

(63)

Illustration: B. Frey’s face database

(64)

Conclusions about QA

 Rank preservation is useful in NLDR QA:

 More powerful than distance preservation

 Reflects the appealing idea of ‘topology’ preservation

 Unifying framework:

 Connects existing criteria

• LCMC

• MRREs

• T&C

 Relies on the co-ranking matrix

(≈ Shepard diagram with ranks instead of distances)

 Accounts for different types of rank errors:

• A global error (like LCMC)

• ‘Type I and II’ errors (like T&C and MRREs)

 Our proposal

 Overall quality criterion + embedding-specific ‘behavior’ criterion

 Involves no (arbitrary) weighting

 Focuses on the inside of K-ary neighborhoods (mild intrusions and extrusions)

(65)

Final thoughts & perspectives

 In practice, you need

 Appropriate data preprocessing steps

 An estimator of the intrinsic dimensionality

 A NLDR method

 Method-independent quality criteria

 Main take-home messages

 Carefully adjust your model complexity…

 Beware of (hidden) metaparameters…

 Convex methods are not a panacea…

 ‘Extrusive’ methods work better than ‘intrusive’ ones…

 Always try PCA first…

 Future

 The interest for spectral methods seems to diminish…

 Will auto-encoders emerge again thanks to ‘deep’ learning?

 Similarity-based NLDR is a hot topic…

 Tighter connections are expected with the domains of

• Data mining and visualization

• Graph embedding

(66)

Thanks for your attention

If you have any question…

[email protected]

Nonlinear Dimensionality Reduction

Springer, Series: Information Science and Statistics John A. Lee, Michel Verleysen, 2007

300 pp.

ISBN: 978-0-387-39350-6

References

Related documents

Objectives of the study were to screen different plant species and crop cultivars in standardised growth tests for the phytotoxicity of imazamox and to derive effective doses

Small firms (3-49 employees) appear to be slightly less likely to offer HD plans, conditional on having offered them the previous year.. Fifty-six percent of firms with 3-49

observations ( Fig. 2a ) than for the ground-based observations ( Fig. The width of the light curve of a type Ia supernova has been shown to be an excellent indicator of its

Head of Citi Community Investing and Development, President of the Citi Foundation NFTE’s Entrepreneurial Leadership Award is presented to the Citi Foundation in recognition

We then report the fundamental steps in a pattern recognition based system control, focusing in particular on the time domain features used in [16]; • in Chapter 2 we firstly

We recognize that no two projects are ever the same, or the goals of any two clients. This is why we make sure from the outset us fully understanding your Accounts, HR

Fatum, Rasmus and Barry Scholnick (2008): ―Monetary Policy News and Exchange Rate Responses: Do Only Surprises Matter?, Journal of Banking and Finance, 32, 1076-1086. Rogers;

This study aims to determine the effect of mechanical, thermal, osmotic, and other factors on nerve conduction, the effect of single and double pithing on the