• No results found

The STOPS framework for structure-based hyperparameter selection in multidimensional scaling

N/A
N/A
Protected

Academic year: 2021

Share "The STOPS framework for structure-based hyperparameter selection in multidimensional scaling"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

ePub

WU

Institutional Repository

Thomas Rusch and Patrick Mair and Kurt Hornik

The STOPS framework for structure-based hyperparameter selection in

multidimensional scaling

Conference or Workshop Item (Published)

(Refereed)

Original Citation:

Rusch, Thomas and Mair, Patrick and Hornik, Kurt (2018) The STOPS framework for

structure-based hyperparameter selection in multidimensional scaling.

In: Data Science, Statistics &

Visualisation (DSSV2018), 09.07.-11.07., Vienna, Austria.

This version is available at:

http://epub.wu.ac.at/6399/

Available in ePub

WU

: July 2018

ePub

WU

, the institutional repository of the WU Vienna University of Economics and Business, is

provided by the University Library and the IT-Services. The aim is to enable open access to the

scholarly output of the WU.

This document is the publisher-created published version.

(2)

The STOPS Framework

for Structure-Based Hyperparameter Selection in

Multidimensional Scaling

(3)

Slide Zero

This is joint work withPatrick Mair(Harvard) andKurt Hornik(WU)

(4)

Multidimensional Scaling

TheSTRESSobjective function with (transformed) distances dij

(

X

)

, (transformed) proximities

δ

ijand finite weights wij is

σ(

X

) =

X

i<j

wij

ij

dij

(

X

)



2

which is minimized to find theconfiguration X

arg min

X

σ(

X

)

MDS provides anoptimal map into continuous space

R

M

(objective 1)

We may also be interested in some structural appearance of X, e.g.,clustersorcircumplex(objective 2).

It can happen thatwhat is optimal for objective 1 is not very useful for objective 2

(5)

Motivation: Republican Mantras

“I’m a Republican, because ...”from Mair et al. (2014)

Supporters of the Republican Party have been asked why they are Republican (254 statements)

Natural language datathat was scraped and processed

=⇒

Sparse data matrix (document term matrix)

Objects are the words (we use only words that appeared at least 10 times)

We look for themes in the statements: “Mantras” (words that occur often together)

We use acosine distancefor word co-occurences andapply standard least squares MDS(SMACOF) for representation.

(6)

Motivation: Republican Mantras

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 Republican Mantras? Configurations D1 Configur ations D2 government freedom values responsibility country personal conservative life party strong limited america free individual liberty right small family taxes nation people american god principles work defense great constitution fiscal market national founding hard military will best low

We findlack of (interesting) structurein MDS configuration

(7)

Multidimensional Scaling Extensions

More structure is often introducedby using transformations

δ

ij

=

fij

ij

)

and dij

(

X

)

=

gij

(

dij

(

X

))

and weights wij

Many MDS variants are aspecial caseof this general formulation, e.g.,

Metric MDS: gij(a) =a, fij(a) =a,Sammon mapping: wij = δ

−1 ij Multiscale: fij(a) =gij(a) =log(a) POST-MDS: gij(a) =aκ,fij(a) =aλ,wij =w ν ij,ALSCAL:κ = λ =2

LMDS: Box-Cox transformations for gij(·),fij(·),Isomap: gij(·)

isometric distance

Often transformations areparametrized by a hyperparameter vector

θ

, so

δ

ij

=

fij

ij

; θ)

and dij

=

gij

(

dij

; θ)

(8)

Power Stress MDS

Fitratio MDS with power transformationby setting, e.g.,

f

ij

) = δ

ij20

Structure isclearerbut thefit is now worse(0.373 versus 0.401) (essentially fits only

δ

very close to the maximum)

Parameters chosen ad hoc, not always clear what is theright

θ

.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 Republican Mantras?! Configurations D1 Configur ations D2 government freedom values responsibility country personal conservative life party strong limited america free individual liberty right small family taxes nation people american god principles work defense great constitution fiscal market national founding hard military will best low SLIDE 7 DSSV18, 09-07-2018

(9)

Structure Optimized Proximity

Scaling

Our suggestion is a framework to systemize this approach:Structure Optimized Proximity Scaling (STOPS).

Idea: Select the parameters for the transformations(

θ

) in a principled fashionby fit and structure considerations

This offers a conceptual and computationalframework for hyperparameter selection in MDS variants

Building blocks:

θ–parametrized target function formisfit

Statistics measuring configuration structure (structuredness

indices)

Combinationof misfit and structure

Algorithm foroptimization

(10)

STOPS - I

We have the target function that measuresmisfit(e.g., Stress)

σ(

X

, θ) =

L

(∆

,

D

(

X

), θ)

which we minimize to find theconfiguration Xfor a

θ

X

(θ) =

arg min

X

σ(

X

, θ)

X

(θ)

has somestructural appearance(C-Structuredness). C-Structurednesschangeswith different

θ

(11)

STOPS - II

CaptureP structures in X

(θ)

by indices Ip

(

X

(θ); γ),

p

=

1

, . . . ,

P.

Combine

σ(

X

(θ), θ)

and Ip

(

X

(θ); γ)

tostoploss

(

X

(θ), ϑ; ∆)

Two STOPS models

Additive STOPS (aSTOPS)

stoploss(X(θ), ϑ; ∆) =v0· σ(X(θ), θ) +

P

X

p=1

vpIp(X(θ); γ)

Multiplicative STOPS (mSTOPS)

stoploss(X(θ), ϑ; ∆) = σ(X(θ), θ)v0·

P

Y

p=1

Ip(X(θ); γ)vp

v0.. stressweight (redundant), v1, ...,vP... structuredness weights,γ... (optional)

metaparameters for structuredness indices;ϑ ⊆ {θ,v0, . . . ,vk}

(12)

Structures and Indices

C-Structuredness indicescaptureessence of a particular structurein a configuration. Some examples:

C-Association: Pairwisenonlinear associationbetween principal

axes (pairwise maximal maximum information coefficient; Reshef et al. 2011)

C-Clusteredness: Aclustered appearance(normed OPTICS

Cordillera; Rusch et al., 2018)

C-Complexity:Complexity of the functional relationshipbetween

any principal axes (pairwise maximal minimum cell number; Reshef et al. 2011)

C-Manifoldness: Points lie close to asmooth submanifold(maximal

correlation; Sarmanov, 1958)

(13)

Optimization-I

We need to find

arg min

ϑ stoploss

(

X

(θ), ϑ; ∆)

This can be seen as aprofile method

We use anested algorithm

1 First solve for X(θ) =arg maxXσ(X, θ)

2 Then minimize stoploss(X(θ), ϑ; ∆)overϑ

Advantages:

For finding X(θ)we can usestandard solutions(reasonably good) The inner part (1.) allowscomputationally flexible specificationsof MDS method

Ip(X)depends directlyonly on X(θ)

Dimensionality of outer problem isusually not very high

(14)

Optimization-II

Difficulties whenoptimizingover

ϑ

Inner minimization is verycostly

For stoploss basically only know function evaluations

Estimation of Step 1 may benoisy(premature termination, local minimum)

This suggests to solve Step 2 withEfficient Global Optimization

akaBayesian Optimization.

One samples the “best” candidate for evaluationgiven a surrogate model and the current knowledge.

(15)

Optimization-III

Bayesian Optimization:

Choose a(flexible) surrogate model(prior)

Evaluatethe target function at some candidate values (data)

Updatethe prior with the function evaluations (posterior)

Maximize an acquisition functionover theposteriorsurface

This suggests acandidate parameter combination

Evaluate at candidate andrepeat

We useExpected Improvementfor acquisition andTreed

Gaussian Process with Jumps to Linear Models(Grammacy, 2007) orKriging(Roustant et al., 2012) for the surrogate model.

(16)

R Package stops

All of this is implemented in the R packagestops

High level function for STOPSstops(delta,loss,...)

Prespecified MDS models (argumentloss) arestrain, SMACOF (smacofSym),sammonmapping,elasticscaling, SMACOF on a sphere (smacofSphere),sstress,rstress,powerstress,

Sammon mapping and elastic scaling with powers (powersammon,

powerelastic). Planned: Isomap and LMDS

Optimization with Bayesian optimization (kriging,tgp) and some more (including simulated annealingSANNor a particle swarm algorithmpso).

Features various c-structuredness indices

S3 methods: plot, summary, print, coef, residuals, plot3d, plot3dstatic

(17)

Example: Republicans

Misfit: Power Stress MDS

Structuredness: C-ClusterednessandC-Manifoldness

Optimization withtreed gaussian process prior with jump to linear models(for 20 steps)

R> resc <- stops(dt.dist,loss="powermds",

+ structures=c("cmanifoldness","cclusteredness"))

R> resc

Call: stops(dis = dt.dist, loss = "powermds", theta = c(1, 1), structures = c("cmanifoldness", "cclusteredness"), strucpars = strucpars, optimmethod = "tgp",

lower = c(0.5, 0.3), upper = c(3, 10), verbose = 5, type = "additive", itmax = 20)

Model: additive STOPS with powermds loss function and theta parameters= 1.871 3.191 1 Number of objects: 37

MDS loss value: 0.2513

C-Structuredness Indices: cmanifoldness 0.9738 cclusteredness 0.3117 Structure optimized loss (stoploss): -0.3914

MDS loss weight: 1 c-structuredness weights: -0.5 -0.5 Number of iterations of tgp optimization: 20

(18)

Example: Republicans

kappa lambda −0.35 −0.35 −0.35 −0.35 −0.35 −0.35 −0.35 −0.35 −0.345 −0.345 −0.345 −0.345 −0.345 −0.345 −0.34 −0.34 −0.34 −0.34 −0.34 −0.34 −0.335 −0.335 −0.335 −0.335 −0.33 −0.33 −0.33 −0.325 −0.325 −0.325 −0.32 −0.32 −0.32 −0.315 −0.315 −0.315 −0.305 1.6 1.8 2.0 2.2 2.5 3.0 3.5 4.0 ● ● ● ● ● ● ● ● ● ● ● SLIDE 17 DSSV18, 09-07-2018

(19)

Example: Republicans

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 Republican Mantras! Configurations D1 Configur ations D2 SLIDE 18 DSSV18, 09-07-2018

(20)

Example: Republicans

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 Configuration Plot Configurations D1 Configur ations D2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● SLIDE 19 DSSV18, 09-07-2018

(21)

Example: Republicans

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 Republican Mantras! Configurations D1 Configur ations D2 government freedom values responsibility country personal conservative life party strong limited america free individual liberty right small family taxes nation people american god principles work defense great constitution fiscal market national founding hard military will best low SLIDE 20 DSSV18, 09-07-2018

(22)

Example: Republicans

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1 0.0 0.1 0.2 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 Republican Mantras! Configurations D1 Configur ations D2 government freedom values responsibility country personal conservative life party strong limited america free individual liberty right small family taxes nation people american god principles work defense great constitution fiscal market national founding hard military will best low Fiscalcon Traditional Neocon+Liberalist Paleo+Populist Unclustered SLIDE 21 DSSV18, 09-07-2018

(23)

Summary and Outlook

STOPS

A conceptual and computationalframework for hyperparameter optimizationin MDS based on structure considerations

Outlook

More models and (perhaps?) more structures

Extend to other dimension reduction techniques (e.g., the Gifi system)

(24)

References

Borg, I., Groenen, P. (2005). Modern multidimensional scaling: Theory and applications, 2nd Edition, Springer, New York.

Gramacy, R. B. (2007). tgp: an R package for Bayesian nonstationary, semiparametric nonlinear regression and design by treed Gaussian process models. Journal of Statistical Software, 19(9), 1–46.

Mair, P., Rusch, T., Hornik, K. (2014) The grand old party - A party of values? SpringerPlus, 3:697.

Reshef, D., Reshef, Y., Finucane, H., Grossman, S., McVean, G., Turnbaugh, P., Lander, E., Mitzenmacher, M., & Sabeti, P. (2011) Detecting novel associations in large data sets. Science, 334, 1518–1524.

Roustant, O., Ginsbourger, D., & Deville, Y. (2012). Dicekriging, Diceoptim: Two R packages for the analysis of computer experiments by kriging-based metamodelling and optimization. Journal of Statistical Software, 51(1), 1–54. Rusch, T., Hornik, K., Mair, P. (2018) Assessing and quantifying clusteredness: The OPTICS Cordillera. Journal of

Computational and Graphical Statistics, 27 (1), 220-233.

Rusch, T., Mair, P., Hornik, K. (in preparation). Structure based hyperparameter selection for dimensionality reduction: The STOPS framework for Structure Optimized Proximity Scaling.

Sarmanov, O. (1958). Maximum correlation coefficient (symmetric case). Doklady Akad. Nauk SSR, 120, 715–718.

(25)

Thank You for Your Attention

Thomas Rusch

Competence Center for Empirical Research Methods email:[email protected]

URL:http://wu.ac.at/methods/team/dr-thomas-rusch WU Vienna University of Economics and Business

Welthandelsplatz 1, 1020 Vienna Austria

(26)

License

Please attribute Thomas Rusch, Patrick Mair and Kurt Hornik. Except where otherwise noted, this work is licensed under CC-BY-SA:

https://creativecommons.org/licenses/by-sa/4.0/

References

Related documents

Philip found Nathanael and said to him, “We have found him about whom Moses in the law and also the prophets wrote, Jesus son of Joseph from Nazareth.” Nathanael said to him,

delivery. The project seeks to develop a conceptual and practical framework to move forward policy, practice and advocacy on CAAC. This entails the development of new methods

Consistent with previous epifluorescence microscopy images CNI-sfGFP-6His appears to be uniformly distributed in the membrane when viewed using TIRF microscopy (Figure 74) and

Alpine is working closely with Microsoft to enable music from PlaysForSure Online Stores to play through Alpine Ai-NET head units, on a number of leading portable MP3 players

Most of the guidelines indicate explicitly the use of the peak shear strength parameters in- stead of the residual shear strength parameters in the design of geosynthetic

Violin plots represent range of observed values for the normalized difference soundscape index (NDSI; top panel); Acoustic Diversity index (ADI); acoustic complexity index

Sandira believes her interest in social justice started quite early in her career. She recalls participating in equity training at the school board level in her first year

The margin of error for results based on cell phone owners is +/- 2.4 percentage points... o r g 5 The demographics of cell phone