Vector-valued distribution regression: a simple and consistent approach

(1)

Vector-valued Distribution Regression: A Simple

and Consistent Approach

Zolt´an Szab´o

Joint work with Arthur Gretton (UCL), Barnab´as P´oczos (CMU), Bharath K. Sriperumbudur (PSU)

(2)

Outline

Motivation. Previous work. High-level goal.

Definitions, algorithm, error guarantee, consistency. Numerical illustration.

(3)

Problem: regression on distributions

Given: _{(xi,yi)}li=1 samples H∋f =? such that f(xi)≈yi.

Our interest:

(4)

Two-stage sampled setting = bag-of-features

Examples:

image = set of patches/visual descriptors, document = bag of words/sentences/paragraphs, molecule = different configurations/shapes,

group of people on a social network: bag of friendship graphs, customer = his/her shopping records,

(5)

Distribution regression: wider context

Several problems are covered in machine learning and statistics: multi-instance learning,

(6)

Existing methods

Idea:

1 estimate distribution similarities,

2 plug them into a learning algorithm.

Approaches:

1 parametric approaches: Gaussian, MOG, exponential family

[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].

2 kernelized Gaussian measures:

(7)

Existing methods+

1 (Positive definite) kernels: [Cuturi et al., 2005,

Martins et al., 2009, Hein and Bousquet, 2005].

2 Divergence measures (KL, . . . ): [P´oczos et al., 2011].

3 Set metric based algorithms:

1 Hausdorff metric [Edgar, 1995], and

2 its variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].

(8)

Existing methods: summary

MIL dates back to [Haussler, 1999, G¨artner et al., 2002].

(9)

Existing methods: summary

MIL dates back to [Haussler, 1999, G¨artner et al., 2002].

There are several multi-instance methods, applications.

One ’small’ open question:

(10)

Existing methods: “exceptions”

APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013,

Sabato and Tishby, 2012]:

yi = max(IR(xi,1), . . . ,IR(xi,N))∈ {0,1},

(11)

Existing methods: “exceptions”

APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013,

Sabato and Tishby, 2012]:

yi = max(IR(xi,1), . . . ,IR(xi,N))∈ {0,1},

whereR = unknown rectangle.

Density based approaches, regression: KDE + kernel smoothing [P´oczos et al., 2013, Oliva et al., 2014],

densities live on compact Euclidean domain, density estimation: nuisance step.

(12)

High-level goal: set kernel

Given (2 bags):

Bi :={xi,n}Nn=1i ∼xi, Bj :=_{xj,m_}Nj

m=1 ∼xj.

Similarity of the bags (set/multi-instance/ensemble-,

convolution kernel [Haussler, 1999, G¨artner et al., 2002]):

K(Bi,Bj) = 1 NiNj Ni X n=1 Nj X m=1 k(xi,n,xj,m).

(13)

High-level goal: consistency of set kernels

Are set kernels consistent, when plugged into some regression

scheme? Our focus:

ridge regression . Motivation (ridge scheme):

1 _{simple algorithm.}

(14)

Story

H_{: assumed function class to capture the (}_x,_y_{) relation.}

fρ: true regression function (might not be in H_).

fH: “best” function fromH (l =∞,N :=Ni =∞).

ˆ

f: estimated function fromH _{based on} _{₍_{_x_i_,n}N

n=1,yi)}li=1.

Aim:

High probability error guarantees (λ: reg.,_E: risk):

E[ˆf]_{− E}[fH]_≤r1(l,N, λ), (1) kˆf ₋fρkL2≤r2(l,N, λ) +r3(richness ofH). (2) Consistency: (l,N, λ) =? such thatri(l,N, λ)→0 (i= 1,2).

(15)

Distribution regression: definition, solution idea

z=_{(xi,yi)_}l i=1: xi ∈M+1 (D), yi ∈Y. ˆ z= {xi,n_}N n=1,yi l i=1: xi,1, . . . ,xi,N i.i.d. ∼ xi.

Goal: learn the relation between x andy based on ˆz.

Idea:

1 embed the distributions (usingµdefined byk), 2 _{apply ridge regression (determined by}_K_).

M+

1 (D)

µ

(16)

Kernel part (

k

_,

K

_{): RKHS}

k :D_×D_→_R_{kernel on} D_{, if}

∃ϕ:D_→_H_{(ilbert space) feature map,}

k(a,b) =_hϕ(a), ϕ(b)_i_H (_∀a,b_∈D_). Kernel examples: D₌_Rd ₍_p _>_0,_{θ >} ₀₎ k(a,b) = (_ha,b_i+θ)p: polynomial, k(a,b) =e−ka−bk2 2/(2θ2): Gaussian, k(a,b) =e−θka−bk2: Laplacian. In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).

(17)

Kernel part: example domains (

D

₎

Euclidean space: D₌_Rd_.

Strings, time series, graphs, dynamical systems.

(18)

Embedding step:

M

+ 1

(

D

)

µ

−

→

X

_⊆

H

₍

k

₎

Given: kernel k:D_×D_→_R_.

Mean embedding of a distribution x_∈M+

1(D):

µx = Z

D

k(_·,u)d_x₍_u₎_∈_H₍_k₎_.

Mean embedding of the empirical distribution ˆ xi = _N1 PNn=1δxi,n ∈M + 1(D): µˆxi = Z D k(_·,u)d_ˆ_xi₍_u_{) =} 1 N N X n=1 k(_·,xi,n)_∈H(k).

(19)

Objective function:

X

_{−−−−−−→}

f∈H=H(K)

Y

Optimal (H_{/measurable) in expected risk (}_E_{) sense:}

E[fH] = inf f∈HE[f] = inff∈H Z X×Y k f(µa)−yk2Y dρ(µa,y), fρ(µa) =E[y_|µa] = Z Y yd_ρ₍_y_|_µa_{) (}_µa _∈_X₎_. One-stage (R →z), two-stage difficulty (z_→ˆz): f_zλ = arg min f∈H 1 l l X i=1 kf(µxi)−yik 2 Y +λkfk 2 H, (3) λ 1 l X ₂ ₂

(20)

Algorithmically: ridge regression

_⇒

analytical solution

Given: training sample: ˆz, test distribution: t. Prediction: (f_ˆ_zλ_◦µ)(t) = [y1, . . . ,yl](K+lλIl)−1k, (5) K= [Kij] = [K(µˆxi, µxˆj)]∈L(Y) l×l_, ₍₆₎ k=    K(µˆx1, µt) .. . K(µˆxl, µt)   ∈L(Y) l_. ₍₇₎ Specially: Y =R⇒L₍_Y_{) =}_R_;_Y ₌_Rd _⇒_L₍_Y_{) =}_Rd×d_.

(21)

Assumption-1

D_{: separable, topological.}

Y: separable Hilbert.

k:

bounded: supu∈Dk(u,u)≤Bk ∈(0,∞),

continuous. X =µ M+

1(D)

(22)

Assumption-1 – continued

K [Kµa :=K(·, µa)]:

1 bounded:

kKµak2_HS=Tr Kµa∗Kµa≤BK ∈(0,∞), (∀µa ∈X).

2 _{H¨older continuous:} _∃_L_>_0,_h_∈₍₀_,_{1] such that}

kKµa−KµbkL(Y,H)≤Lkµa−µbkhH, ∀(µa, µb)∈X×X.

(23)

Assumption-1: remarks (before the

ρ

assumptions)

k: bounded, continuous_⇒

µ: (M+

1(D),B(τw))→(H,B(H)) measurable.

µmeasurable,X _{∈ B}(H)_⇒ρonX _×Y: well-defined.

If (*) :=D_{is compact metric,} _k _{is universal, then} _µ_is

continuous and X _{∈ B}(H).

If Y =R, we get the traditional boundedness ofK:

(24)

Assumption-1: linear

K

_↔

_{set kernel}

Let Y =R andK(µa, µb) =hµa, µbiH. Recall

µˆxi = 1 N N X n=1 k(_·,xi,n).

In this case: BK =Bk,L= 1, h= 1, we get the set kernel

K(µxˆi, µˆxj) = µxˆi, µˆxj H = 1 N2 N X n,m=1 k(xi,n,xj,m).

(25)

Assumption-1: nonlinear

K

_{examples for}

Y

₌

_R

In case of (*) and Y =R: H¨olderK-s (D_{: compact, metric;} _µ_:

continuous) KG Ke KC e−k µa−µ_bk2H 2θ2 _e− kµa−µ_bkH 2θ2 1 +_kµa−µbk2H/θ2 −1 h= 1 h= 1₂ h= 1 Kt Ki 1 +kµa−µbkθH −1 kµa−µbk2H+θ2 −1₂

(26)

Assumption-1 – continued:

ρ

∈ P

(

b

,

c

)

Let the T :H_→H _{covariance operator be}

T = Z

X

K(_·, µa)K∗(·, µa)dρX(µa)

with eigenvalues tn (n = 1,2, . . .).

Assumption: ρ∈ P(b,c) = set of distributions on X ×Y

α≤nb_t n≤β (∀n≥1;α >0, β >0), ∃g _∈H_{such that} _fH₌_Tc−1 2 g withkgk2 H≤R (R>0), whereb _∈(1,_∞), c _∈[1,2].

(27)

Assumption-2: Assumption-1, but with alternative

ρ

Let ˜T be the extension ofT fromH _to _L2_ρ

X: S_K∗ :H _֒_→_L2_ρ X, SK :L2_ρ_X →H_, ₍_SK_g₎₍_µu_{) =} Z X K(µu, µt)g(µt)d_ρX₍_µt₎_, ˜ T =S_K∗SK :L2_ρ_X →L2_ρ_X. Our assumptions on ρ:

Range space assumption: fρ∈ImT˜sfor some s ≥0.

(28)

Assumption-2: remarks

Range space assumption:

fρ∈Im

˜

Ts_: _s _{captures the smoothness of}_f_ρ.

L2

ρX: separable⇔ measure space with d(A,B) =ρX(A△B)

(29)

Error guarantees, consistency (in human-readable format)

In case of Assumption-1: if l _≥λ−b1−1 E[f_ˆ_zλ]_{− E}[fH]≤ logh(l) Nh_λ3 +λ c₊ 1 l2_λ + 1 lλ1b →0 Assumption-2: if _λ12 ≤l S ∗ Kfˆzλ−fρ L2 ρ_X ≤ log h 2(l) Nh2λ32 + 1 λ√l +DH→DH, DH= inf q∈Hkfρ−S ∗ KqkL2 ρ_X

(30)

Demo-1 (

Y

₌

_R

_{): Supervised entropy learning}

Problem: learn the entropy of (rotated) Gaussians. Baseline: kernel smoothing based distribution regression (applying density estimation) =: DFDR.

Performance: RMSE boxplot over 25 random experiments. Experience:

more precise than the only theoretically justified method, by avoiding density estimation.

(31)

Supervised entropy learning: plots

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

(32)

Demo-2 (

Y

₌

_R

_{): Aerosol prediction from satellite images}

Bag:= multispectral satellite image over an area. Label of a bag:= AOD value of a highly accurate ground-based instrument.

Performance: RMSE. Experience:

≈domain-specific, engineered methods, beating state-of-the-art MI techniques.

(33)

Aerosol prediction: results

Baseline [mixture model (EM)]: 7.5−8.5 (±0.1−0.6).

Linear K: single: 7.91 (_±1.61). ensemble: 7.86 (_±1.71). NonlinearK: Single: 7.90 (_±1.63), Ensemble: 7.81(_±1.64).

(34)

Summary

Problem: two-stage sampled distribution regression. Literature: large number of heuristics.

Contribution:

error guarantees, consistency for the ridge based solution. specially, set kernel is consistent in regression (15-year-old open question).

Code _∈ITE toolbox:

(35)

Future research directions

Theoretical:

quadratic loss (_E), bounded kernels (k,K), mean embedding (µ) with i.i.d. samples (_{xi,n}Nn=1): relaxation,

equivalent characterizations/alternative priors (ρ), lower/optimal bounds,

error guarantees for non-point estimates.

(36)

Thank you for the attention!

Acknowledgments: This work was supported by the Gatsby Charitable

Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in

(37)

Appendix: Contents

Topological definitions. Vector-valued RKHS. Hausdorff metric. Weak topology onM+ 1(D).

(38)

Topological space, open sets

Given: D₆₌_∅_set.

τ _⊆2D

is called atopology onD_if:

1 ∅ ∈τ,D_∈_τ_.

2 Finite intersection: O1∈τ, O2∈τ ⇒O1∩O2∈τ. 3 _{Arbitrary union:} _O_i_∈_τ ₍_i_∈_I₎_{⇒ ∪}_i∈I_O_i_∈_τ_.

(39)

Topology: examples

τ =_{∅,D_}_{: indiscrete topology.} τ = 2D : discrete topology. (D_,_d_{) metric space:} Open ball: Bǫ(x) =_{y _∈D_:_d₍_x_,_y₎_{< ǫ}_}_.

O_⊆D _{is open if for}_∀_x_∈_O _∃_{ǫ >}_{0 such that}_B_ǫ(_x₎_⊆_O_.

(40)

Closed-, compact set, closure, dense subset, separability

Given: (D_{, τ}_). _A_⊆D_is

closed ifD_\_A_∈_τ _{(i.e., its complement is open),}

compact if for any family (Oi)i∈I of open sets with A⊆ ∪i∈IOi, ∃i1, . . . ,in∈I withA⊆ ∪nj=1Oij. Closure of A⊆D_: ¯ A:= \ A⊆C closed inD C. (8) A⊆D_is _dense_{if ¯}_A₌D_.

(D_{, τ}_{) is} _separable_if_∃ _{countable, dense subset of} D_.

(41)

The discrete space

D_,₂D

: complete metric space.

Discrete metric (inducing the discrete topology):

d(x,y) = 0, ifx =y 1, if x ₆=y . (9)

(42)

Vector-valued RKHS

Definition:

A H_⊆_YX _{Hilbert space of functions is RKHS if}

Aµx,y :f 7→ hy,f(µx)iY (10)

is continuous for _∀µx _∈X,y_∈Y.

= The evaluation functional is continuous in every direction.

Riesz representation theorem ⇒

∃Kµt ∈L(Y,H):

K(µx, µt)(y) = (Kµty)(µx), (∀µx, µt ∈X), or shortly

K(_·, µt)(y) =Kµty, (11) H₍_K_{) =}_span_{_Kµ _y_:_µt_∈_X_,_y _∈_Y_}_. ₍₁₂₎

(43)

Vector-valued RKHS – continued

Examples (Y =Rd):

1 Ki :X ×X →_Rkernels (i = 1, . . . ,d). Diagonal kernel: K(µa, µb) =diag(K1(µa, µb), . . . ,Kd(µa, µb)). (13)

2 Combination of Dj diagonal kernels [Dj(µa, µ_b)∈_Rr×r,

Aj ∈Rr×d]: K(µa, µb) = m X j=1 A∗_jDj(µa, µb)Aj. (14)

(44)

Existing methods: set metric based algorithms

Hausdorff metric [Edgar, 1995]:

dH(X,Y) = max

( sup

x∈X

inf

y∈Yd(x,y),_y∈Ysupx∈Xinf d(x,y)

)

. (15)

(45)

Weak topology on

M

+ 1

(

D

)

Def.: It is the weakest topology such that the

Lh: (M+1(D), τw)→R,

Lh(x) =

Z

D

h(u)d_x₍_u₎

mapping is continuous for all h_∈Cb(D), where

(46)

Universal kernel examples

On every compact subset of Rd:

k(a,b) =e−

ka−bk2₂

2σ2 , (σ >0)

k(a,b) =eβha,bi,(β >0), or more generally

k(a,b) =f(_ha,b_i), f(x) = ∞ X n=0 anxn (_∀an>0) k(a,b) = (1_{− h}a,b_i)α, (α >0).

(47)

Auer, P. (1998).

Approximating hyper-rectangles: Learning and pseudorandom sets.

Journal of Computer and System Sciences, 57:376–388. Babenko, B., Verma, N., Doll´ar, P., and Belongie, S. (2011).

Multiple instance learning with manifold bags.

InInternational Conference on Machine Learning (ICML), pages 81–88.

Blum, A. and Kalai, A. (1998).

A note on learning from multiple-instance examples.

Machine Learning, 30:23–29. Chen, Y. and Wu, O. (2012).

(48)

Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).

Semigroup kernels on measures.

Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995).

Measure, Topology and Fractal Geometry.

Springer-Verlag.

G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002).

Multi-instance kernels.

Haussler, D. (1999).

Convolution kernels on discrete structures.

Technical report, Department of Computer Science, University of California at Santa Cruz.

(49)

Hein, M. and Bousquet, O. (2005).

Hilbertian metrics and positive definite kernels on probability measures.

InInternational Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143.

Jebara, T., Kondor, R., and Howard, A. (2004).

Probability product kernels.

Journal of Machine Learning Research, 5:819–844. Long, P. M. and Tan, L. (1998).

PAC learning of axis-aligned rectangles with respect to product distributions from multiple-instance examples.

Machine Learning, 30:7–21.

Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009).

(50)

Nielsen, F. and Nock, R. (2012).

A closed-form expression for the Sharma-Mittal entropy of exponential families.

Journal of Physics A: Mathematical and Theoretical, 45:032003.

Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014).

Fast distribution to real regression.

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714.

P´oczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013).

Distribution-free distribution regression.

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´oczos, B., Xiong, L., and Schneider, J. (2011).

(51)

InUncertainty in Artificial Intelligence (UAI), pages 599–608. Sabato, S. and Tishby, N. (2012).

Multi-instance learning with any hypothesis class.

Journal of Machine Learning Research, 13:2999–3039.

Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008).

Real Analysis.

Prentice-Hall.

Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009).

Closed-form Jensen-R´enyi divergence for mixture of Gaussians and applications to group-wise shape registration.

Medical Image Computing and Computer-Assisted Intervention, 12:648–655.

(52)

Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).

Identifying multi-instance outliers.

InSIAM International Conference on Data Mining (SDM), pages 430–441.

Zhang, D., He, J., Si, L., and Lawrence, R. D. (2013).

MILEAGE: Multiple Instance LEArning with Global Embedding.

International Conference on Machine Learning (ICML; JMLR W&CP), 28:82–90.

Zhang, M.-L. and Zhou, Z.-H. (2009).

Multi-instance clustering with applications to multi-instance prediction.

(53)

Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates.

Technical report, University of California, Berkeley. (http://arxiv.org/abs/1305.5029).

Zhou, S. K. and Chellappa, R. (2006).

From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.