Regression on Probability Measures: A Simple and Consistent Algorithm

(1)

Regression on Probability Measures: A Simple and

Consistent Algorithm

Zolt´an Szab´o (Gatsby Unit, UCL)

Joint work with

◦Bharath K. Sriperumbudur (Department of Statistics, PSU),

◦Barnab´as P´oczos (ML Department, CMU),

◦Arthur Gretton (Gatsby Unit, UCL)

CRiSM Seminars Department of Statistics,

University of Warwick May 29, 2015

(2)

The task

Samples: {(xi,yi)}l

i=1. Goal: f(xi)≈yi, findf ∈H.

Distribution regression:

xi-s are distributions,

available only through samples: _{xi,n}Nn=1i .

(3)

Example: aerosol prediction from satellite images

Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.

Relevance: climate research.

Engineered methods [Wang et al., 2012]: 100×RMSE = 7.5−8.5.

(4)

Wider context

Context:

machine learning: multi-instance learning,

statistics: point estimation tasks (without analytical formula).

Applications:

computer vision: image = collection of patchvectors, network analysis: group of people = bag of friendshipgraphs, natural language processing: corpus = bag ofdocuments,

(5)

Several algorithmic approaches

1 Parametric fit: Gaussian, MOG, exp. family

[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012]. 2 Kernelized Gaussian measures:

[Jebara et al., 2004, Zhou and Chellappa, 2006]. 3 _{(Positive definite) kernels:}

[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005]. 4 _{Divergence measures (KL, R´enyi, Tsallis): [P´oczos et al., 2011].} 5 _{Set metrics: Hausdorff metric [Edgar, 1995]; variants}

[Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].

(6)

Theoretical guarantee?

MIL dates back to [Haussler, 1999, G¨artner et al., 2002].

Sensible methods in regression: require density estimation [P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014] + assumptions:

1 _{compact Euclidean domain.}

(7)

Kernel, RKHS

k :D_×D_→_R_{kernel on} D_{, if}

∃ϕ:D_→_H_{(ilbert space) feature map,}

(8)

Kernel, RKHS

k :D_×D_→_R_{kernel on} D_{, if}

∃ϕ:D_→_H_{(ilbert space) feature map,}

k(a,b) =hϕ(a), ϕ(b)iH (∀a,b∈D). Kernel examples: D₌_Rd ₍_p _>_0,_{θ >} ₀₎ k(a,b) = (_ha,b_i+θ)p: polynomial, k(a,b) =e−ka−bk22/(2θ 2₎ : Gaussian, k(a,b) =e−θka−bk1_{: Laplacian.} In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).

(9)

Kernel: example domains (

D

)

Euclidean space: D₌_Rd_.

Graphs, texts, time series, dynamical systems.

(10)

Problem formulation (

Y

=

R

)

Given:

labelled bags ˆz=_{(ˆxi,yi)}ℓi=1,

ith _{bag: ˆ}_x

i={xi,1, . . . ,xi,N}i

.i.d.

∼ xi ∈P(D),yi ∈R.

(11)

Problem formulation (

Y

=

R

)

Given:

ith _{bag: ˆ}_x

i={xi,1, . . . ,xi,N}i

.i.d.

Task: find aP₍D₎_→_R_{mapping based on ˆ}_z_.

Construction: distribution embedding (µx)

(12)

Y

=

R

)

Given:

ith _{bag: ˆ}_x

i={xi,1, . . . ,xi,N}i

.i.d.

Construction: distribution embedding (µx) +ridge regression

(13)

Y

=

R

)

Given:

ith _{bag: ˆ}_x

i={xi,1, . . . ,xi,N}i

.i.d.

Construction: distribution embedding (µx) +ridge regression

P₍D₎_{−−−−→}µ=µ(k) _X _⊆_H ₌_H₍_k₎_{−−−−−−−→}f∈H=H(K) _R_.

Our goal: risk bound compared to the regression function

fρ(µx) = Z

R

(14)

Goal in details

Expected risk:

R[f] =E(x,y)|f(µx)−y|2.

Contribution: analysis of the excess risk

(15)

Goal in details

Expected risk:

R[f] =E(x,y)|f(µx)−y|2.

(16)

Goal in details

Expected risk:

R[f] =E(x,y)|f(µx)−y|2.

E(f_ˆ_zλ,fρ) =R[fˆzλ]− R[fρ]≤g(ℓ,N,λ)→0 and rates, f_ˆ_zλ = arg min f∈H 1 ℓ ℓ X i=1 |f(µˆxi)−yi| 2 +λkfk2H, (λ >0).

(17)

Goal in details

Expected risk:

R[f] =E(x,y)|f(µx)−y|2.

E(f_ˆ_zλ,fρ) =R[fˆzλ]− R[fρ]≤g(ℓ,N,λ)→0 and rates, f_ˆ_zλ = arg min f∈H 1 ℓ ℓ X i=1 |f(µˆxi)−yi| 2 +λkfk2H, (λ >0). We consider two settings:

1 well-specified case: fρ∈H_,

2 misspecified case: fρ∈L2 ρX\H.

(18)

Step-1: mean embedding

k :D_×D_→_R _{kernel; canonical feature map:} _ϕ₍_u_{) =}_k₍_·_,_u_).

Mean embedding of a distribution x,xiˆ ∈P₍D_):

µx = Z D k(_·,u)d_x₍_u₎_∈_H₍_k₎_, µxˆi = Z D k(_·,u)d_x_ˆ_i₍_u_{) =} 1 N N X n=1 k(_·,xi,n).

(19)

Step-1: mean embedding

Mean embedding of a distribution x,xiˆ ∈P₍D_):

µx = Z D k(_·,u)d_x₍_u₎_∈_H₍_k₎_, µxˆi = Z D k(_·,u)d_x_ˆ_i₍_u_{) =} 1 N N X n=1 k(_·,xi,n).

Linear K ⇒ set kernel:

K(µxˆi, µˆxj) = µxˆi, µˆxj H = 1 N2 N X n,m=1 k(xi,n,xj,m).

(20)

Step-1: mean embedding

Mean embedding of a distribution x,xiˆ _∈P₍D_):

µx = Z D k(_·,u)d_x₍_u₎_∈_H₍_k₎_, µxˆi = Z D k(_·,u)d_x_ˆ_i₍_u_{) =} 1 N N X n=1 k(_·,xi,n). NonlinearK example: K(µxˆi, µˆxj) =e −kµˆxi−µˆxjk 2 H 2σ2 .

(21)

Step-2: ridge regression (analytical solution)

Given: training sample: ˆz, test distribution: t. Prediction on t: (f_ˆ_zλ_◦µ)(t) =k(K+ℓλIℓ)−1[y1;. . .;yℓ], K = [K(µˆxi, µˆxj)]∈R ℓ×ℓ_, k= [K(µˆx1, µt), . . . ,K(µˆxℓ, µt)]∈R 1×ℓ_. (1) (2) (3)

(22)

Blanket assumptions: both settings

D_{: separable, topological domain.}

k:

bounded: sup_u∈Dk(u,u)≤Bk ∈(0,∞),

continuous.

K: bounded; H¨older continuous: _∃L>0,h _∈(0,1] such that

kK(·, µa)−K(·, µb)kH≤Lkµa−µbk

h H.

y: bounded.

(23)

Well-specified case: performance guarantee

Difficulty of the task:

fρis ’c-smooth’,

’b-decaying covariance operator’.

Contribution: If ℓ_≥λ−1b−1, then with high probability

E(fλ ˆz,fρ)≤ logh₍_ℓ₎ Nh_λ3 +λ c₊ 1 ℓ2_λ+ 1 ℓλ1b | {z } g(ℓ,N,λ) .

(24)

Well-specified case: performance guarantee

fρis ’c-smooth’,

’b-decaying covariance operator’.

Contribution: If ℓ≥λ−1b−1, then with high probability

E(f_ˆ_zλ,fρ)≤ log h₍_ℓ₎ Nh_λ3 + λc + 1 ℓ2_λ+ 1 ℓλ1b | {z } g(ℓ,N,λ) . (4) ˆ xi c-smoothness

(25)

Well-specified case: example

Assume

b is ’large’ (1/b≈0, ’small’ effective input dimension),

h = 1 (K: Lipschitz), ✄ ✂1 =✁ ✄ ✂2 in (4)✁ ⇒λ;ℓ=N a ₍_a_>_0),

t =ℓN: total number of samples processed.

Then 1 c = 2 (’smooth’fρ): E(fλ ˆz,fρ)≈t− 2 7 –faster convergence, 2 c = 1 (’non-smooth’fρ): E(fλ ˆ z ,fρ)≈t− 1 5 – slower.

(26)

Misspecified case: performance guarantee

fρis ’s-smooth’ (s>0).

Contribution:

IfL2ρX is separable and 1 λ2 ≤l,

then with high probability E(fλ ˆ z ,fρ)≤ logh2(l) Nh2λ 3 2 +_√1 lλ+ √ λmin(1,s) λ√l +λ min(1,s) | {z } g(ℓ,N,λ) .

(27)

Misspecified case: performance guarantee

fρis ’s-smooth’ (s>0).

Contribution: If

L2ρX is separable and 1 λ2 ≤l,

then with high probability E(fˆzλ,fρ)≤ logh2(l) Nh2λ32 + 1 √ lλ+ √ λmin(1,s) λ√l +λ min(1,s) | {z } g(ℓ,N,λ) . (5) ˆ xi s-smoothness

(28)

Misspecified case: example

Assume s _≥1,h= 1 (K: Lipschitz), ✄ ✂1 =✁ ✄ ✂3 in (5)✁ ⇒λ;ℓ=N a ₍_a_>₀₎

t =ℓN: total number of samples processed.

Then 1 s = 1 (’non-smooth’f_ρ): E(fλ ˆ z ,fρ)≈t−0.25 – slower, 2 s → ∞ (’smooth’fρ): E(fλ ˆ z ,fρ)≈t−0.5 – fasterconvergence.

(29)

Notes on the assumptions:

∃

ρ,

X

∈ B

(

H

)

k: bounded, continuous_⇒

µ: (P₍D₎_,_B₍_τ_w₎₎_→₍_H,_B₍_H_{)) measurable.}

µmeasurable,X _{∈ B}(H)_⇒ρonX _×Y: well-defined.

If (*) :=D_{is compact metric,} _k _{is universal, then}

µis continuous, and X _{∈ B}(H).

(30)

Notes on the assumptions: H¨older

K

examples

In case of (*): KG Ke KC e− kµa−µ_bk2_H 2θ2 e− kµa−µ_bk_H 2θ2 1 +kµa−µbk2H/θ2 −1 h= 1 h= 1₂ h= 1 Kt Ki 1 +_kµa₋µ_bkθ_H−1 _kµa₋µ_bk2_H+θ2− 1 2 h= θ 2 (θ≤2) h= 1

(31)

Notes on the assumptions: misspecified case

L2_ρ

X: separable ⇔measure space with d(A,B) =ρX(A△B) is so

(32)

Vector-valued output:

Y

= separable Hilbert space

Objective function: f_ˆ_zλ = arg min f∈H 1 l l X i=1 kf(µxˆi)−yik 2 Y +λkfk 2 H, (λ >0). K(µa, µb)∈L₍_Y_): operator-valued kernel, vector-valued RKHS.

(33)

Vector-valued output: analytical solution

Prediction on a new test distribution (t):

(f_ˆ_zλ_◦µ)(t) =k(K+lλIl)−1[y1;. . .;yl], K= [K(µˆxi, µxˆj)]∈L(Y) l×l_, k= [K(µˆx1, µt), . . . ,K(µxˆl, µt)]∈L(Y) 1×l_. (6) (7) (8) Specifically: Y =R⇒L₍_Y_{) =}_R_;_Y ₌_Rd _⇒L₍_Y_{) =}_Rd_.

(34)

Vector-valued output:

K

assumptions

Boundedness and H¨older continuity of K:

1 Boundedness: kKµak 2 HS =Tr Kµ∗aKµa ≤BK ∈(0,∞), (∀µa ∈X). 2 H¨older continuity: ∃L>0,h∈(0,1] such that

kKµa−KµbkL(Y,H)≤Lkµa−µbk h

(35)

Demo

Supervised entropy learning:

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

Aerosol prediction from satellite images:

State-of-the-art baseline: 7.5₋8.5(_±0.1₋0.6). MERR:7.81 (_±1.64).

(36)

Summary

Problem: distribution regression. Literature: large number of heuristics. Contribution:

a simple ridge solution is consistent,

specifically, the set kernel is so (15-year-old open question).

Simplified version [Y =R,fρ∈H]:

(37)

Summary – continued

Code in ITE, extended analysis (submitted to JMLR): https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066. Closely related research directions (Bayesian world):

∞-dimensional exp. family fitting,

(38)

Thank you for the attention!

Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in

(39)

Appendix: contents

Topological definitions, separability. Prior definitions (ρ).

Universal kernel: definition, examples. Vector-valued RKHS.

Demos: further details. Hausdorff metric.

(40)

Topological space, open sets

Given: D₆₌_∅_set.

τ _⊆2D is called atopology onD_if:

1 ∅ ∈τ,D_∈_τ_.

2 Finite intersection: O₁∈τ, O₂∈τ ⇒O₁∩O₂∈τ. 3 Arbitrary union: O_i∈τ (i∈I)⇒ ∪_i_∈_IO_i∈τ.

(41)

Closed-, compact set, closure, dense subset, separability

Given: (D_{, τ}_). _A_⊆D_is

closed ifD_\_A_∈_τ _{(i.e., its complement is open),}

compact if for any family (Oi)i∈I of open sets with A⊆ ∪i∈IOi, ∃i1, . . . ,in∈I withA⊆ ∪nj=1Oij.

Closure of A⊆D_: ¯ A:= \ A⊆C closed inD C. (9) A⊆D_is _dense_{if ¯}_A₌D_.

(D_{, τ}_{) is} _separable_if_∃ _{countable, dense subset of} D_.

(42)

Prior (well-specified case):

ρ

∈ P

(

b

,

c

)

Let the T :H_→H _{covariance operator be}

T =

Z X

K(_·, µa)K∗(_·, µa)d_ρ_X₍_µa₎

with eigenvalues tn (n = 1,2, . . .).

Assumption: ρ∈ P(b,c) = set of distributions on X ×Y

α≤nbtn≤β (∀n≥1;α >0, β >0),

∃g ∈H_{such that} _fρ₌_Tc−1

2 g withkgk2

H≤R (R>0),

whereb _∈(1,_∞), c _∈[1,2].

Intuition: 1/b – effective input dimension,c – smoothness of

(43)

Prior: misspecified case

Let ˜T be defined as:

S_K∗ :H _֒_→_L2 ρX, SK :L2ρX →H, (SKg)(µu) = Z X K(µu, µt)g(µt)dρX(µt), ˜ T =S_K∗SK :L2ρX →L 2 ρX.

Our range space assumption on ρ: fρ∈Im

˜

(44)

Universal kernel: definition

Assume

D_{: compact, metric,}

k :D_×D_→_R_{kernel is continuous.}

Then

Def-1: k is universal ifH(k) is dense in (C(D₎_,_k·k

(45)

Universal kernel: definition

Assume

Then

∞).

Def-2: k is

(46)

Universal kernel: definition

Assume

Then

∞).

Def-2: k is

characteristic, ifµ:P₍D₎_→_H₍_k_{) is injective.}

(47)

Universal kernel: examples

On compact subsets of Rd k(a,b) =e−ka−bk 2 2 2σ2 _, ₍_{σ >}₀₎ k(a,b) =e−σka−bk1, (σ >0) k(a,b) =eβha,bi,(β >0), or more generally k(a,b) =f(_ha,b_i), f(x) = ∞ X n=0 anxn (∀an>0).

(48)

Vector-valued RKHS:

H

=

H

(

K

)

Definition:

A H_⊆_YX _{Hilbert space of functions is RKHS if}

Aµx,y :f ∈H7→ hy,f(µx)iY ∈R (10)

is continuous for _∀µx ∈X,y∈Y.

(49)

Vector-valued RKHS:

H

=

H

(

K

) – continued

Riesz representation theorem ⇒∃K(µx|y)∈H_:

hy,f(µx)iY =hK(µx|y),fiH (∀f ∈H). (11)

K(µx|y): linear, bounded in y ⇒K(µx|y)=Kµx(y) with

(50)

Vector-valued RKHS:

H

=

H

(

K

) – continued

Kµx ∈L(Y,H). K construction: K(µx, µt)(y) = (Kµty)(µx), (∀µx, µt∈X), i.e., K(_·, µt)(y) =Kµty, (12) H₍_K_{) =}_span_{_K_µ ty :µt ∈X,y ∈Y}. (13)

(51)

Vector-valued RKHS:

H

=

H

(

K

) – continued

Kµx ∈L(Y,H). K construction: K(µx, µt)(y) = (Kµty)(µx), (∀µx, µt∈X), i.e., K(_·, µt)(y) =Kµty, (12) H₍_K_{) =}_span_{_K_µ ty :µt ∈X,y ∈Y}. (13) Shortly: K(µx, µt)∈L₍_Y₎ _generalizes_k₍_u,_v₎_∈_R_.

(52)

Vector-valued RKHS – examples:

Y

=

R

d

1 Ki :X ×X →_Rkernels (i = 1, . . . ,d). Diagonal kernel:

K(µa, µb) =diag(K1(µa, µb), . . . ,Kd(µa, µb)). (14) 2 Combination of Dj diagonal kernels [Dj(µa, µb)∈_Rr×r,

Aj _∈Rr×d]: K(µa, µb) = m X j=1 A∗_jDj(µa, µb)Aj. (15)

(53)

Demo-1: supervised entropy learning

Problem: learn the entropy of the 1st _{coo. of (rotated)}

Gaussians.

Baseline: kernel smoothing based distribution regression (applying density estimation) =: DFDR.

Performance: RMSE boxplot over 25 random experiments. Experience:

more precise than the only theoretically justified method, by avoiding density estimation.

(54)

Demo-2: aerosol prediction – selected kernels

Kernel definitions (p = 2,3): kG(a,b) =e− ka−bk2₂ 2θ2 , k_e(a,b) =e− ka−bk2 2θ2 , (16) kC(a,b) = 1 1 +ka−bk22 θ2 , kt(a,b) = 1 1 +_ka₋b_kθ₂, (17) kp(a,b) = (_ha,b_i+θ)p, kr(a,b) = 1₋ ka−bk 2 2 ka−bk22+θ , (18) ki(a,b) = 1 q ka−bk22+θ2 , (19) k_M_,3 2(a,b) = 1 + √ 3ka−bk2 θ ! e− √ 3ka−bk2 θ , (20) √ 5_ka₋b_k₂ 5_ka₋b_k2₂! ₋√5ka−bk2

(55)

Existing methods: set metric based algorithms

Hausdorff metric [Edgar, 1995]:

dH(X,Y) = max ( sup x∈X inf y∈Yd(x,y),_ysup_∈_Yxinf∈Xd(x,y) ) . (22)

Metric on compact sets of metric spaces [(M,d);X,Y _⊆M]. ’Slight’ problem: highly sensitive to outliers.

(56)

Weak topology on

P

(

D

)

Def.: It is the weakest topology such that the

Lh : (P(D), τw)→R, Lh(x) =

Z

D

h(u)d_x₍_u₎

mapping is continuous for all h_∈Cb(D), where

(57)

Chen, Y. and Wu, O. (2012).

Contextual Hausdorff dissimilarity for multi-instance clustering.

InInternational Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873.

Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).

Semigroup kernels on measures.

Journal of Machine Learning Research, 6:11691198.

Edgar, G. (1995).

Measure, Topology and Fractal Geometry.

Springer-Verlag.

G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002).

Multi-instance kernels.

InInternational Conference on Machine Learning (ICML), pages 179–186.

(58)

Haussler, D. (1999).

Convolution kernels on discrete structures.

Technical report, Department of Computer Science, University of California at Santa Cruz.

(http://cbse.soe.ucsc.edu/sites/default/files/

convolutions.pdf).

Hein, M. and Bousquet, O. (2005).

Hilbertian metrics and positive definite kernels on probability measures.

InInternational Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143.

Jebara, T., Kondor, R., and Howard, A. (2004).

Probability product kernels.

Journal of Machine Learning Research, 5:819–844.

(59)

Journal of Machine Learning Research, 10:935–975.

Nielsen, F. and Nock, R. (2012).

A closed-form expression for the Sharma-Mittal entropy of exponential families.

Journal of Physics A: Mathematical and Theoretical, 45:032003.

Oliva, J., P´oczos, B., and Schneider, J. (2013).

Distribution to distribution regression.

International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057.

Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014).

Fast distribution to real regression.

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714.

(60)

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515.

P´oczos, B., Xiong, L., and Schneider, J. (2011).

Nonparametric divergence estimation with applications to machine learning on distributions.

InUncertainty in Artificial Intelligence (UAI), pages 599–608.

Reddi, S. J. and P´oczos, B. (2014).

k-NN regression on functional data with incomplete observations.

InConference on Uncertainty in Artificial Intelligence (UAI).

Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis.

Prentice-Hall.

Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009).

(61)

Medical Image Computing and Computer-Assisted Intervention, 12:648–655.

Wang, J. and Zucker, J.-D. (2000).

Solving the multiple-instance problem: A lazy learning approach.

InInternational Conference on Machine Learning (ICML), pages 1119–1126.

Wang, Z., Lan, L., and Vucetic, S. (2012).

Mixture model for multiple instance regression and applications in remote sensing.

IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237.

Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).

Identifying multi-instance outliers.

InSIAM International Conference on Data Mining (SDM), pages 430–441.

(62)

Multi-instance clustering with applications to multi-instance prediction.

Applied Intelligence, 31:47–68.

Zhou, S. K. and Chellappa, R. (2006).

From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.

https://bitbucket.org/szzoli/ite/

http://arxiv.org/abs/1411.2066.

(http://cbse.soe.ucsc.edu/sites/default/files/