Distribution Regression: Computational and Statistical Tradeoffs

(1)

Distribution Regression: Computational &

Statistical Tradeoffs

Zolt´an Szab´o (Gatsby Unit, UCL)

Joint work with

◦Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦Barnab´as P´oczos (ML Department, CMU),

(2)

Outline

Motivation: application + theory. Problem formulation.

Results: computational & statistical tradeoffs. Numerical examples.

(3)

The task

Samples: {(xi,yi)}`i=1. Find f ∈H such that f(xi)≈yi.

Distribution regression:

xi-s are distributions,

available only through samples: {xi,n}Nn=1i , labelledbags.

(4)

The task

Samples: {(xi,yi)}`i=1. Find f ∈H such that f(xi)≈yi.

Distribution regression:

xi-s are distributions,

available only through samples: {xi,n}Nn=1i , labelledbags.

(5)

Motivation (application): aerosol prediction

Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.

Relevance: climate research, sustainability.

Engineered methods [Wang et al., 2012]: 100×RMSE = 7.5−8.5.

(6)

Wider context

Context:

machine learning: multi-instance learning,

statistics: point estimation tasks (without analytical formula).

Applications:

(7)

Several algorithmic approaches

1 _{Parametric fit: Gaussian, MOG, exp. family}

[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].

2 _{Kernelized Gaussian measures:}

[Jebara et al., 2004, Zhou and Chellappa, 2006].

3 _{(Positive definite) kernels:}

[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].

4 _{Divergence measures (KL, R´}_{enyi, Tsallis, . . . ):}

[P´oczos et al., 2011, Kandasamy et al., 2015].

5 _{Set metrics: Hausdorff metric [Edgar, 1995]; variants}

(8)

Motivation: theory

MIL dates back to [Haussler, 1999, G¨artner et al., 2002].

Sensiblemethods in regression: require density estimation

[P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014,

Sutherland et al., 2015] + assumptions:

(9)

Input-output requirements

Input: distributions on ’structured’ Ddomains (kernels).

Output:

simplest case: Y =R, but

(10)

Input-output requirements

Output:

(11)

Input-output requirements

Output:

(12)

Kernel, RKHS

k :D×D→Rkernel on D, if

∃ϕ:D→H(ilbert space) feature map,

k(a,b) =hϕ(a), ϕ(b)i_H (∀a,b∈_D). Kernel examples: D=Rd (p >0, θ >0) k(a,b) = (ha,bi+θ)p: polynomial, k(a,b) =e−ka−bk22/(2θ 2₎ : Gaussian, k(a,b) =e−θka−bk1: Laplacian. In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).

(13)

Kernel, RKHS

k :D×D→Rkernel on D, if

∃ϕ:D→H(ilbert space) feature map,

k(a,b) =hϕ(a), ϕ(b)i_H (∀a,b∈_D). Kernel examples: D=Rd (p >0, θ >0) k(a,b) = (ha,bi+θ)p: polynomial, k(a,b) =e−ka−bk22/(2θ 2₎ : Gaussian, k(a,b) =e−θka−bk1_{: Laplacian.} In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).

(14)

Kernel: example domains (

D

)

Euclidean space (D=Rd), graphs, texts, time series,

(15)

Problem formulation (Y

=

_R

)

Given: labelled bags ˆz={(ˆxi,yi)} ` i=1, ith bag: ˆxi={xi,1, . . . ,xi,N} i.i.d. ∼ xi ∈P(D),yi ∈R.

Task: find aP(D)→_Rmapping based on ˆz.

Construction: mean embedding (µx)

P(D)−−−−−→µ=µ(k) X ⊆H

| {z }

2 =two-stage sampling

(16)

Problem formulation (Y

=

_R

)

Construction: mean embedding (µx)

P(D)−−−−−→µ=µ(k) X ⊆H

| {z }

2 =two-stage sampling

(17)

Problem formulation (Y

=

_R

)

Construction: mean embedding (µx) +ridge regression

P(D)−−−−−→µ=µ(k) X ⊆H | {z } 2 =two-stage sampling =H(k)−−−−−−−→f∈H=H(K) _R | {z } 1 =Hilbert→Rregression .

(18)

1 : Hilbert

→

_R

regression, well-specified case

Regression function, expected risk (assume for a moment: fρ∈H):

(19)

1 : Hilbert

→

_R

regression, well-specified case

(20)

1 : Hilbert

→

_R

regression, well-specified case

(21)

1 : Hilbert

→

_R

regression

Known [Caponnetto and De Vito, 2007]: if ρ(µx,y)∈ P(b,c),

then the best/achieved rate

E(f_zλ,fρ) =Op `−bcbc+1 (1<b,c ∈(1,2]). ρ∈ P(b,c): T = Z X K(·, µ_a)K∗(·, µ_a)dρX(µa) :H→H. Eigenvalues ofT decay asλn=O(n−b). fρ∈Im Tc−21 . Intuition: 1/b– effective input dimension,c – smoothness offρ.

(22)

1 : Hilbert

→

_R

regression

Known [Caponnetto and De Vito, 2007]: if ρ(µx,y)∈ P(b,c),

then the best/achieved rate

E(f_zλ,fρ) =Op `−bcbc+1 (1<b,c ∈(1,2]). ρ∈ P(b,c): T = Z X K(·, µ_a)K∗(·, µ_a)dρX(µa) :H→H. Eigenvalues ofT decay asλn=O(n−b). fρ∈Im Tc−21 .

(23)

Our question

Can we reach thisO_p`−bcbc+1

minimax rate? N=? f_ˆ_zλ= arg min f∈_H 1 ` ` X i=1 |f(µˆxi)−yi| 2₊_λ_kf_k2 H, f_zλ= arg min f∈_H 1 ` ` X i=1 |f(µxi)−yi| 2₊_λ_kf_k2 H.

(24)

2 : mean embedding,

µ

xi

→

µ

xˆi

k :D×D→_R kernel; canonical feature map: ϕ(u) =k(·,u).

Mean embedding of a distribution x,xiˆ ∈P(D):

µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n).

LinearK ⇒ set kernel:

K(µxˆi, µˆxj) = µˆxi, µˆxj H = 1 N2 N X n,m=1 k(xi,n,xj,m).

(25)

2 : mean embedding,

µ

xi

→

µ

xˆi

µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n).

LinearK ⇒ set kernel:

K(µ , µ ) =µ , µ = 1

N

X

(26)

2 : mean embedding,

µ

xi

→

µ

xˆi

µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n). NonlinearK example: K(µ , µ ) =e− kµ_xi_ˆ−µ_ˆ_xjk2 H 2 _.

(27)

2 : ridge regression

⇒

analytical solution

Given: training sample: ˆz, test distribution: t. Prediction on t: (f_ˆ_zλ◦µ)(t) =k(K+`λI`)−1[y1;. . .;y`], K= [K(µˆxi, µˆxj)]∈R `×`_, k= [K(µˆx1, µt), . . . ,K(µˆx`, µt)]∈R 1×`_. (1) (2) (3) ⇒ K(µx, µx0) =K(·, µ_x),K(·, µ0) H matter.

(28)

1 -

2 : Why can we get consistency? – intuition

Convergence of the mean embedding:

kµ_x −µˆxkH(k)=Op 1 √ N . H¨older property ofK (0<L, 0<h ≤1): kK(·, µx)−K(·, µˆx)k_H(K)≤Lkµx −µˆxkhH(k). f_ˆ_zλ depends ’nicely’ on µˆx.

(29)

1 -

2 : Proof idea

By decomposing the excess risk, concentration, onP(b,c) we get

E(f_ˆ_zλ,fρ)≤ logh(`) Nh_λ 1 λ2_`2 + 1 + 1 `λ1+1b | {z } 2 =two-stage sampling + λc+ 1 `2_λ+ 1 `λb1 | {z } 1 =H→_Rregression →0, s.t. `λb+1b ≥1,log(`) λ2h ≤N.

Let N=`ahlog(`) ⇒ 1st term + constraints simplify.

a>0: needed, i.e. N>log(`).

(30)

1 -

2 : Proof idea

By decomposing the excess risk, concentration, onP(b,c) we get

E(f_ˆ_zλ,fρ)≤ logh(`) Nh_λ 1 λ2_`2 + 1 + 1 `λ1+1b | {z } 2 =two-stage sampling + λc+ 1 `2_λ+ 1 `λb1 | {z } 1 =H→_Rregression →0, s.t. `λb+1b ≥1,log(`) λ2h ≤N.

Let N=`ahlog(`) ⇒ 1st term + constraints simplify.

(31)

Computational & statistical tradeoff (W)

If a≤ b(c+ 1) bc+ 1 , thenE f_ˆ_zλ,fρ =Op `−cac+1 with λ=`−c+1a _, a≥ b(c+ 1) bc+ 1 then E f_ˆ_zλ,fρ =O_p`−bcbc+1 with λ=`−bcb+1.

Meaning (a-dependence,N =`halog(`)):

’Smaller a’ = computational saving, but reduced statistical

efficiency.

Sensible choice: a≤ b(c + 1)

bc + 1 <2:

N sub-quadratic in ` achieves

(32)

Computational & statistical tradeoff (W)

efficiency.

Sensible choice: a≤ b(c + 1)

bc + 1 <2:

N sub-quadratic in ` achieves

(33)

Computational & statistical tradeoff (W)

efficiency.

Sensible choice: a≤ b(c+ 1)

(34)

Computational & statistical tradeoff (W)

Meaning (h-dependence, N=`ah log(`)):

smoother K kernel is rewarding = bag-size reduction; see

(35)

Computational & statistical tradeoff (W)

If a≤ b(c+ 1) bc+ 1 , thenE f_ˆ_zλ,fρ =O_p`−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E f_ˆ_zλ,fρ =Op `−bcbc+1 with λ=`−bcb+1_. Meaning (c-dependence): c 7→ b(c+ 1)

bc+ 1 decreasing: smaller bags are enough for easier

(36)

Misspecified setting

Relevant case: fρ∈L2ρX\H.

fρ: difficulty parameter = s ∈(0,1], larger s = easier problem.

Proof idea:

∞-D exponential family fitting [Sriperumbudur et al., 2014], ridge solution.

(37)

Computational & statistical tradeoff (M)

Let N=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E f_ˆ_zλ,fρ =Op `−s2+1sa with λ=`−s+1a _, a≥ s+ 1 s+ 2, then E f_ˆ_zλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (a-dependence):

efficiency. Sensible choice: a≤ s+ 1 s+ 2 ≤ 2 3 ⇒2a≤ 4 3 <2!

(38)

Computational & statistical tradeoff (M)

(39)

Computational & statistical tradeoff (M)

(40)

Computational & statistical tradeoff (M)

LetN=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E f_ˆ_zλ,fρ =O_p`−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E f_ˆ_zλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (h-dependence): h 7→ 2a

(41)

Computational & statistical tradeoff (M)

LetN=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E f_ˆ_zλ,fρ =Op `−s2+1sa with λ=`−s+1a _, a≥ s+ 1 s+ 2, then E f_ˆ_zλ,fρ =O_p`−s2+2s with λ=`−s+21 . Meaning (s-dependence): s 7→ 2s

s+ 2 is increasing, i.e easier task

= better rate,

s →0: arbitrary slow rate.

(42)

Optimality of the rate (M)

Our rate: r(`) =`−s2+2s _{– range space assumption (}_s_).

One-stage sampled optimal rate: ro(`) =`−

2s

2s+1 [Steinwart et al., 2009],

range-space assumption + eigendecay constraint, D: compact metric,Y =R. 0.2 0.4 0.6 0.8 Rate −log l[ro(l)]=2s/(2s+1) −log[r(l)]=2s/(s+2)

(43)

Blanket assumptions: both settings

D: separable, topological domain.

k:

bounded: sup_u_∈_Dk(u,u)≤Bk ∈(0,∞),

continuous.

K: bounded; H¨older continuous: ∃L>0,h∈(0,1] such that

kK(·, µa)−K(·, µ_b)k_H≤Lkµa−µbkhH.

(44)

H¨

older

K

examples (other than the linear

K

when h=1)

In case of compact metricD, universal k:

KG Ke KC e− µa−µ_b 2 H 2θ2 e− µa−µ_b H 2θ2 1 +kµ_a−µbk2H/θ2 −1 h= 1 h= 1₂ h= 1 Kt Ki 1 +kµa−µbkθ_H−1 kµa−µbk2_H+θ2 −1₂

(45)

Vector-valued output: similarly

K(µa, µb)∈L(Y).

Prediction on a new test distribution (t):

(f_ˆ_zλ◦µ)(t) =k(K+`λI`)−1[y1;. . .;y`], K= [K(µˆxi, µˆxj)]∈L(Y) `×`_, k= [K(µˆx1, µt), . . . ,K(µˆx`, µt)]∈L(Y) 1×`_. (4) (5) (6) Specifically: Y =R⇒L(Y) =R;Y =Rd ⇒L(Y) =Rd×d.

(46)

Demo

Supervised entropy learning:

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

Aerosol prediction from satellite images:

State-of-the-art baseline: 7.5−8.5(±0.1−0.6). MERR:7.81 (±1.64).

(47)

Demo

Supervised entropy learning:

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

(48)

Summary

Problem: distribution regression (k).

Contribution:

computational & statistical tradeoff analysis,

specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size.

Code in ITE, analysis submitted to JMLR:

https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.

(49)

Summary

Problem: distribution regression (k).

Contribution:

computational & statistical tradeoff analysis,

specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size. Code in ITE, analysis submitted to JMLR:

https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.

(50)

Thank you for the attention!

(51)

Caponnetto, A. and De Vito, E. (2007).

Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368.

Chen, Y. and Wu, O. (2012).

Contextual Hausdorff dissimilarity for multi-instance clustering.

InInternational Conference on Fuzzy Systems and Knowledge

Discovery (FSKD), pages 870–873.

Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).

Semigroup kernels on measures.

Journal of Machine Learning Research, 6:11691198.

Edgar, G. (1995).

(52)

Multi-instance kernels.

InInternational Conference on Machine Learning (ICML),

pages 179–186.

Haussler, D. (1999).

Convolution kernels on discrete structures.

Technical report, Department of Computer Science, University of California at Santa Cruz.

(http://cbse.soe.ucsc.edu/sites/default/files/

convolutions.pdf).

Hein, M. and Bousquet, O. (2005).

Hilbertian metrics and positive definite kernels on probability measures.

InInternational Conference on Artificial Intelligence and

(53)

Kandasamy, K., Krishnamurthy, A., P´oczos, B., Wasserman, L., and Robins, J. M. (2015).

Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. Technical report, Carnegie Mellon University.

http://arxiv.org/abs/1411.4342.

Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009).

Nonextensive information theoretical kernels on measures. Journal of Machine Learning Research, 10:935–975.

Nielsen, F. and Nock, R. (2012).

A closed-form expression for the Sharma-Mittal entropy of exponential families.

(54)

Fast function to function regression.

InInternational Conference on Artificial Intelligence and

Statistics (AISTATS), pages 717–725.

Oliva, J., P´oczos, B., and Schneider, J. (2013).

Distribution to distribution regression.

International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057.

Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014).

Fast distribution to real regression.

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714.

(55)

P´oczos, B., Xiong, L., and Schneider, J. (2011).

Nonparametric divergence estimation with applications to machine learning on distributions.

InUncertainty in Artificial Intelligence (UAI), pages 599–608.

Reddi, S. J. and P´oczos, B. (2014).

k-NN regression on functional data with incomplete observations.

InConference on Uncertainty in Artificial Intelligence (UAI),

pages 692–701.

Sriperumbudur, B. K., Fukumizu, K., Kumar, R., Gretton, A., and Hyv¨arinen, A. (2014).

Density estimation in infinite dimensional exponential families. Technical report.

(56)

Sutherland, D. J., Oliva, J. B., P´oczos, B., and Schneider, J. (2015).

Linear-time learning on distributions with approximate kernel embeddings.

Technical report, Carnegie Mellon University.

http://arxiv.org/abs/1509.07553.

Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009).

Closed-form Jensen-R´enyi divergence for mixture of Gaussians

and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655.

Wang, J. and Zucker, J.-D. (2000).

(57)

Wang, Z., Lan, L., and Vucetic, S. (2012).

Mixture model for multiple instance regression and applications in remote sensing.

IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237.

Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).

Identifying multi-instance outliers.

InSIAM International Conference on Data Mining (SDM),

pages 430–441.

Zhang, M.-L. and Zhou, Z.-H. (2009).

Multi-instance clustering with applications to multi-instance prediction.

(58)

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.