Learning Theory for Vector-Valued Distribution Regression

(1)

Learning Theory for Vector-Valued Distribution

Regression

Zolt´an Szab´o (Gatsby Unit, UCL)

Joint work with

◦ Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦ Barnab´as P´oczos (ML Department, CMU),

◦ Arthur Gretton (Gatsby Unit, UCL)

CMStatistics, London, UK December 12, 2015

(2)

Outline

Motivation: application + theory. Problem formulation.

Results: computational & statistical tradeoffs. Numerical examples.

(3)

The task

Samples: {(xi, yi)}ℓi=1. Find f ∈ H such that f (xi) ≈ yi.

Distribution regression:

xi-s are distributions,

available only through samples: {xi,n}Nn=1i , labelled bags.

(4)

The task

Samples: {(xi, yi)}ℓi=1. Find f ∈ H such that f (xi) ≈ yi.

Distribution regression:

xi-s are distributions,

(5)

Motivation (application): aerosol prediction

Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.

Relevance: climate research, sustainability.

Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression?

(6)

Wider context

Context:

machine learning: multi-instance learning,

statistics: point estimation tasks (without analytical formula).

Applications:

(7)

Several algorithmic approaches

1 Parametric fit: Gaussian, MOG, exp. family

[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].

2 _{Kernelized Gaussian measures:}

[Jebara et al., 2004, Zhou and Chellappa, 2006].

3 _{(Positive definite) kernels:}

[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].

4 _{Divergence measures (KL, R´enyi, Tsallis, . . . ):}

[P´oczos et al., 2011, Kandasamy et al., 2015].

5 Set metrics: Hausdorff metric [Edgar, 1995]; variants

[Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].

(8)

Motivation: theory

MIL dates back to [Haussler, 1999, G¨artner et al., 2002].

Sensible methods in regression: require density estimation [P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014, Sutherland et al., 2015] + assumptions:

(9)

Input-output requirements

Input: distributions on ’structured’ D domains (kernels).

(10)

Input-output requirements

Input: distributions on ’structured’ D domains (kernels). Output:

(11)

Input-output requirements

Input: distributions on ’structured’ D domains (kernels). Output:

simplest case: Y = R, but

dependenciesmight matter: Y = Rd _{(or separable Hilbert).}

(12)

Kernel, RKHS

k : D × D → R kernel on D, if

∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).

(13)

Kernel, RKHS

k : D × D → R kernel on D, if

∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).

Kernel examples: D = Rd (p > 0, θ > 0)

k(a, b) = (ha, bi + θ)p: polynomial, k(a, b) = e−ka−bk22/(2θ

2₎

: Gaussian, k(a, b) = e−θka−bk1_{: Laplacian.}

In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).

(14)

RKHS: evaluation point of view

Let H ⊂ RD be a Hilbert space.

Consider for fixed x ∈ D the δx : f ∈ H 7→ f (x) ∈ R map.

The evaluation functional is linear:

(15)

RKHS: evaluation point of view

Consider for fixed x ∈ D the δx : f ∈ H 7→ f (x) ∈ R map.

The evaluation functional is linear:

δx(αf + βg ) = αδx(f ) + βδx(g ).

Def.: H is called RKHS if δx is continuous for ∀x ∈ D.

(16)

RKHS: reproducing point of view

k : D × D → is called a reproducing kernel of H if for ∀x ∈ D, f ∈ H

1 k(·, x) ∈ H, 2 _{hf , k(·, x)i}

(17)

RKHS: reproducing point of view

k : D × D → is called a reproducing kernel of H if for ∀x ∈ D, f ∈ H 1 k(·, x) ∈ H, 2 _{hf , k(·, x)i} H = f (x) (reproducing property). Specifically, ∀x, y ∈ D k(x, y ) = hk(·, x), k(·, y)iH.

(18)

RKHS: positive-definite point of view

Let us given a k : D × D → R symmetric function. k is called positive definite if ∀n ≥ 1, ∀(a1, . . . , an) ∈ Rn,

(x1, . . . , xn) ∈ Dn n X i ,j=1 aiajk(xi, xj) = aTGa≥ 0, where G = [k(xi, xj)]ni ,j=1.

(19)

Kernel: example domains (D)

Euclidean space (D = Rd), graphs, texts, time series,

dynamical systems, distributions!

(20)

Problem formulation (Y = R)

Given:

labelled bags ˆz = {(ˆxi, yi)}ℓi=1,

ith _{bag: ˆ}_x

i= {xi,1, . . . , xi,N} i.i .d.

∼ xi ∈ P (D), yi ∈ R.

(21)

Problem formulation (Y = R)

Given:

ith _{bag: ˆ}_x

i= {xi,1, . . . , xi,N} i.i .d.

∼ xi ∈ P (D), yi ∈ R.

Task: find a P (D) → R mapping based on ˆz.

Construction: mean embedding (µx)

P_(D)_{−−−−→}µ=µ(k) _{X ⊆ H}

| {z }

2 =two-stage sampling

= H(k)

(22)

Problem formulation (Y = R)

Given:

ith _{bag: ˆ}_x

i= {xi,1, . . . , xi,N} i.i .d.

∼ xi ∈ P (D), yi ∈ R.

Task: find a P (D) → R mapping based on ˆz.

Construction: mean embedding (µx) +ridge regression

P_(D)_{−−−−→}µ=µ(k) _{X ⊆ H} | {z } 2 =two-stage sampling = H(k)_{−−−−−−−→}f∈H=H(K ) R | {z } 1 =Hilbert → R regression .

(23)

1 _{: Hilbert → R regression, well-specified case}

Regression function, expected risk (assume for a moment: fρ∈ H):

fρ(µx) =

Z

Ry dρ(y |µ

x), R [f ] = E(x,y )|f (µx) − y|2.

(24)

1 _{: Hilbert → R regression, well-specified case}

(25)

1 _{: Hilbert → R regression, well-specified case}

(26)

1 _{: Hilbert → R regression}

Known [Caponnetto and De Vito, 2007]: if ρ(µx, y ) ∈ P(b, c),

then the best/achieved rate E(fzλ, fρ) = Op

ℓ−bc+1bc

(27)

1 _{: Hilbert → R regression}

Known [Caponnetto and De Vito, 2007]: if ρ(µx, y ) ∈ P(b, c),

then the best/achieved rate E(fzλ, fρ) = Op ℓ−bc+1bc (1 < b, c ∈ (1, 2]). ρ ∈ P(b, c): T = Z X K (·, µ a)K∗(·, µa)dρX(µa) : H → H.

Eigenvalues of T decay as λn= O(n−b). fρ∈ Im

Tc−12

. Intuition: 1/b – effective input dimension, c – smoothness of fρ.

(28)

Our question

Can we reach this Op

ℓ−bc+1bc minimax rate? N =? fλ ˆ z = arg min f∈H 1 ℓ ℓ X i=1 |f (µˆxi) − yi| 2 + λ kf k2H, f_zλ= arg min f∈H 1 ℓ ℓ X i=1 |f (µxi) − yi| 2 + λ kf k2H.

(29)

2 : mean embedding, µ

xi

→ µ

ˆxi

k : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u).

Mean embedding of a distribution x, ˆxi ∈ P(D):

µx = Z Dk(·, u)dx(u) ∈ H(k), µxˆi = Z Dk(·, u)dˆx i(u) = 1 N N X n=1 k(·, xi ,n).

(30)

2 : mean embedding, µ

xi

→ µ

ˆxi

µx = Z Dk(·, u)dx(u) ∈ H(k), µxˆi = Z Dk(·, u)dˆx i(u) = 1 N N X n=1 k(·, xi ,n).

Linear K ⇒ set kernel:

K (µ , µ ) =µ , µ = 1

N

X

(31)

2 : mean embedding, µ

xi

→ µ

ˆxi

µx = Z Dk(·, u)dx(u) ∈ H(k), µxˆi = Z Dk(·, u)dˆx i(u) = 1 N N X n=1 k(·, xi ,n). Nonlinear K example: K (µxˆi, µˆxj) = e −kµˆxi−µˆxjk2H 2σ2 .

(32)

2 _{: ridge regression ⇒ analytical solution}

Given: training sample: ˆz, test distribution: t. Prediction on t: (f_ˆ_zλ_{◦ µ)(t) = k(K + ℓλI}ℓ)−1[y1; . . . ; yℓ], K = [K (µˆxi, µˆxj)] ∈ R ℓ×ℓ_, k= [K (µˆx1, µt), . . . , K (µˆxℓ, µt)] ∈ R 1×ℓ_. (1) (2) (3) ⇒ K (µ , µ ) =_{K (·, µ} )_{, K (·, µ}′) matter.

(33)

1 -

2 : Why can we get consistency? – intuition

Convergence of the mean embedding: kµx− µˆxkH(k)= Op 1 √ N . H¨older property of K (0 < L, 0 < h ≤ 1): kK (·, µx)−K (·, µˆx)kH_{(K )}≤ L kµx − µˆxkh_H(k). f_ˆ_zλ depends ’nicely’ on µˆx.

(34)

1 -

2 : Proof idea

By decomposing the excess risk, concentration, on P(b, c) we get E(fˆzλ, fρ) ≤ logh(ℓ) Nh_λ 1 λ2_ℓ2 + 1 + 1 ℓλ1+1b | {z } 2 =two-stage sampling + λc+ 1 ℓ2_λ+ 1 ℓλ1b | {z } 1 = H → R regression → 0, s.t. ℓλb+1b ≥ 1,log(ℓ) λ2h ≤ N.

(35)

1 -

2 : Proof idea

By decomposing the excess risk, concentration, on P(b, c) we get E(fˆzλ, fρ) ≤ logh(ℓ) Nh_λ 1 λ2_ℓ2 + 1 + 1 ℓλ1+1b | {z } 2 =two-stage sampling + λc+ 1 ℓ2_λ+ 1 ℓλ1b | {z } 1 = H → R regression → 0, s.t. ℓλb+1b ≥ 1,log(ℓ) λ2h ≤ N.

Let N = ℓahlog(ℓ) ⇒ 1st term + constraints simplify.

a > 0: needed, i.e. N > log(ℓ).

Bias-variance trick with constraint-checking ⇒

(36)

Computational & statistical tradeoff (W)

If a ≤ b(c + 1)_{bc + 1} , then Ef_ˆ_zλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1) bc + 1 then E f_ˆ_zλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b .

(37)

Computational & statistical tradeoff (W)

Meaning (a-dependence, N = ℓhalog(ℓ)):

’Smaller a’ = computational saving, but reduced statistical efficiency.

(38)

Computational & statistical tradeoff (W)

Meaning (a-dependence, N = ℓhalog(ℓ)):

(39)

Computational & statistical tradeoff (W)

If a ≤ b(c + 1)_{bc + 1} , then Ef_ˆ_zλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1)_{bc + 1} then Ef_ˆ_zλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b .

Meaning (h-dependence, N = ℓah log(ℓ)):

smoother K kernel is rewarding = bag-size reduction; see smoothness of fρ.

(40)

Computational & statistical tradeoff (W)

If a ≤ b(c + 1) bc + 1 , then E f_ˆ_zλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1)_{bc + 1} then Ef_ˆ_zλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b . Meaning (c-dependence):

c 7→ b(c + 1)_{bc + 1} decreasing: smaller bags are enough for easier problems.

(41)

Misspecified setting

Relevant case: fρ∈ L2ρX\H.

fρ: difficulty parameter = s ∈ (0, 1], larger s = easier problem.

Proof idea:

∞-D exponential family fitting [Sriperumbudur et al., 2014], ridge solution.

(42)

Computational & statistical tradeoff (M)

Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a _, a ≥ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 _.

(43)

Computational & statistical tradeoff (M)

Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a _, a ≥ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 _. Meaning (a-dependence):

(44)

Computational & statistical tradeoff (M)

Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a _, a ≥ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 _. Meaning (a-dependence):

(45)

Computational & statistical tradeoff (M)

Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1 s + 2, then E f_ˆ_zλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a , a ≥ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 _. Meaning (h-dependence): h 7→ 2a

h is decreasing: smoother K kernel is rewarding.

(46)

Computational & statistical tradeoff (M)

Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a , a ≥ s + 1_{s + 2}, then Ef_ˆ_zλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 .

Meaning (s-dependence): s 7→ _{s + 2}2s is increasing, i.e easier task

= better rate,

s → 0: arbitrary slow rate.

(47)

Optimality of the rate (M)

Our rate: r (ℓ) = ℓ−s+22s – range space assumption (s).

One-stage sampled optimal rate: ro(ℓ) = ℓ−

2s

2s+1 [Steinwart et al., 2009],

range-space assumption + eigendecay constraint, D_{: compact metric, Y = R.} 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Smoothness (s) Rate −log l[ro(l)]=2s/(2s+1) −log_l[r(l)]=2s/(s+2)

(48)

Blanket assumptions: both settings

D_{: separable, topological domain.}

k:

bounded: supu∈Dk(u, u) ≤ Bk ∈ (0, ∞),

continuous.

K : bounded; H¨older continuous: ∃L > 0, h ∈ (0, 1] such that kK (·, µa) − K (·, µb)kH≤ L kµa− µbk

h H.

(49)

H¨older K examples (other than the linear K when h=1)

In case of compact metric D, universal k:

KG Ke KC e−kµa−µbk 2 H 2θ2 e− kµa−µbkH 2θ2 1 + kµa− µbk2H/θ2 −1 h = 1 h = 1₂ h = 1 Kt Ki 1 + kµa− µbkθH −1 kµa− µbk2H+ θ2 −1₂ h = θ₂ _{(θ ≤ 2)} h = 1

Functions of kµa− µbkH ⇒ computation: similar to set kernel.

(50)

Vector-valued output: similarly

K (µa, µb) ∈ L(Y ).

Prediction on a new test distribution (t): (f_ˆ_zλ_{◦ µ)(t) = k(K + ℓλI}ℓ)−1[y1; . . . ; yℓ], K= [K (µˆxi, µxˆj)] ∈ L(Y ) ℓ×ℓ_, k= [K (µˆx1, µt), . . . , K (µˆxℓ, µt)] ∈ L(Y ) 1×ℓ_. (4) (5) (6)

(51)

Demo

Supervised entropy learning:

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

(52)

Demo

Supervised entropy learning:

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

(53)

Summary

Problem: distribution regression (k). Contribution:

computational & statistical tradeoff analysis,

specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size.

(54)

Summary

Problem: distribution regression (k). Contribution:

computational & statistical tradeoff analysis,

specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size.

Code in ITE, analysis submitted to JMLR:

https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.

(55)

Thank you for the attention!

Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK.

(56)

Caponnetto, A. and De Vito, E. (2007).

Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368. Chen, Y. and Wu, O. (2012).

Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873.

Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures.

Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995).

(57)

Multi-instance kernels.

In International Conference on Machine Learning (ICML), pages 179–186.

Haussler, D. (1999).

Convolution kernels on discrete structures.

Technical report, Department of Computer Science, University of California at Santa Cruz.

(http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf).

Hein, M. and Bousquet, O. (2005).

Hilbertian metrics and positive definite kernels on probability measures.

In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143.

Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels.

Journal of Machine Learning Research, 5:819–844.

(58)

Kandasamy, K., Krishnamurthy, A., P´oczos, B., Wasserman, L., and Robins, J. M. (2015).

Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. Technical report, Carnegie Mellon University.

http://arxiv.org/abs/1411.4342.

Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009).

Nonextensive information theoretical kernels on measures. Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012).

A closed-form expression for the Sharma-Mittal entropy of exponential families.

(59)

Fast function to function regression.

In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 717–725.

Oliva, J., P´oczos, B., and Schneider, J. (2013). Distribution to distribution regression.

International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057.

Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014).

Fast distribution to real regression.

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714.

P´oczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression.

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´oczos, B., Xiong, L., and Schneider, J. (2011).

(60)

Nonparametric divergence estimation with applications to machine learning on distributions.

In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Reddi, S. J. and P´oczos, B. (2014).

k-NN regression on functional data with incomplete observations.

In Conference on Uncertainty in Artificial Intelligence (UAI), pages 692–701.

Sriperumbudur, B. K., Fukumizu, K., Kumar, R., Gretton, A.,

and Hyv¨arinen, A. (2014).

Density estimation in infinite dimensional exponential families. Technical report.

(61)

Sutherland, D. J., Oliva, J. B., P´oczos, B., and Schneider, J. (2015).

Linear-time learning on distributions with approximate kernel embeddings.

Technical report, Carnegie Mellon University. http://arxiv.org/abs/1509.07553.

Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009).

Closed-form Jensen-R´enyi divergence for mixture of Gaussians and applications to group-wise shape registration.

Medical Image Computing and Computer-Assisted Intervention, 12:648–655.

Wang, J. and Zucker, J.-D. (2000).

Solving the multiple-instance problem: A lazy learning approach.

In International Conference on Machine Learning (ICML), pages 1119–1126.

(62)

Wang, Z., Lan, L., and Vucetic, S. (2012).

Mixture model for multiple instance regression and applications in remote sensing.

IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237.

Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers.

In SIAM International Conference on Data Mining (SDM), pages 430–441.

Zhang, M.-L. and Zhou, Z.-H. (2009).

Multi-instance clustering with applications to multi-instance prediction.

(63)

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.