• No results found

Distribution Regression: Computational and Statistical Tradeoffs

N/A
N/A
Protected

Academic year: 2021

Share "Distribution Regression: Computational and Statistical Tradeoffs"

Copied!
58
0
0

Loading.... (view fulltext now)

Full text

(1)

Distribution Regression: Computational &

Statistical Tradeoffs

Zolt´an Szab´o (Gatsby Unit, UCL)

Joint work with

◦Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦Barnab´as P´oczos (ML Department, CMU),

(2)

Outline

Motivation: application + theory. Problem formulation.

Results: computational & statistical tradeoffs. Numerical examples.

(3)

The task

Samples: {(xi,yi)}`i=1. Find f ∈H such that f(xi)≈yi.

Distribution regression:

xi-s are distributions,

available only through samples: {xi,n}Nn=1i , labelledbags.

(4)

The task

Samples: {(xi,yi)}`i=1. Find f ∈H such that f(xi)≈yi.

Distribution regression:

xi-s are distributions,

available only through samples: {xi,n}Nn=1i , labelledbags.

(5)

Motivation (application): aerosol prediction

Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.

Relevance: climate research, sustainability.

Engineered methods [Wang et al., 2012]: 100×RMSE = 7.5−8.5.

(6)

Wider context

Context:

machine learning: multi-instance learning,

statistics: point estimation tasks (without analytical formula).

Applications:

(7)

Several algorithmic approaches

1 Parametric fit: Gaussian, MOG, exp. family

[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].

2 Kernelized Gaussian measures:

[Jebara et al., 2004, Zhou and Chellappa, 2006].

3 (Positive definite) kernels:

[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].

4 Divergence measures (KL, R´enyi, Tsallis, . . . ):

[P´oczos et al., 2011, Kandasamy et al., 2015].

5 Set metrics: Hausdorff metric [Edgar, 1995]; variants

(8)

Motivation: theory

MIL dates back to [Haussler, 1999, G¨artner et al., 2002].

Sensiblemethods in regression: require density estimation

[P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014,

Sutherland et al., 2015] + assumptions:

(9)

Input-output requirements

Input: distributions on ’structured’ Ddomains (kernels).

Output:

simplest case: Y =R, but

(10)

Input-output requirements

Input: distributions on ’structured’ Ddomains (kernels).

Output:

simplest case: Y =R, but

(11)

Input-output requirements

Input: distributions on ’structured’ Ddomains (kernels).

Output:

simplest case: Y =R, but

(12)

Kernel, RKHS

k :D×D→Rkernel on D, if

∃ϕ:D→H(ilbert space) feature map,

k(a,b) =hϕ(a), ϕ(b)iH (∀a,b∈D). Kernel examples: D=Rd (p >0, θ >0) k(a,b) = (ha,bi+θ)p: polynomial, k(a,b) =e−ka−bk22/(2θ 2) : Gaussian, k(a,b) =e−θka−bk1: Laplacian. In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).

(13)

Kernel, RKHS

k :D×D→Rkernel on D, if

∃ϕ:D→H(ilbert space) feature map,

k(a,b) =hϕ(a), ϕ(b)iH (∀a,b∈D). Kernel examples: D=Rd (p >0, θ >0) k(a,b) = (ha,bi+θ)p: polynomial, k(a,b) =e−ka−bk22/(2θ 2) : Gaussian, k(a,b) =e−θka−bk1: Laplacian. In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).

(14)

Kernel: example domains (

D

)

Euclidean space (D=Rd), graphs, texts, time series,

(15)

Problem formulation (Y

=

R

)

Given: labelled bags ˆz={(ˆxi,yi)} ` i=1, ith bag: ˆxi={xi,1, . . . ,xi,N} i.i.d. ∼ xi ∈P(D),yi ∈R.

Task: find aP(D)→Rmapping based on ˆz.

Construction: mean embedding (µx)

P(D)−−−−−→µ=µ(k) X ⊆H

| {z }

2 =two-stage sampling

(16)

Problem formulation (Y

=

R

)

Given: labelled bags ˆz={(ˆxi,yi)} ` i=1, ith bag: ˆxi={xi,1, . . . ,xi,N} i.i.d. ∼ xi ∈P(D),yi ∈R.

Task: find aP(D)→Rmapping based on ˆz.

Construction: mean embedding (µx)

P(D)−−−−−→µ=µ(k) X ⊆H

| {z }

2 =two-stage sampling

(17)

Problem formulation (Y

=

R

)

Given: labelled bags ˆz={(ˆxi,yi)} ` i=1, ith bag: ˆxi={xi,1, . . . ,xi,N} i.i.d. ∼ xi ∈P(D),yi ∈R.

Task: find aP(D)→Rmapping based on ˆz.

Construction: mean embedding (µx) +ridge regression

P(D)−−−−−→µ=µ(k) X ⊆H | {z } 2 =two-stage sampling =H(k)−−−−−−−→f∈H=H(K) R | {z } 1 =Hilbert→Rregression .

(18)

1

: Hilbert

R

regression, well-specified case

Regression function, expected risk (assume for a moment: fρ∈H):

fρ(µx) = Z R ydρ(y|µx), R[f] =E(x,y)|f(µx)−y|2. Ridge estimator: fzλ = arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2 +λkfk2H, (λ >0). Excess risk: E(fzλ,fρ) =R[fzλ]− R[fρ].

(19)

1

: Hilbert

R

regression, well-specified case

Regression function, expected risk (assume for a moment: fρ∈H):

fρ(µx) = Z R ydρ(y|µx), R[f] =E(x,y)|f(µx)−y|2. Ridge estimator: fzλ = arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2 +λkfk2H, (λ >0). Excess risk: E(fzλ,fρ) =R[fzλ]− R[fρ].

(20)

1

: Hilbert

R

regression, well-specified case

Regression function, expected risk (assume for a moment: fρ∈H):

fρ(µx) = Z R ydρ(y|µx), R[f] =E(x,y)|f(µx)−y|2. Ridge estimator: fzλ = arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2 +λkfk2H, (λ >0). Excess risk: λ λ

(21)

1

: Hilbert

R

regression

Known [Caponnetto and De Vito, 2007]: if ρ(µx,y)∈ P(b,c),

then the best/achieved rate

E(fzλ,fρ) =Op `−bcbc+1 (1<b,c ∈(1,2]). ρ∈ P(b,c): T = Z X K(·, µa)K∗(·, µa)dρX(µa) :H→H. Eigenvalues ofT decay asλn=O(n−b). fρ∈Im Tc−21 . Intuition: 1/b– effective input dimension,c – smoothness offρ.

(22)

1

: Hilbert

R

regression

Known [Caponnetto and De Vito, 2007]: if ρ(µx,y)∈ P(b,c),

then the best/achieved rate

E(fzλ,fρ) =Op `−bcbc+1 (1<b,c ∈(1,2]). ρ∈ P(b,c): T = Z X K(·, µa)K∗(·, µa)dρX(µa) :H→H. Eigenvalues ofT decay asλn=O(n−b). fρ∈Im Tc−21 .

(23)

Our question

Can we reach thisOp`−bcbc+1

minimax rate? N=? fˆzλ= arg min f∈H 1 ` ` X i=1 |f(µˆxi)−yi| 2+λkfk2 H, fzλ= arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2+λkfk2 H.

(24)

2

: mean embedding,

µ

xi

µ

xˆi

k :D×D→R kernel; canonical feature map: ϕ(u) =k(·,u).

Mean embedding of a distribution x,xiˆ ∈P(D):

µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n).

LinearK ⇒ set kernel:

K(µxˆi, µˆxj) = µˆxi, µˆxj H = 1 N2 N X n,m=1 k(xi,n,xj,m).

(25)

2

: mean embedding,

µ

xi

µ

xˆi

k :D×D→R kernel; canonical feature map: ϕ(u) =k(·,u).

Mean embedding of a distribution x,xiˆ ∈P(D):

µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n).

LinearK ⇒ set kernel:

K(µ , µ ) =µ , µ = 1

N

X

(26)

2

: mean embedding,

µ

xi

µ

xˆi

k :D×D→R kernel; canonical feature map: ϕ(u) =k(·,u).

Mean embedding of a distribution x,xiˆ ∈P(D):

µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n). NonlinearK example: K(µ , µ ) =e− kµxiˆ−µˆxjk2 H 2 .

(27)

2

: ridge regression

analytical solution

Given: training sample: ˆz, test distribution: t. Prediction on t: (fˆzλ◦µ)(t) =k(K+`λI`)−1[y1;. . .;y`], K= [K(µˆxi, µˆxj)]∈R `×`, k= [K(µˆx1, µt), . . . ,K(µˆx`, µt)]∈R 1×`. (1) (2) (3) ⇒ K(µx, µx0) =K(·, µx),K(·, µ0) H matter.

(28)

1

-

2

: Why can we get consistency? – intuition

Convergence of the mean embedding:

x −µˆxkH(k)=Op 1 √ N . H¨older property ofK (0<L, 0<h ≤1): kK(·, µx)−K(·, µˆx)kH(K)≤Lkµx −µˆxkhH(k). fˆzλ depends ’nicely’ on µˆx.

(29)

1

-

2

: Proof idea

By decomposing the excess risk, concentration, onP(b,c) we get

E(fˆzλ,fρ)≤ logh(`) Nhλ 1 λ2`2 + 1 + 1 `λ1+1b | {z } 2 =two-stage sampling + λc+ 1 `2λ+ 1 `λb1 | {z } 1 =H→Rregression →0, s.t. `λb+1b ≥1,log(`) λ2h ≤N.

Let N=`ahlog(`) ⇒ 1st term + constraints simplify.

a>0: needed, i.e. N>log(`).

(30)

1

-

2

: Proof idea

By decomposing the excess risk, concentration, onP(b,c) we get

E(fˆzλ,fρ)≤ logh(`) Nhλ 1 λ2`2 + 1 + 1 `λ1+1b | {z } 2 =two-stage sampling + λc+ 1 `2λ+ 1 `λb1 | {z } 1 =H→Rregression →0, s.t. `λb+1b ≥1,log(`) λ2h ≤N.

Let N=`ahlog(`) ⇒ 1st term + constraints simplify.

(31)

Computational & statistical tradeoff (W)

If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.

Meaning (a-dependence,N =`halog(`)):

’Smaller a’ = computational saving, but reduced statistical

efficiency.

Sensible choice: a≤ b(c + 1)

bc + 1 <2:

N sub-quadratic in ` achieves

(32)

Computational & statistical tradeoff (W)

If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.

Meaning (a-dependence,N =`halog(`)):

’Smaller a’ = computational saving, but reduced statistical

efficiency.

Sensible choice: a≤ b(c + 1)

bc + 1 <2:

N sub-quadratic in ` achieves

(33)

Computational & statistical tradeoff (W)

If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.

Meaning (a-dependence,N =`halog(`)):

’Smaller a’ = computational saving, but reduced statistical

efficiency.

Sensible choice: a≤ b(c+ 1)

(34)

Computational & statistical tradeoff (W)

If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.

Meaning (h-dependence, N=`ah log(`)):

smoother K kernel is rewarding = bag-size reduction; see

(35)

Computational & statistical tradeoff (W)

If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op`−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op `−bcbc+1 with λ=`−bcb+1. Meaning (c-dependence): c 7→ b(c+ 1)

bc+ 1 decreasing: smaller bags are enough for easier

(36)

Misspecified setting

Relevant case: fρ∈L2ρX\H.

fρ: difficulty parameter = s ∈(0,1], larger s = easier problem.

Proof idea:

∞-D exponential family fitting [Sriperumbudur et al., 2014], ridge solution.

(37)

Computational & statistical tradeoff (M)

Let N=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (a-dependence):

’Smaller a’ = computational saving, but reduced statistical

efficiency. Sensible choice: a≤ s+ 1 s+ 2 ≤ 2 3 ⇒2a≤ 4 3 <2!

(38)

Computational & statistical tradeoff (M)

Let N=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (a-dependence):

’Smaller a’ = computational saving, but reduced statistical

efficiency. Sensible choice: a≤ s+ 1 s+ 2 ≤ 2 3 ⇒2a≤ 4 3 <2!

(39)

Computational & statistical tradeoff (M)

Let N=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (a-dependence):

’Smaller a’ = computational saving, but reduced statistical

efficiency. Sensible choice: a≤ s+ 1 s+ 2 ≤ 2 3 ⇒2a≤ 4 3 <2!

(40)

Computational & statistical tradeoff (M)

LetN=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op`−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (h-dependence): h 7→ 2a

(41)

Computational & statistical tradeoff (M)

LetN=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op`−s2+2s with λ=`−s+21 . Meaning (s-dependence): s 7→ 2s

s+ 2 is increasing, i.e easier task

= better rate,

s →0: arbitrary slow rate.

(42)

Optimality of the rate (M)

Our rate: r(`) =`−s2+2s – range space assumption (s).

One-stage sampled optimal rate: ro(`) =`−

2s

2s+1 [Steinwart et al., 2009],

range-space assumption + eigendecay constraint, D: compact metric,Y =R. 0.2 0.4 0.6 0.8 Rate −log l[ro(l)]=2s/(2s+1) −log[r(l)]=2s/(s+2)

(43)

Blanket assumptions: both settings

D: separable, topological domain.

k:

bounded: supuDk(u,u)≤Bk ∈(0,∞),

continuous.

K: bounded; H¨older continuous: ∃L>0,h∈(0,1] such that

kK(·, µa)−K(·, µb)kH≤Lkµa−µbkhH.

(44)

older

K

examples (other than the linear

K

when h=1)

In case of compact metricD, universal k:

KG Ke KC e− µa−µb 2 H 2θ2 e− µa−µb H 2θ2 1 +kµa−µbk2H/θ2 −1 h= 1 h= 12 h= 1 Kt Ki 1 +kµa−µbkθH−1 kµa−µbk2H+θ2 −12

(45)

Vector-valued output: similarly

K(µa, µb)∈L(Y).

Prediction on a new test distribution (t):

(fˆzλ◦µ)(t) =k(K+`λI`)−1[y1;. . .;y`], K= [K(µˆxi, µˆxj)]∈L(Y) `×`, k= [K(µˆx1, µt), . . . ,K(µˆx`, µt)]∈L(Y) 1×`. (4) (5) (6) Specifically: Y =R⇒L(Y) =R;Y =Rd ⇒L(Y) =Rd×d.

(46)

Demo

Supervised entropy learning:

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

Aerosol prediction from satellite images:

State-of-the-art baseline: 7.5−8.5(±0.1−0.6). MERR:7.81 (±1.64).

(47)

Demo

Supervised entropy learning:

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

(48)

Summary

Problem: distribution regression (k).

Contribution:

computational & statistical tradeoff analysis,

specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size.

Code in ITE, analysis submitted to JMLR:

https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.

(49)

Summary

Problem: distribution regression (k).

Contribution:

computational & statistical tradeoff analysis,

specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size. Code in ITE, analysis submitted to JMLR:

https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.

(50)

Thank you for the attention!

(51)

Caponnetto, A. and De Vito, E. (2007).

Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368.

Chen, Y. and Wu, O. (2012).

Contextual Hausdorff dissimilarity for multi-instance clustering.

InInternational Conference on Fuzzy Systems and Knowledge

Discovery (FSKD), pages 870–873.

Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).

Semigroup kernels on measures.

Journal of Machine Learning Research, 6:11691198.

Edgar, G. (1995).

(52)

Multi-instance kernels.

InInternational Conference on Machine Learning (ICML),

pages 179–186.

Haussler, D. (1999).

Convolution kernels on discrete structures.

Technical report, Department of Computer Science, University of California at Santa Cruz.

(http://cbse.soe.ucsc.edu/sites/default/files/

convolutions.pdf).

Hein, M. and Bousquet, O. (2005).

Hilbertian metrics and positive definite kernels on probability measures.

InInternational Conference on Artificial Intelligence and

(53)

Kandasamy, K., Krishnamurthy, A., P´oczos, B., Wasserman, L., and Robins, J. M. (2015).

Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. Technical report, Carnegie Mellon University.

http://arxiv.org/abs/1411.4342.

Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009).

Nonextensive information theoretical kernels on measures. Journal of Machine Learning Research, 10:935–975.

Nielsen, F. and Nock, R. (2012).

A closed-form expression for the Sharma-Mittal entropy of exponential families.

(54)

Fast function to function regression.

InInternational Conference on Artificial Intelligence and

Statistics (AISTATS), pages 717–725.

Oliva, J., P´oczos, B., and Schneider, J. (2013).

Distribution to distribution regression.

International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057.

Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014).

Fast distribution to real regression.

International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714.

(55)

P´oczos, B., Xiong, L., and Schneider, J. (2011).

Nonparametric divergence estimation with applications to machine learning on distributions.

InUncertainty in Artificial Intelligence (UAI), pages 599–608.

Reddi, S. J. and P´oczos, B. (2014).

k-NN regression on functional data with incomplete observations.

InConference on Uncertainty in Artificial Intelligence (UAI),

pages 692–701.

Sriperumbudur, B. K., Fukumizu, K., Kumar, R., Gretton, A., and Hyv¨arinen, A. (2014).

Density estimation in infinite dimensional exponential families. Technical report.

(56)

Sutherland, D. J., Oliva, J. B., P´oczos, B., and Schneider, J. (2015).

Linear-time learning on distributions with approximate kernel embeddings.

Technical report, Carnegie Mellon University.

http://arxiv.org/abs/1509.07553.

Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009).

Closed-form Jensen-R´enyi divergence for mixture of Gaussians

and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655.

Wang, J. and Zucker, J.-D. (2000).

(57)

Wang, Z., Lan, L., and Vucetic, S. (2012).

Mixture model for multiple instance regression and applications in remote sensing.

IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237.

Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).

Identifying multi-instance outliers.

InSIAM International Conference on Data Mining (SDM),

pages 430–441.

Zhang, M.-L. and Zhou, Z.-H. (2009).

Multi-instance clustering with applications to multi-instance prediction.

(58)

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.

References

Related documents

(Widiyanto, 2017, hal. 80), keterampilan hanya dapat diperoleh dan dikuasai dengan jalan praktik dan banyak latihan. Kemampuan berbicara ini dilatih dengan tujuan

[r]

African stock exchanges, with exception of the Johannesburg Stock Exchange (JSE) have revealed common characteristics which constitute barriers to the growth and

The results of the present study of a Nordic adult population with MetS or at risk of developing MetS showed that the intake of SFA and sodium was higher and dietary fibre and

National concerns about quality of care and safety Shortages of nursing personnel which demands a higher level pf preparation for leaders who can design and assess care.. Shortages

Nordnet is a Swedish limited company, headquartered in Stockholm. Nordnet shares have been listed on the Nasdaq Stockholm since April 2000. In 2014, the Group had operations in

If all quarter sections have been sampled or if no water sources are present within the search area then a Sundry Notice requesting relief from Rule 318A.f may be filed with

We also compare BAF with the previous schemes using the following criteria: (i) The computational overhead of signature generation/verification operations; (ii) storage