Distribution Regression: Computational &
Statistical Tradeoffs
Zolt´an Szab´o (Gatsby Unit, UCL)
Joint work with
◦Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦Barnab´as P´oczos (ML Department, CMU),
Outline
Motivation: application + theory. Problem formulation.
Results: computational & statistical tradeoffs. Numerical examples.
The task
Samples: {(xi,yi)}`i=1. Find f ∈H such that f(xi)≈yi.
Distribution regression:
xi-s are distributions,
available only through samples: {xi,n}Nn=1i , labelledbags.
The task
Samples: {(xi,yi)}`i=1. Find f ∈H such that f(xi)≈yi.
Distribution regression:
xi-s are distributions,
available only through samples: {xi,n}Nn=1i , labelledbags.
Motivation (application): aerosol prediction
Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.
Relevance: climate research, sustainability.
Engineered methods [Wang et al., 2012]: 100×RMSE = 7.5−8.5.
Wider context
Context:
machine learning: multi-instance learning,
statistics: point estimation tasks (without analytical formula).
Applications:
Several algorithmic approaches
1 Parametric fit: Gaussian, MOG, exp. family
[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].
2 Kernelized Gaussian measures:
[Jebara et al., 2004, Zhou and Chellappa, 2006].
3 (Positive definite) kernels:
[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].
4 Divergence measures (KL, R´enyi, Tsallis, . . . ):
[P´oczos et al., 2011, Kandasamy et al., 2015].
5 Set metrics: Hausdorff metric [Edgar, 1995]; variants
Motivation: theory
MIL dates back to [Haussler, 1999, G¨artner et al., 2002].
Sensiblemethods in regression: require density estimation
[P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014,
Sutherland et al., 2015] + assumptions:
Input-output requirements
Input: distributions on ’structured’ Ddomains (kernels).
Output:
simplest case: Y =R, but
Input-output requirements
Input: distributions on ’structured’ Ddomains (kernels).
Output:
simplest case: Y =R, but
Input-output requirements
Input: distributions on ’structured’ Ddomains (kernels).
Output:
simplest case: Y =R, but
Kernel, RKHS
k :D×D→Rkernel on D, if
∃ϕ:D→H(ilbert space) feature map,
k(a,b) =hϕ(a), ϕ(b)iH (∀a,b∈D). Kernel examples: D=Rd (p >0, θ >0) k(a,b) = (ha,bi+θ)p: polynomial, k(a,b) =e−ka−bk22/(2θ 2) : Gaussian, k(a,b) =e−θka−bk1: Laplacian. In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).
Kernel, RKHS
k :D×D→Rkernel on D, if
∃ϕ:D→H(ilbert space) feature map,
k(a,b) =hϕ(a), ϕ(b)iH (∀a,b∈D). Kernel examples: D=Rd (p >0, θ >0) k(a,b) = (ha,bi+θ)p: polynomial, k(a,b) =e−ka−bk22/(2θ 2) : Gaussian, k(a,b) =e−θka−bk1: Laplacian. In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).
Kernel: example domains (
D
)
Euclidean space (D=Rd), graphs, texts, time series,
Problem formulation (Y
=
R
)
Given: labelled bags ˆz={(ˆxi,yi)} ` i=1, ith bag: ˆxi={xi,1, . . . ,xi,N} i.i.d. ∼ xi ∈P(D),yi ∈R.Task: find aP(D)→Rmapping based on ˆz.
Construction: mean embedding (µx)
P(D)−−−−−→µ=µ(k) X ⊆H
| {z }
2 =two-stage sampling
Problem formulation (Y
=
R
)
Given: labelled bags ˆz={(ˆxi,yi)} ` i=1, ith bag: ˆxi={xi,1, . . . ,xi,N} i.i.d. ∼ xi ∈P(D),yi ∈R.Task: find aP(D)→Rmapping based on ˆz.
Construction: mean embedding (µx)
P(D)−−−−−→µ=µ(k) X ⊆H
| {z }
2 =two-stage sampling
Problem formulation (Y
=
R
)
Given: labelled bags ˆz={(ˆxi,yi)} ` i=1, ith bag: ˆxi={xi,1, . . . ,xi,N} i.i.d. ∼ xi ∈P(D),yi ∈R.Task: find aP(D)→Rmapping based on ˆz.
Construction: mean embedding (µx) +ridge regression
P(D)−−−−−→µ=µ(k) X ⊆H | {z } 2 =two-stage sampling =H(k)−−−−−−−→f∈H=H(K) R | {z } 1 =Hilbert→Rregression .
1
: Hilbert
→
R
regression, well-specified case
Regression function, expected risk (assume for a moment: fρ∈H):
fρ(µx) = Z R ydρ(y|µx), R[f] =E(x,y)|f(µx)−y|2. Ridge estimator: fzλ = arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2 +λkfk2H, (λ >0). Excess risk: E(fzλ,fρ) =R[fzλ]− R[fρ].
1
: Hilbert
→
R
regression, well-specified case
Regression function, expected risk (assume for a moment: fρ∈H):
fρ(µx) = Z R ydρ(y|µx), R[f] =E(x,y)|f(µx)−y|2. Ridge estimator: fzλ = arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2 +λkfk2H, (λ >0). Excess risk: E(fzλ,fρ) =R[fzλ]− R[fρ].
1
: Hilbert
→
R
regression, well-specified case
Regression function, expected risk (assume for a moment: fρ∈H):
fρ(µx) = Z R ydρ(y|µx), R[f] =E(x,y)|f(µx)−y|2. Ridge estimator: fzλ = arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2 +λkfk2H, (λ >0). Excess risk: λ λ
1
: Hilbert
→
R
regression
Known [Caponnetto and De Vito, 2007]: if ρ(µx,y)∈ P(b,c),
then the best/achieved rate
E(fzλ,fρ) =Op `−bcbc+1 (1<b,c ∈(1,2]). ρ∈ P(b,c): T = Z X K(·, µa)K∗(·, µa)dρX(µa) :H→H. Eigenvalues ofT decay asλn=O(n−b). fρ∈Im Tc−21 . Intuition: 1/b– effective input dimension,c – smoothness offρ.
1
: Hilbert
→
R
regression
Known [Caponnetto and De Vito, 2007]: if ρ(µx,y)∈ P(b,c),
then the best/achieved rate
E(fzλ,fρ) =Op `−bcbc+1 (1<b,c ∈(1,2]). ρ∈ P(b,c): T = Z X K(·, µa)K∗(·, µa)dρX(µa) :H→H. Eigenvalues ofT decay asλn=O(n−b). fρ∈Im Tc−21 .
Our question
Can we reach thisOp`−bcbc+1
minimax rate? N=? fˆzλ= arg min f∈H 1 ` ` X i=1 |f(µˆxi)−yi| 2+λkfk2 H, fzλ= arg min f∈H 1 ` ` X i=1 |f(µxi)−yi| 2+λkfk2 H.
2
: mean embedding,
µ
xi→
µ
xˆik :D×D→R kernel; canonical feature map: ϕ(u) =k(·,u).
Mean embedding of a distribution x,xiˆ ∈P(D):
µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n).
LinearK ⇒ set kernel:
K(µxˆi, µˆxj) = µˆxi, µˆxj H = 1 N2 N X n,m=1 k(xi,n,xj,m).
2
: mean embedding,
µ
xi→
µ
xˆik :D×D→R kernel; canonical feature map: ϕ(u) =k(·,u).
Mean embedding of a distribution x,xiˆ ∈P(D):
µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n).
LinearK ⇒ set kernel:
K(µ , µ ) =µ , µ = 1
N
X
2
: mean embedding,
µ
xi→
µ
xˆik :D×D→R kernel; canonical feature map: ϕ(u) =k(·,u).
Mean embedding of a distribution x,xiˆ ∈P(D):
µx = Z Dk( ·,u)dx(u)∈H(k), µxˆi = Z Dk( ·,u)dˆxi(u) = 1 N N X n=1 k(·,xi,n). NonlinearK example: K(µ , µ ) =e− kµxiˆ−µˆxjk2 H 2 .
2
: ridge regression
⇒
analytical solution
Given: training sample: ˆz, test distribution: t. Prediction on t: (fˆzλ◦µ)(t) =k(K+`λI`)−1[y1;. . .;y`], K= [K(µˆxi, µˆxj)]∈R `×`, k= [K(µˆx1, µt), . . . ,K(µˆx`, µt)]∈R 1×`. (1) (2) (3) ⇒ K(µx, µx0) =K(·, µx),K(·, µ0) H matter.1
-
2
: Why can we get consistency? – intuition
Convergence of the mean embedding:
kµx −µˆxkH(k)=Op 1 √ N . H¨older property ofK (0<L, 0<h ≤1): kK(·, µx)−K(·, µˆx)kH(K)≤Lkµx −µˆxkhH(k). fˆzλ depends ’nicely’ on µˆx.
1
-
2
: Proof idea
By decomposing the excess risk, concentration, onP(b,c) we get
E(fˆzλ,fρ)≤ logh(`) Nhλ 1 λ2`2 + 1 + 1 `λ1+1b | {z } 2 =two-stage sampling + λc+ 1 `2λ+ 1 `λb1 | {z } 1 =H→Rregression →0, s.t. `λb+1b ≥1,log(`) λ2h ≤N.
Let N=`ahlog(`) ⇒ 1st term + constraints simplify.
a>0: needed, i.e. N>log(`).
1
-
2
: Proof idea
By decomposing the excess risk, concentration, onP(b,c) we get
E(fˆzλ,fρ)≤ logh(`) Nhλ 1 λ2`2 + 1 + 1 `λ1+1b | {z } 2 =two-stage sampling + λc+ 1 `2λ+ 1 `λb1 | {z } 1 =H→Rregression →0, s.t. `λb+1b ≥1,log(`) λ2h ≤N.
Let N=`ahlog(`) ⇒ 1st term + constraints simplify.
Computational & statistical tradeoff (W)
If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.Meaning (a-dependence,N =`halog(`)):
’Smaller a’ = computational saving, but reduced statistical
efficiency.
Sensible choice: a≤ b(c + 1)
bc + 1 <2:
N sub-quadratic in ` achieves
Computational & statistical tradeoff (W)
If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.Meaning (a-dependence,N =`halog(`)):
’Smaller a’ = computational saving, but reduced statistical
efficiency.
Sensible choice: a≤ b(c + 1)
bc + 1 <2:
N sub-quadratic in ` achieves
Computational & statistical tradeoff (W)
If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.Meaning (a-dependence,N =`halog(`)):
’Smaller a’ = computational saving, but reduced statistical
efficiency.
Sensible choice: a≤ b(c+ 1)
Computational & statistical tradeoff (W)
If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op `−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op`−bcbc+1 with λ=`−bcb+1.Meaning (h-dependence, N=`ah log(`)):
smoother K kernel is rewarding = bag-size reduction; see
Computational & statistical tradeoff (W)
If a≤ b(c+ 1) bc+ 1 , thenE fˆzλ,fρ =Op`−cac+1 with λ=`−c+1a , a≥ b(c+ 1) bc+ 1 then E fˆzλ,fρ =Op `−bcbc+1 with λ=`−bcb+1. Meaning (c-dependence): c 7→ b(c+ 1)bc+ 1 decreasing: smaller bags are enough for easier
Misspecified setting
Relevant case: fρ∈L2ρX\H.
fρ: difficulty parameter = s ∈(0,1], larger s = easier problem.
Proof idea:
∞-D exponential family fitting [Sriperumbudur et al., 2014], ridge solution.
Computational & statistical tradeoff (M)
Let N=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (a-dependence):’Smaller a’ = computational saving, but reduced statistical
efficiency. Sensible choice: a≤ s+ 1 s+ 2 ≤ 2 3 ⇒2a≤ 4 3 <2!
Computational & statistical tradeoff (M)
Let N=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (a-dependence):’Smaller a’ = computational saving, but reduced statistical
efficiency. Sensible choice: a≤ s+ 1 s+ 2 ≤ 2 3 ⇒2a≤ 4 3 <2!
Computational & statistical tradeoff (M)
Let N=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (a-dependence):’Smaller a’ = computational saving, but reduced statistical
efficiency. Sensible choice: a≤ s+ 1 s+ 2 ≤ 2 3 ⇒2a≤ 4 3 <2!
Computational & statistical tradeoff (M)
LetN=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op`−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+2s with λ=`−s+21 . Meaning (h-dependence): h 7→ 2aComputational & statistical tradeoff (M)
LetN=`2halog(`) (a>0). If a≤ s+ 1 s+ 2, then E fˆzλ,fρ =Op `−s2+1sa with λ=`−s+1a , a≥ s+ 1 s+ 2, then E fˆzλ,fρ =Op`−s2+2s with λ=`−s+21 . Meaning (s-dependence): s 7→ 2ss+ 2 is increasing, i.e easier task
= better rate,
s →0: arbitrary slow rate.
Optimality of the rate (M)
Our rate: r(`) =`−s2+2s – range space assumption (s).
One-stage sampled optimal rate: ro(`) =`−
2s
2s+1 [Steinwart et al., 2009],
range-space assumption + eigendecay constraint, D: compact metric,Y =R. 0.2 0.4 0.6 0.8 Rate −log l[ro(l)]=2s/(2s+1) −log[r(l)]=2s/(s+2)
Blanket assumptions: both settings
D: separable, topological domain.
k:
bounded: supu∈Dk(u,u)≤Bk ∈(0,∞),
continuous.
K: bounded; H¨older continuous: ∃L>0,h∈(0,1] such that
kK(·, µa)−K(·, µb)kH≤Lkµa−µbkhH.
H¨
older
K
examples (other than the linear
K
when h=1)
In case of compact metricD, universal k:
KG Ke KC e− µa−µb 2 H 2θ2 e− µa−µb H 2θ2 1 +kµa−µbk2H/θ2 −1 h= 1 h= 12 h= 1 Kt Ki 1 +kµa−µbkθH−1 kµa−µbk2H+θ2 −12
Vector-valued output: similarly
K(µa, µb)∈L(Y).
Prediction on a new test distribution (t):
(fˆzλ◦µ)(t) =k(K+`λI`)−1[y1;. . .;y`], K= [K(µˆxi, µˆxj)]∈L(Y) `×`, k= [K(µˆx1, µt), . . . ,K(µˆx`, µt)]∈L(Y) 1×`. (4) (5) (6) Specifically: Y =R⇒L(Y) =R;Y =Rd ⇒L(Y) =Rd×d.
Demo
Supervised entropy learning:
0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE
Aerosol prediction from satellite images:
State-of-the-art baseline: 7.5−8.5(±0.1−0.6). MERR:7.81 (±1.64).
Demo
Supervised entropy learning:
0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE
Summary
Problem: distribution regression (k).
Contribution:
computational & statistical tradeoff analysis,
specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size.
Code in ITE, analysis submitted to JMLR:
https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.
Summary
Problem: distribution regression (k).
Contribution:
computational & statistical tradeoff analysis,
specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size. Code in ITE, analysis submitted to JMLR:
https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.
Thank you for the attention!
Caponnetto, A. and De Vito, E. (2007).
Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368.
Chen, Y. and Wu, O. (2012).
Contextual Hausdorff dissimilarity for multi-instance clustering.
InInternational Conference on Fuzzy Systems and Knowledge
Discovery (FSKD), pages 870–873.
Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005).
Semigroup kernels on measures.
Journal of Machine Learning Research, 6:11691198.
Edgar, G. (1995).
Multi-instance kernels.
InInternational Conference on Machine Learning (ICML),
pages 179–186.
Haussler, D. (1999).
Convolution kernels on discrete structures.
Technical report, Department of Computer Science, University of California at Santa Cruz.
(http://cbse.soe.ucsc.edu/sites/default/files/
convolutions.pdf).
Hein, M. and Bousquet, O. (2005).
Hilbertian metrics and positive definite kernels on probability measures.
InInternational Conference on Artificial Intelligence and
Kandasamy, K., Krishnamurthy, A., P´oczos, B., Wasserman, L., and Robins, J. M. (2015).
Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. Technical report, Carnegie Mellon University.
http://arxiv.org/abs/1411.4342.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009).
Nonextensive information theoretical kernels on measures. Journal of Machine Learning Research, 10:935–975.
Nielsen, F. and Nock, R. (2012).
A closed-form expression for the Sharma-Mittal entropy of exponential families.
Fast function to function regression.
InInternational Conference on Artificial Intelligence and
Statistics (AISTATS), pages 717–725.
Oliva, J., P´oczos, B., and Schneider, J. (2013).
Distribution to distribution regression.
International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057.
Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014).
Fast distribution to real regression.
International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714.
P´oczos, B., Xiong, L., and Schneider, J. (2011).
Nonparametric divergence estimation with applications to machine learning on distributions.
InUncertainty in Artificial Intelligence (UAI), pages 599–608.
Reddi, S. J. and P´oczos, B. (2014).
k-NN regression on functional data with incomplete observations.
InConference on Uncertainty in Artificial Intelligence (UAI),
pages 692–701.
Sriperumbudur, B. K., Fukumizu, K., Kumar, R., Gretton, A., and Hyv¨arinen, A. (2014).
Density estimation in infinite dimensional exponential families. Technical report.
Sutherland, D. J., Oliva, J. B., P´oczos, B., and Schneider, J. (2015).
Linear-time learning on distributions with approximate kernel embeddings.
Technical report, Carnegie Mellon University.
http://arxiv.org/abs/1509.07553.
Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009).
Closed-form Jensen-R´enyi divergence for mixture of Gaussians
and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655.
Wang, J. and Zucker, J.-D. (2000).
Wang, Z., Lan, L., and Vucetic, S. (2012).
Mixture model for multiple instance regression and applications in remote sensing.
IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237.
Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010).
Identifying multi-instance outliers.
InSIAM International Conference on Data Mining (SDM),
pages 430–441.
Zhang, M.-L. and Zhou, Z.-H. (2009).
Multi-instance clustering with applications to multi-instance prediction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.