Learning Theory for Vector-Valued Distribution
Regression
Zolt´an Szab´o (Gatsby Unit, UCL)
Joint work with
◦ Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦ Barnab´as P´oczos (ML Department, CMU),
◦ Arthur Gretton (Gatsby Unit, UCL)
CMStatistics, London, UK December 12, 2015
Outline
Motivation: application + theory. Problem formulation.
Results: computational & statistical tradeoffs. Numerical examples.
The task
Samples: {(xi, yi)}ℓi=1. Find f ∈ H such that f (xi) ≈ yi.
Distribution regression:
xi-s are distributions,
available only through samples: {xi,n}Nn=1i , labelled bags.
The task
Samples: {(xi, yi)}ℓi=1. Find f ∈ H such that f (xi) ≈ yi.
Distribution regression:
xi-s are distributions,
Motivation (application): aerosol prediction
Bag := pixels of a multispectral satellite image over an area. Label of a bag := aerosol value.
Relevance: climate research, sustainability.
Engineered methods [Wang et al., 2012]: 100 × RMSE = 7.5 − 8.5. Using distribution regression?
Wider context
Context:
machine learning: multi-instance learning,
statistics: point estimation tasks (without analytical formula).
Applications:
Several algorithmic approaches
1 Parametric fit: Gaussian, MOG, exp. family
[Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].
2 Kernelized Gaussian measures:
[Jebara et al., 2004, Zhou and Chellappa, 2006].
3 (Positive definite) kernels:
[Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].
4 Divergence measures (KL, R´enyi, Tsallis, . . . ):
[P´oczos et al., 2011, Kandasamy et al., 2015].
5 Set metrics: Hausdorff metric [Edgar, 1995]; variants
[Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].
Motivation: theory
MIL dates back to [Haussler, 1999, G¨artner et al., 2002].
Sensible methods in regression: require density estimation [P´oczos et al., 2013, Oliva et al., 2014, Reddi and P´oczos, 2014, Sutherland et al., 2015] + assumptions:
Input-output requirements
Input: distributions on ’structured’ D domains (kernels).
Input-output requirements
Input: distributions on ’structured’ D domains (kernels). Output:
Input-output requirements
Input: distributions on ’structured’ D domains (kernels). Output:
simplest case: Y = R, but
dependenciesmight matter: Y = Rd (or separable Hilbert).
Kernel, RKHS
k : D × D → R kernel on D, if
∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).
Kernel, RKHS
k : D × D → R kernel on D, if
∃ϕ : D → H(ilbert space) feature map, k(a, b) = hϕ(a), ϕ(b)iH (∀a, b ∈ D).
Kernel examples: D = Rd (p > 0, θ > 0)
k(a, b) = (ha, bi + θ)p: polynomial, k(a, b) = e−ka−bk22/(2θ
2)
: Gaussian, k(a, b) = e−θka−bk1: Laplacian.
In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).
RKHS: evaluation point of view
Let H ⊂ RD be a Hilbert space.
Consider for fixed x ∈ D the δx : f ∈ H 7→ f (x) ∈ R map.
The evaluation functional is linear:
RKHS: evaluation point of view
Let H ⊂ RD be a Hilbert space.
Consider for fixed x ∈ D the δx : f ∈ H 7→ f (x) ∈ R map.
The evaluation functional is linear:
δx(αf + βg ) = αδx(f ) + βδx(g ).
Def.: H is called RKHS if δx is continuous for ∀x ∈ D.
RKHS: reproducing point of view
Let H ⊂ RD be a Hilbert space.
k : D × D → is called a reproducing kernel of H if for ∀x ∈ D, f ∈ H
1 k(·, x) ∈ H, 2 hf , k(·, x)i
RKHS: reproducing point of view
Let H ⊂ RD be a Hilbert space.
k : D × D → is called a reproducing kernel of H if for ∀x ∈ D, f ∈ H 1 k(·, x) ∈ H, 2 hf , k(·, x)i H = f (x) (reproducing property). Specifically, ∀x, y ∈ D k(x, y ) = hk(·, x), k(·, y)iH.
RKHS: positive-definite point of view
Let us given a k : D × D → R symmetric function. k is called positive definite if ∀n ≥ 1, ∀(a1, . . . , an) ∈ Rn,
(x1, . . . , xn) ∈ Dn n X i ,j=1 aiajk(xi, xj) = aTGa≥ 0, where G = [k(xi, xj)]ni ,j=1.
Kernel: example domains (D)
Euclidean space (D = Rd), graphs, texts, time series,
dynamical systems, distributions!
Problem formulation (Y = R)
Given:
labelled bags ˆz = {(ˆxi, yi)}ℓi=1,
ith bag: ˆx
i= {xi,1, . . . , xi,N} i.i .d.
∼ xi ∈ P (D), yi ∈ R.
Problem formulation (Y = R)
Given:
labelled bags ˆz = {(ˆxi, yi)}ℓi=1,
ith bag: ˆx
i= {xi,1, . . . , xi,N} i.i .d.
∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ˆz.
Construction: mean embedding (µx)
P(D)−−−−→µ=µ(k) X ⊆ H
| {z }
2 =two-stage sampling
= H(k)
Problem formulation (Y = R)
Given:
labelled bags ˆz = {(ˆxi, yi)}ℓi=1,
ith bag: ˆx
i= {xi,1, . . . , xi,N} i.i .d.
∼ xi ∈ P (D), yi ∈ R.
Task: find a P (D) → R mapping based on ˆz.
Construction: mean embedding (µx) +ridge regression
P(D)−−−−→µ=µ(k) X ⊆ H | {z } 2 =two-stage sampling = H(k)−−−−−−−→f∈H=H(K ) R | {z } 1 =Hilbert → R regression .
1
: Hilbert → R regression, well-specified case
Regression function, expected risk (assume for a moment: fρ∈ H):
fρ(µx) =
Z
Ry dρ(y |µ
x), R [f ] = E(x,y )|f (µx) − y|2.
1
: Hilbert → R regression, well-specified case
Regression function, expected risk (assume for a moment: fρ∈ H):
fρ(µx) = Z Ry dρ(y |µ x), R [f ] = E(x,y )|f (µx) − y|2. Ridge estimator: fzλ = arg min f∈H 1 ℓ ℓ X i=1 |f (µxi) − yi| 2 + λ kf k2H, (λ > 0).
1
: Hilbert → R regression, well-specified case
Regression function, expected risk (assume for a moment: fρ∈ H):
fρ(µx) = Z Ry dρ(y |µ x), R [f ] = E(x,y )|f (µx) − y|2. Ridge estimator: fzλ = arg min f∈H 1 ℓ ℓ X i=1 |f (µxi) − yi| 2 + λ kf k2H, (λ > 0). Excess risk: E(fzλ, fρ) = R[fzλ] − R[fρ].
1
: Hilbert → R regression
Known [Caponnetto and De Vito, 2007]: if ρ(µx, y ) ∈ P(b, c),
then the best/achieved rate E(fzλ, fρ) = Op
ℓ−bc+1bc
1
: Hilbert → R regression
Known [Caponnetto and De Vito, 2007]: if ρ(µx, y ) ∈ P(b, c),
then the best/achieved rate E(fzλ, fρ) = Op ℓ−bc+1bc (1 < b, c ∈ (1, 2]). ρ ∈ P(b, c): T = Z X K (·, µ a)K∗(·, µa)dρX(µa) : H → H.
Eigenvalues of T decay as λn= O(n−b). fρ∈ Im
Tc−12
. Intuition: 1/b – effective input dimension, c – smoothness of fρ.
Our question
Can we reach this Op
ℓ−bc+1bc minimax rate? N =? fλ ˆ z = arg min f∈H 1 ℓ ℓ X i=1 |f (µˆxi) − yi| 2 + λ kf k2H, fzλ= arg min f∈H 1 ℓ ℓ X i=1 |f (µxi) − yi| 2 + λ kf k2H.
2
: mean embedding, µ
xi→ µ
ˆxik : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u).
Mean embedding of a distribution x, ˆxi ∈ P(D):
µx = Z Dk(·, u)dx(u) ∈ H(k), µxˆi = Z Dk(·, u)dˆx i(u) = 1 N N X n=1 k(·, xi ,n).
2
: mean embedding, µ
xi→ µ
ˆxik : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u).
Mean embedding of a distribution x, ˆxi ∈ P(D):
µx = Z Dk(·, u)dx(u) ∈ H(k), µxˆi = Z Dk(·, u)dˆx i(u) = 1 N N X n=1 k(·, xi ,n).
Linear K ⇒ set kernel:
K (µ , µ ) =µ , µ = 1
N
X
2
: mean embedding, µ
xi→ µ
ˆxik : D × D → R kernel; canonical feature map: ϕ(u) = k(·, u).
Mean embedding of a distribution x, ˆxi ∈ P(D):
µx = Z Dk(·, u)dx(u) ∈ H(k), µxˆi = Z Dk(·, u)dˆx i(u) = 1 N N X n=1 k(·, xi ,n). Nonlinear K example: K (µxˆi, µˆxj) = e −kµˆxi−µˆxjk2H 2σ2 .
2
: ridge regression ⇒ analytical solution
Given: training sample: ˆz, test distribution: t. Prediction on t: (fˆzλ◦ µ)(t) = k(K + ℓλIℓ)−1[y1; . . . ; yℓ], K = [K (µˆxi, µˆxj)] ∈ R ℓ×ℓ, k= [K (µˆx1, µt), . . . , K (µˆxℓ, µt)] ∈ R 1×ℓ. (1) (2) (3) ⇒ K (µ , µ ) =K (·, µ ), K (·, µ′) matter.1
-
2
: Why can we get consistency? – intuition
Convergence of the mean embedding: kµx− µˆxkH(k)= Op 1 √ N . H¨older property of K (0 < L, 0 < h ≤ 1): kK (·, µx)−K (·, µˆx)kH(K )≤ L kµx − µˆxkhH(k). fˆzλ depends ’nicely’ on µˆx.
1
-
2
: Proof idea
By decomposing the excess risk, concentration, on P(b, c) we get E(fˆzλ, fρ) ≤ logh(ℓ) Nhλ 1 λ2ℓ2 + 1 + 1 ℓλ1+1b | {z } 2 =two-stage sampling + λc+ 1 ℓ2λ+ 1 ℓλ1b | {z } 1 = H → R regression → 0, s.t. ℓλb+1b ≥ 1,log(ℓ) λ2h ≤ N.
1
-
2
: Proof idea
By decomposing the excess risk, concentration, on P(b, c) we get E(fˆzλ, fρ) ≤ logh(ℓ) Nhλ 1 λ2ℓ2 + 1 + 1 ℓλ1+1b | {z } 2 =two-stage sampling + λc+ 1 ℓ2λ+ 1 ℓλ1b | {z } 1 = H → R regression → 0, s.t. ℓλb+1b ≥ 1,log(ℓ) λ2h ≤ N.
Let N = ℓahlog(ℓ) ⇒ 1st term + constraints simplify.
a > 0: needed, i.e. N > log(ℓ).
Bias-variance trick with constraint-checking ⇒
Computational & statistical tradeoff (W)
If a ≤ b(c + 1)bc + 1 , then Efˆzλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1) bc + 1 then E fˆzλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b .Computational & statistical tradeoff (W)
If a ≤ b(c + 1)bc + 1 , then Efˆzλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1) bc + 1 then E fˆzλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b .Meaning (a-dependence, N = ℓhalog(ℓ)):
’Smaller a’ = computational saving, but reduced statistical efficiency.
Computational & statistical tradeoff (W)
If a ≤ b(c + 1)bc + 1 , then Efˆzλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1) bc + 1 then E fˆzλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b .Meaning (a-dependence, N = ℓhalog(ℓ)):
’Smaller a’ = computational saving, but reduced statistical efficiency.
Computational & statistical tradeoff (W)
If a ≤ b(c + 1)bc + 1 , then Efˆzλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1)bc + 1 then Efˆzλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b .Meaning (h-dependence, N = ℓah log(ℓ)):
smoother K kernel is rewarding = bag-size reduction; see smoothness of fρ.
Computational & statistical tradeoff (W)
If a ≤ b(c + 1) bc + 1 , then E fˆzλ, fρ = Op ℓ−c+1ac with λ = ℓ−c+1a , a ≥ b(c + 1)bc + 1 then Efˆzλ, fρ = Op ℓ−bc+1bc with λ = ℓ−bc+1b . Meaning (c-dependence):c 7→ b(c + 1)bc + 1 decreasing: smaller bags are enough for easier problems.
Misspecified setting
Relevant case: fρ∈ L2ρX\H.
fρ: difficulty parameter = s ∈ (0, 1], larger s = easier problem.
Proof idea:
∞-D exponential family fitting [Sriperumbudur et al., 2014], ridge solution.
Computational & statistical tradeoff (M)
Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a , a ≥ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 .Computational & statistical tradeoff (M)
Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a , a ≥ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 . Meaning (a-dependence):’Smaller a’ = computational saving, but reduced statistical efficiency.
Computational & statistical tradeoff (M)
Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a , a ≥ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 . Meaning (a-dependence):’Smaller a’ = computational saving, but reduced statistical efficiency.
Computational & statistical tradeoff (M)
Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1 s + 2, then E fˆzλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a , a ≥ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 . Meaning (h-dependence): h 7→ 2ah is decreasing: smoother K kernel is rewarding.
Computational & statistical tradeoff (M)
Let N = ℓ2ah log(ℓ) (a > 0). If a ≤ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+12sa with λ = ℓ−s+1a , a ≥ s + 1s + 2, then Efˆzλ, fρ = Op ℓ−s+22s with λ = ℓ−s+21 .Meaning (s-dependence): s 7→ s + 22s is increasing, i.e easier task
= better rate,
s → 0: arbitrary slow rate.
Optimality of the rate (M)
Our rate: r (ℓ) = ℓ−s+22s – range space assumption (s).
One-stage sampled optimal rate: ro(ℓ) = ℓ−
2s
2s+1 [Steinwart et al., 2009],
range-space assumption + eigendecay constraint, D: compact metric, Y = R. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 Smoothness (s) Rate −log l[ro(l)]=2s/(2s+1) −logl[r(l)]=2s/(s+2)
Blanket assumptions: both settings
D: separable, topological domain.
k:
bounded: supu∈Dk(u, u) ≤ Bk ∈ (0, ∞),
continuous.
K : bounded; H¨older continuous: ∃L > 0, h ∈ (0, 1] such that kK (·, µa) − K (·, µb)kH≤ L kµa− µbk
h H.
H¨older K examples (other than the linear K when h=1)
In case of compact metric D, universal k:
KG Ke KC e−kµa−µbk 2 H 2θ2 e− kµa−µbkH 2θ2 1 + kµa− µbk2H/θ2 −1 h = 1 h = 12 h = 1 Kt Ki 1 + kµa− µbkθH −1 kµa− µbk2H+ θ2 −12 h = θ2 (θ ≤ 2) h = 1
Functions of kµa− µbkH ⇒ computation: similar to set kernel.
Vector-valued output: similarly
K (µa, µb) ∈ L(Y ).
Prediction on a new test distribution (t): (fˆzλ◦ µ)(t) = k(K + ℓλIℓ)−1[y1; . . . ; yℓ], K= [K (µˆxi, µxˆj)] ∈ L(Y ) ℓ×ℓ, k= [K (µˆx1, µt), . . . , K (µˆxℓ, µt)] ∈ L(Y ) 1×ℓ. (4) (5) (6)
Demo
Supervised entropy learning:
0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE
Demo
Supervised entropy learning:
0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE
Summary
Problem: distribution regression (k). Contribution:
computational & statistical tradeoff analysis,
specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size.
Summary
Problem: distribution regression (k). Contribution:
computational & statistical tradeoff analysis,
specifically, the set kernel is consistent: 16-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size.
Code in ITE, analysis submitted to JMLR:
https://bitbucket.org/szzoli/ite/ http://arxiv.org/abs/1411.2066.
Thank you for the attention!
Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK.
Caponnetto, A. and De Vito, E. (2007).
Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368. Chen, Y. and Wu, O. (2012).
Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873.
Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures.
Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995).
Multi-instance kernels.
In International Conference on Machine Learning (ICML), pages 179–186.
Haussler, D. (1999).
Convolution kernels on discrete structures.
Technical report, Department of Computer Science, University of California at Santa Cruz.
(http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf).
Hein, M. and Bousquet, O. (2005).
Hilbertian metrics and positive definite kernels on probability measures.
In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143.
Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels.
Journal of Machine Learning Research, 5:819–844.
Kandasamy, K., Krishnamurthy, A., P´oczos, B., Wasserman, L., and Robins, J. M. (2015).
Influence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. Technical report, Carnegie Mellon University.
http://arxiv.org/abs/1411.4342.
Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009).
Nonextensive information theoretical kernels on measures. Journal of Machine Learning Research, 10:935–975. Nielsen, F. and Nock, R. (2012).
A closed-form expression for the Sharma-Mittal entropy of exponential families.
Fast function to function regression.
In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 717–725.
Oliva, J., P´oczos, B., and Schneider, J. (2013). Distribution to distribution regression.
International Conference on Machine Learning (ICML; JMLR W&CP), 28:1049–1057.
Oliva, J. B., Neiswanger, W., P´oczos, B., Schneider, J., and Xing, E. (2014).
Fast distribution to real regression.
International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714.
P´oczos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013). Distribution-free distribution regression.
International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´oczos, B., Xiong, L., and Schneider, J. (2011).
Nonparametric divergence estimation with applications to machine learning on distributions.
In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Reddi, S. J. and P´oczos, B. (2014).
k-NN regression on functional data with incomplete observations.
In Conference on Uncertainty in Artificial Intelligence (UAI), pages 692–701.
Sriperumbudur, B. K., Fukumizu, K., Kumar, R., Gretton, A.,
and Hyv¨arinen, A. (2014).
Density estimation in infinite dimensional exponential families. Technical report.
Sutherland, D. J., Oliva, J. B., P´oczos, B., and Schneider, J. (2015).
Linear-time learning on distributions with approximate kernel embeddings.
Technical report, Carnegie Mellon University. http://arxiv.org/abs/1509.07553.
Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009).
Closed-form Jensen-R´enyi divergence for mixture of Gaussians and applications to group-wise shape registration.
Medical Image Computing and Computer-Assisted Intervention, 12:648–655.
Wang, J. and Zucker, J.-D. (2000).
Solving the multiple-instance problem: A lazy learning approach.
In International Conference on Machine Learning (ICML), pages 1119–1126.
Wang, Z., Lan, L., and Vucetic, S. (2012).
Mixture model for multiple instance regression and applications in remote sensing.
IEEE Transactions on Geoscience and Remote Sensing, 50:2226–2237.
Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers.
In SIAM International Conference on Data Mining (SDM), pages 430–441.
Zhang, M.-L. and Zhou, Z.-H. (2009).
Multi-instance clustering with applications to multi-instance prediction.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.