Kernel-based learning on probability distributions

(1)

Kernel-based learning on probability distributions

Zolt´an Szab´o (Gatsby Unit, UCL)

University of California, San Diego

(2)

Example: sustainability

Goal_{: aerosol prediction = air pollution → climate.}

Prediction using labelled bags:

bag := multi-spectral satellite measurements over an area, label := local aerosol value.

(3)

Example: existing methods

Multi-instance learning:

[Haussler, 1999, G¨artner et al., 2002] (set kernel):

sensiblemethods in regression: few,

1 restrictive technical conditions,

(4)

One-page summary

Contributions:

1 Practical: state-of-the-art accuracy (aerosol). 2 _Theoretical:

General bags: graphs, time series, texts, . . .

Consistencyof set kernel in regression (17-year-old open problem). How many samples/bag?

(5)

One-page summary

Contributions:

1 Practical: state-of-the-art accuracy (aerosol). 2 _Theoretical:

General bags: graphs, time series, texts, . . .

Consistencyof set kernel in regression (17-year-old open problem). How many samples/bag?

(6)

Objects in the bags

Examples:

time-series modelling: user = set oftime-series, computer vision: image = collection of patchvectors, NLP: corpus = bag ofdocuments,

(7)

Objects in the bags

Examples:

time-series modelling: user = set oftime-series, computer vision: image = collection of patchvectors, NLP: corpus = bag ofdocuments,

network analysis: group of people = bag of friendshipgraphs, . . . Wider context (statistics): point estimation tasks.

(8)

Regression on labelled bags

Given:

labelled bags: ˆz = Pi, yˆ i ℓ

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP.

(9)

Regression on labelled bags

Given:

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP. Estimator: f_ˆ_zλ = arg min f∈H 1 ℓ Xℓ i=1 h f µ_Pˆ_i |{z} feature of ˆPi − yi i2 + λ kf k2H.

(10)

Regression on labelled bags

Given:

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP. Estimator: f_ˆ_zλ = arg min f∈H(K ) 1 ℓ Xℓ i=1 h f µPî − yi i2 + λ kf k2H. Prediction: ˆ y ˆP= gT(G + ℓλI)−1y, g=K µPˆ, µPî , G =K µPî, µPˆj , y = [yi].

(11)

Regression on labelled bags

Given:

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP. Estimator: f_ˆ_zλ = arg min f∈H(K ) 1 ℓ Xℓ i=1 h f µPî − yi i2 + λ kf k2H. Prediction: ˆ y ˆP= gT(G + ℓλI)−1y, g=K µPˆ, µPî , G =K µPî, µPˆj , y = [yi]. Challenges

1 Inner product of distributions: K µ_ˆ

Pi, µPˆj

= ?

(12)

Regression on labelled bags: similarity

Let us define an inner product on distributions [K (P, Q)˜ ]:

1 Set kernel: A = {a_i}N i=1, B = {bj}Nj=1. ˜ K (A, B) = 1 N2 N X i,j=1 k(ai, bj) =D 1 N N X i=1 ϕ(ai) | {z } feature of bag A , 1 N N X j=1 ϕ(bj) E . Remember:

(13)

Regression on labelled bags: similarity

Let us define an inner product on distributions [K (P, Q)˜ ]:

1 Set kernel: A = {a_i}N i=1, B = {bj}Nj=1. ˜ K (A, B) = 1 N2 N X i,j=1 k(ai, bj) =D 1 N N X i=1 ϕ(ai) | {z } feature of bag A , 1 N N X j=1 ϕ(bj) E .

2 Taking ’limit’ [Berlinet and Thomas-Agnan, 2004,

Altun and Smola, 2006, Smola et al., 2007]: a ∼ P, b ∼ Q

˜ K (P, Q)= Ea,bk(a, b) = D Eaϕ(a) | {z } feature of distributionP=:µP , Ebϕ(b) E .

(14)

Regression on labelled bags: baseline

Quality of estimator, baseline:

R(f ) =E(µP,y )∼ρ[f (µP) − y]2, fρ= best regressor.

How many samples/bag to get the accuracy of fρ? Possible?

(15)

Our result: how many samples/bag

Known [Caponnetto and De Vito, 2007]: best/achieved rate

R(f_zλ_{) − R(f}ρ) = O

ℓ−bcb+1c

,

(16)

Our result: how many samples/bag

R(f_zλ_{) − R(f}ρ) = O

ℓ−bcb+1c

,

b – size of the input space,c – smoothness of fρ.

Let N = ˜_O(ℓa). N: size of the bags. ℓ: number of bags.

Our result

If2 ≤ a, thenfλ

(17)

Our result: how many samples/bag

R(f_zλ_{) − R(f}ρ) = O

ℓ−bcb+1c

,

b – size of the input space,c – smoothness of fρ.

Let N = ˜_O(ℓa). N: size of the bags. ℓ: number of bags.

Our result

If2 ≤ a, thenfλ

ˆz attains thebest achievable rate. In fact, a = b_bc(c+1)₊₁ < 2 is enough.

(18)

Aerosol prediction result (100 × RMSE )

We perform on par with the state-of-the-art, hand-engineered method.

Zhuang Wang, Liang Lan, Slobodan Vucetic. IEEE Transactions on Geoscience and Remote Sensing, 2012: _{7.5 − 8.5}_{(±0.1 − 0.6):}

hand-crafted features.

Our prediction accuracy: 7.81 _(±1.64).

no expert knowledge. Code in ITE: #2 on mloss,

(19)

(20)

Distribution regression with random Fourier features

Kernel EP [UAI-2015]:

(21)

Distribution regression with random Fourier features

distribution regression phrasing,

learn the message-passing operator for ’tricky’ factors. extends Infer.NET; speed ⇐ RFF.

(22)

Distribution regression with random Fourier features

(23)

+

App

lications, with Gatsby students

Bayesian manifold learning [NIPS-2015]:

App.: climate data → weather station location.

Fast, adaptive sampling method based on RFF [NIPS-2015]: App.: approximate Bayesian computation, hyperparameter inference.

(24)

+

App

lications, with Gatsby students

Bayesian manifold learning [NIPS-2015]:

App.: climate data → weather station location.

Fast, adaptive sampling method based on RFF [NIPS-2015]: App.: approximate Bayesian computation, hyperparameter inference.

Interpretable 2-sample testing [_→NIPS-2016]: App.:

random → smart features,

(25)

Summary

Regression on

bags/distributions: minimax optimality, set kernel is consistent.

random Fourier features: exponentially tighter bounds.

Several applications (with open source code).

Acknowledgments: This work was supported by the Gatsby Charitable Foundation.

(26)

Why can we get consistency/rates? – intuition

Convergence of the mean embedding: µP− µPˆ _H = O 1√ N . H¨older property of K (0 < L, 0 < h ≤ 1): K (·, µ_P)−K (·, µ_P_ˆ) _H≤ L µ_P− µ_P_ˆ h_H. fλ ˆ z depends ’nicely’ on K (µPˆ, µQˆ) = D K (·, µ_Pˆ),K (·, µ_Qˆ) E H. [39 pages]

(27)

Extensions

1 Misspecified setting (f_ρ∈ L2\H):

Consistency: convergence toinff ∈Hkf − fρkL2.

(28)

Extensions

2 Vector-valued output:

Y : separable Hilbert space ⇒ K (µP, µQ) ∈ L(Y ). Prediction on a test bag ˆP:

ˆ

y ˆP= gT_{(G + ℓλI)}−1_y,

g= [K (µ_Pˆ, µ_Piˆ)], G = [K (µ_Piˆ, µ_Pjˆ)], y = [yi]. Specifically: Y = R ⇒ L(Y ) = R; Y = Rd _{⇒ L(Y ) = R}d×d_.

(29)

Other valid similarities

Recall: ˜_{K (P, Q) = hµ}P, µQi. ˜ KG K˜e K˜C e− kµP −µQk2 2θ2 e− kµP −µQk 2θ2 1 + kµP− µQk2/θ2 −1 ˜ Kt K˜i 1 + kµP− µQkθ −1 kµP − µQk2+ θ2 −1₂

(30)

Altun, Y. and Smola, A. (2006).

Unifying divergence minimization and statistical inference via convex duality.

In Conference on Learning Theory (COLT), pages 139–153. Berlinet, A. and Thomas-Agnan, C. (2004).

Reproducing Kernel Hilbert Spaces in Probability and Statistics.

Kluwer.

Caponnetto, A. and De Vito, E. (2007).

Optimal rates for regularized least-squares algorithm.

Foundations of Computational Mathematics, 7:331–368.

G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A.

(31)

Convolution kernels on discrete structures.

Technical report, Department of Computer Science, University of California at Santa Cruz.

(http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf).

Smola, A., Gretton, A., Song, L., and Sch¨olkopf, B. (2007).

A Hilbert space embedding for distributions.