• No results found

Kernel-based learning on probability distributions

N/A
N/A
Protected

Academic year: 2021

Share "Kernel-based learning on probability distributions"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Kernel-based learning on probability distributions

Zolt´an Szab´o (Gatsby Unit, UCL)

University of California, San Diego

(2)

Example: sustainability

Goal: aerosol prediction = air pollution → climate.

Prediction using labelled bags:

bag := multi-spectral satellite measurements over an area, label := local aerosol value.

(3)

Example: existing methods

Multi-instance learning:

[Haussler, 1999, G¨artner et al., 2002] (set kernel):

sensiblemethods in regression: few,

1 restrictive technical conditions,

(4)

One-page summary

Contributions:

1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical:

General bags: graphs, time series, texts, . . .

Consistencyof set kernel in regression (17-year-old open problem). How many samples/bag?

(5)

One-page summary

Contributions:

1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical:

General bags: graphs, time series, texts, . . .

Consistencyof set kernel in regression (17-year-old open problem). How many samples/bag?

(6)

Objects in the bags

Examples:

time-series modelling: user = set oftime-series, computer vision: image = collection of patchvectors, NLP: corpus = bag ofdocuments,

(7)

Objects in the bags

Examples:

time-series modelling: user = set oftime-series, computer vision: image = collection of patchvectors, NLP: corpus = bag ofdocuments,

network analysis: group of people = bag of friendshipgraphs, . . . Wider context (statistics): point estimation tasks.

(8)

Regression on labelled bags

Given:

labelled bags: ˆz = Pi, yˆ i  ℓ

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP.

(9)

Regression on labelled bags

Given:

labelled bags: ˆz = Pi, yˆ i  ℓ

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP. Estimator: fˆzλ = arg min f∈H 1 ℓ Xℓ i=1 h f µPˆi |{z} feature of ˆPi  − yi i2 + λ kf k2H.

(10)

Regression on labelled bags

Given:

labelled bags: ˆz = Pi, yˆ i  ℓ

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP. Estimator: fˆzλ = arg min f∈H(K ) 1 ℓ Xℓ i=1 h f µPˆi  − yi i2 + λ kf k2H. Prediction: ˆ y ˆP= gT(G + ℓλI)−1y, g=K µPˆ, µPˆi  , G =K µPˆi, µPˆj  , y = [yi].

(11)

Regression on labelled bags

Given:

labelled bags: ˆz = Pi, yˆ i  ℓ

i=1, ˆPi: bag from Pi, N := |ˆPi|. test bag: ˆP. Estimator: fˆzλ = arg min f∈H(K ) 1 ℓ Xℓ i=1 h f µPˆi  − yi i2 + λ kf k2H. Prediction: ˆ y ˆP= gT(G + ℓλI)−1y, g=K µPˆ, µPˆi  , G =K µPˆi, µPˆj  , y = [yi]. Challenges

1 Inner product of distributions: K µˆ

Pi, µPˆj



= ?

(12)

Regression on labelled bags: similarity

Let us define an inner product on distributions [K (P, Q)˜ ]:

1 Set kernel: A = {ai}N i=1, B = {bj}Nj=1. ˜ K (A, B) = 1 N2 N X i,j=1 k(ai, bj) =D 1 N N X i=1 ϕ(ai) | {z } feature of bag A , 1 N N X j=1 ϕ(bj) E . Remember:

(13)

Regression on labelled bags: similarity

Let us define an inner product on distributions [K (P, Q)˜ ]:

1 Set kernel: A = {ai}N i=1, B = {bj}Nj=1. ˜ K (A, B) = 1 N2 N X i,j=1 k(ai, bj) =D 1 N N X i=1 ϕ(ai) | {z } feature of bag A , 1 N N X j=1 ϕ(bj) E .

2 Taking ’limit’ [Berlinet and Thomas-Agnan, 2004,

Altun and Smola, 2006, Smola et al., 2007]: a ∼ P, b ∼ Q

˜ K (P, Q)= Ea,bk(a, b) = D Eaϕ(a) | {z } feature of distributionP=:µP , Ebϕ(b) E .

(14)

Regression on labelled bags: baseline

Quality of estimator, baseline:

R(f ) =E(µP,y )∼ρ[f (µP) − y]2, fρ= best regressor.

How many samples/bag to get the accuracy of fρ? Possible?

(15)

Our result: how many samples/bag

Known [Caponnetto and De Vito, 2007]: best/achieved rate

R(fzλ) − R(fρ) = O

 ℓ−bcb+1c

 ,

(16)

Our result: how many samples/bag

Known [Caponnetto and De Vito, 2007]: best/achieved rate

R(fzλ) − R(fρ) = O

 ℓ−bcb+1c

 ,

b – size of the input space,c – smoothness of fρ.

Let N = ˜O(ℓa). N: size of the bags. ℓ: number of bags.

Our result

If2 ≤ a, thenfλ

(17)

Our result: how many samples/bag

Known [Caponnetto and De Vito, 2007]: best/achieved rate

R(fzλ) − R(fρ) = O

 ℓ−bcb+1c

 ,

b – size of the input space,c – smoothness of fρ.

Let N = ˜O(ℓa). N: size of the bags. ℓ: number of bags.

Our result

If2 ≤ a, thenfλ

ˆz attains thebest achievable rate. In fact, a = bbc(c+1)+1 < 2 is enough.

(18)

Aerosol prediction result (100 × RMSE )

We perform on par with the state-of-the-art, hand-engineered method.

Zhuang Wang, Liang Lan, Slobodan Vucetic. IEEE Transactions on Geoscience and Remote Sensing, 2012: 7.5 − 8.5(±0.1 − 0.6):

hand-crafted features.

Our prediction accuracy: 7.81 (±1.64).

no expert knowledge. Code in ITE: #2 on mloss,

(19)
(20)

Distribution regression with random Fourier features

Kernel EP [UAI-2015]:

(21)

Distribution regression with random Fourier features

Kernel EP [UAI-2015]:

distribution regression phrasing,

learn the message-passing operator for ’tricky’ factors. extends Infer.NET; speed ⇐ RFF.

(22)

Distribution regression with random Fourier features

Kernel EP [UAI-2015]:

(23)

+

App

lications, with Gatsby students

Bayesian manifold learning [NIPS-2015]:

App.: climate data → weather station location.

Fast, adaptive sampling method based on RFF [NIPS-2015]: App.: approximate Bayesian computation, hyperparameter inference.

(24)

+

App

lications, with Gatsby students

Bayesian manifold learning [NIPS-2015]:

App.: climate data → weather station location.

Fast, adaptive sampling method based on RFF [NIPS-2015]: App.: approximate Bayesian computation, hyperparameter inference.

Interpretable 2-sample testing [→NIPS-2016]: App.:

random → smart features,

(25)

Summary

Regression on

bags/distributions: minimax optimality, set kernel is consistent.

random Fourier features: exponentially tighter bounds.

Several applications (with open source code).

Acknowledgments: This work was supported by the Gatsby Charitable Foundation.

(26)

Why can we get consistency/rates? – intuition

Convergence of the mean embedding: µP− µPˆ H = O 1√ N  . H¨older property of K (0 < L, 0 < h ≤ 1): K (·, µP)−K (·, µPˆ) H≤ L µP− µPˆ hH. fλ ˆ z depends ’nicely’ on K (µPˆ, µQˆ) = D K (·, µPˆ),K (·, µQˆ) E H. [39 pages]

(27)

Extensions

1 Misspecified setting (fρ∈ L2\H):

Consistency: convergence toinff ∈Hkf − fρkL2.

(28)

Extensions

2 Vector-valued output:

Y : separable Hilbert space ⇒ K (µP, µQ) ∈ L(Y ). Prediction on a test bag ˆP:

ˆ

y ˆP= gT(G + ℓλI)−1y,

g= [K (µPˆ, µPiˆ)], G = [K (µPiˆ, µPjˆ)], y = [yi]. Specifically: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d.

(29)

Other valid similarities

Recall: ˜K (P, Q) = hµP, µQi. ˜ KG K˜e K˜C e− kµP −µQk2 2θ2 e− kµP −µQk 2θ2  1 + kµP− µQk2/θ2 −1 ˜ Kt K˜i  1 + kµP− µQkθ −1  kµP − µQk2+ θ2 −12

(30)

Altun, Y. and Smola, A. (2006).

Unifying divergence minimization and statistical inference via convex duality.

In Conference on Learning Theory (COLT), pages 139–153. Berlinet, A. and Thomas-Agnan, C. (2004).

Reproducing Kernel Hilbert Spaces in Probability and Statistics.

Kluwer.

Caponnetto, A. and De Vito, E. (2007).

Optimal rates for regularized least-squares algorithm.

Foundations of Computational Mathematics, 7:331–368.

G¨artner, T., Flach, P. A., Kowalczyk, A., and Smola, A.

(31)

Convolution kernels on discrete structures.

Technical report, Department of Computer Science, University of California at Santa Cruz.

(http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf).

Smola, A., Gretton, A., Song, L., and Sch¨olkopf, B. (2007).

A Hilbert space embedding for distributions.

References

Related documents

The new Cisco EtherSwitch service modules, pictured in Figure 1, greatly expand the capabilities of integrated switching within Cisco routers by providing support for new features

Therefore, in order to understand the differences observed from both a mechanical point of view and corrosion defect morphology results between the different exposure conditions,

Infliximab (chimeric anti-tumour necrosis factor alpha monoclonal antibody) versus placebo in rheumatoid arthritis patients receiving concomitant methotrexate: a randomised phase

Yet, while we are different, we may be similar to each other on a trait (s), and could constitute a personality type (s). A person’s decision with regard to the brand as well as

An alternative hypothesis (“rivet” hypothesis) states that any species loss, to any extent, always leads to an ecosystem function decrease because every species

Communication from the Commission on the application of the European Union State aid rules to compensation granted for the provision of services of general

By tracking levels of student engagement, the use and analysis of learning analytics provides a level of clarity which can dispell uncertainty around how to allocate

The einstein problem : does there exist a single tile which can only tile the plane aperiodically?.. The