Learning on Distributions

(1)

Learning on Distributions

Zolt´an Szab´o (Gatsby Unit)

Joint work with Arthur Gretton (Gatsby Unit), Barnab´as P´oczos (CMU), Bharath K. Sriperumbudur (University of Cambridge)

Kernel methods for big data, Lille April 2, 2014

(2)

Outline

ITE (Information Theoretical Estimators) Toolbox. Distribution Regression:

Motivation, examples. Algorithm, consistency result. Numerical illustration.

(3)

Distribution based tasks: building blocks

Entropies: uncertainty H(x) =₋ Z Rd f(u) logf(u)d_u_.

(4)

Distribution based tasks: building blocks

Entropies: uncertainty H(x) =₋ Z Rd f(u) logf(u)d_u_. Mutual information, association indices: dependence

Ix1, . . . ,xM= Z f u1, . . . ,uMlog " f u1, . . . ,uM QM m=1fm(um) # du.

Divergences, kernels: ’distance’/inner product of probability distributions D(f1,f2) = Z Rd f1(u) log f1(u) f₂(u) d_u_.

(5)

Estimation

1 Plug-in of estimated densities:

(6)

Estimation

Example: histogram based methods. Poor scaling!

Reason:

goal6= density estimation, but

functionalsof distributions.

(7)

Estimation

Example: histogram based methods. Poor scaling!

Reason:

goal6= density estimation, but

functionalsof distributions.

(8)

Existing packages

Focus on

discrete variables, or quite specialized

applications and

information theoretical estimation methods.

(9)

ITE (information theoretical estimators) toolbox

Goal:

state-of-the-art, nonparametric estimators, modularity:

high-level optimization,

(10)

Covered quantities

entropy: Shannon entropy, R´enyi entropy, Tsallis entropy (Havrda and Charv´at entropy), complex entropy, Φ-entropy (f-entropy), Sharma-Mittal entropy,

mutual information: generalized variance, kernel canonical correlation analysis, kernel generalized variance, Hilbert-Schmidt independence criterion, Shannon mutual information (total correlation, multi-information),

L2mutual information, R´enyi mutual information, Tsallis mutual information, copula-based kernel

dependency, multivariate version of Hoeffding’s Φ, Schweizer-Wolff’sσandκ, complex mutual information, Cauchy-Schwartz quadratic mutual information (QMI), Euclidean distance based QMI, distance covariance, distance correlation, approximate correntropy independence measure,χ2mutual information (Hilbert-Schmidt norm of the normalized cross-covariance operator, squared-loss mutual information, mean square contingency),

divergence: Kullback-Leibler divergence (relative entropy, I directed divergence),L2divergence, R´enyi

divergence, Tsallis divergence Hellinger distance, Bhattacharyya distance, maximum mean discrepancy (kernel distance), J-distance (symmetrised Kullback-Leibler divergence, J divergence), Cauchy-Schwartz divergence, Euclidean distance based divergence, energy distance (specially the Cramer-Von Mises distance), Jensen-Shannon divergence, Jensen-R´enyi divergence, K divergence, L divergence, f-divergence (Csisz´ar-Morimoto divergence, Ali-Silvey distance), non-symmetric Bregman distance (Bregman divergence), Jensen-Tsallis divergence, symmetric Bregman distance, Pearsonχ2divergence (χ2distance), Sharma-Mittal divergence,

association measures: multivariate extensions of Spearman’sρ(Spearman’s rank correlation coefficient, grade correlation coefficient), correntropy, centered correntropy, correntropy coefficient, correntropy induced metric, centered correntropy induced metric, multivariate extension of Blomqvist’sβ(medial correlation coefficient), multivariate conditional version of Spearman’sρ, lower/upper tail dependence via conditional Spearman’sρ,

cross quantities: cross-entropy,

kernels on distributions: expected kernel (summation kernel, mean map kernel), Bhattacharyya kernel, probability product kernel, Jensen-Shannon kernel, exponentiated Jensen-Shannon kernel, Jensen-Tsallis kernel, exponentiated Jensen-R´enyi kernel(s), exponentiated Jensen-Tsallis kernel(s),

+some auxiliary quantities: Bhattacharyya coefficient (Hellinger affinity),α-divergence.

(11)

ITE: summary

Matlab/Octave (first release). Multi-platform.

GPLv3(≥).

Appeared in JMLR, 2014.

(12)

ITE: built-in tests/applications

Consistency tests.

Prototype: independent subspace analysis, its extensions.

Image registration → outlier robustness.

Distribution regression (next part).

(13)

Regression

Given: _{(xi,yi)}l_i₌₁ samples H∋f =? such that f(xi)≈yi.

Typically: xi ∈Rp,yi ∈Rq.

(14)

Distribution regression: two-stage sampling difficulty

In practise:

xi-s are only observable via samples: xi ≈ {xi,n}N_n=1 ⇒

an xi is represented as abag:

image = set of patches, document = bag of words, video = collection of images,

different configurations of a molecule = bag of shapes.

(15)

Set kernels: consistency?

Given (2 bags): Bi :={xi,n}N_n=1i ∼xi, (1) Bj :={xj,m} Nj m=1 ∼xj. (2)

Similarity of the bags (set/multi-instance/ensemble kernel):

K(Bi,Bj) = 1 NiNj Ni X n=1 Nj X m=1 k(xi,n,xj,m). (3)

Many successful applications – no theory.

Our results _⇒

(16)

Example: supervised entropy learning

Entropy of x_∼f: ₋R

f(u) log[f(u)]d_u_.

Training: samples from distributions, entropy values. Task: estimate the entropy of a new sample set.

(17)

Example: age prediction from images

Training: (image, age) pairs; image = bag of features. Goal: estimate the age of a person being on a new image.

(18)

Example: Sudoku difficulty estimation

Sudoku: special constraint satisfaction problem. Spiking neural networks (SNN)

can be used to solve such problems,

have stationary distribution under mild conditions.

Sudoku↔ stationary distribution of the SNN.

(19)

Example: aerosol prediction using satellite images

Aerosol = floating particles in the air; climate research.

Multispectral satellite images: 1 pixel = 200_×200m2 _∈bag.

Bag label: ground-based (expensive) sensor.

(20)

Towards problem formulation: kernel, RKHS

k :D_×D_→Rkernel on D_{, if}

∃ϕ:D_→_H_{(ilbert space) feature map,}

k(a,b) =_hϕ(a), ϕ(b)_i_H (_∀a,b_∈D_). Kernel examples: D₌Rd ₍_p _>_0,_{θ >} ₀₎ k(a,b) = (_ha,b_i+θ)p: polynomial, k(a,b) =e−ka−bk2 2/(2θ2): Gaussian, k(a,b) =e−θka−bk1: Laplacian. In the H=H(k) RKHS (∃!): ϕ(u) =k(·,u).

(21)

Some example domains (

D

_{), where kernels exist}

Euclidean spaces: D₌_Rd_.

Strings, time series, graphs, dynamical systems.

(22)

Distribution kernel: example (used in our work)

Given: (D_,_k_{); we saw that}_u _→_ϕ₍_u_{) =}_k₍_·_,_u₎_∈_H₍_k_).

Let x be a distribution on D₍_x _∈M+

1(D)); the previous

construction can be extended:

µx =

Z

D

k(·,u)d_x₍_u₎_∈_H₍_k₎_. ₍₄₎

If k is bounded: µx is well-defined for anydistributionx.

(23)

Mean embedding based distribution kernel

Simple estimation of µx =R

Dk(·,u)dx(u):

Empirical distribution: having samples {xn}N_n=1

ˆ x = 1 N N X n=1 δxn. (5)

Mean embedding, inner product – empirically (set kernels!):

µˆx = Z D k(_·,u)d_ˆ_x₍_u_{) =} 1 N N X n=1 k(_·,xn), (6) K µxˆi, µˆxj = µˆxi, µxˆj H(k)= 1 NiNj Ni X n=1 Nj X m=1 k(xi,n,xj,m).

(24)

Mini summary

Until now

If we are given a domain (D_{) with kernel}_k_{, then}

one can easily define/estimate the similarity of distributions on D_.

Prototype example: D₌Rd,k = Gaussian,K = lin. kernel.

The realconditions:

D_{: locally compact, Polish.} _k_: _c₀_-universal.

K: H¨older continuous.

(25)

Distribution regression problem: intuitive definition

z={(xi,yi)}l_i=1: xi ∈M₁+(D),yi ∈R. ˆ z= {xi,n}Nn=1,yi l i=1: xi,1, . . . ,xi,N i.i.d. ∼ xi.

Goal: learn the relation between x andy based on ˆz.

Idea: embed the distributions (µ) + apply ridge regression

(26)

Objective function

fH∈H=H(K): ideal/optimal in expected risk sense (E):

E[fH] = inf f∈HE[f] = inff∈H Z X×R [f(µa)−y]2dρ(µa,y). (7) One-stage difficulty (R →z): f_zλ = arg min f∈H 1 l l X i=1 [f(µxi)−yi] 2₊_λ_k_f_k2 H ! . (8) Two-stage difficulty (z_→ˆz): f_ˆ_zλ = arg min f∈H 1 l l X i=1 [f(µxˆi)−yi] 2₊_λ kf_k2H ! . (9)

(27)

Algorithmically: ridge regression

_⇒

simple solution

Given: training sample: ˆz, test distribution: t. Prediction: (f_ˆ_zλ_◦µ)(t) = [y₁, . . . ,yl](K+lλIl)−1    K(µˆx1, µt) .. . K(µˆxl, µt)   , (10) K= [Kij] = [K(µxˆi, µˆxj)]∈R l×l_. ₍₁₁₎

(28)

Consistency result

We studied

the excess error: _E f_ˆ_zλ

− E[fH], i.e,

the goodness compared to the best function fromH_.

Result: with high probability

Ehf_ˆ_zλ

i

− E[fH]→0, (12)

if we appropriately choose the (l,N, λ) triplet.

(29)

Consistency result:

_P

(

b

_,

c

_{) class}

Let the T :H_→H _{covariance operator be}

T =

Z

X

K(·, µa)K∗(·, µa)d_ρX₍_µa_{) =}

Z

X

K(·, µa)δµadρX(µa)

with eigenvalues tn (n = 1,2, . . .).

Let ρ_{∈ P}(b,c) be the set of distributions onX _×R:

α_≤nb_t n≤β (∀n≥1;α >0, β >0), ∃g _∈H_{such that} _fH=T c−1 2 g withkgk2 H≤R (R>0), whereb _∈(1,∞), c _∈[1,2].

(30)

Consistency result: convergence rates

High-level idea:

The excess error can be upper bounded on P(b,c) as:

g(l,N, λ) =_Ehf_ˆ_zλ i − E[fH]≤ log(l) Nλ3 +λ c₊ 1 l2_λ+ 1 lλb1 . We choose λ=λl,N →0:

by matching two terms,

g(l,N, λ)→0; moreover, make the 2 equal terms dominant.

l=Na ₍_a_>_0).

(31)

Convergence rate: results

1 = 2 : Ifλ=hlog(N)_N i 1 c+3 , b1+c c+3 ≤a, then g(N) =O log(N) N c c+3! →0. (13)

(32)

Convergence rate: results

1 = 2 : Ifλ=hlog(N)_N i 1 c+3 , b1+c c+3 ≤a, then g(N) =O log(N) N c c+3! →0. (13) 1 = 3 : Ifλ=Na−12log 1 2(N), 1 6 ≤a<min 1 2 −c+31 , 1 2(b1−1) 1 b−2 , g(N) =O 1 N3a−12log 1 2(N) ! →0. (14) 1 = 4 : Ifλ= Na−1_log(_N₎₃_bb −1_{, max(}b−1 4b−2,3b1 )≤a< bc+1 3b+bc, g(N) =O 1 Na+3ba−1− 1 3b−1log 1 3b−1(_N) ! →0. (15)

(33)

Convergence rate: results

2 = 3 : _∅(the matched terms can not be made dominant).

2 = 4 : If λ= 1 Nbcab+1 ,a< bc+1 3b+bc, then g(N) =O 1 Nbcabc+1 →0. (16) 3 = 4 : If λ= 1 Nbab−1 , 2<b,a< _2(2bb−₋1₁₎, then g(N) =O 1 N2a−bab−1 →0. (17)

(34)

Numerical illustration: supervised entropy learning

Problem: learn the entropy of Gaussians in a supervised manner.

Formally:

A= [Ai,j]∈R2×2,Aij∼U[0,1]. 100 sample sets: _{N(0,Σu)_}100

u=1, where

100 = 25(training) + 25(validation) + 50(testing). one set = 500 i.i.d. 2D points,

Σu=R(βu)AATR(βu)T, R(βu): 2d rotation,

angleβu∼U[0, π].

(35)

Supervised entropy learning: goal, performance measure

Goal: learn the entropy of the first marginal

H = 1

2ln 2πeσ

2

, σ2 =M1,1, M = Σu ∈R2×2. (18)

Baseline: kernel smoothing based distribution regression (applying density estimation) =: DFDR.

(36)

Supervised entropy learning: results

0 1 2 3 −2 −1 0 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

(37)

Numerical illustration: aerosol prediction

Bags:

randomly selected pixels,

within a 20km radius around an AOD sensor.

800 bags, 100 instances/bag. Instances: xi,n∈R16.

(38)

Aerosol prediction - baseline

Baseline: state-of-the-art mixture model EM optimization,

800 = 4_×160(training) + 160(test); 5-fold CV, 10 times. Accuracy: 100_×RMSE(_±std) = 7.5₋8.5 (_±0.1₋0.6). Ridge regression:

800 = 3_×160(training) + 160(validation) + 160(test), 5-fold CV, 10 times,

validation: λregularization,θkernel parameter.

(39)

Aerosol prediction: kernel

k

We picked 10 kernels (k): Gaussian, exponential, Cauchy,

generalized t-student, polynomial kernel of order 2 and 3

(p= 2 and 3), rational quadratic, inverse multiquadratic

kernel, Mat´ern kernel (with 3₂ and 5₂ smoothness parameters).

We also studied their ensembles. Explored parameter domain:

(λ, θ)_∈

2−65,2−64, . . . ,2−3 _×

2−15,2−14, . . . ,210 .

(40)

Aerosol prediction: kernel definitions

Kernel definitions (p = 2,3): kG(a,b) =e− ka−bk2₂ 2θ2 , k_e(a,b) =e− ka−bk2 2θ2 , (19) kC(a,b) = 1 1 +ka−bk22 θ2 , kt(a,b) = 1 1 +_ka₋b_kθ₂, (20) kp(a,b) = (ha,bi+θ)p, kr(a,b) = 1− k a₋b_k2₂ ka₋b_k2₂+θ, (21) ki(a,b) = 1 q ka₋b_k2₂+θ2 , (22) k_M_,3 2(a,b) = 1 + √ 3ka₋b_k₂ θ ! e− √ 3ka−bk2 θ , (23) k_M_,5 2(a,b) = 1 + √ 5_ka₋b_k₂ θ + 5_ka₋b_k2₂ 3θ2 ! e− √ 5ka−bk2 θ . (24)

(41)

Aerosol prediction: results (

K

_{: linear)}

100×RMSE(±std) [baseline: 7.5−8.5 (±0.1−0.6)]: kG ke kC kt 7.97 (_±1.81) 8.25 (_±1.92) 7.92 (_±1.69) 8.73 (_±2.18) kp(p= 2) kp(p= 3) kr ki 12.5 (_±2.63) 171.24 (_±56.66) 9.66 (_±2.68) 7.91(_±1.61) k_M_,3 2 kM,52 ensemble 8.05 (_±1.83) 7.98 (_±1.75) 7.86 (_±1.71) Best combination in the ensemble: k =kG,kC,ki.

(42)

Aerosol prediction: nonlinear

K

We fed the mean embedding distance (_kµx₋µy_k_H(k)) to the

previous kernels.

Example (RBF on mean embeddings – valid kernel):

K(µa, µb) =e−

kµa−µ_bk2H(k)

2θ2

K (µa, µb∈X). (25)

We studied the efficiency of (i) single, (ii) ensembles of kernels [(k,K) pairs].

(43)

Aerosol prediction: nonlinear

K

_{, results}

Baseline:

Mixture model (EM): 7.5−8.5 (±0.1−0.6), LinearK (single): 7.91 (±1.61). LinearK (ensemble): 7.86(±1.71). NonlinearK: Single: 7.90 (_±1.63), Ensemble: Accuracy: 7.81(±1.64), (k,K) = (ki,kt), k_M_,3 2,kM,32 ,(kC,kG).

(44)

Summary

Problem: distribution regression. Difficulty: two-stage sampling.

Examined solution: ridge regression; simple alg.! Contribution (on arXiv):

consistency; convergence rate.

specially: consistency of set kernels in regression.

ITE toolbox (Bitbucket, _∋MERR).

(45)

Thank you for the attention!

(46)

Topological space, open sets

Given: X₆₌_∅_set. τ _⊆2X

is called atopology onX_if:

1 ∅ ∈τ,X_∈_τ_.

2 Finite intersection: _O₁∈τ, _O₂∈τ ⇒_O₁∩_O₂∈τ. 3 _{Arbitrary union:} _O_i_∈_τ ₍_i_∈_I₎_{⇒ ∪}_i_∈_I_O_i_∈_τ_.

Then, (X_{, τ}_{) is called a}_{topological space}_;_O _∈_τ_: _open_sets.

(47)

Topology: examples

τ =_{∅,X_}_{: indiscrete topology.} τ = 2X : discrete topology. (X_,_d_{) metric space:} Open ball: Bǫ(x) ={y ∈X:d(x,y)< ǫ}.

O_⊆X_{is open if for}_∀_x_∈_O _∃_{ǫ >}_{0 such that}_B_ǫ₍_x₎_⊆_O_.

(48)

Closed set, compact set, closure, subspace topology

Given: (X_{, τ}_). _A_⊆X_is

closed ifX_\_A_∈_τ _{(i.e., its complement is open),} compact if for any family (Oi)i∈I of open sets with A_{⊆ ∪}i∈IOi, ∃i1, . . . ,in∈I withA⊆ ∪nj=1Oij. Closure of A_⊆X_: ¯ A:= \ A⊆C closed inX C. (26)

For A_⊆X_the_{subspace topology}_on _A_: _τ_A ₌_{_O _∩_A_:_O_∈_τ_}_.

(49)

Hausdorff space

(X_{, τ}_{) is a}_{Hausdorff space}_{, if}

for anyx ₆=y _∈X _∃_U_,_V _∈_τ _{such that}_x _∈_U_,_y _∈_V_,

U_∩V =_∅.

In other words, disjunct points have disjunct open environments.

(50)

Dense subset, separability, basis of a topology, Polish

A_⊆X_is _dense_{if ¯}_A₌X_.

(X_{, τ}_{) is} _separable_if_∃ _{countable, dense subset of} X_.

Counterexample: l∞/L∞.

τ1 ⊆τ is abasisof τ if every open is union of sets inτ1.

Example: open balls in a metric space.

(X_{, τ}_{) is} _Polish_if _τ _{has a countable basis and}_∃ _metric

definingτ.Example: complete separable metric spaces.

(51)

Environment, locally compact spaces

(X_{, τ}_):

V _⊆X_{is a}_neighborhood_of _x_∈X_if_∃_O _∈_τ _{such that} x _∈O _⊆V.

is called locally compact if for_∀x _∈X_∃ _compact

(52)

Examples: LCH, but not (necessarily) compact

Euclidean spaces: Rd, not compact.

Discrete spaces: LCH. Compact _{⇔ |}X_|_<_∞_.

Open/closed subsets of an LCH: LC in subspace topology. Example: unit ball (open/closed).

(53)

Examples: Hausdorff, but not locally compact

(Q, topology inherited from R).

In other words, not every subset of an LCH is LC. Infinite dimensional Hilbert spaces.

(54)

The discrete space

X_,₂X

: complete metric space.

Discrete metric (inducing the discrete topology):

d(x,y) = 0, ifx =y 1, if x ₆=y . (27)

Discrete space: separable _{⇔ |}X_|_{is countable.}

(55)

Example: hyperparameter selection

Training: samples from MOGs with component number labels. Task:

given: samples from a new MOG distribution, predict: the number of components.

(56)

Example: toxic level estimation from tissues

Toxin alters the properties/causes mutations in cells. Training data:

bag = tissue,

samples in the bag = cells described by some simple features, output label = toxic level.

Task: predict the toxic level given a new tissue.

(57)

Kernel

k

_:

c

₀

_-universal

Let C0(D) = D→Rcontinuous functions vanishing at infinity,

i.e.,

{u _∈D_:_|_g₍_u₎_{| ≥}_ǫ_} ₍₂₈₎

is compact for g _∈C0(D),∀ǫ >0. k :D×D→R isc0-universal

if

kk_k_∞:= supu∈D

p

k(u,u)<∞,

k(_·,u)_∈C0(D) (∀u ∈D).