Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

(1)

Sampling via Moment Sharing:

A New Framework for Distributed Bayesian Inference for Big Data

Yee Whye Teh (Oxford)

in collaboration with:

Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby)

(2)

SMS: Distributed Bayesian Inference for Big Data Yee Whye Teh

Bayesian Inference

! Parameter vector X.

! Data items Y = y1, y2,... yN.

! Model:

! Aim:

p(X, Y ) = p(X)

YN

i=1

p(y_i|X)

X

y₁ y₂ y₃ y₄ ... y_N

p(X|Y ) = p(X)p(Y |X) p(Y )

(3)

Why Bayes for Machine Learning?

! An important framework to frame learning.

! Quantification of uncertainty.

! Flexible and intuitive construction of complex models.

! Straightforward derivation of learning algorithms.

! Mitigation of overfitting.

(4)

Big Data and Bayesian Inference?

! Large scale datasets are fast becoming the norm.

! Analysing and extracting understanding from these data is a driver of progress in many sectors of society.

! Current successes in scalable learning are optimization- based and non-Bayesian.

! What is the role of Bayesian learning in world of Big Data?

(5)

Generic (Machine) Learning on Big Data

! Stochastic optimisation using mini-batches.

! Stochastic gradient descent.

> Stochastic Gradient Langevin Dynamics (Welling & Teh, Teh et al)

! Distributed/parallel computations on cores/clusters/GPUs.

! MapReduce, parameter server.

! Bringing the computations to the data, not the reverse.

! High communication costs.

> Distributed Bayesian Posterior Sampling via Moment Sharing (Xu et al)

! High synchronisation costs.

> Asynchronous Anytime Sequential Monte Carlo (Paige et al)

(6)

Generic (Bayesian) Learning on Big Data

! Stochastic optimisation using mini-batches.

! Stochastic gradient descent.

! > Stochastic Gradient Langevin Dynamics [Welling & Teh 2011, Patterson & Teh 2013, Teh et al (forthcoming)]

! Distributed/parallel computations on cores/clusters/GPUs.

! MapReduce, parameter server.

! Bringing the computations to the data, not the reverse.

! High communication costs.

! > Distributed Bayesian Posterior Sampling via Moment Sharing [Xu et al 2014]

! High synchronisation costs.

! > Asynchronous Anytime Sequential Monte Carlo [Paige et al 2014]

(7)

Machine Learning on Distributed Systems

y_1i y_2i y_3i y_4i

! Distributed storage

! Distributed computation

! Network

communication costs

(8)

Embarassingly Parallel MCMC Sampling

y_1i y_2i y_3i y_4i

Treat as independent inference problems.

Collect samples.

“Combine” samples together.

! Only communication at the combination stage.

{X^ji}j=1...m,i=1...n

{Xⁱ}^i=1...n

(9)

Local and Global Posteriors

! Each worker machine j has access only to its data subset.

! where pj(X) is a local prior and pj(X | yj) is local posterior.

! The (target) global posterior is

! If prior p(X) = ∏j pj(X), then

! Given collection of samples { Xji }i=1…n from pj(.|y), how do we get { Xi }i=1…n samples from p(.|y)?

p_j(X | y^j) = p_j(X)

YI i=1

p(y_ji | X)

p(X | y) / p(X)

Ym j=1

p(y_j | X) / p(X)

Ym j=1

p_j(X | y^j) p_j(X)

p(X | y) /

Ym j=1

p_j(X | y^j)

(10)

Consensus Monte Carlo

! Each worker machine j collects N samples {Xmn} from:

! Master machine combines samples by weighted average:

[Scott et al 2013]

p_j(X | y^j) = p(X)^1/m

YI i=1

p(y_ji|X)

X_i =

0

@

Xm j=1

W_j 1 A

1 Xm j=1

W_jX_ji

(11)

(12)

Consensus Monte Carlo

! Combination is correct if local posteriors are Gaussian.

! Weights are local posterior precisions.

! If not Gaussian, makes strong assumptions and unclear what local priors and weights for it to work.

[Scott et al 2013]

X_i =

0

@

Xm j=1

W_j 1 A

1 Xm j=1

W_jX_ji

(13)

Approximating Local Posterior Densities

! [Neiswanger et al 2013] proposed methods to combine estimates of local posterior densities instead of samples:

! Parametric: Gaussian approximation.

! Nonparametric: kernel density estimation based on samples.

! Semiparametric: Product of a parametric Gaussian

approximation with a nonparametric KDE correction term.

! Combination: Product of (approximate) densities.

! Sampling: Resort to Metropolis-within-Gibbs.

! [Wang & Dunson 2013]’s Weierstrass sampler is similar, using rejection sampling instead.

[Neiswanger et al 2013, Wang & Dunson 2013]

p(X | y) /

Ym j=1

p_j(X | y^j) ⇡

Ym j=1

1 n

Xn i=1

K^h^j (X; X_ji)

(14)

(15)

Approximating Local Posterior Densities

! Parametric approximation can be quite bad unless Bernstein-von Mises Theorem kicks in.

! Complex and expensive combination step in non- and semi-parametric

estimates.

! KDE suffers from curse of dimensionality.

! Performs poorly if local posteriors differ significantly.

(16)

Intuition and Desiderata

! Distributed system with independent MCMC sampling.

! Identify regions of high (global) posterior probability mass.

! Each local sampler is based on

local data, but “concentrate on high probability regions”.

! High probability regions found using samples, by allowing for some small amount of communication.

(17)

! Allow some amount of communication to align worker MCMC

samplers.

! “High probability region”

defined by low order moments.

! Align using Expectation Propagation (EP).

! Asynchronous and infrequent updates.

y_1i y_2i y_3i y_4i

(Not Quite) Embarrassingly Parallel MCMC

(18)

Expectation Propagation

! If N is large, the worker j likelihood term p(yj | X) should be well approximated by Gaussian

! Parameters fit iteratively using a variational approach to minimize KL divergence:

p(y_j | X) ⇡ q^j(X) = N (X; µ^j, ⌃_j)

[Minka 2001]

p(X | y) ⇡ p^j(X | y) / p(y^j | X) p(X) Y

k6=j

q_k(X)

| {z }

p_j(X)

q_j^new(·) = arg min

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

(19)

Expectation Propagation

! Update performed as follows:

! Compute (or estimate) first two moments µ*, Σ^{* of p}^j^{( X | y).}

! Compute µj, Σ^j so that N(.; µj, Σ^j) pj( X )/Z has moments µ*, Σ*.

! Computations done on natural parameters.

! Generalizes to other exponential families.

p(X | y) ⇡ p^j(X | y) / p(y^j | X) p(X) Y

k6=j

q_k(X)

| {z }

p_j(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

(20)

Expectation Propagation

! Variational parameters fit

iteratively until convergence.

! EP tends to converge very quickly (when it does).

! Damping updates can help convergence.

! At convergence, all local

posteriors agree on their first two moments.

! Generalizes to hierarchical and graphical models [infer.NET,

Gelman et al 2014].

y_1i y_2i y_3i y_4i

p(X)

p(y1|X)

≈ q1(X)

p(y2|X)

≈ q2(X)

p(y3|X)

≈ q3(X)

p(y4|X)

≈ q4(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

(21)

Sampling via Moment Sharing (SMS)

y_1i y_2i y_3i y_4i

! KL minimized by matching moments of pj(X | y).

! Moments computed by drawing MCMC samples.

! All samples from all machines can be treated as approximate samples from full posterior

given all data.

! Communicate only moments, synchronous or asynchronous.

p(X)

p(y1|X)

≈ q1(X)

p(y2|X)

≈ q2(X)

p(y3|X)

≈ q3(X)

p(y4|X)

≈ q4(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

(22)

Sampling via Moment Sharing (SMS)

y_1i y_2i y_3i y_4i

given all data.

p(X)

p(y1|X)

≈ q1(X)

p(y2|X)

≈ q2(X)

p(y3|X)

≈ q3(X)

p(y4|X)

≈ q4(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

p_j(·)

(23)

Sampling via Moment Sharing (SMS)

y_1i y_2i y_3i y_4i

given all data.

p(X)

p(y1|X)

≈ q1(X)

p(y2|X)

≈ q2(X)

p(y3|X)

≈ q3(X)

p(y4|X)

≈ q4(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

p_j(·)

{X^ji}

(24)

Sampling via Moment Sharing (SMS)

y_1i y_2i y_3i y_4i

given all data.

p(X)

p(y1|X)

≈ q1(X)

p(y2|X)

≈ q2(X)

p(y3|X)

≈ q3(X)

p(y4|X)

≈ q4(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

p_j(·)

{X^ji} ) (µ^⇤, ⌃^⇤)

(25)

Sampling via Moment Sharing (SMS)

y_1i y_2i y_3i y_4i

given all data.

p(X)

p(y1|X)

≈ q1(X)

p(y2|X)

≈ q2(X)

p(y3|X)

≈ q3(X)

p(y4|X)

≈ q4(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

p_j(·)

{X^ji} ) (µ^⇤, ⌃^⇤) ) (µ^j, ⌃_j)

(26)

Sampling via Moment Sharing (SMS)

y_1i y_2i y_3i y_4i

given all data.

p(X)

p(y1|X)

≈ q1(X)

p(y2|X)

≈ q2(X)

p(y3|X)

≈ q3(X)

p(y4|X)

≈ q4(X)

N (·;µ,⌃) KL p_j(· | y) k N (·; µ, ⌃)p^j(·)

p_j(·)

{X^ji} ) (µ^⇤, ⌃^⇤) ) (µ^j, ⌃_j) q_j(·)

(27)

Bayesian Logistic Regression

! Simulated dataset.

! d=20, # data items N=1000.

! NUTS base sampler.

! # workers m = 4,10,50.

! # MCMC iters T = 1000,1000,10000.

! # EP iters k given as vertical lines.

200 400 600 800 1000 1200 1400

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

k × T × N/m × 10³

100 200 300 400 500 600

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

k × T × N/m × 10³ 250 500 750 1000 1250 1500

−2.5

−2

−1.5

−1

−0.5 0 0.5 1

k × T × N/m × 10³

(28)

Bayesian Logistic Regression

! MSE of posterior mean, as function of total # iterations.

3.2 6.4 9.6 12.8 16 19.2

x 10⁵ 10⁻⁶

10⁻⁴ 10⁻² 10⁰

k × T × m SMS(s)

SMS(a) SCOT NEIS(p) NEIS(n) WANG

(29)

Bayesian Logistic Regression

! Approximate KL, MSE of predictive probabilities, as function of total # iterations.

3.2 6.4 9.6 12.8 16 19.2

x 10⁵ 10⁻⁷

10⁻⁶ 10⁻⁵ 10⁻⁴ 10⁻³ 10⁻² 10⁻¹

k × T × m SMS(s)

SMS(a) SCOT NEIS(n) WANG

3.2 6.4 9.6 12.8 16 19.2

x 10⁵ 10⁻¹

10⁰ 10¹ 10²

k × T × m SMS(s)

SMS(a) SCOT WANG

(30)

Bayesian Logistic Regression

! Approximate KL as function of # nodes.

m=8 m=16 m=32 m=48 m=64

0 0.5 1 1.5 2 2.5

SMS(s,s) SMS(s,e) SMS(a,s) SMS(a,e) SCOT XING(p)

(31)

Bayesian Logistic Regression

! Approximate KL, as function of # iterations per node and

# likelihood evaluations.

0 0.5 1 1.5 2 2.5

x 10⁸ 10⁻²

10⁻¹ 10⁰ 10¹ 10²

k × T × N/m

SMS(s) SMS(a) m = 8 m = 16 m = 32 m = 48 m = 64

0 1 2 3 4 5 6 7

x 10⁴ 10⁻²

10⁻¹ 10⁰ 10¹ 10²

k × T

SMS(s) SMS(a) m = 8 m = 16 m = 32 m = 48 m = 64

(32)

Spike-and-Slab Sparse Regression

0 500 1000 1500 2000

−0.4

−0.2 0 0.2 0.4

k × T × N/m × 10³

0 1000 2000 3000 4000

−0.4

−0.2 0 0.2 0.4

k × T × N/m × 10³

! Posterior mean coefficients.

(33)

Some Remarks

! Scalable distributed MCMC sampling.

! A bit of communication goes a long way.

! Issue with stochasticity of moment estimates:

! EP theory does not cover stochastic updates.

! Not clear what is the best stochastic update to use.

! Nor how can we characterise convergence and quality of approximation.

! Matlab source: https://github.com/chokkyvista/smssample

(34)

Other Approaches to Scalable Bayes

! Median posterior [Stanislav et al 2014]:

! Embeds local posteriors into an RKHS, and computes the geometric median.

! Improves robustness to outliers in data.

! Stochastic gradient MCMC approaches:

! Reduce cost of each MCMC step by using data subset.

! A distributed version have been developed.

! [Welling & Teh 2011, Ahn et al 2012, 2014, Teh, Thiery &

Vollmer (forthcoming), Bardenet et al 2014]

! Variational approaches:

! Faster convergence, with possibly significant bias.

! Recent works successfully extend these to large scale datasets using stochastic approximation techniques [Hoffman et al 2010, 2013, etc] and to flexible parameterized variational distributions [Mnih & Gregor 2014, Rezende et al 2014, Kingma & Welling 2014].

(35)

Bigger Picture

! The probabilistic modelling/Bayesian inference approach offers a principled and powerful data analysis framework.

! Standard methodologies do not extend easily to Big Data.

! Important to develop generic methodologies allowing these approaches to be applicable on Big Data.

! Bias/variance trade-offs becoming more important.

! Low bias “exact” methods do not scale as well to Big Data.

(36)

Thank you!

Thanks for funding:

Yee Whye Teh SMS: Distributed Bayesian Inference for Big Data