Gaussian Processes for Big Data Problems

(1)

Gaussian Processes for Big Data Problems

Marc Deisenroth

Department of Computing Imperial College London

http://wp.doc.ic.ac.uk/sml/marc-deisenroth

Machine Learning Summer School, Chalmers University 14 April 2015

(2)

http://www.gaussianprocess.org/

(3)

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Objective

For a set of observations y_i“ f px_iq ` ε, ε „N^{`0, σ}ε²˘, find a distribution over functionspp f q that explains the data

Probabilistic regression problem

(4)

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10

−2 0 2

x

f(x)

Objective

Probabilistic regression problem

(5)

(Some) Relevant Application Areas

§ Global black-box optimization and experimental design

§ Autonomous learning in robotics

§ Probabilistic dimensionality reduction and data visualization

(6)

Key Concepts in Probability Theory

Two fundamental rules:

ppxq “ ż

ppx, yqdy Sum rule/Marginalization property ppx, yq “ ppy|xqppxq Product rule

Bayes’ Theorem (Probabilistic Inverse) ppx|yq “ ppy|xq ppxq

ppyq , x : hypothesis, y : measurement

§ Posterior belief

§ Prior belief

§ Likelihood (measurement model)

§ Marginal likelihood (normalization constant)

x

y

(9)

Key Concepts in Probability Theory

ppxq “ ż

§ Posterior belief

§ Prior belief

x

y

(10)

Key Concepts in Probability Theory

ppxq “ ż

§ Posterior belief

§ Prior belief

x

y

(11)

The Gaussian Distribution

ppx|µ,Σq “ p2πq^´^D²|Σ|^´¹²exp`

´¹₂px ´ µq^JΣ^´1px ´ µq˘

§ Mean vectorµ Average of the data

§ Covariance matrixΣ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 6

0 0.05 0.1 0.15 0.2 0.25 0.3

p(x)

p(x) Mean

95% confidence bound

(12)

The Gaussian Distribution

−4 −3 −2 −1 0 1 2 3 4 5 6

0 0.05 0.1 0.15 0.2 0.25 0.3

x

p(x)

p(x) Mean

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

x 1 x2

Mean

(13)

The Gaussian Distribution

−4 −3 −2 −1 0 1 2 3 4 5 6

0 0.05 0.1 0.15 0.2 0.25 0.3

p(x)

Data p(x) Mean

95% confidence interval

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

x x2

Data Mean

(14)

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „N^`^µ, Σ˘ from a D-dimensional joint Gaussian with covariance matrixΣ and mean vector µ.

However, we only have access to a random number generator that can sample x fromN^{`0, I˘...}

Exploit that affine transformations y “ Ax ` b of Gaussians remain Gaussian

§ Mean: ExrAx ` bs “ AExrxs ` b

§ Covariance: VxrAx ` bs “ AVxrxsA^J

1. Find conditions for A, b to match the mean of y 2. Find conditions for A, b to match the covariance of y

(15)

Sampling from a Multivariate Gaussian

Objective

(16)

Sampling from a Multivariate Gaussian

Objective

(17)

Sampling from a Multivariate Gaussian (2)

Objective

x = randn(D,1); Sample x „N^{`0, I˘}

y = chol(Σ)’*x + µ; Scale x

Here chol(Σ) is the Cholesky factor L, such that L^JL “Σ

Therefore, the mean and covariance of y are Erys “ ¯y “ ErL^Jx ` µs “ L^JErxs ` µ “ µ

Covrys “ Erpy ´ ¯yqpy ´ ¯yq^Js “ErL^Jxx^JLs “ L^JErxx^JsL “ L^JL “ Σ

(18)

Sampling from a Multivariate Gaussian (2)

Objective

x = randn(D,1); Sample x „N^{`0, I˘}

y = chol(Σ)’*x + µ; Scale x

Here chol(Σ) is the Cholesky factor L, such that L^JL “Σ Therefore, the mean and covariance of y are

Erys “ ¯y “ ErL^Jx ` µs “ L^JErxs ` µ “ µ

Covrys “ Erpy ´ ¯yqpy ´ ¯yq^Js “ErL^Jxx^JLs “ L^JErxx^JsL “ L^JL “ Σ

(19)

Conditional

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y) ppx, yq “N

˜«

µ_x µ_y

ff

,« Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “N^`^µ_x|y^, Σx|y

˘

µ_x|y“ µ_x ` Σxy Σ^´1_yy py ´ µ_yq Σx|y“ Σxx ´ Σxy Σ^´1_yy Σyx

Conditional ppx|yq is also Gaussian Computationally convenient

(20)

Conditional

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y)

Observation ppx, yq “N

˜«

µ_x µ_y

ff

,« Σxx Σxy

Σyx Σyy

ff¸

˘

µ_x|y“ µ_x ` Σxy Σ^´1_yy py ´ µ_yq Σx|y“ Σxx ´ Σxy Σ^´1_yy Σyx

(21)

Conditional

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y) Observation y Conditional p(x|y)

ppx, yq “N

˜«

µ_x µ_y

ff

,« Σxx Σxy

Σyx Σyy

ff¸

˘

µ_x|y“ µ_x ` Σxy Σ^´1_yy py ´ µ_yq Σ_x|y“ Σxx ´ Σxy Σ^´1_yy Σyx

(22)

Marginal

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y) Marginal p(x)

ppx, yq “N

˜«

µ_x µ_y

ff

,« Σxx Σxy

Σyx Σyy

ff¸

pp x q “ ż

pp x , y qd y

“N^`^µ_x^, Σxx

˘

§ The marginal of a joint Gaussian distribution is Gaussian

§ Intuitively: Ignore (integrate out) everything you are not interested in

(23)

Gaussian Process

Definition

AGaussian process(GP) is a collection of random variables x₁, x₂, . . . , any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system x_t`1“ Ax_t`w, w „N^{`0, Q˘} x₁, x₂, . . . is a collection of random variables, any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not for time series

(25)

Gaussian Process

Definition

Unexpected (?) example: Linear dynamical system x_t`1“Ax_t`w, w „N^{`0, Q˘}

x₁, x₂, . . . is a collection of random variables

, any finite number of which is Gaussian distributed

This is a GP

(26)

Gaussian Process

Definition

x₁, x₂, . . . is a collection of random variables, any finite number of which is Gaussian distributed

This is a GP

(27)

Gaussian Process

Definition

This is a GP

(28)

Gaussian Process

Definition

This is a GP

(29)

The Gaussian in the Limit

Consider thejoint Gaussian distributionppx, ˜xq, where x P R^D and

˜x P R^k, k Ñ 8

Then

ppx, ˜xq “N

˜« µ_x µ_˜x

ff ,

„Σxx Σx˜x

Σ˜xx Σ˜x ˜x

¸

whereΣ˜x ˜xPR^kˆkandΣx˜x PR^Dˆk, k Ñ 8. However, themarginal remains finite

pp x q “ ż

pp x , ˜x qd ˜x “N^`^µ_x^, Σxx

˘ where we integrate out an infinite number of variables ˜x_i.

(30)

The Gaussian in the Limit

˜x P R^k, k Ñ 8 Then

ppx, ˜xq “N

˜«

µ_x µ_˜x

ff ,

„Σxx Σx˜x

Σ˜xx Σ˜x ˜x

¸

whereΣ˜x ˜xPR^kˆkandΣx˜x PR^Dˆk, k Ñ 8.

However, themarginal remains finite pp x q “

ż

pp x , ˜x qd ˜x “N^`^µ_x^, Σxx

(31)

The Gaussian in the Limit

˜x P R^k, k Ñ 8 Then

ppx, ˜xq “N

˜«

µ_x µ_˜x

ff ,

„Σxx Σx˜x

Σ˜xx Σ˜x ˜x

¸

whereΣ˜x ˜xPR^kˆkandΣx˜x PR^Dˆk, k Ñ 8.

However, themarginal remains finite pp x q “

ż

pp x , ˜x qd ˜x “N^`^µ_x^, Σxx

(32)

Marginal and Conditional in the Limit

§ In practice, we considerfinite training and test datax_train, xtest

§ Then, x “ tx_train, x_test, x_otheru

(x_otherplays the role of ˜x from previous slide)

ppxq “N

¨

˚

˝

»

—

— –

µ_train µ_test µ_other

fi ffi ffi fl,

»

—

— –

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σ_test,other Σother,train Σother,test Σother

fi ffi ffi fl

˛

‹

‚

ppx_train, x_testq “ ż

pp x_train, x_test, x_otherqd x_other ppxtest|xtrainq “N^`^µ˚, Σ˚

˘

µ_˚ “ µ_test ` Σtest,train Σ^´1_trainpx_train´ µ_trainq Σ˚ “ Σtest ´ Σtest,train Σ^´1_train Σtrain,test

(33)

Marginal and Conditional in the Limit

ppxq “N

¨

˚

˝

»

—

— –

fi ffi ffi fl,

»

—

— –

Σtest,train Σtest

Σtrain,other

Σ_test,other Σother,train Σother,test Σother

fi ffi ffi fl

˛

‹

‚

˘

(34)

Marginal and Conditional in the Limit

ppxq “N

¨

˚

˝

»

—

— –

fi ffi ffi fl,

»

—

— –

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi ffi ffi fl

˛

‹

‚

˘

(35)

Marginal and Conditional in the Limit

ppxq “N

¨

˚

˝

»

—

— –

fi ffi ffi fl,

»

—

— –

Σtest,train Σtest

Σtrain,other

Σtest,other

fi ffi ffi fl

˛

‹

‚

pp x_train, x_test, x_otherqd x_other

ppxtest|xtrainq “N^`^µ˚, Σ˚

˘

(36)

Marginal and Conditional in the Limit

ppxq “N

¨

˚

˝

»

—

— –

fi ffi ffi fl,

»

—

— –

Σtest,train Σtest

Σtrain,other

Σtest,other

fi ffi ffi fl

˛

‹

‚

pp x_train, x_test, x_otherqd x_other ppxtest|x_trainq “N^`^µ˚, Σ˚

˘

(37)

Back to Regression: Distribution over Functions

§ We are not really interested in a distribution on x, but rather in a distribution over function values f pxq

§ Let’s replace x with function values f “ f pxq

(Treat a function as a long vector of function values)

pp f , ˜f q “N

¨

˝

» –

µ_f µ_˜f

fi fl,

» –

Σf f Σ_{f ˜f} Σ_˜ff Σ_{˜f ˜f}

fi fl

˛

‚

whereΣ_{˜f ˜f}PR^kˆkandΣ_{f ˜f}PR^Nˆk, k Ñ 8.

§ Again, themarginal remains finite pp f q “

ż

pp f , ˜f qd ˜f “N^`^µ_f ^, Σf f

˘

(38)

Back to Regression: Distribution over Functions

pp f , ˜f q “N

¨

˝

» –

µ_f µ_˜f

fi fl,

» –

fi fl

˛

‚

ż

pp f , ˜f qd ˜f “N^`^µ_f ^, Σf f

˘

(39)

Back to Regression: Distribution over Functions

pp f , ˜f q “N

¨

˝

» –

µ_f µ_˜f

fi fl,

» –

fi fl

˛

‚

ż

pp f , ˜f qd ˜f “N^`^µ_f ^, Σf f

˘

(40)

Marginal and Conditional over Functions

Define f_˚ :“ f_test, f :“ f_train.

§ Marginal

pp f_˚, f q “ N

˜« µ_f µ_˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

§ Conditional(predictive distribution) pp f_˚|f q “N^{`m, S˘}

m “ µ_˚`Σ˚fΣ^´1_{f f}pf ´ µq S “Σ˚˚´Σ˚fΣ^´1_{f f}Σf ˚

We need to compute (cross-)covariances between unknownfunction values

Thekernel trickhelps us out

(41)

Marginal and Conditional over Functions

§ Marginal

pp f_˚, f q “ N

˜«

µ_f µ_˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

(42)

Marginal and Conditional over Functions

§ Marginal

pp f_˚, f q “ N

˜«

µ_f µ_˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

(43)

Marginal and Conditional over Functions

§ Marginal

pp f_˚, f q “ N

˜«

µ_f µ_˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

(44)

Marginal and Conditional over Functions

§ Marginal

pp f_˚, f q “ N

˜«

µ_f µ_˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

(45)

Kernelization

pp f_˚, f q “N ˆ„µ_f

µ_˚

 ,

„Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f_˚|f q “N^{`m, S˘}

m “ µ_˚` Σ˚fΣ^´1_{f f} pf ´ µq S “ Σ˚˚ ´ Σ˚fΣ^´1_{f f}Σf ˚

§ Akernel functionk is symmetric and positive definite.

§ Kernel computes covariances between unknown function values f px_iqand f px_jqby just looking at the corresponding inputs x_i, x_j

Σ^pi,jq_{f f} “Covr f px_iq, f px_jqs “kpx_i, x_jq

§ This yields the predictive distribution pp f_˚|f , X, X˚q “N^{`m, S˘}

m “ µ_˚` kpX_˚, XqkpX, Xq^´1pf ´ µq S “ K˚˚ ´ kpX˚, XqkpX, Xq^´1kpX, X˚q

(46)

Kernelization

pp f_˚, f q “N ˆ„µ_f

µ_˚

 ,

„Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f_˚|f q “N^{`m, S˘}

(47)

Kernelization

pp f_˚, f q “N ˆ„µ_f

µ_˚

 ,

„Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f_˚|f q “N^{`m, S˘}

(48)

Kernelization

pp f_˚, f q “N ˆ„µ_f

µ_˚

 ,

„Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f_˚|f q “N^{`m, S˘}

(49)

GP Regression as a Bayesian Inference Problem

Objective

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy| f , Xq “N^{` f pXq, σ}ε²I˘

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpm_post, k_postq

m_postpx_iq “mpx_iq `kpX, x_iq^JpK ` σ_ε²Iq^´1py ´ mpx_iqq k_postpx_i, x_jq “kpx_i, x_jq ´kpX, x_iq^JpK ` σ_ε²Iq^´1kpX, x_jq

(50)

GP Regression as a Bayesian Inference Problem

Objective

ppy|Xq

(51)

GP Regression as a Bayesian Inference Problem

Objective

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “N^{` f pXq, σ}ε²I˘ Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpm_post, k_postq

(52)

GP Regression as a Bayesian Inference Problem

Objective

ppy|Xq

(53)

GP Regression as a Bayesian Inference Problem

Objective

ppy|Xq

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f

Posterior: pp f |y, Xq “ GPpm_post, k_postq

(54)

GP Regression as a Bayesian Inference Problem

Objective

ppy|Xq

(55)

GP Regression as a Bayesian Inference Problem

Objective

ppy|Xq

m_postpx_iq “mpx_iq `kpX, x_iq^JpK ` σ_ε²Iq^´1py ´ mpx_iqq k_postpx_i, x_jq “kpx_i, x_jq ´kpX, x_iq^JpK ` σ²Iq^´1kpX, x_jq

(56)

GP Regression as a Bayesian Inference Problem (2)

Posterior Gaussian process:

mpostpx_iq “mpx_iq `kpX, x_iq^JpK ` σ_ε²Iq^´1py ´ mpx_iqq kpostpx_i, x_jq “kpx_i, x_jq ´kpX, x_iq^JpK ` σ_ε²Iq^´1kpX, x_jq

Predictive distributionpp f_˚|X, y, x˚qat test inputs x˚: pp f˚|X, y, x˚q “N`Er f˚s, Vr f˚s˘

Er f˚|X, y, x˚s “m_postpx˚q “mpx˚q `kpX, x˚q^JpK ` σ_ε²Iq^´1py ´ mpx˚qq Vr f˚|X, y, x˚s “kpostpx_˚, x˚q “kpx˚, x˚q ´kpX, x˚q^JpK ` σ_ε²Iq^´1kpX, x˚q From now: Set prior mean function m ” 0

(57)

GP Regression as a Bayesian Inference Problem (2)

mpostpx_iq “mpx_iq `kpX, x_iq^JpK ` σ_ε²Iq^´1py ´ mpx_iqq kpostpx_i, x_jq “kpx_i, x_jq ´kpX, x_iq^JpK ` σ_ε²Iq^´1kpX, x_jq Predictive distributionpp f_˚|X, y, x˚qat test inputs x˚:

pp f˚|X, y, x˚q “N`Er f˚s, Vr f˚s˘

Er f˚|X, y, x˚s “m_postpx_˚q “mpx_˚q `kpX, x_˚q^JpK ` σ_ε²Iq^´1py ´ mpx_˚qq Vr f˚|X, y, x˚s “kpostpx_˚, x˚q “kpx˚, x˚q ´kpX, x˚q^JpK ` σ_ε²Iq^´1kpX, x˚q

From now: Set prior mean function m ” 0

(58)

GP Regression as a Bayesian Inference Problem (2)

mpostpx_iq “mpx_iq `kpX, x_iq^JpK ` σ_ε²Iq^´1py ´ mpx_iqq kpostpx_i, x_jq “kpx_i, x_jq ´kpX, x_iq^JpK ` σ_ε²Iq^´1kpX, x_jq Predictive distributionpp f_˚|X, y, x˚qat test inputs x˚:

pp f˚|X, y, x˚q “N`Er f˚s, Vr f˚s˘

Er f˚|X, y, x˚s “m_postpx_˚q “mpx_˚q `kpX, x_˚q^JpK ` σ_ε²Iq^´1py ´ mpx_˚qq Vr f˚|X, y, x˚s “kpostpx_˚, x˚q “kpx˚, x˚q ´kpX, x˚q^JpK ` σ_ε²Iq^´1kpX, x˚q From now: Set prior mean function m ” 0

(59)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Priorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅^s ^“^mpx˚q “0

Vr f px˚q|x˚,∅s “ σ²px˚q “kpx˚, x˚q

(60)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Priorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅^s ^“^mpx˚q “0

Vr f px˚q|x˚,∅s “ σ²px˚q “kpx˚, x˚q

(61)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Posteriorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “kpX, x˚q^JpK ` σ_ε²Iq^´1y

Vr f px˚q|x˚, X, ys “ σ²px˚q “kpx˚, x˚q´kpX, x˚q^JpK ` σ_ε²Iq^´1kpX, x˚q

(62)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Posteriorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “kpX, x˚q^JpK ` σ_ε²Iq^´1y

Vr f px˚q|x˚, X, ys “ σ²px˚q “kpx˚, x˚q´kpX, x˚q^JpK ` σ_ε²Iq^´1kpX, x˚q

Gaussian Processes for Big Data Problems

Gaussian Processes for Big Data Problems

Problem Setting

Problem Setting

(Some) Relevant Application Areas

Table of Contents

Table of Contents

Key Concepts in Probability Theory

x

y

Key Concepts in Probability Theory

x

y

Key Concepts in Probability Theory

x

y

The Gaussian Distribution

The Gaussian Distribution

The Gaussian Distribution

Sampling from a Multivariate Gaussian

Sampling from a Multivariate Gaussian

Sampling from a Multivariate Gaussian

Sampling from a Multivariate Gaussian (2)

Sampling from a Multivariate Gaussian (2)

Conditional

Conditional

Conditional

Marginal

Table of Contents

Gaussian Process

Gaussian Process

Gaussian Process

Gaussian Process

Gaussian Process

The Gaussian in the Limit

The Gaussian in the Limit

The Gaussian in the Limit

Marginal and Conditional in the Limit

Marginal and Conditional in the Limit

Marginal and Conditional in the Limit

Marginal and Conditional in the Limit

Marginal and Conditional in the Limit

Back to Regression: Distribution over Functions

Back to Regression: Distribution over Functions

Back to Regression: Distribution over Functions

Marginal and Conditional over Functions

Marginal and Conditional over Functions

Marginal and Conditional over Functions

Marginal and Conditional over Functions

Marginal and Conditional over Functions

Kernelization

Kernelization

Kernelization

Kernelization

GP Regression as a Bayesian Inference Problem

GP Regression as a Bayesian Inference Problem

GP Regression as a Bayesian Inference Problem

GP Regression as a Bayesian Inference Problem

GP Regression as a Bayesian Inference Problem

GP Regression as a Bayesian Inference Problem

GP Regression as a Bayesian Inference Problem

GP Regression as a Bayesian Inference Problem (2)

GP Regression as a Bayesian Inference Problem (2)

GP Regression as a Bayesian Inference Problem (2)

Illustration

Illustration

Illustration

Illustration