• No results found

Gaussian Processes for Big Data Problems

N/A
N/A
Protected

Academic year: 2021

Share "Gaussian Processes for Big Data Problems"

Copied!
174
0
0

Loading.... (view fulltext now)

Full text

(1)

Gaussian Processes for Big Data Problems

Marc Deisenroth

Department of Computing Imperial College London

http://wp.doc.ic.ac.uk/sml/marc-deisenroth

Machine Learning Summer School, Chalmers University 14 April 2015

(2)

http://www.gaussianprocess.org/

(3)

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Probabilistic regression problem

(4)

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10

−2 0 2

x

f(x)

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Probabilistic regression problem

(5)

(Some) Relevant Application Areas

§ Global black-box optimization and experimental design

§ Autonomous learning in robotics

§ Probabilistic dimensionality reduction and data visualization

(6)

Table of Contents

Introduction

Gaussian Distribution Gaussian Processes

Definition and Derivation Inference

Covariance Functions Model Selection Limitations

Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes

Distributed Gaussian Processes

(7)

Table of Contents

Introduction

Gaussian Distribution Gaussian Processes

Definition and Derivation Inference

Covariance Functions Model Selection Limitations

Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes

Distributed Gaussian Processes

(8)

Key Concepts in Probability Theory

Two fundamental rules:

ppxq “ ż

ppx, yqdy Sum rule/Marginalization property ppx, yq “ ppy|xqppxq Product rule

Bayes’ Theorem (Probabilistic Inverse) ppx|yq “ ppy|xq ppxq

ppyq , x : hypothesis, y : measurement

§ Posterior belief

§ Prior belief

§ Likelihood (measurement model)

§ Marginal likelihood (normalization constant)

x

y

(9)

Key Concepts in Probability Theory

Two fundamental rules:

ppxq “ ż

ppx, yqdy Sum rule/Marginalization property ppx, yq “ ppy|xqppxq Product rule

Bayes’ Theorem (Probabilistic Inverse) ppx|yq “ ppy|xq ppxq

ppyq , x : hypothesis, y : measurement

§ Posterior belief

§ Prior belief

§ Likelihood (measurement model)

§ Marginal likelihood (normalization constant)

x

y

(10)

Key Concepts in Probability Theory

Two fundamental rules:

ppxq “ ż

ppx, yqdy Sum rule/Marginalization property ppx, yq “ ppy|xqppxq Product rule

Bayes’ Theorem (Probabilistic Inverse) ppx|yq “ ppy|xq ppxq

ppyq , x : hypothesis, y : measurement

§ Posterior belief

§ Prior belief

§ Likelihood (measurement model)

§ Marginal likelihood (normalization constant)

x

y

(11)

The Gaussian Distribution

ppx|µ,Σq “ p2πq´D2|Σ|´12exp`

´12px ´ µqJΣ´1px ´ µq˘

§ Mean vectorµ Average of the data

§ Covariance matrixΣ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 6

0 0.05 0.1 0.15 0.2 0.25 0.3

p(x)

p(x) Mean

95% confidence bound

(12)

The Gaussian Distribution

ppx|µ,Σq “ p2πq´D2|Σ|´12exp`

´12px ´ µqJΣ´1px ´ µq˘

§ Mean vectorµ Average of the data

§ Covariance matrixΣ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 6

0 0.05 0.1 0.15 0.2 0.25 0.3

x

p(x)

p(x) Mean

95% confidence bound

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

x 1 x2

Mean

95% confidence bound

(13)

The Gaussian Distribution

ppx|µ,Σq “ p2πq´D2|Σ|´12exp`

´12px ´ µqJΣ´1px ´ µq˘

§ Mean vectorµ Average of the data

§ Covariance matrixΣ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 6

0 0.05 0.1 0.15 0.2 0.25 0.3

p(x)

Data p(x) Mean

95% confidence interval

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1 0 1 2 3

x x2

Data Mean

95% confidence bound

(14)

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „N`µ, Σ˘ from a D-dimensional joint Gaussian with covariance matrixΣ and mean vector µ.

However, we only have access to a random number generator that can sample x fromN`0, I˘...

Exploit that affine transformations y “ Ax ` b of Gaussians remain Gaussian

§ Mean: ExrAx ` bs “ AExrxs ` b

§ Covariance: VxrAx ` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y 2. Find conditions for A, b to match the covariance of y

(15)

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „N`µ, Σ˘ from a D-dimensional joint Gaussian with covariance matrixΣ and mean vector µ.

However, we only have access to a random number generator that can sample x fromN`0, I˘...

Exploit that affine transformations y “ Ax ` b of Gaussians remain Gaussian

§ Mean: ExrAx ` bs “ AExrxs ` b

§ Covariance: VxrAx ` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y 2. Find conditions for A, b to match the covariance of y

(16)

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „N`µ, Σ˘ from a D-dimensional joint Gaussian with covariance matrixΣ and mean vector µ.

However, we only have access to a random number generator that can sample x fromN`0, I˘...

Exploit that affine transformations y “ Ax ` b of Gaussians remain Gaussian

§ Mean: ExrAx ` bs “ AExrxs ` b

§ Covariance: VxrAx ` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y 2. Find conditions for A, b to match the covariance of y

(17)

Sampling from a Multivariate Gaussian (2)

Objective

Generate a random sample y „N`µ, Σ˘ from a D-dimensional joint Gaussian with covariance matrixΣ and mean vector µ.

x = randn(D,1); Sample x „N`0, I˘

y = chol(Σ)’*x + µ; Scale x

Here chol(Σ) is the Cholesky factor L, such that LJL “Σ

Therefore, the mean and covariance of y are Erys “ ¯y “ ErLJx ` µs “ LJErxs ` µ “ µ

Covrys “ Erpy ´ ¯yqpy ´ ¯yqJs “ErLJxxJLs “ LJErxxJsL “ LJL “ Σ

(18)

Sampling from a Multivariate Gaussian (2)

Objective

Generate a random sample y „N`µ, Σ˘ from a D-dimensional joint Gaussian with covariance matrixΣ and mean vector µ.

x = randn(D,1); Sample x „N`0, I˘

y = chol(Σ)’*x + µ; Scale x

Here chol(Σ) is the Cholesky factor L, such that LJL “Σ Therefore, the mean and covariance of y are

Erys “ ¯y “ ErLJx ` µs “ LJErxs ` µ “ µ

Covrys “ Erpy ´ ¯yqpy ´ ¯yqJs “ErLJxxJLs “ LJErxxJsL “ LJL “ Σ

(19)

Conditional

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y) ppx, yq “N

˜«

µx µy

ff

,« Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “N`µx|y, Σx|y

˘

µx|y“ µx ` Σxy Σ´1yy py ´ µyq Σx|yΣxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also Gaussian Computationally convenient

(20)

Conditional

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y)

Observation ppx, yq “N

˜«

µx µy

ff

,« Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “N`µx|y, Σx|y

˘

µx|y“ µx ` Σxy Σ´1yy py ´ µyq Σx|yΣxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also Gaussian Computationally convenient

(21)

Conditional

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y) Observation y Conditional p(x|y)

ppx, yq “N

˜«

µx µy

ff

,« Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “N`µx|y, Σx|y

˘

µx|y“ µx ` Σxy Σ´1yy py ´ µyq Σx|yΣxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also Gaussian Computationally convenient

(22)

Marginal

x

-6 -4 -2 0 2 4

y

-5 -4 -3 -2 -1 0 1 2 3

Joint p(x,y) Marginal p(x)

ppx, yq “N

˜«

µx µy

ff

,« Σxx Σxy

Σyx Σyy

ff¸

pp x q “ ż

pp x , y qd y

“N`µx, Σxx

˘

§ The marginal of a joint Gaussian distribution is Gaussian

§ Intuitively: Ignore (integrate out) everything you are not interested in

(23)

Table of Contents

Introduction

Gaussian Distribution Gaussian Processes

Definition and Derivation Inference

Covariance Functions Model Selection Limitations

Scaling Gaussian Processes to Large Data Sets Sparse Gaussian Processes

Distributed Gaussian Processes

(24)

Gaussian Process

Definition

AGaussian process(GP) is a collection of random variables x1, x2, . . . , any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system xt`1Axt`w, w „N`0, Q˘ x1, x2, . . . is a collection of random variables, any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not for time series

(25)

Gaussian Process

Definition

AGaussian process(GP) is a collection of random variables x1, x2, . . . , any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system xt`1Axt`w, w „N`0, Q˘

x1, x2, . . . is a collection of random variables

, any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not for time series

(26)

Gaussian Process

Definition

AGaussian process(GP) is a collection of random variables x1, x2, . . . , any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system xt`1Axt`w, w „N`0, Q˘

x1, x2, . . . is a collection of random variables, any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not for time series

(27)

Gaussian Process

Definition

AGaussian process(GP) is a collection of random variables x1, x2, . . . , any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system xt`1Axt`w, w „N`0, Q˘

x1, x2, . . . is a collection of random variables, any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not for time series

(28)

Gaussian Process

Definition

AGaussian process(GP) is a collection of random variables x1, x2, . . . , any finite number of which is Gaussian distributed.

Unexpected (?) example: Linear dynamical system xt`1Axt`w, w „N`0, Q˘

x1, x2, . . . is a collection of random variables, any finite number of which is Gaussian distributed

This is a GP

But we will use the Gaussian process for a different purpose, not for time series

(29)

The Gaussian in the Limit

Consider thejoint Gaussian distributionppx, ˜xq, where x P RD and

˜x P Rk, k Ñ 8

Then

ppx, ˜xq “N

˜« µx µ˜x

ff ,

Σxx Σx˜x

Σ˜xx Σ˜x ˜x

¸

whereΣ˜x ˜xPRkˆkandΣx˜x PRDˆk, k Ñ 8. However, themarginal remains finite

pp x q “ ż

pp x , ˜x qd ˜x “N`µx, Σxx

˘ where we integrate out an infinite number of variables ˜xi.

(30)

The Gaussian in the Limit

Consider thejoint Gaussian distributionppx, ˜xq, where x P RD and

˜x P Rk, k Ñ 8 Then

ppx, ˜xq “N

˜«

µx µ˜x

ff ,

Σxx Σx˜x

Σ˜xx Σ˜x ˜x

¸

whereΣ˜x ˜xPRkˆkandΣx˜x PRDˆk, k Ñ 8.

However, themarginal remains finite pp x q “

ż

pp x , ˜x qd ˜x “N`µx, Σxx

˘ where we integrate out an infinite number of variables ˜xi.

(31)

The Gaussian in the Limit

Consider thejoint Gaussian distributionppx, ˜xq, where x P RD and

˜x P Rk, k Ñ 8 Then

ppx, ˜xq “N

˜«

µx µ˜x

ff ,

Σxx Σx˜x

Σ˜xx Σ˜x ˜x

¸

whereΣ˜x ˜xPRkˆkandΣx˜x PRDˆk, k Ñ 8.

However, themarginal remains finite pp x q “

ż

pp x , ˜x qd ˜x “N`µx, Σxx

˘ where we integrate out an infinite number of variables ˜xi.

(32)

Marginal and Conditional in the Limit

§ In practice, we considerfinite training and test dataxtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xotherplays the role of ˜x from previous slide)

ppxq “N

¨

˚

˚

˝

»

— –

µtrain µtest µother

fi ffi ffi fl,

»

— –

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other Σother,train Σother,test Σother

fi ffi ffi fl

˛

ppxtrain, xtestq “ ż

pp xtrain, xtest, xotherqd xother ppxtest|xtrainq “N`µ˚, Σ˚

˘

µ˚ “ µtest ` Σtest,train Σ´1trainpxtrain´ µtrainq Σ˚Σtest ´ Σtest,train Σ´1train Σtrain,test

(33)

Marginal and Conditional in the Limit

§ In practice, we considerfinite training and test dataxtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xotherplays the role of ˜x from previous slide)

ppxq “N

¨

˚

˚

˝

»

— –

µtrain µtest µother

fi ffi ffi fl,

»

— –

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other Σother,train Σother,test Σother

fi ffi ffi fl

˛

ppxtrain, xtestq “ ż

pp xtrain, xtest, xotherqd xother ppxtest|xtrainq “N`µ˚, Σ˚

˘

µ˚ “ µtest ` Σtest,train Σ´1trainpxtrain´ µtrainq Σ˚Σtest ´ Σtest,train Σ´1train Σtrain,test

(34)

Marginal and Conditional in the Limit

§ In practice, we considerfinite training and test dataxtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xotherplays the role of ˜x from previous slide)

ppxq “N

¨

˚

˚

˝

»

— –

µtrain µtest µother

fi ffi ffi fl,

»

— –

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi ffi ffi fl

˛

ppxtrain, xtestq “ ż

pp xtrain, xtest, xotherqd xother ppxtest|xtrainq “N`µ˚, Σ˚

˘

µ˚ “ µtest ` Σtest,train Σ´1trainpxtrain´ µtrainq Σ˚Σtest ´ Σtest,train Σ´1train Σtrain,test

(35)

Marginal and Conditional in the Limit

§ In practice, we considerfinite training and test dataxtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xotherplays the role of ˜x from previous slide)

ppxq “N

¨

˚

˚

˝

»

— –

µtrain µtest µother

fi ffi ffi fl,

»

— –

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi ffi ffi fl

˛

ppxtrain, xtestq “ ż

pp xtrain, xtest, xotherqd xother

ppxtest|xtrainq “N`µ˚, Σ˚

˘

µ˚ “ µtest ` Σtest,train Σ´1trainpxtrain´ µtrainq Σ˚Σtest ´ Σtest,train Σ´1train Σtrain,test

(36)

Marginal and Conditional in the Limit

§ In practice, we considerfinite training and test dataxtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xotherplays the role of ˜x from previous slide)

ppxq “N

¨

˚

˚

˝

»

— –

µtrain µtest µother

fi ffi ffi fl,

»

— –

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi ffi ffi fl

˛

ppxtrain, xtestq “ ż

pp xtrain, xtest, xotherqd xother ppxtest|xtrainq “N`µ˚, Σ˚

˘

µ˚ “ µtest ` Σtest,train Σ´1trainpxtrain´ µtrainq Σ˚Σtest ´ Σtest,train Σ´1train Σtrain,test

(37)

Back to Regression: Distribution over Functions

§ We are not really interested in a distribution on x, but rather in a distribution over function values f pxq

§ Let’s replace x with function values f “ f pxq

(Treat a function as a long vector of function values)

pp f , ˜f q “N

¨

˝

» –

µf µ˜f

fi fl,

» –

Σf f Σf ˜f Σ˜ff Σ˜f ˜f

fi fl

˛

whereΣ˜f ˜fPRkˆkandΣf ˜fPRNˆk, k Ñ 8.

§ Again, themarginal remains finite pp f q “

ż

pp f , ˜f qd ˜f “N`µf , Σf f

˘

(38)

Back to Regression: Distribution over Functions

§ We are not really interested in a distribution on x, but rather in a distribution over function values f pxq

§ Let’s replace x with function values f “ f pxq

(Treat a function as a long vector of function values)

pp f , ˜f q “N

¨

˝

» –

µf µ˜f

fi fl,

» –

Σf f Σf ˜f Σ˜ff Σ˜f ˜f

fi fl

˛

whereΣ˜f ˜fPRkˆkandΣf ˜fPRNˆk, k Ñ 8.

§ Again, themarginal remains finite pp f q “

ż

pp f , ˜f qd ˜f “N`µf , Σf f

˘

(39)

Back to Regression: Distribution over Functions

§ We are not really interested in a distribution on x, but rather in a distribution over function values f pxq

§ Let’s replace x with function values f “ f pxq

(Treat a function as a long vector of function values)

pp f , ˜f q “N

¨

˝

» –

µf µ˜f

fi fl,

» –

Σf f Σf ˜f Σ˜ff Σ˜f ˜f

fi fl

˛

whereΣ˜f ˜fPRkˆkandΣf ˜fPRNˆk, k Ñ 8.

§ Again, themarginal remains finite pp f q “

ż

pp f , ˜f qd ˜f “N`µf , Σf f

˘

(40)

Marginal and Conditional over Functions

Define f˚ :“ ftest, f :“ ftrain.

§ Marginal

pp f˚, f q “ N

˜« µf µ˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

§ Conditional(predictive distribution) pp f˚|f q “N`m, S˘

m “ µ˚`Σ˚fΣ´1f fpf ´ µq S “Σ˚˚´Σ˚fΣ´1f fΣf ˚

We need to compute (cross-)covariances between unknownfunction values

Thekernel trickhelps us out

(41)

Marginal and Conditional over Functions

Define f˚ :“ ftest, f :“ ftrain.

§ Marginal

pp f˚, f q “ N

˜«

µf µ˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

§ Conditional(predictive distribution) pp f˚|f q “N`m, S˘

m “ µ˚`Σ˚fΣ´1f fpf ´ µq S “Σ˚˚´Σ˚fΣ´1f fΣf ˚

We need to compute (cross-)covariances between unknownfunction values

Thekernel trickhelps us out

(42)

Marginal and Conditional over Functions

Define f˚ :“ ftest, f :“ ftrain.

§ Marginal

pp f˚, f q “ N

˜«

µf µ˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

§ Conditional(predictive distribution) pp f˚|f q “N`m, S˘

m “ µ˚`Σ˚fΣ´1f fpf ´ µq S “Σ˚˚´Σ˚fΣ´1f fΣf ˚

We need to compute (cross-)covariances between unknownfunction values

Thekernel trickhelps us out

(43)

Marginal and Conditional over Functions

Define f˚ :“ ftest, f :“ ftrain.

§ Marginal

pp f˚, f q “ N

˜«

µf µ˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

§ Conditional(predictive distribution) pp f˚|f q “N`m, S˘

m “ µ˚`Σ˚fΣ´1f fpf ´ µq S “Σ˚˚´Σ˚fΣ´1f fΣf ˚

We need to compute (cross-)covariances between unknownfunction values

Thekernel trickhelps us out

(44)

Marginal and Conditional over Functions

Define f˚ :“ ftest, f :“ ftrain.

§ Marginal

pp f˚, f q “ N

˜«

µf µ˚

ff

, «Σf f Σf ˚

Σ˚f Σ˚˚

ff¸

§ Conditional(predictive distribution) pp f˚|f q “N`m, S˘

m “ µ˚`Σ˚fΣ´1f fpf ´ µq S “Σ˚˚´Σ˚fΣ´1f fΣf ˚

We need to compute (cross-)covariances between unknownfunction values

Thekernel trickhelps us out

(45)

Kernelization

pp f˚, f q “N ˆ„µf

µ˚

 ,

Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f˚|f q “N`m, S˘

m “ µ˚` Σ˚fΣ´1f f pf ´ µq S “ Σ˚˚ ´ Σ˚fΣ´1f fΣf ˚

§ Akernel functionk is symmetric and positive definite.

§ Kernel computes covariances between unknown function values f pxiqand f pxjqby just looking at the corresponding inputs xi, xj

Σpi,jqf fCovr f pxiq, f pxjqs “kpxi, xjq

§ This yields the predictive distribution pp f˚|f , X, X˚q “N`m, S˘

m “ µ˚` kpX˚, XqkpX, Xq´1pf ´ µq S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

(46)

Kernelization

pp f˚, f q “N ˆ„µf

µ˚

 ,

Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f˚|f q “N`m, S˘

m “ µ˚` Σ˚fΣ´1f f pf ´ µq S “ Σ˚˚ ´ Σ˚fΣ´1f fΣf ˚

§ Akernel functionk is symmetric and positive definite.

§ Kernel computes covariances between unknown function values f pxiqand f pxjqby just looking at the corresponding inputs xi, xj

Σpi,jqf fCovr f pxiq, f pxjqs “kpxi, xjq

§ This yields the predictive distribution pp f˚|f , X, X˚q “N`m, S˘

m “ µ˚` kpX˚, XqkpX, Xq´1pf ´ µq S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

(47)

Kernelization

pp f˚, f q “N ˆ„µf

µ˚

 ,

Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f˚|f q “N`m, S˘

m “ µ˚` Σ˚fΣ´1f f pf ´ µq S “ Σ˚˚ ´ Σ˚fΣ´1f fΣf ˚

§ Akernel functionk is symmetric and positive definite.

§ Kernel computes covariances between unknown function values f pxiqand f pxjqby just looking at the corresponding inputs xi, xj

Σpi,jqf fCovr f pxiq, f pxjqs “kpxi, xjq

§ This yields the predictive distribution pp f˚|f , X, X˚q “N`m, S˘

m “ µ˚` kpX˚, XqkpX, Xq´1pf ´ µq S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

(48)

Kernelization

pp f˚, f q “N ˆ„µf

µ˚

 ,

Σf f Σf ˚

Σ˚f Σ˚˚

˙

pp f˚|f q “N`m, S˘

m “ µ˚` Σ˚fΣ´1f f pf ´ µq S “ Σ˚˚ ´ Σ˚fΣ´1f fΣf ˚

§ Akernel functionk is symmetric and positive definite.

§ Kernel computes covariances between unknown function values f pxiqand f pxjqby just looking at the corresponding inputs xi, xj

Σpi,jqf fCovr f pxiq, f pxjqs “kpxi, xjq

§ This yields the predictive distribution pp f˚|f , X, X˚q “N`m, S˘

m “ µ˚` kpX˚, XqkpX, Xq´1pf ´ µq S “ K˚˚ ´ kpX˚, XqkpX, Xq´1kpX, X˚q

(49)

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy| f , Xq “N` f pXq, σε2I˘

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq

(50)

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy| f , Xq “N` f pXq, σε2I˘

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq

(51)

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k

Likelihood (noise model): ppy| f , Xq “N` f pXq, σε2I˘ Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq

(52)

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy| f , Xq “N` f pXq, σε2I˘

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq

(53)

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy| f , Xq “N` f pXq, σε2I˘

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f

Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq

(54)

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy| f , Xq “N` f pXq, σε2I˘

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq

(55)

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yif pxiq ` ε, ε „N`0, σε2˘, find a distribution over functionspp f q that explains the data

Training data: X, y. Bayes’ theorem yields pp f |X, yq “ ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k Likelihood (noise model): ppy| f , Xq “N` f pXq, σε2I˘

Marginal likelihood (evidence): ppy|Xq “ş ppy| f , Xqpp f qd f Posterior: pp f |y, Xq “ GPpmpost, kpostq

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σ2Iq´1kpX, xjq

(56)

GP Regression as a Bayesian Inference Problem (2)

Posterior Gaussian process:

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq

Predictive distributionpp f˚|X, y, x˚qat test inputs x˚: pp f˚|X, y, x˚q “N`Er f˚s, Vr f˚

Er f˚|X, y, x˚s “mpostpx˚q “mpx˚q `kpX, x˚qJpK ` σε2Iq´1py ´ mpx˚qq Vr f˚|X, y, x˚s “kpostpx˚, x˚q “kpx˚, x˚q ´kpX, x˚qJpK ` σε2Iq´1kpX, x˚q From now: Set prior mean function m ” 0

(57)

GP Regression as a Bayesian Inference Problem (2)

Posterior Gaussian process:

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq Predictive distributionpp f˚|X, y, x˚qat test inputs x˚:

pp f˚|X, y, x˚q “N`Er f˚s, Vr f˚

Er f˚|X, y, x˚s “mpostpx˚q “mpx˚q `kpX, x˚qJpK ` σε2Iq´1py ´ mpx˚qq Vr f˚|X, y, x˚s “kpostpx˚, x˚q “kpx˚, x˚q ´kpX, x˚qJpK ` σε2Iq´1kpX, x˚q

From now: Set prior mean function m ” 0

(58)

GP Regression as a Bayesian Inference Problem (2)

Posterior Gaussian process:

mpostpxiq “mpxiq `kpX, xiqJpK ` σε2Iq´1py ´ mpxiqq kpostpxi, xjq “kpxi, xjq ´kpX, xiqJpK ` σε2Iq´1kpX, xjq Predictive distributionpp f˚|X, y, x˚qat test inputs x˚:

pp f˚|X, y, x˚q “N`Er f˚s, Vr f˚

Er f˚|X, y, x˚s “mpostpx˚q “mpx˚q `kpX, x˚qJpK ` σε2Iq´1py ´ mpx˚qq Vr f˚|X, y, x˚s “kpostpx˚, x˚q “kpx˚, x˚q ´kpX, x˚qJpK ` σε2Iq´1kpX, x˚q From now: Set prior mean function m ” 0

(59)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Priorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅s mpx˚q “0

Vr f px˚q|x˚,∅s “ σ2px˚q “kpx˚, x˚q

(60)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Priorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅s mpx˚q “0

Vr f px˚q|x˚,∅s “ σ2px˚q “kpx˚, x˚q

(61)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Posteriorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “kpX, x˚qJpK ` σε2Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “kpx˚, x˚kpX, x˚qJpK ` σε2Iq´1kpX, x˚q

(62)

Illustration

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8

−3

−2

−1 0 1 2 3

x

f(x)

Posteriorbelief about the function Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “kpX, x˚qJpK ` σε2Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “kpx˚, x˚kpX, x˚qJpK ` σε2Iq´1kpX, x˚q

References

Related documents

Structural unity among viral origin binding proteins: crystal structure of the nuclease domain of adeno-associated virus Rep. The nuclease domain of adeno-associated virus

A complex, rapidly changing product mix; a supply chain complicated by varying product lifecycles and product returns; issues of time horizons, communication, and field

Why did the Agency determine that the regulatory intent justifies the adverse impact to the regulated business community.. The rules are required of the Division to

Controlled protease processing of the blended fabric dyed with Lanasol Blue CE enabled the degradation and removal of the dyed wool fibre component from the fabric blend, resulting

Second, in the mobile environment, service delivery is crucial because it can determine the content available, and involves how users will pay for and access the services

The professionalization of the army during the post-war years can be shown in five examples; the development of a General Staff and Commanding General, the production and

To the author’s best knowledge no studies has been reported yet on non-aligned bioconvective stagnation point flow of a nanofluid comprising gyrotactic microorganisms across

2005-2007 Practicum Therapist , Florida State University Psychology Clinic, Tallahassee, FL Duties : Outpatient individual psychotherapy and psychological assessments..