• No results found

Bayesian Multivariate Linear Splines (BMLS)

2.3 Mapping Function

2.3.2 Bayesian Multivariate Linear Splines (BMLS)

The basic principle of Bayesian approach is to calculate the conditional probability distri- bution of the unobserved variables of interest, given the observed data. It means that the posterior predictive distribution of new output yn+1 must be calculated for the new input

xn+1given the training data setD, i.e.,

p(yn+1|xn+1,D) = Z

p(yn+1|xn+1,W)p(W|D)dW (2.14)

whereW denotes all the model parameters and hyper-parameters1of the prior structures, andp(W|D)represents the posterior probability of the parameters of the model f given the

1Hyper-parameters are the parameters of a prior distribution. This name is used to distinguish between

2.3. MAPPING FUNCTION 24

training data setD. The estimation of speech quality can be obtained by:

ˆ

yn+1=E(yn+1|xn+1,D) = Z

f(xn+1,W)p(W|D)dW (2.15)

In real regression analysis, there is no single model which can predict the true relationship between the inputs and the outputs. While classical methods try to select the best model with optimising the parameters of the model, the Bayesian approach integrates out the un- certainty between the parameter values and averages over models with different number of basis functions, where each model is weighted by the posterior probability of its param- eters. For taking into account the uncertainty between models M of different dimension (e.g., the number of basis functions), the expectation in Equation 2.15 is written as,

ˆ yn+1= K

k=0 Z f(xn+1,Wk,Mk)p(Wk|D,Mk)p(Mk|D)dWk (2.16)

where M ={M0, . . . ,Mk} is the set of entertained models and p(Mk|D) is the posterior distribution of modelMk, obtained by using prior distribution of model and Bayes rule:

p(Mk|D) = p(D|Mk)p(Mk)

p(D) (2.17)

LetMk denote a typical model, then using the Bayes rule prior distributions on the model

p(Mk)are updated to posterior distributions p(Mk|D).

Since the posterior distribution has a complex form, the integral in equation (2.14) cannot be calculated using analytical methods. Instead, a reversible jump MCMC (Markov Chain Monte Carlo) sampling strategy [60] was used to approximate the integral by drawing sam- ples from the joint probability distribution of all the model parameters, p(W|D), and then approximating the integral in (2.14) by:

I≈ 1 N−n0 N

t=n0 f(Wt) (2.18)

whereNis the total number of the generated samples,W1, . . . ,WN are draws from the pos- terior distribution of W and n0 is a “burn-in” period. To give the algorithm a chance to converge to p(W|D), the samples (i.e. Wt) from the first few iterations of the algorithm, known as the burn-in period, are discarded and after that convergence is assumed and every

2.3. MAPPING FUNCTION 25

way the correlation between the successive samples is removed and the generated samples of the models are less dependent [59, 61].

In BMLS model, piecewise linear planes are used as basis functions and the regression function can be written as,

ˆ f(xi) =β0+ k

j=1 βj xi.µj + (2.19) whereβ = (β0, . . . ,βk) 0

andµj= (µj0, . . . ,µj p)are regression coefficient and basis parame-

ter respectively; pis dimension of the predictors,a.bis inner product and(a)+=max(0,a);

xi.µj is a truncated linear plane which its position and orientation is determined by param-

eterµj. In matrix format, ˆf(xi)can be written as: ˆf(xi) =Bβ+εwhichBis then×(k+1)

matrix: B=       1 (x1.µ1)+ · · · (x1.µk)+ 1 (x2.µ1)+ · · · (x2.µk)+ .. . ... . .. ... 1 (xn.µ1)+ · · · (xn.µk)+       .

To introduce covariate selection to the basis functions, some of the elements of µj−0 (µj

except the first element) are set to zero, and make the plane perpendicular to the corre- sponding covariates. To determine the number and place of the non-zero elements inµj−0,

two new parameters are introduced: γj = γj1, ...,γj p

and z; γjd =1 if the dth element

of µj−0 is non-zero and vice versa, andz= p ∑ d=1

γjd which shows the number of non-zero

elements inµj−0[56].

Function “f” can be uniquely determined by the number of basis function k, the position vectorµ,γ, the numberz, the output coefficientsβ and the regression varianceσ2, which

is the variance of the noise. Therefore BMLS is parameterised byW = (Mk,w)whereMk

is defined to include the number and location of the basis functions,Mk=(k,µk,γ,z), andw

includes(β,σ2).

In a Bayesian network, the posterior probability densities of these parameters are of interest given the data set. The posterior density is given as a combination of likelihood and prior. The Bayesian approach is based on three basic steps [59]:

2.3. MAPPING FUNCTION 26

2) Calculating the likelihood of the training data, using the given parameters, p(D|W). 3) Determining the posterior distributions of the parameters, using Bayes rule.

Assigning prior distributions to all the unknown parameters

The prior density is used to represent information about the unknown parameters and incor- porate inferences for simpler models or smoother model outputs. The unknown parameters in the model are number of basis functions k, knot position µ,γ,z, the set of coefficients

β = (β1, ...,βk)0and the regression varianceσ2.

It is preferred to choose mathematically convenient forms of prior distribution which result in computationally tractable posterior distribution. This goal is achieved through the use of conjugate prior distribution. For BMLS model, the conjugate choice of prior forβ andσ2

is the normal inverse-gamma (NIG).

p β,σ2|Mk = p β|σ2,Mk p σ2|Mk =N(β|0,λ−1σ2I)InvGamma(σ2|a,b) = b a (2π) k 2|λ−1|1/2Γ(a) σ2−(a+( k 2)+1) ×exp− β0λ β+2b/2σ2 (2.20)

where,aandbare parameters of inverse gamma distribution andλ is the precision (inverse

variance) of the normal distribution.

Uniform prior is assigned onz,U[1,2, ...,Z], whichZis the maximum allowed interaction. A uniform prior distribution is also assigned onγ conditioned onzas well asµconditioned

onγ andz. The final prior distribution is adopted fork, the number of basis function; as

there is no information available about this number, uniform distribution from zero to the number of training data is assigned to that, p(k) =U(0, ...,n).

To summarize, the joint prior distribution on the model parameters is:

p k,β,µ,σ2,z,γ=p β|σ2,kp σ2p(µ|z,γ,k)p(γ|z,k)p(z|k)p(k) (2.21)

which p β|σ2,kp σ2is set to the normal-inverse gamma and the rest to uniform dis-

tribution.

2.3. MAPPING FUNCTION 27

Assuming that the noise term takes a normal distribution, the likelihoodp D|β,σ2,Mk

or alternatively p Y|X,β,σ2,Mkcan be written as,

p D|β,σ2,Mk =N fMk(x),σ 2I = 2π σ2− n 2exp −(Y−Bβ) 0 (Y−Bβ) 2σ2 (2.22)

and in log format, the following log-likelihood of the observed data is obtained:

L D|β,σ2,Mk =−nlogσ− 1 σ2 n

i=1 yi− f xi,β,σ2,Mk 2 +constant (2.23)

Determining the posterior distributions of the parameters

The posterior distribution of β and σ2 has standard format, thanks to the conjugate prior

distributions. The posterior distribution of the other parameter is complex and dimensional varying (the number of basis functions,k, is one of the parameters which is unknown) and cannot be calculated analytically.

As it was mentioned earlier, a reversible jump MCMC method is used for sampling from the posterior distribution. This method is a generalization to the Metropolis-Hastings algo- rithm [59] with introducing a number of other possible move types surrounding a change in dimension of the density.

The sampling algorithm starts with one basis function with unity values for all the input features; at each iteration, the sampler can suggest one of the three following proposals:

Birth - Adding a basis function to the model

Death - Removing one of the basis functions(Death move is not proposed whenk=0)

Move - Changing the parameter set of one of the existing basis functions

AssumingM= (α,β,k,µ,γ,z)shows the current state of the model and the sampler pro-

pose a model with parameters M?= (α?,β?,k?,µ?,γ?,z?), the proposal will be accepted

by probability: S(M,M?) =min 1,p(D|k ?,µ?,γ?,z?) p(D|k,µ,γ,z) (2.24)