Parametric Bayesian Modeling - Bayesian Modeling and Learning

2.2 Bayesian Modeling and Learning

2.2.1 Parametric Bayesian Modeling

In the Bayesian view we are interested in the distribution of the unknown random variables. There is an initial belief about the value of these unknowns that ulti- mately become more certain with more observations. In case these likelihoods and the prior are conjugate we can compute the products in closed form and the integral is tractable. For example Gaussian distribution is a conjugate prior for a Gaussian likelihood as in theBayesian linear regressionbelow.

As an example of the parametric Bayesian modeling, consider the observations to be a set of pairs D = {(xi,yi)}ni=1 where xi ∈ Rd and yi ∈ R as a regression problem. The graphical model of this regression problem is shown in Figure 2.2 that corresponds to the following:

p(θ|D) = 1 Zp(θ)p(D|θ) = 1 Zp(θ) n

∏

i=1 p(xi,yi|θ).

§2.2 Bayesian Modeling and Learning 15 −6 −4 −2 0 2 4 6 θ 0.0 0.2 0.4 0.6 0.8 1.0 1.2 p p(θ) p(θ|D)

(a) Prior and posterior on weights

1.0 1.2 1.4 1.6 1.8 2.0 2.2 x 0.5 1.0 1.5 2.0 2.5 3.0 3.5 y (b) Regression model

Figure 2.4: Bayesian linear regression: the prior and the posterior of the parameterθ are shown in Figure 2.4(a). In Figure 2.4(b), the dots are the observed points. The red lines are three samples from the linear function defined by the posterior distribution.

As is shown, the posterior is peaking at the true value (1 in this example).

In case of regression where we are not modeling p(xi), we can write the likelihood as p(yi|xi,θ) (i.e. since xi is observed, its distribution is absorbed in the nor- malizer). Assuming a Gaussian prior, θ ∼ N(0,α2I_d) and a Gaussian likelihood yi ∼ N(θ>xi,β2Id)with hyper-parametersα,β∈R, the posterior is

p(θ|D) = N(β−2SX>y,S). S=α−2I+β−2X>X

−1 ,

whereId is the identity matrix of sized,Xis the matrix constructed from observation

x and y is the vector of labels. The posterior is a distribution over the value of the state of nature which represents the linear functions that generated the label y. Here, if the labels are binary (±1), the likelihood model is not Gaussian (i.e. not conjugate) and hence the integral in the posterior becomes intractable and requires approximation.

An illustration of this model at work is shown in Figure 2.4. The dots in Figure 2.4(b) represent the observations that are generated from a linear function with a small additional noise. That is, we chose a constant weight that was multiplied by each point on the x-axis and then a random noise was added to the output as shown in the y-axis, i.e. y_{∼ N}(x>θ,β2). The prior, a zero mean Gaussian distribution over

that unknown weight, and the posterior that was obtained from these observations is shown in Figure 2.4(a) where the posterior is peaked at the true parameter value that generated the data. This posterior represents the belief over the space of weights that could generate the observations (with the given likelihood parameterβthat specifies

the noise in the observations). Three samples of these linear functions induced by this posterior are also plotted in red. If we increase the number of observations, this

x

i= 1, . . . , n

c= 1, . . . , m

✓

c

Figure 2.5: Graphical model of the Mixture Model

posterior Gaussian distribution on the weight vectors becomes sharper and looks more like a delta function. It is an important example that shows how the conjugate likelihood can lead to an efficient inference. We will get back to this example again when we discuss Gaussian processes and Bayesian inference.

Bayesian parametric models can be used in the estimation of the mixture models as well. The easiest way to think about the mixture model is in clustering where a subset of observations form a group (k-means can be seen as a non-Bayesian mixture model). Instead of a labeled set, we have an unlabeled setD= {x1,x2, . . . ,xn}and a mixture model for the likelihood. That is, there is a parameter vector for each cluster and there is a hidden variableci that specifies which cluster each observationxi be- longs to. The likelihood of each observation given its cluster is p(xi|θc)(distribution of each observation depends on the parameter of that cluster). Then we have,

p(θ|D) = 1 Z m

∑

c=1 p(θc) n

∏

i=1 p(ci = c)p(xi|θc). (2.5) The integration of the posterior distribution becomes intractable in general and needs sampling or approximate inference. If we assumep(xi|θc)andp(θc)are conju- gates (say a Gaussian distribution like the regression problem), we can useExpectation Maximization (EM)to estimate the membership variable ci and the parameterθc (if not conjugate we need an approximation step inside EM, e.g. variational EM). Al- though with this procedure we enter the realm of frequentists byMaximum a-posteriori (MAP)estimation (andMaximum Likelihood if we ignore the prior on the value of θ) by selecting one set of parameters among many, we can get a feeling of the possible value for the unknowns. A graphical model of the mixture model is shown in Figure 2.5.

In document Scalable Loss-calibrated Bayesian Decision Theory and Preference Learning (Page 30-32)