• No results found

2.3 Learning from Data with Bayesian Models

2.3.1 Bayesian Methods

One of the central problems when using graphical models is to estimate the model parameters from data in the presence of the latent variables. There are broadly two schools of thoughts regarding this problem. We can think each of these latent variables as having a one particular, but unknown,value or an uncertainty quantity. The former perspective is usually called frequentist approach, while the latter is referred to as Bayesian approach, in which the parameters are not fixed values but are considered as random variables which follow some probability distributions called prior distributions. Some advantages of Bayesian approach include (Wasserman,

2013, Section 11.9), (Robert,2007, Chapter 11), and (Neal, 2004)11:

• Providing a principled framework for combining prior information and ob- served data. When new observations come, the previous posterior distribution can be considered as a prior;

• Providing interpretable answers with appropriate variances of estimated para- meters;

11Also see “A Defense of the Bayesian Choice” (Robert, 2007, Chapter 11) for an interesting

2.3. Learning from Data with Bayesian Models 31

Define models

and priors Gather data

Compute posteriors

Make predictions and decisions

Figure 2.8: Bayesian learning process includes four main phases: defining models with priors, collecting observations, estimating the posteriors, and using the ob- tained posteriors for predictions or making decisions.

• Providing an elegant mechanism to build up a broad range of graphical models including hierarchical models, missing data problems.

However, there are also limitations associated with the Bayesian approach. Two outstanding ones are:

• The question on how to choose good prior distributions, and

• The high computational cost incurred since computing the posterior is usually intractable, especially for models with a large number of parameters.

In this thesis, we would like to handle the data at a large scale, which requires a robust and resilient framework to represent the model, extract information, know- ledge and perform reasoning. The Bayesian methodology provides a refined tool to work within the umbrella of graphical models. Together, they allow us to develop robust and scalable learning models for modern datasets. The learning process with Bayesian paradigm can be summarised in Figure 2.8.

In the first step, we use the probabilistic and statistical languages such as graphical models - as used in this thesis - to formulate our problem of interest. This might include developing a (graphical) model that represents and expresses aspects of our knowledge, e.g. independence assumptions, distributional forms. The model will introduce a collection ofunknown parameters. Since these parameters are uncertain values, we need to describe their prior distributions which express our beliefs or prior knowledge about the parameters. The observed data may be available in hand or needed to collect and to pre-process (cleaning, wrangling, etc.).

The main bottleneck in Bayesian learning is, however, often related to the third phase – computing posteriors which are the distributions for the parameters – given the observed data. We can use these posteriors to predict, reason or to make de- cisions.

2.3. Learning from Data with Bayesian Models 32

Data

Priors Independence assumptions (models)

Data

Poster

iors

Figure 2.9: Conceptual level of Bayesian learning.

puted by combining the prior distributions with the likelihood using Bayes’ theorem

p(Θ| D) = p(Θ)p(D | Θ)

p(D) , (2.17)

where Θdenotes the parameter of a model; D is the observed data. Since the data

D observed and fixed, i.e., p(D) is a normalising constant, the above equation can then be written as a proportionality

p(Θ| D)∝p(Θ)p(D |Θ). (2.18)

To making prediction for a new data point ygiven the observed data D, one might marginalise out the parameters as follows

p(y| D) =

ˆ

Θ

p(y |Θ)p(Θ| D)dΘ.

When computing posterior in Equation (2.17), we implicitly assume that the choice of models already made. However, we are usually unsure which model is right or the best. The schematic in Figure 2.9 shows the conceptual level of learning with the Bayesian paradigm. In this diagram, we explicitly introduce model M as an unknown entity. But how to choose the “right” model that most described our observed data D? One popular solution is that we can compare models based on the evidence, i.e., the marginal likelihood, for each model. The evidence is the probability of the observed data D when we assume the model M

p(D |M) =

ˆ

Θ

p(D |Θ)p(Θ|M). (2.19)

For two models M1 and M2, we prefer to choose the model that provides higher

marginal likelihood p(D | M). Also note that each model M may have a different parameter space Θ.

Example 2.4. Let us reconsider Gaussian mixture model (GMM) introduced in Figure 2.4b. In Section 2.1.1, GMM has three groups of (value) parameters: µk

and Σk - parameters for multivariate Gaussian distribution, and vector of mixing proportion π - which has the sum as one. In the context of Bayesian learning, we

2.3. Learning from Data with Bayesian Models 33

(a) (b)

Figure 2.10: Bayesian Gaussian mixture models: (a) graphical model for Bayesian Gaussian mixture model; (b) an example of using Gaussian mixture model to cluster data: choosing the number of clusters K using model selection with training (valid- ation) data has failed (adapted from (Liang and Klein, 2007)).

now assume that these parameters are random variables which have the probability distribution as follows

π∼Dir (γ) τ ∼Gamma (α, β) µk∼ N µ0,(λτ)−1I,

whereIis the identity matrix of the appropriate size. These probability distributions for the parameters are called the priors. Data generation will remain as Equation (2.2) with the covariance matrix Σk = (λτ)

−1

I. The graphical models for Bayesian Gaussian mixture models are depicted in Figure 2.10a.

In machine learning community, Gaussian mixture models are usually used to cluster the data. One often chooses the number of clusters K in advance. The value of K can be considered as a parameter which specifies the structure and complexity of the model M in (2.19). One class of the methods to select the value K is to use cross-validation which selects the model (in this case K) based on the likelihood of validation set12. Another class referred to asBayesian model selection uses marginal

likelihood or minimum description length as the metrics for evaluating the best-fit model M. However, the strategies of model selection sometimes do not work. Let us consider a clustering problem in Figure 2.10b, which was introduced in (Liang and Klein, 2007). We would like to cluster (red) data points in 2-dimension real values. In order to define the number of clusters, we use cross-validation method and maximum likelihood as criteria for selecting the complexity of the model. Un- fortunately, when we increase the number of clusters, the training likelihood also

2.3. Learning from Data with Bayesian Models 34

improves appropriately. In contrast, the test likelihood grows for some initial values of K but declines after a certain number of clusters (K = 4). Furthermore, model selection methods are usually costly, especially for learning with large-scale data- sets, since we need to train our datasets on many candidate models to select the best one. In streaming settings, when data come, we need to redo the model selection step for the previously observed data. Hence, there is a pressing need for an eleg- ant and efficient framework which can deal with model selection difficulties and, at the same time, retain attractive properties of Bayesian learning such as conjugacy, computational tractability, posterior consistency, etc. The Bayesian nonparamet- rics introduced in the following section will summarise such a recent and elegant framework for learning from data.