• No results found

Parametric Probabilistic Models

We now focus on parametric probabilistic models, that is models where the form of the probability distribution p(X, Y, θ) is assumed to be known and therefore completely governed by its set of parameters θ. In fact, once the model para- meters have been estimated, the probability distribution is completely known. For example if the data is assumed to be continuous and normally distributed, the joint probability distribution of the data for each class variable is a Gaussian

distribution N (µ, Σ) and the learning phase consists of estimating the mean µ

and the covariance matrix Σ for each value of the class label Y .

An interesting aspect of generative modelling is the possibility of explicitly expressing the way random variables factorise through repeated applications of

the product rule of probability. For instance a generative model p(X, Y, θ) can be expressed as the product of two probability terms, namely the class conditional

probability p(X|Y, θ) and the class prior probability p(Y, θ), which can itself be

expressed as the product p(Y|θ)p(θ):

p(X, Y, θ) = p(X|Y, θ)p(Y |θ)p(θ) . (3.2)

The way a probabilistic generative model factorises can be graphically repres- ented by Bayesian Networks. Bayesian Networks are probabilistic models that can be used to learn existing dependencies between the random variables of a probabilistic model [69].

The problem of learning a parametric probabilistic model given the prior knowledge of the problem and the training data consists of learning the paramet- ers that most likely fit the training data. The initial knowledge about the model parameters is captured by the prior distribution p(θ). Bayes’ theorem can be used to update the uncertainty associated with a set of model parameters after

having observed the training dataD = {X , Y}:

p(θ|X , Y) = p(X , Y|θ)p(θ)

p(X , Y) . (3.3)

The denominator of Equation (3.3) is a normalisation constant that does not depend on the model parameter, and can be estimated by integrating the joint

probability distribution p(X , Y, θ) over all possible values of the random variable

θ:

p(X , Y) =

Z

p(X , Y|θ)p(θ)dθ . (3.4)

The latter consideration implies that the problem of finding an estimate for the model parameters consists of finding an estimate of the posterior distribution

of the parameters p(θ|X , Y), which is proportional to the the product of the prior

distribution p(θ) and the likelihood function p(X , Y|θ):

p(θ|X , Y) ∝ p(X , Y|θ)p(θ). (3.5)

It is interesting to note that the likelihood function p(X , Y|θ) plays a key role in

the estimation of the parameter posterior probability.

estimation approaches of increasing complexity [6]:

Maximum Likelihood (ML). In this frequentistic approach the parameters of a model are fixed but have unknown values. Since they are not random variables, the estimation of the posterior distribution is simply proportional

to the likelihood function p(X , Y|θ). The ML approach aims to find the

set of parameters θML that maximise the probability of the data given the

parameters:

θML= arg max

θ p(X , Y|θ) . (3.6)

This corresponds to finding the unbiased estimators of the parameters. For

example if our generative model is normally distributedN (µ, Σ), the mean

µ and the variance Σ correspond to the sample mean and the sample cov-

ariance of the training data [25].

Maximum A Posteriori (MAP). In this approach parameters are treated as random variables governed by appropriate prior distributions, as in Equa- tion (3.5). As for ML, the optimal parameter values are point estimates

of the parameter set θ = θ0, and the integral at the denominator of Equa-

tion (3.4) is simply a multiplicative constant. If these priors are chosen to be conjugate of the class conditionals, the posterior distribution of the

parameters p(θ|X , Y) will have the same functional form as the parameter

priors. Conjugate priors are chosen according to the form of the class con- ditional and to the specific set of unknown parameters. For example, if we

model a Gaussian distribution p(X , Y|θ), then the prior over the mean is

a Gaussian distribution N (µ0, Σ0), the prior over the covariance matrix Σ

is a Wishart distribution W(W, ν), and the prior over each class prior is a

Dirichlet distribution Dir(α). The advantage of adding prior distributions for the parameters is that they introduce some form of regularisation, and therefore favour simpler models to more complicated ones [6].

Pure Bayesian Learning. As in the MAP approach, parameters are random vari- ables, but their optimal values are not point estimates. This approach aims at solving the estimation problem of Equation (3.4) via sampling methods such as Markov Chain Monte Carlo or via approximation techniques such as variational inference or expectation propagation [6].

One of the strengths of generative models is their inherent ability to handle la- belled and unlabelled data. This can be shown by simply rewriting Equation (3.2) for labelled and unlabelled data as:

p(D) = p(XL,YL, XU, θ)

= p(XL,YL|θ)p(XU|θ)p(θ)

= p(XL|YL, θ)p(YL|θ)p(XU|θ)p(θ) ,

(3.7)

and by taking the logarithm of the likelihood p(D|θ) = p(YL|θ)p(XU|θ) in Equa-

tion (3.7): log p(D|θ) = |DL| X i=1 log p(xi|yi, θ)p(yi|θ) + |DU| X j=1 log c X k=1 p(xj|yk, θ)p(yk|θ)  . (3.8)

The log-likelihood in Equation (3.8) is made of two distinct terms, the first one accounting for the labelled data only and the second one accounting for the unlabelled data only.

This property of generative models to naturally handle labelled and unla- belled data is of great importance, as many real problems nowadays require large quantities of labelled data to design supervised classifiers with high accuracy, and at the same time are characterised by the difficulty and cost of collecting such data. A possible answer to this dilemma would be to consider semi-supervised algorithms, that is, techniques which are able to learn from a small amount of labelled data together with a large amount of unlabelled data [102].