Summary - Nonparametric Bayesian Topic Modelling with Auxiliary Data

This chapter reviews some basic of Bayesian methods, including how a Bayesian model is constructed and the how to make inference on quantities of interest in the model. Due to the difficulty to make inference on posterior analytically, which is usually complex in practical situations, various approximation approaches were re- viewed; emphasis was given in the discussion of Markov chain Monte Carlo methods as these will be used primarily in this dissertation.

In the next chapter, we continue with the discussion on some important probability distributions and stochastic processes that are used in this dissertation. In particular, we note that they are discussed in the framework of Bayesian modelling.

Chapter3

Probability Distributions and

Stochastic Processes

This chapter provides a brief review on probability distributions and stochastic processes. The following illustrated probability distributions and stochastic processes are chosen on the basis of relevance to this dissertation; they are only a tiny portion of all existing (and important) distributions, see Walck [2007] for a comprehensive list of other important probability distributions. We first describe some simple probability distributions in Sections 3.1 and 3.2. Section 3.3 describes the nonparametric approach in Bayesian methods and mentions some stochastic processes.

3.1 Univariate Probability Distributions

We first discuss the simple univariate probability distributions. These distributions are characterised by the fact that they generate one variable at a time.

3.1.1 Bernoulli Distribution

TheBernoulli distributioncan be considered as the simplest of all distributions. It is a discretedistribution (i.e., the outcome takes on a fixed value) with only two outcomes: 0 and 1. A classical example having such distribution would be the number of heads obtained from asingletoss of a bent coin.

Letθdenote the probability of landing a head, andxdenotes the number of heads obtained, we say x follows a Bernoulli distribution with parameterθ, which can be presented as follows:

(x_|θ)∼Bernoulli(θ). (3.1) Theprobability density function5associated with xis given by

p(x|θ) =θx(1−θ)1−x, x∈ {0, 1}, θ∈ [0, 1]. (3.2)

5_{It should be called the probability mass function in the case of a discrete probability distribution,} but for convenience, we call it a probability density function as in the continuous case.

18 Probability Distributions and Stochastic Processes

3.1.2 Binomial Distribution

The binomial distribution is a generalisation of the Bernoulli distribution with mul- tiple trials. Following the above example, if we throw the same bent coin n times and again denote x as the number of heads obtained, then x follows a binomial distribution with parameternandθ:

(x_|n,θ)∼Binomial(n,θ). (3.3) As with the Bernoulli distribution, it is a discrete distribution, but now with (n+1)outcomes fromntrials. The probability density forxis given as

p(x|n,θ) = n x θx(1−θ)n−x, x∈ {0, 1, . . . ,n}, θ ∈[0, 1], (3.4) where the notation(n_x)denotes the binomial coefficient, given as

n x = n! x!(n−x)!. (3.5) 3.1.3 Beta Distribution

In contrast to the Bernoulli and the binomial distribution, the beta distribution is a continuous distribution (i.e., the outcome can be any real number) for which the outcome can take values between 0 and 1 (inclusive). The beta distribution is usually used as a prior distribution for the probability of an event. For example, we can model the probability of getting a head from a coin toss,θ, by a beta distribution:

p(θ|a,b) = 1 B(a,b)(θ)

a−1₍₁₋

θ)b−1, θ ∈[0, 1], a>0, b>0 . (3.6) Here, the parametersaandbare known as shape parameters, andB(_·,_·)is called the beta function, which serves as a normalisation constant. The beta function can also be written as a product of gamma functions:

B(a,b) = Γ(a)Γ(b)

Γ(a+b) . (3.7)

Note that the beta distribution is aconjugatedistribution of the binomial distribution (and also of the Bernoulli distribution). This means that the prior and posterior distributions ofθwill be of the same family of distributions, namely the beta family. This convenient property also allows a tractable derivation of acompound distribution named the beta-binomial distribution.

§3.1 Univariate Probability Distributions 19

3.1.4 Beta-Binomial Distribution Consider the following Bayesian model:

(x_|n,θ)∼Binomial(n,θ), (3.8)

(θ|a,b)∼Beta(a,b). (3.9) where aandbarehyperparametersassociated with prior θ. Note that the variablesn, aandbare known (or chosen to be certain values) in the model.

It is not difficult to show that the posterior of θfollows a beta distribution: p(θ|x,n,a,b)∝ p(x|n,θ)p(θ|a,b)

∝θa+x−1(1−θ)b+n−x−1, (3.10) that is,(θ|x,n,a,b)∼ Beta(a+x,b+n−x). Often times, we rewrite Equation (3.10) as p(θ|x), implicitly conditioning on known variables (n,aandb) for simplicity and ease of reading. This conjugacy also enables us to analytically derive the compound distribution ofx by integrating out the parameterθ:

p(x|n,a,b) = Z 1 0 p (x|n,θ)p(θ|a,b)dθ = n x 1 B(a,b) Z 1 0 θ a+x−1₍₁₋ θ)b+n−x−1dθ = n x B(a+x,b+n₋x) B(a,b) . (3.11)

This distribution is known as the beta-binomial distribution. Note that the integral in Equation (3.11) is easily computed by recognising that it is part of the posterior distribution ofθ.

For situations where there isa priori ignorance regardingθ (i.e., we do not know whataandbare), three specifications have been proposed: uniformprior (a= b=1), improper prior6 (a = b = 0) and Jeffreys prior (a = b = 1/2). Each of these has its advantages and disadvantages. However, given large sample size (often true for computer science application), the differences between using the three priors tend to be negligible.

Note that the improper prior is not a proper probability distribution in which the density does not sum up (or integrate) to 1. When one uses an improper prior, care must be taken to ensure that the posterior distribution is proper, otherwise the inference obtained is completely useless!

20 Probability Distributions and Stochastic Processes

In document Nonparametric Bayesian Topic Modelling with Auxiliary Data (Page 40-44)