2.6 Discussion
3.2.1 Dirichlet process mixture
We start by introducing some fundamental concepts of a Dirichlet Process (DP) that are relevant to this article; for a more comprehensive review refer to Chapter 1 and 2 in M¨uller et al. (2015).
3.2.1.1 Construction of a Dirichlet process
A nonparametric model is characterized by an infinite number of parameters. Suppose we have a collection of data y1, ..., yN, which follow a distribution G in
an infinite dimensional space. To proceed with a Bayesian nonparametric model,
we need a prior on G known as a Bayesian nonparametric prior. Among a range
of Bayesian nonparametric priors, we will focus on the DP prior because of its mathematical convenience. A DP(M G0) is characterized by two quantities, the
total mass parameter M and the centering measurement G0. For each partition
{B1, ..., BK}on a setB, DP(M G0) assigns probabilityG(Bk) to every subsetBk,
such that G(B1), ..., G(BK)∼Dir(M G0(B1), ..., M G0(BK)), where Dir() denotes
the Dirichlet distribution. From the definition we can see the analogy between a DP and a Dirichlet distribution: a DP is an infinite dimensional extension to a Dirichlet distribution. In addition, the mathematical convenience of a DP comes from its conjugacy. Let the datay1, ..., yN|G
i.i.d.
∼ G, and the priorG∼DP(M G0), then the posterior also follows a DP: G|y1, ..., yN ∼ DP(M G0 +PNi=1δyi) = DP (M +N)(MM+NG0 + M+1N
PN
i=1δyi)
,where the centering measurement is a weighted average of the prior distribution and the point masses of the observed data denoted by the Dirac function δ(·), and the total mass parameter increases by N.
A DP can be constructed in several equivalent ways, e.g., Chinese restaurant process, Polya urn process, and the stick breaking construction (M¨uller et al., 2015). In the Chinese restaurant process, it specifies a distribution over partitions ofN points in a sequential manner. As a plain explanation, consider a restaurant with an infinite number of tables. The first customer sits on the first table, and the (n+ 1)th customer has the option to sit on an empty table with probability
M
M+n, or sit on an existing table with probability nk
M+n, wherenk is the total num-
ber of people sitting on tablek before the (n+1)thcustomer comes in. In a similar
vein, the Polya urn process specifies the distribution of (y1, ..., yn) as a product
of sequence of increasing conditionals p(y1, ..., yn) = p(y1)
Qn
3.2. MODEL SPECIFICATION 55 where p(yi|y1, ..., yi−1) = M+1i−1Pi −1 h=1δyh(yi) + M M+i−1G0(yi), i = {2,3, ...n} and
y1 ∼G0. The allocation of the (n+ 1)th data point either equals to the existing datayh, h={1, ..., n}, or to a new data point drawn from the centering measure-
mentG0. A more constructive way to represent a DP is through the stick breaking
construction, where G can be represented as an infinite sum of point masses
with different weights wh and locations θh, such that G(·) =
P∞
h=1whδθh(·), θh ∼ G0, vh ∼ Beta(1, M) and wh = vhQl<h(1− vl). It can be thought as
breaking a stick of unit length at positionv1 and assigningw1 equal to the length of the stick that we broke, and the parameter associated with that segment is
θ1 which is drawn from the centering measurement G0. Then we successively
break the rest of the stick by fraction vh, h={2, ...}, compute wh and draw the
cluster-specific parameter θh from the centering measurement G0. In this way,
a DP induces clusters, as data with the same parameter θh form a cluster. In
this chapter we will use the stick breaking construction because it will facilitate computation.
In a DP, the centering measurement G0 is the expectation of the distribution
G before data are collected, therefore it represents the prior belief on G. The
total mass parameter M controls how concentrated the induced distribution is
aroundG0, as a biggerM generates more clusters and a distribution closer toG0. To illustrate this, in Figure 3.1, we generated three DPs by the stick breaking construction, with G0 =N(0,1) and M = 1,10,100 respectively. When M = 1, only a few clusters were generated with larger weights and the induced distribu- tion was far from the standard normal distribution. On the other hand, when
M = 100, a greater number of clusters were generated with smaller weights and
the induced distribution was very close to the standard normal distribution.
3.2.1.2 Dirichlet process mixture
From the stick breaking construction we can see that a DP only generates discrete distributions, and it is not directly generalizable to continuous distributions. To
2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 Weights, M=1 Index weights DP(M=1, G0=N(0,1)) y Frequency −2 −1 0 1 2 0 5 10 15 20 25 30 0.00 0.05 0.10 0.15 0.20 Weights, M=10 Index weights DP(M=5, G0=N(0,1)) y Frequency −2 −1 0 1 0 100 200 300 400 500 0.000 0.010 0.020 0.030 Weights, M=100 Index weights DP(M=100, G0=N(0,1)) y Frequency −2 −1 0 1 2 3
Figure 3.1: Dirichlet Processes generated by stick breaking construction, with
G0 =N(0,1), M = 1,10,100 respectively.
overcome this limitation, DPM models can be used by putting a DP on the distribution of the parameters in a parametric kernel
fG(y) =
Z
fθ(y)dG(θ), G∼DP(M G0). (3.1)
Equivalently, a DPM can be represented hierarchically, where each data point yi
is associated with its own parameter θi and the distribution of those parameters
follow a DP
yi|θi ∼fθi, θi|G∼G, G∼DP(M G0). (3.2)
In this representation, clusters are induced by the grouping of the parameters. A third equivalent representation is constructed through the stick breaking repre- sentation of a DP which induces an infinite mixture model
f(y|w, θ) = ∞
X
h=1
whf(y|θh). (3.3)
It shares many of the same interpretations and properties as a finite mixture model, but is more flexible in terms of allowing the model complexity to be adapted to the data.
3.2. MODEL SPECIFICATION 57
3.2.1.3 Selecting the number of components in a finite mixture model
In a finite mixture model f(yi|w, θ) = PhH=1whf(yi|θh), the number of clusters
H is fixed although unknown, and a model selection procedure can be applied
to choose a ‘best’ H. This is known as the order selection problem (McLachlan
and Peel, 2004, Chapter 6). Some commonly used methods for the order selec- tion problem involve calculating an information criterion (AIC, BIC), performing a likelihood ratio test, and calculating the Bayes factor between two compet- ing models (Carlin and Chib, 1995), etc. These methods require a number of potential models to be fitted and one chosen amongst them. These methods can be quite computationally intensive and may fail because of nonidentifiabil- ity issues and the use of improper priors. Specifically, the nonidentifiability issue arises in the mixture distribution when the model can be represented equivalently
well by different H, and this may lead to an unbounded likelihood function in
a frequentist model (Basford and McLachlan, 1985) or a divergent posterior if an improper prior is specified in a Bayesian model. In addition, the methods also fail to account for the variability in H if it is treated as fixed. Some fully Bayesian procedures consider model selection and parameter estimation in a sin-
gle MCMC run by treating H as a random variable. For example, the reversible
jump algorithm (Richardson and Green, 1997) allows the chain to jump between different H values corresponding to different models until convergence; in a birth and death process (Stephens, 2000) the parameters in the model are viewed as a point process and new components are allowed to be ‘born’ and existing compo- nents to ‘die’; and by prior parallel tempering (Van Havre et al., 2015), where far more clusters than supported by the data are initially included and the redundant clusters are gradually removed or merged with other clusters in the MCMC algo- rithm by controlling hyper-parameters. We have chosen to use the prior parallel tempering algorithm in our proposed model because it neither requires the de- sign of a delicate balance criterion in the trans-dimensional algorithms, nor extra programming effort in the parallel chains by just changing the hyperparameters
that resemble temperatures. Furthermore, the prior parallel tempering algorithm simultaneously achieves order selection and assists mixing.