3.3 The Dirichlet process
3.3.3 Generating model parameters consistent with a Dirichlet process prior
In the previous section we considered how to obtain realisations of G where G|α, G0 ∼
DP(α, G0). We now suppose that the data are n observations and denote the parameter
for the ith observation by λi. The focus of this section is how to obtain a realisation ofλ =
(λ1, . . . , λn) givenλ|G ∼ G, that is, how to draw realisations (of the parameters) from G,
where G follows a Dirichlet process. Such realisations are reasonably straightforward to obtain under the stick-breaking representation. However, any realisation obtained using this method will only be from an approximation to the true distribution defined by the Dirichlet process. This approximation is due to the need to truncate G so that it is finite dimensional (see above). On the other hand, the Chinese restaurant/P´olya urn representations allow for exact realisations to be drawn directly from G (we need not obtain a realisation of G explicitly). Further, these samples remain from the true distribution defined by the Dirichlet process irrespective of the value of α.
To generate a realisation of λ = (λ1, . . . , λn) we appeal to the latent cluster indicators c,
where ci = c denotes that observation i is within cluster c. Note that, given the (unique)
cluster parameters λ†j, the cluster indicators enable us to completely identify λ and so obtaining a realisation for the cluster indicators is equivalent to drawing a realisation for λ. We now describe how to obtain realisations of c (and therefore λ) which are consistent with the Dirichlet process prior under both the stick-breaking and the Chinese restaurant/P´olya urn representations.
Stick-breaking representation
To generate a realisation of the cluster allocations consistent with a Dirichlet process prior using the stick-breaking representation we must first obtain a (approximate) realisation of the discrete distribution G. This can be obtained using the method described in the previous section which gives a realisation of G defined by Pr(λ = λ†j) = ψj for j =
1, . . . , N1. Given a realisation of G, we can generate a realisation of the cluster allocations
for n observations as follows.
• Sample ci indep∼ Cat(N1,ψ) for i = 1, . . . , n.
Chinese restaurant/P´olya urn representation
Under the stick-breaking representation we need to first obtain a realisation of (the dis- tribution) G before then drawing samples of the cluster allocations (from G). However, under the Chinese restaurant/P´olya urn representation this step is no longer needed and we can instead draw realisations (of cluster allocations c and therefore λ) directly from G using the following process:
• Choose α > 0 or simulate from a suitable distribution. • Set c1 = 1 and the (current) number of clusters as Nc= 1.
• For i = 2, . . . , n simulate the allocation of observation i to a cluster according to the discrete distribution Pr(ci = j|c1, . . . , ci−1) = ncij α + i− 1, for j = 1, . . . , N c, Pr(ci= Nc+ 1|c1, . . . , ci−1) = α + iα − 1,
where ncij denotes the number of points currently within cluster j (at iteration i), and Nc→ Nc+ 1 if c
i= Nc+ 1.
• Simulate λ†j indep
∼ G0 for j = 1, . . . , Nc.
Again, as for the stick-breaking representation, the parameter associated with observation i is given by λ†ci. It follows that the parameter vectorλ is given by λi = λ†ci for i = 1, . . . , n.
We now highlight a subtle but important difference between the two methods. Suppose we are in the scenario where n N1, that is, the number of observations is significantly
smaller than the truncation parameter. In this case the error resulting from the approx- imation used in the stick-breaking approach will be reasonably small (conditional on a suitable choice of concentration parameter). However, in the stick-breaking approach, the realisation of G is defined over a (fixed) finite number of atoms (N1), which, as a
result, constrains the maximum number of clusters to N1. In other words, irrespective
of the number of observations n, the parameter vector λ can contain at most N1 unique
values. It follows that this method of generating parameter realisations may result in a poor approximation to the true distribution defined by the Dirichlet process in the limit as n→ ∞. However, the Chinese restaurant/P´olya urn method allows for the possibility that each of the i observations can join a new cluster and is therefore assigned to a (unique) parameter which is an independent draw from the base distribution. It follows that in this case the upper limit on the number of clusters is theoretically infinite when considering
3.3.4 Generalised Dirichlet process
The Pitman–Yor process is a generalised version of the Dirichlet process. This process is accredited to Pitman and Yor (1997) for their work on the two–parameter Poisson– Dirichlet distribution. However the name was coined by Ishwaran and James (2001) in their review of stick-breaking priors. Here we let PY(α, d, G0) denote the Pitman–Yor
process with governing parameters α (>−d), known as the strength parameter, a discount parameter 0≤ d < 1, and a base distribution G0. As for the Dirichlet process, a realisation
from the Pitman–Yor process is a discrete distribution over an infinite set of atoms; also these atoms are (independent) draws from the base distribution G0. However, in contrast
to the Dirichlet process, the weight (probability) associated with each atom is drawn from a two–parameter Poisson–Dirichlet distribution. This results in the Pitman–Yor process being more flexible than the Dirichlet process with regards to tail behaviour and it is often the preferred model for analysing data with power-law tails (the Dirichlet process has exponential tails).
To visualise the relationship between the Pitman–Yor and the Dirichlet processes, consider the stick-breaking representation of the former
G(·) = ∞ X j=1 ψjδλj(·) (3.5) ψj = vj Y `<j (1− v`) vj indep∼ Beta(1 − d, α + jd) λj indep ∼ G0.
Clearly the case d = 0, produces a distribution G from (3.5) that is equivalent to that from the Dirichlet process (3.4), that is, PY(α, 0, G0)
d
≡ DP(α, G0). For this reason
the Dirichlet process is considered to be a special case of the Pitman–Yor process. The Normalised Inverse–Gamma process is another special case given by d = 0.5 and α = 0.
We needP
jψj = 1 for G to be well defined, or equivalently, the atom weights must be on
the simplex. If we let a = 1− d and bj = α + jd, Lemma 1 of Ishwaran and James (2001)
shows that ∞ X j=1 ψj = 1 almost surely ⇐⇒ ∞ X j=1 log 1 + a bj =∞. (3.6)
It is trivial to verify that condition (3.6) holds for the Dirichlet process. Recall that the Dirichlet process is a special case of the Pitman–Yor process with d = 0, and hence a = 1
and bj = α. Given this we have α > 0 =⇒ 1 + 1 α > 1 =⇒ log 1 + 1 α > 0 =⇒ ∞ X j=1 log 1 + 1 α =∞ =⇒ ∞ X j=1 ψj = 1 almost surely
and so the distribution in (3.4) is well defined.
In what follows we focus on the Dirichlet process and note that this is a common choice of stick-breaking prior; primarily due to the availability of efficient sampling schemes. Many of these (efficient) inference schemes make use of the Chinese restaurant process represen- tation. Unfortunately, such representations are typically unavailable for the Pitman–Yor process and so the stick-breaking representation must be used and, under this represen- tation, it is only possible to obtain an approximate posterior distribution (although the approximation can be made arbitrarily small given sufficient computing power) – this is discussed further in Section 3.4.1.