Chapter 3 Two-view Dirichlet Process for Clustering
3.2 Previous Work
3.2.1
Dirichlet Process
Dirichlet Process lies in the cornerstone of non-parametric Bayesian models. A Dirichlet Process DP(πΊ0, πΌ) with a base measure πΊ0 and scaling parameter πΌ is a distribution over distributions
(Ferguson, 1973). For a random distribution πΊ to be distributed according to the Dirichlet Process, the probability measure of πΊ for any arbitrary ο¬nite measurable partition (π΄1,β β β , π΄π ) of the sample
space Ξ must follow a Dirichlet distribution:
(πΊ(π΄1),β β β , πΊ(π΄π ))βΌ π·ππ(πΌπΊ0(π΄1),β β β , πΌπΊ0(π΄π )),
For any measurable set π΄β Ξ, the probability measure πΊ(π΄) is a random variable with mean πΊ0(π΄)
and variance 1
1+πΌπΊ0(π΄)(1β πΊ0(π΄)). Therefore the meaning of the parameters πΊ0 and πΌ is clear:
πΊ0 gives the mean of the DP and πΌ controls the variance or precision. The higher the πΌ then the
more heavily DP will concentrate the probability mass around the mean.
Ferguson (1973) ο¬rst formalized DP as a prior over distributions with large support and an- alytically manageable posteriors in the general Bayesian statistical modeling. It also showed the existence of DP. Blackwell (1973) proved that a random draw πΊ from DP is almost surely discrete, even though the base measure πΊ0is continuous and also gave the Polya urn interpretation. Blackwell
and MacQueen (1973) gave the Polya urn interpretation to the Dirichlet process. The Polya urn scheme is also a constructive way to draw samples from DP. Sethuraman (1994) developed another constructive way of forming πΊ known as the stick breaking process. There are some other ways to to explicitly construct samples from DP, such as the Polya urn scheme, Chinese restaurant process, etc. Ishwaran and James (2001) popularized the stick breaking process and proposed two general types of Gibbs samplers.
Dirichlet Process is an important prior in non-parametric Bayesian models, particularly useful for clustering(J. K. Ghosh, 2003). This is due to the natural grouping properties of its posterior distribution which we illustrate in detail in the following. Suppose we have a random distribution πΊ drawn from DP:
πΊβΌ DP(πΊ0, πΌ).
Once we observe a sample π1drawn from πΊ ,
πβΌ πΊ(π),
the posterior distribution of πΊ can be derived as
πΊβ£π1βΌ DP( πΌ
1 + πΌπΊ0+ 1
1 + πΌπΏπ1, πΌ + 1).
We can see that the posterior is a mixture of the base measure πΊ0 and point mass located at π1.
More generally, once we have observed π samples π1,β β β , ππ, the posterior distribution becomes the
following mixture πΊβ£π1,β β β , ππβΌ DP( πΌ π + πΌπΊ0+ 1 π + πΌ π β π=1 πΏππ, πΌ + π).
In this mixture the weight on the base measure πΊ0 is proportional to πΌ, while the weight on the
empirical distribution with point masses is proportional to the number of observations π. As the sample size π goes to inο¬nity, the posterior is dominated by the empirical distribution with inο¬nite precision. This shows that DP is consistent in the sense the posterior distribution approaches the true underlying distribution.
We can also integrate πΊ out and consider only the marginal distribution of π1,β β β , ππ. Denote
the collection (π1,β β β , ππβ1, ππ+1,β β β , ππ) by π[βπ]. Assume π1,β β β , ππ are exchangeable, i.e., the
joint distribution of π1,β β β , ππ remains the same when we aribitrarily reorder π1 through ππ, then
Balckwell-MacQueen urn scheme (Blackwell and MacQueen, 1973). Suppose there are πΎ unique values among π[βπ]and let them be πβπ, π = 1 : πΎ. The predictive distribution of ππ on others can be
found as π(ππβ£π[βπ]) = β« π(ππβ£πΊ)π(πΊβ£π[βπ])ππΊβ β 1β€πβ=πβ€π πΏππ(ππ) + πΌπΊ0(ππ),
which is equivalent with
π(ππβ£π[βπ])β πΎ β π=1 πβππ πΏπβ π(ππ) + πΌπΊ0(ππ). (3.1) Here πΏπβ
π() is the Kronecker delta function, and π
βπ
π is the number of instances accumulated in cluster
π, excluding instance π. From Eq. (3.1) we can see a natural clustering eο¬ect in the sense that with positive probabilities ππ will take an existing value from π1β,β β β , ππΎβ. This eο¬ect can be interpreted
by the Chinese Restaurant Process metaphor (Aldous, 1983), where assigning ππ to a cluster is
analogous to a new customer choosing a round table in a Chinese restaurant. The customer has certain chances to join a table already occupied with people, or open up a new table by herself. The Chinese Restaurant Process deο¬nes a distribution over partitions because the order of customers does not aο¬ect the joint probability. We can also see the so called βrich-gets-richerβ eο¬ect: a large cluster grows larger faster.
A popular application of DP is the Mixture of Dirichlet Process (MDP) model for clustering which was ο¬rst proposed in Antoniak (1974). In MDP, the observed instance π₯π is modeled by
distribution πΉ (π₯πβ£ππ), where the parameter ππ has a prior πΊ that is a random draw from DP(πΊ0, πΌ).
Because πΊ is not restricted to any speciο¬c function form and the number of model parameters could grow with the observed instances, MDP has a richer descriptive ability than parametric clustering models such as Model-based clustering (Fraley and Raftery, 2002). Two major types of inference of MDP are MCMC sampling (Escobar and West, 1995; MacEachern and Muller, 1998; Neal, 2000) and variation inference methods (Blei and Jordan, 2005).
3.2.2
Combination of DP and MRF
Welling (2006) and Wallach et al. (2010) noted that the βrich-get-richerβ property of MDP may not be justiο¬ed for some clustering applications. The issue is that there are usually only a few large clusters but many small clusters. For example, when using MDP in natural image segmentation, Orbanz and Buhmann (2008) noticed that many segments consist of only a small number of pixels, causing the segments incoherent. Therefore they proposed to combine DP with Markov Random
Field (MRF) to incorporate spatial smoothness in images. We will make connections with their work in Section 3.3.3.