Previous Work - Two-view Dirichlet Process for Clustering

Chapter 3 Two-view Dirichlet Process for Clustering

3.2 Previous Work

3.2.1 Dirichlet Process

Dirichlet Process lies in the cornerstone of non-parametric Bayesian models. A Dirichlet Process DP(𝐺0, 𝛼) with a base measure 𝐺0 and scaling parameter 𝛼 is a distribution over distributions

(Ferguson, 1973). For a random distribution 𝐺 to be distributed according to the Dirichlet Process, the probability measure of 𝐺 for any arbitrary ﬁnite measurable partition (𝐴1,⋅ ⋅ ⋅ , 𝐴𝑠) of the sample

space Θ must follow a Dirichlet distribution:

(𝐺(𝐴1),⋅ ⋅ ⋅ , 𝐺(𝐴𝑠))∼ 𝐷𝑖𝑟(𝛼𝐺0(𝐴1),⋅ ⋅ ⋅ , 𝛼𝐺0(𝐴𝑠)),

For any measurable set 𝐴_{⊂ Θ, the probability measure 𝐺(𝐴) is a random variable with mean 𝐺}0(𝐴)

and variance 1

1+𝛼𝐺0(𝐴)(1− 𝐺0(𝐴)). Therefore the meaning of the parameters 𝐺0 and 𝛼 is clear:

𝐺0 gives the mean of the DP and 𝛼 controls the variance or precision. The higher the 𝛼 then the

more heavily DP will concentrate the probability mass around the mean.

Ferguson (1973) ﬁrst formalized DP as a prior over distributions with large support and an- alytically manageable posteriors in the general Bayesian statistical modeling. It also showed the existence of DP. Blackwell (1973) proved that a random draw 𝐺 from DP is almost surely discrete, even though the base measure 𝐺0is continuous and also gave the Polya urn interpretation. Blackwell

and MacQueen (1973) gave the Polya urn interpretation to the Dirichlet process. The Polya urn scheme is also a constructive way to draw samples from DP. Sethuraman (1994) developed another constructive way of forming 𝐺 known as the stick breaking process. There are some other ways to to explicitly construct samples from DP, such as the Polya urn scheme, Chinese restaurant process, etc. Ishwaran and James (2001) popularized the stick breaking process and proposed two general types of Gibbs samplers.

Dirichlet Process is an important prior in non-parametric Bayesian models, particularly useful for clustering(J. K. Ghosh, 2003). This is due to the natural grouping properties of its posterior distribution which we illustrate in detail in the following. Suppose we have a random distribution 𝐺 drawn from DP:

𝐺∼ DP(𝐺0, 𝛼).

Once we observe a sample 𝜃1drawn from 𝐺 ,

𝜃_{∼ 𝐺(𝜃),}

the posterior distribution of 𝐺 can be derived as

𝐺∣𝜃1∼ DP( 𝛼

1 + 𝛼𝐺0+ 1

1 + 𝛼𝛿𝜃1, 𝛼 + 1).

We can see that the posterior is a mixture of the base measure 𝐺0 and point mass located at 𝜃1.

More generally, once we have observed 𝑛 samples 𝜃1,⋅ ⋅ ⋅ , 𝜃𝑛, the posterior distribution becomes the

following mixture 𝐺_∣𝜃1,⋅ ⋅ ⋅ , 𝜃𝑛∼ DP( 𝛼 𝑛 + 𝛼𝐺0+ 1 𝑛 + 𝛼 𝑛 ∑ 𝑖=1 𝛿𝜃𝑖, 𝛼 + 𝑛).

In this mixture the weight on the base measure 𝐺0 is proportional to 𝛼, while the weight on the

empirical distribution with point masses is proportional to the number of observations 𝑛. As the sample size 𝑛 goes to inﬁnity, the posterior is dominated by the empirical distribution with inﬁnite precision. This shows that DP is consistent in the sense the posterior distribution approaches the true underlying distribution.

We can also integrate 𝐺 out and consider only the marginal distribution of 𝜃1,⋅ ⋅ ⋅ , 𝜃𝑛. Denote

the collection (𝜃1,⋅ ⋅ ⋅ , 𝜃𝑖−1, 𝜃𝑖+1,⋅ ⋅ ⋅ , 𝜃𝑛) by 𝜃[−𝑖]. Assume 𝜃1,⋅ ⋅ ⋅ , 𝜃𝑛 are exchangeable, i.e., the

joint distribution of 𝜃1,⋅ ⋅ ⋅ , 𝜃𝑛 remains the same when we aribitrarily reorder 𝜃1 through 𝜃𝑛, then

Balckwell-MacQueen urn scheme (Blackwell and MacQueen, 1973). Suppose there are 𝐾 unique values among 𝜃[−𝑖]and let them be 𝜃∗𝑘, 𝑘 = 1 : 𝐾. The predictive distribution of 𝜃𝑖 on others can be

found as 𝑝(𝜃𝑖∣𝜃[−𝑖]) = ∫ 𝑝(𝜃𝑖∣𝐺)𝑝(𝐺∣𝜃[−𝑖])𝑑𝐺∝ ∑ 1≤𝑗∕=𝑖≤𝑛 𝛿𝜃𝑗(𝜃𝑖) + 𝛼𝐺0(𝜃𝑖),

which is equivalent with

𝑝(𝜃𝑖∣𝜃[−𝑖])∝ 𝐾 ∑ 𝑘=1 𝑛−𝑖_𝑘 𝛿𝜃∗ 𝑘(𝜃𝑖) + 𝛼𝐺0(𝜃𝑖). (3.1) Here 𝛿𝜃∗

𝑘() is the Kronecker delta function, and 𝑛

−𝑖

𝑘 is the number of instances accumulated in cluster

𝑘, excluding instance 𝑖. From Eq. (3.1) we can see a natural clustering eﬀect in the sense that with positive probabilities 𝜃𝑖 will take an existing value from 𝜃1∗,⋅ ⋅ ⋅ , 𝜃𝐾∗. This eﬀect can be interpreted

by the Chinese Restaurant Process metaphor (Aldous, 1983), where assigning 𝜃𝑖 to a cluster is

analogous to a new customer choosing a round table in a Chinese restaurant. The customer has certain chances to join a table already occupied with people, or open up a new table by herself. The Chinese Restaurant Process defines a distribution over partitions because the order of customers does not affect the joint probability. We can also see the so called “rich-gets-richer” effect: a large cluster grows larger faster.

A popular application of DP is the Mixture of Dirichlet Process (MDP) model for clustering which was ﬁrst proposed in Antoniak (1974). In MDP, the observed instance 𝑥𝑖 is modeled by

distribution 𝐹 (𝑥𝑖∣𝜃𝑖), where the parameter 𝜃𝑖 has a prior 𝐺 that is a random draw from DP(𝐺0, 𝛼).

Because 𝐺 is not restricted to any speciﬁc function form and the number of model parameters could grow with the observed instances, MDP has a richer descriptive ability than parametric clustering models such as Model-based clustering (Fraley and Raftery, 2002). Two major types of inference of MDP are MCMC sampling (Escobar and West, 1995; MacEachern and Muller, 1998; Neal, 2000) and variation inference methods (Blei and Jordan, 2005).

3.2.2 Combination of DP and MRF

Welling (2006) and Wallach et al. (2010) noted that the “rich-get-richer” property of MDP may not be justiﬁed for some clustering applications. The issue is that there are usually only a few large clusters but many small clusters. For example, when using MDP in natural image segmentation, Orbanz and Buhmann (2008) noticed that many segments consist of only a small number of pixels, causing the segments incoherent. Therefore they proposed to combine DP with Markov Random

Field (MRF) to incorporate spatial smoothness in images. We will make connections with their work in Section 3.3.3.

In document Statistical modeling of heterogeneous data (Page 57-60)