• No results found

Chapter 3 Two-view Dirichlet Process for Clustering

3.2 Previous Work

3.2.1

Dirichlet Process

Dirichlet Process lies in the cornerstone of non-parametric Bayesian models. A Dirichlet Process DP(𝐺0, 𝛼) with a base measure 𝐺0 and scaling parameter 𝛼 is a distribution over distributions

(Ferguson, 1973). For a random distribution 𝐺 to be distributed according to the Dirichlet Process, the probability measure of 𝐺 for any arbitrary finite measurable partition (𝐴1,β‹… β‹… β‹… , 𝐴𝑠) of the sample

space Θ must follow a Dirichlet distribution:

(𝐺(𝐴1),β‹… β‹… β‹… , 𝐺(𝐴𝑠))∼ π·π‘–π‘Ÿ(𝛼𝐺0(𝐴1),β‹… β‹… β‹… , 𝛼𝐺0(𝐴𝑠)),

For any measurable set π΄βŠ‚ Θ, the probability measure 𝐺(𝐴) is a random variable with mean 𝐺0(𝐴)

and variance 1

1+𝛼𝐺0(𝐴)(1βˆ’ 𝐺0(𝐴)). Therefore the meaning of the parameters 𝐺0 and 𝛼 is clear:

𝐺0 gives the mean of the DP and 𝛼 controls the variance or precision. The higher the 𝛼 then the

more heavily DP will concentrate the probability mass around the mean.

Ferguson (1973) first formalized DP as a prior over distributions with large support and an- alytically manageable posteriors in the general Bayesian statistical modeling. It also showed the existence of DP. Blackwell (1973) proved that a random draw 𝐺 from DP is almost surely discrete, even though the base measure 𝐺0is continuous and also gave the Polya urn interpretation. Blackwell

and MacQueen (1973) gave the Polya urn interpretation to the Dirichlet process. The Polya urn scheme is also a constructive way to draw samples from DP. Sethuraman (1994) developed another constructive way of forming 𝐺 known as the stick breaking process. There are some other ways to to explicitly construct samples from DP, such as the Polya urn scheme, Chinese restaurant process, etc. Ishwaran and James (2001) popularized the stick breaking process and proposed two general types of Gibbs samplers.

Dirichlet Process is an important prior in non-parametric Bayesian models, particularly useful for clustering(J. K. Ghosh, 2003). This is due to the natural grouping properties of its posterior distribution which we illustrate in detail in the following. Suppose we have a random distribution 𝐺 drawn from DP:

𝐺∼ DP(𝐺0, 𝛼).

Once we observe a sample πœƒ1drawn from 𝐺 ,

πœƒβˆΌ 𝐺(πœƒ),

the posterior distribution of 𝐺 can be derived as

πΊβˆ£πœƒ1∼ DP( 𝛼

1 + 𝛼𝐺0+ 1

1 + π›Όπ›Ώπœƒ1, 𝛼 + 1).

We can see that the posterior is a mixture of the base measure 𝐺0 and point mass located at πœƒ1.

More generally, once we have observed 𝑛 samples πœƒ1,β‹… β‹… β‹… , πœƒπ‘›, the posterior distribution becomes the

following mixture πΊβˆ£πœƒ1,β‹… β‹… β‹… , πœƒπ‘›βˆΌ DP( 𝛼 𝑛 + 𝛼𝐺0+ 1 𝑛 + 𝛼 𝑛 βˆ‘ 𝑖=1 π›Ώπœƒπ‘–, 𝛼 + 𝑛).

In this mixture the weight on the base measure 𝐺0 is proportional to 𝛼, while the weight on the

empirical distribution with point masses is proportional to the number of observations 𝑛. As the sample size 𝑛 goes to infinity, the posterior is dominated by the empirical distribution with infinite precision. This shows that DP is consistent in the sense the posterior distribution approaches the true underlying distribution.

We can also integrate 𝐺 out and consider only the marginal distribution of πœƒ1,β‹… β‹… β‹… , πœƒπ‘›. Denote

the collection (πœƒ1,β‹… β‹… β‹… , πœƒπ‘–βˆ’1, πœƒπ‘–+1,β‹… β‹… β‹… , πœƒπ‘›) by πœƒ[βˆ’π‘–]. Assume πœƒ1,β‹… β‹… β‹… , πœƒπ‘› are exchangeable, i.e., the

joint distribution of πœƒ1,β‹… β‹… β‹… , πœƒπ‘› remains the same when we aribitrarily reorder πœƒ1 through πœƒπ‘›, then

Balckwell-MacQueen urn scheme (Blackwell and MacQueen, 1973). Suppose there are 𝐾 unique values among πœƒ[βˆ’π‘–]and let them be πœƒβˆ—π‘˜, π‘˜ = 1 : 𝐾. The predictive distribution of πœƒπ‘– on others can be

found as 𝑝(πœƒπ‘–βˆ£πœƒ[βˆ’π‘–]) = ∫ 𝑝(πœƒπ‘–βˆ£πΊ)𝑝(πΊβˆ£πœƒ[βˆ’π‘–])π‘‘πΊβˆ βˆ‘ 1β‰€π‘—βˆ•=𝑖≀𝑛 π›Ώπœƒπ‘—(πœƒπ‘–) + 𝛼𝐺0(πœƒπ‘–),

which is equivalent with

𝑝(πœƒπ‘–βˆ£πœƒ[βˆ’π‘–])∝ 𝐾 βˆ‘ π‘˜=1 π‘›βˆ’π‘–π‘˜ π›Ώπœƒβˆ— π‘˜(πœƒπ‘–) + 𝛼𝐺0(πœƒπ‘–). (3.1) Here π›Ώπœƒβˆ—

π‘˜() is the Kronecker delta function, and 𝑛

βˆ’π‘–

π‘˜ is the number of instances accumulated in cluster

π‘˜, excluding instance 𝑖. From Eq. (3.1) we can see a natural clustering effect in the sense that with positive probabilities πœƒπ‘– will take an existing value from πœƒ1βˆ—,β‹… β‹… β‹… , πœƒπΎβˆ—. This effect can be interpreted

by the Chinese Restaurant Process metaphor (Aldous, 1983), where assigning πœƒπ‘– to a cluster is

analogous to a new customer choosing a round table in a Chinese restaurant. The customer has certain chances to join a table already occupied with people, or open up a new table by herself. The Chinese Restaurant Process defines a distribution over partitions because the order of customers does not affect the joint probability. We can also see the so called β€œrich-gets-richer” effect: a large cluster grows larger faster.

A popular application of DP is the Mixture of Dirichlet Process (MDP) model for clustering which was first proposed in Antoniak (1974). In MDP, the observed instance π‘₯𝑖 is modeled by

distribution 𝐹 (π‘₯π‘–βˆ£πœƒπ‘–), where the parameter πœƒπ‘– has a prior 𝐺 that is a random draw from DP(𝐺0, 𝛼).

Because 𝐺 is not restricted to any specific function form and the number of model parameters could grow with the observed instances, MDP has a richer descriptive ability than parametric clustering models such as Model-based clustering (Fraley and Raftery, 2002). Two major types of inference of MDP are MCMC sampling (Escobar and West, 1995; MacEachern and Muller, 1998; Neal, 2000) and variation inference methods (Blei and Jordan, 2005).

3.2.2

Combination of DP and MRF

Welling (2006) and Wallach et al. (2010) noted that the β€œrich-get-richer” property of MDP may not be justified for some clustering applications. The issue is that there are usually only a few large clusters but many small clusters. For example, when using MDP in natural image segmentation, Orbanz and Buhmann (2008) noticed that many segments consist of only a small number of pixels, causing the segments incoherent. Therefore they proposed to combine DP with Markov Random

Field (MRF) to incorporate spatial smoothness in images. We will make connections with their work in Section 3.3.3.

Related documents