• No results found

Two-View Clustering

Chapter 3 Two-view Dirichlet Process for Clustering

3.3 Two-View Clustering

In this section we propose a novel non-parametric Bayesian model Two-View Clustering (TVClust), in which we treat the data and the side information network as two conditionally independent sets of outcome, or two views, generated from the hidden clustering structure.

1. The observed data instances 𝑿 ={π‘₯𝑛: π‘₯π‘›βˆˆ ℝ𝑝, 𝑛 = 1 : 𝑁}.

2. The side information network represented by a symmetric 𝑛 by 𝑛 matrix 𝑬. Each entry 𝐸𝑖𝑗

takes one of three possible values. If data points π‘₯𝑖 and π‘₯𝑖 are believed a priori to be in the

same cluster then 𝐸𝑖𝑗 = 1. If π‘₯𝑖 and π‘₯𝑖 are believed to belong to differently clusters then

𝐸𝑖𝑗 = 0. When we don’t have clear judgement about the relationship between π‘₯𝑖 and π‘₯𝑖, then

𝐸𝑖𝑗 = 𝑁 π‘ˆ 𝐿𝐿, indicating insufficient prior knowledge. For the ease of future notation, let us

define π’ž = {(𝑖, 𝑗) : 𝐸𝑖𝑗 βˆ•= π‘π‘ˆπΏπΏ}. So π’ž contains data instance pairs where the pairwise side

information is available.

We model π‘₯1:𝑛and 𝐸 by two different data generating processes: π‘₯1:𝑛are modeled by the Mixture

of Dirichlet Process model and 𝐸 modeled as a random graph (Erd¨os and R´enyi, 1959).

It is worth noting that either view may suffice for existing clustering algorithms. Given only π‘₯1:𝑛it is the familiar clustering task where many methods such as K-means, model-based clustering

(Fraley and Raftery, 2002) and MDP may be applied. With the network 𝐸, applicable methods include spectral clustering such as normalized graph cut.

Our multi-view approach tries to exploit redundancies between the two views to obtain improved clustering results. Aggregating the two views of π‘₯1:𝑛and 𝐸 is particularly useful when neither view

can be fully trusted: π‘₯1:𝑛 may be contaminated by noise; 𝐸 may miss true links or contain false

links. While previous work such as constrained K-means or constrained EM assumes and relies on constraint exactness, TVClust uses 𝐸 in a β€œsoft” manner and is thus more error tolerable and robust.

3.3.1

Modeling Numerical Observations

We use the Mixture of Dirichlet Process as the underlying clustering model for π‘₯1:𝑛. Let πœƒπ‘– be the

is used as the prior for πœƒπ‘–. That is: 𝐺 ∼ DP(𝐺0, 𝛼), πœƒπ‘– ∼ 𝐺(πœƒ). To model π‘₯𝑖 we can use any

likelihood function from the parametric distributions parameterized by πœƒπ‘–. The base measure 𝐺0

should then be properly chosen to cover the parameter space. Here we are primarily interested in modeling continuous data and thus choose multidimensional normal distribution as the likelihood.

𝑝(π‘₯π‘–βˆ£πœƒπ‘–) = N(π‘₯π‘–βˆ£πœƒπ‘–), where πœƒπ‘–= (πœ‡π‘–, Σ𝑖). (3.2)

The mean πœ‡π‘– is a 𝑝-vector and covariance matrix Σ𝑖is a 𝑝 by 𝑝 matrix. In the following we will use

πœƒπ‘– and (πœ‡π‘–, Σ𝑖) interchangeably. The base measure 𝐺0 is chosen to be the Normal Inverse Wishart

distribution which is conjugate to the multidimensional normal likelihood in Eq. (3.2). The whole clustering framework can be extended easily to other likelihood functions, such as multinomial distribution to model local histograms for image segmentation or term frequencies for text mining. In that case 𝐺0 can be chosen as the conjugate prior to multinomial, i.e., dirichlet distribution.

3.3.2

Modeling the Network

Let πœƒ1:𝑛 = (πœƒ1, πœƒ2,β‹… β‹… β‹… , πœƒπ‘›) be the collection of parameters for all the instances. Define an 𝑛 by 𝑛

partition matrix 𝐻(πœƒ1:𝑛) with the 𝑖𝑗-th entry 𝐻(πœƒ1:𝑛)𝑖𝑗 = π›Ώπœƒπ‘–(πœƒπ‘—). As πœƒπ‘– are random variables, so is

𝐻(πœƒ1:𝑛). Note that 𝐻(πœƒ1:𝑛) uniquely determines the cluster structure and is our goal of inference for

the clustering. For convenience we will just use 𝐻 as a shorthand notation for 𝐻(πœƒ1:𝑛). We model

the network 𝐸 conditioned on 𝐻 and parameters πœ™πΈ ={𝑝𝑖𝑗, π‘žπ‘–π‘—}𝑛𝑖,𝑗=1 using the following random

graph model. Recall thatπ’ž = {(𝑖, 𝑗) : πΈπ‘–π‘—βˆ•= π‘π‘ˆπΏπΏ}. For (𝑖, 𝑗) ∈ π’ž,

⎧           ⎨           ⎩ 𝑝(𝐸𝑖𝑗 = 1βˆ£π»π‘–π‘— = 1, πœ™πΈ) = 𝑝𝑖𝑗, 𝑝(𝐸𝑖𝑗 = 0βˆ£π»π‘–π‘— = 1, πœ™πΈ) = 1βˆ’ 𝑝𝑖𝑗, 𝑝(𝐸𝑖𝑗 = 1βˆ£π»π‘–π‘— = 0, πœ™πΈ) = 1βˆ’ π‘žπ‘–π‘—, 𝑝(𝐸𝑖𝑗 = 0βˆ£π»π‘–π‘— = 0, πœ™πΈ) = π‘žπ‘–π‘—, (3.3)

which can be written in a more concise way as:

𝑝(πΈπ‘–π‘—βˆ£π»π‘–π‘—, πœ™πΈ) = 𝑝 𝐸𝑖𝑗𝐻𝑖𝑗

𝑖𝑗 (1βˆ’ 𝑝𝑖𝑗)(1βˆ’πΈπ‘–π‘—)𝐻𝑖𝑗 (3.4)

The site-dependent parameters 𝑝𝑖𝑗 and π‘žπ‘–π‘— can be seen as the accuracy measures of the constraint

𝐸𝑖𝑗. In the extreme case where 𝑝𝑖𝑗 = π‘žπ‘–π‘— = 1, 𝐻𝑖𝑗 will be always equal to 𝐸𝑖𝑗. Conversely 1βˆ’ 𝑝𝑖𝑗

is the error probability that 𝐸𝑖𝑗 misses a real edge between instance 𝑖 and 𝑗 (false negative), and

1βˆ’ π‘žπ‘–π‘— is the error probability that 𝐸𝑖𝑗 adds a false edge (false positive). In our Bayesian framework

we put a further layer of prior over 𝑝𝑖𝑗 and π‘žπ‘–π‘—. Because 𝑝𝑖𝑗 and π‘žπ‘–π‘— represent percentages, we

conveniently choose the conjugate prior Beta distribution Beta(π‘π‘–π‘—βˆ£π›Όπ‘π‘–π‘—, 𝛽 𝑝

𝑖𝑗) and Beta(π‘žπ‘–π‘—βˆ£π›Όπ‘žπ‘–π‘—, 𝛽 π‘ž 𝑖𝑗)

from the exponential family.

This choice of hyper-parameters (𝛼𝑝𝑖𝑗, 𝛽𝑖𝑗𝑝, π›Όπ‘žπ‘–π‘—, π›½π‘–π‘—π‘ž) is based on our confidence of the constraint 𝐸𝑖𝑗. Recall that the mean for a Beta distribution Beta(π‘βˆ£π›Ό, 𝛽) is 𝛼/(𝛼 + 𝛽). If we believe that 𝐸𝑖𝑗 is

highly credible and thus want to use a higher 𝑝𝑖𝑗 and π‘žπ‘–π‘— for 𝐸𝑖𝑗, then we can draw 𝑝𝑖𝑗 and π‘žπ‘–π‘— from

a prior such as Beta(π‘π‘–π‘—βˆ£10, 1) and Beta(π‘žπ‘–π‘—βˆ£10, 1). If we have no assessment of the quality of 𝐸𝑖𝑗, we

can simply choose a flat prior Beta(π‘βˆ£1, 1), which is the same as a uniform distribution from 0 to 1. The values of 𝑝𝑖𝑗 and π‘žπ‘–π‘— will eventually be estimated from the data through a Gibbs sampler.

Note that our model is also capable of accommodating β€œhard” constraints: if we believe 𝐸𝑖𝑗 is

absolutely correct, we can set 𝑝𝑖𝑗 and π‘žπ‘–π‘— to constant 1. This will force 𝐻𝑖𝑗 to be always equal to

𝐸𝑖𝑗, which has the same effect of applying a β€œhard” constraint as in constrained K-means.

Finally, because π‘₯1:𝑛and 𝐸 are two conditionally independent views, the complete likelihood is

𝑝(π‘₯1:𝑛, πΈβˆ£πœƒ1:𝑛) = 𝑛 ∏ 𝑖=1 N(π‘₯π‘–βˆ£πœƒπ‘–) ∏ (𝑖,𝑗)βˆˆπ’ž 𝑝(πΈπ‘–π‘—βˆ£π»π‘–π‘—(πœƒ1:𝑛)).

3.3.3

Connection with Prior Work

The TVClust framework is originally motivated by treating the side information as a kind of observed data. But TVClust can also be regarded as putting a side-information-modified Chinese Restaurant Prior over the model parameters, which is closely related with the PPMx model in (M¨uller et al., 2011). In the following we will elaborate the connection and difference.

The PPMx model tries to solve a similar problem: how to cluster the objects{𝑦𝑖}𝑛𝑖=1when each

𝑦𝑖 is associated with some covariates π‘₯𝑖. The PPMx model is motivated by fine tuning the Product

Partition Model (PPM) with the extra covariates π‘₯𝑖. Specifically, let 𝜌 be the random 𝑛 by 𝑛

partition matrix. The prior over 𝜌 conditioned on covariates π‘₯𝑖is then 𝑝(𝜌∣π‘₯1:𝑛)βˆβˆπΎπ‘˜=1𝑔(π‘₯βˆ—π‘˜)𝑐(π‘†π‘˜),

where 𝐾 is the number of partitions contained in 𝜌, π‘†π‘˜ is the collection of data points in cluster π‘˜,

and 𝑔(π‘₯βˆ—

π‘˜) is the cohesion function measuring the cohesiveness of π‘₯βˆ—π‘˜ ={π‘₯𝑖: π‘₯𝑖 ∈ π‘†π‘˜}. If there is no

covariates, then 𝑝(𝜌)∝∏𝐾

is a particular case of PPM, in which 𝑐(π‘†π‘˜) = (π‘›π‘˜βˆ’ 1)!.

TVClust can be also seen as modifying the Chinese Restaurant Prior using side information. Specifically, now the prior over partition 𝜌 becomes:

𝑝(𝜌∣𝐸) = 𝐾 ∏ π‘˜=1 (π‘›π‘˜βˆ’ 1)! ∏ (𝑖,𝑗)βˆˆπ’ž π‘πΈπ‘–π‘—πœŒπ‘–π‘—(1 βˆ’ 𝑝)(1βˆ’πΈπ‘–π‘—)πœŒπ‘–π‘—(1 βˆ’ π‘ž)𝐸𝑖𝑗(1βˆ’πœŒπ‘–π‘—)π‘ž(1βˆ’πΈπ‘–π‘—)(1βˆ’πœŒπ‘–π‘—),

This can be written in the following more concise form.

𝑝(𝜌∣𝐸) = ( 𝐾 ∏ π‘˜=1 (π‘›π‘˜βˆ’ 1)! ) 𝑝<𝐸,𝜌>(1βˆ’ 𝑝)(<1βˆ’πΈ,𝜌>(1βˆ’ π‘ž)<𝐸,1βˆ’πœŒ>π‘ž<1βˆ’πΈ,1βˆ’πœŒ>. (3.5)

where the operator <β‹…, β‹… > is defined as the matrix sum after element-wise multiplication.

We immediately notice that one difference from the PPMx model is that 𝑝<𝐸,𝜌>(1βˆ’π‘)(<1βˆ’πΈ,𝜌>(1βˆ’

π‘ž)<𝐸,1βˆ’πœŒ>π‘ž<1βˆ’πΈ,1βˆ’πœŒ> cannot be factorized as product of cohesive function, i.e., ∏𝐾

π‘˜=1𝑔(π‘₯βˆ—π‘˜). This

is because the PPMx model inherently assumes a Markov property: measuring the cohesiveness of a cluster is independent of any data point outside this cluster. While in our approach, measuring the cohesiveness of a cluster is not only dependent on this cluster alone, but also dependent on how outsider data points are similar/dissimilar to this cluster.

Next we will propose two sets of algorithms to compute the model parameters and draw posterior inference about cluster structures: Gibbs sampling and variational inference. Gibbs sampling can give us the exact full posterior distribution of model parameters, but is usually impeded by the computational costs due to slow convergence of Markov chain. Variational inference method often converges much faster, and the convergence can be monitored by computing the lower bound. But it only approximates the target posterior distribution. We developed both algorithms to take advantage of each one’s advantage. It will be up to future TVClust users to decide which algorithm to adopt, by weighing computational costs versus accuracy.

Related documents