Exploring Multiobjective Optimization for Multiview Clustering

(1)

44 Clustering

SRIPARNA SAHA and SAYANTAN MITRA,Indian Institute of Technology Patna

STEFAN KRAMER,Johannes Gutenberg University Mainz

We present a new multiview clustering approach based on multiobjective optimization. In contrast to existing clustering algorithms based on multiobjective optimization, it is generally applicable to data represented by two or more views and does not require specifying the number of clusters a priori. The approach builds upon the search capability of a multiobjective simulated annealing based technique, AMOSA, as the underlying optimization technique. In the first version of the proposed approach, an internal cluster validity index is used to assess the quality of different partitionings obtained using different views. A new way of checking the compatibility of these different partitionings is also proposed and this is used as another objective function. A new encoding strategy and some new mutation operators are introduced. Finally, a new way of computing a consensus partitioning from multiple individual partitions obtained on multiple views is proposed. As a baseline and for comparison, two multiobjective based ensemble clustering techniques are proposed to combine the outputs of different simple clustering approaches. The efficacy of the proposed clustering methods is shown for partitioning several real-world datasets having multiple views. To show the practical usefulness of the method, we present results on web-search result clustering, where the task is to find a suitable partitioning of web snippets.

CCS Concepts: • Computing methodologies → Cluster analysis;

Additional Key Words and Phrases: Multiview classification, multiobjective optimization, simulated annealing, search result clustering

ACM Reference format:

Sriparna Saha, Sayantan Mitra, and Stefan Kramer. 2018. Exploring Multiobjective Optimization for Multiview Clustering. ACM Trans. Knowl. Discov. Data. 12, 4, Article 44 (May 2018), 30 pages.

https://doi.org/10.1145/3182181

1 INTRODUCTION

Multiview machine learning aims to take advantage of multiple views, i.e., representations of objects, in the machine learning process (Sun2013). Views are typically defined as sets of features or variables that together describe one aspect of the objects. If the underlying relationship between different views is understood, it can be used to alleviate the difficulty of a learning problem of interest (Sun2013; Wahid et al.2014). Problems with multiple representations of objects can be

The authors would like to thank A. von Humboldt research foundation to support the research.

Authors’ addresses: S. Saha (Corresponding author) and S. Mitra, Department of Computer Science and Engineering, In- dian Institute of Technology Patna, 801103 Bihar, India; emails: [email protected], [email protected]; S. Kramer, Johannes Gutenberg University Mainz, Institute of Computer Science, Staudingerweg 9, 55128 Mainz, Germany; email:

[email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from[email protected].

(2)

found in many real-world applications (Bickel and Scheffer2004; Cai et al.2013). Multiview machine learning can be either supervised as in the case of classification or regression (Wang et al.

2011,2008), or unsupervised as in the case of clustering (Bickel and Scheffer2004; Cai et al.2013).

In this article, we focus on the problem of multiview clustering.

One example of a multiview clustering problem from a real-world application is that of clustering search results in the form of web snippets (Carpineto et al.2009). Web snippets are required to be clustered into multiple categories considering multiple views, for instance, (i) a syntactic view (the structure of the web snippet in terms of unique words and their frequencies, e.g., term- frequency and inverse-document-frequency), and (ii) a semantic view (the “meaning” of a web snippet). Generating a single partitioning that satisfies both these views can be an application of multiview clustering (Wahid et al.2014).

In case of complex data having multiple views, it may be difficult to identify one unique unambiguous partitioning. In order to capture all the alternative partitionings of a particular dataset, we model the task of clustering as a multiobjective optimization (MOO) problem (Deb2001). MOO requires to optimize more than one objective function (Deb2001). It is different from single-objective optimization (SOO), where only a single objective function is optimized. MOO computes a set of tradeoff solutions as the optimal solutions, whereas SOO determines only a single solution as the optimal one. Another important motivation of solving the clustering problem using MOO is that most of the available single objective based clustering techniques optimize only a single criterion or cluster quality measure. Thus, these can identify clusters having some particular shapes or structures. A dataset having multiple views may have several optimal partitionings corresponding to different views. In a dataset, if multiple different views exist, then it must contain clusters having different structures. It would be difficult to identify all of them by optimizing a single cluster quality measure. Different cluster quality measures quantifying the goodness of a partitioning with respect to different views need to be optimized simultaneously. In the current work, the problem of multiview clustering of different datasets is formulated as a MOO problem.

In recent years, two main multiobjective based approaches have been developed for multiview clustering (Davidson et al.2013; Wahid et al.2014). The multiview clustering approach proposed by Wahid et al. (2014) used the search capability of MOO for web-search result clustering. It is basically a cluster ensemble approach where the outputs of multiple clustering techniques like hi- erarchical clustering approaches and k-means are combined efficiently using the search capability of a popular multiobjective genetic algorithm based technique, NSGA-II (Deb et al.2002) (nondominated sorting genetic algorithm-II). The approach is then applied to clustering web documents.

It assumes that the number of clusters is known beforehand. Moreover, the approach builds on the concept of partitional clustering. Another approach of multiobjective based multiview clustering was developed by Davidson et al. (2013). Here, a multiview spectral clustering technique is proposed under the framework of MOO. The Pareto front of MOO identifies all possible good cuts without taking any parameter value from the user. The utility of this methodology is verified both by theoretical analysis and empirical results. The main limitations of this approach are that (i) it is primarily applicable to only two given views, and (ii) the number of clusters present in a dataset is assumed to be known a priori. Moreover, the authors have mainly dealt with the two- way partitioning problem. In order to solve multicluster partitioning problems, the algorithm has to be executed on the given dataset multiple times considering each pair of clusters at a time or recursively until the partitions meet some termination condition/criterion.

None of the existing multiobjective based multiview approaches deal with the development of some partitional approach for clustering a dataset described by multiple views. A way of automatically determining the optimal partitioning from a given dataset after a single execution of an algorithm (without simply combining the outputs of multiple clustering algorithms) satisfying

(3)

multiple views was required to be developed. Moreover, the previous approaches assume that the number of clusters is known beforehand and can be specified by the user. Thus, some automatic approaches that can determine the appropriate number of clusters present in a dataset were called for as well. In the current work, a new multiview clustering technique is developed using the concepts of MOO (Deb2001). The approach uses the search capability of a simulated annealing (SA) based MOO technique, AMOSA (Bandyopadhyay et al.2008) (archived multiobjective SAbased technique). Note that it has already been experimentally proven that AMOSA outperforms other existing MOO based techniques, including NSGA-II (Deb et al.2002) (used as the underlying MOO technique in another multiview clustering technique, MMOEA (Wahid et al.2014)), in solving several benchmark test problems (Bandyopadhyay et al.2008). The key issues and points addressed in this article are discussed in the following:

— The main approach, MOO-Multi1, conducts multiview clustering by first identifying different partitionings from the same dataset using different views. In order to capture the goodness of an individual clustering generated using a single view, an internal cluster validity index, the PBM-index (Pakhira et al.2004), is used. The values of the PBM-index for different partitionings obtained using different views are simultaneously optimized along with a newly defined agreement index. This measures the agreement among multiple partitionings obtained using different views in a new way. The individual solutions are evolved using the search capability of AMOSA. A new encoding strategy and different mutation operators are proposed. Finally, a novel way is proposed to combine the partitionings obtained using multiple views for a single solution. The approach is automatic in nature: It is capable of determining the number of partitions and the corresponding appropriate partitioning automatically for a given dataset.

— Two more baseline methods for MOO based multiview clustering are described and tested.

These methods, MOO-Multi2 and MOO-Multi2, are based on the idea of cluster ensembles and in this sense similar to the MMOEA algorithm by Wahid et al. (2014). But unlike MMOEA, the number of clusters is automatically determined by the proposed approaches.

Initially, some simple clustering techniques like k-means and hierarchical clustering are used to partition the given dataset varying the views and the number of clusters. The membership matrices are finally encoded in the form of solutions in AMOSA. These membership matrices are efficiently combined using the search capability of AMOSA. The mutation operators of AMOSA help to explore the search space efficiently. In the second version, we have optimized the individual partitionings obtained by different techniques.

The compactness and separation of individual partitionings are calculated and these objective functions are optimized simultaneously using AMOSA. The optimal partitionings identified by AMOSA are finally combined in the last step of the proposed approach to arrive at a single consensus partitioning. In the third approach, the individual partitionings present in a solution are first combined and the goodness of this partitioning captured by different cluster quality measures are optimized by AMOSA.

— Unlike the two existing multiview based multiobjective clustering techniques (Davidson et al.2013; Wahid et al.2014), the proposed clustering approach is automatic in nature. It can automatically detect the number of clusters and the appropriate partitionings for any given dataset. We consider this as a major advantage of the proposed clustering technique.

— Unlike the multiview spectral clustering approach proposed by Davidson et al. (2013), the proposed approach can handle any number of views of a dataset while partitioning the data.

— Unlike the multiview spectral clustering approach proposed by Davidson et al. (2013), our proposed approach can identify any number of clusters automatically for a given dataset.

(4)

As stated above, the existing multiview spectral clustering approach (Davidson et al.2013) can only detect two-way partitions from a dataset. Thus, multiple runs of this approach are required in order to obtain multiple partitions from a given dataset. In contrast to this, the proposed approach can detect multiple partitions automatically in a single run.

The efficacy of the proposed approaches is first shown for partitioning several standard and recent datasets taken from the UCI Machine Learning Repository.¹Results show the effectiveness of the proposed approach compared to existing approaches. We have compared our results with many existing single and multiobjective based single-view clustering techniques as well as with a recently developed multiobjective based spectral clustering technique (Davidson et al.2013). As a real-world application, we chose to apply our approaches to web-search result clustering. Three benchmark datasets, Ambient,²Moresque (Navigli and Crisafulli2010) and ODP-239 (Carpineto and Romano2010) are used for the experimental validation. Two views, i.e., one semantic and one syntactic view, are considered. Results are compared with an existing multiobjective multiview based web search result clustering technique (Wahid et al.2014) and also with several existing single-view search-result clustering techniques in terms of well-known and established cluster quality measures.

2 RELATED WORK

This section discusses existing work on multiview clustering and MOO based clustering techniques.

2.1 Existing Work on Multiview Clustering

Multimodal datasets are frequent in real life because of the use of different modalities of data input and generation, viz., text, video, and audio. Each instance or observation can have multiple representations, which are also called views. The increase of multiview data in real-world applications has raised the interest in so-called multiview learning (Sun2013). Multiview clustering techniques try to explore the available multiple representations of data for obtaining a precise and robust partitioning of the data in contrast to single-view clustering. In general, the existing multiview clustering techniques can be categorized into two groups: centralized and distributed (Tzortzis and Likas 2009). Centralized algorithms simultaneously use all available views to cluster the data (Tzortzis and Likas2009), whereas distributed algorithms cluster the data using each view independently and then combine the individual clusterings to obtain a single consensus partitioning (Tzortzis and Likas2009). A two-view expectation maximization based clustering technique was developed by Bickel and Scheffer (2004). In the same article, they also presented a two-view spherical k-means clustering technique under the assumption that the views are independent. Some agglomerative hierarchical multiview clustering algorithms for text data were also proposed for handling text data. Results show that while multiview versions of k-means and EM perform better than their single-view counterparts, some negative results are obtained for agglomerative hierarchical multiview clustering techniques (Bickel and Scheffer2004). A two-view spectral clustering algorithm that creates a bipartite graph was developed by de Sa (2005). Another multiview spectral clustering technique was proposed by Kumar and Daumé (2011). A convex mixture model based multiview clustering technique was proposed by Tzortzis and Likas (2009). The same authors later proposed a kernel-based weighted multiview clustering technique (Tzortzis and Likas2012). A cluster ensemble based approach for multiview clustering was proposed by Xie and Sun (2013). Two approaches, namely multiview spectral clustering and multiview kernel k-means, were proposed to handle the

1http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2http://credo.fub.it/ambient.

(5)

problem of multiview clustering by combining the clusterings originating from different views by an ensemble approach (Xie and Sun2013). A new multiview based k-means clustering technique was developed by Cai et al. (2013) for large-scale datasets. It can be executed in parallel using multicore processors. A gene expression data clustering technique was proposed in recent years (Zeng et al.2010), which combines information from multiple sources. The Multi-Source Clustering (MSC) algorithm merges information from different sources of data. Here, information from one complete source is considered as primary and information from some incomplete sources is consid- ered as constraints. Results show that the MSC algorithm produces more biologically meaningful clusters than those produced considering only a single source of information. This in turn proves the utility of combining information from multiple sources. Wahid et al. (2014) have developed a cluster ensemble based technique to solve the multiview clustering problem for web documents.

The method produces different clustering solutions utilizing separate views of the data and finally a combination of these clusters is formed to generate a final partitioning.

2.2 Existing Multiobjective Optimization Based Clustering Techniques

For many datasets, it is difficult to identify one unique unambiguous partitioning. In order to capture all the alternative partitionings of a particular dataset and also to capture clusters having different shapes, the problem of clustering is modeled as MOO (Deb2001). If clusters of different structures exist in a single partitioning, then it would be difficult to identify all of them by optimizing a single cluster quality measure. In order to solve this problem, multiple cluster quality measures, capable of identifying clusters having different shapes at a time, need to be optimized simultaneously. MOO can be employed to address this.

Handl and Knowles (2007) proposed a multiobjective clustering technique, called MOCK, where a locus-based adjacency representation (Park and Song1998) is used to encode partitions in the form of a string. The first objective is capable of determining hyperspherically shaped clusters, whereas the second objective is used to handle the connected structures. Results over several ar- tificial and real-world datasets show that MOCK performs better than some recently developed ensemble based clustering techniques, some SOO based clustering algorithms and two different model selection approaches. A multiobjective evolutionary algorithm for fuzzy clustering was also proposed in recent years (Bandyopadhyay et al.2007a,2007b). Here again, two objective functions, the one from Fuzzy C-means (Bezdek1981) and the Xie-Beni index (Xie and Beni1991a), are simultaneously optimized. This proposed algorithm is further extended for solving the categorical data clustering problem (Mukhopadhyay et al.2009). A new technique combining the advantages of a recently proposed multiobjective fuzzy clustering (Bandyopadhyay et al.2007a,2007b) and the support vector machine (SVM) classifier was also developed (Maulik et al.2009; Mukhopadhyay and Maulik2009), which produces a single solution upon termination. The effectiveness of this technique was shown for automatically partitioning satellite images (Bandyopadhyay et al.2007a) and gene expression data (Bandyopadhyay et al.2007b).

Some symmetry based multiobjective clustering techniques were developed by Saha and Bandyopadhyay (2010a,2010b,2013). For the purpose of clustering, similarity or proximity measures play some important roles. These measures are used to assign points to different clusters.

The concept of ”symmetry” is inherent in many real-world objects. It is an important attribute for the recognition and identification of important objects. Inspired by this observation, some symmetry based similarity measures were proposed in the recent literature (Bandyopadhyay and Saha2007; Saha and Bandyopadhyay 2009). Bandyopadhyay et al. (2008) proposed an AMOSA based clustering technique that is capable of determining clusters having different shapes as long as those satisfy some properties of symmetry. This algorithm was extended to determine a proper partitioning of the given dataset after a single execution (Saha and Bandyopadhyay2010b).

(6)

Another multiobjective clustering technique was developed that utilizes a new measure of stability of clustering solutions as the objective function (Saha and Bandyopadhyay2010a). All these existing clustering algorithms are capable of either handling compact hyperspherically-shaped clusters or clusters having symmetrical shapes. The MOCK clustering technique (Handl and Knowles 2007) is able to handle clusters having connected structures or hyperspherical shapes. In order to automatically discover clusters having various properties like symmetrical shapes, hyperspherical shapes or connected structures after running a single algorithm, a generalized clustering algorithm was developed (Saha and Bandyopadhyay2013). However, all of the above mentioned algorithms have considered only a single view of the dataset at the time of partitioning. Multiple different views of a dataset are not considered at the time of partitioning the dataset.

2.3 Drawbacks of the Existing Literature

Most of the existing multiview clustering techniques are based on the concepts of SOO. A single quality measure for partitioning is optimized implicitly or explicitly using various paradigms of single-view learning. In recent years two multiobjective based multiview clustering approaches have been developed; however, they assume that the number of clusters present in a dataset is known a priori. In summary, existing techniques suffer from the following drawbacks:

(1) Most of the existing multiview clustering techniques are based on the concepts of SOO. A single quality measure for partitioning is optimized implicitly or explicitly using various paradigms of unsupervised single-view learning.

(2) None of the existing multiobjective based multiview approaches follow a partitional approach to clustering a dataset described by multiple views. A way of automatically determining the optimal partitioning from a given dataset after the single execution of an algorithm (without simply combining the outputs of multiple clustering algorithms) satisfying multiple views is required to be developed.

(3) Most of the previous approaches assume that the number of clusters is known beforehand and can be specified by the user. Thus, some automatic approaches which can determine an appropriate number of clusters present in a dataset are called for as well.

(4) The more advanced multiobjective based clustering techniques available in the literature do not consider multiple views while clustering the datasets.

These motivated us to develop an automated multiobjective based approach for multiview clustering. The key attributes of the approach are as follows:

— It automatically determines the number of clusters from a dataset.

— All the views will be considered simultaneously at the time of partitioning the dataset.

— Multiple partitionings will be generated satisfying different cluster quality measures. All these partitionings will correspond to different ways of grouping the data.

— Multiple cluster quality measures will be optimized simultaneously using the search capability of a MOO based approach.

— Finally, a new way of measuring the agreement between partitionings generated using different views is proposed and utilized to obtain an ultimate consensus partitioning satisfying multiple views.

3 MULTIOBJECTIVE OPTIMIZATION BASED MULTIVIEW CLUSTERING

In this section, we describe our proposed multiobjective multiview based clustering technique in detail. Note that the search capability of AMOSA is used here as the underlying optimization strategy. However, the proposed clustering methodology is generic in nature and any other optimization strategy could have been utilized in place of AMOSA (Bandyopadhyay et al.2008).

(7)

3.1 Problem Formulation

The multiview clustering problem is formulated as a MOO problem.

— Given:

— A dataset of n samples S= {x1, x2, . . . , xn},

— each described by d different features

— and in total m different views, and

— a set of objective functions

CV1,CV2, . . . ,CVm, AI,

where each CVi is a cluster validity index measured on the partitioning obtained after considering only view i for the given dataset, and AI measures the agreement between the partitionings obtained for different views.

— Find:

— A consensus partitioning (U) satisfying all views

— The set of data-points, S, is divided into K clusters,{U1,U2, . . . ,UK}

—Ui = {xⁱ1, xⁱ₂, . . . , xⁱ_n_i}; ni: number of points in cluster i; xⁱ_j: jth point of cluster i.

—∪^K_i=1Ui = S and Ui∩ Uj = ∅ for all i j.

— which simultaneously optimizes the objective functions. The simultaneous optimization of these objectives provides a Pareto optimal front.

3.2 AMOSA: Underlying Optimization Strategy

Our problem statement of MOO for multiview clustering requires the optimization of three or more objective functions simultaneously. As the AMOSA (Bandyopadhyay et al.2008) algorithm is par- ticularly useful for this kind of problem, we chose it as the underlying optimization methodology.

In the following, we give a short description of this algorithm, which is based on SA (Kirkpatrick et al.1983).

SA is a popular optimization technique based on the annealing process in metallurgy. A con- vergence proof of SA also exists in the literature, which makes SA popular to solve different hard optimization problems (Hastings1985). In case of SA, there is a concept of acceptance probability, which is calculated using the energy difference between the current solution and the new solution.

This probability function is further utilized during the process of selection of a new solution. In case of MOO, it is difficult to generate the acceptance probability considering multiple objectives.

Moreover, SA can produce only a single solution after a single run. As MOO requires to identify a set of tradeoff solutions, SA has to be executed for a MOO problem multiple times to get the set of nondominated solutions. These problems hinder the process of extending SA to solve MOO problems. In AMOSA (Bandyopadhyay et al.2008), several new concepts are incorporated to solve the above mentioned problems.

The key attributes of AMOSA are the following:

(1) An Archive is used to store all the nondominated solutions generated during the search process.

(2) Two limits, namely a hard limit denoted by HL, and a soft limit denoted by SL, are restrict- ing the size of the Archive. During execution, the archive size is allowed to reach SL, but when Archive-size> SL, single linkage clustering (Jain et al.1999) is used to reduce the size to HL.

(3) Initialization of Archive is done with (γ× SL) random solutions, where γ > 1. Each solu- tion corresponds to some state in the search space. These initial solutions are then refined using some steps of hill climbing.

(8)

Fig. 1. Flow chart of the first proposed approach (MOO-Multi1).

(4) The current point is selected by randomly picking a solution from the archive at temper- ature T = Tmax. A new solution called new-pt is generated after application of mutation operations on the current point.

(5) A large number of cases are defined depending on the domination status of the new-pt with respect to current-pt and other remaining solutions in the archive to select the next current point. There are three outcomes possible depending on the domination status of the new solution, namely (i) accept new solution, (ii) accept current solution, or (iii) accept a solution from the archive.

(6) The above mentioned process (steps 4–5) is performed iter times at each temperature and the temperature is decreased with a cooling rate of α (<1) until the minimum temperature T_minis attained.

(7) At the end of the execution, the final Archive contains a set of nondominated solutions.

(8) The concept of geometric cooling is incorporated for decreasing the temperature: T_k+1= α× Tk, where α is the cooling rate.

Experimental results show that AMOSA performs comparable to existing multiobjective evolutionary algorithms like NSGA-II (Deb et al.2002) for solving different benchmark two objective test problems, but it outperforms all the existing algorithms in solving three or more objective test problems, which are also known as many-objective test problems (Bandyopadhyay et al.2008).

3.3 Overview of MOO-Multi1

A detailed description of different steps of the proposed algorithm (MOO-Multi1) is provided below and the pseudocode is shown in Algorithm 1 and the following algorithms. A flow-chart with the different steps of MOO-Multi1 is shown in Figure 1. The algorithm starts with the generation of different views for a given dataset. Some possible initial solutions are generated ran- domly and those are stored in the Archive of the AMOSA process. A novel way of representing the partitioning solutions in the form of strings is proposed here. Based on the centers encoded in a solution/string, different partitionings are identified exploring different views. The minimum center-distance based criterion is utilized for the purpose of assigning points to different clusters.

The qualities of these different partitionings obtained after varying the views are judged with the

(9)

help of a cluster validity index. The consensus between different partitionings is also measured using a newly developed Agreement Index. AMOSA is employed to simultaneously optimize these multiple objective functions. In order to explore the search space efficiently and to automatically determine the number of clusters, several mutation operators are proposed to generate a new solution from the current solution. Finally, the steps of AMOSA will be followed to automatically identify the optimal partitioning and the optimal number of clusters.

ALGORITHM 1: MOO-Multi1

Set Tmax,Tmin, no_views, HL, SL, Iter, α, temp = Tmax /^∗no_views is the total number of views of the dataset^∗/

begin

Initialize pool /^∗ Size of pool=SL ^∗/ for i = 1 to pool_size do

for j = 1 to no_views do

ComputeMembership(pool[i],j) UpdateCenter(pool[i],j) end

CombineCenter(pool[i]) for j = 1 to no_views do

ComputeFitnessPBM(pool[i],j)/* Compute PBM for each view */

end

ComputeFitnessAI(pool[i]) end

Initialize Archive with the nondominated solutions of pool current _pt =random(Archive )

while temp≥ Tmindo

new_pt = perturb (current _pt ) for j = 1 to no_views do

ComputeMembership(new_pt ,j) /* k-Means-like assignment step for view j */

UpdateCenter(new_pt ,j) /* k-Means-like cluster center calculation step for view j */

end

CombineCenter(new_pt ) for j = 1 to no_views do

ComputeFitnessPBM(new_pt ,j)/* Compute PBM for each view */

end

ComputeFitnessAI(new_pt )

Compute dominance of current _pt and new_pt , update Archive temp = α× temp

end end

3.4 String Representation and Archive Initialization

The first step of the proposed clustering approach is to initialize the Archive used in AMOSA (Bandyopadhyay et al.2008) with some alternative solutions. The solutions are generated randomly. Here each solution contains a set of cluster centers in order to represent the clustering. Each archive member represents a set of cluster centers based on which we can obtain partitionings using multiple views. Different lengths are associated with different archive mem- bers. We assume that a particular dataset contains n samples where each one of them is having d different features. Let us assume that each sample point has m different views, V1, V2, . . . ,Vm. Here,

(10)

ALGORITHM 2: Generate Cluster Partitions.

procedure: ComputeMembership(Element, view_index) /^∗view_index refers to the view number currently working on.^∗/

Set min, distance, cluster _index begin

for i=1 to no_clusters do

/*no_points= total samples present in the dataset*/

for j=1 to no_points do

Element .Membership[view_index][i][j]= 0 end

end

for i=1 to no_points do

min= MAX_DOU BLE_VALU E for j = 1 to no_clusters do

/* calculate the Euclidean distance */

distance = cal_distance (Element .center [view_index][j], dataset [view_index][i]) if distance≤ min then

min= distance cluster _index= j end

end

Element .Membership[view_index][cluster _index][i]= 1 end

end

V1= {f₁¹, . . . , f₁ⁿ¹} (view 1 consists of a set of n1features present in this set),

V2= {f₂¹, . . . , f₂ⁿ²} (view 2 consists of n2features present in this set), and so on, up to V_m = {fm¹, . . . , fmⁿ^m} (the mth view consists of nm features present in this set), and d=m

i=1n_i. The initial dataset is consisting of samples having all the features belonging to different views.

The values m, ni, i = 1, . . . ,m, d are specific to a dataset. Archive member i represents the cen- troids of Ki clusters and has length li, where li = d ∗ Ki. The d× Ki different real values represent the coordinates of the Ki centers. For the purpose of initialization, some Ki points are randomly selected from the dataset and those are used as the initial cluster centers.

The number of cluster centers encoded in a particular solution i, denoted by Ki, is selected randomly from the given range Kmin to Kmax as follows:

Ki = (rand()mod(Kmax − 1)) + Kmin (1)

Here, rand () is a function which generates some random integer number. Kmax stands for the upper-limit of the number of clusters. The lower limit denoted by Kminis fixed to 2.

The structure of each solution or archive_element is shown in Figure2. Here, a sample of d feature values is represented as a data point, and each cluster centroid ckis also represented as a vector of d feature values.

3.5 Formation of Clusters and Objective Function Calculations

After initializing the archive members with some randomly selected cluster centroids, the following steps are executed to compute different objective functions. The search capability of AMOSA can be utilized to simultaneously optimize these objective functions.

(11)

Fig. 2. Representation of a partitioning in the form of a solution. Here, a center-based representation is used.

In this figure, we have assumed that there are K= 2 clusters with two views: The first view has 3 features and the second view has 2 features. C_i^j(k ) represents the value of the kth dimension of the jth view for the ith cluster. For example, C¹₁(2) represents the value of the second dimension of the first view for the first cluster.

ALGORITHM 3: Update Cluster Centers.

procedure: UpdateCenter(Element, view_index)

Set sum_points[dim], count= 0 /^∗dim is the number of features present in the current view (view number referred by view_index)^∗/

begin

for i=1 to no_cluster do count=0

for k=1 to dim do sum_points[k]= 0 end

for j=1 to no_points do

if Element .Membership[view_index][i][j]= 1 then for k=1 to dim do

sum_points[k]= sum_points[k] + dataset[view_index][j][k]

count= count + 1 end

end end

for k=1 to dim do

Element .center [view_index][i][k]= sum_points[k]/count end

end end

(1) First, the set of cluster centers present in the string are extracted. Let the entire set be {C^d1,C^d₂, . . . ,C^d_K} = {C¹₁,C¹₂, . . . ,C_d¹, . . . ,C₁^K,C₂^K, . . . ,C_d^K}, hereC^d1is the first cluster center vector of length d, C_i^jdenotes the jth dimensional value of the ith cluster center and K is the number of clusters encoded in that particular string. k-means clustering is applied to the dataset using this set of cluster centers for different views.

(2) The PBM-index value is calculated for the final partitioning obtained using view v. Let the value be denoted by PBMv.

(3) The adjoint matrix (A^vof size n× n, where n is the size of dataset) corresponding to view v is calculated as follows:

A^v_{i j} = 1 if xi and xj belong to the same cluster

= 0 otherwise (2)

= 1 if i = j (3)

(12)

ALGORITHM 4: Combine Clusters having Maximum Intersecting Points from Different Views.

procedure: CombineCenter(Element )

Set temp[no_points], sum[no_points], max_strinд[no_points], index, max, count begin

for i=1 to no_clusters do for j=1 to no_points do

sum[j]= max_strinд[j] = Element.membership[1][i][j] /*Initialize sum and max_strinд with a cluster string from first view.*/

end

/* start from second view */

for j=2 to no_views do index= −1 max = 0 count= 0

for k=1 to no_clusters do for l=1 to no_points do

temp[l]= Element.membership[j][k][l]

end

/* determine the number of positions with a one in both strings */

count= similarity_count (temp,max_strinд) if count > max then

index= k max = count

f laд= 1 end end

if f laд= 1 then f laд= 0

for l=1 to no_points do

temp_strinд[l]= Element.membership[j][index][l]

end

/* is the number of ones in first string greater than the number of ones in the second string? */

if count _ones (temp_strinд) > count _ones (max_strinд) then max_strinд= temp

end end

sum[l]=sum[l]+Element.membership[j][index][l]

end end

Element . f inal_membership[i, l]= (sum[l]/no_views) > 0.5?1 : 0 end

end

Compute centers with updated Membership and compute final partition.

end

(13)

ALGORITHM 5: Compute Agreement Index procedure: ComputeFitnessAI()

Set aдree= 0,disaдree = 0, adj[no_views][no_points][no_points], total_ai = 0, AдreementIndex begin

Compute adjacency matrix for each view Initialize adj

for v1=1 to (no_views− 1) do for v2= (v1+1) to no_views do

aдree= 0 disaдree= 0

for n1 = 1 to no_points do for n2 = 1 to no_points do

if adj[v1][n1][n2]= adj[v2][n1][n2] then agree = agree+1

end else

disagree = disagree+1 end

end end

total_ai = total_ai + disaдree/aдree end

end

AдreementIndex= (2 × total_ai)/(no_views × (no_views − 1)) end

(4) A new objective function Agreement Index is calculated as follows. This measures the agreement between the partitionings obtained using multiple views. The measure is calculated as follows:

— At a time two views are considered: v1 and v2. Let the corresponding adjoint matrices be A^{v 1}and A^{v 2}, respectively.

— The number of agreement (na) is calculated as follows: na =_n

i=1_n

j=1I_A^{v 1}

i j,A^{v 2}_{i j}, here I_A^{v 1}

i j,A^{v 2}_{i j} = 1 if A^{v 1}_{i j} = A^{v 2}_{i j} (4)

= 0 otherwise. (5)

— The number of disagreements (nd) is calculated as follows: nd = n²− na.

— Agreement index between these two views (v1, v2) is calculated as follows:

AIv 1,v 2=na+ 1

n_d+ 1. (6)

The values of 1 in the numerator and denominator are used as a normalization factor to avoid the problem of a division by zero.

— The total Agreement index for the entire partitioning is calculated as follows:

AI =

m i=1m

j=1, ji2× AIvi,vj

m× (m − 1) , (7)

where m is the total number of views available.

(14)

(5) The objective functions corresponding to a particular string are {PBM1, . . . , PBMm, AI }.

The search capability of AMOSA is used to simultaneously maximize these objective functions. The aim is to identify some good partitionings using different views, which are also consensus (similar) partitionings across different views.

Please note that any other cluster validity index could have been used in place of the PBM-index. In order to show the effectiveness of the PBM-index in a part of the experiments, we have also replaced the PBM-index by the XB-index (Xie and Beni1991b). In that case, the objective functions to be optimized by AMOSA are{_{X B}¹₁, . . . ,_{X B}¹

m, AI }, where X Bmis the XB-index value calculated for the partitioning obtained using the features cor- responding to view m. Generally, the minimum value of the XB-index corresponds to an optimal partitioning. All these objective functions have to be simultaneously maximized by AMOSA.

3.6 Update of String

After the objective functions are calculated, a consensus partitioning is obtained that satisfies all the available views. The cluster centers corresponding to this consensus partitioning are used to update the string of AMOSA.

— Let the partitioning obtained using multiple views be represented by π¹, π², . . . , π^m. Let us denote the jth cluster of partition v as π_j^v. First, some reordering is done among all the obtained partitionings so that there is a one-to-one correspondence between the cluster numbers of different partitionings.

— In order to obtain a combined clustering result, we first assign each data point to one cluster that is determined through the closest cluster center. First, a consensus cluster center m, having in total d features/dimensions, for each cluster is calculated considering these multiple partitionings. Only those data points are included for which all views agree:

m^j =

xi∈π_j¹&xi∈π_j²&...xi∈π_j^mxi

|{xi|xi ∈ π_j¹&xi ∈ π_j²& . . . &xi ∈ π_j^m}|. (8) Here, m^j denotes the jth cluster center of the consensus partitioning, the denominator counts the total number of points present in the same partitionings obtained using different views.

— Next, the newly generated cluster centers mj, j = 1, . . . , K are used to obtain the final consensus partitioning as follows:

πj = {∀xi ∈ X : d(xi,mj) < d (xi,ml)for l= 1, . . . , K,l j}. (9) Here, K is the number of clusters encoded in that solution and d (xi,mj) denotes the Eu- clidean distance between the consensus mean mjand data point xi. X denotes the set of all data points.

— Finally, using Equation (8), the new consensus cluster centers are again calculated. These new cluster centers m^j, j = 1, . . . , K are used to replace the centers encoded in the string.

So, in order to get a consensus partitioning, initially the common points of different clusters present in different partitionings obtained using different views are identified. These points are further used to determine cluster centers. The other points are assigned to these centers using a minimum distance criterion to get a final consensus partitioning. These cluster centers are used to update the given string.

(15)

3.7 Search Operators

In order to explore the search space efficiently using AMOSA, perturbation operations are introduced. These operators also help in generating some new solutions from the current solution, which can further take part in the search process. As the proposed framework is automatic in nature, also the number of clusters is required to be determined automatically. For this purpose, three different perturbation operators are introduced. The first type can be used to make some small perturbations in the existing set of cluster centers. The second type is for increasing the number of clusters in the given solution and the third type is for decreasing the number of clusters from the given solution. For all these operations, cluster centers are considered as indivisible, i.e., all the feature values of a cluster center are inserted or deleted simultaneously. Below we describe the three types of mutation operations in detail:

Mutation 1: This is used to make some changes in the existing set of cluster centers. For the purpose of updation, a random value is drawn using the Laplacian distribution, p (ϵ )∝ e⁻^{|ϵ−μ |}^δ . Here, μ is set as the old value of the cluster center and δ is set to 1.0. δ is used to set the magnitude of perturbation. The newly generated value is used to replace the current feature value of a cluster center. The Laplacian distribution is used so that the probability of generating a value similar to the old value would be high. If a particular centroid is selected for the application of the mutation operator, all of its feature values are changed in the above way.

Mutation 2: The purpose of this mutation operator is to increase the number of clusters present in a solution. From the given dataset, a point is selected randomly and this point is inserted in the solution as a new center.

Mutation 3: This type of mutation is used to decrease the number of cluster centers encoded in a solution. A cluster center is randomly selected from the set of cluster centers encoded in the string. This is then deleted from the solution. As a cluster centroid is considered to be indivisible, in the process of removing a cluster centroid all the feature values are removed.

An example of the mutation operation is shown in the supplementary material. Any one of the above discussed mutation operators is applied on a particular solution to generate a new solution which can further participate in the process of AMOSA.

4 BASELINE METHODS BASED ON CLUSTER ENSEMBLES:

MOO-MULTI2 AND MOO-MULTI3

In this section, we describe two baseline methods, MOO-Multi2 and MOO-Multi3, that were inspired by an SOO ensemble approach to multiview clustering (Wahid et al.2014). Analogously to that SOO method, MOO-Multi2 and MOO-Multi3 follow an ensemble based approach to MOO multiview clustering. The search capability of AMOSA is utilized as above, but again any other MOO algorithm could have been used. The outputs of some well-known clustering techniques like k-means (Jain et al.1999), hierarchical clustering techniques like single-linkage (Jain et al.

1999), complete-linkage (Jain et al. 1999), and average-linkage (Jain et al. 1999) are combined efficiently using the two proposed multiobjective ensemble based techniques. The steps of the second approach are shown in Figure3, and the steps of the third approach are shown in Fig- ure4. Both the algorithms start with the generation of views for a given dataset. Four different clustering techniques, k-means, single linkage, complete linkage, and average linkage are exe- cuted on a given dataset, varying the views. The solutions of the Archive are initialized with the partitionings identified by these clustering techniques. The two proposed approaches differ in the way of defining the objective functions. The mutation operators proposed in the current approaches are used to combine the different partitioning solutions obtained by the simple clustering techniques. The AMOSA process is again followed to optimize different objective functions

(16)

Fig. 3. Flow chart of the second approach (MOO-Multi2).

Fig. 4. Flow chart of the third approach (MOO-Multi3).

simultaneously, and also to obtain the optimal number of clusters and the optimal number of partitionings automatically.

4.1 String Representation

First, for a given string the number of clusters (K) which will be encoded in it is determined ran- domly. Like in the previous approach, the number of clusters is varied over the range Kmin to Kmax. Here, the value of Kmin = 2 and Kmax =√

n, where n denotes the number of data points.

First a value K is selected randomly between the range Kminand Kmax with uniform probability.

The particular string contains some membership matrices of length K× n. Four different clustering techniques, k-means, single linkage, complete linkage, and average linkage, are run on the given

(17)

Fig. 5. Membership matrices represented in a solution for second and third approaches (MOO-Multi2 and MOO-Multi3). In the example, there are two clustering algorithms, two views, two clusters, and a dataset of 10 points.

dataset varying the views with the number of clusters= K. For example, if the dataset is having V views, then k-means is executed in total V times on the same dataset with the corresponding set of attributes with the number of clusters= K. For each case a membership matrix Mem of size K× n is obtained as follows:

Memi j = 1 if xj ∈ πi (10)

= 0. (11)

Here, xj denotes the jth data point and πidenotes cluster i. Memi jdenotes the membership value of the jth data point for the ith cluster. Thus, if the total number of used clustering algorithms= m, the number of views= V, then we obtain m × V membership matrices, each having size K × n.

These membership matrices are encoded in the form of a string. So a string is a collection of m× V binary membership matrices. Figure5shows an example of the proposed string representation.

All the strings of the archive are initialized in the above way.

4.2 Computation of Objective Functions

The second and third approach differ in the calculation of their objective functions.

In the second version of the proposed approach the membership matrices are treated separately.

The search capability of AMOSA is used to fine-tune the partitioning obtained by a single clustering algorithm. AMOSA is used to find the optimal partitioning for each of the individual clustering algorithms. After the execution of the entire process of AMOSA, to determine a single consensus partitioning corresponding to a single solution, a new approach is proposed.

For measuring the quality of an individual partitioning, two cluster quality measures, (i) compactness and (ii) separation, are used. Thus, if the total number of used clustering algorithms= m and the number of views= V , then we obtain m × V individual partitionings. The corresponding membership matrices are of size K× n. The compactness of a particular partitioning is calculated as follows:

comp=

_K

i=1maxDiai

K . (12)

Here, K is the number of clusters and maxDiairepresents the maximum diameter of the ith cluster of that particular partitioning. maxDiaiis calculated as follows:

maxDiai = max

∀k, j,kj&Memi j=Memi k=1d (xj, xk),

(18)

Fig. 6. Mutation operation on membership matrices (used in MOO-Multi2 and MOO-Multi3 approaches).

where d (xj, xk) denotes the Euclidean distance between two points xj and xk. Here, Memi j = Memik = 1 indicates that the jth and the kth points belong to the ith cluster. The separation of an obtained partitioning is calculated as follows:

sep =

_K

i=1minSepi

K , (13)

minSepi represents the minimum separation of the ith cluster from all the other clusters. It is calculated as follows:

minSepi = min

∀k, j,kj and Memi j=1&Memi k=0d (xj, xk),

where d (xj, xk) represents the Euclidean distance between two given points xj and xk. Memi j = 1 denotes that the jth point belongs to the ith cluster and Memik = 0 denotes that the kth point does not belong to the ith cluster. Thus, the objective functions to be optimized are {comp1,_sep¹

1, . . . , comp_m×v,_sep¹

m×v}. The aim is then to get a partitioning where clusters are compact and well-separated from each other. The objective is to simultaneously minimize these objective functions in order to get optimal individual partitionings.

In the third approach, the following steps are followed:

— Here, for each string first the membership matrices are combined to get a single consensus partitioning using the procedure of Section3.6.

— The compactness and separation values are calculated for this consensus partitioning.

— The objective functions to be optimized are{comp,_sep¹ }. These two objective functions are simultaneously minimized using AMOSA.

4.3 Steps of Mutation Operation

The simple binary mutation is applied on each membership matrix encoded as a string. The binary bit value is flipped with some probability. Some points are randomly selected and their membership values are changed. An example for a mutation operation is shown in Figure6.

4.4 Combining Solutions in Second Approach

Upon termination of the second approach, we get a set of membership matrices corresponding to a single solution on the archive. The membership matrices are combined to obtain a single consensus partitioning following the procedure mentioned in Section3.6. This final partitioning is reported for that particular solution in the archive.

5 RESULTS AND ANALYSIS

In this section, we discuss the datasets used for the experimental analysis. As all the proposed approaches are based on AMOSA, several parameters are involved. Details of the parameters are also provided in this section. We also include a real-world application of the proposed technique,

(19)

Table 1. Description of datasets

Actual no. Features used No. of PCA features/

Datasets AC Instances of features in view1 features used in view2 Total features

Mice 8 1,080 82 80 40 120

Diabetic 2 1,151 20 19 4 23

HTRU2 2 17,898 9 8 4 12

Statlog (Shuttle) 7 58,000 9 9 5 14

Wine 3 178 13 13 6 19

Yeast 10 1,484 8 8 4 12

the clustering of web-search results. The obtained results are compared with another multiview based multiobjective clustering technique (Wahid et al.2014) and several other state-of-the-art techniques.

5.1 Datasets

In this section, we introduce the datasets used in our experiments. For the first batch of experiments, we used a large number of datasets from the UCI Machine Learning Repository³including Iris, Newthyroid, LiverDisorder, Glass, BreastCancer, Wine, Ionosphere, Yeast, Ecoli, Leaf, Mice Pro- tein Expression , Diabetic Retinopathy Debrecen, HTRU2, and Statlog (Shuttle). The results with some larger datasets like the Mice Protein Expression dataset, Diabetic Retinopathy Debrecen, HTRU2, Stat- log (Shuttle), Wine, and Yeast are provided in the main paper and results with other datasets are discussed in the supplement. A description of all the datasets used here for experimentation is provided in Table1.

As in other publications on multiview clustering, we use two views. The first view is the original set of attributes. The second view is generated by applying Principal Component Analysis (PCA) (Jolliffe1986) to the dataset. We capture 95% of the data variance while reducing the dimensionality.

The set of attributes obtained after application of PCA constitutes the second view. The datasets are standardized to mean 0 and variance 1 before application of the proposed techniques.

5.2 Comparative Survey

We have compared our proposed approach with several existing approaches: Multiview k-means (Bickel and Scheffer2004), Multiview Multiobjective Evolutionary Algorithm (MMOEA) (Wahid et al. 2014), Multiobjective Multiview based Spectral clustering approach (MOO-Spectral) (Davidson et al.2013), a single-view based multiobjective clustering technique, VAMOSA (Saha and Bandyopadhyay 2010b). The first three approaches assume that the number of clusters present in a dataset is known a priori. A brief summary of these approaches is provided in the supplementary file.

5.3 Parameters Used

As the proposed approaches are based on the search capability of AMOSA (Bandyopadhyay et al.

2008), a SA based MOO technique, several parameters are involved. The parameters were selected after conducting a thorough analysis. Some guidelines regarding the selection of parameters are provided in the original paper on AMOSA (Bandyopadhyay et al.2008). The relationship between different parameter values of AMOSA and a sensitivity analysis of the parameters can be found there as well (Bandyopadhyay et al.2008). The same instructions are followed here to fix the values

3http://www.ics.uci.edu/∼mlearn/MLRepository.html.

(20)

Table 2. Results on Some Real-Life Datasets; the Minimum Minkowski Score Values Obtained by Different Clustering Algorithms are Reported

Dataset MOO-Multi1_{P B M} MOO-Multi_{X B} MOO-Multi2 MOO-Multi3 multi-KM VAMOSA MOO-Spectral Mice 0.9827 1.205684 1.34652 1.25745 1.36478 1.35134 1.1966908

Diabetic 0.9900 0.992583 1.12874 1.02894 1.105689 1.11523 0.9962

Shuttle 0.6007216 0.626147 0.68479 0.64854 0.648921 0.64233 0.6895

HTRU2 0.655078015 0.689214 0.70964 0.70498 0.695487 0.68125 0.7285

Wine 0.662679 0.6875 0.6976 0.7154 0.9618 0.97 0.6764

Yeast 1.10384 1.127864 1.1555 1.18057 1.18766 1.831623 1.19975

Note: Low values of the Minkowski Score correspond to good partitionings. The best values for different datasets are marked in bold.

of parameters. Finally, we have kept the following values: Tmax= 100, Tmin= 0.00001, iter = 30, α, cooling rate= 0.8, SL (soft-limit) = 50 and HL (hard-limit = 30). Kmax is set equal to√

n, where n is the size of the dataset and Kmin = 2. In order to measure the quality of the obtained results, some cluster quality measures like the Minkowski Score (Ben-Hur and Guyon2003), the Adjusted Rand Index (Hubert and Arabie1985) and Mutual Information (Paninski2003) are used. All these measures check the agreement of the obtained partitioning with some available ground truth.

5.4 Analysis of Results

We have executed our proposed multiobjective multiview based clustering techniques ten times on a given dataset. As the approaches are based on MOO, several solutions are generated for the final Pareto optimal front. All these solutions represent a tradeoff. In the first approach (MOO- Multi1), we obtain a set of cluster centers in the form of solutions in the final archive. First a consensus partitioning is calculated using these cluster centers following the steps mentioned in Section3.6. The Minkowski Score value is calculated for this partitioning. The minimum Minkowski Score values obtained by all the solutions produced by this approach for different datasets are reported in Table2. Here, two versions of MOO-Multi1 are tested. In the first approach, the PBM- index is used to measure the quality of individual partitionings obtained from a particular view, and in the second approach, the XB-index (Xie and Beni1991b) is used as the internal cluster validity index. The two versions are named as MOO-Multi1P BMand MOO-Multi1X B, respectively.

Similarly, MOO-Multi2 returns a set of consensus membership matrices after application of the steps of Section3.6on the final solutions of the archive. The Minkowski Score value is calculated for each of these membership matrices. The minimum Minkowski Score values obtained by this approach for different datasets are reported in Table 2. The third approach (MOO-Multi3) also returns a set of consensus membership matrices on the final Pareto optimal front. The minimum Minkowski Score values attained by these solutions are reported in Table2. The Minkowski Score values obtained by the multiview k-means algorithm and VAMOSA are also reported in Table2. In case of VAMOSA we have also reported the minimum Minkowski Score values over all the solutions on the final Pareto front. Each of the above mentioned algorithms is executed ten times and finally the minimum Minkowski Score values are reported. Each of the proposed multiobjective based multiview approaches provides a set of solutions on the final Pareto front. We have also plotted the boxplots of Minkowski Score values obtained by different solutions on the Pareto optimal front for different datasets. These plots are shown in the supplementary material. These figures demonstrate that all the solutions produced by (MOO-Multi1P BM) are better than the solutions produced by the other two approaches (MOO-Multi1X Band multi-KM).

Table2clearly shows the efficacy of the proposed approaches. It is also evident from Table2 that the first proposed approach (MOO-Multi1P BM) performs the best compared to all the other