• No results found

A clustering procedure

N/A
N/A
Protected

Academic year: 2021

Share "A clustering procedure"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

A CLUSTERING PROCEDURE A Thesis Presentedtothe Faculty of

CaliforniaStatePolytechnicUniversity,Pomona

InPartial Fulfillment

OftheRequirementsfortheDegree Masterof Science In Mathematics By JohnZaheer 2015

(2)

SIGNATURE PAGE THESIS: A CLUSTERING PROCEDURE

AUTHOR: JohnZaheer

DATE SUBMITTED: Spring 2015

Mathematics and Statistics Department

Dr. HuburtusF. von Bremen ThesisCommittee Chair Mathematics &Statistics

Dr. AlanKrinik

Mathematics &Statistics

Dr. RyanS.Szypowski Mathematics &Statistics

(3)

ACKNOWLEDGMENTS

I would like to take this opportunity to thank my thesis adviser Dr. von Bremen for always challenging me and for providing me with a guiding hand in my academic career as well as in my life. I would also like to thank my family and girlfriend, Alexandria, for being the rock that I could lean on through the Masters program at Cal Poly Pomona. Last, but definitely not least, I would like to acknowledge the friends that I have made throughout the program for making my stay at Cal Poly Pomona worthwhile.

(4)

ABSTRACT

In this paper we propose a procedure to obtain an optimal clustering solution for a given set of data. The main idea that runs through this thesis is the cluster­ ing of a data set with the consideration of multiple clustering solutions. This is achieved through first finding multiple clustering solutions through the k-means and/or spectral clustering algorithm. With the clustering solutions a cluster en­ semble is formed. In order to avoid losing sight of the original data matrix, a hybrid bipartite graph formulation is applied to the ensemble. To obtain a clustering so­ lution from the resulting bipartite graph we may consider a linear deterministic programming algorithm and/or the PivotBiCluster algorithm. Upon the selection of the PivotBiCluster algorithm we are finally able to obtain what can be considered an optimal clustering solution.

The example carried throughout the thesis illustrates the process used. The ex­ ample shows that the procedure can be used to obtain a clustering solution that has a cost closely related to the true clustering solution. Furthermore, an example that classifies wheat seeds, presented in Chapter 5, exposes the procedures dependence on the clustering solutions within the ensemble. Although k-means outperforms the procedure in terms of the cost of the clustering solution, the procedure theoretically holds for multiple values of k and can consider solutions from different algorithms that may expose different structures within the data set, which can potentially give a better clustering in terms of the data set.

(5)

Contents

List of Figures vii

1 Introduction 1

2 Solutions to the Clustering Problem 8

2.1 Notation and Definitions for K-means . . . 9

2.2 K-means Algorithm . . . 10

2.3 Notation and Definitions for Spectral Clustering . . . 11

2.4 Spectral Clustering Algorithm . . . 13

2.5 Example of Producing Multiple Solutions . . . 14

3 Cluster Ensemble Problem and Hybrid Bipartite Formulation 19

3.1 Cluster Ensemble Problem . . . 20

3.2 Hybrid Bipartite Graph Formulation . . . 21

3.3 Example of the Transitions . . . 23

4 Bipartite Correlation Clustering 26

4.1 Deterministic Linear Programming Algorithm . . . 27

4.2 PivotBiCluster Algorithm . . . 30

(6)

5 Another Classification Example 36 6 Conclusion 41 Bibiliography 43 A MATLAB Code 45 A.1 kmeansolutions . . . 45 A.2 specsolutions . . . 46 A.3 ensemble . . . 47 A.4 biform . . . 48 A.5 combi . . . 49 A.6 wheatscatter . . . 52

(7)

List

of

Figures

1.1 Flow chart of the possible procedures. . . . 5

3.1 Depiction of the formulation of the bipartite graph. . . . 22

4.1 Depiction of the correlations R1, R2, and R12. . . . 31

4.2 Depiction of the final clustering solution for the iris data. . . 34

4.3 The clustering cost of each clustering solution for the iris data. . . . 35

5.1 Depiction of the final clustering solution for the wheat seed data matrix with characteristics 1, 2 and 3. . . . 37

5.2 Depiction of the final clustering solution for the wheat seed data matrix with characteristics 4, 5 and 6. . . . 38

5.3 The clustering cost of each clustering solution for the wheat seed data. 39 5.4 The clustering cost of the k-means solutions excluding outliers. . . . 40

(8)

Chapter

1

Introduction

A cluster can be defined as a set of nodes that share similarities, thus the

clustering problem can best be described as a partitioning of a quantitative data set into disjoint clusters using a metric that measures the similarities of the data nodes. The clustering problem is usually solved with the objective of minimizing the difference of the similarities within each cluster. The solution to the clustering problem is a set of disjoint clusters such that the union over the whole set equals the set of nodes original given, essentially referred to as a clustering solution. The most common and widely used algorithms to solve the problem are the k-means and spectral clustering algorithm, which will be discussed in Chapter 2. One major issue with these two algorithms is that they both require an input of the number of clusters desired in the output. However, this number is quite difficult to find in order to optimize the similarities within the clustering solution. Furthermore, running the algorithms many different times on a data matrix can result in different clustering solutions due to the complexity of the data and the delicate nature of the algorithms. This leads to the question, how can we use the information presented

(9)

by the multiple solutions to gather the data nodes into a final clustering?

In Meta Clustering, [2], Caruana, Elhawary, Nguyen, and Zuylen define the set of multiple clustering solutions to be a cluster ensemble. The paper introduces the idea of using the information presented by the cluster ensemble to find a single clustering solution, which they describe to be the cluster ensemble problem. One major difficulty pertaining to the cluster ensemble problem is finding the op­ timal clustering solution. Each solution in the cluster ensemble can have different numbers of clusters to represent different structures within the data set. Not want­ ing to constrict the final clustering solution by dictating the number of outputting clusters, different types of clustering algorithms were sought after. In particular, one that does not need the input of the number of final clusters to be required.

Through further investigation of the different types of clustering processes, the idea of solving the cluster ensemble through a bipartite graph partitioning was in­ troduced by Fern and Brodley in Solving Cluster Ensemble Problems by Bipartite Graph Partitioning. In [1] the authors introduce a hybrid bipartite graph formula­ tion that takes the cluster ensemble problem and converts it into a partitioning of a bipartite graph problem without loss of the original data structure. That is to say, given a bipartite graph constructed through the hybrid bipartite graph formu­ lation, one can easily recreate the original cluster ensemble which is a benefit since other graph formulations lose the original cluster ensemble structure. The hybrid bipartite graph formulation along with the formal definition of the cluster ensemble problem will be discussed in Chapter 3.

The bipartite graph is obtained in such a manner that vertices consist of clus­ tering solutions and data nodes. Also the edges of the graph are given a weight with respect to data nodes being within clusters. The partitioning of the graph will

(10)

result in a single clustering solution of the original data since the formulation does not lose its structure. If the data set has n data nodes and the algorithms discussed produces a cluster ensemble of r clustering solutions, in which each solution has

ki, for i“ 1 : r clusters, the resulting bipartite graph that needs to be partitioned

řr

will have nr edges and n` i“1ki vertices. This can cause problems when trying

to find a clustering algorithm to partition the graph since determining the number of clusters in the final solution is counter productive. For this reason, the problem was transformed into a cluster ensemble problem. To work around this issue, the problem then makes the transition from a clustering problem of a bipartite graph to a correlation clustering problem. This new clustering process does not require the input of the final number of clusters.

The correlation clustering problem is relatively similar to the clustering problem with one important difference: it applies an operation research like view­ point to the problem. The goal is not to partition the data set, or graph, into a specified number of clusters but rather into an optimal number of clusters. The problem can thus be expressed as a correlation clustering of a bipartite graph, which gives us a process to cluster the cluster ensemble without dictating the number of final clusters. However, this leads us to another problem that needs to be solved. In trying to find an algorithm to solve the bipartite correlation clustering problem the article Improved Approximation Algorithms for Bipartite Correlation Cluster­ ing by Ailon, Avigdor-Elgrabli, Libetery, and Zuylen proved to be quite effective. The authors introduce two improved approximation algorithms in [3]. One is de­ scribed as a deterministic linear programming rounding algorithm and the other a combinatorial approximation algorithm.

(11)

a complete graph by adding edges of weight zero. Then the algorithm applies a minimizing objective function with corresponding constraints to find a solution. This solution, when worked back to the original data set, arrives at a final clustering solution. The combinatorial approximation algorithm uses the special structure of the bipartite graph to construct an iterative method that also produces a final clustering solution. These algorithms will be discussed in Chapter 4.

The main goal of this thesis is to present a procedure that produces a clustering solution that takes into consideration an ensemble. The procedure introduced in this thesis will solve the clustering problem using the k-means and spectral cluster­ ing algorithm to form a cluster ensemble. Applying the hybrid bipartite graph for­ mulation to the cluster ensemble results in a bipartite graph and turns the problem to a bipartite correlation clustering problem. Finally the PivotBiCluster algorithm is applied to the bipartite graph in which solving the bipartite graph correlation clustering problem and resulting in a final clustering solution.

(12)

Figure 1.1: Flow chart of the possible procedures.

The flow diagram above shows the possible procedures that the data set can undertake before coming to a final solution. Although the flow chart depicts the

(13)

process that this thesis presents, there is room to expand the flow chart by using different methods to transition between the problems. Specifically, going from a data set to a cluster ensemble, k-means and spectral clustering were chosen, however any clustering algorithm that results in a clustering solution can be used. Also the cluster ensemble can be produced using only one method or multiple methods at the same time. Furthermore, going from a bipartite formulation to a final clustering solution is solving the bipartite correlation clustering problem. The problem can be solved in many different ways, the PivotBiCluster and Deterministic Linear Programming algorithms were chosen because of the claims of producing a clustering that is at most four times the optimal solution in [3].

Every chapter in this thesis will be accompanied by a section that will provide an example and discussion of the algorithms or transitions being presented. Fur­ thermore, the example will be carried through the entire thesis resulting in a final clustering solution of the data set. All the algorithms and transitions will be done in MATLAB, and the data set will be MATLABs’ Fisher’s Iris Data. The iris data set takes the structure of a matrix in which each row represents a single iris and the columns represent quantitatively measured characteristics of the iris. Because of this we will refer to the rows as data nodes and the columns as characteris­ tics of the data nodes. Furthermore, since the iris data set consist of data nodes pertaining to three specific categories, each category consisting of nodes that have similar characteristics, we can cluster with the goal of classifying each data node into each category to see how well the process performs. It will be shown that the process correctly classifies 124 nodes out of the 150 given. It will also be shown that the process described in the thesis, for this example will have a cost, described in Chapter 2, similar to the true classification cost. Lastly in Chapter 5 another

(14)

example will be worked out to give a straight run through of the process, in which we conclude that the process has a higher cost when compared to the k-means algorithm.

(15)

Chapter

2

Solutions

to

the

Clustering

Problem

To obtain our first solutions of the clustering problem, as shown by Madeira and Oliveira in [7], consider an n by pdata matrix where each element xij will be

a real value. Taking into consideration the example that will be carried throughout the thesis, xij represents the quantitative measure of characteristic j of data node i.

To condense the data matrix into a dataset for notation purposes of the thesis, let

X “ tx1, x2,...,xnu be a data set such that each xifor i“ 1 : n corresponds to row i

of the data matrix introduced. The first transition, following Figure 1.1, is to move the data set through the k-means algorithm or the spectral clustering algorithm, or any combination of ways to obtain solutions to the clustering problem. The k-means algorithm is an iterative method that tries to forcefully constrict the data nodes into clusters. Alternatively, the spectral clustering algorithm elegantly uses graph theory ideas combined with linear algebra to obtain a clustering. Although the spectral clustering algorithm uses k-means within it, the transformation before the

(16)

application of k-means are done to help this algorithm outperform the traditional k-means algorithm according to Luxburg in [5] if the right measure of similarity is chosen. For both algorithms to be initialized, they need the input of the data set

X and the number of desired clusters, k.

2.1

Notation

and

Definitions

for

K-means

Next we describe the k-means algorithm as given by Khan and Ahmad in [4], some notation and definitions must be introduced. Let C “ tC1, C2,...,Cku be

the set of k clusters that the algorithm will be using in its intermediate steps and output. Given the data set X and known k value, consider Sj “ txi : xi P Cju

to be the set of data nodes that belong to cluster Cj. Furthermore, the algorithm

uses the squared Euclidean distance as a measurement of similarity between the data nodes, with this in mind the center of a cluster is defined to be the center coordinate with respect to the data nodes within the cluster it is in.

Each cluster will have one center that gets recalculated at each iteration of the algorithm. Since each cluster will only have one center, the set of centers will be denoted as T “ tt1, t2,...,tku, such that tj is the center of Cj. The overall objective

of the k-means algorithm is to minimize the cost function defined to be

n

ÿ

Cost “ dist pxi, tjq i“1

where dist pxi, tjq is defined to measure the squared Euclidean distance from a

data node xi to respective center tj. The k-means algorithm can use different

measures of similarities such as ”cosine” (one minus the cosine of the included angle between points) and ”cityblock” (sum of absolute difference between each point). The squared Euclidean distance was chosen because it was the default choice for

(17)

MATLAB, but also because the interpretation of the results would be more easily attained. That is to say our resulting clustering solution will have a Cost of the sum of distances from a data node to its corresponding center.

2.2

K-means

Algorithm

The algorithm’s first step is to construct the set of centers, T, using a random sampling. The k-means algorithm can use different initializing criteria to choose the centers such as a uniform distribution. However, since the random sampling is used as k-means default and can be used with all of MATLABs’ measures of similarities it was chosen to be used. With the first set of centers selected, the next step is to decide the membership of the data nodes in each cluster. This is done in accordance to a minimized distance criteria that is set by the user or the algorithms platform such as MATLAB. Once each data node is assigned to a distinct cluster, the centers of the corresponding clusters are recalculated by

ř

xi xiPSj

tj “ .

|Sj|

The magnitude of Sj, p|Sj|q, is calculated to be the number of data nodes in the

set. Since this resulting new set of centers will be different than our original ran­ domly selected centers, the algorithm iteratively repeats the steps that decides the members of each cluster and calculating new centers until there is no change in the clustering solution. The stopping criteria for k-means is thus given by the absence of change from the clustering solution with the new set of centers.

The final clustering solution will be the output of the algorithm. This output will have successfully clustered the data set into k distinct clusters. However, since the algorithm’s first step is choosing the centers through a random sampling, the al­

(18)

gorithm does not guarantee a unique clustering solution of the data set. This means that every time the algorithm is applied to the data set, the resulting clustering can and most likely will be different. This results in various clustering solutions with each being adequate for meeting the k-means stopping criteria. Because of this, producing various amounts of clustering solutions is considered simple.

2.3

Notation

and

Definitions

for

Spectral

Clus­

tering

Like the k-means algorithm some notation and definitions must be explained prior to the descriptions of the spectral clustering algorithm [5]. The general defi­ nition of a graph given by Marcus in [6] is that a graph consists of points, which are called vertices (singular: vertex), and connections, which are called edges, indicated by line segments or curves adjoining certain pairs of vertices. The graph can thus be denoted as G “ pV,Eq where V is the set of vertices and E is the set of edges. In [6] Marcus later elaborates that a graph can also be directed and weighted. A directed graph is a graph whose edges includes a direction from one vertex to another [6]. A weighted graph is a graph in which each edge has a number associated with it, which we refer to as the weight of that edge [6].

For the purposes of spectral clustering we will consider the graph G to be an undirected (not having orientation of the edges) weighted graph. Following the standard notation and definition set by Marcus in [6] and Luxburg in [5] we will let

G“ pV,E,Wq be the undirected weighted graph with vertex set V “ tv1, v2,...,vnu,

edge set E “ teij : connecting vi to vju, and the non-negative weight set W “

(19)

columns both correspond to vertices. By the assumption of G being undirected, it is required that wij “ wji to ensure there is no orientation to the edge. The

degree of a vertex is commonly defined through graph theory to be the number of edges that occur at the vertex [6], however, [5] defines it to be the sum of all weights corresponding to every adjacent edge. Since we are following Luxburgs’ tutorial on spectral clustering we will use his definition which mathematically can be calculated by

n

ÿ

di “ wij. j“1

With this definition of degree, the degree matrix D can be constructed. D

is defined to be a diagonal matrix in which its diagonal entries are d1, d2,...,dn.

Relating the construction of G to our data set X we have the simple assumption of X “ V , such that each data node will be considered a vertex in V . With this

assumption the weight set will measure our sense of similarity and following the requirements for k-mean, the measure of similarity is given by the squared Euclidean distance of each vertex. Furthermore, the unnormalized graph Laplacian is defined to be

L“D´W.

Through the special structures of the matrices D and W, L has special properties that are important to make spectral clustering possible. These properties are ex­ plained and proven by Luxburg in [5], but for our purposes they will only be noted and discussed.

Theorem 2.3.1. The matrix L satisfies the following properties: 1. For every vector uP Rn

n u1Lu“ 1 ÿ wijpui´ujq2. 2 i,j“1

(20)

2. L is symmetric and positive semi-definite.

3. The smallest eigenvalue of L is 0 and the corresponding eigenvector is a con­ stant vector of entries of value 1.

4. L has n non-negative, real-valued eigenvalues 0 “λ1 ďλ2 ď...ďλn.

The most important property in Theorem 2.3.1 is the fourth property, which is a direct consequence of the first three. In the spectral clustering algorithm it is required to supply the number of clusters desired, k, and since it would be pointless to cluster a data set of size n into n clusters (each data node being in its own cluster), k ă n. Furthermore, the algorithm requires the computation of the first k eigenvectors. The fourth property guarantees the existence of n eigenvalues, consequently guaranteeing the existence of k eigenvectors.

2.4

Spectral

Clustering

Algorithm

From the tutorial on spectral clustering given by Luxburg in [5], the algorithm first requires the input of the data set X and the number clusters desired, k. As stated in the previous section we let X “ V . The next step of the algorithm is to construct the weight matrix W using the squared Euclidean distance as the measure of similarity between each vertex. After constructing W, it computes the degree of each vertex inputting it into the degree matrix D. Using these two computed matrices, the unnormalized Laplacian L is computed. Using MATLAB’s eig function, the first k eigenvectors u1, u2,...,uk of L can be computed.

Let U P Rnˆkbe the matrix containing the eigenvectors u1, u2,...,ukas columns.

(21)

let Y “ U such that yj P Rk be the vector corresponding to the jth row of U.

Thus using Y and the previously given k as our new initial inputs, the k-means algorithm takes over and performs a clustering on Y . Let the resulting clustering solution given by k-means be denoted by C “ tC1, C2,...,Cku.

Since the spectral clustering algorithm uses the k-means algorithm within it, it runs into the same problems as the k-means regarding the multiple solutions. But by the preliminary work done to transform the data into a graph and finding the graph Laplacian, it is said to outperform k-means in quality of clustering if the measure of similarity is chosen to expose differences as expressed in [5]. The choice to measure the similarity using the squared Euclidean distance was made following the k-means algorithm choice of measure. However, through the example that follows it was shown that spectral clustering does not outperform k-means with respect to the Cost function in Chapter 2 Section 1. Spectral clustering attempts to cluster all the data nodes into one cluster due to closeness of each eigenvector with respect to the squared Euclidean distance. Since spectral clustering uses k-means and it is forced to find k clusters it does however reveal outliers within our data matrix which.

2.5

Example

of

Producing

Multiple

Solutions

Our data set X, as stated before, will be MATLABs’ iris data set which comes with most versions of MATLAB. The data set can be loaded using the simple com­ mand, ”load fisheriris”. Once the data is loaded, the matrix labeled ”meas” will be our data set X. Now, since k-means is already programmed into MATLAB as the function titled kmeans, we only need to create a function that can loop through

(22)

kmeans multiple times while saving the clustering solutions in a matrix. The func­ tion that can be produced to find clustering solutions for k-means is presented below.

function K = kmeansolutions(r,k,X)

%runs kmeans multiple times to generate solutions

K=[]; for i = 1:r b = kmeans(X,k); K = [K,b]; b = 0; end

For the input of the function, r is the number of solutions that the function will produce and k is the number of clusters in each solution. We are guaranteed

r solutions by the convergence of the k-means algorithm as long as ties within the creation of centers is broken consistently [8]. However, the r solutions are not guaranteed to be a global minimum, but rather local. Although theoretically k can be of any value for any cluster, the difficulties of programming presented in later functions forced the decision of keeping k fixed. For the iris data that value was discovered through evaluation of the resulting final solution. Running the function for r “ 10 and k “ 4 will result in a 150 by 10 matrix in which each column will represent a clustering solution and each row entry will represent the indexing for the clustering solution corresponding to that particular data node.

With k “ 4, we will have the entries in each column range from one to four representing the four clusters that each solution must have by the k-means require­ ments. The output will be a collection of 10 clustering solutions. Two issues that may be evident with the use of this function are the problems with the repetitive

(23)

nature of the indexing through the multiple solutions and redundant solutions (so­ lutions that show up more then once) that will create a bias in the final steps of our process. The indexing issue will be taken care of during the ensemble transition and the redundant solutions will be dealt with during the hybrid bipartite formulation in the next chapter.

Unlike k-means, spectral clustering is not already pre-programmed into MAT­ LAB so a function following the steps in Section 2.4 must be produced. The function that was created to preform the spectral clustering algorithm is given by:

function S = specsolutions(r,k,X)

%performs spec cluster and uses kmeans

v = size(X); w = zeros(v(1));

for i = 1:v(1)%constructs the upper tri of w

for j = i:v(1) s = X(i,:); t = X(j,:); for p = 1:v(2) w(i,j) = w(i,j)+sqrt((s(p)´t(p))ˆ2); end end end w=w'+w; for i = 1:v(1)%constructts d d(i,i) = sum(w(i,:)); end L = d´w; [vec,val] = eig(L); U = vec(:,1:k);

(24)

S = kmeansolutions(r,k,U);

Two separate ”for” loops were used to construct the weight matrix, W. The first ”for” loop calculated the squared Euclidean distance of each data node and constructed the upper triangular part of the matrix W. Due to the symmetry property of W , the lower triangular part of W was constructed using the upper triangular part by the second for loop. Once the W was constructed, using the definition of the degree matrix D, another ”for” loop was implemented to add the weights corresponding to the diagonal entry of D. Now since W and D are of the same dimensions, L was produced by subtracting W from D. From here the eigenvalues and eigenvectors where calculated using MATLABs’ eig function, in which we only collect the first k eigenvectors in the matrix U. Once we have the first k eigenvectors collected in U, we use the kmeanssolutions function created to form multiple solutions for the k-means algorithm. Note that the last transition described in Section 2.4 did not need to take place since kmeansolutions clusters across the rows of the input matrix.

With the inputs of the function being r “ 10, k “ 4, and X being the iris

data set, the issue that arises using the specsolutions function is that the resulting clustering solutions are identical with respect to which data nodes are clustered together. Inspection of each matrix used in the function resulted in the discovery that the issue arises in the values of U, the collection of eigenvectors corresponding to L. U contains values in each row very similar to one another in the sense of the squared Euclidean distance. This fact forces the kmeans function in MATLAB to attempt to cluster all the data nodes together. In doing so, it clusters k´ 1 data

nodes by themselves and the remaining data nodes in one cluster. This at first does not seem helpful, but with further evaluation is shown to indicate outliers within

(25)

the data set. Keeping this in mind we will let r “ 1 for the input so that we don’t generate identical solutions that will be eliminated later in the hybrid bipartite formulation. Also the output clustering solution will have repeated indexes since it uses kmeansolutions. This will also be addressed in the ensemble transition.

To quantify the differences between the k-means solutions and the spectral clus­ tering solution we calculate the cost of each clustering solution using the Cost function from Section 1 of this chapter. The Cost function calculates the cost of the spectral clustering solution to be 680.9071, where as the 10 solutions for the k-means algorithm only range from 57.2285 to 71.4452. Because of this we can conclude that with the choice of the squared Euclidean distance to measure the similarities between data nodes, spectral clustering does not outperform k-means. But the fact that it does indicate outliers in our data matrix makes the solution useful and thus will still be considered in the following steps of the procedure.

Finally, our initial input of the iris fisher data, k “ 4, and r “ 10 for kmeanso­ lutions and the input of the iris fisher data, k “ 4, and r“ 1 for specsolutions we can implement the two functions above to produce a total of 11 solutions each clus­ tering the data into 4 disjoint clusters. Furthermore the outputs of the functions will be the matrices K and S corresponding to the functions respectively.

(26)

Chapter

3

Cluster

Ensemble

Problem

and

Hybrid

Bipartite

Formulation

In Chapter 2 the clustering problem is introduced and solved using the k-means and spectral clustering algorithms. As discussed in the previous chapter the algo­ rithms do not produce unique clustering solutions for the data set X, but in fact can produce many solutions depending on the initial random sampling from k-means. The existence of many solutions can be problematic to deal with when deciding which one would work ”best” as the final solution. Caruana, Elhawary, Nhuyen and Smith in [2] present the idea of clustering based off of clustering solutions. The general format for finding a clustering of clustering solutions is given in [2] by first generating many good quantitatively different clustering solutions of the data set

X. From here, measure the similarities between the clustering solutions so that they themselves can be clustered. Lastly, cluster the clustering solutions in an at­ tempt to minimize the difference in similarities within the clusters. Following these guidelines the original clustering problem will transform into a cluster ensemble

(27)

problem and later a bipartite graph partitioning problem using a hybrid bipartite graph formulation.

3.1

Cluster

Ensemble

Problem

To translate this procedure into a problem statement, first consider that given a data set X “ tx1, x2,...,xnu a cluster ensemble is defined by Fern and Brodley in

[1] to be a set of clustering solutions of X. Let C “ tC1, C2,...,CRube the cluster

ensemble such that there are R P N clustering solutions. Since these solution will come from the algorithms discussed in Chapter 2, each clustering solution Cr for

r “ 1 : R is produced to be a partition of the data set X into kr distinct clusters.

The subscript on the kr represents the idea that the clustering solutions can be of r

different sizes. Thus each clustering solution takes the form Cr “ tC

1r, C2r,...,Ckru.

Also since each cluster within a clustering solution is distinct but clusters every

r

data node, each clustering set holds the property that YjCj “ X. Finally, the

clustering ensemble problem is stated to be: given a cluster ensemble C and the number of desired final clusters, partition the data set Xinto the desired number of distinct clusters using the information obtained from C.

Since the cluster ensemble problem is in essence a clustering problem, one obvi­ ous procedure to find a final clustering solution is to run the k-means and spectral clustering algorithms on the ensemble. However, this would ignore the information about the structure of the data presented in the cluster ensemble because of the input requirements of the algorithms. Also to run the algorithms on the cluster ensemble would lead to a clustering of clusters, in which losing the initial structure of the data set X. To insure that the original structure of the data set is not lost

(28)

and to solve the clustering ensemble problem, X and C can go through a hybrid bipartite graph formulation. This formulation will lead into a correlation clustering problem that will be described in the next section.

3.2

Hybrid

Bipartite

Graph

Formulation

In Section 2.3 the definition of an undirected weighted graph for our purpose was given to be G “ pV,E,Wq where V is the set of vertices such that V “ X,

E is the set of edges connecting the data nodes with weight given by the squared Euclidean distance that forms the set W . Marcus [6] defines a bipartite graph

as a graph whose vertices can be separated into two sets, L and R, in such a way that every edge in the graph has one endpoint in each set. That is to say for all

eij PE, whatever set i belongs to j must belong to the other. This however imposes

a problem given our definition of G and correlation to the data set X. Since the edges reflect the weight given by the squared Euclidean distance V , it cannot be separated. Furthermore, the data set X and a clustering ensemble C are given, meaning we need to expand our definition of vertices in the graph to include our clustering ensemble. Also to condense some notation, from this point on we will assume each clustering solution Crfor r 1 : R has k clusters. This is done without

loss of generality since the formulation below will hold under different values of k

for each Cr.

Given the data set X “ tx1, x2,...,xnu and the clustering ensemble C “ tC1, C2,...,CRu

where Ckr is a cluster in the cluster solution r out of k many, using Fern and Brod­ ley’s formulation, let G“ pL,R,Eq where L is the set of vertices representing each data node in X and R is the set of vertices representing each Ckr in the set C.

(29)

Formally, for all lj P L, lj “xj for j “ 1, 2,...,n, and for all ri P R, ri equals some

cluster in the cluster ensemble uniquely for i “ 1, 2,...,kR. The edge eij P E is

the edge that connects vertex lj to ri, if the data node corresponding to lj is in the

cluster that is represented by ri. With this definition of an edge in E we have that

each member in L and R cannot be adjacent with any other member of L and R

respectively. Also every edge in E has an endpoint in L and in R, hence fitting the definition of a bipartite graph. The process is given the name hybrid bipartite graph formulation by Fern and Brodley because it uses two other types of graph formulations as a guide. The hybrid bipartite graph formulation is shown in the figure below.

Figure 3.1: Depiction of the formulation of the bipartite graph.

The hybrid bipartite graph formulation keeps the structure of the original data set X while including the information gathered by the cluster ensemble. This tran­ sition changes the problem of looking for a final clustering of the cluster ensemble to the partitioning of a bipartite graph. Furthermore, the added structure of the cluster solutions gives a way to measure the correlation between the clusters and

(30)

the data nodes. Using the measure of correlation between the two sets of vertices to partition the bipartite graph transitions the cluster ensemble problem into a bipartite correlation clustering problem. The solution to the bipartite correlation clustering problem will be a clustering of the vertices of the bipartite graph. The advantages of the hybrid bipartite formulation becomes evident since the final clus­ tering solution will represent a final clustering of the data nodes by removing the vertices representing the cluster ensemble set C.

3.3

Example

of

the

Transitions

From the example in Chapter 2 Section 5 we have the matrices K and S corre­ sponding to clustering solutions with respect to the functions kmeansolutions and specsolutions. The ensemble transition can be described as the joining of clustering solution matrices K and S. This is done by the first line of the function ensemble that was created following the definition of a cluster ensemble.

function E = ensemble(k,K,S)

%creates the ensemble from kmeans and spec cluster

E = [K,S]; v = size(K); u = size(S); w = v(2)+u(2);

for i = 1:w %unique indexes

E(:,i) = E(:,i)+(i´1)*k; end

One issue that came up in kmeansolutions and specsolutions was the repeated indexes for the clusters in the output matrices K and S. This is dealt with in the

(31)

”for” loop of the function. Using the input of the k value, k “ 4, the ”for” loop iteratively adds a multiple of 4 so that the first solution in E (ie. first column), will be indexed 1 to 4, the second solution (second column) will be indexed 5 to 8, and so on to produce clusters that have unique indexes. Hence, the output of the function ensemble given the inputs k “ 4, K, and S from Chapter 2 Section 5 will be a 150 by 11 matrix E that represents a cluster ensemble, with each cluster having a unique index.

Now that we have a matrix E and the repeated indexes have been taken care of by the function ensemble, the hybrid bipartite formulation needs to take place. The hybrid bipartite formulation takes the ensemble and forms a matrix, say B, where each column will represent a single cluster, and each row represent a data node. The entry Bij will be 1 if data node i belongs to cluster j and 0 otherwise.

By the unique indexing of E we can let the index directly indicate the column the cluster will be represented in. Thus B was able to be created using a nested ”for” loop as shown in the MATLAB code below:

function B = biform(E)

%creates the bipartite graph

L = size(E); for i = 1:L(1) for j = 1:L(2) B(i,E(i,j))=1; end end

%to rid of redundancy in solutions

v = size(B); for i = 1:v(2)

(32)

for j = i+1:v(2) if B(:,i) == B(:,j) B(:,j) = 0; end; end end for i = 1:v(1) u = size(B); for j = 1:u(2) if sum(B(:,j)) == 0 B(:,j) = []; break end end end

In the function it can be noted that the actual formation of B is a single nested ”for” loop that takes up a third of the actual function. The other two thirds of the function deal with the issue discovered in Chapter 2 Section 5, the appearance of redundant solutions that can create a bias in the final steps of our process.

Since B is created in the fashion that each column represents a cluster and an entry of 1 means the data node belongs to that cluster, it is easy to see that we can conduct a search to find columns that are equal to one another which correlates to being a redundant solution. This is done in the ”for” loop after the creation of

B, in which we replace duplicate columns with columns of zero. The columns of zero are then removed by the last ”for” loop resulting in an output matrix B that has unique columns, directly corresponding to unique clusters of the data nodes.

(33)

the vertex set L, the columns to be the vertex set R, and the entries to signify the existence of an edge joining the two verices.

(34)

Chapter

4

Bipartite

Correlation

Clustering

Thus far the data set X has been clustered through k-means and spectral clus­ tering to obtain many clustering solutions to solve the clustering problem. With the set of clustering solutions the problem transformed into the cluster ensemble problem. To keep away from simply clustering clusters, but to actually use the information presented in the ensemble to cluster the data set, X and C became vertices in a bipartite graph through the hybrid bipartite graph formulation. The formulation exposed correlations through the edge set E. This resulted in a bipar­ tite graph G“ pL,R,Eq to be partitioned into a final solution through correlation clustering. In [3], Ailon, Avigdor-Elgrabli, Libetery, and Zuylen introduces two al­ gorithms to solve the bipartite correlation clustering problem. The two algorithms presented are a deterministic linear programming rounding based algorithm, and a combinatorial 4-approximation algorithm titled PivotBiCluster.

The deterministic linear programming rounding algorithm forms the bipartite graph Ginto a complete graph, defined by Marcus to be a graph in which every vertex is adjacent to every other vertex. It does so by adding the missing edges

(35)

with weight zero and in essence forgets about the bipartite structure. From here the algorithm presents a minimizing problem with constraints in which the solution leads to a set of indicating variables. A quick cluster iterative method is then applied to the indicating variables to form the final solution.

The PivotBiCluster algorithm keeps the input as a bipartite graph and uses the structure to measure the correlation between vertices. The PivotBiCluster algorithm is an iterative method that first chooses a vertex randomly from L, compares the correlations of the vertex within it’s neighborhood, defined in [3] to be the set of vertices adjacent to said chosen vertex, then clusters with respect to a criteria in the algorithm. Since within the algorithm the vertices in L are already being clustered, the output of the algorithm is a final clustering solution of the data set X.

4.1

Deterministic Linear Programming Algorithm

Let the input of the algorithm be the previously defined bipartite graph G “ pL,R,Eq. The objective is to cluster G through the correlations presented by the edge set. For notation purposes [3] lets the individual edge be defined as pi,jq such that it connects vertex li P L to rj P R. Note that E is a subset to the complete

set of edges adjoining every vertex in L to R, call this set LˆR.

To make the graph G into a weighted graph that holds no presumption of the final solution, binary weights are assigned. For each edge pi,jq provided in the input set E, let w`

“ 1 and w´

“ 0. For each edge pi,jq in LˆR not in E, let

ij ij

` ´ `

wij “ 0 and wij “ 1. Since G is a bipartite graph we arbitrarily let wij “ 0 and ´

“ 0 if i and j are both in L or both in R. Formally in set notation the weight

(36)

variables are defined as such

@ pi,jq P pLˆRq XE ùñ w` “ 1, w´ “ 0 ij ij ` ´ @ pi,jq P pLˆRq zE ùñ wij “ 0, wij “ 1 ` ´ @ pi,jq P pLˆLq Y pRˆRq ùñ wij “ 0, wij “ 0

The weight variables reflect the separate set of vertices L and R in the graph G. The indicating variables that will be used to determine the clustering solution lose

`

track of the bipartite property since it is defined in [3] to be yij “1 if and only if vertex i and j are positioned in the same cluster. It follows in standard operation research fashion that y´

“ 1 ´y` Since the indicating variable is defined to be

ij ij.

` ` ´ ˘

binary integers yij, yij P t0, 1u , the following constraint can be imposed

` ´

yij `yij “ 1.

This constraint guarantees that vertex i cannot simultaneously be in the same cluster as j and not in the same cluster as j. The other constraint for the algorithm is introduced in [3] to be implied by definition, however due to the detail it is introduced as an observation, and proven.

Observation 4.1.1. Let V “ LYR, then for any ordered set of three vertices

´ ´ `

i,j,k PV , yij `yjk`yik ě 1.

Proof. Consider the ordered set of vertices i,j,k P V . Using the definition of the indicating variable

´ ´ ` ` ` ` ` ` `

yij `yjk `yik “ 1 ´yij ` 1 ´yjk `yik “ 2 ´yij ´yjk `yik

Since we have that the indicating variable are in t0, 1u, making the left side of the equation above as small as possible would lead to the reduction of the constant 2.

(37)

` ` `

However, the smallest ´yij´yjk`yik can be is negative one due to the dependence of the variables with one another. This is depicted in the four cases below:

` ` ` ` If yij “ 0 and yjk “ 0 ùñ yik “ 0 or yik “ 1 ` ` ` impliying that ´yij ´yjk`yik “ 0 or 1 ` ` ` If yij “ 1 and yjk “ 0 ùñ yik “ 0 ` ` ` impliying that ´yij ´yjk`yik “ ´1 ` ` ` If yij “ 0 and yjk “ 1 ùñ yik “ 0 ` ` ` impliying that ´yij ´yjk`yik “ ´1 ` ` ` If yij “ 1 and yjk “ 1 ùñ yik “ 1 ` ` ` impliying that ´yij ´yjk`yik “ ´1 ` ` `

Hence the smallest 2 ´yij ´yjk `yik can be is 1, imposing the constraint

´ ´ `

yij `yjk`yik ě 1

This constraint ensures that if vertex i is in the same cluster as j and k, then it is the case that j and k are clustered together. Thus the linear programming problem is introduced to be ÿ ` ˘ ` ´ ´ ` LP “ min wijyij `wijyij pi,jqPpL,Rq s.t. @i,j,k PV ´ ´ ` yij `yjk `yik ě 1 ` ´ yij `yij “ 1 ` ´ yij, yij P t0, 1u

(38)

The optimal solution to the linear programming problem LP will be a labeling of the indicator variables. To obtain the corresponding partition of the data set

X an iterative method called QuickCluster is used in [3]. QuickCluster chooses a pivot vertex i at random, forms a cluster C1 that contains i and all vertex j such

`

that yij “ 1. Once the cluster is formed the vertices are removed from V and the process is repeated until V is empty. The final clusters will be a clustering solution that when the vertices from R are removed, reveal a clustering solution for the original data set X.

4.2

PivotBiCluster

Algorithm

The PivotBiCluster algorithm is an iterative method introduced by Ailon, Avigdor-Elgrabli, Libetery, and Zuylen in which every iteration creates a cluster and possibly many singletons (vertices clustered by themselves), then removes the vertices from the graph before the next iteration. Before going into the description of the algo­ rithm we first condense our notation so that all members of the set L and R will be denoted simply by l and r. Furthermore, since in Chapter 3 the hybrid bipartite graph formulation makes the set X into the set L we concentrate the correlation measurements to describe the relationship of the vertices in L. Hence the neigh­ borhood about l, denoted Nplq, is the set of vertices r P R adjacent to l in G

[3].

The last definition needed before the description of the algorithm are the cor­ relations used within the algorithm. As stated previously, the algorithm works through two phases, let l1 P L be the vertex selected in phase one and l2 PL be the vertex selected in phase two. R1 is the subset of vertices from R in Npl1q but not

(39)

in Npl2q. R2 is the subset of vertices from R in Npl2q but not in Npl1q. Finally, R12 is the subset of vertices in R that are in the intersection of Npl1q and Npl2q.

|R1|, |R2|, and |R12| are the magnitudes of each set, the number of members of each set respectively. An example of this is depicted in the figure below:

Figure 4.1: Depiction of the correlations R1, R2, and R12.

The algorithm described in [3] begins in phase one by picking a node from L

uniformly at random, previously defined to be l1. With the selection of l1the cluster Ci, the cluster created on the ith iteration, is formed such that Ci “ tl1u YNpl1q. Moving on to phase two, the algorithm decides to do one of three things while going over all other remaining members of L one by one, previously to be defined to be l2. The three actions that can be taken are to include l2 in Ci, turn l2 into a singleton, or leave l2 in G.

The way PivotBiCluster decides what to do with l2 is through the measure of correlations which is the relationship of the magnitudes of the sets R1, R2, and R12.

|R12|

With probability mint|R2|, 1u the algorithm decides the following:

(40)

ii) If |R12| ă |R1|, let l2 form into a singleton Ci`1

|R12| With the probability 1 ´ mint|R

2|, 1u leave l2 in the graph to be dealt with at a later iteration. Once these steps are completed the clustered vertices, in Ci and

any other cluster formed by singletons, are removed from the graph and the process is started again with the stopping criteria of L being empty. Once the stopping criteria has been met the output solution will be a clustering of the original data set X.

4.3

Example

of

Bipartite

Correlation

Clustering

To solve the bipartite correlation clustering problem we only choose to imple­ ment PivotBiCluster do to programming difficulties within the deterministic linear programming algorithm. From the example of the transitions in Chapter 3 Section 3 we have the bipartite graph of the cluster ensemble represented as the matrix B, the output of the function biform. Using this as an input, the PivotBiCluster algo­ rithm can be implemented using a function created in MATLAB for the purposes of this thesis. The function is located in the appendix titled ”combi”.

Through evaluation of the code, one can note that the biggest portion of the function corresponds to the algorithm while the bottom of the function formats the clustering solution to take into consideration any singletons that might have been formed. The output of the function combi is a matrix in which the rows correspond to the original data set X and the columns correspond to the optimized number of clusters that is guaranteed to be at most four times the optimal solution according to [3]. With this final clustering comes the end of our transitions and a single clustering that has used all the information gathered through the cluster ensemble.

(41)

The final clustering solution can be depicted in a scatter graph by relating the final clustering index, defined as fin in the function, to the data set X. The script that can achieve this is shown below.

v=size(fin); y=[]; for i=1:v(2) for j=1:v(1) if fin(j,i)==1 y=[y;X(j,:)]; end end scatter3(y(:,1),y(:,2),y(:,3),'o') hold on y=[]; end

One thing to note about the script is the decision to use scatter3 to plot the points. Scatter3 was chosen in order to view the maximum amount of dimensions. Fur­ thermore, the usage of the first three characteristics for the three dimensions where found to do a great job in depicting the clustering solution through experimentation. This will become evident in the next chapter since the first three characteristics do not depict the clustering clearly. The resulting scatter graph is depicted in Figure 4.2 below.

(42)

Figure 4.2: Depiction of the final clustering solution for the iris data.

In Figure 4.2 we see a distinction between the clusters through the symbols chosen as well as where the nodes are located. By considering each cluster to be a classification in which the data node fits into, it becomes comparable to the actual classifications given by MATLAB. Thus by calculating the difference of the classifications the process made and the classifications given to us by the data set, it can be shown that the process differs by 26. This is to say the process successfully classified 124 data nodes.

To consider if this is actually a ”good” clustering solution we turn to the Cost function as used in Chapter 2 Section 5. By calculating the cost of each clustering solution the following figure can be produced:

(43)

Figure 4.3: The clustering cost of each clustering solution for the iris data.

Here we see the cost of the true classification is represented as a dashed line and the cost of the procedure is represented as a solid line. The costs of the k-means solutions are the first 10 costs of the ensemble, and the 11th being the spectral clustering cost. Here we see that the k-means algorithm has the lowest cost over all, and the spectral clustering algorithm has the highest due to the issue of trying to cluster all the nodes together. Figure 4.3 shows that the procedure presented in the thesis is not better then the k-means algorithm but has a cost closely related to the true classification cost. Furthermore, the k-means algorithm is only capable of finding a solution with a fixed k value. This differs from the process presented since it will allow clustering solutions of different sizes and clustering solutions using different methods that can expose different structures within the data set.

(44)

Chapter

5

Another

Classification

Example

For this example we will be exploring the classification of a wheat seed data matrix. The data matrix can be pulled from the University of California, Irvine machine learning repository. The URL is available in [9]. The data presents 210 data nodes each having 7 characteristics that are real valued. The data matrix also includes the correct labeling of the data nodes which we will ignore until we form our clustering solution through the process presented in this work. Since the functions have been described throughout the thesis and can be found in the appendix, what is left to do is write a script that will take the data set through the functions in order. The script that will take the data matrix titled wheatseed through the functions can be found in A.7.

A couple of things to note about the script are the values selected for r and

k, along with the varying solutions that can be present. Since we want our en­ semble to be as robust as possible we want the number of solutions to be taken into consideration to be larger then the example that was shown throughout the thesis. For this we arbitrarily chose the number of solutions to be 100, and for pre­

(45)

viously reasons stated we can assume that they are not unique among each other. Furthermore, since we want to classify our data nodes into 3 categories, we choose

k “ 3. The idea of not knowing what k value would work best could be resolved by considering different k values for the solution through the k-means and spectral clustering algorithms, however the functions would have to be adapted in order to accept such changes. The different values of k was not considered for this example do to the difficulties of programming. Running the script as presented through the function in the appendix, the final output can be depicted as a scatter graph in Figure 5.1.

Figure 5.1: Depiction of the final clustering solution for the wheat seed data matrix with characteristics 1, 2 and 3.

Comparing the clustering solution to the actual categories the wheat seeds are suppose to be in, it can be shown that the final clustering solution successfully classified 156 data nodes. Figure 5.1 shows the final clustering solution, however the distinctions between the clusters are not evident in the scatter graph. This

(46)

issue is due to the characteristics chosen to be graphed. Through some more ex­ perimentation with the scatter graph and the choice of characteristics to plot, the characteristics that best depicts the clusters for this example are the 4th, 5th, and 6th. The script in appendix A.6 shows these characteristics in following scatter graph.

Figure 5.2: Depiction of the final clustering solution for the wheat seed data matrix with characteristics 4, 5 and 6.

To compare the cost of the final clustering solution we use the same type of depiction as done in Chapter 4 Section 3 of the Fisher Iris data. Following the same procedure to calculate the cost, the following graph can be produced.

(47)

Figure 5.3: The clustering cost of each clustering solution for the wheat seed data.

Here we see that the process has produced a cost that is higher then the true classification cost and is higher then the k-means. However, spectral clustering still has the highest cost since our measure of similarity did not change. Due to the scaling of the image, the cost of the k-means solutions seem to be constant, however they are not. The k-means costs do in fact differ as shown in Figure 5.4. The procedure discussed in the thesis, for this example, does not obtain a cost that is closely related to the cost of the true clustering solution. Since our process takes into consideration all the solutions in the ensemble, it took into consideration the spectral clustering algorithms solution. Because that particular solution had such a high cost it caused the total cost from the procedure to go up as well.

(48)

(49)

Chapter

6

Conclusion

This thesis presents a procedure to obtain an optimal clustering solution given a data matrix. We perceive an optimal solution as one that can expose struc­ tures of the data set through the use of multiple clustering solutions of varying algorithms. The algorithms described in Chapter 2, and discussed in [4] and [5], were used to obtain the initial clustering solutions. To try to improve the resulting clustering solutions, [2] introduces the idea of gathering clustering solutions into a cluster ensemble. This, in turn, inspires the cluster ensemble problem. To solve the cluster ensemble problem without losing the original structure of the data set

X, [1] transforms the ensemble to a bipartite graph through the hybrid bipartite graph formulation. Finally, to partition the bipartite graph, in such a fashion to cluster the data set X, [3] introduces improved approximation algorithms. It is the PivotBiCluster algorithm in [3] that outputs the final clustering solution.

The examples presented in this thesis were used to make the ideas through­ out the procedure more concrete, but also to expose some issues that need further investigation. The idea of having multiple values of k would be possible through

(50)

the transitions of the problems, resulting in a final clustering that takes into con­ sideration different relationships between the data nodes. Also the final clustering solution can change with every application of the procedure due to the high depen­ dence on the k-means algorithm, so choosing a more stable algorithm to create the ensemble will create more stability in the final clustering. The example in Chapter 5 also exposed how dependent the procedure is on the cluster ensemble. Because of this, using a wide variety of clustering algorithms that result in low costs may also be implemented to pursue a more optimal final clustering solution.

(51)

Bibliography

[1] X. Z. Fern & C. E. Brodley, Solving Cluster Ensemble Problems by Bipar­ tite Graph Partitioning, Proceedings of the 21 st International Conference on Machine Learning, Banff, Canada, 2004.

[2] R. Caruana, M. Elhawary, N. Nguyen, & C. Smith, Meta Clustering, ICDM’06, IEEE, 2006.

[3] N. Ailon, N. Avigdor-Elgrabli, E. Libetery, & A. Zuylen, Improved Approx­ imation Algorithms for Bipartite Correlation Clustering, Vol. 41, No. 5, pp. 11101121, SIAM J. COMPUT., 2012.

[4] S. S. Khan & A. Ahmad, Cluster center initialization algorithm for K-means clustering, Pattern Recognition Letters 25, Science Direct, 2004.

[5] U. Luxburg, A Tutorial on Spectral Clustering, Department for Empirical In­ ference, 2007.

[6] D. Marcus, Graph Theory A Problem Oriented Approach, MAA Tectbooks, 2008.

(52)

[7] S. Madeira & A. Oliveira, Biclustering Algorithms for Biological Data Analysis: A Survey, IEEE Transactions on Computational Biology and Bioinformatics, Vol 1, No. 1, 2004.

[8] C. Manning, P. Raghavan, & H Schutzw, Introduction to Information Re­ trieval. Cambridge University Press., 2008.

[9] Lichman, UCI Machine Learning Repository, [http://archive.ics.uci.edu/,l], Irvine, CA: University of California, School of Information and Computer Sci­ ence, Specifically [http://archive.ics.uci.edu/ml/datasets/seeds].

(53)

Appendix

A

MATLAB

Code

The MATLAB code presented below will follow examples throughout the thesis and assume we are dealing with a quantitative data matrix, X.

A.1

kmeansolutions

The kmeansolutions function requires the inputs of r, k, and X. The variable r

is the number of solutions desired, k is the number of clusters within each solution, and X is the data set. Requirements that need to be met only pertain to the data set X. X must be formatted such that the rows correspond to the data nodes and the columns correspond to the characteristics. The output of the function will be a matrix K in which each row will correspond to each data node and the columns will be a clustering solution in which the entries will be the index of the cluster the data is in.

function K = kmeansolutions(r,k,X)

(54)

K=[]; for i = 1:r b = kmeans(X,k); K = [K,b]; b = 0; end

A.2

specsolutions

The specsolutions function requires the inputs of r, k, and X which correspond to the same variable from A.1. The output of the function will be a matrix S

in which each row will correspond to each data node and the columns will be a clustering solution. The entries will be the index of the cluster the data is in. One thing to be considerate about is the redundancy of solutions that is produced due to the linear algebra aspect of the spectral clustering algorithm.

function S = specsolutions(r,k,X)

%performs spec cluster and uses kmeans

v = size(X); w = zeros(v(1));

for i = 1:v(1)%constructs the upper tri of w

for j = i:v(1) s = X(i,:); t = X(j,:); for p = 1:v(2) w(i,j) = w(i,j)+sqrt((s(p)´t(p))ˆ2); end end end

(55)

w=w'+w; for i = 1:v(1)%constructts d d(i,i) = sum(w(i,:)); end L = d´w; [vec,val] = eig(L); U = vec(:,1:k); S = kmeansolutions(r,k,U);

A.3

ensemble

The function ensemble requires the inputs k, K, and S. The variable k is the same variable that corresponds to A.1 and A.2, K is the output matrix from A.1 and

S is the output matrix from A.2. With these inputs the function creates a matrix in which K and S are side by side. Furthermore the function gives each cluster for every solution a unique index. The output will be a matrix E in which the rows correspond to the data nodes and the columns will indicate clustering solution but now each entry will have an index that corresponds to a unique cluster.

function E = ensemble(k,K,S)

%creates the ensemble from kmeans and spec cluster

E = [K,S]; v = size(K); u = size(S); w = v(2)+u(2);

for i = 1:w %unique indices

E(:,i) = E(:,i)+(i´1)*k;

(56)

A.4

biform

The function biform requires the input E which is the output matrix of A.3. The function expands the matrix E so that each cluster will have its own column and enters a value of 1 in the row indicating that the data node is in that specific cluster. Furthermore, the code gets rid of redundant solutions by searching the matrix and if two columns equal each other (indicating a redundant solution) it fills the column of the latter column with zeros. Once the redundant solutions are filled with zeros it simply removes that column from the matrix.

function B = biform(E)

%creates the bipartite graph

L = size(E); for i = 1:L(1) for j = 1:L(2) B(i,E(i,j))=1; end end

%to rid of redundancy in solutions

v = size(B); for i = 1:v(2) for j = i+1:v(2) if B(:,i) == B(:,j) B(:,j) = 0; end; end end for i = 1:v(1) u = size(B);

(57)

for j = 1:u(2) if sum(B(:,j)) == 0 B(:,j) = []; break end end end

A.5

combi

The function combi requires the input E which is the output matrix of A.4. The function combi applies the algorithm in Chapter 4 Section 2, titled PivotBi-Cluster algorithm, on the matrix E. For one of the criteria within PivotBiCluster the function uses MATLAB’s rand function to produces the success rate of the probability within the algorithm. The code also formats the final solution to get rid of empty columns. The output of the function will be a matrix called fin whose rows will correspond to the data node and the columns will correspond to a cluster, and entry of 1 would mean that the data node belongs to the cluster.

function c = combi(E) v = size(E); c = zeros(v(1)); s = []; for k = 1:v(1) %algorithm for i = k+1:v(1) if E(k,1:v(2)) == 0 break else

(58)

c(k,k) = 1; end ro = 0; rot = 0; rt = 0; for j = 1:v(2)

if E(k,j) == 1 & E(i,j) == 0; ro = ro+1;

end

if E(k,j) == 1 & E(i,j) == 1; rot = rot+1;

end

if E(k,j) == 0 & E(i,j) == 1; rt = rt+1;

end end

p = min(rot/ro,1); %with probability

q = rand; if q <= p if rot/rt >= 1 if rot >= ro c(i,k) = 1; E(i,1:v(2)) = 0; else s = [s,i]; E(i,1:v(2)) = 0; end end end end

(59)

References

Related documents

National Conference on Technical Vocational Education, Training and Skills Development: A Roadmap for Empowerment (Dec. 2008): Ministry of Human Resource Development, Department

How Many Breeding Females are Needed to Produce 40 Male Homozygotes per Week Using a Heterozygous Female x Heterozygous Male Breeding Scheme With 15% Non-Productive Breeders.

Despite the fact that the rise in net saving rate, capital stock and total consumption are the same in these two economies, the permanent increase in consumption that an

Using text mining of first-opinion electronic medical records from seven veterinary practices around the UK, Kaplan-Meier and Cox proportional hazard modelling, we were able to

• Follow up with your employer each reporting period to ensure your hours are reported on a regular basis?. • Discuss your progress with

The uniaxial compressive strengths and tensile strengths of individual shale samples after four hours exposure to water, 2.85x10 -3 M cationic surfactant

KEY WORDS rice weevil, lesser grain borer, Angoumois grain moth, detection, near-infrared, wheat.. Pos:manrresr cRAIN LossEs caused by pests and poor storage