Clustering - Website boundary detection via machine learning

Clustering is an unsupervised learning technique by which, given a set of objects, each object is assigned to a group (called clusters) so that the objects in the same cluster are more similar (in some measurable sense) to each other than to those in other clusters [71, 95, 176, 184]. The data objects to be clustered could be textual documents, where the similarity is measured based on common terms in the documents. Equally the data objects could represent vertices in a graph structure, where by similarity is measured based on (say) distance between vertices using links.

There are two main scenarios in which clustering can be performed, either in a (1) static context or (2) dynamic context. In a static context clustering is performed to all data items at once. This is also known as the “offline method” of clustering objects, where by knowledge of all data is known at the time of cluster assignment.

In a dynamic context clustering is performed to data objects in a step-by-step manner. This method is also known as “online clustering”, data objects are assigned

to clusters one at a time, using only previously encountered data. An online method of clustering should be able to provide an approximation of the clusters at any point during its operation, and should be able to incrementally incorporate new data objects into the existing clustering at any time. In an online clustering method the order in which data objects are seen can significantly effect the outcome of the clustering [164]. Iterative methods can exist with respect to both the static and dynamic contexts. The distinction is made between having access to all the data when making decisions on iterative cluster assignments (static), or only partial data (dynamic).

2.6.1 Clustering Algorithms

Clustering algorithms can be broadly categorised based on the model in which they use to group data objects. The main clustering algorithm categories considered in this research are:

• Hierarchical based

• Centroid based

• Density based

Hierarchical based clustering algorithms use an assumption that data objects, that are closer (using euclidean distance for example), are typically more likely to be related than objects far apart. Hierarchical clustering approaches either start with each data items being contained in a set of single clusters to which a merging process is applied (agglomerative, or bottom up, hierarchical clustering) or the data items can be all contained in a single cluster to which a splitting process is applied (divisive, or top down, hierarchical clustering). Regardless of the approach adopted a dendogram can be used to illustrate the merging or splitting process. An example of a hierarchical (graph) clustering method is the Newman approach [144], which has been used for detecting community structure in web networks [143].

A centroid based clustering algorithm uses a central vector as a prototype to represent each of the clusters. This centroid can be assigned using the value of an existing data object, or can be of an arbitrary value. If given a value of k for the number of clusters that exist in the data, an optimisation problem is defined, where by each data object is to be assigned to one of the k clusters according to some distance function. An example of a centroid based clustering algorithm is kmeans [134].

Density based clustering algorithm defines clusters to be areas of higher density within data object collections. Density based on methods are founded on the idea of connecting points that satisfy a certain density criterion. Clusters can be represented by arbitrary shapes, in contrast to, say, centroid based methods like kmeans. Example of density based clustering algorithms are DBScan [76] and KNN [71, 95].

Algorithmkmeans (k)

1: Assign initial values for centroidsc1,c2, . . .ck; 2: repeat

3: Assign each data object to cluster with closest centroid;

4: Calculate new centroid for each cluster;

5: until Convergence criteria is met;

Table 2.1: Pseudo code for kmeans algorithm based on [71].

The algorithms that are considered in this thesis are presented in further detail below. It will be noted that significance will be given to the kmeans algorithm which is used for the majority of the analysis in the work presented later in this thesis. 2.6.2 Kmeans

The well known [71, 95, 176, 184] kmeans clustering algorithm [134] is a method of partitioning objects into a set of k clusters. Table 2.1 presents some pseudo code for the kmeans algorithm. The initial step is to assign values for thekcluster centroids, this can be done arbitrarily, or using data objects as a guide. The algorithm then proceeds to assign each data object to the closest cluster; cluster centroids are recalculated as the process proceeds. This process is repeated until some convergence criteria is met, this could be that no change in cluster centroids is observed after recalculation.

The result produced from the algorithm is that each data object is clustered with similar objects, while a low similarity remains between clusters of objects. The algorithm has been used successfully in many applications [71, 95, 176, 184]. Despite the algorithms success, it is sensitive to the starting conditions, such as initial clustering and ordering of input [31].

The kmeans algorithm is traditionally used on a collection of data apriori (in a static context), such that all objects are known in advance. The kmeans algorithm used with respect to the work described in this thesis follows [149, 165], and uses a variation of the kmeans algorithm in an incremental setting (dynamic context) as well as a traditional implementation (static context). In the dynamic setting the data is clustered as it is received (as a stream). As per the original algorithm, the number of k clusters is required initially. To initialise the k cluster centroids method the first k data objects received are used. The cluster centroid are assigned based on the order of objects from the input stream. Applying this to the web using a web crawler output as a data stream means that the initial seeds are not selected randomly from the entire data, but are selected from initially crawled web pages.

ICA The so called Incremental Clustering Algorithm (ICA) presented in this section is based on the work proposed in [132]. The algorithm was originally proposed for the

1: C← Empty Set{cluster set}

2: foreach document ddo

3: for each clusterc do

4: Simulate addingdtoc

5: CASAcnew =CASAc

6: end for

7: ADD dtoc with lowestCASAcnew 8: end for

Table 2.2: Pseudo code for ICA in the context of this work, based on the method in [132].

incremental clustering of search results. It is based on the Clusters Average Similarity Area (CASA). For a clusterc containing documents such that c={d1, d2. . . dn}, the

pair-wise similarity for two documentsdi and dj is measured as sij. The CASA value

is therefore calculated as:

CASAc= 2 n−1 X i=1 n X j=i+1 s2_ij n(n−1) (2.1)

The algorithm assigns new data items to clusters based on the lowest calculated CASA value, by simulating adding the new item to each of the clusters. Alternatively if the CASA value is below a certain threshold, the item forms a new cluster. With respect to the work described in this thesis, the algorithm is used as a binary clustering method to segment a graph as it is dynamically traversed (see section 7.4.2). The pseudo code used in this work is shown in table 2.2.

Bisecting kmeans The bisectingk-means clustering algorithm is a partitional clustering algorithm that works by computing a user specified k number of clusters as a sequence of repeated bisections of the vector space. Ak-way partitioning via repeated bisections is obtained by recursively computing 2-way clusterings. At each stage one cluster is selected and a bisection is made [189].

KNN The k-nearest neighbour algorithm is an iterative agglomerate clustering algorithm [71]. Items are iteratively merged into existing clusters that are “closest”, within a user specified threshold value. If items exceed the threshold, a new cluster is created. The algorithm has the ability to find arbitrary shaped clusters in the vector space. The principle of the algorithm is to group a set of items based on a majority vote, items are assigned to a cluster most common within its neighbours, hence arbitrary shaped clusters can be found. This is in contrast to the kmeans algorithm, which typically determine spherical shaped clusters hindered by data items assignment to the nearest mean.

DBSCAN The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm creates clusters that have a small size and density [76]. Density is defined as the number of points within a certain distance of one another. Note that the number of clusters,k, is not prescribed, but it is determined by the algorithm.

In document Website boundary detection via machine learning (Page 48-52)