• No results found

2. Fundamentals of Web Data Mining and Web Recommendation

2.3. Clustering Algorithms

Clustering analysis is a widely used data mining algorithm for many data management

applications. Clustering is a process of partitioning a set of data objects into a number of

object clusters, where each data object shares the high similarity with the other objects

within the same cluster but is quite dissimilar to objects in other clusters. Different from

classification algorithm that assigns a set of data objects with various labels previously

defined via a supervised learning process, clustering analysis is to partition data objects

objectively based on measuring the mutual similarity between data objects, i.e. via a

before data analysis, for example, in case of being hard to assign class labels in large

databases, clustering analysis is sometimes an efficient approach for analysing such kind

of data. To perform clustering analysis, similarity measures are often utilized to assess

the distance between a pair of data objects based on the feature vectors describing the

objects, in turn, to help assigning them into different object classes/clusters. There are a

variety of distance functions used in different scenarios, which are really dependent on

the application background. For example, cosine function and Euclidean distance

function are two commonly used distance functions in information retrieval and pattern

recognition [61]. On the other hand, assignment strategy is another important point

involved in partitioning the data objects. Therefore, distance function and assignment

algorithm are two core research focuses that attract a lot of efforts contributed by various

research domain experts, such as from database, data mining, statistics, business

intelligence and machine learning etc.

The main data type typically used in clustering analysis is the matrix expression of data.

Suppose that a data object is represented by a sequence of attributes/features with

corresponding weights, for example, in the context of Web usage mining, a usage data

piece (i.e. user session) is modelled as a weighted page sequence. Like what we discussed

above, this data structure is in the form of the object-by-attribute structure, or an n-by-m

matrix where n denotes the number of data objects and m represents the number of

attributes. In addition to data matrix, similarity matrix, where the element value reflects

the similarity between two objects is also used for clustering analysis. In this case, the

similarity matrix is expressed by an n-by-n table. For example, an adjacency matrix

we adopt the first data expression, i.e. data matrix to address Web usage mining and Web

recommendation.

To date, there are a large number of approaches and algorithms developed for clustering

analysis in the literature [2, 17, 29, 35, 45, 56, 69, 74-76]. Based on the operation targets

and procedures, the major clustering methods can be categorized as: Partitioning

methods, hierarchical methods, density-based methods, grid-based methods, model-based

methods, high-dimensional clustering and constraint-based clustering [17]. Partitioning

method is to assign n objects into k predefined groups, where each group represents a

data segment sharing the highest average similarity in comparison to other groups. The

well-known k-means is one of the most conventional partitioning clustering algorithms.

The algorithm is expressed as follows:

Step 1: arbitrarily choose k data points as initial cluster mean centres;

Step 2: then assign each data to the cluster with the nearest centres, and update each mean

centre of cluster;

Step 3: repeat step 2 until all centres don’t change and no reassignment is needed;

Step 4: finally output subject clusters and their corresponding centres.

Hierarchical clustering is to construct a hierarchical tree of the given set of data objects

instead of crisply classifying each data objects into a distinct data segment. The algorithm

is executed in the following way:

Step 1: calculate the mutual distance of two data points (distance matrix) as the clustering

criteria;

Step 2: decompose the dataset into a set of levels of the nested aggregations based on the

Step 3: cut the hierarchical tree at the desired level by selecting a predefined threshold,

and then explicitly merge all connected subjects below the cut level to create various

clusters;

Step 4: output the dendrograms and the clusters.

Hierarchical clustering provides an easily visualized way to modelling the underlying

relationships among the data objects. Hierarchical clustering can be considered an

agglomerative approach, which suffers from the problem of one-way construction, that is,

it can not be undone during the hierarchical tree construction procedure.

Model-based methods hypothesize that there exists a model for each of the clusters, into

which one data object is best fitted by measuring a density function. The density function

that associates with the special distribution of the data helps to determine the cluster

number and to assign the data objects into various clusters. For example, Self Organizing

Map (SOM) based clustering is one of the model-based methods, which is to map a data

object in a high-dimensional space into a low-dimensional (e.g. 2-D or 3-D) grid map via

a neural network based algorithm. The SOM-based clustering algorithm is eventually to

map the original data objects onto different data blocks/segments of the SOM grid and

the locality of the data points indicates the visualized clustering information. The detailed

description of SOM-based clustering algorithm is briefly summarized as follows:

Step 1: The SOM process consists of a regular, usually two-dimensional, grid of map

units. Each unit i is represented by an n-dimensional prototype vector, mi =[mi1,,min],

where n is the dimension of the input space. In the grid, the units are connected to

depending on the size of the input space, determines the accuracy and generalization

capability of the SOM;

Step 2: On each learning step, a data sample x is selected and the nearest map unit (i.e.

Best Matching Unit - BMU) is found on the map. The prototype vector of the BMU and

its neighbouring units on the grid are merged toward the sample vector:

( 1) ( ) ( ) ( )[ ( )]

i i bi i

m t+ =m tt h t x m t− (2.1)

where α(t) is the learning rate and hbi(t) is a neighbourhood kernel centred on the winner

unit. Both of learning units and neighbourhood kernels radius decrease monotonically

with time;

Step 3: The SOM is trained iteratively until the following error function reaches the

minimum: 2 1 1 N M bj i j i j E h x m = = =

∑∑

− (2.2)

where N is the number of the training data and M is the number of the map units.