2. Fundamentals of Web Data Mining and Web Recommendation
2.3. Clustering Algorithms
Clustering analysis is a widely used data mining algorithm for many data management
applications. Clustering is a process of partitioning a set of data objects into a number of
object clusters, where each data object shares the high similarity with the other objects
within the same cluster but is quite dissimilar to objects in other clusters. Different from
classification algorithm that assigns a set of data objects with various labels previously
defined via a supervised learning process, clustering analysis is to partition data objects
objectively based on measuring the mutual similarity between data objects, i.e. via a
before data analysis, for example, in case of being hard to assign class labels in large
databases, clustering analysis is sometimes an efficient approach for analysing such kind
of data. To perform clustering analysis, similarity measures are often utilized to assess
the distance between a pair of data objects based on the feature vectors describing the
objects, in turn, to help assigning them into different object classes/clusters. There are a
variety of distance functions used in different scenarios, which are really dependent on
the application background. For example, cosine function and Euclidean distance
function are two commonly used distance functions in information retrieval and pattern
recognition [61]. On the other hand, assignment strategy is another important point
involved in partitioning the data objects. Therefore, distance function and assignment
algorithm are two core research focuses that attract a lot of efforts contributed by various
research domain experts, such as from database, data mining, statistics, business
intelligence and machine learning etc.
The main data type typically used in clustering analysis is the matrix expression of data.
Suppose that a data object is represented by a sequence of attributes/features with
corresponding weights, for example, in the context of Web usage mining, a usage data
piece (i.e. user session) is modelled as a weighted page sequence. Like what we discussed
above, this data structure is in the form of the object-by-attribute structure, or an n-by-m
matrix where n denotes the number of data objects and m represents the number of
attributes. In addition to data matrix, similarity matrix, where the element value reflects
the similarity between two objects is also used for clustering analysis. In this case, the
similarity matrix is expressed by an n-by-n table. For example, an adjacency matrix
we adopt the first data expression, i.e. data matrix to address Web usage mining and Web
recommendation.
To date, there are a large number of approaches and algorithms developed for clustering
analysis in the literature [2, 17, 29, 35, 45, 56, 69, 74-76]. Based on the operation targets
and procedures, the major clustering methods can be categorized as: Partitioning
methods, hierarchical methods, density-based methods, grid-based methods, model-based
methods, high-dimensional clustering and constraint-based clustering [17]. Partitioning
method is to assign n objects into k predefined groups, where each group represents a
data segment sharing the highest average similarity in comparison to other groups. The
well-known k-means is one of the most conventional partitioning clustering algorithms.
The algorithm is expressed as follows:
Step 1: arbitrarily choose k data points as initial cluster mean centres;
Step 2: then assign each data to the cluster with the nearest centres, and update each mean
centre of cluster;
Step 3: repeat step 2 until all centres don’t change and no reassignment is needed;
Step 4: finally output subject clusters and their corresponding centres.
Hierarchical clustering is to construct a hierarchical tree of the given set of data objects
instead of crisply classifying each data objects into a distinct data segment. The algorithm
is executed in the following way:
Step 1: calculate the mutual distance of two data points (distance matrix) as the clustering
criteria;
Step 2: decompose the dataset into a set of levels of the nested aggregations based on the
Step 3: cut the hierarchical tree at the desired level by selecting a predefined threshold,
and then explicitly merge all connected subjects below the cut level to create various
clusters;
Step 4: output the dendrograms and the clusters.
Hierarchical clustering provides an easily visualized way to modelling the underlying
relationships among the data objects. Hierarchical clustering can be considered an
agglomerative approach, which suffers from the problem of one-way construction, that is,
it can not be undone during the hierarchical tree construction procedure.
Model-based methods hypothesize that there exists a model for each of the clusters, into
which one data object is best fitted by measuring a density function. The density function
that associates with the special distribution of the data helps to determine the cluster
number and to assign the data objects into various clusters. For example, Self Organizing
Map (SOM) based clustering is one of the model-based methods, which is to map a data
object in a high-dimensional space into a low-dimensional (e.g. 2-D or 3-D) grid map via
a neural network based algorithm. The SOM-based clustering algorithm is eventually to
map the original data objects onto different data blocks/segments of the SOM grid and
the locality of the data points indicates the visualized clustering information. The detailed
description of SOM-based clustering algorithm is briefly summarized as follows:
Step 1: The SOM process consists of a regular, usually two-dimensional, grid of map
units. Each unit i is represented by an n-dimensional prototype vector, mi =[mi1,,min],
where n is the dimension of the input space. In the grid, the units are connected to
depending on the size of the input space, determines the accuracy and generalization
capability of the SOM;
Step 2: On each learning step, a data sample x is selected and the nearest map unit (i.e.
Best Matching Unit - BMU) is found on the map. The prototype vector of the BMU and
its neighbouring units on the grid are merged toward the sample vector:
( 1) ( ) ( ) ( )[ ( )]
i i bi i
m t+ =m t +α t h t x m t− (2.1)
where α(t) is the learning rate and hbi(t) is a neighbourhood kernel centred on the winner
unit. Both of learning units and neighbourhood kernels radius decrease monotonically
with time;
Step 3: The SOM is trained iteratively until the following error function reaches the
minimum: 2 1 1 N M bj i j i j E h x m = = =
∑∑
− (2.2)where N is the number of the training data and M is the number of the map units.