Methods for Large Databases - Transportation data analysis. Advances in data mining and uncerta

2.3 Clustering

2.3.5 Methods for Large Databases

The clustering methods already introduced represent a selection of the most applied traditional techniques, which are limited when applied to large datasets, in particular dynamic datasets. This particular type of data, which is becoming quite common, requires that:

1. data must be read only one time;

2. the algorithm is online, i.e. the ’best’ solution of the algorithm is found when it is executed;

3. the algorithm can be easily suspended, stopped and restarted; 4. results must be updated incrementally, when data are added or sub-

tracted to the database;

5. memory resources are limited;

6. algorithm can make a scan of the database; 7. each observation is processed only one time.

BIRCH Algorithm

BIRCH algorithm (Balanced Iterative Reducing and Clustering Using Hierar- chies) (Zhang, Ramakrishnan, and Livny 1996) is an incremental hierarchical

algorithm, which can be succesfully used in large database since it needs limited memory resources and read the data only one time.

Its structure is based on the concepts of clustering feature CF and CF Tree:

• A clustering feature (CF) is the triplet (N, ⃗LS, SS), where N is the number

of elements in a cluster, ⃗LS is the sum of elements in a cluster and SS

is the sum of the square of the elements in a cluster;

• A CF Tree is a balanced tree with branching factor B, which is the maximum number of children that can be generated by a node. Each node contains a triplet CF for each of its children. If the node is a leaf, it is representative of a cluster and has a CF for each sub-cluster, which cannot have a diameter larger than the threshold T.

Therefore CF Tree is a tree which is built adding observations and re- specting the maximum diameter T allowed for each leaf, the maximum number of children B that can be generated by a node and memory limits. The diameter is calculated as the mean of the distances calculated between all the couples of observations which belong to the cluster. A larger value

of T produces a smaller tree, which is a good clustering solution in case of limited memory resources.

In the meanwhile clustering features, associated to each node of the tree, summarise the characteristics of the clusters, speeding up the update of the tree and reducing the access to the data to only one time.

The building of the CF Tree is a dynamic and incremental process. De- fined by parameters B and T the limits of the tree, each observation is considered and the distance between the centroid of the clusters is deter- mined using the information available from the clustering feature. If the parameters B and T are respected, the new observation is added to the clos- est cluster, the clustering feature of the cluster is updated and the same is done for the clustering features from the cluster to the root of the tree.

If the conditions are violated the new observations is added to the node as a new cluster and the interested clustering features are calculated and updated.

BIRCH algorithm is very eﬃcient if the threshold value T has been

correctly identified, otherwise the tree must be rebuilt. Furthermore BIRCH is adapt in case of spherical clusters, since it is strongly related to the maximum diameter T for the definition of the boundaries of clusters.

DBSCAN Algorithm

DBSCAN algorithm (Density-Based Spatial Clustering of Applications with Noise) (Ester et al. 1996) is a partitioning algorithm based on the measure of density, which is particularly interesting for the possibility of identifying clusters of arbitrary shape.

The algorithm is guided by parameters MinPts, which defines the min- imum number of elements in a cluster and Eps, which is the maximum distance between two distinct elements in the same cluster. Some prelim- inary definitions must be given to have a correct comprehension of the algorithm:

• The Eps − neigborhood of the element p is the set of elements that are within a circle of radius Eps centered in p;

• If the Eps−neigborhood of p has a minimum number of elements MinPts, then p is a core point;

• Given the values MinPts and Eps, the element p is ”directly density- reachable” by q if:

1. dis(p, q) ≤ Eps

2. | r|dis(r, q) ≤ Eps |≥ MinPts

• Given the values MinPts and Eps, the element p is ”density-reachable”

by q if exists a chain of elements p1, . . . , pn, for which p1 = q and pn= p,

such that pi+1is directly density-reachable by pi, with 1< i < n.

• Given the values MinPts and Eps, the element p is ”density-connected” to q if exists an element o, such that p and q are density-reachable by o. In Figure 2.4 are reported some examples of these concepts, where the

circles have radii equal to Eps and MinPts= 3:

• the elements r and s are ”density-connected” to each other by o; • the element q is ”density-reachable” by p.

Figure 2.4:DBSCAN Distances. Source: Ester et al. 1996

Using these concepts the density of a cluster is the criterion which deter- mines the belonging of an element to a cluster: each cluster has a central set of elements directly density-reachable, very close one to each other (the core points), rounded by a series of other elements in the border of the cluster,

suﬃciently closed to the central points (border points). Finally elements not

belonging to any cluster are defined as ’noise’ and considered as outliers.

OPTICS Algorithm

Algorithm DBSCAN is a powerful method to detect cluster with arbitrary shape. However the quality of results is influenced by the choice of the correct values of parameters MinPts and Eps. This is a common issue for many clustering methods, but this fact is more relevant in case of multi- dimensional data, especially if data distributions are distorted with respect of some dimensions.

OPTICS algorithm (Ordering Points to Identify the Clustering Structure) (Ankerst et al. 1999) has been developed to solve this problem, giving as

a result an ordering of clusters that can be automatically analysed. This ordering is equivalent to the clustering obtained from DBSCAN algorithm through a wide range of parameters’ values.

In fact for a constant value of MinPts, the clusters with a higher density (with smaller values of Eps) are completely included in density-connected sets with lower density. Executing DBSCAN algorithm with a progressive variation of the parameter Eps, it is possible to obtain an ordering of clusters starting from the ones with higher density (Figure 2.5).

Therefore two values are calculated or each of the elements analysed: the core distance and the reachability distance:

• The core distance of an element p is the smallest Eps′_{which makes p a}

core object. If p is not a core items, this distance is indefinite.

• The reachability distance of an element q with respect to another el- ement p is the maximum value between the core distance of p and the Euclidean distance between p and q. If p is not a core items, this distance is indefinite.

Analysing this couple of values, associated to each element of the dataset, is possible to establish alternative clustering solutions, evaluating

the influence of the choice of the values of distance Eps′.

Figure 2.5:OPTICS. Illustration of the Cluster Ordering. Source: Ankerst et al. 1999

In document Transportation data analysis. Advances in data mining and uncertainty treatment (Page 57-60)