Spatial classification problems are used to partition sets of spatial objects. Spatial objects could be classified using nonspatial attributes, spatial predicates (spatial attributes), or spatial and nonspatial attributes. Concept hierarchies may be used, as may sampling. As with other types of spatial mining, generalization and progressive refinement techniques may be used to improve efficiency.
8.6.1 103 Extension
The concept of neighborhood graphs has been applied to perform classification of spatial objects using an ID3 extension [EKS97]. A neighborhood graph is a graph constructed from the objects in the space. Each object becomes a node in the graph. The edges are constructed from the neighbors; that is, two nodes are connected by an edge in the neighborhood graph if 9ne is a neighbor of the other. "Neighbor" can be defined based on any relationship between the spatial objects such as distance less than a particular threshold, satisfiability of a topological relationship between the objects, or direction relationship. Note that some of the relationships are order relationships and others are not. The idea of the algorithm is to take into account the objects that are near a given object. A max-length indicator is input that specifies the maximum length of a neighbor hood path starting at the node. This then identifies a set of nodes that are associated with the target hade. ID3 then considers for classification purposes not only the nonspatial attributes of the target object, but also those in neighboring objects.
8.6.2 Spatial Decision Tree
One spatial classification technique builds decision trees using a two-step process similar to that used for association rules [KHS98]. The basis of the approach is that spatial objects can be described based on objects close to them. A description of the classes is then assumed to be based on an aggregation of the most relevant predicates for objects nearby.
To construct the decision tree, the inost relevant predicates (spatial and nonspatial) are first determined. It is hoped that this process will create smaller and more accurate decision trees. These relevant predicates are the ones that will be used to build the decision tree. It is assumed that a training sample is used to perform this step and that
weights are assigned to attributes and predicates. Initial weights are 0. Two corresponding objects are examined for each object. The nearest miss is the spatial object closest to the
target object that is in a different class. The nearest hit is the closest target in the same class. For each predicate value in the target object, if the nearest hit object has the same value, then the weight of that predicate is increased. If it has a different value, then the weight is decreased. Likewise, the weight is decreased (increased) if the nearest miss has the same (different) value. Only predicates with positive weights above a predefined threshold are then used to construct the tree. It is proposed that, because of the complexity of finding the relevant predicates, relevant predicates be found first at a coarse level and then at a finer leveL MBRs, instead of actual objects, and a generalized coarse close_to relationship are first used to find the relevant predicates. Then these relevant predicates and the true objects are used during the second pass.
8.7
Section 8.7 Spatial Cl ustering Algorithms 237
For each object in the sample, the area around it, called its buffer, is examined. A description of this buffer is created by aggregating the values of the most relevant predicates of the items in the buffer. Obviously, the size and shape of the buffer impact the resulting classification algorithm. It is possible, although unrealistic, to perform an exhaustive search around all possible buffer sizes and shapes. The objective would be to choose the one that results in the best discrimination between classes in the training set. This would be calculated using the information gain. Other approaches based on picking a particular shape were examined, and the authors finally used circles (equidistance buffers).
To construct the tree, it is assumed that each sample object has associated with it a set of generalized predicates that it satisfies. Counts of the number of objects that satisfy (do not satisfy) each predicate can then be determined. This is then used to calculate information gain as is done in ID3. Instead of creating a multiway branching tree, a binary decision tree is created. The resulting algorithm to construct the decision tree is shown in Algorithm 8.5. ALGORITHM 8.5 Input : D c Output : T
/ /Data , inc luding spatial and nonspatial attribut e s / / Concept hierarchi es
/ / B inary de c i s i on tree
SPATIAL deci s ion tree algorithm:
f ind a samp l e S of data from D with known classi f i cat ion; ident i fy the be s t predicates p to us e for clas s i f i c a t i on ; determine the best buf f e r s i z e and shape ;
us ing p and C, general i z e the predicates for each buf f e r ; bui l d b inary T us ing the generali zed predi cates and ID3 ;
SPATIAL CLUSTERING ALGORITHM S
Spatial clustering algorithms must be able to work efficiently with large multidimen sional databases. In addition, they should be able to detect clusters of different shapes. Figure 8.8 illustrates what we mean. This figure shows clusters in a two-dimensional space. Obviously, by looking at this figure it is easy to see that there are four different clusters, each of a fairly irregular shape. A good spatial clustering algorithm should be able to detect these four clusters even though the shapes are not regular, and some points in one cluster may actually be closer to some points of other clusters rather than to points in its own cluster. An algorithm that works using centroids and simple distance measures probably will not be able to identify the unusual shapes.
Other desirable features for spatial clustering are that the clusters found should be independent of the order in which the points in the space are examined and that the clusters should not be impacted by outliers. In Figure 8.8 the outliers in the lower right part of the figure should not be added to the larger cluster close to them.
Many of the clustering algorithms discussed in Chapter 5 may be viewed as spa tial. In the following sections, we evaluate additional algorithms specifically targeted to spatial data.
238 Chapter 8 Spatial M i ning
•.·.
, .... , .. � .•.. ,... . . .
FIGURE 8.8: Different shapes for spatial clusters.
8.7.1 CLARANS Extensions
The main memory assumption of CLARANS is totally unacceptable for large spatial databases. Two approaches to improve the performance of CLARANS by taking advan tage of spatial indexing structures have been proposed [EKX95].
The first approach uses a type of sampling based on the structure of an R*-tree (an R-tree variant). To ensure the quality of the sampling, the R *-tree is used to guarantee that objects from all areas of the space are examined. The most central object found in each page of the R *-tree is used to represent that page in the search. The most central object is the object (of all objects stored on that page) with the smallest distance from it to the center of the page. Remember that the page is actually the MBR that contains all the objects in that page. So the center of that MBR can be defined as the geometric center of the bounding rectangle. CLARANS is then used to find clusters for these central objects. The k medoids found in this step represent the k clusters to be found for the database as a whole. Since the R *-tree clusters objects that are spatially near on a node in the tree (and thus page), it is reasonable to believe that this approach to sampling finds good medoids.
The second technique improves on the manner in which the cost for a medoid change is calculated (see Formula 5 . 1 0 in Chapter 5.). Instead of examining the entire database, only the objects in the two affected clusters must be examined. A region query can be used to retrieve the needed objects. An efficient technique to retrieve only the objects in a given cluster is based on the construction of a polyhedron around the cluster medoid. The constructed polyhedron is called the Voronoi polyhedron or Voronoi diagram. This polyhedron is created by constructing perpendicular bisectors between pairs of medoids. This process is illustrated in Figure 8.9. This then defines the cluster. The objectives within a Voronoi diagram are closer to the medoid of that polyhedron than to any other.
8.7.2
Section 8.7 Spatial Clustering Algorithms 239
.';·
�
:��·�W:
. · · · ' · . · · · . . . . . . ·.. :
...(a) Perpendicular bisector (b) Voronoi polyhedrons
FIGURE 8.9: Voronoi polyhedron. SD(CLARANS)
Spatial dominant CLARANS [SD(CLARANS)] assumes that items to be clustered contain both spatial and nonspatial components. It first clusters the spatial components using CLARANS and then examines the nonspatial attributes within each cluster to derive a description of that cluster. For example, clustering of vegetation in remote areas may find that one area (cluster) is predominantly a forest of pine trees, while another contains massive open plains and grassy areas. SD(CLARANS) assumes that some learning tool, such as DBLEARN [HCC92], is used to derive the descliption of the cluster. This description can be viewed as a generalized tuple; that is, by using a concept hierarchy, the attribute values for the set of tuples in a cluster can be generalized to provide summary values at a higher level in the hierarchy. The learning tool performs this task. Algorithm 8.6 outlines the SD(CLARANS) algorithm. Note that it is a combination of CLARANS, DBLEARN, and the spatial-dominant algorithm discussed earlier in this chapter. It also assumes that in the first step an initial filtering of the da1ta using a relevance based on the nonspatial data is performed. Any clustering algorithm could be used in place of CLARANS in this algorithm. In our algorithm we show that the number of desired clusters is input. However, the authors of the original version propose an approach to determine the "most natural number of clusters" [NH94].
ALGORITHM 8.6 Input : D k Output : / /Data to be clustered
/ / Number of des ired c e l l s at the lowes t l eve l K / / Set of c lusters
SD ( CLARANS ) algori thm :
I I Find s e t of tup l e s that s at i s fy select ion criteria .
d = s e lect tuples from D based on nonspat ial select ion criteri a ; / /App ly CLARANS to d based on spatial att ribut e s .
K = CLARANS(d) ;
// Perform attribute general ization .
for each k E K do
240 Chapter 8 Spatia l Mining
In contrast to SD(CLARANS), nonspatial dominant CLARANS [(NSD(CLARANS)j
first looks at the nonspatial attributes. By performing a generalization on these attributes, a set of representative tuples, one representing each cluster, can be found. Then the algorithm determines which spatial objects go with which representative tuple to finish the clustering process.
8.7.3 DBCLASD
A recent spatial clustering algorithm based on DB SCAN has been proposed that is called
DBCLASD (Distribution Based Clustering of LArge Spatial Databases). DBCLASD
assumes that the items within a cluster are uniformly distributed and that points out side the cluster probably do not satisfy this restriction. Based on this assumption, the algorithm attempts to identify the distribution satisfied by the distances between nearest neighbors. As with DBSCAN, a cluster is created around a target element. Elements are added to a cluster as long as the nearest neighbor distance set fits the uniform distribution assumption. Candidate elements are determined and then are added to the current cluster if they satisfy a men\bership criteria. Candidate elements are determined by executing a region query \,Ising a circle of radius m centered around a point p that was just added to
the cluster; m is chosen based on the following formula:
A
m > (8.5)
Here N is the number of points in the cluster and A is its area. The added points then become new candidates.
The area of the cluster is estimated by using grids that enclose the cluster with a polygon. When a point is added to a cluster, the grid containing that point is added to the polygon. The closeness of the polygon to the real shape of the cluster depends on the size of the grids. If the grids are too large, the shape may not approximate the cluster well. If they are too small, the cluster could actually be estimated by disconnected polygons. The grid length is chosen to be the largest value in the nearest neighbor distance set.
The algorithm DBCLASD is shown in Algorithm 8.7. Since the x2 test usually requires at least 30 elements, the authors assumed that 29 neighboring points are initially added to each cluster [XEKS98]. The last step expands a cluster based on the expected distribution of the nearest neighbor distance set of C using the candidates found in c. Each candidate is added one at a time to C, and the distribution of the nearest neighbor distance set is estimated. If it still has the desired distribution, the points in the neighborhood of this candidate are added to the set of candidates; otherwise the candidate is removed from C. This process continues until c is empty. The points in the neighborhood of a given point are determined based on the radius value stated above.
ALGORITHM 8.7 Input :
D / / Spat ial obj ects to be c lustered
Output :
K / / Set of c lusters
DBCLASD algori t hm :
k = 0 ; I I Ini t i a l ly there are no clusters .
Section 8.7 Spati al Clustering Algorithms 241
c = 0; I / In i t i a l ize the s e t of candidat e s ' to be empty . for each point p in D do
i f p is not in a cluster , then
create a new c luster C and put p in C;
add neighboring point s of p to C;
for each point q in C do
add the points in the neighborhood of q that have not been proc e s s e d to c ;
expand C;
Performance studies show that DBCLASD successfully finds clusters of arbitrary shapes. Only points on the boundary of clusters are assigned to the wrong cluster. 8.7.4 BANG
The BANG approach uses a grid structure similar to a k-D tree. The structure adapts to the distribution of the items so that more dense areas have a larger number of smaller grids, while less dense areas have a few large ones. The grids (blocks) are then sorted based on their density, which is the number of items in the grid divided by its area. Based on the number of desired clusters, those grids with the greatest densities are chosen as the centers of the clusters. For each chosen grid, adjacents grids are added as long as their densities are less than or equal to that of the current cluster center.
8.7.5 WaveCiuster
The WaveCluster approach to generating spatial clusters looks at the data as if they were
signals like STING, WaveCluster uses grids. The complexity of generating clusters is
O (n) and is not impacted by outliers. Unlike some approaches, WaveCluster can find arbitrarily shaped clusters and does not need to know the dtesired number of clusters. A set of spatial objects in an n-dimensional space are viewed! as a signal. The boundaries of the clusters correspond to the high frequencies. Clusters themselves are low-frequency with high amplitude. Signal processing techniques can be used to find the low-frequency portions of the space. The authors propose that a wavelet transform be used to find the clusters. A wavelet transform is used as a filter to determine: the frequency content of the signal. A wavelet transform of a spatial object decomposes it into a hierarchy of spatial images. They can be used to scale an image to different sizes.
8.7.6 Approximation
Once spatial clusters are found, it is beneficial to determine why the clusters exist; that is, what are the unique features of the clusters? Approximation can be used to identify
the characteristics of clusters. This is done by determining the features that are close to the clusters. Clusters can be distinguished based on features unique to them or that are common across several clusters. Here, features are spatial objects such as rivers, oceans, schools, and so on. For example, some clusters may be unique partly because they are close to the ocean or close to good schools. It usually is assumed that features and clusters are represented by more complex closed polygons than by simple MBRs.
242 Chapter 8 Spatial Mining
Aggregate proximity is defined as a measure of how close a cluster (or
f 1 ) . ( b' . th . . group
o e ements . IS to a or to an o �ect m e space). This IS not a measu
re � distance from the cluster boundary, but rather to the points in the cluster. Traditional data structures, such as R-trees and k-D trees, cannot be used to efficiently find th aggregate proximity relationships because they focus on a cluster (object) boundarye
::
opposed to the objects in the cluster. The aggregate proximity distance may be measured by the sum of distances to all points in the cluster.The aggregate proximity relationship finds the k closest features to a cluster. The CRH algorithm has been proposed to identify these relationships [KN96]. C stands for encompassing circle, R for isothetic rectangle, and H for convex hull. These are defined as follows:
• Isothetic rectangle: MBR containing a set of points where the sides of the rectangle are parallel to the coordinate axes.
• Encompassing circle: Circle that contains a set of points; found by using the
diagonal of the
ti
.sothetic rectangle as its diameter.• Convex hull: Minimum bounding convex shape containing a set of points. What makes these shapes efficient is that given a set of n points, the first two points can be found in 0 (n) time and the last in 0 (n lg n) time. Each type of geometric shape is viewed as a bounding structure around a feature. These three types of encompassing geometric shapes are used as multiple levels of filtering of the possible close features. These are used in order of increasing accuracy and decreasing efficiency. The concept of using these three types of bounding polygons is shown in Figure 8. 10, which illustrates