Operations needed to support spatial data nrining involve those required for spatial databases. We review some of these in this section. In these discussions, we assume that A and
B
are spatial objects in a two-dimensional space. Each object can be viewed as consisting of a set of points in the space: (xa , Ya ) E A and (xb , Yb) EB .
As defined in [EFKSOO], there are several topological relationships that can exist
between two spatial objects. These relationships are based on the ways in which two objects are placed in a geographic domain:
• Disjoint: A is disjoint from
B
if there are no points in A that are contained inB.
• Overlaps or intersects: A overlaps with
B
if there is at least one point in A that is also inB.
• Equals: A equals
B
if all points in the two objects are in common.• Covered by or inside or contained in: A is contained in
B
if all points in A arein B. There may be points in
B
that are not in A.• Covers or contains: A contains
B iff B
is contained in A.While data nrining tasks may not specifically address these relationships, the similarity between spatial objects certainly can be defined based partially on these relationships.
Based on the placement of the objects in the space, relationships with respect to direction may be defined. These usually are defined by adding the traditional map
orientations to the space. Thus, we have the relationships such as north, south, east,
west, and so on. What makes these relationships difficult to identify is the irregular shape of spatial objects and the fact that they may overlap.
228 Chapter 8 Spatial Min ing
As mentioned in Chapter 3, the Euclidean and Manhattan measures are often
to me su th d · b
· used
a re e 1stance . etween two pomts. The distance between two spatial ob ·ects
can be defined as extensiOns to these two traditional definitions: � • Minimum:
dis(A , B) = min dis((xa . Ya) , (xb , Yb))
(xa ,Ya)EA, (xb,Yb)EB (8. 1)
• Maximum:
dis(A , B) = max dis((xa , Ya) , (xb , Yb))
(Xa ,ya)EA,(xb,Yb)EB (8.2)
• Average:
• Center:
dis(A , B) = dis((Xea , Yea) , (Xeb, Yeb)) (8.4)
where (Xea , Yea) is a center point for object A and (Xeb, Yeb) for B.
Note th� si
�
larity to distance measures used i n clustering. I n fact, you can think of spatial obJect as a clu�ter �f the points within it. The center points used for the last formula can be Identified by finding the geometric center of the b' t p1 ' f MB . . o �ec . or
exa�p e, I . an R IS used, the distance between objects could be found using the Euclidean .dist�ce between the center of the MBRs for the two objects.
. Spatial �bJects may be retrieved based on selection, aggregation, or join-type opera- �I
�
ns. A selectiOn �ay b� performed based on the spatial or nonspatial attributes. Retrievmo
?
as��
on s�atlal attnbutes could be performed using one of the spatial operators. Aspatial JOin retneves based on the relationship between two spatial objects. 8.4 GENERALIZATION AND SPECIALIZATION
The us� of a concept hie.r�chy shows of relationships among data. When applied to s�atlal . data ch�actenstics, concept allow the development of rules and relatw�ships at differ�nt levels in the hierarchy. This is similar to the use of roll up and
�
nl.l down oper�tl�ns i.n O�
AP."W_
e have also seen this idea used in generalized assocm.twn rule.s. A Siillllar Idea IS used m the generalization and specialization concepts found m mac�ne learning: In these cases, however, the hierarchy is not necessarilyrelated t� s�ati�l data. Spatial data mining techniques have involved both generalization and specializatiOn type approaches.
8.4. 1 Progressive Refinement
Because of the massive amounts of data found in spatial applications approximate answers �ay be made before finding more accurate ones. The use of MB
R
s is a method�
o appro�Imate the shape of an object. Quad trees, R-trees, and most other spatial index mg techmques use a type of progressive refinement. They estimate the shape of objects at8.4.2
Section 8.4 Genera lization and Special ization 229
higher levels in the tree structure, and lower-level entries provide more precise descrip tions of the spatial objects. Progressive refinement can be viewed as filtering out data that are not applicable to a problem.
With progressive refinement, the hierarchical levels are based on spatial relation ships. Example 8 . 1 illustrates the idea of progressive refinement. Here spatial relationships can be applied at a more coarse (move up the hierarchy) or more fine (move down the hierarchy) level.
EXAMPLE 8.1
Suppose that a computer science student wishes to identify apartments close to the SMU Computer Science and Engineering (CSE) Department. A given database listing available apartments in the Dallas metroplex will contain many apartments nowhere near the SMU campus. An initial filtering of the inappropriate elements can be made by finding apartments that are "generalized close" to the CSE Department. This can be performed at any of the levels in the concept hierarchy, Figure 8.6 shows the idea. The closest apartments to SMU probably would be in the Park Cities. By filtering out all apartments in all subtrees other than those for the Park Cities, apartments that are fairly close to SMU would be found. Suppose that a lower level in the concept hierarchy existed that included zip code. If apartments in the same zip code as the CSE Department were found, an even finer estimate of close could be used. This process quickly filters out apartments that could not possibly be used to answer the question. Here a coarser predicate is first used to filter out potential answers. This predicate can be recursively refined until the precise answers are found. Note that when looking at the concept hierarchy, the coarser predicates can be applied to the MBRs at the higher levels, while the finer predicates are applied at the lower levels.
Generalization
As with OLAP, generalization is driven by a concept hierarchy and can be viewed as the process of deriving information at a high level based on information found at lower
levels. Concept hierarchies for spatial data can be both spatial and nonspatial. A spa
tial hierarchy is a concept hierarchy that shows the relationships between geographic areas. Figure 8.6 shows a spatial hierarchy. In Chapter 6, Figure 6.7 illustrated a nonspatial
Dallas-Fort Worth Metroplex
Forth Worth Dallas Arlington Mid-cities Northern suburbs Park cities
�
Preston Hollow M Streets Lakewood East University Park Highland Park230 Chapter 8 Spatial Mining
hierarchy. Generalization can be petformed using either of these two hierarchies. When the spatial data are generalized, the nonspatial data must be appropriately changed to reflect the nonspatial data associated with the new spatial area. Similarly, when the nonspatial data are generalized, the spatial data must be appropriately modified. Using these two types of hierarchies, generalization as applied to spatial data can be divided into two subclasses: spatial data dominant and nonspatial data dominant [LH093]. Both of these subclasses can be viewed as a type of clustering. Spatial data dominant does the clustering based on spatial locations (so that objects close together are grouped), whereas nonspatial data dominant clusters by similarity of nonspatial attribute values. These approaches are referred to as an
attribute-oriented induction because the generalization process is based on attribute values. With spatial data dominant generalization, generalization is first applied to the spatial data, and then the related nonspatial attributes are modified accordingly. General ization is petformed until a threshold number of regions is reached. For example, deter mining the average rainfall in the southwestern United States could be done by finding the mean average rainfall for all states shown to be in the Southwest by a spatial hier archy. Thus, the spatial hie�archy determines which lower-level regions are found in the higher-level region being queried. Determining how to apply the generalization to the nonspatial data is, however, not always a straightforward aggregation operation. Deter mining the average rainfall in this case actually treats each state the same. However, a weighting by geographic area might be used to provide a more accurate average rainfall for the higher-level region being queried.
An alternative approach is to generalize the nonspatial attribute values as well. Generalization is based on grouping of data. Adjacent regions are merged if they have the same generalized values for the nonspatial data. Suppose that instead of average rainfall values, we simply returned values that represented the southwestern cluster. We could assign values of heavy, medium, light, and so on to describe the rainfall rather than
providing actual numeric values. Algorithm 8 . 1 shows the spatial-dominant approach. A
threshold that indicates the maximum number of regions may be given. Based on this threshold, the correct level in the hierarchy is chosen, and thus the number of regions is determined. ALGORITHM 8.1 Input : D H c q
/ / Spat ial database / / Spatial hierarchy / / Concept hierarchy / / Query
Output :
R I /Rule that states the general character i s t i c s reque sted SPATIAL - data- dominant algorithm:
d = s e t of data obtained from D based on s e l e c t i on criteria in q; Fol l owing the structure of H, combine data into regions unt i l
e i ther the de s i red threshold number of regi ons is found
or the requested level in H i s obt aine d ;
for each region found do
perf orm an attribute- oriented induct i on on the nonspat ial attributes ;
Generate and output a rule that summarizes the resul ts found ;
Section 8.4 General ization and Specialization 231
Although not shown here, the nonspatial-data-dominant generalization technique
works in a similar fashion. The first step in this algorithm is to retrieve the data based on the nonspatial selection criteria stated in the query. Needed attribute-oriented induction is then petformed on the retrieved nonspatial data. The nonspatial concept hierarchies are consulted to petform this. During this step, nonspatial attribute values are generalized to higher-level values. These generalizations are higher-level summary values of the lower-level specific values. For example, if average temperature were generalized, several different average temperatures (or ranges) could be combined and labeled "hot." The third step is to petform spatial-oriented generalization. Here neighboring regions with the same (or similar) nonspatial generalized values are merged. This is done to reduce the number of regions returned in response to the query.
A negative of these approaches is that the hierarchy must be predefined by domain experts, and the quality of any data mining requests depends on the hierarchy provided.
The complexity to create the hierarchies is 0 (n log n).
8.4.3 Nearest Neighbor
We introduced the idea of a nearest neighbor in Chapter .5 with respect to clustering. This idea of identifying objects that are close together is a common query type in spatial databases. The nearest neighbor distance is the minimum distance between an object and all other objects in the space.
8.4.4 STING
The STatistical INformation Grid-based method (STING) uses a hierarchical technique to divide the spatial area into rectangular cells similar to a quad tree. The spatial database is scanned once, and statistical parameters (mean, variance, distribution type) for each cell are determined. Each node in the grid structure summarizes the information about the items within it. By capturing this information, many data mining requests, includ ing clustering, can be answered by examining the statistics created for the cells. Thus, only clusters with vertical and horizontal boundaries are generated. However, the entire database need not be scanned after this statistical information is captured. This can be quite efficient when several data mining requests may be made against the data. Unlike the generalization and progressive refinement techniques, no predefined concept hierarchy must be provided.
The STING approached can be viewed as a type of hierarchical clustering tech nique. The first step is to create a hierarchical representation (like a dendrogram). The created tree successively divides the space into quadrants. The top level in the hierarchy consists of the entire space. The lowest level has one leaf for each of the smallest cells. The original proposal was for a cell to have four subcells (grids) at the next lowest level. The division of cells is identical to that petformed for quad trees. In general, however, the approach would work with any hierarchical decomposition of the space. Figure 8.7
illustrates the nodes at the first three levels of the constructed tree.
The process to create the tree is shown in Algorithm 8.2. Each cell in the space corresponds to a node in the tree and is described with both attribute-independent (count) data and attribute-dependent (mean, standard deviation, minimum, maximum, distribu tion) data. As the data are loaded into the database, the hierarchy is created. Placement of an item into a cell is completely determined by its physical position. Algorithm 8.2
232 Chapter 8 Spatial M i n i ng
(a) Level l (b) Level 2 (c) Level 3
FIGURE 8.7: Nodes in STING structure.
is divided into two parts. The first part creates the hierarchy and the second part fills in the values. Since the number of nodes in the tree is less than the number of items in the database, the complexity of STING BUILD is O (n ) .
ALGORITHM 8.2 Input : D k Output : T
/ /Data to be placed in the hi erarchi cal s t ructure / /Number of de s i red c e l l s at the l owest level
/ / Tree STING BUILD algorithm:
/ / Create empty tree from top down .
T = root node with data values init ial ized; / / Initially only
i = 1 ;
repeat
for each node in level i do
root node
create 4 chi ldren nodes with init i a l values ; i = i + 1 ;
unt i l 4 i = k ;
I I Populate tree from bottom up .
for each i tem in D do
determine leaf node j associated with the position of D;
update values of j based on attribute values in item ·
i := log4 (k) ; '
repeat
i := i - 1 ;
for each node J ln level i do
update values of j based on att ribute values in i t s 4 chi ldren ;
unt i l i = 1 ;
The actual STING algorithm is shown in Algorithm 8.3. The algorithm assumes that a query, q, that can be answered from the stored statistical information in the constructed tree, T, is requested. Such a query might be to find the range of price of apartments
near SMU . . The statistics (minimum and maximum) of the apartment rental prices for the appropnate cells should be determined. The cell that SMU is in would determine the actual values for those closest to SMU. In addition, the query might retrieve the
Section 8.5 Spatial Rules 233
information for the cells surrounding this cell or perhaps at the next highest level in the tree that contains the c�ll where SMU is located. Th� nearby cells could be determined using some distance function. The crucial concept here is that the appropriate cells must be determined and then the information from those cells, in the constructed tr�e must be retrieved. A breadth-first tree traversal is used to examine the tree. However, a complete traversal of the tree is not performed. Only children of relevant nodes are examined. Here the concept of relevance is much like that with IR queries except that relevance is determined by estimating the proportion of the objects in that cell that meet the query conditions. The complexity of the STING algorithm is O (k) where k is the number of cells at the lowest level. Obviously, this is the space taken up by the tree itself. When used for clustering purposes, k would be the largest number of clusters created.
ALGORITHM 8.3 Input : T q Output : R / /Tree / / Query
//Regi ons o f relevant c e l l s
STING algorithm:
i = 1 repeat
for each node in level i do
det ermine if this cell i s rel evant to q and mark as such ;
i = i + 1
unt i l all layers in the tree have been vis ited ;
ident ify neighboring c e l l s of rel evant c e l l s to create regions o f cell s ;
Calculating the likelihood that a cell is relevant to a query is based on the percentage of the objects in the cell that satisfy the query constraints. Using a predefined confidence interval, if this proportion is high enough, then that cell is labeled as relevant. The statistical information associated with these relevant cells is used to answer the query. If this approximate answer is not good enough, then the associated relevant objects in the database may have to be examined to provide a more exact response. The cells found by STING approximate those found by DBSCAN. Cells that are found to be close enough to relevant cells are included in the regions of cells that are found by the algorithm.