SPATIAL CLASS I FICATIO N ALGORITHMS 8.7 SPATIAL CLUSTERING ALGORITH M S

8.8 EXERCISES

8.9 BIBLIOGRAPHIC NOTES

INTRODUCTION

Spatial data are data that have a spatial or location component. Spatial data can be viewed as data about objects that themselves ar't located in a physical space. This may be implemented with a specific location attribute(s) such as address or latitude/longitude or may be more implicitly included such as by a partitioning of the database based on location. In addition, spatial data may be adcessed using queries containing spatial operators such as near, north, south, adjacent, and contained in. Spatial data are stored in spatial databases that contain the spatial data and nonspatial data about objects. Because of the inherent distance information associated with spatial data, spatial databases are often stored using special data structures or indices built using distance or topological information. As far as data mining is concerned, this distance information provides the basis for needed similarity measures.

Spatial data are required for many current information technology systems. Geo graphic information systems (GIS) are used to store infomtation related to geographic locations on the surface of the Earth. This includes applications related to weather, community infrastructure needs, disaster management, and hazardous waste. Data min ing activities include prediction of environmental catastrophes. Biomedical applications, including medical imaging and illness diagnosis) also require spatial systems.

Spatial mining, often called spatial data mining or knowledge discovery in spatial databases, is data mining as applied to spatial databases or spatial data. Some of the applications for spatial data mining are in the areas of GIS systems, geology, environ mental science, resource management, agriculture, medicine, and robotics. Many of the techniques discussed in previous chapters are applied directly to spatial data, but there also are new techniques and algorithms developed specifically for spatial data mining.

222 Chapter 8 Spatial M i n i ng

We investigate these issues in this chapter. Before investigating spatial mining, we first provide a brief introduction to spatial data and databases.

8.2 SPATIAL DATA OVERVIEW

Accessing spatial data can be more complicated than accessing nonspatial data. There are specialized operations and data structures used to access spatial data.

8.2.1 Spatial Queries

Because of the complexity of spatial operations, much work has been performed to examine spatial query processing and its optimization.

A traditional selection query accessing nonspatial data uses the standard comparison operations: >, <, :::;, ::=:, :j=. A spatial selection is a selection on spatial data that may use other selection comparison operations. The types of spatial comparators that could be used include near, north, south, east, west, contained in, and overlap or intersect. The following are examp�es of several spatial selection queries:

• Find all houses near Mohawk Elementary School.

• Find the nearest fire station to 963 1 Moss Haven Drive in Dallas.

A special join operation applied to two spatial relations is called a spatial join. In some ways, a spatial join is like a regular relational join in that two records are joined together if they have features in common. With a traditional join, two records must have attributes in common that satisfy a predefined relationship (such as equality in an equijoin). With a spatial join, the relationship is a spatial one. The type of relationship is based on the type of spatial feature. For example, the nearest relationship may be used for points, while the intersecting relationship is used for polygons.

In GIS applications, it is common to have different views of the same geographic area. For example, city developers must be able to see where infrastructure facilities are located, including streets, power lines, phone lines, and sewer lines. At another level, they might be interested in actual elevations, building locations, and rivers. Each of these types of information could be maintained in separate GIS files. Merging these disparate data can be performed using a special operator called a map overlay.

A spatial object usually is described with both spatial and nonspatial attributes.

Some sort of location type attribute must be included. The location attribute could identify a precise point, such as a latitude or longitude pair, or it may be more logical such as a street address or zip code. Often, different spatial objects are identified by different locations, and some sort of translation between one attribute and the other is needed to perform spatial operations between the different objects. As in SAND, the nonspatial attributes may be stored in a relational database, while each spatial attribute is stored in some spatial data structure. Each tuple in the relationship represents the spatial object, and a link to the spatial data structure is stored in the corresponding position in the nonspatial tuple.

Many basic spatial queries can assist in data mining activities. Some of these queries include:

• A region query or range query is a query asking for objects that intersect a given region specified in the query.

Section 8.2 Spatia l Data Overview 223 • A nearest neighbor query asks to find objects that are close to an identified object. • A distance scan finds objects within a certain distance of an identified object, but

the distance is made increasingly larger.

All of these queries can be used to assist in clustering or classification. 8.2.2 Spatial Data Structures

Because of the unique features of spatial data, there are many data structures that have been designed specifically to store or index spati:;tl data. In this section, we briefly exam ine some of the more popular data structures. Many of these structures are based on extensions to conventional indexing approaches, such as B-trees or binary search trees.

Nonspatial database queries using traditidmal indexing structures, such as a B tree, access the data using an exact match query. However, spatial queries may use proximity measures based on relative locations of spatial objects. To efficiently perform these spatial queries, it is advisable that objects close in space be clustered on disk. To this end, the geographic space under consideration may be partitioned into cells based

on proximity, and these cells would then be related to storage locations (blocks on disk). The corresponding data structure would be constructed based on these cells.

A common technique used to represent a spatial object is by the smallest rectangle that completely contains that object, minimum bounding rectangle (MBR). We illustrate the use of MBRs by looking at a lake. Figure 8.1(a) shows the outline of a lake. If we orient this lake in a traditional coordinate system with the horizontal axis representing east-west and the perpendicular axis north-south, we can put this lake in a rectangle (with sides parallel to the axes) that contains it. Thus, in Figure 8.1(b) we show an MBR that can be used to represent this lake. Alternatively, in Figure 8. l (c) we could represent it by a set of smaller rectangles. This option can provide a closer fit to the actual object, but it requires multiple MBRs. An MBR can easily be represented by the coordinates for two nonadjacent vertices. So we could represent the MBR in Figure 8.1(b) by the pair {(XJ , YJ), (x2, Y2)}. There are other ways to store the MBR values, and the orientation of the MBRs need not be with the axes.

We use the triangle shown in Figure 8.2(a) as a simple spatial object. In

Figure 8.2(b) we show an MBR for the triangle. Spatial indices can be used to assist in spatial data mining activities. One benefit of the spatial data structures is that they cluster objects based on location. This implies that objects that are close together in the

<XJ>Yt>

(a) Lake (b) MBR for lake (c) Smaller MBRs for lake

224 Chapter 8 � Ll L3l � � L:C=::::J � Spatial Mi ning

(a) Triangle (a) MBR for Triangle

FIG U RE 8.2: Spatial object example.

10 9 6

11 1J. 7 8

18 17

15 16 19 20

(a) Representing triangle with quadrants (b) Quad tree FIGURE 8.3: Quad tree example.

n-dimensional space tend to be stored close together in the data structure and on disk. Thus, these structures could be used to reduce the processing overhead of an algorithm by limiting its search space. In effect, filtering is performed as you traverse down a tree. In addition, spatial queries can be more efficiently answered by use of these structures.

Quad Tree. One of the original data structures proposed for spatial data is that of a quad tree. A quad tree represents a spatial object by a hierarchical decomposition of the space into quadrants (cells). This process is illustrated in Figure 8.3(a) using the triangle in Figure 8.2. Here the triangle is shown as three shaded squares. The spatial area has been divided into two layers of quadrant divisions. The number of layers needed depends on the precision desired. Obviously, the more layers, the more overhead is required for the data structure. Each level in the quad tree corresponds to one of the hierarchical layers. Each of the four quadrants at that layer has a related pointer to a node at the next level if any of the lowest level quadrants are shaded. We label the quadrants at each

Section 8.2 Spatial Data Overview 225 level in a counterclockwise direction starting at the upper right quadrant (as shown in the figure). Square 0 is the entire area. Square 1 is the upper right at level one. Square 15 is the square in the lower left comer at the second level. In this figure, the triangle is represented by squares 1 2, 13, and 14 because it intersects these three regions. The quad tree for this triangle is shown in Figure 8.3(b). Only nodes with nonempty quadrants are shown. Thus, there are no nodes for quadrants 1 and 4 and 1their subquadrants.

MBRs are similar to the quadrants in the quad tree except that they do not have to be of identical sizes. If hierarchies of MBRs exist, they do not have to be regular as in the quadrant decompositions.

R-Tree. One approach to indexing spatial data represented as MBRs is an R-tree.

Each successive layer in the tree identifies smaller rectangles. In an R-tree, cells may actually overlap. An object is represented by an MBR that is located within one cell. Basically, a cell is the MBR that contains the related set of objects (or MBRs) at a lower level of decomposition. Each level of decomposition is identified with a layer in the tree. As spatial objects are added to the R-tree, it is created and maintained by algorithms similar to those found for B-trees. The size of the tree is related to the number of objects. Looking at a space with only the basic triangle, as seen in Figure 8.2, a tree with only a root node would be created. We illustrate a more complicated R-tree in Figure 8.4. Here there are five objects represented by the MBRs D, E, F, G, and H. The entire geographic space is labeled A and is shown as the root of the tree in Figure 8.4(b). Three of the objects (D, E, F) are contained in an MBR labeled B, while the remaining two (G, H) are in MBR C.

Algorithms to perform spatial operators using an R-tree are relatively straight forward. Suppose we wished to find all objects that intersected with a given object. Representing the query object as an MBR, we can search the upper levels of the R-tree to find only those cells that intersect the MBR query. Those subtrees that do not intersect the query MBR can be discarded.

k-D Tree. A k-D tree was designed to index multiattribute data, not necessarily spatial data. The k-D tree is a variation of a binary search tree where each level in the

A E

c

(a) Partitioning with MBRs (b) R-tree

226 Chapter 8 D E H B Spati al Mining A F G c

(a) Divide and conquer partitioning

FIGURE 8.5: k-D tree example.

(b) k-D tree

tree is used to index one of the attributes. We illustrate the use of the k-D tree assunring a two-dimensional space. Each node in the tree represents a division of the space into two subsets based on the division point used. in addition, the division alternates between the two axes.

In Figure 8.5 we show a k-D tree using the same data we used for the R-tree. As with the R-tree, each lowest level cell has only one object in it. However, the divisions are not made using MBRs. Initially, the entire region is viewed as one cell and thus the toot of the k-D tree. The area is divided first along one dimension and then along another dimension until each cell has only one object in it. In this example, we see that the entire region, A, is first divided into two cells (B, C) along the horizontal axis. Then, looking at B, we see that it is divided into D and E. D is finally divided into H and I.

8.2.3 Thematic Maps

Thematic maps illustrate spatial objects by showing the distribution of attributes or themes. Each map shows one (or more) of the thematic attributes. These attributes describe the important nonspatial features of the associated spatial object. For exam ple, one thematic map may show elevation, average rainfall, and average temperature. Raster-based thematic maps represent the spatial data by relating pixels to attribute val ues of the data. For example, in a map showing elevation, the color of the pixel can be associated with the elevation of that location. A vector-based thematic map represents objects by a geometric structure (such as their outline or MER). In addition, the object then has the thematic attribute values.

8.2.4 Image Databases

In image databases the data are stored as pictures or images. These databases are used in many applications, including medicine and remote sensing.

Some early classification work performed using large image databases looked at ways to classify astrononrical objects. One of the applications of this work is to identify

Section 8.3 Spatial Data Mining Prim itives 227

volcanos on Venus from images taken by the Magellan spacecraft [FWD93]. This system consisted of three parts: data focusing, feature extraction, and classification. The first component deternrines which of the areas of the images is the most likely to contain volcanos. Here the intensity of a central point of a region is compared with that of the background. The important features of these areas are extracted and stored in the second part. The focusing portion compares the intensity of a central point of a region with that of the background. During the second phase, interesting features are identified and extracted. Finally, these features are classified based on classifiers built using training data provided by domain experts. The third portion uses a decision tree to perform the actual classification. The tree is created using ID3 and training examples provided by experts. An accuracy of 80% was achieved.

A related work also used decision trees to classify stellar objects [FS93]. As with the volcano work, the first two steps were to identify areas of the images of inter est and then to extract information about these areas. Multiple trees were created, and from these sets of rules were generated for classification. Accuracy was found to be approximately 94%. When compared to several neural network approaches, the decision tree/rules approach was found to be much more accurate. Both of these studies found the need to normalize the extracted features to compensate for differences between different images. For example, two images could differ based on the angle at which the image was taken.

In document Dunham Data Mining pdf (Page 117-120)