International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 10, October 2015)
212
Spatial Data Mining Techniques.
Rushabh Shah
1, Arib Patel
2, Lynette D’Mello
31,2,3Computer Engineering Department, Dwarkadas J. Sanghvi College of Engineering, India.
Abstract— Spatial Data Mining is the process of discovering interesting, unknown and useful patterns from huge spatial datasets. As the datasets are relatively huge, techniques are required to aid human for extracting patterns and useful information from large chunks of data. Storing of the data inputs in Spatial Datasets is not analogous to traditional datasets. Spatial datasets take inputs that include extended objects such as points, polygons and lines. This paper will highlight the techniques used for extracting data from these datasets. Spatial data mining techniques include Clustering and Outlier Detection, Association and Co-Location, Classification and Trend-Detection. This paper also overviews the individual techniques of the above mentioned techniques.
Keywords— Spatial Data Mining, patterns, clusters, Clustering and Outlier Detection, Association and Co-location, Classification, Trend-Detection.
I. INTRODUCTION
The extensive and world-wide use of spatial datasets has led to the need for generating spatial knowledge. The complexity of dealing with spatial datasets has rendered the traditional methods redundant, which calls for specialized techniques in spatial data mining for extraction and discovering useful information. Coherent tools for extracting information from spatial datasets are very significant to organizations which make decisions on spatial datasets, including NASA, the National Imagery and Mapping Agency and many more.
II. SPATIAL DATA MINING TECHNIQUES A. Clustering and Outlier Detection.
Spatial Clustering is a process of collecting spatial objects with similar attributes together. The attributes within the same clusters are similar and outside the clusters are dissimilar. Clustering algorithms are generally divided into four categories: partitioning method, hierarchical method, density-based method and grid-based method.
1. Partitioning Method.
Partitioning method generally involves segregation of the cluster objects such that the total deviation of the cluster objects from the centroid is minimized. The most commonly used algorithm in this method is K-Means algorithm.
In K-means algorithm, there are basically 4 steps: 1. Group the data in k sets where k is defined.
2. Select k points at random as cluster centers and find their centroid co-ordinates.
3. Find the distance of each object to the centroids using Euclidean distance formula.
4. Cluster the objects based on minimum distance from centroid co-ordinates.
5. Repeat steps 2, 3 and 4 until same points are obtained in consecutive rounds.
2.Hierarchical Method.
Hierarchical method of clustering involves combining data sets into clusters and clusters into larger clusters. The outcome of hierarchical clustering is a tree, usually called a dendogram. This tree shows the merging process and intermediate clusters. The most commonly used algorithm in this method is Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH).The different types of hierarchical clustering algorithms are BIRCH, ROCK (RObust Clustering using linKs), Chameleon and CURE (Clustering Using REpresentatives).
BIRCH:
BIRCH algorithm mainly involves four phases: 1) Loading 2) Condensation 3) Global Clustering and 4) Cluster Refining.
Phase 1: Loading data into memory by building CF tree
The task of this phase is to scan all the data and build a CF-tree which reflects the clustering information of the dataset in a precise manner to the memory limits.
Phase2: Condensation of Initial CF-tree into a smaller tree
This phase is optional. In Phase 3, we use existing global or semi-global clustering algorithms which may create a gap between the results of Phase 1 and Phase 3.Hence, we sometimes require Phase 2 which cushions the results. The task of this phase is to scan the leaf nodes of the initial CF-tree and construct a smaller CF-CF-tree by eliminating outliers and grouping denser sub clusters into larger ones.
Phase 3: Global Clustering
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 10, October 2015)
213
In this phase, all the leaf entries are clustered using existing clustering algorithms.Phase 4: Cluster Refining
After Phase 3, we obtain a set of clusters that captures great distribution pattern in the data. Using the centroids as centers produced by Phase 3 as seeds, this phase redistributes the data points to its closest seed to obtain a new set of clusters. This phase helps to ensure that all copies of given data go to the same cluster and also helps to migrate the points belonging to a cluster.
ROCK (RObust Clustering using linKs):
ROCK is an algorithm which is based on the concept of links. In this algorithm, similarity of clusters is based on the number of points from different clusters which have neighbors in common. The steps involved in this algorithm are: 1) Drawing random sample from dataset.
2) Clustering with links. 3) Label data in disk.
In the first step, random sample from data set is taken. Now to this sample clustering algorithm is applied that employs links to sample data points. Lastly, the clusters involving the sample points are used to assign the remaining points on the disk appropriately.
CURE (Clustering Using REpresentatives):
CURE is an algorithm that uses partitioning of the datasets. In this process, a constant number of scattered points are chosen. These chosen points are then shrunk towards the centroid of the clusters. After shrinking, these points are used as representatives of the cluster. These clusters are partially clustered. After the clustering is done, instead of single centroid, multiple points from clusters are used as representatives to label the remainder of the data set.
CHAMELEON:
Chameleon algorithm uses dynamic modeling. This algorithm compares the similarity of two clusters by dynamic models. The algorithm mainly consists of two phases: 1) Using Graph Partitioning, partition the data points to form sub-clusters. 2) Sub-Clusters are recombined by using clustering algorithm if the resulting cluster shares certain properties with the constituent clusters.
3. Density-Based Method.
Density-Based Method considers noise as low-density regions which are distinguished separately by clusters which are dense regions of objects.
This method is used to filter out noise and outliers and unlike partitioning method, clusters of arbitrary shapes can be formed. The algorithm used in this method is DBSCAN (Density-Based Spatial Clustering of Applications).In this algorithm, two parameters are used which are Eps (Maximum radius of the neighborhood) and MinPts (Minimum number of points in an Eps-neighborhood of that point).
Algorithm:
Step 1: Randomly select a point p.
Step 2: Get all points densely-reachable from p with respect to Eps and MinPts.
Step 3: If p is a core point, a cluster is formed. If p is not a core point, then no points are densely-reachable from p and DBSCAN scans the next point of the data set.
Step 4: Continue until all the points are not processed in data sets.
4. Grid-Based Method.
Grid-Based Method is primarily concerned with efficacy of the space that is surrounding the data points. This method involves limiting the clustering space and then applying suitable operations on the grid structure. Unlike, conventional clustering methods which are concerned with data space, this method is dependent on the number of cells in a grid.The algorithm used in Grid-Based Method is STatiscal INformation Grid-Based clustering method (STING).In STING, the spatial space is partitioned into rectangular cells. Different levels of cells correspond to different level of resolution. Each cell at a high level is partitioned to get the next lower level. Statistical information of each cell is calculated before and it is stored to answer the queries. This method uses a top-down approach. STING algorithm performs the following steps: 1. When the user puts a query, it starts from a pre-defined layer preferably with lower number of cells.
2. From the pre-defined layer do the following steps until the bottom layer:
For each cell in the current layer, calculate the confidence interval which tells the relevance of the cells to the query. If the cell is relevant, it is included in the cluster, otherwise discarded and begin searching in the next layer.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 10, October 2015)
214
B. Association and Co-Location
Association and Co-Location helps in finding the spatial rules that are in association with different spatial objects. However, there are abundant rules and creating the method for selecting for the possible rules from the set of already existing association rules. The commonly used algorithm for determining the rules is Apriori algorithm. Aprioiri algorithm works on transactional databases. It highlights the trend in data sets by determining the frequent item sets.
Algorithm:
Figure I : Apriori Algorithm [7].
C. Classification
Classification is a technique, which is used to determine the rules that shows the partition of the database into given set of classes. A model is created in the beginning according to which the whole dataset is analyzed. The algorithm generally used for this technique is k nearest neighbor classifier.
K-Nearest Neighbor Algorithm (KNN):
In this algorithm, an object is classified by its majority of the neighboring objects with the object being assigned to the class most common among the k-neighbors.
Algorithm:
Step 1: Assume k training objects, and a test object z= (x’, y’).
Step 2: Compute distance of object z with each object in the training set.
Step 3: Select k-objects closest to the test object z.
Step 4: Assign the class of test object z as the most common class among the k-neighbor objects.
KNN is suited for multi-model classes as well as applications in which an object has many class labels.
D. Association and Co-Location
As we know that there are patterns in spatial datasets, hence it is very important to detect trend of the spatial objects when the attribute changes with respect to the neighborhood of some spatial object. Spatial trend is changing of one or more non-spatial attribute when moving away from start object. The algorithms commonly used in this technique is Global Detection and Local Trend-Detection.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 10, October 2015)
215
Figure II: Global Trend Algorithm [5]. [image:4.612.69.295.138.338.2]The Local trend-detection algorithm creates the paths starting from source o in a depth-first search manner, the figure 2 illustrates this. Then a regression function is performed on each of the path sets with max-length ≥ length ≥ min-length and a path is only extended if abs(correlation) ≥ min-conf. This algorithm returns two sets of trend which are set of positive trends and set of negative trends.
Figure III: Local Trend Algorithm [5].
II. CONCLUSION
This paper overviews the different Spatial Data mining techniques in the four categories: Clustering and Outlier Detection, Association and Co-Location, Classification and Trend-Detection. It also discussed the algorithm involved in each technique.
III. COMPARISON
Table I
Comparitive Analysis of Spatial Data Mining Techniques.
Parameters
Clusterin g and Outlier detection
Associati on and Co-Location
Classificat ion
Trend-Detecti on
Learning Rule
Unsuperv ised
Associati on rules
Supervise d
Trend Rules
Performan ce
High Medium Low Mediu
m
Algorithm K-means, BIRCH etc.
Apriori Algorithm etc
K Nearest Neighbor Algorithm
Local Trend, Global Trend
REFERENCES
[1] N.Sumathi, R.Geeta and Dr.S.Sathiya Bama,”Spatial Data Mining-Techniques, Trends and its Application.” Journal of Computer Applications, Vol – 1, No.4, 28-30, Oct – Dec 2008.
[2] Kamalpreet Kaur Jassar and Kanwalvir Singh Dhindsa,”Comparitive Study of Spatial Data Mining Techniques”, International Journal of Computer Applications (0975 – 8887), Volume 112 – No 14, 19-22, February 2015.
[3] Marjan Kuchaki Rafsanjani, Zahra Asghari Varzaneh and Nasibeh Emami Chukanlo,” A survey of hierarchical clustering algorithms”, The Journal of Mathematics and Computer Science ,Vol .5 No.3, 229-240,2012
[4] Tian Zhang, Raghu Ramakrishnan and Miron Livny,” BIRCH: A New Data Clustering Algorithm and Its Applications.”, Data Mining and Knowledge Discovery, Volume 1,Issue 2, 141–182,1997. [5] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. ,”A density-based
algorithm for discovering clusters in large spatial databases with noise.”, Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD-96), pp. 226-231, Portland, OR, USA, August 1996
[image:4.612.52.297.458.696.2]International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 10, October 2015)
216
[7] Xindong Wu · Vipin Kumar · J. Ross Quinlan · Joydeep Ghosh ·Qiang Yang · Hiroshi Motoda · Geoffrey J. McLachlan · Angus Ng · Bing Liu · Philip S. Yu · Zhi-Hua Zhou · Michael Steinbach · David J. Hand · Dan Steinberg,”Top 10 algorithms in data mining”,Journal of Knowledge and Information Systems,Volume 14,Issue 1,1-37,Dec-2007.