Spatial Data Mining of Colocation Patterns for Decision Support in Agriculture

(1)

Spatial Data Mining of Colocation Patterns for Decision

Support in Agriculture

HAN-WENHSIAO1,*_{, M}_ENG_-S_HU_T_SAI1_,

ANDSHAO-CHIANGWANG2

1

Department of Computer Science and Information Engineering, Asia University, Taiwan

2

Factory 401, Armaments Bureau, Ministry of National Defense, Taiwan

ABSTRACT

Computer technologies have been introduced into the area of agriculture recently. Precision agriculture, as an example, is a popular concept of using GIS, GPS and other new technologies in helping farmers optimize agricultural production. Colocation pattern mining is a technique for discovering relationships between different thematic features in a spatial domain. For example, an observation that large cities are often close to riversides is obtained with a reliable statistic. Such desired capability is of importance in agricultural applications, like insect pest management. In this paper, a two-phase hierarchical clustering method is proposed to assist people in making decisions based on spatial colocation patterns implicitly existing inside the geographical data sets. It is designed to be a generic system for any data sets in point format. In the first phase, the point features being close together are grouped into a number of clusters. AnLCmatrix is generated to describe the relationship between the clusters and the layers of feature points. TheLCmatrix is then analyzed by the second hierarchical clustering to generate a dendrogram. The support and confidence of each single cluster in the dendrogram are calculated to show the concurrent occurrence of features, regardless of their geographical locations.

Key words:spatial colocation pattern, hierarchical clustering, decision support.

1. INTRODUCTION

Computer technologies have been introduced into the area of agriculture recently. Precision agriculture, as an example, is a popular concept of using GIS, GPS and other new technologies with precise planning and controlling in helping farmers optimize agricultural production and minimize possible wastes. Precise irrigation is a kind of practical system using remotely sensed imagery for site-specific irrigation to increase crop yields with less water usage, fewer chemicals, less cost of applying pesticide, and less energy. To reach this objective, all sets of geographical data available in electronic format should be analyzed effectively and efficiently for decision support. A variety of artificially intelligent approaches have hitherto provided solutions for these tasks. Peng and Wen (1999) gave an overview of applying artificial neural networks to forest resource management, particularly in the aspect of insect pest prediction using various layers of geographical data for training models. Moreover, the techniques for mining spatial data are to reveal exceptional phenomena implicitly existing inside a given set of geographical data, which are informative in making decisions. Specifically, colocation pattern mining discovers associations between spatial features. For instance, insect pest may frequently occur in the grassland areas where the landscape, wind direction, and wind velocity are strongly correlated in a particular

(2)

way. The rest of this paper is organized as follows. Previous research works relevant to colocation pattern mining are reviewed in the next section. Section 3 presents the proposed approach in detail. Some experimental results with synthetic data sets are given in Section 4. The final section concludes this study with an outlook of future works.

2. RELATED WORKS

Several important issues in spatial data mining can be roughly classified into four categories (Shekhar & Chawla, 2003; Huang, Shekhar, & Xiong, 2004): location prediction, outlier detection, colocation patterns, and spatial clustering. The problems of location prediction are usually event-centric (Anselin, 1988; Shekhar & Chawla, 2003) and can be solved in a supervised fashion. For instance, ornithologists may concern the bird distribution of a specific species and predict their habitat according to different ecological and environmental conditions. Hence, historical observations are utilized to train the models (Besag, 1974; Li, 1995; Shekhar, Schrater, Vatsavai, Wu, & Chawla, 2002). Outlier detection (Hawkins, 1980; Barnett & Lewis, 1994; Breunig, Kriegel, Ng, & Sander, 2000; Han, Pei, & Yin, 2000; Morimoto, 2001) is of interest when feature points congregate in a geographical or parametric space; however, very rare points in the same cluster are not consistent with the others in terms of the associated non-spatial attributes. Such a technique is suitable for discovering unexpected information, especially in crime analysis.

Pattern discovery of spatial colocations is to uncover the existence of two or more types of spatial features that frequently locate together, but the locations are not the primary concern. A geographer, for instance, may be interested in what factors causing wild fires, and then find out a strong relationship between the distribution of coniferous forests and the meteorological conditions. In general, four classes of approaches are used to solve the problems of spatial colocations:

a. The statistical approaches measure spatial correlation to characterize the relationship between the spatial features of different types (Koperski & Han, 1995; Li, 1995). However, they are usually time-consuming.

b. Some researches (Morimoto, 2001; Huang et al., 2004) adopted the association rule approach (Mannila & Toivonen, 1997) to solve this type of problems. In this case, spatial features can be treated as commodities sold in a super market, whereas clusters are analogous to transactions. The challenges lie in the definition of a cluster and the quantities of items in a single transaction.

c. On the basis of computing joins, some algorithms were designed to identify candidate instances of colocation patterns, but the execution time is a computational bottleneck (Shekhar & Huang, 2001). To tackle this problem, Yoo and Shekhar proposed an approach to compute partial joins (2004) and later another one without the join operation (2006).

(3)

d. From time to time, it is necessary to investigate the cause-and-effect relationships and the event-centric approaches (Koperski & Han, 1995; Huang et al., 2004) are taken into account. These spatial features, the so-calledeffect, are specified as the event centers. The remaining types of feature points in the vicinity of these centers with an acceptable confidence are selected as the potentialcausefactors for further analysis. Spatial clustering is a classical problem of grouping data points into several clusters (Ng & Han, 1994) such that the intra-distance within clusters is minimized whereas the inter-distance between clusters is maximized. Different from many typical clustering approaches, hierarchical clustering (Zhang, Ramakrishnan, & Livny, 1996) is a way of grouping feature points based on a similarity measure to generate a tree with a multi-level hierarchy. Assigning points to specific clusters is of importance in cluster analysis. Various linkage methods may result in different partitions, depending on the distribution of the point set. As illustrated in Figure 1, none of typical linkage can well partition a point set of arbitrary distribution. Instead of calculating the distances between a new point and all cluster centroids, it is better to verify the outermost points of clusters which is able to form a most compact convex (Guha, Rastogi, & Shim, 1998; Karypis, Han, & Kumar, 1999) for choosing an appropriate cluster.

(a) (b)

Figure 1. Different point distributions. (a) Two clusters are close to each other, but significantly different in size. (b) Clusters are more or less elongated along a direction.

3. TWO-PHASE HIERARCHICAL CLUSTERING

The proposed algorithm for mining colocation patterns is on the bases of hierarchical clustering (Guha et al., 1998; Karypis et al., 1999), as illustrated in Figure 2. Suppose there are L layers of thematic maps representing different characteristics of a geographical region, where each layer li contains ni feature

(4)



 L i i n N 1 (1)

In the first step, the classical hierarchical clustering (UPGMA) approach is applied to the N feature points, regardless of the characteristics for which these points stand. The similarity measure utilized in this study is simply the Euclidean distance between two points in a two-dimensional space. The linkage method based on the average distance between all pairs of objects in clustersrandsis defined as



   nr s i n j sj ri s r x x dist n n s r d 1 1 ) , ( 1 ) , ( . (2) 4 2 nlLayers nc C lu s te rs 1 1 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0 0 1 1 1 1 1 0 0 1 0 1 1 0 1 0 Thematic Maps Hierarchical Clustering (Euclidean distance) Hierarchical Clustering (Hamming distance) Colocation Patterns LCMatrix L e v e l o f d is s im il a ri ty

Figure 2. An illustration of proposed two-phase hierarchical clustering approach to colocation pattern mining.

Threshold selection is a key issue in the clustering problem. Let these levels of dissimilarity li of nt nodes marked as ●or○in the dendrogram be regarded as

random variables. As illustrated in Figure 3, these variables are grouped in one-dimensional space such that the threshold indicated by the dash line is decided by the left-most clusterTmarked as solid dots●. Before calculating the threshold, these levelsliare first sorted to be monotonically increasing. Algorithm 1 depicts

the details.

0 Level of dissimilarity

li

m s

The left-most clusterT

Figure 3. Threshold selection after the first-phase hierarchical clustering. The axis represents the ordinate of the dendrogram plot clockwise rotated 90 degrees. These levelsliof nodes in the dendrogram are projected onto the line and marked as●or○.

(5)

Algorithm 1. Threshold calculation from the levels of nodes. 1. Define the setTof the left-most cluster as {l1,l2,l3}.

2. Calculate the sample meanmand sample standard deviationsofT. 3.foreach levelli, wherei= 4…nt

ifli< (m+s)then TT{li}

Calculatemandsof newT. else

Return (m+s)

Some feature points scattering sparsely around clusters are then removed from the point set, if the normalized distance from a point to the nearest cluster is larger than one standard deviation of the average distance of that cluster.

After the completion of the first-phase hierarchical clustering, a number of clusters are generated for the following analysis. Each cluster contains multiple points with each one representing a specific type of feature (or thematic meaning). AnLCmatrix of sizencnlis created to store the clustering result, wherencandnl

are the numbers of clusters and layers, respectively. As illustrated in Figure 2, an element eij of the LC matrix having a value of one indicates that the ith cluster

contains a feature point from the jth layer. For example, there are 13 different layers and the 28thcluster consists of points representing the layers 2, 5, 6, 9, 12, and 13. Therefore, the 28throw of theLCmatrix should be filled with {0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1}. TheLCmatrix may be regarded analogous to a collection of transactions in the market basket problem, where each row stands for one transaction. Without using the association rule approach, however, the hierarchical clustering is used again in the second phase, where thesenllayers in this step are

regarded as vectors of nc dimensions. That is to say, the frequently concurrent

occurrence of two types of features implies that their feature points should simultaneously appear in as many clusters as possible. This explains the usage of these columns of theLCmatrix as the input vectors for the second-phase clustering. The similarity measure utilized in this stage is the Hamming distance. A dendrogram generated from the hierarchical clustering indicates some existing patterns of spatial colocation among these layers, and each link (or node) in this dendrogram is equivalent to an association rule.

To measure the frequency of a colocation pattern, the concept of support and confidence is utilized. As illustrated in Figure 4, two layerslpandlqmarked in gray

are merged to form an association rule represented as a nodetin the dendrogram. This nodetis later merged with another layerlrto generate a new association rule,

if its support and confidence are above the pre-defined threshold values. Suppose there are three sets S, A, and B, where both A and B are the subsets of S. The support and confidence are defined as

(6)

) ( ) ( S N B A N support  ; (3) ) ( ) ( A N B A N confidence  (4) whereN(AB) is the number of common items in both setsA andB, N(S) and N(A) are the numbers of items in the setsSandA, respectively. In this study, N(S) is equal to the numberncof clusters,i.e.12 in Figure 4, whereasN(A) andN(A

B) in this example are obtained by the following equations: 8 ) ( ) ( 1   



 c n i ip e p N A N ; (5) 6 ) ( ) ( ) ( 1      



 iq n i ip e e q p N B A N c . (6)

The above equations can be intuitively applied to the cases of node-layer and node-node combinations in a similar way to generate the large itemsets in association rule mining.

n_lLayers 1 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 0 1 1 1 nc C lu s te rs l_p l_q l_r                            6 3 ) ( ) ( ) , ( confidence 12 3 ) ( ) , ( support ) , ( 8 6 ) ( ) ( ) , ( confidence 12 6 ) ( ) , ( support t N l t N l t n l t N l t l l t l N l l N l l n l l N l l r r c r r q p p q p q p c q p q p l_p l_q l_r t

Figure 4. An illustration of support and confidence calculations. Suppose two layerslpand lqare first merged to be an association rule,i.e.nodetin the dendrogram, and then merged with the layerlr.

(7)

4. EXPERIMENTAL RESULT AND DISCUSSION

A prototype system for mining colocation patterns from spatial data is designed to be as generic as possible for various applications. The program has been implemented in MATLAB on a laptop computer equipped with a Celeron 1.3 GHz processor and 256MB RAM. Suppose different types of spatial features are considered as potential factors to be analyzed. For the purpose of mining colocation patterns, all kinds of features are in point format. A robbery event, for instance, occurred in front of a convenient store and was recorded as a spatial point with some relevant non-spatial data as its attributes. Without the loss of generality, the artificial dataset was generated by setting 300 time instances for 13 layers so that each layer will have at most one feature point at each time instance, where the emergence probability is set to be 70%. As shown in Figure 5, the map thus contains no more than 3900 points scattering over 13 layers. It could be thought of as a geographical map consisting of 13 layers, each of which represents some thematic meaning expressed by a specific symbol. Figure 6 illustrates the experimental result obtained by this approach. The execution time for this data set is less than 10 seconds. The layer indexes are listed along the abscissa of the dendrogram plot, whereas the correlation extent among layers is indicated on the ordinate. Each cluster (or node in this dendrogram) stands for an association rule that these feature points in the included layers tend to occur together, regardless of their actual locations. Therefore, the dendrogram plot is in fact a simple yet effective way to visualize the overall colocation patterns among different types of thematic features.

Figure 5. Scatter map of synthetic dataset consisting of 13 layers of feature points. Each layer is represented by a specific symbol.

(8)

Figure 6. Visualization of the spatial colocation patterns discovered from the synthetic dataset. The layer indexes are along the abscissa, while the ordinate stands for the correlation extent between layers.

The values of support and confidence of each association rule discovered from the data set are listed in Table 1. Two layers of feature points considered as spatially collocated are listed in the second and third columns of the table. The layer index larger than the number of layers,i.e., 13 in this study, is actually a new node by merging two nodes (or layers). The support values in the fourth column are the probability that two layers of feature points occur simultaneously. The confidenceL is a conditional probability when the feature points of the left-side layer occur. For example, the first rule in Table 1 indicates that those feature points in two layers 4 and 6 often geographically occur together with a support value of 0.063898 and 0.29851 for the confidence value, under the condition that those feature points in the layer 4 must occur. Table 2 gives a summary of the experimental result from this artificial dataset. The top four pairs of layers from Table 1 arerelativelysignificant in terms of support and confidence, and can thus be considered as spatially collocated. These low values are due to the usage of synthetic data obtained from random number generators. It is expected that the results from genuine data should be satisfactory.

5. CONCLUSION

A conceptually simple approach of two-phase hierarchical clustering has been proposed for discovering colocation patterns of different types of spatial features that often geographically occur together. All types of feature points are mixed together for hierarchical clustering. After a simple step for threshold selection, an

4 6 9 1 7 5 10 8 2 13 12 3 11 0.3 0.305 0.31 0.315 0.32 0.325 0.33 0.335 0.34 0.345

(9)

LCmatrix is generated to store the relationships between the clusters and the layers. The matrix is then used as the input for a second hierarchical clustering. Each node in the created dendrogram can be thought as an association rule, if its support and confidence are greater than the pre-defined threshold values. The dendrogram itself is in fact an effective yet simple representation of visualizing these discovered colocation patterns. This system has demonstrated its effectiveness, efficiency, and generality. Nevertheless, a more rigorous method for threshold selection in the first phase should be considered, and will be addressed in the future work. Potential applications may be in the aspects of, but not limited to, insect pest management, wild fire prediction, etc. Such discovered patterns are expected to be valuable for early prevention.

Table 1. Support and confidence of each spatial colocation pattern

Pattern ID LayerL LayerR Support ConfidenceL ConfidenceR

1 4 6 0.063898 0.29851 0.30075 2 2 13 0.071885 0.31034 0.33088 3 5 10 0.054313 0.24818 0.27869 4 1 7 0.065495 0.29496 0.29078 5 15 12 0.023962 0.33333 0.12195 6 16 8 0.025559 0.47059 0.10738 7 18 3 0.006390 0.26667 0.02898 8 17 19 0.001597 0.02439 0.06250 9 14 9 0.020767 0.32500 0.08904 10 20 11 0.003195 0.50000 0.01370 11 21 22 0.000000 0.00000 0.00000 12 24 23 0.000000 0.00000 0.00000

Note. Two layers 4 and 6 in this table, for example, are spatially collocated with a support value of 0.063898 and two confidence values, depending on the definition of confidence.

Table 2. Four spatial colocation patterns with significant support and confidence

Colocation Patterns ConfidenceL Colocation Patterns ConfidenceR

46 0.29851 64 0.30075

213 0.31034 132 0.33088

510 0.24818 105 0.27869

17 0.29496 71 0.29078

Note. For example, the pattern 46 with a confidence value about 0.3 indicates that the probability is about 0.3 when two layers 4 and 6 occur together, if the layer 4 occurs.

REFERENCES

Anselin, L. (1988). Spatial Econometrics: Methods and Models. Dordrecht, Netherlands: Kluwer Academic.

Barnett, V., & Lewis, T. (1994).Outliers in Statistical Data, (3rd ed.). New York, USA: John Wiley.

(10)

Besag, J. (1974). Spatial interaction and statistical analysis of lattice systems. Journal of Royal Statistical Society,Series B,36(2), 192-236.

Breunig, M. M., Kriegel, H. P., Ng, R., & Sander, J. (2000). LOF: Identifying density-based local outliers. Proceedings of ACM SIDMOD International Conference on Management of Data, Dallas, Taxes, USA, 93-104.

Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, 73-84. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate

generation. Proceedings of the ACM SIDMOD International Conference on Management of Data, Dallas, Taxes, USA, 1-12.

Hawkins, D. M. (1980).Identification of outliers. London, UK: Chapman and Hall. Huang, Y., Shekhar, S., & Xiong, H. (2004). Discovering colocation patterns from

spatial data sets: a general approach. IEEE Transactions on Knowledge and Data Engineering,16(12), 1472-1485.

Karypis, G., Han, E. H., & Kumar, V. (1999). CHAMELEON: A hierarchical clustering algorithm using dynamic modeling.IEEE Computer,32(8), 68-75. Koperski, K., & Han, J. (1995). Discovery of spatial association rules in geographic

information databases. Proceedings of 4thInternational Symposium on Large Spatial Databases, Portland, Maine, USA, 47-66.

Li, S. Z. (1995). Markov random field modeling in computer vision. New York, USA: Springer-Verlag.

Mannila, H., & Toivonen, H. (1997). Levelwise search and borders of theories in knowledge discovery.Data Mining and Knowledge Discovery,1(3), 241-258. Morimoto, Y. (2001). Mining frequent neighboring class sets in spatial databases.

Proceedings of 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, 353-358. Ng, R., & Han, J. (1994). Efficient and effective clustering method for spatial data

mining.Proceedings of International Conference on Very Large Data Bases, Santiago, Chile, 144-155.

Peng, C.H., & Wen, X. (1999). Recent applications of artificial neural networks in forest resource management: an overview. In: U. Corté and M.Sànche-Marrè (Eds.), Environmental Decision Support Systems and Artificial Intelligence. (pp. 15-22). AAAI Technical Reports WS-99-07, AAAI Press, Menlo Park, CA, USA.

Shekhar, S., & Chawla, S. (2003).Spatial Databases: A Tour. Englewood Cliffs, NJ, USA: Prentice Hall.

Shekhar, S., & Huang, Y. (2001). Colocation rules mining: A summary of results.

Proceedings of 7th International Symposium on Spatial and Temporal

Databases, Redondo Beach, California, USA, 236-256.

Shekhar, S., Schrater, P. R., Vatsavai, R. R., Wu, W., & Chawla, S. (2002). Spatial contextual classification and prediction models for mining geospatial data. IEEE Transactions on Multimedia,4(2), 174-188.

Yoo, J. S., & Shekhar, S. (2004). A partial join approach for mining colocation patterns, Proceedings of 12th Annual ACM International Workshop on

(11)

Geographic information systems, Washington DC, USA, 241-249.

Yoo, J. S., & Shekhar, S. (2006). A joinless approach for mining spatial colocation patterns. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1323-1337.

Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. Proceedings of ACM SIGMOD

International Conference on Management of Data, Montreal, Canada,

103-114.

Han-Wen Hsiao received the B.S. degree in physics from Fu Jen Catholic University in 1988, the M.S. degree in atmospheric physics with an emphasis in satellite remote sensing from National Central University in 1990, and the Ph.D. degree in civil engineering with a specialization in geoinformatics from the University of Illinois at Urbana-Champaign (UIUC) in 1999. During his graduate study in the US, he worked respectively for the Robotics Center and the Business Process Division of Construction Engineering Research Laboratories, Corps of Engineers, US Army. Later, he also participated in the UIUC Digital Library Initiative (DLI) Project for more than two years. After graduation, he worked as a postdoctoral researcher in the information system research team of the Office of the National Science and Technology Program for Hazards Mitigation. Since 2001 he has been with Taichung Healthcare and Management University (now Asia University). He is currently an Assistant Professor in the Department of Computer Science and Information Engineering, and affiliated with the Department of Biotechnology and Bioinformatics. His research interests include pattern recognition, data mining, image processing, bioinformatics, and geoinformatics.

Meng-Shu Tsai received the B.S. degree in

information management from Diwan College of

Management in 2002, and the M.S. degree in information engineering from Taichung Healthcare and Management University (now Asia University) in 2004. He is currently with ITE Tech., Inc. as a software engineer. His research interests include data mining and distributed computing.

(12)

Shao-Chiang Wang received the B.S. degree in surveying engineering from Chung Cheng Institute of Technology in 1988, and the M.S. degree in information engineering from Feng Chia University in 2003. Since 1988 he has been with the Factory 401, Combined Logistics Command (now Armaments Bureau), Ministry of National Defense (MND). During 1990s, he was sent as a technical consultant of digital mapping to the Military Survey Department, Ministry of Defense, the Kingdom of Saudi Arabia (KSA) for four years. After coming back from the KSA, he then worked at the R&D office of the Factory 401 for four years. In 2004, he joined the MPCMIS project as one of the planners at the Armaments Bureau of the MND. He is currently a lieutenant colonel and the director of cartographic compilation section. His research interests are in the areas of geoinformatics, image processing, and computer vision.