Conclusion and Discussion - Spatial Ordering and Encoding for Geographic Data Mining and Visual

We have presented and evaluated a set of spatial ordering/encoding methods to transform spatial relations into a one-dimensional ordering and encoding, which preserves spatial

patterns as much possible. Such ordering and encoding can then be used in a variety of spatial or spatio-temporal data mining tasks. We designed a comprehensive set of measures to evaluate different orderings/encodings. The results revealed a number of important characteristics and unique behaviors for each ordering/encoding method. We showed that the optimal ordering/encoding with the complete-linkage clustering consistently gives the best overall performance, with various data distributions tested. We also briefly introduced two possible applications (out of many) that make use of the spatial ordering and encoding methods we describe.

Evaluation results with various data distributions show that the optimal ordering/encoding based on the complete-linkage clustering gives the best overall performance, surpassing well-known space-filling curves, in preserving spatial patterns. It can preserve spatial locality in both directions, i.e., on one hand spatial neighbors are close in the ordering and on the other hand neighbors in the ordering are also spatially close. The spatial ordering and encoding can then help in a variety of geographic data mining problems.

Although the optimal ordering strategy generally produces a better ordering/encoding than the simple ordering strategy for a given hierarchical clustering method, the primary factor that controls the ordering/encoding quality is the clustering method. For example, the single-linkage clustering gives very poor results in all tests, no matter which ordering strategy is used. The two space-filling curves (i.e., the Hilbert curve and the Morton curve) have very different characteristics. They generally work better with random data than clustered data but neither of them gives a good overall performance—they are good on one type of measures and perform badly on the other type of measures. Another advantage of the cluster-based ordering methods is that they can work with non-Euclidean data spaces.

Lastly, we would like to briefly compare the computational scalability of each ordering method. Space-filling curves are of O(nlogn) complexity and thus can process very large data sets. For each clustering-based method, the computational complexity involves two parts: the clustering procedure and the ordering procedure. The simple ordering strategy is of O(n) complexity while the optimal ordering strategy is of O(n³). The single-linkage clustering is of O(n²logn) complexity and the complete-linkage clustering is of O(n³). Therefore, the CLO_OPT and SNN_CLO_OPT ordering methods are the most time-consuming ones among all clustering-based methods. To derive an ordering of the 3128 US cities, the CLO_OPT method takes about 5 minutes on a desktop computer with 2.0GB of RAM memory and a 3.60GHz Pentium 4 CPU. Where efficiency is an issue, perhaps due to dataset size or time criticality of the application, then the simple ordering strategy might provide a more viable alternative.

Acknowledgement

This research was partially supported by grant CA95949 from the National Cancer Institute.

References

Andrienko, G. and N. Andrienko (1999). Data Mining with C4.5 and Interactive Cartographic Visualization. User Interfaces to Data Intensive Systems. G. T. Los Alamitos, CA, IEEE Computer Society: 162-165.

Andrienko, G. and N. Andrienko (1999). "Interactive Maps for Visual Data Exploration."

International Journal of Geographical Information System 13(4): 355-374.

Andrienko, N., G. Andrienko and P. Gatalsky (2003). "Exploratory spatio-temporal

visualization: an analytical review." Journal of Visual Languages & Computing 14(6):

503-541.

Ankerst, M., M. M. Breunig, H.-P. Kriegel and J. Sander (1999). OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, ACM Press, 49-60.

Baase, S. and A. V. Gelder (2000). Computer Algorithms, Addison-Wesley.

Bar-Joseph, Z., E. D. Demaine, D. K. Gifford, A. M. Hamel, T. S. Jaakkola and N. Srebro (2003). "K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data."

Bioinformatics 19(9): 1070-8.

Bar-Joseph, Z., D. K. Gifford and T. S. Jaakkola (2001). "Fast optimal leaf ordering for hierarchical clustering." Bioinformatics 17(Suppl. 1): S22-S29.

Bertin, J. (1983). Semiology of Graphics. Diagrams, Networks, Maps. Madison, The University of Wisconsin Press.

Bertin, J. (2001). "Matrix theory of graphics." Information Design Journal 10: 5-19.

Breinholt, G. and C. Schierz (1998). "Algorithm 781: Generating Hilbert's space-filling curve by recursion." Acm Transactions on Mathematical Software 24(2): 184-189.

Duda, R. O., P. E. Hart and D. G. Stork. (2000). Pattern classification. New York, John Wiley

& Sons.

Ertoz, L., M. Steinbach and V. Kumar (2003). Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. The Third SIAM International Conference on Data Mining (SDM '03), San Francisco, CA, USA.

Ester, M., H. P. Kriegel and J. Sander (1997). Spatial data mining: A database approach.

Advances in Spatial Databases. Berlin 33, Springer-Verlag Berlin. 1262: 47-66.

Ester, M., H.-P. Kriegel, J. Sander and X. Xu (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. the Second International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA, AAAI Press, 226-231.

Friendly, M. and E. Kwan (2003). "Effect Ordering for Data Displays." Computational Statistics & Data Analysis 43(4): 509-539.

Gahegan, M. (2000). "The case for inductive and visual techniques in the analysis of spatial data." Journal of Geographical Systems 2(1): 77-83.

Goodchild, M. F. and A. W. Grandfield (1983). Optimizing raster storage: an examination of four alternatives. Proceedings, Auto-Carto 6, 400-407.

Gordon, A. D. (1987). "A Review of Hierarchical Classification." Journal of the Royal Statistical Society. Series A (General) 150(2): 119-137.

Gordon, A. D. (1996). Hierarchical Classification. Clustering and Classification. P. Arabie, L.

J. Hubert and G. D. Soete. River Edge, NJ, USA, World Scientific Publisher: 65-122.

Gotsman, C. and M. Lindenbaum (1996). "On the metric properties of discrete space-filling curves." Ieee Transactions on Image Processing 5(5): 794-797.

Guo, D., M. Gahegan, A. M. MacEachren and B. Zhou (2005). "Multivariate Analysis and Geovisualization with an Integrated Geographic Knowledge Discovery Approach."

Cartography and Geographic Information Science 32(2): 113-132.

Guo, D., D. Peuquet and M. Gahegan (2003). "ICEAGE: Interactive Clustering and

Exploration of Large and High-dimensional Geodata." GeoInformatica 7(3): 229-253.

Han, J. and M. Kamber (2001). Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers.

Han, J., M. Kamber and A. K. H. Tung (2001). Spatial Clustering Methods in Data Mining: a survey. Geographic Data Mining and Knowledge Discovery. H. J. Miller and J. Han.

London and New York, Taylor & Francis: 33-50.

Han, J., K. Koperski and N. Stefanovic (1997). GeoMiner: A System Prototype for Spatial Data Mining. ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, 553-556.

Hilbert, D. (1891). "Uber die stetige Abbildung einer Linie auf Flachenstuck." Math. Ann. 38:

459-460.

Jain, A. K. and R. C. Dubes (1988). Algorithms for clustering data. Englewood Cliffs, NJ, Prentice Hall.

Jarvis, R. A. and E. A. Patrick (1973). "Clustering using a similarity measure based on shared near neighbours." IEEE Transactions on Computers 22(11): 1025-1034.

Keim, D. A., C. Panse, M. Sips and S. C. North (2004). "Visual Data Mining in Large Geospatial Point Sets." IEEE Computer Graphics and Applications 24(5): 36-44.

Koperski, K., J. Han and N. Stefanovic (1998). An Efficient Two-Step Method for Classification of Spatial Data. 1998 International Symposium on Spatial Data Handling SDH'98, Vancouver, BC, Canada, 45-54.

Koperski, K. and J. W. Han (1995). Discovery of spatial association rules in geographic information databases. Advances in Spatial Databases. Berlin 33, Springer-Verlag Berlin. 951: 47-66.

Lamarque, C. H. and F. Robert (1996). "Image analysis using space-filling curves and 1D wavelet bases." Pattern Recognition 29(8): 1309-1322.

Lawder, J. K. and P. J. H. King (2001). "Querying multi-dimensional data indexed using the Hilbert space-filling curve." SIGMOD Record 30(1): 19-24.

Mark, D. M. (1990). "Neighbor-based Properties of Some Ordering of Two-dimensional Space." Geographical Analysis 22(2): 145-157.

Miller, H. J. and J. Han (2001). Geographic Data Mining and Knowledge Discovery: an overview. Geographic Data Mining and Knowledge Discovery. H. J. Miller and J.

Han. London and New York, Taylor & Francis: 3-32.

Mokbel, M. F. and W. G. Aref (2003). "Analysis of Multi-Dimensional Space-Filling Curves."

GeoInformatica 7(3): 179-209.

Moon, B., H. V. Jagadish, C. Faloutsos and J. H. Saltz (2001). "Analysis of the Clustering Properties of the Hilbert Space-Filling Curve." IEEE Transaction on Knowledge and Data Engineering 13(1): 1-18.

Morton, G. (1966). A computer-oriented geodetic data base and a new technique for file sequencing, IBM Canada: Unpublished report.

Murray, A. T. and T. K. Shyy (2000). "Integrating attribute and space characteristics in choropleth display and spatial data mining." International Journal of Geographical Information Science 14(7): 649-667.

Ng, R. and J. Han (1994). Efficient and Effective Clustering Methods for Spatial Data Mining.

Proc. 20th International Conference on Very Large Databases, Santiago, Chile, 144-155.

Openshaw, S. (1994). Two exploratory space-time-attribute pattern analysers relevant to GIS.

Spatial analysis and GIS. S. Fotheringham. Technical Issues in Geographic Information Systems, Taylor & Francis: 83-104.

Qeli, E., W. Wiechert and B. Freisleben (2004). Visualizing time-varying matrices using multidimensional scaling and reorderable matrices. Proceedings of the Eighth International Conference on Information Visualisation, 561-567.

Reinelt, G. (1994). The Travelling Salesman. Computational Solutions for TSP Applications.

Berlin Heidelberg New York, Springer-Verlag.

Sammon, J. W. (1969). "A non-linear mapping for data structure analysis." IEEE Transactions on Computers C-18(5): 401-409.

Shekhar, S., P. Zhang, Y. Huang and R. Vatsavai (2004). Trend in Spatail Data Mining. Data Mining: Next Generation Challenges and Future Directions. H. Kargupta, A. Joshi, K.

Sivakumar and Y. Yesha, AAAI/MIT Press: 357-381.

Siirtola, H. and E. Makinen (2005). "Constructing and Reconstructing the Reorderable Matrix." Information Visualization 4: 32-48.

Skubalska-Rafajlowicz, E. (2001). "Data compression for pattern recognition based on space-filling curve pseudo-inverse mapping." Nonlinear Analysis-Theory Methods &

Applications 47(1): 315-326.

Steenberghen, T., T. Dufays, I. Thomas and B. Flahaut (2004). "Intra-urban location and clustering of road accidents using GIS: a Belgian example." International Journal of Geographical Information Science 18(2): 169-181.

Wang, W., J. Yang and R. Muntz (1997). STING : A Statistical Information Grid Approach to Spatial Data Mining. 23rd Int. Conf on Very Large Data Bases, Athens, Greece, Morgan Kaufmann, 186-195.

Wilkinson, L. (1979). Permuting a matrix to a simple pattern. Proceedings of the Statistical and Computing section of the American Statistical Association, 409-412.

Wirth, N. (1976). Algorithms + Data Structures = Programs, Prentice Hall.

Wong, P. C., K. K. Wong, H. Foote and J. Thomas (2003). "Global Visualization and Alignments of Whole Bacterial Genomes." IEEE Transactions on Visualization and Computer Graphics 9(3): 361-377.

Yamada, I. and J.-C. Thill (2004). "Comparison of Planar and Network K-functions in Traffic Accident Analysis." Journal of Transport Geography 12: 149–58.

Young, F. W. (1987). Multidimensional scaling: history, theory, and applications, Lawrence Erlbaum Associates.

In document Spatial Ordering and Encoding for Geographic Data Mining and Visualization. Diansheng Guo 1 and Mark Gahegan 2 (Page 22-26)