We now discuss future work based on the results of this thesis.
First, as discussed above, our works investigate undirected graphs. One open question is whether these methods are still applicable to directed graphs. For example, the quality of the best path from a node u to another node v is the same as from v to u in an undirected graph, but not necessarily in a directed graph. It is also possible that there is no path from v to u.
In network simplification, we can easily modify the calculation of con- nectivity between nodes to take path direction into account. We can also use methods in Articles II and III to remove edges to simplify directed graphs.
In network compression, the modification is a bit more complicated. A superedge represents all edges between nodes in its endpoints. The
5.6 Open questions 39 directions need to be added into superedges. It brings up several ques- tions: how to decide the direction of a superedge, how to assign an op- timal weight to a superedge, and how to calculate the compression er- ror. In the simple weighted graph compression problem, one solution is that the direction of a superedge is the majority direction of edges repre- sented by it. Edges with opposite directions should incur a larger error, e.g. wo(u, v)− sgn((u, v)) · ws(u, v). Another solutions is to consider dif-
ferent directions separately. This would introduce less error. However, the modification in the generalized weighted graph compression is even harder, since it needs to take path direction into account.
Second, the results from our network simplification methods are con- nected graphs. An edge whose removal cuts the graph will not be consid- ered, because we aim to preserve the connectivities of all node pairs. This raises the question if removing some weak edges can lead to a good abstrac- tion even if it may cut the graph. Suppose a node connects with the rest of the graph only through a single edge. Removal of such an edge separates the node from the rest of the graph. However, this may be insignificant overall, and in a sense, such removal could be meaningful.
Third, to evaluate results from a subjective viewpoint is difficult. For example, one advantage of an abstract graph is that the structure is simpler and smaller, which makes visualization convenient. We have not carried out any user studies to investigate the usefulness of network abstraction for an end user.
Last, in our work, the main goal is to give an overview of a large graph which maintains some essential qualities of the original graph. A related problem is (social) network anonymization, where the goal is to preserve some important properties (the “utility” of the network) but hide iden- tities of nodes or some other sensitive information [18]. The methods in privacy preserving graph mining try to transform the graph to avoid re- identification of individuals (nodes), but still maintain the utility of the transformed graph. Take graph compression as an example of the similar- ities of the applications. By requiring that each supernode consists of at least k nodes, one achieves k-anonymity: any individual (node) is identical with as least k− 1 other individuals. It would be interesting to study the connections between these two research areas further.
Chapter 6
Conclusions
Networks exist in various domains, such as biology, chemistry, and sociol- ogy. Large graphs are difficult to visualize and understand. In different areas of computer science, several approaches have been proposed to help understand graphs, like partitioning, sampling, and extracting subgraphs.
In this thesis, we defined the field of network abstraction, including all approaches that aim to produce a simpler and useful network from a large one to help a user understand networks more easily.
We first studied simplifying a graph structure by removing edges. We provided definitions of redundant edges and unimportant edges, and pro- posed methods that are applicable to different kinds of graphs. The ex- periments show that graphs can have a fewer number of edges while the connectivity of the original graph is well maintained.
We then investigated presenting a graph at a higher abstraction level by grouping nodes and edges. Graphs can be compressed based on the graph structure alone, or with respect to node importances which reflect the emphasis of the application. We also have shown the utility of the compressed graph through a biological application.
The contributions of this thesis include discussion of the network ab- straction field and proposing approaches for two specific cases. The area of network abstraction is large, and the solutions presented in this thesis are only one part of it. By identifying this as a research area and proposing some novel methods, we believe that the work provides new insight into understanding and visualizing graphs.
References
[1] M. J. Berger and S. H. Bokhari. A Partitioning Strategy for Nonuni- form Problems on Multiprocessors. IEEE Transactions on Computers, 36(5):570–580, 1987.
[2] Z. W. Birnbaum. On the importance of different components in a multicomponent system. In Multivariate Analysis - II, pages 581–592, New York, 1969. Academic Press.
[3] P. Bonacich. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology, 2(1):113–120, 1972.
[4] U. Brandes. A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology, 25(2):163–177, 2001.
[5] B. Bringmann and S. Nijssen. What is Frequent in a Single Graph? In Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD’08, pages 858–863, Berlin, Heidelberg, 2008. Springer-Verlag.
[6] C. Chen, C.X. Lin, M. Fredrikson, M. Christodorescu, X. Yan, and J. Han. Mining Graph Patterns Efficiently via Randomized Sum- maries. Proceedings of the VLDB Endowment, 2(1):742–753, 2009. [7] C. F. Eick, N. Zeidat, and Z. Zhao. Supervised Clustering – Algorithms
and Benefits. In Proceedings of the 16th IEEE International Confer- ence on Tools with Artificial Intelligence, pages 774–776, Washington, DC, USA, 2004. IEEE Computer Society.
[8] M. Everett and S. P. Borgatti. Ego network betweenness. Social Net- works, 27(1):31–38, January 2005.
[9] C. Faloutsos, K. S. McCurley, and A. Tomkins. Fast Discovery of Connection Subgraphs. In Proceedings of the tenth ACM SIGKDD
44 References International Conference on Knowledge Discovery and Data Mining, pages 118–127, New York, NY, USA, 2004. ACM.
[10] L. C. Freeman. A Set of Measures of Centrality Based on Betweenness. Sociometry, 40:35–41, 1977.
[11] N. E. Friedkin. Theoretical Foundations for Centrality Measures. American Journal of Sociology, 96(6):1478–1504, 1991.
[12] S. Gert. The centrality index of a graph. Psychometrika, 31(4):581– 603, December 1966.
[13] A.C. Gilbert and K. Levchenko. Compressing network graphs. In Proceedings of the LinkKDD workshop at the 10th ACM Conference on KDD, 2004.
[14] S.J. Gould. Wonderful Life: The Burgess Shale and the Nature of History. Norton, 1990.
[15] M Gr¨otschel, C. L. Monma, and M. Stoer. Design of survivable net- works. In Handbooks in Operations Research and Management Science, volume 7, pages 617–672, 1995.
[16] P. Hage and F. Harary. Eccentricity and centrality in networks. Social networks, 17(1):57–63, 1995.
[17] T. H. Haveliwala. Topic-Sensitive PageRank. In WWW ’02: Proceed- ings of the 11th International Conference on World Wide Web, pages 517–526, New York, NY, USA, 2002. ACM.
[18] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting Structural Re-identification in Anonymized Social Networks. Proceed- ings of the VLDB Endowment, 1(1):102–114, August 2008.
[19] B. Hendrickson and R. Leland. An improved spectral graph partition- ing algorithm for mapping parallel computations. SIAM Journal on Scientific Computing, 16(2):452–469, 1995.
[20] B. Hendrickson and R. Leland. A Multi-Level Algorithm For Parti- tioning Graphs. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (CDROM), New York, NY, USA, 1995. ACM. [21] P. Hintsanen and H. Toivonen. Finding reliable subgraphs from large
probabilistic graphs. Data Mining and Knowledge Discovery, 17:3–23, August 2008.
References 45 [22] P. Hintsanen, H. Toivonen, and P. Sevon. Fast Discovery of Reliable Subnetworks. In Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining, pages 104–111, Los Alamitos, CA, USA, 2010. IEEE Computer Society.
[23] C. Hubler, H. P. Kriegel, K. Borgwardt, and Z. Ghahramani. Metropo- lis Algorithms for Representative Subgraph Sampling. In Proceedings of the eighth IEEE International Conference on Data Mining, pages 283–292, Los Alamitos, CA, USA, 2008. IEEE Computer Society. [24] A. Inokuchi, T. Washio, and H. Motoda. An Apriori-Based Algorithm
for Mining Frequent Substructures from Graph Data. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowl- edge Discovery, pages 13–23, London, UK, 2000. Springer-Verlag. [25] J.W. Jaromczyk and G.T. Toussaint. Relative Neighborhood Graphs
and Their Relatives. Proceedings of the IEEE, 80(9):1502–1517, 1992. [26] G. Jeh and J. Widom. Scaling Personalized Web Search. In WWW ’03: Proceedings of the 12th International Conference on World Wide Web, pages 271–279, New York, NY, USA, 2003. ACM.
[27] B. H. Junker and F. Schreiber. Analysis of Biological Networks. Wiley Series in Bioinformatics. John Wiley & Sons, 2011.
[28] G. Karypis and V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Com- puting, 20:359–392, 1998.
[29] B. W. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graphs. Bell System Technical Journal, 49(1):291–307, 1970.
[30] J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In Proceed- ings of the 12th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, KDD ’06, pages 631–636, New York, NY, USA, 2006. ACM.
[31] S. Lin and H. Chalupsky. Unsupervised Link Discovery in Multi- relational Data via Rarity Analysis. In ICDM ’03: Proceedings of the third IEEE International Conference on Data Mining, page 171, Washington, DC, USA, 2003. IEEE Computer Society.
46 References [32] A.S. Maiya and T.Y. Berger-Wolf. Sampling Community Structure. In Proceedings of the 19th International Conference on World Wide Web, pages 701–710, New York, NY, USA, 2010. ACM.
[33] G. L. Miller, S. H. Teng, W. Thurston, and S. A. Vavasis. Geometric Separators for Finite-Element Meshes. SIAM Journal on Scientific Computing, 19(2):364–386, 1998.
[34] S.C. Morris. The Crucible of Creation: The Burgess Shale and the Rise of Animals. Oxford University Press paperback. Oxford University Press, 2000.
[35] S. Muggleton. Duce, an oracle based approach to constructive induc- tion. In Proceedings of the 10th International Joint Conference on Artificial Intelligence, pages 287–292, San Francisco, CA, USA, 1987. Morgan Kaufmann Publishers Inc.
[36] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph Summarization with Bounded Error. In Proceedings of the 2008 ACM SIGMOD In- ternational Conference on Management of Data, pages 419–432, New York, NY, USA, 2008. ACM.
[37] S. Navlakha, M.C. Schatz, and C. Kingsford. Revealing Biological Modules via Graph Summarization. Journal of Computational Biology, 16(2):253–264, 2009.
[38] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004.
[39] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Physical Review E, 64:026118, Jul 2001.
[40] D. Peleg and A. A. Sch¨affer. Graph spanners. Journal of Graph Theory, 13(1):99–116, 1989.
[41] A. Pothen, H. D. Simon, and K.-P. Liou. Partitioning Sparse Matrices with Eigenvectors of Graphs. SIAM Journal on Matrix Analysis and Applications, 11(3):430–452, July 1990.
[42] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defin- ing and identifying communities in networks. Proceedings of the Na- tional Academy of Sciences of the United States of America, 101:2658– 2663, 2004.
References 47 [43] D. Rafiei. Effectively visualizing large networks through sampling. In 16th IEEE Visualization (VIS 2005), pages 375–382, Washington, DC, USA, 2005. IEEE Computer Society.
[44] R. W. Schvaneveldt. Pathfinder Associative Networks: Studies in Knowledge Organization. Ablex Publishing Corp., Norwood, NJ, USA, 1990.
[45] J. Scott. Social Network Analysis: A Handbook. Sage Publications, UK, January 2000.
[46] D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and Applica- tions of Tree and Graph Searching. In PODS ’02: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Princi- ples of Database Systems, pages 39–52, New York, NY, USA, 2002. ACM.
[47] X. Shi, M. Bonner, L. A. Adamic, and A. C. Gilbert. The Very Small World of the Well-connected. In Proceedings of the nineteenth ACM Conference on Hypertext and Hypermedia, pages 61–70, New York, NY, USA, 2008. ACM.
[48] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph Summarization. In Proceedings of the 2008 ACM SIGMOD In- ternational Conference on Management of Data, pages 567–580, New York, NY, USA, 2008. ACM.
[49] Y. Tian, R. C. Mceachin, C. Santos, D. J. States, and J. M. Patel. SAGA: a subgraph matching tool for biological graphs. Bioinformatics, 23(2):232–239, 2007.
[50] Y. Tian and J. M. Patel. TALE: A Tool for Approximate Large Graph Matching. In Proceedings of the 2008 IEEE 24th International Con- ference on Data Engineering, pages 963–972, Los Alamitos, CA, USA, 2008. IEEE Computer Society.
[51] H. Tong and C. Faloutsos. Center-Piece Subgraphs: Problem Defini- tion and Fast Solutions. In KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 404–413, New York, NY, USA, 2006. ACM.
[52] G. T. Toussaint. The Relative Neighbourhood Graph of a Finite Planar Set. Pattern Recognition, 12(4):261–268, 1980.
[53] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl. Constrained K- means Clustering with Background Knowledge. In Proceedings of the eighteenth International Conference on Machine Learning, pages 577– 584, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[54] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440–442, 1998.
[55] S. White and P. Smyth. Algorithms for Estimating Relative Impor- tance in Networks. In Proceedings of the ninth ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 266–275, New York, NY, USA, 2003. ACM.
[56] C.R. Woese, O. Kandler, and M.L. Wheelis. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eu- carya. Proceedings of the National Academy of Sciences, 87(12):4576– 4579, 1990.
[57] X. Yan, P. S. Yu, and J. Han. Graph Indexing: A Frequent Structure- based Approach. In Proceedings of the 2004 ACM SIGMOD Interna- tional Conference on Management of Data, pages 335–346, New York, NY, USA, 2004. ACM.
[58] N. Zhang, Y. Tian, and J. M. Patel. Discovery-Driven Graph Sum- marization. In Proceedings of the 26th International Conference on Data Engineering, pages 880–891, Los Alamitos, CA, USA, 2010. IEEE Computer Society.
[59] Y. Zhou, H. Cheng, and J. X. Yu. Graph Clustering Based on Struc- tural/Attribute Similarities. Proceedings of the VLDB Endowment, 2(1):718–729, 2009.
[60] Y. Zhou, H. Cheng, and J. X. Yu. Clustering Large Attributed Graphs: An Efficient Incremental Approach. In Proceedings of the 2010 IEEE International Conference on Data Mining, pages 689–698, Washing- ton, DC, USA, 2010. IEEE Computer Society.
[61] R. Zou and L. B. Holder. Frequent Subgraph Mining on a Single Large Graph Using Sampling Techniques. In Proceedings of the eighth Workshop on Mining and Learning with Graphs, MLG ’10, pages 171– 178, New York, NY, USA, 2010. ACM.