Comparison of Clustering Algorithm - Finding Table Similarity

for Summarizing Relational Databases Ammar Yasir, Mittapally Kumara Swamy, and Polepalli Krishna Reddy

Algorithm 1. Finding Table Similarity

4.6 Comparison of Clustering Algorithm

In this section, we compare clustering algorithms for schema summary. In addi- tion to the proposed weighted k-center clustering algorithm using an inﬂuence function(Clusts), we implement the following clustering algorithms:

– Clustc, A community detection based schema summarization approach proposed in [6].

– Clustd, the schema summarization approach proposed in [7]. The clustering approach uses a table importance metric based weighted k-center clustering algorithm.

– Clustv, Combines results from clustering using reference similarity and clustering using document similarity using a voting scheme similar to [8]. This algorithm focuses on combining clustering from diﬀerent similarity models rather than combining similarity models.

Figure 6 shows the clustering accuracy achieved for k = (2, 3, 4) for various clustering algorithms. We observed that clustsand clustdachieve almost similar accuracy, with clustsgiving slightly higher accuracy as it was able to successfully cluster the table trade request. If no active transactions are considered for the TPCE database, the table trade request is empty and data oriented approaches

2 3 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k

acc

sim clust_s clust_v clust_d clust_c

Fig. 6. Clustering accuracy for diﬀerent clustering algorithms

are unable to classify the table. For the clustv and clustc approaches, no spe- cific patterns were observed. The reason for low accuracy of clustv is because referential similarity provides a very imbalanced and ineffective clustering which deters the overall clustering accuracy in the voting scheme significantly.

5 Conclusions and Future Work

Schema summarization has been proposed in the literature to help users in exploring complex database schema. Existing approaches for schema summarization are data-oriented. In this paper, we proposed a schema summarization approach for relational databases using database schema and the database doc- umentation. We proposed a combined similarity measure to incorporate similarities from both sources and proposed a framework for summary generation. Ex- periments were conducted on a benchmark database and the results showed that the proposed approach is as eﬀective as the existing data oriented approaches. We plan to extend this work by developing algorithms for learning the values of various parameters used in the proposed approach.

References

1. Nandi, A., Jagadish, H.V.: Guided interaction: Rethinking the query-result paradigm. PVLDB 4(12), 1466–1469 (2011)

2. Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: Proceedings of the 2007 ACM SIG- MOD International Conference on Management of Data, SIGMOD 2007, pp. 13–24. ACM, New York (2007)

4. Doan, A., Halevy, A.Y.: Semantic-integration research in the database community. AI Mag. 26(1), 83–94 (2005)

5. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. The VLDB Journal 10(4), 334–350 (2001)

6. Xue Wang, X.Z., Wang, S.: Summarizing large-scale database schema using community detection. Journal of Computer Science and Technology, SIGMOD 2008 (2012)

7. Yang, X., Procopiuc, C.M., Srivastava, D.: Summarizing relational databases. Proc. VLDB Endow. 2(1), 634–645 (2009)

8. Wu, W., Reinwald, B., Sismanis, Y., Manjrekar, R.: Discovering topical structures of databases. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1019–1030. ACM, New York (2008) 9. Bergamaschi, S., Castano, S., Vincini, M.: Semantic integration of semistructured

and structured data sources. SIGMOD Rec. 28(1), 54–59 (1999)

10. Palopoli, L., Terracina, G., Ursino, D.: Experiences using dike, a system for supporting cooperative information system and data warehouse design. Inf. Syst. 28(7), 835–865 (2003)

11. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 49–58. Morgan Kaufmann Publishers Inc., San Francisco (2001) 12. TPCE, http://www.tpc.org/tpce/

13. Clarke, C.L.A., Cormack, G.V., Kisman, D.I.E., Lynam, T.R.: Question answering by passage selection (multitext experiments for trec-9). In: TREC (2000)

14. Ittycheriah, A., Franz, M., Jing Zhu, W., Ratnaparkhi, A., Mammone, R.J.: Ibm’s statistical question answering system. In: Proceedings of the Tenth Text Retrieval Conference, TREC (2000)

15. Salton, G., Allan, J., Buckley, C.: Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1993, pp. 49–58. ACM, New York (1993)

16. Tellex, S., Katz, B., Lin, J., Fernandes, A., Marton, G.: Quantitative evaluation of passage retrieval algorithms for question answering. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR 2003, pp. 41–47. ACM, New York (2003)

17. Wang, M., Si, L.: Discriminative probabilistic models for passage based retrieval. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR 2008, pp. 419–426. ACM, New York (2008)

18. Xi, W., Xu-Rong, R., Khoo, C.S.G., Lim, E.-P.: Incorporating window-based passage-level evidence in document retrieval. JIS 27(2), 73–80 (2001)

19. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at trec-3, pp. 109–126 (1996)

20. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2(1), 718–729 (2009)

21. Dyer, M., Frieze, A.: A simple heuristic for the p-centre problem. Oper. Res. Lett. 3(6), 285–288 (1985)

22. Rangrej, A., Kulkarni, S., Tendulkar, A.V.: Comparative study of clustering tech- niques for short text documents. In: Proceedings of the 20th International Confer- ence Companion on World Wide Web, WWW 2011, pp. 111–112. ACM, New York (2011)

Ullas Nambiar1_{, Tanveer Faruquie}2_{, Shamanth Kumar}3_{, Fred Morstatter}3_,

and Huan Liu3

1 _{EMC India COE, Bangalore, India}

[email protected]

2 _{IBM Research Lab, New Delhi, India}

[email protected]

3 _{Computer Science & Engg, SCIDSE, Arizona State University, USA}

{skumar34,fmorstat,huan.liu}@asu.edu

Abstract. The popularity of social media as a medium for sharing information

has made extracting information of interest a challenge. In this work we provide a system that can return posts published on social media covering various aspects of a concept being searched. We present a faceted model for navigating social media that provides a consistent, usable and domain-agnostic method for extracting information from social media. We present a set of domain independent facets and empirically prove the feasibility of mapping social media content to the facets we chose. Next, we show how we can map these facets to social media sites, living documents that change periodically to topics that capture the semantics expressed in them. This mapping is used as a graph to compute the various facets of interest to us. We learn a profile of the content creator, enable content to be mapped to semantic concepts for easy navigation and detect similarity among sites to either suggest similar pages or determine pages that express different views.

In document A comparison of statistical machine learning methods in heartbeat detection and classification (Page 99-102)