Conclusion and Future Work

7.1 Conclusion

This thesis firstly focuses on general query optimization technologies over database federation systems. Given that database federation is by nature a distributed system, as well as data integration system, most query optimization approaches designed for distributed databases and data integration systems are also adopted in database federations. I studied many related work and found out that very few works addressed the problem of considering run-time conditions in query optimization. By analyzing both theoretically and experimentally, I present the need to take run-time conditions, including the available buffers and CPU utilities in the data sources and network environment, into account in optimization processing. Also I pointed out the challenges of doing this consideration.

Secondly this thesis studies two existing approaches, namely parametric algorithm and two-phase algorithm, which are potentially able to consider run-time conditions in the optimization process of database federations. However, after analyzing their pros and cons, we found that both of them are not sufficient for optimization of distributed joins in database federations.

Thirdly, given our target optimization approach is cost-based and is used in distributed environment, cost model definition and parallelism constraints are presented. And then typical database federation system architecture and data structures are introduced.

Fourthly, I proposed Cluster-and-Conquer algorithm for optimizing distributed join over database federation with efficiently considering run-time conditions. Cluster-and- Conquer algorithm is motivated from real-world observation as well as the defects of existing system architecture. Since run-time conditions of data sources are prone to fluctuate, only closely connected “neighbor” machines are able to get fresh information. And real-world public network and enterprise network environment suggests an intuitive way to determine “closely connected” machines based on network data transfer cost. So we proposed to view the whole database federation as clustered system, and provide each cluster of data sources with its cluster mediator. Based on this architecture, the query optimization can be divided into two procedures: the global optimizer decides inter- cluster operations, and cluster optimizers handle the sub queries that happen in those data sources within the cluster with run-time condition consideration. Surprisingly, besides being able to deal with run-time conditions, Cluster-and-Conquer algorithm also outperforms other existing works in terms of “cost of costing”. This is mainly because unnecessary inter-cluster operations are naturally removed, and also each cluster optimizer only needs to process a sub query plan which is much simpler than dealing with a whole distributed query plan by one centralized optimizer, moreover network messaging cost is decreased.

Finally we implemented the prototype federation system with the proposed architecture and optimization algorithm. The experimental results showed the capabilities and efficiency of Cluster-and-Conquer algorithm and gave the target environment where the algorithm performs better than other related approaches.

7.2 Future Work

The Cluster-and-Conquer algorithm assumes the clustered view of data sources is given as input. So a natural extension is to enable the algorithm to gather this information by itself. Currently the prototype system has two levels of mediators, but it is necessary to extend the system in order to support multi-level mediators whenever the environment demands.

Another possible extension is to employ this algorithm to other distributed systems, such as distributed databases and grid computing systems. The philosophy of cluster-and- conquer is expected to be useful for large-scale distributed computing environments.

We may also plan to extend this algorithm for the processing of other types of operations, like aggregate (such as group-by, max and min), top-K, etc. The thesis mainly discusses distributed join operation. Certainly we can do join firstly and then perform other operations on the joined result, but there can be other brilliant way to schedule all operations efficiently in distributed environments.

References

[1] L. M. Haas, E. T. Lin, M. A. Roth. Data Integration through Database Federation. IBM System Journal, VOL 41, No 4, 2002.

[2] I. Manolescu, L. Bouganim, F. Fabret, E. Simon. Efficient Querying of Distributed Resources in Mediator Systems. CoopIS/DOA/ODBASE 2002, pp.468-485, 2002.

[3] X. Wang, R. Burns, A. Terzis, A. Deshpande. Network-Aware Join Processing in Global Scale Database Federations. ICDE 2008.

[4] J. Blakeley, C. Cunningham, N. Ellis, B. Rathakrishnan, M. Wu. Distributed/ Heterogeneous Query Processing in Microsoft SQL Server. ICDE 2005.

[5] J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, A. Halevy. Web-scale Data Integration: You can only afford to Pay As You Go. CIDR 2007.

[6] J. D. Ullman. Information Integration Using Logical Views. Proceedings of the 6th International Conference on Database Theory, 1997.

[7] Y. Papakonstantinou, A. Gupta, L. Haas. Capabilities-Based Query Rewriting in Mediator Systems. Distributed and Parallel Databases, Volumn 6, Issue 1, 1998.

[8] M. T. Roth, F. Ozcan, L. Hass. Cost Models Do Matter: Providing Cost Information for Diverse Data Sources in a Federated System. VLDB 1999.

[9] L. F. Mackert, G. M. Lohman. R* Optimizer Validation and Performance Evaluation for Distributed Queries. VLDB 1986.

[10] M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, Carl Staelin, A. Yu. Mariposa: a wide-area Distributed Database System. The VLDB Journal, 1996. [11] Z. G. Ives, A. Y. Halevy, D. S. Weld. Adapting to Source Properties in Processing Data Integration Queries. SIGMOD 2004.

[12] A. Deshpande, J. M. Hellerstein. Decoupled Query Optimization for Federated Database Systems. ICDE 2002.

[13] PostgreSQL document about Server Configuration,

http://www.postgresql.org/docs/8.3/static/runtime-config-query.html

[14] Y. E. Ioannidis, R. T. Ng, K. Shim, T. K. Sellis. Parametric Query Optimization. VLDB 1992.

[15] W. Hong, M. Stonebraker. Optimization of Parallel Query Execution Plans in XPRS. In Proc. Of the 1st International PDIS Conference, 1991.

[16] M. N. Garofalakis, Y. E. Ioannidis. Parallel Query Scheduling and Optimization with Time- and Space-Shared Resources. VLDB 1997.

[17] S. Raghavan, H. Garcia-Molina. Integrating Diverse Information Management Systems: A Brief Survey. IEEE Computer Society Technical Committee on Data Engineering, 2001

[18] A. Halevy, A. Rajaraman, J. Ordille. Data Integration: The Teenage Years. VLDB 2006.

[19] M. Mohania, S. Samtani, J. F. Roddick. Y. Kambayashi. Advances and Research Directions in Data Warehousing Technology. AJIS 1999.

[20] V. Josifovski, P. Schwarz, L.Haas, E. Lin. Garlic: a new flavor of federated query processing for DB2. ACM SIGMOD 2002

[21] M. T. Roth, P. Schwarz. Don‟t Scrap it, Wrap it! A Wrapper Architecture for Legacy Data Sources. VLDB 1997

[22] A. Y. Halevy. Answering Queries Using Views: A survey. The International Journal on Very Large Data Bases, 2001.

[23] J. D. Ullman. Information Integration Using Logical Views. Proceedings of the 6th International Conference on Database Theory, 1997

[24] T. Milo, S. Zohar. Using Schema Matching to Simplify Heterogeneous Data Translation. Proceeding of the 24th VLDB Conference, 1998

[25] Laura M. Haas, Patricia G. Selinger, Elisa Bertino, Dean Daniels, Bruce G. Lindsay, Guy M. Lohman, Yoshifumi Masunaga, C. Mohan, Pui Ng, Paul F. Wilms, Robert A. Yost. R*: A Research Project on Distributed Relational DBMS. IEEE Database Eng. Bull. 5(4): 28-32. 1982

[26] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, J. Widom. The TRIMMIS Project: Integration of Heterogeneous Information Sources. In Proceeding of IPSJ Conference. 1994.

[27] IBM Research, The Garlic Project, http://www.almaden.ibm.com/cs/garlic/. [28] IBM InfoSphere Federation Server,

http://www-01.ibm.com/software/data/infosphere/federation-server/

[29] I. Manolescu, L. Bouganim, F. Fabret, E. Simon. Efficient Querying of Distributed Resources in Mediator Systems. CoopIS/DOA/ODBASE 2002.

[30] U. Srivastava, K. Munagala, J. Widom, R. Motwani. Query Optimization over Web Service. VLDB 2006.

[31] D. Braga, S. Ceri, F. Daniel, D. Martinenghi. Optimization of Multi-Domain Queries on the Web. VLDB 2008.

[32] S. Adali, K. Candan, Y. Papakonstantinow, V. S. Subrahmanian. Query caching and optimization in distributed mediator systems. SIGMOD [SIG96], page 137-148.

[33] C. Evrendilke, A. Dogac, S. Nural, F. Ozcan. Multidatabase query optimization. Distributed and Parallel Databases, 5(1):77-114, 1997

[34] LAN, WAN Standards White Papers from http://whitepapers.techrepublic.com. [35] Database Systems: The Complete Book. The Second Edition.

In document Query Optimization for Database Federation Systems (Page 59-64)