TO MULTITENANT DATABASES

With the advent of the ubiquitous Internet, a new trend has emerged: cloud computing, which explains the phenomenon of the integration among multiple devices. From an enterprise perspective, a very modern form of application hosting is software-as-a-service (SaaS) motivation [44]. As opposed to traditional on-premises solutions, the way SaaS customers just need to pay the hosting provider a monthly fee, where service charges are paid for those really consumed resources. Based on the service maturity model, multitenancy is a

significant paradigm shift to make configuring applications simple and easy for the customers, without incurring extra operation costs.

In the following, we discuss the extensions of the above multiple-query processing and optimization techniques to the domain of multitenant databases.

7.4.1 Sharing in Multitenant Query Processing

Queries for a single tenant have to contend with data from all tenants. However, previous query methods have been inefficient for multitenant databases because it is very difficult for such methods to understand or account for the unique characteristics of each tenant’s data. While one tenant’s data includes numerous short records with just fewer indexable fields, another may include fewer longer records with numerous indexable fields [44]. Apart from the structural differences, each tenant’s data distribution may also be different compared with the similar schemas. This brings a challenge for existing relational databases that just gather an aggregate or average statistics of all tenants periodically. Therefore, the approach for MQO can lead to incorrect assumptions and query plans for any given tenant.

A natural way to ameliorate the problem is to share tables among tenants [45,46]. Through mapping multiple single-tenant logical schemas to one multitenant physical schema using query transformation rules, the logical tables can be divided into fixed generic structures, such as univer- sal and pivot tables, to avoid the interference with each tenant’s ability. For each table, queries are generated to filter the correct columns and align the different chunk relations based on each TenantId. Then shared process offers bulk execution of administrative queries by allowing them to be parameterized over the domain of each table.

In addition, each tenant database may encounter various query expressions (QEs) over different data sources, such as relational and structured XML data. Therefore, a multigraph-based approach is proposed to intro- duce edges that navigate both the XML nodes and the relational dot nota- tions. Through utilizing the intrasegment compression techniques and adding new edges, similar nodes can form a subgraph that consists of identical or subsumed conditions.

7.4.2 Multitenant Querying Plans

More efficient execution plans of multitenant databases are to adopt a two-phase solution with dynamic tuning of database indices [47]. A layer

of meta-data associates the data items with tenants via tags and the meta-data are used to optimize searches by channeling processing resources during a query to only those pieces of data bearing relevant unique tag. In certain aspects, each tenant’s virtual schema includes a variety of customizable fields, some or all of which may be designated as indexable. One goal of traditional multiple query optimizer is to mini- mize the amount of data that must be read from disk and choose selective tables or columns that will yield the fewest rows during the processing. If the optimizer knows that a certain column has a very high cardinality, it will choose to use an index on that column instead of a similar index on a lower cardinality column. However, consider in a multitenant system that a physical column has a large number of distinct values for most tenants, but a small number of distinct values for specific tenant. Then, the overall high-cardinality column strategy will not get a better performance because the optimizer is unaware that for this specific tenant, the column is not selective. Furthermore, by using system- wide aggregate statistics, the optimizer might choose a query plan that is incorrect or inefficient for a single tenant that does not conform to the “normal” average of the entire database as determined from the gathered statistics. Therefore, the first phase typically includes generating tenant- level and user-level statistics to find the suitable tables or columns for the common subexpressions. The statistics gathered includes the information in entity rows for tenants being tracked to make decisions about query access paths and a list of users to have access to privileged data. The second phase constructs an optimal plan based on query graph. The differ- ence is that some edges are labeled directed and single node consists of

multiple relations considering the

private security model

to keep data

or application separate. The common subexpressions of the first phase are stored by building many-to-many (MTM) physical table, which can also specify whether a user has access to a particular entity row. When handling multiple queries for entity rows that the current user can see, the optimizer must choose between accessing MTM table from the user and the entity side of the relationship.

7.5 CONCLUSION

In this chapter, we overviewed multiple-query processing and optimization techniques in traditional databases and streaming databases. We also discussed their possible extensions to multitenant multiple-query processing and optimization.

As an interesting future work, we view three major issues. First, without data integration engine in cloud computing environment, how to build a cost-based heuristic model to selectively materialize the candidate common subexpressions over diverse data source needs some efficient algorithms. Second, the recent studies focus on accurate query evaluation for multitenant database. It is worthwhile to study the approximate query processing and obtain error-energy trade-offs, especially for stream data. We would like to adapt nowadays techniques to multipath aggregation or join methods that can provide more fault tolerance. Third, there are still research issues to better employ schema knowledge or integrity constraints to perform query optimization at compile time. And it is very significant to detect “unsafe” queries considering data privacy for multitenant database.

REFERENCES

1. J. Grant and J. Minker. Optimization in deductive and conventional relational database systems. In Advances in Data Base Theory, H. Gallaire, J. Minker, and J. M. Nicholas (eds.). New York: Springer, pp. 195–234, 1981.

2. S. Finkelstein. Common expression analysis in database applications. Proceedings of the 1982 ACM SIGMOD International Conference on Management of Data. June 2–4, ACM Press, Orlando, FL, pp. 235–245, 1982.

3. P. Larson and H. Yang. Computing queries from derived relations. Proceedings of the 11th International Conference on Very Large Data Bases. August 21–23, Morgan Kaufmann, Stockholm, Sweden, pp. 259–269, 1985.

4. N. Roussopoulos. View indexing in relational databases. ACM Transactions on Database System, 7(2): 258–290, 1982.

5. N. Roussopoulos. The logical access path schema of a database. IEEE Trans- actions on Software Engineering, 8(6): 563–573, 1982.

6. T.K. Sellis. Multiple-query optimization. ACM Transactions on Database System, 13(1): 23–52, 1988.

7. P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and extensible algorithms for multi query optimization. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. May 16–18, ACM Press, Dallas, TX, pp. 249–260, 2000.

8. M. Jarke. Common subexpression isolation in multiple query optimization. In Query Processing in Database Systems, W. Kim, D. S. Reiner, and D. S. Batory. Berlin: Springer-Verlag, 1985.

9. F.C. Fred Chen and M.H. Dunham. Common subexpression processing in multiple-query processing. IEEE Transactions on Knowledge and Data Engineering, 10(3): 493–499, 1998.

10. D.J. Rosenkrantz and H.B. Hunt. Processing conjunctive predicates and queries. Proceedings of IEEE International Conference on Data Engineering. 1980.

11. J. Park and A. Segev. Using common subexpressions to optimize multiple queries. Proceedings of the IEEE International Conference on Data Engineering. February 1–5, Los Angeles, CA, IEEE Computer Society, pp. 311–319, 1988. 12. P.V. Hall. Optimization of a single relational expression in a relational data

base system. IBM Journal of Research and Development, 20(3): 244–257, 1976. 13. P.V. Hall. Common subexpression isolation in general algebraic systems.

Technical Report UKSC 0060, IBM United Kingdom Scientific Centre, 1974. 14. U.S. Chakravarthy and J. Minker. Multiple query processing in deductive databases using query graphs. Proceedings of the 12th International Conference on Very Large Data Bases. August 25–28, Kyoto, Japan. ACM Press, pp. 384–391, 1986.

15. E. Wong and K. Youssefi. Decomposition: A strategy for query processing. ACM Transactions on Database System, 223–241, 1976.

16. U.S. Chakravarthy and A. Rosenthal. Anatomy of a modular multiplier query optimizer. Proceedings of International Conference on Very Large Data Bases. August 29–September 1, Los Angeles, CA: Morgan Kaufmann, 1988. 17. T. Sellis. Global query optimization. Proceedings of the 1986 ACM SIGMOD

International Conference on Management of Data. May 28–30, Washington, DC, 1986.

18. T. Sellis and S. Ghosh. On the multiple query optimization problem. IEEE Transactions on Knowledge and Data Engineering, 2(2): 262–266, 1990. 19. E.-P. Lim, J. Srivastava, and A. Cosar. An extensive search for optimal multi-

ple query plans. International Conference on Management of Data. June 2–5, San Diego, CA, 1992.

20. A. Cosar, J. Srivastava, and S. Shekhar. On the multiple pattern multiple object match problem. International Conference on Management of Data. May 29–31, Denver, CO, 1991.

21. A. Cosar, E.-P. Lim, and J. Srivastava. Multiple query optimization with depth-first branch-and-bound and dynamic query ordering. Proceedings of the 2nd International Conference on Information and Knowledge Management. Washington, DC. November 1–5, ACM, New York, pp. 433–438, 1993. 22. G. Graefe and W.J. McKenna. The volcano optimizer generator: Extensibility

and efficient search. Proceedings of the 9th International Conference on Data Engineering. April 19–23, Vienna, IEEE Computer Society, pp. 209–218, 1993.

23. H. Mistry, P. Roy, S. Sudarshan, and K. Ramamritham. Materialized view selection and maintenance using multi-query optimization. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Santa Barbara, CA. May 21–24, ACM, New York, pp. 307–318, 2001.

24. A.B. Murat, H.T. Ismail, and C. Ahmet. Genetic algorithm for the multiple-query optimization problem. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 37(1): 147–153, 2007. 25. Z. Yi, L. Qing, and C. Lei. Multi-query optimization for distributed similarity

query processing. 28th International Conference on Distributed Computing Systems. June 17–20, Beijing, IEEE Computer Society, pp. 639–646, 2008.

26. J. Grant and J. Minker. On optimizing the evaluation of a set of expressions. Technical Report TR-916, University of Maryland, College Park, MD, July 1980.

27. S. Madden, M. Shah, J.M. Hellerstein, and V. Raman. Continuously adaptive continuous queries over streams. Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. Madison, WI. June 3–6, ACM, New York, pp. 49–60, 2002.

28. R. Avnur and J.M. Hellerstein. Eddies: Continuously adaptive query processing. Proceedings of the 2000 ACM SIGMOD International Conference on Management of data. Dallas, TX. May 14–19, ACM, New York, pp. 261– 272, 2000.

29. R. Vijayshankar, A. Deshpande, and J.M. Hellerstein. Using state modules for adaptive query processing. Proceedings of the 19th International Conference on Data Engineering. March 5–8, Bangalore, India, IEEE Computer Society, pp. 353–364, 2003.

30. S. Wang, E. Rundensteiner, S. Ganguly, and S. Bhatnagar. State-slice: New paradigm of multi-query optimization of window-based stream queries. Proceedings of the 32nd International Conference on Very Large Data Bases. September 12–15, ACM Press, Seoul, Republic of Korea, pp. 619–630, 2006.

31. S. Chandrasekaran and M.J. Franklin. PSoup: A system for streaming queries over streaming data. The VLDB Journal, 12(2): 140–156, 2003.

32. S. Krishnamurthy, C. Wu, and M.J. Franklin. On-the-fly sharing for streamed aggregation. Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. June 27–29, ACM Press, Chicago, IL, pp. 623–634, 2006.

33. S. Chandrasekaran, O. Cooper, and A. Deshpande. TelegraphCQ: Continuous dataflow processing. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. San Diego, CA. June 9–12, ACM, New York, pp. 668–674, 2003.

34. R. Zhang, N. Koudas, B.C. Ooi, and D. Srivastava. Multiple aggregations over data streams. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. Baltimore, MD. June 13–16, ACM, New York, pp. 299–310, 2005.

35. S. Krishnamurthy. Shared query processing in data streaming systems. University of California, Berkeley, CA, 2006.

36. J. Chen, D.J. DeWitt, and J.F. Naughton. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. Proceedings of the 18th International Conference on Data Engineering. February 26–March 1, San Jose, CA, IEEE Computer Society, pp. 345–356, 2002.

37. E. Cesario, A. Grillo, C. Mastroianni, and D. Talia. A sketch-based architecture for mining frequent items and itemsets from distributed data streams. 201111th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. May 23–26, Newport Beach, CA, IEEE Computer Society, pp. 245–253, 2011.

38. N.N. Dalvi, S.K. Sanghai, P. Roy, and S. Sudarshan. Pipelining in multi-query optimization. Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. Santa Barbara, CA. May 21–24, ACM, New York, pp. 59–70, 2001.

39. M. Hong, M. Riedewald, C. Koch, J. Gehrke, and A. Demers. Rule-based multi-query optimization. Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. Saint Petersburg, Russia. March 24–26, ACM, New York, pp. 120–131, 2009. 40. O. Cooper, A. Edakkunni, and M.J. Franklin. HiFi: A unified architecture for

high fan-in systems. Proceedings of the 30th International Conference on Very Large Data Bases. August 31–September 3, Morgan Kaufmann, Toronto, ON, pp. 1357–1360, 2004.

41. D. Kossmann, M.J. Franklin, and G. Drasch. Cache investment: integrating query optimization and distributed data placement. ACM Transactions on Database Systems, 25(4): 517–558, 2000.

42. X. Shili, B.L. Hock, T. Kian-Lee, and Z. Yongluan. Two-tier multiple query optimization for sensor networks. 27th International Conference on Distributed Computing Systems. June 25–29, Toronto, ON, IEEE Computer Society, pp. 39–49, 2007.

43. A.P. Boedihardjo, C.-T. Lu, and F. Chen. A framework for estimating complex probability density structures in data streams. Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA. October 26–30, ACM, New York, pp. 619–628, 2008.

44. H. Mei, J. Dawei, L. Guoliang, and Z. Yuan. Supporting database applications as a service. IEEE 25th International Conference on Data Engineering. March 29–April 2, Shanghai, China, IEEE Computer Society, pp. 832–843, 2009. 45. S. Aulbach, T. Grust, and D. Jacobs. Multi-tenant databases for software

as a service: Schema-mapping techniques. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Vancouver, BC. June 10–12, ACM, New York, pp. 1195–1206, 2008.

46. F.S. Foping, I.M. Dokas, and J. Feehan. A new hybrid schema-sharing tech- nique for multitenant applications. 4th International Conference on Digital Information Management. November 1–4, Michigan, IEEE Computer Society, pp. 1–6, 2009.

47. C. Weissman and S. Wong. Query optimization in a multi-tenant database system. US Patent 7,529,728 B2, salesforce.com, 2009.

private security model

7.5 CONCLUSION

REFERENCES

8

Large-Scale

Correlation- Based

Semantic Classification

Using MapReduce

Fausto C. Fleites, Hsin-Yu Ha,