UNIVERSITY OF CALIFORNIA
Santa Barbara
Elasticity Primitives for Database as a Service
A Dissertation submitted in partial satisfaction
of the requirements for the degree of
Doctor of Philosophy
in
Computer Science
by
Aaron J. Elmore
Committee in Charge:Professor Divyakant Agrawal, Co-Chair Professor Amr El Abbadi, Co-Chair Professor Xifeng Yan
Professor Kenneth Salem
The Dissertation of Aaron J. Elmore is approved:
Professor Xifeng Yan
Professor Kenneth Salem
Professor Divyakant Agrawal, Committee Co-Chair
Professor Amr El Abbadi, Committee Co-Chair
Elasticity Primitives for Database as a Service
Copyright© 2014 by
Acknowledgements
The company we keep through journeys is just as important as the journey we take. I have been extremely fortunate with the quality and quantity of those around me on my journey so far. I know words alone are not enough to properly thank everyone, but it is a good start.
First and foremost, I acknowledge my advisors “Divy” Agrawal and Amr El Abbadi. From the start, they displayed an uncanny ability to guide and lead while fostering independence. Their ability to understand problems and clarify the murky is something I will always strive towards. Their almost thirty year partnership as academics is truly inspiring. Beyond being excellent mentors and researchers, the two of them have always demonstrated the importance of giving respect and support, taking time to enjoy life and your family, and remaining modest.
I am ever grateful for having great collaborations with Sudipto Das during my early doctorate years. I have no doubt that any success I achieve, will have also been fostered by Sudipto’s influence. Sudipto instilled an ability to critically question your own problems, assumptions, and solutions. He also taught me that working with people who are better than yourself is the surest way to grow. His perseverance and work ethic is the model of a great researcher.
In addition to Sudipto, I am extremely appreciative of all lab mates I have had throughout my time at UC Santa Barbara. Having great people to share victories and defeats with certainly has enriched my experiences. Finishing the doctorate journey would not have been possible without their friendship, advice, contribu-tions, and encouragement. Ceren, Shiyuan, Shoji, Suditpto, Shyam, Zhengkui, Hatem, Alex, Faisal, Vaibhav, Xiaofei, Theo, Siva, Jiaci, and Cetin have all made long hours and days much more enjoyable
Towards the end my doctorate I was fortunate to work with Andy Pavlo. While Andy is known for his personality, he is an excellent collaborator and mentor. His willingness to go above and beyond to help people around him is a gift. Without a doubt, I know Andy will produce amazing students throughout his new academic career and I am indebted on his advice during my transition out of doctorate studies.
I am extremely grateful for my committee members Xifeng Yan and Ken Salem. As committee members they were wonderful in providing feedback, helping me question research directions, and being encouraging about taking new steps pro-fessionally. I appreciate all their understanding and patience.
Our department has been fortunate to have Janet Kayfetz help our students learn how to more effectively communicate ideas in both writing and presenting. Janet takes a deep interest in her students and I am extremely glad for the skills
I have been extremely fortunate to work with amazing colleagues and mentors during my summers. Phil Bernstein graciously had me join him one summer at Microsoft Research. Working with such a titan of research was a transformative experience. Phil’s approach to problems and research showed me the importance of details and clarity. I am indebted to his mentorship and career advice.
I was also fortunate to be invited by Adam Silberstein to join Trifacta for one summer. Adam was an excellent mentor and was always happy to share his valuable insights and experiences with large scale systems. Along with Adam, I was lucky to work with a lot of very talented people at Trifacta during a formative time. I am especially grateful for my experiences with Joe Hellerstein, Jeff Heer, Eric Bothwell, and Sean Kandel. These fine people at Trifacta are not only in the midst of building a great company, but they all took the time and interest to mentor and guide me as I move to the next stage in life.
My time at school has been enriched by getting to interact with some incred-ible staff, faculty, and students. There are too many people to name, but I am extremely happy to have worked with Mark and Jim at NCEAS, Zarko at the Computation Institute, and Adam at UCSB. Soren Llyod and Leo Irakliotis were instrumental in encouraging me to reach my potential early on.
My friends throughout California, Chicago, Denver, Seattle, New York, and Tel Aviv have been constant source of support. I cannot express how much they have done for me throughout the years. My family has also been incredibly supportive. Most importantly they encouraged me to start and finish the doctorate. My parents instilled an incredible sense of determination, confidence, and wonder that equipped me to get this far.
Lastly and most importantly, I must thank my partner and wife Emily. She has made me into a more thoughtful, compassionate, caring, and well-rounded individual. While undertaking a doctorate one can lose sight of the world around them as they dive deeper and deeper. Emily is always there trying to anchor me to the world around me. I am eternally grateful for her support and belief. She is honestly my better half.
Curriculum Vitæ
Aaron J. Elmore
Education
2014 Doctor of Philosophy in Computer Science, University of California, Santa Barbara.
2009 Master of Sciencein Computer Science, University of Chicago. 2002 Bachelor of Sciencein Electronic Commerce Technologies,
DePaul University.
Experience
2013 Software Engineering Intern
Trifacta San Francisco, CA
2012 Research Intern
Microsoft Research Redmond, WA.
2010 Software Engineering Intern
Amazon Web Services Seattle, WA
2010–2013 Research Assistant
National Center for Ecological Analysis and Synthe-sis (NCEAS) Santa Barbara, CA
2010–2013 Teaching Assistant
University of California, Santa Barbara Santa
Barbara, CA
2010–2014 Research Assistant
Distributed Systems Lab, UCSB Santa Barbara, CA
2008–2009 Research Assistant
2006-2008 Software Engineer
1SYNC / GS1US Chicago, IL
2005–2006 Software Engineer
JC Whitney Chicago, IL
2002–2005 Software Engineer
The Incrementum Group, LLC Chicago, IL
2000–2002 Computer Science Tutor
DePaul University Chicago, IL
Selected Publications
Aaron J. Elmore, Vaibhav Arora, Andrew Pavlo, Divyakant Agrawal, Amr El Abbadi. “Squall: Fine-Grained Live Recongu-ration for Partitioned Main Memory Databases”,In Submission. Aaron J. Elmore, Carlo Curino, Divyakant Agrawal, Amr El Abbadi. “Towards Database Virtualization for Database as a Service”, VLDB 2013 (Tutorial).
Aaron J. Elmore, Sudipto Das, Alexander Pucher, Divyakant Agrawal, Amr El Abbadi, Xifeng Yan. Characterizing Tenant Behavior for Placement and Crisis Mitigation in Multitenant DBMSs”, ACM International Conference on Management of
Data (SIGMOD) 2013: 517-528.
Stacy Patterson, Aaron J. Elmore, Faisal Nawab, Divyakant Agrawal, Amr El Abbadi. Serializability, not Serial: Concur-rency Control and Availability in Multi-Datacenter Datastores”, Very Large Data Bases (VLDB) 2012: 1459-1470.
Abbadi. InfoPuzzle: Exploring Group Decision Making in Mo-bile Peer-to-Peer Databases”, Very Large Data Bases (VLDB) 2012: 1998-2001.
Divyakant Agrawal, Amr El Abbadi, Beng Chin Ooi, Sudipto Das, Aaron J. Elmore. The evolving landscape of data man-agement in the cloud”, International Journal of Computational Science and Engineering 2012.
Aaron J. Elmore, Sudipto Das, Divyakant Agrawal, Amr El Ab-badi. Zephyr: Live Migration in Shared Nothing Databases for Elastic Cloud Platforms”,ACM International Conference on
Management of Data (SIGMOD) 2011: 301-312.
Divyakant Agrawal, Amr El Abbadi, Sudipto Das, Aaron J. El-more. Database Scalability, Elasticity, and Autonomy in the Cloud”. 16th International Conference on Database Systems for Advanced Applications (DASFAA) 2011: 2-15.
Aaron J. Elmore, Sudipto Das, Divyakant Agrawal, Amr El Ab-badi. Towards an Elastic and Autonomic Multitenant Database”, 6th International Workshop on Networking Meets Databases (NetDB) 2011.
Honors and Awards
SIGMOD Student Travel Grant UCSB Senate Travel Grant
Amazon Research Grant, Winter 2011 Outstanding Teaching Assistant, UCSB Computer Science Merit Fellowship
Top of the Class Honors, DePaul University
Professional Activities
2014 ACM SIGMOD Demo PC Member
2012–2013 UCSB Computer Science Faculty Recruitment Graduate Representative
Representative
2011–2013 External reviewer for ODBASE 2011, VLDB 2012, COMAD 2012-2013, Middleware 2012
2011–2012 Reviewer for Transactions on Computers, Transactions on Storage
2011 Helped organize the NSF Workshop Science of Cloud held in March 2011
Abstract
Elasticity Primitives for Database as a Service
Aaron J. ElmoreTransactional databases are a critical component in data intensive applications. They enable application developers to persist and query data without having to design for concurrency control, fault tolerance, atomic multi-operation transac-tions, or physical storage layout. Due to the utility of databases and their general purpose design they are widely used within organizations. However, databases are predicated on an architecture that assumes one database instance is dedicated to hosting a single application. Organizations managing many small databases with fluctuating requirements face wasted resources and redundant costs. Build-ing a database-as-a-service platform allows for the effective consolidation of many databases into a reduced number of servers.
This dissertation focuses on the primitives, or tools, required to transform traditional database architectures into a distributed, scalable, and self-managed data platform. The presented primitives enable system elasticity, or the ability for a system to dynamically adapt the available capacity in response to changing resource requirements. First, we propose a self-managed controller to leverage expert administrators in managing database placement and maintaining system performance. This controller provides a method to identify resource requirements at runtime and a method to empirically learn how various databases will be-have when colocated. These techniques are utilized to place databases and load-balance the system when resources are constrained. Second, this dissertation presents two techniques to migrate databases between servers without making the system unavailable for applications. These advances include the live migration of shared nothing databases and the live reconfiguration of partitioned main-memory databases. The presented primitives are critical steps in building a scalable data-base platform to host many applications using existing datadata-base architectures.
Contents
Acknowledgements ix
Curriculum Vitæ xi
Abstract xv
List of Figures xxi
List of Tables xxiii
1 Introduction 1
1.1 The Need for Database-as-a-Service . . . 1
1.2 Challenges Faced with DBaaS . . . 3
1.3 The Need for Elasticity Primitives . . . 6
1.4 Dissertation Overview . . . 8
1.4.1 Modeling and Placement Primitives . . . 9
1.4.2 Movement Primitives . . . 10
1.5 Contributions . . . 11
2 Background 13 2.1 Multitenancy Models . . . 13
2.2 Multitenancy for the Cloud . . . 16
2.3 Recent Multitenant Systems . . . 17
I
Modeling and Placement Primitives
21
3 Pythia 23 3.1 Challenges in Multitenancy . . . 233.2 Controller for a Multitenant DBMS . . . 25
3.6 Problem Formulation . . . 30
3.7 Pythia: Learning Behavior . . . 30
3.7.1 Tenant Feature Selection . . . 31
3.7.2 Resource-based Tenant Model . . . 32
3.7.3 Resource-based classes . . . 33
3.7.4 Training the model . . . 34
3.8 Node Model for Tenant Packing . . . 36
3.8.1 Utilizing Machine Learning . . . 37
3.9 Delphi Implementation . . . 38
3.9.1 Statistics Collection. . . 38
3.10 Crisis Detection and Mitigation . . . 39
3.10.1 Monitoring and Crisis Detection . . . 39
3.10.2 Crisis Mitigation . . . 40
3.11 Experimental Evaluation . . . 42
3.11.1 Benchmark and Tenant Description . . . 42
3.11.2 Model Evaluation . . . 44
3.11.3 Tenant Model Evaluation . . . 44
3.11.4 Node Model Evaluation . . . 45
3.11.5 Crisis Mitigation . . . 46
3.12 Summary . . . 49
II
Movement Primitives
53
4 Forms of Database Migration 55 4.1 Asynchronous migration . . . 56 4.2 Synchronous migration . . . 56 4.3 Live migration. . . 57 5 Zephyr 61 5.1 Background . . . 63 5.1.1 System Architecture . . . 63 5.1.2 Migration Cost . . . 635.1.3 Known Migration Techniques . . . 64
5.2 Zephyr Design . . . 66
5.2.1 Design Overview . . . 67
5.2.2 Migration Cost Analysis . . . 70
5.3.1 Isolation guarantees. . . 71
5.3.2 Fault tolerance . . . 73
5.3.3 Migration Safety and Liveness . . . 77
5.4 Optimizations and Extensions . . . 78
5.4.1 Replicated Tenants . . . 79
5.4.2 Sharded Tenants . . . 79
5.4.3 Data Sharing in Dual Mode . . . 79
5.5 Implementation Details . . . 81 5.6 Experimental Evaluation . . . 82 5.6.1 Benchmark Description . . . 83 5.6.2 Migration Cost . . . 84 5.7 Summary . . . 88 6 Squall 89 6.1 Background . . . 91 6.1.1 H-Store Architecture . . . 91 6.1.2 Database Partitioning . . . 93 6.2 Motivation . . . 95
6.2.1 The Need for Reconfiguration . . . 95
6.2.2 The Impact of Reconfiguration . . . 96
6.3 Overview of Squall . . . 97
6.3.1 Initialization. . . 98
6.3.2 Data Migration . . . 99
6.3.3 Termination . . . 100
6.4 Managing Data Migration . . . 101
6.4.1 Identifying Migrating Data . . . 101
6.4.2 Reactive Migration . . . 103 6.4.3 Asynchronous Migration . . . 106 6.4.4 Replication Management . . . 107 6.5 Fault Tolerance . . . 107 6.5.1 Failure Handling . . . 107 6.5.2 Crash Recovery . . . 109
6.6 Dynamic Data Chunking . . . 109
6.7 Experimental Evaluation . . . 111
6.7.1 Workloads . . . 111
6.7.2 Cluster Expansion . . . 114
6.7.3 Cluster Consolidation. . . 115
6.7.4 Database Size Sensitivity Analysis. . . 117
III
The End for Now
119
7 Conclusion and Future Work 121
7.1 Conclusion . . . 121 7.2 Future Work . . . 124
Bibliography 127
List of Figures
1.1 A shared nothing multitenant DBMS architecture. . . 4
3.1 Pythia incrementally learns behavior. . . 26
3.2 Overview of Delphi’s architecture. . . 27
3.3 Effects of throughput on cache impedance. . . 29
3.4 Tenant model resource consumption when run in isolation. . . 45
3.5 Node model performance by label confidence . . . 46
3.6 Comparing improvements to nodes in violation, and the impact on nodes not in violation . . . 51
3.7 Tenant latencies by platform total tenant count. . . 52
5.1 Timeline for different phases during migration. Vertical lines cor-respond to the nodes, the broken arrows represent control messages and the thick solid arrows represent data transfer. Time progresses from top towards the bottom. . . 67
5.2 Ownership transfer of the database pages during migration. Pi represents a database page and a white box around Pi represents that the node currently owns the page. . . 68
5.3 B+ tree index structure with page ownership information. A sen-tinel marks missing pages. An allocated database page without owner-ship is represented as a grayed page. . . 69
5.4 Impact of the distribution of reads, updates, and inserts on migra-tion cost; default configuramigra-tions used for rest of the parameters. We also vary the different insert ratios – 5% inserts correspond to a fixed per-centage of inserts, while 1/4 inserts correspond to a distribution where a fourth of the write operations are inserts. The benchmark executes 60,000 operations. . . 83
5.5 Impact of varying the transaction size and load on number of failed transactions. We also report the slope of an approximate linear fit of the points in a series. . . 86
6.1 The H-Store architecture from [66]. . . 92 6.2 Simple TPC-C data, showingWAREHOUSEandCUSTOMERpartitioned by warehouse IDs. . . 94 6.3 A sample partition plan to control data layout. For TPC-C in this example, all tables are either replicated or partitioned by their foreign key relationship to the warehouse table. . . 94 6.4 As workload skew increases on a single warehouse in TPC-C, the collocated warehouses experience reduced throughput due to contention. 95 6.5 As a systems partition plan changes, Squall must manage and track the progress of reconfiguration at each node to ensure correct data own-ership in a lightweight manner. . . 97 6.6 Sample Updated Partition Plan. . . 100 6.7 Tracking the partition’s progress at different granularities. . . 104 6.8 Partition Addition– A reconfiguration to expand a cluster with two nodes from 6 partitions to 8 partitions. This expansion acts a reshuf-fle as all data items are evenly distributed between 8 partitions after reconfiguration. . . 112 6.9 Node Addition – A reconfiguration to expand from 4 partitions on one node to 8 partitions on two nodes. This expansion attempts to minimize the data movement, by having each partition migrate half of its data to exactly one new partition. . . 113 6.10 Node Removal– A reconfiguration to contract from 8 partitions on two nodes to 4 partitions on one node. . . 116 6.11 The impact of migrating larger databases on mean throughput. . 117
List of Tables
2.1 Multitenant database models, how tenants are isolated, and the corresponding cloud computing paradigms. . . 14 4.1 Summary of the forms of migration and the associated costs. . . . 55 5.1 Notational Conventions. . . 66
Chapter 1
Introduction
The bureaucracy is expanding to meet the needs of the expanding bu-reaucracy.
Oscar Wilde
1.1
The Need for Database-as-a-Service
Transactional databases are a critical component in data intensive applica-tions. They enable application developers to persist and query data without having to design for concurrency control, fault tolerance, atomic multi-operation transactions, or physical storage layout. Due to the utility of databases and their general purpose design they are widely used within organizations. However, with disjoint project development teams, acquisitions, and distinct databases for de-velopment practices large organizations can experience database proliferation. In one extreme cases, a telecommunications company was found to manage 20,000 separate database instances [25]. This proliferation comes at a high cost to orga-nizations faced with managing such a large scale of database instances.
Databases are predicated on an architecture that assumes a server is primarily dedicated to hosting the database instance. Often each instance hosts a single application’s database. This architecture is an artifact of decades of database re-search and development focused on providing general purpose, high performance databases to fully utilize a machine’s resources to support high throughput ap-plications with low-latency response times. Many database vendors have created highly configurable databases to support a wide variety of application require-ments. Tuning configuration parameters, such as the amount of memory dedicated for caching data, how concurrent operations are serialized, or the amount of ac-ceptable time to delay log flushes have significant impact on the performance and
guarantees provided by a database system. Given the variety and ramifications of potential database configurations, organizations rely on skilled database admin-istrators (DBAs) to properly tune databases by working closely with application developers and system architects. The ability to properly tune and configure a database is often gained through years of administration experience.
In addition to utilizing expert administrators to optimize performance, modern database systems rely on scaling up the capacity of powerful servers to handle demanding applications. Databases greedily consume and explicitly manage the physical resources of a server. Therefore adding memory, faster persistent stor-age, larger CPUs, or faster network devices can often resolve performance issues. Research has investigated the use of parallel [34] and distributed databases [60] to increase performance through scaling out across multiple servers. However, the performance implications of distributed transactions has limited popularity of these databases for update intensive workloads. The need for skilled DBAs and expensive dedicated hardware, combined with expensive licenses for popu-lar commercial DBMSs, results in databases being an expensive part of software application stacks.
With databases being an expensive component, organizations that host many databases incur exacerbated and redundant costs. Many factors contribute to database proliferation. Organizations can maintain many product licenses across different vendors to support distinct application requirements (e.g. spatial func-tionality, text search), support legacy applications (e.g. deprecated functionality), and support distinct databases for development practices (e.g. separate produc-tion, QA, and development databases). A large number of databases drives up capital expenditures not only for purchasing the servers and licenses, but for sup-port staff to manage the physical machines and recurring costs for the storage, power, and cooling of the servers. With the architecture that assumes a dedicated database instance or server per application, this proliferation results in high costs for managing all hosted databases. The high costs and wasted resources associated with managing multiple databases creates a demand for solutions in consolidating databases into fewer servers. Implementing efficient consolidation at the database tier, requires transforming a traditional DBMS into a multitenant DBMS that can effectively share resources between many hosted applications, ortenants.
The rise of cloud computing as a successful computing paradigm has demon-strated the benefits of consolidating various compute components into a multi-tenant service offering. This includes low level offerings to provide on-demand virtual computers in an Infrastructure-as-a-Service (IaaS) platform, to a shared application stack hosted in a Software-as-a-Service (SaaS) platform. Cloud com-puting offerings are successful for service providers due to their ability to leverage
Challenges Faced with DBaaS – Section 1.2
economies of scale to amortize the cost of each service instance. The costs of run-ning a server does not vary greatly with server utilization. The majority of costs derives from purchasing the server itself, power, cooling, physical space to store the server, and human administrators. These costs do not significantly change if the server is utilizing 5% or 80% of its resources. Services hosted on idle or low usage servers could be consolidated to fewer machines to lower the total operating costs. Therefore, the specialization of large scale hosting encourages effective consolida-tion to maximises resource utilizaconsolida-tion in a shared infrastructure. Conversely, users of cloud service platforms are attracted by a pay-as-you-go model that does not require significant initial investment in capital or development. While an effective pricing model and low vendor tie-ins attract users to cloud computing models, the performance and availability guarantees need to meet the application require-ments. Striking a balance between service costs, which is largely determined by consolidation, and performance is a critical challenge in building a service offering. With the popularity and high costs of databases, a Database-as-a-Service
(DBaaS) offering is appealing to both application developers and organizations hosting many databases. Here applications, or users, rent a virtual database from the service provider. To the application it appears if they have a dedicated and isolated database instance. In reality the user is unlikely to acquire a dedicated database instance, but a slice of a shared database application. Use of standard database APIs (e.g. JDBC or ODBC), reduce concerns about vendor tie-in and minimize modifications to migrate from a hosted database to a database service. A Database-as-a-Service can be used a public cloud offering, such as Amazon’s Relational Database Service (RDS), or as internal service, such as a university offering a consolidated database platform for department usage.
1.2
Challenges Faced with DBaaS
A Database-as-a-Service platform orchestrates a cluster of multitenant data-base servers to appear as a monolithic datadata-base to application developers. Fig. 1.1 demonstrates a conceptual Database-as-a-Service architecture which is composed of multiple servers, each hosting multiple database applications. In addition to a transformed DBMS engine to support multitenancy, new components such query routers and system controllers are needed to manage a cluster of database servers. Designing such a database platform requires many architectural decisions, includ-ing mechanisms for how DBMSs are multiplexed between applications or how tenant placement decisions are made. To limit the scope of architectural deci-sions, a database platform will likely target one of two major databases use cases.
Figure 1.1: A shared nothing multitenant DBMS architecture.
that serve applications with frequent short read and write operations. Often these operations are rolled into an atomic transaction that is serialized against con-current transactions. Online analytic processing (OLAP) systems focus on read-heavy analytic and data mining workloads, with updates batched through an load process. These systems may also be referred to as a decision support system (DSS). Since the use cases differ, organizations will run distinct analytic and transactional database systems to insulate the often customer facing OLTP workloads from the sporadic and resource intensive OLAP workloads. However, this separation does not preclude the use of analytic queries in an OLTP system, and vice-versa. While the challenges faced in building an OLTP platform and OLAP platform are similar, the constraints, goals, and solutions will vary. This dissertation focuses on solutions for an OLTP database service.
To make an OLTP focused Database-as-a-Service offering practical for ap-plication developers, guarantees for availability and performance are needed to form expectations and promises between the parties. Aservice level objective (SLO) is a guarantee for a single performance metric. Common SLOs typically include uptime (availability), or an operation latency response time for simple op-erations. Aservice level agreement (SLA)can be synonymous with SLO, but often it is an agreement between the provider and user that encompasses multiple SLOs. SLOs and SLAs are provided in a variety of methods, but an economic incentive model is typically used. Here the user pays for the service, either per hour or per operation, and the provider suffers a penalty for SLO violations. One example is that a user pays per GB of data stored and a nominal fee for each
is-Challenges Faced with DBaaS – Section 1.2
sued query, and if the query violates a latency SLO the user is refunded a certain amount for that query.
For a system hosting multiple tenants there must be aprovisioning strategy
that allocates the number of physical servers required and maps tenants to each server. Provisioning strategies ensure that each tenant has a suitable amount of resources to process requests in a timely manner. When a tenant is added to the system, decisions about the tenant’s initial placement must be made. A consolidation primitive will determine how to initially place tenants. As tenants consume a set of resources (e.g. CPU cycles, IOPS, or memory) and each server has a fixed amount of resources, the initial placement of tenants to servers is often viewed as a multidimensional knapsack problem. Several heuristics have been proposed to address the initial placement, or consolidation, of tenants [57, 25].
While the initial tenant consolidation is a primary concern in multitenant sys-tems, it alone does not provide continual resource effectiveness. Many tenant applications exhibit temporal usage patterns. These patterns may be recurring (e.g. diurnal) or seasonal (e.g. course registration systems). A traditional method for consolidation is to profile the application in isolation in order to identify the ex-pected usage and resource requirements. This technique is referred to assandbox profiling. Once the resource requirements are established for all applications, the tenants are provisioned to support their highest expected level of usage. While this peak provisioning approach ensures that under most circumstances a ten-ant has ample resources, it is an intensive and brittle approach to consolidation. If the application experiences variance in usage patterns, then during low periods of activity the system isover-provisionedand resources are idle. Since there are many fixed costs for running servers (e.g. power, cooling, and space), idle resources are effectively wasted resources. Additionally, if an application is web facing it is subject to sudden changes in normal usage level (e.g. flash crowds). These shifts in usages, either sudden or gradual, can cause an application to use more resources than its historical peak. In these cases the system can become over-utilized and colocated tenants may not have sufficient resources to meet SLOs. This perfor-mance crisisrequires an adjustment to the system’s provisioning strategy. These scenarios highlight why consolidation alone is not enough to maximize physical resources while ensuring performance objectives.
To support performance SLOs, a system controller will need to monitor ten-ant activity and react when a SLO violation does occur. If hardware failures or software bugs are ruled out as reasons for the SLO violation, then the system can assume a behavior change created the violation. To respond to this performance crisis, a controller must implement some form of resource isolation to ensure tenants receive ample resources to meet their performance objectives. Methods
for implementing resource isolation include (i) changing the mapping of tenants to servers to change resource utilization patterns, (ii) adding additional servers to the system and place some tenants on the new servers, (iii) implementing resource control mechanisms to limit resources consumed by a given tenant, or (iv) rate limiting the client queries either at the database or query router. The class of solutions related to the placement of tenants to ensure resources or performance is often referred to as soft isolation [62], the class of solutions related to con-trolling how resource are shared between tenants is referred to as resource allo-cation[63], and solutions related to limiting or queueing queries is often referred to as admission control [86]. While research is ongoing for these orthogonal approaches to resource isolation, this thesis focuses on the elasticity primitives (or tools) needed to enable a soft isolation based multitenant database platform. We focus on soft isolation for three primary reasons. First, soft isolation enables a database platform to be built with little modification to existing database kernels. Usingvanilla database releases or limited patches to popular releases increases the opportunity for adoption and impact. Second, if the database platform is built on an elastic infrastructure (i.e. easy to provision additional servers), then a system does not have to limit tenant requests, either from queries or resource requests, to ensure resource isolation. Third, soft isolation enables the flexible sharing of resources which allows tenants to consume additional resources when needed and to relinquish resources when not needed.
1.3
The Need for Elasticity Primitives
A Database-as-a-Service offering can host hundreds to thousands of databases across tens to hundreds of database servers. Due to the scale of tenants and presence of dynamic workloads, manual administration of resource isolation is untenable. Orchestrating many database servers to host a large number of appli-cations requires changes to existing database systems as well as the design and implementation of new tools and components to ensure that hosted applications continually meet performance objectives. To allow the platform to scale up in the number of tenants it is important that system operations can be managed without direct supervision of an administrator. Therefore, self-managed elastic-ity primitives are essential for building a scalable database platform that adapts to dynamic workloads while maximizing resource utilization. This section high-lights elasticity primitives needed for enabling a Database-as-a-Service using soft isolation as the resource control mechanism.
A soft-isolation platform hosting a large number of small tenants across a clus-ter of servers must address several challenges. One key challenge is the ability to
The Need for Elasticity Primitives – Section 1.3
understand what amount of physical resources a tenants needs in order to meet tar-get SLOs. These resources can include CPU cycles, memory, disk I/O operations, or other related physical resources. Without adequate resource access, a tenant is likely to violate SLOs during periods of high activity. As many applications can be multipurposed or have different users with distinct usage patterns it can be difficult to discover the exact resource requirements for a tenant. In addition to identifying resource requirements, attributing current resource consumption to a given tenant is difficult for architectures where tenants share system processes. Therefore, a platform should have primitives to model resource requirements for a given tenant and attribute resource consumption to tenants.
Discovering resource requirements alone is not enough for a database plat-form to place tenants. In a soft isolation based architecture, resources are shared between tenants without controls for how resources will be shared. When two workloads are placed on a single machine, they will compete for the underlying resources. How the resources will be acquired and used by each tenant will depend heavily on the controls allowed, the database architecture, and the tenant work-loads. Often the resources consumed by tenants are not additive when colocated and a model of aggregating resource consumption is required [25, 6, 62]. As the system will make decisions about which tenants will be colocated, a system con-troller must have an ability to predict or model how various tenants will behave when colocated. Without any colocation primitive, a system would be blind when placing tenants in the absence of strong resource isolation. Such a blind placement would likely result in performance violations for periods of moderate activity.
With the presence of dynamic workloads, behavior can change in manner that has not been previously observed. In these cases, performance violations can occur even with perfect resource and colocation modeling primitives enabled. Here, the system must react to a performance crisis resulting from the violated SLO. Several approaches have been proposed to deal with these violations in a soft isolation based platform. If the system utilizes primary copy replication [16] then one option is to shift workloads by promoting a secondary replica to become the new primary replica [61]. If a multi-master replication scheme is utilized then the percentage of work allocated to each replica can be load-balanced [70]. If replication is not enabled or if the replicas are not valid destinations due to the existing workloads at the secondaries, then a databases must be migrated between the servers to update the tenant to server mapping [38]. For this solution, a migration primitive must exist to migrate a tenant’s persistent image and active state. Ideally, a live migration [21] primitive is supported to migrate the tenant’s active state and image without stopping the database.
Databases amenable to a consolidated environment and hosted by a Database-as-a-Service platform are likely to have a small physical size (footprint) or low throughput. However, certain classes of applications may have data storage re-quirements or an active working set that spans the capacity of a single server. For these tenants the database will need to partitioned across two or more data-base servers. If the tenant requires transactional support, a partitioning primitive will determine how to partitioning the data across servers to minimize distributed transactions while distributing load and storage across servers [29, 65]. As work-loads evolve, the layout of data may need to change to maintain performance. Workload changes can result in a hotspot that needs to be split across servers, or changes in workload access patterns that results in too many distributed trans-actions. Similar to live migration, a live reconfiguration is needed to change partitioned data’s layout without taking the system offline.
1.4
Dissertation Overview
This dissertation focuses on the design, implementation, and evaluation of primitives required for a soft isolation based database platform, in particular prim-itives related to the placement and movement of tenants. While these primprim-itives are critical first steps in enabling a scalable and elastic Database-as-a-Service, there are other primitives desirable for a database platform. This includes tools needed to handle the configuration of replication protocols, ensuring privacy of each tenant’s data, managing and updating scalable query routers, the genera-tion of SLOs, stronger resource allocagenera-tion mechanisms, and controlling elasticity to minimize operating costs. Solutions into these tools provides a rich research agenda for future research.
This thesis presents that building a scalable, elastic, and autonomic database platform is achievable using existing database architectures by providing solutions to understand workload requirements, predict the impact of colocation, reactively load-balancing tenant placement, and migrating persistent state in a lightweight
manner. These issues are addressed in two parts. The first part addresses on
primitives related to modeling resource consumption on tenants, modeling colo-cation impact for tenants, and a load-balancing primitive, which can be used to incrementally place tenants for initial consolidation. The second part focuses of movement primitives to load balance tenants and partitioned databases.
Dissertation Overview – Section 1.4
1.4.1
Modeling and Placement Primitives
Databases are predicated on an architecture that assumes the server is ded-icated to the hosted database. Therefore, a database is designed to consume resources of the server regardless of whether it needs the resources. This architec-ture can make it difficult to attribute accurate resource requirements for a running tenant. As a motivating example, a system hosts a database with a total storage (footprint) of ten GB, but only actively uses two GB of its storage. This means if this database has access to two GB of cache, majority of read queries would not result in disk I/O. However, the buffer pool, or database cache, will fill up to the total amount of allocated buffer space regardless if less is needed. This sample database would use up to ten GB of buffer pool for this tenant, even though the
active set is a fraction of that size. In a multitenant environment to place ten-ants with adequate server resources, it is important to understand what resources a tenant will actually need to answer requests effectively. The greedy design of databases increases the difficulty of attributing resource requirements. Previous research has shown how to identify resource requirements of tenants when they are profiled in an isolated environment [25]. Profiling tenants in isolation works well when tenant behavior is static and predictable. However, when tenants are colocated in a process that shares resources, it is difficult to ascertain resource re-quirements without isolating tenants on a profiling server. When tenant behavior is dynamic, multipurpose, or subject to ad hoc usage isolated profiling becomes untenable. Therefore, a technique is needed to estimate resource requirements at runtime in a consolidated environment. The first primitive presented estimates a tenant’s resource requirements at runtime based on supervised learning tech-niques.
In addition to estimating tenant resource requirements, understanding how behavior changes with colocation is critical for making placement decisions. How tenants share resources is dependent on the database architecture and the ten-ants’ current behavior. For example, if a buffer pool uses a least recently used
(LRU) page replacement policy, then a tenant’s throughput will be a significant
factor in determining how it shares a buffer pool. Faster tenants are more likely to have their pages frequently accessed and therefore less likely to have their pages evicted. Understanding how colocation is required to avoid resource starvation from over-consumed resources. Further complicating the problem is the difficulty in understanding the interactions and relationship between the database, the op-erating system, and the underlying hardware. Building precise models of tenant colocation dependent on these interactions and can result in brittle hand-tuned interaction models. Instead, we propose a colocation model that is empirically
learned by repeatedly observing how different sets of tenants classes behave to-gether.
Even with a perfect resource and colocation model, tenant behavior can evolve over time due to changes in the access patterns, an increase in application traffic, or changes to application code. This behavior change can result in increased resource contention between tenants. With soft isolation, resolving this issue requires one or more tenants must be shuffled between servers to provide adequate resource availability to all tenants. Ideally these disruptions are infrequent and the overhead of moving tenants between servers is amortized during periods of regular activity. A technique is presented to identify a set of tenants to move and destinations leveraging the resource and colocation models.
1.4.2
Movement Primitives
Since soft-isolation platforms do not rely on strong resource allocation mech-anisms, load-balancing tenants is required to resolve performance crises. Here load-balancing will place workloads that are complementary in regards to resource consumption. There are several methods to enable load-balancing in this envi-ronment. As previously stated, replication based approaches are not adequate for all load-balancing scenarios the system may encounter. Therefore, a migra-tion primitive is needed for a soft-isolamigra-tion based database platform. Ideally, a migration technique will minimize disruption to the system and active transac-tions. Such a technique is referred to as live migration [21]. This dissertation present the first live migration technique for shared nothing databases that incurs no downtime to the active tenant.
Live migration focuses on migrating an entire tenant between two database servers. For partitioned databases this approach cannot be directly applied to reconfigure how the data is partitioned. Since the system is transactional and par-titioned a reconfiguration primitive should explicitly consider distributed trans-actions; a concern not present for the migration on an entire tenant database. Additionally, the data moving is at a smaller granularity than a tenant. A live reconfiguration addresses these issues by changing the layout of data without needing to take the system off-line or migrate an entire database or partition. A technique for live reconfiguration for partitioned main memory databases is addressed by this dissertation.
Contributions – Section 1.5
1.5
Contributions
This dissertation makes several contributions to enabling elasticity primitives for a multitenant, scalable, elastic, and self-managed data platform. The pre-sented advances allow traditional relational databases to be used in a distributed scale-out architecture. A focus on minimizing problem constraints is emphasized throughout the presented solutions. While these contributions focus on a shared nothing architecture that serves web and OLTP style workloads, the tools can applied to other target environments. The following contributions are essential in the realization of a virtualized Database-as-a-Service offering:
• An analysis of popular forms of multitenancy in database systems and how these forms align with cloud computing paradigms.
• A framework for analyzing database migration. The framework introduces several forms of migration to characterize existing and future migration tech-niques. The migration framework also identifies attributes to measure and evaluate different migration techniques against each other.
• An end-to-end multitenant database prototype, that orchestrates a cluster of shared-nothing PostgreSQL servers to ensure latency based performance SLOs are maintained. The system includes mechanisms for database mon-itoring, crisis mitigation, and a supervised-learning based autonomic con-troller, Delphi.
• A technique, Pythia, for a multitenant controller to model tenant resource consumption and model the impact of colocation. Pythia can approximate tenant resource consumption in a consolidated process, which externally reports aggregated resource consumption for all colocated tenants. Pythia leverages tenant resource models, to learn the impact of tenant colocation. Pythia allows for configurable consolidation levels with minimal instructions from a database administrator.
• A load-balancing primitive to rapidly derive new tenant placement plans when latency SLOs are violated. The load-balancing algorithm is based on a time-bound local search heuristic to use migrations are steps to improve a systems global state, with the optimal state as each node as unlikely as possible to have a resource violation.
• The first published live migration technique for shared nothing databases, Zephyr. The proposed approach has zero downtime for the migrating ten-ants and aborts an order of magnitude fewer aborted transactions than a
stop-and-copy based approach. A novel use of unique data page ownership enables lightweight synchronization between sites without use of latency in-ducing techniques, such as two-phase commit.
• Squall, the first proposed live reconfiguration technique to update a main memory partitioned database’s layout of tuples at runtime. Squall targets main memory databases that use a single threaded transaction manager per partition. The technique focuses on identifying all migrating data for parti-tioned relational data and how to minimize disruption of reconfiguration in the presence of a single threaded partition.
Chapter 2
Background
A pint of sweat, saves a gallon of blood.
George S. Patton Building a database-as-a-service platform requires new components as well as primitives to adapt traditional stand alone databases. A variety of architectures and systems have emerged to build such a database platform. These architectures have emerged in the context of cloud computing environments or for dedicated hosted platforms, or private clouds. In addition to new tools and architectures, a database system must be multiplexed to host multiple applications. This problem is often referred to as multitenancy, where each hosted application is a tenant. This chapter introduces database multitenancy models, basic cloud computing concepts, recent system architectures for multitenancy, and state of the art ad-vancements in database tools for a database platform.
2.1
Multitenancy Models
Multitenancy in databases has been prevalent for hosting multiple tenants within a single DBMS while enabling effective resource sharing [50, 9, 68]. Shar-ing of resources at different levels of abstraction and distinct isolation levels results in various multitenancy models. The three models explored in the past [50] con-sist of: shared machine (also referred to as shared hardware), shared process,
and shared table. SaaS providers like Salesforce.com [81] are a common use cases
for database multitenancy, and traditionally rely on the shared table model. The shared process model has been recently proposed in a number of database systems for the cloud, such as RelationalCloud [27], SQLAzure [13], ElasTraS [31]. Nev-ertheless, some features of cloud computing increases the relevance of the other
# Sharing Mode
Isolation IaaS PaaS SaaS 1. Shared hardware VM X X 2. Shared VM OS User X 3. Shared OS DB Instance X 4. Shared instance Database X 5. Shared database Schema X 6. Shared table Row X X
Table 2.1: Multitenant database models, how tenants are isolated, and the corre-sponding cloud computing paradigms.
models. Soror et al. [75] propose using the shared machine model to improve resource utilization. To improve understanding of multitenancy, we use the clas-sification recently proposed by Reinwald [68] which uses a finer sub-division (see Table 2.1). Though some of these models can collapse to the more traditional models of multitenancy. However, the different isolation levels between tenants provided by these models make this classification interesting and helpful for se-lecting a target classification when building a multitenant database.
Shared Hardware
The models corresponding to rows 1–3 share resources at the level of the same ma-chine with different levels of abstractions, i.e., sharing resources at the mama-chine level using multiple VMs (VM Isolation) or sharing the VM by using different user accounts or different database installations (OS and DB Instance isolation). There is no database resource sharing. Rows 1–3 only share the machine resources and thus correspond to theshared machine model in the traditional classification. While these models offer strong isolation between tenants, these models come with a cost of increased overhead due to redundant components and a lack of coordination using limited machine resources in an unoptimized way. The lack of coordination is prominent in the case of using a virtual machine for each tenant, row 1, where each tenant behaves as if it has exclusive disk access [27].
Multitenancy Models – Section 2.1
Shared Process
Rows 4–5 involve sharing the database process at various isolation levels—from sharing only the installation binary (database isolation), to sharing the database resources such as the logging infrastructure, the buffer pool, etc. (schema iso-lation), to sharing the same schema and tables (table row level isolation). How a database instance can be isolated between tenants varies between implementa-tion. For example, with MySQL each tenant can be given their own schema with limited user permissions. Rows 4–5 thus span the traditional classes of shared
process (for rows 4 and 5).1
Shared Table
The shared table model uses a design which allows for extensible data models to be defined by a tenant with the actual data stored in single shared table. The design often utilizes ‘pivot tables’ to provide rich database functionality such as indexing and joins [9]. While this model offers advantages of maintaining a single database instance, isolating tenants for migration becomes difficult due to shared locking mechanisms. The reliance on consolidated pivot and heap tables could lead to poor performance due to all tenants sharing index structures. Additionally, the shared table model requires that all tenants reside on the same database engine and release (version). This limits specialized database functionality, such spatial or object based, and requires that all tenants use limited subset of functionality. This model is ideal when tenant data requirements follow similar structures or patterns, such as in the case of Force.com offering customizations on a customer relationship database [81].
With different forms of multitenancy, components that constitute a tenant vary. We henceforth use the term cell to represent all information necessary to serve a tenant. A multitenant database instance consists of thousands ofcells, and the actual physical interpretation of a cell depends on the multitenancy model.
Definition 1. A cell is the self-contained granule representing a tenant in the database.
The choice of multitenancy model has implications on a tenant’s resource iso-lation, consolidation, functionality, and required development. A key trade-off to explore in multitenacy is the balance between the amount of consolidation that the system provides and the level that resources access is isolated between ten-ants. When a shared hardware model relies on virtualization to divide a server’s 1The shared instance model is primarily supported by commercial databases that allows multiple databases (processes) to share a common installation (or binary). Example usage includes running isolated production and test databases. This model can map to both shared machine as well as shared process.
resources between tenants, a strong level of resource isolation is enabled. Here, each tenant is guaranteed a specific amount of resources and the hypervisor pro-vides the ability to monitor and control how resources are utilized. However, as shown in a recent study [26], such a model results in up to an order of magnitude lower performance and consolidation compared to the shared process model. This limitation consolidation is largely driven by two factors. First, virtualiza-tion results in redundant database and OS processes which each require dedicated resources to operate. Recent developments have attempted to limit redundancy due to this issue. Second and more importantly, when independent databases processes reside on the same physical server they act in an uncoordinated man-ner. Databases are designed to greedily consume and explicitly manage resources to amortize the costs associated with various components. An example of this is delaying writes to the disk until periods of low activity or when writes can be batched. Many greedy and uncoordinated database processes will utilized the underlying resources in a ineffective manner that significantly limits the amount of tenants which can be hosted on a single server.
On the other hand, the shared table model allows efficient resource sharing amongst tenants but restricts the tenant’s schema and requires custom techniques for efficient query processing. Here the overhead of managing the meta data over-head of many independent databases and tables are mitigated due to shared data within fewer tables. This approach maximises the level of tenant consolidation, but has limited resource isolation. With tenants hosted within the same tables, the database has reduced options to ensure each tenant has adequate access to resources. Implementations of the shared table model can rely on additional com-ponents, such as the application layer, to provide resource isolation [81].
The shared process model, therefore, provides a good balance of effective resource sharing, schema diversity, performance, and scale. The shared process model has also been widely adopted in commercial and research systems [13, 27]. The shared process model also allows for existing database systems to be used with little to no modification to the database kernel. We therefore focus on the shared process model for this dissertation.
2.2
Multitenancy for the Cloud
While broad in concept, three main paradigms have emerged for cloud com-puting: IaaS, PaaS, and SaaS. We now establish the connection between the database multitenancy models with the cloud computing paradigms (Table 2.1 summarizes this relationship), while analyzing the suitability of the models for various multitenancy scenarios.
Recent Multitenant Systems – Section 2.3
IaaS provides the lowest level of abstraction such as raw computation, storage, and networking. Supporting multitenancy in the IaaS layer thus allows much flex-ibility and different schema for sharing. The shared hardware model is therefor best suited in IaaS. A simple multi-tenant system could be built of a cluster of high end commodity machines, each with a small set of virtual machines. Each virtual machine would host a single database tenants. This model provides isola-tion, security, and efficient migration for the client databases with an acceptable overhead, and is suitable for applications with lower throughput but larger storage requirements.
PaaS providers, on the other hand, provide a higher level of abstraction to the tenants. There exist a wide class of PaaS providers, and a single multitenant database model cannot be a blanket choice. For PaaS systems that provide a single data store API, a shared table or shared instance could meet data needs for the platform. For instance, Google App Engine uses the shared table model for its data store referred to as MegaStore [10]. However, PaaS systems with the flexibility to support to a variety of data stores, such as AppScale [20], can leverage any multitenant database model.
SaaS has the highest level of abstraction in which a client uses the service to perform a limited and focused task. Customization is typically superficial and workflows or data models are primarily dictated by the service provider. With rigid definitions of data and processes, and restricted access to a data layer through a web service or browser, the service provider has control over how the tenants will interact with a data store. The shared table model has thus been successfully used by various SaaS providers [9, 50, 81].
2.3
Recent Multitenant Systems
Many multitenant database systems have been proposed in recent years. This section introduces several of the systems to provide key contributions and any potential issues not addressed by the system. Yang et al. [87] outline a shared process model to build a scalable data platform. The presented system focuses on the system can enable replication within a datacenter and across datacenters. The presented project is one of the earlier papers to formally describe an architecture for a shared process data platform. This system assumes tenant resource require-ments are readily available, and that all tenant resource consumption is additive. Tenant placement is treated as a multidimensional bin-packing problem.
Similar to Yang et al. the RTP project seeks to dynamically configure tenant placement [70]. RTP focuses on how to place tenants and redistribute workloads between replicas, such that the system is robust to handle any single server failure.
A thorough evaluation of various placement strategies is presented. The target database is main memory, so therefore the authors focus on univariate workload requirements that are additive.
Lang et al. [53] present a technique to place tenants but also provision the amount and types of servers required. The server provisioning strategy accounts for multiple hardware classes, that have various performance characteristics. Lang et al. focus on a provisioning strategy that ensures performance service objects (SLOs) will be met. Here the workload classes are assumed to be known and the provisioning can systematically explore how the various classes behave when colocated on a given hardware configuration.
PMAX [57] also seeks to provision a multitenant environment, but here a focus is on maximizing profit by factoring in the SLO violation costs and costs associated with running each server. Here, the workloads are known and fixed, but the arrival rate and value of queries can change. The impact of colocating workloads is assumed to provided either by observation or through an oracle. PMAX demonstrates that popular multi-dimensional bin packing heuristics can be sub-optimal for tenant provisioning strategies.
SQLVM [63] is a project to embed the resource allocation and isolation of virtualization technology into the database kernel. SQLVM focuses on how to meter tenant’s resource utilization and schedule to database requests in order to provide specific levels of resource access. This project examines how can disk I/O, CPU cycles, and memory for caching can be limited by the database engine.
DAX [56] is a multitenant environment that focuses on providing cross dat-acenter replication for tenants. Here a single active tenant runs in only a single datacenter but the persistent state of the tenant is replicated between many dat-acenters. This allows a tenant to move between datacenters in a lightweight manner. DAX replaces a local block storage with a distributed key-value store where is each block ID is the key and the block’s data is the value. DAX lever-ages the semantics of transactional consistency to minimize the number of replicas required to acknowledge each operation.
SQLAzure [13], ElasTras [31], and Relational Cloud [27] are multitenancy plat-forms that target the same environment as this dissertation. These projects ad-dress different issues related to building a multitenant environment, such as how to partition tenants [29, 13, 31] and how to initially consolidate tenants when the workload remains static [25]. Relational Cloud enables a tenant to be partitioned with distributed transactions, where the other systems focus on tenants that can be hosted by a single physical server. ElasTras relies on a shared storage layer to host the tenant’s data and logs. Another similar project, Google’s F1 [?] is a
Recent Multitenant Systems – Section 2.3
scalable relational database system that enables automatic partitioning and SQL support.
Part I
Modeling and Placement
Primitives
Chapter 3
Pythia
The purpose of science is not to analyze or describe but to make useful models of the world. A model is useful if it allows us to get use out of it.
Edward de Bono Cloud application platforms and large organizations face the challenge of man-aging, storing, and serving data for large numbers of applications with small data footprints. For instance, several cloud platforms such as Salesforce.com, Facebook, Google AppEngine, and Windows Azure host hundreds of thousands of small ap-plications. Large organizations, such as enterprises or universities, also face a similar challenge of managing hundreds to thousands of database instances for different departments and projects. Allocating dedicated and isolated resources to each application’s database is wasteful in terms of resources and is not cost-effective. Multitenancy in the database tier, i.e., sharing resources among the different applications’ databases (or tenants), is therefore critical. We focus on such a multitenant database management system (DBMS) using the shared process multitenancy model where the DBMS comprises a cluster of database servers (ornodes) where each node runs a single database process which multiple tenants share.
3.1
Challenges in Multitenancy
A multitenant DBMS must minimize the impact of colocating multiple tenants. The challenge lies in determining which tenants to colocate and how many to colocate at a given server, i.e., learn goodtenant packingsthat balance between over-provisioning and over-booking. Furthermore, the colocated tenants’ resource
requirements must be complementary to avoid heavy resource contention after colocation.
To ensure of service, a multitenant DBMS must also associate meaningful
service level objectives (SLOs) in a consolidated setting. If a tenant’s SLO is violated, the DBMS must adapt to this performance crisis. The challenge lies in mitigating the crisis which might be caused by a change in this tenant’s behavior, a change in a colocated tenant’s behavior, or a degradation in the node’s performance. A tenant’s behavioral change might be due to a change in the query pattern, data access distribution, working set size, access rates, or queries issued on non-indexed attributes while typical queries are on indexed attributes— complexity arises from the myriad of possibilities. Adapting to a crisis entails detecting changes, filtering erratic behavior, and devising mitigation strategies. Erratic behavior can arise from temporary shifts in application popularity, periodic analysis, or ad-hoc queries.
The problem of designing a self-managing controller is further complicated by the variety of tenant workload types. Many applications use their databases for multiple purposes, such as using the same database for serving, analysis, and logging. Therefore, in addition to workload variations across tenants, a single ten-ant might also exhibit different behaviors at different time instances. Behavioral changes might have patterns (e.g., diurnal trends of serving and reporting work-loads) or might be erratic (e.g., flash crowds). Moreover, dynamics in the workload might or might not have correlation. For instance, hosted business applications observe a spike in multiple tenants’ activity at the start of the business day. These behavioral dynamics, the interplay of shared resources among colocated tenants, and the complex interactions between the DBMS, OS, and the hardware make analytical models and theoretical abstractions impractical.
From the monitoring perspective, a system controller potentially receives hun-dreds of raw performance measures from the database process and the operating system (OS) at each node. Considering the scale of tens to hundreds of nodes, using all these raw signals to maintain an aggregate view of the entire system and the individual tenants results in an information deluge. One challenge in effective administration is to systematically filter, aggregate, and transform these raw sig-nals to a manageable set of attributes and automate administration with minimal human guidance. An intelligent and self-managing system controller is a signifi-cant step towards achieving economies-of-scale and simplifying administration.
More than a decade of research has focused on effective multitenancy at differ-ent layers of the stack, including sharing the file system or the storage layer [76, 64], sharing hardware through virtualization [45], and in the application and web server layer [80]. Multitenancy in the database tier introduces novel challenges due to
Controller for a Multitenant DBMS – Section 3.2
the richer functionality supported by the DBMS compared to the storage layer, and the complex interplay between CPU, memory, and disk I/O bandwidth ob-served in a DBMS compared to the stateless applications and web servers. Recent work has focused on various aspects of database multitenancy. Kairos [26] is a technique for tenant placement and consolidation for a set of tenants with known static workloads. Kairos uses direct measurements of the tenants’ CPU, I/O, memory, and disk resource consumption to suggest consolidation plans. Smart-SLA [85] is a technique for cost-aware resource management using direct resource utilization measurements to learn the average SLA penalty cost in a setting where each tenant has its independent database process and virtual machine. Ahmad and Bowman [6] use machine learning techniques to predict aggregate resource consumption for various workload combinations and proposes a technique that re-lies on static and known workloads. The authors argue that analytical models for performance and consolidation are hard due to complex component interactions and shifting bottlenecks. Lang et al. [53] propose a SLO-focused framework for static provisioning and placement where tenant workloads are known. In general, existing approaches do not target the problem of continuous tenant modeling, dynamic tenant placement, variable and unknown tenant workloads, and perfor-mance crisis mitigation in the shared process multitenancy model, which is critical for deploying shared database services.
3.2
Controller for a Multitenant DBMS
We present the design and implementation of Delphi, an intelligent self-managing controller for a multitenant DBMS that orchestrates resources among the tenants. Delphi uses Pythia, a technique to learn behavior through observa-tion.1 Pythia uses DBMS-agnostic database-level performance measures available
in any standard DBMS and supervised learning techniques to learn a tenant model representing resource consumption. Pythia learns a node model to determine which combination of tenant types perform well after colocation (good packings) and which combinations do not perform well (bad packings). Pythia continuously models behavior and maintains historical behavior which allows it to detect a change in a tenant’s behavior. Once Delphi detects a performance crisis, it leverages Pythia to suggest remedial actions. Identifying a set of tenants to re-locate, and finding destinations for these tenants to alleviate latency violations is the core challenge addressed by Pythia. Delphi employs a local search algorithm, hill-climbing, to prune the space of possible tenant packings and uses the node
Figure 3.1: Pythia incrementally learns behavior.
model to identify potential good packings. Pythia requires minimal human super-vision, typically from a database administrator, only for training the supervised learning. Once the models are trained, Delphi can independently orchestrate the tenants, i.e., monitor the system to detect performance crises, load-balance and migrate tenants to mitigate a crisis and to ensure that tenant SLOs are being met. Figure 3.1 presents an overview of Delphi’s design.
In contrast to existing techniques that directly use OS or VM level resource utilization, such as Kairos [26] and SmartSLA [85], Pythia uses database-level performance measures such as cache hit ratio, cache size, read/write ratio, and throughput. This allows Pythia to maintain a detailed per-tenant profile even when tenants share a database process. OS level measures either provide aggre-gate resource consumption metrics of all tenants, and the alternative of hosting one tenant per database process degrades performance [26]. In contrast, Pythia results in negligible performance impact by using performance measures available from any standard DBMS implementation. Additionally, Pythia learns tenant behavior without any assumptions or in-depth understanding of the underlying systems. In addition, unlike workload driven techniques [26, 6, 53], Pythia does not require advanced knowledge of the tenants’ workload or limit the workload types. Moreover, Pythia does not require profiling tenants in a sandbox, a ded-icated node for running tenants in isolation, thus making it applicable even in scenarios where production workloads cannot be replayed due to operational or privacy considerations [7]. Therefore, we expect Pythia to have applications in a variety of multitenant systems and environments, while requiring minimal changes to existing systems.
Delphi is the first end-to-end framework for the accurate and continuous mod-eling of tenant behavior in a shared process multitenancy environment. We built
Delphi Architecture – Section 3.3
Figure 3.2: Overview of Delphi’s architecture.
a prototype implementation of Delphi in a multitenant DBMS running a cluster of Postgres RDBMS servers. Our current implementation uses a set ofclassifiers to learn tenant and node models, although Pythia can be extended to use additional tenant resource models, or other machine learning techniques such as clustering or regression learning [83]. Pythia learns tenant models with a 92% accuracy, and node models with a 86% accuracy. Once a performance crisis is detected, Delphi can mitigate the crisis by reducing the 99th percentile latency violations by 80% on average.
3.3
Delphi Architecture
Fi