Apache HttpClient REST+T Extension - Cherry Garcia: Transactions across Heterogeneous Data Stor

The REST+T extensions to the Apache HTTP client library are fairly easy to implement. For instance the code listed in Figure 6.5 is all that is needed to implement the extension to the HttpEntityEnclosingRequestBase class in order to work with any existing application using the Apache HttpClient Java library. The key to the functionality of this class is the getMethod() method that returns the value of the static string METHOD NAME which is set to PREPARE. The rest of the library uses this to construct the REST+T request to the server.

1 @NotThreadSafe

2 p u b l i c c l a s s H t t p P r e p a r e extends H t t p E n t i t y E n c l o s i n g R e q u e s t B a s e { 3 p u b l i c f i n a l s t a t i c S t r i n g METHOD NAME = ”PREPARE” ;

4 5 p u b l i c H t t p P r e p a r e ( ) { 6 super( ) ; 7 } 8 9 p u b l i c H t t p P r e p a r e (f i n a l URI u r i ) { 10 super( ) ; 11 s e t U R I ( u r i ) ; 12 } 13 14 /∗∗ 15 ∗ @throws I l l e g a l A r g u m e n t E x c e p t i o n i f t h e u r i i s i n v a l i d . 16 ∗/ 17 p u b l i c H t t p P r e p a r e (f i n a l S t r i n g u r i ) { 18 super( ) ; 19 s e t U R I ( URI . c r e a t e ( u r i ) ) ; 20 } 21 22 @ O v e r r i d e 23 p u b l i c S t r i n g getMethod ( ) {

24 return METHOD NAME;

25 }

26 }

Listing 6.5: Implementation of the REST+T API extension of Apache HttpClient for the PREPARE method

6.8 Chapter Summary

We began this chapter by describing the implementation of the Cherry Garcia protocol in a Java library of the same name. We described the details of the Datastore implementation of the Windows Azure Storage (WAS), Google Cloud Storage (GCS) and our own REST+T data store called Tora. We highlighted the subtle differences between these systems and showed that the Datastore abstraction is able to easily hide the somewhat complex interaction with the individual data stores in an efficient manner. The example code in Section 5.3 shows how

6.8. CHAPTER SUMMARY 131

easy it is to write transactional applications with data stores that do not provide native multi-item transactions.

Subsequently, we described the implementation of Tora in the C++ program- ming language using the Boost library and its Asynchronous IO classes to provide a REST+T interface to the WiredTiger storage engine. The use of the WiredTiger storage engine API within the Tora code were also described in detail using code snippets.

Later in this thesis we will discuss how we use the Java implementation of Cherry Garcia for our performance evaluations in Chapter 8. The evaluations also cover the Tora data store implementation to evaluate the correctness and performance characteristics of the REST+T API extensions to standard HTTP.

Chapter 7 The YCSB+T Benchmark

Database system benchmarks like TPC-C and TPC-E focus on emulating database applications to compare different DBMS implementations. These benchmarks use carefully constructed queries, executed within the context of transactions, to exercise specific RDBMS features and measure the throughput achieved. Cloud services benchmark frameworks, like YCSB, on the other hand, are designed to evaluate the performance of distributed NoSQL key-value stores. Early examples of these systems did not support transactions, and so the benchmarks use single operations that are not inside transactions. Recent implementations of web-scale distributed NoSQL systems, like Spanner and Percolator, offer transaction features to cater to the needs of this new breed of web-scale applications. This inability to wrap multiple operations into a single transaction has exposed a gap in the available benchmarks.

In this chapter, we begin by identifying the issues that need to be addressed when evaluating transaction support in NoSQL databases. This is followed by a description of the YCSB+T framework, an extension of YCSB, that wraps database operations within transactions. This framework includes a validation stage to detect and quantify database anomalies resulting from the workload and a way to gather metrics to measure transactional overhead. In this context we describe a workload, called the Closed Economy Workload (CEW), which can be run within the YCSB+T framework for this purpose. Lastly, we discuss execution details of the framework and associated workloads and share our experience with using YCSB+T to evaluate some NoSQL systems in Chapter 8.

7.1 Background

It is evident that cloud or utility computing [69] is rapidly becoming the deployment architecture of choice for application across large established enterprises and small startups and businesses alike. New data management technologies, broadly classified under the NoSQL database category have come to the forefront. These include Amazon S3 [1], Google BigTable [26], Yahoo! PNUTS [28] and many others. These systems leverage the distributed deployment platforms to support large-scale storage and throughput while tolerating node failures. As a trade-off, they typically offer applications with less in query expressivity (for example, they may not perform joins) and less transactional and consistency guarantees.

Early NoSQL systems, such as Amazon’s AWS Simple Storage Service (S3) [1] or Amazon’s internal-use Dynamo [28], provide little, if any, support for transactions and information consistency. Others allow read operations to return somewhat stale data, and typically only eventual consistency [136] or time-line consistency [28] is offered respectively. Recent commercially available offerings like Windows Azure Storage (WAS) [25] and Google Cloud Storage (GCS) [2] allow the transaction-like grouping of multiple operations but restrict this to involve either a single item or a collection of items that are collocated. Another approach to better support for application logic lies in richer operations such as test-and-set or conditional put.

As we have seen in Section 3.6, several designs have recently been proposed to overcome these limitations of NoSQL databases. This includes Google Perco- lator [108], G-Store [35], CloudTPS [144], Deuteronomy [85], Megastore [11] and Google Spanner [30].

The question is, how do we evaluate these designs? Traditional data management platforms were evaluated using industry standard benchmarks like TPC- C [133] and TPC-E [134]; these focused primarily on emulating end-user application scenarios to evaluate the performance (especially the throughput, and throughput relative to system cost) of the underlying DBMS and application server stack.

These benchmarks typically run a workload with queries and updates that are performed in the context of transactions, and the integrity of the data is supposed to be verified during the process of the execution of the benchmark. If the data is

7.1. BACKGROUND 135

corrupted, the benchmark measurement is rejected entirely.

In the case of web-scale data management, especially the “available-despite- failures” key-value stores in the NoSQL category, a new benchmarking framework YCSB [29] has become accepted. The focus of this benchmark is raw performance and scalability; correctness is not measured or validated as part of the benchmark, and the operations do not fall within transaction boundaries (largely because the target systems do not support traditional notions of transactions nor guarantee data consistency). YCSB is actually a flexible framework within which the workload and the measurements can be extended.

Here we introduce a benchmarking framework (called YCSB+T) that retains the flexibility of YCSB by allowing the user to implement the DB Java interface to their database or data store; allows for additional operations apart from the standard read, write, update, delete and scan; enables the user to define workloads in terms of these operations; and most importantly, allows these operations to be wrapped into transactions. Further, it includes a validation stage that enables consistency checks to be performed on the database or data store after the workload had completed in order to detect and quantify transaction anomalies.

This approach is intended to fill the gap between traditional TPC-C-style benchmarks that are designed with transactional RDBMSs in mind and the non- transactional HTTP/web-service benchmarks which have no ability to define transactions.

In this chapter we:

• describe our extension of the YCSB benchmark that we call YCSB+T, with support for transactional operation and workload validation to detect and quantify anomalies resulting from the workload.

• describe a simple benchmark workload, we call the Closed Economy Work- load (CEW), built using YCSB+T that we can use to evaluate the performance and correctness of systems.

• Share our experience in using the benchmark to evaluate some NoSQL data stores.

Client Threads Stats New Workloads Validation Transaction API YCSB+T Client D is tr ib u te d N oS Q L D atas tor e Workload properties: - Read/write mix - Record size - Popularity distribution …

Command line parameters: - DB to use

- Workload to use - Target throughput - Number of threads

Figure 7.1: YCSB+T architecture

7.1.1 Brief overview of YCSB

The Yahoo! Cloud Services Benchmark (YCSB) is an extensible benchmarking framework that is widely used to measure the performance and scalability of large- scale distributed NoSQL systems like Yahoo! PNUTS [28] as well as traditional database management systems like MySQL. Figure 7.1 illustrates the architecture of YCSB+T benchmarking framework. The light coloured rectangles represent YCSB components and the dark coloured rectangles represent additions and en- hancements that are part of YCSB+T.

When the YCSB+T client is launched, the workload executor instantiates and initialises the workload then loads the data into the database to be tested or executes the workload on the database using the DB client abstraction. The CoreWorkload workload, defined by default with the YCSB framework, defines different mixes of the simple database operations such as read, update, delete and scan operations on the database. The doTransactionRead() operation reads a single record while the doTransactionScan() operation is used to fetch more than one record identified by a range of keys. Data records are inserted using the doTransactionInsert() method and doTransactionUpdate() is issued to perform record updates in the database. The doTransactionReadModifyWrite() reads a

In document Cherry Garcia: Transactions across Heterogeneous Data Stores (Page 149-156)