• No results found

SQL for Operational Analytics on Big Data Platforms

Hadoop was invented to do batch analytics on large data sets. These data sets were read-only, and Hadoop’s major component HDFS was a write once read many (WORM) file system. Use cases or workloads that would entail use of these systems for updates, deletes, etc., were never part of the initial features Hadoop was meant to address.

Realistically, however, the success of a platform usually has had the effect of demanding more from it. This is why the industry began to explore the idea of having Hadoop, big data tools, and frameworks dispatch transactional workloads typically executed on operational systems. In this section of the chapter, we will take a look at some of the new tools in the landscape that are trying to address this use case and build new systems and databases, and ways to solve this problem on a big data platform. In Chapter 6 , we will also look at the Hybrid Transactional and Analytics Platform (HTAP), which is a unified platform that supports both operational and analytical workloads. Development of systems capable of executing operational workloads within a big data ecosystem represents a step in the right direction.

In Chapter 3 , we looked at how newer versions of Hive, starting from version 0.14, support transactional SQL capabilities, and this helps to support operational workloads. This was just the starting point that fueled industry leaders and innovators to delve further into supporting this, by building newer systems. “Operational” capabilities is an emerging Hadoop market category and, therefore, one of the least mature, stable, and robust, but the tools in this space are rapidly evolving and getting used in real production environments.

Transactional workloads are always mission-critical for business, because they represent their bread and butter and touch points for customers. Hence, handling operational workloads has very strict SLAs in terms of response times, transactional guarantees, data integrity, concurrency, and availability.

The different types of workloads, with their latency requirements, are shown in Figure  5-6 .

Trafodion

Trafodion is one of the first big data systems we explore and delve deeper into to see how it addresses operational capabilities on a big data system. It is an open source product, initiated by HP Labs, to develop an enterprise-level SQL-on-Hadoop engine that targets transactional and operational workloads. Under the covers, Trafodion builds on the capabilities of Hadoop, such as scalability, availability, replication, and parallel processing, but does not reinvent the wheel.

Following is a list of some of the major features of Trafodion: It is a fully functional ANSI SQL DBMS with support for the syntax and semantics of INSERT / DELETE / UPDATE / MERGE , etc. It extends HBase by adding transactional support with ACID semantics.

It has an extensive list of built-in optimizations for low-latency SQL, both for read and write operations.

It provides standard JDBC and ODBC drivers for third-party applications to connect and interact with it.

It provides a relational schema abstraction on top of HBase, which makes it feel like working with any other relational database and makes transition to HBase much easier. It supports creating and managing database objects such as schemas, tables, views, stored procedures, and constraints. Figure 5-6. Workloads and their typical latencies

CHAPTER 5 ■ SQL FOR STREAMING, SEMI-STRUCTURED, AND OPERATIONAL ANALYTICS

It has the transactional support found in most relational databases, such as BEGIN WORK , COMMIT WORK , ROLLBACK WORK , SET TRANSACTIONS , etc.

It provides security privileges with operators such as GRANT / REVOKE .

It provides a plethora of utilities, such as Update Statistics, Explain, Command Line Interface, Bulk Loader, Backup and Restore, etc., found in most typical RDBMS.

In spite of being built on top of HBase, Trafodion innovates in multiple areas and provides the following capabilities:

• It has the ability to define Primary Keys, which can be simple or composite.

• It supports secondary indexes (which HBase does not). • It adds a broader scope of ACID over HBase and allows

transactions to span over multiple SQL statements, tables, and rows.

• It defines data types for columns, though HBase treats any stored data as an array of bytes.

Architecture

In this section, we will look at the core architectural underpinnings of Trafodion. Trafodion is not a storage engine per se but an SQL execution engine that leverages HBase and, hence, Hadoop and HDFS, under the covers. We will first look at how Trafodion fits the scheme of things within the Hadoop ecosystem and then take a look a level deeper into the architecture of the Trafodion engine.

Trafodion is designed to build upon and leverage Apache Hadoop and HBase core modules. Trafodion extensively uses HDFS and Zookeeper, the tow vital cogs within the Hadoop ecosystem. Internally, Trafodion interacts with HDFS, HBase, and Zookeeper, using Java-based APIs. Trafodion leverages the scalability, elasticity, and availability aspects of Hadoop and leverages parallel performance, load balancing across regions of data available from HBase and weaving the two together to form a true transactional processing system on the big data platform.

Trafodion adds its own ODBC and JDBC drivers for third-party tools and APIs. It also adds its own SQL engine, which sits atop HBase and HDFS and adds a Distributed Transaction Manager module to coordinate transactions within the database.

Trafodion is architected in three distinct layers: Client, SQL, and Storage layers. Client layer : This allows connectivity to third-party

applications, tools, and BI engines, using standard ODBC/ JDBC drivers. It supports both Type 2 and Type 4 JDBC drivers.

SQL layer : This consists of the services for managing the different database-related objects and operations. It includes basic services such as connection management, SQL

compilation and generation and execution of execution plans, and transaction and workload management.

Storage layer : This is the set of services Trafodion requires from HDFS and HBase to store and manage the different database objects. On this layer, Trafodion maps the standard SQL queries and operational/transaction semantics to native HBase API calls. Trafodion can manage data residing in Hive and HBase—labeled as external data.

At a very high level, Trafodion’s architecture looks as shown in Figure  5-8 . Figure 5-7. Trafodion and the Hadoop ecosystem

CHAPTER 5 ■ SQL FOR STREAMING, SEMI-STRUCTURED, AND OPERATIONAL ANALYTICS

Upon request for a client connection, Trafodion’s database connection service, which resides in the SQL layer, processes the request and assigns the connection to a Trafodion Master process residing in the SQL layer. Zookeeper (not shown in the diagram) internally manages all the connections in coordination with the connection service.

Master process is the entry point for all SQL queries. On receipt of a DML query, the Master process hands over the query to the CMP (compiler and optimizer process), which checks to see if the query result or the query plan is cached and, if not, parses, compiles, and generates the optimized query execution plan. The query is then fanned out to all or one (depending on whether the query is to be handled in parallel) to the ESP residing in each of the region server nodes (Executive Service Processes), to perform the actual work of executing the query. The ESP process coordinates with the storage engine to fetch the data from the data store in the most optimized way and also works with the Storage layer to push down the query predicates.

Finally, when all the ESPs have returned the query results, the Master process assembles all the fragments of query results and returns the result to the Client layer. In the event that parallel processing of the data by the ESPs is not required (depending on the type of the query), the Master process interacts directly with the HBase API to do the work and get the results.