Which SQL Engine Leads the Herd?

(1)

Which SQL Engine

Leads the Herd?

(2)

Executive Summary

In an independently audited benchmark of three popular SQL-on-Hadoop® implementations, IBM showed that Hadoop is ready to run OLAP and complex query workloads at a fraction of the cost of traditional systems – that is, if you choose the right technology.

With so many vendors making claims about the performance and compatibility of SQL-on-Hadoop, IBM decided to put leading distributions to the test, conducting the first ever Hadoop-DS benchmark. The test compared IBM’s Big SQL with Cloudera’s Impala™ and Hortonworks®_{Hive™ 0.13.} Hadoop-DS is a Hadoop Decision Support benchmark developed by IBM modeled after the highly regarded

Transaction Processing Council Decision Support (TPC-DS™) benchmark. To help make the process fair, IBM established three competing teams running each Hadoop distribution on identical hardware configurations. IBM engaged the services of an independent TPC professional to audit and help document the result. Among the key findings were:

Compatibility Matters – IBM’s Big SQL was the only solution able to execute all 99 TPC-DS queries at scale with minor modifications permissible under TPC rules. Cloudera’s Impala was able to run 52 queries, and Hive ran 581 queries in a manner that complied with the TPC rules. For the remaining queries, re-writes were required (some extensive) and some queries could not be made to run at all. Lack of SQL compatibility is a nuisance in a benchmark, but could be a costly disaster in a production environment.

Throughput matters – When comparing Big SQL against the subset of queries that could actually run on competing platforms, IBM ran queries on average 3.6 times faster than Impala and 5.4 times faster than Hive on a 10 TB scale test2. Not only was Big SQL the only engine able to run the Hadoop DS workload – it ran the workload significantly faster as well for both Hadoop-DS single-user and multi-user tests.

Scale matters – Hadoop is about big data after all. IBM had originally planned to compare all three vendors at a 30TB scale, but it achieving stability at scale was a challenge. While Big SQL could reliably execute all queries at a 30TB scale, the competitive platforms could not, exhibiting various run-time errors. The comparison was made instead at the 10TB scale where results could be repeated, and thus audited.

These findings are compelling. Not only was IBM’s Big SQL the only Hadoop solution tested able to actually run the complete set of queries, but it was found to be the fastest, the most scalable, and the most reliable as well.

1

These results refer to initial testing at 1 GB scale for compatibility. At 10 TB scale, both Impala and Hive ran fewer queries.

2

(3)

The case for SQL on Hadoop

In today’s data centers SQL has become a ubiquitous way to access and manipulate data. No longer a tool used just by developers and database administrators, today most professionals and analysts have at least some knowledge of SQL or use tools that rely on SQL as a standard.

While non-structured data types gets all the attention for big data workloads, the majority of real projects involve

transactional or log data3 – data formats generally well-suited to manipulation with SQL. While Hive was the only game in town just a few years ago, today there are at least a dozen competing commercial and open source efforts around SQL-on-Hadoop. Vendors are competing based on performance, compatibility, and the ability to scale to support real-world production workloads.

3

70% of 465 survey respondents cite transactional data as a primary target for big data initiatives - Gartner research note “Survey Analysis - Big Data Adoption in 2013 Shows Substance Behind the Hype“ Sept 12 2013 Analyst(s): Lisa Kart, Nick Heudecker, Frank Buytendijk

Standards are job one

Standards are important in every industry. They help reduce cost, expand markets, spur innovation, reduce risk and

generally give organizations a competitive edge. This is true of SQL as well.

In almost every organization, SQL is at the heart of enterprise data used in transactional systems, data warehouses, columnar databases and analytics platforms to name just a few examples. Additionally, a vast number of commercial and in-house developed tools used to access, manipulate and visualize data rely on SQL. SQL is lifeblood of the modern transaction and decision support systems.

The last thing an organization wants to do is introduce technology that is not compatible with what they have. It’s nice to be able to use open source software, but at the end of the day standardization is what matters most – the software needs to work.

Evaluating SQL-on-Hadoop solutions

As customers know, testing their own applications is the only benchmark that matters, but when it comes to a standard benchmark, the TPC Benchmark™ DS (TPC-DS) is among the most thorough. TPC-DS is a decision support benchmark that models several aspects of the business operations of a global retailer. Comprised of 99 separate queries, it models real-world business operations that companies in this and other industries would find familiar. While there are no official results at the time of this writing, TPC-DS is widely regarded as a fair and complete benchmark. The rigor and realism of the benchmark makes it almost impossible for vendors to “game” the

(4)

Figure 1 Query compatibility by SQL-on-Hadoop solution

SQL-on-Hadoop systems cannot meet several of the technology requirements of the TPC-DS benchmark, so IBM modeled the Hadoop-DS benchmark on TPC-DS, using the same data sets and queries, but not performing data maintenance operations, and not enforcing referential integrity or meeting other benchmark requirements not feasible with Hadoop systems. The benchmark is designed to model systems where operational data is used both to make business decisions quickly and to direct long range planning and operation. The types of queries involved fall broadly into four different categories.

 Reporting queries

 Ad-hoc queries

 Iterative OLAP queries

 Data mining queries

Because the sizes of businesses vary, the benchmark is designed to scale also model different sizes of warehouses. Standard scale sizes are 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB.

SQL-on-Hadoop – not all are equal

One of the first hurdles in conducting the benchmark is simply getting the queries to run across all three Hadoop environments. From this point of view, not all SQL-on-Hadoop

implementations are created equal.

As shown in Figure 1, in initial testing IBM Big SQL was able to run 99 of the standard TPC-DS queries after building the dataset. 87 queries ran “out-of-the box” and an additional 12 were easily modified within a few hours with minor syntax changes allowable under the TPC-DS benchmark

specification4.

Other distributions did not fare so well. In the case of Cloudera’s Impala, 35 queries ran un-modified, 17 required minor modifications complying with TPC-DS rules, and 36 required more extensive non-compliant modifications. More concerning was that some of the 99 queries could not be run at

4

(5)

all – either because no re-write to the query was found or because the queries would fail at run-time.

In the case of Hive .13 the situation was similar. 32 queries ran “out-the-box”, an additional 26 queries ran with compliant modifications, and 13 queries could be re-written with non-compliant modifications. As team scaled up the size of the test however queries that worked at smaller scale stopped working. At a 10TB dataset size 30 of the queries would not run at all. This exercise highlights the challenge that customers can be expected to encounter when seeking to adapt existing SQL schemas and applications to SQL-on-Hadoop implementations.

Beware of cherry pickers

Vendors have been making many performance claims related to the TPC-DS benchmark, cherry picking queries from the suite of 99 queries, and publishing only those queries that happen to work, and show an offering in the best possible light.

In some cases, vendors have even altered table schemas to avoid compatibility issues or boost performance. Clearly this is not a proper way to run a benchmark. In fact, the rules of the benchmark specifically forbid this practice.

The real news is not that selected queries can be made to run faster, but that many of the ANSI SQL queries in the benchmark simply don’t run at all on competitive platforms. You can just imagine the challenges associated with getting your own production application running on a database platform that doesn’t support 50% of your standard queries. This would amount to a “re-write” of the application, introducing risk, added-costs and certain delays.

Performance matters too

A full comparison between Impala, Hive and Big SQL could not be made because Hive and Impala could only run a subset of the queries. It was still possible though to compare results for the common set of 46 queries that all distributions were able to run at a 10 TB scale.

(6)

Figure 2: Time in seconds to run the common set of 46 queries on each SQL-on-Hadoop platform

The same ranking was observed in the multiuser test consisting of four concurrent streams of queries executing.

Big SQL means Big Investment Protection

The good news for IBM customers is that Big SQL offers significant investment protection. Not is Big SQL the only SQL-on-Hadoop test that was able to run all the queries, it was also the fastest, and the only engine to scale to a 30 TB dataset size.

What’s even better is that customers don’t need to compromise on standards. Rigorous SQL compatibility means that

customers can:

 Leverage existing investments in software, tools and people skills

 Run existing applications where appropriate over SQL-on-Hadoop data stores

 Runs queries faster, more efficiently and at a larger scale translating into a lower operations costs

Value beyond compliance

Standards compliance and performance are essential, but it is also important that a chosen SQL implementation play nice with others. In Hadoop, playing nice means a number of things:

 Supporting open data formats

 Using standard client-side database drivers

 Supporting built-in functions that SQL users expect

 Providing sophisticated security capabilities

 Federated access to multiple data sources

Open, standard data formats – SQL is a useful language. Thanks to standardization and decades of maturation, it is well known and adept at solving many problems. It is not however the only language available, nor is it always the best solution for every problem. Hadoop has an ever-expanding array of languages and tools for analyzing large datasets, but to be able to use these rich tools, data needs to exist in standard Hadoop data formats. Hadoop enthusiasts will be pleased to know that there is no such thing as a Big SQL data store. A table defined in Hive is a table defined in Big SQL and vice versa. Big SQL supports 100% native HDFS file formats. This is not true of all distributions, and customers should be wary of SQL-on-Hadoop implementations that introduce their own proprietary metadata.

Common client-side drivers – Beyond the ability to share SQL across platforms, Big SQL supports standard IBM client drivers allowing the same set of standards-compliant JDBC, JCC, ODBC, CLI and .NET drivers to be used across multiple databases and operating systems. Clients using these drivers can access IBM Big SQL, DB2®, IBM Informix® and third party database environments transparently. By combining a standard SQL implementation with industry standard drivers, the number of ISV applications that can interact seamlessly with IBM’s Big SQL is enlarged.

Built-in functions drive productivity - Having an SQL implementation that merely works is different than having an environment that makes users productive. Incorporating a rich library of over 250 built-in functions along with SQL OLAP functions, Big SQL is built for analytics. It provides advanced features including sub-query support, additional SQL types and global (session) variables. With these additional features users

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000

(7)

can simply do more things, and answer more questions from within the SQL environment.

When using other SQL solutions that may lack these features, users may find themselves writing custom code to implement the same capabilities already built into Big SQL.

Security and auditing – For some SQL implementations, security is an afterthought. Big SQL was built with security in mind. User authentication is handled using standard

mechanisms including LDAP and Kerberos so that Big SQL fits seamlessly into your enterprise environment. Big SQL supports flexible authorization controls based on users, groups and roles. It uses standard SQL GRANT and REVOKE syntax familiar to database administrators. In addition to basic table-level access controls, Big SQL supports fine-grained role and column level access controls (RCAC). Fine grained access control and features like data masking help expand the range of solutions that Big SQL is applicable to. In addition to flexible authentication and authorization, Big SQL also provides extensive auditing facilities. In short, Big SQL brings the rich security features that RDBMS administrators expect to the world of Hadoop.

Federated queries – In modern data centers, data seldom exists in one place. Some data will exist in relational databases and other data will be in data warehouses or specialized column-oriented databases. Big SQL supports rich federation capabilities allowing users to write queries that access not only Hadoop-based data, but other databases as well. A single query may join data from Big SQL, Hive on Hadoop, a table on a Teradata warehouse, and data from an Oracle database.

So – Who leads the herd?

The findings of this benchmark are compelling. Organizations are heavily invested in SQL. The last thing customers need is one SQL dialect for Hadoop, and another for their existing database environments. The fact that Big SQL was the only SQL-on-Hadoop implementation able to actually run the Hadoop-DS workload is important. The fact that it is also faster, more scalable, more stable, and has a richer set of features is very impressive indeed.

 Big SQL was the only implementation able to run the full Hadoop DS benchmark with all 99 queries

 Big SQL delivered over three times the performance of the nearest competitor in the single user test

 Big SQL was the only offering able to scale to 30 TB and run the full workload at that scale

This result should not be surprising. IBM invented SQL after all and has over 30 years of experience building SQL query engines and optimizers. When it comes to SQL-on-Hadoop, IBM InfoSphere BigInsights with Big SQL clearly leads the herd.

For More Information

To learn more about Big SQL, download the free IBM whitepaper SQL-on-Hadoop without compromise at

https://www14.software.ibm.com/webapp/iwm/web/signup.do? source=sw-infomgt&S_PKG=ov23626

(8)

IMW14799-USEN-00

Please note: Performance is based on measurements and projections using

standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

3600 Steeles Ave East Markham, Ontario L3R 9Z7

IBM, the IBM logo, ibm.com, BigInsights, Cognos, DB2, Informix, InfoSphere, PureData and z/OS are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council

Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks is a trademark of Hortonworks Inc.

Hadoop and Hive are trademarks of the Apache Software Foundation Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Other company, product and service names may be trademarks or service marks of others.