Adaptive operators - Toward timely, predictable and cost-effective data analytics

All mentioned approaches that perform dynamic plan changes are examples of inter-operator adaptivity, where the adaptation mechanism is employed between operators, i.e., it mostly pertains to the operator order. Adaptive operators, on the other hand, are more fine-grained as they encapsulate the adaptation mechanism within their own algorithm. Adaptive operators offer greater flexibility when it comes to scheduling the order of tuples flowing through it. This flexibility enables operators to proceed even when data from one source blocks its arrival at expense of increased memory consumption (e.g., Symmetric Hash Join [252], Multi-way Join [246], X Join [242], Ripple Joins [116]). Unpredictable data arrival is also a motivation for the work introduced in Chapter9, where in order to optimize for the expensive data access on cold storage devices, hardware-driven query execution is introduced (i.e., hardware decides on the order in which data is sent).

Robustness and adaptation to data characteristics at the intra-operator level are considered in [16,17,28,61,100,183]. Despite a lot of efforts in fixing suboptimal decisions, little attention has been paid to the access paths selection problem. Nonetheless, a suboptimal decision at the level of access paths has a highly detrimental effect on the overall query performance, since the access paths touch most of the data before any filtering has been applied. The detrimental effect has already been shown in Figure4.3, where the suboptimality at the access path level resulted in a 75× increase in query execution time. We fill the need for adaptation at the access path level in Chapter8by introducing a hybrid adaptive access path called Smooth Scan. Smooth Scan guarantees nearly optimal performance throughout the entire

Chapter 5. Improving Query Performance through Corrective Actions

range of possible selectivities, thereby preventing poor execution cases as a consequence of suboptimal decisions. Unlike [16,17], however, Smooth Scan does not waste any resources by doing double work, nor does it require a serious change of the database architecture. Moreover, since the high risk of having incomplete statistics in the case of ever increasing data sets still remains, Smooth Scan is completely statistics-oblivious.

6 Database Storage

This section discusses the storage aspects of databases. As enterprise databases traditionally use storage tiering, where the data waterfalls for the high cost, low-latency tiers into the low cost, high latency tiers, we discuss the implications of the tiering hierarchy on the database cost and performance. We further discuss a newly appeared hardware, named Cold Storage, as a promising avenue to restructure the traditional storage tiering hierarchy.

6.1 Database storage tiering

Enterprise databases have long used storage tiering for reducing capital and operational expenses. Traditionally, databases used a two-tier storage hierarchy. An online tier based on enterprise HDD provided low-latency random access (ms) to data. The backup tier, in contrast, was based on ofﬂine tape cartridges or optical drives, and provided low-cost, high-latency (hours) storage for storing backups to be restored only during rare failures.

As databases grew in popularity, the necessity to reduce recovery time after failure became important. Further, as regulatory compliance requirements forced enterprises to maintain long-term data archives, the ofﬂine nature of the backup tier proved too slow for both storing and retrieving infrequently accessed archival data. Thus, a new tier dubbed archival tier became popular. Archival tier was implemented using nearline storage devices, like robotic tape libraries (VTL) or optical jukeboxes, that could store and retrieve data automatically without human intervention in minutes.

Hierarchical Storage Managers (HSM) were developed to automatically manage migration of data between online and archival tiers, implement different backup schedules for each tier, and integrate with ofﬂine storage for disaster recovery [251]. Databases used HSM to implement multitier storage hierarchies by associating policies with different data items (archived redo logs, backups of data ﬁles, etc). For example, Oracle uses a HSM called Sun Storage Archive Manager (SAM) to automatically move data between a disk-based online tier and a tape-based archival tier [231].

Chapter 6. Database Storage DRAM SSD 15k RPM HDD 7200 RPM HDD VTL ns μs ms min hour Offline Tape O N L I N E N E A R O F F Performance Capacity Archival Backup

Data Access Latency

St

or

ag

e Cos

t

Figure 6.1: Storage tiering for enterprise databases

Over the past decade, much attention has been paid to the online tier due to three main reasons: 1) the emergence of flash-based solid state storage, 2) the declining price of DRAM, 3) demand for low-latency, real-time data analytics. As a result, the traditional online tier has been decomposed into a low-latency, SSD or RAM-based performance tier, and a high density, HDD-based capacity tier. Databases classify data as hot or cold depending on access patterns, store them in the appropriate tier, and enable queries over data stored in both tiers. For instance, SAP’s Business Warehouse (BW) product uses SAP HANA to manage a DRAM-based performance tier and Sybase IQ to manage a HDD-based capacity tier [69]. Oracle uses SAM QFS to seamlessly manage flash, disk, and tape tiers [231]. Thus, as shown in Figure6.1, all modern enterprise databases nowadays use a four tier storage hierarchy, where performance, capacity, archival, backup tiers are implemented using three storage types (online, nearline, and offline).

Enterprise databases have enforced a strict separation of functionality across storage tiers predominantly dictated by the access latencies of corresponding storage devices. Given the demand for real-time analytics, the performance tier is used to satisfy latency-sensitive real- time queries and the capacity tier for latency-insensitive batch queries. Unlike these online storage devices, any access to data in nearline storage must be mediated by the HSM, as it must be located and transferred from a nearline device to an online device before it can be used. Given the prohibitively high access latencies (minutes) associated with nearline storage, databases store all data that must be accessible by the query execution engine (for both batch and interactive queries) in either performance or capacity tiers. Thus, the nearline tier is never used directly during query execution, but only to retrieve archived data during compliance veriﬁcation, or backup during media failure.

6.2. Proliferation of cold data

In document Toward timely, predictable and cost-effective data analytics (Page 67-71)