• No results found

query execution engine in either performance or capacity tiers. Thus, today, the performance tier is used for satisfying latency-sensitive real-time queries while the capacity tier is used for satisfying latency-insensitive batch queries. The archival tier is never used directly during query execution, but only during compliance verification or online media failure.

6.3 Cold storage devices

Over the past few years, storage hardware vendors and researchers have become cognizant of the gap between the HDD-based capacity tier and the tape-based archival tier. This has led to the emergence of a new class of nearline storage devices explicitly targeted at cold data workloads [25,202,222,232,253]. These devices, also referred to as Cold Storage Devices (CSD), have three salient properties that distinguish them from the tape-based archival tier.

First, they use archival-grade, high-density, Shingled Magnetic Recording-based (SMR) HDD as the storage media instead of tapes. Second, they explicitly trade off performance for power consumption by organizing hundreds of disks in a Massive Array of Idle Disks (MAID) configuration [67]. MAID arrays are similar to RAID arrays in their use of HDD as the primary storage medium. However, in contrast to the high performance 15K RPM HDD used in RAID arrays, MAID arrays use high-density SMR-based SATA HDD to increase storage capacity. In addition, while RAID arrays maintain all disks active and spinning at all times to reduce latency, MAID arrays spin down most disk drives and keep only a fraction of disks on at any given time. By doing so, they are able to densely pack hundreds or even thousands of disks in a single storage rack while staying within a limited power and cooling budget. This leads to the last point – CSD right-provision hardware resources, like in-rack cooling and power management, to cater to only the subset of disks that are spun up.

Energy efficiency is quite important for enterprise data centers, as power and related costs dominate the overall enterprise infrastructure cost, and are a major impediment toward scalability in data centers, since the data growth largely outpaces the improvement in the energy footprint [156]. According to the recent reports from James Hamilton, the energy efficiency leader for enterprise data centers, nearly half of all the costs for an enterprise data center goes to power and cooling [120]. The cost reduction in this domain, will substantially reduce the operational expenses of large-scale enterprise data centers. By vertically integrating hardware, software, cooling, and power management using a single converged design, CSD manage to win in all aspects and substantially reduce the overall operational expenses, offering storage cost/GB (and capacity) comparable to that of traditional offline tape archives. For instance, Spectra’s ArticBlue CSD is reported to reduce storage cost to $0.1/GB [169], while Storiant claims a total cost of ownership (TCO) as low as $0.01/GB per month [189]. Due to the use of HDD instead of tape, they, however, reduce the worst-case access latency from minutes to mere seconds–the spin up time of disk drives. Thus, CSD form a perfect middle ground between the HDD-based capacity tier and tape-based archival tier as shown in Figure6.3.

Chapter 6. Database Storage DRAM SSD 15k RPM HDD 7200 RPM HDD ns μs ms O N L I N E N E A R O F F Performance Capacity Archival Backup

Data Access Latency

St

or

ag

e Cos

t

CSD VTL

sec min hour

Offline Tape

Figure 6.3: CSD in the storage tiering hierarchy

Although CSD differ in terms of cost, capacity, and performance characteristics, they are identical from a behavioral stand point–each CSD is a MAID array in which only a small subset of disks is spun up and active at any given time. For instance, Pelican [25] packs 1,152 SMR disks in a 52U rack for a total capacity of 5 PB. The rack is internally made of 6 8U chassis. Each chassis is composed of 12 4U trays, and each tray contains 16 disks. Thus, the disks in Pelican can be conceptually visualized as being arranged in a 6× 16 × 12 cuboid as shown in Figure6.4. A shared backplane powers the trays but is configured to be sufficient to power only one active disk drive as depicted by the gray row in Figure6.4. Similarly, in-rack cooling is performed using multiple air flow channels, where each channel is shared by 12 disk drives situated across multiple trays. With such an arrangement, each channel can cool only one active disk drive as shown by the gray column in Figure6.4. Due to these cooling and power management design choices, Pelican hardware enforces strict restrictions on a set of disks that can be active simultaneously at any given time shown as the disks highlighted as the gray diagonal (in the case of Pelican only 8% of disks can be spun up simultaneously). Similarly, each OpenVault Knox [202] CSD server stores 30 SMR HDD in a 2U chassis, out of which only one can be spun up to minimize the sensitivity of disks to vibration.

The net effect of these limitations is that CSD enforce strict restrictions on how many and which disks can be active simultaneously (referred to as a disk group). All disks within a group can be spun up or down in parallel. Access to data in any of the disks in the currently spun up storage group can be done with latency and bandwidth comparable to that of the traditional HDD-based capacity tier. For instance, Pelican, OpenVault Knox, and ArticBlue are all capable of saturating a 10-Gb Ethernet link as they provide between 1-2 GB/s of throughput for reading data from spun-up disks [25,202,222].

However, accessing data on a disk outside the currently active group requires spinning down active disks and loading the next group by spinning up the new set of disks. We refer to this

6.3. Cold storage devices

D’

16 disks per tray

12 tr

ay

s per chassis

Figure 6.4: Pelican rack schematic

operation as a group switch. For instance, Pelican takes eight seconds to perform the group switch. Thus, the best case access latency of CSD is identical to the traditional capacity tier (in the order of ms), while the worst-case access latency is two orders of magnitude higher (in the order of sec).

Driven by the price/performance aspects of CSD, in Chapter9we examine how CSD should be integrated into the database storage tiering hierarchy. While doing so, we make a case for using CSD as a replacement for both the HDD-based capacity and archival tiers of enterprise databases.

Part II

Quest for timely, predictable and

cost-effective data analytics

7

Timely and Interactive Data Analytics

As data volumes become larger, data initialization (i.e., data loading and tuning) as a prereq- uisite to efficient data exploration turns into a major bottleneck. More data means more time to prepare and load the data into the database before being able to pose desired queries. Many applications already avoid using database systems, for example, scientific data analysis and social networks, due to the long data-to-insight time, that is, the time between getting the data and retrieving its first useful results. The situation will only aggravate in the future, where it is expected to have much more data than what we can move or store, let alone analyze.

Motivated by the requests for timely and interactive data analytics where users aspire for a quick interaction with their freshly acquired data, this chapter presents the design of a new paradigm in database systems, called NoDB. To reduce the data-to-insight time, NoDB systems skip the data initialization step, i.e., they do not require data loading and enable query processing directly over raw data files. Through the design and lessons learned by implementing the NoDB paradigm over a modern DBMS, we discuss the fundamental limitations as well as strong opportunities that such a research path brings. We identify performance bottlenecks specific for processing over raw files, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, this chapter introduces an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure and incremental statistics collection that both further improve query execution performance.

NoDB implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of PostgreSQL and even outperforming it in many cases. The analysis shows that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect on database usability, as it enables scientists to benefit from database technology, while at the same time removing the burden from them of deciding how to prepare and tune the system.1

Chapter 7. Timely and Interactive Data Analytics

7.1 Introduction

The unprecedented amounts of generated data that outgrow the capabilities of query pro- cessing technology are creating a new era for database technology, an era of data deluge [111]. Many emerging applications, social networks, the Internet of Things, data exploration in scientific experiments, are all representative examples of this deluge [130]. Scientific analysis such as astronomy is soon expected to collect multiple Terabytes(TB) of data on a daily basis, while web-based businesses such as social networks or web log analysis are already confronted with a growing stream of large data inputs [220]. These requirements show a clear need for efficient big data processing to enable the evolution of businesses and sciences to the new era of data deluge.

Need for a change in database technology. Although Database Management Systems (DBMS)

remain overall the predominant data analysis technology, they are rarely used for emerging applications such as scientific analysis and social networks. This is largely due to the com- plexity involved; there is a significant initialization cost in loading data and preparing the database system for queries, which substantially prolongs the time to first insight, i.e., the moment the user is able to extract useful knowledge from his data. For example, a scientist needs to quickly examine a few TB of new data, received from a device such as a telescope or a sensor, in search of certain properties. Even though only a few attributes might be relevant for the task, or even worse, nothing is relevant for the given task, the entire data must first be loaded into a database. For large amounts of data, this means a few hours of delay, even with parallel loading across multiple machines. Besides being a significant time investment, such an approach involves extra computing resources required for a full load and increases energy consumption further affecting economic sustainability. Furthermore, this overhead can prove to be useless, as the user can decide to discard the loaded data shortly after realizing that it does not contain any interesting information.

Instead of using database systems, emerging applications rely on custom solutions that usually miss important database features. For instance, declarative queries, schema evolution and complete isolation from the internal representation of data are rarely present. The problem with the situation today is in many ways similar to the past, before the first relational sys- tems were introduced; there are a wide variety of competing approaches but users remain exposed to many low-level details and must work close to the physical level to obtain adequate performance and scalability.

The lessons learned in the past four decades indicate that in order to efficiently cope with the data proliferation in the long run, we will need to rely on the fundamental principles adopted by database management technology. That is, we will need to build extensible systems with declarative query processing but focus on self-managing optimization techniques tailored for the data deluge. A growing part of the database community recognizes this need for significant and fundamental changes to database design, ranging from low-level architectural redesigns to changes in the way users interact with the system [8,111,136,152,162,184,228].

7.1. Introduction

NoDB. We recognize this new need, which is a direct consequence of the data deluge and

data exploration as a new use case, and describe the roadmap towards NoDB, a new database design paradigm that we believe will affect the design of future database systems. The goal of NoDB is to make database systems more accessible to the user. NoDB enables timely and interactive data exploration by eliminating major bottlenecks of current state-of-the- art technology that increase the data-to-insight time. The data-to-insight time is of critical importance as it defines the moment when a database system becomes usable and thus useful. The NoDB paradigm changes the way a user interacts with a database system by eliminating data loading and reducing the initialization time to zero. Instead, NoDB advocates for querying directly over raw data (instantaneously as data arrives) and extends traditional query processing architectures to work over raw data.

Querying raw files directly, i.e., without loading, has long been a feature of database systems. For instance, Oracle calls this feature external tables. Unfortunately, such features are hardly sufficient to satisfy the data deluge demands, since they repeatedly scan entire files for every query. Instead, we propose to redesign the query processing layer of database systems to incrementally and adaptively query raw data files, while automatically creating and refining auxiliary structures to speed up future queries. Using a mature and complete implementation over a modern DBMS (PostgreSQL), we identify and overcome fundamental limitations in NoDB systems. We show how to make raw files first-class citizens without sacrificing query pro- cessing performance by introducing several innovative techniques such as selective parsing, adaptive indexing that operates over raw files, caching techniques and incremental statistics collection over raw files. Overall, we describe how to exploit traditional DBMS to conform to the NoDB philosophy, identifying limitations and opportunities along the way.

The contributions of this chapter are as follows:

• This chapter presents necessary steps to convert a traditional DBMS (PostgreSQL) into

a NoDB system (PostgresRaw). To address the overhead of repeated file access and its parsing which are the main bottlenecks to efficient processing, we design an innovative adaptive indexing mechanism that makes the trip back to data files efficient.

• We demonstrate that the query response time of a NoDB system can be competitive

with a traditional DBMS which loads data a priori, if we use the workload as a driver to build adaptive indexes, caches and statistics that accelerate future queries.

• We show that NoDB systems provide quick access to the data under a variety of work-

loads and micro-benchmarks. PostgresRaw query performance improves gradually as it processes additional queries and it quickly matches or outperforms traditional DBMS, including MySQL and PostgreSQL.

• We describe challenges coming from raw query processing and discuss opportunities

Chapter 7. Timely and Interactive Data Analytics