Thesis Organization - Toward timely, predictable and cost-effective data analytics

The remainder of this thesis is organized as follows:

PartI(Chapter2to Chapter6) gives a necessary background on the topics discussed in this

thesis. Chapter2provides a quick look on the internals of a DBMS. Chapter3discusses the traditional query processing life-cycle from query optimization to query execution, illustrating what are the problems with the current execution model with respect to the characteristics of modern applications discussed in Section1.3. We then describe the traditional database tuning procedure in Chapter4, both in ofﬂine and online settings, and examine current state-of-affairs in physical design. Similar to the previous case, we discuss the problems of the current tuning procedure with respect to modern data applications. Chapter5gives a background on the existing state-of-the-art in adaptive query processing, which is the processing model advocated in this thesis. Lastly, Chapter6discusses storage techniques in enterprise data centers, with a special attention given to cold storage devices, which are the storage devices we analyze in Chapter9.

Chapter7introduces our efforts in enabling interactive and timely data exploration. We demonstrate that data loading is a major bottleneck toward enabling timely data exploration and present the NoDB paradigm in building database systems that completely skips data loading leaving data ﬁles in their original raw format. To be competitive with traditional DBMS, NoDB progressively builds auxiliary design structures (e.g. positional indexes, caches and incremental statistics) as a side-effect of the user’s queries run on the system.

Chapter8focuses on the predictability aspects of query processing. We ﬁrst isolate suboptimal access path decisions proposed by the optimizer as a major bottleneck toward achieving predictable performance. We then introduce Smooth Scan, a new paradigm in building access path operators that employs continuous adaptation and morphing at runtime by transforming from one physical alternative to another (i.e., from an index access to sequential scan) based

1.6. Thesis Organization

on the observed data distributions (selectivity factors). We show that a system with Smooth Scan requires no access path decisions taken by the optimizer up front nor does it need accurate statistics to provide near-optimal performance.

Chapter9focuses on the monetary aspects of data analytics. To decrease the cost of data analytics services, we use cold storage devices as primary storage for the enterprise and cloud-hosted databases. This chapter describes Skipper, a new CSD-targeted query execution framework that employs an out-of-order CSD-driven query execution model based on multi- way joins in combination with efﬁcient cache management and I/O scheduling strategies to hide the non-uniform access latencies of cold storage devices. As a result, Skipper offers performance comparable to the performance of systems storing data on HDD, while having a signiﬁcantly lower cost.

Chapter10 offers conclusions, summarizes the thesis contributions and its impact, and presents possible future steps toward building fully adaptive hands-free database systems which will be able to service modern data applications.

Part I

2 A Look Inside the DBMS

A database is deﬁned as a collection of data [203], usually describing the activities of different organizations. Various companies generate data in various formats that all could be considered a database in some form. A database management system (DBMS) is a software designed to assist in managing these large collections of data. Considering the value of gained insights from the data, DBMS are becoming increasingly popular in businesses. From the user’s perspective, the use of a DBMS has several important advantages over writing application-speciﬁc code to manage a database. These are primarily:

• Data Independence: The DBMS provides an abstract view of the data (in the form of

relations) that hides the details of the actual data format and storage.

• Efﬁcient Data Access: DBMS use sophisticated techniques developed in the past 40

years to efﬁciently store and access the data.

• Reduced Application Development Time: DBMS support a set of functions common

to many applications accessing the data. In addition, they offer a high-level interface to the data (SQL), facilitating fast development.

• Concurrent Access and Data Integrity: A DBMS gives users an impression they are

alone in the system, while the DBMS seamlessly schedules concurrent accesses to the data. Moreover, since the data can be accesses only through the DBMS, it is easy to impose different integrity constraints on the data.

Given all the advantages, it is clear why DBMS are an indispensable tool in many business domains.

2.1 The relation model

Their success and wide adoption DBMS owe to the declarative nature of expressing requests, which is very much akin to the way humans think of information. A database user typically

Chapter 2. A Look Inside the DBMS

expresses what information he wants, while the DBMS decides on how to extract this infor- mation. This ﬂexibility DBMS owe to Edgar Codd who introduced the Relational Model in 1970 [65], which shaped the database arena to this day. The Relational Model, which is based on ﬁrst-order logic, represents data items using tuples (rows) which when grouped together represent relations (tables1). Each tuple contains multiple attributes (columns) and has a unique key that helps distinguish between different tuples. The relation is described by a

heading, which is an unordered set of attributes corresponding to the columns of the relation.

The number of tuples in a relation defines its cardinality, and the number of attributes defines its degree. For example, a students relation in the database of a university would contain one entry for each student of the university. Each tuple would then contain specific information expressed through a number of attributes for a given student, e.g., name, address, telephone number, year of study, etc.

In the Relational Model the logical organization of the data (in the form of tuples and relations) is decoupled from the physical organization of the data (the way how data is organized on disk or any other storage device). Together with declarative languages such as SEQUEL [48], the relational model allowed database users to make declarative requests for accessing data. The system was then responsible to choose the way how data is physically accessed. This flexibility coming from decoupling the physical and logical layer freed the application for the first time from caring about physical storage, which opened the door to impressive performance. The first system that adopted Codd’s Relational Model was an IBM research prototype known as the Peterlee Relational Test Vehicle (PRTV) [238]. IBM later produced System R [20], the first commercially viable relational database system (RDBMS2) whose design is the basis for many modern DBMS. System R featured concurrency control, supported efficient updates, had a query optimizer (a component in charge of optimizing data access), and was the first system to implement the SQL language [48], a declarative language which is now adopted by all modern DBMS.

In document Toward timely, predictable and cost-effective data analytics (Page 30-36)