Outlook and conclusions - Toward timely, predictable and cost-effective data analytics

constantly performs within 2% of the optimal algorithm [201]. We adopt this algorithm, i.e., the policy that chooses to service next the group with the maximal number of queries, and extend it with a notion of fairness. Our ranking based algorithm was inspired by the rFEED[114] task scheduling algorithm. The problem of task scheduling has been studied extensively in the past [197]. However, while task scheduling algorithms assume that task execution times are independent, query execution time in our case depends on which group is active at the time of query execution, which, in turn, depends on the order of query execution. Thus, task scheduling algorithms are not directly applicable to our context.

Hot and cold data classiﬁcation and migration. There is a large body of research involving

data classification in the context of main-memory databases or multitiered databases that can be used to identify hot and cold data, e.g., [76,84,167]. Enterprise databases have long used such algorithms to improve performance by caching hot data in low-latency storage devices. Similarly, databases have also used Hierarchical Storage Managers (HSM) to automatically manage migration of data between online, nearline, and offline storage tiers [164]. We do not consider the orthogonal problems of data classification or automatic data migration in our work. Rather, we focus on query execution over “cold data at rest” in the CSD after classification and migration has taken place.

9.7 Outlook and conclusions

In this chapter, we demonstrate that the usage of cold storage devices enables a new tier in the enterprise storage tiering hierarchy, named the Cold Storage Tier. We show that the cold storage tier is able to replace both the capacity and archival tiers in their functionality, thereby offering major cost savings for enterprise data centers. Furthermore, we show that data analytics can be run on such a platform by a judicious hardware-software codesign where both the database query execution engine and the CSD work in concert toward achieving a common goal – masking the high access latency of CSD group switches.

The implications and beneﬁts of using CSD reach far beyond enterprise data centers, and are equally applicable to cloud providers. For instance, Cloud Service Providers (CSP) have already started deploying custom-built, rack-scale CSD explicitly targeted at cold data workloads [97,

202,254]. By doing so, CSP have already reported substantial cost savings. For instance, according to a recent report from Facebook, the Open Vault cold storage system reduced their expenses by a third compared to conventional online storage; their Blu Ray-based cold storage system reduced power consumption by 80% over Open Vault [254]. Recognizing the potential of CSD, CSP have started offering hosted, low-cost cold storage services based on CSD, and such cold-storage-as-a-service offerings are quickly gaining popularity, offering cloud customers a chance to beneﬁt from inexpensive storage [13,97].

We believe that the CSD beneﬁt for cloud providers could go beyond offering just storage-as- a-service. By following the design and architecture of Skipper, CSP could offer cloud-hosted

Chapter 9. Data Analytics for a Penny

cloud-hosted data analytics services, as providers could increase revenue by offering cheap analytics services on data stored on CSD, and customers could reduce total cost of ownership by running latency insensitive analytics workloads on cold data stored on CSD.

10 Concluding Remarks and Future

Ahead

This thesis contributes to the quest of bridging the gap between traditional DBMS technology and the requirements of modern data analytics applications. With this work, we show that DBMS can still join the race for servicing new applications, as long as they employ an agile and adaptive approach. We believe that the insights and techniques presented in this thesis could pave the way for building fully adaptive DBMS able to service new, modern, data analytics applications.

In particular, lazy, workload-driven adaptivity is a promising path in reducing the data-to- insight time for data exploration applications, while autonomous adaptation enables non- DBMS savvy users from other domains (e.g. businesses, science, etc.) to beneﬁt from decades of research into database technology. Runtime data-driven adaptation of query operators is a promising way of tackling the serious impediment of DBMS to predictable query performance for more than 40 years– suboptimal query plans proposed by the optimizer at compile time. Finally, we show how DBMS could beneﬁt from new hardware offerings and substantially reduce the storage cost of enterprise data analytics solutions if they employ hardware-driven adaptation in which the storage system optimizes access to the data, rather than the DBMS having full control over it.

10.1 Thesis contributions and lessons learned

Looking at the advancements of current technology, this thesis presents the following technological and intellectual contributions:

• From the technological aspect, Chapter7presents the design and implementation of novel auxiliary design structures tailored for hiding the overhead of processing raw data ﬁles. Positional maps, caches, selective parsing and tokenizing all contribute signiﬁ- cantly toward masking the overhead of raw data access. From the intellectual aspect, this work demonstrates that workload-driven adaptivity with zero preparation overhead presents a promising path toward servicing interactive data exploration applications.

Chapter 10. Concluding Remarks and Future Ahead

Moreover, the usage of workload as a driving force for performance tuning presents an autonomous way to decouple the users’ interest from the data growth, since ultimately not all data is interesting and needed to gain useful insights.

• Chapter8makes a technological advancement in DBMS access paths by introducing a novel hybrid access path operator, called Smooth Scan, able to replace the existing access path operators as it approximates the performance of optimal choices throughout the entire selectivity interval. Intellectually, with Smooth Scan we have demonstrated that pushing decisions from query optimization to query execution coupled with con- tinuous learning and morphing could alleviate suboptimal access path decisions that hurt performance of long running analytical queries. Looking long-term, data-driven

adaptation presents a promising path in dealing with the high increase in data volumes and velocity, where the lack of statistics for a data set will be common rather than an

exception.

• In Chapter9, with Skipper, which is a new CSD-targeted query execution framework, we have shown that CSD present a promising path in reducing the cost of data stored in both private and public clouds. More importantly, we have learned and demonstrated that

judicious hardware-software codesign is needed in order to fully exploit the advantages

of new hardware technology and be able to service new data analytics applications.

Overall, the work done in the context of this thesis showcases that runtime adaptivity is key to dealing with raising uncertainty about the workload characteristics that modern data analytics applications exhibit.

In document Toward timely, predictable and cost-effective data analytics (Page 185-188)