• No results found

We presented HAIL (Hadoop Adaptive Indexing Library), a twofold approach towards zero-overhead indexing in Hadoop MapReduce. HAIL introduced two indexing pipelines that address two major problems of traditional indexing tech- niques. First, HAIL static indexing solves the problem of long indexing times which had to be invested on previous indexing approaches in Hadoop. This was a severe drawback of Hadoop++ [28], which required expensive MapReduce jobs in the first place to create indexes. Second, HAIL adaptive indexing allows us to au- tomatically adapt the set of available indexes to previously unknown or changing workloads at runtime with only minimal costs.

In more detail, HAIL static indexing allows users to efficiently build clustered indexes while uploading data to HDFS. Thereby, our novel concept of logical repli- cation enables the system to create different sort orders (and hence clustered in- dexes) for each physical replica of a data set without additional storage overhead. This means that in a standard system setup, HAIL can create three different in- dexes (almost) for free as byproduct of uploading the data to HDFS. We have shown that HAIL static indexing also works well for a larger number of replicas. E.g. in our experiments HAIL created six different clustered indexes in the same time HDFS took to just upload three byte-identical copies without any index.

With HAIL static indexing, we can already provide several matching indexes for a variety of queries. Still, our static indexing approach has similar limitations as other traditional techniques when it comes to unknown or changing workloads. The problem is, that users have to decide upfront on which attributes to index and it is usually costly to revisit this choice in case of missing indexes. We solve this problem with HAIL adaptive indexing. Using this approach, our system can create missing but valuable indexes automatically and incrementally at job execution time. In contrast to previous work, our adaptive indexing technique again focuses on indexing at minimal expense.

We have experimentally compared HAIL with Hadoop as well as Hadoop++ using different datasets and a number of different clusters. The results demon- strated the high superiority of HAIL. For HAIL static indexing, our experiments showed that we typically create a win-win situation: e.g. users can upload their datasets up to 1.6x faster than Hadoop (despite the additional indexing effort!) and run jobs up to 68x faster than Hadoop.

Our second set of experiments demonstrated the high efficiency of HAIL adap- tive indexing to create clustered indexes at job runtime and adapt to users’ work- loads. In terms of indexing effort, HAIL adaptive indexing has a very low overhead compared to HAIL full scan (which is already 2x faster than Hadoop full scan). For example, we observed 1% runtime overhead for the UserVisits dataset when using an offer rate of 10% and only for the very first job. The following jobs already

3.10. Conclusion 101

run faster than the full scan in HAIL, e.g. ∼2 times faster from the fourth job, with an offer rate of 25%. The results also show that, even for low offer rates, our approach quickly converges to a complete index after running only a few number of MapReduce jobs (e.g. after 10 jobs with an offer rate of 10%). In terms of job runtimes, HAIL adaptive indexing improves performance dramatically. For a sequence of previously unseen jobs on unindexed attributes, runtime improved by up to a factor of 24 over HAIL without adaptive indexing and a factor of 52 over Hadoop.

Chapter 4

AIR: Adaptive Index

Replacement in Hadoop

4.1

Introduction

Adaptive indexing has received quite some attention in the community and is a very interesting approach to provide reasonably good performance when fac- ing ever changing or evolving workload patterns, without requiring human in- tervention. Several publications looked at adaptive indexing in main memory databases [38, 43, 47, 59]. In our recent studies [88, 11] we present an overview over the field and further explore directions in the adaptive indexing field in the context of main memory and multi core architectures. We also introduced an adaptive indexing algorithm into Hadoop MapReduce [80]. None of those works consider a space constraint on how many indexes can be created. Even though hard disk space can be considered cheap nowadays, we believe that it can be a limiting factor in the context of big data; it is therefore important to efficiently use the available resources. In the past years there has been extensive research on physical database design advisors [25, 96, 7] that include space constraints, but usually need a representative query sequence to provide meaningful advise on the physical design of the database. More recent work looked at online index tuning [83, 84, 19], an online approach to the Index Selection problem. That re- search does not consider an adaptive indexing setting, where indexes are created as byproducts of query execution.

In this chapter we investigate the Adaptive Index Replacement problem, i.e. our formulation of the Index Selection problem in the adaptive indexing scenario.

The chapter is structured as follows. In Section 4.2 we define the Adaptive Index Replacement (AIR) problem. Section 4.3 discusses related work in the field of adaptive indexing, as well as indexing in Hadoop MapReduce, and Buffer Re-

placement algorithms. We then in Section 4.4 describe our cost model and a Mixed Integer Linear Programming (MILP) formulation for the offline AIR problem. Sec- tion 4.5 introduces our proposed algorithm, the LeastExpectetBenefit-K, to solve the online AIR problem. Section 4.6 presents the evaluation of our algorithm us- ing simulations and an experimental validation. Finally, Section 4.7 concludes this chapter.