Related Work - Utilizing query logs for data replication and placement in big data applications

Most of the existing re-declustering literature is about data availability issues and imbalanced load prevention schemes in case of failures. There are a few works

that tries to address changes in access frequencies as well.

One of the earliest declustering approaches is chained declustering [36], a scheme that assumes the servers are aligned over a ring and places replicas of data items onto consecutive servers over the ring. Chained declustering provides high availability and good load balance in the event of disk failures (if these failures do not occur on adjacent nodes). A more recent study ([37]) that stems from chained declustering and suggests which nodes can be closed/removed on purpose in a chain-declustered system for energy efficiency can be considered relevant to our discussions. Unfortunately, these approaches do not utilize query logs and provide a fixed assignment of data items to servers. Thus, they do not propose new data assignments under changing query workloads or addition/removal of disks.

In [38], an approach that tries to provide scalability for hash-based distributed files is discussed. Dynamic addition and removal of data items (records in a file) and changing query patterns are considered. The scalability issues are addressed via bucket (data item group) splits and migrations. Automatic server addition schemes to maintain fixed response times are also discussed, However, the approaches in [38] do not consider replication.

An automized data migration tool called Aqueduct that reacts to changing data access patterns by migrating “hot” data items is described in [39]. The Aqueduct system features a feedback control loop that regulates the speed of the data migration in the parallel disk system while hiding performance impact on the system. This approach is similar to our approach in the sense that it tries to ensure that the migration operations do not have a negative impact on query processing performance. However, unlike our approach which tries to strike a balance between migration and query processing costs, the schema in [39] tries to achieve this by performing migration only when all I/O requests are served. Another paper that uses replication and migration for improving throughput and load balancing is [40]. This approach, which is called DORA, assigns data items dynamically and integrates replication of files into the assignment to effectively distribute requests on “hot” data items across all servers of the system. DORA

has schemes for dynamic replica removal for “cold” files. Both works listed above can adapt to query pattern changes via dynamic assignments, however, non of them consider server addition or removal operations to the parallel database system.

Chapter 4 Query Processing in Replicated

Inverted Indexes

Due to recent advances in big-data processing and serving technologies, utilizing term-partitioned indexes in parallel query processing systems became a viable alternative. In term-partitioned inverted indexes, replication of inverted lists associated with the most frequent terms is employed to improve the performance of the query processing system. In this chapter, we adopt a recently proposed replicated-hypergraph-partitioning-based approach for generating replicated, term-partitioned indexes and evaluate the performance of this approach against state-of-the-art partitioning and replication schemes. We also discuss various scheduling schemes that are required when replication is involved. We investigate these schemes on a realistic parallel query processing system We provide extensive experimental analysis performed up to 32 processors to show that proposed schemes are superior to the state-of-the-art alternatives.

4.1 Introduction

In this chapter, we present our approaches in implementing a parallel query processing system that supports term-based distribution and replication. Even

though search systems are used extensively, and the internal working structures of such systems are more or less known generally, we found that algorithmic and engineering-wise decision made during the implementation of such systems greatly effect the overall performance.

In state-of-the-art search engine systems, document-based inverted index distribution is preferred over the term-based distribution due to the following two reasons: (i) following a parallel crawling phase, building a document-based distributed index becomes easier than building a term-based distributed index, (ii) document-based distributed indexes achieve better load-balancing performance during query processing. However, due to recent advances in big-data processing technologies, it is possible to build term-based distributed inverted indexes in acceptable times [41]. Furthermore, the load-balancing inferiority of term-based distributed indexes can be alleviated via replication schemes. Thus, together with replication, term-based distribution becomes a viable alternative to doc- based distribution.

In this chapter, we consider a parallel query processing system utilizing a term-partitioned inverted index where replication of terms are applied to improve performance. Our contributions are fourfold:

• We implement a successful parallel query processing system and we provide details of our implementation and the reasons behind our design choices. • We utilize a successful replication approach based on replicated hypergraph

partitioning, which was recently proposed in [5]. Our experimental results demonstrate that this approach is much more successful than the state-of- the-art partitioning and replication schemes.

• When there is replication, the problem of selecting the replica to be used in query processing arises. We show that this problem can be reduced to the set-cover problem. Furthermore, we propose various heuristics for scheduling replicated terms.

on our parallel query processing system and report performance analysis with respect to various different metrics.

This chapter is organized as follows: In Section 4.2, we provide the necessary background. In Section 4.3, we describe the details of our parallel query processing system. In Section 4.4, we investigate the performance of various existing partitioning and replication schemes. In Section 4.5, we discuss scheduling issues in replicated term-partitioned indexes. In Section 4.6, we compare investigated methods and algorithms over the proposed parallel query processing system and discuss the results. Finally, in Section 4.7, we briefly review the related literature.

In document Utilizing query logs for data replication and placement in big data applications (Page 84-89)