Overview - Scalable Data Analysis on MapReduce-based Systems

In Chapter 3, we have studied how to build a scalable framework for a computation intensive analysis, the combinatorial statistical analysis, based on the MapReduce paradigm. There are different challenges and difficulties to develop scalable parallel data processing techniques for data intensive applications such as the data cubes analysis.

For instance, an efficient solution of data intensive applications has to guarantee that the data shuffling and data read/write overheads among the cluster are minimized. Oth- erwise, the high overheads incurred during these analyses may significantly affect the performance. Furthermore, it is more challenging (compared to computation intensive analysis) to develop an effective load-balancing strategy - besides having to consider the computation overhead in each reducer. In addition, there is also a need to factor in the data I/O and shuffling overhead. Data cube analysis is such a very typical and popular representative data intensive analysis. In this Chapter, we tackle the problem of

developing an effective, scalable and practical parallel system for data cube analysis.

In many industries, such as sales, manufacturing, transportation and finance, there is a need to make decisions based on aggregation of data over multiple dimensions. Data cubes[31] are one such critical technology that has been used in data warehousing and On-Line Analytical Processing (OLAP) for data analysis in support of decision making.

Much research has been devoted to the data cubes analysis in the literature [7][95][96] [44]. However, existing techniques can no longer meet the demands of today’s work- loads. On the one hand, the amount of data is increasing at a rate that existing techniques (developed for a single machine or a small number of machines) are unable to offer acceptable performance. On the other hand, more complex aggregate functions (like complex statistical operations) are required to support complex data mining and statistical analysis tasks. Thus, new mechanisms are needed to efficiently support data cube analysis on more complex aggregate functions over a large amount of data.

Therefore, in this thesis, we are motivated to develop a scalable parallel data cube analysis platform for large-scale data. While we provide a new cubing algorithm, our key contributions lie in the design of techniques to extend the MR framework for efficient data cube analysis to broaden the application of data cubes primarily for append-only environments [79].

Our main contributions are as follows. First, we present a distributed system, HaCube, an extension of MR, for data cube analysis on large-scale data. HaCube modifies the Hadoop MR framework while retaining good features like ease of programming, scal- ability and fault tolerance. It also builds a layer with user-friendly interfaces for data cube analysis. We note that HaCube retains the conventional Hadoop APIs and thus is compatible with MR jobs. Second, we show how batching cuboids for processing can minimize the read/shuffle overhead to salvage partial work done for efficient data cube materialization. Third, we propose a general and effective load balancing scheme

LBCCC (short for Load Balancing via Computation Complexity Comparison) to ensure

that resources are well allocated to each batch. LBCCC can be used under both HaCube and MR frameworks. Fourth, we adopt a new computation paradigm, MMRR (MAP- MERGE-REDUCE-REFRESH), with a local store under HaCube. HaCube supports efficient view updates for different measures, both distributive such as SUM, COUNT and non-distributive such as MEDIAN, CORRELATION, and thus is able to support more applications with data cube analysis in a data center environment. To the best of our knowledge, this is the first work to address data cube view maintenance in MR-like systems. Finally, We evaluate HaCube based on the TPC-D benchmark with more than 3 billions tuples. The experimental results show that HaCube has significant performance improvement over Hadoop.

The rest of this chapter is organized as follows. Section 4.2 reviews some background material. In Section 4.3, we provide an overview of the HaCube architecture and computation paradigm. Sections 4.4 and 4.5 present our proposed data cube materialization and view maintenance approaches. In Section 4.6, we discuss some issues including the fault tolerance strategy and storage cost. We report our experimental results in Section 4.7 and summarize this work in Section 4.8.

4.2 Preliminaries

In this section, we introduce the notations and background of data cube materialization and view maintenance.

4.2.1 Data Cube Materialization

In OLAP, the attributes are classified into dimension attributes (the grouping attributes) and measure attributes (the attributes which are aggregated) [31]. Each GROUP

BY in a CUBE computation is defined as a cuboid which captures the aggregate data. To speed up query processing, these cuboids are typically stored into a database as views. The problem of data cube materialization is to efficiently compute all the views. If the cube is being built for the first time, we refer to its materialization as initial cube materialization. Figure 4.1 shows all the cuboids represented as a cube lattice with 4 dimensions A, B, C and D. all A AB BC CD ABCD C B D DA AC BD

ABC BCD CDA _DAB

Figure 4.1: A cube lattice with 4 dimensions A, B, C and D

4.2.2 Data Cube View Maintenance

The goal of cube view maintenance is to get the latest view when new data are produced and added. We refer to this newly produced data as delta data ∆D, the data used for the previous view materialization as base data D and the previously materialized cube as base view V . In terms of view update requirements, the measure functions can be classified into two categories: non-distributive and distributive measures [31].

Non-distributive measures are those whose updated views can only be reconstructed by recomputation based on the entire base data D and ∆D. In append-only appli- cations, these functions include STDDEV, MEDIAN, CORRELATION, and REGRES- SION functions.

Distributive measures are the ones whose views can be either updated by recomputation (as in non-distributive functions) or incrementally computed which is referred to as incremental computation. Incremental computation updates a view based on V and ∆D in two steps [52]: (a) In the propagate step, a delta view ∆V is calculated based on the ∆D. (b) In the refresh step, the updated view is obtained by merging V and ∆V . In append-only applications, functions that can be computed incrementally include SUM, COUNT, MIN, MAX and AVG. Note that we have classified algebraic functions, like AVG, as distributive measures. To avoid recomputation, views of algebraic functions can be updated by keeping some extra information. For instance, for computing AVG, we can record both the sum and count in the views.

Job Scheduler Task Scheduling Factory Map Merge Local Store Mapper ... Master Node Cube Analyzer Cube Planner &RQYHUWHU Cube Converting Layer

Task Scheduler Execution Layer OR DB Distributed File System Cube Request Submission Processing Node Refresh Reduce Reducer OR DB Distributed File System Map Merge Local Store Mapper Refresh Reduce Reducer ... Processing Node

Figure 4.2: HaCube Architecture

In document Scalable Data Analysis on MapReduce-based Systems (Page 83-87)