Graph Cube on Multidimensional Networks

2.4 Graph Cube Analysis

2.4.3 Graph Cube on Multidimensional Networks

Subsequently, Zhao et al. proposed a Graph Cube model for multidimensional networks [94]. The Graph Cube model is designed on the networks where each vertex con- tains a set of multidimensional attributes but the edges are identical without attaching any attributes.

Given n attributes attached with each vertex, graph cube generates 2ncuboids each of which is an aggregate graph based on specific dimensional attributes. Besides the cuboid query, the authors also provide a new set of queries which are defined as crossboid query which crosses multiple multidimensional spaces of the network.

However, the graph cube model is highly restricted to the multidimensional networks where the edge does not contain any attributes. In the real world, a lot of information networks are attributed graphs where both the vertex and edge contain attributes. Build- ing a data warehousing model based on the attributed graphs are more challenging and

important.

On the other hand, normally, the information network are large where the algorithm designed on the single machine is not able to provide acceptable performance. However, non of the existing works have provided any parallel and distributed solutions on graph OLAP.

Therefore, we are motivated to design a new and more general graph cube model based on the attributed graphs, and develop scalable, effective and efficient parallel and distributed graph OLAP techniques in order to meet the requirements of large-scale graphs in real applications.

CHAPTER 3 COMBINATORIAL STATISTICAL

ANALYSIS

3.1 Overview

In this chapter, we address the problem of building parallel solutions for one computation extensive analysis, combinatorial statistical analysis (CSA). CSA has been widely used to find the significant correlations that are typically measured by statistical methods among different objects. Intuitively, CSA evaluates the significance of the associations between a combination of objects by adopting the statistical methods, such as χ2 _test.

For illustration, we address the problem by taking the epistasis discovery as an example, where the CSA has been widely adopted. Although we have chosen epistasis discovery for demonstration, our solution is not specific to just this domain, and should apply broadly to all the CSA applications.

In this work, we propose a framework for efficient COmbinatorial Statistical Analy-

sis systems MapReduce(MR)-based Cloud platforms (COSAC). COSAC addresses the CSA problem in two phases: 1) Distribution Phase: We develop and compare different task distribution schemes to enumerate the large number of combinations to the process- ing units in terms of balancing the load. Given a total number of n objects, in order to find the associations among any m objects, there are total C(n, m) combinations to evaluate. The scheme partitions the enumerated combinations into n-m+1 sets, each with a different number of combinations. These sets of tasks are then distributed to the processing units to balance the number of combinations across the units. 2) Statistical Analysis Phase: Each node has to evaluate the statistical significance of the combinations allocated to it. We develop an optimization to salvage the common computations between the various combinations and provide a technique called Integer Representation and Bitmap Indexing (IRBI) to speed up the statistical testing. Such two phases have solved the two key challenges in CSA including balancing the load to each processing units in a distributed environment and conducting an efficient statistics testing.

The COSAC framework includes three layers. The first layer is the index builder layer which is used to preprocess the raw data to facilitate efficient data processing. The second one is the analysis layers for parallel combination enumeration and sta- tistical analysis. Two analysis schemes have been proposed, Exhaustive Testing and

Semi-Exhaustive Testing. The Exhaustive Testing supports exhaustive evaluation of the

statistics significance of all the combinations without losing any significant result. The

Semi-Exhaustive can be used to analyze part of the combinations to prune the compu-

tation spaces. The third layer is the top-k retrieval layer that is designed to help users to further retrieve the top-k most significant results from the large volume of analysis results data.

Based on COSAC, we have designed and compared various flexible object combination enumeration schemes with regard to load balancing and scalability for large scale

of datasets using the MR paradigm. The enumeration of combinatorial objects takes an important role in computer science and engineering. We also propose the techniques to use the integers to represent the long string raw data and adopt Bitmap index to index each object on the samples. Thus, we can conduct the analysis only based on the representation data and index data. Both of these two optimizations are memory-efficient, CPU-efficient and contribute the efficient statistical testing. Furthermore, we study how to salvage the computation for a sharing optimization with significant performance sav- ings, instead of conducting the testing for each combination independently. Extensive experimental evaluations have been conducted and the results indicate that our framework is computationally scalable, efficient and practical.

The rest of this chapter is organized as follows. In the next section, we provide some preliminaries about epistasis discovery. In section 3.3, we provide the main architecture of our proposed framework. Section 3.4 introduces our approach for preprocessing the raw data and how to make efficient statistical testings using our transformed data within one combination. Section 3.5 presents the task distribution models. In Section 3.6, we describe the strategy of combination enumeration with sharing optimization for the given task in each processing unit. Section 3.7 reports the experimental results. Finally, we summarize this work in Section 3.8.

In document Scalable Data Analysis on MapReduce-based Systems (Page 45-49)