Processing of Allocated Combinations - Scalable Data Analysis on MapReduce-based Systems

In section 3.5, we have introduced how tasks are distributed to each reducer. In this section, we shall look at how each reducer processes the tasks that have been allocated to it. Clearly, a naive strategy is to process the allocated tasks one at a time. For epis- tasis discovery applications, each task is the statistical test of a combination of objects computed based on the method described in section 3.4. However, we can optimize the performance by salvaging partial work done by one task for another task when they share some common computation. We refer to this as the sharing optimization.

For the purpose of our discussion, we shall illustrate the sharing optimization under the Exhaustive Testing scenario. As mentioned in section 3.5.1, the task distributor assigns the tasks to each reducer based on sets. Each reducer receives a set of set ids which it should conduct.

Sharing Optimization: Our COSAC framework minimizes the cost of processing

a set of tasks and the contingency tables collection by salvaging common operations during the processing of combinations.

Let us assume that there are 5 objects and we need to perform 3-level analysis. Take the first set{123, 124, 125, 134, 135, 145} as an example, we illustrate how the sharing optimization is performed while enumerating and analyzing this set of combinations.

Given the first set, it is easy to see that {12} exists in all the three combinations

{123, 124, 125} and {13} is shared between {134} and {135}. The methodology of our

sharing optimization is that we only combine the parts that are common among multiple combinations once. And then the common part is used to combine with the other objects further. For instance, we calculate the results of the combinations {12} once, and use

{12} to combine with 3, 4 and 5 further. In this case, we omit computing the sharing

parts multiple times (as would have been done under the naive strategy) to reduce the computation overhead. Meanwhile, the the data (contingency tables) also do not need to

Algorithm 3.4: Enumeration

Input: Fixed combination Object: comObj, Start object position: sOP,

Combination forward depth: cFD, analysis level: m

end=List.size()- m + cFD-1 ;

fori: sOP→ end do

2 newComObj=Combine(comObj,List.get(i)); 3 ifcFD == m then 4 MakeStatisticTest(newComObj); 5 else 6 Enumeration(newComObj,i+1,cFD+1,m); 7

be collected multiple times.

Algorithm 3.4 outlines the main steps of performing this sharing optimization which is able to support any deep level analysis flexibly. Algorithm 3.4 has four input parame- ters. First, it is the comObj which is used to store either the start object or the common combined objects. Second, it is the next object position (sOP) which comObj needs to combine with. Third, the cFD indicates the forward analysis depth. In other words, it records the number of objects in the combination after adding another objects to the

comObj. Lastly, m is the analysis level which is set by users.

We take enumerating the first set aforementioned as a running example to illustrate how the algorithm works. Initially, the comObj is the object 1 and the next position of the object in the list to combine is 2. As combining another object with object 2, there will be 2 objects in the combination. Thus, the cFD is 2. And the analysis level m is a constant value once it is set by the users (3 in this example). The algorithm first calculates all the possible positions for the objects which need to combine with comObj (Lines 1, 2). In the above example, 2, 3 and 4 are the object positions which 1 needs to combine with. For all these possible positions, the algorithm then gets the corresponding objects from the list to combine with comObj one by one while the result is stored into a new object (newComObj) (Lines 2, 3). Meanwhile, the algorithm checks whether the depth is the

same as the analysis level m. If it is, it has enumerated one complete combination as required. And then it conducts a statistical testing on this combination and continues to combine with other possible objects (Lines 4, 5). Otherwise, the algorithm enters the next level to combine more objects with the new inputs (newComObj, i+1, cFD+1, m) (Lines 6, 7). For instance, in the example, since 2 is not equal to the analysis level 3, when 1 and 2 are combined, the algorithm enters into the next level to add another new object.

In the next level, the algorithm runs the same steps until it has combined with all the objects it needs. In the next level, {12} is further combined with 3, 4 and 5. And then the algorithm returns to the upper level to combine 1 and 3 and so forth, until all the enumeration has been done.

In this way, the proposed algorithm does not only achieves the sharing optimization for efficient enumeration but also is flexible to support any analysis level. In the section 3.7, our experimental results have shown that the sharing optimization has significantly minimized the cost.

Semi-Exhaustive Testing: Under the Semi-Exhaustive Testing scheme, the task each

reducer facing is to combine seed comb with other non-seed objects as described in section 3.5.2. Algorithm 3.4 can also be used here. For a given analysis level m, if the number of objects in the seed comb is i, the enumeration can be performed through calling the algorithm with input parameters: seed comb, 0 (combine with the first object in the non-seed list), i+1 (combination forward deep) and m (analysis level).

In document Scalable Data Analysis on MapReduce-based Systems (Page 70-72)