PARALLEL RULE MINING WITH DYNAMIC DATA DISTRIBUTION UNDER HETEROGENEOUS CLUSTER ENVIRONMENT

(1)

19 PARALLEL RULE MINING WITH DYNAMIC DATA DISTRIBUTION

UNDER HETEROGENEOUS CLUSTER ENVIRONMENT

Ms. C. Francytheresa

1

, Dr. G. Kesavaraj

2

,

1

M.Phil Full time Research Scholar, Department of Computer Science,

2

Head of the Department, PG and Research Department of Computer Science

Vivekanandha College of Arts and Sciences for Women (Autonomous), Tiruchengode,

Tamilnadu, India.

Abstract

Big data mining methods supports knowledge discovery on high scalable, high volume and high velocity data elements. The cloud computing environment provides computational and storage resources for the big data mining process. Hadoop is a widely used parallel and distributed computing platform for big data analysis and manages the homogeneous and heterogeneous computing models. The MapReduce framework is applied to divide and process the data and tasks as small elements. The frequent item set mining methods are applied to fetch frequent patterns from the database transactions. The parallel frequent mining techniques divide and process the data set with equal intervals. The Data Partitioning in Frequent Itemset Mining on Hadoop Clusters (FiDoop-DP) is adapted to perform the load balanced rule mining process. The Voronoi diagram based data partitioning scheme uses the transaction relationships. The partitioning process controls the redundant transactions with similarity metric and Locality Sensitive Hashing (LSH) technique. The Parallel Frequent Pattern Growth algorithm is employed to discover the frequent item sets. The parallel rule mining process is build to support dynamic data partitioning and distribution over the heterogeneous Hadoop clusters. The heterogeneous Hadoop clusters are formed with different resource level in each computational node. The data aware partitioning process is carried out with load balancing mechanism. The computational resource level is also used for the data partitioning process. The FiDoop-DP scheme is upgraded to handle the data placement with load balance under the Hadoop Distributed File System (HDFS) in heterogeneous nodes. The parallel frequent Item set mining process is improved with energy efficiency features.

Keywords: Parallel Rule Mining, Hadoop Clusters, Data Partitions, FP Growth Algorithm and Big Data Mining

I. INTRODUCTION

Recently, there has been a great deal of hype about cloud computing. In principle, cloud computing is associated with a new paradigm for the provision of computing infrastructure. This paradigm shifts the location of this infrastructure to more centralized and larger-scale datacenters in order to reduce the costs associated with the management of hardware and software resources. In particular, cloud computing has promised a number of advantages for hosting the deployments of data-intensive applications such as:

• reduced time-to-market by removing or simplifying the time-consuming hardware provisioning, purchasing, and deployment processes;

• reduced monetary cost by following a pay-as-you-go business model;

• unlimited (virtually) throughput by adding servers if the workload increases.

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, information privacy. The term often refers simply to the use of predictive

analytics or other certain advanced methods to extract value from data and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reduction and reduced risk.

Analysis of data sets can find new correlations, to "spot business trends, prevent diseases and combat crime and so on." Scientists, business executives, practitioners of media and advertising and governments alike regularly meet difficulties with large data sets in areas including Internet search, finance and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations and biological and environmental research.

(2)

20

challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.

Work with big data is necessarily uncommon; most analysis is of "PC size" data, on a desktop PC or notebook that can handle the available data set. Relational database management systems and desktop statistics and visualization packages often have difficulty handling big data. The work instead requires "massively parallel software running on tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on the capabilities of the users and their tools and expanding capabilities make Big Data a moving target. Thus, what is considered "big" one year becomes ordinary later. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."

II. RELATED WORK

YARN is the second generation of Hadoop. It uses containers to replace task slots and provides a finer granularity of resource management. But it only allows different jobs to have different container sizes [1]. The tasks in one job still have the containers of one fixed size. Therefore, YARN cannot mitigate data skew with flexible size of containers.

There are a few studies on straggler mitigations. SkewReduce alleviates the computational skew by balancing data distribution across nodes using an user-defined cost function. SkewTune [7] repartitions the data of stragglers to take the advantage of idle slots freed by short tasks. Moving repartitioned data to idle nodes requires extra I/O operations, which could aggravate the performance interference. Dolly [8] provides a speculative execution at job level which clones small jobs with straggler tasks. But, duplication of the entire job not only increase the resource usage and I/O contention on intermediate data, which can significantly affect job performance. More importantly, these approaches do not have the coordination between Hadoop and the cloud infrastructure.

There is great research interest in improving Hadoop from different perspectives. A rich set of research focused on the performance and efficiency of Hadoop cluster. Jiang et al. conducted a comprehensive performance study of Hadoop, and summarized the factors that can significantly improve the Hadoop performance. Verma et al. proposed cluster resource allocation approach for Hadoop. They focused on improving the cluster efficiency by minimizing

resource allocations to jobs while maintaining the service level objectives. They estimated the execution time of a job based on its resource allocation and input dataset, and determined its minimum resource allocation.

A Hadoop cluster in the cloud can be seen as a heterogeneous Hadoop cluster because of the dynamic resource allocation. A number of studies proposed different task scheduling algorithms to improve Hadoop performance for heterogeneous environments [11]. DynMR [5] enables interleaved MapReduce execution that overlaps the reduces tasks with map tasks. By opportunistically schedule all tasks, DynMR significantly increases both performance and efficiency of Hadoop. For instance, PIKACHU focuses on achieving optimal workload balance for Hadoop [12]. It presents guidelines for the tradeoffs between the accuracy of workload balancing and the delay of workload adjustment. But, these studies focus on hardware heterogeneity in physical machine based clusters. They are not designed for VM-based clusters where the heterogeneity can be changed by dynamic resource allocation. FlexSlot is able to adapt to the change of heterogeneity with flexible slot size and number. There are recent studies that focus on improving performance of applications in the cloud by dynamic resource allocation. For example, Bazaar is a cloud framework that predicts the resource demand of applications based on the high-level performance goals [9]. It translates the performance goal of an application into multiple combinations of resources and selects the combination that is most suitable for a cloud provider. One recent work focuses on demand-based resource allocation [3]. It efficiently distributes the resource by dynamically allocating the overall capacity among VMs based on their demands. But these approaches focus on VM management only and are less effective than FlexSlot for Hadoop in the cloud because of the semantic gap between demand-based resource allocation and the actual needs of Hadoop tasks.

The performance interference is an important issue for applications in clouds. There are a few recent studies working on mitigating performance interference for different applications [6]. TRACON provides a profiling based interference prediction model. Its cloud-aware scheduler utilizes the model and mitigates the performance interference for data intensive workloads. Gulati et al. introduced distributed resource scheduler (DRS) for resource-management and performance isolation among VMs that located on multiple physical machines.

(3)

21

MapReduce, Hadoop, Dryad, LATE and Wrangler. These strategies only start the speculative execution when the job is nearly done and the cluster has available task slots. There are a few recent studies focus on improving the speculative execution [2]. Mantri focuses on starting speculative execution as soon as a straggler is detected during the job’s execution. It kills the straggler task once it finds the speculative task is faster than the straggler. Mantri estimates a task’s remaining time based on the progress bandwidth. However, there is no guarantee that the speculative copy of the straggler task will perform better. Mantri may need to kill and restart the speculative task on multiple nodes. Wrangler [4] predicts which task would benefit from speculative execution and starts speculative tasks accordingly. The prediction relies on a statistical learning model that is derived from historical workloads. The learning time could be long for workloads with low repetitiveness. The performance interference in the cloud can also affect the accuracy of the prediction. Maximum Performance Cost (MCP) approach [10] employs exponentially weighted moving average to predict the remaining time of a task. MCP selects the node for speculative execution using a sophisticated cost-benefit model which considers the workload on each node and the data locality. It significantly improves the job performance. All these approaches lack the ability to identify the performance bottleneck of the straggler tasks, they cannot guarantee that the speculative copies of tasks will perform better.

III. DATA PARTITIONING IN FREQUENT ITEMSET MINING ON HADOOP CLUSTERS

Traditional parallel Frequent Itemset Mining techniques (a.k.a., FIM) are focused on load balancing; data are equally partitioned and distributed among computing nodes of a cluster. More often than not, the lack of analysis of correlation among data leads to poor data locality. The absence of data collocation increases the data shuffling costs and the network overhead, reducing the effectiveness of data partitioning. In this study, we show that redundant transaction transmission and itemset-mining tasks are likely to be created by inappropriate data partitioning decisions. As a result, data partitioning in FIM affects not only network traffic but also computing loads. Our evidence shows that data partitioning algorithms should pay attention to network and computing loads in addition to the issue of load balancing. We propose a parallel FIM approach called FiDoop-DP using the MapReduce programming model. The key idea of FiDoop-DP is to group highly relevant transactions into a data partition; thus, the number of redundant

transactions is significantly slashed. Importantly, we show how to partition and distribute a large dataset across data nodes of a Hadoop cluster to reduce network and computing loads induced by making redundant transactions on remote nodes. FiDoop-DP is conducive to speeding up the performance of parallel FIM on clusters.

The following three observations motivate us to develop FiDoop-DP in this study to improve the performance of FIM on high-performance clusters.

 There is a pressing need for the development of parallel FIM techniques.

 The MapReduce programming model is an ideal data centric mode to address the rapid growth of big-data mining.

 Data partitioning in Hadoop clusters play a critical role in optimizing the performance of applications processing large datasets. Parallel Frequent Itemset Mining. Datasets in modern data mining applications become excessively large; therefore, improving performance of FIM is a practical way of significantly shortening data mining time of the applications.

Unfortunately, sequential FIM algorithms running on a single machine suffer from performance deterioration due to limited computational and storage resources. To fill the deep gap between massive amounts of datasets and sequential FIM schemes, we are focusing on parallel FIM algorithms running on clusters.

The MapReduce Programming Model. MapReduce - a highly scalable and fault-tolerant parallel programming model - facilitates a framework for processing large scale datasets by exploiting parallelisms among data nodes of a cluster. In the realm of big data processing, MapReduce has been adopted to develop parallel data mining algorithms, including Frequent Itemset Mining. Hadoop is an open source implementation of the MapReduce programming model. In this study, we show that Hadoop cluster is an ideal computing framework for mining frequent itemsets over massive and distributed datasets.

(4)

22

correlation between the data is often ignored which will lead to poor data locality and the data shuffling costs and the network overhead will increase. We develop FiDoop-DP, a parallel FIM technique, in which a large dataset is partitioned across a Hadoop cluster’s data nodes in a way to improve data locality.

IV. ISSUES ON RULE MINING UNDER HADOOP CLUSTERS

The frequent item set mining methods are applied to fetch frequent patterns from the database transactions. The parallel frequent mining techniques divide and process the data set with equal intervals. Inappropriate data partition initiates redundant transaction transmission and item set mining tasks. The Data Partitioning in Frequent Itemset Mining on Hadoop Clusters (FiDoop-DP) is adapted to perform the load balanced rule mining process. The data partitions are created with highly relevant transaction groups. The Voronoi diagram based data partitioning scheme uses the transaction relationships. The partitioning process controls the redundant transactions with similarity metric and Locality Sensitive Hashing (LSH) technique. The Parallel Frequent Pattern Growth algorithm is employed to discover the frequent item sets. The following issues are identified from the current Hadoop based rule mining techniques.

• Computational resource status is not considered in the data partitioning process

• Data distribution and rule mining methods are not tuned for heterogeneous environment

• Data placement loads are not focused for Hadoop Distributed File Systems (HDFS)

• Energy management models are not focused in the rule mining process

V. PARALLEL RULE MINING UNDER

HETROGENOUS CLUSTER ENVIRONMENT

Parallel rule mining process is build to support dynamic data partitioning and distribution over the heterogeneous Hadoop clusters. The data aware partitioning process is carried out with the computational resource and load level details. The FiDoop-DP scheme is upgraded to handle the load balanced data placement under the HDFS in heterogeneous nodes. The parallel frequent Item set mining process is improved with energy efficiency features.

The parallel rule mining operations are carried out under the Hadoop clusters. Data relationships are used in the data placement and transmission process. Data placement operations are tuned to support

Hadoop Distributed File System (HDFS). The system is divided into five major modules. They are Transactional database, Resource management in Hadoop clusters, Data partitioning process, Data placement in HDFS and Frequent item set mining. The transactional database maintains the data items. The resource management module maintains the computational nodes and their resource details. The transactions are grouped under the data partitioning process. The data elements are transferred and placed in the data placement process. Frequent item sets are discovered in the parallel rule mining process.

Transactional database

The transactional database is used to manage the data items under the textual data files. Distributed File System (DFS) is used to maintain the data files under the providers. The system uses the online learning data sets for the frequent Itemset identification process. The student learning activities are updated in the data collections.

Resource Management in Hadoop Clusters

Homogeneous and heterogeneous Hadoop clusters are build with static and dynamic resource levels. Storage and computational resources are provided in the Hadoop cluster environment. The storage spaces are allocated to manage the partitioning transactional data values. The computational resources are assigned to perform the parallel rule mining process.

Data Partitioning Process

The data partitioning process is constructed to support homogeneous and heterogeneous cluster environment. Voronoi diagram, Similarity metric and Locality Sensitive Hashing (LSH) techniques are adapted for the data aware partitioning process. The computational and data aware partitioning scheme (CDAPS) is build to perform data partitioning under the heterogeneous clusters. The data partition operations are performed with load level details.

Data Placement in HDFS

(5)

23

System (HDFS) manages the transactional data in each node.

Frequent Item Set Mining

The Parallel Frequent Pattern Growth (PFPG) algorithm is used for the frequent item set mining process. The item set mining process is tuned to support homogeneous and heterogeneous cluster environment. Resource availability and load levels are considered in the pattern discovery process. The rule mining tasks are scheduled with energy constraints.

VI. CONCLUSION AND FUTURE WORK

The parallel rule mining methods are applied to discover the frequent item set under the distributed environment. Homogeneous and Heterogeneous clusters are used to provide the storage and computational resources for the distributed rule mining process. The load balanced rule discovery process is carried out using the Data Partitioning in Frequent Itemset Mining on Hadoop Clusters (FiDoop-DP). The FiDoop-DP scheme is improved with load balanced data partitioning, data placement and energy management models. The Data Partitioning in Frequent Itemset Mining on Hadoop Clusters (FiDoop-DP) scheme is adapted to perform parallel rule mining. Data partitioning and data placement operations are tuned for the heterogeneous clusters. Similarity redundant data filtering model minimizes the data transmission load. The system improves the energy efficiency with minimum computational complexity levels.

REFERENCES

[1]. Yanfei Guo, Jia Rao, Changjun Jiang and Xiaobo Zhou, “Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution”, IEEE Transactions on Parallel and Distributed Systems, March 2017.

[2]. S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu and S. Wu. Maestro: Replica-aware map scheduling for mapreduce. In Proc. of the 12th IEEE/ACM Int’l Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2012.

[3]. G. Shanmuganathan, A. Gulati and P. Varman. Defragmenting the cloud using

demand-based resource allocation. In Proc. ACM SIGMETRICS, 2013.

[4]. N. J. Yadwadkar, G. Ananthanarayanan and R. Katz. Wrangler: Predictable and Faster Jobs using Fewer Resources. In Proc. of the ACM Symposium on Cloud Computing (SOCC), 2014.

[5]. J. Tan, A. Chin, Z. Z. Hu, Y. Hu, S. Meng, X. Meng and L. Zhang. DynMR: Dynamic mapreduce with reducetask interleaving and maptask backfilling. In Proc. of the ACM European Conference on Computer Systems (EuroSys), 2014.

[6]. H. Kang and J. L. Wong. To hardware prefetch or not to prefetch?: a virtualized environment study and core binding approach. In Proc. of the Int’l Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.

[7]. Y. Kwon, M. Balazinska, B. Howe and J. Rolia. SkewTune: Mitigating skew in mapreduce applications. In Proc. of the ACM SIGMOD, 2012.

[8]. G. Ananthanarayanan, A. Ghodsi, S. Shenker and I. Stoica. Effective Straggler Mitigation: Attack of the Clones. In Proc. of the USENIX NSDI, 2013.

[9]. V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis and A. Rowstron. Bridging the tenant-provider gap in cloud services. In Proc. of the ACM Symposium on Cloud Computing (SOCC), 2012.

[10]. Q. Chen, C. Liu and Z. Xiao. Improving mapreduce performance using smart speculative execution strategy. IEEE Transactions on Computers, 63(4):954–967, Apr. 2014.

[11]. D. Cheng, J. Rao, Y. Guo and X. Zhou. Improving mapreduce performance in heterogeneous environments with adaptive task tuning. In Proc. of ACM/IFIP/USENIX Middlware, 2014.

PARALLEL RULE MINING WITH DYNAMIC DATA DISTRIBUTION UNDER HETEROGENEOUS CLUSTER ENVIRONMENT

19