DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT

(1)

DATA MINING OVER LARGE DATASETS USING

HADOOP IN CLOUD ENVIRONMENT

V.Nappinna lakshmi1, N. Revathi2* 1

PG Scholar, 2Assistant Professor Department of Information Technology, Sri Venkateswara College of Engineering, Sriperumbudur – 602105, Chennai, INDIA.

1 Nappinnavenkat@gmail.com 2 revathi@svce.ac.in* * Corresponding author

Abstract- There is a drastic growth of data’s in the web applications and social networking and such data’s are said be as Big Data. The Hive queries with the integration of Hadoop are used to generate the report analysis for thousands of datasets. It requires huge amount of time consumption to retrieve those datasets. It lacks in performance analysis. To overcome this problem the Market Basket Analysis a very popular Data Mining Algorithm is used in Amazon cloud environment by integrating it with Hadoop Ecosystem and Hbase. The objective is to store the data persistently along with the past history of the data set and performing the report analysis of those data set. The main aim of this system is to improve performance through parallelization of various operations such as loading the data, index building and evaluating the queries. Thus the performance analysis is done with the minimum of three nodes with in the Amazon cloud environment. Hbase is a open source, non-relational and distributed database model. It runs on the top of the Hadoop. It consists of a single key with multiple values. Looping is avoided in retrieving a particular data from huge datasets and it consumes less amount of time for executing the data. HDFS file system is used to store the data after performing the map reduce operations and the execution time is decreased when the number of nodes gets increased. The performance analysis is tuned with the parameters such as the HBase Heap Memory and Caching Parameter.

Keywords- HBase, Cloud computing, Hadoop ecosystem, mining algorithm

I INTRODUCTION

Currently the data set sizes for applications are growing in a incredible manner. Thus the data sets growing beyond the few hundreds of terabytes, have no solutions to manage and analyse these data. Services like social networking approaches to achieve the goals like minimum amount of effort in terms of software, CPU and network. Cloud computing is associated with the new paradigm for provisioning the computing infrastructure. Thus the paradigm shifts the

location of infrastructure to the network to reduce the cost associated with the management of hardware and software resources. The cloud computing is said to be as the model for enabling convenient, on- demand network access, to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction.

Hadoop is a framework for running large number of applications which consists HDFS for storing large number of dataset. Hadoop DB tries to achieve fault tolerance and the ability to operate in heterogeneous environments by inheriting the scheduling and job tracking implementation from Hadoop. The main aim of these systems is to improve the performance through parallelization of various operations such as loading the datasets, index building and evaluating the queries. These systems usually designed to run on top of a shared nothing architecture where data may be stored in a distributed fashion and input/output speeds are improved by using multiple CPU’s disk in parallel and network links with high available bandwidth. Hadoop database tries to achieve the performance of parallel databases by doing most of query processing inside the database engine. Hadoop is an open source and framework that is used in cloud environment for efficient data analysis and storage of data. It supports data-intensive applications by realizing the implementation of the Map Reduce framework. Inspired by the Google’s architecture. Integration of Hadoop and Hive is used to store and retrieve the dataset in a efficient manner. For more efficiency of data storage and transactions of retail business the integration of Hadoop ecosystem with HBase along with the cloud environment is used to store and retrieve the data sets persistently. The performance analysis is done with the Map Reduce parameters like HBase heap memory and Caching parameter.

(2)

A. Cloud computing

It is a model for enabling convenient and on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, application storages and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

Fig 1.1 Hadoop Ecosystem

B. Hadoop

HADOOP is designed to run on cheap commodity hardware. It automatically handles the data replication and node failure, focuses on processing the data. It is used to store the copies of internal log and dimension data sources. Used for reporting/analytics and machine learning.

C. Hive

Hive is system for managing and querying structured data built on top of the Hadoop for data warehousing and analytics of data.

D. Hive QL

Hive QL is the tuple subset of the hive dataware housing. It makes use of the parser for execution of the map reduce program and to store the large data set in HDFS of Hadoop.

E. Map Reduce

Map Reduce is a programming model for processing large data sets, and the implementation is done with the Google. Map Reduce is typically used to do distribute computing on clusters of computers. Map Reduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use

similar hardware).Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured). Map Reduce can take advantage on data locality, processing the datasets or near the storage assets to decrease transmission of data.

"Map" step- The master node takes the input file and divides those datasets into smaller sub-problems, and distributes them to each of the worker nodes. A worker node may do this again in turn to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step- The master node collects the answers of all the sub-problems and combines them to form the output – the answer to the problem is that, it was originally trying to solve the job process in easier manner.

F. HBase

HBase is a distributed, versioned, column-oriented database built on top of Hadoop and Zookeeper. HBase master is used to figure out the placement of regions of the nodes, so that it can configure and splits the data correctly. This is to run the mappers job on the region servers containing the regions to ensure data locality. HBase Server reads incoming request using one Listener thread, and pool of Deseriaisation Threads. Input/output process is done asynchronously. The listener sleeps on a bunch of socket descriptors.

G. Zookeeper

Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (they are said to be as the registers znodes), used to like the file system. Unlike normal file systems Zookeeper provides its clients with high throughput, low latency, availability and the strictly ordered access to the znodes. The performance analysis of Zookeeper is being tested by allowing number of distributed systems.

The main differences between Zookeeper and standard file systems are that every znode can have data associated with it and those znodes are limited to the amount of datasets that they can have. Zookeeper was designed to store the coordination of datasets such as: the status information, configuration, location of the information, etc. This kind of Meta data-information is usually being measured in kilobytes, if not bytes. Zookeeper has in built sanity check of 1M and it is used to prevent it from being used as a huge data store, but in PIG HCATALOG A M B A R I O O Z I E S Q O O P F L U M E

(3)

general it is used to store much smaller pieces of data. It is used to achieve the data sets easily by splitting the large number of datasets.

H. HDFS

Hadoop Distributed File System cluster consists of a single Name node, a master server that manages the file system namespace and regulates access to files by clients. There are a number of Data Nodes usually one per node in a cluster. The Data Nodes manage storage attached to the nodes that they run on. HDFS contains a file system namespace and allows user data to be stored in files. A single file is being split into one or more blocks and set of blocks are stored in Data Nodes.

Data Nodes-serves read, write requests,

performs block creation, deletion, and replication upon instruction from Name node.

Block ops

Read Datanodes Datanodes Replica-

tions

Rack 1 Rack2

Fig.1.2 HDFS architecture

Name node maintains the file system. Any

Meta information changes to the file system are recorded by the Name node.

An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Name node -HDFS is designed to store very large files across machines in a large cluster. Each file is a sequence of blocks. All blocks in the file system except the last are of the same size. Blocks are replicated for fault tolerance. Block size and replicas are configurable per file. The Name node receives a Heartbeat and a Block Report from each Data Node in the cluster. Block Report contains all the

blocks on a Data node. The placement of the replicas is critical to HDFS performance. Optimizing replica placement distinguishes HDFS from other distributed file systems.

I. Rack-aware replica placement

Goal- improves reliability, availability and

network bandwidth utilization.

Searching of data topic-Many racks, communication between racks are through switches. Network bandwidth between machines on the same rack is greater than those in different racks. Name node determines the rack id for each Data Node.

Replicas are placed-Nodes are being placed

on various local racks. Replica Selection- Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency. If there is a replica on the Reader node then that is preferred. HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one. File system Metadata- the HDFS namespace is stored by Name node.

Name node uses a transaction log called the Edit Log to record every change that occurs to the file system Meta data. Entire file system namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Name node’s local file system.

II RELATED WORKS

D. Abadi [1]In this paper the large scale data analysis is done with the traditional DBMS. The data management is scalable but there is replication of data. Replication of data leads to the fault tolerance.

Y. Xu, P. Kostamaa, and L. Gao [3] This paper deploys the Teradata parallel DBMS for large data warehouses. In recent years there is a increase in the data volumes and some data like web logs and sensor data are not managed by Teradata EDW (Enterprise Data ware house). Researchers agree that both the parallel DBMS and Map reduce of Hadoop paradigms have advantages and dis-advantages for the various business applications and the also exists for the long time. The integration of optimizing opportunities is not possible for DBMS running on the single node.

Farah Habib Chan chary [6] large datasets among the clusters of machines are efficiently stored in the cloud storage systems. So that the Name node Client Blocks Metadata (Name, replica...) (Home, foo, data6) Client

(4)

same information on more than one system could operate the datasets even if any one of the system’s power fails.

J.ABABI, AVI SILBERCHATZ [7] Analysing massive datasets on very large clusters is done within the HadoopDB architecture for the real world application like business data warehousing. It approaches for parallel databases in performance. Still there is no scalability. It consumes huge amount of time for execution.

III HBASE-MINING ARCHITECTURE

HADOOP CLUSTER

Fig: 3.1 HBase-Mining Architecture

Hive queries are used to store and retrieve the movie lens datasets. Efficient database storage with the market basket algorithm is used for the maintenance of heavy transactions in the retail business of super market products and the aim of this systems is to improve the performance through parallelization of various operations such as loading the datasets, index building and queries evaluation in the Hadoop database is integrated along with the HBase is used to store and retrieve the huge datasets without any loss age of the transactions. The Amazon cloud is used to hold the HDFS file system to store the name node, data node along with the region servers where the data sets are stored when it is being splitted from the Hbase table. Hive Query Processing (HQP) is also considered as one of the data nodes.

A. Advantage of having the HADOOP database

In RDBMS it can store the limited amount of data, SQL operations are used and it cannot execute the data concurrently which means there is no parallel processing and it is a single threaded process, and it can handle only one dataset. Whereas in HADOOP along with the HBASE data storage it can store huge amount of data up to petabytes of data, it makes use of NOSQL operations and it can execute the data concurrently which means the parallelization is achieved and it splits the work into many based of the number of processors. So that it maps the function with <key, value> pairs and the data is being processed in the reduce function to execute the dataset.

Table 1 Difference between the NOSQL and SQ

RDBMS are used to store only the structured data, but the HADOOP data storage systems are used to store both the structured and unstructured data. E.g. for unstructured data are (mail, audio, video, medical transactions etc...)

IV MARKET BASKET ALGORITHMS Market basket is one of the most popular data mining algorithms. It is a cross-selling promotional program to generate the combination of datasets. Association rules are also used to identify the pairs of similar datasets. Advantage of using this algorithm is that it is simple in computations and different forms of data can be analysed. Selection of promotions in purchasing and joint promotional opportunities is more.

Identifying the actionable information of market basket analysis are profitability for each purchase profiles and the use for marketing purpose are layouts or catalogs, select product for promotions, space allocation and product placement. The purchases patterns are identified by the items tend to be purchased together and the items purchased sequentially.

DATA REPORT ANALYSIS HIVE DATASET QUEIRIES HADOOP HDFS-HBASE MARKET BASKET ALGORITHM NN DN JT TT HM HRS SN DN HQP TT HRS AMAZON CLOUD NN DN JT TT HM HRS SN DN HQP TT HRS DN TT HRS

(5)

A. Market Basket Analysis for Mapper

The input file is being loaded into the Hadoop distributed file system and the datasets are stored with the block sizes of 64MB along with the <key, value> pairs, then the mapping function is performed then each mapped datasets are being stored in their respective Hbase tables.

B. Algorithm

1. Input file is loaded in to the HDFS.

2. File is being splitted in to the block size of 64MB.

3. The mapping function is performed on the basis of <k1, v1> pairs.

4. Then it is stored on the HBase tables.

C. Market Basket Analysis for Reducer

The HBase market basket table is being splitted as region server to store the datasets with the help of zookeeper and the <k1, v1> pairs are allotted for each datasets that are involved in the transactions. The Reduce function is performed to get the output file in the form of <K2, v2> pairs. The grouping and sorting functions are performed to obtain the final result <k3, v3> pair in the Hbase table.

D. Algorithm

1. HBase market basket table is splitted into region servers to store the datasets with <k1, v1> pairs. 2. <k2, v2> output file pair is obtained by the reduce function on the basis of alphabetical analysis manner.

3. Sorting and grouping functions are performed by counting the number of value counts to obtain final result <k3, v3> pair in the HBase table.

V RESULTS

Fig:5.1 Performance analysis of data sets

HBase has around 1 million records for product table.

And it has 5.1 million records doing algorithm analysis.

For Performance Results

1 Node - 1 Million records - 4 min 37 sec 2 Nodes - 1 Million records - 3 min 31 sec 3 Nodes - 1 Million records - 2 min 56 sec 5 Nodes - 1 Million records - 2min

Performance Parameters tuned

Hbase Heap Memory and Caching Parameter VI CONCLUSION

The integration of Hadoop Ecosystem along with the HBase is used to store and retrieve the huge datasets without any loss and the parallelization is achieved in loading the datasets, building the indexes and evaluation of the queries . The Hadoop ecosystem can store both the structured and unstructured datasets. It can include and delete the datasets parallel. The Map Reduce function is used to split the datasets and to get stored on the basis of number of processors and the market basket analysis for mapper and reducer function is performed to store and retrieve the millions of datasets along with the <key, value> pairs that are allotted for each and every datasets. Thus the obtained datasets are being stored in Hbase table and the performance analysis is processed with five nodes.

VII FUTURE WORK

Thus the Hadoop Ecosystem with the integration of Hbase database should be used in various fields like telecommunications, banks, insurance, medical fields etc. To maintain the public details in an efficient manner and to avoid the fraudulent.

VIII REFERENCES

[1] D. Abadi.” Data management in the cloud: Limitations and opportunities”,IEEE Transactions on Data

Engineering , Vol.32,No.1,March 2009.

[2] Daniel Warneke and Odej Kao,”Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud,” IEEE Transactions on

Distributed and Parallel systems , Vol.22,No.6,June 2011.

[3] Y. Xu, P. Kostamaa, and L. Gao. “Integrating Hadoop and Parallel DBMS”, Proceedings of ACM SIGMOD,

International conference on Data Management, New york,NY,USA 2010.

[4] Huiqi Xu, Zhen Li et.al, “ CloudVista: Interactive and Economical Visual Cluster Analysis for Big Data in the Cloud,” IEEE Conference on Cloud computing, Vol.5,

(6)

[5] M. Losee and Lewis Church Jr.”Information Retrieval with Distributed Databases: Analytic Models of Performance Robert,” IEEE Transactions on parallel and

distributed systems, Vol.15, No.1, January 2004.

[6] Farah Habib Chan chary,”Data Migration: Connecting databases in the cloud” IEEE JOURNAL ON

COMPUTER SCIENCE, vol no-40 , page no-450-455, MARCH 2012.

[7] Kamil Bajda at el.”Efficient processing of data warehousing queries in a split execution environment”,

JOURNAL ON DATA WAREHOUSING, vol no-35, ACM, JUNE 2011.

[8] J.ABABI, AVI SILBERCHATZ,”HadoopDB in action: Building real world applications”, SIGMOD CONFERENCE ON DATABASE, vol no-44 ,USA, SEPTEMBER 2011.

[9] S. Chen. “Cheetah: A High Performance, Custom Data Warehouse on Top of Map Reduce”, In Proceedings of

VLDB, vol no-23, pg no-922-933, SEPTEMBER 2010.

[10] R. Vernica, M. Carey, and C. Li.” Efficient ParallelSet-Similarity Joins Using Map Reduce”, In Proceedings of

SIGMOD, vol no-56, pg no-165-178, MARCH 2010

[11] Aster Data, “SQL Map Reduce framework”,

http://www.asterdata.com/product/advanced-analytics.php.

[12] Apache HBase, http://hbase.apache.org/.

[13] J. Lin and C. Dyer, “Data-Intensive Text Processing with Map Reduce”, Morgan & Claypool Publishers, (2010). [14] GNU Cord, http://www.coordguru.com/.