Improving the Performance of Heterogeneous Hadoop Clusters Using the Map Reduce Big Data Algorithm

(1)

Improving the Performance of Heterogeneous Hadoop Clusters Using the Map Reduce Big Data Algorithm

Mr. Sachin Jain¹

Assistant Professor, JNU, Jaipur Ms. Amita Kashyap² Assistant Professor, JNU, Jaipur

Mr. Sheetal Kr. Dixit³ Assistant Professor, JNU, Jaipur

Mr.Shivanshu Gautam Assistant Professor, JNU, Jaipur

Abstract:

This article focuses on improving the performance of heterogeneous clusters in Hadoop after a series of steps that improve data I / O, improve the routing algorithm for heterogeneous clusters, and then improve query processing performance in such a way that they can be easily connected to the right part of execution in the shortest possible time without increasing costs at the computer level. The proposed work follows a series of steps to process these scenarios, which are described herewith.

Keywords: Big data, Hadoop, Heterogeneous clusters, Map reduce, Throughput, Latency etc.

I. INTRODUCTION

In simple words, Big Data is a collection of very large and growing data files. It can also be termed as a collection of infrastructure management tools, algorithms used to classify data and create visualizations from data that work on any type of data generated by a user or machine. An important point to note is that Big Data is not only about data generation and visualization, but should also give an insight from any kind of data, regardless of the amount of data and data sets. The main problem that revolves around Big Data is storage, that is, how to store data generated at exponential rates. It is a fact that inspite of cost of storage devices has decreased; it still becomes expensive to store and retrieve data which becomes useful when analyzed properly. Hadoop can handle Big Data infrastructure by providing batch processing and storage to any kind of data. Hadoop is also an open source framework that is used to process data using different methods in a distributed environment. It can manage data movement across different clusters of computers in a way that makes data access fast, reliable, and secure. This includes the use of simple computing and programming models to deal with data storage or distribution over large and wide geographic areas. The relationship between Big Data and Hadoop is much simpler than it seems.

The purpose of this research paper is to identify:

(2)

• Work with existing studies related to Hadoop, Big Data and MapReduce frameworks and commonly encountered problems.

• To deal with large sets of statistical or unstructured data and to review them for research work.

• Comparative study of existing algorithms used for data clustering on heterogeneous Hadoop clusters and one of them is the best for future trends but impose limitation of re-depending on subsequent query which increase latency.

• This motivates to design a new methodology that overcomes the drawbacks of existing algorithms to provide better functionality than the former.

II. SOLUTION DOMAIN

There is a solution in the problem itself. Research work is proposed to study and identify challenges present in heterogeneous clusters of Hadoop, analyze existing algorithms, design a new algorithm to overcome the shortfalls faced by major existing algorithms during data clustering. Hadoopcan provide a good flow between large data and low latency.

III. LITERATURE SURVEY

There has been a variety of forms in which the research work in the Big Data Clustering has been done.

The study and research done by Xu Z., et. al., 2015 states the concept of 4 V‟s is proposed to describe Big Data.

The first V for volume scale is overridden when data in amounts of petabytes and zettabytes arrive. This data volume can prove to be very costly when stored in distributed manner and needs a vast amount of processing. The second V for Velocity is compromised when the data is generated in such a speed that the equipment is not able to deal well, and the responses that are expected within a time frame are delayed. The Velocity tests the capability of system well under stress. The third V fordifferent „varieties‟ of data that a Big Data consists of. These varieties arrive from different sources in format that are not easily understandable. Data could be correlated, could be devised differently, or could be inconsistent but they demand environments with better processing and databases.

The last V refers to the value that the Big Data operations can provide. Most of the database or knowledge based systems are built on the framework of structures. In one or other way they have some arranged order of format which is not so in the Big Data[1]. The study and research done by Thirumala, B., et. al., 2011 states the rate of data processing is growing exponentially but the same is not true for computing power[2].The study and research done by Florica Novacescu, et. al., 2013 states raised concerns about the challenges that are posed by the supercomputing communities. She has brought forward the issues that arise when data to be encountered is in petabytes or zettabytes. The vast amount of data that is created because of numerical simulations as the resultant of supercomputing operations are her key areas to focus. The parallelism in big data sets could be viewed from three different aspects, which are data applications which involve high rate of inbound and outbound movements, the server applications which involve network with high bandwidth to process all the queries, and scientific computing applications which need high rate of processing and memory performances to decode some very complex phenomenon[3]. The study and research done by Liu, F.H. et. al, 2014 state that virtualization is a good way to harness the capabilities of Hadoop and MapReduce on data clusters. The data intensive jobs to be performed on nodes could be improved significantly by incorporating virtualization as a green IT factor. The resource consumption could be reduced, and can save a lot of energy too [4].The study and research done by Zhou Liu, 2015 demonstrate that the Hadoop clusters with DAG‟s scheduling queries experience a vast amount of slow down

(3)

suggests the need of a two level processing to not only reduce the complexity of an internal query but to clarify the level of processing in two levels and further enhancing the job processing time. The authors have used a two-level scheduler (TLS), and conducted experiments that demonstrates the query semantics and reduction in processing time. The TLS offers a good amount of reduction in latency[5].The study and research done by M. Akdere, et. al., 2011 proposed Query Performance Prediction (QPP). The main idea used in this Query Performance Prediction (QPP) is to build workload either, offline for future predictions or building new models online as and when a new prediction is required. The experiments were conducted out using static features captured from the offline execution of the training workload[6]. The study and research done by Fahad A., et. al., 2014 classifies the functionality of candidate algorithms primarily used in Big Data in terms of theoretical as well as empirical analysis. The parameters taken for considering the gross efficiency of an algorithm involves a lot of internal and external factors. Some of them are performance metrics, runtime, scalability tests, and stability. They also described a good clustering algorithm that can be used in Big Data[7]. The study and research done by Tinetti, F.

et. al., 2015 state that MapReduce algorithm performs well where, there are heavy data sets involved. The authors have proposed a new design model to cater the requirements for the future systems and to guide organizations to find the right partner for storage and processing of large data sets. They have used TeraSort and TestDFSIO as benchmarking tools to analyze the depth of HDFS clusters in heterogeneous and homogeneous environment. They have used the TeraSort to practically analyze the scaling and functionality of clusters in case of data burst. The result reveals that a lot of data generation at a fast speed needs to be taken care of else there might be chances of node failures. The TestDFSIO is a benchmarking tool that is simply used to inject node failures to test the stability of the clusters[8]. The authors, Giannakouris, et. al., 2014have followed the model framework between Hadoop‟s MapReduce and calculations based on TF-IDF and the cosine similarity among it. The text clustering on which the method is applied is usable in many applications which are data intensive. The initial data collected, applied TF- IDF upon it,and then a cosine similarity matrix is applied to cluster them into usable components[9]. Jagtap A., et.

A., 2015 in his paper, used the method of filtering, uploading and then applying K-Means to find effective results.

The document clustering can be improved using methods that distinguishably prunes documents and then held them in the Hadoop account. The author has incorporated Davis Bouldin index that measures the quality of cluster that is generated post Hadoop implementation step[10]. The expenses of big data are relevantly written by Kametkar k., et. al., 2015. He explains the scenario of testing tools, hardware, and software counterparts of the Hadoop and the Big Data that can help organizations to discover new ways to use the data and provide good monetization. He also advocates in finding the performance bottlenecks and what could be the solution to the idealistic features to improve them for further proceedings[11].

IV. PROPOSED METHODOLOGY

Our work majorly focused on improving the performance of heterogeneous groups on Hadoop by following a set of steps that improve data input / output queries, and improve the routing algorithm for heterogeneous groups and hence processing performance is improved. Inquiring in such a way that they are easily connected to the right part of execution in the least amount of time without increasing costs at the computing level. The proposed work follows a series of steps to handle this scenario, which are described and is illustrated in the following section, each considering the effects and processes that accompany it.

Solve problems through repetition and phasing:

(4)

Research work involves continuously running query processing for MapperAIDS in steps to improve queries for filtering, clustering, and faster execution processes. Initially the step consists of iterative steps to clean the query and find dependencies between queries so that once the query is executed, the tasktractor does not have to re- depend on computing capabilities after computation in the initial steps. The problem here is to reduce query execution time, improve efficiency of clusters, and iterative processing, where queries are improved through clustering in similar steps, applying TF-IDF, calculating weights, parity metrics managing and finally submitting queries for execution staged methodology for Heterogeneous Groups.

V. QUERY IMPROVISATION

This is where clustering also comes in scene; when queries are fetched into the parser and semantic analyzer, they invariably compute the dependencies among the queries. But once this parsed query is sent to the Hadoop‟s Map Reduce to execute the dependencies that were calculated for the Hive, the query is lost between the transitions.

Once the dependencies are calculated they can be used for semantics extractions in Hive QL processor. In the second step we can use these dependencies such as logical predicates and input tables to process the dependencies among different queries to be closely attached to each other during transition. When the sematic gap between Hive and Hadoop is bridged by these intermediary steps, they can be easily used for clustering similar queries at query level, and hence significantly improvise the query job execution.

a. Query Improvisation Process Applying Clustering in Initial Phases:

A lot of applications and systems use clustering as a part of dividing similar and dissimilar data items at once.

Clustering of data sets here can help achieve focusing on the items that are useful and discarding items which are generally not useful. Looking the content of a log file and applying clustering on it, we get clusters of user prone data that categorize the behavior of the item buying selections. The interactions of a user with an e-commerce website can identify good buying patterns easily. When a single end user data is collected out of the multimillion user‟s data of the website and analyzed the results tells a whole lot of insight about the consumer behavior, item selection, reveals buying options and fields which need improvisation.

The other log files used here symbolizes the unstructured data category and recording of the data from different fronts.In this iterative clustering model when the data is captured from user end, it is optimized to remove static and dynamic parts of it in the clustering process itself where static parts are the decisions whereas dynamic parts are variants of time stamping, invoice numbers, delimiters etc. It could be done through iteration process but applying a standard clustering algorithm with changes improves the efficiency as well.

(5)

Algorithm 1: Query Improvisiation Algorithm Input: Datasets cleansed of garbled values.

Tq = Read_queries from Query.DB() fq= Frequency of item dataset Ci= item in Cluster i

ds = K-Means Clustering Similarity matrix Procedure: Compute fq

Output: Clustered data of importance.

for Q=1…n then

n= Number of log transactions IQ = i.

getQuery() for i = 1…xthen if (IQ is in (Ci))then

fq = 1

ds =Compute distance (Similarity matrix) else

IQ = New(Ci) end if

end for i endforQ

b. Load distribution to High and Slow end Clusters:

Once the data is sorted initially, it can be forwarded to the clusters for processing one by one. Here comes the tricky part because ideally Hadoop doesn‟t contain same type of hardware architecture underneath. There can be nodes that operate as High-End clusters with good processing speeds but there also can be Slow-End clusters with lower speeds than their High-End counterparts. When a job or couple ofjobs is fed into Hadoop ecosystem, they are sent to both types of processors depending on the availability. When a High-End cluster finishes job processing, the data waiting to be processed is sent to the High-End systems. This data movement introduces cost to the processing, data storage and security as well.

Hadoop uses First In First Out (FIFO) job scheduler by default. Apart from it, we used Hadoop Fair Scheduler (HFS) and Hadoop Capacity Scheduler (HCS) and their customizations for Hadoop ecosystem. To have fairness

(6)

among different jobs, the HCS and HFS are sufficient. In the proposed methodology, both the schedulers are used one by one to find the outcome which can help in maintaining order among different capacity clusters.

The load distribution is performed in such a way tht can help reduce data movement among different clusters and hence improve the overall processing quality of Hadoop‟s heterogeneous clusters.

c. Applying MapReduce

As demonstrated in previous sections, when initially the system requirements, job processing on clusters, and queries to be implemented are decided, it becomes whole lot easier to deal with the unstructured data formats. The cleansed, clustered and diversified data items can now be sent to MapReduce to extract the key value pairs for further processing. In MapReduce the applications can be used to classify unstructured documents into a set of meaningful terms which can provide value to the clusters. The MapReduce algorithm can help in retrieving important information from a pool of data which cannot be achieved manually.

d. Implementation of MapReduce:

Algorithm 2: Map Reduce Algorithm Map:

Step 1: Input unstructured data file: (log_file.txt)

Step 2: Function that divides data sets:(log_file_intermediate) Split(log_file.txt): (lf1)+(lf2)+(lf3)….+(lfn) Step 3: Collect Output: (log_file_output; lfId, 1)

Reduce:

k: Count of the individual terms.

Step 1: Input file which was output from map function: (log_file_output; lfId, [k])

Step 2: Function to summarize the intermediate terms and give a final value to all the detected valid terms in the module:

p=Sum (k)

Step 3: Provide output to sum up the file values. (log_part, countId, p) e. Applying TF-IDF:

As described in the earlier parts, Term Frequency- Inverse Document Frequency(TF-IDF) provides an equivalent measure of the weight of information retrieval system. The queries on which a TF-IDF weight is to be calculated, carrying more weightage to a good result on the filtered words that are more important than the others in a classification system. For this particular requirement of categorizing big data sets.

(7)

Fig 5: Comparison of different Hadoop Schedulers VI. SIMULATION AND RESULT

The following results have been received by conducting the proposed methodology in the system. The heterogeneous Hadoop clusters created to test the functioning and working of big data sets have emulated the input output throughput, average rate of execution, standard deviations and finally the test execution time.

a. Comparison of efficiency between Hadoop Schedulers:

The results of comparing the mostly efficient schedulers used in Hadoop, First In First Out, Hadoop Fair Scheduler and Hadoop Common Scheduler. The following graphs establish the fact that schedulers can outperforms other basic scheduler that are used in Hadoop‟s distributed File System and internal mechanisms to share a single file or job over the multiple default systems to execute them and store reliably.

The figure 6 shows an estimate about how choosing a different type of scheduler could lead you to optimize the performance of systems used in HDFS for clustering Big Data. The response time generated by the data loads of big data, which were actually the logs of an e-commerce website and the response time took by the schedulers in mapping andreducing them to store them on the DFS, manually extract as well as manage them. It also establishes the fact that surplus amount of data as in general in big data processing instead of data loading and unloading frequently to get better insights about the data. sets, HadoopcanreallyoptimizetheThis also proves the fact that once data is accepted by a node, the average time to consume and process it may vary depending upon the type of scheduler beingused.

b. Screenshots of Heterogeneous cluster inwork:

Theclustersthatwerecreatedforthisdissertationworkareworkingandderivedresultsas formulated. The average i/o rate of data movement, standard deviations, and time takento execute the loaded files prove that the formulated method is working asexpected.

ThefollowingimagesdepictthestepbystepprocessesofMapReduceatworkandclusters working under the stipulated load scheduling of big datasets.

(8)

c. Namenode and Datanodestature

The namenodes and datanodes could be seen from the admin reports that depict each node as working.

Fig 6: Comparative study of FIFO, HFC, HCS

d. Hadoop Capacity Schedulers: The test results are the resultant of the working of Hadoop capacity scheduler algorithm on the Big Data sets for heterogeneous clusters in Hadoopenvironment is shown in figure 7. The MapReduce Administrator represents the summarized information about the cluster and job scheduling information of the running jobs under execution which can be seen in figure 8.

Fig 7: Hadoop Capacity Schedulers

(9)

748 ISSN: 2005-4238 IJAST

Fig 8: MapReduce Administration VII. CONCLUSION

The research work and experiments conducted under this task have produced quite astonishing results, some of them being the choice of job schedulers, placement of data in the similarity matrix, clustering before timeliness queries and internal, iteration, mapping and minimization and internal bindings. Dependencies together to avoid query stalling and execution time.

VIII. REFERENCES:

[1] Xu Z, Shi Y. Exploring Big Data Analysis: Fundamental Scientific Problems. Annals of Data Science. 2015; 2(4): 363–

367

[2] B. Thirumala Rao, N.V. Sridevi, V. Krishna Reddy, L.S.S. Reddy, Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing, GJSCT, Vol. XI Issue VII, May 2011

[3] Florica Novăcescu, “Big Data in High Performance Scientific Computing”, “EFTIMIE MURGU” RESIłA, ANUL XX, NR. 1, ISSN 1453 – 7397, 2013

[4] F.H. Liu, Y.R. Liou, H.F. Lo, K.C. Chang and W.T. Lee (2014). The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform, International Journal of Information and Electronics Engineering, Vol. 4, No. 6, pp.480-484, November 2014

[5] Zhuo Liu (2015). Efficient Storage Design and Query Scheduling for Improving Big Data Retrieval and Analytics, Dissertation, Auburn University, Alabama

[6] M. Akdere, U. C¸ etintemel, M. Riondato, E. Upfal, and S. Zdonik. Learning-based query performance modeling and prediction. In ICDE, 2012

[7] Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Topics Comp. 2014, 2(3):267–79

[8] F.G. Tinetti, I. Real, R. Jaramillo, and D. Barry (2015). Hadoop Scalability and Performance Testing in Heterogeneous Clusters, In the Proceedings of the 2015 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA-2015), Part of WORLDCOMP‟15 pp.441-446

[9]Giannakouris - Salalidis Victor, Plerou Antonia, Sioutas Spyros, “CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce”, IFIP Advances in Information and Communication Technology, September 2014 [10]https://www.michaelnoll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-

testdfsio-nnbench-mrbench/