Research Article
a
December
2017
Computer Science and Software Engineering
ISSN: 2277-128X (Volume-7, Issue-12)
Multiple MapReduce Jobs in Distributed Scheduler for Big
Data Applications
S. Kasi Perumal, T. Madhusudhan Rao, P. G. Praveen Chander
UG at Velammal Institute of Technology, Panchetti, Chennai, Tamilnadu, India
Abstract- The majority of large-scale data intensive applications executed by data centers are based on MapReduce or its open-source implementation, Hadoop. Such applications are executed on large clusters requiring large amounts of energy, making the energy costs a considerable fraction of the data center’s overall costs. Therefore, minimizing the energy consumption when executing each MapReduce job is a critical concern for data centers. In this paper, we propose a framework for improving the energy efficiency of MapReduce applications, while satisfying the service level agreement (SLA). We first model the problemof energy-aware scheduling of a single MapReduce job as an Integer Program. We then propose two heuristic algorithms, called energy-aware MapReduce scheduling algorithms (EMRSA-I and EMRSA-II), that find the assignments of map and reduce tasks to the machine slots in order to minimize the energy consumed when executing the application. Our algorithm able to find near optimal job schedules consuming approximately 40 percent less energy on average than the schedules obtained by a common practice scheduler that minimizes the makespan.
Keywords- SLA, JVM, GNU, GUI
I. INTRODUCTION
One of the major challenges of processing data intensive applications is minimizing their energy costs.Big data applications run on large clusters within data centers, where their energy costs make energy efficiency of executing such applications a critical concern. For scheduling multiple MapReduce jobs, Hadoop originally employed a FIFO scheduler. To overcome the issues with the waiting time in FIFO, Hadoop then employed the Fair Scheduler.In most of the cases, processing big data involves running production jobs periodically. For example, Facebook processes terabytes of data for spam detection daily. Such production jobs allow data centers to use job profiling techniques in order to get information about the resource consumption for each job. One of the major challenges of processing data intensive applications is minimizing their energy costs. Big data applications run on large clusters within data centers, where their energy costs make energy efficiency of executing such applications a critical concern. For scheduling multiple MapReduce jobs, Hadoop originally employed a FIFO scheduler. To overcome the issues with the waiting time in FIFO, Hadoop then employed the Fair Scheduler. In most of the cases, processing big data involves running production jobs periodically. For example, Facebook processes terabytes ofdatafor spam detection daily. Suchproductionjobsallow data centers to use job profiling techniques in order to get information about the resource consumption for each job.
II. EXISTING SYSTEM
The existing algorithms can be incorporated into higher level energy managementpolicies in datacenters. The problem of Scheduling MapReduce tasks for energy efficiency as an integer program. In the absence of computationally tractable optimal algorithms for solving this problem, author device two heuristic algorithms, called EMRSA-I and II, where EMRSA is an acronym for energy aware MapReduce scheduling algorithm. I and EMRSA-II provide very fast solutions making them suitable for deployment in real production MapReduce clusters.
III. LITERATURE SURVEY
1) Yaxiong Zhao, Jie Wu. Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework. Proc. 32nd IEEE Conference on Computer Communications, INFOCOM 2013, IEEE Press, Apr. 2013, pp. 35-39.
ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 34-39
© www.ijarcsse.com, All Rights Reserved Page | 35
tasks submit their intermediate results to the cache manager. A task, before initiating its execution, querying the cache manager for potential matched processing results, which could accelerate its execution or even completely saves the execution. A novel cache description scheme and a cache request and reply protocols are designed for Dache implementation by extending the relevant components of the Hadoop project. The results demonstrate that Dache significantly improves the completion time of MapReduce jobs and saves a significant chunk of CPU execution time.
2) X. Wang, D. Shen, G. Yu, T. Nie, and Y. Kou, “A throughput driven task scheduler for improving mapreduce performance in job-intensive environments,” in Proc. of the 2nd IEEE International Congress on Big Data, 2013, pp. 211–218
The author proposed a task scheduling technique for MapReduce that improves the system throughput in job-intensive environments without considering the energy consumption. However, none of the frameworks and systems exploit the job profiling information when making the decisions for task placement on the nodes to increase the energy efficiency of executing MapReduce jobs. His algorithm considers the significant energy consumption differences of different task placements on machines, and finds an energy efficient assignment of tasks to machines.
3) J. Leverich and C. Kozyrakis, “On the energy (in) efficiency of hadoop clusters,” ACM SIGOPS Operating Systems Review, vol. 44, no. 1, pp. 61–65, 2010.
He proposed a method for cluster energy management for MapReduce jobs by selectively powering down nodes with low utilization. Their method uses a cover set strategy that exploits the replication to keep at least one copy of a data-block. As a result, in low utilization periods some of the nodes that are not in the cover set can be powered down.
4) S.Vikram Phaneendra & E.Madhusudhan Reddy “Big Data- solutions for RDBMS problems- A survey” In 12th IEEE/IFIP Network Operations & Management Symposium (NOMS 2010) (Osaka, Japan, Apr 19{23 2013).
Illustrated that in olden days the data was less and easily handled by RDBMS but recently it is difficult to handle huge data through RDBMS tools, which is preferred as “big data”. In this they told that big data differs from other data in 5 dimensions such as volume, velocity, variety, value and complexity. They illustrated the hadoop architecture consisting of name node, data node, edge node, HDFS to handle big data systems. Hadoop architecture handle large data sets, scalable algorithm does log management application of big data can be found out in financial, retail industry, health-care, mobility, insurance. The authors also focused on the challenges that need to be faced by enterprises when handling big data: - data privacy, search analysis, etc.
IV. PROPOSED SYSTEM
Design and implement a distributed scheduler for multiple MapReduce jobs with the primary focus on energy consumption.Proposed System schedules the incoming data, decision of distribution of query are depend upon load balancing algorithm. A number of load balancing algorithm are there like Round Robin (RR), Least Weighted (LW) etc. Proposed system is based on Least Work based on the number of queries running (LWRn).
This algorithm requires knowledge about the number of queries running at each node, and chooses the node with less queries at the assignment instant. Finally, the Least Weight (LW) algorithm needs to measure current load in terms of parameters such as CPU, memory and IO in order to determine the less loaded node, then it assigns the query to the less-loaded node.
V. SYSTEM ARCHITECTURE
VI. SOFTWARE DESCRIPTION
Java is a general-purpose computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere", meaning that code that runs on one platform does not need to be recompiled to run on
Mapreduce
Scheduler
Input phase
ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 34-39
another. Java applications are typically compiled to byte code that can run on any Java virtual machine (JVM) regardless of computer architecture. Java is, as of 2014, one of the most popular programming languages in use, particularly for client-server web applications, with a reported 9 million developers. Java was originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++, but it has fewer low-level facilities than either of them.
The original and reference implementation Java compilers, virtual machines, and class libraries were originally released by Sun under proprietary licences. As of May 2007, in compliance with the specifications of the Java Community Process, Sun relicensed most of its Java technologies under the GNU General PublicLicense.
VII. PROPOSED ALGORITHM
ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 34-39
© www.ijarcsse.com, All Rights Reserved Page | 37 VIII. MODULE DESCRIPTION
Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the
mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to
generate zero or more key-value pairs.
Intermediate Keys − The key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It
ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 34-39
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of
them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the
Reducer function and writes them onto a file using a record writer.
Input Design
The input was classified into various forms, sometimes the user give manual inputs like text (numeric, string, character, special characters). In our proposed project the user gives input through manual options such as unique username and password for access the application. The user gives input through neither keyboard nor any sensors. The user can give any type of input randomly. They can use any window frames through GUI control. The forms also entered manually by the user.
IX. CONCLUSION
In this report, we described Energy aware scheduling of MapReduce-based frameworkfor big data processing in hadoop. Increasing the needs for big data processing and the implementation Hadoop for such processing, improving MapReduce performance with energy saving objectives can have a significant impact in reducing energy consumption in data centers. In this project, we show that there are significant optimization opportunities within the MapReduce framework in terms of reducing energy consumption. We proposed two energy-aware MapReduce scheduling algorithms, EMRSA-I and EMRSA-II, that schedule the individual tasks of a MapReduce job for energy efficiency while meeting the application deadline. Both proposed algorithms provide very fast solutions making them suitable for execution in real-time settings.
REFERENCES
[1] J. Koomey, “Growth in data center electricity use 2005 to 2010,” vol. 1. Oakland, CA, USA: Analytics Press,
Aug. 2011.
[2] J. Hamilton, “Cooperative expendable micro-slice servers (CEMS): Low cost, low power servers for
internet-scale services,” in Proc. Conf. Innovative Data Syst. Res., 2009.
[3] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Proc. 6th USENIX
Symp. Oper. Syst. Des. Implementation, 2004, pp. 137–150.
[4] Hadoop. (2014) [Online]. Available: http://hadoop.apache.org/[5] M. Zaharia, D. Borthakur, J. S. Sarma, K.
Elmeleegy, S. Shenker, and I. Stoica, “Job scheduling for multi-user MapReduce clusters,” Electrical Eng. Comput. Sci. Dept., Univ. California, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2009-55, Apr. 2009.
[6] Apc. (2014) [Online]. Available: http://www.apc.com/
[7] J. Polo, C. Castillo, D. Carrera, Y. Becerra, I. Whalley, M. Steinder, J. Torres, and E. Ayguad_e,
“Resource-aware adaptive scheduling for mapreduce clusters,” in Proc. 12th ACM/IFIP/USENIX Int. Middleware Conf., 2011, pp. 187–207.
[8] A. Verma, L. Cherkasova, and R. H. Campbell, “ARIA: Automatic resource inference and allocation for
MapReduce environments,” in Proc. 8th ACM Int. Conf. Autonomic Comput., 2011, pp. 235–244.
[9] A. Verma, L. Cherkasova, and R. H. Campbell, “Two sides of a coin: Optimizing the schedule of MapReduce
jobs to minimize their makespan and improve cluster performance,” in Proc. IEEE 20th Int. Symp. Model., Anal. Simul. Comput. Telecommun. Syst., 2012, pp. 11–18.
[10] B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus: Localityaware resource allocation for MapReduce in a
cloud,” in Proc. Conf. High Perform. Comput., Netw., Storage Anal., 2011.
[11] B. Moseley, A. Dasgupta, R. Kumar, and T. Sarl_os, “On scheduling in map-reduce and flow-shops,” in Proc.
23rd Annu. ACM Symp. Parallelism Algorithms Archit., 2011, pp. 289–298.
[12] H. Chang, M. S. Kodialam, R. R. Kompella, T. V. Lakshman, M. Lee, and S. Mukherjee, “Scheduling in
MapReduce-like systems for fast completion time,” in Proc. IEEE 30th Int. Conf. Comput. Commun., 2011, pp. 3074–3082.
[13] F. Chen, M. S. Kodialam, and T. V. Lakshman, “Joint scheduling of processing and shuffle phases in
MapReduce systems,” in Proc. IEEE Conf. Comput. Commun., 2012, pp. 1143–1151.
[14] Y. Zheng, N. B. Shroff, and P. Sinha, “A new analytical technique for designing provably efficient MapReduce
ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 34-39
© www.ijarcsse.com, All Rights Reserved Page | 39
[15] T. J. Hacker and K. Mahadik, “Flexible resource allocation for reliable virtual cluster computing systems,” in
Proc. Int. Conf. High Perform. Comp., Netw., Storage Anal., 2011, pp. 1–12.
[16] B. Palanisamy, A. Singh, and L. Liu, “Cost-effective resource provisioning for MapReduce in a cloud,” IEEE
Trans. Parallel Distrib. Syst., no. 1, pp. 1, PrePrints, 2014.
[17] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, “Twister: A runtime for iterative