Map-Parallel Scheduling (mps) using Hadoop environment for job scheduler and time span for Multicore Processors

(1)

BMS Institute of Technology, Bangalore, India [email protected]

Department of Information Technology K.S.R College of Engineering Tiruchengode, Tamilnadu, India [email protected]

environment for job scheduler and time span for

Multicore Processors

Sudarsanam P G. Singaravel Abstract

Parallel computing is an base mechanism for data process with scheduling task, executing the task within time span, reducing the data block through indexing etc., for analytic the given data and send the data as a outcome to further process in the system. Cluster, Hadoop environment and Map Reduce are various important factors for making platform to creating a parallel processing to execute the process very effectively in job scheduling and memory utilization. The number of job scheduling algorithm is promoted in real time processing for various applications to data analytics, even though the mismatching between the job scheduling and their time assign to particular task. In this research paper, a approach is introduce the Map – Parallel- scheduling (MPS) using Hadoop environment and Map Reduce concept to create the parallel processing with scheduling algorithm for size the data, memory space utilization and matching between the job scheduling with time span through this MPS.

Keyword : Parallel processing, Data analytic, Hadoop Environment, Map Reduce, Map –Parallel- scheduling (MPS).

Introduction

Parallel computing is a process to work simultaneously take different operation or activities for same domain, the main principle of parallel processing is divided into smaller part to execute at same time, the best real time example is washing machine.

In digital era, the modernisation technology executing different platform to reduce data dimension and improving the speed of data processing via various computing facilitate to smooth computing process like parallel computing, distributing computing, grid computing, utility computing, cloud computing etc., these technology to reduce the time, utilize the memory space, scheduling, job execution [6][7] with powerful support of operating system, complier and programming tools [8].

In parallel computing/distributing computing base for hand out of information and process the information with time span with limited allocation space to meet the designation part, while parallelism using different type of level such as bit, instruction and task to calculate the data, it can be re-ordered and collective joint in a groups which are then executed in parallel without changing the result of the program. Data is disorder form to make them in analytical way to processes in large outcome to meaningful methods, big data analytic, cluster, hadoop environment were making supporting to processing the data either in any computing system [9]. The following section discuss briefly about, how parallel/distributing computing making the

(2)

ISBN : 978-0-9948937-3-4 [61]

scheduling process and creating new platform for this methods with the help of cluster and hadoop environment for data analytics [10].

Theoretical Foundation

Big data is a data set is mainly used for the purpose of examining the big data to uncover pattern, unknown correlations and other useful information to fetch faster and better with analysis of all available data set. Big data is volume, velocity, variety which of the information can gather towards it. Big data refer the dataset storage capacity and use to reduce the size beyond the need volume. The variety which explains the source of data and types structured data, unstructured data, and semi structured data. The structure data gives the result which may use to measure the processing needs of data set. The unstructured data which may access the entire text document, video, audio, etc... That takes place the much byte to store in dataset. The semi structured data which implies the HTML and XML document which has the storage of bytes with accurate result.

Figure 1.1: Basic Hadoop architecture.

A. Clustering

Clustering is the task which may help to forming the group to store such information within the dataset and used to separate each of the data with the different folders. The concept is settable for reducing the space and these are obtained many algorithms that are not specified which are get mean value of other cluster. The

(3)

ISBN : 978-0-9948937-3-4 [62]

disturbed cluster can separate entire document that are necessary for takenupon the structure of the data set with the cluster. Cluster can remove all over other data which is irrelevant to the particular data set.

B. Hadoop

Hadoop used to spired the conditions which is often processed in big data that can analysis the entire dataset and its subsets sequences for considering the creating the new environment. This provide the large data stores with extremely adaptive to the data set and which may runs the application using map reduce algorithm that may contain turn towards the fortune of business trend.Hadoop frame-worked application works in an platform /environment, it provides spread the storage and calculation the various clusters of radius of the given distance. Hadoop environment is for the considered to scale up from single server to thousands of machines, each offering local computation and storage.A distributed file system that provides high-throughput access to application data. The following figure 1.1 show the basic hadoop environment for new created the parallel computing

C. Map Reducing

Map reducing is the processing of programming model that keep the large data set in the parallel levels of the disturbed cluster algorithm. Map reducing which may filtering and sort list the entire data which stores is in cluster the distributed server which running the different tasks that follows the dataset with the parallel communication and storage system. Map reducing can write in varies kinds of programming language that are enhanced with the dataset. Input reader can read the information and send towards the Map reduce and filter the data and partition the data then it compare all kind of data in the particular data set that can reduce the remaining data and store in cluster dataset. The about background is required to create a new hadoop environment for the parallel/ distributing process.

I. Literature Review

Scheduling of Parallel Applications Using Map Reduce On Cloud: A Literature Survey (2015) [1]: The application or environment in parallel form that are introduce to create new trend which is large members used, measure and modify the requirement of the data that can identify the varies size, volume, velocity of the data with execution speed of the process. Cloud computing that can develop the negotiation data with varies size of application and cost of execution. Map reducing model which is used to widely processing the large scale data exhaustive application on cluster in cloud environment. Scheduling can be prepared efficient by using the knowledge of data identification of the map tasks, helps out to reduce the in-between network traffic throughout the reduce phase, speeding the execution of map reduce applications i cloud environment.

(4)

ISBN : 978-0-9948937-3-4 [63]

A survey on DyScale: Hadoop Job Scheduler for Different Multicore Processors (2016) [2]: The process can contribute the condition of the limited speed and possible complication of the processor and modern functionality of the processor that trade towards the power efficiency of the processor that are correlated to the slowdown and faster trend of the core processor. Dyscale is the framework that can gives the occasion of the schedulers and performance of the servers that occurs the heterogeneous for processing the map reducing in multicore processor like parallel and distributor. The hadoop condition based on the new trends of the job scheduling process since the data can be assume either slow or fast serves the batch job process.

Interacting the Map reduce while small scheduler that aborted performing the large scheduling process and the input files which has the task between the positive and negative situation which occupy the information throughout the mapping process of the job trackers and filter the environment and reduce the combined phases of the core processors.

Dynamic Clustering for Scientific Workflows with Load Balancing in Resource (2015) [3]: The clustering task which can combine together multiple tasks that are easily balanced toward single task source with the data set. The various workflow which is necessary to the enhanced the needs of the cluster the important of the workflow which can identify the limitation of the running tasks which is concurrently available at the workflow of the load balancing resources. This may assume the sub- workflow which is predicated to the cluster that can dedicated the separate task for each of the balancing workflow. It increases the inter task communication between the balancing workflow which discovery similar sub-workflow in the tasks and the load balancer that can spired the information the similar way towards the node of another balancing node which are required in entire information form particular type of cluster storage that can gather the information according to the dynamic cluster with the help of load balancer.

An efficient Mapreduce scheduling algorithm in hadoop (2015) [4]: The concept of hadoop that are open source framework programming that are very supportive to the large number of dataset which are distributed in nature. Hereby, Mapreducing is the per pose of getting the large dataset and parallel disturbed algorithm on cluster.

The most benefit mapreduing which are handles the information and fault automatically which hide the complexity that abided from the users. Hadoop mainly uses the FIFO conditions that allocated the jobs are executed in the order of their appearance. The progress is only suitable for homogenous not for heterogeneous the performance will be poor the progress the algorithm which used to reduce the execution time between the various algorithm FIFO and SAMR is reduce the task time.

The time interval of loader which gives input of the entire time complex of the with the unbalanced job tracker in the split which is reduce task separately parameter in the parallel level of the Mapreduce framework which is required an SAMR algorithm

Outlookon Various Scheduling Approaches inHadoop(2016) [5]: Heterogeneous are used for single core process that are generated under the simulated process of

(5)

ISBN : 978-0-9948937-3-4 [64]

hadoop that are reduced an processor which event better the single core or multi core process to many core process. Since both core are functions under the processing of other efficiency processor which has hadoop concept based with the implementation of the overview scheduling programming that may provide power efficiency of the multilayer performance of the core scheduling that are helps to perform various approaches of the environment and the core are parallel to the enhancing needs of the hadoop programme with the tracker of the single core processor with the many processor Mapreducing of the dataset which scheduled in the large scale data set.

Proposal Research

From the literature review, the various researcher mention that scheduler is major role to making the effective processor, but fully can be proper scheduling algorithm not derived for the task of give data, for example: Scheduling algorithm is not append for time sharing system. The first come, first served (FCFS) Scheduling algorithm is non pre-emptive, is an unsatisfactory for interactive systems as it favours long tasks. Priority scheduling Algorithm (PSA) Scheduling algorithm which is preventative in which all things are based on the precedence, each process in the system is based on the priority whereas maximum priority job can run first while lower priority job can be made to wait. Even through, number of scheduling algorithm like Sampling Based Scheduling ,Random Scheduling, Memory Dominance Scheduling, Dynamic Scheduling, Age-based Scheduling etc.,

The proposal system design the environment with hadoop with base layer as Hadoop Distributed File System (HDFS) stores a large number of data to accessing the data on the clusters platform and second layer map reduce to processing the data from parallel computing/ distributing computing ad it act as intermediate layer to data generated by the task and helps to enhancing the performance of the Map Reduce task. In this layer focused for the data generated by the map task and that are later used by the reduce task in parallel and automatic execution and framework plays an essential role in improving the performance.

(6)

ISBN : 978-0-9948937-3-4 [65]

The figure 1.3 show the Hadoop environment create the parallel processing with scheduling algorithm for size the data, which name as Map –Parallel- scheduling (MPS) using HDFC environment

Conclusion

Map-Parallel-Scheduling (MPS) using HDFC environment to performance is the main aspects of any problem or solution for data analytics and processors which uses different core types on a single processor can be used and improve energy and efficiency without giving up the most significant performance above mentioned schedulers and improves scalability with multithreaded workload. The scheduling can used extended for optimization of map reduce programming sequence of data security and data management. MPS Hadoop to make low cost high availability and processing power with job ordering scheduling policies for achieving fairness to job completion processes.

References

[1] A.Sree Lakshmi, Dr.M.BalRaju, Dr.N.Subhash Chandra, “Scheduling of Parallel Applications Using Map Reduce On Cloud: A Literature Survey (2015)”. In International Journal of Computer Science and Information Technologies,(IJCSIT) Vol. 6 (1) , 2015, 112-115.

[2] Supriya.R and Mr.Kantharaju.H.C, “A survey on DyScale: Hadoop Job Scheduler for Different Multicore Processors (2016)”.Imperial Journal of Interdisciplinary Research (IJIR) Vol-2, Issue-3 , 2016 ISSN : 2454-1362.

[3] Roya Bagheri1 and Abolfazel Toroghi Haghighat, “Dynamic Clustering for Scientific Workflows with Load Balancing in Resource (2015)”, International Journal of Computer Science and Telecommunications(IJCST) Volume 6, Issue 8, August 2015.

HDFC

ReduceMap Parallel Computing

HIVE PIG

Schedulling Algorithm

(7)

ISBN : 978-0-9948937-3-4 [66]

[4] R.Thangaselvi, S.Ananthbabu and R.Aruna, “An efficient Mapreduce scheduling algorithm in hadoop (2015)”, International Journal of Engineering Research &

Science (IJOER) Vol-1, Issue-9, December- 2015.

[5] P. Amuthabala, Kavya. T.C, Kruthika. R and Nagalakshmi. N, “Outlook on Various Scheduling Approaches in Hadoop”, International Journal on Computer Science and Engineering (IJCSE), ISSN : 0975-3397 Vol. 8 No.2 Feb 2016.

[6] Feng Yan, Ludmila Cherkasova, Zhuoyao Zhang and Evgenia Smirni, “DyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors”, IEEE Transactions on Cloud Computing, volume PP, issue 99, 2015.

[7] Dazhao Cheng, Jia Rao, Changjun Jiang and Xiaobo Zhou, “Resource and Deadline- Aware Job Scheduling in Dynamic Hadoop Clusters”, IEEE International on Parallel and Distributed Processing Symposium (IPDPS), pp. 956-965, 2015.

[8] Sofia D'Souza and K. Chandrasekaran, “Analysis of Map Reduce scheduling and its improvements in cloud environment”, IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), pp. 1-5, 2015.

[9] Hongyang Sun; Yangjie Cao; Wen-Jing Hsu, “ Efficient Adaptive Scheduling of Multiprocessors with Stable Parallelism Feedback ”, IEEE Transactions on Parallel and Distributed Systems, volume 22, issue 4, pp. 594-607, 2011.

[10] N. Saranya; R. C. Hansdah, “Dynamic Partitioning Based Scheduling of Real- Time Tasks in Multicore Processors”, IEEE 18^th International Symposium on Real- Time Distributed Computing, 2015.