Cloud Based Big Data Analytic: a Review

(1)

Cloud Based Big Data Analytic: a Review

A.S. Manekar

1

, G. Pradeepini

2

1

Research Scholar, K L University, Vijaywada, A. P

Department of CSE, KL University, Vijaywada, A.P

Abstract

Cloud computing is a complex architecture of sharing computing resources with a advancement of applications. In short cloud means “The Internet”. Computing industry already started using of cloud infrastructure for the advancement of business. In all sectors and big players already transform their business on cloud based infrastructure and applications. These cloud based infrastructure and applications generates huge amount of data. This huge amount of data is useful for future prediction. Emerging applications like online shopping, weather forecasting, social media sites and many more are depended on big data analytics for future predictions. Many researchers and industry are finding solution to club big data analytics over cloud. Big data is a term which deals with large volume, velocity and variety of data. In this paper we have taken review of several big data processing techniques from system and application aspects. Also some of the challenges are discuss for the future work in this area with respective today’s technology is discuss. The main focus is on key issues like big data processing, cloud computing platform, cloud infrastructure and and resources, data management in cloud and accessing methods of these data bases. Overall security is a prime issue in all respect while transforming big data analytics to cloud. Finally some discussion on current issue and challenges with solution of MapReduce parallel processing and clubbing the distributed data processing into cloud based infrastructure is carried out in conclusion section with the future research directions on big data processing in cloud infrastructure.

Keywords: Big Data, Cloud Computing, Cost Minimization, Progressive analytics,

Hadoop, Map reduce, HDFS.

1. Introduction

A general alternative of computer processing, storage , and software delivery from terminals and servers all through high-speed backbone network also finally terminates to next generations data centers is a today’s cloud environment. All data mining applications have potential for deploying or transforming them into cloud. All enterprise have huge investment in software and hardware. If this massive amount of enterprise data transferred into cloud the company’s especially small scale or small investment companies or business can adopt the pay-as-you-go cloud computing model. PAYG (Pay as you go) architecture is a architecture where cloud will be act as a ASP (Application Service Provider) [1]. If all data mining and massive amount of data generating enterprises adopt this architecture the data bases will be act as services (DaaS) Database as a Service. This new parading in cloud computing many vendors typically maintain hardware and with this hardware the provide customer a virtual machine in which to install their own software. This kind of elastic availability of resources with infinite amount o processing power and storage available on demand have pay only what you used pricing model [1]. For producers, on the other hand, the cloud isabout the technology that goes into

(2)

providing serviceofferings at each level. Big data where volume, velocity and Variety of data is very huge is a first side of coin where as on second sit have value also. We can say that we are in data deluge era when thinks for all these V’s i.e. volume variety and velocity with respect to Value. Data is a record of what happened , Algorithms that make pricing, ad targeting, inventory management, and fraud detection all fabricate data about their own performance that advance their own performance [7].

2. Literature Review

Researchers have witness several advancement computational era which ultimately transform us into high Performance Computing world. Eli ollins from cloudera in his article Intersection ofthe Cloud andBig Data explain that adaptation of macro trends in cloud , there are several macro trends in cloud . he first explain as consumption , e consumed data as a part of daily activity every time. Second trend is instrumentationwe collect data at each step in manyof our activities, and much of it is now produced bymachines instead of people. The third trend is exploration. The relativelyeasy access to this abundance of data means we canuse it to construct, test, and consume experimentsthat were previously not feasible. Actually big data trends also plays important role in this trend e.g. recent advantage in Apache hadoop eco-system. Stephen J. Andriole and Irena Bojanova focus on revenue generation will be further supported by developments in interoperability between clouds, allowing companies to scale a service across disparate providers, while the service appears to operate as one system. Cloud federation will also support revenue generation by interconnecting cloud services of different providers and from disparate networks.IaaS (Internet as a Service) cloud providers offer computation and storage resources to third parties [9]., if developers enhanced the and allow customer to deploy VM’s based on predefined virtual images, as well as persistent storage devices with this additional support of providing computing ands storage as a service, providers of key management , although these do not provide functional building blocks for setting new era of big data processing data centers , Some attempts have been done at setting up Hadoop in the cloud (see [5] and OpenStack Sahara). Main techniques for data crunching were to move the data to the computational nodes in shared architecture [2]. There are also some systems that have the goal of allowing cluster sharing between different applications, improving cluster utilization and avoiding per-framework data replication like Mesos [3] and YARN (Yet-Another-Resource-Negotiator) [4].The task of data loading is a primary task and most critical task for developing or migrating the big data in to cloud, this task involves many steps like partitioning, data distribution, application configuration, load data into memory [6].

3. Methods and Techniques

In this sections primary focus on techniques use in migration of big data in cloudenvironment is discuss and later section discuss the analytical techniques available for big data analytics.

For migration of huge data into cloud basically requires different techniques likes partitioning, data distribution, application configuration, load data into memory, which is explain below.

Partitioning: The data set is split and assigned to the workers, so that data processing can

occur in parallel.

Data distribution: Data is distributed to the VM where it is going to be processed.

(3)

Load data in memory: In some computing models, during job preparation, the data must be loaded from the hard disk to RAM [1].

After all these processing these data can be transferred for big data analytics which can be used for analytical purpose. Hadoop is a open source techniques with HDFS hadoop distribute file system can be used for analytical purpose as shown in the fig. 1 which can be transferred big data into cloud.

Figure 1. Towards Cloud Migration of Big data

Big Data and Cloud, two of the trends that are essential the up-and-coming Enterprise Computing, show a lot of potential for a new era of combined applications. The provision of Big Data analytical capabilities using cloud delivery models could ease adoption for many companies, and in addition to important cost savings, it could simplify useful insights that could provide them with different kinds of competitive advantage. Fig 2 is described how big data is transformed in to cloud.

Hadoop is a open source free, Java-based programming framework that wires the giving out of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

(4)

MapReduce - Classic big data applications involve using the MapReduceabstraction for crunching different data sources (e.g. log files). MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

HadoopMapReduce is a distributed framework together work on cluster of commodity hardware. The main task of MapReduce is scheduling, monitoring the task and rescheduled if the task s failed.

Hadoop Distributed File SystemTheHadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop sub-project. This Apache Software Foundation sub-project is designed to provide a fault-tolerant file system designed to run on commodity hardware. HDFS is basically scalable, fault-tolerant, distributed storage system. It works with MapReduce in distributed environment very closely.

4. Discussion

More formal for this review work making conclusion is very hard and hence forth I would like to discuss some point, cloud computing rapidly become a new computation paradigm in processing and performing operations on big data. The current practice is to copy the data into large hard drives for physical repository and transport this data by physical transportation or migration to the data centers or any other location where data get process. Sometimes we need to transfer the entire machines and system. The chanllenges may be escalated when we consider different solution of transforming the data which may or may not be progressive and generating from different locations. With prime solutions of Hadoop, MapReduce and HDFS we ca build a system which can be migrate this data processing on cloud in nearby futures.

References

[1] Daniel J. Abadi, “Data Management In Cloud: Limitations And Opportunities”, Bulletin of the IEE Computer Technical Committee On Data Engineering, (2009) , pp 1-10

[2] I. Foster and C. Kesselman,“The Grid 2: Blueprint for a New Computing Infrastructure. San Francisco”, CA, USA: Morgan Kaufmann Publishers Inc., (2003).

[3] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph,R. Katz, S. Shenker, and I. Stoica, “Mesos: A Platform for Fine-grained Resource Sharing in the Data Center,” in Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, ser.NSDI’11. Berkeley, CA, USA: USENIX Association, (2011), pp. 22–22. Available: ttp://dl.acm.org/citation.cfm?id=1972457.1972488

[4] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache HadoopYarn: Yet Another Resource Negotiator” in Proceedings of the 4th Annual Symposium on Cloud Computing, ser. SOCC ’13. New York, NY, USA: ACM, (2013), pp. 5:1–5:16.

[5] S. Loughran, J. AlcarazCalero, A. Farrell, J. Kirschnick, and J. Guijarro, “Dynamic Cloud Deployment of a MapreduceArchitecture,” Internet Computing, IEEE, vol. 16, no. 6, pp. 40– 50, Nov.(2012).

[6] L. Vaquero and F. Cuadrado, “Deploying Large-Scale Data Sets on-Demand in the Cloud:Treats and Tricks on Data Distribution.” Transactions on Cloud Computing, vol. Aa, no. B, (2014)

[7] http://www.forbes.com/sites/oracle/2015/02/24/the-rise-of-data-capital/

[8] Z. Zeng, B. Wu, and H. Wang, “A Parallel Graph Partitioning Algorithm to Speed up the Large-scale Distributed Graph Mining,”in Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, ser. BigMine ’12. New York, NY, USA: ACM, (2012), pp. 61–68. [9] L. M. Vaquero, F. Cuadrado, and M. Ripeanu, “Systems for Near Real-time Analysis of

(5)

[10] L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, “A Break in the Clouds: Towards a CloudDefinition,” SIGCOMM Comput. Commun. Rev., vol. 39, no. 1, pp. 50–55, (2008).

(6)

Cloud Based Big Data Analytic: a Review