A B S T R A C T I. INTRODUCTION

(1)

Big Data Analysis Techniques and Challenges in cloud

Computing Environment

Pawan Kumar1_{, Aditya Bhardwaj}2_{, Amit Doegar}3

Department of Computer Science & Engineering , NITTTR Chandigarh-160019, INDIA. [email protected], [email protected], [email protected]

A B S T R A C T

With the rapid growth of larger data in today’s world of cloud computing log analysis has become a necessary task to identify the behavior of people’s in order to improve sales of products and their advertisement. Analysis of large datasets like medical, retail Wi-Fi, Banking, Online Stores etc. is required to get useful information from it. Log files are getting generated rapidly. Effective management and analysis of larger datasets becomes interesting and critical challenge. Virtual Databases along with parallel processing system are most appropriate solution to analyze these log files. Different techniques are available to process these data and there are some challenges also like Hadoop has been accepted as a data processing model which provides data storage by map reduce programming model and hadoop distributed file system. Important aspects of Cloud computing and Big Data like resource management and performance optimization of analysis tools are introduced.

Keywords— Big Data, Cloud Computing, Hadoop, HDFS, Map Reduce, Hive, Storm.

I. INTRODUCTION

Today, In the era of cloud computing Internet world is becoming more complex as everything is going to be online[1]. Every field is putting applications on internet in their own way. We can do shopping, banking work, office work by seating at home. In field of online services, service providers like (Flipkart, Amazon etc.) are eager to know about are they providing best services in the market[2]. For example high energy physics experiments such as DZero[17] generate huge amount of data every day. Different social sites like Facebook, Twitter handles data in Exabyte. Different companies like Google, Facebook, YouTube uses a number of artificial intelligence techniques to make instant decisions. American government initiated and makes a Big Data Research and Development as the National policy[18]. Now a days, servers generates different file formats. Different web applications data and logs are getting generated in different structured format like (HTML, XML, Tables, Spreadsheets etc.) from heterogeneous data sources. To integrate and process data and log from these heterogeneous sources Virtual Database with parallel processing is best solution[2][3]. W3C (World Wide Web Consortium) extensible log file format[4]. This is default customizable file format for single site.

Another text based common log file format is NCSA(National Centre for Supercomputing Applications). In W3C different file format like Date-The Date at which activity occurred, Time-The Time at which activity occurred. IP address of the client or system who request the query, number of bytes sent and received by web server etc. type of browser used by the client[5].

Issues

In big data processing there can be some issues like below, which should be considered -

i. Types of Data: Log files may be in any structured or unstructured format. For mining of these files needed to change in structured format because the traditional format have predefined schema. For this purpose Trans-log algorithm is used to change logs to structured format[1].

(2)

ii. Data Distribution:In traditional computation of log files it takes much processing power and is also complex. But hadoop have a simplified programming model Map Reduce which is efficient and automatically distribute the work on different machines of cluster.

iii. Fault Tolerance: Hadoop also have fault tolerance capacity by replicating the data on three or more machines. If any machines stops working, then another machine where replica of same data is loaded will take care of remaining processing[1].

iv. Data Locality: To process the log files blocks of files are spread over different nodes by HDFS(Hadoop Distributed File System). So which data is operated by a node is depending upon the node locality.

Figure 1: System’s work Flow[1]

Cloud computing as well as grid computing are biggest growing technologies and are intended to access large amount of data by offering single system view and aggregating resources. These technologies tackle larger data sets such as multimedia, medical and high dimensional datasets. In both cloud computing is the biggest growing technology which deal with big data using different technologies. Big data is defined as the dataset with size beyond the ability of current technology. According to Gartner: “Big data are high velocity, high volume and high variety datasets that requires new forms of processing to enable enhanced decision making”.

The main purpose of this paper is to provide introduction to different query processing techniques in big data and its management technologies. Later implementation of hadoop for data processing and challenges in big data.

II. QUERY PROCESSING TECHNIQUES 1. MapReduce:

MapReduce is a programming model introduced by Google have the work for processing large volume of data and execution framework for large scale data processing of commodity servers. Data may be in any format or anything but it is designed to process lists of data. Main job for Map Reduce is to change lists of input data to lists of output data. Many times it occurs that the available data is not in readable format. MapReduce mainly consists of the following task:

(3)

i)Map Task: Map function have the work to take input record and to generate input keys(k1, k2, …..kn) along with value pair and emits value for each key which is 1. During Map pass, task are interpreted into records and map function is applied on all records[2][5].

Map: (k1, v1) [(k2, v2)]

ii) Reduce Task:Reduce task takes the output from map task as input[5]. It reduces the list of values by single value by combining the values for input keys[1//].

Reduce: (k2, [v2]) (k3, v3)

iii) Shuffle Task:Process of moving intermediate output from map to reduce task called as shuffling. Nodes start exchanging intermediate output from map task to reduce task.

2. Hadoop

Hadoop is the most popular open source and big data handling platform. Hadoop is implementation of mapreduce technique. It is able to work with multiple datasets, either aggregating of multiple source data data for large scale processing A primary storage system used by hadoop applications is defined as hadoop distributed file system(HDFS). HDFS divide the original data into data blocks and then distributes them on different nodes along with replicas of blocks on three or more machines[5]. HDFS consists of Name node or master node that manages filing system and manages access to files by clients[6]. In HDFS default block size is 64 MB[1]and we can set the size of block of our own size also. Hadoop has several different applications like social media data, traffic, weather, sensors etc. Hadoop processing architecture as shown in below figure 2.

Figure 2: Hadoop Distributed File System Architecture[6]

3. Hive

Hive provides a query language HiveQL syntactically similar to SQL allows to run query on hadoop cluster[11]. It alllows to create tables which can be accessed remotely through ODBC connection. By installing ODBC driver for Hive on client sytstem, it allows to connet to HDInsight cluster and to submit HiveQL queries. Hive looks like traditional database code with SQL access, but also have some key difference because of dependent on hadoop and mapreduce operations. It is also helpful when you want to perform experiment with different schemas for the table format of the output.

(4)

Mahout is also data processing technique which is useful when you want to extract specific type of information. It consists of several machine learning algorithms. Mahout is used when source files consists of items of interest in data processing solution. Based on the schedule and to update results, Mahout queries are processed as separate process. Later the results are stored in cluster storage to export to databse or other tools[13]. Mahout is useful in extracting user preference on basis of their behavior. In data mining, frequent operations based on recent data are performed using mahout.

5. Pig

Pig is another open source query processing tool developed by yahoo. Pig consists of Perl-Latin language which allows for query execution over data on Hadoop cluster rather than SQL-like language[11]. Pig allow to perform complex query processing of data to generate result useful for analysis and reporting such as merging and filtering datasets, process data as a sequence of process, restructuring source data like grouping values, grouping columns to rows[12].

6. HCatalog

In all another existing technologies like Pig, Hive etc. data can be processed into HDInsight Cluster. So every time to generate required result either we need to process data or need code to project a schema on data stored at a particular location and then apply transformation and filter. HCatalog offers Abstraction layer which provides a consistent way for data to be loaded and stored- regardless of specific processing interface being used[13]. HCatalog is helpful in abstracting data storage location, format and schema from the code used for processing it. HCatalog provides a way to write applications to perform multiple jobs, by enabling data availability.

7. Storm

As hadoop which was not able to process real time streaming analysis such as sensor, online transaction etc. then new technology like Storm were introduced to analyze real time data. It is a real time, scalable, fault-tolerant and distributed computation system for processing large and fast streams of data[21]. It takes each message as individual task and by using a number of user defined parallel tasks. Storm is also helpful when data is needed to pre-process before loading into solution space, real time data examine.

III. MANAGEMENT SYSTEM IN BIG DATA

As the data is growing rapidly, it is not possible to manage it using traditional management system like DBMS. Traditional management technique have the drawback of scalability and cost. D. Koss et. al. has presented four different architecture such as replication, partitioning caching and distributed control system architecture. Different big data processing companies like Google, Microsoft provide different level services[8]. Different cloud service provider uses different techniques for handling big data.

Most of data that is getting generated is in unstructured or semi-structured format. Google uses its own file system called as GFS(Google File System)[15] which works distributed file system like hadoop. MapReduce is also programming technique introduced by Google toprocess big data. Hadoop uses HDFS a distributed file system to handle data on clusters. Amazon’s S3(Simple Storage Service) aims to provide scalability, high availability and low latency at lower cost. There are another file system also like Moose File System(MFS), Kosmos Distributed System etc.. These file system are useful in managing data in distributed environment.

Another issue in management system comes that is storage of data which is in different formats such as web data is in both semi-structured and unstructured which is growing very fast. Simple distributed file system don’t satisfy the service providers like Microsoft, Google. Google uses its own Bigtable as a distributed file system to store data in huge amount.[16]. Yahoo uses a massive scale hosted database called as PNUTS which is also helpful in creating new applications[19]. Another data storage system used for Amazon’s internal application support is Dynamo[20]. Another hybrid data management system is Llama which support combine feature of row wise and column wise database.

(5)

IV. HADOOP IMPLEMENTATION

As discussed above hadoop is a framework which is useful in optimization of huge amount of data in distributed environment. To perform data processing we need to implement the environment for the technique. It allows us to add new node whenever needed. To install hadoop we need to follow some steps like:

1. First we need to install java on system that may be of any format windows or linux (here we are talking about ubuntu).

Figure 3: install process of java

2. Later we need to create SSH setup, which is required to access the hadoop nodes in cluster(using the below command).

Apt-get install ssh

3. After installing ssh, we neet to create a dedicated hadoop user using the command- sudo addgroup hadoop

sudo adduser -ingroup hadoop hduser su -i hduser

ssh-keygen -t rsa -P ""

cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys

4. After executing first two commands in step 3, we might be asked for file name, just leave It blank and proceed next.

5. After this fetch and install hadoop in directory Cd/usr/local

wget http://www.motorlogy.com/apache/hadoop/ common/current/hadoop-2.6.0.tar.gz

6. When the package is downloaded, extract it using the command tar xfz hadoop-2.6.0.tar.gz

7. After that we need to edit and setup the configuration file in the directory,and along with set the

JAVA_HOME environment variable -

~/.bashrc

8. Edit the bash directory using the command

(6)

9. In the directory we need to modify following files: /usr/local/hadoop/etc/hadoop/hadoop-env.sh /usr/local/hadoop/etc/hadoop/core-site.xml /usr/local/hadoop/etc/hadoop/yarn-site.xml /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Figure 4: public-private key generation

Figure 5: bash directory after editing

10. After editing files, save and close file.later we need to format the new hadoop file system using the command-

hdfs namenode –format

11. At the end we need to start hadoop, in singlr node cluster using command, start-dfs.sh

(7)

start-yarn.sh

After executing the command on hadoop, we start running and verify by running command jps

Figure 6: hadoop in running Mode

V. CHALLENGES IN BIG DATA

With the introduction to new technologies in big data processing, some challenges are also introduced which are needed to keep in mind.

Performance:

In today’s online world a nanosecond may also effect your business, so big data must move at a high velocity in all workload conditions[22]. Visualization helps in performing analysis and making decision, but the challenge also occur as the degree of granularity increase. Possible solution for this is more hardware or more memory and powerful parallel processing. Another method can be grid computing to solve the query and improve the performance.

High Availability:

When you rely on big data, it should be available 24hours and should never go down data[23]. A certain amount of down time should built into.

Scale:

Data is getting generated very fast, so the scalability is also the challenge for data processing companies to process within time. Big data should be able to scale whenever it is required at any scale[22].

Data Security

With the growth of internet world, data is also getting generated from more sensitive data such as credit card data, personal ID Information which requires more security. For these data processing user feel more security issue that his data is safe or not. We should ensure that organization’s data, network, partner, customer are protected end-to-end[24].

(8)

To make data much useful for decision making, we should be much able to find and analyze data quickly in proper format for information consumers. Addressing data quality is a challenge for data analysts when considering volume of information involved in big data projects. We should ensure a pro-active method to address data quality issues[25].

Management

Management is also the biggest challenge in big data as the data is growing much faster and also in different formats. To manage big data introducing new technology is the biggest challenge for data analysts. If we talk about traditional management system such as RDBMS which is costly, time consuming and often futile endeavor[23].

Dealing with outliers:

To make communication between trends and outliers graphical representation of data by visualization is much faster and better than tables containing numbers and texts. By charts issues can be understood easily by pointing at chart. In larger data representation of data to outliers is not possible. Then the possible solution to it is to remove outlier from data[24].

Big Data talent gap:

Big data talent gap is real and according to a static by 2018, the US alone could face the shortage of more than 140,000 deep analytical skills. There is a growing community of tools developer like hadoop ecosystem. There are expert who gained experience through tool development and uses programming model rather than data management aspects.

VI. CONCLUSION

This paper describes the survey on big data processing techniques in the cloud computing environment. Here we discussed different processing techniques along with management techniques required to store the huge data. Big data face the challenges like real time processing which requires new techniques. With the growth in data, big data will become more complex and introduce more challenges which create more opportunity for the scholars. There is a need to make cooperation research scholars and industries to face all challenges and success to cloud computing and big data.

VII. REFERENCES

[1] Narkhede, S., & Baraskar, T. “HMR log analyzer: analyze Web application logs over Hadoop MapReduce”. International Journal of UbiComp. pp 41-51, 2013.

[2] Pandit, A., Deshpande, A., & Karmarkar, P. “Log Mining Based on Hadoop’s Map and Reduce Technique”. International Journal on Computer Science & Engineering. pp 270-274, 2014.

[3] Wada, Y., Watanabe, Y., Syoubu, K., Sawamoto, J., & Katoh, T. “Virtual database technology for distributed database”. In Advanced Information Networking and Applications Workshops (WAINA), 2010 IEEE 24th_{International Conference on, IEEE. pp 214-219, 2010.}

[4] Bakariya, A. B., & Thakur, B. G. S. “User Behavior Analysis from Web Log using Log Analyzer Tool”. Ijcsns. pp 41-52, 2013.

(9)

[5] Dhole Poonam, B., & Gunjal Baisa, L. “Survey Paper on Traditional Hadoop and Pipelined Map Reduce”. International Journal of Computational Engineering Research. pp 32-36, 2013.

[6] Chavan, M. V., & Phursule, R. N. “Survey Paper On Big Data”. International Journal of Computer Science and Information Technologies, IJCSIT. pp 7932-7939, 2014.

[7] “Big data: science in the peta byte era," Nature 455 (7209): 1, 2008.

[8] D. Kossmann, T. Kraska, and S. Loesing, "An evaluation of alternative architectures for transaction processing in the cloud," in Proceedings of the 2010 international conference on Management of data. ACM, 2010, pp. 579-590.

[9] Pal, A., & Agrawal, S. “An experimental approach towards big data for analyzing memory utilization on a hadoop cluster using HDFS and MapReduce”. In Networks & Soft Computin(ICNSC), First International Conference,IEEE. Pp 442-447, 2014.

[10] Shim, K. S., Lee, S. K., & Kim, M. S. “Application traffic classification in Hadoop distributed computing environment”. In Network Operations and Management Symposium (APNOMS), 2014, 16th Asia-Pacific, IEEE. pp 1-4, 2014.

[11] Rathee, S. Big Data and Hadoop with components like Flume, Pig, Hive and Jaql. In International Conference on Cloud, Big Data and Trust pp. 13-15, 2013.

[12] Fuad, A., Erwin, A., & Ipung, H. P. (2014, September). Processing performance on Apache Pig, Apache Hive and MySQL cluster. In Information, Communication Technology and System (ICTS), 2014 International Conference on (pp. 297-302). IEEE.

[13] Sethi, P., & Kumar, P. (2014, August). Leveraging hadoop framework to develop duplication detector and analysis using Mapreduce, Hive and Pig. In Contemporary Computing (IC3), 2014 Seventh International Conference on (pp. 454-460). IEEE.

[14] Rumi, G., Colella, C., & Ardagna, D. (2014, September). Optimization Techniques within the Hadoop Eco-system: A Survey. In Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014 16th International Symposium on (pp. 437-444). IEEE.

[15] S. Ghemawat, H. Gobioff, and S. Leung, "The google file system," in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 2003, pp. 29-43.

[16] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, "Bigtable: A distributed structured data storage system," in 7th OSDI, 2006, pp. 305-314. [17] http://www.d0.fnal.gov/

[18] http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal/

[19] B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, "Pnuts: Yahoo!'s hosted data serving platform," Proceedings of the VLDB Endowment, vol. 1, no. 2, pp. 1277-1288, 2008.

[20] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon's highly available keyvalue store," in ACM SIGOPS Operating Systems Review, vol. 41, no. 6. ACM, 2007, pp. 205-220.

[21] Jitkajornwanich, K., Gupta, U., Shanmuganathan, S. K., Elmasri, R., Fegaras, L., & McEnery, J. (2013, October). Complete storm identification algorithms from big raw rainfall data using MapReduce framework. In Big Data, 2013 IEEE International Conference on (pp. 13-20). IEEE.

(10)

https://www.progress.com/~/media/Progress/Documents/Papers/Addressing-Five-Emerging-Challenges-of-Big-Data

[23] http://www.datastax.com/big-data-challenges

[24] Mariyah, S. (2014, September). Identification of big data opportunities and challenges in statistics Indonesia. In ICT For Smart Society (ICISS), 2014 International Conference on (pp. 32-36). IEEE. [25] Kaisler, S., Armour, F., & Espinosa, J. A. (2014, January). Introduction to Big Data: Challenges,

Opportunities, and Realities Minitrack. In System Sciences (HICSS), 2014 47th Hawaii International Conference on (pp. 728-728). IEEE.