Research on Data Processing Platform of Cloud Computing Based on Hadoop

(1)

Research on Data Processing Platform of Cloud

Computing Based on Hadoop

Chunfeng Wang

Modern Education Technology Center Yancheng Institute of Technology

Yancheng，China [email protected]

Jun Qian

Academic Affairs Office Radio and Television University

Yancheng，China [email protected] Abstract—Cloud computing is a new computing model and

it can distribute the computing tasks to a large number of computers resource pool. Users can obtain the computing power, storage space and information service by the cloud computing. Hadoop is a distributed computing platform of open source, which is designed for large-scale data processing and distributed computing. This paper has established a new data processing model combined cloud computing technology, intrusion detection, DDoS technology and Hadoop, and then set up system platform based on the model. Through the experiment, the paper has verified that the system can greatly shorten the time consumption of data process, and can ensure the safety and reliability of data processing. At the same time, processing ability and data storage capacity of the Hadoop platform are also adapt to the change along with the increasing amount of data. That precisely reflects the advantage of the cloud computing technology, intrusion detection and DDoS technology used in processing large-scale data, storage space and safety performance.

Index Terms—cloud computing, Hadoop, Intrusion technology, DDos

I. INTRODUCTION

Cloud computing is the deep development of distributed computing (Distributed Computing, DC) and grid computing (Grid Computing, GC). Cloud computing can dynamically provide corresponding service to users according to the need and can solve many problems in the age of big data by the virtual technology. So cloud computing has become the cornerstone for the development of the Internet today. While Hadoop is an open source system based on Google cloud computing. The structure environment of Hadoop cloud computing belongs to PaaS mode (Platform as a Service). Hadoop can provide users with a kind of distributed computing and distributed memory programming environment [1]. Hadoop is not only applied to the field of the cloud computing, but also is applied to mass data processing, data mining, scientific computing and other fields as an efficient open source platform.

In recent years, with the continuous expansion of the network application, the data processing mode of traditional network has exposed many problems, and mainly reflected in the followings:

z The original data processing system has difficulty in dealing with data source format which become more and more diversified.

z Face data source format problem of becoming more and more diversified, the original data processing system has been difficult to meet. z Very small amounts of operational errors

sometimes result in termination, which caused a lot of waste of time.

Hadoop is effective to the above problems of cloud computing, and can customize the corresponding services, applications and resources according to the needs of uses, so as to get rid of the local device's own capacity constraints [2]. However, the security mechanism of Hadoop is very weak, and can cause security problems. For example, illegal user steals the rights of the legitimate user; data are stolen in the transmission process; data server lack memory protection for memory and the data in the memory etc. These safety problems make users have concerns when deciding whether to use Hadoop.

This paper studies the cloud computing platform based on hadoop, which can well solve the problems caused by the large amount of data, and is more efficient for processing the log data file of complex structure. The platform has important significance for data management and mining. At the same time, the intrusion detection and DDoS defense technology is introduced into the cloud computing platform, which can effectively solve the security problem of cloud computing, prevent the occurrence of cloud computing abuse.

II. INTRODUCTION OF CLOUD COMPUTING AND HADOOP

A. Definition of cloud computing

Cloud computing is a business model, and it makes the calculated task to been distributed in large amounts of computer resources. Users can access computing power, storage space and information service demand. Cloud computing contains some virtual computing resources, and these virtual computing resources usually have server cluster and broadband resources [3]. In order to get rid of the tedious details of the operation trouble, cloud computing use professional software to centralize and management the automatic computing resource. That will help to improve the efficiency of production, and reduce operating costs.

(2)

B. Services Types of Cloud Computing

At present, cloud computing can be divided into infrastructure service, platform service and software service.

z Infrastructure Service

Infrastructure service can provide all the tools of deploying a complete application to the user. The user can access these resources according to their own needs. From the view of business point, infrastructure service can virtually integrate and reuse the calculation, storage, network and other IT infrastructure, then provide to the user by internet [4].

z Platform service

Platform service can provide the application development environment to the users by the network. Platform service can provide a series of software resources, which include the middleware, DB, OS, development environment and so on. From the service level, Platform service is under Infrastructure Service [5].

z Software service

Software service can provide the software resources as the mode of service to the users by the Internet technology. In addition, software service can provide all the infrastructure, software and hardware resources used in the process, and also are responsible for all security information [6-7]. The user can lease all services according to their actual needs without considering the problem encountered in the traditional information process.

C. Cloud computing platform based on Hadoop

Hadoop originated from the Apache open source, and can realize the distributed parallel programming calculation based on cluster. Hadoop has been widely used in many famous large web sites, such as Amazon, Facebook, Yahoo!, IBM, etc. Hadoop is mainly composed of HDFS (Hadoop Distributed File System), Map/Reduce and Hbase [8-9].

Among them, HDFS is responsible for the storage and management of a cluster file. HDFS is the flagship file system based on Hadoop, and can store large distributed files system by stream data access patterns, and can run on cheap hardware based on clusters. The HDFS distributed file system structure as shown in Figure.1.

Figure.1. The HDFS distributed file system structure HDFS uses communication protocol layer based on TCP/IP. The host uses ClientProtocal to communicate by TCP open ports of the known node. The communication

between the data node and the name node can be realized by DatanodeProtocal [10]. ClientProtocol and DataNodeProtocol are packed up by the remote procedure call (RPC), but the name node does not call RPC.

III. DESIGNOFCLOUDCOMPUTINGDATA PROCESSINGMODEL

A. Idea of Design

This model can guarantee the expansion, and can avoid effecting the implementation of the system construction based on the principles of simple and practical. At the same time, combination with the characteristics of Hadoop distributed framework, this paper has designed a new cloud data processing system. The topological diagram is shown in Figure.2.

Figure.2: The topological diagram

B. design of module

z Log collection

The main work of this module is to collect the original log files from the Web front-end server. The server will periodically chase the log rollback. The purpose is that the single log file is not too large. During the whole operation of server, Apache will keep the access of access.log and keep the written state. Under normal circumstances, we should select the moment of server working in smaller pressure station, and collect the log information. That can avoid causing the phenomenon of network congestion and higher time consumption. If the log file is moved or deleted, the system must restart the server for using new log file.

z Data import

Before the reboot and data analysis, the system must import the data which need to be processed into the Hadoop platform. A great advantage of HDFS is that it can simplify the operation process, especially for those programmers which are unfamiliar with the operation of the distributed file system. This process only needs to use the Hadoop shell command, as shown below: Hadoop dfs –put <localsrc> … <dst>.

(3)

This stage is the core part of computing model, and mainly completes the design and calculation of rules. The calculation can process the data in very efficient speed, and very easy to complete statistical tasks [11-12]. Its principle is shown as follows. The system firstly receives the input file, and makes the input file into pieces by FileSplit class. Secondly, according to the RecordReader method of the InputFormat class, the system reads the data which has been changed to the form of <key,value>, then assigns the tasks to the node.

z Format processing

Some different software can provide the same Web service, but the Web log files are often different format. Therefore, the system should retain the necessary information and remove the interference of messy data, so as to ensure the uniform of data format. For the log file of customized processed, the system should complete the format processing [13-14]. The module gets the different log contents by different screening methods. This paper can screen the IP field information aimed at the Apache log file.

z Derived data

At this stage, the results will be derived from HDFS and stored in the executive position. In general, the data have been formatted and are more simplified and targeted compared with the unprocessed data. Its operation is very simple by the command of “Hadoop DFS – get”.

C. design of safety

Intrusion detection system and DDoS prevention technology can detect attacks and the cloud system irregularities, ensure the healthy operation of the cloud system, mainly to solve the usability problems of cloud computing and prevent the abuse behavior [15-16].

The technology of intrusion detection and behavior analysis is the process of finding and responding the system intrusion and irregularities. Intrusion detection system can analyze the information and discover the acts violating the security policy and aggressive behavior by the key points of the network and the host [17]. Deployed the intrusion detection system in the cloud system, we not only can detect attacks from the outside, but also can monitor the internal users and prevent violations, so as to solve the abuse of the cloud resource and malicious users of cloud. This paper presents an IDS deployment in cloud system, as shown in Figure.3.

Figure.3. The intrusion detection system

Each virtual assembly above all have an IDS sensor, the IDS sensor transfer the collected data to the IDS management module. Then management module analyzes the collected data. When finding attacks or illegal behavior, the IDS sensor will take corresponding measures [18]. These sensors should be visible to the user, and the user can configure the sensor and the behavior library according to their intentions.

In addition to IDS sensors placed in each virtual assembly, cloud service providers must arrange at least one IDS sensor in each layer, such as to provide the IDS based on the network and the host for system layer and platform layer and provide a IDS sensor in the application layer [19]. Cloud service providers can quickly detect attacks against the user or from the cloud user attack inside by the sensor. When detected these violations, the cloud service provider can use the automatic disposal measures to stop these attacks in a timely manner, such as to reduce the internal user resource for running DDoS attack, or directly turn off the host of attacker.

IV. DEPLOYMENT OF CLOUDCOMPUTINGDATA PROCESSING SYSTEM

A. Build the environment of hadoop

1) Configuration of SSH

The machine in the Hadoop cluster communicates with each other by the way of SSH [20]. The key of successful communication is able to access without password. Perform the following operations on each of the machine:

#ssh -keygen -t dsa

#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys #chmod 755 ~/.ssh/

#chmod 644 ~/.ssh/authorized_keys

Then add the content (/.ssh/id_dsa.pub”) of each machine to the end of key file (~/.ssh/authorized_keys) of all other machines.

2) Configuration of DameNode and NameNode

Modify the file (/etc/hosts) of each machine. If the machine is used for NameNode, then all the host name and corresponding IP address should be added to the cluster file. If the machine is only used for DataNode, then only the local host and itself corresponding IP address should be added to the file. In the experiment, the Hadoop-master is as NameNode. The configuration of the nodes (/etc/hosts) is shown below:

192.168.1.110 Hadoop-master 192.168.1.111 Hadoop-slave-01

The above configurations select the Hadoop-slave-01 as DataNode, then configure the file (/etc/hosts).

3) Start the Haphoop cluster

After configuring Master and Slaves and formatting the distributed file system, the command is shown below:

(4)

Start the Hadoop daemon. Start “NameNode”, “JobTracker” and “Secondary NameNode” in “hadoop-master”, start “DataNode” and “TaskTracker” in the “hadoop-slave-01”. The command is as follows:

$bin/start-all.sh

By accessing the “http://192.168.1.110:50070”, we can view NameNode, the state of the distributed file system, and can browse the distributed file system and log etc..

B. Analysis of data processing

This paper selects 7 separate files which sizes are 50M, 100M, 150M, 200M, 250M, 300M, 350M, and upload them to the Hadoop platform for processing, at recording the processing time at the same time.

In order to reflect the advantage between the large-scale data processing and single machine environment based on the cloud computing environment, this paper also select the above 7 files to been processed in the single machine environment, and also record the processing time.

Finally, compare the two groups of experimental data and draw the curve diagram to reflect time consumption situation in the two environment of data processing. The curve diagram about time consumption in two kinds of environment data processing is shown in Figure.4.

Figure.4. The HDFS distributed file system structure The two group experimental environment (Hadoop cluster and single machine) and 7 sets of data quantity increasing data are used in the experiments. From the experimental results, processing capacity of a smaller file case based on the Hadoop cluster environment does not have the advantages compared to the single machine environment, while it have some disadvantages.

But when the amount of data increases to a certain size, the time consumption of the single machine environment begins more than the Hadoop cluster environment. With the further increase of the amount of data, the time consumption of the Hadoop cluster environment tends to a stable time range. The time consumption of the single machine

environment is greatly increased, and there is an increasing trend to continue.

This shows that the Hadoop platform can improve the efficiency of data processing and allocate the processing capacity to the cloud computing environment with the increase of data. But the processing capacity of the single machine environment is limited. With the increase of data quantity, data processing capacity of the single machine environment will reach the limit, bottlenecks, long or even impossible to perform processing tasks.

V. CONCLUSIONS

This paper mainly research on the key technology of cloud computing system and the Hadoop open source framework. At the same time, this paper has applied the Hadoop open source framework to the large-scale data processing. Through the analysis on the experimental results, the cloud computing technology is applied to the massive log file processing environment can effectively improve the efficiency and quality of data processing, especially for larger scale data, cluster a larger scale experiment environment. In addition, because of the characteristics of cloud computing technology based on Hadoop framework, we can successfully deploy the environment, and make adding new resources to deal with higher requirements simpler and easier.

In addition, DDoS is the main threat to the availability of the cloud service. Apart from the use of intrusion detection technology to prevent DDoS attacks, each cloud service providers have taken different measures to prevent DDoS attack. For instance, using proprietary DDoS attack mitigation techniques can effectively mitigate DDoS attacks caused outages; redundant design of the network topology provides more access points by the redundant servers. So the servers still provide service access once some point was attacked; adopting the redundant links and maintaining high bandwidth strategies can response the DDoS attack and the possible treat of the network security.

In the whole process of research, design and experimental environment deployment, the paper has summarized some conclusions as the following:

z Hadoop is a huge cloud computing system, and the details of its parallel, fault tolerance, local optimization and load balancing are all transparent. There are many configuration parameters in the process of deploying Hadoop platform. Therefore, users must be careful and make clear the significance and the effect on behalf of the each parameters and parameter values, different parameter configurations may bring the different results. So a detailed grasp of basic knowledge is important before the configuration.

z Hadoop platform has better treatment effect for large data sets than small data sets, and has better effect for the complex logical structure of the data than simple logical structure of the data. That fully embodies the powerful performance of the system for large-scale data processing.

(5)

z The system can obtain very good security with introduction of intrusion detection and the DDoS prevention technology to the cloud computing platform.

ACKNOWLEDGMENT

This work was supported by the 12th Five Year Planning Foundation of Jiangsu Province (Grant No.B-b/2013/01/012). It was also supported by the Modern Education Technology Foundation of Jiangsu Province (Grant No.2013-R24773).

REFERENCES

[1] L.YOUSEFF ， M.BUTRICO ， D.D.SLIVA.Toward a Unified Ontology of Cloud Computing[J].Grid Computing Environments Workshop，2008，11：1-10.

[2] ALEXANDER LENK ， MARKUS KLEMS ， JENS NIMIS ， STEFAN TAI ， THOMAS SANDHOLM.What’s Inside the Cloud?An Architectural Map of the Cloud Landscape[C].IEEE Computer Society.CLOUD '09 Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing，Washington，2009：36-40.

[3] DANIEL J.ABADI.Data Management in the Cloud：Limitations and Oppor tunities[J].Data Engineering，2009，32(1)：15-19.

[4] SONEHARA N, ECHIZEN I, WOHLGEMUTH S. Isolation in cloud computing and Privacy-Enhancing technologies[J]. Business & Information Systems Engineering, 2011, 3(3), 155-162.

[5] MAHJOUB M, MDHAFFAR A, BEN HALIMA R, etal. A comparative study of the current cloud computing technologies and offers [C]. Proceedings - 2011 1st International Symposium on Network Cloud Computing and Applications, NCCA 2011, 2011: 131-134.

[6] Wang D. and L Xiao. Storage and Query of Condition Monitoring Data in Smart Grid Based on Hadoop. Computational and Information Sciences (ICCIS), Fourth International Conference. 2012, IEEE:377-380

[7] HU G, L ZHOU and L KE. Research on Hadoop-based Network Log Analysis System[J]. Computer Knowledge and Technology. 2010, 22:18.

[8] Shen Q, ed. SAPSC: Security Architecture of Private Storage Cloud Based on HDFS. Advanced Information Networking and Applications Workshops (WAINA), 26th_{International Conference,}

2012. IEEE: 1292-1297.

[9] O'Driscoll, Aisling; Daugelaite, Jurate; Sleator, Roy D. 'Big data', Hadoop and cloud computing in genomics[J]. Journal of Biomedical Informatics. October,2013:774-781.

[10] Zhao, Qingsong; Chen, Lin; Sun, Bo; Zhu, Yan; Jiang, Haiyan. Algorithm implementation and tested of crop growth model based on hadoop of cloud computing[J]. Nongye Gongcheng Xuebao/ Transactions of the Chinese Society of Agricultural Engineering. April 15,2013:179-186.

[11] Kim, Myoungjin; Cui, Yun; Han, Seungho; Lee, Hanku, Towards efficient design and implementation of a Hadoop-based distributed video transcoding system in cloud computing environment[J]. International Journal of Multimedia and Ubiquitous Engineering. 2013:213-224.

[12] Shi, Hengliang; Bai, Guangyi; Tang, Zhenmin. Research on elastic job scheduling model and algorithm of cloud computing based on hadoop[J]. International Journal of Advancements in Computing Technology. 2012:473-483.

[13] Li, Changming; Zhang, Xiangdong; Li, Lijie. Research on comparative analysis of regional logistics information platform operation mode based on cloud computing[J]. International Journal of Future Generation Communication and Networking. 2014:73-80. [14] wobodo, Ikechukwu; Jahankhani, Hossein; Edoh, Aloysius. Security

challenges in the distributed cloud computing[J]. International Journal of Electronic Security and Digital Forensics. 2014:38-51. [15] De Falco, I; Scafuri, U; Tarantino, E. Two new fast heuristics for

mapping parallel applications on cloud computing[J]. Future Generation Computer Systems. July 2014:1-13.

[16] Wei, Lifei; Zhu, Haojin; Cao, Zhenfu; Dong, Xiaolei; Jia, Weiwei; Chen, Yunlu; Vasilakos, Athanasios V. Security and privacy for storage and computation in cloud computing[J]. Information Sciences. February 10, 2014:371-386.

[17] Kala Karun, A; Chitharanjan, K. Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop - HDFS - An infrastructure extension[J]. Proceedings of IEEE International Conference on Circuit, Power and Computing Technologies, ICCPCT 2013:1243-1249.

[18] Addair, T.G; Dodge, D.A; Walter, W.R; Ruppert, S.D. Large-scale seismic signal analysis with Hadoop[J]. Computers and Geosciences. May 2014:145-154.

[19] Wang, Shangguang; Su, Wei; Zhu, Xilu; Zhang, Hongke. A Hadoop-based approach for efficient web service management[J]. International Journal of Web and Grid Services. 2013:18-34. [20] Yu, Donghui; Yu, Mingyuan; Ye, Lei; Liang, Ronghua. Method of

real estate information services based on Hadoop[J]. Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition). December 2012:66-69.