Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4477
Implementation of Log Based Analytical
Engine using Hadoop and Spark
Roshan Patil1, Prof. A. C. Taskar2, Dwineeth Shetty1, Jagruti Jadhav1, Vrushali Ahirrao1 B.E Student, Dept. of Computer,SIEM, Savitribai Phule Pune University, Nashik, Maharashtra, India1 Asst. Professor, Dept. of Computer, SIEM, Savitribai Phule Pune University, Nashik, Maharashtra, India2
ABSTRACT: Due to digitalization in today's modern era log file analysis has become a necessary task to track user behavior and get important knowledge based on it. Log files are generated at massive rate and to analyze them is tedious task. In order to analyze large dataset user need effective solution for integrating the data and also parallel processing for that spark's machine learning is used which will give power to run machine learning algorithm in distributed environment on large volume dataset (Big Data).
KEYWORDS: Analytical Engine, Apache Hadoop, HDFS, Machine learning, Apache Spark
I. INTRODUCTION
Log contains a large amount of information. There are many different types of logs available like surfing log, error log, etc. These logs are stored on server. The log production rate can lead up to several Tera bytes (TB) or Petabytes (PB). Log based analytical engine is an engine which analyzes the gathered logs in such a way that we can understand the number of access-users, system operation status, user-behavior, etc. There have been some log analysis tools but either they are standalone or have the limitation of data scale. With today's increasing speed of development of Internet technology, how to deal with large scale log data has become a challenge.
[1]Hadoop is a popular open source computing distributed framework which has been mainly used to store and analyze data. It also provides a distributed file system called as HDFS. Hadoop provides features like horizontal scalability and fault-tolerance.[3] Hive is a analytical tool provided by Hadoop which provides facilities like writing, reading and managing large size of datasets that are stored in HDFS using SQL statements. Hadoop breaks up the log files into number of blocks and distribute them to the nodes which are present in Hadoop cluster. [4]The analytical engine needs stable and massive data processing ability as well as efficiency required to work under variety of scenarios. Hadoop does not provide fast performance and also not suitable for real time applications because each time it performs operations from hard disk.
[2]Spark is an open source in-memory distributed cluster system. Using Spark the speed required for data analysis is increased as it uses main memory to perform operations. It is used for interactive data mining as well as iterative algorithms like graph mining algorithm and machine learning algorithm.
Unlike spark, for working with the structured-data Apache Spark SQL module is used which lets you query structured data inside the Spark programs, using SQL. Spark SQL reuses the Hive front-end and meta-store giving us a full compatibility with the existing Hive data, queries etc. It also includes column storage, cost based optimizer and code-generation to make the queries execute faster.
II. RELATED WORK
Literature Survey:
Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4478
[6] Sayalee Narkhede and Tripti Baraskar :- To summarized data results for a particular web application, this paper gave the way of log analysis which helps to improve the business strategies as well as to generate statistical reports. Hadoop { MR log _le analysis tool will provide us graphical reports showing hits for web pages, user's activity, in which part of website users are interested, traffic sources, etc. From these reports business communities can evaluate which parts of the website need to be improved, which are the potential customers, from which geographical region website is getting maximum hits, etc, which will help in designing future marketing plans.
[7] Xiaokui Shu and John Smiy :-was intended to design and implement a lightweight distributed framework. Consisting a minimized set of components, the framework is different from general ones and is specifically designed for security log analysis. Features such as streaming log analysis are efficiently provided in the design.
[8] N. Ramasubramanian, Srinivas V.V. :- The stream logging application which is computed over a distributed _le system can be used to facilitate applications that involve transaction based computing. Since, the computation is done using normal desktop systems and the number of transactions performed is considerably more, it helps to reduce the number of special hardware (server systems) required for computation. Hence the overall cost of computation is considerably less.
[9] Muhammad Ali Gulzar, Matteo Interlandi :-BIGDEBUG offers interactive debugging primitives for an in-memory data-intensive scalable computing (DISC) framework. It enables a user to determine crash culprits and resolve them at runtime, avoiding a program re-run from scratch. It scales to massive data in the order of terabytes, improves fault localizability by orders of millions than baseline Spark, and provides up to 100% time saving with respect to a posthoc instrumentation replay debugger.
[10] Jorge L. Reyes-Ortiz,Luca Oneto and Davide Anguita :- Spark on Hadoop with in-memory data processing reduces the gap between Hadoop MapReduce and HPC for Machine Learning.MPI/OpenMP on Beowulf that is high-performance oriented and exploits multi-machine/multicore infrastructures, and Apache Spark on Hadoop which targets iterative algorithms through in-memory computing. Results obtained from experiments with a particle physics data set show MPI/OpenMP outperforms Spark by more than one order of magnitude in terms of processing speed and provides more consistent performance.
[11] Ilias Mavridis and Eleni Karatza :-To investigate the analysis of log _les with the two most widespread computing frameworks in cloud computing, the well-established Hadoop and rising Spark. The two frameworks have developed, executed and evaluated realistic programs for analyzing logs. The two frameworks have the common goal of parallel processes execution on distributed _les or other input files. The differences and similarities between Hadoop MapReduce and Apache Spark and evaluates the performance of them.
Mathematical Model:
Let M be an Analytical Engine such that M={D,F,A,L,R}
D is input database D=i1,i2,i3...in i1=host i2=url
Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4479 F is number of feature required for analysis
F=k1,k2,k3...kp where p¡m
A is log association matrix with respect to F L is Machine Learning model
Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4480 Fig: Mathematical Model
III.PSEUDO CODE
Let X = {x 1, x2, x3,..., xn} be the set of data points and, V = {v 1, v2,..., v c } be the set of centers.
1) Select „c‟ cluster centers as per the size of Set.
2) Calculate the Euclidean distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers.
4) Recalculate the new cluster center using:
where, „c i ‟ represents the number of data points in i th cluster.
5) Recalculate the distance between each data point and new obtained cluster centers. 6) If no data point was reassigned then stop, otherwise repeat from step 3).
IV.EXPERIMENT
An experiment was performed on the basis of log file generated at server end. In this experiment log files were fetched and stored on HDFS using Apache Hadoop and created external table on that log file using Apache Hive. Log file stored on HDFS get fetched by Apache Spark and various operation was performed as given below :-
A)Featurization :-
Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4481
B)Clusterisation :-
Use of K-Mean Clustering Algorithm for making Cluster of featurized dataset based on there site visited behaviour. Similar behaviour hosts were grouped in clusters in just fraction of second. 10k logs were clusterised in 0.224 Sec. It might vary based upon processing power.
Based on Clusterisation results two results were extracted:- 1)Total site visited and their host count.
2)We got clusters of similar behaviour hosts.
These all above operation were performed in 1-2 Seconds and we got result very efficiently.
V. SIMULATION RESULTS
The simulation studies involve the determination of different host visiting different sites as shown in Fig.1. The proposed efficient algorithm is implemented with JAVA. The logs generated across the various location are stored at server end is used for analysis after which the host from different location visiting different sites are recorded and the engine generates result as shown in the Fig.1. Our results shows that how many host have visited particular sites. Here 10,000 logs are analyzed and result is obtained in graphical format in just 0.25 seconds.
Fig.1. Result of site visited by Host
Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4482 Fig.2. Result of Host visiting similar sites
The Fig.3 shows different cluster formed, where all visited sites belonging to a respective clusters are shown.
Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4483
VI.CONCLUSION AND FUTURE WORK
Conclusion
Analytical Engine has become ubiquitous. People use them to do analysis of Log data came at web server and based on the result generated, Admin will take proper action on web server. This system proves very efficient because
of faster analysis.
Future Scope
In addition to this proposed Engine we can extend it for supporting more complex log analysis and also to build a very flexible system which will again make use of unsupervised machine learning to analyze the various kinds of log data. One can increase the speed of analysis by integrating various technologies in it.
REFERENCES
[1] http://hadoop.apache.org/
[2] https://spark.apache.org/ [3] http://hive.apache.org/
[4] Xiuqin LIN, Peng WANG, Bin WU, “Log Analysis in Cloud Computing Environment with Hadoop and Spark”, 5th IEEE InternationalConference on Broadband Network and Multimedia Technology (IC-BNMT), pages 273-276,2013.
[5] MateiZaharia, MosharafChowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy Mccauley, Michael J .Franklin, Scott Shenker and IonStoica, “Fast and Interactive Analytics over Hadoop Data with Spark”, usenix conference, vol. 37, No. 4, August 2012.
[6] SayaleeNarkhede,TriptiBaraskar. “HMR Log Analyzer: Analyze Web Application Logs Over Hadoop Map Reduce”, International Journal of UbiComp (IJU), Vol.4, No.3, pages 41-51, July 2013.
[7] XiaokuiShu, John smiy, “Massive Distributed and Parallel Log Analysis For Organizational Security”, First International Workshop on Security and privacy in Big Data,13 Dec 2013.
[8] N. Ramasubramanian, Srinivas V. V., Praveen Kumar Yadav, “Performance Evaluation of Stream Log Collection Using Hadoop Distributed File System”, International Journal of Advanced Research in Computer Science and Software Engineering, pages 41-44,2013.
[9] Muhammad Ali Gulzar, MatteoInterlandi, SeunghyunYoo, Sai Deep Tetali, Tyson condie, Todd Millstein, Miryung Kim, “BigDebug: Debugging Primitives for Interative Big Data Processing in Spark”, 38 th InternationalConference on Software Engineering (ICSE 2016),Austin,TX, May,pages 14-22,2016.
[10] Jorge L. Reyes-Ortiz, Luca Oneto and DavideAnguita, “Big Data Analytics in the Cloud: Spark on Hadoop vs. MPI/OpenMP on Beowulf”, INNS Conference on Big Data, Volume 53, pages 121–130, 2015.
[11] IliasMavridis, EleniKaratza,“Log File Analysis in Cloud with Apache Hadoop and Apache Spark” , Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015) Krakow, Poland, pages 51-62, 2015.
[12] Roshan Patil, Prof. A. C. Taskar,Dwineeth Shetty,Jagruti Jadhav and Vrushali Ahirrao, “Log Based Analytical Engine using Hadoop and Spark”, International Journal of Modern Trends in Engineering and Research, Volume 3, Issue 9, pages 120-125, 2016.
BIOGRAPHY
Roshan Patil is a Student in Department of Computer Engineering, Sandip Institute of Engineering & Management, Savitribai Phule Pune University. He is pursuing Bachelor of Engineering degree from SIEM, Nashik, Maharashtra, India. His research interests are Data Analytics, Big-Data Eco-System, Machine Learning etc.
Dwineeth Shetty is a Student in Department of Computer Engineering, Sandip Institute of Engineering & Management, Savitribai Phule Pune University. He is pursuing Bachelor of Engineering degree from SIEM, Nashik, Maharashtra, India. His research interests are Computer Networks, Cloud Computing, Big-Data etc.
Vrushali Ahirrao is a Student in Department of Computer Engineering, Sandip Institute of Engineering & Management, Savitribai Phule Pune University. She is pursuing Bachelor of Engineering degree from SIEM, Nashik, Maharashtra, India. Her research interests are Web-development, Algorithm etc.
Copyright to IJIRCCE DOI: 10.15680/IJIRCCE.2017. 0503160 4484
Prof. Avinash Taskar is a Asst. Professor in Department of Computer Engineering, Sandip Institute Of Engineering & Management, Savitribai Phule Pune University. He received M.TECH(CSE) degree. His research interests are Cloud Computing, Networking,Big-Data etc.