2015
A Survey on Web Mining Tools and Techniques
1Sujith Jayaprakash and 2Balamurugan E. Sujith 1,2Koforidua Polytechnic,
Abstract
The inexorable growth on internet in today’s world has not only paved way for easy communication and sharing of data, but also created a new dimension in the research on online consumer behavior. Web mining is a prolific approach towards the study of the behavior of online consumers on real time. Web mining is an application to discover the interesting usage patterns from the web data through the web server log file. Several commercial and open source tools are used in analyzing the web log data to understand the behavior of consumers. In this research paper, some of the modern tools of web mining are analyzed based on the techniques to ascertain its efficiency and accuracy. Key words—Web mining, Web usage mining , Web mining tools, Web mining techniques
1.0 Introduction
popular website states that, as of today, we have an average of 986,945,830 websites online [1]. Each day, close to 14,000 websites are launched. Hence, the World Wide Web is flooded with information. This exponential growth of websites and information has paved way for a new research called Web mining or Web usage mining.
As the internet revolution has taken the world to a next stage, parallel e-commerce has also grown along with internet. Success rate of e-e-commerce sites like flipkart, amazon, kaymu and many more has proven that internet is an important effective tool for business [2]. Millions of users are registering in to the e-commerce sites and this has led to a tough competition among the companies. Big corporates have started funding research programmes which can help them to understand the consumer behavior. Hence, web mining is a decisive tool which is used in analyzing the log files to understand the visitor’s behavior on a website.
2015
Web mining is the extraction of interest and potential useful pattern and implicit information from artifacts or activity related to WWW [3]. The extracted information is used for personalization, profiling and future prediction. The usage data captures the identity or origin of the web users along with browsing behavior at website. Information is captured from a webserver log file. There have been various researches already done and still going on in the analysis of consumer behavior. Even though numerous mathematical techniques are used, there is no appropriate tool to fulfill all the requirements of a researcher. Some tools are performing well in some aspects of research direction but one cannot be sure it may fulfill the requirement of any researcher’s complete needs. Hence, it is mandatory that a researcher should have clear knowledge about the tools available for their research domain. This research paper may be an eye opener for any interested person to know the tools which are available for the data mining research. This paper is organised as chapter I : Introduction, Chapter II :Related works, Chapter III : WebMining tools and Techniques , Chapter IV : Comparative study Chapter V : Conclusion.
Web Server Log File
Weblog file is a simple text file which logs all the activities on a server. Analyzing the webserver log file is a pre-processing technique in the research of web usage mining. A server log file is automatically created in a server which consists of lists of activities performed by a visitor on a website [4]. The following are the types of information that can be retrieved from a server log file.
• IP Address
• Identity of the computer making request • Login ID of the visitor
• Date/Time • Location • Status Code • Size
• Web Page Referred
Several tools are used in analyzing the log files. This paper will focus on the different types of tools and the features in each tool.
2015
2.0 Related Research Works
Several data pre-processing tools with different functionalities are released in the market. Faustina Johnson carried out a research on analyzing various techniques used for extracting information from the different types of data available in the internet and how this data could be used for mining purposes. Research also states that Semantic Web is a future vision in which web content can be manipulated by automated systems for analysis and synthesis [5]. Arun et al., explored the web usage pattern which is a key to promoting intelligence in e-commerce. They also insisted that it ought to study the loopholes within the analysis of internet usage patterns through existing tools and to style economical, climbable and powerful analysis tool. Santhakumar and Christopher (2015)., analyzed the web usage data by applying two different clustering algorithm such as K-means and Fuzzy C means in web usage data set using the tool Rapidminer [6]. Pierrakos et al., did an overview of KOINOTITES system that exploited Web Usage Mining techniques in order to identify communities of Web users that exhibited similar navigational behavior with respect to a particular web site [7].
3.0 Webmining Tools and Techniques
Contemporaneous researches and tools are introduced using various data mining pre-processing technique to analysis the web log files,The following are a few of those, especially with good features.
3.1 RapidMiner
Rapidminer is a software developed using Java programming language. It provides data mining and machine learning procedures including: data loading and transformation, data pre-processing and visualization, predictive analysis and statistical modeling, evaluation and deployment [8]. Rapidminer is a cross platform software which can be deployed in any operating system.
Software allows direct importing on the weblog files and it supports the following tasks:
• Aggregations of web usage statistics
2015
• Mash-ups with web services to map ip addresses to countries, cities, and map coordinates
• 2D and 3D visualization of web usage statistics and many more.
Rapidminer is comparatively faster than most of the data mining tools available in the market. Due to the GUI interface and the various formats of reports available in the market, it’s one of the preferred solutions for many analysts.
3.2 Weblog Expert
Weblog expert is an Apache and IIS log analyzing software which will give information about a site’s visitors. It can work only on Windows Operating System and has very limited functionalities compared to other data mining tools. Weblog experts can also read the compressed files like GZ, ZIP [9].
Weblog experts can generate the reports with following details: • Activity statistics
• Access Files and paths • Referred Pages
3.3 W3Perl
W3Perl is a logfile analyser based on logfiles parsing and distributed under the GPL licence. It requires a configuration file which can be built from a web interface. It’s a platform independent tool which will support different platforms. It’s a free log file analyzing software building using Perl and this can be deployed in any operating system which supports Perl. W3Perl can analyze Web/FTP/Mail/CUPS/DHCP/SSH and Squid log files. Reports can be generated in the various formats like HTML or Table manner. If there is any access restriction to the log files, W2Perl can be used with a small piece of JavaScript code and can be monitored. In such cases, the script will create the log files. The main features are:
2015
3.4 Webalizer
This software is used to analyze the web and usage logs in the server. Webalizer is mostly used to analyze the web traffic using the URL, Hit, Page, File, Visitor, Host and User Agent. Upon analyzing the server log files, the following information is extracted to generate the report.
• Client’s IP Address • URL Paths
• Processing Time • User Agent • Referrer
• With additional features
These extracted information are generally grouped and displayed in an HTML format. Apart from the HTML files, normal text file reports are also generated which can later be imported into spreadsheet manually. The major limitation of webalizer is that, it cannot differentiate between a web robot and human visitor.
3.5 Alterwind Log Analyzer
It’s a weblog analysis tool which will provide statistics based on the web usage. The reports generated are used in the search engine optimization and website promotion. Several reports are generated based on this tool, for example,
• Page not visited from search engine • Entry resources from search engine • Paths by search phrases
Different log file formats are supported by the Alterwind Log Analyzer. It can be installed on any operating system.
3.6 GoAccess
It’s an open source software to analyze the web log files. This software runs in a unix like system and it provides fast and valuable HTTP Statistics. Despite giving out the general statistics on bandwidth and usage, it also provides the following information;.
2015
• Geo Location
• Ability to output JSON and CSV.
GoAccess provides a real time report without having to generate HTML reports.
3.7 AWSTATS
Awstats is a web analyzing tool to generate statistics of streaming, advanced web usage, FTP, mail servers, etc., It’s a free software and works with CGI or Command Line interface. The reports are generated graphically. The following reports can be generated using the AWSTATS tool;
• Visits duration • Authenticated visits • Visit of robots • Worms attacks
• Cluster reports and many more.
Awstats supports unlimited log file size and split log files (load balancing system). It also provides a plugin for country detection from IP location to determine the country, state and city. It provides a Cross Site Scripting Attacks Protection.
4.0 Comparative Study
Features of the tools are compared with certain criteria. Comparison criteria for the different Web Content Mining tools are difficult to compare because of the variety of goals and contexts. In this research, few comparison criteria encountered are based on the general characteristics of the tools.
S.No. Feature Description
1 Open source Open source / License
2 Cross Platform Dependent / Independent
3 GUI Interface GUI Interface or Not
4 Robot Attacks Does it capture web robot attacks?
5 Worm Attacks Does it capture worm attacks?
6 Geolocation Does it display the country state and city of the visitor? 7 Report (HTML/PDF /
Spreadsheet/Real Time)
Does the tool generate various types of report formats?
2015
Table 1 represents seven features which can be considered for comparison of
various tools. Based on the above mentioned features, a consolidated Comparative Table (Table 2) summarizes the features of each tools .
Tools/ Features
Open s
ource
Cros
s Platform
GUI Interface Robot Attacks Worm Attacks Geolocation Report HTML/PDF/Sprea dsheet/Real Time
Rapidminer √ X x √ √ √ √ Weblog E√pert x √ x x √ x √ W3Perl x X √ x x x x Webalizer x X x √ x x x Alterwind √ X x x x x x GoAccess √ X √ √ x √ √ Awstats √ X √ √ √ √ √
Table 2 : Comparison of various tools with its features
Note : Notational Representations in the above table x : for Not supporting the
functionality / feature, √ : for supporting the functionality.
Fig 1. Graphical representation of Feature Support
2015
Based on the various features from the comparative table, it’s obvious that Rapidminer, GoAccess and Awstats support the maximum features. Rapidminer has limited functionalities as Free edition which takes back the tool compared to Awstats, which is completely free and powerful with all the features.
5.0 Conclusion and Future Research
In this paper, the research has attempted to provide a review of Web usage mining tools. Since the success of an e-commerce which relies on the understanding of the consumer behavior, it is a necessity to analyze the customer data and produce various results which can support the companies. Though there are several tools in the market to analyze the weblog file, this paper deals with the common tools found in the market. Comparisons and results are derived based on the functionalities of the tools as a weblog analyzer. In future research work, a detailed study will be made by comparing the web mining tools and its supportive algorithms.
6.0 References
Total number of Websites. (n.d.). Retrieved August 20, 2015.
Herrouz, A., Khentout, C., and Djoudi. (2013). Overview of Web Content Mining Tools. The International Journal of Engineering and Science (IJES),2(6). Thiyagarajan, V.S., Venkatachalapathy, K.(2013). Web Data mining-A Research
area in Web usage mining. IOSR Journal of Computer Engineering (IOSR-JCE), 13(1),22-26.
Harish, S., Kavitha, G. (2015). Statistical Analysis of Web Server Logs Using Apache Hive in Hadoop Framework. International Journal of Innovative
Research in Computer and Communication Engineering, 3(5).
Faustina, J. and Santhosh, K. (2012). Web Content Mining Techniques: A Survey. International Journal of Computer Applications, 47(11).
Santhakumar, M. and Christopher, C. (2015). Web Usage Based Analysis of Web Pages Using RapidMiner. WSEAS TRANSACTIONS on COMPUTERS.14, 455-464.
Pierrakos, D., Paliouras, G., Papatheodorou, C. and Spyropoulos, C. KOINOTITES: A Web Usage Mining Tool for Personalization.