• No results found

A Survey on Web Mining Tools and Techniques

N/A
N/A
Protected

Academic year: 2021

Share "A Survey on Web Mining Tools and Techniques"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

2015

A Survey on Web Mining Tools and Techniques

1Sujith Jayaprakash and 2Balamurugan E. Sujith 1,2Koforidua Polytechnic,

Abstract

The inexorable growth on internet in today’s world has not only paved way for easy communication and sharing of data, but also created a new dimension in the research on online consumer behavior. Web mining is a prolific approach towards the study of the behavior of online consumers on real time. Web mining is an application to discover the interesting usage patterns from the web data through the web server log file. Several commercial and open source tools are used in analyzing the web log data to understand the behavior of consumers. In this research paper, some of the modern tools of web mining are analyzed based on the techniques to ascertain its efficiency and accuracy. Key words—Web mining, Web usage mining , Web mining tools, Web mining techniques

1.0 Introduction

popular website states that, as of today, we have an average of 986,945,830 websites online [1]. Each day, close to 14,000 websites are launched. Hence, the World Wide Web is flooded with information. This exponential growth of websites and information has paved way for a new research called Web mining or Web usage mining.

As the internet revolution has taken the world to a next stage, parallel e-commerce has also grown along with internet. Success rate of e-e-commerce sites like flipkart, amazon, kaymu and many more has proven that internet is an important effective tool for business [2]. Millions of users are registering in to the e-commerce sites and this has led to a tough competition among the companies. Big corporates have started funding research programmes which can help them to understand the consumer behavior. Hence, web mining is a decisive tool which is used in analyzing the log files to understand the visitor’s behavior on a website.

(2)

2015

Web mining is the extraction of interest and potential useful pattern and implicit information from artifacts or activity related to WWW [3]. The extracted information is used for personalization, profiling and future prediction. The usage data captures the identity or origin of the web users along with browsing behavior at website. Information is captured from a webserver log file. There have been various researches already done and still going on in the analysis of consumer behavior. Even though numerous mathematical techniques are used, there is no appropriate tool to fulfill all the requirements of a researcher. Some tools are performing well in some aspects of research direction but one cannot be sure it may fulfill the requirement of any researcher’s complete needs. Hence, it is mandatory that a researcher should have clear knowledge about the tools available for their research domain. This research paper may be an eye opener for any interested person to know the tools which are available for the data mining research. This paper is organised as chapter I : Introduction, Chapter II :Related works, Chapter III : WebMining tools and Techniques , Chapter IV : Comparative study Chapter V : Conclusion.

Web Server Log File

Weblog file is a simple text file which logs all the activities on a server. Analyzing the webserver log file is a pre-processing technique in the research of web usage mining. A server log file is automatically created in a server which consists of lists of activities performed by a visitor on a website [4]. The following are the types of information that can be retrieved from a server log file.

• IP Address

• Identity of the computer making request • Login ID of the visitor

• Date/Time • Location • Status Code • Size

• Web Page Referred

Several tools are used in analyzing the log files. This paper will focus on the different types of tools and the features in each tool.

(3)

2015

2.0 Related Research Works

Several data pre-processing tools with different functionalities are released in the market. Faustina Johnson carried out a research on analyzing various techniques used for extracting information from the different types of data available in the internet and how this data could be used for mining purposes. Research also states that Semantic Web is a future vision in which web content can be manipulated by automated systems for analysis and synthesis [5]. Arun et al., explored the web usage pattern which is a key to promoting intelligence in e-commerce. They also insisted that it ought to study the loopholes within the analysis of internet usage patterns through existing tools and to style economical, climbable and powerful analysis tool. Santhakumar and Christopher (2015)., analyzed the web usage data by applying two different clustering algorithm such as K-means and Fuzzy C means in web usage data set using the tool Rapidminer [6]. Pierrakos et al., did an overview of KOINOTITES system that exploited Web Usage Mining techniques in order to identify communities of Web users that exhibited similar navigational behavior with respect to a particular web site [7].

3.0 Webmining Tools and Techniques

Contemporaneous researches and tools are introduced using various data mining pre-processing technique to analysis the web log files,The following are a few of those, especially with good features.

3.1 RapidMiner

Rapidminer is a software developed using Java programming language. It provides data mining and machine learning procedures including: data loading and transformation, data pre-processing and visualization, predictive analysis and statistical modeling, evaluation and deployment [8]. Rapidminer is a cross platform software which can be deployed in any operating system.

Software allows direct importing on the weblog files and it supports the following tasks:

• Aggregations of web usage statistics

(4)

2015

• Mash-ups with web services to map ip addresses to countries, cities, and map coordinates

• 2D and 3D visualization of web usage statistics and many more.

Rapidminer is comparatively faster than most of the data mining tools available in the market. Due to the GUI interface and the various formats of reports available in the market, it’s one of the preferred solutions for many analysts.

3.2 Weblog Expert

Weblog expert is an Apache and IIS log analyzing software which will give information about a site’s visitors. It can work only on Windows Operating System and has very limited functionalities compared to other data mining tools. Weblog experts can also read the compressed files like GZ, ZIP [9].

Weblog experts can generate the reports with following details: • Activity statistics

• Access Files and paths • Referred Pages

3.3 W3Perl

W3Perl is a logfile analyser based on logfiles parsing and distributed under the GPL licence. It requires a configuration file which can be built from a web interface. It’s a platform independent tool which will support different platforms. It’s a free log file analyzing software building using Perl and this can be deployed in any operating system which supports Perl. W3Perl can analyze Web/FTP/Mail/CUPS/DHCP/SSH and Squid log files. Reports can be generated in the various formats like HTML or Table manner. If there is any access restriction to the log files, W2Perl can be used with a small piece of JavaScript code and can be monitored. In such cases, the script will create the log files. The main features are:

(5)

2015

3.4 Webalizer

This software is used to analyze the web and usage logs in the server. Webalizer is mostly used to analyze the web traffic using the URL, Hit, Page, File, Visitor, Host and User Agent. Upon analyzing the server log files, the following information is extracted to generate the report.

• Client’s IP Address • URL Paths

• Processing Time • User Agent • Referrer

• With additional features

These extracted information are generally grouped and displayed in an HTML format. Apart from the HTML files, normal text file reports are also generated which can later be imported into spreadsheet manually. The major limitation of webalizer is that, it cannot differentiate between a web robot and human visitor.

3.5 Alterwind Log Analyzer

It’s a weblog analysis tool which will provide statistics based on the web usage. The reports generated are used in the search engine optimization and website promotion. Several reports are generated based on this tool, for example,

• Page not visited from search engine • Entry resources from search engine • Paths by search phrases

Different log file formats are supported by the Alterwind Log Analyzer. It can be installed on any operating system.

3.6 GoAccess

It’s an open source software to analyze the web log files. This software runs in a unix like system and it provides fast and valuable HTTP Statistics. Despite giving out the general statistics on bandwidth and usage, it also provides the following information;.

(6)

2015

• Geo Location

• Ability to output JSON and CSV.

GoAccess provides a real time report without having to generate HTML reports.

3.7 AWSTATS

Awstats is a web analyzing tool to generate statistics of streaming, advanced web usage, FTP, mail servers, etc., It’s a free software and works with CGI or Command Line interface. The reports are generated graphically. The following reports can be generated using the AWSTATS tool;

• Visits duration • Authenticated visits • Visit of robots • Worms attacks

• Cluster reports and many more.

Awstats supports unlimited log file size and split log files (load balancing system). It also provides a plugin for country detection from IP location to determine the country, state and city. It provides a Cross Site Scripting Attacks Protection.

4.0 Comparative Study

Features of the tools are compared with certain criteria. Comparison criteria for the different Web Content Mining tools are difficult to compare because of the variety of goals and contexts. In this research, few comparison criteria encountered are based on the general characteristics of the tools.

S.No. Feature Description

1 Open source Open source / License

2 Cross Platform Dependent / Independent

3 GUI Interface GUI Interface or Not

4 Robot Attacks Does it capture web robot attacks?

5 Worm Attacks Does it capture worm attacks?

6 Geolocation Does it display the country state and city of the visitor? 7 Report (HTML/PDF /

Spreadsheet/Real Time)

Does the tool generate various types of report formats?

(7)

2015

Table 1 represents seven features which can be considered for comparison of

various tools. Based on the above mentioned features, a consolidated Comparative Table (Table 2) summarizes the features of each tools .

Tools/ Features

Open s

ource

Cros

s Platform

GUI Interface Robot Attacks Worm Attacks Geolocation Report HTML/PDF/Sprea dsheet/Real Time

Rapidminer √ X x √ √ √ √ Weblog E√pert x √ x x √ x √ W3Perl x X √ x x x x Webalizer x X x √ x x x Alterwind √ X x x x x x GoAccess √ X √ √ x √ √ Awstats √ X √ √ √ √ √

Table 2 : Comparison of various tools with its features

Note : Notational Representations in the above table x : for Not supporting the

functionality / feature, √ : for supporting the functionality.

Fig 1. Graphical representation of Feature Support

(8)

2015

Based on the various features from the comparative table, it’s obvious that Rapidminer, GoAccess and Awstats support the maximum features. Rapidminer has limited functionalities as Free edition which takes back the tool compared to Awstats, which is completely free and powerful with all the features.

5.0 Conclusion and Future Research

In this paper, the research has attempted to provide a review of Web usage mining tools. Since the success of an e-commerce which relies on the understanding of the consumer behavior, it is a necessity to analyze the customer data and produce various results which can support the companies. Though there are several tools in the market to analyze the weblog file, this paper deals with the common tools found in the market. Comparisons and results are derived based on the functionalities of the tools as a weblog analyzer. In future research work, a detailed study will be made by comparing the web mining tools and its supportive algorithms.

6.0 References

Total number of Websites. (n.d.). Retrieved August 20, 2015.

Herrouz, A., Khentout, C., and Djoudi. (2013). Overview of Web Content Mining Tools. The International Journal of Engineering and Science (IJES),2(6). Thiyagarajan, V.S., Venkatachalapathy, K.(2013). Web Data mining-A Research

area in Web usage mining. IOSR Journal of Computer Engineering (IOSR-JCE), 13(1),22-26.

Harish, S., Kavitha, G. (2015). Statistical Analysis of Web Server Logs Using Apache Hive in Hadoop Framework. International Journal of Innovative

Research in Computer and Communication Engineering, 3(5).

Faustina, J. and Santhosh, K. (2012). Web Content Mining Techniques: A Survey. International Journal of Computer Applications, 47(11).

Santhakumar, M. and Christopher, C. (2015). Web Usage Based Analysis of Web Pages Using RapidMiner. WSEAS TRANSACTIONS on COMPUTERS.14, 455-464.

Pierrakos, D., Paliouras, G., Papatheodorou, C. and Spyropoulos, C. KOINOTITES: A Web Usage Mining Tool for Personalization.

References

Related documents

■ San Antonio’s hospitals, clinics, federally qualified health centers, mental health providers, private sector physicians, public health departments, and community based

Mile’s function analysis concepts and introduced the methodology called Function Analysis Systems Technique (FAST) to the Society of American Value Engineers (SAVE) at

Different astronomical practices from Antarctica are studied like infrared astronomy, Sub millimeter wave astronomy, Neutrino detection etc.. In this review work the

We compared our results with those obtained from four different scheduling techniques commonly used in high-level synthesis, namely, the GA scheduling , ALAP scheduling

Taking all the variables together, the producer most likely to have voted yes in the 1997 referendum (i) was an older individual with more years of experience growing cotton,

We are looking to the Transport and General Workers Union and other transport unions to get their members involved in doing things to help that hospital.. So that is

During the checking I collected a number of unambiguous Tiberian spellings as variants to the critical apparatus. The more important of these have been added to

Multiculturalisme Language - talk about factors facilitating integration; discuss which culture immigrants should show loyalty to Grammar - use conjunctions Skills -