AN OVERVIEW OF PREPROCESSING OF WEB LOG FILES FOR WEB USAGE MINING

(1)

1

AN OVERVIEW OF PREPROCESSING OF

WEB LOG FILES FOR WEB USAGE

MINING

N. M. Abo El-Yazeed

Demonstrator at High Institute for Management and Computer, Port Said University, Egypt

[email protected]

Abstract:

Web applications are increasing at an enormous speed and its users are increasing at exponential speed. The evolutionary changes in technology have made it possible to capture the user’s essence and interactions with web applications through web server log file. Web log file is saved as text (.txt) file. Due to large amount of “irrelevant information” in the web log, the original log file cannot be directly used in the web usage mining (WUM) procedure. Therefore the preprocessing of web log file becomes imperative. The proper analysis of web log file is beneficial to manage the web sites effectively for administrative and users’ prospective. Web log preprocessing is initial necessary step to improve the quality and efficiency of the later steps of WUM. There are number of techniques available at preprocessing level of WUM. Different techniques are applied at preprocessing level such as data cleaning, data filtering, and data integration.Web usage mining, a classification of Web mining, is the application of data mining techniques to discover usage patterns from clickstream and associated data stored in one or more Web servers. This paper presents an overview of the various steps involved in the preprocessing stage.

Keywords:

(2)

2

1. INTRODUCTION

Web mining is one of the major and important fields of data mining. Data mining techniques are applied [1] on contents, structures and on log files of web sites to achieve performance, web personalization and schema modifications of web sites.

Web mining is divided into three categories [2] such as Web Content Mining, Web Structure Mining and Web Usage Mining.

In web content mining, we discover useful information from the contents of web site which may include text, hyperlinks, metadata, images, videos, and audios. Search engines and web spiders are used to gather data for content mining [1].

In web structure mining, we mine the structure of website on the basis of hyperlinks and intra-links inside and outside the web pages.

In web usage mining (WUM) or web log mining, user’s behavior or interests are revealed by applying data mining techniques on web log file. The ability to know the patterns of user’s habits and interests helps the operational strategies of enterprises. Various applications are built efficiently by knowing users navigation through web. Web mining is the application of data mining techniques to automatically retrieve, extract and evaluate information for knowledge discovery from web documents and services. These applications may include:

 Modification of web site design.  Schema modifications.

 Improve web site and web server performance.  Improve web personalization.

 Recommender Systems.

 Fraud detection and future prediction.

Srivastava et al. [3] proposed A framework for web usage mining. This process consists of four phases: the input stage, the preprocessing stage, the pattern discovery stage, and the pattern analysis stage:

(3)

3

registration information (if any) and information concerning the site topology.

2. Preprocessing stage. The raw web logs do not arrive in a format conducive to fruitful data mining. Therefore, substantial data preprocessing must be applied. The most common preprocessing tasks are (1) data cleaning and filtering, (2) de-spidering, (3) user identification, (4) session identification, and (5) path completion. 3. Pattern discovery stage. Once these tasks have been accomplished,

the web data are ready for the application of statistical and data mining methods for the purpose of discovering patterns. These methods include (1) standard statistical analysis, (2) clustering algorithms, (3) association rules, (4) classification algorithms, and (5) sequential patterns.

4. Pattern analysis stage. Not all of the patterns uncovered in the pattern discovery stage would be considered interesting or useful. For example, an association rule for an online movie database that found “If Page = Sound of Music then Section= Musicals” would not be useful, even with 100% confidence, since this wonderful movie is, of course, a musical. Hence, in the pattern analysis stage, human analysts examine the output from the pattern discovery stage and glean the most interesting, useful, and actionable patterns.

2. Clickstream Analysis:

Web usage mining is sometimes referred to as clickstream analysis. A clickstream is the aggregate sequence of page visits executed by a particular user navigating through a Web site. In addition to page views, clickstream data consist of logs, cookies, metatags, and other data used to transfer web pages from server to browser. When loading a particular web page, the browser also requests all the objects embedded in the page, such as .gif or .jpg graphics files. The problem is that each request is logged separately. All of these separate hits must be aggregated into page views at the preprocessing stage. Then a series of page views can be woven together into a session.

(4)

4

3. Web Server Log Preprocessing:

Preprocessing being preliminary and essential step but rather ignored due to variations and limitations of web log files. A web log file, as an input to the preprocessing phase of WUM, large in size, contains number of raw and irrelevant entries and is basically designed for debugging purpose [4]. Consequently, web log file cannot be directly used in WUM process. Preprocessing of log fie is complex and laborious job and it takes 80% of the total time of web usage mining process as whole [5]. Weighing the pros and cons, we come to the conclusion that, we cannot negate importance of preprocessing step in web usage mining. Paying due attention to preprocessing step, improves the quality of data [6], furthermore, preprocessing improves the efficiency and effectiveness of other two steps of WUM such as pattern discovery and pattern analysis.

3.1. Web Log Files:

Web usage information takes the form of web server log files, or web logs. For each request from a user’s browser to a web server, a response is generated automatically, called a web log file, log file, or web log. This response takes the form of a simple single-line transaction record that is appended to an ASCII text file on the web server. This text file may be comma-delimited, space-delimited, or tab-delimited.

(5)

5

141.243.1.172 [29:23:53:25] “GET /Software.html HTTP/1.0” 200 1497 query2.lycos.cs.cmu.edu [29:23:53:36] “GET /Consumer.html HTTP/1.0” 200 1325

tanuki.twics.com [29:23:53:53] “GET /News.html HTTP/1.0” 200 1014 wpbfl2-45.gate.net [29:23:54:15] “GET /default.htm HTTP/1.0” 200 4889 wpbfl2-45.gate.net [29:23:54:16] “GET /icons/circle logo small.gif HTTP/1.0” 200 2624

wpbfl2-45.gate.net [29:23:54:18] “GET /logos/small gopher.gif HTTP/1.0” 200 935 140.112.68.165 [29:23:54:19] “GET /logos/us-flag.gif HTTP/1.0” 200 2788

wpbfl2-45.gate.net [29:23:54:19] “GET /logos/small ftp.gif HTTP/1.0” 200 124 wpbfl2-45.gate.net [29:23:54:19] “GET /icons/book.gif HTTP/1.0” 200 156 wpbfl2-45.gate.net [29:23:54:19] “GET /logos/us-flag.gif HTTP/1.0” 200 2788 tanuki.twics.com [29:23:54:19] “GET /docs/OSWRCRA/general/hotline HTTP/1.0” 302 -

wpbfl2-45.gate.net [29:23:54:20] “GET /icons/ok2-0.gif HTTP/1.0” 200 231 tanuki.twics.com [29:23:54:25] “GET /OSWRCRA/general/hotline/ HTTP/1.0” 200 991

tanuki.twics.com [29:23:54:37] “GET /docs/OSWRCRA/general/hotline/95report HTTP/1.0” 302 -

wpbfl2-45.gate.net [29:23:54:37] “GET /docs/browner/adminbio.html HTTP/1.0” 200 4217

tanuki.twics.com [29:23:54:40] “GET /OSWRCRA/general/hotline/95report/ HTTP/1.0” 200 1250

wpbfl2-45.gate.net [29:23:55:01] “GET /docs/browner/cbpress.gif HTTP/1.0” 200 51661

dd15-032.compuserve.com [29:23:55:21] “GET /Access/chapter1/s2-4.html HTTP/1.0” 200 4602

FIGURE 1: Sample Web Log File

i. Basic Log Format:

 Remote Host Field

(6)

6

computers are most efficient with IP addresses, the DNS system provides an important interface between humans and computers. For more information about DNS, see the Internet Systems Consortium, www.isc.org [8].

 Date/Time Field

The EPA web log uses the following specialized date/time field format: “[DD:HH:MM:SS],” where DD represents the day of the month and HH:MM:SS represents the 24-hour time, given in EDT. In this particular data set, the DD portion represents the day in August, 1995 that the web log entry was made. However, it is more common for the date/time field to follow the following format:

“DD/Mon/YYYY:HH:MM:SS offset,” where the offset is a positive or negative constant indicating in hours how far ahead of or behind the local server is from Greenwich Mean Tim (GMT). For example, a date/time field of “09/Jun/1988:03:27:00 -0500” indicates that a request was made to a server at 3:27 a.m. on June 9, 1988, and the server is 5 hours behind GMT.

 HTTP Request Field

The HTTP request field consists of the information that the client’s browser has requested from the web server. The entire HTTP request field is contained within quotation marks. Essentially, this field may be partitioned into four areas: (1) the request method, (2) the uniform resource identifier (URI), (3) the header, and (4) the protocol. The most common request method is GET, which represents a request to retrieve data that are identified by the URI. For example, the request field in the first record in Figure 1 is “GET /Software.html HTTP/1.0,” representing a request from the client browser for the web server to provide the web page Software.html. Besides GET, other requests include HEAD, PUT, and POST. For more information on the latter request methods, refer to the W3C World Wide Web Consortium at www.w3.org [9].

(7)

7

concerning the browser’s request. This information can be used by the web usage miner to determine, for example, which keywords are being used by visitors in search engines that point to your site. The HTTP request field also includes the protocol section, which indicates which version of the Hypertext Transfer Protocol (HTTP) is being used by the client’s browser. Then, based on the relative frequency of newer protocol versions (e.g., HTTP/1.1), the web developer may decide to take advantage of the greater functionality of the newer versions and provide more online features.

 Status Code Field

Not all browser requests succeed. The status code field provides a three-digit response from the web server to the client’s browser, indicating the status of the request, whether or not the request was a success, or if there was an error, which type of error occurred. Codes of the form “2xx” indicate a success, and codes of the form “4xx” indicate an error. Most of the status codes for the records in Figure 1 are “200,” indicating that the request was fulfilled successfully. A sample of the possible status codes that a web server could send follows [9].

 Successful transmission (200 series)

 Indicates that the request from the client was received, understood, and completed.  200: success  201: created  202: accepted  204: no content  Redirection (300 series)

 Indicates that further action is required to complete the client’s request.

 301: moved permanently  302: moved temporarily  303: not modified

 304: use cached document  Client error (400 series)

(8)

8  400: bad request

 401: unauthorized  403: forbidden  404: not found

 Server error (500 series)

 Indicates that the web server failed to fulfill what was apparently a valid request.

 500: internal server error  501: not implemented  502: bad gateway

 503: service unavailable  Transfer Volume (Bytes) Field

The transfer volume field indicates the size of the file (web page, graphics file, etc.), in bytes, sent by the web server to the client’s browser. Only GET requests that have been completed successfully (Status = 200) will have a positive value in the transfer volume field. Otherwise, the field will consist of a hyphen or a value of zero. This field is useful for helping to monitor the network traffic, the load carried by the network throughout the 24-hour cycle.

ii.

Common Log Format

Web logs come in various formats, which vary depending on the configuration of the web server. The common log format (CLF or “clog”) is supported by a variety of web server applications and includes the following seven fields:

 Remote host field  Identification field  Authuser field  Date/time field  HTTP request  Status code field  Transfer volume field  Identification Field

(9)

9

field is seldom used because the identification information is provided in plain text rather than in a securely encrypted form. Therefore, this field usually contains a hyphen, indicating a null value.

 Authuser Field

This field is used to store the authenticated client user name, if it is required. The authuser field was designed to contain the authenticated user name information that a client needs to provide to gain access to directories that are password protected. If no such information is provided, the field defaults to a hyphen.

iii. Extended Common Log Format

The extended common log format (ECLF) is a variation of the common log format, formed by appending two additional fields onto the end of the record, the referrer field, and the user agent field. Both the common log format and the extended common log format were created by the National Center for Supercomputing Applications http://www.ncsa.uiuc.edu/ [10].

 Referrer Field

The referrer field lists the URL of the previous site visited by the client, which linked to the current page. For images, the referrer is the web page on which the image is to be displayed. The referrer field contains important information for marketing purposes, since it can track how people found your site. Again, if the information is missing, a hyphen is used.

 User Agent Field

(10)

11

iv. Microsoft IIS Log Format

There are other log file formats besides the common and extended common log file formats. The Microsoft IIS log format includes the following fields [11]:

 Client IP address  User name

 Date  Time

 Service and instance  Server name

 Server IP  Elapsed time  Client bytes sent  Server bytes sent  Service status code  Windows status code  Request type

 Target of operation  Parameters

The IIS format records more fields than the other formats, so that more information can be uncovered. For example, the elapsed processing time is included, along with the bytes sent by the client to the server; also, the time recorded is local time. Note that web server administrators need not choose any of these formats; they are free to specify which fields they believe are most appropriate for their purposes.

3.2. Preprocessing Steps:

 Data Cleaning

(11)

11

protocol used, etc.) that may not provide useful information in the analysis or data mining tasks [12].

No Object Type Unique

Users Requests Bytes In

% of Total Bytes In 1 *.gif 1 46 89.00KB 0.50% 2 *.js 1 37 753.95KB 4.40% 3 *.aspx 1 34 397.05KB 2.30% 4 *.png 1 31 137.67KB 0.80% 5 *.jpg 1 20 224.72KB 1.30% 6 UnKnown 1 15 15.60KB 0.10% 7 *.ashx 1 15 104.79KB 0.60% 8 *.axd 1 13 274.81KB 1.60% 9 *.css 1 8 71.78KB 0.40% 10 *.dll 1 7 26.41KB 0.20% 11 *.asp 1 4 1.26KB 0.00% 12 *.html 1 3 2.17KB 0.00% 13 *.htm 1 2 69.87KB 0.40% 14 *.pli 1 2 24.92KB 0.10%

TABLE1: Example of web log with different extensions

.

 User Identification

The task of User Identification is, to identify who access web site and which pages are accessed. The analysis of Web usage does not require knowledge about a user’s identity. However, it is necessary to distinguish among different users. Since a user may visit a site more than once, the server logs record multiple sessions for each user. The user activity record is used to refer to the sequence of logged activities belonging to the same user.

(12)

12

FIGURE 2: Example of User Identification

Consider, for instance, the example of Figure 2. On the left, depicts a portion of a partly preprocessed log file. Using a combination of IP and URL fields in the log file, one can partition the log into activity records for three separate users (depicted on the right).

 Session Ordering

(13)

13

Generally, sessionization heuristics fall into two basic categories: time-oriented or structure time-oriented. As an example, time-time-oriented heuristic, h1: Total session duration may not exceed a threshold θ. Given t0, the timestamp for the first request in a constructed session S, the request with a timestamp t is assigned to S, iff t − t0 ≤ θ. In Fig 3, the heuristic h1, described above, with θ = 30 minutes has been used to partition a user activity record into two separate sessions.

FIGURE 3: Example of Sessionization  Path Completion

(14)

14

FIGURE 4: Identifying missing references in path completion  Data Integration

The above pre-processing tasks ultimately result in a set of user sessions each corresponding to a delimited sequence of page views. However, in order to provide the most effective framework for pattern discovery, data from a variety of other sources must be integrated with the preprocessed clickstream data. This is particularly the case in e-commerce applications where the integration of both user data (e.g., demographics, ratings, and purchase histories) and product attributes and categories from operational databases is critical. Such data, used in conjunction with usage data, in the mining process can allow for the discovery of important business intelligence metrics such as customer conversion ratios and lifetime values.

(15)

15

stored in a data warehouse called an commerce data mart. The e-commerce data mart is a multi-dimensional database integrating data from various sources, and at different levels of aggregation. It can provide pre-computed e-metrics along multiple dimensions, and is used as the primary data source for OLAP (Online Analytical Processing), for data visualization, and in data selection for a variety of data mining tasks.

4. Conclusion:

The data collected in the Web server and other associated data sources do not reflect precisely about the pages visited by the user during his interactions with the Web. Due to the presence of superfluous items, in addition to the inability to identify users and sessions, it is essential that the log files need to be preprocessed initially before the mining tasks can be undertaken. Data preprocessing is a significant and prerequisite phase in Web mining. Various heuristics are employed in each step so as to remove irrelevant items and identify users and sessions along with the browsing information. The output of this phase results in the creation of a user session file. Nevertheless, the user session file may not exist in a suitable format as input data for mining tasks to be performed. This paper has focused on a design that can be adopted for preliminary formatting of a user session file so as to be suited for various mining tasks in the subsequent pattern discovery phase.

5. Future Work:

(16)

16

6. Reference:

[1] K. R. Suneetha, and D. R. Krishnamoorthi, “Identifying User Behavior by Analyzing Web ServerAccess Log File”, International Journal of Computer Science and Network Security (IJCSNS), VOL. 9, No. 4, April 2009.

[2] S. Alam, G. Dobbie and P. Riddle, “Particle Swarm Optimization Based Clustering Of Web Usage Data”, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 451- 454, 2008.

[3] J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, “Web usage mining: discovery and applications of usage patterns from web data”, SIGKDD Explore, VOL. 1, NO. 2, Jan 2000.

[4] N. Khasawneh and C. C. Chan, “Active User-Based and Ontology-Based Web Log Data Preprocessing for Web Usage Mining”, Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings) (WI'06) 0-7695-2747-7/06, 2006.

[5] Z. Pabarskaite, “Implementing Advanced Cleaning and End-User Interpretability Technologies in Web Log Mining”, 24th Int. Conf. information Technology Interfaces /TI 2002, Cavtat, Croatia, June 24-27, 2002.

[6] J. Han, and M. Kamber, “Data Mining: Concepts and Techniques”, A. Stephan. San Francisco, Morgan Kaufmann Publishers is an imprint of Elsevier, 2006. [7] http://ita.ee.lbl.gov/html/traces.html. [8] http://www.isc.org. [9] http://www.w3.org. [10] http://www.ncsa.uiuc.edu/. [11] http://www.microsoft.com/.

[12] A. Scime, “Wed Mining : Applications and Techniques”, Idea Group Publishing, ISBN 1-59140-414-2, 2005.