Pre-Processing: Procedure on Web Log File for Web Usage Mining

Full text

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012)

419

Pre-Processing: Procedure on Web Log File for Web Usage

Mining

Shaily Langhnoja

1

, Mehul Barot

2

, Darshak Mehta

3

1

Student M.E.(C.E.), L.D.R.P. ITR, Gandhinagar, India

2 Asst.Professor, C.E. Dept., L.D.R.P. ITR, Gandhinagar, India

3Lecturer, Government Polytechnic, Gandhinagar, India

Abstract These days World Wide Web becomes very popular and interactive for transferring of Information. Web usage mining is the area of data mining which deals with the discovery and analysis of usage patterns from Web data, specifically web logs, in order to improve web based applications. Web usage mining consists of three phases, preprocessing, pattern discovery, and pattern analysis. After the completion of these three phases the user can find the required usage patterns and use these information for the specific needs. The web access log file is saved to keep a record of every request made by the users. However, the data stored in the log files does not specify accurate details of the users’ accesses to the Web site. So, preprocessing of the Web log data is first and important phase before web log file can be applied for pattern analysis & pattern discovery. The preprocessed Web Log file can then be suitable for the discovery and analysis of useful information referred to as Web mining. This paper gives detailed description of how pre-processing is done on web log file and after that it is sent to next stages of web usage mining.

KeywordsWeb Mining, Web Usage Mining, Web Log file, Data cleansing, Preprocessing

I. INTRODUCTION

With the continued growth and proliferation of e-commerce, Web services, and Web-based information systems, the volumes of clickstream and user data collected by Web-based organizations in their daily operations has reached astronomical proportions. Analyzing such data can help these organizations determine the life-time value of clients, design cross-marketing strategies across products and services, evaluate the effectiveness of pro-motional campaigns, optimize the functionality of Web-based applications, provide more personalized content to visitors, and find the most effective logical structure for their Web space. This type of analysis involves the automatic discovery of meaningful patterns and relationships from a large collection of primarily semi-structured data, often stored in Web and applications server access logs, as well as in related operational data sources.

Web usage mining refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites. The goal is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages, objects, or re-sources that are frequently accessed by groups of users with common needs or interests. Following the standard data mining process the overall Web usage mining process can be divided into three inter-dependent stages: data collection and pre-processing, pattern discovery, and pattern analysis.

This paper provides description about what is Web Log File, where it is located, different formats of it & pre-processing on it. Pre-pre-processing of web log file includes data cleansing, user identification & session identification.

II. WEBLOG FILE

Web log files are files that contain information about website visitor activity. Log files are created by web servers automatically. Each time a visitor requests any file (page, image, etc.) from the site information on his request is appended to a current log file. Most log files have text format and each log entry (hit) is saved as a line of text. Log file range 1KB to 100MB.

A. Location of weblog file:

Web log file is located in three different location.

 Web server logs: Web log files provide most accurate and complete usage of data to web server. The log file do not record cached pages visited. Data of log files are sensitive, personal information so web server keeps them closed.

(2)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012)

420 The two disadvantages are: Proxy-server construction

is a difficult task. Advanced network programming, such as TCP/IP, is required for this construction. The request interception is limited.

 Client browser: Log file can reside in client’s browser window itself. HTTP cookies used for client browser. These HTTP cookies are pieces of information generated by a web server and stored in user’s computer, ready for future access.

B. Type of web log file:

There are four types of server logs.

 Access log file: Data of all incoming request and information about client of server. Access log records all requests that are processed by server.

 Error log file: list of internal error. Whenever an error is occurred, the page is being requested by client to web server the entry is made in error log .Access and error logs are mostly used, but agent and referrer log may or may not enable at server.

 Agent log file: Information about user’s browser, browser version.

 Referrer log file: This file provides information about link and redirects visitor to site.

C. Web log file format:

Web log file is a simple plain text file which record information about each user. Display of log files data in three different format

 W3C Extended log file format

 NCSA common log file format

 IIS log file format

NCSA and IIS log file format the data logged for each request is fixed.W3C format allows user to choose properties, user want to log for each request.

1. W3C Extended log file format

W3C log format is default log file format on IIS server. Field are separated by space, time is recorded as GMT (Greenwich Mean Time). It can be customized that is administrators can add or remove fields depending on what information want to record. In W3C format of year is YYYY-MM-DD. Omitting unwanted attributes field when log file size is limited[w3c].

Figure below shows that

#software - version of IIS that is running #version - the log file format

#Date- recording date and time of first log entry.

#fields: date time c-ip cs-username s-ip cs-method uri-stem uri-query sc-status sc-bytes bytes time-taken cs-version cs(User-Agent) cs(Cookie) cs(Referrer)

Fig.1. Example of W3C log file format

2.NCSA common log file format

The NCSA Common log file format is a fixed ASCII text-based format, so you cannot customize it. The NCSA Common log file format is available for Web sites and for SMTP and NNTP services, but it is not available for FTP sites. Because HTTP.sys handles the NCSA Common log file format, this format records HTTP.sys kernel-mode cache hits.The NCSA Common log file format records the following data:

• Remote host address

• Remote log name (This value is always a hyphen.) • User name

• Date, time, and Greenwich mean time (GMT) offset • Request and protocol version

• Service status code (A value of 200 indicates that the request was fulfilled successfully.)

• Bytes sent

Fig.2 Example of NCSA log file format

3.IIS log file format

The IIS log file format is a fixed ASCII text-based format, so you cannot customize it. Because HTTP.sys handles the IIS log file format, this format records HTTP.sys kernel-mode cache hits.

The IIS log file format records the following data: • Client IP address

• User name • Date • Time

• Service and instance • Server name • Server IP address • Time taken

#Software: Microsoft Internet Information Services 7.5 #Version: 1.0

#Date: 2012-12-05 08:25:10

#Fields: 1998-11-19 22:48:39 206.175.82.5 - 208.201.133.173 GET/global/images/navlineboards.gif – 200 540 324 157 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+95) USERID=CustomerA;+IMPID=01234

http://www.loganalyzer.net

(3)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012)

421 • Client bytes sent

• Server bytes sent

• Service status code (A value of 200 indicates that the request was fulfilled successfully.)

• Windows status code (A value of 0 indicates that the request was fulfilled successfully.)

• Request type • Target of operation

Fig.3 Example of IIS log file format

III. PHASE –1:PREPROCESSING

There are several pre-processing tasks to be done before data mining algorithms can be performed on the web server logs. These include data cleansing, user identification, session identification.

Fig.4 Data Pre-Processing Steps in Web Usage Mining

A. Data Cleansing

The purpose of data cleaning is to remove irrelevant items stored in the log files that may not be useful for analysis purposes. When a user accesses a HTML document, the embedded images, if any, are also automatically downloaded and stored in the server log. For example, log entries with file name suffixes such as gif, jpeg, GIF, JPEG, jpg and JPG can be removed. Since the main objective of data preprocessing is to obtain only the usage data, file requests that the user did not explicitly request can be eliminated. This can be done by checking the suffix of the URL name. In addition to this, erroneous files can be removed by checking the status of the request (such as status code 404). Data cleaning also involves the removal of references resulting from spider navigations which can be done by maintaining a list of spiders or through heuristic identification of spiders and Web robots. The cleaned log represents the user’s accesses to the Web site.

Algorithm for Data Cleansing

Following is the algorithm used for cleansing web log file for retrieving useful information and eliminating unnecessary data to carry out work related to this paper. The algorithm for Data cleansing step in Web usage mining process of pre-processing stage used in this paper. Here input is raw web log file which is processed and finally output generated is processed web log file and its data is inserted into table of database.

Input: raw web log file. Output: processed web log file.

1. for each lines in web log file do

2. if length of line is more then one character then #Avoid Blank Lines

3. if line does not start with ‘#’ then #Avoid Comments

4. if link name contains domain name then #Consider Application specific links only

5. if page extension is aspx or html then #Eliminate non-page links like images, pdfs

insert query for adding log data in database

B. User & Session Identification

To identify each user and session uniquely we can take measures like IP address, operating system, browser, time out period, etc. Once above step of data cleansing is performed, all useful data records are available with us in database and irrelevant entries are considered to be removed. So, now we can start up the remaining process with database rows itself.

Algorithm for User & Sesion Identification

The algorithm for the user and session identification can be depicted as below:

Input: processed weblog file

Output: identification of user & session. 1. for each record in dataset do 2. if currentIP is not in ListOfIP then

add currentIP in ListOfIP

mark whole record as a new user and session assign a new sessionID and userID

3. else if currentOS is not in ListOfOS then add currentOS in ListOfOS

mark whole record as a new user and session assign a new sessionID and userID

4. else if currentBrowser is not in ListOfBrowser then add currentBrowser in ListOfBrowser

mark whole record as a new user and session assign a new sessionID and userID

(4)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012)

422 5. else if current record timestamp is more than 1800

seconds #30minutes * 60 seconds

mark whole record as a new user and session assign a new sessionID and userID

6. else

mark current record with existing sessionID and userID

end if end of loop

The above algorithm when used, marks each record in database with respective user and session identified groups which later can be used for further proceedings of web usage mining process. The resulted group of records can be inserted into database and later results of which can be very helpful like total number of users, total number sessions, difference between total number of records before pre-processing and post-prepre-processing, etc.

IV. EXPERIMENTAL RESULTS

We have conducted several experiments on log files collected from Government Polytechnic, Gandhinagar website. During Data cleansing step all irrelevant entries are removed. Sample raw web log file is as below:

Fig.5. Sample Web Log File

Select web log file for cleansing operation as shown below:

Fig.6. Data Cleansing Process

Thus after completion of Data Cleansing Web Server Log file is cleaned and is prepared for data to be loaded into relational database. Here data is loaded & stored in MS SQL Server 2008.

Fig.7. Processed Web Log File

Here, since a Government Polytechnic, Gandhinagar site is mostly accessed by students in the computer laboratories without passing through proxy server - we simply use the machines’ IP addresses to identify unique users. After performing Pre-Processing step result get is shown in table1.

#Software: Microsoft Internet Information Services 7.5 #Version: 1.0

#Date: 2012-11-19 04:36:21

#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Cookie) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken

2012-11-19 04:36:21 W3SVC1 DARSHAK 172.16.1.252 GET / -

80 - 172.16.1.247 HTTP/1.1

Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/537.11+(KHTML, +like+Gecko)+Chrome/23.0.1271.64+Safari/537.11 - - 172.16.1.252 200 0 0 1324 367 6334

2012-11-19 04:36:21 W3SVC1 DARSHAK 172.16.1.252 GET

/itInfo/Images/login.jpg - 80 -

(5)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 2, Issue 12, December 2012)

423 TABLE 1

RESULTS AFTER PRE-PROCESSING

Total No. of Users

Total No. of Sessions

Rows in Web Log

File

Total Rows after pre-processing

18 68 1217 411

V. CONCLUSION

Web usage mining is indeed one of the emerging area of research and important sub-domain of data mining and its techniques. In order to take full advantage of web usage mining and its all techniques, it is important to carry out preprocessing stage efficiently and effectively. This paper tries to deliver areas of preprocessing including data cleansing, session identification, user identification, etc. Once preprocessing stage is well-performed, we can apply data mining techniques like clustering, association, classification etc for applications of web usage mining such as business intelligence, e-commerce, e-learning, personalization, etc.

REFERENCES

[1] Theint Theint Aye. 2011. Web Log Cleaning for Mining Of Web Usage Patterns. IEEE.

[2] K.R. Suneetha and Dr. R. Krihnamoorthi. 2009. Identifying User Behavior by Analyzing Web Server Access Log File. IJCSNS.

[3] R.Cooley, Bamshad Mobasherand Jaideep Srivastava,

"DataPreparation for Mining World Wide Web Browsing Patterns." Knowledge and Information Systems,1(1),1999,5-32 R.Kosala and H. Blockeel, "Web Mining Research : A Survey." ACM SIGKDD Explorations, 2000, 1-15.

Figure

Updating...

References

Updating...