• No results found

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

N/A
N/A
Protected

Academic year: 2021

Share "An Enhanced Framework For Performing Pre- Processing On Web Server Logs"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

ISSN: 2231-5381

http://www.internationaljournalssrg.org

Page 178

An Enhanced Framework For Performing

Pre-Processing On Web Server Logs

T.Subha Mastan Rao #1, P.Siva Durga Bhavani#2, M.Revathi #3, N.Kiran Kumar#4 ,V.Sara#5

#

Department of information science and technology,koneru lakshmaiah college of engineering green fields, vaddeswaram,guntur-522502,INDIA

Abstract- Now, peoples are interested in analyzing log files which can offer valuable insight into web site usage. The log files shows actual usage of web site under all circumstances and don't

need to conduct external experimental labs to get this information. This paper describes the effective preprocessing of

access stream before actual mining process can be performed.

The log file collected from different sources undergoes different

preprocessing phases to make actionable data source. It will help

to automatic discovery of meaningful pattern and relationships

from access stream of user

Keywords: Web Usage Mining, Web Server,Data Mining, Data Preprocessing

I. INTRODUCTION

The World wide Web has become one of the most important media to store, share and distribute information .At present, Google is indexing more than 8 billion Web pages. The rapid expansion of the Web has provided a great opportunity to study user and system behavior by exploring Web access logs. The WWW is serving as a huge widely distributed global information service center for technical information, news, advertisement, e-commerce and other information service. By using web log db software export the web log file .It yields output in the form of access file format. Now this access file format is ready for performing pre-processing.

The main intension of our paper is to perform pre-processing on web log data.Before analyzing such data using web mining techniques, the web log has to be pre processed, integrated and transformed. As the World Wide Web is continuously and rapidly growing, it is necessary for the web miners to utilize intelligent tools in order to find, extract, filter and evaluate the desired information. The data pre-processing stage is the most important phase for investigation of the web user usage behavior. To do this one must extract the only human user accesses from weblog data which is critical and complex.

II. PROPOSED FRAME WORK FOR PERFORMING PRE-PROCESSING:

Fig: Framework for Pre-Processing

III.WEB LOG DATA

Log File is the input to pre-processing block. A Web log is a file to which the Web server writes information each time a user requests a resource from that particular site.

The log files[4] are text files that can range in size from 1KB to 100MB, depending on the traffic. In determining the amount of traffic a site receives during a specified period of time, it is important to understand w hat exactly; the log files are counting and tracking.

The raw log files consists of 19 attributes such as : Date, Time, Client IP, Auth User, Server Name, Server IP, Server Port, Request Method, URI-Stem, URI Query, Protocol Status, Time Taken, Bytes Sent, Bytes Received, Protocol Version, H ost, User Agent, Cookies, Referer

(2)

ISSN: 2231-5381

http://www.internationaljournalssrg.org

Page 179

SIONIDAQRRCQCC=LBDGBPIBDFCOKH MLHEHNKFBN

http://www.tutor.com.my/

1) Date

The date from Greenwich Mean Time (GMT x 100) is recorded for each hit. The date format is YY

-MM-DD The example above shows that the transaction was recorded at 2003-11-3.

2) Time

Time of transactions. The time format is HH :MM:SS. The example above shows that the

transaction time was recorded at 16:00:13.

3) Client IP Address

Client IP is the number of computer who access or request the site.

4) User Authentication

Some web sites are set up with a security feature that requires a user to enter username and passw ord. Once a user logs on to a Website, that user’s “username” is logged in the fourth field of

the log file

5)Server Name

Name of the server. In example the name of the server isCSLNTSVR20.

6)Server IP Address

Server IP is a static IP provided by Internet Service Provider. This IP w ill be a reference for access the information from the server.

7) Server Port

Server Port is a port used for data transmission. Usually, th e port used is port 80.

8) Server Method

The word request refers to an image, movie, sound, pdf, txt, HTML file and more. The above example indicatesthatfolder.gif was the item accessed. It is also important to note that the full path name from the document root. The GET in front of the path name specifies the way in which the server sends the requested information. Currently, there are there formats that Web servers send information in GET, PO ST and Head. Most HTML files are served via GET Method w hile most CGI functionality is served via POST.

9)URI-Stem

URI-Stem is path from the host. It represents the structure of the websites.

For examples:-/tutor/images/icons/fold.gif

10) Server URI-Query

URI-Query usually appears after sign “?”. This represents the type of user request and the value usually appears in the Address Bar. For

example:?q=tawaran+biasiswa&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=20&sa=N

IV.WEB LOG DB SOFTWARE:

The Web Log DB exports w eb log data to databases via ODBC. Web Log DB uses ODBC to perform database inserts data using SQ L queries. Web Log DB allows you to use the applications you have become accustomed to such as MS SQL, MS Excel, MS Access etc. Also, any other ODBC compliant applicationcan now be used to produce the output you desire. Use Web Log DB to perform further analysis and special softs. Web Log DB analyze most popular log file formats MS IIS logfile format, Apache logfile format etc. It can even read GZip(gz) compressed logs so you won't need to unpack them manually.

Fig: log file browsi ng

Fig: FTP Server Aut henti cation

(3)

ISSN: 2231-5381

http://www.internationaljournalssrg.org

Page 180

Fig: Web log db s/w

Fig: Before Pre-Processing

Fig: While Pre-Processing

(4)

ISSN: 2231-5381

http://www.internationaljournalssrg.org

Page 181

V.BRIEFVIEW OF DATA PRE-PROCESSING:

1)Data Cleaning:

Data Cleaning[2] is one of the Pre-Processing steps which is used to eliminate the duplicates, fill the missing values, remove unwanted data. The follow ing are some of the types of unwanted and irrelevant data that is to be removed are:

a)The Records having status code above 299 and below 200.

b)The Records in which the attribute cs_uri_stem has extensions like CSS,JPEG,GIF.

2)User and Session Identification:

The task of user and session identification is to find out the different user sessions from the original web access log. A referrer-based method is used for identifying sessi ons. The different IP addresses distinguish different users.

a. If the IP addresses are same, different browsers and operation system’s indicate different users which can be found by client IP address and user agent who gives information of user’s browsers and operating system. b. If all of the IP address, browsers and operating systems are same, the referrer information should be taken into account. The Refer UR I is checked, new user’s session is identified if the URL in the Refer URI is ‘-’ that is field hasn't been accessed previously, or there is a large interval of more than 30 minutes between the accessing time of this record.

3)Path Completion:

Path Completion should be used acquiring the complete user access path. The incomplete access path of every user session is recognized based on user session identification. If in a start of user session, Referrer as well URI has data value, delete value of Referrer by adding ‘-‘. Web log pre-processing helps in removal of unwanted click-streams from the log file and also reduces the size of original file by 40 -50%.

4)Data pre-processing is performed in two types of

approaches:

a)XML b)TEXT FILE

a)XML:

i)Logs[3] recorded in web log which is text file are converted to DOM tree structure using XML Parser.

ii)Since DOM tree structure is used, pre-processing stages can be analysed very well.

iii)Time taken to convert is 20minutes.

iv)XML approach can be used when the web log file consists of more number of attributes describing usage profile of user as IIS web server having Extern Log File Format having 17 attributes.

b)TEXT File:

i)Logs [3]recorded in web log which is text file are first needs to be separated using delimiter as Space.

ii)Understanding of each step of pre-processing would be difficult for user because this approach demands analysis and knowledge of how web log looks.

iii)Time taken to convert is 10 sec.

iv)Text file approach can be used when the web log file consists of very few attributes describing usage profile of users i.e., less than 10 as in Common Log File Format

V1.WEB USAGE MINING:

Web usage mining is the type of web mining allows for the collection of Web access information for Web pages. This usage data provides the paths leading to accessed Web pages. This information is often gathered automatically into access logs via the Web server

Data which is used for web usage mining can be collected at three different levels

1)Server level 2)Client level 3)Proxy level

1)Server Level:

The server stores data regarding request performed by the client. Data can be collected from multiple users to single site

2)Client Level:

Client level is the client itself which sends information regarding the users behaviour. This is done either with an ad-hoc browsing application or through client side application running standard browsers.

3)Proxy Level:

Information regarding user behaviour is stored at proxy side, thus web data is collected from multiple users on several websites, but only users whose web clients pass through the proxy.

4)Applications Of Web Usage Mining:

(5)

ISSN: 2231-5381

http://www.internationaljournalssrg.org

Page 182

marketing strategies and promotional campaign effectiveness. The usage data that is gathered provides the companies with the ability to produce results more effective to their businesses and increasing of sales. Usage data can also be useful for developing marketing skills that will out-sell the competitors and promote the company’s services or product on a higher level.

Usage mining [5] is valuable not only to businesses using online marketing, but also to e-businesses whose business is based solely on the traffic provided through search engines. The use of this type of web mining helps to gather the important information from customers visiting the site. This enables an in-depth log to complete analysis of a company’s productivity flow. E-businesses depend on this information to direct the company to the most effective Web server for promotion of their product or service

Fig: Web Usage Mining

The first is usage[1] processing, used to complete pattern discovery. This first use is also the most difficult because only bits of information like IP addresses, user information, and site clicks are available. With this minimal amount of information available, it is harder to track the user through a site, being that it does not follow the user throughout the pages of the site.

The second use is content processing, consisting of the conversion of Web information like text, images, scripts and others into useful forms. This helps with the clustering and categorization of Web page information based on the titles, specific content and images available

Finally, the third use is structure processing. This consists of analysis of the structure of each page contained in a Web site. This structure process can prove to be difficult if resulting in a new structure having to be performed for each page.

VII.CONCLUSION

In this paper, we have taken the web log data as source. The web log data is converted to accessible format using web log db software. The data pre-processing is then performed on the obtained accessible format to increase the quality of data by removing the erroneous and noisy data Web Log DB s/w which converts the logged data into simple MS Access file format. Functions and mining done on this access format is very easy and useful for the humans. The missing values are replaced by the most frequent ones and the unwanted data is deleted by keeping some parameters

.

REFERENCES: [1]Google Website. http://www.google.com.

[2]Jiawei Han and M. Kamber. “Data Mining: Concepts and Techniques,” In Morgan Kaufmann publishers, 2001[8] ZY COMPUTING-2003 ,123 Log Analyzer. San Jose USA. Available at http://www.123loganalyzer.com

[3]Ms. Dipa Dixit et. al. / (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 07, 2010, 2447-2452 IN ISSN : 0975-3397.

[4] Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd, Hafizul Fahri Hanafi, Mohamad Farhan Mohamad Mohsin IN “World Academy of Science, Engineering and Technology 48 2008”

[5] International Journal of Information Technology and

Knowledge Management July-December 2010, Volume 2, No. 2, pp. 279-283 BY Navin Kumar Tyagi, A.K. Solanki &

References

Related documents

health concerns within the rubric of public interest litigation and in a series of subsequent cases, the Court held that it is the obligation of the state not only to

We mention that if the full conditional densities are available, whether in the context of the multiple-block M-H algorithm or that of the Gibbs sampler, then the MCMC output can

Because the two-dimensional galaxy-galaxy lensing accounts for spatial configuration of the lens galaxies, the unique signatures in the shear field caused by overlapping regions of

The objective of a graduate educational program in medical physics is to provide its graduates with the basic and applied scientific knowledge that is necessary both for

Oleh sebab itu perlu adanya peningkatan pengetahuan tentang perkembangan teknologi informasi seperti digital marketing dengan memanfaatkan media sosial, serta

The results of the study allow us to conclude that the "Debate" technology prepares students for making independent, responsible decisions, develops the ability

Because mediation is a conflict resolution process in which the parties themselves decide on the outcome, NASW does not determine whether specific. violations of the Code of Ethics