Enhanced Initialization and Distance function for Pattern discovery in Web Usage Mining
G.Vijaiprabhu1 Dr.K.Meenakshisundaram2
Ph.D. Research Scholar, Associate Professor,
Department of Computer Science, Department of Computer science, Erode Arts and Science College, Erode Arts and Science College, Erode, India Erode, India
[email protected]. [email protected]
Abstract
Web mining is a discipline of datamining that focus on WWW for itsprimary source of information that includes all of its components. It uses datalogging techniques and algorithms to reveal all hidden facts directly from net documents, hyperlinks and server logs. It collects and analyzes the records in order to gain insight into user behavior. The categories are Content, Structure and Usage mining where the content extracts useful info of documents, Structure processing discovers structural information from pages and the final category identifies interesting usage patterns from navigation history of user stored in weblogs. Among this, the later comprises of three phases:
Preprocessing, Pattern discovery and analysis. This paper applies method for Pattern discovery in Usage Mining with classification technique. This research put forth an enhanced procedure to upgrade existing algorithm K-Nearest Neighbor (KNN). The primary objective is to get User identification and HTTP Status code Categorization for a specific university in order to improve site. Each user is identified and classified according to IP address specified in the log file. The work is implemented in Rapid miner tool and assessed with suitable evaluation metrics.
Keywords: Distance Function, K-Nearest neighbor, Logarithmic function, Rapid Miner.
1. Introduction
Web Crawlingis an application of data analytics toreveal patterns from WWW. [4] It is the largest statistics,growing in unsystematic way. Every pageis linked but not organizedlogically. In short period, millions of pagesare added to web and undergochanges daily. This leads to overloading. So,getting desiredinformation or particular details in a limited time with minimal attempt isa major challenge while designing the page. Another major issue is relevancyof insight. A very efficient and effective technique is needed to extract or access requiredinformation.Proper management ofimproves retrievalefficiency. To satisfythe requirements, one ofmostused functional [17] techniques is to analyze browsing patterns.
A. Types of techniques
Analyzer tool with efficient techniques [5] are used to automatically discover information from web documents and services. Main purpose is to get useful info and its usage patterns.
1) Content based
Applicationof extracting useful particularsfrom content of cyberspace documents. This content includes different types of file such as text, image, video,audio etc. Text contains group of facts that is used to design a page. It contains effective and interesting patterns about user needs. This kind performs scanning and selecting of text, images and grouping of scripts.
2)Structure based
Discovers of structural fact from hypertext documents.It considers testimony of pages as nodes, and hyperlinks as edges.This type basicallyshows structured summary of a particular site. It defines relationship among the manuscript linked by information. Its purpose is to determine a connection betweencommercial sites.
3) Usage based
Application of identifying [9] interesting usage patterns from a large repository. These patterns enable tounderstand visitor behaviors. User accessed page is stored in form of logs.
B. Source of Log Files
Log files records the fact when a user submits a request to a server. It is a text file which is created automatically, when one request a page. It is considered as a type of file that has informationabout requests between the resource from a site. When a person sends request to a server, databases will be retrieved. At the sametime,user session including URL, accessing date and time, Client’s IP address, querystem will be recorded in logs.
These filesreside at three different areas as follows:
1) Server Log
The purpose of server log is:
Givesinsightabout visitor’slogin.
Number of requests for each page at site.
Usage patterns in terms of time, day or season.
Example: 120.236.0.14 -20011-12-12.
Types:
(a) Referrer Log– Stores particulars of path or ‘Uniform resource Locater’ of record from the sites that link to pages. Referrer logs can help to analyze trafficinnetwork.
It provides:
Name of engine thatsent traffic.
Keywords used to find site.
Most or least accessedpages.
User profile with region.
Time span of user session.
Number of user sessions or views.
Summary of top referring pages.
Http errors.
Bandwidthof traffic.
Example: http://myblaze.sez.html>/library/lectures/news.gif
(b) Agent Log – Stores information about clients that sends request to server. It registersthe details about browser of user, operating system and browser version. This can be used to find out most popular browsers and operating systems used.
Example: Internet Explorer/5.0(win 7 ;)
(c) Error Log – This file is used to record fault when user clicks on specific link and when browser does not open the requested link. Also stores information about failed requests onserver.By analyzing this, a systemadministrator can resolve inaccuracywith information available from error logseasily.
Top five error logs:
The five most common HTTP errors on Google are –
HTTP Error 500 – Internet server error.
HTTP Error 403 – Forbidden
HTTP Error 404 – Not Found.
HTTP Error 400 – Bad Request.
HTTP Error 401 – Unauthorized.
2) Proxy Server Log
Log files contain findingsabout proxy[10] server on which a request arise.It lies in anintermediateofan endpoint device and a server. When a proxy server receives a request for a link, it search in its local cache history and returns it. If the request is not existing, server acts as a client, uses one of its own IP address to request pageand forwards it. A proxy can also log its interactions, which is necessary for trouble shooting.
3) Client browser Log
Log files reside in client’s browser and a special software are used tostore details.Client logs enable a server to record connection and streaming analysis. It tracks activityof clients that connect to it. This file contains several client event entries witha number of space-delimited fields.
Fields in Log files
Users IP address.
Time stamp.
Mode of Request,
Host IP address,
Requested URL.
Status Code.
Content size in Bytes.
Agent type.
Remote URL.
2.Supervised Learning A. Supervised Learning
Supervised learning [8] is a function that maps an input to an output. It infers a function fromlabeled instance. A supervised learning algorithm analyzes trainingset and produces an inferred function, which is used to map new examples.
Steps in Supervised Learning
1. Determine and collect type of training examples.
2. Determine inputfeature that gives enough info to accurately predict output.
3. Determine structure of learned function and run algorithm on training samples.
4. Evaluate accuracyof learned function.
Categories of Supervised learning
Classification: A classification problem is found when output variable is a category, like “Yes” and “No”.
Regression: A regression problem is foundwhen output variable is a real value, like
“dollars” or “weight”.
B. Distantmeasure in supervised learning
Represents features of objects. Also used to understand patterns in input and it recognizes similarities among data. Ifit is small, resemblance is high; if it is high resemblance is low. This can be measured by a mathematical formula.
Commonly used functions:
1) Euclidean -A straight-line throughcoupleof points in a space and considered as the length of the line segment(c1, d1) and (c2, d2).
(1)
2) Manhattan–Sum ofabsolute differences of its corresponding points.It is simply the sum of vertical and horizontal space enclosed by two points (c, d).
(2) 3) Mahalanobis-Calculatesthe length between the points in multivariate space. Initially, the function transforms the columns into uncorrelated variables and scale it to find variance then finally calculates the Mean squared norm.
(3) Where, y - vector of observation,
m - vector of mean values of independent variables.
S-1 - inverse covariance matrix of independent variables.
4) Cosine Similarity-A function used to measure the relation of objects irrespective of their size. It calculates the cosine of the angle within two vectors in a multi - dimensional space. Sometimes there is a chance of high concurrence even though the points are far apart measured by the previously discussed function. Here, the smaller the angle between (B, C), higher the sameness.
(4)
5) Jaccard-Calculation of "Intersection over Union". Determines the homogeneity of finite sample setsand the size of the intersection divided by the size of the union of the sets (A, B).
(5)
6) Minkowski -Works in a normed vector space (A, B) which includes real and complex numbers. The norm satisfies three conditions 'Zero Vector' with zero length, 'Scalar factor' in which the direction is not change though its length gets changed, 'Triangular Inequality' where the result will always be the straight line.
(6)
Where, p is a integer number raised to the power.
7) Chebyshev–The maximum rate of the differences between (E, F) along any coordinate. It is also known as 'chessboard or maximum value length'. Also used for ordinal and quantitative variables.
(7)
The above listed are the commonly used measure in supervised learning techniques.
3. Litreature Review
Ali SeyedShirkhorshidiet al [1]did a comparative study on credibility and dissimilarity measures for continuous data. Fifteen public record were used for the comparison. These were classified into low and high-dimensional categories. These are applicable to classification and clustering techniques.It makes a total of seven hundred and twenty experiments in this research. The accuracy of a likeness is assessed in terms of Rand index and the best for each of the low and high-dimensional processor were discussed. The overall results indicate that Average Divergence measure gives the top most accuracy for all clustering algorithms.
AnandanBellieet al [2] analyses behavior of the university students to improve the quality of service of the internet. The quality of the internet is addressed, as the daily usage of net is increased and this can be solved while the existing user behavior pattern is known. This paper takes the blog info from the university computer lab and employs K- Means algorithm with proposed Norm in Weka tool for assessing similar characteristics students while accessing the contents. This analysis is helpful for the management of university internet infrastructure and page personalization.
AnupamaPrasanthet al [3] present a study work on web personalization approach.
Initially it focuseson personalization strategies and then focuses on one of the most promising technique of automatic personalization and its various advances. This paper lists the various advances and enlists the same through various strategies.
Komal Maher et al [6] did an experiment with Analogy measure andCosine similarity for text processing. The effectiveness is evaluated on a real-world entitiesfor classification and clustering problems. This resemblance measure reflects the degree of closeness or separation of the target objects. Three variant sets are used for experiment. The results show the proposed measure for text processing that achieves high accuracy than the others but it takes more processing time to converge.
ManishaKumariet al [7] did a reviewwith classification technique Nearest neighbor in this domain. It is applied on the logdetails and the performance of the algorithm is measured. This algorithm is biased by the value of fixing K. The only approach available is to run the algorithm for different K and choose the best one. There is no separate training phase even though the sheet is split into training and test. All the computations are done in the test phase. The review of this work can be implemented in discovering user navigation pattern, recommendation system, Computerbehaviorstudies, and content improvement.
Vedpriyaet al [11] proposed a new model for predicting the consumer behavior from the file accessed from log server, proxy server and client-side cache. K-means clusteringand Regression Analysis algorithms are used to find the future accessing of document. Using the blog files and consumer’s current navigation pattern, the suggested system predicts next files in form of recommendation list. The algorithm is added with
weights regression analysis is used for prediction. Finally, the performance is evaluated in terms of accuracy, error rate and time, space complexity.
4. Methodology
A. Existing Methodology
The existing techniqueapplies K-Nearest Neighbor with Random initialization of 'K' [12] and Euclidean function.
K-Nearest Neighbor -
The classification of an unknown [16] tuple is accomplished by analyzing the classes of its closest contact. The algorithm has a fixed number of neighbors to vote in the process of classification for an instance which is identified by 'k', where 'k' is a positive integer. It is a non-parametric lazy learning method that doesn't require any prior knowledge regarding classification. It yields the closest records from training samples which have highest priority and can be used both in discrete and continuous data. The procedure of classification starts with a set that contains certain number of attributes. The series is divided into dual as training and testing. Training is given as input to the algorithm while testing is used to assess it. The division can be done using various methodologylike 'Random sampling, Percentage split, Cross validation, Hold-out method' etc.
The algorithm classifies any new instance by using training tuples similar to it. But all the computations are done at the time of classification. The training tuples is viewed as a site of n-dimensional space, when an unknown instance comes for classification the algorithm find out the k friendliest points.
Random Initialization of ‘K’:
The ‘K’is fixed by the applicant. There is no standard rule used for initialization. The only way is to initialize randomly or run the algorithm for variant 'K' valuation and pick the most suited one with high accuracy.
The most common Mean squared norm is used to find the nearest next by using Equation (1). It is the ordinary straight-line range of two points. It is also related to the older ‘Pythagorean Mathematical Problem’. It can work in n-dimension space.
Procedure KNN
Step1:Determine parameter ‘K’ randomly.
Step2: Calculate the difference among the instance and allthe training samples by Eq. (1).
Step 3: Sort the range and determinethe closest point on the basis of Kth minimum length.
Step 4: Select the category or vote of the nearest neighbor.
Step 5: Return the mode of ‘K’ labels.
Advantages
Easy to implement and interpret.
Applicable to classification and regression problem.
Able to handle multi class problem.
Disadvantages
At the runtime, it uses all the training data. So, it is a slow learner.
'K' is chosen randomly or it is tried for variant estimates leads to high processing time.
The distance measurefinds the interval between dual points. Hence it evaluates only the physical length not the similarity between them. And takes high calculation time for squared function.
B. Proposed method
The put forth method enhanced initialization and distance equationto improve the existing one. So, the work is done in two phases.
Phase - 1: Logarithmic function for initialization of 'K'
Logarithmic function [13]is used to initialize ‘K’. Logarithm is the exponent to which a base must be raised to get a given number. If 'a' is the logarithm of 'm' to the base 'b', andif ba = m, it is written as a = logb m. The logarithmic function to the base 10 is called as common logarithm and it has many real-life applications, in acoustics, information theory, computer science, electronics, earthquake analysis and population prediction. It skews the output from input. It alsoreduces wide range quantities to tiny impact.
Some Applications of logarithmic function
In chemistry, pH is a logarithmic measure used for acidy.
It describes the frequency ratios of music intervals.
Discrete logarithm is used for public-key cryptography.
Used to express decibel in ratios for signal power and amplitude.
To compress large scale scientific data, logarithmic scales are used.
Fixing Value of K
It is fixed by the logarithm of the total number of instances.
(8)
Where 'n' is the number of instances. Common logarithmic function is used. It has base 10.
Phase - 2: Enhanced Distancesymmetric in KNN (EDKNN)
The existing norm measures the range of the given line segment. The space [18] between the line (e, f) is taken as the length of the line segment,
(9) This work put forth an enhanced interval measure by measuring the slope of the line instead of length. So, the original similarity will be assessed.
Finding Slope of the line
Slope defines the steepness of the line. It is calculated by taking the ratio of the rise in vertical axis over the horizontal axis for the point (a1, b1) and (a2, b2).
(10)
Enhanced Euclidean norm
EEN = (11)
This equation (10) finds the angle among the points instead of length and reduce the complexity of computation by eliminating squared function.
Procedure EDKNN
Step 1: Determine ‘K’ by Eq (8).
Step2:Calculate the space between the instance by Eq (11).
Step 3: Sort them and categorize thehigh associate item on thebasis of Kth minimum length.
Step 4: Select the category or vote of the neighbor.
Step 5: Return the mode of the K labels.
Advantages
No need to run the algorithm for different 'K', thus saves the processing time.
Able to find the rangebetween the points accurately and quickly and it also shows the rate of change.
Reduce the computation by eliminating the function of taking squares of the points.
5. Results and Discussion
The classification algorithm with Enhanced Euclidean norm is implemented in Rapid miner for Pattern discovery phase with Weblog records.
A. Dataset
It is taken from Kaggle dataset [14] which is a real one designed for a university. It consists of three months records.
B. Attributes
The dataset consists of four [15] attributes:
1. User Address 2. Login Time,
3. User Communication 4. Status code.
C. Results
1) Screen in Excel
CSV is the comma separated valuation format in excel which can be directly uploaded in analyzer tools. It eliminates the space in the excel format thus reduce the file size.
Fig. 1. Screen in excel sheet in CSV format Fig. 1. ShowsLog records in excel sheet with four attributes.
2) Results of Phase 1
Phase 1 initialize the K with the equation (8). The 'K' is calculated as (2)on the basis of number of instances.
Table 1. Classification results of Phase1
Parameters
Existing with K=1 Proposed Phase - 1 (K=2) User
ID
Status code
User ID
Status code Accuracy 82.25 83.23 84.23 86.01 Error rate 17.75 16.77 15.77 13.99 Recall 81.91 82.91 83.85 85.62 Precision 82.51 83.67 84.67 86.45
Time in seconds
17.98 15.60 21.86 18.64
Table 1. shows the results of the existing and the put forthmethod Phase 1. From the results the superiority of the proposed one is revealed. The time is increased in phase 1 as the proposed method has the extra calculation for the logarithmic function and search method for 'k value 2'but it achieves high accuracy.
3) Results of Phase 2
Phase 2 proposed an enhanced distant metric as in Eq. (11).
Table 2. Classification results of Phase1
Parameters
Proposed Phase - 2 (EDKNN) User ID Status code
Accuracy 87.21 89.98
Error rate 12.79 10.02
Recall 86.89 89.64
Precision 87.65 90.44
Time in seconds
15.21 12.07
Table 2. shows the resultsof Phase 2. The Phase 2 method outperforms existing and Phase 1 and the time too gets reduced since it eliminates the squared calculation.
4) Performance Analysis a) Login Identification
Fig. 2. User identification
Fig 2. shows the user identification results for existing and proposed Phase 1 and 2.
b) HTTP Code
Fig. 3. Status Code
Fig 3. shows the Status code classification results for existing and proposed Phase1and 2.
6. Conclusion and Future Scope
Searching usage patterns in Web applies retrieval techniques and algorithms to reveal the hidden information directly from internet contents. The information issuedhelps to get the behavior pattern in accessing the linkand also used to improve the site if the status code is analyzed. This paper analyses both the Login ID and the status code in dual phases by implementing supervised learning technique Nearest neighbor with some enhancement in the initialization of 'K' value and distancemeasure to overcome the existing drawbacks.
The blog informationis taken from the Kaggle dataset repository and the results shows the recommended process carried,outperforms the existing algorithm.
In future, the algorithm is further developed with proper enhancement to increase the accuracy level. The procedure can be tried for other datasets.Also, the enhancement is assessed with other evaluation metrics.
7. References
1. Ali SeyedShirkhorshidi, SaeedAghabozorgiTeh Ying Wah, " A Comparison Study on Similarity and Dissimilarity Measures in Clustering ContinuousData", PLOS ONE DOI:10.1371journal.pone. 0144059 December 11, 2015.
2.AnandanBellie,“Web Usage Analysis of University Students to Improve the Quality of Internet Service”,International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), Volume 4 Issue 5, May 2015.
3.AnupamaPrasanth, “web personalization using web Mining techniques”,
International Journal of Current Engineering and ScientificResearch, vol 3, issue 3, 2016.
4.JoshilaGrace .L.K, Maheswari .V ,DhinaharanNagamalai, " Analysis of Web Logs and Web User in Web Mining", International Journal of Network Security
& Its Applications (IJNSA), Vol.3, No.1, January 2011.
5.Kaviarasan ,Hemapriya .K ,Gopinath .K," SemanticWeb Usage Mining Techniques for Predicting Users’Navigation Requests", International Journal ofInnovativeResearch in Computer and Communication Engineering, Vol. 3, Issue 5, May 2015.
6.KomalMaher , Madhuri S. Joshi, " Effectiveness of Different Similarity Measures for Text Classification and clustering", International Journal of Computer Science and Information Technologies, Vol. 7 (4) , 2016.
7.ManishaKumari, SaritaSoni, " A Review of classification in Web Usage Mining using K- Nearest Neighbour", Advances in Computational Sciences and Technology ISSN 0973-6107 Volume 10, Number 5, 2017.
8.Margaret H. Dunham, Sridhar .S, “Data Mining:Introductory and Advanced Topics”, Pearson Education.
9.SahajChavda, Saurabh Jain, NikunjPanchal, Manisha Valera, " Recent Trends and Novel Approaches in WebUsag Mining", International Research Journal
ofEngineering and Technology,Vol: 04, 2017.
10.Sudheer Reddy .K“An Effective Methodology for Pattern Discovery in Web Usge Mining”, International Journal ofComputer Science and Information Technologies, Vol. 3 (2), 2012, 3664-3667.
11.VedpriyaDongre,JagdishRaikwal,“An Improved User Browsing Behavior Prediction using Regression Analysis on Web Logs”, International journal of computer applications, Volume 120 – No.19,2015.
12.Vidyapriya .V, Pushpa .V, " Identifying web users from weblogs using classification algorithms", International Journal of Innovative research in Computer and Communication Engineering", Vol. 4, Issue 7, 2016.
13. en.Wikipedia.org>wiki>Logarithm
14. www.kaggle.com/shawon10/web-log-dataset.
15. en.wikipedia.org/wiki/Common_Log_Format.
16. en.wikipedia.org/wiki/K-nearest_neighbors_algorithm.
17. en.wikipedia.org/wiki/Web_mining.
18. en.wikipedia.org/wiki/Euclidean_distance.