JECET; June – August-2013; Vol.2.No.3, 730-740.
Journal of Environmental Science, Computer Science and Engineering & Technology
An International Peer Review E-3 Journal of Sciences and Technology
Available online at www.jecet.org Computer Science
Research Article
JECET; June – August 2013; Vol.2.No.3, 730-740. 730
A Modified FP-Tree Algorithm for Generating Frequent Access Patterns
Harendra Singh, Ashish Kumar Srivastava, Sitendra Tamrakar
Department of computer science & Engg. NRI Institute of Science and Technology, Bhopal RGPV Technical University Bhopal, INDIA
Received: 30 June 2013; Revised: 24 July 2013; Accepted: 31 July 2013
Abstract: Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web server, in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of web users along with their browsing behavior from web sites. Web usage mining itself can be classified further depending on the kind of usage data. The proposed work is an efficient algorithm for generating frequent access patterns from the access paths of the users. This algorithm is optimized to takes less time compare to the existing algorithms and stores the access paths in the compressed format. The main aim of this algorithm is to reduce execution time and memory utilization as compared to the existing algorithms. The frequent access patterns show the sequence of web pages which are frequently navigated by the user. The proposed Algorithm is not generating any candidate sets, but more number of patterns will be generated, due to this the number of tree traversals will be more. The result obtained shows that the proposed algorithm takes 25% less time compared to the Apriori algorithm in all instances. If the threshold of minimum support values the both algorithm execution time is less.
Keyword: Data cleaning, FP Growth, FP-tree, Web Usage Mining, Association rule, Apriori algorithm,
JECET; June – August 2013; Vol.2.No.3, 730-740. 731 INTRODUCTION
The Web is a huge, explosive, diverse, dynamic and unstructured data repository, which supplies incredible amount of information, and also raises the complexity of how to deal with the information from the different perspectives of view, users, web service providers, business analysts. The users want to have the effective search tools to find relevant information easily and precisely. Web Usage Mining is usually done using data mining techniques to determine frequent access patterns of the users. There are two phases in the Web Usage Mining. The first phase is data preprocessing in initial phase we send a web server log files usually record a fully history of requested access to files by users.. There are several preprocessing tasks i.e. Data Cleaning, User Identification, Session Identification and Transaction Identification. These tasks are computationally intensive and time consuming which needs to be performed on the data; these data are collected from web server logs. The primary objective of Web Usage Mining is to discover interesting patterns in accesses to various web pages within the web space associated with a particular server. In order to successfully apply generic data mining techniques to web data one must first transform this data to a suitable form. In particular unlike market basket analysis where a single transaction is defined naturally according to customer purchase activity in web data the notion of a transaction must be defined based on the properties of the application domain. The proposed work uses each user access path as one transaction.
The Web Usage mining has following component server logs, Web pages, Web hyperlink structures, on- line market data, and other information1. Web logs: When people browse Web server, server will produce three kinds of log documents i.e. server logs, error logs, and cookie logs. By analyzing these log documents we can mine accessed information. A brief description of these log documents are mentioned below:
1. Web logs: When people browse Web server, sever will produce three kinds of log documents:
sever logs, error logs, and cookie logs. Through analyzing these log documents we can mine accessing information.
2. On-line market data: used as storing e-commerce information in commerce sites.
3. Web pages: Most of existing web mining methods are used in Web pages of according with HTML standard.
4. Web hyperlink structures: The Web pages are all connected by hyperlinks, in which there is very important mining information. So Web hyperlinks are very authoritative resources. Other information is composed of user registrations that can help mine better. The preprocessing methods used in the many existing tools (WEBMINER system) are all designed to function with only the information supplied by the Common Log Format specified as part of the HTTP protocol by CERN and NCSA2.
5. Most of the work involves determining frequent traversal patterns or large reference sequences from the physical layout type of graph. Path analysis could be used to determine most frequently visited paths in a Web site.
1.1 Access Paths & Patterns of Web Usage
Mining: There are many different types of graph that can be formed for performing path analysis since a graph represents some relation defined on web pages path analysis could be used to determine most frequently visited paths in a website. The output of the preprocessing step is user access paths. The below example shows the difference between user access paths and user access patterns3 . The nodes represent the web pages and links represents the navigation of web pages.
JECET; June – August 2013; Vol.2.No.3, 730-740. 732 The user access patterns follow only forward references.
The user access paths follow both forward and backward references of web pages.
The user path through which Uid accesses certain website: A-B-C-D-B-G-E-H-G-C-A-I-K-I-D.
The user access patterns are: A-B-C-D, A-B-G-E-H, A-I-K-I-D.
The frequent access patterns have to satisfy the condition i.e. the number of occurrences of access patterns is more than equal to the minimum support. The minimum support is number of occurrences per total transactions. The minimum support also given as input for generating user access patterns.
LITERATURE SERVEY
The Web Mining4 [2] is the application fo data mining techniques to automatically discover and extract information form the web.web usage mining has various application areas such as web prefechting ,site reorganization web personalization. Most important of web usage mining is discovery useful patterns form web log data by using patttern discovery technique such as Apriori ,FP-Growth algorithm1 .Apriori algorithm for weblog mining is a well known technique .Many algorithms are existed for generating frequent access patterns from the access paths Ex. Apriori Algorithm, FP-Tree Algorithm, etc.5 But these Algorithms will take more database scans for generating user access patterns. These algorithms will take more time and more memory. An improved version of original Apriori- All algorithm is developed for sequence mining in6. It adds the property of the user ID during every step of producing the candidate set and every step of scanning the database to decide about whether an item in the candidate set should be used to produce next candidate set. The algorithm reduces the size of candidate set in order to reduce the number of database scanning.
Apriori Algorithm: Apriori algorithm has been developed for rule mining in large transaction databases.
A path is a non empty set for items. This algorithm decomposed the problem of mining patterns into two steps:
In first step find all combinations of items that have transaction support above minimum support. Call those combinations frequent patterns. Use the frequent patterns to generate the desired rules. The general idea is that if, say, ABCD and AB are frequent patterns, then we can determine if the rule ABCD holds by computing the ratio r = support(ABCD)/support(AB). The rule holds only if r >= minimum_ confidence.
Note that the rule will have minimum support because ABCD is frequent. The Apriori algorithm used for finding all frequent patterns is given below. It makes multiple passes over the database. In the first pass, the algorithm simply counts item occurrences to determine the frequent 1-patterns (patterns with 1 item).
For this step this algorithm scans the data base once and stores all information about web page names and their page counts in the user’s access paths.
In second step A subsequent pass, say pass k, consists of two phases. First, the frequent pattern Lk-1 (the set of all frequent (k-1)-patterns) found in the (k-1)th pass are used to generate the candidate sets Ck, using the apriori-gen() function. This function first joins Lk-1 with Lk-1, the joining condition being that the lexicographically ordered first k-2 items are the same. Next, it deletes all those patterns from the join result that have some (k-1)-subset that is not in Lk-1 yielding Ck. The algorithm now scans the database.
For each transaction, it determines which of the candidates in Ck are contained in the user paths using a hash-tree data structure and increments the count of those candidates. At the end of the pass, Ck is examined to determine which of the candidates frequent, yielding Lk is. The algorithm terminates when Lk becomes empty.
JECET; June – August 2013; Vol.2.No.3, 730-740. 733 FP-Tree Algorithm: The FP Tree Algorithm is to maintain a frequent pattern tree (FP-Tree) of the database. It is an extended prefix-tree structure, storing crucial quantitative information about frequent sets. The tree nodes are frequent items and are arranged in such a way that more frequently occurring nodes will have a better chances of sharing nodes than the less frequently occurring ones. The method starts from frequent 1-itemsets as an initial suffix pattern and examines only its conditional pattern base (a subset of the database), which consists of the set of frequent items co-occurring with the suffix pattern.
The algorithm constructs the conditional FP-tree and performs mining on this tree.
The FP-tree Algorithm will take more time for recursive calls in the algorithm. For generating Frequent Patterns the pointer have to traverse all the nodes. So it will take more time compare to the Apriori Algorithm. The Apriori will take at most n-1database scans and FP-tree will take at most two database scans. So Apriori Algorithm will take more time and more memory compare to the FP-tree algorithm. If the user access paths are common then FP-tree algorithm is very efficient. The existing algorithms are A priori Algorithm, Partition Based Approach and FP-Tree algorithm, etc. All existing Algorithms have their own advantages and drawbacks. If all the transactions are different then Apriori algorithm is good otherwise FP-tree Algorithm is good. If the user access paths are common then FP-tree algorithm is very efficient.
PROBLEM IDENTIFICATION
The extraction of knowledge through mining is always challenging during accessing the web pages from web server.
Many algorithms are existing for generating frequent access patterns form the access paths. This algorithm will take more time and more memory. An improved version of original Apriori-all algorithm is developed for sequence mining6.the apriori algorithm mining for sequence database there are following steps.
Step1: Let SEQ be a sequence database, if a sequence G is not a T-pattern of SEQ, any super-sequence of G cannot be a T-pattern of SEQ.
Step2: If e is a frequent event in the set of prefixes of sequences in Access Pattern, w.r.t pattern P, sequence eP is an access pattern of Access Sequence Database.
Drawback of apriori algorithm
It takes at most n-1 scans for generating candidate sets.
If the the sequence length is more, this algorithm will become complex for finding access patterns.
Implementation of an Efficient Algorithm for finding frequent Access Patterns from the given Access Paths of User. This Access Paths are generating from the Web Log Record by applying Data Preprocessing.
PROPOSED SCHEME
In the present study, data mining algorithm as been developed that can be utilized by a customer coming on our website.
Access log record: Access log record is in the web server. This data is stored in text format and it is in unstructured format. First load the data into Data base from the text format.
JECET; June – August 2013; Vol.2.No.3, 730-740. 734 Data Cleaning: Elimination of the items deemed irrelevant can be reasonably accomplished by checking the Suffix of the URL name. For instance, all log entries with filename suffixes such as, gif, jpeg, GIF, JPEG, jpg, JPG, and map can be removed. Scripts such as “count.cgi” can also be removed.
Step1:- write query for eliminate gif, jpeg, GIF, JPEG, jpg, JPG from the database.
Remove all the rows which are having the above suffix in the Request field in the database.
Step2: By using String Tokanizer class in java divide the agent field into two field’s i.e Browser, its version and Operating System.
Step3: Remove irrelevant information in the data field i.e. in the request field. We want user requested page only, but the request is in the below form.
GET/biblio/riviste/img/r-t/rew4.html.
From this we have to remove prefix of rew4.html and again store this rew4.html in Request field of database. Now the data base is useful for the Preprocessing.
User Identification
Step1: First create the website topology.
Take each page as one node and link to the previous page and next page. The below java class represent the Webpage.
Public class Webpage
{Protected String pagename;
Protected TreeNode parent; //parent node Protected TreeNode fchild; //first child
Protected TreeNode subling; //gives the subling node}
Step2: Categorize the users by using IP address in the database.
Step3: Find different user with same IP address. i.e. at the same time the two user working on with same IP address and using different browsers.
By using diff browser names and diff versions of browsers at different time periods.
Again categorize user by using diff operating systems.
Step4: Find the unexpected requests which are not related to previous requests from the same IP address.
By using site topology we can easily find the different users from the same operating system.
User Session Identification: Every user will visit any site for specific time period. This is called one session of the user. Take session time of each user as 1 hour in our implementation.
Step1: Divide all transactions of users in to users and sessions. Use ID and Session ID will become the primary key for the user.
Step2: Find the user access paths of users by using site topology and User ID and Session ID of the user.
Finding Frequent Access Patterns: For constructing FAP tree, this work used two classes for generating Header Table and FAP Tree Node. The below class represents the Header Table Node, which gives the webpage name and its number of occurrences in the total paths.
Data structures: Class Header Table Node
JECET; June – August 2013; Vol.2.No.3, 730-740. 735 { Protected String pagename; ///It represents the page name in page table
Protected int count; //It represents pagecount of page Public TreeNode seqptr; //points to the treenode
Protected HeaderTableNode succ; //this refers to next same type headertablenode}
The below class represents the TreeNode, which gives the webpage name and its number of occurrences in the total paths.
This Tree will store the all paths in the compressed data format. The height of tree is not more than the number of web pages in the website.
If the paths are same this tree will not create new nodes, it simply increments the counter value.
Class Treenode {
Protected String pagename;
Protected int count; //count the no of pages Protected TreeNode parent; //parent node Protected TreeNode fchild; //first child
Protected TreeNode sibling; //gives the subling node Public TreeNode seqptr;
}
In the above class pagename represents the name of the webpage, count represents the number of times the users accessed that page, parent points to parent of the child, and this field is used at the time frequent access patterns by using backward traversals. fchild represents the child node and sibling field represents the sibling of that tree node.
There are many existing algorithms for generating frequent access patterns from the access paths. But they have less efficient in terms of execution time and memory requirement. This proposed algorithm is modification of FP-tree Algorithm, but this algorithm will not use recursion for generating Frequent Patterns. So this Algorithm will take less execution time for access paths which are not having uncommon items. This is explained in the below
Improved Algorithm: This algorithm is divided into two steps:
Step1: Construct frequent access pattern tree according to access paths derived from user session files, and records the access counts of each page.
Procedure FAP_Tree (T , p) begin
Create_tree(T);
//construct the root of FAP-Tree signed with “null”
While P<>null then begin
JECET; June – August 2013; Vol.2.No.3, 730-740. 736 If p.name is the same as the name of T’s ancestor (n) then
begin
Increment n.count;
T=n;
end of If statement Else
If p.name is the same as the name of T’s child (e) then begin
Increment the c.counter;
T=c;
End of if Else
Insert_tree (T , p);
//insert the new node of P into T, as a child of the current node P=p.next;
end of while loop End of function
Step 2: The function of FAP_growth is used to mine both long and short access patterns on the FAP tree.
In this step prefix method is used for finding long access patterns from the page table entries which is created in step1.
Procedure FAP-growth (tree, k);
begin
For each Ki.count>=min_sup
//Ki is a member of the page header table begin
Generate access pattern B=Ki K = K U B;
P= ki.next;
//p points to the first location of Ki in the FAP-tree While ((p! =null) or (p.count>=min_sup) )do
begin Look for each Ki’s prefix access pattern base, then construct access pattern Bi by Ki prefix access pattern base connectiong with itself;
If (Bi>= min_sup) then Ki=ki U Bi;
JECET; June – August 2013; Vol.2.No.3, 730-740. 737 P=p.next;
//p points to the next location of Ki in the FAP tree End
End End
EXAMPLE
Assume A,BC,D,E,F,G,H,I,K are the web pages in a particular web site. U1 is the user ID and S1, S2, S3, S4 are different sessions
Table-1: Access paths of single user in different sessions
User Name
Session Name
Access Path
U1 S1 A-B-C-D-B-E
U1 S2 A-B-C-D-C-B-E-G-
E-C-H
U1 S3 A-B-A-C-I
U1 S4 C-I-G-I-K-I-D
In the above page table gives information about web pages and no of occurrences of all the access paths.
The page table contents are stored in ascending order according to the user access page count. For generating Access patterns it will take one more database scan. The tree stores all access paths in compressed format. Each node in the tree represents the page name and page count. Each node has two pointers to parent node and to the child nodes. The root node is null node; the child nodes of this root node will give all access paths. If the user follow the same path in different session, this tree simply increment page count instead of creating new nodes.
NULL
A:4
B:5
C:1
C:3
D:2 G:1
c:1
C:1 H:1
K:1 G:1
I:1
I:3
E:3
D:1
JECET; June – August 2013; Vol.2.No.3, 730-740. 738 Table-2: Access paths of single user
PAGE NAME
COUNT NEXT
H 1
K 1
G 2
D 3
E 3
A 4
I 4
B 5
C 6
Generating Frequent Access Patterns:
For generating Frequent access patterns, start from the page table first page traverse to the tree node if it follows the below condition:
If page count>min_support:
begin
Move header node seqptr to the treenode, Travers the treenode to the rootnode
Move present node to next seqptr and again follow the above two steps End
Else
Go to next page.
Table-3: Frequent Access Pattern Generation
item Set of Prefix Access pattern Base
Frequent Access Pattern generated
G {} {G}:2
D {{C:3,{B,C}:3,{A,B,C}:3} {D}:2,{C,D}:2,{B,C,D}:2,{A,B, C,D}:2
E {B}:5, {A,B}:4 {E}:3,{B,E}:3, {A,B,E}:3
A {} {A}:4
I {C}:1 {I}:3
B {A}:4 {B}:5,{A,B}:4
C {B}:5, {A,B}:4 {C}:3, {B,C}:3, {A,B,C}:3
JECET; June – August 2013; Vol.2.No.3, 730-740. 739 PERFORMANCE ANALYSIS & RESULT
This Algorithm takes at most 2 scans of Database, But Apriori Algorithm will take at most n-1 scans of database. This Algorithm is more efficient than the aprori Algorithm for large Databases as well as Longer access paths. This Algorithm is more efficient than the FP-tree algorithm in the case of access paths which are not common in the user sessions. In this case FP-tree creates more complex tree and It takes more execution time for generating frequent patterns. This algorithm shows same performance as FP-tree if the access paths common in the user sessions. For validation of the algorithm data used from the web site www.musicmachines.com. These results checked for constant size data base (i.e50MB, 150MB). The two algorithms i.e. Apriori Algorithm and Proposed Algorithm are implemented by using Java.
The following figure shows the comparison result
Apriori and Proposed Algorithm results comparison with Time
0 2 4 6 8 10 12
1 2 3 4 5 6 7 8
Minimum Support(%)
Seconds Proposed
Algorithm Existing Apriori Algorithm
Fig.1 : Apriori and Proposed Algorithm results comparison
The above figure shows the comparisons of both the algorithms in term of execution time & memory. An evident from the figure, the minimum support is less than the execution time is more; because more number of candidate sets will be generated. Hence, Apriori Algorithm execution time is more than the proposed Algorithm. The proposed Algorithm is not generating any candidate sets, but more number of patterns will be generated, due to this the number of tree traversals will be more. From the above Figure 3.4 the proposed algorithm is taking less time compare to the Apriori algorithm in all instances. If the minimum threshold is more the both algorithm execution time is less.
CONCLUSION & FUTURE SCOPE
Information content on the WWW is increasing at an exponential rate and it is not surprising to find users having difficulty in navigation and finding relevant information. Hence, the e-commerce site developers find it difficult to observe potential customers or web site structure. This thesis used a Web Access log file of a Web site to apply data mining techniques for finding frequent access patterns of the users.
The Work initially makes a in depth analysis of the existing Algorithms for their similarities in generating Frequent Access Patterns for Web Usage Mining. Based on the shortcomings it then develops a comprehensive algorithm. The algorithm is based on the method of generating frequent patterns without candidate sets. The time taken for generation of targeted frequent patterns is small to respond to the user in real time mode. This algorithm will take at most two data base scans for generating the frequent access patterns.
However the work may be extended to analyze the Data Preprocessing phase in detail. One work has been carried out mostly for the frequent pattern analysis. The work can be extended to analyzed and suggest modifications for the Data Preprocessing phase. It can also simulated using variable memory
JECET; June – August 2013; Vol.2.No.3, 730-740. 740 sizes, instead of the constant memory sizes instead of the constant memory size adapted for the study.
Graph theory and Statistical analysis etc. can also be done for Web Usage Mining. By using efficient algorithm we can reduce the runtime and memory requirement. The proposed algorithm is taking twenty five percent less time compare to the Apriori algorithm in all instances.
REFERENCES
1. Lizhen, Junjie Chen and Hantao Song. “The Research of Web Mining,” the 4th World Congress on Intelligent Control and Automation, 2002.2333 – 2337
2. www.w3c.org/clf.html
3. Jain Pei, Jiawei Han, BehZad Motazavi-asl and Hua Zhu. “Mining Access Patterns Efficiently from Web Logs”. CS, Simon Fraser University. Technical ReportCS,2000,
4. Yan Wang. “Web Mining and Knowledge Discovery of Usage Patterns”, Project (part1), CS748T, February, 2000.
5. R. Agrawal and R. Srikant.” Fast algorithms for mining association rules. In Proc. of Intl. Conf. on Very Large Databases (VLDB)”, September 1994.
6. J. Han, J. Pei, and Y. Yin. “Mining frequent patterns without candidate generation.”, IEEE. September 1998 In, pp-365-378,
*Corresponding Author: Harendra Singh; Deptt.of computer science & Engg. NRI Institute of Science and Technology, Bhopal, RGPV Technical University Bhopal, INDIA
.