APPLICATION OF WEB DATA MINING
Durga Patel
1, Pallavi Sengar
21,2Information Technology Department Banasthali University, Niwai,Rajasthan, (India)
ABSTRACT
Data Mining has attracted a great deal of attention in the information industry and in the society as a whole in recent years due to the wide availability of huge amounts of data and the imminent need for turning that data into useful information and knowledge . There are many types of data mining based on the type of data being extracted: Spatial data mining, Multimedia data mining ,Text data mining , Temporal data mining , and Web data mining. Web data mining itself is an interdisciplinary field .Web data mining is extracting of data from the world wide web. According to the survey report of Google of 2008 the number of web users is above 1 trillion .It consist of Web structure mining, Web content mining and Web usage mining .As the number of web users are increasing day by day so we have to implement methods so we can get the data accurately .When we search for any particular topic on the web then we can get only the data that is similar to the original one and not the exact data .So, we have to use some form of association to correlate with it.
Keywords: web mining,web usage mining,web content mining,web structure mining,apriori algorithm,page ranking algorithm,weighted page ranking algorithm,hyper link induced topic search(hits).
I. INTRODUCTION
World wide web is a huge and diverse in nature .It consist of semi –structured pages .web data mining is the extraction of data from multiple source of information and converting it into our useful knowledge .The growth of web has increased in the last years .The data can be converted using some preprocessing steps such as: data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation. Web usage mining in this we identify the user behavior while interacting with the web. User behavior can be identified by identifying the browsing pattern .Web structure mining it is used to link the structure of web pages form of graph .Web content mining is related to more structured form of data by indexing the web pages. It is represented in two views: information retrieval and database view.
II. LITERATURE REVIEW 2.1 Web Data Mining
Web data mining is the process of extracting the information automatically from the web pages or document. As we have stated earlier that the data mining is a multidisciplinary field. In this, the information can be extracted with the help of other areas such as neural network, fuzzy or rough set theory, knowledge representation .It is also integrated with other technology such as information retrieval, pattern recognition, image analysis, signal
processing, bioinformatics or psychology, etc. Data mining system can be categorized according to the various criteria[1]:
Classification on the basis of databases mined.
Classification on the basis of knowledge mined.
Classification on the basis of technology utilized.
Classification on the basis of application adopted.
We have reviewed some challenges in web mining:
Web is huge.
Web pages are semi-structured.
Web information stand to be diversity in meaning.
Quality of information extracted.
Knowledge from information extracted.
The tasks performed by web data mining:
Resource finding: It is the process of extracting the information from offline or online text resource available on the web.
Information selection and pre-processing: It is the process of transforming the selected information to convert into meaningful knowledge.
Generalization: It is the process of replacing the low level data or “primitive” data by high level data with the help of concept hierarchies. Ex: Categorical attributes, like street, can be generalized to high -level concepts, like city or country.
Analysis: It plays an important role of validation and help in pattern mining.
Web mining can be divided into three broad categories:
2.1.1 Web content mining:
According to Linoff and Berry (in 2001) defined this as “the process of extracting useful information from the text , image and other forms of content that make up the pages” .With the advent of increase in WWW there are billions of Html documents, pictures and other multimedia files available on the internet. As the information available on the internet is vast how can a user find the exact information as the search engines are not of much help. Most of the time the result displayed by the engines does not correspond to the original data but somewhat correlate to it. Information extraction from the web is complex when working with static data bases, due to its dynamic nature and large number of documents. As the web content mining can be represented in two views:[2]
Information Retrieval view: In this the data is in the form of unstructured and structured form. Data can be extracted from the text documents and hypertext documents. It is represented in relational form. The method used in extraction is Machine Learning and Statistical (include Natural Language Processing). The application categories are Categorization, Clustering, finding extract rules and finding patterns in text.
Data Base view: The data is represented in the form of semi structured which consist of hypertext documents. The method used is proprietary algorithms and association rules. The application categories are finding frequent sub-structures and web site schema discovery. In this ,we will discuss two terms:
Association: Association rule implies the picking of the unrelated data and find out the relation between them. It is also measured as the frequency of occurrence. It is represented in:
X=>Y
Example: Suppose ,if a person goes to buy a computer he/she had to definitely buy a software cd with it.
buys(x, “computer”)=>
buys(x, “software cd”)
In this the, value of support=1%
Support is defined as XUY.
Value of confidence =50%
which is defined as X |Y.
Correlation: To determine the correlation value of support and confidence is not enough but also the relation between the item sets. It can be calculated by using the lift. In this the itemsets A and B are independent of each other if,
P(X U Y)=P(X)P(Y)
Then, lift(X,Y)=P(X U Y)/P(X)(Y).
Apriori Algorithm
This algorithm is a seminal algorithm for mining frequent itemsets for Boolean association rule. This is the important and basic algorithm given in 1994 by Agarwal and Srikant. This is known as the fastest algorithm for finding out the frequent itemsets. In this we have to consider the value of minimum support count. It follows the property:[3][4][5][6][7]
“All nonempty subsets of a frequent itemsets must also be frequent.”
Principles used in this are:
Generation of frequent itemsets.
Items having support>=min-support.
Each itemset have high confidence rules.
Working of Apriori Algorithm In this there are two important steps:
1. Join Step: In the join step, there are a total number of n itemsets so the subset is n-1.
2. Pruning Step: Any (n-1)-itemsets that is not frequent cannot be a subset of a frequent n-1 itemset.
Get frequent things
Get frequent itemsets
Join Step: - Cn is generated by joining Ln-1 with itself.
Prune step: - Any (n-1) –itemset that is not frequent cannot be a subset of a frequent n-1 itemset Where,
Cn: candidate itemset of size n Ln:- frequent itemset of size n L1={frequent items};
For (n=1; Ln! =Ø;n++) do begin
Cn-1=candidates from Ln;
For each purchase p in database do
Increment the count of customers in Cn-1 that are contained in p Ln-1 =candidates in Cn-1 with min-support
end
Return UnLn Example:
Transaction ID Item ID’s
1 A,B,C,E
2 C,D,E
3 C,E
4 A,C,D
5 B,D
Step 1- Total no. of frequent item in the transactions to count support of each item.
TID Item ID’s Support Count
1 A 2
2 B 2
3 C 4
4 D 3
5 E 3
Step-2 Join the table with itself.
TID Item ID’s Support count
1 A,B 1
2 A,C 2
3 A,D 1
4 A,E 1
5 B,C 1
6 B,D 1
7 B,E 1
8 C,D 2
9 C,E 3
10 D,E 1
Step 3- Prune the above table on the basis of minimum support count i.e.2
TID Item ID’s Support count
1 A,C 2
2 C,D 2
3 C,E 3 Step 4-Join the table with itself.
TID Item ID’s Support count
1 A,C,D 1
2 C,D,E 1
So, in this table none of the satisfying the minimum support count. So, in this none of them is frequent itemsets.
But, there are a number of drawbacks associated with it.
Drawbacks of algorithm
Repetition of data is more.
Same support minimum threshold.
Difficulty to get rare occurring.
Time taking.
If data size is large then the number of candidate sets is large.
To overcome these disadvantage several authors have given their algorithm.
Intersection and record filter approach: In record filter approach, count the support of candidate set only in the transaction record whose length is greater than or equal to length of candidate set. It consumes less time but more optimization is needed.
Improvement based on set size frequency: In this the items whose frequency is less than the minimum support count is removed initially and determine initial set size to get the highest set size whose frequency can be greater than or equal to minimum support of set size. In this there is no fix criteria to determine the size of key for pruning the data.
Utilization of attributes: In this all the unnecessary transaction should be cut down to increase I/O. As all the data in the transaction are treated equally, so the data that is not present should not be considered. This can be removed by using attributes such as frequency of items, quantity.
After this it has been concluded that the number of candidate set should be less in number.
Improved Apriori Algorithm
In this algorithm, another column have been introduced which is the size of transaction (SOT), which depicts the number of individual items in the database .The deletion should be done on the basis of value k. If the value of k, matches the value of SOT then, that value would be deleted in the database.
Input: transaction database DB; min-sup.
Output: the set of Frequent L in the database DB
Minimum support count =minimum support count *mode of data base L1-candidates=find all one item set from the data base
L1={<Y1,TID-set(Y1)> L1-candidates |support-count&'()-support- count}
for (n=2; Ln-1#*+,--.!/0!1)
{ for each n-itemset (yi,TID-set(yi) Ln-1 do for each n-itemset (yj,TID-set(yj) Ln-1 do
if (yi[1]=yj[1])!(yi[2]=yj[2])!…!(yi[n-2]=yj[n-2] then {Ln-candidates.Yn= Yi* Yj;
Ln-candidates.TID-set(Yn) = TID-set(Yi) 234-set(Yj) }
}
for each n-itemset<Yn,TID(Yn)> Ln-candidates do sup-count=|TID-set|
Ln={<Yn,TID-set(Yn)> Ln-candidates|support-count&'()-support-count}
}
set-count=Ln.itemcount //updata set-count return L="nLn;
Example: D
TID Items
1 A,C,G
2 B,C,G
3 A,B,C
4 B,C
5 B,C,D,E
6 B,C
8 B,C,D,F
9 A
10 A,C
Step- 1
TID Items SOT
1 A,C,G 3
2 B,C,G 3
3 A,B,C 3
4 B,C 2
5 B,C,D,E 4
6 B,C 2
7 A,B,C,D,F 5
8 B,C,D,F 4
9 A 1
10 A,C 2
Step-2 C1
Item Set Support count
A 5
B 7
C 9
D 3
E 1
F 2
G 2
Step-3 L1
Item Set Support count
A 5
B 7
C 9
D 3
D1:
TID Item set SOT
1 A,C 2
2 B,C 2
3 A,B,C 3
4 B,C 2
5 B,C,D 4
6 B,C 2
7 A,B,C,D 5
8 B,C,D,F 4
9 A,C 2
Step – 2 C
Item Set Support count
A,B 2
A,C 4
A,D 1
B,C 7
B,D 3
C,D 3
Step-3 L2
Item Set Support count
A,C 4
B,C 7
B,D 3
C,D 3
D2:
TID Items SOT
3 A,B,C 3
5 B,C,D 3
7 A,B,C,D 4
8 B,C,D,F 4
Step-1 C3
Item Set Support Count
B,C,D 3
Step-2 L3
Item Set Support count
B,C,D 3
Advantage:
It helps in reducing query frequencies.
It helps in increasing the efficiency.
It reduces I/O load.
It helps in reducing storage space.
The application of database view is to find the frequent sub-structures and to discover the web site schema.
Its application is used to adverse drug reaction detection .Data mining towards tumultuous crimes concerning women.
2.2.2 Web usage mining: In this identify the user behavior by identifying the browsing behavior. It helps in automatic discovery using click streams patterns and other associated data collected or gathered. The main aim is to capture, model, analysis of behavioral patterns and profile of user[8][9][10].
Fig : An illustration of web usage mining process In the mining process stage can be divided into three parts:
Data collection and preprocessing, pattern discovery and pattern analysis.
In data collection and preprocessing, the data in the click stream is removed and partitioned according to the user behavior during every visits. In the pattern discovery different application areas are used to extract the hidden pattern and reflecting the behavior of user. It can also be used to obtain statistical values related to it. In the final stage, the obtained input is filtered and operations are performed to give input to generate reports. In this the primary data sources are the server log files ,including application server logs and Web server access logs. Another types of data such as meta-data , site files and operational databases are used to prepare data and to discover the pattern. The data is divided into four groups:
Usage data: It is primary form of data as each hit against the server correspond to an HTTP request and generate an entry in the server. FOR this cookies is used to identify a repeat visitor. Each entry correspond to details such as: date, time, IP requested resource and certain parameters.
Content data: The data used in it is of the form static HTML/XML pages, page segments from scripts and collection of records .It also include semantic or structural meta-data.
Structure data: It is used to represent the designer’s view of the content. In this it is used to convey the linkage between pages. Example both the HTML and XML documents can be represented in form of graph and trees.
User data: It is used to represent user profile information such as past purchase, products or visit history.
This data can be seen by other as long as the users are not differentiated.
In this the data is represented in the form of relational table, graph. The data can be extracted with the help of Machine learning, Statistical and Association rules. Its application categories are site construction, adaptation and management, marketing and user modeling.
2.2.3 Web Structure mining:
In this the web pages is viewed in the form of link structure. It is basically represented in the form of graph. In this the proprietary algorithms is used[11][12][13].In this the structure is formed with the help of links and hyperlinks through which we can get additional information. It uses the concept of clustering, document categorization and information extraction from web pages. The web may be viewed as a directed labeled graph whose nodes are the documents and the edges are the hyperlinks connecting them[2][14].
The web pages are dynamic in nature some contains little text (making it difficult for text mining), multimedia (multimedia data mining).The web is a graph. In this we use the predecessor page- that have a hyperlink pointing to the target page. It may contain particular useful information for classifying a page[15][16][17]:
Redundancy: There is a probability there may be more than one page pointing to a single point on the web, so as to combine the data from multiple source of information.
Independent encoding: The information is less sensitive to vocabulary used by particular author.
Sparse or non-existing text: Web pages contain limited amount of text as it contain images and no machine readable text at all.
Irrelevant or misleading text: Most of the pages contain unnecessary data and misleading information.
Graph structure can be used for retrieval, classification and Classification with hyperlinks.
There are a number of algorithms associated with it: Page Rank, Weighted Page Rank and Hypertext Induced Topic Search (HITS).
Page Rank: This algorithm is given by Brin and Page (1998) at Stanford University during their Ph.D. Page rank is used to simply count the number of pages that are linked to it. These links are called backlinks. The backlinks are considered of high weightage as they come from important pages. The link which is used to connect one page to another is called vote and is of great importance. In this the input parameters are the backlinks and forward links. Page and Brin proposed a formula to calculate the page rank:
PR(A)=(1-d)+d(PR(T1)/C(T1))+…+PR(Tn/C(Tn)) ……(1) Where, PR(Ti) is the page rank of pages Ti
C(Ti)is number of outlinks on page Ti D is the dumping factor.
It is used to calculate the probability so that the other pages have no effect on it. It is done to show that the sum of Page Rank of all web pages to be 1. The complexity is O(log N). The search engine used in it is Google. The only limitation is that it is query independent.
Weighted page rank: This algorithm is proposed by Wenpu Xing and Ali Ghorbani .It is the extension of page ranking algorithm as rank values are assigned to it according to their importance. The importance is calculated on the basis of the links: incoming links, outgoing links.
The formula is:
WPR(n)=(1-d)+d ∑m€B(n) WPR (m)Win(m,n) Wout(m,n) …………(1)
Win is the weight of link(m,n) on the number of incoming links of page n.
WOut is the weight of link(m,n)calculated on the number of outgoing links ofpage n w.r.f page m.
According to experimental result , it has been found that the weighted page rank has larger relevancy than page rank.
Its complexity is <O(log N). The search engine used in it is research model.
Hyper-Induced Topic Search: This algorithm is given by Klienberg in 1999 and propose two forms of web pages called hubs and authorities. Hubs are pages that are used to give authorities to users. So, if a page is good authoritative then it should be pointed out by good hubs at the same time. Klienberg also pointed out that a page can be a good hub and a good authority at the same time. The formula used to calculate it:
Fig : Calculation of hubs and authorities
In HITS there are two steps, first step is Sampling and second step is Iterative .
In sampling step the relevant pages are collect .And in Iterative step, use the output of sampling step and find hubs , authorities.
Hp=∑q€I (p) Aq AP =∑q€B (p) Hq Where:
Hp = The hub weight Ap = The Authority weight
I(p) and B(p) = Denotes the set of reference and referrer pages of page p HITS algoritham has following constraints:
a) The differentiation between hubs and authorities is not possible . b) Sometimes did not search the relevant pages.
c) It did not generate relevant link for user query.
d) It is not efficient in real time.
The complexity is <O(log N). The input parameter are backlinks, forwardlinks and content.
III. CONCLUSION
Web data mining is an interdisciplinary field. This paper covers all the basic aspects of web mining. In the web mining it has been divided in to three parts: web content mining , web usage mining and web structure mining.
Web content mining is presented in two views: information retrieval and database view. In database view ,we are dealing with traditional apriori algorithm and then the improved apriori algorithm. In Web usage mining we are dealing with the user behavior and different types of data . Web structure mining deals with the links and hyperlinks. The algorithms associated with it : Page Rank, Weighted PageRank and HITS algorithm is discussed.
As, the web is very diverse in nature and apriori algorithm is the basic algorithm so many improvements are needed in this direction so more and more optimization is possible.
REFERENCES
[1] XindongWu,Vipin Kumar, J. Ross Quinlan, Top 10 algorithms in data mining, (2008), pp-1–37.
[2] Lihui Chen, Wai Lian Chue ,Using Web structure and summarization techniques for Web content mining,2005,pp1225-1242.
[3] Pragya Agarwal, Madan Lal Yadav, Nupur Anand, Study on Apriori Algorithm and its Application in Grocery Store, International Journal of Computer Applications , Volume 74,14 July 2013.
[4] Pratibha Mandave, Megha Mane , Data mining using Association rule based on Apriori Algorithm and Improved Approach with illustration, International Journal of Latest Trends in Engineering and Technology, Vol. 3 Issue2 November 2013.
[5] jaishree Singh, Hari Ram, Improved efficiency of Apriori Algorithm Using Trsnsaction Reduction, International Journal of Scientific and Research Publications, January 2013.
[6] Osmar R. Zaıane ,Web Usage Mining for a BetterWeb-Based Learning Environment, Department of Computing Science University of Alberta Canada ,.
[7] Rekha Jain ,Page Ranking Algorithms for Web Mining, International Journal of Computer Applications,Volume 13– No.5, January 2011,pp-22-25.
[8] Bamshad Mobasher and Olfa Nasraoui, Web usage mining.
[9] Bamshad Mobasher, Web usage mining.
[10] P. Ravi Kumar and Ashutosh Kumar Singh,Web Structure Mining: Exploring Hyperlinks and Algorithms for Information Retrieval, American Journal of Applied Sciences ,pp- 840-845, 2010.
[11] Prof. Vinitkumar Gupta, Survey on several improved Apriori algorithms , IOSR Journal of Computer Engineering,2013, PP 57-61.
[12] Divya Bansal ,Execution of Apriori Algorithm of Data Mining Directed Towards Tumultuous Crimes Concerning Women, International Journal of Advanced Research in Computer Science and Software Engineering,2013,pp-54-62.
[13] Yoon Ho Choa , Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce, Expert Systems with Applications,2014, pp- 233–246.
[14] Ming syan, Efficient Data Mining For Path Traversal Patterns.
[15] Robort Cooley, Data Preparation For Mining World Wide Web Brosing Pattern,October 1998.
[16] Robert Cooley, The Use of Web Structure and Content to Identify Subjectively Interesting Web Usage Patterns,ACM Transaction on Intrernet Technology 2013.
[17] Johannes F¨urnkranz,Web Structure Mining Exploiting the Graph Structure of the World-Wide Web, pp-1- 10.