• No results found

Modified Weighted PageRank Algorithm using Time Spent on Links

N/A
N/A
Protected

Academic year: 2020

Share "Modified Weighted PageRank Algorithm using Time Spent on Links"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

Modified Weighted PageRank Algorithm

using Time Spent on Links

Priyanka Bauddha1, Sonal Tuteja2 , Monika Bauddha3

1

M.Tech, Galgotias University, School of Computing Science & Engineering Greater Noida (U.P.), India.

1

priyankabauddha@gmail.com 2

M.Tech, Software Engineering, Delhi Technological University, Delhi,India

2

sonalt9@gmail.com 3

B.Tech, HCST College, Computer Science & Engineering Mathura (U.P.), India.

3

monikabauddha28@gmail.com

Abstract— With dynamic growth and increasing data on the web, it is very difficult to find relevant

information for a user. Large numbers of pages are returned by search engine in response of user’s query. The ranking algorithms have been developed to prioritize the search results so that more relevant pages are displayed at the top. Various ranking algorithms based on web structure mining and web usage mining such as PageRank, Weighted PageRank, PageRank with VOL and Weighted PageRank with VOL have been developed but they are not able to endure with the time spent by user on a particular web page. If user is conferring more time on a web page that signifies the page is more relevant to user. The proposed algorithm consolidates time spent with the Weighted PageRank using Visit of Links.

Keywords Inlinks, Outlinks, Visit of Links (VOL), Web Usage Mining, Log Analysis, Web Mining, World Wide Web (WWW), Search Engine, Server Logs.

I. INTRODUCTION

The World Wide Web is an interactive medium to preach information in a huge, diverse and dynamic way. Web Mining is the process of extraction of unspecified, valuable and comprehensible patterns of information from a large web data repository[1]. Web Mining is categorised into Web Structure Mining, Web Content Mining and Web Usage Mining. Web Structure Mining is used to mine the structure of hyperlinks in the web itself. Web Content Mining used to extract useful information from the content of web. Web Usage Mining is the application of Web Mining technique on large web log repository to extract suitable knowledge about user’s behavioural patterns. With the dynamic growth and increasing data on the web, it is very difficult to find relevant information for a user. When a user makes a query from search engine, the result displayed has relevant and non-relevant web pages containing the query. So, ranking algorithms are obligatory to prioritize the search result so that more relevant pages are displayed on the top. Various ranking algorithms based on web structure mining and web usage mining such as PageRank, Weighted PageRank, PageRank with VOL and Weighted PageRank with VOL have been developed.

But these algorithms are not able to endure with the time spent by user on a particular web page. If user is conferring more time on a web page that signifies the page is more relevant to user. The proposed algorithm combines time spent on links with the Weighted PageRank algorithm using Visit of Links.

The organization of the paper is as follows: In section 2, a brief idea about research background has been given. In Section 3, the proposed work has been described with algorithm and example. Section 4 describes the advantages and limitations of proposed work. In section 5, the results of proposed method have been compared with Weighted PageRank and Weighted PageRank using Visits Of Links. Conclusion and Future work have been given in section 6.

II. RESEARCH BACKGROUND

PageRank algorithm was developed by Brin and Page [11] at Stanford University based on hyperlink structure of web. It works on the principle that if a page has some important incoming links to it then its outgoing links to other pages also become important.The page rank is calculated by formula given in equation 1:

(2)

where

Nv = total number of outlinks of web page p

c= factor of normalization.

Its not compulsory that all the users follow direct links on WWW to get to required webpage. The modified formula for calculating PageRank use the dampering factor which shows the direct links to webpage is given in equation 2:

1 ∑ B N

(2)

Wenpu and Xing Ghorbani[8] extended the standard PageRank algorithm which is called as Weighted PageRank algorithm. It works on the principle that if the webpage is important then it should have more number of inlinks as well as outlinks. The PageRank of a webpage is proportionally divided among its inlinks and outlinks according to their importance or popularity. The inlink weight is calculated by using formula given in equation 3:

Win v, u

R v

(3)

Where

= number of inlinks of page u.

= number of inlinks of page p.

R(v) represents the set of web pages pointed by v.

Wout v, u , the popularity calculated from the number of outlinks of page u and the number of outlinks of all reference pages of page v as given in equation 4:

Wout v, u

R v

(4)

Where

= number of outlinks of page u.

Op= number of outlinks of page p.

R(v)= the set of web pages pointed by v.

The page rank using Weighted PageRank algorithm is calculated by the formula as given in equation 5.

1 W , W ,

B

5

Gyanendra Kumar et. al.[3] introduced a new algorithm which considered the user’s browsing behaviour to calculate the value of page rank. In this algorithm more rank value is assigned to the outgoing links which is most visited by the user. The formula used to calculate PageRank using VOL is given in equation 6:

1 ∑ B (6) where,

Lu denotes the number of visit of links which are pointing page u from v.

TL(v) denotes total number of visit of all links.

d denotes the dampering factor.

Neelam Tyagi and Simple Sharma[4] came up with a new idea Weighted PageRank based on number of VOL. It ruminate only the popularity of number of inlinks and ignores the popularity from the number of outlinks which was incorporated in WPR algorithm.

The webpage rank using VOL is calculated using formula in equation 7:

(3)

where,

= Page rank of page u

= Page rank of page v

d= dampering factor

Lu= the number of visit of links which are pointing page u from v.

TL(v)= total number of visit of all links.

B(u)= set of web pages pointing to u.

= number of visits of inlinks of page u and the number of inlinks of all references of page v.

III.PROPOSEDWORK

The Weighted PageRank algorithm using Visit Of Links calculates the rank of web page by using the popularity of inlinks. But only number of hits or visits cannot explain the information about the browser’s interest because it does not count for user’s ability to access information and appropriateness of the information.

It can be explain by taking an example of three webpages D, E and F. If a website is designed in such a way that one must go through D and E to go to F then the server logs will show as many “hits” or “visits” on page D and E as on F, even if user is getting relevant information on F. So, only the number of visit of links does not ensure only that the page is relevant to the user. So, we have incorporated time spent on links to improve the relevancy of web pages. The proposed algorithm uses , the popularity from the time spent by user for inlinks and , the popularity from the time spent by user for outlinks.

is the weight of the link(v, u) which is calculated by the time spent on the number of visit of links of page u to the time spent on the number of visit of inlinks of all reference pages of page v as given in equation 8:

(8)

Where,

= time spent on the incoming visit of link of page u.

= time spent on the incoming visit of link of page p.

R(v)= set of reference pages of page v.

is the weight of the link(v, u) which is calculated by the time spent on the number of visit of outlinks of page u to the time spent on number of visit of outlinks of all reference pages of page v as given in

equation 9:

(9)

Where,

= time spent on the outgoing visit of link of page u.

= time spent on the outgoing visit of link of page p.

R(v)= set of reference pages of page v.

These are used to calculate the values of page rank using equation 10:

(10)

where,

d= dampering factor,

B(u)= set of pages point to u.

(4)

Rank score of page v.

= popularity with respect to time spent on the number of visits of inlinks.

= popularity with respect to time spent on the number of visits of outlinks.

A. Algorithm to calculate

1. Finding rich hyperlink website: With a rich hyperlinks find a website which have rich number of hyperlinks with time spent on the links because the algorithm rely on web structure to calculate page ranks.

2. Accomplish a web graph: For the selected website, generate a web graph. 3. Calculate :

4. Calculate :

5. Apply proposed formula:

6. Repeat Step 5: Repeat step 5 until recursive values are stablized.

Fig. 1 Algorithm to calculate

B. Example to illustrate the working of proposed algorithm

The working of proposed algorithm has been illustrated via taking a hypothetical graph having three web pages A, B and C. The links connecting one web page to another represents the number of visits of links with and corresponding time spent on links. The link from A to C marked as 2(2,4) represents visit of links 2 with time spent 2sec and 4sec respectively as shown in Figure 2.

Fig. 2 Hyperlink structure with time spent on links

2 (2) 2 (2,4) 2 (3,2)

2 (3,3)

A

(5)

The value of page ranks for the Figure 2 are calculated as :

 

 

 

The values of and are calculated using equation( ):

The calculated values are put in the above equations to calculate the values of page ranks. For d=0.35, the page ranks are:

 

 

 

These values are calculated iteratively until the value of page rank are stabilized, then final values are: A=1.01130, B=0.69826 and C= 1.03229. For d=0.50, the values of Time Weighted PageRanks are:

(6)

These values are calculated iteratively until the value of page rank are stabilized, then final values are: A=0.60790, B=0.22045 and C= 0.53870. The value of page rank at various values of d has given below in table 2:

TABLE 2

VALUE OF PAGE RANKS AT DIFFERENT VALUES OF d

A B C

0.35  1.01130 0.69826 1.03229

0.50  0.98808 0.56736 0.97616

0.85  0.60790 0.22045 0.53870

IV.ADVANTAGES AND LIMITATIONS OFPROPOSEDALGORITHM

The proposed algorithm serves the purpose of bringing the important and relevant pages to have high popularity, which will help the user to get relevant information easily. Following are the advantages of

are:

1. uses hyperlink structure of webpages and their usage behaviour, the pages returned are supposed to be highly popular and relevant to the user needs.

2. The PageRank algorithm using VOL, rank the value of any page on the basis of number of visits only, while the , rank the page using of time spent by user on each link.

3. Only the number of visits to a webpage is not enough to explain the user’s interest, so the time spent on the links can be used to improve the results.

Some of the limitations of the algorithm are discussed as:

o Variable network: In a variable speed network, it is impossible to know how much time the network is

taking to transfer the data to the client. Due to that some easy and cheap means to estimate network variability is needed to make time spent on a web page as a metric of user interest and data relevancy.

o Cache system: WWW clients or proxy servers are using cache systems which makes harder to account

for the time spent by user on web pages. A cache stores the pages accessed and makes available whenever user requests without leaving a trace in the server log files. Those webpages improves the efficiency, but although creates problems for measuring the time spent by user on a web page.

V. RESULTANALYSIS

This section compares the page ranks of webpages using Weighted Page Rank, Weighted Page Rank using Visit Of Links (WPRVOL) and the proposed algorithm . As the value of dampering factor increases the rank of the webpage using time spent on links decreases as observed from the tables. We have calculated the page rank value at different d (0.35, 0.50, and 0.85) based on WPR, WPRVOL, and WPR(t)VOL for webpages as shown in Figure 2 :

TABLE 3

COMPARISON OF PAGERANKS

d 0.35 0.50 0.85

WPR

A 1.00535 0.97677 0.58335 B 0.70865 0.58140 0.23335 C 1.01532 0.95351 0.51505

WPR(VOL)

A 1.01736 1 0.64037

B 0.68956 0.55556 0.21495

C 1.04960 1 0.57463

(7)

The bar chart is used to compare page rank of webpages using WPR, WPRVOL, and . The values retrieved by using are better than WPR and WPRVOL. The WPR uses only Web Structure Mining but WPRVOL, and both uses Web Structure Mining and Web Usage Mining to calculate the page rank. The proposed algorithm method use time spent by user on a webpage that is collected from server logs. Fig 3 gives the comparison between WPR, WPRVOL, and for d=0.35.

Fig 3. Comparison of page rank at d=0.35

Fig. 4 comparison of page ranks at d=0.50

Fig. 5 Comparison of page rank at d=0.85

VI.CONCLUSIONANDFUTUREWORK

The proposed algorithm incorporates time spent on links to WPRVOL to calculate page rank. This modified algorithm computes the value of page rank based on the time spent on the visit of links on a page or the popularity of the pages, so this algorithm is more usage oriented than others.

Some of the future work for proposed algorithm are:-

a. Implementation on large scale network: The implementation of the algorithm on a large scale network should be done to check its performance. The value of PageRank is computed for three web pages only. A web graph with more number of web pages should be used to check its accuracy.

(8)

c. Information relevancy: Due to different needs of different users, the web pages are not equally important for all users. So other factors like age, gender, etc. can also be consider to calculate page rank with relevancy.

REFERENCES

[1] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data”, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, ACM Jan 2000. [2] V.Chitraa, Dr.Antony Selvdoss Davamani,”A survey on Preprocessing Methods for Web Usage Data”,Department of Computer

Science,CMS college of Science and Commerece, Tamil Nadu,India,IJCSIS vol. 7,No. 3, 2010.

[3] Gyanendra Kumar, Neelam Duhan, A. K. Sharma, “Page Ranking Based on Number of Visits of Links of Web Page”, Department of Computer Engineering, YMCA University of Science & Technology, Faridabad, India, ICCCT 2011.

[4] Neelam tyagi, Simple Sharma, “Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page”, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-3, July 2012.

[5] Sonal Tuteja,”Enhancement in Weighted Page Rank Algorithm Using VOL”, Software Engineering, Delhi Technological University, India,IOSR Journal of Computer Engineering (IOSR-JCE), e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume X, Issue X (Sep. - Oct. 2013), PP 01-00.

[6] Rodney Fuller, Johannes J.Graaff,”Measuring User Motivation from Server Log Files”,Microsoft Usability:Dsigning for web. [7] Sergey Brin and Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Computer Science Department,

Stanford University, Stanford, CA 94305.

[8] Wenpu Xing and Ali Ghorbani, “Weighted PageRank Algorithm”, Faculty of Computer Science, University of New

Brunswick,Fredericton, NB, E3B 5A3, Canada.

[9] Auth Dell Zhang, Yisheng Dong, “A novel Web usage mining approach for search engines”, Computer Networks 39 (2002) 303– 310or Profile.

[10] Rodney Fuller, Johannes J.Graaff,”Measuring User Motivation from Server Log Files”,Microsoft Usability:Dsigning for web. [11] Brin, Sergey and Page, Lawrence, “The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Seventh International

World-Wide Web Conference (WWW 1998), 14-18 April, 1998, Brisbane, Australia.

Figure

Fig. 1 Algorithm to calculate
TABLE 2 VALUE OF PAGE RANKS AT DIFFERENT VALUES OF d
Fig. 5 Comparison of page rank at d=0.85

References

Related documents

After Day, in the lighting of Moon Night wakens under a prayer to call hush to Day, time to commune. with Night, windswept under

The results of this study showed that in 2010, in France, about 1 to 10 patients managed for primary lung cancer in the respiratory department of a general hospital died within 1

There are infinitely many principles of justice (conclusion). 24 “These, Socrates, said Parmenides, are a few, and only a few of the difficulties in which we are involved if

• In this study of physicians’ knowledge, attitudes, and practices related to sexual and mood side effects of hormonal contraceptives, practising doctors were more likely

Ambient WiFi energy is harvested by the rectenna which consists of an antenna for capturing the ambient WiFi energy and a rectifier for converting the RF power into DC power with

Light scattering by small seeding particles proceeds in the glass pipe where a tested sample of filtration textile is mounted.. The scattered light goes through the glass pipe,

Therefore, the highly positive correlation between the lifetime number of stem cell divisions in a tissue and the incidence of cancer in that tissue [74,76], together with

Since, financial development of India as of late is driven principally by administrations division and inside administrations segment by data innovation (IT) and