Page Ranking Algorithms Based on Link Visit

(1)

21 ABSTRACT

Web search engines encounter many new challenges with the increased amount of information on the web. Web documents have been a main resource for various purposes, and people rely on search engines to retrieve the desired documents. This thesis proposes a dynamic and efficient Page rank algorithm for search engines to return quality results by scoring the relevance of web documents. The modified Page rank algorithm increases the degree of relevance than the original one, and decreases the time and efforts to find the desired documents from the set of results returned by search engine.

Search engines generally return a large number of pages in response to user queries. To assist the users to navigate in the result list, ranking methods are applied on the search results. Most of the ranking algorithms proposed in the literature are either link or content oriented, which do not consider user usage trends. Here, in this thesis, a page ranking mechanism called PRLV (Page Ranking based on Link Visits)is being devised for search engines, which works on the basic ranking algorithm of Google i.e. Page Rank and takes number of visits of inbound links of Web pages into account. It is described in the literature that Page Rank algorithm uses the link structure to calculate the importance (rank value) of different pages and on the basis of importance, it sorts the list of pages return by search engine in response of query submitted by users. Rank value of pages does not change till the link-structure of web remains constant. In other words, Page Rank only uses web structure mining technique to calculate the rank value of pages. To make rank value of pages dynamic rather than static, in this thesis, a new concept called PRLV is proposed and described, which takes into account users’ behaviour i.e. Link Visit Information, and calculates importance of pages. This concept is very useful to display most valuable pages on the top of the result list on the basis of user browsing behaviour, which reduces the search space to a large scale.

1. INTRODUCTION

The last years, the web has evolved rapidly. The amount of information that Internet provides to users is becoming unmanageable. People who search for information waste a lot of time in searching for desired content. The reasons are that there is too much information interesting for a user and the current systems are not good enough selecting it according to the user’s needs.

Majority of the users fulfil their information needs by employing one of the existing search engines.

Many times the search engines seem to be really useful but many other times they do not find what users are searching for. Even, if they find something interesting they have to pass through a slow and time costly process. This process is the filtering and selection of the pages returned by the search engine as a result of users query in order to find those pages that are really interesting to them.

This is a slow process and many times, it requires several iterations where users refine their query and submit it again to the search engine and again the filtering process to check all the results returned starts.

One cannot say that the current process of search system is not useful. Of course, it works most of the times but it can be improved. How can it be improved? Let first describe roughly how Search Engine works.

1.1 SEARCH ENGINE

A search engine receives uses query, processes the query, and searches into its index for relevant documents i.e. the documents that are likely related to query and supposed to be interesting then, search engine ranks the documents found relevant and it shows them as results. This process can be divided in the following tasks:

Page Ranking Algorithms Based on Link Visit

Manoj Kumar¹, Sapana Singh²

1M.Tech Student , IIMT engineering college, Meerut

2Astt. Professor^,IIMT engineering college, Meerut [email protected]¹, [email protected]

(2)

22 Crawling: A crawler is in charge of visiting as many pages it can and retrieve the information needed from them. The idea is that this information is stored for the use by the search engine afterwards.

Indexing: The information provided by a crawler has to be stored in order to be accessed by the search engine. As the user will be in front of his computer waiting for the answer of the search engine, time response becomes an important issue.

That is why this information is indexed in order to decrease the time needed to look into it.

Searching: The web search engine represents the user interface needed to permit the user to query the information. It is the connection between the user and the information repository.

Sorting/Ranking: Due to the huge amount of information existing in the web, when a user sends a query about a general topic (e.g. java course), there exist an incredible number of pages related to this query. Of course, only a small part of such amount of information will be really interesting for the user. That is why current search engines incorporate ranking algorithms in order to sort the results.

1.2 RANKING ALGORITHMS

The web is very large and diverse and many pages could be related to a given query. That is why a method/algorithm to sort the entire pages subject to be interesting to a user query is needed. The most important ranking algorithm developed and discussed in this thesis is given below.

PageRank: This algorithm is based on link structure of web and its proved importance and usefulness in Google.

Weighted PageRank: Improvement over PageRank. It rank pages according to their importance not only consider link structure of web graph.

Page Content Rank: Based on content as well as structure data mining.

HITS: This algorithm assumes that for every query topic, there is a set of "authoritative" or "authority"

pages/sites that are relevant and popular focusing on the topic and there are "hub" pages/sites that contain useful links to relevant sites including links to many related authorities. Shown to be the other

most important ranking algorithm. The first one is PageRank.

SALSA: Improvement over HITS. It behaves better against some properties of web graphs.

Randomized HITS: Several problems of HITS make it unstable and Randomized HITS is a new version that makes use of the stability of PageRank to improve it.

Subspace HITS: Another possibility to solve the instability of HITS generalizing the result.

SimRank: A new algorithm to measure the similarity of pages based on their link structure.

2. PAGE RANKING

ALGORITHM BASED ON LINK-VISITS

The biggest problems existing with the current approaches is summarised and some possible solutions to those problems are described. The main focus will be on PageRank algorithm because it’s proven an efficient algorithm by Google search engine.

Rank quality of PageRank:

Current rankings have shown a really high quality and the proof is that success of Google (or they are still being used) successfully. However, some improvements can be done on it.

Data Mining Technique of PageRank:

PageRank algorithm used only Web Structure Mining and Web Content Mining technique; it does not use Web Usage Mining, which may significantly improve the quality of rank of web pages according to users information needs.

PageRank is Static in Nature:

In PageRank algorithm, the importance or rank score of each page are static in nature. The rank changes only with link structure of web.

2.1 PAGERANKBASEDONLINK VISITS To solve all above listed problem an efficient ranking mechanism named PRLV (Page Ranking based on Link Visits) based on Web Structure Mining and Usage Mining, is proposed and implemented to take the user visits of pages/links into account with the aim to determine the importance and relevance score of the web pages.

(3)

23 To accomplish the complete task from gathering the usage characterization till final rank determination, many subtasks are performed, which are outlined in the proposed framework shown in Figure 2.2. The various subtasks are given below:

 Storage of user’s access information (hits) on an outgoing link of a page in related server log files.

 Fetching of pages and their access information by the targeted web crawler.

 For each page link, computation of weights based on the probabilities of their being visited by the users.

 Final rank computation of pages based on the weights of their incoming links.

 Retrieval of ranked pages corresponding to user queries.

The weights are determined for out-links of pages (inner-document structure mining) and ranks are computed by taking back-links into account (inter- document structure mining). To segregate the tasks performed by different components, the framework is divided into two phases as shown by the separating dotted line.The first task is performed by the upper phase, while the rest of four tasks are dedicated to lower half. The two phases involved in the work are:

 Link-Visits (hits) Calculation

 Rank Calculation and Result retrieval

2.2 CALCULATION OF VISITS (HITS) OF LINKS

When a user requests to a Web page from a dedicated web site server (either by accessing a web site page or a URL in search engine results), a client-side agent is activated to send the request signal to the dedicated server-side agent, which downloads the page from the web server (See Figure 2.2). The client-side agent sends the user’s access information to the

Fig 2.1. Format of Server Log Files

3.2 CALCULATION OF VISITS (HITS) OF LINKS

server-side agent, which in turn stores this in web server log file This phase calculates the number of visits (from different users) on outgoing links of web pages. To accomplish this, PRLV used a client side script embedded in the client-side agent.

Whenever a web page is accessed, this script will be loaded on client side from the web server and will monitor every click made by user over the hyperlinks of the Web page. Whenever such an event happens, a signal is sent to

Fig 2.2. A Framework for Extraction of Link Visits and Rank Calculations

server-side agent with information about current web page (or hyperlink) in access. The target web page gets loaded on the user’s system after initialization of this process. On the server side, a database of log files is used to record the URLs of the pages, hyperlinks in those pages and IP addresses of users visiting those hyperlinks. The hit count of each hyperlink is also stored, which can becalculated easily by counting the distinct IP addresses visiting the corresponding page. Hit count is incremented every time a hit occurs on the hyperlink.

Algorithm of client side agent is shown in Fig. 2.3 and for server side agent is shown in Fig. 2.4.

Whenever a web page is browsed by user this client side agent is loaded on users computer and when user click on web page it will check the current clicked element of page, if it is anchor tag then it will call Server_Agent along with address of current page and link attribute of clicked anchor tag, otherwise wait for next click event of web page.

Query  Ranked Pages 

Page and its Access Information

Query Terms  Result Pages  Interlinked Web Pages

User Client-Side Agent

Web Crawler

Index Repository To Indexer PRLV Calculator

Query Processor

SEARCH Search Engine

Interface

Web Servers Server-Side Agent

Web Servers Server-Side Agent Web

Browsing

Calculates the Page Ranks using Access Information of Pages

(4)

24 When Server_Agent get invoked by Client_Agent it first check the entries in Link_log for URL and Outlinked_URL. If the entry is present in Link_log then it will increment the corresponding value of Hit_count field in Link_log. If entry is not present then it first insert the URL and

Fig 2.3 Algorithm of Client_Agent

Fig. 2.4 Algorithm of Server_Agent Outlinked_URL in Link_log and initialize the value of Hit_count. Initially Hit_count value will be one.

The proposed format of log files stored on the web server is shown in Fig 3.1. The link_log contains information about the page URLs, their hyperlinks and total hit count of each hyperlinked URL. The second log called access_log has the same format as that of the NCSA Combined Log Format [11].

The visit count for a hyperlinked URL can be easily calculated by processing the accesslog and counting the distinct IP addresses or User_Ids visiting the URL, which gets stored in link_log.

2.3 CALCULATION OF RANK SCORE The search engines index most of the pages available on WWW. For this purpose, databases of Web servers are periodically accessed by the crawlers. Here, the working of crawlers is slightly modified so as to fetch the pages as well as their hit counts stored in link_logs. The crawled information is sent to an additional component of the search

engine called the PRLV Calculator, which calculates rank values of web pages based on their link cardinality and access information.

Each link in the crawled web graph is assigned a weight, which indicates its probability of being visited by the users. Obviously, a link with high probability is considered more important than others. In the proposed system, a page assumes a high rank if its back-linked pages possess high weights. Thus, ranking is propagated iteratively through back links. Some definitions related to the proposed ranking method are given below:

Def. 1.Outbound link

Consider a page p having n hyperlinks embedded in it. The outbound link set is denoted by:

O(p)= { o1, o2,…on | each oi is a URL (page) which can be accessed from page p }.

Def. 2.Inbound Link

A page p is said to have a set B of m inbound links:

B= {b1, b2,…b_m | each bi is a URL (page) from which p can be accessed }.

Def. 3.Probability-Weight of Link

If p is a page with outbound-link set O(p) and each outbound link is associated with a numerical integer indicating visit-count (VC), then the weight of each outbound link connecting p and o is calculated by:







) ( '

) ' , (

) , ) (

, (

p O o

link

VC p o

o p o VC

p Weight

(2.1)

From the definition 3, it may be noted that each out link has a weight associated with it, which is different from the weights of other out links of the same page.

Example illustration of Weight Calculation:

Consider the example hyperlinked structure shown in Fig 2.5, where the constant on each link indicates the visit count (section 3.2) and a value in brackets indicates the calculated weight (using (3.1)). To understand the weight calculation, let us take the page D, which has 3 outboundlinks to pages F, G, H with visit counts 100, 75 and 25 respectively. The weight of link (D, F) is:

2 / 25 1 75 100 ) 100 ,

( 



 

F

D

Weight

_link

(5)

25 FiG. 2.5. Example Hyperlinked Structure with

Link Visits

Similarly,

8 / 1 ) , ( ,

8 / 3 ) ,

( D G  Weight D H  Weight

_link _link

It may be noted here that sum of weights of outbound links of a page is always 1. So, this mechanism provides an unequal distribution to importance of links as compared to PageRank method, which gives equal distribution to all outbound links (see Fig3.6). The page rank value of a page is calculated based on visits of inbound links as described in the next definition.

Def. 4. Page Rank based on Link Visits (PRLV) If p is a page having inbound-linked pages in set B(p), then the rank (PRLV) is given by:

) ) , ( ).

( (

) 1 ( ) (

) (











p B b

link

b p Weight

b PRLV d

d p

PRLV

(2.2)

where d is the damping factor as is used in PageRank, Weightlink() is the weight of the link calculated by (3.1).

Example illustrating Calculation of PRLV:

To explain the calculation of PRLV, let us take the same example hyperlinked structure (Fig3.5) along with hit counts of links as shown in Fig 3.6. The weight of each link is calculated by (3.1) and is shown in brackets with each link.

Fig.3.6. Unequal Distribution of Link Weights The ranks for pages A, B and C are calculated by using (3) as follows:

PRLV(A)= (1-d)+d(PRLV(B).3/4+PRLV(C).1/3) (3.2a)

PRLV(B)= (1-d)+d(PRLV(A).1+PRLV(C).2/3) (3.2b)

PRLV(C)= (1-d)+d(PRLV(B).1/4) (3.2c)

Taking d=0.5, these equations can easily be solved using iteration method shown in Table 3.1 and the final results obtained are:

PRLV(A)= 1.08, PRLV(B)= 1.26, PRLV(C)= 0.66 Comparing these calculations with PR calculations (Section 2.5.2), page B obtains the highest rank, while it was not the case in PR calculations. Ranks obtained are totally different from PageRanks.

These values will not change till the link structure of the web graph and the visit counts remain same.

3. CONCLUSION

This thesis explained the current state of the ongoing work in the Web Mining field focused on search engines and ranking algorithms. It has described the current processes used and which are the main and biggest problems with them. Here a new ideaPRLV proposed to follow in order to solve them or at least to improve them. A whole process of PRLV from the beginning to the end has been presented and analysed in order to provide better results. In addition, PRLV has experimented with real cases (not only theoretical examples)and real data proving that our solution is feasible and that it works well. Under these assumptions it has been proved that PRLV solution behaves better than the existing ones.

User generally spends a lot of time in sifting through the search results to find the relevant pages. The paper presented a novel page ranking algorithm called PRLV that provides more relevant results than original PageRank. PRLV calculates rank value of a web page based on the user visits on incoming links of that page. The ordering of

A

C

B

100 (1/3)

200 (2/3)

25 (1/4) 75 (3/4)

100 (1)

100 (100/110)

10 (10/110)

10 (1/3)

20 (2/3)

75 (1)

7 (1) 75 (3/8) 100 (1/2) A

C B

E D

G

H F

25 (1/8)

(6)

26 pages in this way increases the relevancy of pages and thereof provides the user with quality search results. As a result, user may find the desired content in the top few pages, thus search space can be reduced to a large scale

5. FUTURE SCOPE

Some of the future work in PRLV system includes the following:

1. Scalability

To check feasibility of PRLV and to receive some basic results, a small web graph is used. However, in order to have a real test PRLV system need to use a bigger web graph. Currently PRLV is retrieving data from the web in order to enlarge the current link structure, considered. Then, PRLV will be tested again in order to check not only the results against a bigger graphbut also to check performance.

2.PRLV Improvement

In present experiments, because the current state is not good enough to be computed over the whole graph due to a big needs of resources. Many different ways can be used to improve it in order to implement it and added to our experiments.

3. More Information from User

Currently PRLV implementation uses only link visit information from a user. One could think in other information, one could get from it like for example some feedback from the search engine about which pages does he choose from the whole list of results.

4. More Experiments and Evaluation

Of course, the results shown in this thesis are only preliminary. More experiments are needed and with bigger set of data have to be done in order to be able to prove that PRLV algorithm is really more convenient that the existing ones. Even, user evaluation would be a good idea in order to get real feedback from people who are not involved in the research itself.

REFERENCES

[1] Glen Jeh and Jennifer Widom.Simrank: A measure of structural-context similarity. Technical report, Stanford University Database Group, 2001.

[2] R.Cooley, B.Mobasher and J.Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97),1997.

[3] Companion slides for the text by Dr.M.H.Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2002.

[4] JaroslavPokorny, JozefSmizansky, Page Content Rank: An Approach to the Web Content Mining.

[5] L. Page, S. Brin, R. Motwani, and T. Winograd.

The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Libraries SIDL-WP-1999- 0120, 1999.

[6] C. Ridings and M. Shishigin, Pagerank uncovered. Technical report, 2002.

[7] A. Broder, R. Kumar, F. Maghoul, P.

Raghavan, and R. Stata. Graph structure in the web.

In In Proceedings of the 9th International World Wide Web Conference, 2000.

[8] Kleinberg, J., Authorative Sources in a Hyperlinked Environment. Proceedings of the 23rd annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998.

[9] C. Ding, X. He, P. Husbands, H. Zha, and H.

Simon. Link analysis: Hubs and authorities on the world. Technical report:47847, 2001.

[10] Wenpu Xing and Ali Ghorbani, Weighted PageRank Algorithm, Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR’04), 2004 IEEE.

[11] NCSA Log File Formats:

http://publib.boulder.ibm.com/tividd/td/ITWSA/IT WSA_info45/en_US/HTML/guide/c-logs.html [12] Salton G. and Buckley, C., 1998. Term Weighting Approaches in Automatic Text Retrieval.In Information Processing and Management. Vol. 24, No. 5, pp. 513–523.

[13] G. Jeh and J. Widom.Simrank: A measure of structuralcontext similarity, 2002.

[14] S. Chakrabarti, B. E. Dom, S. R. Kumar, P.

Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg, Mining the Web’s link structure.

Computer, 32(8):60–67, 1999.

[15] B. H. Murray and A. Moore.Sizing the internet, July 2000.

[16] Andrew Y. Ng, Alice X. Zheng, and Michael I. Jordan. Stable algorithms for link analysis.In Proc. 24th Annual Intl. ACM SIGIR Conference.ACM, 2001.

[17] R. Lempel and S. Moran.The stochastic approach for linkstructure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1–6):387–401,2000.

[18] Mark Levene and Richard Wheeldon.Web dynamics, 2001.

[19] Peter Lyman and Hal R. Varian.How much information, 2000.

[20] Jean-Loup Guillaume and MatthieuLatapy.

The web graph: an overview.`

[21] What’s the value of data resource management?http://www.dama.org/data facts you can use.htm.

(7)

27 [22] Brian Pinkerton. Finding what people want:

Experiences with the web crawler. In The second Internation WWW Conference Chicago, 1994.

[23] C. J. Van Rijsbergen. Information Retrieval, 2nd edition.Dept. of Computer Science, University of Glasgow, 1979.

[24] Junghoo Cho, Hector Garc´ıa-Molina, and Lawrence Page.Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1–7):161–172, 1998.

[25] Andrei Broder. A taxonomy of web search.

Technical report, IBM Research, 2002.

[26] Tim Berners-Lee, Robert Cailliau, Ari Luotonen, HenrikFrystyk Nielsen, and Arthur Secret.The World-Wide Web. Communications of the ACM, 37(8):76–82, 1994.

[27] Alan Borodin, Gareth O. Roberts, Jeffrey S.

Rosenthal, andPanayiotis Tsaparas. Finding authorities and hubs from linkstructures on the world wide web. In World Wide Web, pages415–

429, 2001.

[28] SoumenChakrabarti, Byron E. Dom, David Gibson, Ravi Kumar,PrabhakarRaghavan, Sridhar Rajagopalan, and AndrewTomkins.Experiments in topic distillation.In In SIGIR workshopon Hypertext Information Retrieval, 1998.

[29] SoumenChakrabarti, Byron E. Dom, S. Ravi Kumar, PrabhakarRaghavan, Sridhar Rajagopalan, Andrew Tomkins, DavidGibson, and Jon Kleinberg. Mining the Web’s link structure.Computer, 32(8):60–67, 1999.

[30] Brian Pinkerton. Finding what people want:

Experiences withthe web crawler. In The second Internation WWW ConferenceChicago, 1994.

[31] Saeko Nomura, Satoshi Oyama, TetsuoHayamizu, Analysis and Improvement ofHITS Algorithm for DetectingWebCommunities.

[32] Longzhuang Li, Yi Shang, and Wei Zhang,Improvement of HITS-based Algorithms onWeb Documents, WWW2002, May 7-11,2002, Honolulu, Hawaii, USA. ACM 1-58113-449- 5/02/0005.

[33] Krishna Bharat and Monika R. Henzinger.

Improved algorithmsfor topic distillation in a hyperlinked environment. In Proceedingsof SIGIR- 98, 21st ACM International Conference on Researchand Development in Information Retrieval, pages 104–111, Melbourne, AU, 1998.

[34] NeelamDuhan, A. K. Sharma, Komal Kumar Bhatia, “PageRanking Algorithms: A Survey” In International AdvanceComputing Conference, 2009. IACC 2009 IEEE.

[35] Chia-Chen Yen, Jih-Shih Hsu, “Associate Pagerank:Improved Pagerank Measured by Frequent Term Sets”.International Conference on Virtual Environments, Human-Computer Interfaces and Measurements Systems, VECIMS 2009, Hong Kong, China May 11-13, 2009.

[36] Zdravko Markov and Daniel T. Larose,

“Mining the Web:Uncovering Patterns in Web

[37] Wenpu Xing and Ali Ghorbani, “Weighted PageRankAlgorithm”, Proc. of the 2nd Annual Conference onCommunication Networks &

Services Research, 2004.

[38] Google Technology:

http://www.google.com/technology/index.html.

[39] R. Lempel and S. Moran, “SALSA: The StochasticApproach for Link-structure Analysis”.

ACM Transactionson Information Systems, 19(2), Apr 2001, pp: 131–160.

[40]http://WWW.webrankinfo.com/english/seonew s/topic-16388.htm. January 2006, Increased Google index size.

[41] NareshBarsagade, Web Usage Mining and Pattern Discovery: A Survey Paper, CSE 8331, Dec.8,2003.