Exercise Paper Information Search and Retrieval

(1)

Exercise Paper

Information Search and Retrieval

Graz University of Technology

WS 2008

Ranking Algorithm and Search Engine Optimization

Mathias Mayrhofer [9630747]

Graz University of Technology [email protected]

Matthias Hilpold [0330795]

Robert Primschitz [0131085]

Supervisor

Dipl.-Ing. Dr.techn. Christian GÜTL

Institute for Information Systems and Computer Media (IICM), Austria [email protected] and [email protected]

This work may be used by anyone in accordance with the terms of the Free Content

(2)

Kurzfassung

Da die Informationen im Internet im mehr werden und es immer schwieriger wird die gesuchten Informationen zu finden werden in diesem Paper Methoden beschrieben, mit denen man

Suchergebnisse von Suchmaschinen nach Wichtigkeit und Relevanz ordnen kann. Im ersten Kapitel wird der geschichtliche Werdegang von Suchmaschinen erläutert, im zweiten Kapitel werden dann die Algorithmen beschrieben und im letzten Kapitel wird kurz darauf eingegangen was man gegen Spam machen kann und was die Probleme bei modernen Web Seiten sind.

Abstract

Since the information grows constantly on the Internet and it becomes more and more difficult to find the relevant information, in this paper methods are described which make it possible to rank results of search engines according to their importance. In the first chapter the historical

development of search engine is discussed, in the second chapter the algorithms are described and in the last chapter shortly the problems with modern web sites are mentioned and how spam on the web can be prevented.

Keywords

Ranking Algorithm and Search Engine Optimization, Dr. Vannaver Bush, Memex, Memory Extender, Gerald Salton, Salton’s Magic Automatic Retriever of Text, Vector Space Model, Inverse Document Frequency, Term Frequency, Term Discrimination Value, Relevance Feedback, ARPANet, Advanced Research Projects Agency Network, Archie, Gopher, Veronica, WAIS, Yahoo, Google, Lycos, AltaVista, InDegree algorithm, PageRank algorithm, G.Brin, L. Page , HITS algorithm, Jon M. Kleinberg , SALSA algorithm, Lempel, Moran, Hub-Averaging algorithm, Allan Borodin, Authority Threshold algorithms, Max algorithm, Breadth-First-Search algorithm, Bayesian algorithm, TrustRank, Web search, search engine.

(3)

Table of content

1.Introduction ... 5

2.Historical development ... 6

a.The very first ideas of search engines ... 6

b.The forming of the first algorithms ... 6

c.The Internet and his older brother the ARPANet ... 7

d.At the beginning of the Internet ... 7

e.The first generation of search engines ... 7

2.5.1Archie ... 7

2.5.2Gopher & Veronica ... 8

2.5.3WAIS ... 8

f.Later developed search engines ... 8

2.6.1Yahoo ... 8

2.6.2Google ... 8

2.6.3Lycos ... 9

2.6.4AltaVista ... 9

3.Current approaches and examples of Web-based ranking algorithms ... 10

a.Basic link analysis ranking algorithms ... 10

3.1.1InDegree algorithm [1997] ... 10

3.1.2PageRank algorithm [1998] ... 11

3.1.3HITS algorithm [1998] ... 13

3.1.4SALSA algorithm [2000] ... 15

3.2.New link analysis ranking algorithms ... 16

3.2.1Hub-Averaging algorithm [2005] ... 16

(4)

3.2.5Bayesian algorithm [2005] ... 18

4.Search engine optimization approaches and problems ... 19

4.1 Using TrustRank against Spam ... 19

4.2 Problems with Java, Flash ... 19

4.2.1Flash and Java Applets ... 19

4.2.2JavaScript ... 19

5.References ... 20

Table of figures

Figure 1: InDegree example...11

Figure 2: PageRank performance example [Page 1998]...12

Figure 3: PageRank example iteration 1 with d=0,85 [Erlhofer 2008]...13

Figure 4: PageRank example after convergency with d=0,85 [Erlhofer 2008]...13

Figure 5: HITS example interation 1...15

Figure 6: HITS example iteration 2 after normalization...15

(5)

“The Internet is the world's largest library. It's just that all the books are on the floor.”

- John Allen Paulos, Professor of Mathematics at Temple University

1. Introduction

The information flew in the modern world is mostly necessary for a lot of different aspects of modern life. That is why we call it the “age of information”.

This paper is decided to “Ranking Algorithm and Search Engine Optimization”. We start with the development of ranking algorithms from the beginning of the “age of information” till today as well as the different “search engines” that used to be or are still state of the art. The main view of this paper is the different ranking algorithms (as mentioned before), how they work and the problems that still need to be dissolved.

But what is ranking exactly and why do we need it? To answer this question we have to know some different components.

Due the historical development of information storage and accessibility we came to the point where we got an enormous information overflow that could not be handled by one or even more humans alone.

The information itself is stored in servers round the globe within the internet and the internet got no core or centre.

That is why search and ranking algorithms are so important. The user just needs to formulate a query. The Algorithm then gathers information from different servers (sources), stored in different file formats, marks the requested information within the file, sorts the found information, shows it to the user and makes the information accessible and this all nearly just in time (as it might seem).

This means that the requirements of different search and ranking algorithms are very high and there is still a lot of work to do.

(6)

“Getting information off the Internet is like taking a drink from a fire hydrant.”

- Mitch David Kapor, founder of Lotus Development Corporation

2. Historical development

Due to the growing data amount within the Internet and the development of new technologies in order to increase the web performance, there was a big need to invent, reinvent and develop search algorithms that can handle the flood of information and facilitate access to this information even to Internet newbie. The following section should give an overview of the historical development of search engines.

a. The very first ideas of search engines

The first idea of something that should become the Internet was created in 1945. After World War II the American scientist Dr. Vannaver Bush published his article “As we may think” within “The Atlantic” [Bush 1945]. In this article Dr. Bush describes how the flow of information should be canalized and that it would be a great idea to invent a “scientific body” that would store the knowledge of humanity.

“The difficulty seems to be, not so much that we publish unduly in view of the extent and variety of present day interests, but rather that publication has been extended far beyond our present ability to make real use of the record. The summation of human experience is being expanded at a prodigious rate, and the means we use for threading through the consequent maze to the momentarily

important item is the same as was used in the days of square-rigged ships.” [Bush 1945]

The entire article could be seen as the birth of modern information technology. Bush mentioned a device called Memex (Memory Extender). Memex can be seen as an analogue computer that uses Microfilms as storage medium.

b. The forming of the first algorithms

In the 60s of the last century Gerald Salton invented the first system that sorts text documents automatically, the SMART (Salton’s Magic Automatic Retriever of Text) system. This system combines many different algorithms that are still used today:

• Vector Space Model

• Inverse Document Frequency

• Term Frequency

• Term Discrimination Value

• Relevance Feedback

(7)

Gerald Salton also published a book dealing with the search of information called “A Theory of Indexing”.

c. The Internet and his older brother the ARPANet

The ARPANet (Advanced Research Projects Agency Network) was invented for the US Air Force. The first steps to install this network took place in 1962. Two different concepts were needed to create this network: packed information transfer and decentralized hardware. Although these two

concepts were necessary to invent the ARPANet but they did not facilitate the finding and ranking of documents. In 1983 the ARPANet involved the TCP/IP Transfer protocol. Since then it has been known as the Internet.

d. At the beginning of the Internet

In the early days of the Internet the amount of data stored within the web was not that much than it uses to be today. At the beginning it was enough to know someone who knew where the desired information could be found. However, the amount of information increased rapidly:

“From its start in 1983, the internet had grown to 1000 hosts in 1984, to 10,000 in 1987, to 100,000 in 1990 and to 1,000,000 in 1992.” [Goedemans 2005]

As well as the growing amount of information there were still other problems that made it much more difficult to find information. The Internet, as mentioned in previous section, is a decentralized network. This means that a search engine doesn´t know where information is stored by itself or by a master server, the engine has to find the information on its own.

Nevertheless, the idea came up that it would be necessary to create machines that find any desired information. The first search engine that was invented and should be mentioned was Archie.

e. The first generation of search engines

2.5.1 Archie

Archie, formally known as „Archives“, was invented in 1990 by Alan Emtage, Peter Deutsch and Bill Heelan. The main goal of Archie was to create a centralized database which stores the directories and files of any anonymous FTP server. Therefore Archie used the Spider technology to scan the FTP servers.

The central database was the great advantage of Archie but it also got many different problems. A user was only able to search for files and directory names, but not the content of the documents.

Another great disadvantage came up with the increasing traffic. The central database server soon became overrun by users and this lead to an enormous performance decrease.

(8)

2.5.2 Gopher & Veronica

Gopher was invented in 1991. It is based on a Gopher protocol. This protocol is a further

development of the FTP protocol. Every Gopher server installed this protocol so that users could search for information on the server.

This means that a user has to know where to find the server. In order to simplify the usability for the user a Gopher search server was invented. This server was called Veronica (Very Easy Rodent- Oriented Network Index to Computerized Archives). Veronica searches every Gopher server using a spider program.

2.5.3 WAIS

WAIS stands for Wide Area Information Server and was invented in 1991 by Thinking Machines Corporation. WAIS also uses spiders to search the Internet by using its contents. This means that WAIS was the first search engine that indexes the contents of files, so users could search for contents instead of just files or directory names. WAIS therefore uses a client that users have to install in order to use it.

The gathered information was stored on different servers, so that a user just started his search at a master server by using the client. On this master server all other servers where listed. The other servers were sorted by topics like history, physics and health for example. Consequently a user could find his information by searching on the specialized server.

f. Later developed search engines

2.6.1 Yahoo

The history of Yahoo started 1994 with „Jerry's Guide to the World Wide Web“. At the beginning Yahoo didn’t have a page ranking algorithm in a narrower sense. One could rather compare this search engine to an extremely comprehensive link list. Websites are subdivided into specialized topics and made accessible for the user in exactly that way. Yahoo started to look for alternatives as the editorial effort began to dominate with the growing amount of information. Therefore it was decided to offer the Google’s search interface on the Yahoo site in 2000. As they did not want to become a copycat of Google and in turn to disappear as they would offer the same information as Google, they decided to become partners of Microsoft in 2004.

2.6.2 Google

The History of Google started in 1996 with Google’s ancestor BackRub. BackRub just counts external links leading to a special page in order to know how important this page might be. At 1998 the founders of BackRub Larry Page and Sergey Brin published their search engine ranking technology with the title “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. Google later started its beta test phase and the search algorithm itself got patented in 2001.

(9)

2.6.3 Lycos

Lycos started in July 1994. The main advantage of the used search algorithm was the statistical investigation how close searched words or terms are in a document to each other.

2.6.4 AltaVista

AltaVista started in 1995 and it was designed as a search engine with a commercial background. The used a search engine called “Scooter” was a powerful advice which made AltaVista one of the big players within the Internet.

(10)

“And bring me a hard copy of the Internet so I can do some serious surfing.”

- Scott Raymond Adams, creator of the Dilbert comic strip

3. Current approaches and examples of Web-based ranking algorithms

In this chapter the most popular link analysis ranking algorithms for the web are analyzed and discussed. Generally link analysis ranking algorithms can be divided into two phases. In the first phase a set of web pages are found. Depending on how this set was found it can be distinguished among query independent and query dependent algorithms. The PageRank was proposed as a query independent algorithm which tries to rank all pages on the Web. Other algorithms rank just a subset of web pages which are dependent on the search query. Kleinberg [Kleinberg 1998] discussed how such a subset could be found. Thereby a text-based Web search engine is used to find a query dependent subset of web pages. After that this root subset is extended by web pages which point to the root subset or are pointed by web pages form the root subset. The result subset is then used by these algorithms.

In the second phase of link analysis ranking algorithms the underlying hyperlink graph is build. Nodes are corresponding to web pages and edges to link among them. Thereby isolated nodes are removed and links within a web page are not considered. [Borodin et al. 2005]

a. Basic link analysis ranking algorithms

The following discussed ranking algorithms count to the basic link analysis ranking algorithms on the Web. To it belongs the InDegree, the PageRank, the HITS and the SALSA algorithm.

3.1.1 InDegree algorithm [1997]

The first of the considered algorithms is called InDegree [Marchiori 1997]. Some of the algorithms below are based on this one. At this point the term authority has to be introduced. A good authority is a web page which is pointed to by many other web pages and vice versa. The InDegree algorithm analyzes the link topology and ranks the web pages according to their link popularity. If a web page has many links pointing to it, it gets a better ranking and it represents so a better authority. The obvious disadvantages of this algorithm are that the ranking value of a web page depends just on the amount of incoming links and that every incoming link has the same impact on the ranking value.

That makes the algorithm highly vulnerable against web spam. Therefore further methods need to be found to improve the ranking as the result is good, but not satisfying and trustworthy enough.

[Borodin et al. 2005]

(11)

Algorithm:

3.1.2 PageRank algorithm [1998]

The PageRank algorithm introduced by G.Brin and L. Page [Page et al. 1998] is based on the InDegree algorithm discussed in previous section. Since one of the main disadvantages of the InDegree

algorithm is that every incoming link has the same impact on the ranking value Brin and Page tried to solve this problem by assigning to every incoming links on a web page different weights. In other words that means incoming links from many good authorities gain the importance of the web page and vice versa. This approach and the InDegree algorithm discussed before belong both to the so called one-level propagation scheme. [Borodin et al. 2005]

The following formula describes the PageRank algorithm. Thereby stands PageRank()for the PageRank value of a particular web page, Count() for the number of outgoing links on a web page pointing to the particular web page and d is a damping factor between 0 and 1. As you can see from the formula below good authority web pages with less outgoing links have higher impact than authority web pages with similar quality and much more outgoing links [Erlhofer 2008].

Algorithm:

Figure 1: InDegree example

authority_B = 0

authority_C = 0

authority_E = 0

authority_D = 0

Incoming(A) A

E B

C D

authority_A = 4

(12)

The algorithm performs a random walk on the entire web structure to weight the web pages. It simulates the behaviour of a random surfer. It starts from a chosen node and then follows an outgoing link according an uniformly probability distribution. So the importance of a web page is determined by how often the web page is linked by other web pages and which quality these web pages have. As this algorithm is iterative it takes a certain amount of steps before the results are satisfying. Initially all weights are set to 1. Former tests of Brin and Page showed that 52 iteration steps were necessary for about 322 million links. [Borodin et al. 2005]

Figure 2: PageRank performance example [Page 1998]

The advantages are good results and a good performance. But it has also disadvantage since PageRank favors old web pages that have probably many links pointing to it than new web pages with much less incoming links. This algorithm is the most commercially used one and very successful.

It is used by Brin and Page in the Google search engine for ranking the result pages. But PageRank is not the only factor that determines the ranking position of a web page. Just in the Open Directory Project (Google Directory) PageRank alone is responsible for determining the ranking positioning.

How Google ranking exactly works knows just Google itself. [Jaster 2006]

(13)

3.1.3 HITS algorithm [1998]

In order to determine the importance of a web page Jon M. Kleinberg introduced in 1998 [Kleinberg 1998] a two-level weight propagation scheme, called HITS algorithm. HITS stands for Hypertext- Induced Topic Selection. Every web page gets two quality values: the authority weight and the hub weight. The authority weight describes the quality of the web page as a resource itself while the hub weight describes the quality of the web page as a pointer to useful resources. Among authorities and hubs exists therefore a mutual reinforcement relationship and it can be seen as a bipartite graph where hubs point to authorities. A web page which points to good authorities gets a high hub weight and a web page which pointed to by good hubs gets a high authority weight. The hub weight of a

PageRank(B) = 1

PageRank(D) = 1 PageRank(C) = 1

PageRank(A) = 1 A B

C

D

PageRank(B) = 0,78

PageRank(D) = 0,15 PageRank(C) = 1,58

PageRank(A) = 1,49 A B

C

D

Figure 3: PageRank example iteration 1 with d=0,85 [Erlhofer 2008]

Figure 4: PageRank example after convergency with d=0,85 [Erlhofer 2008]

(14)

Normalization:

Jon M. Kleinberg proposed to calculate the weights iteratively by setting the initial weights to 1.

Every iteration step normalizes the weights. This iterative algorithm stops when the weights converge. The normalization has no impact on the convergence and the relative order of the web pages in the ranking. [Borodin et al. 2005]

The same year Krishna Bharat enhanced Kleinberg’s HITS algorithm by adding hub and authority weights [Bharat et al. 1998]. Longzhuang Li introduced in 2002 a new weighted HITS algorithm which performs significantly better than Bharat’s improved HITS algorithm. Furthermore he combined his and Bharat’s HITS algorithm with four popular relevance scoring methods: VSM, Okapi, TLS, and CDR. The combination of a HITS-based algorithm and any of the four relevance scoring methods brought marginally better performance than Longzhunag’s weighted HITS-based method. Between the four relevance scoring methods, Longzhuang found no significant quality difference when they are combined with a HITS-based algorithm. [Longzhuang 2002]

14 A

E B

C D

authority_A = 1 hub_A = 1

authority_C = 1 authority_D = 1

authority_E = 1 hub_E = 1 authority_B = 1

hub_B = 1

(15)

3.1.4 SALSA algorithm [2000]

The SALSA algorithm is based on the both, the PageRank algorithm and the HITS algorithm. SALSA stands for Stochastic Approach for Link-Structure Analysis. It was proposed by Lempel and Moran [Lempel 2001]. The algorithm distinguishes like the HITS algorithm between authority and hub values for a web page. This structure can be again seen as a bipartite graph with hubs and authorities. Now it also performs a random walk on the graph just like the PageRank algorithm. But it alters

consecutively between hub and authority pages. When choosing the next hub, it performs a backward step. To an authority page it would be a forward step. [Borodin et al. 2005]

SALSA can be seen as extension of the HITS algorithm where instead of broadcasting their weight, each hub divides its weight among the authorities they point to. SALSA does not really have the same

Figure 5: HITS example interation 1

authority_C = 0 hub_C = 4/16

authority_D = 0 hub_D = 4/16 authority_E = 0 hub_E = 4/16 authority_B = 0

hub_B = 4/16

Incoming(A) A

E B

C D

authority_A = 4/4 hub_A = 0

Figure 6: HITS example iteration 2 after normalization

(16)

Algorithm:

3.2. New link analysis ranking algorithms

The proposed HITS algorithm by Jon M. Kleinberg [Kleinberg 1998] is symmetric since the authority weight and the hub weight of a web page are defined in the same way. Furthermore the algorithm is egalitarian due to the fact that all authority weights are equally important for calculating the hub weights and vice versa. These properties lead in some cases to not intuitive ranking results.

Therefore other algorithms have to be found. Some of them are introduced in the following chapters.

[Borodin et al. 2005]

3.2.1 Hub-Averaging algorithm [2005]

In order to minimize the effect that the quantity of authorities a web page points to influences the hub weight, Allan Borodin proposed in 2005 the Hub Averaging algorithm. Thereby the authority weights are calculated as in the HITS algorithm and the hub weights are calculated as in the SALSA algorithm. Therefore it can be seen as a “hybrid” between the HITS and the SALSA algorithms. The symmetric property of the HITS algorithm is not longer valid. [Borodin et al. 2005]

Algorithm:

3.2.2 Authority Threshold AT(k) algorithms [2005]

Also the Hub Averaging algorithm has some drawbacks because it punishes web pages pointing to mainly weak authorities and it awards web pages pointing to mainly good authorities. In order to solve this problem Allan Borodin introduced in 2005 also Authority Threshold AT(k) algorithm. Hubs weights are calculated as in the HITS algorithm but now by using just the best k authority weights.

Weak authorities have no longer impact on the hub weight but the influence of web pages with good

(17)

as in the HITS algorithm. The symmetric and the egalitarian properties of the HITS algorithm are not longer valid. If k is equal to the number of outgoing links in a web page, this algorithm corresponds completely to the HITS algorithm. [Borodin et al. 2005]

Algorithm:

3.2.3 Max algorithm [2005]

The Max algorithm can be seen as the Authority Threshold algorithm with k=1. That means a good hub is one which points at least to one good authority. The symmetric and the egalitarian properties are not longer valid. [Borodin et al. 2005]

Algorithm:

3.2.4 Breadth-First-Search algorithm [2005]

Unlike the HITS and the InDegree algorithm, the BFS algorithm takes more than the one-link- neighbourhood into account. The Kleinberg algorithm for instance considers the whole graph more than the popularity of the nodes in the graph. The BFS algorithm is a trade-off that takes n steps into account. The algorithm starts from a node i and visits all neighbours in BFS order. At each iteration it performs a backward or a forward step (analogous to SALSA), and includes all new nodes it

encounters. Then the weight factors are updated. Each node can be considered only once, when it is first encountered by the algorithm. [Borodin 2001]

Every node j contributes to the weight of node i just once and the contribution of node j is 1/a^k, where k is the shortest path from j to i. So the BFS algorithm ranks nodes dependent on their reach ability. So the importance of a node is denoted by its connectivity, and that separates the BFS

(18)

3.2.5 Bayesian algorithm [2005]

This algorithm is a statistical approach to compute the authority and hub weights. Every node i is given three parameters: ei is its general tendency to have links, hi is its tendency to have intelligent links to authority sites, and ai is its level of authority. The statistical a priori probability from a link from i to j is then calculated by:

) exp(

1

) ) exp(

(

i i j

e h a

e h j a

i

P + +

= +

→

If there is no link given from i to j, the probability is

) exp(

1 ) 1 (

i i

jh e

j a i

P ≠ = + +

The idea is, that the probability is high, when ei is high (meaning that it has a large tendency to link to any site) or the product of hi and aj is high, which means that i is an intelligent hub and j is of a good quality authority. [Borodin 2001]

The parameters µ=-5 and σ=0.1 are statistical values and define the normal distribution for N(µ, σ).

For large graphs, these start values have only a small impact on the results and should not depend on the observed data.

There also exists a simplified version of the Bayesian algorithm called SBayesian. By eliminating the parameter ei and removing the exp function, we also do not need prior values for µ and σ. It performs very similar to the InDegree algorithm mentioned above. [Borodin 2005].

(19)

“Nowadays, anyone who cannot speak English and is incapable of using the Internet is regarded as backward.”

- Al-Waleed bin Talal, ranked by Forbes as the 20th richest person in the world

4. Search engine optimization approaches and problems 4.1 Using TrustRank against Spam

One serious problem for search engines are so called spam sites. Spammers try to get high result positions in search engines, so that they can sell their products or clicks better. They try to abuse the ranking algorithms for their advantage. Usually, a user only checks the first 10 results of a web search result and therefore it is very important who wins the battle for the highest ranks.

The method of TrustRank [Wu et al. 2004] deals with the ranking of spam sites. It uses a biased PageRank algorithm to calculate a trust score for every site and inherits it to their descendants.

Initially, human experts select a list of seed sites that are well-known and trustworthy on the Web.

This method has good precision, but a bad recall, as it is dependent on the manual seed and therefore is not broad enough. So Wu invented the Topical TrustRank algorithm.

This method is based on online directory services like DMOZ or Yahoo. Usually, pages that link to each other have comparable topics. So the algorithm partitions the pages in groups of the same topic and calculates the TrustRank for each topic, which leads to a much better result than TrustRank itself.

Algorithm:

d ) - (1 t T

t = α × × + α ×

Thereby is t the TrustRank score vector, α is the decay factor, T is the transition matrix in which T(i; j) is the probability of following the link from page j to page i, and d is the normalized trust score vector for the seed set. t is initialized with the value of d.

4.2 Problems with Java, Flash

Modern sites use different technologies in order to look better, be more comfortable for users or just to increase the traffic at this site. The problem for search algorithms in this case is the interpretation of pages that run at the client side.

4.2.1 Flash and Java Applets

Flash Sites are invisible for most search engines. One solution to this is to have a HTML option available with the same content as the Flash site. The same problem exists for Java Applets.

Google for instance claimed that they would be able to crawl Flash content by including the Adobe Flash player technology into the search algorithm.

(20)

“Looking at the proliferation of personal web pages on the Net, it looks like very soon everyone on Earth will have 15 megabytes of fame.”

- M.G. Sriram, Professor at University of Texas Houston

5. References

[Marchiori 1997] Massimo Marchiori: “The quest for correct information on Web: Hyper search engines”

[Page et al. 1998] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd: “The PageRank Citation Ranking: Bringing Order to the Web”

[Borodin 2005] Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, Panayiotis Tsaparas: “Link Analysis Ranking: Algorithms, Theory, and Experiments”

[Erlhofer 2008] Sebastian Erlhofer : “Suchmaschinen-Optimierung”, Galileo Computing, ISBN 978-3- 8362-1233-5

[Kleinberg 1998] Jon M. Kleinberg: “Authoritative Sources in a Hyperlinked Environment”

[Bharat et al. 1998] Krishna Bharat, M. R. Henzinger: “Improved Algorithms for Topic Distillation in a Hyperlinked Environment.”

[Longzhuang 2002] Longzhuang Li, Yi Shang, Wei Zhang : “Improvement of HITS-based algorithms on Web Documents”

[Lempel 2001] R. Lempel, S. Moran: “SALSA: The Stochastic Approach for Link-Structure Analysis”

[Borodin 2001] Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal, and Panayiotis Tsaparas:

“Finding Authorities and Hubs from Link Structures on the World Wide Web”

[Bharat et al. 2002] Krishna Bharat, George A. Mihaila: “When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics”

[Wu et al. 2004] Baoning Wu, Vinay Goel, Brian D. Davison: “Topical TrustRank: Using Topicality to Combat Web Spam”

[Najork 2007] Marc A. Najork: “Comparing the Effectiveness of HITS and SALSA”

[Bush 1945] Dr. Vannevar Bush: „As we may think“

http://www.theatlantic.com/doc/194507/bush, last access 05.12.2008 [Goedemans 2005] Rob Goedemans : “Internet History“,

http://www.internethistory.leidenuniv.nl/index.php3, last access 05.12.2008 [Jaster 2006] Andreas Jaster: ”Der PageRank-Algorithmus”,

http://www.suchmaschinen-doktor.de/algorithmen/pagerank.html, last access 04.12.2008