An Alternative Web Search Strategy? Abstract

(1)

An Alternative Web Search Strategy?

V.-H. Winterer, Rechenzentrum Universität Freiburg (Dated: November 2007)

Abstract

We propose an alternative Web search strategy taking advantage of the knowledge on complex networks accumulated in the past decade. Adopting a maximally distributed architecture and building up a search infrastructure very similar to the Internet at autonomous system level, it seems possible to cope with the major drawbacks of current search engines. First results from simulations using simplified models appear promising.

(2)

I. DRAWBACKS OF CURRENT SEARCH ENGINES

Current popular search services (Google [2], Yahoo [3], MSN Search [4]) share a common approach to Web search. The Web or part of it is crawled repeatedly over time from one or more locations and a simulacrum data set is build upon the crawl results. The simulacrum data set is indexed and eventually searched using different algorithms, depending on the search service provider (eg. the presently most successful PageRank algorithm [5] used by Google). This approach to information gathering on the Web implies a number of systematic well known drawbacks:

• due to the crawl-index-search processing chain, results can obviously never be truly up-to- date. The time delay might vary between seconds and weeks depending on the strategies adopted by the search service provider, but is systematically always present.

• as M. Bergman puts it, “Most of the Web’s information is buried far down on dynamically generated sites, and standard search engines never find it.” [6], so only a (small?) part of the Web can be found and hence used.

• search service providers are trying to capture an exponentially growing monster. As of December 2006, the Netcraft Web Server Survey [7] reports topping 100 million Web sites for the first time. The growth still seems to follow an exponential curve. As current search engines work on a (pruned) copy of the Web, they will have to cope with that growth on several fields. Besides the aggravation of the drawbacks already mentioned, the hard- and software infrastructure of search service providers will be chased by the Web growth.

• the concentration of a key ingredient in the access to the global Web resources in the hands of a handful companies represents a single point of failure and is commonly thought to be questionable.

With the present paper I would like to show a possible solution to all of the above mentioned points using a peer-to-peer (P2P) architecture enhancing the already existing Web servers. The adoption of distributed variants of Web search has been analyzed using P2P architectures (eg.

[18]), but was still focusing on the above mentioned crawl-index-search processing chain. The present proposal describes a maximally distributed architecture which consistently takes into account the scale free network character of the Web and tries to factor in as much as possible of our present knowledge on the properties and the dynamics of scale free networks.

(3)

II. SEARCH ON SCALE FREE NETWORKS

The topology of the Web, i.e. the topology of the network of hyperlinks, has been found to exhibit scale free network properties [11]. A scale free network can be characterized by its degree distribution following a power law:

P

(

k

) ∼

^k⁻^γ

P

(

k

)

being the probability of finding a Web page with degree k and the degree k of a Web page being the number of outgoing links (the same applies for the in-degree, the number of hyperlinks pointing to the page). (N.B.: In this paper, in the interest of a broader readership, we limited the description of the network theory aspect to a minimum and refer the interested reader to two excellent reviews by M.E.J. Newman [8] and by Albert, R. and Barabási, A.-L. [9] for a detailed description of the current knowledge on complex networks). Scale free networks can show up an interesting property, known as the small-world effect [12]. In a small-world network, the distance

!

(i.e. number of hops) between two arbitrary nodes only grows logarithmically with the number of the nodes in the network N,

! ∼

^log

(

N

)

. The Web has been found to exhibit this property [10], which is of central importance for our ansatz. It infers that navigating from node to node (i.e. following hyperlinks) is efficient and an extremely powerful (i.e. fast) way to move across the web in away very similar to the routing infrastructure of the Internet. Keeping in mind that network topological search strategies can be successfully utilized for Web search (Google’s PageRank is a prominent example [5]), we still need a metric allowing us to determine a hopping direction towards our destination node.

The adoption of classical search algorithms (DFS, BFS, Dijkstra, see for instance [14] for a good introduction) clearly cannot be our solution due to the massive size of the Web graph. The approach must instead be of a local search type, as already proposed for the case of P2P networks [13]. Here only local knowledge is used for the decision on the next search step. Using the Hyperlink structure of the local site only or, if no further hint is found locally, using the Hyper- links of some neighbor Web sites, the direction towards the destination Web page is found (more details can be found in section IV).

III. A NEW SEARCH STRATEGY

We start from the following assumptions:

(4)

1. The Web graph is a scale free network showing the “small-world” effect.

2. The Hyperlink structure of the Web subgraph for every tag containing the search term is also a a scale free network showing the “small-world” effect.

3. Every Web site is running an appropriate module or application providing a current local information routing table based on local hyperlinks.

4. The search of information can be performed in a way very similar to Internet routing, the hop direction (i.e. metric) being towards nodes with higher in-degree.

5. Most (used) Web sites are located in the strongly connected component of the Web graph.

The local information routing table is build up using exclusively the local Web documents tree.

The HTML documents are scanned and the hyperlink structure is stored and indexed using the hyperlink tag information. A trigger to perform a refresh of the local information can be any change (adding, modifying or deleting documents in the local hypertext document tree. It is thus trivial to maintain the information routing table up-to-date.

A user seeking information provides one or more search terms to initiate a Web search. Using the local information routing table, a next hop Web site/information router is chosen. The Web site is chosen by first searching the information routing table for relevant hyperlinks, the relevance of the hyperlinks being determined by the presence of the search term(s) in the hyperlink tag. The next hop will be the Web site with the highest in-degree hyperlinks score. If no relevant tags are found in the local information table, the search is passed to a neighbor Web site. The procedure of searching the most relevant Web site and passing the search to that Web site is repeated as long as the relevance (i.e. the number of relevant hyperlinks) grows. If no further reasonable growth is obtained, the search is regarded to be at the destination Web site and the URL of the destination is back-propagated to the originator of the search. The strategy described is very similar to the routing strategies in the Internet. It thus seems straightforward to adopt all the knowledge of the Internet at autonomous system level to refine the proposed search strategy.

A search system as described could be implemented as a Web server module, providing the following functionalities:

1. scanning the local documents tree and, with that information, updating the local information routing table. The scan can be triggered by the Web server itself, as Web authors

(5)

commonly check the results of their work using the Web site. This also is the place, where hidden/dynamic contents can be included.

2. managing the search requests from local and remote users (i.e. remote Web sites), handling communicaton with remote Web servers.

3. providing an appropriate Web search front-end for the local users.

As the efficiency of the search strategy critically depends on the presence of the module at most sites (ideally at every site), it appears reasonable to start the implementation as module to the Apache Web server [15] being the most common server [7]. Furthermore, only a common effort on an open source basis will provide the impetus necessary to establish the search architecture.

IV. PRELIMINARY RESULTS

We checked the concept of our ansatz using a simulation. We set up a graph (BA graph) with 100000 nodes using Preferential Attachment [11]. We used that algorithm for practical reasons as a first test although knowing, that Preferential Attachment does not accurately describe the Web graph (the out-degree of nodes is fixed, instead of being a power law as observed many authors ( [8, 9]). We performed searches starting from every node in the generated graph looking for the highest ranking node whilst recording the trajectory through the graph.. We observed an excellent convergence, after 6 hops virtually all searches led to the highest in-degree node (figure 1).

For the next future we plan a detailed analysis of the chances of our new Web search strategy focussing on the following areas:

• Fundamentally we have two ways to analyze the potential of the search algorithm and to eventually validate it: using real Web crawl data and, on the other side, using simulations.

While the first method is dominated by issues in parsing the crawl data in order to cor- rectly reconstruct the Web graph, the second way critically depends on the quality of Web generator algorithms.

• The Web is currently thought to have a complex large scale structure [16] with nodes.

The question is here if our ansatz is appropriate to build up a search infrastructure with reasonable results in such a topology.

(6)

Figure 1: Search convergence on simulated BA network of 100000 nodes

• A special emphasis has to be put on security aspects: how can we manage to protect potential users of the web search from fake results, e.g. from nodes deviating the search from the regular path.

• A huge experience has been accumulated in the design and operation of the Internet over the last decades. Due to the similarity of the Web search proposed in this paper, it is straightforward to try to integrate as much of that knowledge in the project, e.g. analyzing the usage of the Border Gateway Protocol (BPG) routing for the large scale architecture of our system.

[1] Barabási, A-L. & Albert, R., Emergence of scaling in random networks, Science 286, 509-512, 1999 [2] Google, http://www.google.com

[3] Yahoo, http://www.yahoo.com [4] MSN Search, http://www.live.com

[5] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World Wide Web Conference, pages 107-117, 1998

(7)

[6] Bergman, M., The deep Web: Surfacing the hidden value. BrightPlanet, http://www.complete- planet.com/Tutorials/DeepWeb/index.asp, 2000

[7] Netcraft Web Server Survey, http://news.netcraft.com/archives/web_server_survey.html, December 2006

[8] Newman, M.E.J., The structure and function of complex networks, arXiv:cond-mat/0303516, 23 Mar 2003 [9] Albert, R., Barabási, A.-L., Statistical Mechanics of complex networks, Rev. Mod. Phys., Vol. 74, January

2004

[10] Albert, R., Barabási, A.-L., Jeong, H., Diameter of the World-Wide Web, Nature, Vol. 401, 9 September 1999

[11] Albert, R., Barabási, A.-L., Emergence of scaling in random networks, arXiv:cond-mat/9910332, 21 Oct 1999

[12] Watts, D. J.; Strogatz, S. H., Collective dynamics of ’small-world’ networks, Nature, Vol. 393, June 1998 [13] Adamic, L.A., Lukose, R.M., Huberman, B.A., Local Search in Unstructured Networks, arXiv:cond-

mat/0204181, 4 Jun 2002

[14] Goodrich, M.T., Tamassia, R., Data Structures and Algorithms in Java, John Wiley & Sons, 2001 [15] The Apache Software Foundation, http://www.apache.org

[16] A Broder, R Kumar, F Maghoul, P Raghavan, et al., Graph structure in the Web, Computer Networks, 2000

[17] Li et al., BGP routing dynamics revisited, Comput Commun Rev (2007) vol. 37 pp. 7-16

[18] Bender et al., MINERVA: collaborative P2P search, Proceedings of the 31st VLDB Conference, Trond- heim, Norway, 2005