Differences between intranet search and Web search

Chapter 1 Introduction

2.5 Search in an intranet environment

2.5.1 Differences between intranet search and Web search

An intranet environment is different in many aspects from the Internet. Even though both are using the TCP/IP protocol for communications, the structure in which data is stored differs a lot. The Internet has the World Wide Web as the main structure for storing and interlinking information. As a consequence, users can access all data on the Web using a single tool, namely their web browser. Hypertext exists in many intranet landscapes. However, hypertext in intranet environments is much less prevalent than in the Internet. In fact hypertext is just a small piece of the landscape, sharing it with file share protocols, database systems, etc. An intranet environment is a collection of all kind of electronic information within a private network. Therefore, search in an intranet environment differs from search in the Internet and the term enterprise search is used instead of web search.

Enterprise search is defined by [Hawking 2004] as search over all electronic text content of an organization, including search of the organization’s external websites, search of internal websites, search of other electronic text held by the organization in the form of e‐mail, database records, documents on file shares and the like. The differences between search in an enterprise environment and search in the Internet have been analyzed in [Fagin et al. 2003]. Fagin et al. mention six main differences which affect search: The qualitative linkage structure, the macro level

structure, spam, the search context, the answer set size, and the search engine friendliness. Next, each point is briefly outlined.

Difference 1: Qualitative linkage structure

Comparing solely the hypertext structure of an intranet to the structure of the WWW will reveal significant differences [Fagin et al. 2003; Xue et al. 2003]. While in both cases a collection of interlinked documents is present, the way pages are linked differs.

In the Internet, the linkage‐structure is generated democratically. According to [Surowiecki 2004], the “collective solution”, i.e. the average guess of a group, outperforms the individual guesses in approximately 98% of cases. Hence, the paradigm of the popularity vote exploited by link‐analysis algorithms like PageRank can be applied.

In intranets however, this paradigm fails. Here, links between documents reflect something fundamentally different. First, they are often generated by a few employees with special privileges. Second, documents are created to deliver information and not to draw as much attention as possible. Third, links do not reflect a document’s relevancy. Indeed, in a company the manifestation of the linkage structure often corresponds to the organization’s departmental structure. Therefore, a link pointing to a document can not be considered as a recommendation.

Consequently, search approaches which work well for the Internet might not deliver the expected ranking performance for the hypertext part of an intranet.

Difference 2: Macro level structure

In [Broder et al. 2000] the macroscopic structure of the Web is described by four pieces. At the heart, the Internet consists of a giant strongly connected component (SCC). Every page in the SCC can reach any other page of the SCC along directed links. The second and third pieces are called IN and OUT. IN contains all pages that can reach the SCC, but no page from the SCC can reach IN pages. OUT contains all pages that cannot reach the SCC, but SCC pages can reach OUT. The last piece is TENDRILS. These are pages which cannot reach the SCC and cannot be reached by the SCC. In [Fagin et al. 2003], a study of IBMs macro level intranet structure was conducted. While the intranet contains the same pieces (SCC, IN, and OUT) as the Web, the distribution of the sizes differs. Most notably, the SCC of IBMs intranet is much smaller than the SCC of the Web. IBM’s intranet had 10% of its pages in the SCC while the Web has 30% of its pages in the SCC. Consequently, if we assume that the hyperlink structure of other intranets has similar differences, we can conclude that PageRank does not work very well in intranets.

Difference 3: Spam

Spam is a well known problem of today’s Web. This phenomenon does not exist in an intranet, i.e. intranets can be regarded as spam‐free [Fagin et al. 2003]. Workers encounter a high social pressure in delivering good results, i.e. in showing colleagues

their professional capabilities at work. By doing so, they can speed up their career. The incentives are thus high not to create spam in a private network. As a consequence several ranking heuristics (cf. Chapter 2.6; anchor‐text, URL depth, etc.) which can’t be applied in the Internet are applicable in intranets. As an example, consider the hyperlink’s anchor‐text heuristic stating that “a page p is more relevant for a query q if the words in q are part of a hyperlink that points to page p”. This is a dangerous criterion in the Internet because spammers can easily misuse this to get their pages ranked higher. In intranets however this danger is not an issue.

Difference 4: Search context

Another difference between search in the Internet and search in a corporate environment concerns the user’s context or points of view. In the Internet the only contextual information which is usually available is the location of the searcher and maybe his preferences derived from past actions. In intranets however, the available contextual information is much richer. Departments, projects, and group memberships are just a few examples. Therefore, considering the user’s context at query time has a much bigger relevance in intranets (Chapter 5).

Difference 5: Answer set size

Queries in intranets have often only a small set of correct answers. At first glance this seems to make search in intranets easier than in the Web as the number of pages to be ranked is considerably smaller. However, the drawback is the inability to identify the correct answer page. Having only a few or a single answer page means that the searcher has to use notions and terms that match the vocabulary of the document. If he uses synonyms or more general / specific search terms he might not find the desired document. In the Web this problem is mitigated due to the informational equivalence of many different answer pages.

Difference 6: Search engine friendliness

The heterogeneity of platforms and formats makes intranets less search‐engine friendly [Fagin et al. 2003]. Database driven portals for example generate multiple views based on the database content. Often, the views contain temporarily generated hyperlinks to other layout styles, or to actions such as deletion, insertion, and modification. In case a crawler would follow these kinds of links it could lead to unexpected behavior such as modification of database entries.

Search engine friendliness is also an issue in the Internet. Otherwise, there would not be so much active research in the area of Semantic Technologies (Chapter 4) which aims at making data machine‐interpretable. At least, in the Internet it suffices to support a few formats so that large parts of the content are covered. This is in contrast to intranets. In intranets, many different search engine friendly interfaces – targeting many different formats – are required in order to achieve an acceptable coverage of the content.

In document Kohn, Alex (2010): Professional Search in Pharmaceutical Research. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 47-49)