Entropy-Based Link Analysis for Mining Web Informative Structures

(1)

Entropy-Based Link Analysis for Mining Web Informative

Structures

Hung-Yu Kao, Shian-Hua Lin

*

, Jan-Ming Ho

*

, Ming-Syan Chen

Electrical Engineering Department

National Taiwan University Taipei, Taiwan, ROC

[email protected]}

Institute of Information Science* Academia Sinica Taipei, Taiwan, ROC

E-Mail: {shlin, hoho}@iis.sinica.edu.tw

ABSTRACT

In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by TOC pages through informative links. It is noted that the Hyperlink Induced Topics Search (HITS) algorithm has been employed to provide a solution to analyzing authorities and hubs of pages. However, most of the content sites tend to contain some extra hyperlinks, such as navigation panels, advertisements and banners, so as to increase the add-on values of their Web pages. Therefore, due to the structure induced by these extra hyperlinks, HITS is found to be insufficient to provide a good precision in solving the problem. To remedy this, we develop an algorithm to utilize entropy-based link analysis to mine Web informative structures. This algorithm is referred to as LAMIS, standing for entropy-based

Link Analysis on Mining web Informative Structures. The key idea

of LAMIS is to utilize information entropy for representing the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Experiments on several real news Web sites show that the precision and recall of LAMIS is much superior to those obtained by heuristic methods and also that the link analysis techniques derived are very powerful to mining the informative structures of news Web sites. In average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 133% to 232% when the desired recall falls between 0.5 and 1.

Keywords

Informative structure, link analysis, hubs and authorities, anchor text, entropy, information extraction.

1. Introduction

Recently, there has been explosive progress in the development of the World Wide Web. This progress creates numerous and various information contents published as HTML pages on the Internet. Furthermore, for the purpose of maintenance and scalability of Web sites, most of the Web contents are migrating from static

pages and general text files to dynamic forms which are generated from predefined templates and diverse content retrieved from back-end databases. In general, pages in most commercial Web sites, e.g., search engines, e-commerce stores, news, etc, are so dynamically generated. Such Web sites are called systematic Web sites in this paper. A news Web site that generates pages with daily hot news and archives historic news is a typical example of a systematic Web site.

Due to the evolution of automatic generation of Web pages, the number of Web pages grows explosively [6] [12]. However, there is a lot of redundant and irrelevant information in Web sites [5], especially in automatically generated pages of systematic Web sites. Examples of redundant and irrelevant information include advertisement banners, browsing menus, catalogs of services, announcements of copyright and privacy policy, and those contents tagged with hyperlinks for easy access to related information. In systematic Web sites, e.g. news Web sites, with the use of redundant and irrelevant links, it is convenient and easy for users to browse and extract informative parts using fewer clicks or shortcuts in any page. However, these redundant links increase the difficulty for search engines or text miners to extract useful information exactly since they are usually designed to index or process everything including redundant and irrelevant information. In general, these redundant and irrelevant information is usually not closely related to the theme of corresponding pages, thereby making it difficult to retrieve and classify the topics of a content page correctly.

Consider the example in Figure 1. We divide the root page of WashingtonPost (http://www.washingtonpost.com, a popular English news Web site) into several parts with different styles and contents, i.e., (1) a banner with links ''news'', ''OnPolitics'', ''Entertainment'', ''Live Online'', etc. on the top, (2) a menu with 22 links of news categories on the left, (3) a banner with advertisement links, (4) general announcements about washingtonpost, (5) a block with promoted hot news and advertisements, (6) a TOC block, and (7) a list with headline news. In this case, parts (1) and (2) are distributed among most pages in WashingtonPost and are redundant information for users. However, they are still indexed by search engines. Such indexing induces the increasing of the index size, while being useless for users and harmful for the quality of search results. Parts (3), (4) and (5) are irrelevant to the context of the page and are called irrelevant information. These parts will make the topic of the page drift when terms in these parts are indexed. The last two parts, (6) and (7), in fact draw more attention from users and are primary contents. Users can get news articles within one click from anchors in these

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

(2)

two parts. The following examples describe their impacts with more detail:

Example 1: After searching ''game hardware tech jobs'' in Google (http://www.google.com)1_{, one of the most popular search engines,}

we found 20 pages of CNET (http://www.cnet.com, a news and product review Web site for computing and technology) in the top-100 results. However, none of these pages came from the Web pages categorized in CNET Job Seeker2_{, which contains the}

desired information. There are matched query terms in redundant parts of pages among all these 20 pages and three of them do not contain any matched terms in the informative parts of pages. The matched terms in redundant parts of a page will increase the rank of the page, even though they are usually ignored by users. Note that six of the 20 pages are ranked as top-16 and the page with the highest ranking does not contain any desirable information in the informative parts.

Figure 1: A sample page of news Web sites

In news Web sites, news article pages are daily updated and news information agents only need to crawl TOC pages first and find news pages through links in TOC pages with the help of the informative structure. They, therefore, do not need to daily crawl and index redundant and irrelevant information. In addition to increasing precision and decreasing the cost of search engines, crawlers and information agents, informative structures of Web sites are useful for wrapper generation in information extraction, e.g., WIEN [17], IEPAD [11] etc. This is because repeated patterns would be more evident in clustered article pages. From our observations, in systematic Web sites, the document structures of pages with near equal authority and hub values are analogous to one another. The precision of wrapper generation will be improved after the preprocessing of mining informative structures on pages. Consequently, with the continuous growth on the number of pages with redundant and tenuous contexts, finding desired information on the Web has become an important and difficult task to solve. Explicitly, extracting informative structures of Web sites to recognize useful pages and links is a crucial issue to increase precision and decrease the cost of search engines and information agents. In this paper, we focus on the mining of news Web sites to demonstrate the problem and the solution proposed in detail. From observing TOC pages and article pages in a news Web site, one may note that TOC pages tend to hold more outlinks than

1 _{The result is queried from www.google.com on February 5, 2002.}

2_{CNET Job Seeker: http://dice.cnet.com/seeker.epl?rel_code=1&op=1}

article pages, and also that lengths of non-anchor context of article pages are longer than those of TOC pages. However, these characteristics are diverse in systematic Web sites and are dependent on presentation styles, policies, and types of contents of Web sites. Moreover, the characteristics will be blurred in a large context and become less discriminating when redundant/irrelevant information and links are induced. The works in [14][15] provide good learning mechanisms to recognize advertisements and redundant and irrelevant links of Web pages. However, these methods need to build the training data first and related domain knowledge must be included to extract features for generation of classification rules. Therefore, it is difficult to extract the informative structures of present systematic Web sites by an automatic and heuristic technique.

Specifically, when information characteristics are considered in link graphs, TOC pages in the informative structure of a Web site present characteristics of good hubs, and article pages are good authorities linked by TOC pages. Therefore, link analysis algorithms, e.g., HITS (Hyperlink Induced Topic Search [16]), provide a reasonable solution to mining the informative structures of news Web sites.

Explicitly, based on link analysis, HITS [16] and the PageRank algorithm [4] applied in Google have provided a new ranking technology for Web search engines. The HITS algorithm based on mutual reinforcement relationship provides an innovative methodology for Web searching and topics distillation. According to the definition in [16], a Web page is an authority on a topic if it provides good information, and is a hub if it provides links to good authorities. In recent research work on link analysis of hyperlinked documents, HITS is applied to the research area of topic distillation and several kinds of link weights are involved to indicate the significance of links in hyperlinked documents. In the Clever system [10], weights tuned empirically are added to distinguish same-site links and others. In [3], the metrics of similarity of whole contents in linked documents are applied on link weights and the use of text surrounding the links as keyword-based evidence to determine a weight for each link is proposed in [9]. Considering the distribution of terms in documents, Chakrabarti et al [7] combine the TFIDF-weighted model and micro-hub to represent the significance of anchors in regions with information needed.

Note that the topic distillation is different from the informative structure mining in several aspects: (1) The former distills hubs and authorities for a given query and the latter mines the informative structure consisting of TOC and article pages on a given Web site; (2) The base set of topic distillation consists of the root set, which is generated from the subset of query results, and the neighboring nodes of the root set. The informative structure mining targets on all pages of a Web site; (3) Algorithms on topic distillation usually omit intra-links and nepotistic links to perform the mutual reinforcement between sites. Such links are important and should be considered as link candidates in the informative structure of a Web site; (4) Most adaptive topic distillation algorithms based on HITS take the relationship between queries and documents into consideration. However, these algorithms do not work well on mining informative structures in the absent of a target query. Furthermore, as described in [8][18], the link analysis algorithms, e.g. HITS, are vulnerable to the effect of nepotistic clique attack and Tightly-Knit Community (TKC). The effects will be more significant for mining informative structures of Web sites

(3)

due to the huge amount of nepotistic links and cliques in a Web site.

Consequently, we propose in this paper an approach called

Entropy-based Link Analysis on Mining Informative Structure

(LAMIS). LAMIS is an automatic informative structure extracting system based on the weighted link analysis of Web pages. When a root URL of a site is given, the system ranks its Web pages according to distinct degrees of hub and authority. However, because a page, especially a systematic page in commercial Web sites, usually carries multiple characteristics of hub, authority and noise, it is in general difficult to discriminate pages according to their hub or authority values when a general link analysis algorithm is applied. One may assign weights on each link of a page to reduce the effects of noise links. In recent researches of topic distillation and link analysis, there are some weighting mechanisms proposed in [3][7][10]. However, due to the lack of query terms, these weighting mechanisms do not work well when we want to mine the information structure of a Web site. Therefore, a new weighting mechanism which considers the significance of links and the contents of information they carry in a Web site, is needed. The key idea of LAMIS is to utilize the information entropy and weigh links with the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Results of experiments on several real news Web sites have shown that LAMIS significantly outperforms the companion heuristic algorithms and avoids the drawback of general link analysis algorithms for mining informative structures. In average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 133% to 232% when the desired recall falls between 0.5 and 1. Benefiting from the improvement of TOC recognition, LAMIS is shown to be able to mine the informative structure more efficiently and precisely.

The remainder of this paper is organized as follows. In Section 2, we describe the basic idea of LAMIS. The entropy of anchor text and enhanced link analysis is presented in this section. The system design and implementation is described in Section 3. In Section 4, we evaluate the performance of LAMIS by several experiments on Chinese and English news Web sites. We discuss one of the effects of noises and the augmented feature of LAMIS in this section. The paper concludes with Section 5.

2. Model of Link Analysis

In the section, we will describe the detail of our approach with an illustrative example. The introduction of entropy and its application to anchor text are described in Section 2.1. The proposed entropy-based link analysis is presented in Section 2.2.

2.1. Entropy of Anchor Text

While users are browsing the Web, the anchor text is an important clue for users to track and search their desired information. Hence, we extract terms from anchor texts to denote the significance of anchors. In this paper, a term corresponds to a meaningful keyword. The motivation of our approach is that terms distributed in more pages in a Web site usually carry less information to users. In contrast, those appearing in fewer pages carry more information of interest. Hence, we extract terms from anchor texts and use the entropy, which is determined from the probability distribution of terms around the whole document sets, to represent the information strength (rich or poor) of terms. Shannon's information entropy [22] is applied on the term-document matrix

which is generated from the term extraction module to calculate the entropy. By definition, the entropy E can be expressed as

∑

=

− n

i1pilogpi, where pi is the probability of eventi and n is

number of events. By normalizing the weight of a term to be [0, 1], the entropy of term Ti is:

∑

= − = n j ij ij i w w T E 1 2 log ) ( , (1)

in which wij is the value of normalized term-frequency. wij is an entity in the term-document matrix to represent the weight of a term in a page, i.e.,

∑

= = n k ik ij ij tf tf w 1

, where tfij is the term frequency of termi in pagej. To normalize the entropy value of a term to the range [0, 1], the base of the logarithm is chosen to be the number of pages. Equation (1) thus becomes:

pages of set the is D D n where w w T E n j ij n ij i) log , | ,| ( 1

∑

= = − = . (2)

We then define the entropy of anchor ANi as the average entropy of all terms in ANi below:

i k

k

j j

i whereT T T aretermsinanchorAN k T E AN E( ) ( ), 1, 2, , , 1 K

∑

= = . (3)

It is noted that if an anchor contains no terms, the entropy of the anchor is assigned to one.

P0 T D M a trix A N0 0: "h o t n e w s " P₀ P1 P2 P3 P4 A N0 0 A N0 1 A N1 0 A N1 1 A N3 2 A N1 3 A N4 2 A N2 0 A N3 0 A N4 0 A N3 1 A N4 1 A N1 2 P0 P1 P2 P3 P4 P0 1 1 0 0 P1 1 0 1 1 P2 1 0 0 0 P3 1 1 1 0 P4 1 1 1 0 A d ja c e n t M a trix A N0 1: "s a le s " A N3 3 A N4 3 P0 P1 P2 P3P4 T0 ( h o t) 1 0 0 1 1 T1 ( n e w s ) 1 0 0 2 2 T2 ( sa le s) 1 1 0 1 1 T3 ( sp or ts) 0 0 1 0 1 T4 ( b u s in e ss ) 0 0 1 1 0 T5 ( h o m e ) 0 1 1 1 1 P3 A N3 0: "h o t n e w s " A N3 1: "s a le s " A N3 2: "h o m e " A N3 3: "b u s in e s s n e w s " P1 A N1 0: "s p o r ts " A N1 1: "b u s in e s s " A N1 2: "s a le s " A N1 3: "h o m e " P2 A N2 0: "h o m e " P4 A N4 0: "h o t n e w s " A N4 1: "s a le s " A N4 2: "h o m e " A N4 3: "s p o r ts n e w s "

Figure 2 A simple Web site, |D|=5

Consider for example a simple news Web site in Figure 2. Page P0

is the homepage of the Web site. Page P1 is the TOC page with

two news anchors linking to P3 and P4 which are news article

pages in the Web site. Page P2 is an advertisement page linked by

other four pages. Most pages contain anchors, i.e., “home”, “hot news”, and “sales”, linking to Page P0, P1, and P2 respectively.

Page P3 and P4 have cross-reference links as the related news ones

to each other. The entropy of terms in anchor texts of the Web site can be calculated as below:

(4)

. 430 . 0 2 1 log 2 1 ) ( ) ( , 861 . 0 4 1 log 4 1 ) ( ) ( , 655 . 0 5 1 log 5 1 5 2 log 5 2 ) ( , 682 . 0 3 1 log 3 1 ) ( 2 1 5 4 3 4 1 5 5 2 2 1 5 5 1 3 1 5 0

∑

= = = = = − = = = − = = = − − = = − = j j j j T E T E and T E T E T E T E

From the example in Figure 2, we note that terms uniformly distributed around documents have their entropy values close to one, meaning that these terms provide very few information for users. In general, these terms come from links of navigation panels and advertisement banners, like T1, T2, and T5. Hence, anchors contain high entropy terms are deemed as less informative ones. For example, the entropy of anchor AN00 is ₀_.₆₆₉

2 ) (T0+ T1 ₌

E _{, and}

anchor AN00 is thus considered as a less informative link. In contrast, when terms can only be found in one anchor, they own the lowest entropy 0, meaning that users can only find information relevant to these terms through anchors they are located. Users are usually more interested in such information of smaller entropy. This in term means that an anchor with a smaller entropy term should be assigned a larger weight than the one with a larger entropy term. The entropy values of anchors in the sample site are listed in Table 1. The most informative links are AN10 and AN11

which link to article pages P3 and P4 from TOC page P1. Table 1: Entropy values of anchors in Figure 2

P0 P1 P2 AN00 AN01 AN10 AN11 AN12 AN13 AN20 0.669 0.861 0.430 0.430 0.861 0.861 0.861 P3 P4 AN30 AN31 AN32 AN33 AN40 AN41 AN42 AN43 0.669 0.861 0.861 0.543 0.669 0.861 0.861 0.543

2.2. Enhanced Link Analysis with Anchor

Texts

In the link graph of a Web site G=(V, E), HITS algorithm computes two scores for each node v in V, i.e., the hub score H(v) and the authority score A(v). Initially, the two scores of each node are assigned to an equal positive number. Then, it iteratively updates the scores as follows until H and A converge:

∑

∈ ∈ = = E u v E v u u A v H and u H v A ) , ( ) , ( ) ( ) ( ) ( ) ( . (4)

Note that the scores are normalized in each iteration. It is proved in [16] that H and A will eventually converge, i.e., termination of HITS is guaranteed.

In our approach, we incorporate entropy values of anchors as link weights to present the significance of links in a page. Therefore, Equation (4) is modified as follows:

uv E u v uv E v u u A v H and u H v A( ) ( )*α ( ) ( )*α ) , ( ) , (

∑

∈

∑

∈ = = , (5)

where

α

_uv is the weight of ANuv. According to the definition of entropy,

α

_uv is defined as follows:

) (

1 _uv

uv = −E AN

α . (6)

Equation (6) means that the more information a link carries, the larger the weight of the link.

While using mutual reinforcement approach, such as in HITS, to perform the link analysis on pages of a Web site, the TKC effect [18] is more obvious than on the base set of a specific query. This is because that the number of cycle links and nepotistic links dominates the set of links in a Web site. As defined in [18], a tightly-knit community is a small but highly interconnected set of sites. Sites in the TKC will score high in link analysis algorithms, even though they are not authoritative on the given topic. On mining informative structure of a Web site, numerous intra-domain links will form several and complex TKCs. They will unavoidably affect the result of HITS algorithm significantly. The SALSA proposed in [18] is designed to resist this effect and will be empirically evaluated by our experimental studies later. The entropy-based SALSA can be described as below:

. * ) ( deg 1 * ) ( ) ( * ) ( deg 1 * ) ( ) ( ) , ( ) , ( uv E u v uv E v u u ree in u A v H u ree out u H v A α α − = − =

∑

∈ ∈ (7)

The notion out-degree(u) means the number of out-links of node u, and notion in-degree(u) is the number of links pointing to node u. We apply Equation (6) on entropy values of anchors in Figure 2 to obtain corresponding weights. Then we use Equations (4) and (5) to calculate values of hub and authority of all pages to show the effect of entropy-based weighting. In Table 2, hub and authority values are obtained after 10 iterations. Page P1 is ranked as top-1

hub page by entropy-based HITS and pages P3 and P4 are ranked

with the highest authority as well. The result agrees with our expectation. It can also be seen that HITS ranks the advertisement page as the best authoritative page and news article pages as good hub ones.

Table 2: Results of link analysis of HITS and Entropy-based HITS

Method HITS Entropy-based

HITS Authority Hub Authority Hub P0 0.535 0.297 0.229 0.142

P1 0.419 0.524 0.338 0.756

P2 0.576 0.160 0.244 0.031

P3 0.321 0.553 0.622 0.451

P4 0.321 0.553 0.622 0.451

3. The

Design

of

LAMIS

In this section, the LAMIS system is designed to explore hyperlink information to extract and identify the hub and authority pages. The only parameter for the extraction system is the starting URL of a Web site and there is no manual inventions and prior knowledge about the Web site required by LAMIS. We will introduce the system architecture in the Section 3.1 and the process of our adaptive link analysis will be described in Section 3.2.

(5)

3.1. The

System

Architecture

Figure 3 shows the three main components in LAMIS, including (1) a Web crawler to crawl pages and build the link graph of a Web site, (2) a feature extractor module which extracts features of pages, including the in-degree and out-degree of pages, the length of context and terms in links, and (3) a entropy-based link analysis module which recognizes TOC pages and builds up the informative structure of a Web site.

In the crawler module, a starting URL, i.e., the root node in the link graph of a Web site, is given and pages are crawled according to the link structure of the Web site. In LAMIS, we can assign different crawl depths to get a different view of the Web site. The deeper the crawl depth, the more precise the analysis will be, at the cost of higher computing complexity. From Equation (5), it can be verified that the complexity of link analysis algorithms is O(|E|). In our observation on several systematic Web sites, the informative structures of Web sites are mostly located from depth-1 to depth-2 of link graphs (the root node is depth-0). Therefore, without loss of generality the crawl depth is assigned to be three in our experiments.

Once a page has been crawled, related features in the page are extracted by the feature extractor module. Then, the anchor texts in all links are parsed to extract meaningful terms and associated term frequencies are also counted. As we know, extracting English terms is relatively simple. Applying stemming algorithms and removing stop words based on a stop-list, English keywords (terms) can be extracted [21]. Extracting terms used in oriental languages seems to be more difficult because of the lack of separators in these languages. In LAMIS, we use an algorithm to extract keywords from Chinese sentences based on a Chinese term base which is generated via collecting hot queries, excluding stop words, from our search engine3_.

S tarting U R L o f a w e b s ite C ra w ler M odu le P a ge A rc h ive arc hive w e b pa ge s Feature D atabase reco rd link relatio n

and ancho r text

C o nte nt B loc k & Fe atu re E xtrac tion

M o du le p ars e p ag e

read ancho r text / reco rd features H TM L Files in form a tion s tru c ture o f a w e b s ite LAMIS S truc ture Lea rn er

Info rm atio n S tructure D istillatio n M o dule

Figure 3: Flowchart of LAMIS system

For each term, we maintain a term-document matrix (abbreviated as T-D matrix) to represent the corresponding term frequency. While associated features are being extracted and the T-D matrix is constructed, the entropy-based link analysis module ranks pages

3 _{The searching service is a project sponsored by Yam, (http://yam.com/).}

It served the Web users from November, 1998 to December, 2000.

according to the values of hub and authority. TOC pages are selected from the set of high-ranking pages, and then we find corresponding article pages through informative links in TOC pages. The extracted sets of nodes and links form an information structure of the Web site. This in turn means that if we can rank and recognize TOC pages more precisely, we can get a more accurate informative structure directly.

3.2. The process of link analysis

In the entropy-based link analysis module, we calculate the values of hub and authority of each page according to Equations (5) and (6) after the link weights are assigned. In HITS, hub and authority will converge to the principal eigenvector of the link matrix [16]. The weighted HITS algorithm is also proved to be converged if weighting factors are positive [3]. In LAMIS, the weighting factor

uv

α

, is bounded in [0,1] so that the convergence of LAMIS follows. In our experiments, 91.7% of the hub values are zero after 10 iterations on average. Moreover, in our inspection, TOC pages tend to hold high hub values, and we hence use the ranked list to retrieve TOC pages among the whole page sets.

4. Experiments

and

Evaluation

In the section, we describe several experiments conducted on some real news Web sites to evaluate the performance improvement attained by LAMIS. The detail of datasets is described in the Section 4.1. The improvement of the entropy-based link analysis is presented in Section 4.2. We discuss one of noise factors and devise a technique to remedy the effect in Section 4.3. Section 4.4 shows an overall comparison of all experiments.

4.1. Datasets

In our experiments, the datasets4_{contain eight Chinese and five}

English news Web sites as described at Table 3. All of these news sites provide real-time news and historical news browsing services. News in these Web sites cover several domains, including politics, finance, sports, life, international issues, entertainment, health, cultures, etc., and are updated from time to time. In our experiments, the crawl depth is set to 3, and after pages have been crawled, the domain experts5_{inspect the content of each page in}

Chinese news Web sites and mark the classes of pages to build the answer set of datasets according to previous experience and domain knowledge in the maintenance of the news search engine (NSE) of Yam.

As shown in Table 3 the numbers of TOC pages vary among sites, even though the numbers of total pages of some sites are similar. As will be seen later, the diversity of information structures in datasets in fact indicates the several general applicability of LAMIS.

After extracting features from crawled pages, we compute the entropy values of extracted terms and anchors by Equations (2) and (3). As illustrated in Figure 4, in average, 71.6% of links are not informative, i.e. their entropy values are larger than 0.8. As we expect, they are mainly links in navigation panels, advertisements and copyright banners and are indeed pretty uniformly distributed among all pages.

4_{Pages of Web sites in datasets are crawled at 2001/12/27 and 2002/4/11.}

The datasets can be retrieved in our research site http://kp06.iis.sinica.edu.tw/isd/index.html.

5_{The domain experts are the site managers of Yam News Search Engine}

(6)

Table 3: Datasets and related information

Site Abbr. URLs of News Web Sites Total

pages TOC pages Links content blocks CDN www.cdn.com.tw 261 25 1339 892 CTIMES news.chinatimes.com 3747 79 26848 79077 CNA www.cna.com.tw 1400 33 5849 14544 CNET taiwan.cnet.com 4331 78 25844 15912 CTS www.cts.com.tw 1316 31 8915 16149 TVBS www.tvbs.com.tw 740 13 3530 5937 TTV www.ttv.com.tw 861 22 3301 4990 UDN udnnews.com 4676 252 34882 84411 CNN www.cnn.com 626 N/A*₂₁₂₇₆ ₁₁₆₄₃ WP www.washingtonpost.com 1301 N/A 10367 8203

LATIMES www.latimes.com 1119 N/A 25069 8720

CSMONITOR www.csmonitor.com 3618 N/A 31972 14260

DISPATCH www.dispatch.com 603 N/A 711 5862

*: We only consider top-20 precision in experiments of English Web site. Hence, we do not find out all TOC pages in English Web sites.

0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Link Entropy A ccu m ul at in g D is tr ibut ion ( % ) CDN CTIME S CNA CNE T CTS TVB S TTV UDN

Figure 4: Accumulating distribution of link entropy

4.2. The Improvement of Entropy-based Link

Analysis

In order to show the improvement achieved by LAMIS, we construct several experiments under different criteria as listed in Table 4. We evaluate the performance of experiments by measuring the precision and recall of the hub-ranking list. For each Web site, we examine the precisions at 11 standard recall levels, i.e., 0%, 10%, 20%, … , 100%. Then, same as in [2], we average the precision at each recall level as follows:

∑

= = N i i N r P r P 1 ) ( ) ( ,

where P(r) is the average precision at the recall level r, N is the number of datasets used, and Pi(r) is the precision at recall level r for the i-th dataset.

For performance comparison among individual datasets, we use

R-Precision and Precision Histograms for single value summaries

[2]. R-Precision RPA(i)for algorithmA over dataseti is defined as

the precision rate at the R-th position in the ranking, where R is the total number of answers and the precision histogram is defined as:

) ( ) ( ) ( / i RP i RP i RP_A _B = _A − _B .

We scale up the precision histogram by the factor calculated on the number of answers in each dataset for highlights on the improved number of retrieval answers. At first, we compare the performances of HITS, SALSA, and LAMIS. In Table 4, LAMIS can be described as combination of PA-AEN-HITS meaning that entropy-based HITS in the page mode. The notation PA means that the algorithm is operated in the page mode, i.e. each page is

treated as a node. Note that we use the notation LN (i.e., link normalization) to indicate the difference between HITS and SALSA, i.e., SALSA is equal to HITS-LN. In Figure 5, we can see that HITS does not work well since the average R-precision is only 0.27 (shown in Table 5). Moreover, R-precisions of HITS in five datasets are smaller than 0.05, showing that HITS cannot be directly applied on mining the informative structures. However, though the improvements of the precision rates for both entropy-based HITS and SALSA over the original ones range from 0.1 to 0.2, all of these four algorithms perform poor when the desired recall is larger than 0.7.

It is observed that TOC pages have more links than article pages in general. This observation suggests us to rank pages by the number of outlinks on a page, i.e. the out-degree of nodes in the link graph. It is interesting to see from Figure 5 that the simple heuristics OL performs much better than general link analysis algorithms when the desired recall is under 0.7, where the precision decreases suddenly when the recall is larger than 0.7.

It is noted that there are some noise effects influencing the entropy-based link analysis and about 50% of TOC pages are hard to discriminate when these four algorithms are applied. In the following section, we will describe these noise effects and our solutions to these effects.

Table 4: The list of components of experiments

Experiment Abbr. Description

OL rank by number of outlinks in a page

PA the page mode

CB the content block mode

HITS Kleinberg’s HITS

SALSA the stochastic approach for link-structure analysis

LN link normalization

AEN weighted by anchor text entropy

0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall P rec is ion OL HITS

HITS-LN (SALSA) LAMIS

LAMIS-LN

Figure 5: The effect of weights on link analysis

4.3. Augmented Features of LAMIS

On inspection of experimental results in Figure 5, we find several noise influences on link analysis algorithms during mining informative structures. The most influential effect will be examined in Section 4.3.1, and we shall devise one technique to remedy its effect.

4.3.1.

Hybridization and infection of hubs and

authorities

When the contexts of pages are more complex, pages may contain more hybrid characteristics of hub and authority. Such a page is called a hybridized page. For example, the leading story page in news Web site contains informative links to other hot-news pages

(7)

and is considered as a TOC page to these hot-news pages. However, this page is in fact a highly authoritative page as well because the hottest news is also pointed to by other pages. Due to the influence of the mutual reinforcement relationship, the hybridization of hubs and authorities affects not only the characteristics of hybridized pages but also that of their neighboring pages. The infection will blur the values of hub and authority. To address this effect, we propose an adaptive approach, namely the content block mode (CB).

Content Block Mode

In Figure 6, we can see the difference of mutual reinforcement propagation between the page mode and the content block mode. In the page mode, authority of P2, i.e., Ap2, can affect the value Ap3 through hub of P1 Hp1. If Pagep2 is authoritative, Ap3 will also be promoted, even though it is not an authoritative page. In the content block mode, we divide Pagep1 into two parts, one contains a link to Pagep2, and the other contains a link to Pagep3. They are treated as separate nodes. Hence, the propagation of high authority of Pagep2 will now be terminated at CB1 and Pagep2 will not be incorrectly promoted.

The block level hub values also help us to extract the informative part of a page. The work in [8] also proposes a fine-grained model based on Document Object Model (DOM [23]) to perform micro-hub computation. In our approach, blocks in the content block mode are delimited by pre-defined HTML tags, e.g., <table>, rather than by the DOM tree used in [8]. It is shown by our experimental results that the delimitation mechanism performs well on discovering informative content blocks of Web documents.

The effect of the content block mode is shown in Figure 7. We note that LAMIS-LN-CB outperforms other schemes when the desired recall is larger than 0.6. We can find that the average precision of LAMIS-LN-CB is smaller than the one of LAMIS-LN when the recall rate is smaller than 0.5. This is because LAMIS-LN-CB ranks two article pages of TVBS and TTV news site in the top-10 hub ranking. Because the sizes of TOC answer sets of these two sites are small, i.e., 13 and 22 respectively, the corresponding precision rates decrease suddenly, which in turn reduces the average precision rate when the desired recall is smaller than 0.5. The performance of the other two experiments on the content block mode is similar to that in the page mode, because the noise effects of TKC and redundant/irrelevant links conceal the improvement of the content block mode.

4.4. Overall Performance Comparison among

All Schemes

We summarize the R-precision values of all experiments in Table 5. The augmented LAMIS (LAMIS-LN-CB), which integrates the content block mode with entropy-based link analysis, attains the best performance metrics. The average R-precision is 0.75. The augmented LAMIS is in fact ranked first for R-precision in four of eight Chinese datasets and ranked second in the other three, showing the improvement on R-precision over general HITS by a factor of 1.78. We can also see the improvement of augmented LAMIS over all Chinese news Web sites in Figure 8. It is shown that LAMIS based schemes in general outperform others and the technique devised in Section 4.3.1 is powerful in dealing with such effect.

P1

P2 P3 P2 P3

CB1 CB2

Page Mode Content Block Mode

Figure 6: Propagations of mutual reinforcement on different modes 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rec all Pr ec is io n H ITS H ITS -C B L AMIS L AMIS -C B L AMIS -L N L AMIS -L N -C B

Figure 7: Effects of the content block mode and the page mode Table 5: R-Precision of all experiments

R-Precision CDN CTIMES CNA CNET CTS TVBS TTV UDN AVG.

outlinks 0.84 0.67 0.36 0.44 0.42 0.77 0.64 0.43 0.57 HITS 0.48 0.63 0.94 0.03 0.03 0.01 0.05 0.02 0.27 HITS-LN (SALSA) 0.92 0.77 0.97 0.42 0.29 0.77 0.36 0.25 0.59 LAMIS 0.88 0.63 0.30 0.33 0.16 0.92 0.05 0.06 0.42 LAMIS-LN 0.96 0.77 0.52 0.55 0.32 1.00 0.50 0.53 0.64 CB-HITS 0.48 0.22 0.24 0.46 0.52 0.01 0.36 0.28 0.32 CB-HITS-LN 0.96 0.33 0.30 0.45 0.39 0.69 0.23 0.56 0.49 LAMIS-CB 0.88 0.49 0.12 0.31 0.13 0.92 0.18 0.09 0.39 LAMIS-LN-CB 0.92 0.95 0.97 0.51 0.26 1.00 0.86 0.53 0.75

OL: rank by number of outlinks in a page, PA: Page mode, CB: Content block mode, HITS:

Kleinberg's HITS, LN: Link normalization, HITS-LN=SALSA: the stochastic approach for link-structure analysis, , AEN: weighted by anchor text entropy

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

CDN CTIMES CNA CNET CTS TV BS TTV UDN A V G.

Datasets R-P re cis io n

outlinks HITS HITS-LN (SALSA) LA MIS-LN-C B

Figure 8: R-Precision improvement of augmented LAMIS

4 0 4 5 5 0 5 5 6 0 6 5 7 0 L A M IS - L N - C B o u tlin k s H IT S S A L S A P rec is ion ( % ) T o p - 2 0

(8)

We conduct our several experiments on English news Web sites and compare the top-20 precision rates in Figure 9. We can note that some results are not as good as those in Chinese news Web sites, though augmented LAMIS still emerges as the winner. The reason is that effects of noises described in Section 4.3 are prominent in English news Web sites and thus affect the top-20 ranking further. For example, four of top-10 hub-ranking pages in DISPATCH and CSMONITOR are article pages with local related news menus, and they appear to receive high hub values due to high weighted links in the irrelevant block.

5. Conclusions

In the paper, we addressed the problem on mining informative structure of Web sites and its corresponding issues. We devised an entropy-based link analysis mechanism to mine the informative structures with high precision. Our approach, i.e., LAMIS, incorporates the entropy value of a link, which is in essence the average of entropy values of terms in the anchor text, into coefficients of the authority-hub equations so as to enhance the entropy-based link analysis for Web structure mining. LAMIS is further augmented to improve the precision and recall in the Web structure mining. LAMIS gives a precise TOC recognition methodology and builds an informative structure by these TOC pages and articles pages that are linked by informative links in TOC pages. Experimental results on several news Web sites have shown that the augmented LAMIS has the capability of mining the informative structures of news Web sites within a very good precision. Moreover, from the experimental studies, we enhance LAMIS and make it practically useful for mining real news Web sites.

6. References

[1] B. Amento, L. Terveen, and W. Hill. Does “Authority” Mean Quality? Predicting Expert Quality Ratings of Web Documents. Proc. of 23th ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000.

[2] R. Baeza-Yates, B. Ribeiro-Neto. Modern Information Retrieval. Addision Wesley, 1999.

[3] K. Bharat, M. R. Henzinger. Improved Algorithms for Topic Distillation in a Hyperlinked Environment. Proc. of 21th ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998.

[4] S. Brin, L. Page. The anatomy of a large-scale hypertextual Web search engine. Proc. of 7th World Wide Web Conference, 1998.

[5] A. Broder, S. Glassman, M. Manasse, G. Zweig. Syntactic Clustering of the Web. Proc.of 6th_{World Wide Web}

Conference, 1997.

[6] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener. Graph structure in the Web. Proc.of 9th_{World Wide Web}

Conference, 2000.

[7] S. Chakrabarti, M. Joshi, V. Tawde. Enhanced Topic Distillation using Text, Markup Tags, and Hyperlinks. Proc. of 24th ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.

[8] S. Chakrabarti. Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction. Proc. of 10th_{World Wide Web Conference, 2001.}

[9] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. M. Kleinberg. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proc. of 7th_{World Wide Web Conference, 1998.}

[10] S. Chakrabarti, B. Dom, S. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg. Mining the Web's link structure. IEEE Computer, 32(8), pages 60-67, August 1999.

[11] C. H. Chang, S. C. Lui. IEPAD: Information Extraction Based on Pattern Discovery. Proc.of 10th_{World Wide Web}

Conference, 2001.

[12] M.-S. Chen, J.-S. Park, and P. S. Yu. Efficient Data Mining for Path Traversal Patterns. IEEE Transactions on Knowledge and Data Engineering, 10(2): 209--221, April 1998.

[13] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: towards automatic data extraction from large Web sites. Proc. of 27th_{International Conference on Very Large Data Bases,}

2001.

[14] B. D. Davison. Recognizing Nepotistic Links on the Web. Proc. of AAAI 2000.

[15] N. Jushmerick. Learning to remove Internet advertisements. Proc. of 3rd_{International Conf. On Autonomous Agents,}

1999.

[16] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. ACM-SIAM Symposium on Discrete Algorithms, 1998.

[17] N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In Proc. of the 15th

International Joint Conference on Artificial Intelligence (IJCAI), 1997.

[18] R. Lempel, S. Moran. The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC effect. In 9th

International World Wide Web Conference, Amsterdam, Netherlands, May 2000.

[19] W. S. Li, N. F. Ayan, O. Kolak, Q. Vu. Constructing Multi-Granular and Topic-Focused Web Site Maps. Proc. of 10th_{World Wide Web Conference, 2001.}

[20] P. Pirolli, J. Pitkow, R. Rao. Silk from a sow’s ear: Extracting usable structures from the Web. Proc. of ACM SIGCHI Conference on Human Factors in Computing, 1996.

[21] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley. 1989.

[22] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:398-403, 1948.

[23] W3C DOM. Document Object Model (DOM). http://www.w3.org/DOM/.

Entropy-Based Link Analysis for Mining Web Informative Structures