Countering Web Spam of Link-based Ranking Based on Link Analysis

(1)

Procedia

Engineering

Procedia Engineering 00 (2011) 000–000 www.elsevier.com/locate/procedia

* Corresponding author. Tel.: +86-21-65983287; E-mail: [email protected]

PEEA 2011

Countering Web Spam of Link-based Ranking Based on Link

Analysis

Hongwei Wang*

1

_{, Yuankai Li}

1

_{, Kaiqiang Guo}

1,2 1. School of Economics and Management, Tongji University, Shanghai, 200092, China

2. School of Business, Jinggangshan University, Ji’an, 343009, China

Abstract

We propose a ranking algorithm to help search engine eliminate spam pages. On the basis of an initial blacklist containing a small set of identified spam pages, our algorithm evaluates a page from two aspects, namely spam tendency and the authority. In this way, pages with relatively high quality and low or even no spam tendency can get high ranks. In the experiment part, we tested the anti-spam performance of this algorithm based on 3537379 pages and 8456740 links. The result indicated the fact that, compared with PageRank, the anti-spam performance of our algorithm was considerably better enhanced.

Selection and/or peer-review under responsibility of ICSS

Keywords: rank algorithm; link analysis; spam tendency; penalty factor; anti-spam

1 Introductions

With the explosion of Internet, users are usually confused about its mass information. The emergence of search engine to some extent improves searching efficiency. However, the amount of searching results is large so that users have to choose the information they need by themselves. Therefore, ranking algorithm is applied to facilitate the choices, such as PageRank [1]_{, HITS} [2]_{. As the detail of ranking}

algorithm is a part of commercial confidentiality, it does not open to the public.

The rank of searching result has a notable influence on users concern. It is shown by American famous engineering service provider iProspect that 56.6% of users only pay attention to the first two pages of searching result, 16% of users pay attention to the first several items, only 23% of users will check the first two pages, and only 8.7% of users will look up more than three pages. Consequently, optimization service arose, which can improve the rank of searching result. However, there are some sites that try to deceive the search engine and make their rank higher by guessing the ranking algorithm and using disguise means. This kind of act is defined as spam, which not only degrades the performance of search

Open access underCC BY-NC-ND license.

(2)

engines, but also seriously reduces users’ satisfaction. Therefore, an anti-spam sorting algorithm based on link analysis is proposed in this paper to be against the cheating technology.

2 Literature Reviews

The definition of spam is different to different searching engine. In summary, there are three perspectives: ①adding irrelevant keywords in certain source code of web pages intentionally; ②repeating certain keywords in certain source code of web pages deliberately, even if they are related to the contents of web pages; ③adding one or more links point to the spam pages. However, those pages which the spam pages point to are innocent [3]_.

To recognize and punish the spam pages, anti-spam study started. James Caverlee et al [4-5] provided a rigorous study of the set of critical parameters that can impact source-centric link analysis, such as source size, the presence of self-links; James Caverlee et al developed a novel credibility-based Web ranking algorithm – CredibleRank – which incorporates credibility information directly into the quality assessment of each page on the Web [6]_{; Chengming Liang et al proposed an algorithm to assign spam}

values to web pages and semi-automatically select potential spam web pages [7]_{; Jiahui Yu et al}

considered that although the diversity of web spam is considerable, it’s purpose is no more than advertising and value-added services. So they focused the anti-spam study on the spam aim [8]_{; Mengyan}

Hao et al analysized the spam techniques which may exist in the search engine optimization and they proposed different penalties for the different types of cheating [9]_{; G. Mishne et al raise a method to}

recognize the link spam in the blog comments through comparing the language model between the contents and comments [10]_{; Michael G. Noll et al proposed that the level of expertise of a user with}

respect to a particular topic is mainly determined by two factors. Firstly, an expert should possess a high quality collection of resources, while the quality of a Web resource depends on the expertise of the users who have assigned tags to it. Secondly, an expert should be one who tends to identify interesting or useful resources before other users do [11]_.

3 Anti-spam Algorithm

As a general rule, one page always links other pages with the same reliability. Reliability refers to whether the page gets the good rank by web span rather than the accuracy of its content. Web pages with high reliability will not link the ones with low reliability while the bad ones will link the good ones to strengthen their influence, so we judge the spam tendency and punish the spam pages via the quality of the forward links. Based on these considerations, we use spam tendency to measure the reliability of a page and there is a negative correlation between spam tendency and reliability.

3.1 Spam tendency

It is unconsidered to detect spam pages only by whether there is a directly link pointing to the page in blacklist because limited pages tend to spam can be found in this way. As figure 1 showed, p1 is a page in

blacklist, p2 and p3 all have directly forward link pointing to p. So we regard p2 and p3 having high spam

tendency. Although p4 has link pointing to p2 which has high spam tendency but not in the blacklist, it

won’t be punished. According to this consideration and referring to BadRank, the spam tendency may spread by the forward link. That means a page’s spam tendency will increase when it points to the page with high spam tendency.

(3)

Figure 1: a simple network

Let G=<P,L> denote a graph model of the Web. The spam tendency can be calculated by formulae 1:

1 ( ) ta n sig ((1 ) ( ) ) ( ) n t i t i i S q S p d E p d C q = = − +

∑

( ) （1） Where St (p) is the spam tendency of page p; d∈[0,1] is the damping factor. E (p) is the initial spam

tendency of page p. While p∈Pb, E(p)=1 ;else, E(p)=0; n= |Out(p)| is the sum of forward links of page p; qi∈Out(p) means page p has a link point to qi; C(qi)= |In(qi) | is the sum of backward links of page qi; St(qi)

is the spam tendency of page qi; tan sig ( ) 2 ₂ 1

1 x

x

e−

= −

+ is used to control the value of St(p) in [0,1], namely 0≤St(p) ≤1.

3.2 Penalty factor

The possibility of a page to cheatcan be judged by spam tendency, but we cannot take that directly as the punishing standard. As figure 1 shows the association from p2 to p1 is w(p2, p1) =1. While the spam

tendency of p2 only equals 0.199. Therefore, only topunish a page directly by its spam tendency is not

suitable. We should also consider the strength of the association of two pages.

Given G=<P, L>, p∈P and q∈P. w (p, q) is the strength of the associationfrom page p to page q. If p has link pointing to q, w(p, q) =1/|Out (p)|, otherwise, w(p, q)=0. In fact, w(p, q) is the probability that p

transfer to q. Since p may have links pointing to several spam pages, we calculate the strength of the association from p to spam pages by formulae 2:

o ( ) ( , ) If ( ) 1 If u t i b i b q O p P r b w p q p P S p p P ∈ ⎧ ∉ ⎪ = ⎨ ⎪ ∈ ⎩

∑

∩ （2）

Where Out (p) is the sum of forward links that page p has; Pb is the blacklist; qi belongs to the blacklist.

Obviously, 0≤Sr(p) ≤1.

The penalty factor can be calculated by formulae 3:

γ(p)=α (1-Sr(p))+β (1-St(p)) （3）

Where α+β=1, 0≤α≤1, 0≤β≤1. γ(p) is the penalty factor, which consider both the spam tendency and the association from page p to the spam pages. According to formulae 3, there is a negative correlation between penalty factor and spam tendency, likewise, there is a negative correlation between penalty factor and the Sr(p).

3.3 link-based anti-spam algorithm

Taking all the factors above into consideration, including the spam tendency, the association with spam pages and the authority of a page, an anti-spam algorithm is raised based on link analyzing. Given

G=<P, L>, p∈P. The authority, the spam tendency and the association with spam pages can be calculated

by formulae 1 and formulae 2 respectively. Consequently, our algorithm can be calculated by formulae 4:

(4)

4 Experiment Evaluations 43.1 Setup

The experiment in this paper uses a dataset from Sogou (http://www.sogou.com/labs/) consisting of 3.54 million pages and 8.45 million links. We first make the pre-treatment of the pages, such as removing the links pointing towards themselves. This results in 6,031 sites and 27,994 links. However, there are 3,888 sites only have forward links, which may influence the performance of our anti-spam algorithm. Consequently, we add a virtual site that has links pointing to these 3888 sites. Finally, we get 6032 sites and 31882 links.

Our anti-spam algorithm is based on the blacklist which contains certain spam pages or sites. However, search engines, such as Google and Baidu, never disclose the list of spam pages and it is too subjective to define sites as spam site by individual perception, for example, judging sites with a URL containing a pornography related keyword as spam site. Consequently, we regard those sites which have no result after entering “site: URL” in Google as spam sites. In this way, about 5% of sites are found. According to Z. Gyongyi and H. Garcia-Molina’s experiment, about 10%-15% of pages are spam pages [12]_{. Thus, in our}

experiment, it is acceptable that the initial blacklist consist of 5% of sites. Here is the pseudo code of our algorithm:

int _tmain( int argc, _TCHAR* argv[] ) {

Init( ); //processing of database: recording the forward links and backward links of each page PrRanking( ); //calculating the PageRank value of each page

StRanking( ); //calculating the spam tendency value of each page according to the formula 1 Penalty( ); //calculating the penalty factor of each page according to the formula 3

STBArank( ); //calculating the final score of each page Output( ); //the rank of pages

}

4.2 Our algorithm versus PageRank

A site is called a seed site if it has a link or more pointing to the sites in the blacklist. In the experiment, there are 640 seed sites, which are quite important for the testing of spam resilience calculated by formulae 5. 1 1 ( ) ( ) 1 ( ) m i i rank m i i R E S m R B − − =

∑

−

∑

（5） Where R (Ei) represents the rank of seed site Ei according to the candidate ranking algorithm and R (Bi)

represents the rank of seed site Bi according to baseline ranking algorithm. The value of m means the top m sites in ranking and positive Srank (m) values indicate the candidate algorithm is more spam resilience

than the baseline.

In this part, we observe the different spam resilience performances of our algorithm by changing the value of α. According to the figure 2, our algorithm is more spam resilience than PageRank regardless of the three different value of a. In the case of α=0, the penalty factor in the formulae 5 is determined by the strength of the associationfrom a site to the spam sites. Thus it has higher penalties to the seed sites than other value of a; in the case of α=1, the penalty factor in the formulae 5 is determined by the spam tendency value and it has lower penalties to the seed sites (high penalties to all sites); in the case of α=0.5, the penalty factor in the formulae 5 is determined by both the strength of the associationfrom a site to the

(5)

spam sites and the spam tendency. The two parts are equally important for the penalty factor. Consequently, in this case, the spam resilience is higher than α=1 and lower than α=0.

Figure2 our algorithm versus PageRank

4.3 Time complexity

The algorithm's time complexity is an important factor in evaluating an algorithm when facing trillions of pages. The algorithm is based on the iterative process, thus the number of iterations influences the time complexity. Let i donate the times of iterations and n donate the total number of forward links; then time complexity is O(i*n). In the standard HP 2230s laptop, the running time of our experiment is 2.81 seconds, which is acceptable.

5 Conclusions

By recognizing and punishing the cheating act (web spam), it can improve the performance of search engine and help enhance customers’ satisfaction. From the standpoint of link analysis, this paper introduced the concept of web spam tendency to measure the possibility of web spam. Based on the initial blacklist and connectivity with it, we firstly proposed a method to calculate the value of web spam tendency and translate it into web reliability. Meanwhile, considering that a web page will only have a clear understanding on these directly pointed pages, we applied correlativity to measure the relativity between two pages. At last, we designed an anti-spam algorithm based on both reliability and correlativity when calculating the penalty factor for spam page. The experiments showed that this algorithm is more spam resilience than PageRank.

Acknowledgements

This work is partially supported by the NSFC Grant 70971099 and Shanghai Leading Academic Discipline Project (B310). 0 0.5 1 1.5 2 2.5 3 200 400 600 800 1000 1200 1400 1600 1800 2000 m Srank (m) a=0 a=0.5 a=1 Our algorithm vs. PageRank

(6)

References

[1]Sergey Brin, Lawrence Page. The PageRank Citation Ranking: Bring order to the Web. Technical report, Stanford, 1999:1-15 [2]J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the Association for Computing Machinery, 1999, 46(5): 604-632

[3] Baidu Web Search Help-Webmaster FAQ, http://www.baidu.com/search/guide.html, 2010.1.13

[4]James Caverlee, Ling Liu, William B. Rouse. Link-Based Ranking of the Web with Source Centric Collaboration. 2006 International Conference on Collaborative Computing: Networking, Applications and Worksharing, Atlanta, 2006: 1-9

[5]James Caverlee, Steve Webb, Ling Liu, William B. Rouse. A Parameterized Approach to Spam-Resilient Link Analysis of the Web. In Parallel and Distributed Systems, 2009, 20(10):1422-1436

[6]James Caverlee, Ling Liu. Countering Web Spam with Credibility-Based Link Analysis, In Proceeding of the twenty-sixth

annual Association for Computing Machinery symposium on Principles of distributed computing, 2007:157-166

[7]Chengmin Liang, Liyun Ru, Xiaoyan Zhu. R-Spam Rank: A Spam Detection Algorithm Based on Link Analysis. Journal of

Computational Information Systems, 2007, 3(4):1705-1712

[8]Huijia Yu, Yiqun Liu. Web Spam Taxonomy via Spam Intent Analysis. Journal of Chinese Information Processing, 2009, 23(2): 95-100

[9]Meng-yan Hao, Yuan He. Survey on the Cheating of search Engine Optimization. Computer Knowledge and Technology, 2009,5(33):9533-9535

[10]G.Mishne, D. Carmel, R.Lempel. Blocking Blog Spam with Language Model Disagreement. In First International Workshop on Adversarial Information Retrieval on the Web, at WWW ’05: the 14th_{international conference on World Wide Web,}

2005:1-6.

[11] Michael G. Noll, Au Yeung, C. M., Gibbins. Telling Experts from Spammers: Expertise Ranking in Folksonomies. In: The 32nd Annual ACM SIGIR Conference, Boston, MA, USA, 2009:612-619

[12]Z. Gyongyi, H. Garcia-Molina. Web spam taxonomy. In: First International Workshop on Adversarial Information Retrieval on the Web (AIR Web), Chiba, Japan, 2005:1-8