The Use of Merging Algorithm to Real Ranking for Graph Search

(1)

The Use of Merging Algorithm to Real Ranking

for Graph Search

A.

Mohammad Reza Nami , B. Mehdi Ebadian

Faculty of Electrical, Computer, and IT Engineering,

Islamic Azad University- Qazvin Branch, Qazvin, IRAN

ABSTRACT

Ranking problem is becoming an important issue in

many fields, especially in information retrieval. This

paper presents an automatic technique for spam

monitoring in the graph. The technique is based on

combining information from two different sources:

Truncated page rank and Semi-Streaming Graph

Algorithms. In this paper we conduct further study on the

heuristically ranking framework and provide measuring

page rank of link farm. Twenty-six articles from 15

venues have been reviewed and classified within the

taxonomy in order to organize and structure existing

work in the field of Information Retrieval.

Keywords

Information retrieval (IR), Page rank (PR), Streaming

Algorithms, Internet Marketing, Spam and Search

Engine Optimization.

1

Introduction

Search engines have being become the most lucrative

thing over the internet. Search engines are mediated

between Web platform and information seeker. Search

engines then rank Web pages to create short list of

high-quality result. On the other hand, large visits originate

from search engines that most users just click on first few

results.

Therefore,

creating

high

score

page

independently of their real merit.

SPAM: Each new communication Media creates

opportunity for sending unsolicited messages. Type of

electronic spam includes e-mail spam, instant messaging

(SPIM), internet telephony (SPIT), spamming by mobile

phone, by fax, and so on. The request responses

paradigms of HTTP so goal is deceive search engines.

Any attempt to deceive search engine's relevancy

algorithm or "would not be done if search engines did not

exist" So ethical attempt is different between SPAM and

SEO (Search Engine Optimization) . The relation

between website and search engine administrator is

adversarial.

Stream graph algorithm: Suppose that we have a very

large undirected, un-weighted graph (starting at

hundreds of millions of vertices, ~10 edges per vertex),

non-distributed and processed by single thread only and

that I want to do breadth-first searches on it. I expect

them to be I/O-bound, thus I need a good-for-BFS disk

page layout, and disk space is not an issue. The searches

can start on every vertex with equal probability.

Intuitively that means minimizing the number of edges

between vertices on different disk pages, which is a graph

partitioning problem. The graph itself looks like

spaghetti think of random set of points randomly

interconnected, with some bias towards shorter edges.

(2)

Web spam techniques classified two groups: content

(keyword) spam, and link spam.

Link spam changes the sites structure by creating link

farm.

Link farm is densely connected pages to deceiving

ranking algorithm by improving one user in group.

Our spam-detection algorithm target are pages which

receive most link-base ranking by participating in link

farms but little relationship with rest of the graph.

Links may not be spam, by buying advertising or buying

expired domains that used legitimate purposes.

Topological spamming is spamming which achieved by

using Link farm.

Link-based and content-based analysis offers two

orthogonal approaches. Weakness of link-based: For

some pages that statistically close to non spam pages.

Threats of link -based: Hybrid spam structure.

Opportunity of link-based: Link farms are expensive.

Weakness of content -based: less resilient to changes in

spammer strategies.

Threats of content -based: Hybrid spam structure, copy

entire Web site (change few out-link) is inexpensive.

So they should be used together.

2 Algorithm Framework

Fetterly et al. [2004] hypothesized statistical distribution

about pages is a good way to detecting spam pages, "in a

number of these distribution, outlier values are web

spam".

Baeza-Yates et al. [2006] introduce damping function for

rank propagation.

We want to explore the neighborhood of page and link

structure artificially generated or not.

Two algorithm challenges:

1.

how to simultaneously compute statistics

neighborhood of each page in huge web graph

2.

how use it to detect and demote web spam

2.1 Supporter

If there is a link page x to page y, the author of page x

is recommending page y, the x is supporter of page y

at distance d, if shortest path from x to y formed by links

in E has length d .

Figure 2. Web Graph and supporter Distribution. Distribution of the fraction of distinct supporters found at

varying distances (normalized), obtained by backward breadth-first visits from a sample of nodes, in four large

Web graphs.

Number of new distinct supporter increases up to certain distance, and the decreases, graph is limit in size and we

approach effective diameter.

Figure 3. Different Bucket's page ranks.

Calculate Page Rank (PR) of pages in the eu.int sub domain to showing different distribution in high and low ranked

sites.

Breadth-first search (BFS) instead of computing the distribution for all nodes of sample of large Web graphs.

Advantage: inexpensive

Disadvantage: memory for each marked nodes (N2) time to repeat BFS.Solution: compute supporters only for subset of

suspicious nodes

constraint: we do not know a prior node is suspicious of being spam or not.

(3)

Link-analysis algorithm using semi-stream model, metric is score vector that uses O(N log N) bits memory.

PR algorithm instead of BFS for web spam detection, for measure the centrality of nodes outcomes tree a specific node and not all nodes, whereas PR compute a score for all nodes in the graph at same time.

2.2 TRUNKATED PAGERANK

A link-based ranking function that reduces importance of

neighbors which topologically close to the target node.

Damping function ignores direct contribution of the first

levels of links. Spam pages should be very sensitive to

changes in damping factor of PR calculation.

A

NN

be citation matrix of

G = (V, E),



xy

= 1



(x, y)



E

(1)

P be row-normalized citation matrix, that all rows sum

up to one, and rows of zeros replaced 1/N to avoid sink

rank.

W=



[damping(t) ∕ N]P

t

Damping(t)={

C is normalization constant  is damping factor Algorithm 2

0

t ≤ T

C



t

t > T

Algorithm 1: Link-analysis algorithm

(4)

Figure 4. 4times truncated page rank.

With comparing PR and TPR, for value from 1 to 4, both closely correlated, an correlation decreases as more level

truncated.

2.3 ESTIMATION SUPPORTERS

Use probabilistic counting to compute estimation the

number of supporter for all vertices in the graph at the

same time.

Figure 5. Propagation of having supporter 1 and Not 0 . Bit propagation algorithm. Page y has a link to page x, then

vector of page x is updated: x ← x OR y

Bit propagation Algorithm for estimating number of distinct supporters at distance ≤ d of all nodes.

Figure 6. Distances of supporter in 3 types. Comparison of estimation average number of supporters against observed value in a sample of nodes, by assuming

є = 1/N

(5)

And Estimation with adaptive Bit propagation, by dividing

є

two at each iteration b

3

Classification



Precision P = tp/(tp + fp) → P = #spam hosts

classified as spam /(#hosts classified)



Recall R = tp/(tp + fn) → R = #spam hosts

classified as spam/(#spam hosts)



Fp → False positive rate = #normal

hosts classified as spam / (#normal hosts)



Fn → False negative rate = # spam

host classified as spam / (#spam hosts)

Table 1. Performance of this Article classifier

Metrics UK2012 UK2013 True Positive False Positive F1 True Positive False Positive F1 Degree (D) 0.733 0.016 0.807 0.324 0.023 0.431 D + Page Rnk (P) 0.769 0.014 0.836 0.36 0.026 0.467 D+P +Trust Rank 0.785 0.013 0.847 0.54 0.038 0.596 D + P+ Trunc. PR 0.782 0.016 0.844 0.356 0.021 0.474 D + P +Est. Supporters 0.801 0.008 0.868 0.467 0.038 0.549 All attributes 0.806 0.01 0.872 0.586 0.038 0.632 Relevant Spam hosts Nonrelevant Normal hosts Retrieved tp #spam hosts classified as spam fp Not Retrieved fn tn #normal hosts not classified as spam

Table 3. Performance Using Page Rank Supporters degree Experimental Result Dataset True Positive False Positive F-Measure Previouse F-Measure from Table IV UK 0.801 0.008 0.866 0.834 Only pages 0.795 0.014 0.853 Only hosts 0.778 0.011 0.849 UK 0.465 0.033 0.549 0.459 Only pages 0.402 0.03 0.497 Only hosts 0.468 0.03 0.555

Table 2. Criterion "F" (Web spam techniques classification)

(6)

4

Conclusions

The technique used for link analysis assigns to every

node in Page Rank the web graph a numerical score

between 0 and 1, known as its Page Rank. With the help

of this paper the website owners and webmasters can

decide which SEO practice is worth and will give a good

return on investment.

Finally, the use of regularization methods that exploit the

topology of the graph and the locality hypothesis

[Davison 2000b] is promising, as it has been shown that

those methods are useful for general Web classification

tasks [Zhang et al. 2006; Angelova and Weikum 2006;

Qi and Davison 2006] and that can be used to improve

the accuracy of Web spam detection systems [Castillo et

al. 2007].

(7)

REFERENCES

[1] Alexa Inc., http://www.alexa.com/help/traffic-learn-more last accessed on may 17, 2011

[2] Antoniol, G. and Guéhéneuc, Y. G., "Feature Identification: An Epidemiological Metaphor", IEEE Transactions on Software Engineering, vol. 32, no. 9, 2006, pp. 627-641.

[3] Binkley D, Gold G, Harman M, Li Z, Mahdavi K (2008) An empirical study of the relationship between the

concepts expressed in source code and dependence. J Syst Software 81:2287–2298

[4] Cornelissen B, Zaidman A, van Deursen A, Moonen L, Koschke R (2009) A systematic survey of program

comprehension through dynamic analysis. IEEE Trans Software Eng (TSE) 35(5):684–702

[5] De Alwis B, Murphy GC (2008) Answering conceptual queries with Ferret. 30th International Conference on

Software Engineering (ICSE’08), Leipzig, Germany, 21–30 [6] De Lucia, A., Fasano, F., Oliveto, R., and Tortora, G., "Recovering Traceability Links in Software Artefact Management Systems", ACM Transactions on Software Engineering and Methodology, 2007.

[7] Egyed, A., Binder, G., and Grunbacher, P., "STRADA: A Tool for Scenario-Based Feature-to-Code Trace Detection and Analysis", in Proc. of IEEE/ACM 29th International Conference on Software Engineering (ICSE'07), 2007, pp. 41-42.

[8] Eaddy M, Aho AV, Antoniol G, Guéhéneuc YG (2008a) CERBERUS: tracing requirements to source code

using information retrieval, dynamic analysis, and program analysis. 16th IEEE International Conference

on Program Comprehension (ICPC’08), Amsterdam, The Netherlands, 53–62

[9] Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho AV (2008b) Do crosscutting concerns cause defects? IEEE Trans Software Eng 34(4):497–515 [10] Gay G, Haiduc S, Marcus M, Menzies T (2009) On the use of relevance feedback in IR-based concept location. 25th IEEE International Conference on Software Maintenance (ICSM’09), Edmonton, Canada, 351–360

[11] Grant S, Cordy JR, Skillicorn DB (2008) Automated concept location using independent component analysis 15th Working Conference on Reverse Engineering (WCRE’08), Antwerp, Belgium, 138–142

[12] Hayes, J. H., Dekhtyar, A., and Sundaram, S. K., "Advancing candidate link generation for requirements tracing: the study of methods", IEEE Transactions on Software Engineering, vol. 32, no. 1, January 2006 2006, pp. 4-19. [13] Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of NL-queries for

software maintenance and reuse. 31st IEEE/ACM International Conference on Software Engineering

(ICSE’09), Vancouver, British Columbia, Canada

[14] Kothari, J., Denton, T., Mancoridis, S., and Shokoufandeh, A., "On Computing the Canonical Features of Software

Systems", in 13th IEEE Working Conference on Reverse Engineering (WCRE'06), Benevento, Italy, 2006.

[15] Kuhn, A., Ducasse, S., and Gîrba, T., "Semantic Clustering: Identifying Topics in Source Code", Information and Software Technology, vol. 49, no. 3, March 2006, pp. 230- 243.

[16] Lawrance J, Bellamy R, Burnett M (2007) Scents in programs: does information foraging theory apply to

program maintenance? IEEE Symposium on Visual Languages and Human-Centric Computing (VL/

HCC’07), IEEE, 15–22

[17] Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. 22nd IEEE/ACM International Conference on Automated Software

Engineering (ASE’07), Atlanta, Georgia, 234–243

[18] Li Z (2009) Identifying high-level dependence structures using slice-based dependence analysis. King’s

College London, University of London. Ph.D

[19] Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug location using latent dirichlet allocation.

15th Working Conference on Reverse Engineering (WCRE’08), Antwerp, Belgium, 155–164

[20] Poshyvanyk, D., Guéhéneuc, G. Y., Marcus, A., Antoniol, G., and Rajlich, V., "Feature Location using Probabilistic Ranking of Methods based on Execution Scenarios and Information Retrieval", IEEE Transactions on Software Engineering, vol. 33, no. 6, June 2007, pp. 420-432.

[21] Rajlich, V., "Changing the Paradigm of Software Engineering", in Communications of ACM, vol. August, 2006, pp. 67-70.

[22] Salah, M., Mancoridis, S., Antoniol, G., and Di Penta, M., "Scenario-driven dynamic analysis for comprehending large software systems", in Proc. of 10th European Conference on Software Maintenance and Reengineering (CSMR'06), 2006. [23]Shepherd, D., Fry, Z., Gibson, E., Pollock, L., and Vijay- Shanker, K., "Using Natural Language Program Analysis to Locate and Understand Action-Oriented Concerns", in Proc. of International Conference on Aspect Oriented Software Development (AOSD'07), 2007, pp. 212-224.

[24] Simmons, S., Edwards, D., Wilde, N., Homan, J., and Groble, M., "Industrial tools for the feature location problem: an exploratory study", Journal of Software Maintenance: Research and Practice, vol. 18, no. 6, 2006, pp. 457-474.

[25]WordStreamTools,

http://www.wordstream.com/adwordskeyword-tool on May 10, 2011

[26] Zhao, W., Zhang, L., Liu, Y., Sun, J., and Yang, F., "SNIAFL: Towards a Static Non-interactive Approach to Feature Location", ACM Transactions on Software

Engineering and Methodologies, vol. 15, no. 2, 2006, pp. 195-226.