• No results found

Link Analysis. Hongning Wang

N/A
N/A
Protected

Academic year: 2021

Share "Link Analysis. Hongning Wang"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Link Analysis

(2)

Structured v.s. unstructured data

β€’

Our claim before

– IR v.s. DB = unstructured data v.s. structured data

β€’

As a result, we have assumed

– Document = a sequence of words – Query = a short document

– Corpus = a set of documents

(3)

A typical web document has

(4)
(5)

Intra-document structures

Title Paragraph 1 Paragraph 2 Anchor texts

….

.

Document

Concise summary of the document

Likely to be an abstract of the document

References to other documents Images Visual description of the document

(6)

Exploring intra-document structures

for retrieval

Intuitively, we want to give different weights to the parts to reflect their importance

1 1 1 ( | , ) ( | , ) ( | , ) ( | , ) n i i n k j i j j i p Q D R p w D R s D D R p w D R = = = = = Γ• Γ₯ Γ•

β€œpart selection” prob. Serves as weight for Dj

Can be estimated by EM or manually set

Select Dj and generate a query

word using Dj Title Paragraph 1 Paragraph 2 Anchor texts

….

.

Document

(7)

Inter-document structure

(8)

What do the links tell us?

β€’

Anchor

– Rendered form

(9)

What do the links tell us?

β€’

Anchor text

– How others describe the page

β€’ E.g., β€œbig blue” is a nick name of IBM, but never found on IBM’s official web site

(10)

What do the links tell us?

β€’

Linkage relation

– Endorsement from others – utility of the page

(11)

Analogy to citation network

β€’

Authors cite others’ work because

– A conferral of authority

β€’ They appreciate the intellectual value in that paper

– There is certain relationship between the papers

β€’

Bibliometrics

– A citation is a vote for the usefulness of that paper – Citation count indicates the quality of the paper

(12)

Situation becomes more complicated

in the web environment

β€’

Adding a hyperlink costs almost nothing

– Taken advantage by web spammers

β€’ Large volume of machine-generated pages to artificially increase β€œin-links” of the target page

β€’ Fake or invisible links

β€’

We should not only consider the count of

in-links, but the quality of each in-link

(13)

Link structure analysis

β€’

Describes the characteristic of network

structure

β€’

Reflect the utility of web documents in a

general sense

β€’

An important factor when ranking documents

(14)

Recall how we do web browsing

1. Mike types a URL address in his Chrome’s URL bar;

2. He browses the content of the page, and follows the link he is interested in;

3. When he feels the current page is not interesting or there is no link to follow, he types another

URL and starts browsing from there;

(15)

PageRank

β€’

A random surfing model of web

1. A surfer begins at a random page on the web and starts random walk on the graph

2. On current page, the surfer uniformly follows an out-link to the next page

3. When there is no out-link, the surfer uniformly jumps to a page from the whole collection

(16)

PageRank

β€’

A measure of web page popularity

– Probability of a random surfer who arrives at this web page

(17)

Theoretic model of PageRank

β€’

Markov chains

– A discrete-time stochastic process

β€’ It occurs in a series of time-steps in each of which a random choice is made

– Can be described by a directed graph or a transition matrix

(18)

Markov chains

β€’

Markov property

– 𝑃 𝑋!"# 𝑋#, … , 𝑋! = 𝑃(𝑋!"#|𝑋!) β€’ Memoryless (first-order)

β€’

Transition matrix

– A stochastic matrix β€’ βˆ€π‘–, βˆ‘2 𝑀32 = 1 – Key property

β€’ It has a principal left eigenvector corresponding to its largest eigenvalue, which is one

𝑀 = 0.6 0.2 0.20.3 0.4 0.3

0 0.3 0.7

Idea of random surfing

(19)

Theoretic model of PageRank

β€’

Transition matrix of a Markov chain for

PageRank

d1 d2 d4 d3 𝐴 = 0 0 0 0 1 10 0 0 1 1 1 0 00 0 𝐴′ = 0 0 0.25 0.25 0.25 0.251 1 0 1 1 1 0 00 0 1. Enable random jump on dead end

𝐴′′ =

0 0

0.25 0.25 0.25 0.250.5 0.5

2. Normalization 3. Enable random

jump on all nodes

𝑀 =

0.125 0.125

0.25 0.25

0.375 0.375

(20)

Steps to derive transition matrix for

PageRank

1. If a row of A has no 1’s, replace each element

by 1/N.

2. Divide each 1 in A by the number of 1’s in its

row.

3. Multiply the resulting matrix by 1 βˆ’ Ξ±.

4. Add Ξ±/N to every entry of the resulting

matrix, to obtain M.

(21)

PageRank computation becomes

β€’

𝑝

!

𝑑 = 𝑀

"

𝑝

!#$

𝑑

– Assuming 𝑝0 𝑑 = # 1 , … , # 1

– Iterative computation (forever?)

β€’ 𝑝4 𝑑 = 𝑀5𝑝467 𝑑 = β‹― = (𝑀5)4𝑝8 𝑑

– Intuition: after enough rounds of random walk, each dimension of 𝑝2 𝑑 indicates the frequency of a random surfer visiting document 𝑑

(22)

Stationary distribution of a Markov chain

β€’

For a given Markov chain with transition

matrix M, its stationary distribution of Ο€ is

– Necessary condition

β€’ Irreducible: a state is reachable from any other state β€’ Aperiodic: states cannot be partitioned such that

transitions happened periodically among the partitions

βˆ€π‘– ∈ 𝑆, πœ‹3 β‰₯ 0 2 3∈5 πœ‹3 = 1 πœ‹ = 𝑀6πœ‹ A probability vector

(23)

Markov chain for PageRank

β€’

Random jump operation makes PageRank

satisfy the necessary conditions

1. Random jump makes every node is reachable from any other nodes

2. Random jump breaks potential loop in a sub-network

β€’

What does PageRank score really converge

(24)

Stationary distribution of PageRank

β€’

For any irreducible and aperiodic Markov

chain, there is a unique steady-state

probability vector Ο€, such that if 𝑐(𝑖, 𝑑) is the

number of visits to state 𝑖 after 𝑑 steps, then

– PageRank score converges to the expected visit frequency of each node

lim

2β†’8

𝑐 𝑖, 𝑑

(25)

Computation of PageRank

β€’

Power iteration

– 𝑝2 𝑑 = 𝑀6𝑝29# 𝑑 = β‹― = (𝑀6)2𝑝0(𝑑)

β€’ Normalize 𝑝4 𝑑 in each iteration

β€’ Convergence rate is determined by the second largest eigenvalue

– Random walk becomes series of matrix multiplication

– Alternative interpretation of PageRank score

(26)

Computation of PageRank

β€’

An example from Manning’s text book

𝑀 =

1/6 2/3 1/6

5/12 1/6 5/12

(27)

Variants of PageRank

β€’

Topic-specific PageRank

– Control the random jump to topic-specific nodes

β€’ E.g., surfer interests in Sports will only randomly jump to Sports-related website when they have no out-links to follow

– 𝑝2 𝑑 = [𝛼𝑀6 + 1 βˆ’ 𝛼 ⃗𝑒𝑝6(𝑑)]𝑝29# 𝑑

(28)

Variants of PageRank

β€’

Topic-specific PageRank

– A user’s interest is a mixture of topics

Compute it off-line

User’s interest: 60% Sports, 40% politics

(29)

Variants of PageRank

β€’

LexRank

– A sentence is important if it is similar to other

important sentences

– PageRank on sentence similarity graph

Centrality-based sentence salience ranking for

(30)

β€’

SimRank

– Two objects are similar if they are referenced by

similar objects

– PageRank on bipartite graph of object relations

Variants of PageRank

(31)

HITS algorithm

β€’

Two types of web pages for a broad-topic

query

– Authorities – trustful source of information

β€’ UVA-> University of Virginia official site

– Hubs – hand-crafted list of links to authority pages for a specific topic

(32)

HITS algorithm

β€’

Intuition

– Using hub pages to discover authority pages

β€’

Assumption

– A good hub page is one that points to many good authorities -> a hub score

– A good authority page is one that is pointed to by many good hub pages -> an authority score

β€’

Recursive definition indicates iterative

algorithm

(33)

Computation of HITS scores

β€’

Two scores for a web page for a given query

– Authority score: π‘Ž 𝑑 – Hub score: β„Ž(𝑑) π‘Ž 𝑑 ← : :β†’< β„Ž(𝑣) β„Ž 𝑑 ← : <β†’: π‘Ž(𝑣) 𝑣 β†’ 𝑑 means there is a link from 𝑣 to 𝑑

(34)

Computation of HITS scores

β€’

In matrix form

– βƒ—π‘Ž ← 𝐴6β„Ž and β„Ž ← 𝐴 βƒ—π‘Ž – That is βƒ—π‘Ž ← 𝐴6𝐴 βƒ—π‘Ž and β„Ž ← 𝐴𝐴6β„Ž – Another eigen-system β„Ž = 1 πœ†< 𝐴𝐴6β„Ž βƒ—π‘Ž = 1

(35)

Constructing the adjacent matrix

β€’

Only consider a subset of the Web

1. For a given query, retrieve all the documents containing the query (or top K documents in a ranked list) – root set

2. Expand the root set by adding pages either

linking to a page in the root set, or being linked to by a page in the root set – base set

(36)

Constructing the adjacent matrix

β€’

Reasons behind the construction steps

– Reduce the computation cost

– A good authority page may not contain the query text

(37)
(38)

Today’s reading

β€’

Introduction to information retrieval

(39)

References

β€’ Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd.

"The PageRank citation ranking: Bringing order to the web." (1999)

β€’ Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of

the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.

β€’ Erkan, GΓΌnes, and Dragomir R. Radev. "LexRank: Graph-based

lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22, no. 1 (2004): 457-479.

β€’ Jeh, Glen, and Jennifer Widom. "SimRank: a measure of

structural-context similarity." In Proceedings of the eighth ACM SIGKDD

international conference on Knowledge discovery and data mining, pp. 538-543. ACM, 2002.

β€’ Kleinberg, Jon M. "Authoritative sources in a hyperlinked

References

Related documents

Cloud computing works as PAY-PER-USE..In this research paper i would like to highlight some core issues like what is the cloud computing system ,advantage and disadvantage of

Whether we are working with a global financial services provider to deliver complex network solutions and integrated cloud-enabled services, a large enterprise to provide IT

The research problem for this reseach can be summarised with the following problem statement: Decision making within a business intelligence (BI) environment is often

Voting by roll call: AYES – England, Kramer, Nutter, Spellman, Swotek, Bice..

In patients at risk for acute renal failure, IVIg products should be administered at the minimum rate of infusion and dose practicable. Aseptic meningitis syndrome

Private motor boat round transfer to Hvar Island with guided walking tour of Hvar and Stari Grad Town, with visit to wineries and private lunch in the olive grove. Guided tour

To examine patterns of conserved synteny, we annotated the genes found upstream and downstream of the globin gene clusters of seven teleost species (fugu, medaka,

The student named below is applying for a scholarship administered by The Links Incorporated, North Broward County Chapter. Your recommendation is needed as part of the