Link Analysis
Structured v.s. unstructured data
β’
Our claim before
β IR v.s. DB = unstructured data v.s. structured data
β’
As a result, we have assumed
β Document = a sequence of words β Query = a short document
β Corpus = a set of documents
A typical web document has
Intra-document structures
Title Paragraph 1 Paragraph 2 Anchor textsβ¦.
.
DocumentConcise summary of the document
Likely to be an abstract of the document
References to other documents Images Visual description of the document
Exploring intra-document structures
for retrieval
Intuitively, we want to give different weights to the parts to reflect their importance
1 1 1 ( | , ) ( | , ) ( | , ) ( | , ) n i i n k j i j j i p Q D R p w D R s D D R p w D R = = = = = Γ Γ₯ Γ
βpart selectionβ prob. Serves as weight for Dj
Can be estimated by EM or manually set
Select Dj and generate a query
word using Dj Title Paragraph 1 Paragraph 2 Anchor texts
β¦.
.
DocumentInter-document structure
What do the links tell us?
β’
Anchor
β Rendered form
What do the links tell us?
β’
Anchor text
β How others describe the page
β’ E.g., βbig blueβ is a nick name of IBM, but never found on IBMβs official web site
What do the links tell us?
β’
Linkage relation
β Endorsement from others β utility of the page
Analogy to citation network
β’
Authors cite othersβ work because
β A conferral of authority
β’ They appreciate the intellectual value in that paper
β There is certain relationship between the papers
β’
Bibliometrics
β A citation is a vote for the usefulness of that paper β Citation count indicates the quality of the paper
Situation becomes more complicated
in the web environment
β’
Adding a hyperlink costs almost nothing
β Taken advantage by web spammers
β’ Large volume of machine-generated pages to artificially increase βin-linksβ of the target page
β’ Fake or invisible links
β’
We should not only consider the count of
in-links, but the quality of each in-link
Link structure analysis
β’
Describes the characteristic of network
structure
β’
Reflect the utility of web documents in a
general sense
β’
An important factor when ranking documents
Recall how we do web browsing
1. Mike types a URL address in his Chromeβs URL bar;
2. He browses the content of the page, and follows the link he is interested in;
3. When he feels the current page is not interesting or there is no link to follow, he types another
URL and starts browsing from there;
PageRank
β’
A random surfing model of web
1. A surfer begins at a random page on the web and starts random walk on the graph
2. On current page, the surfer uniformly follows an out-link to the next page
3. When there is no out-link, the surfer uniformly jumps to a page from the whole collection
PageRank
β’
A measure of web page popularity
β Probability of a random surfer who arrives at this web page
Theoretic model of PageRank
β’
Markov chains
β A discrete-time stochastic process
β’ It occurs in a series of time-steps in each of which a random choice is made
β Can be described by a directed graph or a transition matrix
Markov chains
β’
Markov property
β π π!"# π#, β¦ , π! = π(π!"#|π!) β’ Memoryless (first-order)β’
Transition matrix
β A stochastic matrix β’ βπ, β2 π32 = 1 β Key propertyβ’ It has a principal left eigenvector corresponding to its largest eigenvalue, which is one
π = 0.6 0.2 0.20.3 0.4 0.3
0 0.3 0.7
Idea of random surfing
Theoretic model of PageRank
β’
Transition matrix of a Markov chain for
PageRank
d1 d2 d4 d3 π΄ = 0 0 0 0 1 10 0 0 1 1 1 0 00 0 π΄β² = 0 0 0.25 0.25 0.25 0.251 1 0 1 1 1 0 00 0 1. Enable random jump on dead endπ΄β²β² =
0 0
0.25 0.25 0.25 0.250.5 0.5
2. Normalization 3. Enable random
jump on all nodes
π =
0.125 0.125
0.25 0.25
0.375 0.375
Steps to derive transition matrix for
PageRank
1. If a row of A has no 1βs, replace each element
by 1/N.
2. Divide each 1 in A by the number of 1βs in its
row.
3. Multiply the resulting matrix by 1 β Ξ±.
4. Add Ξ±/N to every entry of the resulting
matrix, to obtain M.
PageRank computation becomes
β’
π
!π = π
"π
!#$π
β Assuming π0 π = # 1 , β¦ , # 1β Iterative computation (forever?)
β’ π4 π = π5π467 π = β― = (π5)4π8 π
β Intuition: after enough rounds of random walk, each dimension of π2 π indicates the frequency of a random surfer visiting document π
Stationary distribution of a Markov chain
β’
For a given Markov chain with transition
matrix M, its stationary distribution of Ο is
β Necessary condition
β’ Irreducible: a state is reachable from any other state β’ Aperiodic: states cannot be partitioned such that
transitions happened periodically among the partitions
βπ β π, π3 β₯ 0 2 3β5 π3 = 1 π = π6π A probability vector
Markov chain for PageRank
β’
Random jump operation makes PageRank
satisfy the necessary conditions
1. Random jump makes every node is reachable from any other nodes
2. Random jump breaks potential loop in a sub-network
β’
What does PageRank score really converge
Stationary distribution of PageRank
β’
For any irreducible and aperiodic Markov
chain, there is a unique steady-state
probability vector Ο, such that if π(π, π‘) is the
number of visits to state π after π‘ steps, then
β PageRank score converges to the expected visit frequency of each node
lim
2β8
π π, π‘
Computation of PageRank
β’
Power iteration
β π2 π = π6π29# π = β― = (π6)2π0(π)
β’ Normalize π4 π in each iteration
β’ Convergence rate is determined by the second largest eigenvalue
β Random walk becomes series of matrix multiplication
β Alternative interpretation of PageRank score
Computation of PageRank
β’
An example from Manningβs text book
π =
1/6 2/3 1/6
5/12 1/6 5/12
Variants of PageRank
β’
Topic-specific PageRank
β Control the random jump to topic-specific nodes
β’ E.g., surfer interests in Sports will only randomly jump to Sports-related website when they have no out-links to follow
β π2 π = [πΌπ6 + 1 β πΌ βππ6(π)]π29# π
Variants of PageRank
β’
Topic-specific PageRank
β A userβs interest is a mixture of topics
Compute it off-line
Userβs interest: 60% Sports, 40% politics
Variants of PageRank
β’
LexRank
β A sentence is important if it is similar to other
important sentences
β PageRank on sentence similarity graph
Centrality-based sentence salience ranking for
β’
SimRank
β Two objects are similar if they are referenced by
similar objects
β PageRank on bipartite graph of object relations
Variants of PageRank
HITS algorithm
β’
Two types of web pages for a broad-topic
query
β Authorities β trustful source of information
β’ UVA-> University of Virginia official site
β Hubs β hand-crafted list of links to authority pages for a specific topic
HITS algorithm
β’
Intuition
β Using hub pages to discover authority pages
β’
Assumption
β A good hub page is one that points to many good authorities -> a hub score
β A good authority page is one that is pointed to by many good hub pages -> an authority score
β’
Recursive definition indicates iterative
algorithm
Computation of HITS scores
β’
Two scores for a web page for a given query
β Authority score: π π β Hub score: β(π) π π β : :β< β(π£) β π β : <β: π(π£) π£ β π means there is a link from π£ to π
Computation of HITS scores
β’
In matrix form
β βπ β π΄6β and β β π΄ βπ β That is βπ β π΄6π΄ βπ and β β π΄π΄6β β Another eigen-system β = 1 π< π΄π΄6β βπ = 1Constructing the adjacent matrix
β’
Only consider a subset of the Web
1. For a given query, retrieve all the documents containing the query (or top K documents in a ranked list) β root set
2. Expand the root set by adding pages either
linking to a page in the root set, or being linked to by a page in the root set β base set
Constructing the adjacent matrix
β’
Reasons behind the construction steps
β Reduce the computation cost
β A good authority page may not contain the query text
Todayβs reading
β’
Introduction to information retrieval
References
β’ Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd.
"The PageRank citation ranking: Bringing order to the web." (1999)
β’ Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of
the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.
β’ Erkan, GΓΌnes, and Dragomir R. Radev. "LexRank: Graph-based
lexical centrality as salience in text summarization." J. Artif. Intell. Res.(JAIR) 22, no. 1 (2004): 457-479.
β’ Jeh, Glen, and Jennifer Widom. "SimRank: a measure of
structural-context similarity." In Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 538-543. ACM, 2002.
β’ Kleinberg, Jon M. "Authoritative sources in a hyperlinked