The PageRank Algorithm - Using community trained recommender models for enhanced information re

In this section, two existing MRW models are described: a one-layer model and a two layer model. The application of these two models in the PageRank algorithm is then described.

5.2.1 Basic One-Layer Model

A MRW model is essentially a way of deciding the importance of a vertex within a graph based on global information recursively drawn from the entire graph. The basic idea is that of a “vote”

143

or “recommendation” between vertices (Wan and Yang, 2008). A link between two vertices is considered as a vote that one vertex gives to another. The score associated with a vertex is determined by the votes that are given for it.

Figure 5.1 One-layer link graph (Wan and Yang, 2008)

Wan and Yang (2008) define notations in the following way. Given a document set 𝑆, let 𝐺 = (𝑉, 𝐸) indicate a graph which reflects the relationships between documents in the whole document set, as shown in Figure 5.1. V is the set of vertices where each vertex 𝑣! in V is a

document in 𝑆. E is the set of edges, which is a subset of 𝑉×𝑉. Each edge 𝑒_!" in E is associated with an affinity weight 𝑓(𝑖 → 𝑗) between documents 𝑣_! and 𝑣_! (𝑖 ≠ 𝑗). Each document 𝑣_! is represented as a set of terms 𝑣!(𝑡!, 𝑡!, … , 𝑡!). The affinity weight is computed using the standard

cosine similarity measure (Baeza-Yates and Ribeiro-Neto, 1999) between two documents, shown in Equation (5-1).

𝑓 𝑖 → 𝑗 = 𝑠𝑖𝑚!"#$%& 𝑣!, 𝑣! =

𝑣!∙ 𝑣!

𝑣_! × 𝑣_! (5-1)

where 𝑣! and 𝑣! are the term vectors of 𝑣! and 𝑣!. We think that two vertices are connected if the

affinity weight between them is larger than 0. And define 𝑓 𝑖 → 𝑖 = 0 is used to avoid the sele- transition.

In Wan and Yang (2008), the transition probability matrix 𝑃, the transition probability from 𝑣_! to 𝑣_! 𝑝(𝑖 → 𝑗)) is defined by normalizing the corresponding affinity weight as shown in Equation (5-2).

144 𝑝 𝑖 → 𝑗 = 𝑓(𝑖 → 𝑗) 𝑓(𝑖 → 𝑘) ! !!! 𝑖𝑓 |!| 𝑓(𝑖 → 𝑘) ≠ 0 !!! 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (5-2)

Formally, 𝑝(𝑖 → 𝑗) is not equal to 𝑝(𝑗 → 𝑖) . In (Wan and Yang, 2008), the authors use 𝑀_{!,! ! × !} to describe G with each entry corresponding to the transition probability 𝑀!,! = 𝑝 𝑖 → 𝑗 . In order to make M into a stochastic matrix rows with all zero elements are

replaced by a smoothing vector with all elements set to 1 𝑉 . However, in our experiment, we are not concerned with the direction of documents, i.e. which document leads to which other document, which means that 𝑝 𝑖 → 𝑗 is equal to 𝑝 𝑗 → 𝑖 . In our case, the saliency score 𝑆𝑐𝑜𝑟𝑒(𝑣!) for document 𝑣! can be deduced from matrix M and formulated in a recursive form, as

in the PageRank algorithm shown in Equation (5-3).

𝑆𝑐𝑜𝑟𝑒 𝑣! = 𝜆 ∙ 𝑆𝑐𝑜𝑟𝑒 𝑣! 𝑀!,!+

1 − 𝜆 𝑉

!!!

(5-3)

where λ is a damping factor usually set to 0.85, as in the PageRank algorithm (Page et al., 1998). For implementation, the initial scores of all documents are set to 1, and the iterative algorithm in Equation (5-3) is applied to compute the new scores of the documents. The convergence of the iteration algorithm is achieved when the difference between the scores computed for two successive iterations for any documents falls below a given threshold.

5.2.2 Two-Layer Model

A cluster-based conditional MRW model was proposed in (Wan and Yang, 2008). This conditional MRW model is based on a two-layer link graph including both documents and clusters information. This work assumed that a document set usually contains several non-related topics, that each top can be represented by a cluster of topic-related sentences, and that each

145

topic cluster is not equally important. The authors conducted experiments on the Document Understanding Conference (DUC) document summarization evaluation tasks dataset (DUC200117_{and DUC2002}18_{dataset). Three popular clustering algorithms were explored for}

detection of theme clusters within the document set: K-means Clustering, Agglomerative Clustering and Divisive Clustering. According to their results, the performance of each clustering algorithm varies based on the different 𝜆 value, shown in Equation (5-3). Overall, the Agglomerative Clustering algorithm obtained the best average performance among three clustering algorithms explored.

Figure 5.2 Two-layer link graph (Wan and Yang, 2008)

Wan and Yang (2008) also proposed proposed the two-layer model is shown in Figure 5.2. The lower layer represents the traditional link graph between documents with the upper layer representing the topic clusters. The dashed lines between these two layers indicate the conditional influence between the documents and clusters. Formally, they represent the two-layer graph as 𝐺∗_{= (𝑉, 𝑉}

!, 𝐸!!, 𝐸!") where V is the set of documents and 𝑉! is the set of hidden nodes

representing the detected theme clusters; 𝐸!! = 𝑒!"|𝑣!, 𝑣!∈ 𝑉 corresponds to all links between

17_{http://www-nlpir.nist.gov/projects/duc/guidelines/2001.html} 18_{http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html}

146

documents and 𝐸!"= 𝑒!!|𝑣! ∈ 𝑉, 𝑐! ∈ 𝑉! 𝑎𝑛𝑑 𝑐!= 𝐶(𝑣!) corresponds to the correlation

between a document and its cluster. 𝐶(𝑣!) indicates the theme cluster containing document 𝑣!.

They incorporated two factors, source cluster 𝐶(𝑣_!) and destination cluster 𝐶 𝑣_! , into the transition probability from 𝑣! to 𝑣!; the new transition probability is defined as shown in

where the 𝑓 𝑖 → 𝑗|𝐶 𝑣_! , 𝐶 𝑣_! is the affinity weight between two documents vi and vj,

conditioned on the two clusters containing the two documents. 𝑓 𝑖 → 𝑗|𝐶 𝑣_! , 𝐶 𝑣_! is computed as shown in Equation (5-5).

𝑓 𝑖 → 𝑗|𝐶 𝑣_! , 𝐶 𝑣_! = 𝛽 ∙ 𝑓 𝑖 → 𝑗 𝐶 𝑣_! + 1 − 𝛽 ∙ 𝑓 𝑖 → 𝑗 𝐶 𝑣_! = 𝛽 ∙ 𝑓 𝑖 → 𝑗 ∙ 𝜋 𝐶 𝑣_! ∙ 𝜔 𝑣_!, 𝐶 𝑣_! + 1 − 𝛽 ∙ 𝑓 𝑖 → 𝑗 ∙ 𝜋 𝐶 𝑣! ∙ 𝜔 𝑣!, 𝐶 𝑣! = 𝑓 𝑖 → 𝑗 ∙ 𝛽 ∙ 𝜋 𝐶 𝑣_! ∙ 𝜔 𝑣_!, 𝐶 𝑣_! + 1 − 𝛽 ∙ 𝜋 𝐶 𝑣! ∙ 𝜔 𝑣!, 𝐶 𝑣! (5-5)

where β∈[0,1] is the combination weight controlling the relative contributions from the source cluster and the destination cluster. In more precise detail, 𝜋(𝐶 𝑣! ) ∈ 0,1 denotes the

importance of cluster 𝐶(𝑣_!) in the whole document set S. This aims to evaluate the importance of the cluster 𝐶(𝑣_!) in document set S, and is computed as the cosine similarity value between

147

the cluster and whole document set, shown in Equation (5-6), this equation is used to compute the similarity between the representation of the cluster 𝐶(𝑣!) and the representation of document

set S.

𝜋(𝐶 𝑣_! ) = 𝑠𝑖𝑚_!"#$%&(𝐶 𝑣_! , 𝑆) (5-6)

𝜔 𝑣_!, 𝐶 𝑣_! ∈ [0,1] denotes the strength of the correlation between document 𝑣_! and its cluster 𝐶(𝑣!). This aims to evaluate the correlation between the document 𝑣! and its cluster

𝐶(𝑣_!), and is computed as the cosine similarity value between the document and the cluster shown in Equation (5-7).

𝜔 𝑣_!, 𝐶 𝑣_! = 𝑠𝑖𝑚_!"#$%&(𝑣_!, 𝐶 𝑣_! ) (5-7) The new row-normalized matrix 𝑴∗_{is defined as shown in Equation (5-8).}

𝑀_!,!∗ = 𝑝(𝑖 → 𝑗|𝐶 𝑣! , 𝐶 𝑣! ) (5-8)

Similar to the one-layer model, the saliency score (𝑆𝑐𝑜𝑟𝑒(𝑣_!)) for document 𝑣_! is computed based on the matrix 𝑴∗_{by using the iterative form in Equation (5-3).}

Based on these existing models, besides considering the correlation between clusters and documents in the PageRank algorithm, we propose a three-layer model which involves the correlation between query and cluster into the PageRank algorithm. The following sections introduce our proposed new algorithm.

In document Using community trained recommender models for enhanced information retrieval (Page 144-149)