In this section, two existing MRW models are described: a one-layer model and a two layer model. The application of these two models in the PageRank algorithm is then described.
5.2.1 Basic One-Layer Model
A MRW model is essentially a way of deciding the importance of a vertex within a graph based on global information recursively drawn from the entire graph. The basic idea is that of a βvoteβ
143
or βrecommendationβ between vertices (Wan and Yang, 2008). A link between two vertices is considered as a vote that one vertex gives to another. The score associated with a vertex is determined by the votes that are given for it.
Figure 5.1 One-layer link graph (Wan and Yang, 2008)
Wan and Yang (2008) define notations in the following way. Given a document set π, let πΊ = (π, πΈ) indicate a graph which reflects the relationships between documents in the whole document set, as shown in Figure 5.1. V is the set of vertices where each vertex π£! in V is a
document in π. E is the set of edges, which is a subset of πΓπ. Each edge π!" in E is associated with an affinity weight π(π β π) between documents π£! and π£! Β (π β π). Each document π£! Β is represented as a set of terms π£!(π‘!, π‘!, β¦ , π‘!). The affinity weight is computed using the standard
cosine similarity measure (Baeza-Yates and Ribeiro-Neto, 1999) between two documents, shown in Equation (5-1).
π π β π = π ππ!"#$%& π£!, π£! =
π£!β π£!
π£! Γ π£! (5-1)
where π£! and π£! are the term vectors of π£! and π£!. We think that two vertices are connected if the
affinity weight between them is larger than 0. And define π π β π = 0 is used to avoid the sele- transition.
In Wan and Yang (2008), the transition probability matrix π, the transition probability from π£! to π£! π(π β π)) is defined by normalizing the corresponding affinity weight as shown in Equation (5-2).
144 π π β π = π(π β π) π(π β π) ! !!! Β Β Β ππ |!| π(π β π) β 0 !!! Β Β Β 0 Β Β Β ππ‘βπππ€ππ π (5-2)
Formally, π(π β π) is not equal to π(π β π) . In (Wan and Yang, 2008), the authors use Β π!,! ! Γ ! to describe G with each entry corresponding to the transition probability π!,! = π π β π . In order to make M into a stochastic matrix rows with all zero elements are
replaced by a smoothing vector with all elements set to 1 π . However, in our experiment, we are not concerned with the direction of documents, i.e. which document leads to which other document, which means that π π β π Β is equal to π π β π . In our case, the saliency score πππππ(π£!) for document π£! can be deduced from matrix M and formulated in a recursive form, as
in the PageRank algorithm shown in Equation (5-3).
πππππ π£! = π β πππππ π£! π!,!+
1 β π π
!!!
(5-3)
where Ξ» is a damping factor usually set to 0.85, as in the PageRank algorithm (Page et al., 1998). For implementation, the initial scores of all documents are set to 1, and the iterative algorithm in Equation (5-3) is applied to compute the new scores of the documents. The convergence of the iteration algorithm is achieved when the difference between the scores computed for two successive iterations for any documents falls below a given threshold.
5.2.2 Two-Layer Model
A cluster-based conditional MRW model was proposed in (Wan and Yang, 2008). This conditional MRW model is based on a two-layer link graph including both documents and clusters information. This work assumed that a document set usually contains several non-related topics, that each top can be represented by a cluster of topic-related sentences, and that each
145
topic cluster is not equally important. The authors conducted experiments on the Document Understanding Conference (DUC) document summarization evaluation tasks dataset (DUC200117 and DUC2002 18dataset). Three popular clustering algorithms were explored for
detection of theme clusters within the document set: K-means Clustering, Agglomerative Clustering and Divisive Clustering. According to their results, the performance of each clustering algorithm varies based on the different π value, shown in Equation (5-3). Overall, the Agglomerative Clustering algorithm obtained the best average performance among three clustering algorithms explored.
Figure 5.2 Two-layer link graph (Wan and Yang, 2008)
Wan and Yang (2008) also proposed proposed the two-layer model is shown in Figure 5.2. The lower layer represents the traditional link graph between documents with the upper layer representing the topic clusters. The dashed lines between these two layers indicate the conditional influence between the documents and clusters. Formally, they represent the two-layer graph as πΊβ= (π, π
!, πΈ!!, πΈ!") where V is the set of documents and π! is the set of hidden nodes
representing the detected theme clusters; πΈ!! = π!"|π£!, π£!β π corresponds to all links between
17 http://www-nlpir.nist.gov/projects/duc/guidelines/2001.html 18 http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html
146
documents and πΈ!"= π!!|π£! β π, π! β π! Β πππ Β π!= πΆ(π£!) corresponds to the correlation
between a document and its cluster. πΆ(π£!) indicates the theme cluster containing document π£!.
They incorporated two factors, source cluster Β πΆ(π£!) and destination cluster πΆ π£! , into the transition probability from π£! to π£!; the new transition probability is defined as shown in
Equation (5-4). π π β π|πΆ π£! , πΆ(π£!) = π(π β π|πΆ π£! , πΆ(π£!)) π(π β π|πΆ π£! , πΆ(π£!)) ! !!! Β Β Β ππ |!| π(π β π|πΆ π£! , πΆ(π£!)) !!! β 0 Β Β Β 0 Β Β Β ππ‘βπππ€ππ π (5-4)
where the π π β π|πΆ π£! , πΆ π£! is the affinity weight between two documents vi and vj,
conditioned on the two clusters containing the two documents. π π β π|πΆ π£! , πΆ π£! Β is computed as shown in Equation (5-5).
π π β π|πΆ π£! , πΆ π£! = π½ β π π β π πΆ π£! + 1 β π½ β π π β π πΆ π£! = π½ β π π β π β π πΆ π£! β π π£!, πΆ π£! + 1 β π½ β π π β π β π πΆ π£! β π π£!, πΆ π£! = π π β π β π½ β π πΆ π£! β π π£!, πΆ π£! + 1 β π½ β π πΆ π£! β π π£!, πΆ π£! (5-5)
where Ξ²β[0,1] is the combination weight controlling the relative contributions from the source cluster and the destination cluster. In more precise detail, π(πΆ π£! ) β 0,1 denotes the
importance of cluster πΆ(π£!) Β Β in the whole document set S. This aims to evaluate the importance of the cluster πΆ(π£!) in document set S, and is computed as the cosine similarity value between
147
the cluster and whole document set, shown in Equation (5-6), this equation is used to compute the similarity between the representation of the cluster πΆ(π£!) Β Β and the representation of document
set S.
π(πΆ π£! ) = π ππ!"#$%&(πΆ π£! , π) (5-6)
π π£!, πΆ π£! β [0,1] Β denotes the strength of the correlation between document π£! and its cluster Β πΆ(π£!). This aims to evaluate the correlation between the document π£! and its cluster
πΆ(π£!), and is computed as the cosine similarity value between the document and the cluster shown in Equation (5-7).
π π£!, πΆ π£! = π ππ!"#$%&(π£!, πΆ π£! ) (5-7) The new row-normalized matrix π΄β is defined as shown in Equation (5-8).
π!,!β = π(π β π|πΆ π£! , πΆ π£! ) (5-8)
Similar to the one-layer model, the saliency score (πππππ(π£!)) for document π£! is computed based on the matrix π΄β by using the iterative form in Equation (5-3).
Based on these existing models, besides considering the correlation between clusters and documents in the PageRank algorithm, we propose a three-layer model which involves the correlation between query and cluster into the PageRank algorithm. The following sections introduce our proposed new algorithm.