Chapter 2: Literature Review
2.2 Sentence Selection and Ranking for Generic and Focused Extractions
Many sentence extraction methods rely on a ranking mechanism to quantify the importance or saliency of each candidate sentence. In generic multi-document summarization, Nenkova and Vanderwende (2005) first proposed a corpus-level frequency sentence scoring method called SumBasic. Their method is based on the observation that human-constructed summaries tend to contain highly frequent words. To compute SumBasic, each sentence is scored by the sum of the average probability of the words in the sentence. After the top sentence is chosen, the probability of words containing the selected sentence is updated to penalize redundancy. They empirically proved that word frequency significantly contributes to the extraction of salient sentences using the standard benchmark data sets. A few subsequent works tried to extend SumBasic into several directions. For example, Yih et al. (2007) employed sentence position in addition to the word frequency feature. In topic-focused summarization, Vanderwende et al. (2007) proposed SumFocus which computes a sentence score as a linear combination of the unigram
probabilities derived from the topic description and the unigram probabilities from the document.
Erkan and Radev (2004) proposed LexRank, an eigenvector centrality approach to find salient sentences for multi-document summarization. Their method is inspired by a well-known PageRank algorithm (Brin and Page 1998). In essence, LexRank defines a random walk over sentence graph where each vertex represents the individual sentence and each edge represents the similarity between sentences. The edge weight is determined by TFIDF-weighted cosine similarity score between sentence nodes. Since the sentence graph can be transformed into a stochastic matrix, it defines a Markov chain. Thus, each sentence can be ranked according to its stationary distribution. Important sentences are selected according to the highest stationary distribution. The LexRank method has been extended to several topic- focused sentence extraction tasks, such as question answering (Otterbacher et al. 2005) and focused summarization (Otterbacher et al. 2009). The extended approach, called topic-focused LexRank, is defined as a mixture model of the relevance of the sentence to the query and the similarity between sentences. Mihalcea and Tarau (2004) also incidentally proposed a similar eigenvector centrality approach for single-document summarization called TextRank. Their main idea is the same as LexRank in which a sentence graph is constructed and transformed into stochastic matrix. Then, sentences are selected according to their stationary distribution.
Some recent works have applied information distance to extractive summarization. Information distance is based on Kolmogorov complexity (Li and Vitanyi 1997) which is comparable to a well-known information theory developed by Claude Shannon. For instance, Long et al. (2009) proposed a conditional information distance based approach for extractive multi-document summarization. Two
methods used for estimating information distance were presented in their approach: approximation by compression and approximation by the coding theory. Topic models (Blei et al. 2003) have also been explored in the context of focused summarization. For example, Tang et al. (2009) focused on the problem of multi- topic based focused summarization. To address the problem, they proposed a statistical topic model to discover multiple topics in a document collection. Two strategies for incorporating the query information into the topic model were explored. The first strategy integrated the query information into the generative process of the topic model, resulting in a mixture of a document-specific topic distribution and a query-specific topic distribution. The second strategy involved the use of a regularization form to constrain the topic model by the query information. In essence, the query-specific topics were employed to bias the topic model.
Other summarization methods considered diversity as one of the major goals of the extractive-based generic and focused summarizations. Recently, there were a growing number of works which attempted to integrate diversity into the sentence ranking function itself. For example, Zhu et al. (2007) proposed a unified ranking algorithm called GRASSHOPPER which is based on random walks over an absorbing Markov chain. The representative sentences which have been selected into the summary become absorbing states, effectively transforming their transition probabilities to zero. The absorbing nodes will drag down the scores of the adjacent nodes as the walk gets absorbed. On the other hand, the nodes which are far away from the absorbing nodes still get visited by the random walk. Next, Li et al (2009) casts the diversity issue as the optimization under constraints problem. They propose a supervised method based on structural learning which incorporates diversity as a set of subtopic constraints. Then, they train a summarization model
and enforce diversity through the optimization problem. Wan et al. (2006) proposes a cross-document random walks to extract a focused summary with high information richness and novelty. They introduce a diversity penalty imposition step to remove redundancy after the initial list of representative sentences has been extracted. After each top-ranked sentence i in the initial list is selected into the summary, the scores of all adjacent sentences to i will be penalized.
In general, diversity is one of the most important topics in many related areas, particularly in information retrieval. Perhaps, the most well-known work is Maximal Marginal Relevance (MMR) (Carbonell and Goldstein 1998) in which redundancy reduction method is first introduced to rerank the search results. Since then, it has become the most commonly used method to reduce redundancy in text summarization. Subsequent works in information retrieval research attempt to establish a theoretical framework of diversity ranking and evaluation (Agrawal et al. 2009; Clarke et al. 2008; Zhai et al. 2003). Considering related works in text summarization, most graph-based ranking models (Chen et al. 2009; Otterbacher et al. 2005; Zhu et al. 2007) are inspired by the PageRank algorithm (Brin and Page 1998). Therefore, they employ eigenvector centrality to measure the importance of nodes in sentence graph. Under this model, a node is considered to be important if it is linked to other important nodes. Simply, it receives a high recommendation vote from the adjacent nodes.
In the context of graphical models, there have been several attempts to incorporate negative edges into the traditional graph representations. These include the areas such as trust/distrust ranking (de Kerchove and Dooren 2008; Guha et al. 2004) and social network mining (Kunegis et al. 2009; Yang et al. 2007). For example, de Kerchove et al. (2008) proposed the PageTrust algorithm as an
extension to the original PageRank algorithm by including negative links as the propagation of distrust among web pages. Their method ranks the nodes using both positive and negative links. Similarly, Kunegis et al. (2009) defined an eigenvector ranking method called signed spectral ranking which considers both positive and negative links to model friend and foe relationships in the social network.
Finally, the semantic structure of sentence has been applied in a few text mining and information retrieval applications. In text categorization, Shehata et al. (2007) propose conceptual term frequency as a new term weight scheme computing at sentence semantic level. It has been applied to text classification task. Next, Wang et al. (2008) utilized a simple structural composition of sentences to compute the similarity scores in multi-document summarization. Next, Bilotti et al. (2007) explored the use of semantic roles to create structural search queries. Most applications of sentence semantics were based on semantic role labeling research in natural language processing domain. Gildea and Jurafsky (2002) first introduced a machine learning approach to automatically label sentence constituents with proper semantic roles. Their classifier was trained on FrameNet data (Baker et al. 1998) using various linguistic features, such as verb, head nouns, syntactic category, active/passive voice label, and grammatical function. A subsequent work by Pradhan et al. (2004) explored a shallow semantic parsing approach to train a multi-class Support Vector Machines (SVM) classifier for semantic role labeling task.