The idea of identifying keywords that are descriptive of the input documents was first proposed in Luhn’s fundamental work in automatic summarization (Luhn, 1958). There, keywords were identified based on the frequency in the input, where words that appeared most and least often were excluded. The sentences in which the keywords appeared close to each other were then selected to form the final summary. Many successful recent systems also employ unsupervised methods to estimate word importance. For example, estimating word importance based on frequency (probability) has been rejuvenated in the past ten years, and summarizers based on this achieves impressive performance (Nenkova and Vanderwende, 2005; Nenkova et al., 2006). Apart from word frequency, words are also estimated using document frequency (DF) (Schilder and Kondadadi, 2008). Systems that optimize the cover- age of bigrams also often use DF to weigh bigrams (Gillick et al., 2008; Gillick and Favre, 2009). Word importance is also often weighted by TF*IDF. Many extractive summarization systems represent the sentences into vectors using bag-of-word mod- els, where each word is weighted by TF*IDF (Erkan and Radev, 2004; Radev et al., 2004b; McDonald, 2007; Lin and Bilmes, 2011). These systems then estimate the importance of sentences based on the vectors.
Some other methods assign weights to certain words, instead of all words in the input. A powerful method is the log-likelihood ratio test (Lin and Hovy, 2000), which
identifies a set of words that appear more often in the input than in a background corpus; the authors call such words topic signatures. Lin and Hovy (2000) show the effectiveness of this method over TF*IDF for word weighting in single-document summarization. Later systems employ topic signatures to score sentences for multi- document summarization and achieve competitive performance (Conroy et al., 2004; Harabagiu and Lacatusu, 2005; Conroy et al., 2006a). Han et al. (2015) use a quantitative statistical method to discover salient keywords which characterize the contributions of a paper from a paper’s citation history. The authors then use the identified keywords to develop a scientific paper summarization system. Graph- based approaches have also been used for extracting keywords for single-document summarization. There, the graph is constructed based on the co-occurrence of words in a sentence, and the HITS algorithm (Kleinberg, 1999) is used to rank these words (Litvak and Last, 2008).
Word importance has also been estimated by supervised approaches, with word frequency and position of word occurrence in the input (e.g., first occurrence of a word) as the most typical features. For example, a summarizer that weighs words using position and frequency outperforms using frequency alone (Yih et al., 2007); a stack decoding algorithm with performance guarantee can be used to maximize the coverage of word weights (Takamura and Okumura, 2009), where the estimation of word weights resembles that of Yih et al. (2007). Recently, a submodular function based framework has been developed to optimize the coverage of word weights (Sipos et al., 2012), where capitalization, word length, unigrams along with frequency and location are used as features. In general, however, the features that have been explored for estimating word importance in summarization are limited.
A handful of papers have productively explored the mutually reinforcing rela- tionship between word and sentence importance. There, graph-based approaches
are widely used to iteratively improve the estimation of word and sentence impor- tance, in unsupervised (Zha, 2002; Wan et al., 2007; Wei et al., 2008) and later supervised frameworks (Liu et al., 2011).
Instead of estimating word importance, most prior work has aimed at scoring sentences in the input. Popular unsupervised approaches estimate sentence impor- tance based on its similarity with the input (Centroid) (Radev et al., 2004b), average word frequency (Nenkova et al., 2006), latent semantic analysis (Gong and Liu, 2001; Steinberger and Jezek, 2004), and graph-based methods (Erkan and Radev, 2004; Mihalcea, 2005). Researchers have also extensively explored features used in su- pervised systems for estimating sentence importance. Earlier work (Kupiec et al., 1995; Teufel and Moens, 1997) used features such as sentence length and sentence position in a paragraph, later work combined a variety of indicators of sentence im- portance (e.g., sentence importance scores estimated by unsupervised methods) in their models (Litvak et al., 2010; Ouyang et al., 2011; Wang et al., 2013).
Prior work has also extensively investigated keyword (or rather keyphrase) iden- tification2 (e.g., extract a list of phrases from a journal article) , without further in-
tegrating the selected keywords in a full-fledged summarizer. Turney (1997) showed that word frequency in the input is a poor selector for keyphrases. A handful of work (Frank et al., 1999; Turney, 2000; Hulth, 2003) explored using supervised methods for this task, where the features used in those work include TF*IDF and position of the first appearance of the words (Frank et al., 1999), word frequency, document frequency, number of words in the phrase (Turney, 2000) and part-of-speech (Hulth, 2003). Competitive to these are unsupervised methods in which the weights of words are derived by a graph describing the co-occurrence of words in the same sentence in a fixed width window of context (Mihalcea and Tarau, 2004; Liu et al., 2010). One of our baseline methods is similar to the method of Mihalcea and Tarau (2004), but we estimate word weights for summarization.
2The task there is in fact keyphrase extraction, because the “keywords” are often phrases of at
Overall, only a handful of studies investigate suitable features for estimating the importance of words for summarization. In our work, we incorporate a much broader set of features for estimating word importance. Some of our features are derived from unsupervised methods, some are inspired by features used in keyphrase identification, some others are derived based on knowledge independent of the input.