Word Properties - Content Selection in Multi-Document Summarization

3.5 Features

3.5.2 Word Properties

We discuss three kinds of word properties: part-of-speech, named entity categories, and word capitalization.

Part-of-speech: Part-of-speech (POS) is useful in improving the identification of keyphrases that are used for indexing (Hulth, 2003). In summarization, the POS information is also effective in eliminating unimportant content (e.g., lead adverbials, gerund clause) to make the summaries more concise (Dunlavy et al., 2003; Conroy et al., 2006b). One relevant analysis appears in Gillick (2011), who studied the dis- tribution of POS tags in the document sets and the summaries. Recently, Woodsend and Lapata (2012) use POS information as features for estimating word importance. The estimation result is used as an indicator that helps to decide whether or not a node in a parse tree should be removed in a summarization system that uses com- pressed sentences of the input. In general, however, POS tags are not often used as features for estimating word importance in summarization.

Features Sample words prop WRS +/- rf

NNP president, November, Clinton NA 3.7e-24 + 15.9%

NNPS States, Nations, embassies NA 1.2e-18 + 34.9%

VBN accused, killed, held, arrested NA 3.3e-6 + 14.5%

VBG taking, adding, including, speaking NA 2.7e-5 - 7.6%

RB ago, recently, long, apparently NA 3.1e-5 - 6.4%

NNS years, officials, countries, victims NA 4e-5 + 13.1%

FW el, hage, cardinal, es NA 7.3e-5 + 33.3%

VB make, give, carry, put, face NA 0.0003 - 7.8%

VBZ appears, remains, includes, means NA 0.0003 - 7.8%

CD 13, 1998, million, billion NA 0.002 + 12.5%

VBP remain, include, feel, fear NA 0.004 - 6.2%

JJR larger, lower, higher, greater NA 0.007 - 6.2%

VBD killed, began, voted, reported NA 0.011 + 13.1%

NN president, government, state NA 0.040 + 11.8%

Organization international, state, national NA 3.6e-61 + 25.5%

Location U.S., States, United, York, Saudi NA 4.2e-44 + 22.4%

Other entities president, time, international NA 8.2e-17 + 11.0%

Person Names Ms., David, John, Ali, Michael NA 0.009 - 9.5%

Date November, October, 1998, years NA 0.023 + 12.3%

Money million, billion, 13, 30 NA 0.035 + 13.4%

Ever capitalized? Bush, United, British, China 3e-35 2.2e-35 + 16.0%

Capitalization ratio? NA NA 2.1e-31 + NA

All capitalized? Bush, United, British, China 2.6e-11 2.1e-11 + 13.8%

Table 3.3: The significant part-of-speech, named entity and capitalization features. We show their p-values by Wilcoxon rank-sum test (WRS). For binary features, we also show their p-values by proportion test. +/- indicates more in the summary/input. rf indicates the percentage of words with this feature tag in the input that are included in human summaries; the mean rf of all words is 10.9%.

In our work, we include POS tags for each individual word. Here we use the Stanford POS-Tagger (Toutanova et al., 2003). We have one real-valued feature corresponded to each POS tag: let Nw denote the number of occurrences of wordw in the input and letN_w0 denote the number of occurrences of word wwith POS tag

t in the input, the value of this feature is equal to N_w0/Nw. In most cases only one feature gets a non-zero value.

Of all POS tags, 14 of them are significant (see Table 3.3). There are more nouns (NNS, NNPS, NN), numbers (CD) and past tense verbs (VBN, VBD) in the summaries compared to the input. There are fewer present tense verbs (VB, VBG, VBP, VBZ), comparative adjectives (JJR) and adverbs (RB) used in summaries. For a description of the part-of-speech tag sets, see Santorini (1990).

We also quantify the percentage of words with certain POS-tags in the input that are also included in human summaries. Formally, let WI denote the set of content words in the input I and let SI denote the set of content words in the summary corresponded toI. LetWI,f andSI,f denote the set of words inWI and SI that have the POS tag corresponded to feature f, respectively. For each input I and feature

f, we compute:

rI,f =

|WI,f ∩SI,f|

|WI,f|

(3.5) The final rf that corresponds to feature f is equal to the mean of rI,f over all input sets. We show the rf for all significant features in Table 3.3. Of all content words in the input, 10.9% of them appears in human summaries. Therf of many significant features (e.g., NNPS, VBN) are much higher than 10.9%.

Named Entities: Named entity recognition (NER) classifies words into pre- defined categories (e.g., date, time, organization). For each word, we include its named entity label, derived by the Stanford Name Entity Recognizer (Finkel et al., 2005). Among the eight NE features, six of them are significant: there are more Organization, Location, Other entities, Date, Money; less Person Names in human summaries (see Table 3.3). Indeed, many of the words in the Organization

(e.g., international, state, national) and Location categories (e.g., States, Saudi, river) are related to the most critical events of the input. The significantly less occurrences of Person Names might be because abstractors would only select the most important names (e.g., Bush), while the large number of other names (e.g., David, John, Michael) that appear in the input documents are left out.

been capitalized, the ratio of its capitalized occurrences, and whether or not all its occurrences are capitalized. Sentence initial words are excluded before computing. Capitalized words are more likely to appear in summaries, as shown in Table 3.3.

In document Content Selection in Multi-Document Summarization (Page 71-74)