Features, classifiers and evaluation

In document Enhancing extractive summarization with automatic post-processing (Page 94-97)

1.2 Problem

2.2.2 Approaches

2.3.3.2 Features, classifiers and evaluation

In order to identify the most relevant features that can distinguish the discourse relations that two text segments share, several experiments were performed over the previously mentioned corpora.

Word pairs has been the most common feature used. When defining word pairs, sev- eral have been the combinations experimented.

The pioneer work by Marcu and Echihabi (2002) inspired the research on finding dis- course relations. These authors describe an unsupervised approach to recognize specific discourse relations (CONTRAST,EXPLANATION-EVIDENCE,CONDITIONandELABORATION), that hold between two spans of texts. Firstly, they started by creating a training corpus by extracting adjacent sentence pairs that contain specific connectives for each discourse relation in question (for instance, they used "but" for CONTRAST relations). In addition, sentences containing a connective in the middle are extracted. Then, they split each ex- tracted sentence into two spans, one containing the words from the beginning of the sen- tence to the occurrence of the connective (e.g. the word "but") and one containing the words from the occurrence of the connective to the end of the sentence. Finally, all the pairs are labeled with the relation expressed by the connective.

In order to define their feature set, they hypothesize that lexical item pairs provide clues about the discourse relations shared by the text spans in which those items occur. So, they extract an opposite polarity word (nouns, verbs, and other cue-phrases) from each sentence of the pairs previously gathered. These pairs of words were used as fea- tures of a Bayesian classifier, that learns the discourse relation between them. Further ex-

periments shown that the training dataset provided to the classifier contained too much noise. Thus, instead of simply using the sentence pairs, they used the most representa- tive words (nouns, verbs and cue-phrases) from each pair. The results with over 75% of accuracy for all the classes suggested that discourse relation classifiers trained on most representative word pairs and millions of training data can achieve high performance.

Blair-Goldensohn et al. (2007) proposed several refinements to the word pair model focusing on two specific discourse relations: CAUSEandCONDITION. The refinements pro- posed include (1) stemming, (2) using a small fixed vocabulary size consisting in the most frequent stems, and (3) a cutoff on the minimum frequency of a feature. Blair- Goldensohn et al. (2007) reproduced in their corpus the approach by Marcu and Echi- habi (2002) in order to compare both approaches. They report that Marcu and Echihabi’s system achieved a 54% accuracy, while their own system achieved 60% accuracy.

As opposed to these experiments, a study by Pitler et al. (2008) claims that word pairs lead to data sparsity problems. A possible approach to solve this problems is to reduce the number of features, using semantic relations between words. Based on these con- clusions, these authors tested other features as polarity tags, length of verb phrases, verb tenses, context windows in the arguments, and word pairs. In these experiments, only the top classes of PDTB were used. Nevertheless, some interesting conclusions on the best features were drawn. They found that word polarity, verb classes as well as some lexical features are strong indicators of the type of discourse relation. Naïve Bayes and

Maximum Entropy were the classifiers used and Naïve Bayes produced the best results, a

94% of accuracy when classifying explicit discourse relations.

In a similar vein, Wellner et al. (2006) seek not only to identify but also to classify discourse relations, using the Discourse GraphBank. A variety of syntactic and lexical- semantic features were experimented, and most of them were obtained by using external knowledge-sources. The set of features tested included (i) words appearing in the begin- ning and end of the two discourse segments; (ii) proximity between two segments and their direction; (iii) paths in an ontology between non-function words occurring in the two segments; (iv) word-pair similarities between words; (v) grammatical dependency relations between the two segments; and (vi) event-based attributes (e.g. cue-words from one segment paired with events in the other segment). Context words (i) and proximity (ii) were the most impacting features, when considered alone. In addition, several com- bination of these features were performed. There is a strong evidence that combined

features are responsible for a better performance of the classifier. They used a Maximum

Entropy Classifier to assign labels to each pair of discourse segments connected by some

relation. They report accuracy results of over 70% when combining all the features. Lin et al. (2009) used four classes of features: contextual features, constituent parse features, dependency parse features, and lexical features. In order to define contextual features, they rely on the definition of the dependencies that hold between pairs of dis- course relations stated by Lee et al. (2006). By observing the PDTB looking for those de- pendencies, they found that from the set defined by Lee et al. (2006), two of them – em- bedded argument and shared argument – are the most common ones, so those were used as context features. Constituent parse features resort to syntactic structure, as syntactic structure within one argument may constrain the relation type and the syntactic struc- ture of the other argument. So, the production rules of the constituency trees of the two arguments are used as features. Dependency parse trees are also used, due to the fact that those encode additional information at the word level that is not explicitly present in the constituency trees. Finally, in line with Marcu and Echihabi (2002), word pairs are also used as lexical features. Using a Maximum Entropy classifier, they reported an accuracy of 40% when applying all features classes, an improvement of 14.1% over the baseline.

Louis et al. (2010) experimented three types of features: structural, semantic, and non-discourse features. Structural features were based on RST trees and include the depth of the text span and its promotion score (text spans considered more salient are awarded with promotion scores). Semantic features, obtained from PDTB, encode the number of relations (implicit, explicit and total) shared by a sentence and the distance between the arguments of those relations. Finally, non-discourse features are standard features used for summarization, such as (1) length of the sentence; (2) if the sentence is the first sentence of the paragraph or the first sentence of a document; (3) the sentence offset from document beginning, as well as paragraph beginning and end; (4) the average, sum and product probabilities of the content words appearing in sentences; and (5) the number of topic words in the sentence. The results obtained provide strong evidence that discourse structure, other than discourse semantics, is the most useful aspect in improv- ing sentence selection for summarization. Still, both these types of discourse features complement the commonly used non-discourse features for content selection. Accura- cies over 75% were obtained by the classifiers, and the single-document summarization system improved its performance when tested with different features.

In document Enhancing extractive summarization with automatic post-processing (Page 94-97)