3.3 Content selection
3.3.2 Clustering sentences by keywords
In previous experiments (Silveira and Branco, 2012), human evaluators were confronted with summaries generated automatically by GISTSUMM(cf. Section 2.1.7). When asked
about the quality of the summaries, they pointed out that some sentences seemed irrele- vant and that they were poorly related to each other. A procedure that effectively selects the most relevant content is very important to improve the quality of a summary.
Many studies sought to find relevant information within large collections of text us- ing combinations of several weighting metrics, as term frequency [(Radev et al., 2000), (Farzindar et al., 2005), (Mori, 2005)], sentence position [(Hovy and Lin, 1999), (Lin and Hovy, 2000), (Mori, 2005)], inverse sentence frequency [(Pardo et al., 2003), (Leite and Rino, 2008)], query overlapping [(Mori, 2005), (Farzindar et al., 2005)], etc. Other studies looked for more sophisticated approaches by using statistical models (LDA, hLDA) to find the relevant topics within a collection of texts (Arora and Ravindran, 2008).
In this work, we have built a simple algorithm using statistical metrics. In order not only to find the relevant information within the collection but also to define a structure in the unrelated data. A clustering procedure based on word frequency is then executed.
SIMBAaims to produce a generic summary, that is a summary that reflects the main idea expressed in the collection of input texts, without focusing in a specific matter. Ac- cordingly, the keywords that represent the collection of texts must be identified. This procedure will be described in the next subsection, before describing the clustering algo- rithm in the subsequent subsection.
Computing the keywords
Keywords are determined in four steps. The first step adds to a list of candidates the words occurring in the input texts. The words are added to this list considering their lemmas, to ensure that the set of candidates only contains unique words. The second step filters the list of candidates by selecting only the common and proper nouns, since the words in these categories provide good indicators of ideas or themes mentioned in the collection of texts. The third step orders the list of candidates by their word score (tf-idf ). If two words have the same score, the word frequency is the value considered to untie the word order in the list. In the final step, a predefined number of keywords is retrieved in order to build the final set of keywords, that is used in the keyword clustering. The number of
keywords that is added to the final keywords list depends on the total number of words in the collection of texts. This number is computed using Equation 3.11.
k =r Ns
Ns – total number of words in the summary.
This is a relevant number since the number of keywords determines the number of clusters that will be created in this clustering step. The number k of keywords is a very studied one in clustering analysis. Many values for k have been proposed. The rule of thumb sets k to the number in the Equation 3.11 (Mardia et al., 1979). Also, k = 50 is a very common value since, as stated in (Wives, 2004) and (Schütze and Silverstein, 1997), this value optimizes the clustering algorithm. The number in Equation 3.11 depends on the total number of words that will be included in the summary.
After the set of keywords that represents the collection of texts has been selected, these keywords are rewarded, so as the sentences containing them. The extra score of each key- word occurring in each sentence is thus updated. This is a very important step specially when considering that, in the post-processing procedure (cf. Chapter 4), more specifi- cally in the sentence reduction module (cf. Section 4.2.1), parts of the sentences are re- moved taking into account the sentence relevance score. This step aims then to discourage the removal, during the reduction process, of parts of sentences that contain keywords.
In addition, by rewarding the keywords in the sentence, we are also rewarding the sentence itself. The assumption here is that sentences with more keywords tend to be more relevant.
Recall the working example that is being used to describe the summarization process. The keywords obtained for these texts are shown in Table C.5. The sentences obtained in the similarity clustering phase that will be clustered by keywords are shown in Table C.4 (in Annex C), ordered by their relevance score. Considering the keywords obtained, and after their extra score has been updated, the collection of sentences is ordered differently as detailed in Table C.6. This new order reflects the importance of having a keyword in a sentence, suggesting that sentences containing more keywords are better representative of the key information expressed in the input texts.
Once the clustering by similarity has been completed and the keywords have been iden- tified, the representative sentences of the similarity cluster are again clustered, but now based on the set of keywords. A cluster is identified by a keyword (the topic), and con- tains a representative sentence, and a collection of values (the sentences related to the keyword). Considering that the procedure must take a collection of sentences (retrieved from the previous phase) and a set of keywords that determine the topics of this collec- tion, the algorithm adopted is K -means (MacQueen, 1967), a partitional algorithm that, based on a collection of data, creates a set of clusters whose content is close to each other. The elementary k-means algorithm comprises the following steps:
1. Choose the number of clusters, k;
2. Generate the k clusters centers randomly; 3. Assign each point to the nearest cluster center; 4. Recompute the clusters center;
5. Repeat the two previous steps until a convergence criterion is met.
Our keyword clustering algorithm is an adapted version of K -means. It performs clus- tering by keywords following the steps described below:
1. Choose the number of clusters, k, defined by the number of keywords; 2. Create the initial empty clusters, represented by each keyword;
3. Consider each sentence:
a) Compute the occurrences of each keyword in the sentence;
b) Assign the sentence to the cluster whose keyword occurs more often in it (if there is a tie, the sentence is added to the keyword with the highest score);
4. Recompute the cluster representative sentence. If the current sentence has more occurrences of the keyword defining its cluster than the previous representative sentence had, the newly added sentence becomes the cluster representative sen- tence; if it has not, the cluster representative remains the same;
5. If the sentence does not contain any keywords, it is added to a specific set of sen- tences which do not have any keyword ("no-keyword" set);
As in the similarity algorithm, when the representative sentence changes, the extra
scores of both the previous representative sentence and of the new one are updated. The
new representative sentence is rewarded, by adding the predefined value to its extra score, and the previous representative sentence is penalized, by removing the predefined value from its extra score.
In addition, another value is added to the extra score: the number of keywords in the sentence. The idea behind this is that the more keywords a sentence has, the more rel- evant it should be considered. So, the extra score of each sentence of each cluster is up- dated by adding to it the number of keywords occurring in the sentence.
In the same way, a penalty operation is also performed in the sentence extra score. Recall from Section 2.1 that Schiffman et al. (2002) proposed a metric that penalizes sen- tences with less than fifteen words. We also apply such a penalty to sentences with less than fifteen words, while sentences with more than fifteen words are rewarded. Sequences with less than fifteen words are typically considered as conveying less information. In some cases, they are not even full sentences. Titles, subtitles, or headers are examples of such sentences. In addition, this kind of sentences have high scores, since they typically contain very frequent words. Finally, these sentences normally include less information than other sentences containing the same words.
The keyword clustering process is depicted in Figure 3.3.
Figure 3.3: Keyword clustering.
Once the algorithm terminates, two types of sentences have been identified: the most significant sentences and the ones to be discarded. The sentences enclosed in a keyword cluster are considered to be the most significant ones of the whole collection. The ones added to the "no-keyword" cluster are ignored.
Recalling our working example, Table C.6 shows the sentences before applying the keyword clustering procedure. Table C.7, in turn, details the clusters obtained after this clustering phase. These are the sentences that will be taken into account in the next steps of the summarization procedure.
The sentences in the clusterNO-KEYWORDare less important to the key information
conveyed by the texts to be summarized, as they do not contain any keyword. Hence, they will not be used in the next stages of the procedure.
Note that there are clusters without sentences (BUKAVU and CONGO), which means that these keywords occur less often in the sentences to be clustered, than the other key- words, or their score is lower than the score of the other keywords.
Also, the first sentence in each cluster, its representative, has been rewarded through its extra score. Thus, the sentences that have more keywords – for instance, the sentence in the cluster (PORTA-VOZ) – have higher scores.
Finally, the sentences with less than fifteen words appear in the end of the list.
The final list of sentences, the output of this clustering procedure, is illustrated in Table C.8. Note that the sentences in theNO-KEYWORDcluster are not included in this list,
as they will not be included in the input for the next steps of the summarization process. There are several applications of this clustering representation. The sentences grouped in each keyword cluster are related by the key information they commonly address. This relation can thus help to define a topic, and each topic can be represented in the final summary as a paragraph, for instance. Also, the cluster score suggests the importance of its group of sentences within the summary. So, the position of each paragraph in the summary can be defined by its cluster score.
Concluding, this clustering procedure helps not only to select content but also to de- fine the final organization of the summary.