• No results found

4.3 Evaluation

4.3.2 Results

In the update task of TAC 2008, 57 peer summaries were manually evaluated with the Pyramid method, and 71 were evaluated using ROUGE and the Basic Elements evaluation package [13].

Table 4.1 shows the average Recall, Precession and F-measure for the ROUGE-1, ROUGE-2, and ROUGE-SU4 evaluations on the two submitted runs. It can be noted that

in both runs, the system generally ranked higher in Recall than Precession. This suggests that the system is better at finding relevant content than it is at removing irrelevant content. Also, it can be noted that the run in which more weight was given to the topic statement generally achieved better ROUGE scores than the other run with more weight given to the headlines.

Run 1 Run 34

ROUGE Avg R Avg P Avg. F Avg R Avg P Avg F

1 0.34463 0.33866 0.34148 0.34022 0.33372 0.33680

2 0.08091 0.07933 0.08008 0.08080 0.07912 0.07991

SU4 0.11852 0.11634 0.11737 0.11706 0.11471 0.11583

Table 4.1: The ROUGE Scores obtained by my system in the two runs I submitted in TAC08

Table 4.2 shows the automated evaluations average scores obtained by my submitted runs (with their ranks) in comparison with the 71 peer summaries submitted by the rest of the participants. The scores I obtained were above average for all runs.

Evaluation Run (1) Run (34) Best Worst

ROUGE2-R 0.08091 (25/71) 0.08080 (26/71) 0.10382 0.03343

ROUGESU4-R 0.11858 (23/71) 0.11713 (29/71) 0.13646 0.06517

BE 0.04964 (24/71) 0.04903 (28/71) 0.06462 0.01337

Table 4.2: The automated scores (and ranks) obtained by my system compared with the rest in TAC08

The evaluation in TAC2008 included human judgments of linguistic quality. Table 4.3 shows the results and the rank of my system in respect with the rest in the manual

evaluation. The metrics shown in the table are: responsiveness which is how well the summary addresses the user's information need; and linguistic quality. The linguistic quality score is guided by consideration of the following factors:

1. Grammaticality 2. Non-redundancy 3. Referential clarity 4. Focus

5. Structure and Coherence

with the scores being between 1 (very poor) and 5 (very good). The results obtained for the submitted runs were above average as shown in the table. This was expected since the summarizer is extractive and no modifications were made to the sentences. Information redundancy, diversity and coherence are the main factors affecting linguistic quality and overall responsiveness. An attempt is made to address these factors in section 4.4.

Run (1) Run (34) Best Worst

Avg Linguistic Quality 2.719 (12/58) 2.76 (11/58) 3.073 1.312

Overall Responsiveness 2.427 (15/58) 2.385 (18/58) 2.667 1.198

Table 4.3: Manual Evaluation Results in TAC08

It is interesting to test the impact of using other sentences similarity measures in the built summarization system and evaluate their performances. Ideally, it would be optimal to repeat all of the evaluations performed by TAC08 organizers with different variations of the system using different sentences similarity measures. However, it is a labour intensive task to perform the manual evaluations they performed and is beyond my means.

It is also used by the TAC08 organizers and provides a good mean of reference against the results I obtained from my participation.

After obtaining the official evaluation results from the TAC 08 organizers for the submitted runs, I used the same dataset to evaluate other sentences similarity measures using ROUGE with the same parameters as was used with Run1. Namely, I implemented and evaluated the measures: SemSimMeasure, arTonvSemSimMeasure, Syn_SimMeasure,

EditDist_SimMeasure, and EditDistEx_SimMeasure which were all described in section

4.2.2.2. Because the measures involving the replacement of terms with their antonyms reflect the dissimilarity and diversity between sentences, it was decided to implement these measures for enhancing the overall diversity of sentences in the summary and reducing redundancy as described and evaluated in section 4.4. The method I used for participating in TAC08 was SemSimMeasure and it is chosen as the baseline during this evaluation.

The results I obtained are illustrated in Table 4.4. It can be noted that

arTonv_SemSimMeasure gave the best performance for all ROUGE metrics with the

biggest increase over the baseline being for ROUGE1. The next best performing metric in ROUGE1 was found to be EditDistEx_SimMeasure. This measure has boosted the performance of the baseline by 2.9% and yielded better unigram matches between the generated summaries and the reference summaries as can be noted from comparisons shown in Figure 4.17. As for ROUGE2 and ROUGESU4, the next best performing measure was found to be the baseline as demonstrated in Figure 4.18 and Figure 4.19.

Evaluation ROUGE1 ROUGE2 ROUGESU4 SemSimMeasure 0.34463 0.08091 0.11852 arTonv_SemSimMeasure 0.35801 0.08125 0.11979 Syn_SimMeasure 0.33823 0.07984 0.11487 EditDist_SimMeasure 0.34249 0.08011 0.11604 EditDistEx_SimMeasure 0.35357 0.08048 0.11828

Table 4.4: ROUGE evaluation results for the different sentences similarity measures

ROUGE1-R 0.325 0.33 0.335 0.34 0.345 0.35 0.355 0.36

SemSimMeasure arTonv_SemSimMeasure Syn_SimMeasure EditDist_SimMeasure EditDistEx_SimMeasure

Figure 4.17: ROUGE1 scores showing the performance for the different sentences similarity measures in a column chart

ROUGE2-R 0.079 0.0795 0.08 0.0805 0.081 0.0815

Sem Sim Measure arTonv_Sem Sim Measure Syn_Sim Measure EditDist_Sim Measure EditDistEx_Sim Measure

S

c

o

re

ROUGESU4-R 0.112 0.113 0.114 0.115 0.116 0.117 0.118 0.119 0.12 0.121

Se m Sim M e as ure arTonv_Se m Sim M e as ure Syn_Sim M e as ure EditDis t_Sim M e as ure EditDis tEx_Sim M e as ure

S

c

o

re

Figure 4.19: ROUGESU4 scores showing the performance for the different sentences similarity measures

The obtained results in the shown column charts support the idea that introducing different semantic-based measures can lead to performance improvement. For instance, the measure

Syn_SimMeasure successfully captures the similarity between the words “resolution” and

“settlement” in the two sentences: “The defendant reached a settlement with the plaintiff by paying 20 million dollars” and “The defendant reached a resolution with the plaintiff

by paying 20 million dollars”. On the other hand, replacing one of the two words with

“agreement” would cause a failure in capturing the similarity between the words. Instead

of performing simply words matching, a more refined measure is used with

SemSimMeasure and arTonv_SemSimMeasure as they both utilize JCn’s metric when

computing the similarity between words. The obtained results of these two measures in the column charts highlight this observation.

It was noted in the arTonv_SemSimMeasure that some adverbs are transformed to verbs, while their corresponding words in other sentences are of type noun. Since the similarity computation between words is performed only on the same POS due to the WordNet limitations mentioned in the previous sections, the benefits gained from the adjectives/adverbs transformation process is therefore dependent on the POS of the words

processed in different sentences. When computing the semantic distance between two sentences, it would be best to devise a way for comparing every term from the first sentence with every term in the second regardless of their POS. In part, this is reflected in the performance of the EditDistEx_SimMeasure in ROUGE1 which obtained better results than SemSimMeasure even though it implements simple words matching while the later

utilizes JCn’s metric but ignores adjectives/adverbs. In the next chapter, I propose a

Wikipedia-based semantic relatedness measure that takes into account every term from the two compared sentences regardless of their POS.

When examining the summaries generated by the best performing summarizer, it can be noted that the grammar of the sentences are acceptable. This is expected as the summarizer is extractive. However, there seems to be an issue with the redundancy of some sentences within some of the generated summaries. In an attempt to address this issue, I opted to include a redundancy and diversity checking layer in the post-processing stage of the summarizer. The next section provides more details about the theory behind this, the implementation and evaluation results.