Application-based evaluation - Term extraction evaluation

3.7 Experimental evaluation

3.7.2 Term extraction evaluation

3.7.2.3 Application-based evaluation

An important reason for developing term extraction techniques is for their contribution in specific applications, for example keyphrase extraction or index term assignment. Thus, a reasonable evaluation of term extraction techniques is to evaluate their performance in a given application. This section complements the standard evaluation approach presented in the previous section with an application-based evaluation, ad- dressing the research question RQ 1.4, about the portability of the expertise topic extraction approach across domains.

In this set of experiments we considered again the three domain-specific datasets described in Section 3.6.2.2, but we evaluated the extracted terms at the document level. Instead of evaluating a single list of ranked terms for the whole corpus, we evaluated a list of ranked terms for each of the documents separately and then we aggregated the results. This is in line with the usual evaluation of tasks such as keyphrase extraction or index term assignment, for which the datasets were initially constructed. For this purpose, each term extraction approach has to be adapted for the task at hand. To keep the results comparable, we adapted all the considered term extraction approaches in the same way.

Typically, a termhood measure is combined with various measures of document relevance to perform keyphrase extraction or index term assignment, because candidate terms have to be assigned at the document level. In our experiments we considered the standard information retrieval measure TF-IDF, to measure the relevance of a term to a document. The same method for selecting terms proposed in section 3.4.1 is considered, and then candidate terms are ranked using different methods for term extraction, generically called termhood in the following equation. To assign terms to documents we combined the termhood score which is measured at the corpus level with the TF-IDF relevance measure as follows:

3. DOMAIN ADAPTIVE EXPERTISE TOPIC EXTRACTION THROUGH DOMAIN MODELLING

where t is the scoring function used to assign candidate terms to documents, τ is the candidate term, δ is the given document, termhood is the scoring function used for term extraction (e.g., NC-value, TermExtractor) and tfidf measures the relevance of a term τ for a document δ. In this way, the top ranked candidate terms using the scoring function t are assigned to the document δ.

In this set of experiments the best results are obtained by using domain coherence as a post-processing step, method which is called P ostRankDC and which was described in Section 3.4.3. The top 30 candidate terms, selected using our Basic approach described in Equation 3.4, are re-ranked based on their domain coherence with the domain model using the scoring function t.

Number of candidates F − scor e 5 10 15 20 25 30 0 5 10 15 _{P ostRankDC}Basic N C − value T ermExtractor

Figure 3.7: Keyphrase extraction evaluation on the Krapivin corpus

The application-based evaluation proposed in this work allows us to evaluate both precision and recall, therefore F-score can be employed as an evaluation metric. The results for keyphrase extraction in Computer Science are presented in Figure 3.7, while the results for index term extraction in the Agriculture domain are shown in Figure 3.8, and the results for term extraction at the document level in the Biomedical domain appear in Figure 3.9. On the x-axis we display different cut-off points of the ranked output list, and on the y-axis we plot the F-score in percentages.

All three methods yield a higher performance on the GENIA corpus, because a considerably higher proportion of all the noun phrases in the text are annotated as correct terms compared to the two other datasets. Although the GENIA documents

3.7 Experimental evaluation Number of candidates F − scor e 5 10 15 20 25 30 0 2 4 6 Basic P ostRankDC N C − value T ermExtractor

Figure 3.8: Index term evaluation on the FAO corpus

Number of candidates F − scor e 5 10 15 20 25 30 0 5 10 15 20 25 Basic P ostRankDC N C − value T ermExtractor

Figure 3.9: Term extraction at the document level on the GENIA corpus

are in average much shorter than the other documents, there are more than four times correct terms than in documents from the other domains. A random baseline would also achieve higher results on this dataset. The results on the Agriculture corpus are again the lowest, because a larger number of candidate terms has to be analysed, compared to the other two domains. The contrastive measure employed by TermExtractor is not suitable for extracting generic terms, such as keyphrases or index terms, as can be seen in Figure 3.7 and Figure 3.8, but it outperforms the other methods when extracting

3. DOMAIN ADAPTIVE EXPERTISE TOPIC EXTRACTION THROUGH DOMAIN MODELLING

more specific biomedical terms.

The Basic method outperforms the NC-value approach on the Krapivin corpus and on the GENIA corpus, but not on the FAO corpus. This leads to the conclusion that embedded terms have a different behaviour across domains. We can observe that the domain coherence approach (P ostRankDC) considerably improves over our Basic approach on all three domains. After the post-ranking step, the improvement is sta- tistically significant compared to the best performing state-of-the-art method on the Computer Science dataset, NC-value. In this domain the improvement is 99% better than NC-value when reporting the top 10 ranked terms per document. NC-value outperforms TermExtractor in Computer Science and Agriculture, but TermExtrac- tor performs better in Biomedicine, where the output terms should be more specific. These results confirm our assumption that, although both NC-value and TermExtractor make use of domain-independent features for ranking, their performance varies across domains and applications. At the same time, combining our domain coherence approach (PostRankDC ) with our Basic method in a post-ranking step, in the method P ostRankDC, displays a more stable behaviour, achieving the best performance on the Computer Science domain (Krapivin) and results similar to those of the best method in Biomedicine (GENIA) and Agriculture (FAO).

In document Domain adaptive extraction of topical hierarchies for Expertise Mining (Page 76-79)