Chapter 4 Analysing Entity-centric Topic Representations of Multi-
4.5 Language-specific Retrieval of News Articles for Entity-centric Queries
4.5.2 Precision-Recall Analysis
I used a state-of-the-art information retrieval model, BM25 [234] as a baseline. The baseline model retrieved the documents based on the number of matches of terms from the original query.
Figure 4.2: Precision-Recall curves of the baseline model and the context-based in- formation retrieval model using different topic representations for “Angela Merkel”. In Figure 4.2, I present the interpolated precision achieved by the baseline model, and the context-based information retrieval model using topic repre- sentations derived from various contexts at different recall levels for the query entity “Angela Merkel”. As it can be observed in Figure 4.2, although the traditional ranking algorithm based on the BM25 scores of the news arti- cles given a query entity can maintain a relatively high precision, the highest
recall it can achieve is about 0.45. That is because a lot of news articles,
such as http://www.thelocal.de/20151202/germany-to-send-1200-troops-
to-aid-isis-fight, http://www.spiegel.de/international/europe/paris-
attacks-pose-challenge-to-european-security-a-1063435.html and http:
//www.thelocal.de/20151029/germany-maintains-record-low-unemployment,
report events either directly driven by “Angela Merkel”, or would directly impact her. Although these articles do not mention the query entity by name, they provide indispensable insights into the query entity’s current focus or past achieve- ments, which the users issuing this query would consider them to be relevant, especially when the number of articles mentioning the query entity is small. The context-based information retrieval model using the entity’s topic representations, no matter whether the topic representation is derived from Article-based contexts or Graph-based contexts, no matter whether they are extracted from English Wikipedia or German Wikipedia, achieved higher recall for this query.
I can also observe that the context-based information retrieval model using the topic representation derived from German (DE) Graph-based context achieves the overall best performance. For most of the time, it achieves higher precision than the ones using topic representations derived from other contexts, while achieving the same recall. This is because this topic representation provides a more comprehensive overview of the topical aspects related to “Angela Merkel”.
Moreover, the context-based information retrieval model outperforms the base- line model with respect to precision at all recall levels for this query entity, when utilising the topic representation derived from the German (DE) Graph-based con- text. This is because “Angela” is quite a common term. By incorporating the background information from Wikipedia, the model can perform disambiguation im- plicitly, by differentiating the Chancellor of Germany from other celebrities, such as Angela Gossow (German singer) and Angela Maurer (German long-distance swim- mer), which helps to increase the precision of retrieved results.
The baseline model ranks the news articles mostly based on the occurrences of terms in the query entity. In contrast, my model considers all the topical aspects mentioned in the news articles about the named entity. The ranks are generated based on the similarity values between the articles’ entity-specific representations and the named entity’s language-specific topic representation, such that news articles that provide a more comprehensive coverage of the entity’s language-specific topical aspects are promoted to higher ranks.
Figure 4.3: Precision-Recall curves of the baseline model and the context-based in- formation retrieval model using different topic representations for “David Cameron”. The effectiveness of the context-based information retrieval model can also be observed for the query “David Cameron”, presented in the Figure 4.3. As shown in Figure 4.3, the proposed model can achieve a much higher recall than the baseline model for this query as well, while maintaining high precision. As expected, the topic representation derived from the English (EN) Graph-based context, which is local for this query, helps the context-based information retrieval model to achieve an overall better performance than the topic representations derived from other contexts.
I did not observe significant differences when using the topic representations derived from the rest of the contexts for the query “David Cameron”. One of the reasons can be the numbers of topical aspects covered in these contexts. The topic representation derived from the EnglishGraph-based context of the entity “Angela Merkel” contains 7,317 non-zero weighted topical aspects, the one derived from the German Graph-based context contains 6,614. Both of them contain much more non-zero weighted topical aspects than the ones derived from English and German
Article-based contexts, which contain 562 and 1,069, respectively. Resulting from that, the topic representations derived from the German and English Graph-based
contexts for “Angela Merkel” are much more “powerful” than the ones derived from German and EnglishArticle-based contexts. For “David Cameron”, the most ‘powerful’ topic representation is derived from EnglishGraph-based context, which contains 10,365 non-zero weighted topical aspects, whereas the numbers for the rest are much smaller and comparable. The topic representation derived from Ger- manGraph-based context for the entity “David Cameron” only has 1,627 non-zero weighted topical aspects; the numbers for the ones derived from his English and German Article-based contexts are 1,143 and 291. Although all of these topic rep- resentations can still help to greatly improve the recall while maintaining relatively high precision, their effectiveness is somewhat limited, because of their comprehen- siveness.