Chapter 4 Analysing Entity-centric Topic Representations of Multi-
4.3 News Retrieval Using an Entity’s Language-specific Topic Represen-
The comprehensiveness of the contexts created by the baseline Article-based approach and the proposed Graph-based approach will be examined and compared in Section 4.4 and Section 4.5.
4.3
News Retrieval Using an Entity’s Language-specific
Topic Representations
In this section, I present the retrieval scenario of searching for relevant articles over news collections in a common language using an entity name as the query. Then I describe my approach that addresses the entity-centric search, using the entity’s language-specific topic representations, as presented in Section 4.2.
4.3.1 Entity-centric News Retrieval
When users are interested in the current news about a named entity, they could simply provide the entity name as the query to a retrieval application. This entity- centric retrieval scenario is also referred asquerying by entities in [316].
On a daily basis, only a limited number of news articles that explicitly mention this named entity are published. However, one named entity is typically related to various other topical aspects, as I observed during the context creation using the Wikipedia link structure. This kind of relationship with topical aspects is demon- strated by the entity-centric topic representations, which I described in Section 4.2. The motivation is that by using these topic representations, I could significantly increase recall of the retrieved documents for the entity-centric queries in a news retrieval application, while keeping high precision. Moreover, some documents could only marginally mention an entity, without providing any comprehensive informa- tion for the specific entity. In these cases, the entity’s topic representations can help the retrieval application to focus on more relevant documents.
When only using an entity name as the query, traditional information re- trieval systems that are based on keyword matching can only return news ar- ticles with the named entity’s occurrence, which can barely satisfy the users’ needs of comprehensive knowledge about the named entity. For example, when using “Angela Merkel” as the query, it would be beneficial to return news articles likehttp://www.thelocal.de/20151202/germany-fear-terrorism-
if-army-fights-in-syria, which describe the situation in Germany. Although
the content of this article contains neither the term “Angela” nor the term “Merkel”, it reports about an event that has a potentially large impact on her political deci- sions. In order to tackle this problem, my context-based information retrieval model incorporates the entity’s contextual topic representations from Wikipedia into the search and ranking process. As a result, the articles discussing similar topical aspects as the entity’s context will obtain higher ranks, even if the entity is not mentioned explicitly.
While using the entity’s topic representations for retrieval applications, the relevance of a news article to a named entity may be controversial among peo- ple with different language backgrounds. For example, a news article contain- ing information about the VW scandal affecting the biggest German car produc- tion companyhttp://www.thelocal.de/20151202/what-the-vw-scandal-means-
for-germanys-economycould be considered as relevant by most German people, as they could think that the German Chancellor should take direct measures to boost the national economy hurt by the scandal. However, the relevance of this article to the query “Angela Merkel” can be considered to be low among the English-speaking communities. These users could think this to be a company problem, and it could be hard for them to understand if this scandal would have a big impact at the national level. I tackle this problem by using the entity’s language-specific topic representations in news retrieval. The users of the retrieval application can select the topic representation of their preferred language when searching for a named en- tity. The returned news articles and their ranks are then language-specific, based on the background knowledge from the corresponding language edition of Wikipedia.
Besides the retrieval of relevant articles, it is also useful to provide information regarding the topical aspects of the entity influencing their relevance. That is par- ticularly important in case the entity itself is not mentioned in the article. The proposed context-based information retrieval model addresses this problem by cre- ating an overview of each news article discussing language-specific topical aspects related to the entity.
4.3.2 Context-based Entity-centric Information Retrieval Model For the news article document d where d ∈ {1, . . . , D}, I extract all the noun phrases in the document as potentially relatedtopical aspects to query entities (the named entities whose names are provided as the queries), and then index all the documents by the topical aspects. For a query entitye, I generate a query-specific vector representation for the documentd, with aspect ak weighted by:
se,d,k =afe,d,k×log
N lfe,k
, (4.4)
where k ∈ {1, . . . , K}, afe,d,k is the number of matches of topical aspect ak with
the noun phrases from documentd. In this way, documentd’s entity-specific vector representation isse,d= (se,d,1, . . . , se,d,K).
I apply the same vector space model and similarity metric as in Section 4.2.2, to compute the similarity between entity e’s topic representation of language ln,
denoted byre,n, and documentd’s entity-specific representationse,d:
Sim(re,n,se,d) =
re,n·se,d
|re,n|×|se,d|
The above similarity will be used to measure the levels of relevance between the query entity and documents under this setting. All the documents’ entity-specific