Joint Feature Modeling - Entity Linking Features

3.2 Entity Linking Features

3.2.6 Joint Feature Modeling

In the following, we present approaches that model different aspects of EL within a joint model. Instead of computing and combiningdistinct feature values or distributions in an EL algorithm, these approaches jointly combine multiple aspects, as proposed in the sections before, in a single model. A typical and well-known technique to jointly model these features are topic models, like LDA. A brief description of LDA and the respective notations can be found in Section 3.2.4on Page35.

One of these models was proposed by Kataria et al. [Kat11], who learned a semi-supervised hierarchical topic model called Semi-supervised Wikipedia-based Pachinko Allocation Model (WPAM). The model captures the rich textual descriptions of entities and their category hierarchy in Wikipedia. In addition to each entity defining a specific topic in the model, the authors made the following two crucial extensions:

• Wikipedia-based Pachinko Allocation Model: With being an extension of the Pachinko Allocation Model for LDA [Li06], the model allows to additionally capture topic correlations within documents, thus enabling collective EL. In contrast to the original Pachinko Allocation Model that focuses on a fixed four-level topic hierarchy, WPAM leverages the entire Wikipedia category hierarchy. The category hierarchy represents a directed acyclic graph structure and groups semantically related entities into relevant categories.

• Supervision: The authors integrated a form of weak supervision into the standard LDA model by leveraging and integrating Wikipedia annotations (i.e., annotated surface forms) into their system to improve linking accuracy. The key idea was to bias the topic-word distribution𝜑𝑘 in favor of surface forms (words) that were often

annotated with topic/entity 𝑘 and to bias the document-topic distributions 𝜃𝑑 in

favor of topics that were referred by the surface form annotations within𝑑.

In the underlying evaluation, the WPAM approach was (slightly) superior to other standard LDA approaches for EL. A detailed overview of the generative model can be found in the respective work [Kat11].

Another approach to model the textual context and the topical coherence with topic models in Wikipedia was proposed by Sen [Sen12], namely Collective context-aware topic models (CA). In contrast to Kataria et al. [Kat11], the authors of this approach did not leverage the Wikipedia category system. Instead, they proposed a separate topic model to learn groups of entities based on a document-centric KB like Wikipedia. Each group represents a Multinomial distribution over entities and describes the entities’ topical coherence with respect to this group. A major issue in generating entity groups was the optimal number of groups given a specific corpus to achieve the best EL results. However, in addition to entity groups that model topical coherence, the authors incorporated word proximity in their model. It is based on the idea that words that appear in the context of an entity are more likely to be associated with this entity. In contrast to LDA, where each word 𝑤ℎ in a document𝑑is generated independently, the CA model generates a document

𝑑 as a sequence. This means that generating a word 𝑤ℎ also depends on the previous

annotated word or words in a previously annotated sentence or paragraph. A thorough evaluation of differently modified topic-models and the effects of differently sized entity groups showed that the proximity of words to entities as well as modeling topical coherence significantly contribute to a high EL accuracy [Sen12]. This topic model is also applicable to other document-centric KBs and does not depend on Wikipedia-specific features.

The current state-of-the-art topic model for EL on the well-known IITB data set [Kul09] was proposed by Han et al. [Han12] in 2012. The model also incorporates topical coherence and is based on three types of global knowledge, namely: Topic knowledge𝜑, entity name knowledge 𝜓 and entity context knowledge𝜉. The topic knowledge describes that each entity𝑒𝑗 in a document 𝑑is generated based on a topic 𝑧𝑙, with 𝑧𝑙 containing semantically

coherent entities (similar to the groups in [Sen12]). Each topic is modeled as Multinomial distribution of entities with the probability denoting the likelihood of an entity𝑒𝑗 getting

extracted from topic𝑧𝑙. The entity name knowledge describes that a surface form𝑚𝑖 is

generated based on all possible annotations of the underlying entity. Hence, the name knowledge of an entity 𝑒𝑗 is modeled as a Multinomial distribution of its surface form

annotations in the overall document corpus𝐷. Finally, entity context knowledge describes that all words𝑤𝑛 are generated using its context knowledge. In other words, the context

knowledge of an entity 𝑒𝑗 is modeled as a Multinomial distribution of words, with the

probability describing the likelihood of𝑤𝑛 occurring in the context of𝑒𝑗. Given the topic

knowledge 𝜑, entity name knowledge 𝜓 and entity context knowledge 𝜉, the generative process can be described as follows [Han12]:

1. For each document 𝑑∈𝐷, sample the topic distribution 𝜃𝑑∼𝐷𝑖𝑟(𝛼)

2. For each surface form position 𝑖in document𝑑: a) Sample a topic assignment𝑧𝑖 ∼𝑀 𝑢𝑙𝑡(𝜃𝑑)

b) Sample an entity assignment𝑒𝑖 ∼𝑀 𝑢𝑙𝑡(𝜑𝑧𝑖)

c) Sample a surface form 𝑚𝑖 ∼𝑀 𝑢𝑙𝑡(𝜓𝑒𝑖)

3. For each word position𝑙 in document 𝑑:

a) Sample a target entity from𝑑’s referent entities𝑎𝑙∼𝑈 𝑛𝑖𝑓 𝑜𝑟𝑚(𝑒𝑚1, 𝑒𝑚2,...,𝑒𝑚𝑑)

b) Sample a describing word using𝑎𝑙’s context word distribution 𝑤𝑙 ∼𝑀 𝑢𝑙𝑡(𝜉𝑎𝑙)

The global knowledge 𝜑,𝜓 and 𝜉 is not a-priori given. Hence, the authors estimated𝜑,𝜓

and 𝜉 through Baysian Inference by integrating the knowledge generation process into the topic model. The authors determined the best number of topics empirically, resulting in

𝐾 = 300.

A recently proposed topic model approach by Li et al. [Li16] links entities defined in linkless KBs. The approach is based on the preceding ‘Evidence Mining’ work of [Li13] (proposed in Section3.2.4). Linkless KBs are a special case of document-centric KBs. More specifically, a linkless KB comprises a set of isolated documents 𝐷with each document

𝑑𝑗 ∈ 𝐷 describing an entity 𝑒𝑗. Cross-document or intra-document hyperlinks are not

necessarily required within the documents in 𝐷. While other topic model approaches generate one model for the entire KB and, hence, each entity is described through its own topic, this approach generates a small topic model for each unique surface form𝑚𝑖

using a small subset of the KB. More specifically, for each surface form 𝑚𝑖 ∈M, withM

denoting the set of surface form strings, a set of candidate documents (i.e., the documents of candidate entities) and a set of surface form documents (i.e., documents that contain the same or very similar surface forms as 𝑚𝑖) are extracted. These documents are unified to a

document set 𝐷𝑚𝑖 for surface form𝑚𝑖. The authors modeled each of the candidate entities

as a single topic in 𝐷𝑚𝑖, combined with some additional, artificial topics for background

words and general topics within the documents. Further, the model tries to mine additional word evidences using the set of surface form documents by mimicking the following effects of cross-document hyperlinks [Li16]:

• Semantic Relatedness: Generally, two entities𝑒1 and 𝑒2 are related if they share the same source entities of incoming hyperlinks. Without hyperlinks, the topic model captures the relatedness by adding 𝑒1’s and𝑒2’s names into each others word evidences. For instance, as shown in Figure 3.6(a), the entities Michael I. Jordan and Andrew Ng are semantically related, both co-occurring in many documents. Additionally, words like ‘research’ and ‘machine learning’ that appear inMichael I. Jordan’s entity description also appear in these documents. While these words are supporting evidence for Michael I. Jordan, we can also associate ‘Andrew Ng’ as Michael I. Jordan ’s evidence, since ‘Andrew Ng’ co-occurs withMichael I. Jordan’s representative words.

• Description Expansion for Context Similarity: If an entity 𝑒1 is linked in the document of entity 𝑒2 with mention 𝑚𝑖, then the surrounding context of 𝑚𝑖

may contain additional evidence words for 𝑒1. Despite non-existing hyperlinks, this approach is able to generate such evidences by directly mining them from𝐷. Figure3.6(b)shows an example where the important descriptive word ‘AAAI fellow’ of entityMichael I. Jordan is extracted from a document containing a term referring toMichael I. Jordan. In our case without hyperlinks, we leverage the entity describing words like ‘research’ ofMichael I. Jordan in the context to associate the term ‘AAAI fellow’ with the entity.

Machine Learning computer science Michael Jordan machine learning Andrew Ng research Statistician machine learning Michael Jordan statistical model Andrew Ng statistics Michael I. Jordan computer science research machine learning statistics Michael I. Jordan computer science research machine learning statistics Andrew Ng

Documents

Evidence

(a) Semantic Relatedness

Machine Learning statistics Michael Jordan machine learning Michael I. Jordan computer science research machine learning statistics Michael I. Jordan computer science research machine learning statistics AAAI fellow Statistician research Michael Jordan AAAI fellow statistics

Documents

Evidence

(b)Description Expansion

Figure 3.6: Examples of mining evidences from surface form documents [Li16]. Blue highlighted terms are referring to other entities. Red highlighted terms denote additionally mined evidence words. A circle denotes the context of a surface form within a document.

Since the overall generative process takes a considerable amount of space, we refer the interested reader to the original paper [Li16] for more details.

Another significant approach proposed by Francis-Landau [Fra16] does not unify all features within one model, but computes each feature with the same technique, namely Convolutional neural networks [LeC98] (CNN). Overall, for each of the three textual granularities in the input document (i.e., the surface form, the surface form’s surrounding context and the entire input document) and two textual granularities of a candidate entity (i.e., the entity name and the entity description), the authors produced vector representations with CNNs. For this purpose, each word is first embedded into a 𝑑- dimensional vector space using Word2Vec [Mik13a]. Next, the authors mapped the words of each granularity into a fixed-size vector using a Convolutional network, put the result through a rectified linear unit and combined the results with sum pooling producing a

representative topic vector for each granularity. The similarity between any granularity pair is denoted as the cosine similarity between the respective topic vectors. Unfortunately, the authors omitted a couple of crucial computation and CNN details, which complicates replicability. However, by using these features in a collective EL approach, the authors achieved state-of-the-art results on two data sets [Fra16].

To summarize, joint features model multiple features in a unified way. The most popular approaches are topic models that collectively integrate the textual context and topical coherence between entities or surface forms. Most of these models were evaluated on data sets that provide a significant amount of textual context before and after the respective surface forms. It remains the question how these approaches perform on shorter documents like tables and tweets.

In document Robust Entity Linking in Heterogeneous Domains (Page 55-59)