3.2 Entity Linking Features
3.2.6 Joint Feature Modeling
In the following, we present approaches that model different aspects of EL within a joint model. Instead of computing and combiningdistinct feature values or distributions in an EL algorithm, these approaches jointly combine multiple aspects, as proposed in the sections before, in a single model. A typical and well-known technique to jointly model these features are topic models, like LDA. A brief description of LDA and the respective notations can be found in Section 3.2.4on Page35.
One of these models was proposed by Kataria et al. [Kat11], who learned a semi-supervised hierarchical topic model called Semi-supervised Wikipedia-based Pachinko Allocation Model (WPAM). The model captures the rich textual descriptions of entities and their category hierarchy in Wikipedia. In addition to each entity defining a specific topic in the model, the authors made the following two crucial extensions:
β’ Wikipedia-based Pachinko Allocation Model: With being an extension of the Pachinko Allocation Model for LDA [Li06], the model allows to additionally capture topic correlations within documents, thus enabling collective EL. In contrast to the original Pachinko Allocation Model that focuses on a fixed four-level topic hierarchy, WPAM leverages the entire Wikipedia category hierarchy. The category hierarchy represents a directed acyclic graph structure and groups semantically related entities into relevant categories.
β’ Supervision: The authors integrated a form of weak supervision into the standard LDA model by leveraging and integrating Wikipedia annotations (i.e., annotated surface forms) into their system to improve linking accuracy. The key idea was to bias the topic-word distributionππ in favor of surface forms (words) that were often
annotated with topic/entity π and to bias the document-topic distributions ππ in
favor of topics that were referred by the surface form annotations withinπ.
In the underlying evaluation, the WPAM approach was (slightly) superior to other standard LDA approaches for EL. A detailed overview of the generative model can be found in the respective work [Kat11].
Another approach to model the textual context and the topical coherence with topic models in Wikipedia was proposed by Sen [Sen12], namely Collective context-aware topic models (CA). In contrast to Kataria et al. [Kat11], the authors of this approach did not leverage the Wikipedia category system. Instead, they proposed a separate topic model to learn groups of entities based on a document-centric KB like Wikipedia. Each group represents a Multinomial distribution over entities and describes the entitiesβ topical coherence with respect to this group. A major issue in generating entity groups was the optimal number of groups given a specific corpus to achieve the best EL results. However, in addition to entity groups that model topical coherence, the authors incorporated word proximity in their model. It is based on the idea that words that appear in the context of an entity are more likely to be associated with this entity. In contrast to LDA, where each word π€β in a documentπis generated independently, the CA model generates a document
π as a sequence. This means that generating a word π€β also depends on the previous
annotated word or words in a previously annotated sentence or paragraph. A thorough evaluation of differently modified topic-models and the effects of differently sized entity groups showed that the proximity of words to entities as well as modeling topical coherence significantly contribute to a high EL accuracy [Sen12]. This topic model is also applicable to other document-centric KBs and does not depend on Wikipedia-specific features.
The current state-of-the-art topic model for EL on the well-known IITB data set [Kul09] was proposed by Han et al. [Han12] in 2012. The model also incorporates topical coherence and is based on three types of global knowledge, namely: Topic knowledgeπ, entity name knowledge π and entity context knowledgeπ. The topic knowledge describes that each entityππ in a document πis generated based on a topic π§π, with π§π containing semantically
coherent entities (similar to the groups in [Sen12]). Each topic is modeled as Multinomial distribution of entities with the probability denoting the likelihood of an entityππ getting
extracted from topicπ§π. The entity name knowledge describes that a surface formππ is
generated based on all possible annotations of the underlying entity. Hence, the name knowledge of an entity ππ is modeled as a Multinomial distribution of its surface form
annotations in the overall document corpusπ·. Finally, entity context knowledge describes that all wordsπ€π are generated using its context knowledge. In other words, the context
knowledge of an entity ππ is modeled as a Multinomial distribution of words, with the
probability describing the likelihood ofπ€π occurring in the context ofππ. Given the topic
knowledge π, entity name knowledge π and entity context knowledge π, the generative process can be described as follows [Han12]:
1. For each document πβπ·, sample the topic distribution ππβΌπ·ππ(πΌ)
2. For each surface form position πin documentπ: a) Sample a topic assignmentπ§π βΌπ π’ππ‘(ππ)
b) Sample an entity assignmentππ βΌπ π’ππ‘(ππ§π)
c) Sample a surface form ππ βΌπ π’ππ‘(πππ)
3. For each word positionπ in document π:
a) Sample a target entity fromπβs referent entitiesππβΌπ πππ πππ(ππ1, ππ2,...,πππ)
b) Sample a describing word usingππβs context word distribution π€π βΌπ π’ππ‘(πππ)
The global knowledge π,π and π is not a-priori given. Hence, the authors estimatedπ,π
and π through Baysian Inference by integrating the knowledge generation process into the topic model. The authors determined the best number of topics empirically, resulting in
πΎ = 300.
A recently proposed topic model approach by Li et al. [Li16] links entities defined in linkless KBs. The approach is based on the preceding βEvidence Miningβ work of [Li13] (proposed in Section3.2.4). Linkless KBs are a special case of document-centric KBs. More specifically, a linkless KB comprises a set of isolated documents π·with each document
ππ β π· describing an entity ππ. Cross-document or intra-document hyperlinks are not
necessarily required within the documents in π·. While other topic model approaches generate one model for the entire KB and, hence, each entity is described through its own topic, this approach generates a small topic model for each unique surface formππ
using a small subset of the KB. More specifically, for each surface form ππ βM, withM
denoting the set of surface form strings, a set of candidate documents (i.e., the documents of candidate entities) and a set of surface form documents (i.e., documents that contain the same or very similar surface forms as ππ) are extracted. These documents are unified to a
document set π·ππ for surface formππ. The authors modeled each of the candidate entities
as a single topic in π·ππ, combined with some additional, artificial topics for background
words and general topics within the documents. Further, the model tries to mine additional word evidences using the set of surface form documents by mimicking the following effects of cross-document hyperlinks [Li16]:
β’ Semantic Relatedness: Generally, two entitiesπ1 and π2 are related if they share the same source entities of incoming hyperlinks. Without hyperlinks, the topic model captures the relatedness by adding π1βs andπ2βs names into each others word evidences. For instance, as shown in Figure 3.6(a), the entities Michael I. Jordan and Andrew Ng are semantically related, both co-occurring in many documents. Additionally, words like βresearchβ and βmachine learningβ that appear inMichael I. Jordanβs entity description also appear in these documents. While these words are supporting evidence for Michael I. Jordan, we can also associate βAndrew Ngβ as Michael I. Jordan βs evidence, since βAndrew Ngβ co-occurs withMichael I. Jordanβs representative words.
β’ Description Expansion for Context Similarity: If an entity π1 is linked in the document of entity π2 with mention ππ, then the surrounding context of ππ
may contain additional evidence words for π1. Despite non-existing hyperlinks, this approach is able to generate such evidences by directly mining them fromπ·. Figure3.6(b)shows an example where the important descriptive word βAAAI fellowβ of entityMichael I. Jordan is extracted from a document containing a term referring toMichael I. Jordan. In our case without hyperlinks, we leverage the entity describing words like βresearchβ ofMichael I. Jordan in the context to associate the term βAAAI fellowβ with the entity.
Machine Learning computer science Michael Jordan machine learning Andrew Ng research Statistician machine learning Michael Jordan statistical model Andrew Ng statistics Michael I. Jordan computer science research machine learning statistics Michael I. Jordan computer science research machine learning statistics Andrew Ng
Documents
Evidence
(a) Semantic Relatedness
Machine Learning statistics Michael Jordan machine learning Michael I. Jordan computer science research machine learning statistics Michael I. Jordan computer science research machine learning statistics AAAI fellow Statistician research Michael Jordan AAAI fellow statistics
Documents
Evidence
(b)Description ExpansionFigure 3.6: Examples of mining evidences from surface form documents [Li16]. Blue high- lighted terms are referring to other entities. Red highlighted terms denote additionally mined evidence words. A circle denotes the context of a surface form within a document.
Since the overall generative process takes a considerable amount of space, we refer the interested reader to the original paper [Li16] for more details.
Another significant approach proposed by Francis-Landau [Fra16] does not unify all features within one model, but computes each feature with the same technique, namely Convolutional neural networks [LeC98] (CNN). Overall, for each of the three textual granularities in the input document (i.e., the surface form, the surface formβs surrounding context and the entire input document) and two textual granularities of a candidate entity (i.e., the entity name and the entity description), the authors produced vector representations with CNNs. For this purpose, each word is first embedded into a π- dimensional vector space using Word2Vec [Mik13a]. Next, the authors mapped the words of each granularity into a fixed-size vector using a Convolutional network, put the result through a rectified linear unit and combined the results with sum pooling producing a
representative topic vector for each granularity. The similarity between any granularity pair is denoted as the cosine similarity between the respective topic vectors. Unfortunately, the authors omitted a couple of crucial computation and CNN details, which complicates replicability. However, by using these features in a collective EL approach, the authors achieved state-of-the-art results on two data sets [Fra16].
To summarize, joint features model multiple features in a unified way. The most popular approaches are topic models that collectively integrate the textual context and topical coherence between entities or surface forms. Most of these models were evaluated on data sets that provide a significant amount of textual context before and after the respective surface forms. It remains the question how these approaches perform on shorter documents like tables and tweets.