3.6 Thematic Context Distance
3.6.2 Linking as a Classification Problem
Candidate entities in Wikipedia can be considered as labels for a mention. This labelling can be learned in a supervised classification task using disambiguated ex- amples retrieved from Wikipedia’s interlinkage. Assume a candidate entity e(m) for a mention m and a mention-candidate pairing operator x(m, e(m)) describing the mutual relation of m and e. In our case, the operator x(m, e(m)) is a vector of n real-valued features, i.e. x(m, e(m)) ∈ Rn. For instance, one feature in this vector may be the cosine similarity of the two describing contexts text(m) and text(e(m)). Now, as stated in Section 2.2, a candidate entity e(m) is either correct or not. In a binary classification setting with labels {y−, y+} = {−1, +1}, a collection of training instances D =nx(k)i (mi, ek(mi)), y (k) i | x(k)i ∈ Rn, yi(k)∈ {y−, y+}, ek(mi) ∈ e(mi) o (3.45) then contains for any mention mi and its k candidate entities ek(mi) ∈ e(mi) a descriptive vector x(k)i (mi, ek(mi)) that has an associated label y
(k)
i ∈ {y−, y+}, where the label y+ denotes a positive instance and y− a negative instance. A positive instance (x(m, e(m)), y+) encodes that e is the correct underlying entity for the mention m. Analogously, a negative instance (x(m, e0(m)), y−) encodes that e0 is not the correct underlying entity for the mention m. Given these training instances, we may learn an assignment function f : x(m, e(m)) 7→ {y−, y+} of the form
f (x(m, e(m))) = y+, if e(m) = e +(m)
y−, if e(m) 6= e+(m). (3.46)
In the inference step, we use the prediction value of f (x(m, e(m))) to decide on the estimated or predicted target entity ˆe:
ˆ
Chapter 3 Topic Models for Person Linking
In general we will observe collections ek(m) ∈ e(m) ⊂ W of candidate targets for a mention m. Therefore, the function f is applied for each mention-candidate pairing x(k)(m, ek(m))
∀ek(m) ∈ e(m) : f (x(k)(m, ek(m))) = y(k) ∈ {y−, y+}, (3.48) resulting in a set of tupels (y(k), e
k) of prediction value y(k) and candidate ek(m). Now, to determine the final prediction, we need to assign the mention m to uniquely one candidate entity e. However, this uniqueness is not inherently guaranteed using standard prediction algorithms such as a binary SVM that predicts labels {y−, y+}. As an example, we may observe two candidate entities from a very similar field, i.e. politicians from the same party, where descriptions may vary only slightly. The resulting mention-candidate-pairings may then both receive a positive label y+. To circumvent this problem, we use the real valued prediction y ∈ R of the SVM instead of the binary labels {y−, y+}, i.e.
y(k) = w∗· x(k)− b. (3.49)
This real-valued prediction y(k) is the offset of the instance x(k) from the separating hyperplane whose parameters w∗ and b are learned from the training instances D (Eq. 3.45). Then, we define the predicted entity ˆe(m) to be the candidate ek(m) for which we obtain the maximum prediction value y(k) for the mention-candidate- pairing x(k)(m, e
k(m)). The final assignment is then ˆ
e(m) = arg max ek∈e(m)
y(k). (3.50)
With Eq. 3.50 we generalized the binary classification to a rank-related classification by choosing the candidate with highest score y(k)among all candidates. This enables an overall assignment model in contrast to an "one-model-per-entity" approach. We also evaluated a Ranking SVM but, as we will show in the empirical evaluation, the results obtained were inferior to those using a classification method. The SVM classifier basically considers each instance individually for learning, whereas the ranking method considers groups of instances to learn the ranking of candidate entities towards a mention. We assume that the descriptive feature vectors for negative candidates, are too similar to each other and that this derogates the ranking approach.
For this classification approach with a standard SVM, we decided to not use artificial NIL candidates to learn a threshold for uncovered entities as in Bunescu and Pasca [2006], Pilz and Paaß [2009, 2012]. Instead, we use a threshold τ and define the prediction ˆe(m) = NIL if the model predicts no score y greater than τ for any of the candidates ei(m) ∈ e(m):
ˆ
e(m) = NIL if max ek∈e(m)
y(k) ≤ τ. (3.51)
3.6 Thematic Context Distance
In initial experiments, we empirically determined the value of τ = 0 to give the best results. The results obtained with this setting will be described in Section 3.6.4.
Alternative formulations are one-model-per-entity or multi-class-classification ap- proaches. In the one-model-per-entity-approach, we would learn one separate model per entity. In the multi-class model, we would represent each entity as one distinct class. However, we consider both alternatives as not appropriate. This is because for a multi-class approach, the number of classes is proportional to the size of W which would result in 3 million classes in the most general case. Then we can also expect a very skewed distribution of positive examples over the classes. For a more tractable formulation, subgroups would need to be determined which requires additional effort in model design. We argue that this is not necessary using the classification method described above. The one-model-per-entity approach is also practically difficult to realize since we would need to manage a huge number of models, for instance when naively storing one SVM model per entity. While probably suitable for smaller en- tity collections, we argue that our formulation of one model for all entities is more elegant since we can also exploit interactions between instances.
Having described an entity linking model based on a classifier with thematic distances as features, we will now detail the corpora used to evaluate this model.