8.2 DoSeR Framework
8.2.3 DoSeR Entity Linking Algorithm
Given the previously constructed EL index, our algorithm accepts documents that contain one or multiple surface forms that should be linked to entities. It links all surface forms within a document using a collective, graph-based approach. Overall, given a set of surface
forms, our algorithm tries to seek the optimal entity assignmentπ€ and can be subdivided into four main steps. Algorithm 5gives an overview of the entire process, whose steps are explained in the following.
Candidate Entity Generation
The first step in our EL chain is Candidate Entity Generation. The goal is to reduce the number of possible candidate entities for each input surface formππ by determining a set
of relevant target entities, the target candidate entity setπΊπ for surface formππ. Details of
our candidate generation process are described in Section8.2.4. Given the candidates we link surface forms with none or one candidate entity. We also initialize the entity set πΈπ
with the entities of unambiguous surface forms or already linked surface forms (Lines 2-7). Semantic Embedding Candidate Filter
Our second step Semantic Embedding Candidate Filter filters candidate entities that fit to the general topic described by the already disambiguated entities (Lines 8-17) requiring at least 3 already assigned entities. The underlying assumption is, that all entities in a paragraph are somehow topically related. To infer this general topic, we create a topic vectorπ‘π£ =βοΈ
ππβπΈππ£ππ(ππ), withπΈπ being the set of already linked entities and π£ππ(ππ)
being the entity embedding of entityππ (Word2Vec vector). Next, we compute the semantic
similarity (cosine similarity) between the general topic vector π‘π£ and the candidate entities of all not yet disambiguated surface forms. If the similarity exceeds the a-priori given CandidateFilter threshold π, the candidate entity remains in the candidate list of the respective surface form. If no candidate of a specific surface form exceeds the threshold, the candidate set for this surface forms remains unchanged. We note that this filter is a crucial step toward fast and accurate EL. Omitting this step results in a significantly lower performance combined with decreasing results (β2 to 5 percentage points F1, depending on the data set).
High Probability Candidate Linking
The third step High Probability Candidate Linking comprises the PageRank application on an EL graph to link high probability candidates (Lines 18-24). Detailed information for graph construction and PageRank can be found in Section 8.2.5. Next, we rank the candidate entities for each surface form according to their relevance score given by the PageRank algorithm in descending order. Additionally, we select the highest PageRank scoreβ, second-highest PageRank scoreπ and average PageRank scoreππ£πacross all entities that belong to the same surface form. Given these parameters, we define a threshold
ππ¦ππ βπππ βπππfor determining the certainty in the ranking based on the differences between the first and the second ranked candidate:
ππ¦ππ βπππ βπππ=ββππππππ1Β·(ββππ£π) (8.1)
whereas details on the parameter ππππππ1 are discussed in Section 8.4.1. We use this threshold as a certainty criterion, indicating whether the top-ranked candidate entity of a surface form is the correct target. More specifically, if the PageRank score π of the second
Algorithm 5:Our graph-based EL algorithm integrated in DoSeR input :π =< π1,...,ππ > ,Thresholdπ,ππππππ1,ππππππ2
output :Assignmentπ€ =< π‘1π,...,π‘ππ >, with π‘ππ denoting the assigned entityππ ofππ
1 configurationπ€ =π‘π’πππ(); linked entities πΈπ=β ; candidate set πΊπ =β
// Candidate Entity Generation
2 forππ βπ do 3 πΊπ = generateCandidates(ππ) 4 if |πΊπ|= 0 then 5 π€(π) =π πΌπΏ 6 else if |πΊπ|= 1then 7 π€(π) =ππ βπΊπ;πΈπ=πΈπβͺπΊπ
// Semantic Embedding Candidate Filter
8 if |πΈπ|>2 then 9 forππ βπ and |πΊπ|>1 do 10 π ππ‘=β 11 forππ βπΊπ do 12 if πππ ππππππ(sumEmbeddings(πΈπ), ππ)> πthen 13 π ππ‘=π ππ‘βͺππ 14 if π ππ‘ΜΈ=β then 15 πΊπ=π ππ‘ 16 if |π ππ‘|= 1 then 17 π€(π) =πΊπ;πΈπ=πΈπβͺπΊπ
// High Probability Candidate Linking
18 CreateDisambiguationGraphAndSolvePageRank(πΊπ, πΈπ); Rank candidates. 19 Select highest PR scoreβ, second highest PR score π , average PR scoreππ£π. 20 forππ βπ and |πΊπ|>1 do
21 if π < ππ¦ππ βπππ βπππthen
22 π€(π) =πππ‘πΈππ‘ππ‘π¦ππ(β);πΊπ=πππ‘πΈππ‘ππ‘π¦ππ(β);πΈπ=πΈπβͺπΊπ;
23 else
24 πΊπ=π πππππ‘π ππ4π ππππππΆπππππππ‘ππ
// Final Linking and Abstaining
25 CreateDisambiguationGraph(πΊπ, πΈπ)
26 forππ βπ and |πΊπ|>1 do
27 Perform PR and rank candidates, Select PR scoresβ,π andππ£π. 28 if π < πππ π‘πππππππ βπππ βπππ then
29 π€(π) =πππ‘πΈππ‘ππ‘π¦ππ(β);πΊπ=πππ‘πΈππ‘ππ‘π¦ππ(β);πΈπ=πΈπβͺπΊπ; 30 else
31 π€(π) =π πΌπΏ;πΊπ=β 32 updateGraph(πΊπ, πΈπ)
ranked candidate does not exceed the threshold ππ¦ππ βπππ βπππ, the highest ranked entity denotes the target entity of its surface form. In other words, if the relevance score margin between the highest ranked candidate and the other candidates is large, then the likelihood of the top-ranked candidate being the correct target entity is also high. If the threshold is exceeded, we reduce the candidate set of the respective surface form to the top-4 ranked candidate entities.
Final Linking and Abstaining
The last step Final Linking and Abstaining links the remaining entities or abstains if the algorithm is uncertain about the correct target entity (Lines 25-32). We first create an EL graph (cf. Section 8.2.5) and, then, iteratively link the entities of the remaining surface forms. For this purpose, every iteration applies the PageRank algorithm to the underlying graph and ranks the candidate entities of each surface form in descending order. The scores β,π , and ππ£π are calculated as in the previous step. The abstaining threshold abstainingThreshold is calculated using formula 8.1 with a different margin parameter (ππππππ2). If the second ranked candidate entity exceeds the abstaining threshold abstainingThreshold, the algorithm returns theNIL identifier for the respective surface form. Otherwise, the top ranked candidate entities denotes the target entity. After every iteration, we update the graph according to the changes in candidates and disambiguated entities and proceed until all surface form have been processed.
We note, that we apply the PageRank only once in step 3 due to performance reasons. The EL graph in step 4 usually does not include many candidate entities and, thus, we apply the PageRank in every iteration, also to provide the maximum accuracy in the abstaining task. The ππππππ parameter to compute the high probability threshold and abstaining threshold varies in both steps. Information about the parameter choice is presented in Section 8.4.1.