• No results found

DoSeR Entity Linking Algorithm

8.2 DoSeR Framework

8.2.3 DoSeR Entity Linking Algorithm

Given the previously constructed EL index, our algorithm accepts documents that contain one or multiple surface forms that should be linked to entities. It links all surface forms within a document using a collective, graph-based approach. Overall, given a set of surface

forms, our algorithm tries to seek the optimal entity assignment𝛀 and can be subdivided into four main steps. Algorithm 5gives an overview of the entire process, whose steps are explained in the following.

Candidate Entity Generation

The first step in our EL chain is Candidate Entity Generation. The goal is to reduce the number of possible candidate entities for each input surface formπ‘šπ‘– by determining a set

of relevant target entities, the target candidate entity set𝛺𝑖 for surface formπ‘šπ‘–. Details of

our candidate generation process are described in Section8.2.4. Given the candidates we link surface forms with none or one candidate entity. We also initialize the entity set 𝐸𝑑

with the entities of unambiguous surface forms or already linked surface forms (Lines 2-7). Semantic Embedding Candidate Filter

Our second step Semantic Embedding Candidate Filter filters candidate entities that fit to the general topic described by the already disambiguated entities (Lines 8-17) requiring at least 3 already assigned entities. The underlying assumption is, that all entities in a paragraph are somehow topically related. To infer this general topic, we create a topic vector𝑑𝑣 =βˆ‘οΈ€

π‘’π‘—βˆˆπΈπ‘‘π‘£π‘’π‘(𝑒𝑗), with𝐸𝑑 being the set of already linked entities and 𝑣𝑒𝑐(𝑒𝑗)

being the entity embedding of entity𝑒𝑗 (Word2Vec vector). Next, we compute the semantic

similarity (cosine similarity) between the general topic vector 𝑑𝑣 and the candidate entities of all not yet disambiguated surface forms. If the similarity exceeds the a-priori given CandidateFilter threshold πœ†, the candidate entity remains in the candidate list of the respective surface form. If no candidate of a specific surface form exceeds the threshold, the candidate set for this surface forms remains unchanged. We note that this filter is a crucial step toward fast and accurate EL. Omitting this step results in a significantly lower performance combined with decreasing results (β‰ˆ2 to 5 percentage points F1, depending on the data set).

High Probability Candidate Linking

The third step High Probability Candidate Linking comprises the PageRank application on an EL graph to link high probability candidates (Lines 18-24). Detailed information for graph construction and PageRank can be found in Section 8.2.5. Next, we rank the candidate entities for each surface form according to their relevance score given by the PageRank algorithm in descending order. Additionally, we select the highest PageRank scoreβ„Ž, second-highest PageRank score𝑠and average PageRank scoreπ‘Žπ‘£π‘”across all entities that belong to the same surface form. Given these parameters, we define a threshold

𝑑𝑦𝑛𝑇 β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘for determining the certainty in the ranking based on the differences between the first and the second ranked candidate:

𝑑𝑦𝑛𝑇 β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘=β„Žβˆ’π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘›1Β·(β„Žβˆ’π‘Žπ‘£π‘”) (8.1)

whereas details on the parameter π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘›1 are discussed in Section 8.4.1. We use this threshold as a certainty criterion, indicating whether the top-ranked candidate entity of a surface form is the correct target. More specifically, if the PageRank score 𝑠of the second

Algorithm 5:Our graph-based EL algorithm integrated in DoSeR input :𝑀 =< π‘š1,...,π‘šπ‘† > ,Thresholdπœ†,π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘›1,π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘›2

output :Assignment𝛀 =< 𝑑1𝑗,...,π‘‘π‘†π‘˜ >, with 𝑑𝑖𝑗 denoting the assigned entity𝑒𝑗 ofπ‘šπ‘–

1 configuration𝛀 =𝑑𝑒𝑝𝑙𝑒(); linked entities 𝐸𝑑=βˆ…; candidate set 𝛺𝑖 =βˆ…

// Candidate Entity Generation

2 forπ‘šπ‘– βˆˆπ‘€ do 3 𝛺𝑖 = generateCandidates(π‘šπ‘–) 4 if |𝛺𝑖|= 0 then 5 𝛀(𝑖) =𝑁 𝐼𝐿 6 else if |𝛺𝑖|= 1then 7 𝛀(𝑖) =𝑒𝑗 βˆˆπ›Ίπ‘–;𝐸𝑑=𝐸𝑑βˆͺ𝛺𝑖

// Semantic Embedding Candidate Filter

8 if |𝐸𝑑|>2 then 9 forπ‘šπ‘– βˆˆπ‘€ and |𝛺𝑖|>1 do 10 𝑠𝑒𝑑=βˆ… 11 for𝑒𝑗 βˆˆπ›Ίπ‘– do 12 if π‘π‘œπ‘ π‘–π‘›π‘’π‘†π‘–π‘š(sumEmbeddings(𝐸𝑑), 𝑒𝑗)> πœ†then 13 𝑠𝑒𝑑=𝑠𝑒𝑑βˆͺ𝑒𝑗 14 if 𝑠𝑒𝑑̸=βˆ… then 15 𝛺𝑖=𝑠𝑒𝑑 16 if |𝑠𝑒𝑑|= 1 then 17 𝛀(𝑖) =𝛺𝑖;𝐸𝑑=𝐸𝑑βˆͺ𝛺𝑖

// High Probability Candidate Linking

18 CreateDisambiguationGraphAndSolvePageRank(𝛺𝑖, 𝐸𝑑); Rank candidates. 19 Select highest PR scoreβ„Ž, second highest PR score 𝑠, average PR scoreπ‘Žπ‘£π‘”. 20 forπ‘šπ‘– βˆˆπ‘€ and |𝛺𝑖|>1 do

21 if 𝑠 < 𝑑𝑦𝑛𝑇 β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘then

22 𝛀(𝑖) =𝑔𝑒𝑑𝐸𝑛𝑑𝑖𝑑𝑦𝑂𝑓(β„Ž);𝛺𝑖=𝑔𝑒𝑑𝐸𝑛𝑑𝑖𝑑𝑦𝑂𝑓(β„Ž);𝐸𝑑=𝐸𝑑βˆͺ𝛺𝑖;

23 else

24 𝛺𝑖=𝑠𝑒𝑙𝑒𝑐𝑑𝑇 π‘œπ‘4π‘…π‘Žπ‘›π‘˜π‘’π‘‘πΆπ‘Žπ‘›π‘‘π‘–π‘‘π‘Žπ‘‘π‘’π‘ 

// Final Linking and Abstaining

25 CreateDisambiguationGraph(𝛺𝑖, 𝐸𝑑)

26 forπ‘šπ‘– βˆˆπ‘€ and |𝛺𝑖|>1 do

27 Perform PR and rank candidates, Select PR scoresβ„Ž,𝑠 andπ‘Žπ‘£π‘”. 28 if 𝑠 < π‘Žπ‘π‘ π‘‘π‘Žπ‘–π‘›π‘–π‘›π‘”π‘‡ β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘ then

29 𝛀(𝑖) =𝑔𝑒𝑑𝐸𝑛𝑑𝑖𝑑𝑦𝑂𝑓(β„Ž);𝛺𝑖=𝑔𝑒𝑑𝐸𝑛𝑑𝑖𝑑𝑦𝑂𝑓(β„Ž);𝐸𝑑=𝐸𝑑βˆͺ𝛺𝑖; 30 else

31 𝛀(𝑖) =𝑁 𝐼𝐿;𝛺𝑖=βˆ… 32 updateGraph(𝛺𝑖, 𝐸𝑑)

ranked candidate does not exceed the threshold 𝑑𝑦𝑛𝑇 β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘, the highest ranked entity denotes the target entity of its surface form. In other words, if the relevance score margin between the highest ranked candidate and the other candidates is large, then the likelihood of the top-ranked candidate being the correct target entity is also high. If the threshold is exceeded, we reduce the candidate set of the respective surface form to the top-4 ranked candidate entities.

Final Linking and Abstaining

The last step Final Linking and Abstaining links the remaining entities or abstains if the algorithm is uncertain about the correct target entity (Lines 25-32). We first create an EL graph (cf. Section 8.2.5) and, then, iteratively link the entities of the remaining surface forms. For this purpose, every iteration applies the PageRank algorithm to the underlying graph and ranks the candidate entities of each surface form in descending order. The scores β„Ž,𝑠, and π‘Žπ‘£π‘” are calculated as in the previous step. The abstaining threshold abstainingThreshold is calculated using formula 8.1 with a different margin parameter (π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘›2). If the second ranked candidate entity exceeds the abstaining threshold abstainingThreshold, the algorithm returns theNIL identifier for the respective surface form. Otherwise, the top ranked candidate entities denotes the target entity. After every iteration, we update the graph according to the changes in candidates and disambiguated entities and proceed until all surface form have been processed.

We note, that we apply the PageRank only once in step 3 due to performance reasons. The EL graph in step 4 usually does not include many candidate entities and, thus, we apply the PageRank in every iteration, also to provide the maximum accuracy in the abstaining task. The π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘› parameter to compute the high probability threshold and abstaining threshold varies in both steps. Information about the parameter choice is presented in Section 8.4.1.