Context Pruning for Speeding Up Pairwise Entity Coreference
4.2 End Node Similarity Estimation with Random Sampling
Algorithm 5 presents the details of the Continue function adopted in Algorithm 4. One problem here is in what ordering should the algorithm consider the paths of an instance.
As discussed in Section 3.2, each path has a path weight that is calculated according to the discriminability and the discounting factor of its comprising triples; therefore one approach is to prioritize the paths based upon their weight, i.e., paths with higher weight will be considered first. A perfect match on high-weight paths indicates that the algorithm should continue to process the remaining context; while a mismatch on high-weight paths could help the algorithm to stop at appropriate places for non-coreferent instance pairs before wasting more efforts.
Here, score and weight are the sum of the end node similarity and their corresponding path weight. They represent the similarity (current) between two instances based upon the already considered context. When calculating the potential similarity score between the two instances, we only consider the remaining paths whose estimated node similarity (m0.est) is no less than current (line 7) since paths whose estimated end node similarity is smaller than current will only lower the final similarity measure.
Algorithm 5 Continue(samplingOnly, score, weight, Na, indexm), samplingOnly indi-cates if the algorithm will use the utility function; score and weight are the sum of the end node similarity and their corresponding weight of the already considered paths; Na is the context of instance a; indexm is the index of path m; returns a boolean value
1. if samplingOnly is false then 2. u ← U tility(indexm, |Na|) 3. if u < 0 then
4. return true 5. current ← weightscore
6. for all paths m0∈Na do
7. m0.est ← the estimated end node similarity f or path m0 8. if m0 has not been considered and m0.est ≥ current then 9. score ← score + m0.est ∗ m0.weight
10. weight ← weight + m0.weight 11. if weightscore > θ then
12. return true 13. else
14. return false
Our entity coreference algorithm (CompareP runing) computes coreference relation-ships by measuring the similarity between end nodes of two paths. So, one key factor to apply our pruning technique is to appropriately estimate the similarity that the last node of a path could potentially have with that of its comparable paths of another instance (est in Equation 4.2), i.e., the potential contribution of each path. The higher similarity that a path has, the more contribution it could make to the final score between two instances.
4.2.1 Estimating URI Path Contribution
Since our algorithm only checks if two URIs are identical, Equation 4.3 is used to estimate URI end node similarity for an object value of a set of comparable properties:
est(G, P, obj) = {t|t =< s, p, obj > ∧ t ∈ G}
{t|t =< s, p, x > ∧ t ∈ G} , p ∈ P (4.3)
where G is an RDF graph; P is a set of comparable object properties; obj is a specific object
in P . It represents how likely one URI node would meet an identical node in RDF graph G and we calculate it as the estimated similarity for each specific object of property p ∈ P . Similarly, we could compute the estimated similarity for the subject values of all object properties.
4.2.2 Estimating Literal Path Contribution
For literal paths, we could perform such estimation for each specific object value of every predicate. Intuitively, for each object value o of predicate pi, we calculate its similarity scores to all other object values of the same predicate, draw the distribution of these scores, and choose the most likely score based upon the distribution.
However, considering applying this technique to large datasets where a predicate could be associated with tens of thousands of or even millions of distinct literal values, this naive approach could be extremely expensive. One alternative is that for each object value of predicate pi, we only compare it to a subset of pi’s object values. This could speed up the processing for each particular object but it will still be expensive when a predicate has a large number of distinct object values. Essentially, this intuitive approach does the estimation for every single distinct value per predicate.
In our approach, to estimate for literal nodes, we randomly select a certain number () of literal values of a property, conduct a pairwise comparison among all the selected literals, and finally get the estimated similarity score as shown in Equation 4.4. Here, P
est(G, P ) = arg min
score
|{(o1, o2)|o1, o2 ∈ Subset(G, P ) ∧ Sim(o1, o2) ≤ score}|
|{(o1, o2)|o1, o2∈ Subset(G, P )}| > γ (4.4)
is a set of comparable datatype properties; Subset(G, P ) randomly selects some number of literal values of P , such as o1 and o2 whose similarity is computed with the function Sim;
γ is a percentage value that controls how many pairwise comparisons should be covered in order to give the estimated node similarity. The intuition here is to find a sufficiently high similarity score as the potential similarity (contribution) between two literal values of P in order to reduce the chance of missing too many true matches. We do not want to set γ too high, since that way we are actually overestimating the potential contribution of paths ending on predicates in P . To summarize, the estimation of literal paths is calculated with respect to each individual property, not for every distinct property value.
One concrete example is given in Figure 4.1 about literal path similarity estimation with Equation 4.4 for the full name predicate in the RKB Person dataset. In this example,
0.0000 0.0044
Figure 4.1: Percentage based Path Similarity Estimation
1,000 object values of the full name predicate were randomly selected, and their pairwise similarity scores are discretized to float values from 0.1 to 1 with an interval of 0.1. Figure 4.1 shows the estimated score (x-axis) by considering covering sufficient pairwise comparison scores (y-axis). Looking at the similarity scores from left to right, when reaching 0.5, more than 99% of the pairwise scores have been covered; thus, 0.5 could be used as the estimated contribution for paths that end on the full name predicate.