End Node Similarity Estimation with Random Sampling

Context Pruning for Speeding Up Pairwise Entity Coreference

4.2 End Node Similarity Estimation with Random Sampling

Algorithm 5 presents the details of the Continue function adopted in Algorithm 4. One problem here is in what ordering should the algorithm consider the paths of an instance.

As discussed in Section 3.2, each path has a path weight that is calculated according to the discriminability and the discounting factor of its comprising triples; therefore one approach is to prioritize the paths based upon their weight, i.e., paths with higher weight will be considered first. A perfect match on high-weight paths indicates that the algorithm should continue to process the remaining context; while a mismatch on high-weight paths could help the algorithm to stop at appropriate places for non-coreferent instance pairs before wasting more efforts.

Here, score and weight are the sum of the end node similarity and their corresponding path weight. They represent the similarity (current) between two instances based upon the already considered context. When calculating the potential similarity score between the two instances, we only consider the remaining paths whose estimated node similarity (m⁰.est) is no less than current (line 7) since paths whose estimated end node similarity is smaller than current will only lower the final similarity measure.

Algorithm 5 Continue(samplingOnly, score, weight, N_a, index_m), samplingOnly indi-cates if the algorithm will use the utility function; score and weight are the sum of the end node similarity and their corresponding weight of the already considered paths; Na is the context of instance a; index_m is the index of path m; returns a boolean value

1. if samplingOnly is false then 2. u ← U tility(indexm, |Na|) 3. if u < 0 then

4. return true 5. current ← _weight^score

6. for all paths m⁰∈N_a do

7. m⁰.est ← the estimated end node similarity f or path m⁰ 8. if m⁰ has not been considered and m⁰.est ≥ current then 9. score ← score + m⁰.est ∗ m⁰.weight

10. weight ← weight + m⁰.weight 11. if _weight^score > θ then

12. return true 13. else

14. return false

Our entity coreference algorithm (CompareP runing) computes coreference relation-ships by measuring the similarity between end nodes of two paths. So, one key factor to apply our pruning technique is to appropriately estimate the similarity that the last node of a path could potentially have with that of its comparable paths of another instance (est in Equation 4.2), i.e., the potential contribution of each path. The higher similarity that a path has, the more contribution it could make to the final score between two instances.

4.2.1 Estimating URI Path Contribution

Since our algorithm only checks if two URIs are identical, Equation 4.3 is used to estimate URI end node similarity for an object value of a set of comparable properties:

est(G, P, obj) = {t|t =< s, p, obj > ∧ t ∈ G}

{t|t =< s, p, x > ∧ t ∈ G} , p ∈ P (4.3)

where G is an RDF graph; P is a set of comparable object properties; obj is a specific object

in P . It represents how likely one URI node would meet an identical node in RDF graph G and we calculate it as the estimated similarity for each specific object of property p ∈ P . Similarly, we could compute the estimated similarity for the subject values of all object properties.

4.2.2 Estimating Literal Path Contribution

For literal paths, we could perform such estimation for each specific object value of every predicate. Intuitively, for each object value o of predicate p_i, we calculate its similarity scores to all other object values of the same predicate, draw the distribution of these scores, and choose the most likely score based upon the distribution.

However, considering applying this technique to large datasets where a predicate could be associated with tens of thousands of or even millions of distinct literal values, this naive approach could be extremely expensive. One alternative is that for each object value of predicate p_i, we only compare it to a subset of p_i’s object values. This could speed up the processing for each particular object but it will still be expensive when a predicate has a large number of distinct object values. Essentially, this intuitive approach does the estimation for every single distinct value per predicate.

In our approach, to estimate for literal nodes, we randomly select a certain number () of literal values of a property, conduct a pairwise comparison among all the selected literals, and finally get the estimated similarity score as shown in Equation 4.4. Here, P

est(G, P ) = arg min

score

|{(o₁, o₂)|o₁, o₂ ∈ Subset(G, P ) ∧ Sim(o₁, o₂) ≤ score}|

|{(o₁, o₂)|o₁, o₂∈ Subset(G, P )}| > γ (4.4)

is a set of comparable datatype properties; Subset(G, P ) randomly selects some number of literal values of P , such as o1 and o2 whose similarity is computed with the function Sim;

γ is a percentage value that controls how many pairwise comparisons should be covered in order to give the estimated node similarity. The intuition here is to find a sufficiently high similarity score as the potential similarity (contribution) between two literal values of P in order to reduce the chance of missing too many true matches. We do not want to set γ too high, since that way we are actually overestimating the potential contribution of paths ending on predicates in P . To summarize, the estimation of literal paths is calculated with respect to each individual property, not for every distinct property value.

One concrete example is given in Figure 4.1 about literal path similarity estimation with Equation 4.4 for the full name predicate in the RKB Person dataset. In this example,

0.0000 0.0044

Figure 4.1: Percentage based Path Similarity Estimation

1,000 object values of the full name predicate were randomly selected, and their pairwise similarity scores are discretized to float values from 0.1 to 1 with an interval of 0.1. Figure 4.1 shows the estimated score (x-axis) by considering covering sufficient pairwise comparison scores (y-axis). Looking at the similarity scores from left to right, when reaching 0.5, more than 99% of the pairwise scores have been covered; thus, 0.5 could be used as the estimated contribution for paths that end on the full name predicate.

In document Towards a Linked Semantic Web: Precisely, Comprehensively and Scalably Linking Heterogeneous Data in the Semantic Web (Page 111-115)