Background and Related Work
2.4 Scaling Entity Coreference Systems
One of the common problems with the current approaches is scalability. Many current systems adopt exhaustive pairwise comparison between instances and they mainly focus on exploring appropriate features and metrics to compute instance similarity. However, pairwise comparison will not fit for large-scale datasets (e.g., datasets with millions of instances). Blocking is one method for subdividing mentions into mutually exclusive blocks and only mentions within the same block will be compared. In database research, one traditional method for identifying duplicate records in a database table is to scan the table and compute the value of a hash function for each record. The hash values define buckets to which each record will be assigned. Such hash values not only identify identical records but also records that are approximately similar [82]. In the end, in order to find duplicate records, it would be sufficient to compare only the records that fall into the same bucket.
Recently, instead of finding mutually exclusive blocks, blocking is also referred to as finding a set of candidate pairs of mentions that could be coreferent [42].
Although blocking can substantially increase the speed of the comparison process in
that it only compares identifiers in the same block, there are some problems with this technique. First of all, it is not necessary that all coreferent identifiers have the same values for a single blocking property; thus, typically multiple types of information will be used in order to improve coverage on true matches. Furthermore, data quality problems need to be considered. Having noise in the data can significantly impact blocking results, causing blocking systems to place entries in the wrong buckets, and thereby preventing them from being compared to actual matching entries. In addition to only computing string similarities of the surface forms, phonetic algorithms, such as Soundex [83] and the New York State Identification and Intelligence System (NYSIIS) [84], could be adopted for better handling erroneous data. It is hoped that misspelled words will still have the same phonetic codes.
2.4.1 Blocking with Manually Identified Key
Many approaches rely on human experts to determine what information to use for blocking and are generally very effective [85, 86, 41, 87, 88, 43]. Best Five [87] is a set of manually identified rules for matching census data. Sorted Neighborhood (SN) [88] sorts all entities on one or more key values (e.g., name for a person and title for a publication) and compares identifiers in a fixed-sized window. Yan et al. [43] proposed a modified sorted neighborhood algorithm, Adaptive Sorted Neighborhood (ASN), to learn dynamically sized blocks for each record. The records are sorted based upon a manually determined key. For a record r, it automatically finds the next N records that might be coreferent to r where N could vary for different records. They claimed that changing to different keys didn’t affect the results but didn’t report any experimental results.
Silk [89] and Oyster [90] are two general frameworks for users to specify rules for per-forming record linkage, but it may be difficult for users to specify such rules for all domains.
Compared to these systems, we try to reduce the need of human input in developing entity coreference systems.
Although keys manually selected by domain experts can be very effective in many sce-narios (e.g., census data), this manual process can be expensive, as the required expertise may not be available for various domains. Moreover, even when people have the necessary knowledge for identifying what information to use for blocking, they may lack the time to sit down and write down the rules.
2.4.2 Automatic Blocking Key Selection
BSL [42] adopted supervised learning to learn a blocking scheme: a disjunction of conjunc-tions of (method, attribute) pairs. Here, a “method” refers to how attribute values will be compared. As a concrete example, a “method” could be “computing the Jaccard similar-ity between two attribute values”. It learns one conjunction each time to reduce as many pairs as possible; by running the learning process iteratively, more conjunctions would be obtained in order to increase coverage on true matches. However, supervised approaches require sufficient training data that may not always be available. As reported by Michelson and Knoblock [42], when 1/5 of the groundtruth was used for training, 4.68% fewer true matches were covered on the Restaurant dataset (described in Section 2.3.1). Even more important, BSL was not able to scale to a dataset with only about 23,000 records [44], since essentially it needs to try out every possible combination of (method, attribute) pairs and picks the best one (that reduces the most pairs and covers the most true matches) at each learning iteration. In order to reduce the needs of training data, Cao et. al. [8] proposed a similar algorithm that utilizes both labeled and unlabeled data for learning the blocking scheme; however the supervised nature of their method still requires a certain amount of
available groundtruth.
Differently, Adaptive Filtering (AF) [91] is unsupervised and it filters record pairs by computing their character level bigram similarity. Marlin [92] uses an unnormalized Jaccard similarity on the tokens between attributes by setting a threshold to 1, which is essentially to find an identical token between the attributes. Although it was able to cover all true matches on some datasets, it only reduced the pairs to consider by 55.35%. Considering applying this technique to large-scale datasets, this may not be a significant enough reduction to make coreference with the remaining instance pairs feasible.
2.4.3 Speeding Up Entity Coreference with Indexing Techniques
The Information Retrieval (IR) style inverted index, a technique typically used for fast data retrieval on the Web, has been widely adopted for speeding up the blocking process. An IR-based inverted index is typically built for a collection of documents where each document contains a set of terms. The index will have a term list that contains all the unique terms in this document collection; and each term will be associated with a posting list with all documents that contain this term.
PartEnum [93], BiTrieJoin [94], IndexChunk [95], FastJoin [96], All-Pairs [97], PP-Join(+) [98] and Ed-Join [99] are all inverted index based approaches. PartEnum [93]
is a search based algorithm that adopts a two-level partitioning and enumeration based on Hamming distance. BiTrieJoin [94] is a trie-based method to support efficient edit similarity joins with sub-trie pruning. AllPairs [97] is a simple index based algorithm with certain optimization strategies. PPJoin+ [100] adopts a positional filtering principle that exploits the ordering of tokens in a record. Ed-Join [99] employed filtering methods that explore the locations and contents of mismatching n-grams. Similarly, IndexChunk
[95] computes asymmetric signatures on character-level n-grams as constraints for selecting candidates. Instead of performing exact matching on tokens and/or character-level n-grams, FastJoin [96] adopts fuzzy matching techniques that consider both token and character level similarity.
The Semantic Web community has also started adopting similar techniques for blocking and entity coreference. Ioannou et. al. [101] developed a system that focuses on query time duplicate instance detection on RDF data. The key technique is to index RDF resources to enable efficient look-ups. By adaptively determining the query to the index, similar instances to the query instance can be efficiently retrieved.
2.4.4 Building Scalable Systems with Feature Selection
Most of the current research focuses on how to reduce the total number of pairwise compar-isons between entity mentions; however, it is also important to speed up a single pairwise comparison. This can be generalized as a pruning process where we prune the less important parts of an instance’s context information. For example, when we compare the similarity between a pair of instances, the system predicts if the remaining unconsidered context will overturn the current decision made based upon what has already been compared. If it is not worth continuing to explore the rest of the context, the system will simply stop and continue with the next pair of instances. This is similar to the feature selection problem [102] where algorithms are developed to select the right features for a specific problem in order to reduce computational complexity. The key problem here is how to stop at the right places (selecting the right features). Stopping too soon may cause the system to lose some number of true matches; while going too far could potentially bring in unnecessary computational costs.