3.2 Blocking Based Indexing
3.2.1 Traditional Blocking
The traditional blocking was introduced in [62, 87]. In this technique records are grouped together based on the value of one or combination of record attributes. These attribute values are used to segregate records into blocks where each block contains only similar records. For example, if a `Postcode' attribute is used as the blocking key, each generated block will contain only records that have the same
§3.2 Blocking Based Indexing 29
Postcode. The attributes that are used in the blocking process are called Blocking Keys (BK). The aim of segregating records between blocks is to avoid comparing all records in a data set, rather compare only records that are in the same block (which are most likely to be similar). This will reduce the comparison space.
Data in the real world usually contain errors and variations. To make sure that similar records fall into the same block even if they contain some errors or variations, attribute values can be converted into phonetic codes [33] (using encoding functions) before going through the blocking process. The aim of using such encoding functions is to make sure that similar records with typographical errors and variations will be inserted into the same blocks. Several encoding functions, like Soundex, Phonex, and Double- Metaphone, can be used before blocking, such functions are described in [33, 82].
The number of records that are inserted into each block depends on the frequency distribution of the attribute values that are used as BKs [34]. For example, if the family name attribute is selected as a BK, more frequent names (e.g. `Smith') will have larger block sizes, while a less frequent name (e.g. `Herzog') will be in smaller blocks. Large block sizes affect the efficiency and scalability of the ER process.
To avoid generating large block sizes Gu and Baxter [68] proposed an adaptive blocking technique that aims at filtering large block sizes. The authors proposed two filtering approaches; the first approach is based on the length of a filtering variable that is used to remove record pairs that are unlikely to be a match from the list of candidate records. If the difference between the length of the filtering variables of record pairs is greater than a specific value these pairs are removed from the list of candidate records. The second filtering approach is based on the count of common bi-grams (sub-strings of length 2) between the filtering variables of record pairs. If the number of common bi-grams in the filtering variables of record pairs is smaller than a specific threshold, the corresponding record pairs are removed from the list of candidate records. This approach managed to reduce the number of candidate record pairs with the cost of a slight decrease in matching quality.
More recently, Fisher et al. [64] have addressed the issue of large block sizes by proposing two iterative blocking approaches that control the size of the generated blocks. The idea is to split large blocks and merge small blocks until all generated blocks are within a specified size range. The first approach merges and splits blocks based on the decreasing similarity of the generated blocks, while the second ap- proach merges and splits blocks based on the increased size of the generated blocks. This approach also managed to control (using a penalty function) the trade-off be- tween the size and quality of the generated blocks.
Another issue to consider with traditional blocking is the quality of the attribute values that are used as BKs. If the values of attributes selected as BKs have large numbers of missing values or errors and variations, this can lead to inserting records into the wrong blocks which affects the quality of the matching process. To over- come this issue, multi-pass blocking [87] can be applied where multiple passes of the blocking process are performed using different BKs to improve the quality of the matching process. Iterative blocking [160] can also be applied where blocks are pro-
30 Related work Indexing Blocking-based Traditional blocking [29] [56] [58] [61] [80] [87] [106] [147] Q-gram indexing [7] [9] [57] Canopy clustering [28] [29] Mapping based indexing [3] [81] Hashing based indexing [20] [60] [79] [84] [95] Sorting-based Sorted neighborhood method (SNM) [72] [73] [86] [83] Adaptive SNM [50] [98] [157] Progressive SNM [109] Sorted blocks [48] [49] Suffix arrays [4] [39] [40] [41]
Figure3.1:Main categories of existing indexing techniques that are used with traditional ER.
cessed iteratively using multiple BKs. In this approach, matched (resolved) records in blocks are distributed to other blocks and a record can be matched against mul- tiple blocks which improves matching quality compared to disjoint methods (where records are only inserted into one block).
Although multi-pass and iterative blocking techniques improves the effectiveness of the ER process, they often affect the efficiency of the matching process because of the increased number of comparison and the redundancy introduced in the gener- ated list of record pairs. The redundancy problem was addressed in [117] and [95]. In [117] the authors identified and discard blocks with redundant comparisons, and merged overlapping blocks which resulted in a block of fewer comparisons. Their solution managed to discard redundant comparisons at the cost of quadratic space complexity. Unlike [117], which works with non-distributed environments, the work presented in [95] addresses the redundancy problem in a parallel environments for MapReduce [46] (a data processing tool in a parallel environment) based approaches. Their technique managed to efficiently identify redundant pairs which are not com- pared at run-time.
For traditional blocking to be used with dynamic data sets, the data structures used to build the index must have the ability to be updated (add, delete, or edit val-
§3.2 Blocking Based Indexing 31
ues). This allows the blocks to dynamically grow when the data sets grow. To achieve real-time ER using traditional blocking, the used data structures must facilitate fast retrieval for records. In addition, block size must be small to make sure that the number of generated candidate record pairs is small and record pair comparisons can be handled in real-time. In Chapter 5 we propose a dynamic blocking-based indexing technique that is updated whenever a new query record arrives to facilitate query matching in real-time.