International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 5, May 2013)
406
Boosting the Efficiency in Similarity Search on Signature
Collections
Jong Wook Kim
11Teradata, CA, USA
Abstract— Computing all signature pairs whose bit differences are less than or equal to a given threshold in large signature collections is an important problem in many applications. In this paper, we leverage MapReduce-based parallelization in order to enable scalable similarity search on the signatures. A road-block in using MapReduce framework in this problem, however, is that the cost of merging and sorting intermediate key-value pairs produced by multiple mappers can be prohibitively expensive when they do not fit into the main memory. Thus, in this paper, we propose S4igpart (Scalable Similarity Search on the Signatures), a novel MapReduce-based technique for computing similarity search over large signature collections. In particular, the approach presented in this paper relies on a data partitioning scheme which enables to avoid costly disk-based merge and sort operations. The experiment results show that the proposed technique, S4igpart, significantly improves the efficiency.
Keywords— MapReduce, Performance, Scalability, Signature, Similarity Search
I. INTRODUCTION
Efficiently computing all pairwise signatures of equal length whose bit differences are less than or equal to a given threshold is an important problem. For example, there have been extensive studies in finding near identical data in large data collections. Established techniques can be roughly categorized into inverted list-based methods and signature-based methods. Unlike inverted list-based methods [4], [5] that can be applied only to text collections, signature-based schemes [1], [2] can be used as a general solution in near duplicate detection problem, since they can be used for various types of data, including texts, videos and images. Signature-based schemes detect near duplicates by (a) generating signatures by hashing local data objects, (b) performing similarity search on the signatures to identify a set of candidate pairs that are likely to be near-duplicates and (c) eliminating false matches. The main obstacle of the signature-based schemes stems from efficiency concerns: performing similarity search on the signatures would be quite expensive. Thus, there is an impending need for mechanisms that will significantly reduce overheads of computing similarity search over large signature collections.
A. Contributions of this Paper
The goal in this paper is to develop a mechanism that efficiently computes similarity search on the signature collection by leveraging MapReduce framework [3]. While MapReduce framework has produced impressive results in various applications, it has a limitation: the cost of merging and sorting intermediate key-value pairs produced by multiple mappers can be prohibitively expensive, when they do not fit into the main memory and thus need to be spilled to the disk. Considering huge amounts of intermediate key-value pairs that are generated when performing similarity search on the large signature collection, this shortcoming is the main obstacle in using MapReduce in this problem. Thus, in this paper, we develop a MapReduce-based algorithm which efficiently finds all signature pairs whose bit differences are greater than or equal to a given threshold. In particular, the approach presented in this paper boosts the efficiency by exploiting a data partitioning scheme.
II. PROBLEM STATEMENT
The problem of this paper can be stated as follows: “Given
- a signature data collection, DB, and - a given threshold, λ
find a set DIFF≤λ (DB) =
{(i, j) | diff (si, sj) ≤λ ˄ si
DB ˄ sj
DB ˄ (i < j)}.” Here,diff (si, sj) represents the bit difference between two signatures, si and sj.
III. PRELIMINARY:PARTENUM
In this paper, given the signature collection DB, we employ PartEnum [2] to find a set of signature pairs whose bit differences are less than or equal to the given threshold, λ. The basic idea of PartEnum is summarized as follows:
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 5, May 2013)
407
- enumeration.: For each partition s‹i,r›, PartEnum
computes
2
n
hash values by using the same
number of hashing functions such that s‹i,r,Ut› = HFUt (s‹i,r›) where n2 = | s‹i,r›|, μ = (λ+1)/n1 - 1 and 1 ≤ t ≤ μ. The hash function, HFUt, computes the hash values by projecting s‹i,r› along all dimensions excepting u
Ut .For example, if si = 01011011 and λ = 3, then n1 = 2, n2 = 4, and μ = 1 by definition. The signature, si = 01011011, is first partitioned into n1 equi-sized partitions as follows:
s‹i,1› = 0101, s‹i,2› = 1011.
Then, all possible hash values for si = 01011011 are computed as follows:
s‹i,1,{1}› = 101, s‹i,1,{2}› = 001, s‹i,1,{3}› = 011, s‹i,1,{4}› = 010, s‹i,2,{5}› = 011, s‹i,2,{6}› = 111, s‹i,2,{7}› = 101, s‹i,2,{8}› = 101.
If the bit difference between two signatures, si and sj, is less than (or equal to) λ, there exists at least one hash bucket where they are mapped together. Note that like other signature-based techniques, PartEnum also needs a post processing step to eliminate false positives
IV. NAIVE SOLUTION:FIRST ATTEMPT
Figure 1 presents the MapReduce-based S4ig (Scalable Similarity Search on the Signatures) algorithm.
In the first step, we create the hash tables. Input to each mapper, Map HT, contains a signature ID, i, and the corresponding signature, si. Given the signature si, the mapper first computes hash values by using PartEnum as described in Section III. Then, the mapper emits an intermediate key-value pair with the triple of partition ID, enumeration ID, and corresponding hash value, ‹r, t, val›, used as the key and the signature ID, i, assigned as the corresponding value (line 8). Then, each reducer merges and sorts intermediate key-value pairs by the key. As a result, given the key ‹r, t, val›, the corresponding value, Vpen, contains a list of all signature IDs such that
∀a
Vpen s‹a,r,Ut› = val. In other words, Vpen acts as the hashbucket containing signature IDs.
Once the hash table is generated, the S4ig identifies a set of candidate pairs. Inputs to each mapper, Map Cand, are results from the previous step; ‹r, t, val› and Vpen. The
mapper enumerates all possible pairwise IDs in Vpen and
emits an intermediate key-value pair with the first signature ID (i.e., x) as the key and the second signature ID (i.e., y) as the value (line 14). In the reduce phase, candidates identified by multiple mappers are gathered up and sorted by the key.
Figure 1. Pseudo-code for S4ig algorithm
As a result, given the key x, the corresponding value, Vcand, contains a list of signature IDs such that
}.
|
{
, , , ,| 2 | 1 , 1
1 r n t n xrUt yrUt
cand
y
s
s
V
Note that since the same candidate pairs can be identified by different mappers when more than one hash function outputs the same values, each reducer first eliminates redundant candidate pairs and then emits a key-value pair with x as the key and Vcand as the corresponding
[image:2.612.322.567.135.447.2]International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 5, May 2013)
408
Figure 2. (a) The total size of intermediate key/value pairs in the candidate selection step for each slice and (b) the partition boundaries.
Inputs to each mapper are results from the previous step: (x, Vcand). Given a pair of candidate IDs, each mapper
retrieves the corresponding signatures, computes bit differences, and returns the ID-pair of signatures if the bit difference is less than (or equal to) λ.
V. SCALABLE SIMILARITY SEARCH ON THE SIGNATURES
The road-block of using MapReduce framework in similarity search on the signature collection is that the cost of sorting and merging intermediate key-value pairs produced by multiple mappers can be extremely expensive, when they do not fit into the main memory and thus sorting is likely to spill over to the disk. In particular, the S4ig algorithm in Figure 1 needs to permutate all possible pairwise signature IDs in the hash bucket to find candidate pairs. This will produce a huge number of intermediate key-value pairs which require expensive disk-based merge-sort operations. Therefore, in this section, we develop the S4igpart algorithm which is able to avoid expensive disk-based merge-sort operations.
B. Estimating the Number of Intermediate Key/Value Pairs
in the Candidate SelectionContributions of this Paper
S4igpart first collects any statistics needed to estimate the number of intermediate key/value pairs which will be produced by the candidate selection step described in Figure 1. Given a signature collection, DB, we initially partition DB into w equi-size partitions, DB1, DB2, …,
DBw. Then, DIFF≤λ(DB) can be computed as followings:
, ) , ( ) ( , 1
w y x w x y x DB DB DIFF DB DIFF Where DIFF≤λ(DBx, DBy) is computed as
{(i, j) | si
DBx∧ sj∈ DBy∧ diff (si, sj) ≤λ∧ (i < j)}.Indeed, we can avoid costly disk-based merge-sort by choosing the sufficiently large value of w. However, such naive partition scheme can lead to under-utilization of memory-based buffers that are used to sort and merge intermediate data by MapReduce. Therefore, the data partition scheme introduced in this paper aims to maximize memory-based buffer usage, whereas the spill to disk is
avoided. Let 0 < α ≤ 1 be the sampling rate. Given a signature partition, DBx, let Rx be the corresponding sample sets (where |Rx| = α × |DBx|). Given a pair of partitions, DBx and DBy, let us ask the question, “What is the number of intermediate key/value pairs that will be produced in the candidate selection phase, when computing
DIFF≤λ(DBx, DBy)?”. In order to effectively estimate the number of intermediate pairs, in this paper, we leverage the existing techniques which have been developed to estimate the size of join results in relational databases: Let C‹Rx, Ry› denote the actual number of intermediate key/value pairs produced in the candidate selection step, when we compute
DIFF≤λ(Rx, Ry). C‹DBx, DBy› is similarly defined. We first
compute DIFF≤λ(Rx, Ry) between two sample sets, which provides a count, C‹Rx, Ry›, for the sample sets. C‹Rx, Ry› is then used for estimating C‹DBx, DBy› as follows:
. 1 , | | | | | | | | , ,
x y
y x y x y x y
x C R R
R R DB DB R R C DB DB C
C. Computing the Partition Boundaries
Computing DIFF≤λ(DB) is considered as a two-way self-join of the form DB DB which is represented in a two dimensional space. As shown in Figure 2-(a), this space is divided into two: a useful region needed for computing final results and a non-useful region considered as wasted work. Without loss of generality, let us further assume that the dimension corresponding to the vertical axis is selected as the pivot that will drive the partitioning process. Then, as shown in Figure 2-(a), given w equi-size partitions, DB1,
DB2, …, DBw, the useful region is initially partitioned into
[image:3.612.51.290.122.242.2]International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 5, May 2013)
409
Figure 3. Pseudo-code for PartitionSig
Since C‹DBx, DBy› has already been estimated with sampling, at this point, the information necessary to compute CΔx is on hand. Then, CΔx is computed as follows:
w y x
y x w
y x
y x
x C DB DB C R R
C , 12 , .
Let us assume that β is the size of each candidate pair in byte. Note that the size of candidate pair, β, is determined by the type of signature ID. Given the above definition of CΔx, the total size of intermediate key/value pairs in the candidate selection step for the x-th slice, Δx, is calculated as β × CΔx. As shown in Figure 2-(b), given the l consecutive partitions, Δx, ….., Δx+l-1, the total size of intermediate key/value pairs for these slices is computed as
1 .
l x i x
i
C
Assuming that U servers are available for processing we expect that each server will be assigned roughly
(
1 .
1
l x i x
i
C
u
) size of intermediate key/value pairs that
need to be merged and sorted based on the keys. Then, our objective is to compute the largest value, l, such that
), 1
max( arg
1
buffer mem l
x i x
i l
size C
u
Where sizemem-buffer represents a pre-defined threshold. In MapReduce, each time the contents of the memory-based buffer reach sizemem-buffer, they are spilled to the disk. The rationale for the choice of the largest value, l, is to maximize the usage of memory-based buffers, whereas the spill to disk is avoided. Figure 3 shows the algorithm for identifying partition boundaries by using the above equation.
[image:4.612.322.564.99.512.2]D. S4igpart Algorithm
Figure 4. Pseudo-code for S4ig
part algorithm
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 5, May 2013)
410
Then, these lower- and upper-bounds are used to enumerate a set of pairwise candidates that belong to the current partition (line 11). Note that an immediate side effect of S4igpart is that each partition might need to be scanned multiple times, which would negatively affect on the execution time. However, as will be shown in Section 6, this side effect is significantly less than the time gains obtained by avoiding costly external merge-sort operations.
VI. EXPERIMENTS
We evaluated the proposed approach with real data sets: We selected 5 million sentences from the Blog data [8]. For each sentence, we generated 48-bit signature. We ran experiments on a cluster of processors which consist of 6 machines. Hadoop [6] was used in experiments.
A. Results and Discussion
Figure 5 compares the execution times of S4igpart for
varying the sampling rates, α. In this experiment, α was varied from 0.01% to 5%. In experiments, we used 5 million signatures and set the value of λ to 3. As shown in Figure 5, the execution time decreases as the sampling rate increases from 0.01% to 0.5%. The execution time gains in this range of sampling rates are achieved by avoiding expensive disk-based merge-sort operations. This verified that as the sampling rates increase, S4igpart can indeed correctly estimate the number of intermediate key/value pairs produced in the candidate selection step. Beyond these, the execution times gradually increase as the sampling rates increase. This is because the large number of samples used to compute partition boundaries negatively affects on the execution time.
Figure 6 shows the way that the execution time is split among partition estimation, hash table creation, candidate selection, and post-processing on varying bit differences. In this experiment, we used 5 million signatures and set the sampling rate to 0.5%. In Figure 6, the bit difference, λ, varies from 1 to 7. The observations from Figure 6 can be summarized as follows: For both schemes, we see that as the number of bit differences is increased, the major contributor to the increase in overall execution times is due to candidate selection step. The performance gaps between these two methods get larger as the number of bit differences increases. In particular, S4igpart significantly
outperforms S4ig in the candidate selection step. This is because in the candidate selection step, S4igpart leverages
the data partition scheme to avoid expensive disk-based merge-sort operations. On the other hand, for hash table creation and post-processing steps, the S4ig algorithm shows a slightly better performance than S4igpart.
The reason for this is that for the S4igpart algorithm, each
[image:5.612.322.562.162.405.2]partition might need to be scanned multiple times, which negatively affects on the execution time.
[image:5.612.325.560.433.548.2]Figure 5 Execution times on varying sample rates (α)
Figure 6 Execution times on varying bit differences (λ)
Figure 7 Execution times on varying the number of signatures
The major performance gains of S4igpart algorithm are
achieved by the data partition scheme that is suitable for similarity search on the large signature collection with MapReduce framework.
Figure 7 plots the execution times versus the number of signatures. Notice that in this experiment, λ is set to 7 and 0.5% sampling rate is used. Once again, as the number of signatures increases, the performance gap between S4igpart
and S4ig gets larger. In summary, the results in this section verify that the S4igpart algorithm proposed in this paper is
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 5, May 2013)
411
VII. CONCLUSION
In this paper, we presented the MapReduce-based S4igpartalgorithm that enables to efficiently compute all signature pairs whose bit differences are greater than or equal to a given threshold. In particular, the S4igpart
algorithm leverages data partitioning scheme to avoid expensive disk-based merge-sort operations. Experimental results show that the algorithm presented in this paper significantly improves the efficiency of similarity search on the signature collections.
REFERENCES
[1] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998.
[2] Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.
[3] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, 2004.
[4] J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. In SIGIR, 2009.
[5] R. Vernica, M.J. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. In SIGMOD, 2010T
[6] Hadoop. http://http://hadoop.apache.org.. [7] HBase. http://hadoop.apache.org/hbase.