Boosting the Quality of Approximate String Matching by Synonyms

(1)

CHUNBIN LIN, University of California, San Diego

WEI WANG, University of New South Wales

CHEN LI, University of California, Irvine

XIAOKUI XIAO, Nanyang Technological University

A string-similarity measure quantifies the similarity between two text strings for approximate string match-ing or comparison. For example, the strmatch-ings “Sam” and “Samuel” can be considered to be similar. Most existmatch-ing work that computes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. While this is indeed an indicator of similarity, there are many important cases where syntactically-different strings can represent the same real-world object. For example, “Bill” is a short form of “William,” and “Database Management Systems” can be abbreviated as “DBMS.” Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching.

In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by propos-ing two novel indexes, called SI-trees and QP-trees, which combine signature-filterpropos-ing and length-filterpropos-ing strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.

Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process

General Terms: Algorithms, Experimentation, Performance, Theory

Additional Key Words and Phrases: String similarity search, similarity join, semantic search

ACM Reference Format:

Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, and Xiaokui Xiao. 2015. Boosting the quality of approximate string matching by synonyms. ACM Trans. Datab. Syst. 40, 3, Article 15 (October 2015), 42 pages.

DOI: http://dx.doi.org/10.1145/2818177

This research is partially supported by the 973 Program of China (Project No. 2012CB316205), NSF China (No: 61472427), and a research fund from the University of Helsinki.

Authors’ addresses: J. Lu (corresponding author), Department of Computer Science, University of Helsinki, Finland, FI-00014; email: [email protected]; [email protected]; C. Lin, University of California, San Diego, La Jolla, CA 92093; email: [email protected]; W. Wang, School of Computer Science and Engi-neering, University of New South Wales, Australia, 2052; email: [email protected]; C. Li, Department of Computer Science, University of California, Irvine, CA 92697; email: [email protected]; X. Xiao, School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798; email: xkxiao@ ntu.edu.sg.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

c

2015 ACM 0362-5915/2015/10-ART15 $15.00

(2)

1. INTRODUCTION

Approximate string matching is used heavily in data integration [Huang and Madey 2004; Arasu et al. 2006, 2008], data deduplication [Chaudhuri et al. 2006; Christen 2012], and data search [Chaudhuri and Kaushik 2009]. Most existing work that com-putes the similarity of two strings only considers syntactic similarities, for example, number of common words or q-grams. Although this is indeed an indicator of similar-ity, there are many important cases where syntactically-different strings can represent the same real-world object. For example, “Big Apple” is a synonym of “New York,” and “Transactions on Database Systems” can be abbreviated as “TODS.” This equivalence in-formation can help us identify semantically-similar strings that may have been missed by syntactical similarity-based approaches.

In this article, we study three problems related to approximate string matching with synonyms.1 First, how to define an effective similarity measure between two strings based on the synonyms? Traditional similarity functions cannot be easily extended to handle the synonyms in similarity matching. Second, how to perform string similarity

search efficiently? That is, how to efficiently find all the similar strings from a given

collection of strings to a string query. Finally, how to efficiently perform similarity join based on newly-defined similarity functions? That is, given two collections of strings, how to efficiently find similar string pairs from the collections? These three problems are closely related, because the similarity measures defined in the first problem will naturally provide a search and join predicate to the second and third problems.

We give several applications in order to shed light on the importance of these three problems. In a gene/protein database, one of the major obstacles that hinders effective use is term variation [Tsuruoka et al. 2007], including Roman-Arabic (e.g., “Synapsin 3” and “Synapsin III”), acronym variation (e.g., “IL-2” and “Interleukin-2”), and term ab-breviation (e.g., “Ah receptor” and “Ah dioxin receptor”). A new similarity measure that can handle term variation may help users discover more genes (proteins) of interest. New measures are also applicable in information-extraction systems to aid mapping from text strings to canonical entries by similarity matching. More applications can be found in areas such as term classification and term clustering, which can be used for advanced information retrieval/extraction applications.

String approximate searches and joins are also useful in data integration and data cleaning. For example, in two representative data-cleaning domains, namely, addresses and academic publications, there often exist rich sources of synonyms which can be leveraged to improve the effectiveness of systems significantly. In addition, information from different data sources often have various inconsistencies. The same real-world entity could be represented in slightly different formats. Therefore, data cleaning needs to find from a collection of entities those similar to a given entity. A typical query is “find records related to Harvard University,” and an entity of “1350 Massachusetts Ave,

Cambridge, MA 02138” should be found and returned.

For the first problem of similarity measures, there is a wealth of research on string-similarity functions on the syntactical level, such as Levenshtein distance [Wang et al. 2012], Hamming distance [Kondrak 2005], episode distance [Cohen et al. 2003], cosine metric [Salton and Buckley 1988], Jaccard coefficient [Chaudhuri et al. 2006; Li et al. 2008], and dice similarity [Bayardo et al. 2007]. Unfortunately, none of them can be trivially extended to leverage synonyms to improve the quality of similarity measure-ment. One exception is the JaccT technique proposed by Arasu et al. [2008], which generates all possible strings by applying synonym rewriting to both strings, returning

1_{For brevity, we use “synonym” to describe any kind of equivalent pairs which may include} synonym, acronym, abbreviation, variation, and other equivalent expressions.

(3)

Fig. 1. An example to illustrate similarity matching with synonyms.

the maximum similarity among the transformed pairs of strings, as illustrated by the following example.

Example 1.1. Figure 1 shows an example of two tables Q and T and a set of

synonyms. Consider two strings q1and t2. Their token-based Jaccard similarity is only 2

8. Now consider applying synonym pairs on them. Obviously, synonyms r1, r3, and r6

can be applied to q1, and r2, r3, r4, and r5can be applied to t2. We useAR(s) to denote

the string generated by applying a set of synonym pairsR to the source string s. For example,A_{r1,r3}(q1) is “International Wireless Health Conf 2012.”

Arasu et al. [2008] considers all possible strings resulting from applying a subset of applicable synonyms on them and then returns the maximum Jaccard similarity. In the preceding example, it needs to consider 23_{· 2}4_{= 128 possible string pairs. The}

maximum Jaccard value betweenA_{r₁_,r₃_,r₆_}(q1) andA{r3}(t2), is

5 6.

The main limitation in the preceding approach is that its string-similarity definition has to generate all possible strings and compare their similarity for each pair in the cross product. In general, given two strings s and t, assuming there are n1 (resp. n2)

synonym pairs applied on s (resp. t), then the preceding approach needs to compute the similarities for 2n1+n2 _{pairs. Such a brute-force approach takes time exponential in}

the number of applicable synonym pairs. A special case of the problem for single-token synonyms is identified in Arasu et al. [2008], where the problem can be reduced to a maximum bipartite graph matching-based problem. Unfortunately, in practice, many synonyms contain multiple tokens, and even in the special case, it still has a worst-case complexity that is quadratic in both the number of string tokens and the number of applicable synonyms. Since such similarity measure has to be called for every candidate pairs when employed in similarity joins, it introduces substantial overhead.

In this article, we propose a novel expansion-based framework to measure the sim-ilarity of strings. Intuitively, given a string s, we first obtain its token set S and then grow the token set by applying the applicable synonyms of s. When a synonym pair

lhs→ rhs is applied, we merge all the tokens in rhs to the token set, that is, S = S∪rhs.

The similarity between two strings is the similarity between their expanded token sets. We have two key ideas here: (i) We can avoid the exponential computational complex-ity by expanding strings with all applicable synonym pairs; (ii) it costs less to perform expanding operation than transforming operation, as illustrated next.

Example 1.2 (Continue the Previous Example). We apply all applicable synonym

pairs to q1and t2to obtain the “closure” of their tokens and then evaluate the Jaccard

(4)

Fig. 2. Comparison of string-similarity algorithms with synonyms (assume both strings have length L; the maximum number of applicable synonyms is n; k is the maximum size of the right-hand side of a rule).

of t2by applying synonym pairs r2, r3, r4, and r5is{International, Intl, WH, Conf, 2012,

Wireless, Health,Conference, United, Kingdom, UK}. The Jaccard similarity between two fully-expanded token sets is ₁₁8, which is also greater than the original similarity

2

8 without synonyms.

We note that the full expansion-based similarity illustrated in the preceding exam-ple is efficient to compute and is also highly competitive in terms of effectiveness in modeling the semantic similarity between strings. Nonetheless, using all the synonym pairs to perform a full expansion may hurt the final similarities. Consider Example 1.2 again, where the use of rule r4 on t2 is detrimental to the similarity between q1 and

t2, as none of the expanded tokens appear in q1 or its expanded token set. Therefore,

we propose a selective expansion measure which selects only “good” synonym pairs to expand both strings (Section 3), and accordingly, the similarity is defined as the maximum similarity between two appropriately-expanded token sets of two strings. Unfortunately, selective expansion is proved to be NP-hard by a reduction from the 3SAT problem. To provide an efficient solution, we propose a greedy algorithm, called

SE algorithm, which is optimal if the rhs tokens of the useful synonyms satisfy the distinct property (the precise definition will be described in Section 3.3). We

empiri-cally show that the condition is likely to be true in practice. Figure 2 summarizes the theoretical analysis of the two new similarity algorithms, contrasted with results of existing methods [Arasu et al. 2008].

Given the newly-proposed similarity measures, it is imperative to study efficient similarity search and join algorithms. For string-similarity search, we first show a signature-based search algorithm, called search-baseline, by following the

filter-and-verification framework. It generates a set of tokens as signatures for each string in the

table, and then for the query string, finds candidate pairs that overlap in the signatures, and finally verifies against the threshold using the previously proposed similarity mea-sures. Further, we propose SI-search, which speeds up the performance by using native indexes, SI-tree for strings from the table, and QP-tree for the query string, to reduce the candidate size with integrated signatures and length filters (Section 4). Similarly, for string similarity joins, we also propose two algorithms, one is the signature-based baseline method (join-baseline), and the other is the SI-tree-based index join algorithm (SI-join). We demonstrate that our join algorithms are flexible and may use different signatures, including prefix filtering and locality sensitive hashing (LSH).

The methods for selecting signatures of strings may significantly affect the perfor-mance of algorithms [Li et al. 2008; Qin et al. 2011]. Therefore, we propose an online algorithm to dynamically estimate the filtering power of a signature filter. In particular,

(5)

gorithms to provide high-confidence estimates for comparing the filtering power of two filters. Our method carries both theoretical and practical significance. In theory, we can prove that, with any given high probability 1-δ, 2DHS returns correct estimates on upper and lower bounds, and its space and time complexities are logarithmic in the cardinality of input size. In practice, the experiments show that our estimation tech-nique accurately predicts the quality of different filters in all three datasets and that its running time is negligible compared to that of the actual filtering phase (accounting for only 1% of the total join time).

Finally, we perform a comprehensive set of experiments on three datasets to demon-strate the superior efficiency and effectiveness of our algorithms (Section 8). The results show that our algorithms are up to two orders of magnitude more efficient than the existing methods while offering comparable or better effectiveness.

Organization. The rest of this article is organized as follows. We review related works

in Section 2 and propose similarity measures in Section 3. In Section 4 and Section 5, we study the approximate string search and approximate string join problems, respec-tively. We present the high-confidence and low-error estimator in Section 6. Section 7 is dedicated to the extensions of our algorithms for more similarity measures and the weighted tokens. Then Section 8 empirically evaluates the effectiveness and efficiency of all proposed algorithms and the state-of-the-art approaches. Finally, Section 9 con-cludes this article and sheds light on future work.

2. RELATED WORK

String Similarity Measures. There is a rich set of string-similarity measures,

includ-ing character n-gram similarity [Kondrak 2005], Levenshtein distance [Xiao et al. 2008; Wang et al. 2012], Jaro-Winkler measure [Winkler 1999], Jaccard similarity [Chaudhuri et al. 2006], TF-IDF-based cosine similarity [Salton and Buckley 1988], and hidden Markov model-based measure [Miller et al. 1999]. A comparison of many string-similarity functions is performed in Cohen et al. [2003]. These metrics can only measure syntactic similarity (or distance) between two strings but cannot capture semantic similarities such as synonyms, acronyms, and abbreviations.

Machine learning-based methods [Bilenko and Mooney 2003; Tsuruoka et al. 2007] have been proposed to improve the quality of traditional syntactic similarity-based approaches. In particular, a learnable string similarity [Bilenko and Mooney 2003] presents trainable measures to improve the effectiveness of duplicate detection, and Tsuruoka et al. [2007] use logistic regression to learn a string-similarity measure from an existing gene/protein name dictionary. Although these methods are effective in capturing semantic similarities between strings, they are limited to special domains and accuracy depends critically on the quality of the training data.

String-Similarity Searches and Joins. State-of-the-art approaches [Gravano et al.

2001; Chaudhuri et al. 2003, 2006; Sarawagi and Kirpal 2004; Bayardo et al. 2007; Li et al. 2012; Wang et al. 2012; Bille 2012] for performing string-similarity searches and joins mainly follow the filtering-and-verification framework. The basic idea is to first employ an efficient filter to prune string pairs that cannot be similar, and then verify the surviving string pairs by computing their real similarity. In the filtering step, there are two methods for generating signatures for each string: (i) prefix filtering scheme

(6)

[Chaudhuri et al. 2006; Bayardo et al. 2007; Xiao et al. 2008; Wang et al. 2012] first transforms each string to a set of tokens. Then the tokens of each string are sorted based on a global ordering, and a prefix set for each string is selected based on a given similarity threshold to be signatures. The signatures can be used to guarantee that if two strings are similar, then their signatures have enough overlap; (ii)

LSH-based scheme [Chaudhuri et al. 2003; Hadjieleftheriou et al. 2008; Xiao et al. 2011]

generates the signatures based on the idea of locality-sensitive hashing (LSH) [Broder et al. 1997], that is, each signature is a concatenation of k minhashes of the set s under a minwise independent permutation, and l such signatures are generated. k and

l are two user-defined parameters. The main limitation of the LSH approach is that

the algorithms are approximate and could miss some correct results. In this article, we will extend these two signature schemes to perform efficient similarity join with synonyms.

Synonyms Generation Approaches. Synonym pairs can be obtained in many ways,

such as users’ existing dictionaries [Tsuruoka et al. 2007], explicit inputs, and synonym mining algorithms [Arasu et al. 2009]. In this article, we assume that all synonym pairs are predefined, and our focus is how to effectively use the semantic information to provide meaningful similarity measures and efficient join algorithms.

Estimation of Set Size. Our work is also related to the estimation of distinct pairs in

set-union. This is because, in the filtering phase, we need to estimate the number of candidates (string pairs) morden to dynamically select a signature scheme. There are many solutions proposed for the estimation of the cardinality of distinct elements. For example, in their influential paper, Flajolet and Martin [1985] propose a randomized estimator for distinct-element counting that relies on a hash-based synopsis; to date, the Flajolet-Martin (FM) technique remains one of the most effective approaches for this estimation problem. FM sketch and its variations (e.g., [Alon et al. 1996; Ganguly et al. 2004]) focus on one-dimensional data. In our problem, we need to estimate the distinct number of pairs, where each dimension is based on an independent FM sketch from one join collection. It turns out that this problem is more complicated, and the existing analysis cannot be extended easily. In addition, we note that the estimation of the resulting size of a string-similarity join has been studied recently (e.g., [Lee et al. 2009, 2011]), by using random sampling and an LSH scheme. Note that these techniques cannot be directly applied here, since given a specific filter, our goal is to estimate the cardinality of candidates, which is often much greater than that of final results.

Finally, this article extends a previous conference paper [Lu et al. 2013] with the following new contributions. (1) We study a new problem of approximate string search with synonyms (Section 4), and show two algorithms, that is, a search-baseline and an indexed approach QP-search based on a native index (QP-tree); and (2) we add the extensions to several issues (Section 7), including three new similarity functions and the weighted tokens.

3. STRING-SIMILARITY MEASURES

As described in Section 1, an expansion-based framework can efficiently measure the similarity of strings with synonyms to avoid exponential computational complexity. In this section, we describe two expansion-based measures in details. In particular, Section 3.1 first introduces the preliminaries about synonym rules. Then Section 3.2 and Section 3.3 describe two expansion-based measures: full expansion and selective

(7)

and rhs(r) can also refer to the set generated from the respective strings. LetR denote a collection of synonym rules, that is,R = {r: lhs(r) → rhs(r)}. Given a string s, we say a rule r∈ R is an applicable rule of s if lhs(r) is a substring of s and let R(s) denote all

applicable rules of s.

Example 3.1. Given two strings: s1 = “SIGMOD Conf 2012 Arizona”, s2 = “ACM

SIGMOD Conference 2012 Arizona”, and two rules: r1: Conf→ Conference, r2: ACM→

Association for Computing Machinery, the applicable rule sets are R(s1) = {r1},

R(s2)= {r2}. 3.2. Full Expansion

Given a string s, suppose a rule r : lhs(r)→ rhs(r) ∈ R is an applicable rule of s, and we use rhs(r) to expand the set S, that is, S = S ∪ rhs(r). Let sim(s1, s2, R) denote the

similarity between s1and s2based onR. Let R(s1) andR(s2) denote the applicable rule

set of s1 and s2, respectively. Under the expansion-based framework, sim(s1, s2, R) =

f (S₁, S₂), where S₁ and S₂ are expanded from S1and S2using some synonym rules in

R(s1) andR(s2), and f is a set-similarity function such as Jaccard, Cosine, etc.

Based on how applicable synonyms are selected to expand S, there are different kinds of schemes. A simple one is the full expansion, which expands S using all their applicable synonyms. That is, the expanded set S= S_r_∈R(s)rhs(r).

The full expansion scheme is efficient to compute. Let L be the maximum length of strings s1and s2. The time cost of substring matching to identify the applicable synonym

pairs is O(L) using a suffix-trie-based method [Farach 1997]. Then the complexity of the set expansion is O(L+ kn), where n is the maximum number of applicable synonym pairs for s1 and s2, and k is the maximum size of the right-hand side of a rule. As

we will demonstrate in our experimental section, the full expansion provides a highly competitive performance with other optimized schemes to capture semantics of strings using synonyms.

3.3. Selective Expansion

We observe that, as mentioned in Section 1, the full expansion cannot guarantee that each applied rule is useful for similarity measure between two strings. Recall Exam-ple 1.2 and Figure 1. The use of rule r4on t2is detrimental to the similarity between q1

and t2, as “United Kingdom” does not appear in q1. To address this issue, we propose an

optimized scheme, called selective expansion, whose goal is to maximize the similarity value of two expanded sets by a judicious selection of applicable synonyms to apply. Given a collection of synonym rulesR and two strings s1 and s2, the selective

expan-sion scheme maximizes f (S₁, S₂). Suppose, without loss of generality, we use Jaccard coefficient to measure the similarity between S₁ and S₂, which can be extended to other functions. The selective-Jaccard similarity of s1 and s2 based onR (denoted by

S J(s1, s2, R)) is defined as follows:

S J(s1, s2, R) = max

R(s1)⊆R,R(s2)⊆R

{Jaccard(S 1, S2)},

(8)

Example 3.2 (Continuation of Example 3.1). Based on the full expansion

scheme, S₁ = S1 ∪ rhs(r1) = {SIGMOD, Conf, Conference, 2012, Arizona}.

S₂ = S2 ∪ rhs(r2) = {ACM, SIGMOD, Conference, 2012, Arizona, Association,

for, Computing, Machinery}. Therefore, Jaccard(S₁, S₂)= ₁₀4. However, based on the selective expansion scheme, we can only use r2 to expand S1 to achieve the maximal

similarity. Then, S₁ = {SIGMOD, Conference, Conf, 2012, Arizona}. Therefore,

S J(s1, s2, R) = Jaccard(S1, S2)= 4₅.

Unfortunately, we can prove that it is NP-hard to compute the selective-Jaccard similarity with respect to the number of applicable synonym rules, by a reduction from the well-known 3SAT problem [Iwama and Tamaki 2004].

THEOREM 3.3. Selective-Jaccard∈ NP-hard.

The detailed proof can be found in Appendix A. Since selective-Jaccard is NP-hard, we design an efficient greedy algorithm that guarantees optimality under certain con-ditions, which has been shown to be met in most cases in practice. Before presenting the algorithm, we will first define the notion rule gain, which can be used to estimate the fitness of a rule to strings.

ALGORITHM 1: Computing a Rule Gain Input : r, s1, s2,R = {ri: lhs(ri)→ rhs(ri)}

Output: the rule gain of r, denoted RG(r, s1, s2, R)

1 U← rhs(r) − S1; // r is applicable to s1, and S1is the token set of s1

2 if (U = φ) then

3 return 0;

end

//Let R⊆ R denote all applicable rules of s2

4 S2 ← S2

ri∈Rrhs(ri) //full-expanding S2

5 G← (rhs(r) ∩ S₂)− S1;

6 return RG(r, s1, s2, R) = |G|_|U|;

Given two strings s1and s2, a collection of synonymsR, and a rule r∈R, we assume

that r is an applicable rule for s1. We denote the rule gain of r by RG(r, s1, s2, R). When

s1, s2, andR are clear from the context, we shall simply speak of RG(r). Informally, the

rule gain is defined by the number of useful elements introduced by r divided by the number of new elements in r. In particular, we say that a token t∈ rhs(r) in r is useful if

t belongs to the full expansion set of s2and t /∈ s1. The reason we use the full expansion

set of s2rather than the original set is that s2can be expanded using new rules, but all

possible new tokens must be contained in the full expansion set. We now go through Algorithm 1. In line 1, we find the new elements U introduced by r. If U = ∅, then the rule gain is zero (lines 2∼ 3). Line 4 fully expands s2, and line 5 computes the useful

elements G. Finally, the rule gain RG(r, s1, s2, R) = _|U||G| is returned (line 6).

Example 3.4 (Continuation of Example 3.1). Since r1 is applicable for s1, we

com-pute RG(r1, s1, s2, R). Note that G = U = {Conference}. Therefore, RG(r1, s1, s2, R) =

1. Further, r2 is applicable for s2, then we compute RG(r2, s2, s1, R). Since U =

{Association, for, computing, machinery}, and G = ∅, RG(r2, s2, s1, R) = 0.

We develop a selective expand algorithm based on rule gains, which is formally described in Algorithm 2. Before doing any expansion, we first compute a candidate set of rules (line 1), which only consists of applicable rules with relatively higher

(9)

Jaccard similarity of the current expanded sets S₁and S₂(line 8), as its similarity is lower than the “average” similarity and its usage would be detrimental to the final similarity. Subsequently, in line 2, we iteratively expand s1and s2using the most

gain-effective rule from the candidate sets. Finally, line 3 returns the maximal value between

Jaccard(s1, s2) (without the synonyms) and the new similarityθ using synonyms.

Example 3.5. We use Example 3.1 again to illustrate Algorithm 2. First, in line 4, C1 = {r1}, and C2 = {r2}. Then S₁ = {SIGMOD, Conf, Conference, 2012, Arizona},

S₂ = {ACM, SIGMOD, Conference, 2012, Arizona, Association, for, Computing,

Machinery} (lines 6 and 7). Therefore, α = Jaccard(S₁, S₂)= 4/10 (line 8). Since RG(r1)=

1 and RG(r2)= 0, r2is removed fromC2(line 11). Therefore, onlyC1= {r1} is returned

in line 13. Subsequently, in line 19, rhs(r1) is added to S1. Therefore, only one rule r1is

used to expand s1, and the final similarity is Jaccard(S₁, S2)= 4/5.

Some readers may be wondering whether we can skip procedure

findCandidateRule-Set in Algorithm 2 and directly use the greedy strategy in line 2 to select rules for

expansion based on all applicable rules. Our answer is no, because line 1 wisely selects only part of the applicable rules in order to avoid the blind expansion with rules that are not suitable for global optimization.

Example 3.6. We use this example to demonstrate the importance of procedure

findCandidateRuleSet in Algorithm 2. Given two short strings: s1= “a,” s2= “b,” and

five rules: r1: “a”→ “b t1,” r2: “a”→ “c d e,” r3: “b”→ “c t2 t3 t4 t5,” r4: “b”→ “d t6 t7 t8

t9,” and r5: “b”→ “e t10 t11 t12 t13,” where all items from t1to t13are “dirty” words which

cannot contribute to the similarity between s1and s2. It turns out that the maximum

similarity is 1/3= Jaccard(“a b t1”,“b”), which is achieved by applying only the rule r1

on s1. In Algorithm 2, line 1 returnsC1= {r2}, because those rules from t2to t13are all

removed fromCi in line 11. Therefore, only r1 is used to expand s1. But if line 1 were

cancelled, then all rules could be used to expand s1or s2. Since RG(r2)= 1 > RG(r1)=

1/2. Then r2is used to expand s1. Subsequently, the other four rules are applied as well.

Then the final similarity falls to 5/18< 1/3.

In addition, as shown in the proof of Theorem 3.7, line 10 in procedure

findCandi-dateRuleSet lies in a key condition in order to guarantee the optimality of our algorithm

in some scenarios.

Performance Analysis. The time complexity of the SE algorithm is O(L+ kn2), where

L is the maximum length of s1and s2, n is the total number of applicable synonyms for

s1and s2, and k is the maximum size of the right-hand side of a rule. More precisely, we

use the suffix-trie to find applicable rule sets, and its complexity is O(L). In procedure

findCandidateRuleSet, the complexity to perform full expansion (lines 6∼ 7) is O(kn),

and the iteration time is no more than n; thus, the complexity is O(L+ kn2_). 3.4. Theoretical Analysis

In this section, we carve out one case and show that SE returns the optimal value and then analyze the reduced time complexity in this optimal case.

(10)

ALGORITHM 2: Selective Expansion (SE) Algorithm Input : s1, s2,R = {r: lhs(r) → rhs(r)}.

Output: Sim(s1, s2, R).

1 C1,C2← findCandidateRuleSet(s1, s2, R);

2 θ ← expand(s1, s2,C1,C2);

3 return max( Jaccard(s1, s2),θ);

Procedure findCandidateRuleSet(s1, s2,R) 4 Ci← {r: r ∈ R ∧ lhs(r) is a substring of si(i= 1 or 2) ∧ RG(r) > 0}; 5 repeat 6 S1 ← S1 r∈C1rhs(r); 7 S2 ← S2 r∈C2rhs(r); 8 α ← Jaccard(S₁, S₂); 9 for (r∈ C1∪ C2) do 10 if (RG(r, s1, s2, C1∪ C2)< ₁_+αα ) then 11 Ci← Ci− r; //r ∈ Ci, i= 1 or 2 12 Recompute S₁, S₂ andα; end end

until (no rule can be removed in line 11);

13 returnC1,C2;

Procedure expand(s1, s2,C1,C2,)

//Let S1 and S2 denote the expanded sets of S1 and S2

14 S1 ← S1;

15 S2 ← S2;

16 while (C1∪ C2= φ) do

17 Find the current most rule-gain-effective rule r∈ Ci(i= 1 or 2);

18 if ((RG(r, s1, s2, C1∪ C2)> 0) then 19 Si← Si∪ rhs(r); end 20 Ci← Ci- r; end 21 return Jaccard(S1, S2);

THEOREM 3.7. Given two strings s1and s2 and their applicable rule setsR(s1) and

R(s2), we call a rule r∈ R(si) (i= 1 or 2) useful if RG(r) > 0. We can guarantee that SE is optimal if the rhs tokens of all useful rules are distinct.

The proof of Theorem 3.7 can be found in Appendix B. Note that the condition in Theorem 3.7 does not require all the rhs tokens to be distinct. It only requires those from useful synonyms of s1and s2to be distinct, which are typically a very small subset. In

addition, under the condition, the time complexity can be reduced to O(L+nlog n+nk). The reason is that the complexity to perform full expansion (lines 6∼7) is reduced to

O(1). Since the tokens are distinct, and that of expanding (lines 15∼19) is O(nlog n),

since the rule gain of each rule does not change, which can be implemented by a maximal heap. Finally, the Jaccard computation in line 20 needs O(L+ kn) time in the worst case. Therefore, the complexity of SE algorithm is O(L+ nlog n + nk).

4. STRING SIMILARITY SEARCH

In this section, we study the similarity search problem. That is, given a query string

q and a collection of strings T , a collection of synonymsR, and a similarity threshold

(11)

Fig. 3. An Example to illustrate the generation of signatures (threshold= 0.9).

of the similarity functions in Section 3. Two search algorithms are proposed in the following sections.

4.1. Preliminary on Prefix Filters

A na¨ıve algorithm for computing similarity search is to enumerate and compare every pair of records with the query string. This method is obviously prohibitively expensive for large datasets. Efficient algorithms exist by following the filtering-and-verification framework, that is, given a query q and a table T , one or multiple filters are utilized to prune away many records in T and then the verification is invoked for the remaining records after filtering. Clearly, the stronger filtering power the method has, the better performance than the na¨ıve algorithm. In the literature, the current modus operandi is called prefix filter, which is based on the intuition that if two canonicalized records are similar, some fragments of them should overlap with each other, as otherwise the two records won’t have enough overlap. This intuition can be formally captured by the prefix-filtering principle [Chaudhuri et al. 2006] rephrased here.

LEMMA4.1 (PREFIXFILTERPRINCIPLE[CHAUDHURI ET AL. 2006]). Given an ordering O

of the token universe U and two strings s and t, each with tokens sorted in the

or-der of O, if Jaccard(s, t) ≥ θ, then the first (1 − θ)|s| smallest tokens of s and the first

(1 − θ)|t| smallest tokens of t must share at least one token.

4.2. Search-Baseline Method

The first search algorithm we propose here is a prefix-filter-based approach, called

search-baseline, following the filtering-and-verification framework. That is, in the

fil-tering step, it generates candidate strings by identifying all strings t such that the signatures of q and t overlap. In the verification step, it checks if sim(q, t, R) ≥ θ for each candidate and outputs those that satisfy the similarity predicate.

To generate the signatures, given a string s, the signatures of s are a subset of tokens in s containing the(1 − θ)|s| smallest elements of s according to a predefined global ordering. Therefore, we generate signatures for a string s as follows. Let R ⊆ R be the set of n application synonym pairs for s. For each subsetRi ofR, we generate an

expanded set Si. For each Si, we obtain one signature set (denoted by sig(Si)). Note

that the number of subset Si is 2n. The exponential number is due to the fact that we

try to enumerate all possibilities of expanded sets for s with n synonym pairs. Finally, we generate a signature set for s by including all signatures of Si: Sig(s, R) =

2n

i=1

Sig(Si). Note that this generation of signatures can be performed offline.

LEMMA 4.2. Given two strings s1 and s2, a threshold value θ, and a collection of

synonym pairsR, if Sig(s1, R) ∩ Sig(s2, R) = ∅, then sim(s1, s2, R) < θ, where sim denotes

any of the similarity functions proposed in Section 3.

PROOF. Sig(s1, R) ∩ Sig(s2, R) = ∅. =⇒ For any S1i, S2 j expanded from s1 and s2,

(12)

Fig. 4. The structure of I-list.

Fig. 5. An example to illustrate I-list.

elements of S2 jhave no common tokens.=⇒ The upper bound of the similarity < θ, as

desired.

Example 4.3. This example is employed to illustrate the generation of signatures

with synonyms. Consider the string q and synonyms in Figure 3(a). The applicable synonyms for q are r5, r6, and the expanded sets of q are shown in Figure 3(b). For

each set, we choose the(1 − 0.9)|s| tokens to be the signatures for the set, which are presented in Figure 3(b). Finally, the signature of q can be computed as the union of the signatures, that is, Sig(q, R) = {SIGMOD, Computing, Conference, on}.

In our implementation, in order to avoid the most time-consuming operation, that is, checking the signature overlap for each string pair to find the candidates, we design a signature inverted list, called I-list (see Figure 4). Each token ti is associated with

an inverted list including all string IDs whose signatures contain the token ti. Thus,

given a signature ti, we can directly obtain all relevant strings without any

signature-overlapping check.

The search-baseline algorithm operates with two steps in the filtering phase. In the first step, for each signature of query q, function getIList is employed to get the I-list whose key is g. In the second step, the I-lists are merged together to obtain the final candidates. Then in the verification phase, any of the two measures proposed in Section 3 can be applied to check the candidates.

Example 4.4. This example is used to illustrate the search-baseline method.

Con-sider the query q in Figure 3 and the table in Figure 5(a). Search-baseline first computes signatures for both query and strings in the table. Secondly, an I-list of the signatures is constructed, as shown in Figure 5(b). Then search-baseline uses the query signa-tures{SIGMOD, Computing, Conference, on} to find corresponding inverted lists, that is,{q1, q2, q3} and {q1}. Finally, only q1is returned as the query answer based on the

selected-expansion similarity function.

4.3. QP-Search Algorithm

In this section, we propose two native indexes for both query strings and the records in the table, respectively, to improve the efficiency of the search-baseline method.

(13)

Fig. 6. The structure of QP-tree and S-directory.

Fig. 7. An example to illustrate the indexes. The column L and F denote the number of tokens in the string and in the full expanded set, respectively.

QP-Tree for the Query q. For the query string q, instead of computing the union

signatures Sig(q, R), we consider the signatures Sig(Qi) for each expanded set Qi.

Note that the length|Sig(Qi)| of the expanded sets here could be utilized to enhance

the filtering power. We build a query pattern tree, called QP-tree (see Figure 6(a)), which has two levels: the nodes in the first level contain two fieldsl, p, where l is an integer to denote the size of each expanded set of query q, and p is a pointer to a signature set. For example, the QP-tree in Figure 7(b) is built for the query in Figure 7(a). Building QP-tree is on-the-fly for each coming query, which is fast.

SI-Tree for the Collection of Strings. We extend the idea of length filter [Sarawagi

and Kirpal 2004; Bayardo et al. 2007] to reduce the number of candidates. Intuitively, given two strings s1and s2and a threshold valueθ, if the size of the full expansion set

of s1is much less than the size of s2, then without checking the signatures of s1and s2,

one can claim that sim(s1, s2, R) must be smaller than θ. Based on the observation, we

propose a data structure, called S-directory, to organize the strings by their lengths. The S-directory (see Figure 6(b)) is a tree structure with three levels, and nodes in different levels can be categorized into different kinds.

—Root Entry. A node in level L0is a root entry which has multiple pointers to the nodes

in level L1.

—Fence Entry. A node in level L1is called fence entry and contains three fieldsu, v, p,

where u and v are integers, and p is a set of of pointers to the nodes in L2. In

particular, u denotes the number of tokens of a string andv is the maximal number of tokens in the fully-expanded sets of strings whose length is u.

(14)

—Leaf Entry. A node in level L2is called leaf entry and contains two fieldst, p, where

t is an integer to denote the number of the tokens in the fully-expanded set of a string

whose length is u, and p is a pointer to an inverted list. As leaf entries are organized in the ascending order of t,vi = tim(see Figure 6(b)).

We argue that S-directory can easily fit in the main memory, as the size of S-directory is only O((vmax− umin)2), which is quadratic to the difference between the maximal size of

expanded sets and the shortest length of records. For example, in our implementation, the size of S-directory for a table with one million strings on a USPS dataset is only about 2.15KB.

To maximize the pruning power, we combine I-list and S-directory together. That is, for each leaf-entry= t, p, p points to a set of signatures (from the string whose size of the full expansion set is t) associated with multiple I-lists. Thus, we have a combined index, called SI-tree, that is, each leaf entry in S-directory is associated with an I-list. For example, in Figure 7(d) and Figure 7(e), we draw the S-directory and SI-tree for data in Figure 7(c), respectively.

ALGORITHM 3: QP-Search Algorithm

Input : QP-tree for query q and SI-tree for table T , a thresholdθ, a collection of

synonymsR

Output: all records r∈ T : sim(q, r, R) ≥ θ

1 C←−∅,R←−∅;

2 for (∀ F in SI, ∀ L in QP) do

//Fence-entry:F=<u, v, P>,Length-entry:L=<l, P>

3 if (F.u × θ ≤ L.l ≤ F_θ.v) then

4 for (∀ E∈F.P) do

//Leaf-entry:E=<t, P>

5 if (min(E.t, L.l) ≥ max(E.t, L.l) × θ) then

6 for (∀ g is an overlapping signature between L.P and E.P) do

7 C= E.P→getIList(g); 8 C = C ∪ C; end end end end end 9 for (∀ r in C) do 10 if sim(q, r, R)≥ θ then R= R ∪ r; end end 11 Return R;

Based on the SI-tree and QP-tree, we design an algorithm (called QP-search) to quickly select candidate strings for verification. See Algorithm 3, where line 3 and line 5 select the desired entries by length filtering. Then, in lines 6∼ 8, for each signature

g that is an overlapping signature between the query and the table, function getIList

returns the I-list whose key is g. Finally, in line 9, QP-search returns the candidate strings for query q.

Example 4.5. We use this example to illustrate the QP-search algorithm. Consider

the query and the table in Figure 7. The QP-tree and SI-Tree are shown in Figures 7(b) and 7(e). The query is q= “ACM SIGMOD 2013,” and the threshold θ = 0.9. Entries L1

(15)

Fig. 8. An example to illustrate FSI-trees.

and F1satisfy the condition in line 3 (2× 0.9 ≤ 3 ≤ 7/0.9), but L1and E2cannot pass

the test in line 5. Next, the entry L2satisfies the conditions in both lines 3 and 5 with

F2and E3, respectively. Unfortunately, there is no overlapping signatures between L2

and E3, thus entry L3 cannot pass the length filter. Finally, consider entries L4 and

E4, both of which luckily satisfy the conditions and have overlapping signatures

“Con-ference” and “on.” Subsequently, we obtain the candidate, that is, q1. To demonstrate

the advantage of the QP-search algorithm over the search-baseline, note that in the filtering phase, search-baseline returns three candidates (see Example 4.4), whereas QP-search returns only one candidate. Finally, the answer is “2013 ACM Intl Conf on Management Data,” which refers to the same conference with q.

4.4. Extensions for Flexible Query Thresholds

In the previous sections, our model assumes that the search threshold is fixed and that only the query can be changed online. However, in practice, users might change the threshold at query time. Therefore, we now move to a more general case, where both the search string and the threshold are flexible at query time. The new challenge here is that we do not know the threshold in advance, and thus we cannot determine the number of signatures of each record. A na¨ıve method is to compute the signatures of records in the table online according to the given threshold, which is clearly pro-hibitively expensive. Next we build a new index called FSI-trees (flexible signature indexing) by extending SI-trees to solve this problem.

Suppose that all meaningful thresholds distribute in the range between 0.99 to 0.50. Then we select some representative thresholds (e.g., 0.95, 0.90, etc.) For each representative threshold, we generate signatures for each record: see an example in Figure 8(a). Note that the signatures of a string for lower thresholds are guaranteed to become signatures of that for higher thresholds. To build an FSI-tree, the length and fence entries of SI-trees remain the same, but each fence entry points to a set of I-lists which come from all representative thresholds. Further, each element in the I-list of a signature token s is a binary tuple (q,θ), where q is a record ID and θ is the minimal threshold for which this signature s appears in q. For example, in Figure 8(b), the token “Computing” is a signature of q1for all thresholds≥ 0.5.

We extend the QP-search algorithm for flexible thresholds. The algorithm almost remains the same, the only change (see Algorithm 4) being that we select the string candidates in I-lists by checking their thresholds (line 3).

An astute reader may notice that our method possibly introduces more candidates because of the gap between representative thresholds and online thresholds. For ex-ample, given a query threshold 0.83, suppose that the closest representative threshold is 0.80. Then the number of signatures for threshold 0.80 may be greater than that for 0.83, but we argue that the problem has actually little impact on the final performance of query processing. To understand this, assume that gap between two representative thresholds is no more than 0.05 (that means, only 11 representative thresholds are

(16)

ALGORITHM 4: Extended QP-Search for Flexible Thresholds

Input : QP-tree for query q and DSI-tree for table T , a thresholdθ, a collection of

synonymsR.

Output: all records r∈ T : sim(q, r, R) ≥ θ

1 Run lines 1∼ 7 in Algorithm 3; 2 for (∀ (q, θq)∈C) do

3 if (θq≥ θ) then

4 C = C ∪ {q};

end end

5 Continue lines 9∼ 11 in Algorithm 3;

“materialized” with signatures between 0.99 and 0.50). It can be proved that given a string s, the difference between the numbers of signatures for thresholds θ1 and θ2

is|θ1− θ2| · |s|. Considering a string |s| = 10, we have 0.05 * 10 = 0.5, that is, for

most records in the table, the extra number of signatures due to the thresholds gap is bounded by 0.5. Therefore, as our experimental results show in Section 8.2.2, the performance of our algorithms for flexible thresholds is comparable to that for static thresholds.

5. STRING SIMILARITY JOINS

In this section, we formulate the string-similarity join problem and develop the cor-responding algorithms. Given two collections of strings S and T , a collection of syn-onymsR, and a similarity threshold θ, a string-similarity join finds all string pairs

(s, t) ∈ S × T , such that sim(s, t, R) ≥ θ, where sim is one of the similarity functions in

Section 3. Two join algorithms would be proposed in the following sections.

5.1. Join-Baseline Method

The join-baseline (see Algorithm 5) method follows the filtering-and-verification frame-work. In the filtering phase, it generates signatures for each string, then filters can-didates by checking the overlaps of the signatures. Note that the way to generate signatures is the same to that of the search-baseline method. In the verification phase, it verifies the candidates by employing any of the two measures in Section 3.

ALGORITHM 5: Join-Baseline Algorithm

Input : Isand It//Isand Itare I-list index for table S and T , respectively

Output:C: Candidate string pairs (s,t)∈ Is× It

1 C←−∅; 2 for (∀ g ∈ Is∩ It) do //g is a signature 3 Ls= Is→getIList(g); 4 Lt= It→getIList(g); 5 C= {(s, t)|s ∈ Is, t∈ It}; 6 C = C∪ C; end 7 returnC;

Example 5.1. We use this example to illustrate the join-baseline algorithm.

Con-sider a self join on the table in Figure 7(c), withθ = 0.8. The corresponding signatures are listed in Figure 7(c). In the filtering phase, the join-baseline method generates candidates (q1, q2), (q1, q3), (q2, q3), and{(qi, qi)|i ∈ [1, 4]}, since the signatures of those

(17)

Input : SI-Trees SISfor S and SIT for T , threshold valueθ, a collection of synonym

pairsR.

Output:C: Candidate string pairs (s, t) ∈ SIS× SIT

1 C←−∅;

2 for (∀ Fsin SIS,∀ Ftin SIT) do

//Fence-entry:F=<u, v, P>

3 if (min(Fs.v, Ft.v) ≥ θ × max(Fs.u, Ft.u)) then

4 for (∀ Es∈Fs.P,∀ Et∈Ft.P) do

//Leaf-entry:E=<t, P>

5 if (min(Es.t, Et.t) ≥ θ × max(Fs.u, Ft.u)) then

6 for (∀ g is an overlapping signature pointed by Es.P and Et.P) do

7 Ls= Es.P→getIList(g); 8 Lt= Et.P→getIList(g); 9 C= {(s, t)|s ∈ Ls, t∈ Lt}; 10 C = C∪ C; end end end end end 11 returnC;

SI-Join. (See Algorithm 6) comprising three steps in the filtering phase: in the first

step (lines 2∼ 3), consider two fence-entries Fsand Ft. If the maximal size of expanded

sets of strings below Fs(or Ft) is still smaller thanθ times the length of original strings

below Ft(or Fs) (i.e., u), then all pairs of strings below Fsand Ftcan be pruned safely.

Similarly, in the second step (lines 4∼ 5), given two leaf entries Esand Et, where all

expanded sets have the same size, if the size of expanded sets below Es(or Et) is smaller

thanθ times the length of original strings below Et(or Es), then all string pairs below

Es and Et can be pruned safely. In the third step (lines 6∼ 10), for each signature g

that is an overlapping signature pointed by Es.P and Et.P, function getIList returns

the I-list whose key is g. Then SI-join generates string pair candidates for every two I-lists.

Example 5.2. We use this example to illustrate the SI-join algorithm. Consider a

self join on the table in Figure 7(a) withθ = 0.8 and synonyms in Figure 7(b). The corresponding SI-indices are shown in Figure 7(e). SI-join prunes all the string pairs below F1and F3, since min(7, 11) < 0.8×max(2, 9) (lines 2∼3). Similarly, all pairs below

(E1, E3) can be pruned (lines 4∼5). Then SI-join uses the signatures “Conference” and

“Conf ” to get I-lists below (E2, E3) (lines 7∼8), since they are overlapping signatures

pointed by E2 and E3 (line 6). Then a candidate (q2, q3) is generated and added to

C (lines 9∼10). The final results are (q2, q3) and {(qi, qi)|i ∈ [1, 4]}. To demonstrate

the superiority of SI-join, we compare the numbers of candidates for two algorithms: join-baseline outputs seven candidates, while SI-join returns five candidates.

Discussions on the Updates of Synonyms. Finally, we now consider the scenario of

(18)

join on a large dataset and then realize that there are some new synonym pairs that should be added into the synonym collection. In this case, we show an incremental algorithm for avoiding rerunning algorithms against the whole datasets. Consider the following three steps. (1) Scan the tables to find the strings for which the new synonyms are applicable; and (2) the fence entries, leaf entries, and I-lists of SI-trees are updated correspondingly; and (3) rerun the SI-join algorithm, but we only consider the entries which have been updated. It is worth pointing out that the newly-added synonyms can only increase the size of results for similarity join with selective expansion, but the full expansion has no such good property, and the newly-added synonyms may decrease the similarity between two strings, and thus all affected string pairs have to be recomputed.

6. ESTIMATION-BASED SIGNATURE SELECTION 6.1. Motivation

The previous algorithms generate signatures for each string according to a global order of tokens. In theory, if N denotes the number of distinct tokens, there are N! ways to permute the tokens to select signatures. A question then arises: which orders are better for the approximate join algorithms? It is impractical to provide an order which is always suitable to all datasets. Therefore, the remaining problem of our model is to study which order of tokens should be used for the two specific join tables.

To address this problem, our main idea is to use multiple ways to generate signatures offline then quickly estimate the quality of each signature filter to select the best one to perform the actual filtering. The main challenge here is two fold. First, we need to generate multiple promising filters offline. Second, the estimator should quickly select the most effective filter online.

To address the first challenge, we use the itf -based method [Arasu et al. 2008; Li et al. 2008] to generate filters. In particular, let tf(w) denote the occurrences of token

w appearing in the join tables, that is, the frequency of w. Then the inverse term

frequency of the tokenw, itf(w), is defined as 1/tf(w). The global order is arranged in a descending order based on itf values. Intuitively, tokens with a high itf value are rare tokens in the collection and have a high probability of being chosen as signatures. Since tokens come from two sources, tables and rules, there are multiple ways to compare the frequency of tokens as follows. (1) ITF1: sort the tokens from data tables and synonyms separately by decreasing itf, then arrange tokens from tables first, followed by those from synonyms; (2) ITF2: sort separately again, but synonyms first, tables second; (3) ITF3: sort all the tokens together by a decreasing itf order; (4) ITF4: generate all possibilities of expanded sets for each string, then sort the tokens by the decreasing itf order in all the expanded sets.

In order to give an empirical study of the effect of different signatures on filtering power, we perform experiments with ITF1∼ITF4. For ease of presentation, we employ the number of candidates to demonstrate the filters’ filtering power. The filtering power is measured by ψ = _|N||C|, where |C| denotes the number of pruned pairs and |N| is the total string pairs. We then show some attractive experimental results for string-similarity joins.

We perform a set of experiments using ITF1∼ITF4 against two datasets: CONF and USPS data. On the CONF dataset, we perform a join between two tables S and T using a synonym table R. Table S contains 1,000 abbreviations of conference names, while table T has 1,000 full expressions of conference names. On the USPS dataset, we perform a self join on a table containing one million records, each of which contains a person name, a street name, a city name, etc., using 284 synonyms. The detailed description of the datasets can be found in Section 8.

(19)

Fig. 9. Filtering powers of different signature filters.

Figure 9 reports the experimental results. We observe that (i) four signature filters achieve quite different filtering powers. For instance, on the CONF dataset, when the join threshold is 0.7, the filtering power of ITF1 is 89.2%, while that of ITF2 is only

53.1%; and (ii) no signature filter can always perform the best in each dataset. For

example, ITF1 performs the best in the USPS dataset, but it performs the worst in the CONF dataset, the main reason being that ITF1 gives a higher priority for the tokens in the data table to become signatures, but there are many overlaps in these signatures in the CONF dataset, resulting in a large number of candidates.

These results suggest that it is unrealistic to find a one-size-fits-all solution to select-ing signatures. Therefore, we propose an estimator to effectively estimate the filterselect-ing power of different filters in order to dynamically select the best filter.

6.2. Estimation-Based Selection Algorithms for String Similarity Joins

Let (S, g) = {q₁g, . . . , qng} denote a list of string IDs (i.e., I-lists in Figures 6 and 7)

associated with the signature token g from the table S. Then an ID pair (q_ig, qg_j) ∈

(S, g) × (T, g) is an element of the cross product of the two lists. Let G denote the set

of all common signatures between S and T . Therefore, our problem is to estimate the cardinality of the union of all ID pairs from all signatures in G, that is,∪g∈G(S, g) ×

(T, g). Before we describe our approach towards this goal, let us examine the lower

and upper bounds of this cardinality.

LEMMA 6.1. Given two tables S and T for similarity join and a signature filterλ, the

upper and lower bounds of the cardinality of the candidate pairs are

UB(λ) =

g∈G

((S, g) × (T, g)),

and

LB(λ) = max{(S, g) × (T, g), g ∈ G},

where(·, g) denote the number of IDs in a list and G is the set of overlapping

signa-tures between S and T .

Example 6.2. We use this example to illustrate this lemma. Recall Figure 7(c) and

assume that there are two tables S and T , where S= {q1, q2} and T = {q3, q4}. Then

G = {Conference, Conf}. Thus, (S,Conference) = {q1, q2}, (T ,Conference) = {q3},

(S,Conf) = {q2}, (T ,Conference) = {q3}. Therefore, according to Lemma 6.1, UB(λ) =

2+ 1 = 3 and LB(λ) = max{2,1} = 2.

In Lemma 6.1, the upper bound is the total number of string pairs and the lower bound is the maximal number of string pairs associated with one signature. A signature

(20)

Fig. 10. An example of bit vector generated by the FM method. (The first 1-bit is the leftmost 1-bit.)

Table I. Summary of the Notations Used

Notation Explanation

Sig(q, R) the signatures of q with synonymsR

LB(_{λ), U B(λ)} the lower and upper bounds of the cardinality of the candidates with the filter_λ

(S, g) a list of string IDs associated with the signature g from the table S

μS the distinct number of string IDs for all signature tokens from the table S

νg

S an FM sketch for the signature g in I-lists of the table S

2DHSg_(i_{, j)} _{the bucket (i}_{, j) for a signature g in a two-dimensional hash sketch synopsis}

(2DHS)

λ guarantees returning fewer candidates than signature λ_{if UB(}_{λ) < LB(λ}_{). Therefore,}

the tighter the bound, the better the results, since a tighter bound implies more accurate estimates on the ability of a filter. In the following, we introduce tighter upper and lower bounds than Lemma 6.1 by extending the Flajolet-Martin (FM) [Flajolet and Martin 1985] technique.

Flajolet-Martin Estimator. The Flajolet-Martin (FM) technique for estimating the

number of distinct elements (i.e., set-union cardinality) relies on a family of hash functionsH that map data values uniformly and independently over the collection of binary strings in the input data domain L. In particular, FM works as follows: (i) use a hash function h∈H to map string IDs to integers sufficiently uniformly in the range [0, 2L−1_{]; (ii) initialize a bit vector} _{ν (also called synopses) of length L to all zeros}

and convert the integers into binary strings of length L; and (iii) compute the position of the first 1-bit in the binary string (denoted as p(h(i))), and set the bit located at position p(h(i)) to be 1. Note that for any i∈[0, 2L−1_{], p(h(i))} _{∈ {0, . . . , logM-1} and}

Pr[ p(h(i))= ] = ₂₊₁1 , which means the bits in lower position have a higher probability

to be set to 1. Figure 10 illustrates the FM method. Assume there are six strings. The length of the domain L= 4. Finally, the synopsis ν is “1110”.

Our Two-Dimensional Hash Sketch (2DHS). In the following, we describe our ideas

for extending the FM sketch to provide better bounds for estimating the number of candidate pairs. The main challenge is that the existing FM-based methods are only applicable to one-dimensional data, but we need to estimate the number of distinct pairs based on two independent FM sketches. Note that the straightforward method is infeasible for mapping two-dimensional data to one unique value and then using the existing FM sketch for our problem. This is because two join tables are given online, and thus we cannot perform the mapping operations offline to construct a one-dimensional sketch. Therefore, we next describe four steps for estimating the number of distinct pairs based on two predefined one-dimensional sketches. Table I summaries the notations used.

Step 1 (Estimating the Distinct Number of One Table). First, we estimate the distinct

number of string IDs for all signature tokens from one table, that is,μSandμT, where

μS= ∪g∈G(S, g) and μT = ∪g∈G(T, g). Ganguly et al. [2004] provide an (, δ)-estimate

algorithm forμS(orμT) by using the extension of the FM technique, which estimates

(21)

Fig. 11. The structure of two-dimensional hash sketch (2DHS).

Space precludes detailed discussion about their method here; however, readers may find it in Section 3.3 of Ganguly et al. [2004]. Note that after the first step, we indeed obtain a new upper bound,μS× μT, but it is still not tight enough. We proceed to the

following steps to get a better value.

Step 2 (Constructing a Two-Dimensional Hash Sketch). Let ν_Sg and ν_Tg be two FM

sketches of signature g in I-lists of table S and table T , respectively. We construct a two-dimensional hash sketch synopsis (2DH S) of signature g, which is a two-two-dimensional bit vector. The bucket (i, j) in a 2DHS is 1 if and only if ν_Sg(i)= 1 and νg_T( j)= 1. Further, let G be the set of common signatures of S and T . We construct a new 2DH S, which is the union of all 2DHS’s of each signature, namely,∪g∈G2DH Sg, such that 2DH S(i, j) =

1 if∃ 2DHSg(i, j) = 1.

Example 6.3. Figure 11 illustrates Step 2. Assume there are two common signatures g1 and g2 of S and T ,νgS1 = {1,1,1}, ν g1 T = {1,1,0,1} and ν g2 S = {1,1,0}, ν g2 T = {1,1,1,0}.

Figure 11(b1) is the 2DHS of g1 and Figure 11(b2) is the 2DHS of g2. Then the union

2DH S is shown in Figure 11(b3).

Step 3 (Computing Witness Probabilities). Once the union 2DHS is built, we select

a bucket (i, j) to compute a witness probability as follows. We examine a collection of ω independent 2DHS (ν₁g,ν₂g, . . . , ν_ωg) (each copy using independently-chosen hash functions fromH). Algorithm 7 shows the steps for computing witness probabilities. In particular, let Rs= log2μS and Rt = log2μT. We investigate a column of buckets

(Rs, y) (line 2 ∼ 8), and a row of buckets (x, Rt) (line 9∼ 15), and select two smallest

indexes y and x at which only a constant fraction 0.3ω of the sketch buckets turns out to be 1. As our analysis will show, the observed fraction ˆpsand ˆpt (line 16) and their

positions can be used to provide a robust estimate for bounds in the next step.

Step 4 (Computing Tighter Upper and Lower Bounds). The final step is to compute

the upper bound TUB and the lower bound TLB of a signature filterλ. Assume, without loss of generality, thatμS≤ μT.

TUB(λ) = ϕ1· min(u1, u2), where

u1 = μS· ln(1− ps) ln(1−₂1y) , and ps= ˆ ps 1− (1 −₂1Rs)μS ; u2 = μT · ln(1− pt) ln(1−₂1x) , and pt= ˆ pt 1− (1 −₂1Rt)μT .

(22)

ALGORITHM 7: Computing Two Witness Probabilities

Input : ω independent two-dimensional hash sketches (2DHS1, 2DH S2, . . . , 2DHSω), Rs

and Rt.

Output: ˆpsand ˆptand two witness positions (Rs, y) and (x, Rt)

1 f = 0.3ω; 2 for (y from 0 to Rt) do 3 count= 0; 4 for (i from 1 toω) do 5 if (2DH Si[Rs][y]== 1) then 6 count= count+1; end end 7 if (count≤ f ) then

8 pˆs= count/ω and goto Line 9;

end end 9 for (x from 0 to Rs) do 10 count= 0; 11 for (i from 1 toω) do 12 if (2DH Si[x][Rt]== 1) then 13 count= count+1; end end 14 if (count≤ f ) then

15 pˆt= count/ω and goto Line 16;

end end

16 return ˆps, ˆptand their witness positions (Rs, y), (x, Rt)

TLB(λ) = ϕ2· μS· ln(1− p_s) ln(1−₂1y) , where p s= ˆ ps− p 1− (1 −₂1Rs)μS p = 1− 1− 1 2Rs+y _μS − 1 2y × 1− 1− 1 2Rs _μS ,

where constantsϕ1= 1.43, ϕ2= 0.8, are derived in Theorem 6.8.

Example 6.4. This example illustrates the computation of TUB and TLB. Assume μS = 100 and μT = 200. Thus, Rs = log1002  = 6, Rt = log2002 = 7. Assume, in

Algorithm 7, that the returned values are ˆps = 0.2, ˆpt = 0.19, x = 5, and y = 6.

According to the formulas in Step 4, we have ps = 0.2522, pt= 0.24, u1= 1845, u2 =

1729, p = 0.0117, p_s= 0.2375, and the upper bound TUB(λ) = 1.43 × 1729 = 2472 and TLB(λ) = 1722 × 0.8 = 1278.

Therefore, one filter is selected as the best if its TUB is less than all the TLBs of other filters, which means that it returns the smallest number of candidates. In practice, if there are several filters whose bounds have overlaps, then it means that their abilities are similar with a high probability, and we can arbitrarily choose one of them to break the tie.

Theoretical Analysis. We first demonstrate that, with an arbitrarily high probability,

the returned lower and upper bounds are correct; then we analyze its space and time complexities.