Set Intersection Size - Set Spaces and Vector Distances

2.3 Set Spaces and Vector Distances

2.3.3 Set Intersection Size

Ifa,bare binary strings, then their intersection sizei(a,b) =F(a,b,×)gives the dot product ofaandb (see Definition 2.3.1). These two statements are essentially identical, since|A∩B|=F(χ(A), χ(B),×). Hamming distance and set intersection size are closely related. Simple consideration of Venn diagrams shows that|A|+|B| −2|A∩B|=|A∆B|. Equivalently, for bit-strings,|a|+|b| −2i(a,b) =

||a−b||H, where|a|is the number of one bits, or ‘weight’ of the bit-stringa. However, this equality does

not mean that we can approximate the intersection size by first approximating the Hamming distance. Informally, although we can find|a|,|b|and ||a−b||H within a factor of1±(each, i(a,b)may be

small in comparison to these and hence directly applying the above relation would not give an((, δ)

approximation toi.

There is a more direct embedding of Hamming distance into intersection space. Leta¯ be the bit-vector formed by taking the complement of each bit in a. As noted before, the concatenation of two vectorsa,bis denoteda||b. Theni(a||a¯,¯b||b) = i(a,¯b) +i(¯a,b) = ||a−b||H, so in this case an

approximation ofiwould allow approximation ofh(since we are adding these quantities). However, there is no similar reduction in the opposite direction, expressing the intersection size in terms of a summation of Hamming distance or other quantities. This is unfortunate, since although there are efficient approximations for Hamming distance, there are no known approximations for the intersection size. In fact, we go on to show that no such approximation can exist.

Hardness of Approximating Intersection Size Certainly, estimating the intersection size is hard. This follows immediately from the fact that estimating Hamming distance is hard. If in the sketch model we could estimate the intersection size with probabilityδ, then we could also estimate Hamming distance with the same probability, by using the relations above (since|a|can easily be included in a sketch usinglogn bits). To show that approximating the intersection size is hard, we consider the related problem of disjointness:

disj(a,b) = 1 ⇐⇒ i(a,b) = 0 disj(a,b) = 0 ⇐⇒ i(a,b)= 0

It has been shown that in a communication complexity scenario, estimating the disjointness function requiresΩ(n)bits of communication, in the main theorems of [KS92, Raz92].

Theorem 2.3.4 (from [Raz92], [KS92]) Let personAholdaand personBholdb. Any scheme which allows

AandB to collaboratively computedisj(a,b)with probability1−δrequiresΩ(n)bits of communication for everyδ < 1₂.

Corollary 2.3.2 An approximation scheme for intersection size would imply a way to estimate disjointness. Proof. We just have to consider two cases. Firstly, wheni(a,b) = 0, then any approximation would have to return a result of 0, whatever the approximation parameters a, b(see Definition 2.1.2) with probabilityδ. Similarly, ifi(a,b) = 0then any approximation would have to return a non-zero result with probabilityδ. Hence if an approximation scheme foriexisted then we could derive an estimate fordisjby returning0ifî= 0 and returning1ifî= 0. ✷ Therefore, by contradiction, there can be no streaming or sketching algorithm for approximating intersection size since otherwise there would exist a communication protocol for estimating disjointness. We comment that although this rules out the existence of any constant factor approximation scheme for intersection size, there is still the possibility of a scheme that returns an answer correct up to a constant factor plus an additional constant. That is, findîsuch thati(a,b)≤î(a,b)≤(i(a,b) +cfor constantsc

and(. Next we pursue an alternative weaker kind of approximation.

Rough Sketches for Intersection Size Although we cannot make sketches for the intersection size that meet the specifications of Definition 2.1.3, it is still possible to find a sort of approximation that may be “good enough”. Informally, we will call this a rough sketch. This is an additive approximation of the intersection size. That is, an approximationˆisuch that|i(a,b)−ˆi(a,b)| ≤(n(rather than(i(a,b)). We describe a simple communication scheme that achieves such an approximation, adapted from Section 5.5 of [KN97]. Suppose person A holds a bit-stringa, and person B holdsb. Then A selectsklocations from{1. . . n}uniformly at random, with replacement, and sends this subset followed by the values of

aat each of these positions. B then estimatesiby

ˆ_i₌ n k k i=1 aSibSi

Lemma 2.3.3 ˆiis an(-additive approximation foriwith probabilityδforksufficiently large inΘ(1/(2log 1/δ). Proof. We pickklocations uniformly at random from the bitstrings with replacement. For any chosen bit, the probability that it is in the intersection is theni(a,b)/n. We compute the intersection of the two sets of chosen bits and scale byn/k, and by linearity of expectation, the expected size of this quantity isn/k(k(i(a,b)/n)) =i(a,b) =i. Each test of a chosen bit is a where each outcome is independently

and identically distributed so we can apply the Hoeffding inequality to the sum ofk such trials, as described in Chapter 4 of [MR95]. The probability that the difference betweenˆiandiis more than(nis given by

Pr(|ˆi−i| ≥(n) = Pr(k/n|ˆi−i| ≥(k)≤2e−2k2=δ

Rearranging this shows thatk=−₂12lnδ/2and sok= Θ(1/(2log 1/δ). ✷ So to allow this to take place in a communication setting, A just needs to send theklocations, plus thek

bits at those locations, to B, at a cost ofO(1/(2log 1/δlogn)bits of communication. We remark again on the close relation between communication protocols and sketching algorithms, since any lower bound in a communication paradigm is a lower bound on the size of a sketch — because a sketch can be viewed as a message being communicated.

Rough Sketching for Smaller Sets Work by Broder described in [Bro98] and with further technical details in [BCFM98] shows a way to make sketches for intersection size. These sketches are used in retrieving documents in the AltaVista search engine. The approach is based on choosing hash functions and (pseudo)random permutations of the universe from which sets are drawn. For a given set, a permutation is applied, and then each element is hashed to an integer. These hash values are taken modulo some integerm, and those that are congruent to 0 are selected for the sketch. The additive approximation of the intersection size of two sets is then found by taking the intersection size of their representative sketches, and scaling appropriately. The purpose of this approach is to deal with cases where the size of any individual set is very small compared to the size of the universe from which the sets are drawn. In this case the method of Lemma 2.3.3 would mainly sample bits that were 0, giving a poor approximation. Repetition of this process (using different permutations and hash functions) can improve the quality of the approximation. However, the size of this representation varies linearly with the size of the sets: the expected size of the representation ofAisO(|A|/m).

Variations of Intersection

We consider an alternative measure of set similarity, the Set Resemblance measure, an extension from sets to vectors, and the related set measure of Set Difference.

Set Resemblance Closely related to Set Intersection size is the Set Resemblance measure studied by Broder [Bro98]. This is also known as the Jaccard coefficient, as used in statistics and information retrieval. The resemblance of two setsAandBisr(A, B) = |_|A_A_∪∩_BB|_|(it is assumed thatAandBare not both the empty set). Sincer(A, B) = 0if and only ifA∩B = ∅, approximating this quantity is hard for the same reasons as approximating the intersection size (otherwise, we could solve the disjointness problem). LetAandBbe both drawn from a universeU of sizen. Suppose thatPis a permutation on

U (that is, it mapsU bijectively to{1. . . n}) chosen uniformly at random from all such permutations. We can applyPtoAto getP(A)⊆ {1. . . n}. We can create a rough sketch forAby considering many different random permutationsPiso that the sketch ofAis given by:

sk(A, P) = (min(P1(A)),min(P2(A)), . . .min(Pk(A)))

The approximation of |_|A_A∩_∪B_B|_| is1−(||sk(A, P)−sk(B, P)||_H)/kwhere|| · ||His the Hamming norm as

usual. This is because for each coordinatePr(min(Pi(A)) = min(Pi(B))) = |_|A_A∩_∪B_B|_|. So by summing

these, the expectation of 1− ||sk(A, P), sk(B, P)||_H/k is the resemblance of the sets A and B. The advantage of this approach is that, unlike that described above, it is independent of the size of the sets

AandB, working equally well whether the sets are large or small. The size of this sketch iskelements, each of which can be represented usinglognbits. This method of approximating the Jaccard coefficient

was defined in [Bro98] and refined in [BCFM98]. Attention is given to how to choose the permutations

P: storing a permutation in full would consume too much space. Instead, permutations are chosen at random from a smaller class of “Min-Wise Permutations”, which can be represented in small space, although we do not discuss this further here. It is also remarked in [Bro98] that1− |_|A_A∩_∪B_B|_|is a metric, although we do not make use of this fact.

Dot products Given two vectorsa,b, we have already stated that their dot product is_iaibi. In the case where the entries of the vectors are restricted to being 0 or 1, then as we observed, the dot product is isomorphic to the set intersection measure. It therefore follows that approximating the dot product of two vectors is at least as hard as set intersection, that is, it requiresΩ(n)bits of communication in a communication complexity model. For non-negative vectors, such as those we shall be considering in later chapters, there are techniques to select from a collection of vectors those that have a large dot- product with a query vector which may improve on the direct evaluation, depending on the nature of the input data [CL99].

Set Difference We show that there can be no algorithm in the sketch model to allow the approximation of the size of the set difference, |A\B|. This measure is quite similar to intersection, since

A\B=A∩B¯, hence we would not expect there to be an approximation scheme. Suppose that it were possible to make sketches for this measure, so that there are sketches ofAandB,sk(A, r)andsk(B, r). Then for some function f, we would have |f(sk(A, r), sk(B, r))− |A\B|| ≤ (|A\B| with probability

1−δ. If such a sketch has been made for some setA, then this sketch and the random bits used to create it could be communicated toB, who could then computesk( ¯B, r)correspondingly, whereB¯ is the complement of the setBwith respect to the universe, as usual. We then consider the approximation of|A\B¯|. The result is0with probability at least1−δifAdoes not intersectB, and non-zero with the same probability ifA and B are not disjoint. In other words, we can reduce the problem to that of Disjointness, and show that the combined size of the sketch and all the information used to create it (ie the total amount of memory used) must beΩ(n). This is due to the fact that computing the disjointness function has been shown to have linear communication complexity (see Section 2.3.3).

In document Sequence distance embeddings (Page 51-54)