• No results found

Hamming distance

In document Sequence distance embeddings (Page 151-155)

Chapter 6 Sending and Swapping

6.4 Computationally Efficient Protocols for String Distances

6.4.1 Hamming distance

The problem of locating the Hamming errors, given an upper bound on the number of errors, is similar to the problem of group testing (also observed by Madej [Mad89]). We wish to group samples and perform a test which will return either ‘all the same’ or ‘at least one mismatch’. The differences are that we have an ordering of the samples, given by their location in the string, and that we have to contend with the problem of false negatives. We use negative to mean that the test indicates no errors; false negatives are an inevitable consequence of using fewer thannbits of communication to locate the Hamming differences, since we are checking whether substrings are identical. Deterministically comparing two strings for equality requires n bits of communication; hence we shall be using probabilistic tests to reduce the communication load, which have some probability of failure. Our tests will be based on fingerprinting with hash functions such as those from Lemma 2.1.1. If a substring in one string hashes to the same value as the corresponding substring in the other, then we take this as evidence that the two substrings match. Note that there is no danger of false positives; if a test leads us to believe that the substrings are different, then there is zero probability that they actually match.

The following scheme is a straightforward way to locate and correct differences, and is similar to that proposed in [Met83]. Mapping our terminology onto that of the above problem, the locations of Hamming differences between the two strings are the distinguished characters. We shall use the proposed simple divide and conquer algorithm to locate them. The algorithm is illustrated with an example in Figure 6.2. The test we perform involves communicating to make the queries: in each round, one party will send the value of a hash function for each of the substrings in question, in the order they occur in the string. The other party will calculate the value for the corresponding substrings of their

Algorithm 6.4.1 Run by A who holdsa range←[0, n−1]

repeat

for all [l, r] in range do

append hash(a[l,(l+r)/21]) to output append hash(a[(l+r)/2, r]) to output send output

receive bitmap newrange←empty

for all [l, r] in range do dequeue bit1, bit2 from bitmap if bit1= 1 then

append [l,(l+r)/21] to newrange if bit2= 1 then

append [(l+r)/2, r] to newrange range←newrange

until cheaper to send characters for all [l, r] in range do

append a[l, r] to output send output

Algorithm 6.4.2 Run by B who holdsb

range←[0, n−1] repeat

newrange←empty receive hashes

for all [l, r] in range do

dequeue hash1, hash2 from hashes if hash(b[l,(l+r)/21])=hash1 then

append 1 to bitmap

append [l,(l+r)/21] to newrange else

append 0 to bitmap

if hash(b[(l+r)/2, r])=hash2 then append 1 to bitmap append [(l+r)/2, r] to newrange else append 0 to bitmap send bitmap range←newrange

until cheaper to send characters a←b

receive characters

for all [l, r] in range do

a[l, r] next(l−r+ 1) characters

string. If they agree, then there is (with high probability) no difference between the two substrings; otherwise there is (with certainty) some discrepancy between them. The reply to the message is a bitmap indicating the success or failure of each test in order — say, using1to indicate that the hashes did not agree,0if they did. Every substring that was not identified is split into two halves, and the procedure continues on these new substrings until it is more expensive to send the hashes than to send the unidentified substrings. There are clearly no more than2 lognrounds, since A can only split substrings ofain halflogntimes. This procedure is outlined in pseudocode — the algorithm run by A is given in Algorithm 6.4.1, that run by B in Algorithm 6.4.2. We must choose our hash functions so that the overall probability of success of the procedure is high.

We will make use of the family of Karp-Rabin hash functions described in Section 2.1.3. These take a paramterδand ensure that, over all choices of hash functions from the family, that the probability of two arbitrarily chosen sequences having the same hash value is δ. We would like that, after performing all the tests, the probability of them all succeeding is at least a constant. Using the union bound, if we performttests, then the probability of them all succeeding is at least1−tδ. Therefore, if we chooseδ =δ/t, for some constantδ, then we guarantee that overall the procedure has probability at mostδof failing.

Theorem 6.4.1 With an estimate of the Hamming distance,ˆh, if A runs Algorithm 6.4.1, and B runs Algorithm 6.4.2, this protocol is efficient with respect to the Hamming distance. It communicates no more than2hlog 2n/h

hash values ofO(logn+ log 1)bits each, and succeeds with probability1−δ.

Proof. Lemma 6.4.1 says that we must make at most2h(log 2n/h)queries (note that this ish, nothˆ). We fixδand so we will chooseδso thatδ =δ/(2ˆh(log 2n/ˆh)). To represent the values of the hash function that are exchanged requires log(n/δlogn/δ)bits, which is log(2nˆh/δlog(2n/ˆh) log(2nˆh/δlog 2n/ˆh)). This isO(logn/δ). In total, we communicate at most2h(log 2n/h)hash values. For each hash that is sent, the other party sends one bit in response. We must also communicate the prime,p, that is used in

a a b b a a c c d a b b b b c a a a d d c c b a b b b b a a c c

The lower string is treated asa, the upper isb. The Hamming distance between the two strings is 3. Blobs mark the differing substrings which are discovered in the traversal of the search tree; all other substrings are unchanged. The rounds progress as follows:

Round 1A sends hashes[0,7],[8,15]

Round 2B replies with11(indicating that both hashes disagreed) Round 3A sends hashes[0,3],[4,7],[8,11],[12,15]

Round 4B replies with0110

Round 5A sends hashes[4,5],[6,7],[8,9],[10,11]

Round 6B replies with1101

Round 7A sends the charactersabbaca

FinishB now knows what the stringais.

Figure 6.2: Illustrating how Hamming differences affect the binary parse tree

the hash functions, which also has sizeO(logn/δ). Combining all these gives the communication cost.

This protocol is non-trivial for h = O(n/(logn log logn)). The overall cost is a factor of

O(log(ˆh/δlogn))above the optimal; however, this cost comes from the size of the hash values sent. With regard to the number of hash values sent, we have shown it is about twice the optimal of any possible scheme which uses hash functions to give a binary answer to a question. The number of rounds is precisely2 logn−1: we can progress one level of the binary parse tree every other round. Theorem 6.4.2 The computational complexity of Algorithms 6.4.1 and 6.4.2 isO(nlogh)

Proof. We make use of some of the observations of Lemma 6.4.1. Considering the binary parse tree

of the string, once we have passed the levellogh, in the worst case we have to deal with2hstrings whose total length isO(n). In this level, the parties must doO(n)work to compute the hashes. In each subsequent level, the substrings under consideration can be no more than half the maximum length of those of the level above. So the total cost of these must beO(n). In the worst case, we need to find hashes for every substring in the binary parse tree above levellogh. To do this costsO(n)for each level,

giving the claimed cost.

Note that although an estimate of the Hamming distance is required for choosing the size of the hash functions, the number of queries depends on the true Hamming distance. In the absence of a good bound on the Hamming distance, the trivial upper boundh≤ncan be used.

Depending on the choice of the hash function, it may be possible to halve the amount of information sent: if linear hashes are sent then the hash value for the second half of a substring can be

calculated from the hash value of the first half and the whole substring. Additionally, by pre-computing the hashes of every substring bottom-up, the computational complexity of the protocol can be reduced fromO(nlogh)toO(n). Otherwise, if hashes are independent, then stored values can be used to check that the procedure has succeeded. This comment applies to all of the hierarchical schemes presented in this section.

6.4.2

Edit Distance.

We next translate the above approach to deal with the edit distance. A similar scheme is proposed but not analysed in [SBB90].

It is not immediately apparent how to map the edit distance problem onto the abstract problem presented above. However, observe that the way that differences were located was by a process of elimination: identical substrings were found, until all that remained were disparate substrings. We may use the same kind of hierarchical approach to identify common fragments. It is no longer the case that substrings are aligned between the strings; however, we know that matching substrings will be offset by at most the edit distance.

The partyB receiving the hash values must therefore do more work to identify them with a substring.Bhas a bounddˆon the edit distance; if a substring is unaltered by the editing process, then it will be found at a displacement of no more thandˆfrom its location in A’s string. SoBcalculates hashes of substrings of the appropriate length at all such displacements left and right from the corresponding position in its string, testing each one to see if it agrees with the sent hash. If they do agree, it is assumed that they match, and soBnow knows the substring at this location in A. The outline pseudocode for this is given in Algorithm 6.4.3.

We consider an optimal edit sequence fromatobof lengthdand how it affects the binary parse tree representing the splitting ofa. A replacement has the same effect as in the Hamming case — it affects every ancestor of the changed character in the binary tree, but nowhere else in the tree. As shown in Figure 6.2, each affected node in the binary parse tree can cause up to two hashes to be sent. The same idea can be applied to the other operations. A deletion also affects characters at the leaf nodes, and every ancestor of that node (it could be thought of as a change of the original character to a null character, ‘’). An insertion of any number of consecutive characters between two adjacent characters is considered to occur at the internal node which is the lowest common ancestor of the pair. It therefore changes every ancestor of the affected node. The protocol must traverse this tree, computing hashes on every node whose subtree contains any of these edit operations. This is illustrated in Figure 6.3. Theorem 6.4.3 If A runs Algorithm 6.4.1 and B runs Algorithm 6.4.3 then this protocol is efficient with respect to the edit distance. It communicates no more than2dlog(2n/d)hash values, each of which is of sizeO(logn/δ)

bits and succeeds with probability1−δ.

Proof. Observe that the worst case is exactly as before — if (almost) all errors are deletions or alterations

(since we have to descend further down the tree to discover these). Certainly, in the worst case, the number of hashes sent will be that given by Lemma 6.4.1,2dlog(2n/d). We must compare each hash sent with up to2 ˆd+ 1others in the worst case. As before, we want the probability of an error to be no more than a constant, which givesδ= 2dlog(2n/d)(2 ˆd+ 1)δ. Following the same line of argument as in Theorem 6.4.1, we compute hashes of sizeO(log(2δndˆ(2 ˆd+ 1) log(2n/d)))bits, which isO(logn/δ)

sinced≤n.

Corollary 6.4.1 The computational complexity of this protocol isO(nlogn)hash computations.

Proof. The time complexity for A is clearlyO(nlogd), since A makes the same computations as before, by running Algorithm 6.4.1 — we have already shown that the worst case number of hashes sent is the

Algorithm 6.4.3 Run by B who holdsb range←([0, n−1])

repeat

receive hashes newrange←empty

for all [l, r] in range do

w= (r−l)/2

dequeue hash1, hash2 from hashes

if 1≤i≤n−w: hash(b[i, i+w]) =hash1 then append 1 to bitmap

enqueue [l, l+w] to newrange

else

append 0 to bitmap a[l, l+w]←b[i, i+w]

if 1≤j≤n−w: hash(b[j, j+w]) =hash2 then append 1 to bitmap enqueue [r−w, r] to newrange else append 0 to bitmap a[r−w, r]←b[j, j+w] range←newrange send bitmap range←newrange

until cheaper to send remaining characters receive characters

for all [l, r] in range do

a[l, r] next(r−l+ 1) characters

same. However, B has to do more work to align hash functions in Algorithm 6.4.3. Suppose at each level B computes the hash of every substring — there areO(n)of these, and the hash values areO(n)in length. We make the assumption here that given the hash for a substringa[l:r], we can easily compute the hash fora[l+ 1 : r+ 1]. This is true of a large class of hash functions, including those described in Lemma 2.1.1. B can store each hash in a hash table (running a second hash function on the hash values if necessary). Working in the RAM model, in which manipulating hashes takes timeO(1), the time required to build the hash table isO(nlogn): O(n)work at each level. Searching the hash table takes timeO(1), the size of each hash. We follow the same procedure at each of thelognlevels, giving

a total cost ofO(nlogn).

In document Sequence distance embeddings (Page 151-155)