• No results found

Tichy’s Distance

In document Sequence distance embeddings (Page 155-158)

Chapter 6 Sending and Swapping

6.4 Computationally Efficient Protocols for String Distances

6.4.3 Tichy’s Distance

Consider a variation of the above abstract problem. Instead ofhdistinguished characters, suppose that there are some number of “dividers” which are positioned between characters, such that there is at most one divider between any two adjacent characters. This problem is reducible to the distinguished character problem: there aren−1locations where dividers can be placed. If our queries are whether there are any dividers within a substring, then clearly the same protocol will suffice, with the same number of queries in the worst case. We consider two stringsaandbwhose Tichy distance isl. This means thata can be parsed intol pieces each of which is a substring ofb. So we can use the same divide-and-conquer technique as before: for every hash value A sends, B searches to see whether there is any substring ofbthat hashes to the same value. If there is, then these substrings are identified; if not, the substring will be split in two and the procedure recurses on the two halves.

a a b b a a c c d a a b b - e a a a a d d c c a a b b b b a a d e

The lower string isa, the upper isb. The edit distance between the two strings is 3. The parse tree of the stringais shown, with the editing operations to makeb. Again, blobs mark the differing substrings which are discovered in the traversal of the search tree. Adis inserted after the fourth character — this affects the lowest common ancestor of its neighbouring nodes, but below this, their substrings are common to bothaandb. The deletion of ane, the seventh character ofa, and the change of the final characterctoeboth affect all their ancestors. The communication proceeds as follows

Round 1A sends hashes on[0,7],[8,15]

Round 2B replies with11

Round 3A sends hashes on[0,3],[4,7],[8,11],[12,15]

Round 4B replies with0101

Round 5B sends hashes on[4,5],[6,7],[12,13],[14,15]

Round 6B replies with0101

Round 7A sends the characterseaae

Figure 6.3: Illustrating how edit differences affect the binary parse tree ofa

Theorem 6.4.4 If A runs algorithm 6.4.1 and B runs Algorithm 6.4.3 then this protocol uses no more than

O(llog(2n/l) logn/δ)bits of communication. It succeeds with probability1−δ.

Proof. If A sends a hash of a substring that is also a substring ofb, then B can identify the substring. Since ais formed from l substrings of b, it follows that a hash will only fail to be identified if the substring ofaoverlaps more than one substring ofb. We can imagineawritten out withl−1markers between characters indicating the start and end of each of the substrings ofb. As noted above, this is equivalent to the original abstract problem: if a hashed substring ofacontains no markers, then it can certainly be identified by B; if it contains markers then it may not be identified (if we are lucky, it might be, but we shall take a worst-case analysis). Therefore, we need to send no more than2llog(2n/l)hash values.

We now need to calculate the size of the hash value necessary to ensure that the probability of success is at least a constant. For each hash received from A, this has to be compared with up ton

others as we try every offset withinb. So we make at most2lnlog(2n/l)comparisons, and following the same pattern as in the above sections, we need to chooseδto be at least2lnlog(2n/l)δ. If we have no a priori bound onl, we will have to use the fact thatlis at mostn. This gives the size in bits of the hash value asO(logn3)which leads to the stated communication cost. As with the other protocols, the number of rounds is logarithmic in the length ofa, since the size of substrings being handled halves

in each round.

The computational complexity of this protocol is identical to the protocol for edit distance,

O(|b|log|b|), since the algorithms being run are the same, and we have shown that they obey the same bound on the number of hashes.

6.4.4

LZ Distance

Since we have shown how to deal with Tichy’s distance, the LZ distance, which is similar in nature, follows fairly directly. We again take the same divide-and-conquer approach. The stringacan be parsed intolsubstrings, which are either substrings ofb or substrings which occur earlier ina. Rather than descending the tree in parallel, the protocol performs a left-to-right depth-first search for identifiable substrings ofa. At each stage B tries to resolve the pair of hashes that have been sent by looking for a substring ofbor the partially builtathat hashes to the same value. B first considers the hash of the left substring, and attempts to resolve that. If B cannot resolve the hash, then this is indicated to A, who splits the substring into two, and sends hashes of each of these halves. If B can resolve the left substring, then B proceeds to the right substring, and follows the same procedure.

Theorem 6.4.5 If A runs Algorithm 6.4.4 and B runs Algorithm 6.4.5, then this protocol uses

O(llog(2n/l) logn/δ) bits of communication. It succeeds with probability1−δ and the number of rounds involved is at most2llog(n/l).

Proof. In an optimal parsing,ais parsed intolpieces, each piece of which is a substring present inbor earlier ina, or a single character. We can imagine thatbhasl−1“dividers” that separate the substrings in this parsing. If a hash falls between two dividers, then it can be resolved. The number of hashes necessary for B who holdsbto identifyais that needed to locate thel−1dividers, which is given by Lemma 6.4.1. This is2(l−1) log(2nl11), which is less than2llog(2n/l).

We are performing many more comparisons between hash values than before, so we require a larger base over which to compute hashes in order to ensure that the chance of a hash collision is still only constant for the whole process. We only compare hashes for substrings of the same length. If we have two strings each of lengthnthen there are fewer than2nsubstrings of any given length. So there are fewer than2n2possible pairwise comparisons. The lemma then follows.

Algorithm 6.4.4 Run by A who holdsa push [0, n−1] onto rangestack repeat

pop [l, r] from rangestack m←(l+r)/2

if cheaper to send characters then send a[l, r]

else

send hash(a[l, m−1]) send hash(a[m, r]) receive bit1, bit2 if bit2= 1 then

push [m, r] to rangestack if bit1= 1 then

push [l, m−1] to rangestack until rangestack is empty

Algorithm 6.4.5 Run by B who holdsb

push [0, n−1] onto rangestack repeat

pop [l, r] from rangestack

if cheaper to send characters then a[l, r] receive characters

else

m←(l+r)/2 o←(r−l+ 1)/2

receive hash1, hash2

if ∃j : hash(b[j, j + o]) = hash2 or

hash(a[j, j+o] =hash2 then

push [m, r] onto rangestack; bit21 else

a[m, r]←b[j, j+o]; bit20

if ∃i : hash(b[i, i + o]) = hash1 or

hash(a[i, i+o] =hash2 then

push [l, m−1] to rangestack; bit1 1 else

a[l, m−1]←b[i, i+o]; bit10 send bit1, bit2

until rangestack is empty

Evfimievski [Evf00] considers what turns out to be the same distance measure, and claims that it requires no more than 3llogn hashes to be sent. With a more rigorous analysis, we have improved this to 2llogn/l. The time complexity of our protocol is O((|a|+|b|) log(|a|+|b|)) hash function manipulations. The procedure to achieve this complexity is essentially the same as described in Section 6.4.2, except that in addition to keeping tables of hashes ofb, B must additionally add hashes of the received parts ofa, yielding the slightly higher time complexity.

In document Sequence distance embeddings (Page 155-158)