Chapter 6 Sending and Swapping
6.5 Computationally Efficient Protocols for Permutation Distances
Having shown efficient protocols based on divide and conquer for the string distances, we now show how the same approach can be applied to the permutation distances as well.
Swap Distance
We shall use the same divide-and-conquer approach as we did for the Hamming distance. A simple lemma suffices to show that this will be efficient in terms of the swap distance.
Proof. Each swap interchanges two elements and leaves the rest as they were. Therefore, any single
swap can alter the Hamming distance between the two strings by at most 2. So if the swap distance betweenaandbisswap(a, b), then the total change in Hamming distance can be at most twice this. ✷ Therefore, any document exchange protocol which is efficient with respect to the Hamming distance will also be efficient with respect to the swap distance. This immediately leads to a corollary to Theorem 6.4.1, substituting twice the swap distancesfor the Hamming distanceh.
Corollary 6.5.1 The protocol for exchanging documents for Hamming distance can be run on permutations, with A running Algorithm 6.4.1 and B running Algorithm 6.4.2 on their respective sequences. This is efficient for the swap distance, and has a communication cost of no more thanO(s(logn/s) logn/δ)bits. The computation cost isO(nlogn)hash operations.
Permutation Edit Distance
From the point of view of exchanging documents, Permutation Edit Distance is virtually identical to String Edit Distance. Each move operation has the same effect on the binary parse tree as a string edit operation — it affects leaf and internal nodes and hence causes a limited number of differences at various levels in the tree.
Theorem 6.5.1 If A runs Algorithm 6.4.1 on permutationP, and B runs Algorithm 6.4.3 on permutationQ
then this divide and conquer protocol for string edit distance is efficient with respect to permutation edit distance. It succeeds with probability1−δand communicates no more than4dlog (n/d)hash values (whered=d(P, Q)). Proof. Each permutation can be treated as a string drawn from an alphabet of size n. Each move operation can be treated as a deletion followed by a re-insertion on a string. It therefore follows that the cost of exchanging permutations with a distance of dis no more than exchanging strings with a distance of2d. The proof is almost identical to that of Theorem 6.4.3. It has the same computation cost,
O(nlogn). ✷
Transposition Distance
We firstly give a lemma relating the Transposition distance between two permutations, P and Q
(t(P, Q)), and the Tichy distance of the permutations,tichy(P, Q). Lemma 6.5.2 tichy(P, Q)≤3t(P, Q) + 1
Proof. Recall from Theorem 3.2.3 that the Transposition Distance between two sequences is bounded
by 3 times the number of transposition breakpoints of P relative to Q. Between two consecutive breakpoints inP, the subsequence is identical to a subsequence ofQ. In other words, if the transposition distance betweenPandQist, thenPcan be parsed into at most3t+1substrings ofQ. If this is the case, then the Tichy distance betweenPandQcannot be any more than3tfort≥1, since we have shown a way to parseP into at most3tsubstrings ofQ. Also,tichy(P, Q) = 0 ⇐⇒ t(P, Q) = 0 ⇐⇒ P =Q. ✷ Since the Transposition distance is so closely related to the Tichy distance, the multi-round protocol for the Tichy distance, outlined above, is sufficient to exchange permutations in a fashion that is efficient in terms of the transposition distance. Applying the above lemma with Theorem 6.4.4 yields the following corollary.
Corollary 6.5.2 The protocol for exchanging two documents for Tichy distance, Algorithm 6.4.1 and Algo- rithm 6.4.3, is also efficient with respect to their transposition distance,t(P, Q). It has a communication cost of no more than6t(P, Q) log(n/2t(P, Q))hash values, and a computation cost ofO(nlogn).
Reversal Distance
The same approach used to deal with transposition distance will also work for Reversal distance, with a small alteration. Transposition breakpoints denoted the junctions between substrings of the original sequence; reversal breakpoints mark the junction between substrings of the original sequence that may have been reversed. Thus the string can still be parsed in substrings of the original sequence — in this case, into at most2r+ 1such substrings — but some of these may be reversed. So we can use the same protocol again for exchange, that for Tichy’s distance, but here we make an alteration: when B is searching for substrings that match the received hashes,bwill be considered both forwards and reversed. Algorithm 6.4.3 must be modified so that when hash matches are being searched for, the reversed string must also be searched; but other than that, the algorithm is unchanged. It follows that this scheme will use the same number of hashes as a string whose Tichy distance is at most2r, though we will have to make twice as many hash comparisons. We therefore gain another corollary to Theorem 6.4.4.
Corollary 6.5.3 The modified protocol for exchanging documents for Tichy distance using Algorithm 6.4.1 and the modified Algorithm 6.4.3 is efficient with respect to the reversal distance. It has a communication cost of no more than4rlog(2n/3r)hash values, and a computation cost ofO(nlogn).
Compound Permutation Distances
Lemma A.1.1 showed that if we allow combinations of permutation operations — reversals, transposi- tions and editing operations — then this induced distance between permutations can be approximated by counting the number of reversal breakpoints, and that this gives a 3-approximation. The above pro- tocol exchanges sequences on the basis of the number of reversal breakpoints, and so will solve this problem. We gain a further corollary to Theorem 6.4.4.
Corollary 6.5.4 If A runs Algorithm 6.4.1 and B runs the modified Algorithm 6.4.3 on their respective permutations then this modified protocol is efficient with respect to the compound permutation distance, τ. It has a communication cost of no more than6τlog(n/2τ)hash values and computation cost ofO(nlogn)hash operations.
Allowing Indels
We have seen in Lemma A.1.2 that permutation distances with insertions and deletions, where one of the sequences is allowed to be a string (denotedτ) can be embedded into theL1 distance with a
distortion of at most 3. Each reversal breakpoint inarelative tobcan be thought of as the junction in
abetween two (possibly reversed) substrings ofb. It therefore follows that the same modified protocol will suffice, giving one final corollary to Theorem 6.4.4.
Corollary 6.5.5 The protocol of Algorithm 6.4.1 and the modified Algorithm 6.4.3 is efficient with respect to the compound permutation distances with indels. It has a communication cost of no more than6τlog(2n/3τ)
hash values and a computation cost ofO(nlog2n).
6.6
Discussion
We have seen how two parties can communicate to exchange similar documents in a way that is much more efficient in terms of communication than sending the documents in full.
String Distance
Metric
Lower bound
Single Round
Multi round hashes
Rounds
Hamming Distance
Yes
hlog(|σ−1|)n/h
2hlogn(|σ| −1)
2hlog 2n/h
logn
Levenshtein Edit Distance
Yes
elog 2(|σ| −1)n/e
2elog|σ|(n+ 1)/e
2elog 2n/e
logn
LZ Distance
No
2llogn
—
2llog 2n/l
llogn/l
Compression Distance
Yes
9clogn
18clog 2n
24clognlog
∗n
logn
Edit Distance with Moves
Yes
3dlog 2n
6dlog 2n
24dlognlog
∗n
logn
Unconstrained Delete
No
9dulog 2n
—
24dulognlog
∗n
logn
Tichy’s Distance
No
2llogn
—
2llog 2n/l
logn
Permutation Distance
Metric
Lower bound
Single Round
Multi round hashes
Rounds
Permutation Edit Distance
Yes
2dlogn
4dlogn
4dlogn/d
logn
Reversal Distance
Yes
2rlogn
4rlogn
4rlog 2n/3r
logn
Transposition Distance
Yes
3tlogn
6tlogn
6tlogn/t
logn
Swap Distance
Yes
swap logn
2 swap logn
4 swap logn/swap
logn
RITE Distances
Yes
3τ
log 2n
6τ
log 2n
6τ
log 2n/3τ
logn
For each distance, we give a lower bound on the number of bits to exchange sequences; if the measure is a metric, then we give the single round cost based on the colouring protocols of Section 6.3. For the multi-round protocols based on hashing, we give the leading terms in the number of bits exchanged, and the number of rounds required.Figure 6.5: Main results on document exchange
• We have seen how arguments based on graph colouring can achieve an amount of communication that is a factor of two above the lower bound for a large class of metric distances.
• For several of our distances, we have seen how computationally efficient protocols can achieve the single round cost owing to the structure of these metrics.
• We have described a number of protocols which sacrifice some efficiency in communication for computational tractability. These are all based on ideas of divide-and-conquer techniques using hash functions, some of which have been described before in the literature. For the first time we analyse the cost of these and give tight bounds on the exact number of bits communicated by these protocols.
• For a number of important distances, such as Tichy’s distance, the LZ distance and the Block Editing distances, we give the first protocols or improved protocols to allow the efficient exchange of documents in terms of their distance. These draw on the analysis of these metrics and their embeddings from earlier chapters.
All the protocols described are efficient with respect to the distance measures employed — that is, the cost depends only linearly on the distance. The main results for this section in terms of the cost of communication are summarised in Table 6.5.