The Shape of Things to Come - Sequence distance embeddings

Now that we have introduced the distances that we will study and some of the problems we would like to answer, we can go on to describe the results we shall show for these distances. Our results are of two types: firstly, embeddings of these distances into different spaces with bounded distortion; and secondly, applications of these embeddings to solve problems related to the original distance. The distances that we consider are summarised in Figure 1.2 — we include the distortion factor that we will show for embedding these distances into alternative spaces. We now outline the structure of the rest of this work: Chapters 2, 3 and 4 give results on embedding sequence distances, and Chapters 5 and 6 give some applications of these embeddings. Finally, Chapter 7 concludes with some discussions on extensions, improvements and open problems.

Chapter 2: Sketching and Streaming

We begin our study of embeddings by considering metrics in Vector Spaces with theLpdistance, and

show how these can be embedded in vector spaces of much smaller dimension. These embeddings 5_{In the ANN literature, it is more usual to denote the dimensionality of the space with}_d_{and the number of objects with}_n_. We use a different convention in keeping with our choice ofnfor the size of objects throughout.

Set and Vector Distances

Description

Reference

Metric

Factor

Symmetric Difference

Number of elements in only one set

[AMS99]

Yes

1±

Intersection Size

Number of elements in both sets

[KN97]

No

n

Union Size

Number of elements in either set

[FM85]

No

1±

L

distances

(

|ai−bi|

)

1/p

[Ind00]

Yes

1±

String Distance

Description

Reference

Metric

Factor

Hamming Distance

Change individual characters

[Ham80]

Yes

1±

Levenshtein Edit Distance

Character operations (insert, delete, change)

[Lev66]

Yes

—

LZ Distance

Create target string by copying from source

_[CPS.V00]

No

lognlog

∗

n

or partially built string

Compression Distances

Unit cost substring copy, move, uncopy

_[CPS.V00]

Yes

lognlog

∗

n

Unconstrained Delete

As Compression, with unconstrained deletes

[Evf00]

No

lognlog

∗

n

Edit Distance with moves

Substring move, insert and delete characters

[CM02]

Yes

lognlog

∗

n

Tichy’s Distance

Create target by copying from source only

[Tic84, LT97]

No

—

Permutation Distance

Description

Reference

Metric

Factor

Permutation Edit Distance Elementary operations (insert, delete, change) [DH98]

Yes

logn

Reversal Distance

Block reversals

[Gus97]

Yes

2 Transposition Distance

Block transpositions

[Gus97]

Yes

2 Swap Distance

Adjacent character transpositions

[Jer85]

Yes

1 RITE Distances

Reversals, Indels, Transpositions and Edits

_[CMS.01]

Yes

3 A summary of the main distances that we will discuss: for each one we give a brief description and a reference.

We record whether it is a metric, and give the factor of distortion of our embedding of the distance into a smaller

space.

Figure 1.2: Summary of the distances of interest

can be computed in a number of computation models, including the streaming model in which a large vector is processed in an arbitrary order one entry at a time with very small working space requirements. We go on to study set measurements, and show how these are related to vector distances. We are able to show hardness results for other set problems in a number of models including the sketching model, where a short summary of a set is created to allow approximation of the set measure. Finally, we discuss further problems in vector spaces, of clustering (organising a set of vectors into subsets of close vectors) and approximate nearest neighbors (pre-processing a set of vectors to rapidly find the approximately closest to a query vector). This chapter consists mostly of a survey of prior results of other authors, which will be used in subsequent chapters to build solutions for different sequence distances.

Chapter 3: Sorting Sequences

We next study embeddings of permutations into vector spaces. Motivated by computational biology scenarios, we study problems of computing distances between permutations as well as matching permutations in sequences, and finding approximate nearest neighbors.

We adopt the general approach of embedding permutation distances into well-known vector spaces in an approximately distance-preserving manner, and solve the resulting problems on the well- known spaces. The main results are as follows: We present the approximately distance-preserving embeddings of these permutation distances into well-known spaces. Using these embeddings, we obtain several results, including (1) communication complexity protocols to estimate the permutation distances accurately; (2) efficient solutions for approximately solving nearest neighbor problems with permutations and (3) algorithms for finding permutation distances in the streaming model. We use these embeddings to allow us to solve approximate pattern matching problems for permutation distances.

Chapter 4: Strings and Substrings

We use the approximate pattern matching problem as applied to string edit distance with moves as a guiding motivation to study problems on string distances. We seek a solution to this problem that is close to linear in the size of the input instead of the existing quadratic solutions to related problems on string edit distance. Our result is a near linear time deterministic algorithm for our version of the problem. It produces an answer that is aO(lognlog∗n)approximation of the correct answer. This is the first known significantly subquadratic algorithm for a string edit distance problem in which the distance involves nontrivial alignments.

The results are obtained by embedding strings intoL1vector space using a simplified parsing

technique we call Edit Sensitive Parsing (ESP). This embedding is approximately distance-preserving, and we show many applications of this embedding to string proximity problems including nearest neighbors and streaming computations with strings. We also show embeddings for other variations of this distance, including the compression distances, and block edit distances with unconstrained deletes.

Chapter 5: Stables, Subtables and Streams

We study the solution of some problems on vectors using the embeddings seen in Chapter 2 to achieve sublinear time and space. We first tackle the problem of comparing data streams. In particular, we use the Hamming norm which is a well-known measure throughout data processing. When applied to a single stream, Hamming norm gives the number of distinct items that are present in the data stream, which is a statistic of great interest in databases. When applied to a difference between a pair of streams, Hamming norm gives an important measure of (dis)similarity: the number of unequal items in the two

streams. This can be used in auditing data streams that are expected to be nearly identical, and can be applied to network intrusion detection.

We also examine the problem of data mining and clustering massive data tables. Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over time intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two “regions”) as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address. We implement methods for determining similar regions in massive tabular data. Our methods are approximate, but highly accurate as we show empirically, and they are fast, running in time nearly linear in the table size.

Chapter 6: Sending and Swapping

We address the problem of minimising the communication involved in the exchange of similar documents. This turns out to be facilitated by the embeddings we have advanced in Chapters 2, 3 and 4. We consider two users, A and B, who hold documents aand b respectively. Neither of the users has any information about the other’s document. They exchange messages so that B computes

a; it may be required that A computes b as well. The goal is to design communication protocols with the main objective of minimising the total number of bits they exchange; other objectives are minimising the number of rounds and the complexity of internal computations. An important notion which determines the efficiency of the protocols is how we measure the distance betweenaandb. We consider several metrics for measuring this distance as described in Chapter 1. For each metric, we present communication-efficient protocols, which often match the corresponding lower bounds up to a constant factor. In consequence, we obtain error-correcting codes for these error models which correct up toderrors inncharacters usingO(d·poly-log(n))bits.

Chapter 2

Sketching and Streaming

The Siege of Belgrade

An Austrian army, awfully arrayed, Boldly by battery besieged Belgrade. Cossack commanders cannonading come, Dealing destruction’s devastating doom. Every endeavor engineers essay,

For fame, for fortune fighting - furious fray! Generals ’gainst generals grapple - gracious God! How honors Heaven heroic hardihood!

Infuriate, indiscriminate in ill,

Kindred kill kinsmen, kinsmen kindred kill. Labor low levels longest, loftiest lines;

Men march ’mid mounds, ’mid moles, ’mid murderous mines; Now noxious, noisey numbers nothing, naught

Of outward obstacles, opposing ought;

Poor patriots, partly purchased, partly pressed, Quite quaking, quickly ”Quarter! Quarter!” quest. Reason returns, religious right redounds,

Suwarrow stops such sanguinary sounds. Truce to thee, Turkey! Triumph to thy train, Unwise, unjust, unmerciful Ukraine! Vanish vain victory! vanish, victory vain! Why wish we warfare? Wherefore welcome were Xerxes, Ximenes, Xanthus, Xavier?

Yield, yield, ye youths! ye yeomen, yield your yell! Zeus’, Zarpater’s, Zoroaster’s zeal,

Attracting all, arms against acts appeal!

Chapter Outline

This chapter is concerned with setting up the basic embedding results for this thesis which will be built on in later chapters. We consider just sets and integer valued vectors (for the most part, these two can be interchanged using some simple isomorphisms), and survey the known results on embeddings for distance measurements on these objects. Almost all results reported here are due to other authors, and this should be viewed as background material for the discussions and development of solutions for questions relating to the sequence distances we consider in subsequent chapters. Some detail is provided on the methods and theorems described to give an insight into how they work. Summaries of the proofs are given when this gives further insight into how they work, and also when we build on these ideas in later chapter, which requires modifying the nature of the mechanism which is used, or altering the proof. For example, we need to understand the methods used to compute sketches of vectors, clusterings and nearest neighbors because in later chapters we will want to use these algorithms on transformed sequences and be sure that corresponding results can be proved about their accuracy and running time.

We progress as follows: Section 2.1 begins by discussing different models for computing embeddings, and how to compute probabilistic equality tests in these models. Section 2.2 takes the important class of vector Lp distances, and reports on the recent progress that has been made in

computing embeddings of these in various models. In Section 2.3 we consider various measurements on pairs of sets in terms of entries in Venn diagrams, and how these can be related to vector distances. Lastly, in Section 2.4 we look at a number of so-called “geometric problems” on vector spaces, which will later be applied to other spaces by way of our embeddings.

In document Sequence distance embeddings (Page 31-36)