Approximating Approximate Pattern Matching

(1)

Jan Studený

Department of Computer Science, ETH Zürich, Switzerland [email protected]

Przemysław Uznański

Institute of Computer Science, University of Wrocław, Poland1 [email protected]

Abstract

Given a textT of lengthnand a patternP of lengthm, the approximate pattern matching problem asks for computation of a particulardistance function betweenP and everym-substring ofT. We consider a (1±ε) multiplicative approximation variant of this problem, for`pdistance function. In this paper, we describe two (1 +ε)-approximate algorithms with a runtime ofOe(n_ε) for all (constant) non-negative values ofp. For constantp≥1 we show a deterministic (1 +ε)-approximation algorithm. Previously, such run time was known only for the case of`1distance, by Gawrychowski and Uznański [ICALP 2018] and only with a randomized algorithm. For constant 0≤p≤1 we show a randomized algorithm for the`p, thereby providing a smooth tradeoff between algorithms of Kopelowitz and Porat [FOCS 2015, SOSA 2018] for Hamming distance (case ofp= 0) and of Gawrychowski and Uznański for`1 distance.

2012 ACM Subject Classification Theory of computation→Design and analysis of algorithms Keywords and phrases Approximate Pattern Matching,`pDistance,`1Distance, Hamming Distance, Approximation Algorithms

Digital Object Identifier 10.4230/LIPIcs.CPM.2019.15

Related Version A full version of the paper is available at [24],http://arxiv.org/abs/1810.01676.

1 Introduction

Pattern matching is one of the core problems in text processing algorithms. Given a text T of lengthnand a patternP of lengthm,m≤n, both over an alphabet Σ, one searches for occurrences of P inT as a substring. A generalization of a pattern matching is to find substrings ofT that are similar toP, where we consider a particular string distance and ask for allm-substrings ofT where the distance to P does not exceed a given threshold, or simply report the distance fromP to every m-substring ofT. Typical distance functions considered are Hamming distance,`1 distance, or in general`pdistances for some constantp,

assuming input is over a numerical, e.g. integer, alphabet.

For reporting all Hamming distances, Abrahamson [1] described an algorithm with the complexity ofO(n√mlogm). Using a similar approach, the same complexity was obtained in [18] and later in conference works [3, 5] for reporting all `1 distances. It is a major open problem whether near-linear time algorithm, or even O(n3/2−ε_{) time algorithm, is}

possible for such problems. A conditional lower bound [6] was shown, via a reduction from matrix multiplication. This means that existence of combinatorial algorithm with runtime

O(n3/2−ε) solving the problem for Hamming distances implies combinatorial algorithms for boolean matrix multiplication with O(n3−δ_{) runtime, which existence is unlikely. If one}

is uncomfortable with poorly defined notion of combinatorial algorithms, one can apply the reduction to obtain a lowerbound of Ω(nω/2_{) for Hamming distances pattern matching,}

1 _{Most of the work was done while the author was affiliated with ETH Zürich.} © Jan Studený and Przemysław Uznański;

licensed under Creative Commons License CC-BY

30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Editors: Nadia Pisanti and Solon P. Pissis; Article No. 15; pp. 15:1–15:13

(2)

where 2≤ω <2.373 is a matrix multiplication exponent.2 Later, the complexity of pattern matching under Hamming distance and under`1distance was proven to be identical (up to polylogarithmic terms) [11, 19].

The mentioned hardness results serve as a motivation for considering relaxation of the problems, with (1 +ε) multiplicative approximation being the obvious candidate. For Hamming distance, Karloff [14] was the first to propose an efficient approximation algorithm with a run time ofO(_εn2log

3

m). The _ε12 dependency was believed to be inherent, as is the case for e.g. space complexity of sketching of Hamming distance, cf. [25, 12, 4]. However, for approximate pattern matching that was refuted by Kopelowitz and Porat [15, 16], by providing randomized algorithms with complexityO(n_εlognlogmlog1_εlog|Σ|) and O(n_εlognlogm) respectively. Moving to`1 distance, Lipsky and Porat [20] gave a deterministic algorithm with a run time of O(_εn2logmlogU), while later Gawrychowski and Uznański [10] have improved the complexity to a (randomized)O(n_εlog2nlogmlogU), whereU is the maximal integer value on the input. Additionally, we refer the reader to the line of work on other relaxations on exact the distance reporting [2, 7, 10, 3].

A folklore result (c.f. [20]) states that the randomized algorithm with a run time of e

O(n

ε2) is in fact possible for any`p distance, 0< p≤2, with use ofp-stable distributions

and convolution.3 _{Such distributions exist only when} _p _≤_{2, which puts a limit on this} approach. See [22] for wider discussion onp-stable distributions. Porat and Efremenko [23] have shown how to approximate general distance functions between pattern and text in time

O(_εn2log

2_m_log3_|_Σ_|_log_B

d), whereBd is upperbound on distance between two characters in

Σ. Their solution does not immediately translates to`p distances, since it allows only for

score functions of formP

jd(ti+j, pj) wheredis arbitrary metric over Σ. Authors state that

their techniques generalize to computation of`2 distances, but the dependencyε−2 in their approach is unavoidable. [20] observe that`2 pattern matching can be in fact computed in O(nlogm) time, by reducing it to a single convolution computation. This case and analogously case ofp= 4,6, . . .are the only ones where fast and exact algorithm is known. We want to point that for`∞pattern matching there is an approximation algorithm of

complexityO(n_εlogmlogU) by Lipsky and Porat [20]. Moving past pattern matching, we want to point that in a closely related problem of computing (min,+)-convolution there exists

O(n_εlogn_εlogU) time algorithm computing (1 +ε) approximation, cf. Mucha et al. [21]. Two questions follow naturally. First, is there aOe(_poly(n_ε₎) algorithm for`pnorms pattern

matching whenp > 2? Second, is there anything special to p= 0 and p= 1 cases that allows for faster algorithms, or can we extend their complexities to other `p norms? To

motivate further those questions, observe that in the regime of maintaining`p sketches in

theturnstile streaming model (sequence of updates to vector coordinates), one needs small space of Θ(logn) bits when p≤ 2 (cf. [13]), while whenp > 2 one needs large space of Θ(n1−2/p_log_{n) bits (cf. [9, 17]) meaning there is a sharp transition in problem complexity}

atp= 2. Similar phenomenon of transition atp= 2 is observed forp-stable distributions, and one could expect such transition to happen in the pattern matching regime as well.

In this work we show that for any constant p≥0 there is an algorithm of complexity e

O(n_ε), replicating the phenomenon of linear dependency onε−1_{from Hamming distance and} `1distance to all`p norms. Additionally this provides evidence that no transition atp= 2

happens, and so far to our understanding cases ofp >2 andp <2 are of similar hardness.

2 _{Although the issue is that we do not even know whether}_{ω >}_{2 or not.} 3 _{We use}

e

(3)

1.1 Definitions and preliminaries

Model. In most general setting, our inputs are strings taken from arbitrary alphabet Σ. We use this notation only when structure of alphabet is irrelevant for the problem (e.g. Hamming distances). However, when considering`p distances we focus our attention over an integer

alphabet [U]def={0,1, ..., U−1}for someU. One can usually assume thatU = poly(n), and then logU term can be safely hidden in theOe notation, however we provide the dependency explicitly in Theorem statements. Even without such assumption, we can assume standard word RAM model, in which arithmetic operations on words of size logU take constant time. Otherwise the complexities have an additional logU factor. We also denoteu= logU. While we restrict input integer values, we allow intermediate computation and output to consist of floating point numbers having ubits of precision.

Distance between strings. Let X =x1x2. . . xn and Y =y1y2. . . yn be two strings. For

anyp >0, we define their`p distance as

`p(X, Y) = X i |xi−yi|p !1/p .

Particularly, `1 distance is known as Manhattan distance, and `2 distance is known as Euclidean distance. Observe that thep-th power of`p distance has particularly simpler form

of`p(X, Y)p=Pi|xi−yi|p.

TheHamming distancebetween two strings is defined as Ham(X, Y) =|{i:xi 6=yi}|.

Adopting the convention that 00_{= 0 and} _x0_{= 1 for}_x₆_{= 0, we observe that (`}

p)p approaches

Hamming distance as p→0. Thus Hamming distance is usually denoted as `0 (although (`0)0 is more precise notation).

Text-to-pattern distance. For textT =t1t2. . . tn and patternP =p1p2. . . pm, the

text-to-pattern distance is defined as an arrayS such that, for everyi,S[i] =d(T[i+ 1.. i+m], P) for particular distance functiond. Thus, for`p distanceS[i] =

Pm

j=1|ti+j−pj|p

1/p , while for Hamming distance S[i] = |{j ∈ {1, . . . , m} : ti+j 6= pj}|. Then (1 +ε)-approximate

distance is defined as an arraySεsuch that, for everyi, (1−ε)·S[i]≤Sε[i]≤(1 +ε)·S[i].

Rounding and arithmetic operations. For any value x, we denote by x(i) ₌ _b_x/2i_{c ·}₂i

the value with i y bits rounded. However, with a little stretch of notation, we do not limit value of i to be positive. We denote by krkc the norm modulo c, that is krkc =

min(rmodc, c−(rmodc)).

1.2 Our results

In this paper we answer favorably both questions by providing relevant algorithms. First, we show how to extend the deterministic`1distances algorithm into`p distances, whenp≥1. I Theorem 1. For any p ≥ 1 there is a deterministic algorithm computing (1 +ε) ap-proximation to pattern matching under `p distances in time O(n_εlogmlogU) (assuming

ε≤1/p).

We then move to the case of `p distances whenp <1. We show that it is possible to

(4)

ITheorem 2. For0< p <1, there is a randomizedalgorithm computing (1 +ε) approx-imation to pattern matching under`p distances in timeO(p−1ε−1nlogmlog2Ulogn). The

algorithm is correct with high probability.4

Finally, combining with existing`0 algorithm from [16] we obtain as a corollary that for constant p≥0 approximation of pattern matching under`p distances can be computed in

e

O(n_ε) time.

2 Approximation of

`

p

distances

We start by showing how convolution finds its use in counting versions of pattern matching, either exact or approximation algorithms. Consider the case of pattern matching under `2 distances. Observe that we are looking for S such that S[i]2 = Pj−k=i(tj−pk)2 =

P jt 2 j+ P kp 2 k−2 P

j−k=itjpk. The last term is just a convolution of vectors in disguise

and is equivalent to computing convolution ofT and reverse orderedP. Such approach can be applied to solving exact pattern matching via convolution (observing that`2distance is 0 iff there is an exact match).

We follow with a technique for computing exact text-to-pattern distance, for arbitrary distance functions, introduced by [20], which is a generalization of a technique used in [8]. We provide a short proof for completeness.

ITheorem 3([20]). Text-to-pattern distance where strings are over arbitraryalphabetΣ can be computed exactly in timeO(|Σ| ·nlogm).

Proof. For every letterc ∈Σ, construct a new textTc _{by setting}_Tc_{[i] = 1 if} _t

i =c and

Tc_{[i] = 0 otherwise. A new pattern} _Pc _{is constructed by setting} _Pc_{[i] =} _{d(c, p}

i). Since

d(ti+j, pj) =Pc∈ΣT

c_[i₊_j]_·_Pc_{[j], it is enough to invoke}_|_Σ_|_{times convolution.}

J

Theorem 3 allows us to compute text-to-pattern distance exactly, but the time complexity

O(|Σ|nlogm) is prohibitive for large alphabets (when|Σ|=poly(n)). However, it is enough to reduce the size of alphabet used in the problem (at the cost of reduced precision) to reach desired time complexity. While this might be hard, we proceed as follows: we decompose our weight function into a sum of components, each of which is approximated by a corresponding function on a reduced alphabet.

We say that a functiond is effectively over smaller alphabet Σ0 if it is represented as d(x, y) = d0(ι1(x), ι2(y)) for some ι1, ι2 : Σ →Σ0 andd0. It follows from Theorem 3 that text-to-pattern under distanced can be computed in time Oe(|Σ0|n) (ignoring the cost of computingι1andι2).

Decomposition. LetD(x, y) =|x−y|p _{be a function corresponding to (`}

p)p distance, that

is `p(X, Y)p = PiD(xi, yi). Our goal is to decompose D(x, y) = Piαi(x, y) into small

(polylogarithmic) number of functions, such that eachαi(x, y) is approximated byβi(x, y)

that is effectively over alphabet ofO(1_ε) size (up to polylogarithmic factors). Now we can use Theorem 3 to compute contribution of eachβi. We then have that G(x, y) =Piβi(x, y)

approximatesF, and text-to-pattern distance underGcan be computed in the desiredOe(n_ε) time. We present such decomposition, useful immediately in case ofp≥1 and as we see in section 2.2 with a little bit of effort as well in case when 0< p≤1.

(5)

Useful estimations. We use following estimations in our proofs. Forp≥1

(1−ε)p≥1−pε, for 0≤ε≤1, (1) (1 +ε)p≥1 +pε, for 0≤ε, (2) (1−ε)p≤1−pε(1−1/e), for 0≤ε≤1/p, (3) ap−(a−b)p≤pap−1b, fora≥b≥0. (follows from 2) (4) For 0≤p≤1

(1−ε)p≤1−pε, for 0≤ε≤1, (5) (1−ε)p≥1−2pεln 2, for 0≤ε≤1/2, (6) (1 +ε)p≥1 +pεln 2, for 0≤ε≤1, (7) ap−(a−b)p≤2pap−1bln 2, fora≥2b≥0. (follows from 6) (8)

2.1 Algorithm for

p

≥

1

In this section we prove Theorem 1. We start by constructing a family of functionsFi, which

are better refinements ofF asidecreases.

First step. Let us denote

Fi(x, y) =

max(0,|x−y| −2i)

p

and fi=Fi−Fi+1.

Observe thatFu= 0 (for 0≤x, y≤U). Moreover, there is a telescoping sumFi = u

P

j=i

fj.

To better see the the telescopic sum, consider casep= 1. We then representF−u(x, y) =

Pu

i=−ufi(x, y) = (−2−u+ 2−u+1) + (−2−u+1+ 2−u+2) +. . .+ (−2t−1+ 2t) + (|x−y| −2t) +

0 +. . .+ 0. Such decomposition (forp= 1) was first considered, to our knowledge, in [20].

Second step. Instead of usingxandyfor evaluation ofFi, we evaluateFiusingxandywith

all bits younger thani-th one set to zero. Formally, definex(i)₌_b_x/2i_{c ·}₂i_,_y(i)₌_b_y/2i_{c ·}₂i_.

Now we denote

Gi(x, y) =Fi(x(i), y(i))

Similarly as forfi, definegi=Gi−Gi+1. Using the same reasoning, we haveGu= 0. For

integersi≤0 the functionsFi andGi are the same (as we are not rounding) and therefore

F−u=G−u= u

P

i=−u

gi. Intuitively,gi captures contribution ofi-th bit of input to the output

value (assuming all older bits are set and known, and all younger bits are unknown).

Third step. Letηbe a value to be fixed later, depending onεandp. Assume w.l.o.g. thatη is such that 1/ηis an integer. We now define_bgias a refinement ofgi, by replacing|x(i)−y(i)|

withkx(i)₋_y(i)_k

Bi and|x

(i+1)₋_y(i+1)_|_with_k_x(i+1)₋_y(i+1)_k

Bi, whereBi= 2

i_{/η, that is}

doing all the computationmodulo Bi. To be precise, define

− → Gi(x, y) = max(0,kx(i)−y(i)kBi−2 i₎p ←− Gi+1(x, y) = max(0,kx(i+1)−y(i+1)kBi−(2 i+1₎p and then_bgi= − → Gi− ←−

Gi+1. Additionally, we denote for shortGbi= u

P

j=ib

(6)

Intuitively,_bgi approximatesgi in the scenario of limited knowledge – it estimates

contri-bution ofi-th bit of input to the output, assuming knowledge of bitsi+ 1 toi+ logη−1 _of input. We are now ready to provide an approximation algorithm to (`p)p text-to-pattern

distances.

IAlgorithm 4. Input:

T is the text, P is the pattern,

η controls the precision of the approximation.

Steps:

1. For eachi∈ {−u, . . . , u} compute arraySi being the text-to-pattern distance between T

andP using_bgi distance function (parametrized byη) using Theorem 3.

2. Output arraySε[i] = u P j=−u Sj[i] !1/p .

To get the (1 +ε)approximation we run the Algorithm 4 with η= ₁₂₈ε .

Now, we need to show the running time and correctness of the result. Firstly, to prove the correctness, we divide summands_bgi into three groups and reason about them separately.

As computingF−u, G−u(by summing fi’s andgi’s respectively) yields (1 +ε) multiplicative

error, we will show that the difference between computinggiandbgibrings only an additional (1 +ε) multiplicative error.

ILemma 5. Forisuch that |x−y| ≤2i _both_g

i(x, y) = 0 and_bgi(x, y) = 0.

Proof. As bothgi,bgi are symmetric functions, we can w.l.o.g. assumex≥y. ∀j≥i:

x (j)₋_y(j) = 2 jjx 2j k −jy 2j k ≤2j _j_x 2j k − _x₋₂i 2j ≤2j.

Therefore Gj = 0 from which gi(x, y) = 0 follows. And because kx(j)−y(j)kBj ≤ |x(j)₋_y(j)_|_{we have}

b

gi(x, y) = 0 as well. J ILemma 6. Forisuch that |x−y|>2i_≥_4η_|_x₋_y_| _{we have}_g

i(x, y) =_bgi(x, y).

Proof. Forgi(x, y) =bgi(x, y) to hold, it is enough to show that bothnorms| · | andk · kBi

are the same forx(i)₋_y(i)_and_x(i+1)₋_y(i+1)_{. This happens if the absolute values of the} respective inputs are smaller thanBi/2. Let us bound both|x(i)−y(i)|and|x(i+1)−y(i+1)|:

max(|x(i)−y(i)|,|x(i+1)−y(i+1)|)≤ |x−y|+ 2i+1≤2i+1(1 + 1 8η). We can w.l.o.g. assumeη≤1/8 in order to make 1

8η a dominant term in the parentheses

and reach: max(|x(i)−y(i)|,|x(i+1)−y(i+1)|)≤2i+1(1 + 1 8η)≤ 2i 2η = Bi 2 . Thereforekx(i)−y(i)kBi =|x (i)₋_y(i)_|_{as well as}_k_x(i+1)₋_y(i+1)_k Bi =|x (i+1)₋_y(i+1)_| which completes the proof. _J

(7)

Proof. For the sake of the proof, we will w.l.o.g. assumeη≤1/8. DenoteA=|x(i)−y(i)|, B = |x(i+1)₋_y(i+1)_|_, _A0 _{= max(0, A}₋₂i_{) and} _B0 _{= max(0, B}₋₂i+1_{). Observe that}

|x−y|−2i _≤_A_{≤ |}_x₋_y_|₊₂i_thus_|_x₋_y_|−₂_·₂i_≤_A0 _{≤ |}_x₋_y_|_{, and similarly}_|_x₋_y_|−₂_·₂i+1_≤ B0≤ |x−y|so |A0−B0| ≤2·2i_{. Assume w.l.o.g. that}_A0 _≥_B0_{. We bound}

|gi(x, y)|= (A0)p−(A0−(A0−B0))p

≤p(A0−B0)(A0)p−1 (by (4))

≤2p2i· |x−y|p−1

J ILemma 8. Ifp≥1then forisuch that4η|x−y|>2iwe have|_bgi(x, y)| ≤2p2i· |x−y|p−1.

Proof. Follows by the same proof strategy as in proof of Lemma 7, replacing | · | with

k · kBi. J

ITheorem 9. Gb−u= P i≥−ub

gi approximates F−u up to an additive32·p·η· |x−y|p term.

Proof. We bound the difference between two terms:

|F−u(x, y)− u X i=−u b gi(x, y)| ≤ log₂(4η|x−y|) X i=−u (|_bgi(x, y)|+|gi(x, y)|) ≤2·   log₂(4η|x−y|) X i=−∞ 2i  ·2·p· |x−y|p−1 ≤32·η|x−y| ·p· |x−y|p−1

where the bound follows from Lemma 5, 6, 7 and 8. _J We now show that F−uis a close approximation ofD (recallD(x, y) =|x−y|p). ILemma 10. For integersx, y there isD(x, y)·(1−(2 ln 2)p/U)≤F−u(x, y)≤D(x, y).

Proof. For x = y the lemma trivially holds, so for the rest of the proof we will assume x 6= y. As x, y are integers only, their smallest non-zero distance is 1. As −u < 0 the

|x−y| −2−u_>_{0 and we bound}_|_x₋_y_{| ·}₍₁₋_1/U₎_≤_max(0,_|_x₋_y_{| −}₂−u₎_{≤ |}_x₋_y_|_{. By (1)}

(whenp≥1) or (6) (whenp≤1) the claim follows. _J By combining Theorem 9 with the Lemma 10 above we conclude that additive error of Algorithm 4 at each position is (32p·η+ _Up)· |x−y|p₌_{p(ε/4 + 1/U}₎_{· |}_x₋_y_|p_≤_pε_|_x₋_y_|p

(since w.l.o.g. ε≥4/U), thus the relative error is (1 +pε/2).

Observe that each_bgi is effectively a function over the alphabet of sizeBi/2i= 1/η. Thus,

the complexity of computing text-to-pattern distance using_bgidistance isO(η−1nlogm), and

iterating over at most 2usummands makes the total timeO(ε−1_n_log_m_log_U_).

Finally, sincep≥1 and w.l.o.g. ε≤1/p, by (2) and (3) (1 +pε/2) approximation of`p p

distances is enough to guarantee (1 +ε) approximation of`p distances.

2.2 Algorithm for

0 < p

≤

1

In this section we prove Theorem 2. We note that the algorithm presented in the previous section does not work, since in the proof of Lemma 7 and 8 we used the convexity of function

|t|p_{, which is no longer the case when}_{p <}_1.

However, we observe that Lemma 5 and 6 hold even when 0< p≤1. To combat the situation where adversarial input makes the estimates in Lemma 7 and 8 to grow too large, we use a very weak version of hashing. Specifically, we pick at random a linear function

(8)

σ(t) =r·t, wherer∈[1,9) is a random independent variable. Such function applied to the input makes its bit sequences appear more ”random” while preserving the inner structure of the problem.

Consider a following approach:

IAlgorithm 11.

1. Fix η=_{15555 log}ε·p_U_{ln 2}.

2. Pick r∈[1,9)uniformly at random. 3. ComputeT0 =r·T andP0=r·P.

4. Use Algorithm 4 to compute S0, text-to-pattern distance betweenT0 andP0 usingGb−u

distance function. 5. OutputS00=S0·r−1_.

Now we analyze the expected error made by estimation from Algorithm 11. We denote the expected additive error of estimation of (`p)p distances as

err(x, y)def=Er∈[1,9) h [ ₁ r p Gb−u(rx, ry)− |rx−ry| p i .

ITheorem 12. The procedure of Algorithm 11 has the expected additive error err(x, y)≤

εp

3 ln 2|x−y|

p_.

Proof. Assume thatx6=y, as otherwise the bound trivially follows. We bound the absolute error as follow, denotingk= log(8η|x−y|)).

err(x, y)≤Er∈[1,9) h1 r p Gb−u(rx, ry)−F−u(rx, ry) i +_Er∈[1,9) h |F−u(rx, ry)−D(rx, ry)| i ≤Er∈[1,9) h u X i=−u (_bgi(rx, ry)−gi(rx, ry)) i +Er∈[1,9) h1 r p 2(ln 2)p UD(rx, ry) i ((1/r)p≤1) ≤ k X i=−u Er∈[1,9) h |g_bi(rx, ry)| i +_Er∈[1,9) h k X i=−u gi(rx, ry) i + 2(ln 2)p U|x−y| p _{(Lemma 5, 6)}

Now, we bound the first two summands separately in following lemmas.

ILemma 13. |Pk

i=−ugi(rx, ry)|is upper bounded by 32(ln 2)η|x−y| p_.

Proof. Since w.l.o.g. η≤1/32 thus 2k+1≤1/2·r|x−y|): k X i=−u gi(rx, ry) ≤ k X i=−∞ gi(rx, ry) ≤ |Gk+1(rx, ry)−D(rx, ry)| ≤((r|x−y|)p−(r|x−y| −2k+1)p) ≤rp|x−y|p_·_{2p(ln 2)} 2k+1 r|x−y| (by (8)) ≤32(ln 2)η|x−y|p. (rp−1≤1) _J

(9)

I Lemma 14. For i ≤ k = log(8η|x−y|) we have Er∈[1,9) h |_bgi(rx, ry)| i ≤ (1152 + 192(ln 2)))η|x−y|p_.

(Due to the space constraints, the proof of this Lemma is deferred to the appendix.) By combining bounds from Lemma 13, and Lemma 14 we get:

err(x, y)≤ k X i=−u Er∈[1,9) h |_bgi(rx, ry)| i +_Er∈[1,9) h k X i=−u gi(rx, ry) i + 2(ln 2)p U|x−y| p ≤2(ln 2)p U|x−y| p_{+ (32(ln 2)η}_|_x₋_y_|p₊ u X i=−u (1152 + 192(ln 2)))η|x−y|p ≤2(ln 2)p U|x−y| p_{+ (32(ln 2)η}_|_x₋_y_|p_{+ 2 log}_U_{(1152 + 192(ln 2)))η}_|_x₋_y_|p ≤2(ln 2)p U|x−y| p_{+ (32(ln 2) + 2 log}_U_{(1152 + 192(ln 2))))η}_|_x₋_y_|p ≤2(ln 2)p U|x−y| p_{+ 2593 log}_{U η}_|_x₋_y_|p ≤ εp 6 ln 2|x−y| p₊ εp 6 ln 2|x−y| p _w.l.o.g. _ε_≥12(ln 2)2 U ≤ εp 3 ln 2|x−y| p J

To finish the proof of Theorem 2 we observe, that for any positioniof output, Algorithm 11 outputsS00[i] such that Er[|(S00[i])p−(S[i])p|]≤ _{3 ln 2}pε ·(S[i])p. By Markov’s inequality it

means that with probability 2/3 the relative error of (`p)p approximation is at most _{ln 2}p ·ε.

Thus, by (5) and (7) relative error of`papproximation isεwith probability at least 2/3. Now

a standard amplification procedure follows: invoke Algorithm 11 independentlyt times and take the median value fromS₍₁₎00 [i], . . . S₍00_t₎[i] as the final estimateSε[i]. Takingt= Θ(logn)

to be large enough makes the final estimate good with high probability, and by the union bound wholeSε is a good estimate of S. The complexity of the whole procedure is thus

O(logn·logU·η−1·nlogm) =O(p−1ε−1nlogmlog2Ulogn).

3 Hamming distances

As a final note we comment on a particularly simple form that Algorithm 11 takes for Hamming distances (limit case ofp= 0).

b gi(x, y) = ( 1 ifkx(i)₋_y(i)_k Bi= 1 0 otherwise,

with Algorithm being simply: pick at randomr∈[1,9], apply it multiplicatively to the input, compute text-to-pattern distance using P

ibgi function.

Taking a limit of p→0 in proof of Theorem 2, we reach that bound from Lemma 14 becomes Er∈[1,9) h |_bgi(rx, ry)| i ≤24η

and since all other terms in error estimate have multiplicative termpin front, we reach

err(x, y)≤2 logU·_Er∈[1,9) h

|_bgi(rx, ry)|

i

(10)

We thus observe that expected relative error in estimation of Hamming distance is: E[S00[i]− S[i]]≤48ηlogU·S[i]. With probability at least 2/3 the relative error is at most 144ηlogU. Settingη= ε

144 logU and repeating the randomized procedure Θ(logn) with taking median for

concentration completes the algorithm. The total runtime is, by a standard trick of reducing alphabet size to 2m,O(n_εlog2mlogn), and while it compares unfavorably to algorithm from [16] (in terms of runtime), it gives another insight on whyOe(n/ε) time algorithm is possible for Hamming distance version of pattern matching.

References

1 Karl R. Abrahamson. Generalized String Matching. SIAM J. Comput., 16(6):1039–1051, 1987. 2 Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with

k mismatches. J. Algorithms, 50(2):257–275, 2004. doi:10.1016/S0196-6774(03)00097-X. 3 Amihood Amir, Ohad Lipsky, Ely Porat, and Julia Umanski. Approximate Matching in the

L1 Metric. InCPM, pages 91–103, 2005. doi:10.1007/11496656_9.

4 Amit Chakrabarti and Oded Regev. An Optimal Lower Bound on the Communication Complexity of Gap-Hamming-Distance. SIAM J. Comput., 41(5):1299–1317, 2012. doi: 10.1137/120861072.

5 Peter Clifford, Raphaël Clifford, and Costas S. Iliopoulos. Faster Algorithms forδ,γ-Matching and Related Problems. InCPM, pages 68–78, 2005.doi:10.1007/11496656_7.

6 Raphaël Clifford. Matrix multiplication and pattern matching under Hamming norm. http://www.cs.bris.ac.uk/Research/Algorithms/events/BAD09/BAD09/Talks/ BAD09-Hammingnotes.pdf, 2009. Retrieved March 2017.

7 Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana A. Starikovskaya. Thek-mismatch problem revisited. InSODA, pages 2039–2052. SIAM, 2016.

8 M. J. Fischer and M. S. Paterson. STRING-MATCHING AND OTHER PRODUCTS. Technical report, Massachusetts Institute of Technology, 1974.

9 Sumit Ganguly. Taylor Polynomial Estimator for Estimating Frequency Moments. InICALP, pages 542–553, 2015. doi:10.1007/978-3-662-47672-7_44.

10 Paweł Gawrychowski and Przemysław Uznański. Towards Unified Approximate Pattern Matching for Hamming andL1 Distance. InICALP, pages 62:1–62:13, 2018. doi:10.4230/ LIPIcs.ICALP.2018.62.

11 Daniel Graf, Karim Labib, and Przemysław Uznański. Brief Announcement: Hamming Distance Completeness and Sparse Matrix Multiplication. InICALP, pages 109:1–109:4, 2018. doi:10.4230/LIPIcs.ICALP.2018.109.

12 T. S. Jayram, Ravi Kumar, and D. Sivakumar. The One-Way Communication Complexity of Hamming Distance. Theory of Computing, 4(1):129–135, 2008. doi:10.4086/toc.2008. v004a006.

13 Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the Exact Space Complexity of Sketching and Streaming Small Norms. InSODA, pages 1161–1178, 2010. doi:10.1137/1. 9781611973075.93.

14 Howard J. Karloff. Fast Algorithms for Approximately Counting Mismatches. Inf. Process. Lett., 48(2):53–60, 1993. doi:10.1016/0020-0190(93)90177-B.

15 Tsvi Kopelowitz and Ely Porat. Breaking the Variance: Approximating the Hamming Distance in 1/Time Per Alignment. InFOCS, pages 601–613, 2015. doi:10.1109/FOCS.2015.43. 16 Tsvi Kopelowitz and Ely Porat. A Simple Algorithm for Approximating the Text-To-Pattern

Hamming Distance. InSOSA, pages 10:1–10:5, 2018. doi:10.4230/OASIcs.SOSA.2018.10. 17 Yi Li and David P. Woodruff. A Tight Lower Bound for High Frequency Moment

Es-timation with Small Error. In APPROX-RANDOM, pages 623–638, 2013. doi:10.1007/ 978-3-642-40328-6_43.

18 Ohad Lipsky. Eficient Distance Computations. Master’s thesis, Bar-Ilan University. Department of Mathematics and Computer Science., 2003.

(11)

19 Ohad Lipsky and Ely Porat. L1pattern matching lower bound.Inf. Process. Lett., 105(4):141– 143, 2008.doi:10.1016/j.ipl.2007.08.011.

20 Ohad Lipsky and Ely Porat. Approximate Pattern Matching with theL1,L2andL∞Metrics.

Algorithmica, 60(2):335–348, 2011. doi:10.1007/s00453-009-9345-9.

21 Marcin Mucha, Karol Węgrzycki, and Michał Włodarczyk. A Subquadratic Approximation Scheme for Partition. InSODA, pages 70–88, 2019. doi:10.1137/1.9781611975482.5. 22 John Nolan. Stable distributions: models for heavy-tailed data. Birkhauser New York, 2003. 23 Ely Porat and Klim Efremenko. Approximating general metric distances between a pattern

and a text. In SODA, pages 419–427, 2008. URL:http://dl.acm.org/citation.cfm?id= 1347082.1347128.

24 Jan Studený and Przemysław Uznański. Approximating Approximate Pattern Matching.

CoRR, abs/1810.01676, 2018. arXiv:1810.01676.

25 David P. Woodruff. Optimal space lower bounds for all frequency moments. InSODA, pages 167–175, 2004. URL:http://dl.acm.org/citation.cfm?id=982792.982817.

A

Omitted proofs

I Lemma 14. For i ≤ k = log(8η|x−y|) we have Er∈[1,9) h

|_bgi(rx, ry)|

i

≤ (1152 + 192(ln 2)))η|x−y|p_.

Proof. First, we define symbolsA, B, A0, B0 to be parts of theg_bi.

A=k(rx)(i)−(ry)(i)kBi

B=k(rx)(i+1)−(ry)(i+1)kBi

A0= max(0, A−2i) B0= max(0, B−2i+1)

Repeating reasoning from proof of Lemma 7, we get

|A0−B0| ≤2·2i (9) krx−rykBi−2·2 i_≤_A0 _{≤ k}_rx₋_ry_k Bi (10) krx−rykBi−2·2 i+1_≤_B0_{≤ k}_rx₋_ry_k Bi (11)

We also bound Bi = 2i/η ≤2k/η= 8|x−y|. Now let’s bound the|g_bi(rx, ry)|. A simple

bound that comes from the definition of_bgi gives us:

|_bgi(rx, ry)|=|A0p−B0p| ≤max(A0p, B0p)≤ krx−ryk p

Bi. (Use of 10,11) (12)

Unfortunately, this bound is not tight enough for larger values of krx−rykBi, so for krx−rykBi ≥6·2

i_{, we prove stronger bound:}

|_bgi(rx, ry)|=|(A0)p−(B0)p| = max(A0, B0)p−min(A0, B0)p = max(A0, B0)p−(max(A0, B0)− |A0−B0|)p = max(A0, B0)p 1− 1− |A 0₋_B0_| max(A0_{, B}0₎ p ≤ krx−rykp_B i·(1−(1− 2·2i krx−rykBi−2·2 i) p₎ ≤ krx−rykp_B i·(1−(1− 3·2i krx−rykBi )p) ≤6p(ln 2)krx−rykp_B−_i1·2i (krx−rykBi≥6·2 i , by (6))

(12)

Thenorm functionkxkBi= min(xmodBi, Bi−(xmodBi)) is in fact a triangle wave

function varying between 0 andBi/2 with periodicity of Bi. So if the input is a random

variable that follows uniform distribution at interval that is larger than its period (in our caseBi), the output has piece-wise uniform distribution, and its probability density function

can be bounded by two times the probability density function of the uniform distribution for the whole domain. Formally, ifX =U(a, b) withb−a≥Bi then for Y =kXkBi its

probability density functionfY(y) is:

fY(y)≤

2 Bi/2

for 0≤y≤Bi/2 (13)

As the input in the expressionkrx−rykBi to thenorm function is uniformly distributed

betweena=|x−y|andb= 9|x−y|andBi≤8|x−y|, we can use 13 to bound the probability

density function of theZ =krx−rykBi byfZ(y)≤

2

Bi/2.

Now when we have the approximate probability density function (namely its upper bound) we can condition on the value ofkrx−rykBi to be able to use the bounds for small and

We bound those two summands separately. Now, bound on the first part:

Er∈[1,9) h |_bgi(rx, ry)| krx−rykBi ≤6ηBi i Pr r∈[1,9) h krx−rykBi ≤6ηBi i ≤ ≤Er∈[1,9) h |g_bi(rx, ry)| krx−rykBi≤6ηBi i 24η ≤Er∈[1,9) h krx−rykp_B_i krx−rykBi≤6ηBi i 24η (by 12) ≤(6ηBi)p24η ≤24η(6·2i)p ≤24·6η(8η|x−y|)p ≤1152η|x−y|p

And on the second part:

Er∈[1,9) h |_bgi(rx, ry)| krx−rykBi>6ηBi i Pr r∈[1,9) h krx−rykBi >6ηBi i ≤ ≤Er∈[1,9) h 6p(ln 2)krx−rykp_B−1 i ·2 i krx−rykBi>6ηBi i · · Pr r∈[1,9) h krx−rykBi>6ηBi i

(13)

. . .≤ Z 9 1 6p(ln 2)krx−rykp_B−1 i ·2 i1 8 ·1[krx−rykBi >6ηBi]dr (1/8 is the density of r.v. r, 1[·] is the indicator function) ≤6p(ln 2)·2i Z Bi/2 0 zp−1 2 Bi/2 1[z >6ηBi]dz (changed to r.v. z=krx−rykBi) ≤p(ln 2)24 Bi ·2i Z Bi/2 0 zp−1dz ≤(ln 2)24 Bi ·2i _B i 2 p ≤24(ln 2)B_ip−1·2i ≤24(ln 2)2ipη−p+1 ≤24(ln 2)(8η|x−y|)pη−p+1 ≤192(ln 2)η|x−y|p So finally, we reach: Er∈[1,9) h |_bgi(rx, ry)| i = =_Er∈[1,9) h |_bgi(rx, ry)| krx−rykBi ≤6ηBi i Pr r∈[1,9) h krx−rykBi≤6ηBi i + +Er∈[1,9) h |_bgi(rx, ry)| krx−rykBi >6ηBi i Pr r∈[1,9) h krx−rykBi >6ηBi i ≤1152η|x−y|p_{+ 192(ln 2)η}_|_x₋_y_|p ≤(1152 + 192(ln 2)))η|x−y|p J