• No results found

It is natural to ask whether the Fk scheme of Theorem 3.5.3 generalizes to more com- plicated functions. We demonstrate that this is indeed the case by presenting non-trivial algorithms for the class of all frequency based functions. A frequency based function is any function G on frequency vectors f = (f1, . . . , fn) of the form G(f ) = Pj∈[n]g(fj) for some

G S

H

R f

Figure 3.2: Example to illustrate Theorem 3.7.1

g : Z+→ Z+.

Frequency-based functions have a number of important special cases, including frequency moments, F0 (the number of distinct items in the stream), and point and range queries on the frequency distribution, and can also be used to compute F∞, the highest frequency in the frequency vector. These functions occupy an important place in the streaming world: Alon, Matias, and Szegedy asked for a precise characterization of which frequency-based functions can be approximated efficiently in the standard streaming model in their seminal paper [6]. Braverman and Ostrovsky [22] gave a zero-one law for approximating monotonically increasing functions of frequencies that are zero at the origin. This can be contrasted with our result that, in the annotation model, all frequency-based functions have non-trivial exact schemes.

Theorem 3.7.1. Assume g(x)≤ nc for some constant c, so that each value in the range of g and G can be represented using O(log n) bits. Suppose N = O(n). Let G(f ) =P

j∈[n]g(fj) be any frequency-based function. Then G has a prescient (n2/3log n, n2/3log n)-scheme and an online (n2/3log4/3n, n2/3log4/3n)-scheme, both in the non-strict turnstile update model. Proof. We first describe the prescient scheme. It is natural to attempt to directly apply the scheme of Theorem 3.5.2 (with ` = 1) to the given function g. However, this does not yield a useful result. The problem with this approach is that while the function g within the definition of G may be viewed through polynomial interpolation as a polynomial ˜g over the

integers or the relevant finite field, the degree of ˜g may be large – as large as 2N , since we need it to hold that ˜g(x) = g(x) for all possible frequencies x∈ {−N, . . . , N}. If N = Ω(n), it would be more efficient for the helper to just repeat the stream in sorted order.

The solution is to reduce the degree of ˜g by removing the heavy hitters from x with the aid of the prover. That is, we run the prescient heavy hitters scheme from Theorem 3.6.1 to

determine H :=P

j∈Sg(fj)− |S|g(0), where S := {j : fj ≥ n

β} and β < 1 is a parameter we

will fix later. Note that this requires communication O((N/nβ) log n) = O(n1−βlog n) since N = O(n) by assumption. Intuitively, H represents the contribution of the heavy hitters to the frequency-based function, and the verifier then “removes” these items from the stream by setting fj = 0 for all j ∈ S. This ensures that the removed items do not contribute to the

sum R =P

j∈[n]g(fj). The verifier and prover then run the scheme of Theorem 3.5.2 on the

modified frequency vector, and the final result is given by H + R. From now on, let f denote this modified vector.

Figure 3.2 gives an illustation of the central idea: the frequency distribution is concep- tually split into two pieces, the set of heavy hitters S and the residual distribution f . The contributions of each piece are calculated as H and R respectively, and summed to obtain the answer G.

When running the scheme of Theorem 3.5.2, we exploit the fact that each entry of f lies in {0, 1, . . . , nβ}. This lets us use a degree-nβ polynomial ˜g within the scheme of Theorem 3.5.2. For any ca, cv such that ca· cv ≥ n, Theorem 3.5.2 yields an online (nβcalog n, cvlog n)

scheme for computing P

i∈[n]g(f˜ i).

It remains to show that we can set the parameters ca, cv, and β of the above protocol to achieve hcost = vcost = O(n2/3log n). The help cost is O(n1−βlog n) bits for the heavy hitters scheme plus O(canβlog n) bits for the scheme of Theorem 3.5.2. The respective verification costs are O(n1−βlog n) and O(cvlog n). Setting β = 13, ca = n1/3, and cv = n2/3 achieves the

desired costs.

In order to achieve an online (n2/3log4/3n, n2/3log4/3n)-scheme for G, observe that the only place where the above scheme used prescience was to identify heavy hitters. So we simply substitute the online heavy hitters scheme of Theorem 3.6.1, with parameter α ∈ [0, 1], in place of the prescient version. In this case, the help cost is O(n1−βlog2n + nαlog n) bits for the heavy hitters scheme and O(canβlog n) bits for the scheme of Theorem 3.5.2. The respective verification costs are O(n1−αlog n) and O(cvlog n). Balancing these costs by setting nβ = n1/3log2/3n, nα = n2/3, ca = n1/3/ log1/3n, and cv = n2/3log1/3n gives the desired overall costs.

Applications. Theorem 3.7.1 provides annotation schemes for the problems described

below.

• We can compute F0, the number of items with non-zero count. This follows by observing that F0 is equivalent to computing

P

i∈[u]g(fi) for the function g given by g(0) = 0 and g(x) = 1 for x > 0. This yields a prescient (n2/3log n, n2/3log n)-scheme for F

0, and an online (n2/3log4/3

n, n2/3log4/3

n)-scheme.

• More generally, we can compute functions on the inverse distribution, i.e., queries of the form “How many items occur exactly k times in the stream?” We do this by setting g(k) = 1 and g(x) = 0 for x 6= k; here we think of k as being fixed. In the case of k = 1, this function is known as rarity [45]. One can build on this to compute, e.g., the number of items that occurred between k and k0 times, the median of this distribution, etc.

• We obtain a protocol for F∞ = maxj∈[n]fj, with a little more work. The helper first claims a lower bound ` on F∞ by providing the index of an item with frequency

F∞, which the verifier checks by running the generalized index protocol from Sec- tion 3.3 (see Remark 2 after Theorem 3.3.2). Then the verifier runs the above protocol

with g(x) = 0 for x ≤ ` and g(x) = 1 for i > `; if P

j∈[n]g(fj) = 0, then the verifier is convinced that no item has frequency higher than `, and concludes that F∞ = `. We therefore achieve a prescient (n2/3log n, n2/3log n)-scheme and an online (n2/3log4/3n, n2/3log4/3n)-scheme for F∞.

3.7.1

Frequency-Based Functions for Skewed Streams

In practice, the frequency distributions of data streams are often skewed, in the sense that a small number of frequent items make up a large portion of the stream. We observe that, if the stream is sufficiently skewed, so that there are few heavy hitters, we can achieve more efficient schemes for frequency-based functions. To see this, notice that in the scheme of Theorem 3.7.1, the verifier, after learning the heavy hitters from the helper, only needs to know an approximate upper bound on F∞(A0), where A0 is the stream obtained from the input stream A by deleting all the heavy hitters. That is, the helper only needs to convince the verifier that he has presented “enough” of the true heavy hitters (and their exact frequencies) so that F∞(A0) ≤ b for some upper bound b = Θ(nβ)—then we may define ˜g to agree with g on [b], so that the degree of ˜g remains O(nβ).

Observe that if there are not many heavy items, the helper can send a list L of heavy hitters and their frequencies (proving the frequencies are truthful as in Theorem 3.6.1) and then appending a proof of an approximate upper bound (within factor 1 + ε) as per Section 3.5.1 on the quantity F∞(A0).

It suffices to let ε be any positive constant in order to achieve b = O(nβ). When there are fewer than ` items with frequency greater than nβ, the index queries, if they are on- line, require annotation O(` log n + calog n) and space O(cvlog n) for the verifier, while the

approximate F∞ scheme requires annotation O(calog3n) and space O(cvlog2n). Therefore, we will obtain an (` log n + calog3n, cvlog2n) scheme for identifying the set of heavy hitters and an upper bound u on F∞(A0).

For concreteness, we will analyze the costs of our improved scheme under the assumption that the frequencies of items in the stream follow a Zipfian distribution, a power law distribu- tion that accurately approximates many real-world data sets. Under the Zipfian distribution, the ith largest frequency is (at most) N i−z for parameter z. Setting this equal to nβ and rearranging, we obtain that there are at most (N/nβ)1/z heavy hitters to identify.

Therefore, if N = Θ(n), we can reduce the cost of the heavy hitters sub-protocol within the scheme of Theorem 3.7.1 to (n(1−β)/zlog n + capolylog n, cvpolylog n). Adding in the annotation cost of sending the polynomial ˜g ◦ ˜f , and the space cost to the veri- fier, the entire scheme therefore requires ˜O(n(1−β)/z + canβ) annotation and ˜O(cv) space, where the ˜O notation hides factors polylogarithmic in n. Assume z ≤ 2. Balancing ex- ponents by setting β = (2 − z)/(2 + z), ca = nz/(2+z), and cv = n/ca, we obtain an (n2/(2+z)polylog n, n2/(2+z)polylog n) scheme.

This strictly improves on Theorem 3.7.1 as long as z > 1. For example, if z = 2, we obtain an online (n1/2polylog n, n1/2polylog n)-scheme, which essentially matches the cost of our online scheme for F2 from Theorem 3.5.3.