Our second result in this chapter is that expressions with streaming composition at the top- level can be efficiently evaluated. While the evaluation of expressions in QRE"is certainly decidable, we do not provide any efficiency guarantees for expression evaluation when the streaming composition operator is arbitrarily nested (rather than just being at the top-level).
Theorem 6.13. Ife1 : T0˚ ùT1,e2 :T1˚ ù T2, . . . , ek : Tk˚´1 ù Tk are consistent, well-
typed, and single-use QREs such that e “e1"e2" ¨ ¨ ¨ "ek :T˚
0 ùTk is well-defined, and
w P T˚
0 is an input stream, then JeKpwq can be computed in time Op|e|
2|w|qtime and with
Op|e|2stpe,wqqmemory, in one left-to-right pass overw.
Proof. The construction is exactly that shown in figure 2.12. EvaluatorsM1,M2, . . . ,Mkare constructed for each expression. All evaluators are fed the start signalpstart,nilq, where nil is the empty token. Each symbolwiofwis fed to the initial evaluatorM1 in order. Whenever evaluatorMj produces a resultv, it is fed to the subsequent evaluatorMj`1. The final result is that produced byMk after all ofwis consumed byM1. Correctness follows trivially by the definition of streaming composition, and because the evaluatorsM1,M2, . . . ,Mkcorrectly compute their respective functions.
Quantitative Approximate Terms
In the previous chapter, we presented a streaming algorithm to evaluate consistent, single-use QREs: this algorithm processes each input symbol in Op|e|2qtime, and requires Op|e|stpe,wqq
memory. The functionstpe,wqis the size of the largest intermediate term produced while evaluatingJeKpwq: in the worst case, this can grow as Op|e||w|q. This chapter is concerned
with the efficient representation of terms: for numerical domains with the usual arithmetic op- erations (`, min, max, avg), terms can be compressed so thatstpe,wq “Op|e|q, independent of the length of the input stream.
It is well-known that exactly computing the median ofnnumbers in one pass requires
Ωpnqspace [78], and therefore approximation algorithms for the streaming selection problem have been extensively studied [79]. In our setting, for some QREs which involve comput- ing the median, we can maintain multiset summaries as histograms, so that stpe,wq “
polyp|e|logp|w|qlogpUq{logpLqq{, for a user-specified multiplicative error tolerance. Here UandLare the highest and lowest values ever generated by applying an elementary trans-
formation to an element of the input stream.
In this chapter, we will explore two simple ideas: (a) the use of term rewriting (such as minpp`3, 4q `2 “ minpp`5, 6q) to compress copyless terms t into representations of size Op|Paramptq|q, and (b) the representation of multisets S of positive numbers as approximate histogramsh :ZÑN, wherehiis the number of elements ofSin the range
pp1`qi´1,p1`qis.
7.1
Motivation
Compressing arithmetic terms. Consider a “large” term, such ast“minpminpx, 3q `2, 9q. By routine algebraic laws such as distributivity, we can write:
t“minpminpx, 3q `2, 9q “minpminpx`2, 5q, 9q “minpx`2, 5q.
In fact, any real-valued termtbuilt out of a set of parametersP and the operations “min” and “`” can easily be reduced to the normal form:
t“minP1ĎPpcP1`
ÿ
pPP1 pq
for some constantscP1 PRY t8ufor each subset of parametersP1 ĎP. Unfortunately, when these rules are fully applied, the resulting term has size Op2|P|
qand is no longer single-use. For example,
t1 “minpx`2, 5q `minpy`8, 7q “minp12,x`9,y`13,x`y`10q.
Observe, however, that the size of the output term is already bounded, regardless of the size of the input term.
The idea behind our simplification routine simplptqis therefore to only propagate constants and not attempt to completely reduce the term to a normal form:t“minpminpx, 3q `2, 9q
is reduced to minpx`2, 5q, butt1 “minpx`2, 5q `minpy`8, 7qis left unchanged. More specifically, simpl guarantees the following properties:
1. For all inputst, the output simplptqis equivalent tot.
2. For all sub-termst1 of the output simplptq, if Parampt1q “ H, thent1 has already been fully evaluated, i.e.t1 “c1
, for some constantc1. 3. Whenevert1`c1
is a sub-term of the output simplptq, wherec1is a constant,t1 “p1 , for some parameterp1. In particular,t1 cannot itself be either of the formt1`t2 or of
the form minpt1,t2q. And symmetrically for each sub-termc1`t1
of the output simplptq. Informally, this limits the number of constants which can appear immediately under a “`” operator.
4. Finally, if minpt1,c1
qis a sub-term of the output simplptq, thent1 is either a parameter
p1
or an instance of the “`” operator, t1 “t1`t2. In particular,t1 is itself forbidden from being of the form minpt1,t2q. A symmetric claim is made for potential sub-terms minpc1,t1
qof the output simplptq. Informally, this limits the number of constants which occur immediately under a “min” combinator.
With these properties, we can then show that the output simplptqis of size Op|Paramptq|q.
Representing multisets. The second idea is to compactly represent multisets as approxi- mate histograms. Consider the multiset:
C“ t12, 9, 14, 20980, 21046u.
We representCby the arrayh:ZÑN, wherehiis the number of values ofCwhich fall in the bucketpp1`qi´1,p1`qis. The indexiof the bucket in which a given valuevfalls can be computed asi“rlogpvq{logp1`qs. In our case, for“0.01, we have:
The histogram itself is small: |h| “ plogpmaxpCqq ´logpminpCqqq{logp1`q, but is still sufficient to answer rank queries with multiplicative accuracy: the median discretized bucket corresponds to the index i “ 266. Reversing the bucket index computation, p1`qi “ 1.01266“14.1yields a value within˘1%of the true median,14.