A model to evaluate aggregations - Testing t-hierarchical conjunctive queries under updates 69

5. Testing t-hierarchical conjunctive queries under updates 69

6.2. A model to evaluate aggregations

To evaluate aggregates under updates, we define the aggregate implementation. The model we introduce here relies on the concept of a MUD-algorithm described in [41].

6.2. A model to evaluate aggregations An aggregate implementation is a triple (ξ,◦, η) where ξ : dom → M is a function that maps a database entry to a message, ◦ : M × M → M is a binary operation on M such that (M, ◦) is an Abelian group and η : M → dom is a function that maps the message to a database entry. The main idea is that we use the aggregate implementation to evaluate aggregate functions in the following way. For a multiset

⦃a1, . . . , an⦄ we use the function ξ to map the database entries to messages, then we connect them with the binary function◦ and use the solution as input on η to receive an output. An aggregate implementation computes an aggregate function{fⁿ}n∈N if for all n∈ N⩾1 and for all a1, . . . , an∈ dom the following holds:

fn(⦃a1, . . . , an⦄) = η(ξ(a1)◦ · · · ◦ ξ(aⁿ)) and f0= η(e) where e is the neutral element in the group (M, ◦).

Example 6.3. We consider now examples for aggregate implementations where dom= N:

• Let I^sum:= (ξsum,◦^sum, ηsum) with – M = Z,

– ξsum(a) := a for all a∈ dom, – ◦^sum= + and

– ηsum(a) = a.

I^sumcomputessum. Note that ηsum(ξsum(a1)◦^sum· · ·◦^sumξsum(an)) = a1+· · ·+aⁿ ∈ N if a1, . . . , an ∈ N.

• Let I^prod:= (ξprod,◦^prod, ηprod) with – M = Q,

– ξprod(a) := a for all a∈ dom,

– ◦^prod is the multiplication operator of Q and – ηprod(a) = a.

I^prod computesprod. Note that ηprod(ξprod(a1)◦^prod· · · ◦^prodξprod(an)) = a1· · · · · an∈ N if a¹, . . . , an∈ N.

• Let I^count:= (ξcount,◦^count, ηcount) with – M = Z,

– ξcount(a) := 1 for all a∈ dom, – ◦^count= + and

– ηcount(a) = a.

I^count computescount.

• Let I^avg:= (ξavg,◦^avg, ηavg) with – M = Z²,

– ξavg(a) := (a, 1) for all a∈ dom,

– (a1, n1)◦^avg(a2, n2) = (a1+a2, n1+n2) for all (a1, n1), (a2, n2)∈ M and – ηavg((a, n)) = a/n.

I^avg computesavg.

• For the aggregate function max it is not sufficient to simply ”remember” the maximum element of a list. If we remove every occurrence of the maximum element in the list during update steps, there is no information what the current maximum element is. To get rid of that problem, we use the following idea. To define an aggregate implementation (ξmax,◦^max, ηmax) for the aggregate function max we set M as the set of functions that maps elements from N ∪ {−∞} to Z. The main idea is that for a multiset⦃a1, . . . , an⦄ the current message is a function that maps every a ∈ N ∪ {−∞} to the number of occurrences of a in

⦃a1, . . . , an⦄. We call such a function the occurrence function of ⦃a1, . . . , an⦄.

For everya∈ N ∪ {−∞}, the value ξ^max(a) is the function f with f (a) = 1 and f (b) = 0 for all b∈ (N ∪ {−∞}) \ {a}. This is the occurrence function of⦃a⦄.

The binary operation◦^max outputs on input off, g∈ M the function u such that u(a) = f (a) + g(a) for all a∈ N ∪ {−∞}. In other words, if f is the occurrence function of⦃a1, . . . , an⦄ and g the occurrence function of ⦃b1, . . . , bm⦄, then the functionf ◦^maxg is the occurrence function of ⦃a1, . . . , an, b1, . . . , bm⦄.

It remains to show, that(M, ◦^max) is an Abelian group.

– ◦^max is associative. Let p, q, r ∈ M. Then, for all a ∈ N ∪ {−∞} the following holds: [(p◦^maxq)◦^maxr](a) = [p◦^maxq](a) + r(a) = (p(a) + q(a)) + r(a) = p(a) + (q(a) + r(a)) = p(a) + [q◦^maxr](a) = [p◦^max(q◦^maxr)](a) and, in particular, is (p◦^maxq)◦^maxr = p◦^max(q◦^maxr).

– The neutral elemente∈ M is the function e(a) = 0 for all a ∈ N ∪ {−∞}

since for all q∈ M and for all a ∈ N ∪ {−∞} holds [q ◦^maxe](a) = q(a) + e(a) = q(a).

– For every q ∈ M there is an inverse element q⁻¹ ∈ M where q⁻¹(a) =

−q(a) for all a ∈ N ∪ {−∞} since [q ◦^maxq⁻¹](a) = q(a) + q⁻¹(a) = 0 for all a∈ N ∪ {−∞}, i.e., q ◦^maxq⁻¹= e.

– ◦^max is commutative. Letp, q∈ M. For all a ∈ N∪{−∞} let [p◦^maxq](a) = p(a) + q(a) = q(a) + p(a) = [p◦^maxq](a). In particular, p◦^maxq = q◦^maxp.

The post-processing function returns the maximuma∈ N ∪ {−∞} that appears in the multiset, i.e.,

η(q) :=

{max{a : a ∈ N ∪ {−∞} , q(a) ̸= 0} if q̸= e

−∞ otherwise

It is straightforward to see that(ξmax,◦^max, ηmax) computes the aggregation func-tionmax.

6.2. A model to evaluate aggregations For an aggregate implementation (ξ,◦, η) a data structure that maintains the aggre-gation function under updates represents the current message mcurrent∈ M and the value η(m_current), i.e., the value η(m_current) = fn(⦃a1, . . . , an⦄) if ⦃a1, . . . , an⦄ is the current multiset.

To initialise the data structure, we initialise mcurrentto the neutral element of (M, ◦) and compute η(mcurrent). If we insert an element a to the multiset, we update mcurrent

to the result of mcurrent◦ ξ(a) and compute the new value η(m^current). If we delete an element a from the multiset, we compute ξ(a)⁻¹ (the inverse element of ξ(a) in the group (M, ◦)) and update m^currentto the result of mcurrent◦ ξ(a)⁻¹ and compute the new value η(mcurrent). It is straightforward to verify that η(mcurrent) always represents the value fn(⦃a1,· · · , aⁿ⦄).

To design suitable data structures forI^sumandI^prodandI^countandI^avgit is sufficient to simply declare variables that store the values mcurrentand η(mcurrent).

For (ξmax,◦^max, ηmax) we represent mcurrent via an ordered list L with the values {(a, mcurrent(a)) : a∈ N ∪ {−∞} , mcurrent(a)̸= 0}. The values are ordered descend-ing by the first value of the tuples, i.e., if (a2, b2) is the successor element of (a1, b1) inL, then a1> a2. To lookup, insert and remove elements in the list fast, we use AVL-trees (see [65]). This takes time O(log n) where n is the number of elements in the current multiset. To initialise the data structure, we simply initialise an empty list. This rep-resents the neutral element of (M, ◦^max). To update the data structure for mcurrentto mcurrent◦^maxp where p∈ ξ(N∪{−∞})∪ξ⁻¹(N∪{−∞}) let b ∈ N∪{−∞} be the unique element with p(b)∈ {1, −1}. Note that, since p ∈ ξ(N ∪ {−∞}) ∪ ξ⁻¹(N∪ {−∞}), it holds that p(a) = 0 for all a ∈ (N ∪ {−∞}) \ {b}. Then, we do the following steps. We lookup b in the AVL-tree. If it is not present, we insert (b, p(b)) to L.

Otherwise, we modify (b, n) to (b, n + p(b)). If n + p(b) = 0 remove (b, 0) from L.

The value of η(mcurrent) is the first component of the first element in the list. It is straightforward to verify, that after an operation the list L represents mcurrent and η(mcurrent) is the maximum element a∈ N ∪ {−∞} with mcurrent(a)̸= 0. In particu-lar, it takes time O(log n) to compute mcurrent◦^maxp from mcurrentand for an element p∈ ξ(N ∪ {−∞}) ∪ ξ⁻¹(N∪ {−∞}).

For an aggregate implementation (ξ,◦, η) let t^ξ be the time it takes to compute ξ(a) for an a∈ dom, t^ξ⁻¹ be the time it takes to compute ξ⁻¹(a) for a a ∈ dom, t◦ be the time to compute mcurrent◦ p if we receive m^current and p∈ ξ(dom) ∪ ξ⁻¹(dom) as input and tη the time it takes to compute η(mcurrent). For an aggregation function {fⁿ}n∈N let ta({fⁿ}n∈N) be defined as follows. Let A be the set of all aggregate implementations that compute{fⁿ}n∈N, then

ta({fⁿ}n∈N) := min

(ξ,◦,η)∈Atξ+ tξ⁻¹+ t_◦+ tη,

i.e., it is the time it takes to perform an update in the multiset for the aggregate.

For the aggregate functions in Example 6.2, it follows that tξ_agg = t_ξ⁻¹

agg = O(1) for agg ∈ {sum, prod, count, avg, max} and t^ηagg = t_◦_agg = O(1) for agg∈ {sum, prod, count, avg} and t^ηmax = t_◦_max ⩽ O(log n). In particular is t^a(max) =

O(log n) and ta(agg) = O(1) for

agg∈ {sum, prod, count, avg}.

We will now give syntax and semantics of aggregation expressions.

In document Answering Conjunctive Queries and FO+MOD Queries under Updates (Page 84-88)