Cost-Based Benefit Model - Sharing Plan Selection

4.3 Sharing Plan Selection

4.3.2 Cost-Based Benefit Model

Next, we establish a model to calculate the CPU costs of both sharing a substring candidate X (CC method), versus not sharing it (NonShare method). Then the BValue of X is obtained by: XBV alue = XN onShare− XCC. Now we analyze the estimated costs of

processing a common substring X by either of two computation approaches. To make the cost model easy to understand, we assume our stream is uniformly distributed, with the data frequency of each event type in this stream the same. Namely, each event type has the same number of instances. We estimate the cost of processing of all instances in a sample stream. We divide event types into four categories START, UPD, CNET, TRIG based on the different operations they trigger upon their arrival (See section 3.1 and 4.2). Gener- ally, we estimate the cost of all instances in each category, and then sum them together as the total cost. The total cost of a single query can thus be formulated as:

Costtotal = 4

i=1

Ncati ∗ Costcati∗ N umcati (4.1)

Table 4.1 represents the parameters and terminology used in the cost estimation. Take the computation cost for pattern query (A, B, C, D) as an example. There is one START, two UPD and one TRIG event types in this query. Then based on the Equation 4.1, the total cost of (A, B, C, D) in a uniformly distributed stream is: Cost(A,B,C,D) = CST ART∗

NST ART ∗ N umST ART + CU P D ∗ NU P D ∗ N umU P D + CT RIG∗ NT RIG∗ N umT RIG =

Table 4.1: Terminology Used in Cost Estimation Term Definition

N0 The instances number of each event type in a uniformly distributed

stream (except the CNET instances number)

f the occurrence frequency of CNET instances, that is, (# of CNET)/N0

f N0 the number of CNET instances in stream

l the length of a substring X

k the number of queries that share X

Ncati The number of instances of event category i

N umcati The number of event types of category i in a pattern

CST ART The cost of a START instance: create a PreCntr, time complexity O(C)

CU P D The cost of a UPD instance: update count of all active PreCntrs, at most

N0in Naive and f N0 in CC. Time complexity O(n)

CT RIG The cost of a TRIG instance: update and sum count of all active PreC-

ntrs, worst case N0, time complexity O(n)

CCN ET The cost of a CNET instance: compute and create snapshot, worst case

N02, time complexity O(n2)

NonShare Method. The cost of computing a common substring X without sharing (duplicate in k queries). Then, the total cost is simply compute X k times. We can utilize Equation 4.1 to get the costs of computing X once. To simplify the cost model but without underestimation, no matter what event category X has, we assume that CST ART =

CT RIG = CCN ET = CU P D = N0 operation cost. Thus, for computing X once, the cost

is: Xonce = N0∗ f N0∗ 1 + N0∗ N0∗ (l − 1).

The cost is composed of two parts: (1) cost of all CNET instances, and (2) cost of all other instances (as the number of CNET is different with other events based on our assumption). The first N0 in both parts represents the operation costs of an individual

event instance (no matter what event category). We simplify (l − 1) as l, then, the total cost of computing X in k queries is:

XN aive = (N0∗ f N0∗ 1 + N0∗ N0 ∗ l) ∗ k

= N02 ∗ k ∗ (l + f )

(4.2)

ChopConnect Method As we can see from the operation analysis for CC in Sec- tion 4.2, the overhead costs of CC mainly come from the connect operation, namely, the snapshot computation and creation operations when the CNET arrives. This snapshot computation cost for a multi-connect operation includes two parts: 1) the multiplication of count of each sub2 PreCntrwith its attached sub1snapshot count, and 2) the summing

of the counts across all sub2PreCntrs. There are at most N0records in the sub1snapshot,

also at most N0 sub2 PreCntrs. Thus, the worst case cost for a multi-connect snapshot

computation is N02.

Based on the above cost analysis, we conclude that the data frequency of CNET instances dominantly determine the CC costs, as the costs of other event categories is at most N0. Besides, the CNET corresponds to the START in X. Thus, the fewer CNET

instances, the fewer PreContrs the UPD instances need to update. Namely, at most f N0

PreCntrs will be created, and thus the CU P Dshould be at most f N0in this case. To mea-

sure the effect of the data frequency of CNET, we set a parameter f in our Benefit Model. Theoretically, the lower f , the lower the CC costs.

In the CC method, we only need to compute substring X once. However, we need to connect it thereafter with other substrings in the k queries that share it. To simplify the cost model but without underestimating the costs, here we make two assumptions: 1) share X in one query will cause one chop before this substring, thus there are total k connects, and 2) we use the multi-connect costs (N02) for all connects, no matter it is

single or multi connect. Therefore, except the CNET costs that need to be calculated k times, the costs of other event categories (only U P D if share X) would be calculated only

once. Given the above two assumptions, we conclude the computation costs of processing X by the CC method as in Equation 4.3:

XCC = CCN ET ∗ k ∗ f N0+ CU P D∗ (l − 1) ∗ N0

≈ N02∗ k ∗ f N0 + f N0∗ l ∗ N0

= N03∗ f ∗ k + N02∗ f ∗ l

(4.3)

Benefit Model. Now we calculate the Benefit Model together with using the XCC

and XN aive costs from Equation 4.3 and 4.2 respectively.

XBV alue = XN aive− XCC

= N02∗ k ∗ (l + f ) − N03∗ f ∗ k − N02 ∗ f ∗ l

(4.4)

From Equation 4.4, three parameters together impact the computation gains of sharing a candidate substring. To observe the effect of each parameter independently, we transform the Benefit Model by treating one parameter as variable while the other two are being set to some constant. This way we obtain the following three linear equations:

XBV alue =                (N02k − N03k − N02l) ∗ f + N02kl, f as variable (a) (N02k − N02f ) ∗ l + (N02− N03)kf, l as variable (b) (N02l + N02f − N03f ) ∗ k − N02f l, k as variable (c) (4.5)

In the Equation 4.5(a), we see that (N02k − N03k − N02l) < 0 as N02 < N03 for

N0 > 1. Thus, the smaller f , the larger the BValue. Moreover, since N02 is usually a

analysis corresponds to our intuition that f is the major factor that affect the sharing cost. In the Equation 4.5(b), we can see that (N02k − N02f ) > 0 as f < 1 while k >= 2.

Thus, the larger l, the larger the BValue. This indicates that the longer a substring we share, the more we can gain.

Lastly, in the Equation 4.5(c), we see that N02l + N02f − N03f could be either greater

or smaller than 0. This corresponds to our intuition that it is hard to decide whether we can benefit more if more queries share a common substring. As the larger k, the more chops and connects are introduced. Though we can eliminate more duplicate computation, but higher CONNECT costs overhead will also arise.

In document High Performance Analytics in Complex Event Processing (Page 47-51)