15.2 Stream Tuple Processing Order
15.2.2 State Sliced Join Processing with Semi-Ordered Exe-
In previous chapters, all discussions are based on the total-ordered execu- tion model. In this section, we first show that the un-ordered execution model cannot guarantee the completeness of the joined result. Then a lazy purge is proposed for state sliced join process in semi-ordered execution model.
Consider join A[w] ./ B[w] of streams A and B, no purge based state slicing can be achieved with un-ordered execution model. The reason is that without any other information, any build tuple must stay in the first state sliced join operator to wait for possible future probing of out-of-order tuples.
Following is an example (A[w] ./ B[w]) of the purging and probing used with semi-ordered execution model. Assume tuples a1, a2and tuples b1, b2
arrive at the system with timestamps Ta1, Ta2, Tb1, Tb2 respectively, where
Tb1 < Ta1 = Ta2 < Tb2. One possible semi-ordered execution sequence is:
b1,a1,a2 and b2. To generate the complete joined result, the processed tuple m need not only probe the state tuple n that satisfies Tn> Tm− w, but also Tn< Tm+ w. Similar to the interleaved PSP scheme, the lazy purge is used to keep the state tuple until no other stream tuples will purge this tuple. To achieve this, each state maintain a mark for each crossing purge from other input streams. Only the part of the state that out of the sliced windows of all the other streams will be really removed from the current state.
For correctness, we have:
15.2. STREAM TUPLE PROCESSING ORDER 177 purge will generate complete joined result in semi-ordered execution model. Proof:
• No duplication. Proof by contradiction. Assume two joined tuples
are exactly the same, in the form of t1, t1, ...tn, where ti is the tuple from stream I. Then these two tuples must be generated when pro- cessing different input probing tuples. Let them be tmand tn. Then it means tuple tm is processed before tn since tn is already in the state. Similarly tnis processed before tm, which contradicts to the previous claim.
• No missing result. Assume tuple t1, t1, ...tnis a valid joined result and input tuple ti is the last one begin processed among t1, .., tn. Then from lazy purging, we know this joined result will be generated.
The timestamps of the output of state sliced join with semi-ordered ex- ecution model is not ordered by the max of the timestamps of the input tuples. However we have:
Lemma 15.2 (Output Timestamp Order Lemma) Let t and t’ be two tuples in
the output queue of a state sliced window join operator. Both tuples have times- tamps of size n, represented as [T S1, ..., T Sn] and [T S10, ..., T Sn0] respectively. If
tuple t appears earlier than tuple t’ in the queue, then there must exist at least one i (1 <= i <= n), such that T Si< T Si0.
15.2. STREAM TUPLE PROCESSING ORDER 178
Basic: n = 2. Let [T S1, T S2] and [T S10, T S20] be the timestamps of the in-
termediate result tuples t and t0respectively. When intermediate result t is generated before t0, there are only two possible cases. (1) If t and t0are both generated by the same probing tuple, then T S1 = T S0
1 (or T S2 = T S20).
Thus T S2 < T S0
2 (or T S1 < T S10) since the state is ordered by the times-
tamps and the probing is in the same order. (2) If t and t0are not generated by same probing tuple, then the timestamps of the two corresponding prob- ing tuples have: T S1 < T S10 or T S2 < T S20, according to the semi-ordered
execution model.
Inductive Hypothesis: Assume that the timestamp order lemma holds for any tuple sequence with size n <= k.
Inductive Step: We now show that the timestamp order lemma also holds for sequences with size n = k + 1.
The timestamp array for t with size n = k + 1 can be treated as a combi- nation of two sub-tuples t1 and t2 with timestamp arrays as [T S1, ..., T Si−1,
T Si+1, ..., T Sk+1] and [T Si], respectively. Similarly, t’ can also be treated as the combination of two sub-tuples t1’ and t2’ with timestamp array as [T S0
1, ..., T Si−10 , T Si+10 , ..., T Sk0] and [T Si0] respectively, where tuple with T Si is the probing tuple. Using the same reasoning as in the base case, we have two cases possible: (1) T Si = T Si0, then from induction hypothesis, there must be one T Si< T S0
i. Or (2) T Si< T Si0.
Above lemma is used to limit the memory of the union operator to sort the joined result. Any tuple that has the maximum timestamp of the times- tamp array is smaller then the minimum timestamp of the timestamp array
15.2. STREAM TUPLE PROCESSING ORDER 179
180
Chapter 16
Experimental Evaluation
In this chapter, we present an experimental study that showing the per- formance of the PSP model and comparing it with other state-of-the-art approaches.