4.7 Experimental Evaluation
4.7.4 Experiments on Synthetic Workload
Periodicity degree vs. utility. Now we assess the relationship between degree of period- icity in event streams and utility in a more systematic way. Figure 4.9a compares Hybrid-I and Hybrid-P over six synthetic streams with varying degrees of periodicity. With no pe- riodicity Hybrid-P and Hybrid-I yield almost the same utility. As we move from a less periodic stream to a more periodic one, Hybrid-P achieves better utility preservation. The utility gain of Hybrid-P in the case of perfect periodicity (periodicity degree = 1) is almost 2x than that of Hybrid-I. It is also worth noting that even for streams with low degree of periodicity, say 0.2, there is a positive, albeit slight, utility advantage of using periodicity based cardinality estimation. For the rest of experiments in this section we fix the degree of periodicity at 0.4.
Pattern length vs. utility. Next, we vary pattern complexity by changing pattern length from 2 to 5 and compare the corresponding utility of different algorithms. As we can see in Figure 4.9b the utility drops as the pattern length increases. This is because the number of matches decreases when the pattern length increases and becomes more complex. Our proposed Hybrid-P achieves only 12% less utility than the Hybrid-Opt, and outperforms all other approaches in this experiment.
Num. of private patterns vs. utility. In Figure 4.9c we vary the number of private patterns while keeping public patterns the same. When there is no private pattern, each approach produces the same utility, as no event suppression is needed. As the number of private patterns increases, the proposed Hybrid-P and Hybrid-I outperform all other alternatives except for the oracle version Hybrid-Opt.
Num. of patterns vs. throughput. Finally, we conduct a throughput test by increas- ing the number of synthetic queries. Figure 4.9d shows that the most expensive Hybrid-P algorithm attains at least 55% of the throughput of a vanilla system when a large number of queries are present. Its relatively efficiency and superior utility makes it an appeal-
4.8. RELATED WORK 102
ing approach for CEP privacy. We also note that the Type-level approach achieves only slightly less throughput than the vanilla CEP (No-PP). Considering the fact that Type- level significantly outperforms the Greedy approach in terms of utility, it may also be an attractive alternative when system resources is a concern.
4.8
Related Work
The problem of sequential pattern hiding was recently investigated in [10,46] for static se- quence databases. Their approaches function on a set of sequences which are independent from each other. While in our CEP context, the single input stream contains potentially endless events that are temporal correlated. Therefore their approaches cannot be applied to solve our problem. Technically we have been inspired by several principles from their approaches, such as minimizing side-effects and data distortion [46].
The problem of privacy and security has also been looked at in different contexts for streaming data. Authors in [102] consider k-anonymity in streaming data, where the aim is to generalize clusters of tuples into equivalence classes of size at least k. Recently, authors in [47] consider the problem of filtering sensitive user location data in a stream to preserve privacy. The problem of access control in a stream environment has also been studied [75, 83]. These techniques, while useful for their respective purposes, are not ap- plicable as they do not consider CEP pattern query semantics nor the utility optimization problem.
The mechanism of event suppression has been used in relational stream load shed- ding [39, 100] to keep up output rate when system resources are limited. However, the notion of privacy preference was not considered in the literature of this area. Besides, loading shedding for the specific semantics of CEP pattern queries has not been studied to date.
4.8. RELATED WORK 103
The general idea of combining “global” (type-level) prediction and “local” (instance- level) prediction in our Hybrid algorithm has been applied to problems in completely different contexts, such as to predict the movement of mobile users [65], and to predict branches in the computer architecture literature [101].
In comparison to the relative lack of research efforts on privacy in the context of CEP, There is an extensive body of literature focusing on privacy preservation in relational databases. K-anonymity [86] and its successors (e.g., [62, 68]) have been extensively studied, which recode the original database in ways such that the modified database sat- isfies structural properties with privacy protection. Differential privacy [42] has been proposed more recently as a rigorous alternative. However, both approaches focus on sta- tistical queries on static relational data, which are not directly applicable to CEP which is more concerned with producing pattern matches in real-time.
Differential privacy achieves privacy preserving by injecting additive noise to statisti- cal aggregates (e.g., SUM queries). This model, however, does not extend to CEP, because SEQ queries employed in CEP are more concerned with producing individual sequence matches, for which differential privacy are not suitable.
The problem of PP-CEP also bears some resemblance to online query auditing [57, 73], where the goal is to determine whether answering a SQL query given previous queries would disclose private information. While query auditing is mostly about de- ciding whether answering a query would compromise privacy, PP-CEP focuses on opti- mizing utility. In addition the significant difference in the data and query model of these two problems make solutions to query auditing inapplicable to PP-CEP.