Applicability of Algorithms - Event Filtering Algorithms

2.3 Event Filtering Algorithms

2.3.2 Applicability of Algorithms

All of the previously mentioned algorithms provide sound solutions to the filtering task, provided their assumptions about application specifics are met. Their individual design goals in their target application setting are to realize a particular space-efficient filtering process, to realize a particular time-efficient filtering process, or to offer a flexible subscription language to their users. The ranges of settings that these algorithms have been developed for vary in their width. Generally the algorithms gain their benefits with respect to filter efficiency or memory usage by exploiting the specifics of those scenarios they have been designed for.

Considering our requirement of a general-purpose algorithm, most approaches become unsuitable with respect to either their memory requirements or their filter efficiency if the application specifics that are exploited in the filtering process do not hold. In the following paragraphs, we analyze these algorithms according to the four identified algorithm categories.

No Predicate Indexing, Individual Subscription Indexing Approaches As we introduced previously, the Elvin system [SA97, SAB+_{00] falls into}

category NP-IS. The following analysis shows that this filtering approach does not constitute a general-purpose solution.

Elvin [SA97, SAB+_{00]. The filtering approach of Elvin is usually de-}

scribed as a “na¨ıve” (e.g., [LHJ05, PFLS00]) or “brute-force” (e.g., [MFP06, TKD04, YGM99]) solution to the filtering task in the literature. This description stems from the fact that all subscriptions, including their predicates, are individually considered by this approach. Evidently, this method does not

lead to high filter efficiency. In particular, the approach does not scale to a growing number of registered subscriptions, and thus does not constitute a generic filtering solution. This limitation of [SA97, SAB+_{00] with respect}

to scaling to a large subscription base is also identified in the literature, for example, [CRW01]. However, Elvin internally supports a more general subscription language (with respect to both the supported predicates and their combination) than other systems.

Generally the approach of individually analyzing all registered subscriptions is relatively well-suited for scenarios where most subscriptions match most incoming messages. Here nearly all subscriptions need to be fully analyzed by any current filtering approach in order to determine whether they are fulfilled by the incoming message. There is no criterion that would al- low a recent algorithm to stop the evaluation of a subscription after its partial analysis. Evidently this narrow application scenario does not fulfill the general- purpose requirement and contradicts typical assumptions about the selectivity of subscriptions (see Section 2.2).

No Predicate Indexing, Shared Subscription Indexing Approaches We identified three main algorithm representatives in category NP-SS. All of them do not classify as general-purpose filtering solutions, because the supported application fields are too narrow and a widening results in a degenera- tion to a basic filtering approach.

Gough and Smith [GS95]. This tree-based conjunctive filtering algorithm generally leads to high filter efficiency due to the approach of traversing exactly one path in the created subscription index tree for an incoming message. However, this advantage is firstly counteracted by the limited range of operators that is supported by this approach: [GS95] effectively supports equality predicates only. Range tests and set membership tests can be supported by the approach. However, this extension comes at the cost of strongly growing memory requirements. These high memory requirements are also noted as a major restriction of the algorithm in the literature, for example, [ASS+_{99, FJL}+_{01, RDJ02, WK05].}

Additionally, the created subscription index tree cannot effectively handle subscriptions that do not specify all attributes in their predicates. Although such subscriptions can be extended by “don’t-care” predicates (fulfilled by all

possible attribute values) and inserted into the index structure, each “don’t- care branch” in a node of the index tree needs to contain a combination of all subtrees that can be reached by the other branches (no “don’t-care” predicates) of that node. This behavior, evidently, leads to a potentially exponential explosion in the size of the index tree [ASS+_99].

Another restricting attribute is that the index tree requires costly prepro- cessing to handle registrations and deregistrations of subscriptions. [GS95] thus presents a static solution to the filtering task, which has been identified as another shortcoming of the approach in the literature, for example, [FJL+_{01, MFP06, RDJ02, WK05]. On this basis, we conclude that the}

algorithm does not classify as a generic solution to the filtering task.

Aguilera and colleagues [ASS+_{99]. Aguilera and colleagues [ASS}+_99]

present another tree-based filtering approach for conjunctive subscriptions. It aims at solving some of the problems of [GS95] with respect to memory requirements. However, this attempt comes at the cost of time efficiency because it cannot filter messages by following one path in the subscription index tree anymore. Nevertheless, the approach [ASS+_{99] is still characterized as}

too memory consuming, for example, [FJL+_{01, RDJ02, WK05].}

Aguilera and colleagues focus on supporting equality predicates [MFP06] in subscriptions; they present some optimizations for this restricted setting to reduce the size of the tree. The overall approach in [ASS+_{99] might generally}

be applicable to operators other than equality as well. However, in this case, the subscription index tree increases in both height and width, sharing the problems of [GS95].

Predicates are not indexed in this approach. Subscriptions can only be shared in a branch of the index tree if all predicates of these subscriptions are the same from a particular point onwards. Furthermore, the order of attributes and operators needs to be predefined, which is too strong an assumption for a general-purpose solution. Because various branches of the created index tree need to be evaluated in the filtering process and predicates are not indexed, the approach in general degrades in efficiency for general settings.

Moreover, the insertion process for newly registered subscriptions is highly inefficient if the optimizations that are required to avoid an explosion in tree size are applied: nearly the whole tree might have to be analyzed for insertion, making it a static filtering solution in practice. This limitation is identified in the literature, for example, [FJL+_{01, RDJ02, WK05].}

We therefore conclude that [ASS+_{99] is not a suitable approach for general}

applications.

Campailla and colleagues [CCC+_{01]. Next to Elvin [SA97, SAB}+_00],

this approach is the only filtering algorithm that supports general Boolean subscriptions. Its main idea is to represent subscriptions by an ordered binary decision diagram (BDD) [Bry86].

In the filtering algorithm, under all circumstances, all registered subscriptions need to be fully evaluated for each incoming message. This is because the subscription index is evaluated backwards, from the terminal nodes (no chil- dren) to the output nodes (no parents) of the BDD. This attribute only makes it a feasible solution if the created BDDs for subscriptions represent graphs with highly equivalent subgraphs in their lower parts. Otherwise, the algorithm degrades to the basic filtering approach, not fulfilling our (and its own) requirements. Additionally, predicates are not indexed, leading to a costly predicate evaluation process in general.

It is hence not only an assumption that subscriptions are highly similar with respect to both their predicates and the combination of these predicates. Moreover, it is a requirement that these redundancies can be exploited in the created subscription index. As we will demonstrate later, the experiments in [CCC+_{01] show that the presented approach does not fulfill this goal even if}

its strong redundancy assumption is met.

With respect to memory requirements, the size of a BDD, and thus the size of the subscription index, may become exponential [CCC+_{01, MFP06] (general}

Boolean subscriptions are supported). The size in practice strongly depends on the ordering of variables (predicates of subscription). The determination of an optimal order is an NP-hard problem [BW96]. [CCC+_{01] does not consider}

the ordering of variables. Instead, the approach requires a given, fixed order of variables. If this order needs to change (e.g., because new subscriptions have been registered), all subscriptions have to be re-indexed by the approach, making it impractical to adapt to the current subscription set. The approach outlined in [CCC+_{01] thus also constitutes a restricted filtering solution in}

this respect. Furthermore, [CCC+_{01] leaves open the question of how a newly}

registered subscription is inserted into a BDD, this being one of the key points for the construction and the size of the subscription index.

Nevertheless, let us assume the restricted application setting that suits this approach. The experiments in [CCC+_{01] assume a total of only 208 distinct}

predicates within all registered subscriptions. The experimental evaluation re- veals that even such highly similar subscriptions already lead to linearly growing sizes of the subscription index with an increasing subscription number, and thus to linearly increasing filtering times. If these specialized experiments show a linear increase in the size of the index structure, more general application settings are expected to lead to index sizes, and thus filtering times, that grow exponentially with the number of registered subscriptions (general Boolean subscriptions are supported).

Moreover, in the dynamic version of the system that does not require an iterative (and thus costly) index minimization on a per-subscription basis, the number of nodes in the created subscription index is only marginally less than the average number of predicates per subscription (7.09 nodes compared to 7.6 predicates per subscription on average). Thus, even in experiments with highly redundant subscriptions, [CCC+_{01] cannot exploit existing redundan-}

cies among subscriptions.

We conclude that [CCC+_{01] leaves open too many fundamental questions,}

does not address our general-purpose requirement, and cannot even exploit the redundancy among highly common subscriptions.

One-Dimensional Predicate Indexing, Shared Subscription Indexing Approaches

Only one main algorithm falls into category OP-SS, [LHJ05]. As we demonstrate, this approach is also too restricted in its applicability to be considered a general-purpose filtering algorithm.

Li and colleagues [LHJ05]. This filtering algorithm, sketched in [LHJ05], was proposed concurrently to our work (Chapter 4); it uses modified binary decision diagrams (MBDs) [JT92] as subscription index structure. In contrast to the original BDD approach presented in [CCC+_{01], [LHJ05] is restricted}

to conjunctive subscriptions. Another difference is that the registered subscriptions are represented by a set of MBDs, that is, several indexes represent registered subscriptions. As an extension of [CCC+_{01], [LHJ05] applies one-}

dimensional predicate index structures, resulting in its classification as OP. Despite these differences, [LHJ05] shares the problems and limitations of [CCC+_{01] that have been described in the previous subsection: For each in-}

Even though it can be decided whether predicates are fulfilled by incoming messages by consulting the predicate index structures, all MBDs still have to be completely evaluated to determine fulfilled subscriptions. This requirement of the full evaluation of all MBDs for each message is a substantial drawback of [LHJ05] (as well as [CCC+_{01]): the complexity of filtering any message in}

[LHJ05] directly corresponds to the size of the subscription index structure. As the original BDD solution, [LHJ05] thus degrades to the basic filtering approach, if the presumed high redundancy among subscriptions is not given or the created subscription index cannot exploit the existing redundancies.

One of the open points of the original BDD approach [CCC+_{01] is how}

newly registered subscriptions are inserted into the existing subscription index structure. [LHJ05] tries to exploit its restriction to conjunctive forms to decide whether a newly registered subscription is integrated into an existing MBD or is inserted as a new MBD in the subscription index. The presented insertion approach, however, depends on a given fixed order of variables. Thus, the algorithm still (as the original approach) requires the impractical re-indexing of all subscriptions if the globally chosen order becomes suboptimal. Therefore, the suitability of [LHJ05] for non-static environments with potentially changing characteristics of subscriptions is not given.

Generally the method of ordering variables that is proposed in [LHJ05] is not applicable to general settings with non-extreme predicate redundancy. It is even inapplicable if only a marginal proportion of predicates is not shared among most subscriptions. The reason for this property is that MBDs in [LHJ05] can only be shared by those subscriptions that specify the same second predicate according to the given attribute order. However, the ordering required by [LHJ05] starts with the least common predicates because redundancies among subscriptions can only be exploited at the bottom of MBDs. The sharing of MBDs thus breaks down as soon as subscriptions do not contain highly common predicates only. It is noteworthy that even in the best possible case only those parts of subscriptions that contain exactly the same predicates from a particular point onwards (in the assumed, fixed predicate order) can be shared.

Thus, as with the original BDD approach, the solution in [LHJ05] does not constitute a general-purpose filtering algorithm. Additionally only excerpts of the algorithm are briefly sketched in [LHJ05]—the work mainly focuses on event routing optimizations. [LHJ05] does not even investigate the size of the

created MBDs. There is no reason why the arbitrary selection of the second predicate (in a fixed order) for sharing MBDs should lead to better results than [CCC+_{01]. The settings analyzed in [LHJ05] are restricted and contain}

highly redundant predicates: One data set contains 2,000 distinct predicates; another data set contains 5,000 distinct predicates.

One-Dimensional Predicate Indexing, Individual Subscription In- dexing Approaches

For category OP-IS, we named two filtering algorithms in Section 2.3.1. Only one of them, the counting approach, constitutes a general-purpose solution to the filtering task, as described in the following paragraphs.

Cluster algorithm [HCH+_{99, FJL}+_{01]. This conjunctive filtering algo-}

rithm applies one-dimensional predicate indexes for efficiency reasons. The approach is presented in detail by Fabret and colleagues [FJL+_{01] and is based on}

a proposal by Hanson and colleagues [HCH+_{99]. Its general idea is to cluster}

sets of subscriptions. However, as we describe later on, the criterion required for an effective clustering disqualifies [FJL+_{01] as a general-purpose solution.}

With respect to efficient filtering, [FJL+_{01] proposes to cluster subscrip-}

tions in such a way that for each incoming event message only a minimal number of clusters (preferably one cluster) can contain fulfilled subscriptions. Hence, the clusters that are determined for an incoming message usually in- clude both fulfilled and unfulfilled subscriptions. In order to derive the set of fulfilled subscriptions from each cluster, all subscriptions in this cluster are evaluated by the filtering algorithm. In combination with the applied predicate indexes, it is sufficient to analyze whether all predicates of each subscription are fulfilled.

[FJL+_{01] uses the notion of access predicate to refer to a predicate (or a}

set of predicates) that is used for clustering. The approach considers equality predicates as access predicates [MFP06]7_{. This assumption on its own already}

disqualifies [FJL+_{01] as a general-purpose solution because it is not applicable}

in other scenarios at all. Furthermore, “intricate schemes” are required by the cluster algorithm to determine access predicates, as admitted by one of its authors [AJL02].

7_[FJL+_{01] states that a further property of access predicates is that they are required to} be fulfilled in a fulfilled subscription. Evidently, this is the case for all predicates because [FJL+_{01] is restricted to conjunctive subscriptions.}

But even if assuming that there is at least one equality predicate per subscription, [FJL+_{01] only achieves an appropriate filter efficiency if the majority}

of predicates in subscriptions are equality predicates. Only if this strong assumption holds, does the clustering envisaged by [FJL+_{01] become possible.}

Furthermore, the algorithm requires subscriptions to contain the same overall number of predicates to be able to cluster them together.

These problems let us conclude that [FJL+_{01] only constitutes an appro-}

priate filtering solution in highly limited application settings.

Counting algorithm [AJL02, YGM94]. The counting algorithm is a filtering approach for conjunctive subscriptions that balances memory usage and filter efficiency, and fulfills the requirement of its applicability in a wide range of settings. We give a technical description of the counting approach in Sec- tion 2.3.3. In the following paragraphs, we demonstrate its broad idea and the resulting suitability for various scenarios.

The overall idea of the counting algorithm is to count the number of fulfilled predicates per subscription in the filtering process. The counting of predicates is based on one-dimensional predicate indexes, allowing for the determination of all predicates that are fulfilled by an incoming message. Having counted the number of fulfilled predicates per subscription, all those subscriptions whose counter equals their overall number of predicates constitute a fulfilled subscription.

Analyzing this approach to filtering, we firstly realize that the counting algorithm is independent of the redundancy among predicates. Secondly, it is applicable regardless of the similarity among subscriptions. Thirdly, [AJL02, YGM94] does not depend on the use of particular attribute filters in predicates (considering their effect on subscription indexing). Conversely, the approach shows comparable filter efficiency and memory requirements for a wide range of settings independently of the previous mentioned parameters. Fourthly, the created subscription index is highly flexible with respect to both registrations and deregistrations, and changing subscription characteristics. Altogether these characteristics make the counting approach a general-purpose filtering solution.

Evidently, [AJL02, YGM94] does not represent the most time-efficient and the most space-efficient filtering solution in those specialized settings that are (either exclusively or primarily) targeted by the previously analyzed filtering solutions. However, as identified, the counting approach represents a general-

purpose solution that, firstly, can be applied to the full range of settings. Secondly, its time efficiency properties do not degrade to a basic filtering approach if the exploited parameter setting does not hold. Thirdly, its memory requirements remain stable over all settings and do not grow excessively for general scenarios.

Summary

Our analysis in this section led to two main findings:

1. There exist only basic filtering algorithms for general Boolean subscriptions.

2. All conjunctive filtering algorithms except one have too strong requirements on the supported application scenario to classify as a general- purpose solution.

With respect to Finding 1, there are two existing approaches for the filtering of general Boolean subscriptions. Both do not apply predicate index structures, requiring the individual consideration of each predicate. Elvin [SA97, SAB+_{00] constitutes the basic filtering algorithm, additionally requiring the}

consideration of each individual subscription in the filtering process. Cam- pailla and colleagues [CCC+_{01], on the other hand, cannot adapt to changes}

in the subscription base. Although the approach indexes subscriptions, all subscriptions need to be fully evaluated in the filtering process. Already for scenarios with highly similar subscriptions, [CCC+_{01] cannot exploit existing}

redundancies, de facto leading to the same problem size as in the basic filtering approach.

With respect to Finding 2, most existing conjunctive filtering algorithms are designed to exploit particular patterns of “niche” application scenarios. Some of these algorithms cannot be applied to more general settings at all. Other algorithms are generally applicable to a broader range of scenarios. However, either their internal filtering process becomes the basic approach in this case, or the memory requirements of the algorithm explode exponentially.

In document General Boolean Expressions in Publish-Subscribe Systems (Page 68-77)