To complete our state-of-the-art survey, we now discuss some hybrid approaches that attempt to combine the different styles of querying complex events.
3.6.1
Pattern Matching in Data Stream Query Languages
The original model of CQL provides only stream-to-relation, relation-to-relation, and relation-to- stream operators. Pattern-matching on streams can be added to this model as stream-to-stream operator. Such an operator allows to specify and match patterns that resemble regular expressions (for pattern matching in strings) in data streams. These patterns have some similarities with queries formed using event composition operators, which is why we classify pattern matching in data stream query languages as a hybrid approach here.
Such operators are particularly useful in processing market data such as stock ticks. There it is often important to recognize certain shapes in the graph of price ticks. For example, one might look for situations where a price falls over a span of several price ticks and then rises, falls again, and rises again, typically with some thresholds and relations between the minimum and maximum prices. In terms of the corresponding price graph, one is looking for a “W”-shape with this query. Such a query can be written as a regular expressionF+R+F+R+, whenF stands for a falling price event (i.e., the previous price tick was higher then the current) andRfor a rising price event (i.e., the previous price was lower then the current).
Such pattern matching is recently supported in Oracle’s CEP product [Ora]. The idea and efficient techniques for evaluating such pattern matching operations have been considered earlier in [SZZA01, SZZA04]. Esper also provides some more restricted pattern matching operations on streams [Esp].
3.6.2
Composition Operators on Top of Data Stream Queries
Another approach for combining the capabilities of composition-operators with the capabilities of a data stream query language is described in [GAC06, CA08]. Incoming data streams are first processed with a data stream query language, and then the output events are further processed with a composition-operator-based language to detect the complex events desired by the appli- cation. Note that this work does not propose an integrated language that has the capabilities of both composition-operators and data stream queries; rather it proposed an architectural model for using both within a single application and thus combining their respective strength.
Composition operators Data stream lang. Production rules Temporal aspects ++ 0 (temporal windows) −
(via host prog. lang.)
Negation + 0 (inconvenient in SQL) − (temp. aspects) Aggregation −− ++ − (temp. aspects) Consumption and selection ++ −− 0 (manually control)
Facts and States −
(C-part of ECA rules)
+ ++
Formal Semantics
0
(data not considered)
+ (precise but unintuitive) − (essentially imperative programming) Ease-of-Use, Learning Curve + (misinterpretations of operators possible) 0 (conversion between streams and relations)
−
(e.g., manual garbage collection) Occurrence and detection time + − (depends on relation-to-stream op.) −− (left to programmer) Extensibility, flexibility − + (user-defined functions and windows) ++ (host prog. lang.)
Data model: XML support −− (with exceptions) + (via SQL/XML) − (conversion to objects) Integration Active database or
stand-alone Database; sometimes prog. lang. Prog. lang. (e.g., Java) Implementations 0 (mainly prototypes) ++ (highly scalable) + (scalability issues) Development tools − − 0
(not tailored for CEP) Figure 3.7: Summary of the comparison between composition operators, data stream langauges, and production rules for querying complex events
In comparison to the previous hybrid approach of adding pattern matching to a data stream query language, this approach can be thought of being the other way round: it first applies data stream queries and then the pattern matching through composition operators, whereas the previously discussed approach first applies pattern matching and then data stream queries.
3.6.3
Event Composition Operators in Production Rules
Work in [WBG08] aims at adding event composition operators to a production rule system based on the rete algorithm. Events occur over time intervals and the supported composition operators are based on Allen’s Interval relations. So far, the work offers so far no operators for negation, aggregation, counting, or similar advanced queries. The composition operators are implemented as beta-nodes in the rete network.
Additionally, the composition operators support metric temporal constraints that pose limi- tations on the distance between the start or end points of the occurrence time intervals. These limitations can be upper bounds or lower bounds on the distance as well as exact specifications. The metric temporal constraints are used to enable an automatic garbage collection of events that become irrelevant (cf. also Chapter 15).
Chapter 4
Background: Xcerpt and XChange
The event query language XChangeEQ, which is developed in this thesis and will be presented in the following chapters, caters for specifics of Web data and reactivity on the Web. To this end, it builds upon two existing projects: Xcerpt, a rule-based Web query language, and XChange, a reactive rule language for the Web. This section gives an introduction into these two languages, as necessary for understanding XChangeEQ.
4.1
Xcerpt: Querying and Reasoning on the Web
Xcerpt [SB04, Sch04] is a declarative, rule-based query language for Web data as well as other kinds of semi-structured data. It is used in the event query language XChangeEQ for querying data in simple events that are transmitted as XML messages.
4.1.1
Distinctive Features
Xcerpt has a number of features that distinguish it from the current standard Web query languages XSLT, XQuery, and SPARQL, which have been introduced shortly in Chapter 2.2.1. Some of these feature also make Xcerpt particularly suitable as a basis for an event query language.
Pattern-Based Approach Queries in Xcerpt are specified as patterns for the data that is accessed to extract interesting portions from. Similarly, data that is to be newly constructed as the result of a query is also specified by patterns. The patterns closely resemble data and can be thought of as forms or templates for the data.
This pattern-based approach where queries are in close correspondence to data gives rise to a language that is fairly intuitive and easy-to-use with a human-friendly syntax. Queries can be written by cut-and-pasting fragments of example data for input and result and successively modifying these fragments into patterns that either extract data (in the case of an example for the input) or construct new data (in the case of an example for the result). The patterns also give rise to a visual language called visXcerpt [BBS03, BBSW03] that realizes the vision of a close correspondence between visual and textual syntax.
For querying simple events that are received as XML messages, the pattern-based approach has a salient advantage, because querying simple events is actually a two-folded task: one has to (1) specify a class of relevant events (e.g., all order events; this corresponds to the event type, cf. Chapter 3) and (2) extract data from the events (e.g., the customer name and item number). As we will see, the patterns of Xcerpt serve both purposes well since they both describe the structure of data and bind variables.
Separation of Extraction and Construction of Data Xcerpt clearly distinguishes and sep- arates patterns that access existing data to extract relevant portions of it (so-called query terms)
and patterns that construct new data (so-called construct terms). In contrast, XQuery and XSLT mix and nest the extraction of data (e.g.,fororletstatements in XQuery) and the construction of new data (e.g.,returnstatement in XQuery).
For querying events, a separation of the intrinsic query (where data is only extracted without constructing new data; also sometimes called the “query proper”) and the construction proves beneficial. It allow an author to first focus on the events that are to be detected over time together with their data and relationships, and then separately on the result that should be generated upon detection.
Rules and Reasoning Query (proper) and construction are brought together in deductive rules; the rule body (“if”-part or antecedent) contains a query, the rule head (“then”-part or consequent) contains a construction. When a rule is applied, we conceptually first evaluate the query in its body. If this is successful, i.e., the patterns specified in the query can be matched to existing data, then new data is constructed according to the specification in the head. Information flows from the rule body to the head in form of variable bindings.
Such rules give rise to deductive reasoning (see also Chapter 2.2.2): a rule can query results constructed by other rules (including itself) and construct new results from it. They also provide an abstraction mechanisms and are convenient for mediating data of different schemas.
Rules can be argued to be as important for events as they are for regular, non-event data. It is therefore conceptually convenient to base XChangeEQ on an existing rule language, even though rules about events require some significant changes to the approaches used for rules about regular, non-event data.
Versatility Xcerpt aims at being versatile with respect to data formats and models, allowing to access and construct data in different formats even within a single query [BFB+05]. In particular it aims at making it easy to query both XML and RDF data, as needed for example for querying both XML documents and RDF meta data that is associated with the documents (e.g., through GRDDL [Con07, BFHL07]). In contrast, most existing query languages for Web data support only a single data format and model (e.g., XML for XQuery and XSLT, RDF for SPARQL).
Events are often used to signal changes in some data source (e.g., insertion, deletion) and contain fragments of the changed data. Accordingly, versatility can be argued to be as important for event queries on the Web as it is for regular, non-event Web queries.
4.1.2
Data Terms
XML and other Web data is represented in Xcerpt in a term syntax that is arguably more concise and readable than the original formats, in particular when considering also query terms and construct terms (Sections 4.1.3 and 4.1.4). The term syntax also provides two features that are not found in XML: First, child elements in XML are always ordered. Xcerpt allows children to be specified as either ordered or unordered, the latter bringing no added expressivity to the data format but being interesting for efficient storage based on reordering elements and for avoiding incorrect queries that attempt to make use of an order that should not exist. Second, the data model of XML is that of tree. Xcerpt is more general supporting rooted graphs, which is necessary to transparently resolve links in XML documents (specified, e.g., with IDREFs [B+06a, B+06b] or with XLink [DMO01, BE05b, BE05a]) and to support graph-based data formats such as RDF. Figure 4.1(a) shows an Xcerpt data term for representing information about flights; its structure and contained information corresponds to the XML document shown in Figure 4.1(b). A data term is essentially a pre-order linearization of the document tree of an XML document. The element name, orlabel, of the root element is written first, then surrounded by square brackets or curly braces, the linearizations of its children as subterms separated by commas. Square brackets [ ]indicate that the order of the children is relevant and must be preserved. Curly braces { }
indicate that the order of children is irrelevant. In the example of Figure 4.1(a), the order of the flightchildren of theflightselement is indicated as relevant, whereas the order of the children of theflightelements is not.
f l i g h t s [ f l i g h t { n u m b e r { " U A 9 1 7 " } , f r o m { " FRA " } , to { " IAD " } } , f l i g h t { n u m b e r { " L H 3 8 6 2 " } , f r o m { " MUC " } , to { " FCO " } } , f l i g h t { n u m b e r { " L H 3 8 6 3 " } , f r o m { " FCO " } , to { " MUC " } } ]
(a) Data term
<? xml v e r s i o n = " 1 . 0 " e n c o d i n g =" ISO -8859 -1"? > < flights >
< flight >
< number > UA917 </ number > < from > FRA </ from > < to > IAD </ to > </ flight > < flight >
< number > LH3862 </ number > < from > MUC </ from > < to > FCO </ to > </ flight > < flight >
< number > LH3863 </ number > < from > FCO </ from > < to > MUC </ to > </ flight > </ flights >
(b) XML document
Figure 4.1: An Xcerpt data term and its corresponding XML document
f l i g h t s {{ f l i g h t {{ to { var D } , f r o m { " MUC " } }} , }}
(a) Destinations from MUC
f l i g h t s {{
desc n u m b e r { var N } }}
(b) All flight numbers
f l i g h t s {{ var F - > f l i g h t {{ n u m b e r {{ var N }} without to {{ " MUC " }} }} }}
(c) Flightsnot going to MUC
Figure 4.2: Examples of Xcerpt query terms
The data term syntax of Xcerpt also accommodates for graph edges beyond the tree-structure, for other entities than element and text nodes (e.g., attributes), namespaces, etc. However for understanding XChangeEQ in the scope of this thesis, these features are not necessary and we therefore refer to [SB04, Sch04] for more details.
4.1.3
Query Terms
A query term describes a pattern for data terms; when the pattern matches, it yields (a set of) bindings for the variables in the query term. Variable bindings are also called substitutions, and sets thereof substitution sets. The syntax of query terms resembles the syntax of data terms and extends it to accommodate variables, incompleteness, and further query constructs.
(Unrestricted) variables Variables in query terms are indicated by the keyword var. They serve as placeholders for arbitrary content and keep query results in the form of bindings. Fig- ure 4.2(a) shows a query term that extracts all possible direct destinations from Munich (MUC) from a data term or document like the one in Figure 4.1. In the example there is only one variable,D, and the result of evaluating the query term is a set of bindings for this variable. For the example input data term of Figure 4.1, the result is {{D 7→"FCO"}}, i.e., there is only a single binding. Note that an empty set would signify that the query term and data term do not match.
Complete and incomplete subterm specification In the patterns of query terms, single brackets or braces indicate a complete specification of subterms. In order for such a pattern to match, there must be a one-to-one matching between subterms (or children) of the data term and
the query term. Double brackets or braces in contrast indicate an incomplete specification (w.r.t. to breadth): each subterm in the query term must find a match in the data term, but the data term may contain further subterms. As with data terms, square brackets indicate that the order of subterms is relevant to the query and curly braces that it is not. According to the query term in the example,flightdata terms must contain a subtermfromand a subtermtoand may contain further subterms (e.g.,number). In contrast,from may only have a single child"MUC", no further children, andtoalso may only have a single child (but of arbitrary content).
Incompleteness in depth Incompleteness in depth, that is matching subterms that are not immediate children but descendants at arbitrary depth, is supported with the construct desc. The query term in Figure 4.2(b) extracts all flight numbers by searching fornumber elements at arbitrary depths. The result for the example input contains three bindings for variableN:
{ {N 7→"UA917"}, {N7→"LH3862"}, {N 7→"LH3863"} }
Variable restrictions Variables can also be restricted using the “as” construct written using an arrow (->) in the formvar X -> q (with a query termq). The variable then does not match arbitrary content like unrestricted variables, but only content that matchesq. Restricted variables are in particular useful for extracting whole subtrees (or subterms) from an XML document. The variableF in Figure 4.2(c) is bound toflightsubterms that match the specified pattern. Subterm negation Patterns can also contain negations, that is the negated subterm may not occur in the data. This is specified with the without keyword. Figure 4.2(c) locates all flights (variableF) together with their flight numbers (variableN) that donotgo to Munich (MUC), i.e., do not have a subterm to { "MUC" }. For the example input, the query gives two bindings for the variables:
{ {F 7→ flight { number {"UA917"}, from {"FRA"}, to {"IAD"} } , N7→"UA917"},
{F 7→ flight { number {"LH3862"}, from {"MUC"}, to {"FCO"} } , N7→"LH3862"} }
Further constructs Xcerpt query terms also cater for optional subterms (optional), label variables, positional variables (pos), regular expression matching, non-structural conditions such arithmetic comparisons (where) and more. These construct will not be detailed here but discussed when necessary in examples of XChangeEQ event queries.
4.1.4
Construct Terms and Single Rules
Construct terms are used to create new data terms using variable bindings obtained by a query. A construct term describes a pattern for the data terms that are to be constructed. The syntax of construct terms resembles the syntax of data terms and extends it to support variables and grouping.
Rules Construct terms and queries are connected through rules of the formGOALcFROMqENDor CONSTRUCTcFROMqEND. In both cases,cis a construct term andqa single query term or formula built from several query terms. GOAL rules directly generate output, whileCONSTRUCT rules are used for intermediate results in rule-based reasoning that are not in the output (see next section). Variables In constructing new data, variables in construct terms are simply replaced by the bindings obtained from the query. The result is a new data term. If there are no grouping constructs, then a new data term is generated for each binding of the variables. The construct term of the rule in Figure 4.3(a) will construct one query termfor eachnon-stop destination from Munich (MUC). Note that the query in the FROMpart of the rule is the same as used previously in Figure 4.2(a).
GOAL muc - d e s t [ var D ] FROM f l i g h t s {{ f l i g h t {{ to { var D } , f r o m { " MUC " } }} , }} END
(a) One term per destination
GOAL ul [ all li [ var D ] group by { var D } ] FROM f l i g h t s {{ f l i g h t {{ to { var D } , f r o m { " MUC " } }} , }} END
(b) Single list of destinations
GOAL
t a b l e [
all tr [ td [ var S ] ,
td [ count(all var D ) ] ] o r d e r by ( l e x i c a l ) [ var S ] ] FROM f l i g h t s {{ f l i g h t {{ to { var D } , f r o m { var S } }} , }} END
(c) Numbers of destinations, sorted Figure 4.3: Examples of Xcerpt construct terms
Grouping Constructing a separate term for each variable binding is fairly limited and more complex restructuring of data is often needed. In particular data must be grouped together in a