Data Flow - Process Model - Flexible semantic service execution

4.3 Process Model

4.3.2 Data Flow

From a structural point of view, the data flow is seen as a directed overlay graph on top of the control flow graph. Although extensions of the PN formalism can in principle be used to also represent the data flow (e.g., coloured PNs), we decided to keep it sepa- rated from the control flow graph in order not to clutter it and unnecessarily complicate matters.

? diʹ dop di diʹ ? di mop diʹ di ... ... di ... di di (a) (b) (c)

producer _consumer producer _producers

s co nsumer s consumer 1 1 n n

Figure 4.6: Data flow primitives: (a) fork; (b) divide where dop is a divide operator; (c) merge where mop is a merge operator.

Most of all, the data flow respects the directedness of the control flow graph: it cannot be counter-directional to the control flow; hence, it is not independent of the control flow. At the same time, the seamless compatibility of data producers with their consumers is a crucial requirement for automated executability. This becomes rele- vant when considering environments with syntactic, structural, and/or semantic data heterogeneities (see [She98] for this classification of different levels of heterogeneities). Therefore, in order to completely capture aspects of the data flow model considered in this thesis we find it necessary to address (i) its structure and execution semantics, (ii) to describe how it is linked to the control flow, and (iii) to provide and discuss a notion of data compatibility between the producer of some data item and its consumer(s).

Whereas the primitives in the flow of control are split and join, the primitives in the flow of data are fork, divide, and merge, which is depicted in Figure 4.6. A fork is essen- tially the use of a data item by multiple consumers (i.e., the use of the same data item multiple times). A divide as well as a merge of data ultimately requires the definition of a divide operator or a merge operator. A divide operator takes one incoming data item and divides it according to some instructions into multiple outgoing data items. Conversely, a merge operator takes two or more incoming data items and merges them according to some instructions to one outgoing data item. Depending on the input, typical operators involve selection, projection, join, and union. In the data flow model that we consider, divide and merge operators are not represented as first-class citizens. This is based on the observation that one can equally represent them by an operation within a service or a service itself (i.e., they can be realized in either way). Only a fork is an explicit part of the model.

Sources and Sinks

Following common terminology, the producer of a data item is called the source and the consumer the sink. The data that “flows” from a source connected to a sink is a single data item (cf.Assumption 2). Sources and sinks map to input and output profile parameters. More precisely, we define the set of sources as the union of all outputs of sub services/operations a service is composed of plus the inputs of the service itself. Conversely, the set of sinks is the union of all inputs of sub services/operations a service is composed of plus the outputs of the service itself.

Definition 4.15(Source, Sink). Let Sc be a service. The set of sources O and sinks I of Sc is defined as follows: O= [ u∈Sc.U u.Pr.O ! ∪Sc.Pr.I and I= [ u∈Sc.U u.Pr.I ! ∪Sc.Pr.O .

A source o ∈Sc.Pr.I is called an initial source. A sink i ∈ Sc.Pr.O is called a final sink. Data Compatibility

In our data flow model, a connection between a sink and a source implies that both are compatible. The notion of data compatibility exists in one form or another in basically all service composition, workflow, or business process models/frameworks since it is imperative to define a data flow at all. Simply put, compatibility is understood as the possibility to forward the data item produced by the source to the sink so that it will be accepted and understood by the sink. Therefore, compatibility is a relation that has a syntactic (accept) and a semantic (understand) dimension. Compatibility at the syntactic level is a prerequisite for compatibility at the semantic level: While it is possible that a sink accepts a data item forwarded from a source but does not understand it, the opposite is impossible, intrinsically.

One can define the notion of data compatibility between sources and sinks in a data flow basically in two ways. Based on the methods employed, we classify them as type- based and the more general form of mediator-based compatibility; the latter building upon so-called data mediators. Data mediators are also not represented as first class citizens for the same reason than divide and merge operators: they can also be realized either as an operation within a service or an atomic service.29

Type-based compatibility is the direct form requiring that a source and a sink match semantically, at least. Using the machinery introduced in the service model this can be formulated as follows. A source profile parameter o is type-compatible with a sink profile parameter i if

type(o) v type(i) . (4.22)

This can be understood as a classical plug-in match [SWKL02, PKPS02] (see also Sec- tion 5.4.1). Observe that due to the unidirectional nature of the data flow a symmet- ric relation is not needed. Hence, it is not necessary to use the stricter equivalence type(o) ≡ type(i). Conversely, the more permissive type(o) w type(i), which can be understood as a subsume match, turns out to be problematic under the assumptions de- scribed next.

Condition (4.22) is, however, not sufficient to ensure seamless compatibility. With- out additional statements, it does not yet address compatibility at data level; that is, it lacks details on syntactical and structural data format requirements. One possibility that is (tacitly) made in most service frameworks/models is that type-compatibility implies syntactic and structural data-compatibility. Formulated in terms of our service

model it is the assumption that the set of valid data values for a profile parameter co- incides with the extension of the concept/data range so that Equation (4.4)and Equa- tion (4.5), respectively, hold. Applied to o, i, this is achieved by requiring that o and i either use the same datatype30 or that the datatype of o is derived from (is a restriction of) the datatype of i, which resembles an exact or plug-in match, respectively, at the level of data.31 Consequently, o and i are seamlessly compatible under this assumption, both syntactically and semantically. Depending on the actual datatype system used in practice, this would be applicable equally for primitive as well as complex structured datatypes. For instance, the XML Schema type system includes the possibility to define a complex datatype d1as the restriction over elements of another complex datatype d2, which effectively makes the value space of d1a subset of the value space of d2. This also explains why extending the subsume match type(o) w type(i) to data compatibility is problematic since there might be data items that are rejected by the sink because they are out of its value space.

At the technical execution level it is therefore indispensable that a source and a sink are compatible at the conceptual level as well as at the syntactic data level.

Definition 4.16(Source, Sink Execution Compatibility). Let Sc be a service and O, I the set of its sources and sinks, respectively. A source o ∈ O isexecution compatible with a sink i ∈ I iffCondition (4.22)holds and the data values produced by i are included in data values accepted by o.

The notion of compatibility between sources and sinks becomes a more complex problem under structural and even more so under semantic data heterogeneities. This is the point where some form of data mediation is ultimately required. Though it is not the focus of this thesis to also cover this topic, we will briefly discuss the mediator-based approach next.

Mediator-based compatibility is more complex since it starts from the advanced cases where the data items produced by a source are not structural compatible with the accepted data items of a sink and/or where source and sink do not semantically match relative to some application domain conceptualization so thatCondition (4.22)does not directly hold. Establishing compatibility under these circumstances involves solving a data integration problem (see [Len02] for an overview).

Probably the most prominent building block that has been proposed to this in the literature considers the use of mediators to achieve this [Wie92]. The basic principle has later been integrated as a core element in the WSMO service framework under the notion of data level mediation [SCMF06]. Data integration is one of the major and widely studied topics in databases and information systems with a large body of work on ontol- ogy based approaches [WVV+01, Noy04]. While the mediator-based approach is more of an architectural pattern that can be employed in principle to solve any syntactic, structural, and semantic data incompatibilities, it does not provide concrete methods

30_{Recap, information about the data format, which includes the datatype, is assumed to be included in} the grounding of an operation (seeSection 4.1.3).

31_{This type of data-compatibility corresponds to the notion of direct and indirect data type compatibil-} ity in [MBE03].

to achieve this. In fact, automated data integration (based on mediators) is still a chal- lenging and not generally solved problem. For instance, a more recent review of the topic [BH08] states that “every step of the information-integration process requires a good deal of manual intervention”. Common approaches followed currently build on the idea of data schema mappings and/or structural transformation procedures. These are usually human-defined in an ad hoc manner for the data formats between which to mediate. Consequently, there is a certain degree of human involvement. An approach to estimate the effort in human involvement has been proposed in [GRR+08]. The au- thors define mediatability as a computable measure quantifying the effort of mediating between XML-based data schemata in terms of a similarity function.

Virtually all mediation-based approaches with humans in the loop have it that the data formats (schemas) among which to mediate are known in advance with only a few different formats involved. However, in open and possibly large-scale environments in which this cannot be assumed, different approaches are needed [Rah11].

Structure

The structure of the data flow is defined using a consumer-pull style relation on sources and sinks (rather than producer-push style). In addition, we need a precedence relation on sources and sinks to represent that the data flow is not counter-directional to the control flow. Let Sc be a service and let Gcf = (P, T, F, M0, f u)be its control flow graph. Let pt : (O∪I) → (P∪T) be the mapping that returns for each source/sink of Sc its corresponding place/transition in the control flow graph; that is

pt(x) =      pi if x∈ Sc.Pr.I pf if x∈ Sc.Pr.O t if x∈ f u(t).I or x∈ f u(t).O . (4.23)

A source o∈ O precedesa sink i ∈ Iw.r.t. Gcf, denoted with

o≺G_cf i , (4.24)

iff there is a path W in Gcf such that pt(o)is the first and pt(i)is the last element in W.

Definition 4.17(Data Flow Graph). Let Sc be a service and Gcf its control flow. A data flow graph (or data flow for short) for Sc is bipartite directed graph Gdf= (O, I,L99)where nodes are divided into the set of sources O and sinks I of Sc and L99 ⊆ I×O is the flow relation (edges) such that the following conditions hold:

(1) for each i ∈ I there is a pair(i, o) ∈ L99such that o must not be an initial source if i is a final sink,

(2) (i, o) ∈ L99implies that i, o are execution compatible, and (3) (i, o) ∈ L99implies o ≺_G_cf i.

If(i, o) ∈ L99then i is said to be connected to o. A source o∈ O isunused (or unconnected) if there is no pair(x, o) ∈ L99.

Observe that Item (1) in Definition 4.17 rules out open-sourced sinks; that is, all sinks are connected to a source. On the other hand, unused sources are permitted (i.e., an output data item produced by some operation need not necessarily be consumed), though this would rather rarely be the case in practice. In other words, L99 is left- total and right-unique (i.e., it is a function) but neither injective nor surjective: different sinks may map to the same source (fork) and not all sources might be covered by L99

(unused/unconnected source). Defining the flow relation in the dual form99K ⊆O×I renders the relation 99K inverse-functional, which is the reason why we opted for the consumer-pull style.

As a last remark, observe thatDefinition 4.17precludes connecting an initial source with a final sink (Item (1)), thereby disallowing a data flow that simply and directly for- wards a service’s input to one or more of its outputs. The importance of this “restriction” lies in reasoning. Later inSection 5.4.4we will assume that every output is represented by a newly introduced constant. An output produced this way by looping it through from an input would actually imply being represented by an existing constant; hence, violate this assumption. From a purely data perspective, however, we could allow such a loop through data flow, though it is of little practical relevance anyway.

Execution Semantics

The forwarding of data items in the course of execution is directly bound to the service lifecycle and transition firing as defined by the execution semantics of the control flow graph. More precisely, for each pair (i, o) ∈ L99, the data item becomes available at an initial source (pt(o) = pi) when the service gets instantiated (i.e., for the initial marking M0). If pt(o) = t then the data item is produced when t fires. The data item becomes available at a final sink (pt(i) = pf) when the service execution completes (i.e., for the final marking Mf). If pt(i) = t then the data item is consumed when the operation or service f u(t)is invoked.

Since there are time gaps between production and consumption of data items in practice, it is the responsibility of an execution engine to keep a data item ready by temporarily storing it unless it was consumed by all sinks. This also includes cases of runtime invocation failures that make it necessary to keep data items for recovery purposes. Conversely, a data item that is never consumed (unconnected source) can in principle be discarded immediately once it was produced, except that it needs to be retained for other purposes (e.g., monitoring, recovery, rollback).

In document Flexible semantic service execution (Page 107-112)