2.5 Operators: Source Collection Classes
2.5.10 Operator: Attribute Calculator
An Attribute Calculator Source Collection is used for the calculation of new attributes from existing attributes. Our goal of allowing scientists to specify their own calcu- lation methods is achieved by decoupling the definition of the calculation from its application. Besides offering flexibility this also ensures full data lineage and enables the use of derived attributes in data pulling.
Attribute Calculator Detail: Attribute Calculator Definition
The operation to process an Attribute Calculator Source Collection, is defined through an auxiliary persistent object, labeled an Attribute Calculator Definition. A scientist can define a new method by creating a new Attribute Calculator Definition, which can subsequently be used in multiple Attribute Calculator Source Collections. There can be several Attribute Calculator Definitions that calculate the same attributes, but with different methods.
Besides the calculation itself, an Attribute Calculator Definition also contains in- formation about the calculation which allows the information system to know what attributes are calculated without having to understand the calculation itself. A par- ticular Attribute Calculator Definition object has the following properties:
Provided Attributes: A list of attributes that are calculated. The information system uses this list to find Attribute Calculator Definitions that can be used for the instantiation of new Attribute Calculator Source Collections in order to fulfill a data pulling request.
Required Attributes: A list of attributes that are required for the calculation and therefore must be present in the parent of an Attribute Calculator using this definition. The information system will ensure this automatically when a new Attribute Calculator Source Collection is instantiated through data pulling. Code: The code to calculate the new attributes from the old attributes. In essence the code can be seen as a function with as arguments the required attributes, together with any process parameters and other progenitors, and as output the provided attributes. We note that the calculation of attributes of one source cannot be influenced by the other sources in the parent Source Collection, because all Source Collections operate on a per-source basis. The Attribute Calculator Source Collection provides the framework that defines how the code should be specified and how it can interface with the data. Such
a framework is implementation dependent and goes beyond the scope of this chapter. The calculation itself might be seen as a black box by the information system and can even require external software.
Process Parameters: The process parameters of Attribute Calculator Source Collections that use this definition. Default values for these parameters should be specified if possible, but individual Attribute Calculator Source Collections have their own specific values. Other Process Targets that are required as input to calculate the attributes are included as process parameters as well, even though they can also be seen as progenitors.
Attribute Calculator Specification
Progenitors: A Source Collection with the attributes required for the calcu- lation.
Parameters: An Attribute Calculator Definition and any required process parameters.
Sources: The same as the parent.
Attributes: The newly calculated attributes. Rel. Operator: Extend.
Dependency Graph Modifications
Attribute Calculator Source Collections that are created automatically through data pulling, are defined to operate on the largest set of sources they are appli- cable for, which usually means that they are placed as early in the dependency graph as possible. This prevents the creation of Attribute Calculator Source Collections that perform the exact same calculation on different subsets of the same larger Source Collection.
Moving Select Sources Source Collections up the dependency graph in one of the steps in minimizing the required processing. Moving a Select Sources up through Attribute Calculator will create a copy of the Attribute Calculator that represents a subset of the original.
Example: The important aspects of the Attribute Calculator are shown in the example of section 2.2.2. A new Attribute Calculator Source Collection is created by searching for an appropriate Attribute Calculator Definition that describes how the requested attribute can be derived. The attributes that are listed as the required for the derivation are already represented by other Source Collections. The information system would recursively create more Attribute Calculator Source Collections to also calculate these attributes if this would have been necessary. The Attribute Calculator Source Collection is defined to be a general as possible. A temporary copy is created that is as specific as possible for the actual construction of the catalog. The instantiated sources and attributes of this smaller copy are stored in the same location as those of the original. Several more use case examples are given in section 3.9.
Implementation Detail: The Attribute Calculator operator is designed to allow the code of a Attribute Calculator Definition to be a black box: the in- formation system knows what attributes are calculated, and how to initiate the calculation, but nothing about the method itself. The design of the Attribute Calculator Definition can be extended to give the information system useful knowledge about the calculation. For example in our implementation the infor- mation system can distinguish between calculations that can be performed on the fly on the database and those which cannot. In section 2.7.3 we describe further enhancements to the Attribute Calculator.