Dataset transformations - A model driven data analysis architecture enabling reuse and insight

Parallel to the data transformations, dataset transformations are to be exe- cuted as well, such that a new dataset is generated matching the data result. Roughly the same approach is taken as during the data transformations. Start with the initial dataset as source, perform each operation sequentially and return the final result.

Each operation has a certain impact according to the specifications in sec- tion 5.3.1. The impact is that dimensions and metrics need to be adjusted, annotated or sometimes newly generated.

Because the structures of transformation are so similar, the challenges coming are mostly identical.

There is one additional challenge. The data source does not have to be retrieved, as an assumption is made for location of data and the tablename can be used easily. To be able to apply the operations sequentially, it is required to keep track of the state of the transformation. This due to the fact that

1. The transformation start with the root of the query, referencing a source. Instead of modifying a dataset, retrieval of an existing dataset is needed.

2. There can be multiple expressions in a query. The datasets resulting out of these queries need to be stored, such that other queries can reference them.

3. When using an operation that merges different datasets, the desired added source can be the result of a subexpression.

We solve these issues using an internal dataset registry that is accessible to the transformer implementation. This registry holds all datasets avail- able for use during the transformation and the transformer can request these datasets when necessary. For each subexpression, these resulting datasets can be stored in this registry as well, allowing later expressions to request the registry for intermediary results.

Many metadata elements are just descriptions. To propagate context, the descriptions and names must be propagated accordingly. There is a design decision here on what to append, how to append it and on what level this should be described by the user.

We opted to generate additional text alongside the existing descriptions. This can mean that when aggregating a metric, its description can be prepended with the method of aggregation and leave the other information intact. Be- cause the open data sources used and their documentation is in Dutch, the implementation adds textual information in Dutch as well. Bilingual support, or configuration of how to append information would be relatively straight- forward but is considered out of scope for this project.

Chapter 7 Implementation details

For the interested reader, this chapter provides technical details on the implementation of the prototype. It can aid the reader in reproducing the implementation or providing new insights for model-driven engineering but can be skipped in its entirety without loss of conceptual meaning. Addi- tionally, it provides a compressed overview of libraries and our reasoning of choice between different libraries of implementation.

7.1 EMF and PyEcore

The Eclipse Modeling Framework is the most widely used framework to apply model-driven engineering, and provides tightly integrated tools to enable development of models and transformations.

While the Eclipse Modeling Framework provides a solid foundation for models, its use bounds the user to the use of Eclipse and Java. Eclipse and Java are difficult to configure and make it hard to automate different steps, especially when one is not familiar with build tools used. The MWE2 workflow provides methods to chain different operations across the platform together, but we did not manage to make this work consistently. We suspect this is due to either configuration issues, versioning issues or untraceable er- rors after a model change.

Since this did not provide the right environment for our prototype implementation, we searched for a different method that would allow Model-Driven Engineering similar to the Eclipse Modeling Framework, but use better, more flexible, tooling that allowed us to develop a prototype more quickly.

PyEcore [10] is an open-source implementation of the Eclipse Model- ing Framework in Python, and can be found on GitHub. It allows users

to import existing ECore files, generate python code for the metamodels and use these classes. Because of our familiarity with many python packages, this allows us to use these packages to streamline the Model-Driven Engineering and use these libraries for parsing, text generation and au- tomation of different steps. Documentation for PyEcore can be found at

https://pyecore.readthedocs.io [9].

The ECore models are designed using Eclipse, since Eclipse its editor is solid and provides a good workflow. Importing these models for use in PyEcore is very simple. PyEcore supplies the tool pyecoregen that takes an ECore file and generates python model code. Executing pyecoregen -e model.ecore -o . in the terminal generates a python package called “model” in the current directory containing model code. This model code can be imported into the python code by simply using import model.

Python is a dynamically typed language, which might not be the ideal choice for Model-Driven Engineering. Using models implies that types of variables are static. PyEcore takes care of this by performing validation of types when a value is bound to the property of a model.

The dynamic nature of Python does give us advantages for the development of a prototype. During testing, a model can be loaded into an inter- preter, allowing direct interaction and exploration of the properties of the model. Furthermore, Python is a famous scripting language to quickly tie elements of code together. This allows us to quickly combine different elements and automate execution of different steps of the framework using a complete programming language.

In document A model driven data analysis architecture enabling reuse and insight in open data (Page 74-77)