Function transformation - A model driven data analysis architecture enabling reuse and insight

This function transformation process requires the definition of a SQL metamodel. Its only use is to generate code based off of functions defined in our metamodel. Hence support for a subset of SQL is sufficient.

This metamodel definition is shown in figure 6.2. Because its only use-case is code generation, many simplifications can be implemented. For example, the Selector element is simplified to be just a string. If SQL functions are needed, this function can be incorporated into this string. While this poses problems when targeting different SQL dialects and is not an ideal solution, it allows us to show the concepts in the prototype implementation.

With the use of this new metamodel, transformation rules are defined that transform a function model into an SQL query model. The function metamodel from chapter 5, designed in such that it can represent a pipeline in which each step is executed consecutively. It also described the impact on the data for each of these operations, which are separately easily transferable to SQL operations. This means our transformation needs to handle concrete SQL syntax based off of these definitions, the initial data selection and a method of consecutive execution of these operations.

The concrete SQL syntax definition is the simplest step. Each operation can be directly converted to an individual SQL query, based on the definitions in section 5.3.1.

For data selection, the assumption is made that the SQLite database used contains a table with the actual data. The name of this table matches the key of the dataset. Every column in such a table corresponds to the keys as specified in the dataset itself. Importing a CSV file in SQLite is easily done using the .importcommand.

There are two methods to link subqueries using SQL. Awithclause allows users to specify queries separately and reference them using an alias. The transformation can be created such that every operation from the function model is a separate SQL query and references the result of the previous function using an alias. This makes the transformation straightforward, as well as the resulting SQL query organised and understandable.

Another method is to use nested subqueries. This decreases the need of aliasing and referencing, but at the cost of the resulting query having a level of nesting for every operation and thus becoming complex to read for elaborate functions. In the end, the results are very similar. We opted for the first, as it is closer to the function metamodel and generates more readable SQL code. Using these elements, we define the transformation structure as follows:

• Always start with a full selection of the source, because we need this data to start with, e.g. “SELECT * FROM dataset”. This query needs to be wrapped in an alias and used in the with query, which results in “WITH 1 as (SELECT * FROM dataset)”

• Create awith subquery that performs a query using the previous query, e.g. “SELECT key1, key2 FROM 1”. To keep the aliases simple, we use the combination of an underscore and a counter.

• After the last operation, create the main sql query that is a selection. While we could just use the last operation as the main query, that generates edge cases in the transformation that we have to solve. This method works generally, at the cost of some verbosity in the resulting query.

There is one additional complexity, which is handling the subexpressions of the initial function model. Because SQL does not support nested WITH expressions, the tree structure needs to be flattened. In order to do this, the order for execution for sub-expressions must be resolved. When subexpression A depends on sub-expression B, B needs to be transformed before A.

6.3.1 Transformation example

As an example, we will investigate the issue of counting the amount of places for childcare services per neighborhood, such that these can be compared with statistical information from these neighborhoods. The query needed for this operation is shown in listing 6.1.

Listing 6.1: A query that calculates the amount of childcare services present per neighborhood, by first merging the neighborhood information from an external dataset followed by an aggregation operation

1 select lrkp

2 restrict metrics on keys actuele_naam_oko, opvanglocatie_postcode, opvanglocatie_huisnummer, aantal_kindplaatsen, type_oko, ID 3 add from adres_codes

4 gwb2016_code

5 match opvanglocatie_postcode=pc6, opvanglocatie_huisnummer=huisnummer 6 aggregation

7 based on keys gwb2016_code, type_oko

8 sum aantal_kindplaatsen as "Totaal aantal kindplaatsen" 9 count ID as "Aantal kinderopvanglocaties"

This function consists of three different operations (restrict, add from and aggregation). For each of the operations, the impact on the data is known. Hence, conversion to SQL is as simple as translating each of these steps and merging them into a single SQL query, such that each of these steps is executed consecutively.

Listing 6.2: Resulting SQL query, generated based off the transformation from the query in listing 6.1

1 WITH 2 _1 as (

3 SELECT * FROM lrkp

4 ), 5 _2 as (

6 SELECT actuele_naam_oko, opvanglocatie_postcode,

7 opvanglocatie_huisnummer, aantal_kindplaatsen, 8 type_oko, ID 9 FROM _1 10 ), 11 _3 as ( 12 SELECT adres_codes.gwb2016_code, _2.* 13 FROM _2 14 JOIN adres_codes 15 ON _2.opvanglocatie_postcode = adres_codes.pc6

16 and _2.opvanglocatie_huisnummer = adres_codes.huisnummer 17 ),

18 _4 as (

19 SELECT gwb2016_code, type_oko,

20 SUM(aantal_kindplaatsen) as totaal_aantal_kindplaatsen,

21 COUNT(ID) as aantal_kinderopvanglocaties

22 FROM _3

23 GROUP BY gwb2016_code,type_oko

24 )

25 SELECT * FROM _4

It can be directly seen that every operation generates to its own sub- query. The restrict metrics operation (line 2) boils down to a selection of columns. The resulting SQL sub-query (line 5-10) selects these columns, as well as the dimension columns that are present at that location. Without adding these dimension columns, the dimensional information would be lost in the resulting data.

Similarly, the operation on lines 3-5 converts to the SQL sub-query on lines 11-17 and the operation with lines 6-9 results in the SQL sub-query on lines 18-24. Each consecutive function references the key of the preced- ing sub-query, just like each operation acts as on the result of the previous operation.

The last line of the query is a selection on the result of the last operation, which is a method to generalize the transformation for each operation. It does not require a special case to convert the subquery of the last operation

to the main expression, instead of a subexpression.

Executing this sql query on a database including the lrkp dataset as table, yields the results as specified by the function in listing 6.1.

In document A model driven data analysis architecture enabling reuse and insight in open data (Page 70-74)