Open source framework for data-flow visual analytic tools for large databases

(1)

FP7318633 AXLE Collaborative Project

Open source framework for data-flow visual

analytic tools for large databases

D5.6 v1.0

WP5 – Visual Analytics: D5.6 Open source framework for dataflow

visual analytic tools for large databases

Dissemination Level: Public

Lead Editor: Janez Demsar

Date: 30 April 2015

Status: Final

Description from Description of Work:

T5.5. Dataflow visual analytic tools for large data bases We will investigate ways in which the existing dataflow model can work within large databases. We will provide a pilot reimplementation of Orange in which the data flow operators (e.g. Orange’s widgets) will communicate by passing SQL queries and/or server side processing using PL/Python instead of retrieved data. Most operations will modify the processing instructions instead of retrieving and modifying the data. We will also move as

(2)

FP7318633 AXLE Collaborative Project many data operations as possible to the database server.

Contributors:

Anze Staric, UL Janez Demsar, UL

Internal Reviewer(s):

BSC: Adrian Cristal, Nehir Sonmez 2ndQ: Tomas Vondra

Version History

Version Date Authors Sections Affected

0.1 Apr 25, 2015 A Staric, J Demsar initial draft

1.0 Apr 30, 2015 J Dix Final version change

List of Figures

1 Example schema 4 2 Selecting columns in the Select Columns widget 6 3 Filtering using the Select Rows widget. 7

(3)

1. Summary

The goal of this task is to create a framework similar to existing dataflowbased data mining frameworks, except that the “flow” does not consist of data but of metadata that can be translated into actual SQL queries when needed.

In this report, we first provide a short description of the work done, followed by an example schema and the corresponding queries.

2. Description

Dataflow models consist of units (Orange calls them widgets) that manipulate the data. As the data passes through multiple units, manipulations (e.g. filters, value transformations and similar) “pile up”. The task of reimplementing the widgets so that instead of data they pass meta data has been greatly simplified by the work done within T5.1. With the categories defined in T5.1, dataflow visual analytic tools can be used on large data bases if we can efficiently implement the computation of (aggregate) data in those categories, and if each unit’s functionality can be expressed in terms of these categories. The former, efficient implementation of data manipulation, is the core of the AXLE project and is researched in WPs 24. The latter was explored within T5.5, presented in this deliverable. In parallel with AXLE and in part supported by it, we developed a new major version of Orange that allows for different types of data storages. Besides the inmemory storage, widgets can now pass data of other types, such as SQL queries. The storage also defines the basic operations on the data, such as various aggregations, filtering and so forth. In the case of SQL data, basic data manipulation, like filtering or feature construction, is implemented by changing the column selection or adding conditions to the WHERE clause in the SELECT statement. Only small amounts of aggregated data are actually passed to the widgets on the client side. The task description states that We will provide a pilot reimplementation of Orange in which the data flow operators (e.g. Orange’s widgets) will communicate by passing SQL queries and/or server side processing using PL/Python instead of retrieved data. Initial study showed that PL/Python scripts do not offer any advantages in comparison to standard SQL queries. However, the work done within T5.1 allowed us to go beyond the pilot reimplementation of Orange: in the new version of Orange, most widgets related to data processing and visualizations already work with data stored in databases that are accessible over nonlocal connections. Widgets for supervised and unsupervised machine

(4)

FP7318633 AXLE Collaborative Project learning and statistics mostly require actual data and not just aggregates; if data is small enough, these widgets can transfer it from the database and compute locally. If the data fits into working memory, the data transfer can be avoided by using Remote Orange; see the example with the Principal Component Analysis (PCA) in deliverable 5.5. If the data is extremely large and does not fit into working memory, machine learning and statistical methods need to be replaced with iterative algorithms. This is not a part of the AXLE project; AXLE, however, provides the necessary architecture, as shown in the example with the PCA, which indeed uses an iterative PCA instead of the standard one.

3. Demonstration

We demonstrate how the workflow works with a simple schema (Figure 1). We connect to the database (SQL Table), select a few columns and define their roles (Select Columns) and filter out some rows (Select Rows). We discretize the data and induce a naive Bayesian classifier and show a Mosaic Display. On the original data, we compute PCA and observe the projections with a Heat map. Figure 1. Example schema Widgets communicate with each other by passing meta data, which can be used to construct the corresponding SQL query that would retrieve the data. The database is accessed only when needed and none of the widgets in the schema actually retrieves any row data from the original table to the desktop client. To demonstrate this principle, we will show SQL queries that would get executed if the Data Table widget, which can be used to show row data, was attached to various widgets.

(5)

FP7318633 AXLE Collaborative Project We ran the schema on one of the common benchmark data sets from the UCI ML Repository, the Wine data set. For the purpose of these experiments, the data set size was increased (by resampling and adding some artificial noise) to 100 million data rows. SQL Table analyzes the data and creates a corresponding instance of Domain class in Orange. The query that retrieves the sufficient data is as follows. The widget outputs the data that would effectively translate into the following query.

(6)

FP7318633 AXLE Collaborative Project In the Select Columns widget we selected the first four columns and designated “Wine” as the target variable. Figure 2. Selecting columns in the Select Columns widget The corresponding query is simplified to The fact “Wine” is the target variable is not reflected in the query itself. In the Select Rows widget, we selected the wine samples with more the value of alcohol above 13 % and malic_acid above 2.3 %.

(7)

FP7318633 AXLE Collaborative Project Figure 3. Filtering using the Select Rows widget. This adds the corresponding conditions to the query, still without actually executing (or even actually constructing) it. Discretization widget defines categorical variables corresponding to the original continuous columns so that each bin contains roughly equal number of samples. The thresholds are computed using the quantile function. The widget also supports other types of discretization, in particular the entropyMDLbased discretization, which requires contingency tables retrieved by the following query:

(8)

FP7318633 AXLE Collaborative Project and discretization into bins of equal width. The output of that widget, if translated to a query, would look as follows. Naive Bayesian classifier is induced by computing cross tables from this data. Currently, separate queries are executed for each column.

(9)

FP7318633 AXLE Collaborative Project The corresponding queries for the Mosaic plot are similar. PCA is computed iteratively using Remote Orange, described in deliverable D5.5. It retrieves actual data rows in batches and updates the projection until convergence or until the client aborts the computation. The output of the PCA data is a transformed table in which each column corresponds to one principle component. We set the widget to output only two components, resulting in the following output query.

(10)

Finally, this is the beginning of the query, which computes the heat map:

(11)

FP7318633 AXLE Collaborative Project This demonstration shows how Orange widgets process SQL data by modifying the query instead of retrieving and transforming the data. In particular, the last query contains the modifications added by all upstream widgets (some of which are obscured by the use of temporary sample table).

(12)

4. Conclusion

All functionality presented in this report is already included in the working version of Orange, available on its website (http://orange.biolab.si/orange3/). The source code is released under the BSD license (except for the GUI part, which is under GPL due to its dependency on PyQt), and available on Github (https://github.com/biolab/orange3).

Open source framework for data-flow visual analytic tools for large databases