4. Implementing the EDA Miner tool
4.2 Per-app implementation notes
4.2.3 Modeling app
Finally we come to the last of the three main apps, and the one responsible for all the machine learning and statistical modeling. Similarly to the previous two, it contains tabs to navigate across the sub-pages. We will review them in reverse order since the first two are more complex.
196
https://github.com/bokuweb/react-rnd Figure 39: EDA Miner / Visualization / Network graphs
The last tab, “Single model”, gives a simple menu consisting of dropdowns for the user to pick a dataset, a problem type (e.g. regression, classification), a model (e.g. “KNN classifier”), and then two more for picking the regressor (X, multiple allowed) and target (Y) variables. It then fits the model and returns metrics according to the problem type (e.g. for classification it returns a classification report and accuracy, for regression
the MSE) and provides an interactive visualization.
The second tab, “Pipelines trainer” has the exact same interface for the results of model fitting but differs in how selecting models works. The models in this case are pipelines that are exported from the ModelBuilder (see below), which also means that the datasets need to be defined there (although this might change later). As soon as you select a pipeline and X/Y variables, the model will be trained and the results displayed. This model will then be saved on Redis as a pickled object for one hour. If the user chooses to “export” the model, then it saved (using PERSIST from Redis). Exported models become available in the Model Devops app (not analyzed here due to being in the early stages of development).
In the first tab is the “ModelBuilder”. Let‟s first inspect the front-end implementation. It provides a small menu that lets you define a complex model / pipeline. The pipeline is implemented as a directed graph (for now it is expected to be acyclic) using dash-cytoscape. The nodes are the various model classes (estimators in sklearn terms) that map 1:1 to sklearn-like model classes, while the edges are simply defining links among them. The nodes belong in exactly one of the following categories, or “parents”: Inputs (choosing an input source), Cleaning (not implemented), Preprocessing (e.g. StandardScaler, MinMaxScaler, PolynomialFeatures), Dimensionality Reduction (PCA, NMF, TruncatedSVD), and Estimators (e.g. regressors, classifiers). Each of these categories has an order parameter symbolizing its place in the pipeline (higher order means closer to the output). Adding and removing nodes are handled with dropdowns, connecting two nodes needs to click two or more of them (both click order and the order parameter matter) and then click the “Connect selected nodes” button, and there is also another dropdown where the user can select a prebuilt model tailored for a specific task. When the user finishes with the model definition, there is a “Save graph and export
pipelines” button that both saves the current graph and creates one separate pipeline for every output (Estimators) node.
Let‟s peek into the implementation with more detail, and the class hierarchy. We have a total of 6 classes: Node, NodeCollection, Edge, EdgeCollection, _Graph, and Graph. The nodes are implemented as data classes with Python‟s __slots__ for faster attribute access and less memory consumption, and node-only attributes. Nodes‟ model classes are expected to provide a typical sklearn interface (that is to implement train, fit, and predict methods). Each node stores its position (xpos, ypos), its parent and the parent‟s order, a label, a node_id (e.g. “linr_001”), and a node_type (a string that maps to a sklearn-like class, e.g. “linr”). A NodeCollection contains Node objects, points to the Graph it is part of, holds information about how many nodes of each node_type it
contains, and defines a few helper methods for creating, removing, and adding nodes. In a similar fashion, the Edge class holds a source and destination nodes, and whether the edge is bidirectional or not (all edges are directed), and the EdgeCollection holds Edge objects, a pointer to the parent Graph, and utility methods to create and add new edges. The _Graph class is the old implementation of the Graph class which will probably be removed in the future when the current code refactoring is finished; it contains a NodeCollection and an EdgeCollection. Finally, the Graph class contains the _Graph object, and keeps track of the input and output nodes. It defines a dispatch method that uniformly handles all the other methods and allows us to create Dash callbacks and UI Figure 42: EDA Miner / Modeling / ModelBuilder / Class hierarchy
with ease like in the Data app for APIs (see above). Every object contains a render method which returns the representation of the object for plotting in dash-cytoscape.
A few important points must be stressed out. When it comes to connecting nodes, for now, we allow connections only among nodes belonging to the same category, or among nodes where the source node‟s order is smaller than the target node‟s (e.g. starting from a Preprocessing node only to Preprocessing, Dimensionality Reduction, and Estimators nodes). Also, graphs must be Directed and Acyclic (or “DAG”), so support for Markovian models is not there yet. Also, the way these Graphs are turned into actual pipelines is not fully implemented in the sense that some core features are still missing: multiple input nodes are not handled well (e.g. no table union nodes, just dataframe concatenation across one axis), and both multiple input and multiple output nodes are still not well-tested. The reason for this is that we have not yet implemented a correct parser; instead we rely on a recursive _traverse_graph function that uses sklearn‟s FeatureUnion (when one or more nodes are the input of another) and Pipeline (to actually connect the output of FeatureUnion with the next node) to create a model. While this approach has simplified coding, it is not without error (at least until a more careful implementation of the Graph class is done). We also didn‟t implement graph traversal (and won‟t until it is absolutely necessary); for that we convert the Graph into a networkx.DiGraph object and
use its methods. Finally, all model classes, aside from the interface requirements mentioned above, also need to define a modifiable_params dictionary where the keys are the model parameters and the value is a list of available choices (the first is the default). This was done so as to avoid future circumstances where users would input incorrect values, or not desirable (e.g. an n_jobs=1000 that can crash the server).