How to Deploy Models using
Statistica SVB Nodes
Abstract
Dell™ Statistica™ is an analytics software package that offers data preparation, statistics, data mining and predictive analytics, machine learning, model deployment, and reporting. This technical brief provides step-by-step instructions for using Statistica to build models from training data and run those models against new data. Specifically, this brief explains how to use Statistica nodes that offer scripting ability using the industry-standard Statistica Visual Basic (SVB) language.
Introduction
One important aspect of data mining is to build models from training data and run these models against new data or testing data The process of applying a trained model to new data is known as deployment — after a satisfactory model or set of models has been identified (the “trained” model), you deploy those models with new data to quickly obtain predictions
or predicted classifications. When new data needs to be analyzed, you do not have to train the models again; instead, you can simply connect the new data to a prediction node. For example, a credit card company might use a trained model to predict credit risk based on the information provided on credit card applications.
This technical brief details, step by step, how to deploy models using Statistica nodes that offer scripting ability with the Statistica Visual Basic language. For more information about using Statistica nodes, as well as detailed instructions on deploying nodes that do not offer SVB scripting, see the related technical brief, “How to Deploy Models using Statistica Nodes.”
This technical brief assumes that you have a basic understanding of how to navigate through the workspace. If you need a refresher, see “How to Navigate the Statistica Workspace.”
In predictive data
mining, the process
of applying a trained
model for prediction
or classification to
new data is known
as deployment.
Deployment using SVB scripted nodes
For this tech brief, we will use historical data on customers who either satisfy their loans (have “Good” credit) or default on their loans (have “Bad”
credit), which is provided in the sample spreadsheet Creditscoring.sta. We will split this data into training and testing data, use two nodes to model the training data set, and then compare the predictions from those models.
The sample data set
To open the sample data set, go to the Home tab and click Open Examples from the Open drop-down menu, as shown in Figure 1. From the Examples folder, select the Datasets folder and then double-click the file Creditscoring.sta.
To add the spreadsheet to a workspace, go to the Home tab, and in the Output group, click Add to Workspace and then click Add to New Workspace, as illustrated in Figure 2.
Figure 1. Opening the sample data set
Figure 2. Adding the sample data set to the workspace
3
First, we must split
the sample data into
training and testing
samples.
Step 1: Prepare to split the sample data into training and testing samples by adding a node.
First, we must split the sample data into training and testing samples. Select Scripted Procedures, as shown in Figure 3.
On the Data tab in the Manage group, click Sampling. On the
resulting submenu, select Split Input Data into Training and Testing Samples (Classification).
Double-click the node to display the Edit Parameters dialog box. Specify 25 in the Approximate percent of cases for testing box, as shown in Figure 4. Click OK to close this dialog box.
Figure 3. Accessing the scripted procedures
Figure 4. Specifying the percentage of cases for testing
Step 2: Select the dependent variables and predictors.
Double-click the CreditScoring data set.
In the variable selection dialog box, click the Variables button and specify the variables as shown in Figure 5. Then click OK to close the dialog box.
Step 3: Split the sample data into training and testing samples by running the node.
Run the Split Input Data into Training and Testing Samples (Classification) node by doing either of the following:
• Click the green arrow icon on the lower left corner of the node.
• Right click the node and click Run to Node from the shortcut menu.
After you run the node, your workspace should look like Figure 6.
Step 4: Add the appropriate nodes to the workspace.
Now we will use the Boosted
Classification Trees and Random Forest Classification nodes to model the training data set. Scripted procedures have two types of nodes: one with deployment and one without. Since the models in our example will be deployed to the testing data, it is important to select the nodes that contain the deployment feature.
Click the Node Browser button to display the Node Browser. In the left pane, expand the Data Mining folder and then the Deployment folder. Then expand the Scripted Deployment folder and select Classification. In the right pane, double- click the Boosting Classification Trees with Deployment node (as shown in
Scripted procedures
have two types of
nodes: one with
deployment and
one without.
Figure 5. Selecting dependent variables and predictors
Figure 6. The workspace after running the node
5
Figure 7) to add it into the workspace.
Then double-click Random Forest Classification with Deployment and Compute Best Predicted Classification from all Models to add those nodes to the workspace as well.
Alternatively, you can drag the nodes to the workspace.
Step 5: Connect the data to the nodes.
Now, connect the Training Data node to the Boosting Classification Trees with Deployment and Random Forest Classification with Deployment nodes, and connect the Testing Data node to the Compute Best Predicted Classification from all Models node.
To connect two nodes, click the gold diamond icon in the center-right side of one node, hold down the mouse button, draw an arrow to another node, and then release the click.
To connect two
nodes, click the
gold diamond icon
in the center-right
side of one node,
hold down the
mouse button, draw
an arrow to another
node, and then
release the click.
Figure 7. Adding the Boosting Classification Trees with Deployment node to the workspace
Step 6: Run the models.
Click Run All in the upper left corner of the workspace. The workspace should look like Figure 8.
The resulting Final Prediction for Credit Rating spreadsheet, shown in Figure 9,
contains the final credit predictions from each model (in the CBT_Prediction and RF_Prediction columns), plus the voted predictions (in the Ensemble_Prediction column).
Figure 9. The resulting spreadsheet, which contains the final credit predictions from each model, plus the voted predictions
The resulting Final
Prediction for Credit
Rating spreadsheet
contains the final
credit predictions
from each model,
plus the voted
predictions.
Figure 8. The workspace after running the models
7
Figure 10. Error message displayed if a node with no deployment is selected
This technical brief
shows how easy it is
to create an analysis
using Statistica
nodes with SVB
scripting ability. Be
sure to explore the
many other features
of these nodes not
illustrated here.
Troubleshooting
If a node with no deployment is selected and you run the Compute Best Predicted Classification from all Models node, the error shown in Figure 10 will be displayed.
Return to Step 4 and select nodes that include the words “with deployment.”
Conclusion
Statistica delivers advanced and predictive analytics, data mining, statistical analysis and advanced
machine learning algorithms for building predictive models. This technical brief shows how easy it is to create an analysis using Statistica nodes with SVB scripting ability. Be sure to explore the many other features of these nodes not illustrated here.
About Dell Software
Dell Software helps customers unlock greater potential through the power of technology—delivering scalable, affordable and simple-to-use solutions that simplify IT and mitigate risk. The Dell Software portfolio addresses five key areas of customer needs:
data center and cloud management, information management, mobile workforce management, security and data protection. This software, when combined with Dell hardware and services, drives unmatched efficiency and productivity to accelerate business results. www.dellsoftware.com.
© 2014 Dell, Inc. ALL RIGHTS RESERVED. This document contains proprietary information protected by copyright. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording for any purpose without the written permission of Dell, Inc. (“Dell”).
Dell, Dell Software, the Dell Software logo and products—as identified in this document—are registered trademarks of Dell, Inc. in the U.S.A. and/or other countries. All other trademarks and registered trademarks are property of their respective owners.
The information in this document is provided in connection with Dell products. No license, express or implied, by estoppel or otherwise, to any intellectual property right is granted by this document or in connection with the sale of Dell products.
EXCEPT AS SET FORTH IN DELL’S TERMS AND CONDITIONS AS SPECIFIED IN THE LICENSE AGREEMENT FOR THIS PRODUCT,
DELL ASSUMES NO LIABILITY WHATSOEVER AND DISCLAIMS ANY EXPRESS, IMPLIED OR STATUTORY WARRANTY RELATING TO ITS PRODUCTS INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT. IN NO EVENT SHALL DELL BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE, SPECIAL OR INCIDENTAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION OR LOSS OF INFORMATION) ARISING OUT OF THE USE OR INABILITY TO USE THIS DOCUMENT, EVEN IF DELL HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Dell makes no representations or warranties with respect to the accuracy or completeness of the contents of this document and reserves the right to make changes to specifications and product descriptions at any time without notice. Dell does not make any commitment to update the information contained in this document.
About Dell Software
Dell Software helps customers unlock greater potential through the power of technology—delivering scalable, affordable and simple-to-use solutions that simplify IT and mitigate risk. The Dell Software portfolio addresses five key areas of customer needs:
data center and cloud management, information management, mobile workforce management, security and data protection.
This software, when combined with Dell hardware and services, drives unmatched efficiency and productivity to accelerate business results. www.dellsoftware.com.
If you have any questions regarding your potential use of this material, contact:
Dell Software 5 Polaris Way Aliso Viejo, CA 92656 www.dellsoftware.com
Refer to our Web site for regional and international office information.
For More Information