5.2 Machine learning and statistical algorithms and measures
5.2.7 Comparison Metrics
Analytics Builder uses several comparison metrics, differing depending on the type of goal used. For a continuous numerical goal, Root Mean Square Error (RMSE) and Pearson Correlation is shown. RMSE is a common statistic measure and is computed with the difference between the observed and predicted values of the goal. RMSE is also displayed normalized between 0 and 1, referred to as
RMSE Normalized. The Pearson Correlation, or just Correlation, looks at how much the predicted and
observed value of the goal vary in unison. Furthermore, a scatter plot is displayed, figure 5.2.4 (PTC 2018b), with predicted results plotted as a function of actual values. Color denotes if the results where over-predicted, accurately-predicted or under-predicted (PTC 2018b).
26 (53)
Figure 5.2.4: Scatter plot for continuous goal variable. Color denotes accuracy of prediction and size denotes number of predictions.
For binary goals a confusion matrix, figure 5.2.6 (PTC 2018b), and a Receiver Operating
Characteristics (ROC) curve, figure 5.2.5 (PTC 2018b), are displayed along with a numerical ROC value and the Matthews correlation coefficient (MCC) (PTC 2018b).
27 (53) The confusion matrix shows the outcomes of the predictions in relation to the actual results. Number of records and percentage of the validation set are shown for true positive (prediction and actual result true), false positive (prediction true and actual result false), true negative (prediction and actual result false) and false negative (prediction false and actual result true) (Fawcett 2005).
Figure 5.2.6: ROC curve. The curve has the desired appearance, by going up quickly, showing that true positives significantly outnumber false positives.
The ROC curve plots true positives as a function of false positives for various thresholds. The threshold is mostly significant when values between false (0) and true (1) are meaningful. It is desirable that the curve follows the left (west) and top (north) of the plot as much as possible for a high true positive ratio. Visualizing the ROC can give a more intuitive feeling for the prediction performance than simply calculating the true false positive rate (Fawcett 2005).
MCC is a measure of the correlation between true and false positives and negatives. It can be
calculated from the confusion matrix according to equation 5.2.4:
ππΆπΆ = π‘πΓπ‘πβππΓππ
β(π‘π+ππ)(π‘π+ππ)(π‘π+ππ)(π‘π+ππ) (5.2.4)
with tp being true positive, tn being true negative, fp being false positive and fn being false negative. An advantage of the MCC is that it takes into account the entire result population and only scores high if the model predicts well on both outcomes. It will also signify if there are no classifiers on one side, for example π‘π = ππ = 0, by being undefined (Chicco 2017).
28 (53)
6 Comparing analytics tools and CSense
To able to make some comparative evaluation of ThingWorx Analytics, another analytics program will also perform analysis on the data. This will establish a baseline.
CSense is a product from General Electric (GE) that provides predictive analytics, with similar goals and uses as ThingWorx Analytics. It is closely tied to GEβs IIoT platform, Predix, and their HMI system, iFix (General Electric 2016). For the moment though, CSense is not sold commercially but GE has plans to include it in their cloud analytics.
The main components for analytics and monitoring in CSense are Proficy Troubleshooter and Proficy Cause+. The Troubleshooter has separate versions for continuous process data and for discrete and batch process data. These take data in csv format, just like ThingWorx Analytics, but they do not need a data configuration file that states data shape and use. This is due to the Troubleshooter having data preparation as a built-in function, with more options than Analytics Builderβs filter function (GE Digital 2010). It does, however, not take data in Boolean values. For this it is needed to configure the Booleans as binary variables, 0 and 1.
There are several tools for visualizing and inspecting the data once it has been uploaded and prepared. One useful feature is the possibility to compute a correlation matrix for all or some of the features. Another is viewing trends and histograms of individual fields. It can also build decision trees that allows the user to investigate how the process ends up in different states (GE Digital 2010). Troubleshooter uses statistical methods and decision trees, just like ThingWorx Analytics. There is, however, no mention of using decision tree ensembles like random forest or gradient boosting machines. For the Continuous Troubleshooter, the statistical methods used are not described in detail and it is not possible to adjust parameters for these. It tries to accomplish two things when doing Continuous modeling. One is to create a non-linear model following the target. The other is to create rules for the parameters relating to the target. When using CSense for process control, this can be very useful (GE Digital 2010).
When it comes to the Discrete and Batch Troubleshooter, there is the possibility for creating models using Principal Component Analysis, Partial Least Squares and Decision Trees. For batch processes it recommends using Principal Component Analysis and for discrete processes it recommends a Decision Tree (GE Digital 2010).
Principal Component Analysis (PCA) is a method to handle processes with a large number of variables that are correlated to a varying degree. By performing linear algebra calculations on the covariance matrix (how the variables vary together) PCA reduces several variables down to a few that still capture the majority of the system characteristics (Wise & Gallagher 1996). Part of this is showing how much of the variability that is captured in each of the calculated principal components. A limit on how much of the system needs to be captured can be decided by the monitoring strategy. It is also possible to see which of the original variables make up the principal components and track any deviations back to the original parameters. This means that an operator can observe the reduced system and only look at the original larger number of parameters when there is a problem (Wise & Gallagher 1996).
Partial Least Squares (PLS) combines PCA with linear regression. This means that it looks at both correlation and covariance of the variables. One of the main benefits of PLS is that it can handle several variables and several predicted variables (Wise & Gallagher 1996).
29 (53) For real-time use, GE delivers Proficy Cause+. Cause+ focuses on process monitoring and control and makes use of the models developed with Proficy Troubleshooter. It allows you to receive and analyze real-time data against set alarm levels and can deliver messages and suggestions based on this. It can also log historical data that allows users and technicians to look at events from previous shifts (GE Digital 2011). In this way, it can perform similar functions to ThingWorx Foundation when dealing with process monitoring. For the use in this project, Proficy Continuous Troubleshooter will be the primary tool for comparison. This choice was made for easier comparison to ThingWorx Analytics and due to the process data.
30 (53)
7 The Veneer Lathe Modeling and Prediction Application Procedure
Here the application used for processing and analyzing data is explained. It consists of several parts and a path of multiple steps is followed to go from the data onsite to predictive scoring results. This will result in some repetition from the previous chapters, but for sense of clarity all components will be covered. An overview of how the parts are connected can be seen in figure 7.1.
Figure 7.1: A flow diagram of how the data is moved and processed. The upper part shows the steps from the production site up to the finished model. The lower section illustrates real-time flow when the model has been deployed.
As explained in chapter 3, there is a computer on-site that gathers measurements from the mounted sensors. This is done at even intervals, going through each of the sensors on the lathe and on the pumps and gathering data with timestamps for each sensor. These gathered measurements are processed and recorded as Vikonβs Q-parameters and as the most significant frequencies inside xml files. This stored data is then transferred from the onsite computer to the computer running the application offsite. Once downloaded, the sensor data is transformed from xml format to json objects before being sent to a Remote Thing in the ThingWorx platform. The data format transformation and sending it to ThingWorx is done with a Java program that is written for this purpose, which utilizes ThingWorxβs Java SDK. ThingWorx Thing properties can be of xml type, but this project is a part of a larger project for predictive maintenance that receives data in json format. The choice to use json was made to make the application easier to integrate with the larger project.
The main part of the ThingWorx platform, ThingWorx Foundation, is run on a Windows Server 2016 machine. Here the sensor data is received by the Remote Thing with the json property. The data is then stored in Data Container Streams with Data Shapes appropriate for the stored data. For example, one Stream holds the frequency data and another one stores the Q-parameters. The other part of the data, which contains information on knife exchange and sharpening, is also read to the platform. The knife data is in csv format to begin with and can be read using a ThingWorx service extension called CSV Parser.
31 (53)
Figure 7.2: The basis for combining the knife data with the sensor data. E represents a planned exchange after producing 15β000 meters of veneer. S stands for sharpened and denotes that the knife is sharpened at 10β000 produced meters. U E is for an unforeseen exchange relating to knife damage. The blue boxes are for sensor data samples included before the event.
These are marked as dull. The purple boxes are for included samples after a knife event. These are marked as not dull, meaning the knife is sharp.
Once both parts of the data, sensor and knife events, are inside the platform they are combined. This is done by taking an entry for a scheduled knife event and then combining it with the sensor data. The basis for the combination is taking a few entries before the production stop timestamp and marking them as dull and then taking an equivalent number of samples after the production start timestamp and labeling these as sharp. Here, scheduled events refer to not including instances where the knife was exchanged due to some type of damage. It only includes those events when it has produced the set veneer length for sharpening and exchange. An overview of this can be seen in figure 7.2. The sensor data entries queried in relation to a knife event are also checked to make sure that they do not coincide with another knife event. This might happen if an unforeseen event occurred in a short time span after another event. Once the data has been combined, it is exported to a new csv file via a Mashup using the Data Export Widget. This data file then undergoes some minor alterations with a data preparation program, named Talend as well as some editing in
Notepad++. The final step before going to ThingWorx Analytics is configuring the Data Configuration json file, which specifies the type and use of the data.
When all of the preparation of the data is done, it is uploaded as a Data Set in ThingWorx Analytics Builder. Here the data can be handled and analyzed according to the methods and functions explained in chapter 5. The analysis done in CSense uses the same csv files but with the differences noted in chapter 6. ThingWorx Analytics Server runs on a Linux Virtual Machine on the same server that runs ThingWorx Foundation. CSense is run on a separate Windows Server.
Once the model is created and validated, it is published to Analytics Manger. Here ThingWorx can send data in real-time for predictive scoring. This is done by generating Things that represent the model and are connected via the ThingPredictor analysis provider. The result of the scoring job is then returned to the Thing in ThingWorx and it can be utilized for whatever is desired.
32 (53)
8 Results
The results are divided between ThingWorx in section 8.1 and 8.2 and CSense in section 8.3. Please note again the previously mentioned discrepancy between sensor index and measuring point index. For example, that sensor 4 is measuring point 3.