Estimation of Process Parameters using Principal Component Analysis
C) Applications of Decision Tree
Many applications and usefulness of data mining and Decision Tree are discussed in literature. Murthy (1998) have done a multi-disciplinary survey on automatic construction of Decision Tree from data. They have covered the application of Decision Tree in areas such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial neural networks. Their survey involves existing work on decision tree construction, identification of important issues involved and the directions of work in this area.
Sun et al. (2007) have introduced a data mining technology in fault diagnosis field. They have proposed a new method based on C4.5 decision tree and PCA. They have used PCA to reduce features after data collection, pre-processing and feature extraction. Then a Decision Tree is generated with diagnosis knowledge. Finally the tree model is used to make diagnosis analysis. The technique is applied for fault diagnosis of rotating machinery. The results are compared with back-propagation neural network (BPNN). The results show that C4.5 and PCA-based diagnosis method has higher accuracy than BPNN.
3.4.2 Validation of Results and Discussion
Using the eq. (3.2) we utilize the calculated Eigen vectors from fault free data set and take PCs as -0.005 to calculate all the 38 parameters. Once all the parameters are calculated the next step is the validation of the parameters. In this case the validation is done by utilization of the data mining software, See 5.0 which has been explained in detail in section 3.5.1.
Demo version of See 5.0 is freely available on the internet and in this work validation of the obtained results is carried out using this software. As per the requirement of the data mining software one of the parameters had to be taken as primary parameter and based on this all others are calculated. In this case the calculated flow rate is 38,028.86 (with PC value of -0.005) so it is taken as the primary parameter and for all the other parameters data mining is done for their validation.
After having calculated all the 38 parameters, data mining is done by constructing Decision Tree. As we see from the Decision Tree in Figure 3.6 it cannot give us exact information about all the 38 parameters, but it can give us a range of values for any of the parameters so the Decision Tree has two main parts. The results by Decision Tree sets can be evaluated as follows.
a) Results discussion of First node of Decision Tree
For the first part as we see that for flow rate of 38,028.86 as primary parameter we get seven other parameters. For these parameters, out of seven, six are fully validated. The details of this validation can be summarized by the comparison of the results obtained by the calculations utilizing PCA model and the results given by the Decision Tree. This is discussed below.
Input Sediments to Plant (SED-E)
For flow of 38,028.86 the value of Input sediments to plant (SED-E) predicted by Decision Tree should be greater than 2.7 and our calculated value is 4.66 which is validated.
Input chemical Oxygen demand to Secondary Settler (DQO-D)
The Input chemical demand of oxygen to secondary settler (DQO-D) is supposed to be in the range of 206 and 450. We are having the value of DQO-D as 282.76 which is within the predicted range of the Decision Tree.
Input Biological demand of oxygen to plant (DBO-E)
Input Biological demand of oxygen to plant (DBO-E) calculates value as 189.279 whereas the Decision Tree recommends the value to be more than 182.5 which verifies the calculated value of DBO-E.
Output volatile suspended solids (SSV-S)
The calculated value of output volatile suspended solids (SSV-S) is 79.91 whereas recommended value of SSV-S by the Decision Tree is greater than 74.1.
Input volatile suspended solids to plant (SSV-E)
For input volatile suspended solids to plant (SSV-E) the recommended value by Decision Tree is greater than 57.4 and the value from the data is 61.93 so it is again validated by the software.
Output chemical demand of oxygen (DQO-S)
For output chemical demand of oxygen (DQO-S) we have calculated the value as 87.27 using PCA model whereas the Decision Tree recommends that the value should be less than or equal to 140.
Input suspended solids to secondary settler (SS-D)
There is only one parameter which is almost validated by the software. Input suspended solids to secondary settler (SS-D) whose value should be around 90 whereas we have calculated the value as 94.20. It was investigated and it was found that this parameter had a lot of fluctuations in its original values in the original data set with the values moving from 54 to 230. This is shown in Figure 3.5 as follows.
Figure 3.5: Graph showing variation in the different days values for SS-D
A summary of the results obtained by this part of Decision Tree is also given in Table 3.2
Table 3.2: Summary of the Results Obtained by Decision Tree
Numbers Parameters Values
determined by PCA model
Values by See
5.0 Validated
1 SED-E 4.66 >2.7 Yes
2 DQO-D 282.76 >206≤450 Yes
3 DBO-E 189.279 >182.5 Yes
4 SSV-S 79.91 >74.1 Yes
5 SSV-E 61.93 >57.4 Yes
6 DQO-S 87.27 ≤140 Yes
7 SS-D 94.20 ≤90 Almost
Figure 3.6: Decision Tree validating the Results obtained for the Process Parameters Estimation
b) Results discussion of the Second node of Decision Tree
For the second node of the Decision Tree we get data set information which again validates the obtained results. In this case we have six parameters with information from the Decision Tree. For these values, out of six parameters four are fully verified whereas two are almost verified. Details of these results are as follows.
Input sediments to plant (SED-E)
For this branch we have the value of Input sediments to plant (SED-E) to be 4.66 whereas the Decision Tree recommends value to be more than 2.7, so it is validated.
Input chemical demand of oxygen to secondary settler (DQO-D)
The value of Input chemical demand of oxygen to secondary settler (DQO-D) is recommended to be in the range of 206 to 450 and our calculated value is 282.76 which is within the recommended range.
Performance input sediments to primary settler (RD-SED-P)
The value of performance input sediments to primary settler (RD-SED-P) is recommended to be less than 95.5 by the Decision Tree and we obtained the value as 91.05 so it is again validated by the software.
Input volatile suspended solids to plant (SSV-E)
Input volatile suspended solids to plant (SSV-E) is also validated by the Decision Tree as it gives the value to be less than or equal to 75.6 as compared to our calculated value of 61.93.
Input Biological demand of oxygen to plant (DBO-E)
For input biological demand of oxygen to plant (DBO-E) it is almost validated in this case as the calculated value is 189.279 as compared to recommended value of less than 182.5.
Performance input suspended solids to primary settler (RD-SS-P)
For performance input suspended solids to primary settler (RD-SS-P) we calculated the value as 58.27 as compared to 45.7 by the Decision Tree.
Summary of the results obtained by the Decision Tree is also given in Table 3.3 Table 3.3: Summary of the Results Obtained by Decision Tree
Numbers Parameters Values
determined by PCA model
Values by See
5.0 Validated
1 SED-E 4.66 >2.7 Yes
2 DQO-D 282.76 >206≤450 Yes
3 RD-SED-P 91.05 ≤95.5 Yes
4 SSV-E 61.93 ≤75.6 Yes
5 DBO-E 189.279 ≤182.5 Almost
6 RD-SS-P 58.27 ≤45.7 Almost