3.5 Learning Based Co-clustering Algorithm
3.5.5 Results and Discussion
The performance of LCC has been compared with respect to the predictive power of the co-clusters with the original data set. Naive Bayes classifiers has been considered as the predictive power. Results with J48 classifier as well as f-measure as the predictive power has also been presented in this work. The data set is split into 80% - 20% training and test set respectively for all the data sets. In this work, the parameter setting with the best result has been shown for a given data set. Parameter selection algorithm for LCC can be extended as the future work.
In Tables 3.6 3.5, 3.8 and 3.7, the co-clustering result has been shown in terms of classification accuracy with different number of co-clusters formed. Here, zero co- clusters refer to the original data set. Result has been tested on two bench mark data sets namely, madelon and Internet-Ads as well as two real world data sets namely, AML and MovieLens. In AML data set the classification accuracy of Naive Bayes model built on co-clusters are better than the original data set by more than 10%. In madelon data set there is a significant improvement in the mean predictive power with co-clustering. In MovieLens data set the classification accuracy increases by more than 8% from the actual data set. Lastly, Internet-Ads have significantly better classification accuracy with smaller number of co-clusters than the original data set. It is important to remember that the co-clusters are overlapping i.e. might have common instances or features. Hence, in certain cases, accuracy might not vary a lot when the number of co-clusters changes. From the above it is clear that the classification accuracy of Naive Bayes model built on the co-clusters is better than the original data set. A different variation of the proposed algorithm on the real world AML data set with f-measure being used as the predictive power and J48 classifier has been presented next. Figure 3.10 and 3.9 shows that even when weighted f-measure has been used as the predictive power, the overall result improves with LCC with Nave Bayes as well as decision tree. This shows that LCC algorithm is robust towards the type of classification model used
for a given data set.
Next, the performance of the proposed algorithm with two state of the art co- clustering method proposed in [72] by Shan et. al and [67] by Dhillon et. al has been compared with LCC algorithm. In [72] Shan et. al proposed overlapping co- clustering technique which maintains separate Dirichlet models for probability of row as well as column clusters. In this paper a co-clustering algorithm has been developed by modeling data matrix as a bipartite graph. In figure 3.5, the predictive power of LCC has been compared with Bayesian co-clustering by [72] (let’s call it BCC) and Spectral co-clustering by [67] (Let’s call it SC). Naive Bayes is the learning model and accuracy of the model is the predictive power for evaluating (two and four) co-clusters generated using each of the methods. From figure 3.5, it is clear that classification model built using co-clusters generated with the proposed method is more accurate than the other two methods for all the data sets. Now, predictive power for evaluating co-clusters helps us understand the potential and usefulness of the proposed algorithm. However, the evaluation might be incomplete if the purity of the co-clusters formed is not tested. Three cluster evaluation techniques namely cluster-precision, cluster-recall and cluster- f-measure as given in equation 3.5, 3.6 and 3.7 and defined by [16] has been used in this work. In the experiments in this paper binary class labels has been used for all the data sets. In binary class data, the binary class labels are the true class or ground truth for evaluating cluster-precision and cluster-recall. From the figures 3.11, 3.12 and 3.13, It can be seen that AML dataset has same cluster-recall and cluster-f-measure as BCC which is better than that of SC. In movieLens data, cluster-precision and cluster- f-measure are better than that of both BCC and SC. In Internet-Ads data set all three scores are significantly better than that of BCC and SC. Madelon data set produces the three scores same as that of SC but lower than BCC. It should be noted that though the three scores obtained using LCC for madelon is slightly low, they are not significantly lower than BCC. The overall outcome of cluster-precision, cluster-recall and cluster- f-measure suggest that in all the data sets LCC performs better than SC and BCC (except madelon which is slightly lower than BCC). This proves that predictive power of the co-clusters was augmented not at the cost of their purity. This shows that LCC generates co-clusters with higher predictive power than the original data set as well as preserves the purity of the actual co-clusters when compared with the true category of
the data.
3.5.6 Conclusion
Learning based co-clustering algorithm is a co-clustering strategy that uses predictive power of the data set for improving the quality of co-clusters. LCC has been pre- sented as an optimization problem that aims to maximize the gain in predictive power while improving the quality of co-clusters by removing extraneous rows and insignificant columns. The result is a set of overlapping co-clusters that are high in predictive power of a learning model built on them. The results over four benchmark as well as real world data sets showed that LCC brings about notable improvement in the accuracy and weighted f-measure of a predictive model. LCC also performs better as compared to two other traditional co-clustering techniques. This proves that LCC is well suited for many real life applications where high dimensional data set is common and are con- cerned with better predictive modeling. LCC can find applications in many different fields namely, health care and recommendation systems where efficient predictive mod- eling is a challenge due to factors such as high dimensional data with a heterogeneous population. The proposed future plan is establishing the theoretical grounding for the concept of LCC and an efficient parameter selection approach in a real world setting.
Table 3.14: Notation Table
Notations Descriptions
c Number of co-clusters
C Original data matrix
X Rows of C
Y Columns of C
x1, x2, ..., xm Objects in C taking value from X
y1, y2, ..., yn Objects in C taking value from Y
MX Co-cluster functions for row
MY Co-cluster functions for column
ˆ
x1, ˆx2, ..., ˆxk k Clusters of X
Table 3.14 – continued from previous page
Acronym Meaning
ˆ
y1, ˆy2, ..., ˆyl l Clusters of Y
F (.) Predictive power function t0 Number of iteration ρ Predictive power of C ∆ρ Gain in ρ from last iteration
ρrow1, ..., ρrowk Predictive power of each row cluster
ρcol1, ..., ρcoll Predictive power of each column cluster
τrow Threshold for row noise removal
τcol Threshold for column noise removal
Pitr Probability of iteration
Prow Probability of row noise removal
Pcol Probability of column noise removal
τccr Threshold for probability of Iteration
C0 Data matrix after row noise removal C00 Data matrix after column noise removal