T-DTS self-tuning procedure validation using real-world classification problems

Validation aspects

DU – CNN, PU – MLP_FF_GDM

IV.2.2 T-DTS self-tuning procedure validation

IV.2.2.2 T-DTS self-tuning procedure validation using real-world classification problems

Applying the self-tuning threshold T-DTS procedure for the Tic-tac-toe problem, we have used a different database partitioning and complexity estimator. The histogram obtained for maximal decomposition tree determines overall database divisibility for the pre-selected DU and complexity estimator.

For the Tic-tac-toe endgame problem, we provide the two histograms obtained using two trustful complexity estimators: Collective (PRISM based method) entropy and my ANN based complexity estimators. The results are described in Fig. V.32 - Fig. V.33.

A brief analysis suggests that, according to Collective entropy’s histogram, the database is divisible and decomposition provides clusters of different complexity;

however, it does not mean that it increases T-DTS performance. Still, the sub-cluster has a high complexity ratio. Concerning the Fig. IV.33, it is suggested that decomposing does not decrease database complexity and the sub-databases remains complex; thus, the database of Tic-tac-toe endgame problem is not divisible in the sense of classification simplification.

Let me note here that according to complexity estimators’ validation, our ANN based estimator has been proven to be more trustful than Collective (PRISM based) entropy.

tel-00481367, version 1 - 6 May 2010

However, both of them are leader among the proposed approach used for validation.

Applying self-tuning T-DTS threshold procedure for different combination of the general database partition, using different PU and complexity estimating techniques gives the one result: any database decomposition doesn’t reduce overall complexity and as the results, decomposition could not increase performance.

Fig. IV.32 : Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, Collective entropy

complexity estimator

Fig. IV.33 : Validation T-DTS self-tuning threshold procedure Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, ANN-structure

based complexity estimator

tel-00481367, version 1 - 6 May 2010

Therefore, the database should be processing solid, or another sophisticated method of decomposing must be used. Knowing the origin of this problem, it is expected to face this conclusion, because each s-instance describes unique game combination that determine a unique part of the border in the feature space S, plus problem is high overlapping, that’s why a low ratio of learning database also reduces performance.

Table IV.6 : Classification results: Tic-tac-toe endgame classification problem

Method description Type of algorithm Accuracy (%)

MLP_FF_BR (90% of database used) MLP FF based 99.9063

MLP_FF_BR (80% of database used) MLP FF based 99.6545

Elman_BNwP (90% of database used) MLP FF based 99.5521

T-DTS (80% of db), ANN-based CE, θ=0, 2 clusters, PU: Elman_BNwP T-DTS 98.4921

MLP_FF_BR (70% of database used) MLP FF based 98.4583

CN2 standard Rule instruction 98.33

Elman_BNwP (70% of database used) MLP FF based 98.0521

T-DTS (70% of db), Fisher ration CE, θ=0, 2 clusters, PU: Elman_BNwP T-DTS 98.0050

IB3-CI Instance learning 97.8

MLP_FF_BR (60% of database used) MLP FF based 97.3890

Decision tree learner +FICUS Feature constructing 96.45

kNN +FICUS (k=3) Feature constructing 96.14

T-DTS (70% of db), kNN Matlab CE, θ=0, 2 clusters, PU: PNN T-DTS 95.7334

kNN +FICUS (k=5) Feature constructing 95.35

kNN +FICUS (k=7) Feature constructing 94.99

kNN +FICUS (kNN – basic) Feature constructing 94.73

CN2-SD (γ=0.9) Rule instruction 88.41

CN2-SD (γ=0.7) Rule instruction 85.07

MLP_FF_BR (60% of database used) MLP FF based 85.0313

CN2-SD (γ=0.5) Rule instruction 84.45

MBRTalk Instance learning 84.1

CN2-SD (add. weight.) Rule instruction 83.92

Backpropagation +FICUS /note: high standard deviation of the results/ Feature constructing 81.66

NewID Decision tree based 79.8

CN2 WRAcc Rule instruction 70.56

Table IV. 6 presents the competitive study between T-DTS and other classification approaches’ performance (Aha 1991), (Lavrac, Flach and Todorovski 2002), (Markovitch and Rosenstein 2002). The classification result are given in a term of accuracy (sum of learning and generalization rate) in order to be comparative to the other methods.

tel-00481367, version 1 - 6 May 2010

This table clearly defines T-DTS based results and the influence of Tic-tac-toe database decomposition on the performance results. Let me note that any complexity estimator that produces similar to ANN based estimator’s histogram will be good candidate for processing this problem in T-DTS framework. Such candidate is not only Fisher base ratio, but also 4 Fukunaga’s interclass matrix distance criteria. However, one may select a solution with two and more clusters as the solution if the given result satisfies initial condition.

For the Splice junction DNA sequence classification problem, the results are described in Fig. V.34. Fisher_Disriminant_Ratio complexity estimator is the leader with its generalization rate 74.6218%, where the ANN based is the second one.

Maximum_Standard_Deviation is inapplicable, because of the same weaknesses mentioned above for Fisher_Disriminant_Ratio applied for Tic-tac-toe endgame problem.

Fig. IV.34 : Validation T-DTS self-tuning threshold procedure: Splice-junction DNA sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU – CNN, PU – MLP_FF_GDM,

3 complexity estimators

This confirms our expectation of that ANN-structure based complexity estimator (marked on Fig. IV.34 as ZISC) is boardly applicabile regardless the specificity of the problem.

Before analyzing the quality of the obtained results and focusing on the main goal of any automatic classification method – maximization of the generalization rate, let me

tel-00481367, version 1 - 6 May 2010

mention that the weaknesses of the T-DTS application (not the concept). It supports that the cortege, for example a pair <complexity threshold; complexity estimating methods>, determines the optimum or quasi optimum. For different methods, quasi-optimal threshold may be also different.

Firstly, it is incorrect to assume/simplify that the set of optimal thresholds for the whole range of complexity estimator applied for a certain single classification task can be allocated in some sub-interval. There are several aspects that have an influence on the optimization function P(θ), including a relativity of the complexity rate except ANN based complexity estimator, meaning that we cannot optimize finding of an appropriate complexity estimator among the available ones.

Second, we expect that main controlling pair of <complexity threshold; complexity estimating methods> using DU during decomposition simplifies the problem regardless of PU. In fact, taking into account that the problem of finding optimal decomposition is NP-hard, summarizing it is oversimplified expectation which assumes that given decomposition is quasi-optimal, meaning that simplification is indeed done.

Finally, turning back to the tasks’ classification goal of the mentioned above manipulation, let us highlight that, in the framework of T-DTS output, we consider the principal direction of minimizing the generalization error.

Furthermore, we search for the answer on the question of how to predict and how to predefine the way of decomposing or no decomposing, in which the maximal generalization rate can be achieved. Once more, our idea is based on the macroscopic features of self-organizing, highlighted in the work (Haken 2002). Thus, microscopic characteristic that rules decomposition is a histogram of divisibility extracted from maximal decomposition tree.

To illustrate this principle, we have used an encoded database (in order to enhance testing rate – our principal aim) for Splice junction DNA sequences classification problem. Fig V.35 showes this database divisibility and complexity reduction based on the next complexity estimators.

Based on the given histogram, one may conclude that the problem is hard-decomposable. Decomposing does not reduce the complexity. During decomposition, the complexity of the majority of sub clusters remains very complex. That is why the decomposition does not provide generalization of error minimization.

tel-00481367, version 1 - 6 May 2010

Fig. IV.35 : Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size 1595 prototypes, DU – CNN, Purity PRISM based complexity

estimator

Fig. IV.36 : Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size 1595 prototypes, DU – CNN, Fukunaga’s interclass distance

measure J1 based complexity estimator

Therefore, we have consolidated the maximum results reached by T-DTS for Splice-junction DNA classification problem in Table V.7; as it was expected, the maximum can be reached when the database is not decomposed.

Table IV.7 : Classification results: Splice-junction DNA sequences classification problem, three classes, generalization and learning database size 1595 prototypes

DU Complexity estimator PU Tr±Std/2 (%) Lr±Std/2 (%) Avr. leaf No. ±Std/2 Θ

None None Elman_BN 94.6675±0.0421 99.9373±0.0181 None None CNN Purity PRISM based Elman_BN 93.5966±0.4174 99.8800±0.0224 2±0 0.0033 CNN Purity PRISM based Elman_BN 93.3950±0.4302 99.8500±0.0533 4±0 0.0340

The results exhibit that decomposition process reduces generalization ability.

tel-00481367, version 1 - 6 May 2010

However, when one takes into account the processing time, it is quite probable that a user who wishes to sacrifice one percent of the generalization ability may select 4 clusters’ T-DTS solution.

To conclude, I provide short (only the maximal characteristics) consolidated results obtained by different authors including specific ROC Analysis employing methods (Makal, Ozyilmaz and Palavaroglu 2008) for this particular classification problem in Table IV.8.

Table IV.8 : Consolidation of the classification results: Splice-junction DNA sequences classification problem.

Method description Specificity of the

method Generalization rate (%) Maximum obtained in the work (Lumini and Nanni 2006) Hierarchical SVM based 99 Maximum obtained in the work (Dutch 2002), (learning db is modified) Specific NN based 95

Elman_BN (50% of database used) MLP FF based 94.6675

The average result for this type of problems (Malousi and al. 2008) SVM based 94

T-DTS (50% of db), 2 clusters, PU: Elman_BN T-DTS 93.5966

T-DTS (50% of db), 4 clusters, PU: Elman_BN T-DTS 93.3950

Maximum obtained in the work (Malaousi and al. 2008) for ANN-based solid ANN-based 93.3890

MLP (by ROC analysis) solid ANN-based 91.23

GRNN (by ROC analysis) solid ANN-based 91.14

RBF (by ROC analysis) solid ANN-based 89.35

Let us compare our output to the results obtained in the work (Malousi and al. 2008).

We have obtained the higher generalization rate (94,66% against 91,23%) because of the embedded recursivity of Elman’s Backpropagation. The work of Makal, Ozyilmas and Palavoroglu provides a good summary of the different solid-ANN methods applied for this particular problem. The used there approaches (Malousi and al. 2008) do not use decomposition approaches. Therefore, it is important to note that our T-DTS result within the solution of four clusters is better 93.3% than the results (Malousi and al. 2008).

Although, in the work (Lumini and Nanni 2006) it was shown that SVM based methods, especially hierarchical SVM based methods likr HM, Subspace, RankSVM, have reached better results (97-99%). However, let me remind that the question of SVM option parameterization is complex and requires additional applied techniques. The interesting fact is that the very specific methods proposed in the work (Duch 2002): RBF, 720 nodes, GhostMiner version of kNN, and Dipol92 surpass our result with their 95% of generalization; nonetheless, their learning databases have been specially adjusted for these

tel-00481367, version 1 - 6 May 2010

three methods. We cannot consider that these methods are general. If one takes a look on this general problem of DNA splice-junction classification as a general medical problem, the work (Malousi and al. 2008) for the similar databases provides an average generalization rate of 94%, even when one may use SVM based methods.

Finally, we can state that using T-DTS approach within enhanced self-tuning procedure applied for the Splice junction DNA sequences classification problem, the average for this type of problem was computed as 93.4% – 94.6% of generalization.

However, various SVM based (not NN-based processing methods that have been used in T-DTS) are the leaders. This result analysis stimulates further update of T-DTS PU database with SVM methods. The following section provides overall T-DTS validation summary.

IV.2.3 Summary

In this section, we performed the range of experiments dedicated to T-DTS approach validation, including its recent enhancement implemented as T-DTS v. 2.50. The first part of the validation confirmed the superior performance characteristics of ANN based complexity estimator: using the proposed novel method of decomposition T-DTS controller, one can reach the maximal ratios of generalization for academic benchmark.

Since these maximal ratios (in absolute values) can be typically achieved with only the best current complexity estimators, e.g., Mahalanobis distance based, Normalized distance based, Maximal standard deviation based measure, etc., the proposed ANN based estimator proved itself to be a very practical approach. Moreover, it should be noted that our estimator performed well even in those real-world problems, where the range of other popular complexity estimators, including Kullback-Leibler divergence and Hellinger distance based estimator, were not applicable because of their Information theory origins.

The second part of the T-DTS validation tested the proposed self-tuning procedure;

recall that this procedure is able to answer the following questions:

• why using a leading (proven to be a leading) complexity estimation technique might not be able to maximize T-DTS output.

• why applying the T-DTS tree-like decomposition technique to some classification problems could not enhance the performance of the technique beyond the results of alternative non-decomposing task processing technique.

tel-00481367, version 1 - 6 May 2010

During the experiment, the obtained histogram of divisibility, i.e., the result of T-DTS maximal decomposition tree, gives an answer to the second question. The consequence of this analysis might stimulate user has to choose another decomposition unit or another complexity estimator that controls the process of decomposition. The self-tuning procedure validation confirmed our expectations: employing this semi-automated procedure does allow the user to find the range of quasi optimal solution. The next section presents consolidated conclusion on the validation of the T-DTS enhancements and complexity estimators, including the proposed ANN based complexity estimator.

IV.3 Conclusion

The first part of this chapter was dedicated to the experimental validation of the proposed ANN based complexity estimation technique that we implemented using the IBM© ZISC®-036 Neurocomputer and Matlab environment. The latter implementation also allowed performing the comparative analysis of 17 complexity estimators. During the verification, we observed that complexity estimators range into three groups based on their relative performance. The third, most effective, group contains the leaders of classification tasks complexity estimation; there include PRISM (Singh 2003) based methods and the novel ANN based complexity estimator.

The second part of the chapter provides the results of the T-DTS concept validation, where we showed that ANN structure based complexity estimator belongs to the class of the leading complexity estimators. Moreover, even though the classification complexity estimation technique is at the kernel of T-DTS as it controls the decomposition, the results of the evaluation showed that this control might be successfully (in term of T-DTS performance) done by the range of other techniques that, if taken separately, cannot appropriately measure true classification complexity.

Last section of the second part provides the results obtained for validation of the self-tuning complexity (i.e. threshold) procedure that allows user to find a quasi-optimal θ-threshold. Also, while searching for a quasi-optimal θ-threshold, T-DTS might produce a whole range of satisfactory solutions; this allows the user to select the most preferable combination of the output characteristics, such as: generalization and learning rate, total number of clusters, and the overall T-DTS processing. It should be also emphasized that

tel-00481367, version 1 - 6 May 2010

the most important part of self-tuning procedure is the divisibility histogram. As it was shown in the validation, it is not only the source of information for the θ-threshold adjustment, but, according to the results, this histogram also provides the explanation of why the T-DTS approach could not be applied to some classification problem. Such unsatisfactory histogram output might be regarded as a stimulus for the further development of decomposition and complexity estimation methods.

The general conclusion and perspectives of this work is separately consolidated in the following section.

tel-00481367, version 1 - 6 May 2010

General conclusion and perspectives

In document Thèse. Présentée pour l obtention du titre de DOCTEUR DE L UNIVERSITÉ PARIS-EST. Spécialité: Sciences Informatiques. (Page 179-189)