Adaptive CMDHS - Improvising the population in CDHS

Algorithm 5.4: Improvising the population in CDHS

5.7 Adaptive CMDHS

The CMDHS was successful in all of the previous experiments. However, the proposed method still needs an accurate tuning of the DE parameters as the DE is sensitive to the settings of its control parameters. The adaptive version of the CMDHS is proposed to overcome the need to

D# i Algorithm p_value α(0.05)/i Hypothesis

D1 1.05 KH 1.81741782409671E-09 0.047619048 Rejected 2.45 CGABC _{1.81741782409671E-09 0.020408163} _Rejected 3.70 DEMC _{1.81741782409671E-09 0.013513514} _Rejected 2.80 CMDHS 1.81741782409671E-09 0.017857143 Rejected

D2 2.71 KH 8.8191106725875E-12 0.018450185 Rejected 2.38 CGABC 8.8191106725875E-12 0.021008403 Rejected

1.00 DEMC 8.8191106725875E-12 0.05 Rejected

3.90 CMDHS 8.8191106725875E-12 0.012820513 Rejected

D3 3.45 KH 1.81741782409671E-09 0.014492754 Rejected 3.40 CGABC 1.81741782409671E-09 0.014705882 Rejected

1.25 DEMC 1.81741782409671E-09 0.04 Rejected

1.90 CMDHS 1.81741782409671E-09 0.026315789 Rejected

D4 2.70 KH 8.10650901638872E-09 0.018518519 Rejected 2.80 CGABC 8.10650901638872E-09 0.017857143 Rejected

1.00 DEMC 8.10650901638872E-09 0.05 Rejected

3.50 CMDHS 8.10650901638872E-09 0.014285714 Rejected D5 1.69 KH 0 0.029585799 Rejected 2.48 CGABC 0 0.02016129 Rejected 2.10 DEMC 0 0.023809524 Rejected 3.74 CMDHS 0 0.013368984 Rejected D6 2.19 KH 0 0.02283105 Rejected 2.86 CGABC 0 0.017482517 Rejected 1.00 DEMC 0 0.05 Rejected 3.95 CMDHS 0 0.012658228 Rejected

116 statically tune the DE parameters using the same method of parameter adaptation proposed in (Cui, Li et al. 2016). This method updates the F and Cr parameters that control the mutation and crossover operators to their optimal selection.

In this section, a comparison between CMDHS and ACMDHS is conducted. The ADDC values need to be minimized while the F-measure values need to be maximized, as shown in the previous sections. The values of both measures are listed in Table 5.16 and Table 5.17. Table 5.16 depicts the external measure values using the F-measure while Table 5.17 shows the internal measure values using ADDC. It becomes clear that the statically-based parameter tuned version (CMDHS) outperformed the dynamically-based ACMDHS. The single most striking observation to emerge from that comparison is the tuned parameters. That is, F for the mutation and Cr for the crossover have only a minor effect on the performance of the centroids allocation. Table 5.17 shows the general trend of all results for both CMDHS and ACMDHS are compatible. The stability of the ADDC in comparison to

the F-measure did not mean that both methods performed equally. That is because the F- measure values were changing when ADDC values were almost steady, but both methods are highly competitive nonetheless.

Table 5.16 F-measure Values

Dataset CMDHS ACMDHS D1 88.50 85.97 D2 98.69 96.03 D3 96.94 97.56 D4 97.56 98.84 D5 99.91 98.92

117 Table 5.17 F-measure Values

Dataset CMDHS ACMDHS D1 88.50 85.97 D2 98.69 96.03 D3 96.94 97.56 D4 97.56 98.84 D5 99.91 98.92

Table 5.18 ADDC Values

Dataset CMDHS ACMDHS D1 0.72 0.72 D2 0.74 0.71 D3 0.73 0.71 D4 0.82 0.75 D5 0.72 0.73

5.8 Summary

The clustering of the text documents is an important process for document categorization, archiving, summarization and retrieval. After the pre-processing of the text documents and feature selection using the supervised and unsupervised methods presented in chapters 3 and 4, the unsupervised feature selection method (DDESA) presented in chapter 4 was used to reduce the features for the text clustering. This chapter presented two different hybrid document clustering approaches, which are capable of distributing the cluster centroids using memetic optimization in the search space.

The first approach is the DEMC. This method combines the DE global search with the simulated anealing local search. The research found the DEMC to be superior to the k-means, DE, DEKM, DESA and CGABC methods in terms of the clustering internal and external evaluation measures.

Another memetic document clustering that fuses the global search using the DHS with the traditional clustering using the k-means was proposed. DHS was applied successfully as a global search and was successfully combined with the k-means to produce the MDHS method. However, this present study experimented with a combination of the binomial DE crossover

118 with the MDHS to produce the CMDHS method. It can be concluded from the results of the experimental study that the proposed CMDHS successfully outperformed other methods that were compared for document clustering. The test results using the F-measure, ADDC and the non-parametric statistical tests showed the superiority of the CMDHS over the baseline methods, namely the HS, DHS, k-means and the Memetic HS. The proposed CMDHS also outperformed two current state-of-the-art methods in most cases. In addition it was better than the Differential Evolution Memetic Clustering proposed earlier.

Finally, an enhancement was made to CMDHS, using the adaptive parameter tuning of the differential control parameters. The resultant method was named the Adaptive CMDHS (ACMDHS) that updates DE control parameters to their best values. The test results indicate that CMDHS and the ACMDHS are both highly competitive methods.

119

Chapter 6 Conclusion and Future Research

6.0 Introduction

This present research addressed the issues of centroids allocation and text feature selection for document clustering. Memetic optimization was proposed to manage these two issues and find a more efficient method to cluster the results of text documents than existing methods. This thesis first discussed the use of memetic optimization to resolve the problem of centroids distribution. As was observed in the literature review (chapter 2), the majority of optimization methods used for centroids allocation perform only a global search, using methods such as the EA, SI or HS. Despite the ability of these methods to perform a global search, they are not capable of performing the exploitation aspect in the local areas within the search space. Therefore, the memetic optimization was used to resolve this problem, because it combines the global and local searches. The research extensively explored memetic optimization in terms of the clustering centroids allocation of document clustering.

As reported in this thesis, the problem of document clustering was not limited to the distribution of the cluster centroids. The other focus of the research was text feature selection, because high text dimensionality affects the clustering system negatively. The thesis discussed supervised and unsupervised feature selection methods in terms of filter, wrapper, and hybrid techniques. Hybrid feature selection techniques were discussed extensively. In this context, the hybrid methods of feature selection are equivalent to memetic optimization with regard to centroids allocation. However, global and local searches have a different meaning in feature selection methods. The wrapper and filter methods are equivalent to global and local searches in optimization, respectively. The final aim of this thesis was to combine hybrid feature selection and hybrid centroids allocation for more efficient document clustering.

In document Document clustering with optimized unsupervised feature selection and centroid allocation (Page 131-135)