Documents, Solutions and Centroids Representation and Evaluation

Algorithm 3.2 Relief-F

5.2 Documents, Solutions and Centroids Representation and Evaluation

This section is intended to highlight the important concepts used in optimizing cluster centroids for the proposed document clustering approaches presented in this chapter. This section will explain first the representation of the documents corpus. It also describes how the solutions are initialized, modified and evaluated. Furthermore, the centroids calculation process of each solution is explained.

5.2.1 Document Corpus Representation

The datasets are first uploaded in a text format. After pre-processing, datasets are transformed into a Term-Document Matrix format (TDM). It is also valid if the TDM matrix is expressed as a Document-Term Matrix (DTM). Table 5.1 shows an example of a dataset containing six documents, seven features (keywords) and three classes (these classes are only included in order to explain the concepts).

89 Table 5.1 Relationship Between Features,

Documents, and Classes

F1 F2 F3 F4 F5 F6 F7 Doc1 c1 0 0.1 0.2 0 0.2 1 0 Doc2 c1 0 1 2 0 0.2 1 0 Doc3 c3 0 3 0.1 0 0.4 0.5 0 Doc4 c3 0 3 1 0 0.4 0.5 0 Doc5 c2 7 0.1 2 0 0.1 0.2 0.1 Doc6 c2 7 0.9 2 0 0.1 0.2 0.1

As the main distinction between the clustering and classification problems is the availability of the class labels in classification problems, their existence is not necessary for any true clustering problems. However, with benchmark datasets the availability of these labels is important only for the purpose of system evaluation. This is essential to validate the system to be used later for unlabeled datasets. The actual representation of TDM used in the clustering systems is shown in Table 5.2.

Table 5.2 Small Dataset Without Class Labels F1 F2 F3 F4 F5 F6 F7 Doc1 0 0.1 0.2 0 0.2 1 0 Doc2 0 1 2 0 0.2 1 0 Doc3 0 3 0.1 0 0.4 0.5 0 Doc4 0 3 1 0 0.4 0.5 0 Doc5 7 0.1 2 0 0.1 0.2 0.1 Doc6 7 0.9 2 0 0.1 0.2 0.1

For text data produced from real world applications, a classified dataset such as the one presented in Table 5.1 is unlikely. Unsupervised learning problems are concerned with label- free data. For instance, the external evaluation measures are dependent on the availability of the original class labels. These measures match the new documents’ configurations after clustering with the original ones. More researchers use this kind of evaluation because more accurate results can be achieved compared to internal measures such as the F-measure, to be explained later. The only limitation that restricts use of this measure is the unavailability of the previous categorization of the documents. Therefore, it can be used with benchmark datasets to validate the assessment performed by the internal measure. As the performance of these datasets is evaluated using the external measures and the internal measures, it becomes easier

90 to predict the accuracy of the clustering with other datasets when only the internal measures can be used, which assess only the generated clusters, depending on the intrinsic properties of them.

5.2.2 Solutions Initialization

The ‘solution’ definition in optimization-based document clustering is a set of centroids that needs to be distributed accurately. To allocate the centroids efficiently, each centroid is supposed to be positioned at the nearest distance to all relevant documents. Such is the case of the example in Table 5.3, in which the number of documents is 7 (n = 7) and these documents need to be distributed among three clusters [c1, c2, c3]. That can be stated as permutations of n documents allocated at a time to r clusters which can be represented in equation (5.1).

Equation (5.1)

Thus, the number of possible solutions would be 210 for this small example, yet only one feasible solution should be considered from those solutions. This simple example is only intended to show the relationship between each document and its corresponding document. The increasing number of documents and centroids requires more intelligent methods to find the best centroids allocation. Consequently, the selection of the best solution will become more complicated. Therefore, intelligent methods such as the memetic algorithm could provide a faster convergence to the best solution (Neri and Cotta 2012). In addition, the problem is not only limited to the selection of the best solution in the search space, it is equally important to employ efficient techniques that are capable of modifying solutions on a local search basis. Such techniques should be capable of avoiding local optima where all solutions become non- productive. In typical optimization problems, a random initial population is first generated by using the random number generator function. For document clustering centroids allocation methods presented in this chapter, the initial population uses this technique. However, using other techniques to initialize the population could be more productive currently, but that exploration is beyond the scope of this present research. The size of the population and the initial assignments of documents are random and should be less than or equal to the desired number of clusters. The example in Table 5.3 shows only 10 sample solutions represented by

210 24 5040 ! ) 3 7 ( ! 7 ! ) ( ! ) , ( = = - = - = r n n r n p

91 the columns (sol1 . . . sol10). Each solution has the same number of documents in the original dataset.

5.2.3 Initial Solution Evaluation

A random initialization of solutions is first conducted (Table 5.3). The viability of each solution is calculated by using a fitness function.

Table 5.3 Solutions Matrix

Doc1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7

sol1 2 1 1 3 3 1 3 sol2 2 3 3 3 2 1 1 sol3 1 1 1 1 2 2 2 sol4 1 2 3 2 1 1 3 sol5 2 1 2 1 3 3 3 sol6 3 2 3 3 2 3 3 sol7 3 1 1 2 2 2 1 sol8 1 2 2 3 2 2 1 sol9 2 3 1 1 2 2 2 so110 2 3 3 1 1 3 1 so111 1 2 1 1 1 2 2 so112 2 1 3 1 1 1 1

Each sol row in Table 5.3 represents a random solution that contains indices of clusters. For instance, the intersection of (d1, sol4) means that document d1 belongs to cluster index 1 while the intersection of (d4, sol4) means that document d4 belongs to the second cluster and so on. The fitness function, which is used to evaluate the quality of each solution, aims to find the highest number of true positives by correctly allocating each document to its proper class. In the example in Table 5.3, if we assume that the right allocation of documents is d1 ∈ 2, d2 ∈ 1, d3 ∈ 3, d4 ∈ 1, d5 ∈ 6, d6∈ 1, that means the best solution vector is [2, 1, 3, 1, 6, 1], which obtains the least fitness function score (assuming that the least is the best). The fitness function, objective function or simply the cost are all interchangeable. The parameters required by the fitness function are the number of clusters, the original dataset shown in Table 5.2 and the initial solutions as reported in Table 5.3.

92 5.2.4 Centroids Calculation and Fitness Evaluation

This process uses the solutions existing in the solutions matrix sequentially to generate a specified number of centroids. For our example, each solution is used to create three centroid vectors. Each centroid vector has the same size of solutions. The use of optimization methods is vital to find the best solution that returns the best centroids. In order to perform the clustering of the documents, all document vectors stored in the TDM shown in Table 5.2 are compared to the centroids resulted from each solution. The comparison is based on distance measures. The comparison of n documents to c clusters (of one solution) is refered to as the fitness of score of that solution calculated by the ADDC. In other words, a good solution would generate good centroids; that solution minimizes the distance between each document to its corresponding centroids. We understand, then, that the locations of centroids are dynamic while documents are static in the search space.

The number of desired clusters will determine the size of the centroids matrix to be constructed. For instance, Table 5.4 shows what the centroid matrix would look like if three clusters were to be formed.

Table 5.4 Centroids Matrix

C1 3.5 2 1.5 0 2.5 2 3.5 C2 3.5 2 1.5 0 2.5 1.5 3.5 C3 0 1 3.5 0 2 2.5 1.5

Thus, the main idea behind using optimization methods is to find the best positions of centroids by adjusting their location in every iteration. The fitness function measures the effectiveness of each solution. The solution that can update the centroids with the highest fitness score is considered the most fitted, and it will be chosen for the next round. Algorithm 5.1 shows the centroid’s calculation steps.

Algorithm 5.1. The centroid calculation

In document Document clustering with optimized unsupervised feature selection and centroid allocation (Page 104-109)