Feature Selection - Liver Lesion Characterisation and Classification

Liver Lesion Characterisation and Classification

5.6 Feature Selection

In order to increase the classification/characterisation accuracy and decrease the system costs, the feature selection approach can be used to select the most robust features from the high dimensional feature set (Jain et al., 2000). The system performance might be adversely affected by high dimensional feature set due to redundancy or lack of importance of some features (Pappu and Pardalos, 2014;Li et al., 2016). Hence, the main goal of a feature selection approach is to search for an optimal subset of relevant features and reduce the redundancy (Sun et al.,2013;Tang et al.,2014).

In this study, the Genetic Algorithm (GA) (Siedlecki and Sklansky, 1989) was adopted to measure the relevance and significance of the features and avoid the redundancy. The GA is a general adaptive optimisation procedure, which is utilising to reduce the dimensionality of the features where GAs has been successfully applied to a wide range of dimensionality reduction studies (Adams et al.,2015). Each variable is represented by a gene and the sequence of genes is called a chromosome. A number of chromosomes (population) are randomly initialised by three genetic operators: selection, crossover and mutation. The chromosomes are evaluated by a predefined fitness function to measure their quality. However, the selection operator utilises to select the high performing chromosomes to transfer it directly to the next generation. The new offspring chromosomes are created by swapping a portion of chromosomes (genes) between two chosen chromosomes and this mechanism is called a crossover operator. The mutation operator modifies one or more gene value of a selected chromosome to provide the better solution, as shown in Figure5.22.

Chapter 5. Liver Lesion Characterisation and Classification

The schematic theory shows how the schemata featuring that appears in chromosomes with a high degree of expectation has a greater expectation of propagation through successive population as a GA evolves (McCall, 2005), as shown in equation

5.24. mH(i + 1) = FH(i)mH(i) 1 − pc lH l − 1 h (1 − pm) Hi (5.24)

where H is a schema features; mH is the number of chromosomes belonging to H,

FH(i) is the relative fitness of H that defined as the average fitness of all chromosomes

in the population (i) belonging to H divided by the average fitness of all chromosomes in the population; l is the length of chromosome; pm is the mutation probability, and

pcis the crossover probability.

The GAs showed better results compared to other feature selection techniques to generate a more robust feature vector, in studies related to the lesion classification/characterisation (Gletsos et al., 2003; Mougiakakou et al., 2007b; Aalaei et al.,

2016). In addition, GA is considered to be an excellent choice for feature selection task due to relative insensitivity towards noisy data (Osowski et al., 2009). It dif- fers significantly from the other existing wrapper algorithms because of the traditional methods search from a single population point unlike GA which searches from paral- lel population points (Akhter et al., 2016). GA is advantageous over other algorithms since it is less likely to be trapped by local minimum and provides a better global optimal solution (Garg,2010;Ling and Liu,2015). Hence, the GA was adopted to fulfil this task.

5.6.1 Implementation of GA

Using GA for feature selection, each feature is considered as a gene (chromosome), represented as 1 (selected) or 0 (not selected). The classification accuracy (the fitness of the chromosome) was determined as the area, Az, under the ROC curve. The fitness

function F (c) for the cth_{chromosome depicted in Equation}_5.25_.

F (c) = f (c) − fmin fmax− fmin

, c = 1, 2, ..., n (5.25)

Where fmin and fmax is the minimum and maximum f (c) among the n chromo-

somes respectively.

The fitness function F (c) based on the Az value ranged from 0 and 1. The chro-

mosome with the largest Az value is assigned a fitness of 1, the chromosome with the

largest value is 0.5 < Az ≤ 1. The probability of the cthchromosome being selected

as a parent, Ps(c) is proportional to its fitness. Ps(c) calculated in Equation5.26.

Ps(c) = F (c) n P c=1 F (c) , c = 1, 2, ..., n (5.26)

A random sampling based on the probabilities (Ps(c)) allowed chromosomes with

higher value of fitness to be chosen more frequently. The crossover rate determines the probability that parents will exchange genes. After crossover, another chance of introducing new features was obtained by mutation. The processes of parent selection, crossover, and mutation resulted in a new generation of n chromosomes. The best subset of features was selected to be the chromosome that provides the highest average Az during the evolution process.

In this thesis, the follow parameters were used. The initial probability of a feature’s presence (Pinit), probability of crossover (Pc) and probability of mutation (Pm) was

0.002, 0.9 and 0.001 respectively. For better results, several studies suggest the use of a high value of crossover probability and a low value of the mutation probability (Chtioui et al.,2009;Sipper et al.,2018).

Figure5.23 illustrates the Genetic algorithm evaluation for feature selection. Fig- ure5.23.a depicts the evolution of the number of selected features and Figure 5.23.b depicts the total area under the ROC curve Az for the GA.

Figure 5.23: The Genetic algorithm evaluation for feature selection. (a) The evolution of the number of selected features for a GA. (b) The evolution of the area Az under the

ROC curve for the GA

The classifiers performance based on the area Azunder the ROC curve for different

Chapter 5. Liver Lesion Characterisation and Classification

is noted that the classification accuracies of the three classifiers have improved and the best performance of SVM was 0.97 at 39 number of features, compared to LR and LDA was 0.95 and 0.94 respectively at 78 number of features, as shown in Figure5.24.

Figure 5.24: The comparisons of the area Azunder the ROC curve for each considered

classifier based on the number of features.

The SVM classification performance comparison using ROC analysis is shown in Figure5.25. The overall ROC performance of classification without using GA feature selection is Az = 0.94, with comparison to Az = 0.97 of classification performance

after applied feature selection. For feature selection approach, less than half of the features were selected and at least one feature was selected from each category (intensity, texture and shape feature). Thus, the three categories of features complement each other to achieve better results. The results show that feature selection can improve the accuracy of the classifier.

We perform the GA feature selection approach to evaluate its performance. Table

5.14 summarised the classification results based on area Az under the ROC curve,

respect to the number of selected features. Using GA, the selected feature set contains only 38 of the available 111 features to achieve the best classification accuracy of 97%. The feature vector size is reduced by 65.75%. The feature set number 4 has the similar accuracy to the feature set number 3 but with an increase in the number of features by 11.71%. This theoretical result is due to the good separation between data in the selected base.

Feature set Number of features Az Percentage of reduction

1 13 0.75 88.29% 2 25 0.84 77.48% 3 38 0.97 65.76% 4 51 0.97 54.05% 5 78 0.96 29.73% 6 87 0.95 21.62% 7 100 0.94 9.91% 8 All features 0.94 0%

Table 5.14: The number of features and the area Az under the ROC curve in the GA

feature selection approach

The best classification accuracy 97% is achieved by using only 38 features: 9 bins of histogram, Skewness, Kurtosis, Entropy, GLCM (Contrast, Homogeneity and Cor- relation), 21 bins of Gabor Energy, Elongation and Roundness features. The selected features by GA approach are related with the lesion appearance and shape. By exam- ining the CT images of the pathological area, we can see that the lesions vary in the degree of brightness, distribution and regularity of the shape. This explains the reasons for choosing these features. For example, the Elongation and Roundness feature is a descriptive characteristic of shape regularity. The malignant lesion is mostly irregular in shape compared with the benign lesion. In malignant lesions, the internal lesion structure shows a wide range of changes (heterogeneous attenuation) and invasion of adjacent structures. But in benign lesions, the internal structure is diffusely homoge- neous. Therefore, these aspects also explain the choice of the homogeneity, Entropy, Gabor Energy,etc. as descriptive characteristics of the lesion. The GA approach re- duces the feature size by 65.75% without compromising the accuracy of the classifier.

In document Automated Characterisation and Classification of Liver Lesions From CT Scans (Page 153-157)