• No results found

Chapter 5 Proposed Methodology

5.7 Evaluation Techniques

A number of techniques that is used for comparing and evaluating each model. It is such an important method to process any clinical datasets. The main idea of this method is to estimate performance (e.g., corrected classification, incorrect classification, error rate …etc.). Moreover, it provides a good benefit for assessing and testing the proposed model. When the classifier does not achieve the main requirement, then the model process is reconstructed repeatedly by altering its parameters till the expected outcomes are obtained.

104 | P a g e This research study is applied performance evaluation metrics process through comparing the selected classifier outcomes with the class attributes. In this scenario, the error rate, performance techniques and accuracy are calculated accordingly. In order to estimate the error rate for each model, it is important to calculate the average number of misclassified instances divided by the number of features. While, the classification accuracy and performance can be estimated as (1-Accuracy), which refer to the total number of error rates. If the classification accuracy is not achieving a certain threshold percentage 85% for example, then feature selection and pre-processing method are required to perform some changes until they obtain better result. Table 5.11 illustrates the most common approaches and their characteristics in machine learning algorithms.

Table 5-11: Evaluation techniques in machine learning Evaluation

method Methodology Description Characteristics

K-fold Cross- validation technique [241]

Each classifier using n

– 1 group and holding one out of the fold for testing.

This method works through selecting a number of folds (or divisions) to partition the data into each fold is held out in turn for testing. The process trains a model for each fold using all the data outside the fold. It tests each model performance using the data inside the fold, and then calculates the average test error over all folds.

The outcome can be unbiased due to the n classifiers, the K- fold group is tested, and the n

test outcomes are calculated.

Re-

substitution

The total number of records in the datasets use for training and testing equally.

In order to build an optimal classifier, all the available data was utilised for modelling.

The results generate biased estimation as the same data using for training and testing process. Holdout technique (Data Partition). This method is selected for our experiments. Datasets divided between training sets and testing sets

The datasets divided into training and testing sets. Usually, the training sets received twice or more than the test size. In our thesis, the training sets receive used %70, the validation sets receive %10, while the testing phase obtain %20.

The model outcome estimation is unbiased in association with the error rates.

Jack-knife (Leave-out- one)

This approach typically has similar function to k-fold cross-validation but n

= N.

Classifier is very close to optimal in the sense that all samples get used for both training and testing.

The classifier result is unbiased but is considered slow concerning the computation intensive task.

Holdout method is considered a good tool to use with a sufficient amount of data. This method works by selecting a percentage of the data. Using the training set to train the model, it then assesses the performance of the classifier based on the test set. This study used the holdout method for allocating training, validation set and testing cases. The training set received 70%

105 | P a g e for generating the classification algorithm; the validation set received 10%, while the testing set received 20% to estimate the generalisation performance and accuracy of the classifiers, particularly on independents objects. In order to learn from the dataset, it is required to operate two stages to build the learning schemes. The training method build the basic structure for each model to calculate the error rates. Then, evaluate the SCD datasets through the testing set in order to predict the accuracy and error rate for each model. The main purpose is to compare our models with the baseline control models LNN (test) and ROM (test), demonstrating that our classifiers provide significantly better results than such baselines. It is found that the combined classifier produced the best results among other classifiers. Eventually, it is important to use validation techniques, so the estimated error rate is likely to be unrealistic and lead to biased estimation as well.

5.7.1 Performance Evaluation Metrics

The performance evaluation of a model is calculated through a parameter known as decision threshold (0 ≤ 𝑡 ≤ 1) in order to choose the ultimate class membership of a certain objective [242]. In this study, our classifier evaluation consists of both out-of-sample (testing) diagnostics and in sample (training). To compare the evaluation outcomes, it is significant to use classification accuracy such as sensitivity, specificity, precision, F1 score, Youden’s J statistic, and overall classification accuracy calculated. Additionally, it is important to represent the outcomes of true and false values of a model by using the Area under the Curve (AUC) and Receiver Operating Characteristic (ROC) plots and, where the classification ability across all operating points was ascertained. Sensitivity and specificity are proper evaluation approach measurements for model binary outputs. In order to illustrate the sensitivity, a test with 100% sensitivity, which means all patients with 500 mg dosages were correctly classified. In contrast, a test with 80% sensitivity outcomes, which means 80% of patients with 500 mg dosage were correctly predicted, and 20% of patients were incorrectly classified (True Negative). In regards to the specificity method, a test with 100% specificity means that all patients not under 500 mg dosage. However, a test with 80% specificity means that the algorithm able to classify 80% of patients with 500 mg dosage correctly, where 20% of patients incorrectly classified. In order to compare the evaluation outcomes by mathematical equations such as confusion matrix, precision, also known as the Positive Predictive Value (PPV) is another way for statistical analysis [243]. This technique counts the number of TP divided by the total number of TP and

106 | P a g e FP. In other words, it is the function of TP and the instances that are considered misclassified as positive, such as FP.

F-score, also called F-measure is a common evaluation performance that usually combines two methods, which are precision and recall within a single value [243]. This method can assist our datasets to find the test's accuracy. As mentioned previously, Precision is the function of TP and objectives were misclassified as positive (FP). While, Recall, is a function of the correctly classified objectives (TP) and its misclassified objectives (FN).

Youden's statistical technique is utilised to measure the ROC curve. It able to estimate the effectiveness of diagnostic tests and allows the selection of an optimal threshold value [244]. In our case, value ranges between -1 to 1, and has 0 value when the test phase provides a similar proportion of positive outcomes for the amount of medication dosage when the test is considered useless. A value of one indicates the test is perfect as there is no FP or FN.

ROC curve offers graph representation for each model based on the total error rate rates in sensitivity and specificity approach. Each point on the ROC curve illustrates the level of threshold for classification and states the total proportion of positive samples that are correctly classified, against the proportion of negative samples that are incorrectly classified. However, the accuracy is calculated using measures of TP, TN, FP, FN rates. The accuracy belongs to the number of predictions that is correctly classified.

5.8

Summary

This chapter conducted comprehensive processing stages to discuss the methodology of our simulation experiment study. Data pre-processing technique was the major part in this thesis and the subcomponents it comprises. These include data collection and pre-processing, data cleaning, Detecting with processing outliers, missing values, missing values mechanism, and data integration and normalization. Feature selection illustrated for selecting the proper features that used for training and testing process. This chapter has discussed the experimental setup of machine learning approaches. A set of 7 single classifiers and 7 ensemble classifiers have been addressed with full description about each model. The following chapter will discuss the experimental setup for machine learning models.

107 | P a g e