2.3 Experimental setup
2.3.6 Long-short term memory network
The long-short term memory (LSTM) network was used because it can find flexible patterns like insertions or deletions in sequences, which can not be easily found by the CNN, because of the static kernel size and position. Distinctive LSTM networks for both datasets were applied with different hyperparameters for adjusting. The LSTM was configured with 1 layer and was trained for 500 updates. The learning rate was 0.001, 0.01, 0.1 and 1 with linear learning rate decay. The batch size was 128, 256, 512 and 1.024. There were 64, 128, 256 or 512 neurons in the different LSTM models. The L2 penalty was 1e-4, 1e-5 or 1e-6 (table2.7).
Hyperparameter Values
Learning rate {0.001, 0.01, 0.1, 1}
Batch size {128, 256, 512, 1.024}
L2-penalty {1e-4, 1e-5, 1e-6}
Number of neurons {64, 128, 256, 512}
Table 2.7: LSTM hyperparameters used in the grid search procedure for the simu-lated and real dataset
The LSTM has the same input encoding like the CNN, but after the input there is one LSTM layer with a variable number of LSTM cells. After the LSTM layer there is a ReLU function, then a fully connected layer and a sigmoid output function (figure2.11).
Figure 2.11: Sketch of the used LSTM network
Chapter 3 Results
3.1 Cluster analysis
Because of memory problems the clustering for the simulated dataset was limited to 50% of the data, with 5000 samples per class, which were randomly selected. The best model was achieved with a configuration of 10 clusters, which was used because it corresponds to the number of classes. Variation of linking criteria had no impact on the results. As distance measure the Hamming distance gave the best results, although experiments with the euclidean distance gave similar results. Figure 3.1 shows that, with the exception of classes 8, 6 and 9, the majority of all classes were predicted perfectly by the cluster analysis.
Figure 3.1: Cluster linkage for 10% of the data samples in the simulated dataset (Clusters are colored, labels are shown with the label number)
For the real dataset the cluster analysis successfully showed clusters for different classes. The number of clusters was set with 13 (number of classes). Only the dark blue and yellow clusters show a high amount of class 1, which corresponds to the
celiac class. Further, clusters for class 2 (HIV class) are visible from bottom to bottom right and at the top. The other classes could not be visually separated by the cluster analysis. For better visual representation, only 10% of data is shown in the cluster linkage of figure3.2.
Figure 3.2: Cluster linkage for 10% of the data samples in the real dataset (Clusters are colored, labels are shown with the label number)
3.2 K-nearest neighbors algorithm
In the experiments the k-nearest neighbors algorithm (kNN) showed the best results with brute algorithm and Minkowski metric p 2, which equals the euclidean metric.
The number of neighbors varied between 1 and 2 for the different folds. For fold 0, 4 two neighbors and for fold 1, 2, 3 one neighbor achieved the best result (table A.1). The average balanced accuracy (BACC) for the antibody class prediction for all five models of the simulated dataset ranged from 97.2% (SD: 3.9) to 100 % (SD:
0.0) (table 3.1). The area under the curve (AUC) for all folds was 1. The receiver operating characteristics (ROC) curve is shown in figureA.1.
Class Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 AVG
1 1.000 1.000 1.000 0.992 0.991 0.996 ±
0.004
2 1.000 1.000 1.000 0.920 0.938 0.972 ±
0.035
3 1.000 1.000 1.000 0.944 1.000 0.989 ±
0.022
4 0.972 0.938 1.000 1.000 0.996 0.981 ±
0.024
5 1.000 1.000 1.000 1.000 1.000 1.000 ±
0.000
6 1.000 0.917 1.000 1.000 1.000 0.983 ±
0.033
7 0.996 0.983 1.000 1.000 0.977 0.991 ±
0.009
8 0.954 0.992 1.000 0.917 1.000 0.972 ±
0.033
9 0.954 1.000 1.000 0.991 1.000 0.989 ±
0.018
10 1.000 0.892 1.000 1.000 0.917 0.962 ±
0.048
Table 3.1: BACC values with mean and standard deviation for all 5 folds from kNN models of the simulated dataset
For the majority of celiac and HIV class, as well as celiac and HIV class cluster, best results were achieved with one neighbor. Celiac class (fold 3, 4) and HIV class cluster (fold 1, 3) were the exception with two neighbors (table A.2). The mean BACC for the celiac and HIV class was 99.6% (SD: 0.1) and 97.4% (SD: 0.5) with random CV (table 3.2) and 97.8% (SD: 1.2) and 93% (SD: 3.3) with clustering, respectively (table3.3). AUC for celiac class was 1 and for HIV class between 0.97-0.98, respectively. ROC curves are shown in figure A.2 for the celiac class and in figureA.3 and A.4 for the HIV class.
Fold Celiac HIV
0 0.995 0.971
1 0.996 0.983
2 0.995 0.978
3 0.996 0.971
4 0.999 0.968
AVG 0.996 ± 0.001 0.974 ± 0.005
Table 3.2: BACC values with mean and standard deviation for all five folds from kNN models of the real dataset for the celiac and HIV class
Fold Celiac HIV
0 0.996 0.976
1 0.985 0.957
2 0.974 0.890
3 0.962 0.899
4 0.973 0.928
AVG 0.978 ± 0.012 0.930 ± 0.033
Table 3.3: BACC values with mean and standard deviation for all five cluster folds from kNN models of the real dataset for the celiac and HIV class
3.3 Support vector machine
Support vector machines (SVM) were used as baseline for the complexity of the clas-sification task. The simulated dataset was trained with different hyperparameters, best results were achieved with the linear kernel, cost 1 and gamma 0.1 for all folds and seeds. Average BACC for all 5 folds over 3 random seeds was 98.7% (SD: 0.2).
Further results are shown in table3.4. The AUC for all classes and folds was 1. The ROC curve is shown in figureA.5 and the train and validation results are shown in tableA.19 and A.20, respectively.
Class Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 AVG
Table 3.4: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 folds from SVM models of the simulated dataset
Best results for the celiac and HIV class with 5-fold CV were achieved with linear kernel and radial basis function (RBF) kernel, respectively. For both classes the optimal gamma value was 0.1 for the majority of the configurations, while the value for cost ranged between 1 and 5. Detailed optimized hyperparameter settings are shown in table A.3 (celiac class) and table A.4 (HIV class). SVM models on basis of the real dataset showed an average BACC over all 5 folds and 3 random seeds with 99.5% (SD: 0.0) for the celiac and 94.8% (SD: 0.0) for the HIV class (table 3.5). AUC for celiac and HIV class was 1 and 0.98, respectively. The ROC curves
are shown in figureA.6andA.7. The train and validation results are shown in table A.21and A.22, respectively.
Fold Celiac HIV
BACC BACC
0 0.995 ± 0.000 0.946 ± 0.003
1 0.995 ± 0.000 0.948 ± 0.001
2 0.995 ± 0.001 0.948 ± 0.003
3 0.995 ± 0.001 0.950 ± 0.002
4 0.994 ± 0.001 0.947 ± 0.002
AVG 0.995 ± 0.000 0.948 ± 0.000
Table 3.5: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 folds from SVM models of the real dataset for the celiac and HIV class
Optimized hyperparameters for SVM models with clustering showed more varia-tions. For both celiac and HIV class, kernel varied between linear and RBF, gamma between 0.1 and 0.2, and cost ranged from 1 to 5. Detailed optimized hyperparam-eter settings for cluster CV are shown in table A.13 (celiac class) and table A.14 (HIV class). Clustered SVM models of the real dataset showed a mean BACC with 98.6% (SD: 0.0) for the celiac and 85.7% (SD: 0.1) for the HIV class (table 3.6).
The train and validation results are shown in table A.23and A.24, respectively.
Fold Celiac HIV
BACC BACC
0 0.992 ± 0.000 0.904 ± 0.000
1 0.993 ± 0.000 0.903 ± 0.001
2 0.980 ± 0.004 0.857 ± 0.002
3 0.972 ± 0.000 0.822 ± 0.002
4 0.992 ± 0.000 0.800 ± 0.000
AVG 0.986 ± 0.000 0.857 ± 0.001
Table 3.6: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 cluster folds from SVM models of the real dataset for the celiac and HIV class