Submitted by Helga Ludwig, BSc Submitted at Institute for Machine Learning Supervisor Univ. Prof. Dr. Sepp Hochreiter Co-Supervisor Guenter Klammbauer, PhD Michael Widrich, MSc February 2020 JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, ¨Osterreich www.jku.at
Antibody class prediction
from sequence with
dif-ferent Machine Learning
Methods
Master Thesis
to obtain the academic degree of
Master of Science
in the Master’s Program
Abstract
Classical approaches of finding distinct patterns in sequences for classifying antibod-ies to their antigen are costly in terms of time and resources, because antibodantibod-ies have to be tested for binding in a laboratory, which requires different chemicals, antigens, incubation time, machines and sterility. Machine learning (ML) methods, especially convolutional neural networks and long-short term memory, provide a modern and efficient approach on classifying sequences (Sharif Razavian et al., 2014; Sønderby et al., 2015). The aim of this thesis therefore is the comparison of cluster analysis, the k-nearest neighbors algorithm (kNN), support vector machines (SVM), con-volutional neural networks (CNN) and long-short term memory networks (LSTM) for antibody-specific antigen prediction from single sequences. Experiments were conducted on antibody amino acid (AA) sequences in a simulated dataset for the setup of networks and verification of the ML methods and a more complex real dataset built from sequenced AA antibody sequences with different classes. The datasets differed in various properties like the length of sequences (15AA vs. 23-280AA), number of samples (100.000 vs 6.747) or number of classes (10 vs 13). For the training and testing of the ML methods with the real dataset, only human complementarity determining region 3 (CDR3) sequences (n=5.935) were consid-ered, because of their main role in the antibody-antigen binding process (Xu & Davis, 2000). A preliminary hyperparameter search was conducted, which led to a reduction in hyperparameter search space for both datasets. The resulting hyper-parameters were selected and adjusted for the training of the models, to optimize results and reduce limitations due to available training time and computational re-sources. To achieve an almost unbiased estimate for the antigen class, 5-fold cross validation (CV) procedure was used, and enhanced with a clustered CV to see if the supervised techniques overfitted. Datasets were randomized and divided into 60% training, 20% validation and 20% test data. In the real dataset only the scores of the HIV class with 2.543 samples and the celiac class with 1.452 samples were considered as prediction target. A cluster analysis procedure successfully clustered all 10 classes of the simulated dataset and could identify some clusters for the celiac and HIV class of the real dataset. The applied supervised ML methods achieved an overall balanced accuracy (BACC) for the simulated dataset with kNN, SVM, CNN and LSTM having 98.4%, 98.7%, 100% and 100%, respectively. Results for kNN, SVM, CNN and LSTM on basis of the real dataset showed an average BACC of 99.6%, 99.5%, 100%, 99.8% for celiac and 97.4%, 94.8%, 97.5%, 94.9% for the HIV class, respectively. Average BACC for kNN, SVM, CNN and LSTM with clustered CV was 97,8%, 98.6%, 99.7%, 98.8% for the celiac and 93%, 85.7%, 94.7%, 85.8% for the HIV class, respectively. Results of the experiments are promising and show that supervised ML methods, especially CNN and LSTM, can perform very well in the antibody classification from biological real world sequence data. Although simpler ML methods like kNN and SVM can be sufficient for simple structured datasets, LSTM, and especially CNN, should be used to ensure best possible results for antibody class prediction from sequence.
Contents
1 Introduction 1
1.1 Biological background . . . 2
1.1.1 Antibodies . . . 2
1.1.2 Antigen binding activity . . . 2
1.1.3 VDJ recombination . . . 3
1.1.4 Antigen recognition . . . 4
1.2 Machine learning . . . 4
1.2.1 Cluster analysis . . . 4
1.2.2 K-nearest neighbors algorithm . . . 4
1.2.3 Support vector machines . . . 5
1.2.4 Artificial neural network . . . 7
1.2.5 Convolutional neural network . . . 8
1.2.6 Recurrent neural network . . . 9
1.2.7 Long-short term memory networks . . . 10
2 Methodology 12 2.1 Antibody sequence dataset . . . 12
2.1.1 Simulated dataset . . . 12 2.1.2 Real dataset . . . 13 2.2 Experimental structure . . . 17 2.2.1 Performance measure . . . 18 2.2.2 Update scheme . . . 19 2.2.3 Activation functions . . . 19 2.2.4 Loss . . . 21 2.2.5 Class weighting . . . 21 2.3 Experimental setup . . . 22
2.3.1 Hardware and software setup . . . 22
2.3.2 Cluster analysis . . . 23
2.3.3 K -nearest neighbors . . . 24
2.3.4 Support vector machine . . . 25
2.3.5 Convolutional neural network . . . 25
2.3.6 Long-short term memory network . . . 27
3 Results 29 3.1 Cluster analysis . . . 29
3.2 K-nearest neighbors algorithm . . . 31
3.3 Support vector machine . . . 33
3.5 Long-short term memory network . . . 37
4 Discussion 39
4.1 Interpretation of results . . . 39
4.2 Limitations . . . 41
4.3 Conclusion and future work . . . 42
A Appendix 47
A.1 Index of abbreviations . . . 47
Chapter 1
Introduction
All higher organisms serve as hosts for microorganisms. The majority of these re-lationships are benign and in some cases they even are beneficial for both species. However, some microorganisms, so called pathogens, can also be harmful for higher organisms. Therefore vertebrates have developed a brilliant designed immune sys-tem with an innate and adaptive immune response. The innate immune syssys-tem is universal and can involve any kind of cell type. The adaptive immune system, on the other hand, is highly specific to a single unique pathogen. While the innate immune response is short lasting, the adaptive immune system can ensure lifelong protection but only for specific pathogens. Both parts of the immune system have developed mechanisms to distinguish the own hosts cells and non-pathogenic cells from pathogenic foreign cells. While the innate system has sensors to detect pat-terns or types of molecules that are common in pathogenic cells, the adaptive im-mune system has a genetic mechanism, which is based on an unique process that produces an almost limitless variety of proteins, called antibodies. This highly spe-cific proteins can bind to nearly every molecule. Molecules that are bound by an antibody are named antigens. Antibodies are able to differentiate between two pro-teins, with as small differences as one amino acid (AA) or even different optical isomers of molecules. This ability enables the adaptive immune system to react pathogen-specific (Alberts et al., 2014). Since these antibodies are highly specific, they have various applications in pharmacy and synthetic biology. Sequenced an-tibodies are being used for diagnosis of different diseases and in immunotherapy (White et al., 2001; Julve Parre˜no et al., 2018). In synthetic biology they are fur-ther used for finding specific proteins (Adams & Sidhu, 2014). Classical approaches of finding distinct patterns in sequences for classifying antibodies to their antigen are inefficient in terms of time and resources, because antibodies have to be tested for binding in a laboratory, which requires different chemicals, antigens, incubation time, machines and sterility. Machine learning (ML) methods, especially convo-lutional neural networks (CNN) and long-short term memory networks (LSTM), provide a modern and efficient approach on classifying sequences (Sharif Razavian et al., 2014; Sønderby et al., 2015). The aim of this thesis is the analysis of dif-ferent ML methods for antibody-specific antigen prediction, solely based on single antibody AA sequences.
1.1
Biological background
1.1.1
Antibodies
Antibodies are Y-shaped proteins produced by plasma cells that neutralize pathogens. They are built of two small light chains and two big heavy chains. The antibody recognizes a unique molecule of the antigen via the tip of the complementarity de-termining regions (CDRs), which are located at the tips of the Y. If the antibody binds its antigen counterpart, it alarms the immune system and the foreign/hostile cells/microbes can be neutralized. The variable (V) regions are located at the two arms of the Y. The V regions are important in the antigen binding process and vary between different antibody molecules. The constant (C) region is located at the stem of the Y. This region is far less variable and interacts with effector cells and molecules. All antibodies consist of a paired heavy and light polypeptide chain. The general term for an antibody is immunoglobulin (Ig), which splits into IgM, IgD, IgG, IgA and IgE that can be distinguished by their constant region (figure1.1).
Figure 1.1: Sketch of an antibody structure (Parker et al., 2018)
The two heavy chains of an antibody are linked by disulfide bonds to each other and each heavy chain is linked to one light chain. The chains consist of repeats that are similar to each other and about 110AA long. The heavy chains consist of 4 repeats each and the light chains of 2 repeats. The light chains of one antibody are always identical. Together with the heavy chains, they form two antigen-binding-sites. Also, the amino-terminal sequences of both the heavy and light chains vary greatly between different antibodies (Charles A Janeway et al., 2001).
1.1.2
Antigen binding activity
Three segments have been identified for being involved in the antigen binding activ-ity. Two of them are identical, so called fragment antigen binding (Fab) fragments. Fab fragments contain the complete light chains, paired with parts of the heavy chains. The other fragment has no antigen-binding activity and is called fragment
in the CDR3. This means the CDR3 is the most variable receptor and plays the main role in the antigen-antibody binding process (Xu & Davis, 2000). Antibodies are specific and can only bind one antigen. They can differentiate between proteins that have only one AA in difference and even different isomer molecules. The corre-sponding antigen is a small molecule. Since one pathogen produces many different antigens, there is a diverse spectrum of antibodies for each pathogen in humans.
1.1.3
VDJ recombination
A very important part in antibody development is the variable, diverse and joining (VDJ) recombination, taking place in the differentiation process of the antibody development. The VDJ recombinase creates a single strand nick on one of the VDJ gene segments. This nick leads to a recombination process, where parts of the gene segments are cut out. The first recombination occurs in the heavy chain between the D and J Segment. In the next step parts of the D and J segments are removed and the remaining segments get joined again. The next recombination occurs at the V segment. Parts of the V and D segments are removed and the remaining segments are again joined. This biological process is mostly random and can therefore result in the large variety of different specialized antigen recognition sites in antibodies (figure1.2) (Charles A Janeway et al., 2001).
1.1.4
Antigen recognition
In humans, if a cell is infected with a pathogen, the human leukocyte antigen (HLA) presents antigens on the surface of the cell. These antigens are mostly small peptides (with a length of approximately 9AA) that come from the digested pathogen inside the cell. If an antibody binds to this antigen and therefore identifies it as foreign and hostile, the immune system is activated. Since the peptides are mostly only 9AA in length, there can be many different antigens originating from one pathogen. Which antigen the antibodies recognize depends on the different antibodies in a human. Because the process of making antibodies has a chance component, the antibodies vary in different individuals (Davis, 2014).
1.2
Machine learning
Machine learning is the mechanism where machines/algorithms can accomplish tasks without explicit instructions. They learn distinct patterns or distributions from data. There are two different kinds of machine learning, which differ in the prior knowledge of the data. In supervised machine learning, the model uses a dataset as input and learns a distinct label for each sample by comparing the predicted label with the real label. In the unsupervised approach, the model learns patterns or distribution of the data without any label. In this thesis only the cluster analysis is an unsupervised approach, that gave an overview over the structure of the dataset. While the main task was to classify sequences, which is a supervised approach. The next sub chapters describe different machine learning approaches, which were used in the prediction experiments (Goodfellow et al., 2016).
1.2.1
Cluster analysis
Clustering is the basic ability of every animal to distinguish itself from everything else. In biology different creatures are divided in taxa and species. Dividing data into groups is necessary for understanding and finding specific data points more easily. Cluster analysis uses this method as an unsupervised ML approach by clustering similar data points into classes without knowing if they are part of a group. There are different clustering methods. In this thesis the agglomerative clustering, which is a hierarchical clustering method, was used. In this algorithm each data point is defined as a single cluster in the beginning. The clusters get merged, by a certain linkage criterion based on the distance between the clusters, until the desired amount of clusters is reached. A visual presentation of the clusters as a plot can give insight about the distribution and complexity of the classes (Davidson & Ravi, 2005).
1.2.2
K-nearest neighbors algorithm
K-nearest neighbors algorithm (kNN) is a supervised ML method which is used for classification tasks like pattern recognition. The kNN classifier assumes that similar samples exist in close proximity. Given a data set (e.g. the training sets), the kNN will classify new data points according to their neighbors in the given data set, where
classifier will overfit and if it is too big it will generalize/over-simplify the model (figure1.3).
Figure 1.3: k-nearest neighbors classification with various numbers of k. The colored dots show the data points, where the color defines the label and the background color shows how the kNN model would classify a dot in that area (Bishop, 2006)
The kNN classifier was used as a baseline for the complexity of the classification task. Also this ML method is fast and easy to implement, especially for a multi-class-problem. Further, kNN does not assume any distribution on the dataset and is therefore suitable for any kind of data. In addition, the kNN has a limited selection of adjustable hyperparameters, which decreases the training duration. But the kNN also has some disadvantages, like a long computation time of big datasets. Further, unbalanced datasets and outliers are not classified accurate by the kNN (Bishop, 2006).
1.2.3
Support vector machines
Support vector machines (SVM) are a supervised ML technique that aims to find a discriminant function which correctly classifies new samples with unknown labels. In SVM the samples are represented as points in a space, where the SVM is using a linear classification border between the positive and negative class samples. This representation is calculated with a kernel. The margin/border is maximized and therefore represents the best linear classifier for the task (figure1.4). New samples are also represented as points and are classified according to the margin. Even if the dataset is not linearly separable in our dimension, it is always linearly separable in a hilbert space and therefore a linear classifier can exist (Halmos, 2017). The SVM classifier calculation consists of two parts, the primal and the dual problem. The primal problem is concerned with maximizing the margin and the dual problem with minimizing the loss. In the following equation:
minimize " 1 n n X i=1 max (0, 1 − yi(w · xi− b)) # + λkwk2, (1.1)
n is the number of samples. yi is the ithlabel, w are the weights, xi is the ith sample,
b is the bias and λ determines the trade-off between increasing the margin size and correct classification.
Figure 1.4: SVM classifier/margin with less-than-ideal (a) and optimized class sep-aration (b) (Lee & Verri, 2003)
SVMs are a discriminant ML approach with specific advantages. In comparison to other ML techniques, they need less computational power and a smaller amount of training samples. Further, SVMs are able to provide a unique solution and are robust to bias. On the downside this ML method is not ideal for classifying multi-class-problems and unbalanced datasets. For multi-class-classification with n classes, the SVM separates every class from all the others, which leads to n different margins. SVMs are not capable of processing text structures and therefore lose sequential information. Regarding performance, SVMs can be very slow in the classification of big datasets, especially if the dataset is not linear separable. In this thesis the SVM was only used as a baseline for the complexity of the classification task. There are various options for the kernel, whereby the Gaussian kernel is best suited for sequence classification, since it can only take values from [0, 1] and maps to hyper-sphere radius of 1 (Shashua, 2009). In the Gaussian radial basis function (RBF) the Vapnik-Chervonenkis (VC) dimension and Hilbert space dimension are both infinite. Choice of σ is crucial to avoid under- or overfitting. In the next equations x is the input value and y the label. Where γ is a hyperparameter to be selected (often γ = 1
2σ2):
k(x, y) = exp(−γ||x − y||2). (1.2)
As another type of kernel for SVMs, linear kernels are mostly used for datasets with a large number of features (e.g. Text Classification), where the data can be separated with a single line. In this case it has advantages in performance and optimization effort. Only changes to the free optimization parameter c are required for the linear kernel, where c ≥ 0:
the degree d should not be to high, at best only 2 (Chang et al., 2010) :
k(x, y) = (xTy + c)d. (1.4)
Another kernel is the sigmoid kernel, which is defined by the following equation:
k(x, y) = tanh(xTy + c). (1.5)
1.2.4
Artificial neural network
An artificial neural network (ANN), or in computer science mostly called (vanilla) neural network (NN), uses ML to accomplish a classification task. The simplest NN is called feed forward NN. It has no recurrent connections and is fully connected, which means every unit from the ith layer is connected to every unit of the (i + 1)th and (i − 1)th layer. The different layers following the input are called hidden layers.
An output layer follows the last hidden layer. It consists of n number of units, where n is the number of labels/classes in the dataset (figure 1.5).
Figure 1.5: Feed forward neural network (Davim, 2011): Circles are units (neurons), the arrows show the connections between the different units
Forward propagation
A ML algorithm is able to learn from given data. In this learning process, so called training, the input data x is multiplied with a weight w0, added with a bias b0
and then fed into the network via an activation function. This is shown as an arrow/connection in figure 1.5,
z = w0 · x + b0, (1.6)
h = f (z), (1.7)
where f (z) represents the activation function (for different activation functions see section2.2.3). This process is applied to every unit (neuron) of the network in the first layer. Each unit is further provided with a distinguished weight tensor. The next layer has the output of the first layer as input. This process, in which the hidden layers accept input data, process it through the activation function and pass
it on to the successive layers, is called the forward propagation or forward pass of the network:
z = w1· h + b1, (1.8)
o = g(z). (1.9)
Backpropagation
After the forward pass, the predicted label is compared to the real label. The loss L(y, yp) is computed, where y is the real label and yp is the predicted label. Based
on these results the backward pass or backpropagation can adjust the calculations in the network. Therefore, the derivative of the loss function with respect to the weights is calculated:
∂L(y, yp)
∂Wo
. (1.10)
In this equation Wo is the weight matrix between the last hidden and the output
layer. This derivative is calculated for every layer in the network. Gradient descent
So far the backpropagation only calculates the derivative. To improve the network the weights are adjusted by adding the derivative times the learning rate according to this formula:
Wnewi = Woldi + η∇Wi. (1.11)
Where η is the learning rate, Woldi is the old weight matrix, Wnewi is the new weight matrix and ∇Wi is the gradient of the weight matrix of the ith layer. This process
is called gradient descent.
Training
In every training step there is a forward and backward pass. The network is trained for an arbitrary number of training steps until the classification results are sufficient. Therefore different hyperparameters, like the number of units and layers, can be adjusted for the classification task.
1.2.5
Convolutional neural network
Convolutional neural networks (CNN) are neural networks that use convolution instead of a general matrix multiplication. They are well-known for their usage in image recognition and successful application in sequence classification (Kagaya et al., 2014). Instead of mapping every sequence element to one output, like in fully-connected models, CNNs utilize a parametrized kernel of fixed size that is convolved over the sequence. Since the input sequences from the dataset provided consist of AA in the one-letter code, they must first be numerically coded. Therefore, every
dimension for a specific AA is called channel in the CNN. The channel encodes the presence or absence of an AA. In this setting, each kernel maps all channels of the covered sequence elements to one output in the next layer (figure 1.6).
Figure 1.6: Convolutional neural network for sequences: input features with a ker-nel/window of size 3 are extracted and multiplied with the weights
As every output is computed using the same kernel parameters, all output elements share the weights (weight-sharing) (Lecun et al., 1998). The convolutional layer is followed by a pooling layer, which is set up with max pooling in this work, meaning that the kernel takes the maximum value of the channels for further computation. Also, after the pooling layer the data is inserted into the ReLU function resulting into only positive values. In the following calculation the input gets squashed into a tensor with the number of classes as elements within the fully connected layer. The sigmoid squashes the data further between 0 and 1 to create the output.
1.2.6
Recurrent neural network
Like the CNN the recurrent neural networks (RNN) are a special type of NN. Also, like in CNNs, the application of weight sharing makes RNN especially suitable for sequence data. The RNN has closed circuit connections, which unlike feed forward NN, allows the usage of an internal state to process sequences as input and thus enabling memory function. While in vanilla NNs the hidden units only process the current input, the hidden units in RNNs remember the current and the last input. Another difference is the forward pass, which changes only slightly in the RNN, whereby a new weight matrix for the recurring connections times the previous state is added to the classical forward pass:
zt= W · xt+ R · ht−1, (1.12)
ht= f (zt). (1.13)
In this equation W is the weight matrix, xt the input of the tth time step, R is the
weight matrix of the recurrent connections and ht the activated value of the tthtime
Figure 1.7: Recurrent neural network (Goodfellow et al., 2016)
In the backpropagation, the gradient is computed for every time step. If there are many time steps the gradient can vanish over time, and therefore the information from the beginning is not contributing to the updates, independent of the importance of the data. This can also be an issue in very deep vanilla NNs.
1.2.7
Long-short term memory networks
Long-short term memory networks (LSTM) are advanced RNNs. The algorithm was developed by Hochreiter and Schmidhuber in 1997 (Hochreiter & Schmidhu-ber, 1997). LSTMs have memory blocks with different gates that interact via the constant error carousel (CEC) (figure 1.8). This property solves the vanishing gra-dient problem, which is a big issue in RNNs (Hochreiter, 1998). Additionaly they not only have the ability to remember a hidden state like RNNs, but also have the ability to forget. A standard LSTM has a cell input, input gate (it), output gate
(ot), forget gate (ft), cell state (ct) and cell output (ht). This variables are additional
to the classical RNN variables like the weight matrix (W ), the input of the tth time
step (xt), the weight matrix of the recurrent connections (R), the activated value
of the tth time step, which is equal to the output value of the (t − 1)th time step,
ht−1 and the bias (b). The cell input gets activated with an activation function.
The input gate then decides the importance of the new input by multiplying with a number between 0 (not important) and 1 (important).
it= σg(Wixt+ Riht−1+ bi). (1.14)
Afterwards the CEC multiplies with the cell state. If the LSTM cell contains a forget gate, this is comprised by the CEC and therefore also multiplied with the cell state:
ft = σg(Wfxt+ Rfht−1+ bf), (1.15)
ct = ft◦ ct−1+ it◦ σc(Wcxt+ Rcht−1+ bc). (1.16)
In the last step the cell state gets activated and multiplied with the value from the output gate, which is again a number between 0 and 1, to create the cell output:
ot= σg(Woxt+ Roht−1+ bo), (1.17)
ht= ot◦ σh(ct). (1.18)
Figure 1.8: LSTM Cell with input gate, forget gate and output gate (Source: taken with modification from LSTM: A Search Space Odyssey (2017), p. 2)
Chapter 2
Methodology
2.1
Antibody sequence dataset
The data of this project consists of two different kinds of datasets (table2.1). First, a simulated dataset for setting up the network and the verification of the methods. Second, a more complex real dataset built from sequenced amino acid (AA) antibody sequences with different classes. The real dataset was provided from the abysis database and the simulated dataset from greifflab (abYsis, 2019; Greiff, 2019).
Property Simulated data Real data
Length of sequences 15AA 23-280AA
Length of the CDR3 sequences - 17-39AA
Number of samples 100.000 6.747
Number of classes 10 13
Distribution of classes uniform unbalanced
Species none human (5.935),
mouse (810)
Table 2.1: Length of the sequences, distribution of classes, number of samples, classes and species in the simulated and real dataset
2.1.1
Simulated dataset
The simulated dataset consists of antibody sequences of equal length, which are featured with a distinct signal pattern that classifies the antibody class. The dataset has 100.000 different AA sequences with a length of 15AA per sequence. Each sequence has a signal with the length of 3AA. This signal corresponds to the feature class. In the simulated dataset the sequences with and without signal are given. Therefore, the differences between the sequences with and without signal, and the relative amount of different AA in the dataset are shown (figure 2.1). If an AA is
Figure 2.1: Frequency of different AA in the simulated dataset. ASinj: Antibody sequence with injected signal, AS: Antibody sequence without signal
2.1.2
Real dataset
The real dataset consists of data obtained by sequencing antibodies from human (n=5.935) and mouse (n=810). It differs greatly from the simulated data set because the length of the sequences varies between 23AA and 280AA compared to the fixed lengths of 15AA in the simulated dataset (figure 2.2). The real dataset has many different features concerning complementarity determining region (CDR), framework region (FR) and variable, diverse and joining region (VDJ). In this thesis only the feature about the antigen antibody binding is relevant and therefore used as the label.
Figure 2.2: Amino acid sequence length (23-280) distribution of the real antibody sequences (n=6745) in a violin plot
The real dataset has a different AA composition in comparison to the simulated dataset with injected signal. The most common AA from the simulated dataset A, D and Y are not very common in the real dataset. The most represented AA are glycin (G) and serin (S) (figure2.3).
Figure 2.3: Frequency of different AA in the simulated and real dataset: sim = simulated antibody sequence with injected signal, real = real antibody sequences Only the human antibody sequences are important for this thesis, therefore the mouse sequences were cut off to get rid of a species bias. Since they play the main role in the antibody-antigen binding process, only a part of the AA sequences, the human complementarity determining region 3 (CDR3) sequences were used for training and testing of the machine learning (ML) methods. The restriction to CDR3 sequences has a positive effect on the problem of overfitting, since the region has a very short sequence length (the longest is 39AA) and is very specific for the antigen-antibody binding process (figure 2.4).
Figure 2.4: Amino acid sequence length distribution of the CDR3 sequences (5-39AA) from the human antibody sequences (n=5.935) in a violin plot
The total number of antigen classes in the real dataset is 13 with a total sample number of 5.935 for the human samples and sequences with a length between 23AA and 280AA (table 2.2). HIV (2.543), celiac (1.452) and tetanus (818) have the greatest number of samples. The antigen classes of influenca (287), Rh (217), auto (187), rabies (141), vaccinia (88), rheumatoid (57), psoriasis (43), various (41), rotavirus (33) and meningococcus (28) provide smaller sample sizes. In comparison to the distribution of sample size, the sequence length of the antigen classes is distributed more equally. Although three antigen classes (rheumatoid 42, HIV 30 and auto 23) have comparatively shorter and three classes (tetanus 280 and various 250) longer sequences, there is little variation in the mean value of sequence lengths, with the highest being various (160.37) and the lowest psoriasis (104.84).
Sequence class Number of samples Length of the sequences (min-max) Mean of the sequence lengths Auto 187 23-138 115.41 Celiac 1.452 82-128 115.03 HIV 2.543 30-147 128.68 Influenza 287 96-137 121.28 Meningococcus 28 115-127 120.93 Psoriasis 43 96-119 104.84 Rabies 141 112-133 120.55 Rh 217 118-137 125.32 Rheumatoid 57 42-133 115.54 Rotavirus 33 119-139 125.12 Tetanus 818 101-280 116.24 Vaccinia 88 116-132 123.08 Various 41 129-250 160.37
Table 2.2: Number of samples, minimum, maximum and mean sequence-length values of all human sequence classes in the real dataset
2.2
Experimental structure
Five different approaches, cluster analysis, k-nearest neighbors (kNN) algorithm, support vector machines (SVM), convolutional neural networks (CNN) and long-short term memory networks (LSTM), are evaluated in this thesis. Different hyper-parameters, like the learning rate, batch size, kernel size, were adjusted in training of the models. The hyperparameter search space was reduced for each of the two datasets after performing a preliminary hyperparameter search, due to limits in the available training time and computational power. The final hyperparameters for training the models are in section 2.3. To achieve an almost unbiased estimate for the datasets, 5-fold cross validation (CV) was used with the four supervised ML ap-proaches, where the datasets were randomized and divided into 60% training, 20% validation and 20% test data (Bishop, 2006). The dataset was also split with cluster CV, where first a cluster analysis with 50 clusters was performed. The 50 clusters were then again randomly combined into 5 clusters, each of these clusters having approximately the same amount of data. Each of these clusters was then used as part of the CV, with three clusters as a training set, one cluster as a validation set and one as a test set. A 5-fold CV was carried out again with these sets. The best score was selected on the validation set and the model then verified with the test set. For the real dataset only the scores of the HIV class with 2.543 samples and the celiac class with 1.452 samples were considered as prediction target. The other eleven classes were only for training the model, since their amount of data samples per class was too low for a sufficient prediction. All hyperparameters in the different ML methods were optimized for the best setting. To better visualize the results, the receiver operating characteristic (ROC) and the area under the ROC curve (AUC) were calculated and displayed. The balanced accuracy (BACC) was computed for
every class in the simulated and real dataset, since it was the best way to determine the true positive samples.
2.2.1
Performance measure
The ROC curve was used to allow a high quality comparison between the ML ap-proaches. Therefore the true positive rate (TPR), false positive rate (FPR) and the precision were calculated. The ROC shows the TPR versus the FPR. In the follow-ing equation, TP corresponds to the true positive predicted samples, FN the false negative, FP the false positive and TN the true negative samples (Fawcett, 2006):
T P R = T P T P + F N, (2.1) F P R = F P F P + T N, (2.2) precision = T P T P + F P. (2.3)
The AUC of the ROC is used as a performance measure. If the AUC is 1, the model performed perfectly, if it is 0.5, the prediction was random. Further, measurement for accuracy (ACC) and BACC were performed. In the consecutive equations, the mentioned values are calculated with the usage of TP, TN, the overall number of negative values (N) and the overall number of positive values (P) (Fawcett, 2006):
ACC = T P + T N P + N , (2.4) BACC = T P P + T N N 2 . (2.5)
Due to the unbalanced data, the BACC was the superior measurement for the real dataset. In order to enable a better comparison, the BACC calculation was also calculated for the simulated data set. Finally the F1 score was calculated. This score considers the TPR and the precision. The F1 score is the harmonic mean of precision and TPR, where 1 is perfect and 0 the worst (Fawcett, 2006):
F 1 = 2 · precision · TPR precision + TPR =
2
2 · T P + F P + F N. (2.6) Additionally to this measure the mean ¯x and standard deviation s was calculated:
¯ x = 1 N N X i=1 (xi), (2.7) s = v u u t 1 N − 1 N X i=1 (xi− ¯x)2. (2.8)
2.2.2
Update scheme
For the CNN and LSTM networks the adam update rule was applied (Kingma & Ba, 2014). m(t+1)w ← β1m(t)w + (1 − β1)∇wL(t), (2.9) v(t+1)w ← β2vw(t)+ (1 − β2)(∇wL(t))2, (2.10) ˆ mw = m(t+1)w 1 − (β1)t+1 , (2.11) ˆ vw = vw(t+1) 1 − (β2)t+1 , (2.12) w(t+1) ← w(t)− η√mˆw ˆ vw+ . (2.13)
Here w(t) is the weight matrix and L(t) the loss function, where t indicates the
current training iteration (indexed at 0). Also is a small scalar used to prevent division by 0, β1 and β2 are the forgetting factors for gradients and second moments
of gradients, respectively. As an additional update scheme learning rate decay was applied to support the training of the network and to avoid oscillations. In this approach, the learning rate is decaying for a certain amount after every time step. This decay depends on the max number of updates and the original learning rate. The learning rate decays linearly over all updates to a final learning rate. In the following equation lrn is the present learning rate, lro is the original learning rate,
lrf is the final learning rate, un is the nth update and um is the maximum update:
lrn= lro∗ 1 − un um + lrf ∗ un um . (2.14)
2.2.3
Activation functions
Activation functions support the NN in the adaption of data with high dispersion. Linear functions do not help with the complexity, so therefore NN mostly use non-linear functions. The sigmoid function squashes the input values between 0 and 1. As a result the function becomes flat and insensitive to small changes from the input (figure 2.5) (Goodfellow et al., 2016). This function is popular because it is easy to understand and apply, but it has slow convergence and, because it is not zero centered, optimizes harder than other activation functions:
σ(x) = 1
Figure 2.5: Sigmoid function (Goodfellow et al., 2016)
The rectified linear unit (ReLU) function has been shown to enable better training of deeper networks (Glorot et al., 2011). The function sets every value below zero to zero (figure 2.6)(Nair & Hinton, 2010). The ReLU function is popular, because it avoids the vanishing gradient problem (Agarap, 2018):
g(z) = z+ = max(0, z). (2.16)
Figure 2.6: ReLU function (Goodfellow et al., 2016)
The tangens hyperbolicus function squashes the given values between -1 and 1 (figure
2.7). This function is zero centered and hence easier to optimize:
Figure 2.7: TanH function (Source: taken with modification from Goodfellow et al. (2016), p. 69)
2.2.4
Loss
For the calculation of loss, the binary cross-entropy (BCE) function was used. The code implementation was performed via PyTorch function BCEWithLogitsLoss (Ketkar, 2017). For the following BCE-formula l is the loss, N is the batch size (with n {0,N}), w are the weights, y are the labels and x are the samples:
`(x, y) = L = {l1, . . . , lN}>, (2.18)
ln = −wn[yn· log σ(xn) + (1 − yn) · log(1 − σ(xn))] . (2.19)
For the weights w the probability of the classes in the dataset was used to avoid overfitting to the biggest sample class, where k is the class number and wk w,
wk = 1 − pk. (2.20)
The loss can be regularized by L2 regularization. Here the parameters get squared, weighted and summed. This L2 penalty is then added to the loss function to avoid overfitting: arg min w n X j=1 (t(xj) − n X i=1 wihi(xj))2 + λ n X i=1 w2i. (2.21)
2.2.5
Class weighting
One way of using class weighting is setting the weight of the wanted class to 1 in the weight matrix before training the network. All other weights are set to:
wclass =
(1 − pclass)
2 . (2.22)
Where pclassis the probability of the class in the training set. With this addition, the
2.3
Experimental setup
2.3.1
Hardware and software setup
The experiments were computed on two different hardware configurations. The majority of the training was performed on a personal computer with an Intel(R) Core(TM) i7-6500U CPU on 2.50GHz and 8GB RAM. Because of limitations in terms of available computing capacity, some experiments were performed on the Johannes Kepler University (JKU) Institute for Machine Learning servers with an Intel(R) Xeon(R) CPU X7560 2.27GHz and Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz. Before the research was started, both datasets were randomized and divided into 60% training, 20% validation and 20% test data to get an unbiased model. The datasets were provided as csv files, read with a csv reader, one-hot-encoded and saved as a hdf file. For the import of the dataset the pandas csv reader was used. For the CNN a positional information was added in the real dataset. The positional information has 3 features, each feature corresponding to a AA location. The first, second and third position show whether the AA is at the beginning, middle or end of the sequence, respectively. Figure2.8 shows the positional information in the last three rows of the table. For the first AA the positional features have the value 1 0 0. All three features always sum up to 1.
Figure 2.8: One-hot-encoding with the simulated dataset and positional information The difference in the sequence lengths of the real dataset was omitted by using
procedure. Phyton was used as the main programming language, since its libraries provide the majority of the required software packages and implementations for deep learning (Rossum, 1995). More specifically, NumPy, sk-learn, sci-kitplot, pandas and matplotlib were applied as standard libraries for calculations, statistics and ML (Oliphant, 2006; Virtanen et al., 2019; Hunter, 2007; McKinney, 2011). PyTorch was also used to implement the LSTM, while the Widis-Lstm-Tools library (provided by the JKU Institute for Machine Learning), which is also based on the PyTorch library, provided an additional implementation for LSTM and CNN (table 2.3) (Ketkar, 2017; Widrich, 2019). The implemented source code can be accessed publicly at github.com/HelgaLudwig/Antibody classification
Software package Version Reference
Python 3.6 (Rossum, 1995)
PyTorch 1.0.1 (Ketkar, 2017)
NumPy 1.14.2 (Oliphant, 2006)
sci-kit/sk-learn 0.19.1 (Virtanen et al., 2019) sci-kit plot 0.3.7 (Virtanen et al., 2019)
matplotlib 2.2.2 (Hunter, 2007)
widis-lstm-tools 0.4 (Widrich, 2019)
pandas 0.22.0 (McKinney, 2011)
Table 2.3: Software packages used for the implementation of the machine learning algorithms
2.3.2
Cluster analysis
The cluster analysis was carried out to get an overview of the distribution of the samples in the dataset. First, the data was embedded into t-distributed stochas-tic neighbor embedding (t-SNE), where all data points are represented in 2D space (Maaten & Hinton, 2008). t-SNE creates a probability distribution over a pair of data points in which similar points are selected with a higher probability than disparate points. Then the t-SNE constructs a comparable probability distribu-tion over the data points in 2D space. The distance between the data points was calculated with the Hamming distance, which is especially suitable for sequences (Robinson, 2008). This distance measure compares every element of a sequence with an element of another sequence at the same position. If the two elements are not identical the Hamming distance is increased by one. In figure 2.9 a) there is only one difference between the sequences, which means that the Hamming distance is 1. In part b) there are 3 differences, that correspond to a Hamming distance 3.
Figure 2.9: Examples for Hamming distance: a) distance 1 b) distance 3
The cluster analysis was used with the method agglomerativeClustering from sklearn. Four different linkage criterion which determinate the distance between the clusters were used. X(Y ) are the clusters with elements x(y). Ward minimizes (min) the variance of the clusters being merged:
dmin(X, Y ) = minxX,yY||x − y||2. (2.23)
Average uses the average (AVG) of the distances of each observation of the two sets, where nA (nB) is the number of elements in A(B):
davg(X, Y ) =
1 nXnY
ΣxX, Σ,yY||x − y||. (2.24)
Complete or maximum linkage uses the maximum (max) distances between all ob-servations of the two sets:
dmax(X, Y ) = maxxX,yY||x − y||. (2.25)
And last, single uses the minimum of the distances between all observations of the two sets (Virtanen et al., 2019):
dmin(X, Y ) = minxX,yY||x − y||. (2.26)
2.3.3
K -nearest neighbors
The kNN algorithm was used because it is very easy to implement and can give a good overview of the complexity of the classification task. kNN was also used with sklearn. The package provided different algorithms and metrics. The algorithm brute uses a brute force search. The algorithm ball-tree and kd-tree are other fast NN approaches. The metric Minkowski uses the Minkowski distance that is calculated with the following formula:
D (X, Y ) = n X i=1 |xi− yi|p !1p . (2.27)
The parameter p can be adjusted, where p=2 equals the euclidean distance. Different numbers of neighbors were tried in the hyperparameter search (table2.4).
Hyperparameter Values
Number of neighbors {1, 2, 3, 4, 5, 10, 20, 30}
Table 2.4: kNN hyperparameters used in the grid search procedure for the simulated and real dataset
2.3.4
Support vector machine
SVMs were used as a baseline for the complexity of the classification task. SVMs were provided from the package sklearn (Virtanen et al., 2019). Different kernels, gamma and cost were used as hyperparameters in several configurations to evaluate the optimal experimental setup (table 2.5).
Hyperparameter Values
Kernel {RBF, linear, poly, sigmoid} Gamma {0.1, 0.2, 0.3, 0.4, 0.5, 0.9} Cost {1, 2, 3, 4, 5, 10, 20, 50}
Table 2.5: SVM hyperparameters used in the grid search procedure for the simulated and real dataset. RBF: radial basis function, poly: polynomial
2.3.5
Convolutional neural network
Convolutional neural networks (CNN), which are known to successfully classify se-quences, have been used as an essential part of this work (Kagaya et al., 2014). The CNN has an input sequence that is fed into a 1D convolutional layer. Afterwards there is a pooling layer with max-pooling over the sequence positions and a ReLU squashing function. At the end there is a fully connected layer and a sigmoid output layer (figure 2.10). For the simulated and real dataset the same network was used. The network had 1 layer, used linear learning rate decay and was trained for 500 updates.
Figure 2.10: Convolutional neural network
For the simulated dataset the sequences with a length of 15AA were converted into a one-hot-encoded matrix with a size of 15x20, where 20 is the number of different AA that can occur in the sequence. In the real dataset only the CDR3 sequences with a length from 17-39AA were used and one-hot-encoded into a matrix with size 39x20. Because of the small number of samples per class, only two different networks for the two classes with a sufficient number of samples were trained (Raudys & Jain, 1991). Different learning rates, batch sizes, number of layers and number of neurons were tried out in the hyperparameter search with grid search procedure (table2.6). The learning rate varied from 0.001 to 1 and was applied with linear learning rate decay. The batch size was 32, 64, 128, 256, 512 and 1.024. The number of neurons had the same values as the batch size and additionally 16. For the kernel size, all possible values for the simulated data set were tried out, namely 3, 5, 7, 9, 11, 13 and the same values for the real data set, but additionally 15, 17, 19 and 21. The hyperparameters were selected with grid search. Higher and lower values were prematurely excluded because of computational time or bad scores.
Hyperparameter Values Learning rate {0.001, 0.01, 0.1, 1} Batch size {32, 64, 128, 256, 512, 1.024} Number of neurons {16, 32, 64, 128, 256, 512, 1.024} Kernel size {3, 5, 7, 9, 11, 13} (simulated dataset) Kernel size {3, 5, 7, 9, 11, 13, 15, 17, 19, 21} (real dataset)
Table 2.6: CNN hyperparameters used in the grid search procedure for the simulated and real dataset
2.3.6
Long-short term memory network
The long-short term memory (LSTM) network was used because it can find flexible patterns like insertions or deletions in sequences, which can not be easily found by the CNN, because of the static kernel size and position. Distinctive LSTM networks for both datasets were applied with different hyperparameters for adjusting. The LSTM was configured with 1 layer and was trained for 500 updates. The learning rate was 0.001, 0.01, 0.1 and 1 with linear learning rate decay. The batch size was 128, 256, 512 and 1.024. There were 64, 128, 256 or 512 neurons in the different LSTM models. The L2 penalty was 1e-4, 1e-5 or 1e-6 (table2.7).
Hyperparameter Values
Learning rate {0.001, 0.01, 0.1, 1} Batch size {128, 256, 512, 1.024} L2-penalty {1e-4, 1e-5, 1e-6} Number of neurons {64, 128, 256, 512}
Table 2.7: LSTM hyperparameters used in the grid search procedure for the simu-lated and real dataset
The LSTM has the same input encoding like the CNN, but after the input there is one LSTM layer with a variable number of LSTM cells. After the LSTM layer there is a ReLU function, then a fully connected layer and a sigmoid output function (figure2.11).
Chapter 3
Results
3.1
Cluster analysis
Because of memory problems the clustering for the simulated dataset was limited to 50% of the data, with 5000 samples per class, which were randomly selected. The best model was achieved with a configuration of 10 clusters, which was used because it corresponds to the number of classes. Variation of linking criteria had no impact on the results. As distance measure the Hamming distance gave the best results, although experiments with the euclidean distance gave similar results. Figure 3.1
shows that, with the exception of classes 8, 6 and 9, the majority of all classes were predicted perfectly by the cluster analysis.
Figure 3.1: Cluster linkage for 10% of the data samples in the simulated dataset (Clusters are colored, labels are shown with the label number)
For the real dataset the cluster analysis successfully showed clusters for different classes. The number of clusters was set with 13 (number of classes). Only the dark blue and yellow clusters show a high amount of class 1, which corresponds to the
celiac class. Further, clusters for class 2 (HIV class) are visible from bottom to bottom right and at the top. The other classes could not be visually separated by the cluster analysis. For better visual representation, only 10% of data is shown in the cluster linkage of figure3.2.
Figure 3.2: Cluster linkage for 10% of the data samples in the real dataset (Clusters are colored, labels are shown with the label number)
3.2
K-nearest neighbors algorithm
In the experiments the k-nearest neighbors algorithm (kNN) showed the best results with brute algorithm and Minkowski metric p 2, which equals the euclidean metric. The number of neighbors varied between 1 and 2 for the different folds. For fold 0, 4 two neighbors and for fold 1, 2, 3 one neighbor achieved the best result (table
A.1). The average balanced accuracy (BACC) for the antibody class prediction for all five models of the simulated dataset ranged from 97.2% (SD: 3.9) to 100 % (SD: 0.0) (table 3.1). The area under the curve (AUC) for all folds was 1. The receiver operating characteristics (ROC) curve is shown in figureA.1.
Class Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 AVG
1 1.000 1.000 1.000 0.992 0.991 0.996 ± 0.004 2 1.000 1.000 1.000 0.920 0.938 0.972 ± 0.035 3 1.000 1.000 1.000 0.944 1.000 0.989 ± 0.022 4 0.972 0.938 1.000 1.000 0.996 0.981 ± 0.024 5 1.000 1.000 1.000 1.000 1.000 1.000 ± 0.000 6 1.000 0.917 1.000 1.000 1.000 0.983 ± 0.033 7 0.996 0.983 1.000 1.000 0.977 0.991 ± 0.009 8 0.954 0.992 1.000 0.917 1.000 0.972 ± 0.033 9 0.954 1.000 1.000 0.991 1.000 0.989 ± 0.018 10 1.000 0.892 1.000 1.000 0.917 0.962 ± 0.048 AVG 0.987 ± 0.02 0.972 ± 0.039 1.000 ± 0.000 0.976 ± 0.033 0.982 ± 0.029 0.984 ± 0.012
Table 3.1: BACC values with mean and standard deviation for all 5 folds from kNN models of the simulated dataset
For the majority of celiac and HIV class, as well as celiac and HIV class cluster, best results were achieved with one neighbor. Celiac class (fold 3, 4) and HIV class cluster (fold 1, 3) were the exception with two neighbors (table A.2). The mean BACC for the celiac and HIV class was 99.6% (SD: 0.1) and 97.4% (SD: 0.5) with random CV (table 3.2) and 97.8% (SD: 1.2) and 93% (SD: 3.3) with clustering, respectively (table3.3). AUC for celiac class was 1 and for HIV class between 0.97-0.98, respectively. ROC curves are shown in figure A.2 for the celiac class and in figureA.3 and A.4 for the HIV class.
Fold Celiac HIV 0 0.995 0.971 1 0.996 0.983 2 0.995 0.978 3 0.996 0.971 4 0.999 0.968 AVG 0.996 ± 0.001 0.974 ± 0.005
Table 3.2: BACC values with mean and standard deviation for all five folds from kNN models of the real dataset for the celiac and HIV class
Fold Celiac HIV
0 0.996 0.976 1 0.985 0.957 2 0.974 0.890 3 0.962 0.899 4 0.973 0.928 AVG 0.978 ± 0.012 0.930 ± 0.033
Table 3.3: BACC values with mean and standard deviation for all five cluster folds from kNN models of the real dataset for the celiac and HIV class
3.3
Support vector machine
Support vector machines (SVM) were used as baseline for the complexity of the clas-sification task. The simulated dataset was trained with different hyperparameters, best results were achieved with the linear kernel, cost 1 and gamma 0.1 for all folds and seeds. Average BACC for all 5 folds over 3 random seeds was 98.7% (SD: 0.2). Further results are shown in table3.4. The AUC for all classes and folds was 1. The ROC curve is shown in figureA.5 and the train and validation results are shown in tableA.19 and A.20, respectively.
Class Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 AVG
1 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 2 0.988 ± 0.000 0.988 ± 0.000 0.987 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 3 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 4 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 5 0.984 ± 0.000 0.984 ± 0.000 0.985 ± 0.000 0.985 ± 0.000 0.984 ± 0.000 0.984 ± 0.000 6 0.984 ± 0.000 0.984 ± 0.000 0.984 ± 0.000 0.985 ± 0.000 0.985 ± 0.000 0.984 ± 0.000 7 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 0.989 ± 0.000 8 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 9 0.988 ± 0.000 0.988 ± 0.000 0.987 ± 0.000 0.987 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 10 0.988 ± 0.001 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988 ± 0.000 0.988± 0.000 AVG 0.987 ± 0.002 0.987 ± 0.002 0.987 ± 0.002 0.988 ± 0.001 0.988 ± 0.002 0.987 ± 0.002
Table 3.4: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 folds from SVM models of the simulated dataset
Best results for the celiac and HIV class with 5-fold CV were achieved with linear kernel and radial basis function (RBF) kernel, respectively. For both classes the optimal gamma value was 0.1 for the majority of the configurations, while the value for cost ranged between 1 and 5. Detailed optimized hyperparameter settings are shown in table A.3 (celiac class) and table A.4 (HIV class). SVM models on basis of the real dataset showed an average BACC over all 5 folds and 3 random seeds with 99.5% (SD: 0.0) for the celiac and 94.8% (SD: 0.0) for the HIV class (table
are shown in figureA.6andA.7. The train and validation results are shown in table
A.21and A.22, respectively.
Fold Celiac HIV
BACC BACC 0 0.995 ± 0.000 0.946 ± 0.003 1 0.995 ± 0.000 0.948 ± 0.001 2 0.995 ± 0.001 0.948 ± 0.003 3 0.995 ± 0.001 0.950 ± 0.002 4 0.994 ± 0.001 0.947 ± 0.002 AVG 0.995 ± 0.000 0.948 ± 0.000
Table 3.5: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 folds from SVM models of the real dataset for the celiac and HIV class
Optimized hyperparameters for SVM models with clustering showed more varia-tions. For both celiac and HIV class, kernel varied between linear and RBF, gamma between 0.1 and 0.2, and cost ranged from 1 to 5. Detailed optimized hyperparam-eter settings for cluster CV are shown in table A.13 (celiac class) and table A.14
(HIV class). Clustered SVM models of the real dataset showed a mean BACC with 98.6% (SD: 0.0) for the celiac and 85.7% (SD: 0.1) for the HIV class (table 3.6). The train and validation results are shown in table A.23and A.24, respectively.
Fold Celiac HIV
BACC BACC 0 0.992 ± 0.000 0.904 ± 0.000 1 0.993 ± 0.000 0.903 ± 0.001 2 0.980 ± 0.004 0.857 ± 0.002 3 0.972 ± 0.000 0.822 ± 0.002 4 0.992 ± 0.000 0.800 ± 0.000 AVG 0.986 ± 0.000 0.857 ± 0.001
Table 3.6: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 cluster folds from SVM models of the real dataset for the celiac and HIV class
3.4
Convolutional neural networks
Hyperparameter optimization for the convolutional neural networks (CNN) model showed best results with learning rate 1, batch size 128 and L2-penalty 10e-4 over all folds and seeds in the simulated dataset. For the majority of all configurations, the number of neurons and kernel size were 64 and 7, respectively (tableA.5). The CNN models achieved 100% BACC in average over all 10 classes in the simulated dataset (table3.7). AUC was 1 for all classes and the ROC is shown in figure A.8. The validation results are shown in table A.25.
Class Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 AVG
1 1.000 ± 0.000 1.000 ± 0.000 0.999 ± 0.002 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 2 0.999 ± 0.002 0.999 ± 0.002 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 3 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 4 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 5 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 6 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 7 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 8 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 9 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 10 1.000 ± 0.000 1.000 ± 0.000 0.998 ± 0.002 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.001 AVG 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.001 1.000 ± 0.000
Table 3.7: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 folds from CNN models of the simulated dataset
Hyperparameter optimization for the CNN model with the real dataset showed dif-ferent configurations between the celiac (tableA.6) and HIV class (tableA.7). With the exception of the learning rate, there was no distinguishable pattern in the se-lection of hyperparameters for both classes, and it was found that a higher learning rate for celiac class (1, 0.1) and a lower rate for the HIV class (0.1, 0.1) performed better. Batch size, kernel size and L2-penalty varied between 128/256, 7/9/11/13, and 10e-4/10e-5, respectively. CNN models for the real dataset showed an average BACC over all 5 folds and 3 random seeds with 100% (SD: 0.0) for the celiac and
97,5% (SD: 0.0) for the HIV class (table 3.8). AUC was 1 for the celiac and varied between 0.95 and 0.98 for the HIV class. The ROC curves are shown in figure A.9
and A.10, and the validation results in table A.27.
Fold Celiac HIV
BACC BACC 0 1.000 ± 0.000 0.976 ± 0.006 1 0.999 ± 0.001 0.974 ± 0.004 2 1.000 ± 0.000 0.972 ± 0.003 3 0.999 ± 0.001 0.977 ± 0.003 4 1.000 ± 0.000 0.977 ± 0.003 AVG 1.000 ± 0.000 0.975 ± 0.000
Table 3.8: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 folds from CNN models of the real dataset for the celiac and HIV class
Hyperparameter optimization for the CNN model with cluster CV showed less vari-ation between the celiac (tableA.8) and HIV class (tableA.9), than without cluster-ing. Optimal learning rate patterns were similar to CNN models without clustering, with the celiac class performing better at higher and the HIV better at lower learn-ing rates. For both classes, the learnlearn-ing rate, batch size, kernel size and L2-penalty varied between 0.01/0.1/1, 128/256, 7/9/11/13, and 10e-4/10e-5, respectively. In the results, the CNN models for the real dataset showed an average BACC over all 5 clustered folds and 3 random seeds with 99.7% (SD: 0.0) for the celiac and 94.7% (SD: 0.0) for the HIV class (table 3.9). The validation results are shown in table
A.26.
Fold Celiac HIV
BACC BACC 0 1.000 ± 0.000 0.938 ± 0.038 1 0.999 ± 0.001 0.962 ± 0.012 2 0.987 ± 0.001 0.902 ± 0.010 3 0.999 ± 0.001 0.963 ± 0.002 4 1.000 ± 0.000 0.970 ± 0.002 AVG 0.997 ± 0.000 0.947 ± 0.010
Table 3.9: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 cluster folds from CNN models of the real dataset for the celiac
3.5
Long-short term memory network
Hyperparameter optimization for the long-short term memory networks (LSTM) model showed best results with learning rate 0.1 and 1 over all folds and seeds in the simulated dataset. Batch size, number of neurons and L2-penalty varied between 128/256, 64/128, and 10e-4/10e-5, respectively (table A.10). The LSTM models achieved an average of 100% BACC across all 10 classes in the simulated data set (table 3.10). AUC was 1 for all classes and the ROC is shown in figure
A.11. The validation results are shown in table A.28.
Class Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 AVG
1 1.000 1.000 1.000 1.000 1.000 1.000 2 1.000 1.000 1.000 1.000 1.000 1.000 3 1.000 1.000 1.000 1.000 1.000 1.000 4 1.000 1.000 1.000 1.000 1.000 1.000 5 1.000 1.000 1.000 1.000 1.000 1.000 6 1.000 1.000 1.000 1.000 1.000 1.000 7 1.000 1.000 1.000 1.000 1.000 1.000 8 1.000 1.000 1.000 1.000 1.000 1.000 9 1.000 1.000 1.000 1.000 1.000 1.000 10 1.000 1.000 1.000 1.000 1.000 1.000 AVG 1.000 1.000 1.000 1.000 1.000 1.000
Table 3.10: BACC values averaged over 3 random seeds with mean and standard deviation (0 in all cases) for all 5 folds from LSTM models of the simulated dataset Hyperparameter optimization for the LSTM model with the real dataset showed different configurations between the celiac (tableA.11) and HIV class (table A.12). Except for the learning rate, there was no distinguishable pattern in the hyperpa-rameter selection for both classes, whereby the learning rate was 1 for both classes. Batch size, number of neurons and L2-penalty varied between 128/256, 64/128, and 10e-4/10e-5, respectively. LSTM models for the real dataset showed an average BACC across all 5 folds and 3 random seeds with 99.8% (SD: 0.2) for the celiac and 94.9% (SD: 1.0) for the HIV class (table 3.11). AUC was 1 for the celiac class and varied between 0.95 and 0.98 for the HIV class. The validation results are shown in tableA.29.
Fold Celiac HIV BACC BACC 0 0.999 ± 0.002 0.953 ± 0.009 1 0.999 ± 0.001 0.946 ± 0.012 2 0.999 ± 0.001 0.939 ± 0.010 3 0.997 ± 0.001 0.947 ± 0.015 4 0.994 ± 0.005 0.961 ± 0.024 AVG 0.998 ± 0.002 0.949 ± 0.010
Table 3.11: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 folds from LSTM models of the real dataset for the celiac and HIV class
Hyperparameter optimization for the LSTM model with cluster CV showed less variation between the celiac (tableA.15) and HIV class (table A.16), than without clustering. Optimal learning rate patterns were similar to LSTM models without clustering, with the majority of configurations performing best on a learning rate of 1 (3 exceptions with 0.1) for both classes. Batch size, number of neurons and L2-penalty varied between 128/256/512, 64/128, and 10e-4/10e-5, respectively. In the results, the LSTM models for the real dataset showed an average BACC over all 5 clustered folds and 3 random seeds with 98.8% (SD: 0.0) for the celiac and 85.8% (SD: 2.4) for the HIV class (table 3.12). The ROC curves are shown in figure A.12
and A.13. The validation results are shown in table A.30.
Fold Celiac HIV
BACC BACC 0 0.967 ± 0.000 0.783 ± 0.034 1 0.998 ± 0.003 0.929 ± 0.007 2 0.998 ± 0.001 0.781 ± 0.047 3 0.984 ± 0.004 0.868 ± 0.042 4 0.994 ± 0.004 0.919 ± 0.036 AVG 0.988 ± 0.000 0.858 ± 0.024
Table 3.12: BACC values averaged over 3 random seeds with mean and standard deviation for all 5 cluster folds from LSTM models of the real dataset for the celiac and HIV class
Chapter 4
Discussion
4.1
Interpretation of results
Cluster analysis was able to identify the simple structure of the simulated dataset. Although the sample size was reduced by half, results are assumed to be represen-tative of the full dataset, since the visual inspection showed a successful clustering. Support vector machine (SVM), k-nearest neighbors algorithm (kNN), convolutional neural network (CNN) and long-short term memory network (LSTM) were able to achieve a high accuracy rating for both simulated and real dataset. This is partic-ularly true for models on basis of the simulated dataset, where all results yielded over 98% balanced accuracy (BACC), with the kNN, SVM, CNN and LSTM having 98.4%, 98.7%, 100% and 100%, respectively (table4.1). The results show, that the applied machine learning methods were able to learn the simple pattern, consisting of three amino acid (AA) signals in a 15AA long sequence, which always occurs on the same position. Based on the simplicity of the simulated dataset, it was hy-pothesized that the supervised machine learning (ML) methods would perform very well on the classification task. This expectation was not only confirmed, but further approved by the perfect prediction of classes with CNN and LSTM in the simulated dataset.
ML method BACC AUC
kNN 0.984 ± 0.012 1.00
SVM 0.987 ± 0.002 1.00
CNN 1.000 ± 0.000 1.00
LSTM 1.000 ± 0.000 1.00
Table 4.1: Average BACC and area under the curve (AUC) for antibody class prediction with kNN, SVM, CNN and LSTM on basis of the simulated dataset Although the results on basis of the simulated dataset provide useful insights in the suitability of ML methods, experiments with the real dataset contribute more significance for applied biology, because it consists of real world data. For an unsu-pervised approach, cluster analysis of the real dataset was performed to show simple patterns in the dataset, and clusters for parts of the HIV and celiac class were
suc-cessfully displayed. Since the cluster analysis does not depend on the sample size, it clusters a part of all classes if the complexity of the data set is low. The real dataset showed that the complexity of the classification task for the HIV and celiac class was less than expected, but the cluster analysis still failed to cluster the majority of the two classes successfully. As for the supervised techniques, kNN, SVM, CNN and LSTM showed promising results with BACC being 99.6%, 99.5%, 100% and 99.8% in the celiac class and 97.4%, 94.8%, 97.5% and 94.9% in the HIV class, respectively (table4.2,4.3).
ML method BACC AUC
kNN (celiac) 0.996 ± 0.001 1.00
SVM (celiac) 0.995 ± 0.000 1.00
CNN (celiac) 1.000 ± 0.000 1.00
LSTM (celiac) 0.998 ± 0.002 1.00
Table 4.2: Average BACC and AUC for antibody class prediction with kNN, SVM, CNN and LSTM on basis of the real dataset for class celiac
ML method BACC AUC
kNN (HIV) 0.974 ± 0.005 0.97-0.98
SVM (HIV) 0.948 ± 0.000 0.98
CNN (HIV) 0.975 ± 0.000 0.95-0.98
LSTM (HIV) 0.949 ± 0.010 0.95-0.98
Table 4.3: Average BACC and AUC for antibody class prediction with kNN, SVM, CNN and LSTM on basis of the real dataset for class HIV
The experiments with the real dataset were extended by the usage of clustered CV, to show if the cluster analysis could already solve most of the classification task and if the different supervised techniques would still perform adequate at the clustered subsets. Results for kNN, SVM, CNN and LSTM with clustered CV on basis of the real dataset showed a BACC of 97.8%, 98.6%, 99.7%, 98.8% in the celiac class and 93%, 85.7%, 94.7%, 85.8% for celiac and HIV class, respectively (table
4.4). Although the BACC was marginally lower with clustered CV, the supervised techniques performed well.
ML method BACC (Celiac) BACC (HIV)
kNN 0.978 ± 0.012 0.930 ± 0.033
SVM 0.986 ± 0.000 0.857 ± 0.001
CNN 0.997 ± 0.000 0.947 ± 0.010
LSTM 0.988 ± 0.000 0.858 ± 0.024
Table 4.4: Average BACC for antibody class prediction with kNN, SVM, CNN and LSTM on basis of the real dataset with cluster CV
Like for the simulated dataset, the assumption that CNN and LSTM would perform better for the classification task was confirmed, whereas the competitive results of kNN and SVM in the real dataset were surprising. The good results of the kNN and SVM showed that the dataset was not overly complex. The CNN showed dis-tinguished results, which further approve their role as state of the art in sequence classification (Jurtz et al., 2017). Although LSTM achieved a nearly perfect pre-diction score for one class, its performance was overall surpassed by CNN. These results were unexpected, because it was assumed, that the LSTM would reach a sim-ilar performance, since it has a more complex architecture. This hints towards the assumption that a better solution for the LSTM could be found by expanding the hyperparameter search space. Also, because of the good results, it can be assumed that using CDR3 sequence alone gives a good prediction of the antigen class. In ad-dition, a sequence-based method can enable de novo identification of other sequence features that are essential for biological function. A similar project also shows good results in antibody sequence classification with machine learning methods (Liberis et al., 2018).
4.2
Limitations
If the complexity of the classification task is high, the amount of training data needed increases. Therefore, the biggest limitation of the experiments is caused by the limited sample size of human AA sequences in the real dataset (n=5.935). Since the training of complex datasets with a large sample size is described as more successful, the experiments were optimized by using a 5-fold CV and limiting the input to sequences of the complementarity-determining region 3 (CDR3), thereby increasing the complexity. The CDR3 sequences are much shorter (5-39AA) than the whole sequence (23-280AA), which allows for the training of NN models with a low number of samples (Sønderby et al., 2015). To further narrow the problem, only the two classes (celiac and HIV) with the highest amount of data samples were considered as prediction target. The other 11 classes of the real dataset were used to train the model and as negative class samples. The good results indicate, that antibody class prediction with small sample size (1.000-2.000 samples) is no limitation for ML methods, if shorter CDR3 sequences are used as input. With the sequence patterns, models can have difficulty finding unknown patterns in AA sequences of the real dataset because the signal pattern is unknown. The results of the HIV class indicate that their pattern was more complex, since it had a higher amount of samples (2.543) than the celiac class (1.452), but still had a lower BACC. While the