3.5 Evaluating Performance of AAE by Analyzing Connectivity Matrices
3.5.3 AAE model Implementation
AAE is trained in two phases with stochastic gradient descent (SGD) algorithm [67]. First, encoder encodes the data and generators acts like decoder which repro- duces the data as close as possible to its input and tries to minimise the reconstruction error. Second, discriminator network tries to tell apart the generated sample from the original sample (actual input) then update the weights of generator and encoder [67]. Once the training process is done latent space z contains useful biological information by omitting noisy features. AAE is versatile and powerful as it can extract both linear and nonlinear relationship from the input data. First, single layer AAE, that is, an autoencoder without hidden layer, having the input directly connected to the output, was used to reduce the features into dimension of 50 then its weight matrix was analyzed to know its biological context. In addition, AAE with one hidden layer with 1000 neurons in each network (encoder, generator, and discriminator) is tested and it reveals meaningful genes. During training the network, adadelta is chosen as a optimizer with learning rate of 1 and binary cross entropy is used as a loss function. In addition, 100 epochs are used with batch size of 128 to train the model in unsuper- vised manner. After that weight matrix is analyzed to sort out highly weighted gene. In this model, dropout or regularizer was not used as validation loss and training loss were almost similar. The number of hidden layers further increased, however result appears that increasing the depth of the model modestly improves the performance. For the implementation of this model, scikit-learn library [51] [68] and keras API [69]
with tensorflow [70] back-end are used which was running on computer specified on Table B.1. AAE encoded latent space contains important subsets of genes effective for cancer detection. For a single layer, weight matrix is considered as W = G × 50 where G in the number of genes and each gene contains 50 nodes. Deeper model is also analyzed, where dot product of the weight matrix is calculated for each layer of AAE. For example, 2 layer neural network having one hidden layer with 1000 neu- rons, weight matrix of first layer is W1 = GX1000 and second layer is W2 = 1000X50.
Then applying dot product W = (W1.W2) finally got, W = G × 50. Therefore, it can
be generalised that weight matrix for the n-layer neural network as W =
n
Y
i=1
Wi
It is found that the weight matrix was strongly normally distributed Fig 3.6. First, all the nodes for each gene are summed up then sorted out genes by its weight. Method of sorting out weight is explained in Algorithm 1. Finally, those genes are analysed by doing GO and pathway analysis to know its biological insights.
3.5.4 Results
All experiments in this study were conducted on University’s server described in Table B.1. Experimental result using BRCA data presented in Table 3.2 and Table B.5 and validated result using UCEC data presented in Table B.6. In this experiment, causation of the diseases are explored by extracting highly weighted genes form the latent space. By analysing those genes could help us to find out the cause of the diseases. Here, top 50 highly weighted genes are taken for further GO and pathway analysis. In the proposed model, four genes such as OR2T27, OR2A25, OR8B8, OR6V1 are found that are related to olfactory receptor activity. In addition, those four genes provide olfactory transduction pathway. In a recent study by Lea et al. (2018) shows that the olfactory receptor genes such as OR2B6 are highly expressed in breast carcinoma tissues compared to the respective healthy tissues [71]. They considered OR2B6 as a potential biomarker for breast carcinoma
Table 3.2.
Results of Gene enrichment analysis using various feature extraction meth- ods on BRCA cancer data-set.
Algorithm Molecular Function P-value
AAE (Single layer) Olfactory receptor activity 5.92e-2
AAE (2 layer) Small conjugating protein ligase activity 7.34e-2
AAE (3 layer) Single-stranded RNA binding 6.25e-2
AE General RNA polymerase II transcription factor activity 7.97e-3
DAE Not Available
PCA Calcium ion binding 6.44e-2
VAE General RNA polymerase II transcription factor activity 9.55e 2
NMF Transition metal ion binding 2.8e-2
ICA RNA helicase activity 6.49e-2
tissues. Their result provides validation of the proposed model. Deeper model of AAE (2 layers) identified protein ligase activity which is closely linked with human breast cancer [72]. PCA extracted genes has GO terms related to calcium ion binding which is known to be associated with breast cancer [73] [74]. Choon et al. (2018) found that transporting calcium ions into milk also have altered expression in some breast cancers. ICA generates molecular function related to RNA helicase activity which is associated with breast cancer. RNA helicase plays an oncogenic role in breast cancer by changing the cell cycle [75]. Other feature extractor methods are also applied and analysed their highly weighted genes but no molecular function or significant pathway is found to be associated with breast cancer. In terms of computational efficiency, my proposed single layer AAE takes 1946 MB of memory and 1 min 13 sec of computation time (wall clock) as described in Table. B.4. Furthermore, proposed methods are evaluated with different data-set named UCEC which contains four major
Weights
F
requencies
Fig. 3.6. Histogram of AAE weights after applying TopGene algorithm.
subtypes of endometrial carcinoma. Single layer AAE model identified metal ion binding genes such as KLF6, EGR3, CLTB, BRSK2, ZSWIM7, ZNF655, TKTL1, BBOX1, ZNF677, RNF114, PHF2, SLU7, ALOX5, CPD, BMPR1B, CACNA1A which is associated with endometrial cancer risk [76] [77]. It is considered that heavy metal like cadmium are well known carcinogens and can increase the incidence of endometrial cancer [77]. It is observed that genes extracted by other methods such as PCA, VAE, NMF, AE do not have relevance with the diseases.
3.5.5 Discussion
In this work, AAE based feature extraction method is presented. It is shown that feature extraction methods not only help classification or visualise data, it is also essential in extracting relevant biological information from both microarray and RNA-seq data. Based on the results, it is observed that the proposed AAE model provide unique benefit in identifying biomarker on two independent data-set. Using BRCA data-set, only AAE based model find important pathway for breast cancer identification. AAE with 2-layer neural network also extracts useful genes for tumour
identification. Single-cell RNA-Seq, long read RNA-seq and microbiome datasets has massive amount of data; therefore deep learning based AAE model has a great future. The performance of the AAE model could certainly improve visualisation of high dimensional data as it provides the non-linear mapping. In the future, it is anticipated that unsupervised AAE can provide a new way to extract and integrate large genomic data-set effectively.