A Soft Computing Perspective for DNA Fragment Assembly and Gene Expression Data Analysis

(1)

A Soft Computing Perspective for DNA

Fragment Assembly and Gene Expression

Data Analysis

By

Uzma

Supervised by

Dr. Zahid Halim

A dissertation

submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering

Faculty of Computer Science and Engineering

Ghulam Ishaq Khan Institute of Engineering Sciences and Technology

Topi, Khyber Pakhtunkhwa, Pakistan

(2)

ii

Dissertation examination committee

Dr. Zahid Halim Supervisor, Faculty of Computer Science and

Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, PAKISTAN.

Prof. Dr. Zhu Han Foreign Evaluator, John and Rebecca Moores

Professor in the Electrical and Computer Engineering Department, and Computer Science Department at the University of Houston, Texas, USA.

QS World University Rank=501-600

Prof. Dr. Andrea Bondavalli Foreign Evaluator, Professor of Computer

Science, Department of Mathematics and Computer Science, Firenze University of Firenze, ITALY.

QS World University Rank=401-500

Dr. Usman Qamar External Examiner, Department of Computer and

Software Engineering, College of Electrical and Mechanical Engineering, National University of Science and Technology, Rawalpindi, PAKISTAN.

Dr. Ayaz Hussain External Examiner, Department of Computer

Science, Quaid-i-Azam University, Islamabad, PAKISTAN.

Dr. Syed Fawad Hussain Internal Examiner, Faculty of Computer Science

and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, PAKISTAN.

(3)

iii

Dr. Ghulam Abbas Internal Evaluator (Pre-screening), Faculty of

Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, PAKISTAN.

Dr. Sajid Anwar Internal Evaluator (Pre-screening), Faculty of

Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, PAKISTAN.

(4)

iv

The work in this dissertation has been carried out at the Machine Intelligence Research Group (MInG), Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology (GIKI), Topi, Pakistan. The research was supported by the GIK under the GA-F Fellowship Program.

Scholar id-: CS1702

(5)

v

Dedication

This work is dedicated to my parents

Memroz Khan

and

Hurmat Begum

(with beautiful soul)

(6)

vi

ACKNOWLEDGEMENTS

First of all, I thank Almighty Allah for his mercy and blessings which cannot be worded, but are to be felt. I consider myself lucky enough to have his blessings on me throughout my academic career and all my prayers will not be sufficient enough to thank for what I am blessed with.

I would like to pay special gratitude to my supervisor Dr. Zahid Halim. He provided support, encouragement, and guidance throughout my graduate research career. At various stages of this research project, I had ample opportunity to learn from him, think creatively, and develop new ideas. His patience, hard work, positive outlook, punctuality, and confident in my research ideas inspired me a lot and encouraged me to accomplish this undertaking. The completion of this dissertation and the research papers extracted from this work would not be possible without his efforts.

It is my pleasure to thank all my friends for the wonderful time we shared. In addition, I would like to thank all my MInG (Machine Intelligence Research Group) colleagues for their friendly suggestions and support, giving me the right direction to step further in my research and making my stay at GIK a pleasant memory. My deep and sincere gratitude to my family, for their help, love, and support. I am very thankful to my brothers Dr. Asad Mahmood, Engr. Safder Mehmood, and Aamir Mehmood for their support throughout my life as well as academic carrier. Their knowledge and experiences always motivated me and encouraged me to accomplish my aims. These guys are my inspirations. This journey would not have been possible without their selfless encouragement for exploring new directions in my life.

I would also like to thank GIK Institute for giving me the opportunity to pursue my dream. I am also thankful to the Dean, Faculty of Computer Science and Engineering and Dean, Graduate Studies for providing the suitable research environment.

(7)

vii

List of Publications Extracted from This Work

Chapter 3 is published in the Applied Soft Computing Journal

Uzma et al., "Optimizing the DNA fragment assembly using metaheuristic-based overlap layout consensus approach," Applied Soft Computing, Vol. 92, 2020. [ISSN: 1568-4946, Thomson Reuters JCR 2019, Impact factor 4.873, Elsevier]-link

Chapter 5 is accepted in the Neural Computing and Applications Journal

Uzma et al., “Gene-encoder: A feature selection technique through unsupervised deep learning-based clustering for large gene expression data,” Neural Computing and Applications, Vol. --, 20--. [ISSN: 1433-3058, Thomson Reuters JCR 2019, Impact factor 4.664, Springer] in press. -link

(8)

viii

List of Symbols and Abbreviations

AAE Adversarial AutoEncoder ABC Artificial Bee Colony

ALL Acute Lymphoblastic Leukemia AML Acute Myeloid Leukemia ANN Artificial Neural Networks ARI Adjusted Rand Index bp Base pair

CAE Convolutional AutoEncoder CFS Correlation Feature Selection dBG de Bruijin Graph

DBI Davies Bouldin Index ddNTPs dideoxynucleotides DE Differential Evolution DFA DNA Fragment Assembly

DHC Density-based Hierarchical Clustering DI Dunn Index

DNA deoxyribonucleic acid EA Evolutionary Algorithms FN False Negative

FP False Positive

FTP Fragmentation Through Polymerization GN Generalize Neuron

GSA Global Sequence Alignment IG Information Gain

KBCGS Kernel-Based Clustering method for Gene Selection

k-NN k-Nearest Neighbor

LMKNN Local Mean-Based k-Nearest Neighbor LR Logistic Regression

LSA Local Sequence Alignment

LSFS Local Search-Based Feature Search MIM Mutual Information Maximization MLP Multi-Layer Perceptron

MR-MEM MapReduce Maximum Exact Matches NGS Next-Generation Sequencing

(9)

ix

NP-hard None Polynomial-Time Hard OLC Overlap Layout Consensus PALS Power-Aware Local Search PCA Principal Component Analysis PCR Polymerase Chain Reaction PNN Probabilistic Neural Network PPV Positive Predictive Value RBF Radial Basis Function RF Random Forest

RRHGA Restarting and Recentering Hybrid Genetic Algorithm SAE Stacked AutoEncoder

SBS Sequential Backward Search SC Silhouette Coefficient

SCAD Simultaneous Clustering and Attribute Discrimination SDM Sequence Data Mining

SGS Second-Generation Sequencing SMKNN Smallest Modified k-NN

SOM Self-Organization Map SVM Support Vector Machine TGS Third-Generation Sequencing TN True Negative

TP True Positive

TSP Traveling Salesman Problem

df degree of freedom ∆c contig

∆f fragment overlap ZeBRα ZeroBackground Redα

(10)

x

Abstract

Bioinformatics is a composite of computing, statistics, and mathematical methods to understand and solve various biological problems. It mainly includes three disciplines: to understand the relationship between entities contained in volumes of data through the development of novel algorithms and statistical modeling, to understand and analyze various kinds of data, such as the deoxyribonucleic acid (DNA) and protein sequences, gene expression, and protein structure; access to information through the implementation and development of modern tools. Bioinformatics methods are often utilized for the large data generated through the multiple initiatives. Genomics and proteomics are the two important large-scale areas which use bioinformatics approaches. Genomics is the study of the organism's genome, which includes the sequences of DNA that determine the life of an organism. The genome consists of DNA sequences that include a set of genes carrying the hereditary material from parents to offspring and these transcripts include the RNA copies that decode the genetic information. The analysis and sequencing of the genomic entities which counts both the transcripts and genes in an individual are referred to as genomics, whereas, the analysis of the whole set of proteins is known as proteomics.

This dissertation focus on the analysis of genomic data, such as DNA sequence and gene expression data. The data generated is used to discover the hidden biological information and understand the biological phenomena that can be translated into clinical applications. However, there are challenges in analyzing such data that need to be addressed. Recently published reports involved the use of machine learning and data mining techniques to overcome these challenges. It is, therefore, necessary to design a framework that extracts useful information from such large and complex data. Hence, to solve the problem of enormous data, the soft computing and data mining methods are employed in the current work.

Addressing these problems, this dissertation contributes in three perspectives for analyzing the genomic data. These include: DNA sequences and gene expression data by addressing the

(11)

xi

DNA fragment assembly, the supervised feature extraction from gene expression data, and unsupervised feature selection technique from the gene expression data.

The metaheuristic-based overlap layout consensus approach for the optimal solution of the DNA Fragment Assembly (DFA) is presented first. A Support Vector Machine (SVM)-based genetic algorithm is employed for optimization. Finally, the ensemble of the filter-based method is used for unsupervised feature selection in the gene expression data. Next, for the optimal feature selection of the gene expression data, the autoencoder-based genetic algorithm is utilized.

A solution to the problem of DFA generated through sequencing is provided. Past literature shows that for the DFA, the Overlap Layout Consensus (OLC) approach was popular due to large reads generated through next-generation sequencing technology. Besides, the relevant task is performed for the DFA to order the fragment in a way that could minimize the number of contigs and maximize the sum of overlap scores. Moreover, the Restarting and Recentering Hybrid Genetic algorithm (RRHGA) is designed to perform the DFA assembly. The initial solution of the GA is created through the 2-opt heuristic. The population is generated based on transpositions. This work uses two representations, directed and undirected. Later, the reproduction operators, i.e., partially mapped crossover and swap mutation are applied to the chromosomes, whereas, for the DFA, the Power Aware Local Search (PALS) heuristic is used as an evolutionary operator. Additionally, the number of contigs and the sum of overlap scores are used to judge the assembly quality. Moreover, the RRHGA is compared with four well-known algorithms based on 25 benchmarks. Three types of experiments present that the proposed framework optimally solves the DFA problem.

The second part of this dissertation provides solution for the analysis of gene expression data through supervised feature selection. Initially, the study focuses on the selection of the optimal feature subset, making the performance of the classification more effective. Next, the classification is performed based on the selected feature subset for the prediction of cancerous samples. The designed framework selects features based on two-phased supervised method. The first phase uses the ensemble of filters, whereas the second phase uses the SVM-based GA. The second phase represents the initial solution based on Local Search-based Feature Selection (LSFS)

(12)

xii

heuristic. It is used to remove the noise that the ensemble method generates. Additionally, the results are evaluated through standard performance metrics. The proposed idea is compared with three state-of-art algorithms utilizing six benchmark cancer gene expression datasets. A number of experiments are conducted to evaluate the current framework. The results show that the proposed solution performs well in selecting optimal features for effective classification.

The final part of this dissertation contributes towards the unsupervised feature selection technique for the analysis of gene expression data. It selects the feature subset that is more effective for the clustering and classification of cancer gene expression data. Therefore, this part has two aspects: feature selection and classification of gene expression data for the prediction of cancer samples. Initially, the ensemble of three filter-based methods is used for feature selection. Later, the feature subset is optimized through an autoencoder-based GA. The resultant optimal feature subset is then employed for the classification of gene expression data. Furthermore, the proposed idea is evaluated using external and internal validity indices. The current proposal is compared with four state-of-the-art algorithms based on six gene expression datasets.

The three contributions made in this work analyze the big data generated through various technologies. Different experiments are performed for each contribution to select the best model that optimizes the solution using the GA-based framework Furthermore, the evaluation of the proposed work shows that its performance is better as compared to the state-of-the-art algorithm based on standard performance metrics. Additionally, for the statistical significance of the proposed work, the t-test is performed. The outcomes obtained reveal that the soft computing and data mining-based approaches assist the analysis of huge biological data.