Summary of contributions in proteomics - Life Language Processing: Deep Learning-based Language

We conclude the proteomics chapter with a summary of contributions of this dissertation in protein informatics:

• Introducing a language model based distributed representation of protein sequences (ProtVec)

• The probabilistic variable-length segmentation of protein sequences for both motif mining and extension of the ProtVec to variable-length embeddings

• Providing state-of-the-art protein secondary structure predictor from the primary sequences through a comprehensive investigation on the role of representations and deep learning architectures in this task.

Distributed representation of protein sequences

The large gap between the number of known protein sequences (raw data) versus the number of known functions/structures associated with these sequences (meta-data) motivates developing methods that can obtain prior knowledge from the existing raw sequences to infer information about structure and function of protein sequences. Continuous vector representations of words known as word vectors have recently become popular in natural language processing (NLP) as an efficient unsupervised approach to represent semantic and syntactic units of text helping in the NLP tasks (e.g., machine translation, parsing, part-of-speech tagging, information retrieval, etc.). Inspired by this idea, we proposed distributed vector representations of biological sequence segments (k-mers), called bio-vector in general and ProtVec for proteins, using the skip-gram neural network. We proposed an intrinsic evaluation of ProtVec by measuring the continuity of the underlying biophysical properties (e.g., average mass, hydrophobicity, charge, and etc.) using the best Lipschitz constant. In addition to intrinsic evaluations, for extrinsic evaluations, we evaluated ProtVec representation in the classification of 324,018 protein sequences belonging to 7,027 protein families, where an average family classification accuracy of 93%_±0.06% was obtained. In addition, incorporation of this representation versus one-hot vector features in Max-margin Markov Network (M3_{N et}_{) for intron-exon}

prediction and domain identification tasks could improve the sequence labeling accuracy from 73.84% to74.99% and from 82.4% to 89.8%, respectively.

To the best of our knowledge, for the first time, we introduced language model-based embedding of biological sequences. In particular, we used it for unsupervised feature extraction from protein sequences for down-stream machine learning on protein structural and functional annotation. ProtVec has been used and extended for variety of tasks in bioinformatics, where we can only cite a subset of such papers (Wan and J. Zeng 2016; Islam et al. 2017;

Hamid and Friedberg 2018; K. K. Yang et al. 2018; Jaeger et al. 2018; Du et al. 2018; A. Dutta et al. 2018; Y. Xu et al. 2018; Öztürk et al. 2018). In addition to contributions of ProtVec in bioinformatics, similar approaches later were introduced for subword embedding in

NLP (Schütze et al. 2016; Kocmi and Ondrej Bojar 2016), where some were directly inspired by ProtVec work (Schütze et al. 2016).

We extended ProtVec embedding to ProtVecX (extended ProtVec) trained on sub- sequences in the Swiss-Prot database (detailed in Section §2.3). We demonstrated that combining the raw k-mer distributions with the embedding representations (either of ProtVec to ProtVecX) can improve the sequence classification performance compared with using either k-mers only or embeddings only. In addition, combining ProtVecX with k-mer occurrences outperformed ProtVec embedding combined with k-mer occurrences for toxin and enzyme prediction tasks. These results suggest that embedding can be used as complementary information to the raw k-mer distribution and their added value is expressed when they are combined with k-mer features. Using the same representation (the combination of embeddings and k-mer features) as an input to a deep neural network, we achieved the first and the third places in two out of three protein classification tasks in the Critical Assessment of protein Function Annotation (CAFA) in 2018 (CAFA 3.14) (N. Zhou et al. 2019).

Probabilistic variable-length segmentation of protein sequences

One of the obvious differences between biological sequences and many natural languages is that biological sequences (DNA, RNA, and proteins) often do not contain clear segmentation boundaries of the sequence segments. This difference makes the fixed length k-mer subsequences the common unsupervised approach in bag-of-word representation of biological sequences, including proteins. However, these fixed-length k-mers can be arbitrary units without any biological implication. Thus, more meaningful units need to be introduced. We proposed a new unsupervised method for segmentation of protein sequences. Instead of fixed-length k-mers, we segmented sequences into the commonly occurring variable-length sub-sequences, inspired by BPE, a data compression algorithm. These sub-sequences were then used as input features to the learning algorithm. As a modification to the original BPE algorithm, we defined a probabilistic segmentation by sampling from the space of possible vocabulary sizes. This probabilistic segmentation allows for considering multiple ways of segmenting a sequence into sub-sequences. This idea can be widely used in different applications of protein informatics. In particular, we used it for (i) alignment-free discriminative protein sequence motif discovery method, called DiMotif, as well as (ii) variable-length extension of protein sequence embedding called ProtVecX.

We compared DiMotif against two existing tools for motif discovery: HH-Motif as an instance of non-discrminative methods, and DLocalMotif as an instance of discriminative methods. We compared the performances in the detection of 20 distinct sub-types of experimentally verified motifs. HH-Motif comparing HMMs of orthologs for retrieving SLiMs, achieved the best average F1 and the DiMotif with domain-specific segmentation achieved the second best F1. DiMotif achieved the highest recall, making it an ideal tool for finding a list of candidates for further experimental verification. In addition, we evaluated DiMotif by extracting motifs related to (i) integrins, (ii) integrin-binding proteins, and (iii) biofilm formation. We showed that the extracted motifs could reliably detect reserved sequences of

the same phenotypes, as indicated by their high F1 scores. We also showed that DiMotif could detect experimentally verified motifs related to nuclear localization signals. By using KL divergence between the distribution of motifs in the positive sequences, DiMotif is capable of outputting multi-part motifs. DiMotif segmentation can be inferred once from Swiss-Prot dataset and then be used to extract the motif in a given discriminative motif mining problem setting. Unlike the existing alignment-based motif discovery methods, the input sequences to DiMotif do not need to be necessarily homologous sequences. Thus, it can be utilized in cases motifs need to be found from a set of non-homologous sequences.

Deep learning for protein secondary structure prediction

We studied the machine learning-based protein secondary structure prediction approaches from the protein primary sequence. We focused on finding an optimal representation and deep learning predictive model for this task, over the most challenging dataset for this task to-date, i.e., Q8 (8-way classification) on CullPDB/CB513 dataset, where the similar sequences to the training samples are removed from the test set.

We investigated (i) different protein sequence representations including one-hot vectors, biophysical features, protein sequence embedding (ProtVec), deep amino acid contextualized embedding (ELMo), and the Position Specific Scoring Matrix (PSSM), (ii) different deep- learning architectures including convolutional neural networks (CNN), recurrent neural networks (in particular Bi-LSTM), use of highway connection, attention mechanism, and multi-scale CNN (Jiyun Zhou et al. 2018). We showed that PSSM and its combination with one-hot vectors achieve the best performance in protein secondary structure prediction. The best performing model was the CNN-BiLSTM architecture, which captures both local and global sequence features essential for proteins secondary structure. Our tool, called

DeepPrime2Secprovides the community with a specialized framework for the protein secondary

structure prediction covering different architectures. The BiLSTM-CRF architecture performs competitive to the other existing approach in the literature, and the ensemble of the best performing model in Prime2Sec marginally outperforms the existing methods. Also, we performed error analysis on the most accurate model based on the location of misclassified amino acids as well as the confusion matrix analysis. Strikingly, misclassified secondary structures were significantly correlated with locating at the structural transitions. Such a correlation is most likely due to the inaccurate assignment of the secondary structure at the boundaries in ground-truth (Y. Yang et al. 2016). By ignoring the boundary amino acids from the evaluation, the Q8 accuracy would increase for an extra %20, i.e., %90.3. Analysis of the confusion matrix furthermore indicates that similar secondary structures are highly confusing (helices: H and G as well as unstructured regions: S, T, and L) showing that the model can learn high-level information about the secondary structures.

Chapter 3 Language-agnostic processing of

genomics/metagenomics

3.1 Introduction and chapter overview

Microbial communities exist on every accessible surface on earth and have important functions relevant to supporting, regulating, and in some cases causing unwanted conditions (e.g., diseases) in their hosts/environments, ranging from organismal environments, such as the human body, to ecological environments, such as soil and water (R. Martin et al. 2014). These communities typically consist of a variety of microorganisms, including eukaryotes, archaea, bacteria, and viruses. Due to differences in nutrient availability and environmental conditions, microbial communities from different environments have widely varying taxonomic structures and compositions (Costello et al. 2009;Pinto et al. 2012;Moran 2015;Sunagawa et al. 2015;

Fierer 2017;Eck et al. 2017;Duvallet et al. 2017).

The human microbiota refers to all microorganisms living in close association with the human body. It is now widely believed that changes in our microbiota correlate with nu- merous diseases, raising the possibility that manipulation of these communities may be used to treat diseases. The microbiota (particularly the intestinal microbiota) is known to play important roles in healthy humans, including: (i) prevention of pathogen growth, (ii) education and regulation of the host immune system, and (iii) providing energy substrates to the host (Lynch and Pedersen 2016). Consequently, dysbiosis of the human microbiota

¶_{The content of this chapter is based on the following publications:}

1. Asgari, E., Garakani, K., McHardy, A. C., & Mofrad, M. R. (2018). MicroPheno: Predicting

environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples. Bioinformatics, 34(13), i32-i42, https://doi.org/10.1093/bioinformatics/bty296. 2. Asgari, E., Münch, P. C., Lesker, T. R., McHardy, A. C.,&Mofrad, M. R. (2018). DiTaxa: Nucleotide-

pair encoding of 16S rRNA for host phenotype and biomarker detection. Bioinformatics, bty954, https://doi.org/10.1093/bioinformatics/bty954.

can promote diseases, including asthma (Marsland et al. 2013; Arrieta et al. 2015), irritable bowel syndrome (Saulnier et al. 2011; I. Cho and Blaser 2012), Clostridium difficile infec- tion (Cammarota et al. 2014), chronic periodontitis (Z. L. Deng et al. 2017;Jorth et al. 2014), cutaneous leishmaniasis (Gimblet et al. 2017), obesity (Turnbaugh et al. 2008; Ridaura et al. 2013), chronic kidney disease (Ramezani and Raj 2014), Ulcerative colitis (Michail et al. 2012), and Crohn’s disease (Gevers et al. 2014;Pascal et al. 2017). For instance, the human microbiota appears to play a particularly important role in the development of Crohn’s disease, an inflammatory bowel disease (IBD), with a prevalence of approximately 40 per 100,000 and 200 per 100,000 in children and adults, respectively (Kappelman et al. 2007). The role of human microbiota in human health motivates developing methods for inferring relationships between microbial taxa or functions associated with certain host phenotypes. Similarly, environmental microbial communities also serve important functions, such as nutrient cycling (Gilbert and Neufeld 2014). For instance, the microbiota living in the ocean account for half of the primary production on Earth (Moran 2015). The soil microbiome surrounding the root of plants impacts plant fertility and growth (Chaparro et al. 2012). Such studies are called metagenomic studies, as they deal with the collective genomes of microorganisms from environmental samples for inferring the microbial diversity and certain characteristics of the environments of interest. Metagenomics is a relatively new area of research in microbiology and is becoming increasingly important (R. Martin et al. 2014).

The starting point of many metagenomic studies is either 16S rRNA gene amplicon or shotgun metagenome sequencing of environmental samples (Pollock et al. 2018). 16S rRNA gene sequencing has several disadvantages in comparison with shotgun metagenomics, such as its inability to resolve functions, and accordingly functional variations within individual taxa (Poret- sky et al. 2014;Ranjan et al. 2016;

Cottier et al. 2018; Pollock et al. 2018). However, due to its low cost, 16S rRNA amplicon sequencing is still the most popular data type generated in microbiome studies (Hamady and R. Knight 2009;

Pollock et al. 2018). The 16S rRNA

gene is highly conserved across bacteria and archaea, includes both conserved regions, against which universal species-independent PCR primers can be directed, and nine hypervariable regions (V1-V9), which allow differential identification of taxon identities and relative abun- dances (Janda and Abbott 2007). After sequencing, the obtained data are usually processed

with bioinformatics software such as QIIME (Caporaso et al. 2010; Lawley and Tannock 2017), Mothur (Schloss et al. 2009), or Usearch (Robert C. Edgar et al. 2011)] and clustered into groups of closely related sequences, referred to as Operational Taxonomic Units (OTUs).

Three main strategies for creating OTUs have been developed: in the de novo OTU clustering scheme, input sequences are aligned against one another and OTU clusters created based on a user-specified percent identity cutoff (in practice mostly97%) without comparisons to reference databases. The implementation of the de novo strategy is difficult to parallelize and therefore limited to small-scale datasets. Variations of this method, such as sub-sample open-reference OTU picking (Rideout et al. 2014) or centroid-based greedy clustering approaches (W. Li and Godzik 2006) accelerate this process and enable their application to larger datasets. Alternatively, in closed-reference OTU clustering, input reads are aligned to a set of cluster centroids defined in a reference database (containing clusters of previously identified OTUs) and will be reported as an OTU, if they align at a given threshold. This strategy will not report OTUs for novel taxa that are not part of the reference database, though. An advantage is the usual high quality of taxonomic assignments of the reference database, which can be used for taxonomic assignment of the OTUs from the community of interest. Finally, the open-reference OTU clustering scheme combines de novo and closed-reference picking, where input sequences are aligned against a reference database (such as Greengenes (DeSantis et al. 2006)) and sequences that fail to match the reference are subsequently clustered de novo in a serial process (Rideout et al. 2014)). Individual algorithms for OTU clustering, post- and pre-processing have been combined to pipelines such as mothur (Schloss et al. 2009), QIIME (Caporaso et al. 2010; Lawley and Tannock 2017), USEARCH (Robert C. Edgar et al. 2011) and LotuS (Hildebrand et al. 2014).

Although OTU clustering has simplified 16S rRNA processing by substituting the analysis of millions of reads by analysis of only thousands of OTUs, it still has several disadvantages: OTUs do not necessarily represent meaningful taxonomic units, such as e.g. species, and sequencing errors may inflate diversity estimates by orders of magnitude (Kunin et al. 2010). To prevent diversity overestimates, OTU based approaches require a highly stringent quality control and relaxed clustering at<97% similarity. While this approach limits the inflation of OTUs by potential sequencing errors, it comes at the expense of taxonomic resolution and may combine organisms with distinct biological properties and capabilities into a single OTU. A further disadvantage is that OTU calling requires extensive sequence alignment efforts. All of the above mentioned OTU-picking strategies involve sequence alignments either to the reference genomes or to the sample sequences, which is computationally expensive and cannot be easily extended to further samples. It was shown that OTUs were generally ecologically consistent across habitats, but observed OTU content can differ substantially between clustering methods (T. S. B. Schmidt et al. 2014). Since the number of obtained OTUs and their content is dependent on the pipeline and the parameter settings, reproducing the same analysis is difficult (Y. He et al. 2015). An alternative solution is the analysis of individual 16S rRNA gene sequence (Callahan et al. 2016; Amir et al. 2017; Nearing et al. 2018), which is computationally challenging, as each 16S rRNA sample may contain 10,000s of sequences. The main focus of this chapter is developing OTU-free methods for processing

of 16S rRNA sequencing.

Chapter overview

We begin this chapter with a genomic phenotype prediction problem setting and show that language-agnostic k-mer representations of microbial genomes can be more effective than expensive genomic sequence events in the detection of the phenotype of interest. Next, we extend the setting to the metagenomics. In Section §3.3 In Section §3.3, we introduce MicroPheno, a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on sequence k-mer representations that benefit from a bootstrapping framework for investigating the sufficiency of shallow sub- samples. Deep learning methods, as well as classical approaches, were explored for predicting environments and host phenotypes. MicroPheno is the state-of-the-art approach for the host phenotype prediction outperforming costly OTU features. Although MicroPheno could outperform OTU features in phenotype prediction, short k-mers cannot be easily used as taxa distinctive biomarkers and OTU features remained the state-of-the-art in biomarker detection (Segata, Izard, et al. 2011). In Section §3.4, we propose DiTaxa, an alignment- and reference- free, subsequence based paradigm for processing of 16S rRNA microbiome data for phenotype and biomarker detection. The main distinction of this approach from existing methods is substituting standard OTU-clustering (Robert C Edgar 2013) or sequence- level analysis (Callahan et al. 2016) by segmenting 16S rRNA reads into the most frequent variable-length subsequences of a dataset. we compare the performance of DiTaxa to the state-of-the-art methods using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis, and inflammatory bowel diseases, as well as a synthetic benchmark dataset. We show that DiTaxa improved the state-of-the-art performance in biomarker detection over 16S rRNA data while performing competitively to the k-mer based state-of-the- art approach in phenotype prediction. Finally, in Section §3.5, we conclude with a summary of contributions this dissertation has had in metagenomics.

3.2 K-mer based representation for microbial genome

In document Life Language Processing: Deep Learning-based Language-agnostic Processing of Proteomics, Genomics/Metagenomics, and Human Languages (Page 105-112)