Top PDF Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

Members of such superfamilies are diverse in their overall reactions yet share a common ancestor and some conserved active site features associated with conserved functional attributes such as a partial reaction or molecular subgraph that all substrates or products may have in common. Thus, despite their different functions, members of these superfamilies often “look alike” which can make them particularly prone to misannotation. To address this complexity and enable reliable transfer of functional features to unknowns only for those members for which we have sufficient functional information, we subdivide superfamily members into subgroups using sequence information (and where available, structural information), and lastly into families, defined as sets of enzymes known to catalyse the same reaction using the same mechanistic strategy and catalytic machinery. At each level of the hierarchy, there are conserved chemical capabilities, which include one or more of the conserved key residues that are responsible for the catalysed function; the small molecule subgraph that all the substrates (or products) may include and any conserved partial reactions. A subgroup is essentially created by observing a similarity threshold at which all members of the subgroup have more in common with one another than they do with members of another subgroup. (Thresholds derived from similarity calculations can use
Show more

23 Read more

Investigating “Gene Ontology”- based semantic similarity in the context of functional genomics

Investigating “Gene Ontology”- based semantic similarity in the context of functional genomics

Over the last half-century, there has been a tremendous evolution in the way that gene function is studied. In the early days of molecular genetics, the elucidation of gene function was primarily reliant on characterising mutant phenotypes, with studies targeting individual genes, or a small number of related genes, at a time. With the advent of DNA sequencing techniques, new approaches for evaluating and relating gene function were required as the number of known genes grew so quickly that manual study alone was no longer practical. From Sanger sequencing [Sanger et al., 1977b], the approach by which the first full DNA-based genome, bacteriophage ΦX174 [Sanger et al., 1977a], was sequenced, to modern day high-throughput or “next-generation” sequencing strategies (see Shendure and Ji [2008] for a review), which allow the sequencing of whole genomes in a matter of days, the range and speed of DNA sequencing is ever increasing, and with it, the amount of genomic data available. There are now over a hundred sequenced eukaryotic genomes, around half of which are vertebrate genomes [Flicek et al., 2011], while the full genomes of over a thousand prokaryotes are available [Lagesen et al., 2010].
Show more

293 Read more

Contextual analysis of variation and quality in human-curated gene ontology annotations

Contextual analysis of variation and quality in human-curated gene ontology annotations

Aim 2. Assessment of annotation quality metrics. Subsequent to the annotation process analysis in Aim 1, the curators will be asked to perform their normal work processes on a predefined set of scientific articles for the purpose of evaluating the validity of a set of annotation quality metrics. The set of articles will be divided in equal parts among two or more equal groups of curators (quantities depend upon enrollment). Each individual will annotate a subset of his/her group’s documents, with a predefined amount of overlap in shared documents. For example, assume 6 total curators who are randomly assigned to two groups of 3, and a total of 6 unique articles. With a group of 3 curators and a subset of 3 of the 6 total articles, each curator will annotate 2 articles, one of which a second curator in the group will also annotate. Each article will thus be annotated by 2 curators per group. The pairs of curators with shared articles will subsequently rationalize (if different) their individual annotations, leading to a single ‘gold standard’ annotation for that document. Following this step, each group will be assigned the 3 papers from the other group, and will repeat the task without knowledge of the other groups’ annotations. This design provides for 4x coverage of each article, and an ability to compare individual annotations against gold standards devised by independent groups, without an external standard process. The two rounds of annotation can be visualized in the following example (the actual experiment will have randomized group and article assignments and may vary in size):
Show more

179 Read more

Evidence-based gene models for structural and functional annotations of the oil palm genome

Evidence-based gene models for structural and functional annotations of the oil palm genome

Integration of Fgenesh++ and Seqping gene predictions To increase the accuracy of annotation, predictions inde- pendently made by the Seqping and Fgenesh++ pipelines were combined into a unified prediction set. All predicted amino acid sequences were compared to protein sequences in the NR database using BLAST (E-value cutoff: 1E-10). ORF predictions with <300 nucleotides were excluded. Predicted genes from both pipelines in the same strand were considered overlapping if the shared length was above the threshold fraction of the shorter gene length. A co- located group of genes on the same strand was considered to belong to the same locus if every gene in the group over- lapped at least one other member of the same group (single linkage approach) at the selected overlap threshold. Differ- ent overlap thresholds, from 60% to 95% in 5% increments, were tested to determine the best threshold value, simultan- eously maximizing the annotation accuracy and minimizing the number of single-isoform loci. Protein domains were predicted using PFAM-A [36, 37] (release 27.0) and PfamS- can ver. 1.5. The coding sequences (CDSs) were also com- pared to NR plant sequences from RefSeq (release 67), using the phmmer function from the HMMER-3.0 package [38, 39]. To find the representative gene model and deter- mine its function for each locus, we selected the lowest E- value gene model in each locus and the function of its RefSeq match. We excluded hits with E-values >1E-10, as well as proteins that contained words “predicted”, “puta- tive”, “hypothetical”, “unnamed”, or “uncharacterized” in their descriptions, keeping only high-quality loci and their corresponding isoforms. Loci without the RefSeq match were discarded. The CDS in each locus with the best match to the RefSeq database of all plant species was selected as the best representative CDS for the locus. Gene Ontology (GO) annotations were assigned to the palm genes, using the best NCBI BLASTP hit to Oryza sativa sequences from the MSU rice database [40] at an E-value cutoff of 1E-10.
Show more

23 Read more

CATH FunFHMMer web server: protein functional annotations using functional family assignments

CATH FunFHMMer web server: protein functional annotations using functional family assignments

The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new chal- lenges to protein function prediction methods. Mul- tidomain proteins complicate the protein sequence– structure–function relationship further as new com- binations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classifi- cation of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an in- dependent international competition ranks FunFH- MMer as one of the top function prediction meth- ods in predicting GO annotations for both the Bi- ological Process and Molecular Function Ontology. The FunFHMMer web server is available at http: //www.cathdb.info/search/by funfhmmer .
Show more

6 Read more

CATH: comprehensive structural and functional annotations for genome sequences

CATH: comprehensive structural and functional annotations for genome sequences

CATH-Gene3D provides domain predictions and super- family assignments for protein sequences in UniProt. All domains (both ‘real’ and predicted) within a CATH super- family are then classified into FunFams using a hierarchi- cal agglomerative clustering algorithm. The algorithm pro- duces a tree of clusters and this was originally partitioned using a generic threshold ( 3 ). However, this partitioning was later improved by using functional annotation data from the Gene Ontology (GO) ( 4 ) to ensure functional coherence ( 5 ). Significant biases in GO annotations have recently been identified ( 6 ) suggesting that GO annotation data can only provide a partial picture of the function of a protein, which will affect functional classification.
Show more

6 Read more

Conceptualization of molecular findings by mining gene annotations

Conceptualization of molecular findings by mining gene annotations

The Gene Ontology (GO) [8], a controlled vocabulary consisting of molecular biology terms (concepts) related to genes, is the most widely used bio-ontology for represent- ing the information derived from genome-scale experi- ments, particularly about aspects of the biological process (BP). Currently, a common approach to finding a func- tional theme from a gene list is to assess whether any GO terms are enriched among the annotations associated with a gene list [9-12]. However, annotations by the GO Consortium are usually highly specific, and it is not uncommon to find a set of specific GO terms enriched in a gene list where each term covers only a small number of genes, thus failing to reveal the major biological process. Aware of the need for more general concepts, the GO Consortium provides a set of general GO terms, referred to as GO slim [8], that represent high-level biological con- cepts. There are also software tools that map/associate genes to concepts in the GO slim subsets [13-15]. As will be shown, these terms tend to be too general; more impor- tantly, this small set of predefined GO terms may not meet the need of revealing functional themes in a case-specific fashion with balanced generality and specificity. Besides the GO enrichment analysis, another widely used approach for finding functional themes of a gene list is to assess whether members of certain predefined pathways, or gene signatures from databases such as KEGG [16] pathways and MSigDB [17], are enriched. However, such representations lack the ontological structure to support
Show more

12 Read more

Predicting gene ontology annotations of orphan GWAS genes using protein-protein interactions

Predicting gene ontology annotations of orphan GWAS genes using protein-protein interactions

the semantic similarity between the nodes in the ontol- ogy [22]. The information in the neighborhood of pro- teins being tested for a target class was enhanced using the substantial semantic similarity that exists between the target class and its several neighbors. The results from experiments carried out on an array of datasets showed that the incorporation of the GO DAG struc- ture leads to more accurate predictions. Other methods that use semantic similarity measures include function prediction algorithms proposed by Tao et al. [23] and Tedder et al. [24]. The algorithm by Tao et al. uses in- formation theory-based semantic similarity (ITSS) ap- proach in combination with the GO DAG structure to predict functions of sparsely annotated GO terms. A K nearest -neighbor algorithm along with ITSS measure was used to assign new edges to the concept nodes in the sparse ontology networks. Precision and recall of 90% and 36% respectively for sparsely annotated net- works were achieved using a 10 fold cross-validation. In an algorithm called PAGODA (Protein Assignment by Gene Ontology Data Associations), semantic similarity measure is used to group genes into functional clusters, and then a Bayesian classifier is employed for term en- richment by assessing whether a pair of interacting genes belongs to a functional cluster [24]. In this study, eight different Plasmodium falciparum datasets were studied. Interaction data for P. falciparum was down- loaded from the IntAct database. The method was eval- uated on all the genes that have GO annotation using a leave-one-out cross validation for each GO term.
Show more

13 Read more

Annotation of gene product function from high-throughput studies using the Gene Ontology.

Annotation of gene product function from high-throughput studies using the Gene Ontology.

The GOC provides an excellent forum for annotation groups to review and harmonize curation practices. By reviewing GO annotations derived from high-throughput studies, the GOC has provided a framework to aid anno- tation consistency and allow GO curators to confidently annotate papers containing valuable data from high- throughput studies without the need for extra training. The major difference in how a GO curator approaches high-throughput versus low-throughput publications is the investment of time in the decision to curate—curators should examine all aspects of the workflow: experimental design, controls, data handling, validation and statistics. From this, the curator must decide whether the data is amenable to functional annotation and whether the statistical measures can be used to extract a higher confidence subset of gene products for annotation. Nevertheless, some high-throughput studies employ very complex methodology and statistics, which make the confidence level difficult to establish. In these cases, curators are advised to directly contact the authors or experts within the research community for advice. Indeed, experts in the fields of proteomics and RNAi have made valuable contributions to the high-throughput annotation guidelines. Within the GOC, discussion and documentation of challenging high-throughput papers is encouraged and as curators review and curate more high- throughput publications, they will further contribute to the GOC guidelines. This is illustrated by the work that the Functional Gene Annotation team at University College London is currently undertaking, working with leaders in the field to develop a common set of standards for the annotation of extracellular matrix components from high- throughput proteomics studies.
Show more

10 Read more

Fuzzy measures on the Gene Ontology for gene product similarity

Fuzzy measures on the Gene Ontology for gene product similarity

Abstract—One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.
Show more

12 Read more

Disjunctive shared information between ontology concepts: application to Gene Ontology

Disjunctive shared information between ontology concepts: application to Gene Ontology

A commonly used approach for evaluating semantic similarity measures in biomedical ontologies is based on comparing their correlation with structural similarity. This cor- relation may not be always accurate, but this approach represents a comprehensive analysis, since structural similarity is present everywhere in Molecular Biology. For example, even functional classifications, like PFAM, rely mostly on structural similarity methods [21]. Therefore, this evaluation assumes that on average the results obtained from a large number of examples should be close to their real value, even if some exceptions exist. A systematic difference between semantic and structural similarity would undermine this assumption, but this is not expected to exist under the assumed correlation between protein function and its structure [22].
Show more

16 Read more

Gene Ontology Function prediction in Mollicutes using Protein-Protein Association Networks

Gene Ontology Function prediction in Mollicutes using Protein-Protein Association Networks

It has been shown herein, that our procedure out-per- forms other similar algorithms to predict GO-based annotations using Protein-Protein networks, with equal or higher overall precision from a significantly broader range of GO terms. The incorporation of other approxi- mations such as functional module detection, conserved between species and orthology exploitation, predict function with higher precision and recall in two ontolo- gies of the GO database. As compared to other GO search engines, our algorithm is capable of finding GO terms with high semantic similarity values due to using orthology information between proteins predicted inside functional modules conserved between species, and it has been also shown to recapitulate “known” future GO annotations artificially removed from the dataset using five-fold cross-over validation, with high precision and recall.
Show more

11 Read more

A drug target slim: using gene ontology and gene ontology annotations to navigate protein-ligand target space in ChEMBL

A drug target slim: using gene ontology and gene ontology annotations to navigate protein-ligand target space in ChEMBL

A GO slim is a high-level subset of the GO created by collapsing specific terms and ‘ mapping ’ them to their higher level parent terms using the parent – child hierar- chies inherent in the GO. GO slimming allows for a repre- sentation of biological information by using high level terms that provide a broad overview of the biology [8]. GO slims are typically generated for specific organism or particular areas of scientific interest and have been used to aid visualisation, exploration and summarization of GO functional data [9, 10]. We have created a ‘ ChEMBL protein target slim ’ to allow users to easily access the biological information to targets with GO annotation.
Show more

7 Read more

Logical Gene Ontology Annotations (GOAL): exploring gene ontology annotations with OWL

Logical Gene Ontology Annotations (GOAL): exploring gene ontology annotations with OWL

Whilst highly useful, many of the GO-orientated tools fail to exploit the full potential of the GO’s representation for reasoning and querying over gene annotations. In parti- cular, most of the GO tools that we investigated do not facilitate rich querying that takes into account the semantics of the GO. For example, it was difficult to ask for all proteins that are located in a membrane, or part of a membrane, that are receptor pro- teins involved in a metabolic process. Extending the queries to include associations of gene product functional attributes, location with phenotype and disease phenomena, such as linking together proteolysis, insulin secretion, plasma membrane, increased glucose concentration and diabetes, is not yet possible. To answer such a query cor- rectly, some form of reasoning over the ontologies is required. The ability to perform such rich queries would enable more precise and flexible exploration of the annota- tions with GO, MPO and HDO, as well as other ontologies used to annotate gene products.
Show more

16 Read more

A method for increasing expressivity of Gene Ontology annotations using a compositional approach

A method for increasing expressivity of Gene Ontology annotations using a compositional approach

As extension data becomes more widely available, querying for functional information can become more sophisticated. Users of the GO will be able to query the annotations for a wealth of specific information, including connections between a gene product and other entities and processes, or the locations — at the subcellular level as well as cell and tissue types — where a gene product performs specific roles. For example, a user could query for all targets of a particular protein kinase, or compose a more specific query to find all the proteins that are in- volved in blood vessel remodeling during retina vascula- ture development in the camera-type eye. Annotation extensions capturing effector-target relationships at the cellular level will provide a rich source of directional in- formation for regulatory network reconstruction. For instance, the has_input and has_direct_input relations can be used to connect signal transducing components of signaling pathways or to link DNA binding regulatory transcription factors with their specific target genes. The inherent directionality encoded in the extension can also be used to increase the information content of existing interaction-based networks. Annotation extensions can also assist with improving the interpretations of pathway ana- lysis. Currently pathway analysis, which uses methods such as term enrichment and pathway topology, is hampered by the lack of functional annotation with associated contextual aspects such as cell or tissue type or dependencies on other gene products or substances [28]. GO has the potential to enable great advances in pathway analysis by providing this contextual information in annotation extensions.
Show more

11 Read more

Update on human genome completion and annotations: Gene nomenclature

Update on human genome completion and annotations: Gene nomenclature

Homologous regions of 15–25 per cent of nucleotides or amino acids can be detected by the various alignment pro- grams, and denote divergence from an ancestral gene. A small almost-invariant DNA motif or protein domain — function- ing as an enzyme active-site, cofactor docking site or ligand- binding site — is further evidence of divergence from an ancestral gene. One of the earliest examples of this nomen- clature approach for homologous genes was the cytochrome P450 (CYP ) gene superfamily, in which it was agreed upon that approximately 40 per cent or more amino acid similarity allows two members to be placed in the same family and about 55 per cent or more similarity allows two members to be assigned to the same subfamily. 11,12 These cut-off values follow the original recommendations of Margaret Dayhoff. 13 Several dozen additional gene superfamilies and large gene families have since followed this same format. 14
Show more

6 Read more

Gene Ontology Analysis of 3D Microarray Gene Expression Data using Hybrid PSO Optimization

Gene Ontology Analysis of 3D Microarray Gene Expression Data using Hybrid PSO Optimization

T he 3D time series gene expression data analysis, plays a vital role in identify the most valuable genes along the Gene-Sample-Time (GST) or Gene-Condition-Time (GCT) dimension (Ziv Bar-Joseph, 2002). The analysis of microarray of 3D data represents a computational challenge due to the characteristic of these data. A large number of clustering approaches are proposed for the analysis of gene expression by many literatures. Triclustering techniques are the most recent simultaneous clustering of the 3D microarray data. Tricluster is a subset of genes that have similar patterns over a subset of condition during subset of time points. Triclustering is treated as one of the optimization problems. Its main objective is to find the quality tricluster with low mean squared residue and high volume. So Partcile Swarm Optimzation (PSO) is perfectly suitable for solving this problem.PSO is the optimization approach and it is population based algorithm. The whole populations are maintained throughout the procedure. The potential solution of PSO is said to be particles and every one of the particles is randomized velocity. Each particle is to attain the optimal solution in the solution space. There are two main features there are 1) Convergence Speed and 2) Relative Simplicity these features are suitable for solving the optimization problem. The main aim of this work is to find the quality of a tricluster using Hybrid PSO optimization.
Show more

7 Read more

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger

restricting the search to retrieve only English medical ar- ticles discussing human genetics studies in psychiatry and immune related disorders. Table 1 shows the dataset statis- tics in terms of article and word counts. The searches have been adapted to ensure appropriate literature coverage. For example, whilst including immun* in the abstract search picks up papers on many diseases such as psoriasis, the same approach using the term psych* is not as effective. In our results, we directly compare the Immune and Psy- chiatric subcorpora only, but the Reference dataset statistics are included here to show the relative size of the two sub- corpora. We will also be employing the Reference corpus in other experiments and to check vocabulary coverage of the existing semantic lexicon. We chose immune and psychi- atric genetics corpora as examples that would be very dif- ferent from each other allowing us to test the utility of the tools. The selected domains fall within the fourth author’s research expertise and this has helped in appropriately in- terpreting the findings (Pouget et al., 2016).
Show more

5 Read more

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger

restricting the search to retrieve only English medical ar- ticles discussing human genetics studies in psychiatry and immune related disorders. Table 1 shows the dataset statis- tics in terms of article and word counts. The searches have been adapted to ensure appropriate literature coverage. For example, whilst including immun* in the abstract search picks up papers on many diseases such as psoriasis, the same approach using the term psych* is not as effective. In our results, we directly compare the Immune and Psy- chiatric subcorpora only, but the Reference dataset statistics are included here to show the relative size of the two sub- corpora. We will also be employing the Reference corpus in other experiments and to check vocabulary coverage of the existing semantic lexicon. We chose immune and psychi- atric genetics corpora as examples that would be very dif- ferent from each other allowing us to test the utility of the tools. The selected domains fall within the fourth author’s research expertise and this has helped in appropriately in- terpreting the findings (Pouget et al., 2016).
Show more

5 Read more

Incorporating Functional Annotations for Fine-Mapping Causal Variants in a Bayesian Framework Using Summary Statistics

Incorporating Functional Annotations for Fine-Mapping Causal Variants in a Bayesian Framework Using Summary Statistics

Our method shares similarities with some existing meth- ods, but also differs significantly from them. The most signif- icant difference is that our method is a general framework to systematically incorporate functional annotations, and it uses the ENET in the penalized model with CV for parameter selection. In principle, any method that can output PIPs which take into account the prior probability of each variant being causal can be used within this framework. Compared to fgwas (Pickrell 2014), our method results in a more robust way to automatically select sparse annotations, it also allows multi- ple causal variants in each locus where fgwas assumes no more than one. The performance of fgwas is similar to that of CAVIARBF_L1_CV, except that CAVIARBF_L1_CV does not show decreased performance under the null annotations. There are also other differences: fgwas only supports bi- nary annotations or a distance model using integers, while CAVIARBF can use any numeric annotations (binary or contin- uous). fgwas uses a penalized likelihood in CV for the testing fold while CAVIARBF uses the likelihood directly. Com- pared to the top5t or top10t strategy recommended by PAINTOR (Kichaev et al. 2014), the proposed method does not have the dilemma of choosing the significance thresh- old: if the P-value threshold is too stringent, we may lose informative annotations; if it is too liberal, we may include too many noninformative annotations. Fitting the model with top-ranked annotations without penalization may also result in overfitting (Pickrell 2014). FM-QTL (Wen et al. 2015) uses the full genotype data, assumes a small number of annotations and uses the same maximum likelihood- based method to incorporate annotations as in PAINTOR. iBRI (Quintana and Conti 2013) uses an L2 penalty with a fixed parameter, which may not be optimal. For PIP calcu- lation, both FM-QTL and iBRI use an MCMC-based approach, which may be faster when there are a relatively larger num- ber of causal variants in a large region. Therefore, an alter- native is to use MCMC to perform genome-wide association analysis and fine mapping simultaneously, such as in Zhu and Stephens (2016). There are also other sampling-based methods showing highly reduced computing cost (Benner et al. 2016).
Show more

42 Read more

Show all 10000 documents...