• No results found

Pathway/Genome Databases: Concepts and Software Tools

N/A
N/A
Protected

Academic year: 2021

Share "Pathway/Genome Databases: Concepts and Software Tools"

Copied!
62
0
0

Loading.... (view fulltext now)

Full text

(1)

Databases:

Concepts and Software

Tools

Peter D. Karp, Ph.D.

Bioinformatics Research Group SRI International

[email protected]

http://www.ai.sri.com/pkarp/ http://BioCyc.org/

(2)

Overview

l Pathway/genome databases

l Pathway Tools software

l EcoCyc and MetaCyc

(3)

Larger than Minds can Grasp?

l Example: E. coli genetic network

l Control by 97 transcription factors of 1174 genes in 630

transcription units

l Example: E. coli metabolic network

l 160 pathways involving 744 reactions and 791 substrates

l Partition theories across multiple minds l Rely on the printed word

(4)

Limitations

l Cannot effectively

l Evaluate them for internal consistency

l Evaluate them for consistency with new data: microarrays l Refine them with respect to new data

l Integrate across them to produce system understanding

l They are too large and complex

l The printed word cannot be manipulated

(5)

Biological Knowledge Bases

l Store biological knowledge and theories in computers in a declarative form

l Amenable to computational analysis and generative user interfaces l Accepted to store data in computers, but not knowledge

l Refined, interpreted, consensus views

l Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases

l Such knowledge bases are an integral part of the scientific enterprise

(6)

Pathway/Genome Databases

l Layer functional information above the genome

l Rich ontology to encode biological information with high fidelity

l Chromosomes, genes, operons, gene products, reactions,

pathways

l Curated by experts for that organism

(7)

Chromosomes, Plasmids Genes Proteins Reactions Pathways Compounds CELL Operons, Promoters,

(8)

Pathway Tools Software

l PathoLogic

l Prediction of metabolic network from genome

l Computational creation of new Pathway/Genome Databases l Pathway/Genome Editors

l Distributed curation of genome annotations l Distributed object database system

l Interactive editing tools

l Pathway/Genome Navigator

l WWW publishing of PGDBs

l Graphic depictions of pathways, chromosomes, operons l Analysis operations

u Pathway visualization of gene-expression data u Global comparisons of metabolic networks

(9)

Sequence Project Workflow

Raw Sequence Phred Phrap CONSED BLAST, BLOCKS GeneMark/Glimmer PathoLogic P/G Navigator P/G Editors WWW Publishing Analyses

(10)

Pathway/Genome DBs

Literature-based Datasets:

lMetaCyc

lEscherichia coli (EcoCyc)

Computationally Derived Datasets: lAgrobacterium tumefaciens lCaulobacter crescentus lChlamydia trachomatis lBacillus subtilis lHelicobacter pylori lHaemophilus influenzae lMycobacterium tuberculosis lMycoplasma pneumonia lPseudomonas aeruginosa lSaccharomyces cerevisiae lTreponema pallidum

http://BioCyc.org/

(11)

EcoCyc Project Overview

l E. coli Encyclopedia

l Model-Organism Database for E. coli

l Tracks the evolving annotation of the E. coli genome l Over 3500 literature citations

l Collaborative development via Internet

l Karp (SRI) -- Bioinformatics architect

l Riley (MBL) -- Metabolic pathways, signal transduction l Saier (UCSD) and Paulsen (TIGR)-- Transport

l Collado (UNAM)-- Regulation of gene expression l Ontology: 1000 biological classes

(12)

Pathway/Genome Navigator

Genes: 4,393 Proteins: 4,273 Reactions: 2,760 Pathways: 165 Compounds: 774 http://BioCyc.org/ Transcription Units: 684 Factors: 108 Enzymes: 914 Transporters: 162 Promoters: 781 TransFac Sites: 910 Citations: 3,508

(13)

EcoCyc Pathways

l Biosynthesis of amino acids, purines,

pyrimidines, fatty acids, cofactors (heme, biotin, folic acid, etc)

l Catabolism of fatty acids, D-glucuronate,L-alanine, L-arabinose, fucose, galactonate, galactose, glucose, mannose, ribose, xylose

l Entner-Doudoroff pathway, TCA cycle,

fermentation, gluconeogenesis, glycerol

metabolism, glycolysis, glyoxylate cycle, pentose phosphate pathway

(14)

Schema

l Pathway Tools visualizations and analyses depend upon the software being able to find precise information in precise places within a Pathway/Genome DB

l When writing Lisp complex queries to PGDBs,

those queries must name classes and slots within the schema

l A Pathway/Genome Database is a web of

interconnected objects; each object represents a biological entity

(15)

Web of Relationships for One Enzyme

Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2

sdhA sdhB sdhC sdhD

Succinate + FAD = fumarate + FADH2

Enzymatic-reaction

Succinate dehydrogenase TCA Cycle

(16)

Frames

l Entities with which facts are associated

l Kinds of frames:

l Classes: Genes, Pathways, Biosynthetic Pathways l Instances (objects): trpA, TCA cycle

l Classes:

l Superclass(es) l Subclass(es) l Instance(s)

l A symbolic frame name (id, key) uniquely identifies each frame

(17)

Slots

l Encode attributes/properties of a frame

l Integer, real number, string

l Represent relationships between frames

l The value of a slot is the identifier of another frame

l Every slot is described by a “slot frame” in a KB that defines meta information about that slot

(18)

Slot Links

Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2

sdhA sdhB sdhC sdhD

Succinate + FAD = fumarate + FADH2

Enzymatic-reaction Succinate dehydrogenase TCA Cycle product component-of catalyzes reaction in-pathway

(19)

Representation of Function

Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2

sdhA sdhB sdhC sdhD

Succinate + FAD = fumarate + FADH2

Enzymatic-reaction Succinate dehydrogenase TCA Cycle EC# Keq Cofactors Inhibitors Molecular wt pI Left-end-position

(20)

Monofunctional Monomer

Gene Reaction Enzymatic-reaction Monomer Pathway

(21)

Bifunctional Monomer

Gene Reaction Enzymatic-reaction Monomer Pathway Reaction Enzymatic-reaction

(22)

Monofunctional Multimer

Monomer Monomer Monomer Monomer

Gene Gene Gene Gene

Reaction

Enzymatic-reaction

Multimer Pathway

(23)

Pathway and Substrates

Reactant-1 Reaction Pathway Reaction Reaction Reaction Reactant-2 Product-2 Product-1 in-pathway left right

(24)

Transcriptional Regulation

site001 pro001 trpE trpD trpC trpB trpA trpL Int003 RpoSig70 TrpR*trp Int001 trpLEDCBA trp apoTrpR Int005

(25)

Principle Classes

l Class names are capitalized, plural

l Genetic-Elements, with subclasses:

l Chromosomes l Plasmids

l Genes

l Transcription-Units

l RNAs

l Proteins, with subclasses:

l Polypeptides

(26)

Principle Classes

l Reactions, with subclasses:

l Transport-Reactions

l Enzymatic-Reactions

l Pathways

(27)

Slots in Multiple Classes

l Common-Name

l Synonyms

l Names (computed as union of Common-Name,

Synonyms)

l Comment

l Citations

(28)

Genes Slots

l Chromosome l Left-End-Position l Right-End-Position l Centisome-Position l Transcription-Direction l Product

(29)

Proteins Slots

l Molecular-Weight-Seq l Molecular-Weight-Exp l pI l Locations l Modified-Form l Unmodified-Form l Component-Of

(30)

Polypeptides Slots

(31)

Protein-Complexes Slots

(32)

Reactions Slots

l EC-Number

l Left, Right

l Substrates (computed as union of Left, Right)

l DeltaG0

l Keq

l Spontaneous?

(33)

Enzymatic-Reactions Slots

l Enzyme l Reaction l Activators l Inhibitors l Physiologically-Relevant l Cofactors l Prosthetic-Groups l Alternative-Substrates l Alternative-Cofactors

(34)

Pathways Slots

l Reaction-List

l Predecessors

(35)

MetaCyc Overview

l Meta Metabolic Encyclopedia

l 445 pathways, 1115 enzymes, 4218 reactions

l 173 E. coli pathways; 158 organisms l 2381 citations

l Literature-based DB with extensive references and commentary

(36)

MetaCyc Frequent Organisms

7 M. pneumoniae 7 P. putida 8 S. cerevisiae 12 M. capricolum 15 Hp. influenzae 17 Pseudomonas 18 Soybean 18 B. subtilis 20 Sf. sulfataricus 31 Ho. sapiens 35 Sm. typhimurium 173 E. coli

(37)

MetaCyc Data

l MetaCyc contains one DB object for each distinct pathway

l Distinct in terms of reaction steps

l Each pathway labeled with species it occurs in

l MetaCyc pathways are experimentally determined

l 4218 reactions in MetaCyc

(38)

MetaCyc Enzyme Data

l Reaction(s) catalyzed l Alternative substrates

l Cofactors / prosthetic groups l Activators and inhibitors

l Subunit structure l Molecular weight, pI

l Comment, literature citations

(39)

MetaCyc Super-Pathways

l Groups of pathways linked by common substrates

l Example: Super-pathway containing

l Chorismate biosynthesis l Tryptophan biosynthesis l Phenylalanine biosynthesis l Tyrosine biosynthesis

l Super-pathways defined by listing their component pathways

l Multiple levels of super-pathways can be defined

(40)

Comparison of MetaCyc to KEGG

l Data

l KEGG has no literature citations, no comments

l KEGG has no detailed information about enzymes (inhibitors,

subunits)

l KEGG pathways are composites of pathways found in many

organisms

u Unclear what sub-pathways occur in what organisms

l Software tools

l KEGG has no algorithmic visualization tools

l KEGG has no queryable metabolic-map overview diagram l KEGG has no interactive editing tools

(41)

EcoCyc/MetaCyc Availability

l WWW EcoCyc-Plus freely available

l EcoCyc, MetaCyc

l Pathway/genome DBs for 12 other organisms

l

http://BioCyc.org/

l On-site EcoCyc-Plus freely available to non-profits

l Flatfiles

l Binary executable: Hardware requirements

u Sun UltraSparc-170 w/ 64MB memory

(42)

Bioinformatics

Resources for Microbial Genome

Analysis

l E. coli has large fraction of gene functions identified experimentally

l Assigning function by similarity to E. coli genes less likely to introduce annotation errors

l Predict metabolic pathways of other microbes

(43)

Applications of EcoCyc and MetaCyc

l Reference sources on E. coli and metabolism

l Sequence/pathway analysis of microbial genomes

l Analysis of gene-expression data

l Computer-aided education

l Anti-microbial drug discovery

l Pathway engineering

l Investigations of

l Comparative metabolism

(44)

Pathway Tools Software

l PathoLogic

l Prediction of metabolic network from genome

l Computational creation of new Pathway/Genome Databases l Pathway/Genome Editors

l Distributed curation of genome annotations l Distributed object database system

l Interactive editing tools

l Pathway/Genome Navigator

l WWW publishing of PGDBs

l Graphic depictions of pathways, chromosomes, operons l Analysis operations

u Pathway visualization of gene-expression data u Global comparisons of metabolic networks

(45)

Implementation Details

l Allegro Common Lisp

l Sun and PC platforms

l Ocelot object database

l Lisp-based WWW server at BioCyc.org

l CWEST-based

(46)

Pathway Tools Architecture

Object DBMS GFP API Pathway Genome Navigator WWW Server X-Windows Graphics Object Editor Pathway Editor Reaction Editor Oracle

(47)

Architecture

l Frame data model

l Classes, instances, inheritance

l Classes and instances both treated as data

l Persistent storage via disk files, Oracle DBMS

l Concurrent development: Oracle l Single-user development: disk files

l Read-only delivery: bundle data into binary program

l Transaction logging facility

l Optimistic concurrency-control protocol

l Schema evolution

(48)
(49)

Visualization and Editing Tools

l Full Metabolic Map

l Pathways

l Reactions

l Compounds

l Enzymes, Transporters, Transcription Factors

l Genes

l Chromosomes

(50)

Genomic Map Genes Gene Products Reactions Pathways Compounds Pathway/Genome Database PathoLogic List of Genes/ORFs

List of Gene Products ANNOTATED GENOME Structured ASCII Text File

DNA Sequence

(51)
(52)
(53)
(54)

PathoLogic Analysis Phases

l Trial parsing of input data files

l Automated build of initial PGDB

l Initialize schema of new PGDB

l Create DB objects for chromosomes, genes, proteins l Predict reactions and pathways present

l Define protein complexes

(55)

PathoLogic Pathway Prediction

l Create associations between enzymes and metabolic reactions

l Reactions and substrates imported from MetaCyc l Automatically via EC numbers

l Automatically via enzyme name matching l Manually

l CC0092 / galE / “UDP-glucose-4-epimerase” / EC 5.1.3.2 l UDP-D-glucose à UDP-galactose

l Import from MetaCyc all pathways associated with inferred reactions

l UDP-D-glucose à UDP-galactose is a reaction of:

l galactose metabolism, UDP-glucose conversion,

l lactose degradation 4, colanic acid building blocks biosynthesis

(56)

Insufficient Evidence

l No unique enzyme AND EITHER

l 1 reaction present for pathway greater than 2 steps

l Set of reactions present is a subset of reactions present in another pathway

l There exists a variant pathway with more evidence

(57)

Pathway Complement

l Extends the paradigm of genome analysis

l Predicted genes placed in their biochemical context

l Information reduction device

l Assess coherence of the set of genes in a genome l Identifies pathway holes and singleton enzymes

l Provides a framework for analysis of functional-genomics

(58)
(59)
(60)

Pathway Comparisons

Eco Mtb Bsu Hin Sce Hpy

Eco 130 103 92 90 84 73 Mtb 103 84 79 82 70 Bsu 96 77 72 65 Hin 90 67 61 Sce 84 64 Hpy 74 Mp

(61)

Summary

l Pathway/Genome Databases

l 14 PGDBs available through SRI at BioCyc.org l Computational theories of biochemical machinery

l Pathway Tools software

l Extract pathways from genomes l Distributed curation tools

l Query, visualization, WWW publishing l Analysis algorithms

(62)

Acknowledgements

l SRI: Suzanne Paley, Pedro Romero, John Pick

l EcoCyc Project: Milton Saier, Julio Collado, Ian Paulsen, Monica Riley

l Stanford: Harley McAdams, Lucy Shapiro, Gary Schoolnik, Russ Altman

l Funding sources:

l NIH National Center for Research Resources l Department of Energy Microbial Cell Project l DARPA BioSpice, UPC

[email protected] http://BioCyc.org/

References

Related documents

STEM BF images of the SiC matrix: (a) Overview image; (b) High resolution image of one SiC grain in unirradiated area; (c) and (d) are high resolution images of two SiC grains

kapang endofit dari ranting tumbuhan mahoni ini memiliki nilai inhibisi yang lebih besar dari 50% sehingga berpotensi sebagai penghasil senyawa aktif antidiabetes

As such, the objective of the present study is to provide a comparative evaluation of the effectiveness of different types of AAR reviews (i.e., subjective and objective AARs) in

Our results with the revenue function representations in price space with Shephard and directional out- put distance functions are similar to the conclusions reached by F¨

• Even if local sites are serializable, subtransactions of two distributed transactions might be serialized in different orders at different sites:.. At site A, T 1A is

The analysis presented here is based on data from the Cambodia Labour Force and Child Labour Survey 2012 conducted by the National Institute of Statistics (NIS)

This non-experimental exploratory sequential study was undertaken to assess the extent to which adult students can transfer and apply information literacy competencies, based on

In particular, I introduced the dslash stencil operator, several different iterative Krylov subspace linear solvers, and the Hybrid Monte Carlo algorithm. I motivated the