Databases:
Concepts and Software
Tools
Peter D. Karp, Ph.D.
Bioinformatics Research Group SRI International
http://www.ai.sri.com/pkarp/ http://BioCyc.org/
Overview
l Pathway/genome databases
l Pathway Tools software
l EcoCyc and MetaCyc
Larger than Minds can Grasp?
l Example: E. coli genetic network
l Control by 97 transcription factors of 1174 genes in 630
transcription units
l Example: E. coli metabolic network
l 160 pathways involving 744 reactions and 791 substrates
l Partition theories across multiple minds l Rely on the printed word
Limitations
l Cannot effectively
l Evaluate them for internal consistency
l Evaluate them for consistency with new data: microarrays l Refine them with respect to new data
l Integrate across them to produce system understanding
l They are too large and complex
l The printed word cannot be manipulated
Biological Knowledge Bases
l Store biological knowledge and theories in computers in a declarative form
l Amenable to computational analysis and generative user interfaces l Accepted to store data in computers, but not knowledge
l Refined, interpreted, consensus views
l Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases
l Such knowledge bases are an integral part of the scientific enterprise
Pathway/Genome Databases
l Layer functional information above the genome
l Rich ontology to encode biological information with high fidelity
l Chromosomes, genes, operons, gene products, reactions,
pathways
l Curated by experts for that organism
Chromosomes, Plasmids Genes Proteins Reactions Pathways Compounds CELL Operons, Promoters,
Pathway Tools Software
l PathoLogic
l Prediction of metabolic network from genome
l Computational creation of new Pathway/Genome Databases l Pathway/Genome Editors
l Distributed curation of genome annotations l Distributed object database system
l Interactive editing tools
l Pathway/Genome Navigator
l WWW publishing of PGDBs
l Graphic depictions of pathways, chromosomes, operons l Analysis operations
u Pathway visualization of gene-expression data u Global comparisons of metabolic networks
Sequence Project Workflow
Raw Sequence Phred Phrap CONSED BLAST, BLOCKS GeneMark/Glimmer PathoLogic P/G Navigator P/G Editors WWW Publishing AnalysesPathway/Genome DBs
Literature-based Datasets:
lMetaCyc
lEscherichia coli (EcoCyc)
Computationally Derived Datasets: lAgrobacterium tumefaciens lCaulobacter crescentus lChlamydia trachomatis lBacillus subtilis lHelicobacter pylori lHaemophilus influenzae lMycobacterium tuberculosis lMycoplasma pneumonia lPseudomonas aeruginosa lSaccharomyces cerevisiae lTreponema pallidum
http://BioCyc.org/
EcoCyc Project Overview
l E. coli Encyclopedia
l Model-Organism Database for E. coli
l Tracks the evolving annotation of the E. coli genome l Over 3500 literature citations
l Collaborative development via Internet
l Karp (SRI) -- Bioinformatics architect
l Riley (MBL) -- Metabolic pathways, signal transduction l Saier (UCSD) and Paulsen (TIGR)-- Transport
l Collado (UNAM)-- Regulation of gene expression l Ontology: 1000 biological classes
Pathway/Genome Navigator
Genes: 4,393 Proteins: 4,273 Reactions: 2,760 Pathways: 165 Compounds: 774 http://BioCyc.org/ Transcription Units: 684 Factors: 108 Enzymes: 914 Transporters: 162 Promoters: 781 TransFac Sites: 910 Citations: 3,508EcoCyc Pathways
l Biosynthesis of amino acids, purines,
pyrimidines, fatty acids, cofactors (heme, biotin, folic acid, etc)
l Catabolism of fatty acids, D-glucuronate,L-alanine, L-arabinose, fucose, galactonate, galactose, glucose, mannose, ribose, xylose
l Entner-Doudoroff pathway, TCA cycle,
fermentation, gluconeogenesis, glycerol
metabolism, glycolysis, glyoxylate cycle, pentose phosphate pathway
Schema
l Pathway Tools visualizations and analyses depend upon the software being able to find precise information in precise places within a Pathway/Genome DB
l When writing Lisp complex queries to PGDBs,
those queries must name classes and slots within the schema
l A Pathway/Genome Database is a web of
interconnected objects; each object represents a biological entity
Web of Relationships for One Enzyme
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction
Succinate dehydrogenase TCA Cycle
Frames
l Entities with which facts are associated
l Kinds of frames:
l Classes: Genes, Pathways, Biosynthetic Pathways l Instances (objects): trpA, TCA cycle
l Classes:
l Superclass(es) l Subclass(es) l Instance(s)
l A symbolic frame name (id, key) uniquely identifies each frame
Slots
l Encode attributes/properties of a frame
l Integer, real number, string
l Represent relationships between frames
l The value of a slot is the identifier of another frame
l Every slot is described by a “slot frame” in a KB that defines meta information about that slot
Slot Links
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction Succinate dehydrogenase TCA Cycle product component-of catalyzes reaction in-pathway
Representation of Function
Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2
sdhA sdhB sdhC sdhD
Succinate + FAD = fumarate + FADH2
Enzymatic-reaction Succinate dehydrogenase TCA Cycle EC# Keq Cofactors Inhibitors Molecular wt pI Left-end-position
Monofunctional Monomer
Gene Reaction Enzymatic-reaction Monomer PathwayBifunctional Monomer
Gene Reaction Enzymatic-reaction Monomer Pathway Reaction Enzymatic-reactionMonofunctional Multimer
Monomer Monomer Monomer Monomer
Gene Gene Gene Gene
Reaction
Enzymatic-reaction
Multimer Pathway
Pathway and Substrates
Reactant-1 Reaction Pathway Reaction Reaction Reaction Reactant-2 Product-2 Product-1 in-pathway left rightTranscriptional Regulation
site001 pro001 trpE trpD trpC trpB trpA trpL Int003 RpoSig70 TrpR*trp Int001 trpLEDCBA trp apoTrpR Int005Principle Classes
l Class names are capitalized, plural
l Genetic-Elements, with subclasses:
l Chromosomes l Plasmids
l Genes
l Transcription-Units
l RNAs
l Proteins, with subclasses:
l Polypeptides
Principle Classes
l Reactions, with subclasses:
l Transport-Reactions
l Enzymatic-Reactions
l Pathways
Slots in Multiple Classes
l Common-Name
l Synonyms
l Names (computed as union of Common-Name,
Synonyms)
l Comment
l Citations
Genes Slots
l Chromosome l Left-End-Position l Right-End-Position l Centisome-Position l Transcription-Direction l ProductProteins Slots
l Molecular-Weight-Seq l Molecular-Weight-Exp l pI l Locations l Modified-Form l Unmodified-Form l Component-OfPolypeptides Slots
Protein-Complexes Slots
Reactions Slots
l EC-Number
l Left, Right
l Substrates (computed as union of Left, Right)
l DeltaG0
l Keq
l Spontaneous?
Enzymatic-Reactions Slots
l Enzyme l Reaction l Activators l Inhibitors l Physiologically-Relevant l Cofactors l Prosthetic-Groups l Alternative-Substrates l Alternative-CofactorsPathways Slots
l Reaction-List
l Predecessors
MetaCyc Overview
l Meta Metabolic Encyclopedia
l 445 pathways, 1115 enzymes, 4218 reactions
l 173 E. coli pathways; 158 organisms l 2381 citations
l Literature-based DB with extensive references and commentary
MetaCyc Frequent Organisms
7 M. pneumoniae 7 P. putida 8 S. cerevisiae 12 M. capricolum 15 Hp. influenzae 17 Pseudomonas 18 Soybean 18 B. subtilis 20 Sf. sulfataricus 31 Ho. sapiens 35 Sm. typhimurium 173 E. coliMetaCyc Data
l MetaCyc contains one DB object for each distinct pathway
l Distinct in terms of reaction steps
l Each pathway labeled with species it occurs in
l MetaCyc pathways are experimentally determined
l 4218 reactions in MetaCyc
MetaCyc Enzyme Data
l Reaction(s) catalyzed l Alternative substrates
l Cofactors / prosthetic groups l Activators and inhibitors
l Subunit structure l Molecular weight, pI
l Comment, literature citations
MetaCyc Super-Pathways
l Groups of pathways linked by common substrates
l Example: Super-pathway containing
l Chorismate biosynthesis l Tryptophan biosynthesis l Phenylalanine biosynthesis l Tyrosine biosynthesis
l Super-pathways defined by listing their component pathways
l Multiple levels of super-pathways can be defined
Comparison of MetaCyc to KEGG
l Data
l KEGG has no literature citations, no comments
l KEGG has no detailed information about enzymes (inhibitors,
subunits)
l KEGG pathways are composites of pathways found in many
organisms
u Unclear what sub-pathways occur in what organisms
l Software tools
l KEGG has no algorithmic visualization tools
l KEGG has no queryable metabolic-map overview diagram l KEGG has no interactive editing tools
EcoCyc/MetaCyc Availability
l WWW EcoCyc-Plus freely available
l EcoCyc, MetaCyc
l Pathway/genome DBs for 12 other organisms
l
http://BioCyc.org/
l On-site EcoCyc-Plus freely available to non-profits
l Flatfiles
l Binary executable: Hardware requirements
u Sun UltraSparc-170 w/ 64MB memory
Bioinformatics
Resources for Microbial Genome
Analysis
l E. coli has large fraction of gene functions identified experimentally
l Assigning function by similarity to E. coli genes less likely to introduce annotation errors
l Predict metabolic pathways of other microbes
Applications of EcoCyc and MetaCyc
l Reference sources on E. coli and metabolism
l Sequence/pathway analysis of microbial genomes
l Analysis of gene-expression data
l Computer-aided education
l Anti-microbial drug discovery
l Pathway engineering
l Investigations of
l Comparative metabolism
Pathway Tools Software
l PathoLogic
l Prediction of metabolic network from genome
l Computational creation of new Pathway/Genome Databases l Pathway/Genome Editors
l Distributed curation of genome annotations l Distributed object database system
l Interactive editing tools
l Pathway/Genome Navigator
l WWW publishing of PGDBs
l Graphic depictions of pathways, chromosomes, operons l Analysis operations
u Pathway visualization of gene-expression data u Global comparisons of metabolic networks
Implementation Details
l Allegro Common Lisp
l Sun and PC platforms
l Ocelot object database
l Lisp-based WWW server at BioCyc.org
l CWEST-based
Pathway Tools Architecture
Object DBMS GFP API Pathway Genome Navigator WWW Server X-Windows Graphics Object Editor Pathway Editor Reaction Editor OracleArchitecture
l Frame data model
l Classes, instances, inheritance
l Classes and instances both treated as data
l Persistent storage via disk files, Oracle DBMS
l Concurrent development: Oracle l Single-user development: disk files
l Read-only delivery: bundle data into binary program
l Transaction logging facility
l Optimistic concurrency-control protocol
l Schema evolution
Visualization and Editing Tools
l Full Metabolic Map
l Pathways
l Reactions
l Compounds
l Enzymes, Transporters, Transcription Factors
l Genes
l Chromosomes
Genomic Map Genes Gene Products Reactions Pathways Compounds Pathway/Genome Database PathoLogic List of Genes/ORFs
List of Gene Products ANNOTATED GENOME Structured ASCII Text File
DNA Sequence
PathoLogic Analysis Phases
l Trial parsing of input data files
l Automated build of initial PGDB
l Initialize schema of new PGDB
l Create DB objects for chromosomes, genes, proteins l Predict reactions and pathways present
l Define protein complexes
PathoLogic Pathway Prediction
l Create associations between enzymes and metabolic reactions
l Reactions and substrates imported from MetaCyc l Automatically via EC numbers
l Automatically via enzyme name matching l Manually
l CC0092 / galE / “UDP-glucose-4-epimerase” / EC 5.1.3.2 l UDP-D-glucose à UDP-galactose
l Import from MetaCyc all pathways associated with inferred reactions
l UDP-D-glucose à UDP-galactose is a reaction of:
l galactose metabolism, UDP-glucose conversion,
l lactose degradation 4, colanic acid building blocks biosynthesis
Insufficient Evidence
l No unique enzyme AND EITHER
l 1 reaction present for pathway greater than 2 steps
l Set of reactions present is a subset of reactions present in another pathway
l There exists a variant pathway with more evidence
Pathway Complement
l Extends the paradigm of genome analysis
l Predicted genes placed in their biochemical context
l Information reduction device
l Assess coherence of the set of genes in a genome l Identifies pathway holes and singleton enzymes
l Provides a framework for analysis of functional-genomics
Pathway Comparisons
Eco Mtb Bsu Hin Sce Hpy
Eco 130 103 92 90 84 73 Mtb 103 84 79 82 70 Bsu 96 77 72 65 Hin 90 67 61 Sce 84 64 Hpy 74 Mp
Summary
l Pathway/Genome Databases
l 14 PGDBs available through SRI at BioCyc.org l Computational theories of biochemical machinery
l Pathway Tools software
l Extract pathways from genomes l Distributed curation tools
l Query, visualization, WWW publishing l Analysis algorithms
Acknowledgements
l SRI: Suzanne Paley, Pedro Romero, John Pick
l EcoCyc Project: Milton Saier, Julio Collado, Ian Paulsen, Monica Riley
l Stanford: Harley McAdams, Lucy Shapiro, Gary Schoolnik, Russ Altman
l Funding sources:
l NIH National Center for Research Resources l Department of Energy Microbial Cell Project l DARPA BioSpice, UPC
[email protected] http://BioCyc.org/