Overview of Chemical Informatics
and Cyberinfrastructure
Collaboratory
October 18 2006
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
gcf@indi
ana.edu
Activities
n Local Teams, successful Prototypes and International
Collaboration set up in 3 initial major focus areas
• Chemical Informatics Cyberinfrastructure/Grids with services,
workflows and demonstration uses building on success in other applications (LEAD) and showing distributed integration of academic and commercial tools
• Computational Chemistry Cyberinfrastructure/Grids with
simulation, databases and TeraGrid use
• Education with courses and degrees
n Review of activities suggest we also formalize work in two further areas
• Chemical Informatics Research – model applicability and
data-mining
• Interfacing with the User - interaction tools and portal optimized for
particular customer groups
n Also have started an activity to identify “customers” for
Cyberinfrastructure and its implied Chemistry eScience model
CICC Senior Personnel
n
Geoffrey C. Fox
n
Mu-Hyun (Mookie) Baik
nDennis B. Gannon
n
Marlon Pierce
nBeth A. Plale
n
Gary D. Wiggins
nDavid J. Wild
n
Yuqing (Melanie) Wu
n
Peter T. Cherbas
nMehmet M. Dalkilic
nCharles H. Davis
n
A. Keith Dunker
nKelsey M. Forsythe
nKevin E. Gilbert
nJohn C. Huffman
nMalika Mahoui
nDaniel J. Mindiola
nSantiago D. Schnell
nWilliam Scott
n
Craig A. Stewart
nDavid R. Williams
From Biology, Chemistry, Computer Science, Informatics
at IU Bloomington and IUPUI (Indianapolis)
CICC Infrastructure Vision
n Drug Discovery and other academic chemistry and pharmacology
research will be aided by powerful modern information technology
ChemBioGrid set up as distributed cyberinfrastructure in eScience model
n ChemBioGrid will provide portals (user interfaces) to distributed
databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses
n ChemBioGrid will provide services to manipulate this data and combine in
workflows; it will have convenient ways to submit and manage multiple jobs
n ChemBioGrid will include access to PubChem, PubMed, PubMed Central,
the Internet and its derivatives like Microsoft Academic Live and Google Scholar
n The services include open-source software like CDK, commercial code from
vendors from BCI, OpenEye, Gaussian and Google, and any user contributed programs
n ChemBioGrid will define open interfaces to use for a particular type of
service allowing plug and play choice between different implementations
CICC Combines Grid Computing with Chemical Informatics
CICC
Chemical Informatics and Cyberinfrastucture CollaboratoryFunded by the National Institutes of HealthCICC
www.chembiogrid.org
Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories
Science and
Cyberinfrastructure
.
Large Scale Computing Challenges
Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated.
CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs.
CICC supports the NIH mission by combining state of the art chemical informatics techniques with
• World class high performance computing • National-scale computing resources (TeraGrid) • Internet-standard web services
• International activities for service orchestration
• Open distributed computing infrastructure for scientists world wide NIH PubMed DataBas e OSCAR Text Analysis POVRay Parallel Renderin g Initial 3D Structure Calculatio n Toxicity Filtering Cluster Groupin g Docking Molecular Mechanic s Calculatio ns Quantum Mechanics Calculatio ns IU’s Varuna DataBase NIH PubChe m DataBase Chemical informatics text analysis programs can process 100,000’s of abstracts of online journal articles to extract chemical signatures of potential drugs.
OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Top-ranking docked molecules can be further examined for drug potential.
Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry
calculations onallof PubMed. Results go back to public databases that are freely accessible by the scientific community.
CICC Prototype Web Services
Molecular weights Molecular formulae Tanimoto similarity 2D Structure diagrams Molecular descriptors 3D structures
InChI
generation/search CMLRSS
R and Excel
Basic cheminformatics
Application based services
Compare (NIH)
Toxicity predictions (ToxTree) Literature extraction (OSCAR3) Clustering (BCI Toolkit)
Docking, filtering, ... (OpenEye Varuna simulation
Define WSDL interfaces to enable global production of
compatible Web services; refine CML
Add more services (identify gaps)
Add more databases, including 3D structural info
Demonstrate use of services in other pipelining tools (KDE,
Knime – Pipeline Pilot already done)
Extend Computational Chemistry (Varuna) Services Routine TeraGrid and Big Red use
“Production” on OSCAR3 CDK Gamess Jaguar Develop more training material
Next steps? Key Ideas
Add value to PubChem with additional distributed service
and databases
Develop nifty ideas like VOTables
Wrapping existing code in web services is not difficult Provide “core” (CDK) services and exemplars of typical
tools
Web Service Locations
Indiana University
Clustering VOTables
OSCAR3
Toxicity classification Database services
Penn State University (now moved to IU)
CDK based services
Fingerprints
Similarity calculations 2D structure diagrams Molecular descriptors
Cambridge University
InChI generation / search
CMLRSS
OpenBabel
InfoChem
SPRESI
database
SDS
Typical
TeraGrid Site
NIH
Cheminformatics Education at IU
n Linked to bioinformatics in Indiana University’s School of Informatics
• School of Informatics degree programs BS, MS, PhD
n Programs offered at both the Indianapolis (IUPUI) and Bloomington
(IUB) campuses
• Bioinformatics MS and track on PhD
• Chemical Informatics MS and track on PhD
• Informatics Undergraduates can choose a chemistry cognate (change
to Life Sciences )
n PhD in Informatics started in August 2005 and offers tracks in
• bioinformatics; chemical informatics; health informatics;
human-computer interaction design; social and organizational informatics; more to come!
n Good employer interest but modest student understanding of value of
Cheminformatics degree
n 3 core courses in Cheminformatics plus seminar/independent studies
n Significant interest in distance education version of introductory
Cheminformatics course (enrollment promising in Distance Graduate
Certificate in Chemical Informatics)
Current Status
n Web site http://www.chembiogrid.org
n Wiki chosen to support project as a shared editable web space
n Building Collaboratory involving PubChem – Global Information System
accessible anywhere and at any time – enhance PubChem with distributed
tools (clustering, simulation, annotation etc.) and data
n Adopted Taverna as workflow as popular in Bioinformatics but we will
evaluate other systems such as GPEL from LEAD
n Demonstrated CI-enhanced Chemistry simulations
n Initiated Data-mining, User interface and Chemical Informatics tools
research
n Prototyped large set of runs on local Big Red 23 Teraflop supercomputer
(OSCAR3 and modeling moving to CDK Gamess Jaguar)
n Initial results discussed at conferences/workshops/papers
• Gordon Conferences, ACS, SDSC tutorial
n First new Cheminformatics courses offered
n Advisory board set up and met – this is second meeting
n Videoconferencing-based meetings with Peter Murray-Rust and group at
Cambridge roughly every 2-3 weeks
n Good or potentially good interactions with Local HTS in CGB, NIH DTP,
Scripps, Lilly and Michigan ECCR
MLSCN Post-HTS Biology Decision Support
Percent Inhibition or IC50 data is
retrieved from HTS
Question: Was this screen successful?
Question: What should the active/inactive cutoffs be?
Question: What can we learn about the target protein or cell line from this screen?
Compounds submitted to PubChem
Workflows encoding distribution analysis of screening results
Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation
tools (Semantic Web, del.icio.us) and
enhance lead ID and
SAR analysis
A Grid of Grids linking collections of services a
PubChem ECCR centers
MLSCN centers
Workflows encoding plate & control well statistics, distribution analysis, etc
Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc
CHEMINFORMATIC S
PROCES
Example HTS workflow: finding cell-protein relationships
A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)
Similar structures to the ligand can be browsed using client
portlets.
Once docking is complete, the user visualizes the high-scoring docked
structures in a portlet using the JMOL applet. Similar structures are
filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. The screening data from
a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.
Docking results and activity patterns fed into R services for building of activity models and correlations Leas Squares Regression Rando
Forests NeuraNets
Varuna
environment for molecular modeling (Baik,
IU)
QM Database
Researcher
Simulation Servic
FORTRAN Code, Scripts
Chemical Concepts
Experime nts
QM/MM Database PubChem, PDB
NCI, etc.
ChemBioGrid
Reactio DB
DB Servic Queries, Clustering
Curation, etc.
Papers etc.
Condor
TeraGri
Methods Development at the CICC
n Tagging methods for web-based annotation exploiting del.icio.us
and Connotea
n Development of QSAR model interpretability and applicability
methods
n RNN-Profiles for exploration of chemical spaces n VisualiSAR - SAR through visual analysis
¨ See http://www.daylight.com/meetings/mug99/Wild/Mug99.html
n Visual Similarity Matrices for High Volume Datasets
¨ See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php
n Fast, accurate clustering using parallel Divisive K-means
n Mapping of Natural Language queries to use cases and workflows n Advanced data mining models for drug discovery information
Structure of Proposal
n
a) Define audience that we are targeting
n
b) Cyberinfrastructure Framework with Key services
--Registry, Computing, portal, workflow
•
Exemplar Chemoinformatics Services
•
Exemplar workflows using services
•
Defined WSDL for key cases defined to allow others to
contribute
•
Tutorial
n
c) Education
n
d) IT/Cyber-enhanced Computational Chemistry
ne) Cheminformatics Research
•
Systems
•
Tools and Modeling
Questions
n
We expect to respond to “big” NIH RFP in about 4 months
n
Should we partner with Michigan?
n
Who is “customer” and how do we get more?
• Do/Should chemists want our or more generally NIH’s product?
• Interactions with “large” and “small” industry
n
What is balance between infrastructure, computational
chemistry, Cheminformatics tools and research, chemical
informatics systems and interfaces?
n
Should we stress literature (OSCAR3) project?
n
Balance of applications and generic capabilities?
n
How should we structure education component?
• Field does not have strong student appeal compared to Bioinformatics
n
We are strong in Computer Sciences
(Grids/Cyberinfrastructure) but doubtful if any CS reviewers
• We are strong in Cheminformatics systems but not clear a recognized
activity and how do we justify claim that Grids/Cyberinfrastructure/Open Access “good”
n
Should we link more with biology?
Covering our bases: Who are our “Customers”?
What do we need to conquer traditional chemical Research Community
- High-Fidelity Structural Data, Redox Potentials, Spectroscopy, Transition State Structures, Energies, Molecular Orbitals…..
“Departments” of the future Center
Infrastructure/Technology Developers and Providers
Build Cyberinfrastructure, design databases, workflow, support Web services with interface standards, wrap codes as services;
Support infrastructure
Application Scientists (Customers)
Core group develops requirements for infrastructure and codes as services and tests infrastructure with key exemplar projects. Allow broad use by all