Indiana University School of
Chemoinformatics
David Wild, djwild@indiana.edu
Indiana University School of
Current state of chemoinformatics research
•
What works and what doesn’t
– Fingerprints, clustering and diversity
– QSAR - predictive and descriptive methods, virtual screening – 3D similarity, pharmacophores & docking
– Visualization, organization and navigation of chemical datesets
•
Current buzz areas in chemoinformatics
Indiana University School of
What works and what doesn’t
• 2D structure and similarity searching well established – Lots of papers comparing fingerprints for similarity
– Some recent evidence Scitegic ECFPs better for recall of actives • Clustering well established but definite room for improvement
– Traditional methods Wards, K-means, Jarvis Patrick
– Recently single pass similarity cutoff methods used for very fast organization - >0.85 for similar activity, >0.55 for QSAR
– Data mining methods - ROCK, Chameleon, Cure, etc unexplored – Diversity hot -> cold -> smart
• QSAR - poor relation of academic work to industry usefulness – Lots of papers: “this method works best on this dataset” – Random forests appear practically to work rather well – Interpretability vs predictive ability
– Predictive methods for LogP, pKa, solubility, etc work reasonably
Indiana University School of
What works and what doesn’t
•
Mostly, 3D methods haven’t worked out yet
– Similarity & QSAR - Almost every paper: 2D better for recall and precision but 3D methods give “interesting ideas”. Useful for “lead hopping”
– Pharmacophore searching not widely used
– Docking - very useful for visual inspection, poor correlation of scoring functions with binding
•
Visualization, organization and navigation of datasets
– Still not clear how to work with datasets > few hundred compounds – Dot plots, spreadsheet-based methods work minimally
Indiana University School of
The current buzz in chemoinformatics
•
Decorporatization and commoditization of data and software
– MLSCN, PubChem, open source, small companies – Crisis for the software companies, nice for academia – Pharma companies in the brown stuff without a paddle
•
Integration with other “ics”
– Data mining chemical/genomic information
– Linking compounds -> proteins -> pathways, etc (e.g. KEGG)
•
Fuzzy boundaries, integration with science and informatics
– Microsoft 2020 vision for science
•
Integration of text and structure searching
Indiana University School of
Suggested collaboration areas
•
Chem/bio/complex systems mashups using web services in
each of the areas: nice, confined projects for students once you
have the infrastructure
•
Chem and complex can work together on integrating text and
structure-based searching, indexing and crawling (e.g.
networks of web services and databases), and intelligent
agents
•
Data mining of chemogenomic information
•
Integration of advanced chemoinformatics methods with
systems biology and pathway mapping tools
•
Performing research to establish best practices for areas of
chemoinformatics
Indiana University School of
Cyberinfrastructure
Geoffrey Fox
Cyberinfrastructure
n
Supports
distributed science
– data, people, computers
n
Exploits
Internet technology
(
Web2.0
) adding (via
Grid
technology) management, security, supercomputers etc.
n
It has two aspects:
parallel
– low latency (microseconds)
between nodes and
distributed
– highish latency
(milliseconds) between nodes
n
Parallel needed to get
high performance
on
individual
3D simulations, data analysis etc.; must
decompose
problem
n
Distributed aspect
integrates
already distinct
components
n
Cyberinfrastructure is in general a
distributed collection
of parallel systems
TeraGrid: Integrating NSF Cyberinfrastructure
TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University,Indiana University, Oak Ridge National Laboratory, the Pittsburgh
Supercomputing Center, and the National Center for Atmospheric Research.
Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.
Cyberinfrastructure at IU
n
Interpreted broadly (
Web presences
), there are many
activities at IU
n
Interpreted narrowly as the “
programmable web
” or “
using
Grid technologies
” there are large projects in
atmospheric
,
earthquake
,
ice-sheet
sciences,
network systems
,
particle
physics
,
Crystallography
and
Cheminformatics
•
IU has an international reputation in both parallel and
distributed Cyberinfrastructure including
education
,
research
and
resources
•
IU has
#31 Supercomputer
in world and is part of two
major National activities
TeraGrid
and
Open Science Grid
n
There are several well known
Bioinformatics Grids
such as
BIRN
(mainly images) and
caBIG
(cancer databases) from NIH
and
MyGrid
from UK (EBI)
n
Could be opportunities to
link Biology
and
Informatics/CS
in
Cyberinfrastructure motivated by Web 2.0
n
Capture the power of interactive Web/Grid sites
enabling people to create, collaborate and build on
each others work
Programmableweb.com
363 Web 2.0 API’
Indiana University School of
Web services, workflows, portals and ontologies
•
Web
Services
allow us to quickly develop and deploy new tools,
interfaces that cross disciplines and are broadly accessible
– Can use simple HTTP and ignore Web Service complications
•
Workflows
(called
mashups
in Web 2.0) allow us to string
together collections of web services to do computation that is
tailored to the science (as a one-off or for re-use).
– Develop core capabilities as services and use in many different ways as in 770 Google map mashups
•
API’s/Languages/Data structures/Ontologies (WSDL AJAX
JSON at low level) allow us to describe workflows and services
in discoverable, standard ways, such that reasoning tools can
piece them together to match queries
•
Portals
enable composable reusable user interfaces
•
Distributed posting of services and easily available composition
tools enable “
everybody
” to contribute
Model and Data Sharing
n
Cyberinfrastructure
requires agreed
sharing standards
(data
structures, API’s, protocols, ontologies, languages) as
intrinsically internationally distributed
n
There are
agreed data structures
for taking
Sequence
Protein
Folding
Interaction Transparently, e.g.
BLAST
n
Nothing at the level where genomics and proteomics is
important: cells and tissues.
n
Partial answers:
CellML, FieldML, SBML
which do not link to
relevant standards outside Biology
n
Need to connect models at these levels. Need
Standard
ontologies/data structures for cell behaviors to allow
connections
and
validation
n
Need to connect Models like
SBW
(Systems Biology
Workbench)/
BioSpice
->Cell-level models (
Compucell
) ->Tissue
level models (
Physiome
)
n
Model builders
at these scales not CS-sophisticated. Models
NOT interoperable and don’t use useful general ideas
n
Glazier
organizing activity in this area with H. Sauro (U.
Washington), W. Li (UCSD-SDSC), Hunter (U. Auckland) and NIH
•
Link to
Open Grid Forum
standard setting and community
http://www.chembiogrid.org
n Database
enabled
quantum chemistry
computations
n Services to link
PubChem,
Supercomputer s, results of
high
throughput
Screening centers
n Education; IU
has unique
Cheminformatic s degrees
Indiana University School of
Chemical Informatics web service infrastructure
•
Database Services
– Local NIH DTP Human Tumor Cell Line set – Local PubChem mirror – Derived properties
database
– Pub3D, PubDock – Synonym service – VARUNA quantum
chemistry database • Statistics (based on R)
– Regression, Neural Nets, Random Forest
– LDA
– K-means clustering – Plotting
– T-test and distribution sampling
•
Computation Services
– OpenEye FRED, OMEGA, FILTER, …
– Cambridge OSCAR3
– BCI fingerprint generation, Ward’s, Divisive K-means clustering
– Tox Tree
– Similarity & fingerprint calculations (CDK)
Indiana University School of
Indiana University School of
Indiana University School of
Kemo - A ChatBot for PubChem
• Uses ALICE chatbot
www.alicebot.org
• AIML used to define
knowledge base, e.g. reaction to common phrases like FIND ME, WHAT IS THE LOGP OF, etc
• Can iteratively improve
knowledge base
• Accesses PubChem
Indiana University School of
Workflow in Xbaya - a meteorology tool!
Indiana University School of
Indexing the world’s chemical informatio
AND computational functionality
• Crawl and index web pages, journal articles, etc. for – Structures (InChIs, SMILES)
– Images (converted using Clide or ChemReader)
– Names (converted using OSCAR3 or similar package) – Other information (IR spectra, reactions, etc…)
• Technology still immature, but improving quickly
• Problem with access to journal articles: we will assume open access in the future!
• Expose computational functionality as web services, contextualize in an OWL-S ontology (semantics), and publish in a UDDI