Cheminformatics and Cyberinfrastructure

(1)

Indiana University School of

Chemoinformatics

David Wild, djwild@indiana.edu

(2)

Current state of chemoinformatics research

• What works and what doesn’t

– Fingerprints, clustering and diversity

– QSAR - predictive and descriptive methods, virtual screening – 3D similarity, pharmacophores & docking

– Visualization, organization and navigation of chemical datesets

• Current buzz areas in chemoinformatics

(3)

What works and what doesn’t

• 2D structure and similarity searching well established – Lots of papers comparing fingerprints for similarity

– Some recent evidence Scitegic ECFPs better for recall of actives • Clustering well established but definite room for improvement

– Traditional methods Wards, K-means, Jarvis Patrick

– Recently single pass similarity cutoff methods used for very fast organization - >0.85 for similar activity, >0.55 for QSAR

– Data mining methods - ROCK, Chameleon, Cure, etc unexplored – Diversity hot -> cold -> smart

• QSAR - poor relation of academic work to industry usefulness – Lots of papers: “this method works best on this dataset” – Random forests appear practically to work rather well – Interpretability vs predictive ability

– Predictive methods for LogP, pKa, solubility, etc work reasonably

(4)

What works and what doesn’t

• Mostly, 3D methods haven’t worked out yet

– Similarity & QSAR - Almost every paper: 2D better for recall and precision but 3D methods give “interesting ideas”. Useful for “lead hopping”

– Pharmacophore searching not widely used

– Docking - very useful for visual inspection, poor correlation of scoring functions with binding

• Visualization, organization and navigation of datasets

– Still not clear how to work with datasets > few hundred compounds – Dot plots, spreadsheet-based methods work minimally

(5)

The current buzz in chemoinformatics

• Decorporatization and commoditization of data and software

– MLSCN, PubChem, open source, small companies – Crisis for the software companies, nice for academia – Pharma companies in the brown stuff without a paddle

• Integration with other “ics”

– Data mining chemical/genomic information

– Linking compounds -> proteins -> pathways, etc (e.g. KEGG)

• Fuzzy boundaries, integration with science and informatics

– Microsoft 2020 vision for science

• Integration of text and structure searching

(6)

Suggested collaboration areas

• Chem/bio/complex systems mashups using web services in

each of the areas: nice, confined projects for students once you

have the infrastructure

• Chem and complex can work together on integrating text and

structure-based searching, indexing and crawling (e.g.

networks of web services and databases), and intelligent

agents

• Data mining of chemogenomic information

• Integration of advanced chemoinformatics methods with

systems biology and pathway mapping tools

• Performing research to establish best practices for areas of

chemoinformatics

(7)

Cyberinfrastructure

Geoffrey Fox

(8)

Cyberinfrastructure

n

Supports

distributed science

– data, people, computers

n

Exploits

Internet technology

(

Web2.0

) adding (via

Grid

technology) management, security, supercomputers etc.

n

It has two aspects:

parallel

– low latency (microseconds)

between nodes and

distributed

– highish latency

(milliseconds) between nodes

n

Parallel needed to get

high performance

on

individual

3D simulations, data analysis etc.; must

decompose

problem

n

Distributed aspect

integrates

already distinct

components

n

Cyberinfrastructure is in general a

distributed collection

of parallel systems

(9)

TeraGrid: Integrating NSF Cyberinfrastructure

TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University,Indiana University, Oak Ridge National Laboratory, the Pittsburgh

Supercomputing Center, and the National Center for Atmospheric Research.

Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.

(10)

Cyberinfrastructure at IU

n

Interpreted broadly (

Web presences

), there are many

activities at IU

n

Interpreted narrowly as the “

programmable web

” or “

using

Grid technologies

” there are large projects in

atmospheric

,

earthquake

,

ice-sheet

sciences,

network systems

,

particle

physics

,

Crystallography

and

Cheminformatics

•

IU has an international reputation in both parallel and

distributed Cyberinfrastructure including

education

,

research

and

resources

•

IU has

#31 Supercomputer

in world and is part of two

major National activities

TeraGrid

and

Open Science Grid

n

There are several well known

Bioinformatics Grids

such as

BIRN

(mainly images) and

caBIG

(cancer databases) from NIH

and

MyGrid

from UK (EBI)

n

Could be opportunities to

link Biology

and

Informatics/CS

in

(11)

Cyberinfrastructure motivated by Web 2.0

n

Capture the power of interactive Web/Grid sites

enabling people to create, collaborate and build on

each others work

Programmableweb.com

363 Web 2.0 API’

(12)

Web services, workflows, portals and ontologies

• Web

Services

allow us to quickly develop and deploy new tools,

interfaces that cross disciplines and are broadly accessible

– Can use simple HTTP and ignore Web Service complications

• Workflows

(called

mashups

in Web 2.0) allow us to string

together collections of web services to do computation that is

tailored to the science (as a one-off or for re-use).

– Develop core capabilities as services and use in many different ways as in 770 Google map mashups

• API’s/Languages/Data structures/Ontologies (WSDL AJAX

JSON at low level) allow us to describe workflows and services

in discoverable, standard ways, such that reasoning tools can

piece them together to match queries

• Portals

enable composable reusable user interfaces

• Distributed posting of services and easily available composition

tools enable “

everybody

” to contribute

(13)

Model and Data Sharing

n

Cyberinfrastructure

requires agreed

sharing standards

(data

structures, API’s, protocols, ontologies, languages) as

intrinsically internationally distributed

n

There are

agreed data structures

for taking

Sequence

_

Protein

_

Folding

_

Interaction Transparently, e.g.

BLAST

n

Nothing at the level where genomics and proteomics is

important: cells and tissues.

n

Partial answers:

CellML, FieldML, SBML

which do not link to

relevant standards outside Biology

n

Need to connect models at these levels. Need

Standard

ontologies/data structures for cell behaviors to allow

connections

and

validation

n

Need to connect Models like

SBW

(Systems Biology

Workbench)/

BioSpice

->Cell-level models (

Compucell

) ->Tissue

level models (

Physiome

)

n

Model builders

at these scales not CS-sophisticated. Models

NOT interoperable and don’t use useful general ideas

n

Glazier

organizing activity in this area with H. Sauro (U.

Washington), W. Li (UCSD-SDSC), Hunter (U. Auckland) and NIH

•

Link to

Open Grid Forum

standard setting and community

(14)

http://www.chembiogrid.org

n Database

enabled

quantum chemistry

computations

n Services to link

PubChem,

Supercomputer s, results of

high

throughput

Screening centers

n Education; IU

has unique

Cheminformatic s degrees

(15)

Chemical Informatics web service infrastructure

• Database Services

– Local NIH DTP Human Tumor Cell Line set – Local PubChem mirror – Derived properties

database

– Pub3D, PubDock – Synonym service – VARUNA quantum

chemistry database • Statistics (based on R)

– Regression, Neural Nets, Random Forest

– LDA

– K-means clustering – Plotting

– T-test and distribution sampling

• Computation Services

– OpenEye FRED, OMEGA, FILTER, …

– Cambridge OSCAR3

– BCI fingerprint generation, Ward’s, Divisive K-means clustering

– Tox Tree

– Similarity & fingerprint calculations (CDK)

(16)

(17)

(18)

(19)

Kemo - A ChatBot for PubChem

• Uses ALICE chatbot

www.alicebot.org

• AIML used to define

knowledge base, e.g. reaction to common phrases like FIND ME, WHAT IS THE LOGP OF, etc

• Can iteratively improve

knowledge base

• Accesses PubChem

(20)

Workflow in Xbaya - a meteorology tool!

(21)

Indexing the world’s chemical informatio

AND computational functionality

• Crawl and index web pages, journal articles, etc. for – Structures (InChIs, SMILES)

– Images (converted using Clide or ChemReader)

– Names (converted using OSCAR3 or similar package) – Other information (IR spectra, reactions, etc…)

• Technology still immature, but improving quickly

• Problem with access to journal articles: we will assume open access in the future!

• Expose computational functionality as web services, contextualize in an OWL-S ontology (semantics), and publish in a UDDI