E Chemistry and Web 2 0

Full text


E-Chemistry and Web 2.0

Marlon Pierce

mpierce@cs.indiana.edu Community Grids Lab


One Talk, Two Projects

£ NIH funded Chemical Informatics and


Collaboratory (CICC) @ IU.

¤ Geoffrey Fox

¤ Gary Wiggins

¤ Rajarshi Guha

¤ David Wild

¤ Mookie Baik

¤ Kevin Gilbert

¤ And others

£ Proposed Microsoft-Funded Project: E-Chemistry

¤ Carl Lagoze (Cornell),

¤ Lee Giles (PSU),

¤ Steve Bryant (NIH),

¤ Jeremy Frey (Soton),

¤ Peter Murray-Rust (Cambridge),

¤ Herbert Van de Sompel (Los Alamos),

¤ Geoffrey Fox (Indiana)

¤ And others


CICC Infrastructure Vision

£ Chemical Informatics: drug discovery and other academic chemistry,

pharmacology, and bioinformatics research will be aided by powerful,

modern, open, information technology.

¤ NIH PubChem and PubMed provide unprecedented open, free data and


¤ We need a corresponding open service architecture (i.e. avoid stove-piped


¤ CICC set up as distributed cyberinfrastructure in eScience model

£ Web clients (user interfaces) to distributed databases, results of high

throughput screening instruments, results of computational chemical simulations and other analyses.

¤ Composed of clients to open service APIs (mash-ups)

¤ Aggregated into portals

£ Web services manipulate this data and are combined into workflows.


CICC Databases


Most of our databases aim to add value to

PubChem or link into PubChem

¤ 1D (SMILES) and 2D structures


3D structures (MMFF94)

¤ Searchable by CID, SMARTS, 3D similarity


Docked ligands (FRED, Autodock)

¤ 906K drug-like compounds into 7 ligands

¤ Will eventually cover ~2000 targets


Building Up the Infrastructure


Our SOA philosophy: use standard Web services.

¤ Mostly stateless

¤ Some cluster, HPC work needed but these populate databases


Services are aggregate-able into different


¤ Taverna, Pipeline Pilot, …


You can also build lots of Web clients.





Web_Resources for li

nks and details.


Sample Services


More Clients…


Example: PubDock

£ Database of approximately 1 million PubChem structures (the most drug-like) docked into

proteins taken from the PDB

£ Available as a web service, so structures can be accessed in your own programs, or using workflow tools like Pipeline Polit

£ Several interfaces developed, including one based on Chimera (right) which integrates the

database with the PDB to allow browsing of compounds in

different targets, or different compounds in the same target

£ Can be used as a tool to help understand molecular basis of activity in cellular or image based assays


Example: R Statistics applied to

PubChem data

£ By exposing the R statistical package, and the Chemistry Development Kit

(CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of

PubChem assay data

£ Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications


Example assay screening

workflow: finding cell-protein


A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

Similar structures to the ligand can be browsed using client


Once docking is complete, the user visualizes the high-scoring docked

structures in a portlet using the JMOL applet. Similar structures are

filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein. The screening data from

a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations Leas Squares Regression Rando

Forests NeuraNets


Relevance to Web 2.0


Some Web 2.0 Key Features

¤ REST Services

¤ Use of RSS/Atom feeds

¤ Client interfaces are “mashups”

¤ Gadgets, widgets for portals aggregate clients



¤ We provide RSS as an alternative WS format.

¤ We have experimented with RSS feeds, using Yahoo Pipes to manipulate multiple feeds.

¤ CICC Web interfaces can be easily wrapped as universal gadgets in iGoogle, Netvibes.


RSS Feeds/REST Services


Provide access to DB's via RSS feeds


Feeds include 2D/3D structures in CML


Viewable in Bioclipse, Jmol as well as Sage etc.


Two feeds currently available

¤ SynSearch – get structures based on full or partial chemical names

¤ DockSearch – get best N structures for a target


Really hampered by size of DB and Postgres


Tools and mashups based on web

service infrastructure


Mining information from journal


£ Until now SciFinder / CAS only chemistry-aware portal into journal information

£ We can access full text of journal articles online (with subscription)

£ ACS does not make full text available … but there are ways round that!

£ RSC is now marking up with SMILES and GO/Goldbook terms!

¤ www.projectprospect.org

£ Having SMILES or InChI means that we can build a

similarity/structure searchable database of papers: e.g. “find me all the papers published since 2000 which

contain a structure with >90% similarity to this one”

£ In the absence of full text, we can at least use the abstract


Text Mining: OSCAR

£ A tool for shallow, chemistry-specific natural

language parsing of chemical documents (e.g. journal articles).

£ It identifies (or attempts to identify):

¤ Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms.

¤ Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections.

¤ Other entities: Things like N(5)-C(3) and so on.

£ Part of the larger SciBorg effort

¤ See



Mash-Up: What published compounds might bind to this protein? Create a database containing th

text of all recent PubMed abstract (2006-2007 = ~500,000)

Convert molecules to 3D and dock into a protein of interest

Visualize top docked molecules in a

Google-like interface

Use OSCAR to extract all of the chemical names referred to in the abstracts and covert to SMILES




E-Chemistry and Digital



E-Chemistry and Digital Libraries


Key problem with our SOA-based e-Science is

information management.

¤ Where is the service that I need?

¤ What does it do?


We may consider our data-centric services to be

digital libraries.


Data is diverse

¤ Documents

¤ Not just computational information like structures.


Another point of view: how can I link together

publications, results, workflows, etc?

¤ That is, I need to manage digital documents.


Digital Libraries

£ Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE)

£ Developing standardized, interoperable, and machine-readable mechanisms to express information about compound information objects on the web.

£ Graph-based representations of connected digital objects.

£ Objects may be encoded in (for example) RDF or XML,

£ Retrievable via repositories with REST service interfaces (c.f. Atom Publishing Protocal)


Challenges for E-Chemistry


Can digital library principals be applied to data as

well as documents?

¤ Can you link your workflow to your conference paper?


Can we engineer a publishing framework and

message formats around Web 2.0 principals?

¤ REST, Atom Publishing Protocol, Atom Syndication Format, JSON, Microformats


Can we do this securely?

¤ Access control, provenance, identify federation are key problems.


More Information


Project Web Site:




roject Wiki: w




ntact me: mpierce@cs.indiana.edu


CICC Combines Grid Computing with Chemical Informatics


Chemical Informatics and Cyberinfrastucture CollaboratoryFunded by the National Institutes of Health



Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories

Science and Cyberinfrastructure


Large Scale Computing Challenges

Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated.

CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer

Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs.

CICC supports the NIH mission by combining state of the art chemical informatics techniques with

• World class high performance computing • National-scale computing resources (TeraGrid) • Internet-standard web services

• International activities for service orchestration

• Open distributed computing infrastructure for scientists world wide NIH PubMed DataBas e OSCAR Text Analysis POVRay Parallel Renderin g Initial 3D Structure Calculatio n Toxicity Filtering Cluster Groupin g Docking Molecular Mechanic s Calculatio ns Quantum Mechanics Calculatio ns IU’s Varuna DataBase NIH PubChe m DataBase Chemical informatics text analysis programs can process 100,000’s of abstracts of online journal articles to extract chemical signatures of potential drugs.

OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Top-ranking docked molecules can be further examined for drug potential.

Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry


MLSCN Post-HTS Biology Decision


Percent Inhibition or IC50 data is

retrieved from HTS

Question: Was this screen successful?

Question: What should the active/inactive cutoffs be?

Question: What can we learn about the target protein or cell line from this screen?

Compounds submitted to PubChem

Workflows encoding distribution analysis of screening results

Grids can link data

analysis ( e.g image

processing developed in existing Grids), traditional

Chem-informatics tools, as

well as annotation

tools (Semantic Web, del.icio.us) and

enhance lead ID and

SAR analysis

A Grid of Grids linking

collections of services a

PubChem ECCR centers

MLSCN centers

Workflows encoding plate & control well statistics, distribution analysis, etc

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc





Need access to math and stat



Did not want to recode algorithms


Wanted latest methods


Needed a distributed approach to



Keep computation on a powerful machine


Why R?


Free, open-source


Many cutting edge methods avilable


Flexible programming language


Interfaces with many languages










The R Server


R can be run as a remote compute



Requires the




Allows authenticated access over



Connections can maintain state


R as a Web Service


On its own the R server is not a web



We provide Java frontends to specific



The frontend classes are hosted in a

Tomcat web container


Accessible via SOAP


Full Javadocs for all available WS’s




Two classes of functionality


General functions

¥ Allows you to supply data and build a predictive model

¥ Sample from various distributions

¥ Obtain scatter plots and hisotgram

¥ Model development functions use a Java front-end to encapsulate model specific information




Two classes of functionality


Model deployment

¥ Allows you to build a model outside of the infrastructure

¥ Place the final model in the infrastructure

¥ Becomes available as a web service

¥ Each model deployed requires its own front end class


Available Functionality


Predictive models - OLS, RF, CNN,



Clustering - k-means


Statistical distributions


XY plot and scatter plots


Model deployment for single model

types and ensemble model types


Deployed Models


Since deployed models are visible as

web services we can build a simple web

front end for them





I anti-cancer predictions




The R WS is not restricted to ‘atomic’



Can write a whole R program

¤ Load it on the R compute server

¤ Provide a Java WS frontend



¤ Feature selection

¤ Automated model generation

¤ Pharmacokinetic parameter calculation


Data Input/Output


Most modeling applications require data



Depending on client language we can



SOAP array of arrays (2D matrices)


SOAP array (1D vector form of a 2D



Data Input/Output


Some R web services can take a URL

to a VOTables document


Conversion to R or Java matrices is done

by a local VOTables Java library


R also has basic support for VOTables



Ignores binary data streams


Interacting With R WS’s


Traditional WS’s do not maintain state


Predictive models are different


A model is built at one time


May be used for prediction at another time


Need to maintain state


State is maintained by serialization to R

binary files on the compute server


Interacting with R WS’s




Send data to model WS


Get back model ID


Get various information via model ID

¥ Fitted values

¥ Training statistics

¥ New predictions


Cheminformatics at Indiana

University School of Informatics

David J. Wild


Associate Director of Chemical Informatics & Assistant Professor

Indiana University School of Informatics, Bloomington


Cheminformatics education at


£ M.S. in Chemical Informatics

¤ 2 years, 36 semester hours

¤ Includes a 6-hour capstone / research project

¤ Opportunity to work in Laboratory Informatics (IUPUI) or closely with Bioinformatics (IUB)

¤ Currently 9 students enrolled

£ Ph.D. in Informatics, Cheminformatics Specialty

¤ 90 credit hours, including 30 hours dissertation research. Usually 4 years.

¤ Research rotations expose students to research in related areas

¤ Currently 4 students enrolled

£ Graduate Certificate

¤ 4 courses, all available by Distance Education

¥ I571 Chemical Information Technology

¥ I572 Computational Chemistry & Molecular Modeling

¥ I573 Programming for Science Informatics

¥ I553 Independent Study in Chemical Informatics

¤ D.E. students pay in-state fees! (~$800 per class)

£ See http://cheminfo.informatics.indiana.edu for more information, or a general review of cheminformatics education in Drug Discovery Today 11, 9&10 (May 2006), pp436-439


Distance Education for



Uses Breeze + teleconference for live sharing

of classes: all that is required is a P.C. and a

telephone. Optional Polycom



Lectures are recorded for easy playback

through a web browser


Wiki or similar webpage for dissemination of

course materials


Also participate in CIC courseshare to give

class at University of Michigan


Of 75 students taking our courses since fall

2005, 39 have been D.E. students


See JCIM 2006; 46(2) pp 495 - 502 for more



Current research in the Wild



Integration of cheminformatics tools and data


¤ A web service infrastructure for cheminformatics

¤ Compound information & aggregation web service and interface (“by the way box”)

¤ An enhanced chatbot for exploting chemical information & web services

¤ A semantically-aware workflow tools for cheminformatics

¤ Data mining the NIH DTP tumor cell line database

¤ PubDock: a docking database for PubChem


Aggregating life science information from web

and journal documents

¤ Data mining semantically rich chemistry journal articles

¤ Document similarity based on chemical structure similarity

¤ Evaluating semantic markup of chemistry journal articles


Integrating cheminformatics into the

chemistry lab

¤ Integrating cheminformatics with the Second Life virtual world

¤ Integrating cheminformatics tools with electronic lab notebooks

¤ Usability of cheminformatics tools


Current research in the Guha



Predictive Modeling

¤ Interpretation, validation, domain applicability

¤ Generalization to other ‘models’ such as docking, pharmacophore etc

¤ Integration of multiple data types

¤ Addressing imbalanced and noisy datasets


Analysis of Chemical Spaces

¤ Quantify distributions in spaces

¤ Investigation of density approaches

¤ Applications to lead hopping, model domains


Methods to summarize & compare data

¤ Applications to HTS and smaller lead series type datasets


Network models combining chemical

structures and biological systems


Software and infrastructure

¤ Model exchange and annotation

¤ Pharmacophore representations, matching

¤ Toolkit development (CDK)


Cheminformatics web service


Database Services

PostgreSQL + gNova

PubChem mirror (augmented)

Pub3D - 3D structures for PubChem

PubDock - Bound 3D structures

Compound-indexed journal article DB

NIH Human Tumor Cell Line

Local PubChem mirror VARUNA quantum

chemistry database

Statistics (based on R)

Regression, LDA

Neural Nets, Random Forest

K-means clustering Plotting

T-test and distribution sampling

Cheminformatics services

Docking (FRED)

3D structure generation (OMEGA)

Filtering (FRED, etc) OSCAR3

Fingerprints (BCI, CDK) Clustering (BCI)

Toxicity prediction (ToxTree)

R-based predictive models Similarity calculations


Descriptor calculation (CDK)

2D structure diagrams (CDK)

Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey C. Fox and David J. Wild, Web service infrastructure for chemoinformatics, Journal of Chemical Information and Modeling, 2007;

47(4) pp 1303-1307


RSC Project Prospect - what

can we do with the






100 papers marked up with SMILES/InChI

(using OSCAR3), plus Gene Ontology and

Goldbook Ontology terms


Created similarity searchable PostgreSQL /

gNova database with paper DOIs, SMILES,

and ontology terms


Web service and simple HTML interfaces for

searching … “which papers reference

compounds similar to this one in the scope of

these ontological terms?”


Applying statistics to look at co-occurrence of

compounds, structural features (MACCS

keys) and ontological terms in papers


Greasemonkey / OSCAR




By the way… annotation


By the way…

This compounds is very similar to a prescription drug,Tamoxifen.

This compound is referenced in20 journal articlespublished in the last 5 years Similar compounds are associated with the words “toxic” and “death” in280 web pages

It appears to be covered under3 patents

It has been shown to be active in5 screens

Computer models predict it to show some activity against8 protein targets

Here are some comments on this compound:


Some useful chemical reactions

IodoacetateaIodoacetamideI-CH4COO- ICH2CONH2

This may also react, chem favored byalkaline pH


Cheminformatics aware

simple lab notebook (mock


Free text input can be converted to machine

readable form by electrovaya

Automatic detection o data fields (yield, etc)

Where possible Plug-in allows structures

to be drawn with the pen and cleaned up

Web service interfac provides access to computation and searching.

Page is marked up by what is possible



Automatic workflow

generation and natural

language queries


Develop service ontology using OWL-S or

similar language

¤ Allows service interoperability, replacement and input/outut compatibility


We can then use generic reasoning and

network analysis tools to find paths from

inputs to desired outputs


Natural language can be parsed to inputs and

desired outputs


Smart Clients <--> Agents <--> Services


Possible “supercharged life science Google?”

- e.g. type in “what compounds might bind to

the enclosed protein?”

2D -> 3D 2 struct ur crawl er dock 3D searc h P’pho r searc h 2 simila rity 2D structures

2D structures 3D structures

3D structures 3D structures & complexes

dock = bind

3D protei structure


3D structures are compounds 2D structures are


3D structures are compounds