• No results found

Cheminformatics and Cyberinfrastructure

N/A
N/A
Protected

Academic year: 2020

Share "Cheminformatics and Cyberinfrastructure"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Indiana University School of

Chemoinformatics

David Wild, djwild@indiana.edu

(2)

Indiana University School of

Current state of chemoinformatics research

What works and what doesn’t

– Fingerprints, clustering and diversity

– QSAR - predictive and descriptive methods, virtual screening – 3D similarity, pharmacophores & docking

– Visualization, organization and navigation of chemical datesets

Current buzz areas in chemoinformatics

(3)

Indiana University School of

What works and what doesn’t

• 2D structure and similarity searching well established – Lots of papers comparing fingerprints for similarity

– Some recent evidence Scitegic ECFPs better for recall of actives • Clustering well established but definite room for improvement

– Traditional methods Wards, K-means, Jarvis Patrick

– Recently single pass similarity cutoff methods used for very fast organization - >0.85 for similar activity, >0.55 for QSAR

– Data mining methods - ROCK, Chameleon, Cure, etc unexplored – Diversity hot -> cold -> smart

• QSAR - poor relation of academic work to industry usefulness – Lots of papers: “this method works best on this dataset” – Random forests appear practically to work rather well – Interpretability vs predictive ability

– Predictive methods for LogP, pKa, solubility, etc work reasonably

(4)

Indiana University School of

What works and what doesn’t

Mostly, 3D methods haven’t worked out yet

– Similarity & QSAR - Almost every paper: 2D better for recall and precision but 3D methods give “interesting ideas”. Useful for “lead hopping”

– Pharmacophore searching not widely used

– Docking - very useful for visual inspection, poor correlation of scoring functions with binding

Visualization, organization and navigation of datasets

– Still not clear how to work with datasets > few hundred compounds – Dot plots, spreadsheet-based methods work minimally

(5)

Indiana University School of

The current buzz in chemoinformatics

Decorporatization and commoditization of data and software

– MLSCN, PubChem, open source, small companies – Crisis for the software companies, nice for academia – Pharma companies in the brown stuff without a paddle

Integration with other “ics”

– Data mining chemical/genomic information

– Linking compounds -> proteins -> pathways, etc (e.g. KEGG)

Fuzzy boundaries, integration with science and informatics

– Microsoft 2020 vision for science

Integration of text and structure searching

(6)

Indiana University School of

Suggested collaboration areas

Chem/bio/complex systems mashups using web services in

each of the areas: nice, confined projects for students once you

have the infrastructure

Chem and complex can work together on integrating text and

structure-based searching, indexing and crawling (e.g.

networks of web services and databases), and intelligent

agents

Data mining of chemogenomic information

Integration of advanced chemoinformatics methods with

systems biology and pathway mapping tools

Performing research to establish best practices for areas of

chemoinformatics

(7)

Indiana University School of

Cyberinfrastructure

Geoffrey Fox

(8)

Cyberinfrastructure

n

Supports

distributed science

– data, people, computers

n

Exploits

Internet technology

(

Web2.0

) adding (via

Grid

technology) management, security, supercomputers etc.

n

It has two aspects:

parallel

– low latency (microseconds)

between nodes and

distributed

– highish latency

(milliseconds) between nodes

n

Parallel needed to get

high performance

on

individual

3D simulations, data analysis etc.; must

decompose

problem

n

Distributed aspect

integrates

already distinct

components

n

Cyberinfrastructure is in general a

distributed collection

of parallel systems

(9)

TeraGrid: Integrating NSF Cyberinfrastructure

TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University,Indiana University, Oak Ridge National Laboratory, the Pittsburgh

Supercomputing Center, and the National Center for Atmospheric Research.

Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.

(10)

Cyberinfrastructure at IU

n

Interpreted broadly (

Web presences

), there are many

activities at IU

n

Interpreted narrowly as the “

programmable web

” or “

using

Grid technologies

” there are large projects in

atmospheric

,

earthquake

,

ice-sheet

sciences,

network systems

,

particle

physics

,

Crystallography

and

Cheminformatics

IU has an international reputation in both parallel and

distributed Cyberinfrastructure including

education

,

research

and

resources

IU has

#31 Supercomputer

in world and is part of two

major National activities

TeraGrid

and

Open Science Grid

n

There are several well known

Bioinformatics Grids

such as

BIRN

(mainly images) and

caBIG

(cancer databases) from NIH

and

MyGrid

from UK (EBI)

n

Could be opportunities to

link Biology

and

Informatics/CS

in

(11)

Cyberinfrastructure motivated by Web 2.0

n

Capture the power of interactive Web/Grid sites

enabling people to create, collaborate and build on

each others work

Programmableweb.com

363 Web 2.0 API’

(12)

Indiana University School of

Web services, workflows, portals and ontologies

Web

Services

allow us to quickly develop and deploy new tools,

interfaces that cross disciplines and are broadly accessible

– Can use simple HTTP and ignore Web Service complications

Workflows

(called

mashups

in Web 2.0) allow us to string

together collections of web services to do computation that is

tailored to the science (as a one-off or for re-use).

– Develop core capabilities as services and use in many different ways as in 770 Google map mashups

API’s/Languages/Data structures/Ontologies (WSDL AJAX

JSON at low level) allow us to describe workflows and services

in discoverable, standard ways, such that reasoning tools can

piece them together to match queries

Portals

enable composable reusable user interfaces

Distributed posting of services and easily available composition

tools enable “

everybody

” to contribute

(13)

Model and Data Sharing

n

Cyberinfrastructure

requires agreed

sharing standards

(data

structures, API’s, protocols, ontologies, languages) as

intrinsically internationally distributed

n

There are

agreed data structures

for taking

Sequence

Protein

Folding

Interaction Transparently, e.g.

BLAST

n

Nothing at the level where genomics and proteomics is

important: cells and tissues.

n

Partial answers:

CellML, FieldML, SBML

which do not link to

relevant standards outside Biology

n

Need to connect models at these levels. Need

Standard

ontologies/data structures for cell behaviors to allow

connections

and

validation

n

Need to connect Models like

SBW

(Systems Biology

Workbench)/

BioSpice

->Cell-level models (

Compucell

) ->Tissue

level models (

Physiome

)

n

Model builders

at these scales not CS-sophisticated. Models

NOT interoperable and don’t use useful general ideas

n

Glazier

organizing activity in this area with H. Sauro (U.

Washington), W. Li (UCSD-SDSC), Hunter (U. Auckland) and NIH

Link to

Open Grid Forum

standard setting and community

(14)

http://www.chembiogrid.org

n Database

enabled

quantum chemistry

computations

n Services to link

PubChem,

Supercomputer s, results of

high

throughput

Screening centers

n Education; IU

has unique

Cheminformatic s degrees

(15)

Indiana University School of

Chemical Informatics web service infrastructure

Database Services

– Local NIH DTP Human Tumor Cell Line set – Local PubChem mirror – Derived properties

database

– Pub3D, PubDock – Synonym service – VARUNA quantum

chemistry database • Statistics (based on R)

– Regression, Neural Nets, Random Forest

– LDA

– K-means clustering – Plotting

– T-test and distribution sampling

Computation Services

– OpenEye FRED, OMEGA, FILTER, …

– Cambridge OSCAR3

– BCI fingerprint generation, Ward’s, Divisive K-means clustering

– Tox Tree

– Similarity & fingerprint calculations (CDK)

(16)

Indiana University School of

(17)
(18)

Indiana University School of

(19)

Indiana University School of

Kemo - A ChatBot for PubChem

• Uses ALICE chatbot

www.alicebot.org

• AIML used to define

knowledge base, e.g. reaction to common phrases like FIND ME, WHAT IS THE LOGP OF, etc

• Can iteratively improve

knowledge base

• Accesses PubChem

(20)

Indiana University School of

Workflow in Xbaya - a meteorology tool!

(21)

Indiana University School of

Indexing the world’s chemical informatio

AND computational functionality

• Crawl and index web pages, journal articles, etc. for – Structures (InChIs, SMILES)

– Images (converted using Clide or ChemReader)

– Names (converted using OSCAR3 or similar package) – Other information (IR spectra, reactions, etc…)

• Technology still immature, but improving quickly

• Problem with access to journal articles: we will assume open access in the future!

• Expose computational functionality as web services, contextualize in an OWL-S ontology (semantics), and publish in a UDDI

References

Related documents

In line with tumor metrics, gene expression analysis of EP and OT treated models showed some overlapping trends in resistance mechanisms supporting tumor response of incomplete

domestic finance to adequately fill the void left by the decline of London and the breakdown of the world financial system in the interwar period, when neither the Buenos Aires

difficile spores could be transmitted from the farm environment to humans through a number of mechanisms including direct contact, airborne dispersal, avian, rodent or arthropod

We present a sampling-based approximation algorithm for the problem of finding the compact consensus shape from a family of proteins.. Our algorithm requires that the consensus

In this study, we compared seven frequently used CNV detection methods: circular binary segmentation (CBS) (8), CNVFinder (9), cnvPartition, gain and loss of DNA (GLAD) (7),

Simulated reconstructed model output (Reservoir Simulation) compared to observed discharge from 2011 to 2013, for a) Thu Bon at Nong Son station, b) Vu Gia at Thanh My Station..

These results of the empirical analyses discussed so far are all in line with the theoretical expectations: due to knowledge spillovers and innovation spillovers, clusters

The following variables are considered endogenous: unemployment rate, growth rate of total factor productivity, inflation rate, terms of trade, hours per worker, growth rate