Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

(1)

Data Visualization in

Cheminformatics

Simon Xi

Computational Sciences CoE

Pfizer Cambridge

(2)

My Background

Professional Experience

Senior Principal Scientist, Computational Sciences CoE, Pfizer Cambridge

9-year experience in pharmaceutical research with a focused on developing cheminformatics and bioinformatics applications for research scientists

Education

MSc in Molecular Cell Biology in UTDallas MSc in Software Engineering in SMU

(3)

What we will cover today

• Introduction to drug discovery

• Cheminformatics basics

• Encoding of the chemical structures

• Visualizing data and structures

• Design and optimization of compound library

• A case study

(4)

The Billion Dollar Molecules

Drug Name

2006

World-Wide Sales

Primary Use

Lipitor

$14,385M

cholesterol

Nexium

$5,182M

heartburn

Advair

$6,129M

asthma

Prevacid

$3,425M

heartburn

Plavix

$6,057M

anticoagulant

Singulair $3,579M

asthma

Seroquel

$3,560M

depression

Effexor

$3,722M

depression

Norvasc

$4,866M

hypertension

Lipitor – 14 billion annual sales

(5)

$0 $5 $10 $15 $20 $25 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 # N M E s 60 40 20 20 0 Source:

Source: PhRMAPhRMAannual survey, 2000annual survey, 2000

Total R&D Investment ($ Billions)

Industry Productivity vs. Investment

The Challenge

Nature Reviews Drug Discovery 3, 451-456 (2004)

(6)

Preclinical Pharmacology

Preclinical Safety

Millions of

Compounds Screened

Idea

_{11 - 15 Years}

Drug

~100 Discovery Approaches ~100 Discovery Approaches

1 - 2 Products

Discovery _{Exploratory Development} _{Full Development}

Phase I Phase II Phase III

0 ₅ 10 15

Clinical Pharmacology & Safety

(7)

(8)

What is Chemoinformatics?

• Use of computer and informational techniques,

applied to a range of problems in the field of chemistry.

• These

in silico

techniques are commonly used in

pharmaceutical companies in the process of drug

discovery.

• Chemistry is a visual science. Data visualization is a

key component of cheminformatics.

(9)

(10)

Encoding Chemical Structures

Lipitor Atoms Bonds SD format CC(C)C1=C(C(=O)NC2=CC=CC=C2)C( C3=CC=CC=C3)=C(N1CCC(O)CC(O)C C(O)=O)C4=CC=C(F)C=C4 SMILES format

(11)

Representing Structure as Fingerprints

(12)

(13)

Compound Properties/Descriptors

1D, 2D, 3D, multi-dimensional properties

•

1D: Molecular Weight, clogP, #of Atoms,

charge, #H-Bond donors and acceptors

•

2D: Atom pairs, substructures functional groups

•

3D: Shape, pharmacophores

•

nD: Fingerprints, etc..

Chemical series – compounds sharing

the same core structures

(14)

Series Classifications –

Wards Clustering

Iteratively merging a pair of nodes until all nodes are merged.

At each merging step, two nodes that give minimal

variance are chosen and merged into one new node.

Once the tree hierarchy is generated, clusters can be defined by cutting the tree at certain dissimilarity threshold

(15)

Toxicity Properties

• Inhibition of CYP450 isozymes • PXR transactivation

• Human hepatocyte toxicity • Mutagenicity

• Mitochondria toxicity • Covalent protein binding • Inhibition of HERG ADME/Physicochemical Properties • Solubility • Chemical stability • Hydrophobicity/hydrogen bonding potential

• Intestinal mucosal cell permeation • Liver and kidney clearance

• Metabolism • Transporters • Charge

• Size

• Protein binding

• Blood-brain barrier permeation • Target cell permeation

Primary pharmacology

• In vitro potency • Cell based potency • Functional assays

• Selectivity against other targets

(16)

Drug-Likeness: Rule of Five

Proposed by C. Lipinski to describe ‘drug-like’ molecules.

Molecules displaying good oral absorption and /or distribution

properties are likely to possess the following characteristics:

– Molecular Weight < 500

– logP < 5.0

– H-donors < 5

(17)

Data Visualization

Grid View Table View Plot View Heatmap View Software Relevance Software Usability Software Management

(18)

Building Predictive Models using Machine

Learning Techniques

• Use computational models to understand Structure-Activitive Relationship (SAR)

• Use computational models to run virtual screen to guide compound selection for synthesis

(19)

Interpretability of Predictive Models

The good part The not so

good part

(20)

Multiple Parameter Optimization in

Combinatorial Library Design

Given a 100x100x100 virtual library space and a set of

predictive models for various properties (e.g. potency,

ADME, selectivity), select the best 300 compounds for

synthesis with the highest probability of being potent and

drug-like and with diverse sampling of the chemical

space

N N N N R3 R1 R2

(21)

The problem of Multiple Parameters Optimzation

• The chemical space is huge

• Predictive models are not very predictive

• Many parameters to optimize and sometime contradictory to each other

(22)

MPO – a case study with kinase selectivity

N N N N F F F R1 R2 Trifluoro-diaminopyrimidine series (~200 cmpds) R1 R2 Tested compounds

Only few combination Rgroup-Kinase have been previously tested

R2 Model B uilding R1 FW Solving R-groups contribution using linear regression

Identify compounds with desired seletivity profile in the expanded virtual chemical space

Virtual Library Profile

Enum eration Predictable Virtual Chemical Space R2 R1 5-50x expansion

~200 cmpds from a library tested against 40 kinases, can we design another 100 cmpds that are highly selective

(23)

(24)

Experimental Validation of Predictions

KSS pIC50 vs. FW pIC50

r2_=0.45 _r2_=0.59 _r2_=0.92 _r2_=0.86 r2_=0.74 _r2_=0.83 _r2_=0.63 _r2_=0.88 r2_=0.85 _r2_=0.81 _r2_=0.81 _r2_=0.85 ~40 cmpds in two series were selected for KSS testing

More promiscuous

(25)

Cheminformatics Challenges for Drug

Discovery

• Information retrieval and knowledge managment - rapidly and

efficiently present all relevant data/knowledge to scientists at

the right time and right place

• Predictive models - drastically improve the accuracy and

interpretability of

in silico

models for potency and ADME

endpoints

• Computer-aided design – provide easy to use software

applications to help scientists analyze/visualize their data and

make efficient use of prior knowledge during compound

(26)

References

1. Agrafiotis, D. K., Lobanov, V. S. and Salemme, F. R. (2002) Combinatorial informatics in the post-genomics ERA. Nat Rev Drug Discov. 1, 337-346

2. Lipinski, C. and Hopkins, A. (2004) Navigating chemical space for biology and medicine. Nature. 432, 855-861

3. Paolini, G. V., Shapland, R. H., van Hoorn, W. P., Mason, J. S. and Hopkins, A. L. (2006) Global mapping of pharmacological space. Nat Biotechnol. 24, 805-815