Data Visualization in
Cheminformatics
Simon Xi
Computational Sciences CoE
Pfizer Cambridge
My Background
Professional Experience
Senior Principal Scientist, Computational Sciences CoE, Pfizer Cambridge
9-year experience in pharmaceutical research with a focused on developing cheminformatics and bioinformatics applications for research scientists
Education
MSc in Molecular Cell Biology in UTDallas MSc in Software Engineering in SMU
What we will cover today
• Introduction to drug discovery
• Cheminformatics basics
• Encoding of the chemical structures
• Visualizing data and structures
• Design and optimization of compound library
• A case study
The Billion Dollar Molecules
Drug Name
2006
World-Wide Sales
Primary Use
Lipitor
$14,385M
cholesterol
Nexium
$5,182M
heartburn
Advair
$6,129M
asthma
Prevacid
$3,425M
heartburn
Plavix
$6,057M
anticoagulant
Singulair $3,579M
asthma
Seroquel
$3,560M
depression
Effexor
$3,722M
depression
Norvasc
$4,866M
hypertension
Lipitor – 14 billion annual sales$0 $5 $10 $15 $20 $25 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 # N M E s 60 40 20 20 0 Source:
Source: PhRMAPhRMAannual survey, 2000annual survey, 2000
Total R&D Investment ($ Billions)
Industry Productivity vs. Investment
The Challenge
Nature Reviews Drug Discovery 3, 451-456 (2004)
Preclinical Pharmacology
Preclinical Safety
Millions of
Compounds Screened
Idea
11 - 15 YearsDrug
~100 Discovery Approaches ~100 Discovery Approaches
1 - 2 Products
Discovery Exploratory Development Full Development
Phase I Phase II Phase III
0 5 10 15
Clinical Pharmacology & Safety
What is Chemoinformatics?
• Use of computer and informational techniques,
applied to a range of problems in the field of chemistry.
• These
in silico
techniques are commonly used in
pharmaceutical companies in the process of drug
discovery.
• Chemistry is a visual science. Data visualization is a
key component of cheminformatics.
Encoding Chemical Structures
Lipitor Atoms Bonds SD format CC(C)C1=C(C(=O)NC2=CC=CC=C2)C( C3=CC=CC=C3)=C(N1CCC(O)CC(O)C C(O)=O)C4=CC=C(F)C=C4 SMILES formatRepresenting Structure as Fingerprints
Compound Properties/Descriptors
1D, 2D, 3D, multi-dimensional properties
•
1D: Molecular Weight, clogP, #of Atoms,
charge, #H-Bond donors and acceptors
•
2D: Atom pairs, substructures functional groups
•
3D: Shape, pharmacophores
•
nD: Fingerprints, etc..
Chemical series – compounds sharing
the same core structures
Series Classifications –
Wards Clustering
Iteratively merging a pair of nodes until all nodes are merged.
At each merging step, two nodes that give minimal
variance are chosen and merged into one new node.
Once the tree hierarchy is generated, clusters can be defined by cutting the tree at certain dissimilarity threshold
Toxicity Properties
• Inhibition of CYP450 isozymes • PXR transactivation
• Human hepatocyte toxicity • Mutagenicity
• Mitochondria toxicity • Covalent protein binding • Inhibition of HERG ADME/Physicochemical Properties • Solubility • Chemical stability • Hydrophobicity/hydrogen bonding potential
• Intestinal mucosal cell permeation • Liver and kidney clearance
• Metabolism • Transporters • Charge
• Size
• Protein binding
• Blood-brain barrier permeation • Target cell permeation
Primary pharmacology
• In vitro potency • Cell based potency • Functional assays
• Selectivity against other targets
Drug-Likeness: Rule of Five
Proposed by C. Lipinski to describe ‘drug-like’ molecules.
Molecules displaying good oral absorption and /or distribution
properties are likely to possess the following characteristics:
– Molecular Weight < 500
– logP < 5.0
– H-donors < 5
Data Visualization
Grid View Table View Plot View Heatmap View Software Relevance Software Usability Software ManagementBuilding Predictive Models using Machine
Learning Techniques
• Use computational models to understand Structure-Activitive Relationship (SAR)
• Use computational models to run virtual screen to guide compound selection for synthesis
Interpretability of Predictive Models
The good part The not so
good part
Multiple Parameter Optimization in
Combinatorial Library Design
Given a 100x100x100 virtual library space and a set of
predictive models for various properties (e.g. potency,
ADME, selectivity), select the best 300 compounds for
synthesis with the highest probability of being potent and
drug-like and with diverse sampling of the chemical
space
N N N N R3 R1 R2The problem of Multiple Parameters Optimzation
• The chemical space is huge
• Predictive models are not very predictive
• Many parameters to optimize and sometime contradictory to each other
MPO – a case study with kinase selectivity
N N N N F F F R1 R2 Trifluoro-diaminopyrimidine series (~200 cmpds) R1 R2 Tested compoundsOnly few combination Rgroup-Kinase have been previously tested
R2 Model B uilding R1 FW Solving R-groups contribution using linear regression
Identify compounds with desired seletivity profile in the expanded virtual chemical space
Virtual Library Profile
Enum eration Predictable Virtual Chemical Space R2 R1 5-50x expansion
~200 cmpds from a library tested against 40 kinases, can we design another 100 cmpds that are highly selective
Experimental Validation of Predictions
KSS pIC50 vs. FW pIC50
r2=0.45 r2=0.59 r2=0.92 r2=0.86 r2=0.74 r2=0.83 r2=0.63 r2=0.88 r2=0.85 r2=0.81 r2=0.81 r2=0.85 ~40 cmpds in two series were selected for KSS testingMore promiscuous
Cheminformatics Challenges for Drug
Discovery
• Information retrieval and knowledge managment - rapidly and
efficiently present all relevant data/knowledge to scientists at
the right time and right place
• Predictive models - drastically improve the accuracy and
interpretability of
in silico
models for potency and ADME
endpoints
• Computer-aided design – provide easy to use software
applications to help scientists analyze/visualize their data and
make efficient use of prior knowledge during compound
References
1. Agrafiotis, D. K., Lobanov, V. S. and Salemme, F. R. (2002) Combinatorial informatics in the post-genomics ERA. Nat Rev Drug Discov. 1, 337-346
2. Lipinski, C. and Hopkins, A. (2004) Navigating chemical space for biology and medicine. Nature. 432, 855-861
3. Paolini, G. V., Shapland, R. H., van Hoorn, W. P., Mason, J. S. and Hopkins, A. L. (2006) Global mapping of pharmacological space. Nat Biotechnol. 24, 805-815