APPLICATION AREAS - ITERATIVE SCREENING - Pharmaceutical Data Mining

ITERATIVE SCREENING

4.6 APPLICATION AREAS

In chemoinformatics and computer - aided drug discovery, support vector machines and binary kernel discrimination have thus far mostly been used for distinguishing between active and inactive compounds in the context of virtual screening [76,81,82] . However, Bayesian models and classiﬁ ers have also been used for different applications beyond prediction of active compounds. For

example, Bayesian modeling has been applied to predict compound recall for ﬁ ngerprint search calculations [79] , multidrug resistance [69] , or biological targets of test compounds [70] . Nevertheless, for all of these advanced data mining approaches, virtual compound screening is a major application area where the derivation of predictive models from experimental screening data presents a particularly attractive aspect. Models from screening data for activ-ity predictions have also been built using recursive partitioning and hierarchi-cal clustering techniques, but their quality is typihierarchi-cally rather sensitive to systematic errors and noise in the data, from which essentially any high throughput screening (HTS) data set suffers. This is why advanced data mining methods like Bayesian modeling or binary kernel discrimination have become very attractive for these purposes, because these approaches have been shown to be capable of deriving robust models from noisy screening data [73,83] .

Typically, models are built from screening data to search other databases for novel active compounds. Thus, HTS data serve as a learning set to derive a support vector machine or a Bayesian or binary kernel discrimination model to classify other database compounds as active or inactive. This makes these data mining approaches also very attractive to aid in iterative experi-mental and virtual screening campaigns that are often described as sequential screening [86,87] . Iterative cycles of virtual compound preselection from screening libraries and experimental evaluation can substantially reduce the number of compounds that need to be screened in order to identify suffi cient numbers of hits for follow - up [86,88] . During these iterations, newly identi-fi ed hits are usually included in model reidenti-fi nement for subsequent rounds of compound selection. The major aim of these calculations is to continuously enrich small compound sets with active compounds, and this selection scheme can be quite powerful. For example, if only a moderate overall enrichment factor of fi ve is achieved, this means that only 10% of a screening library needs to be tested in order to identify 50% of potentially available hits.

Initial approaches to establish sequential screening schemes have predomi-nantly employed recursive partitioning [89,90] or recursion forest analysis [91] , but machine learning techniques have recently also been applied [92] . For advanced data mining approaches, sequential screening represents a highly attractive application scenario for several reasons. For example, Bayesian or kernel - based classifi ers are much less infl uenced by screening data noise than standard compound classifi cation methods and, moreover, classifi ers can be trained not only to select active compounds but also to deselect effi ciently database molecules having a very low probability of activ-ity. Given the fact that the vast majority of database compounds are poten-tial false positives for a given target, effi cient compound deselection becomes an important task in screening database analysis and can greatly contribute to achieving favorable enrichment factors during iterative screening cam-paigns. Thus, we can expect that the interest in machine learning and data mining approaches in virtual and iterative compound screening will further increase.

Another attractive application area for advanced data mining methods is the assembly of target - focused compound libraries. A variety of approaches have been introduced to design target - focused libraries based on ligand or target structure information or a combination of both [14] . In recent years, there has been a clear trend to employ structure design techniques for the generation of focused libraries [93,94] , more so than data mining methods.

However, conceptually similar to the tasks at hand in iterative screening, major goals of targeted library design include a signiﬁ cant enrichment of mol-ecules having a high probability to display a target - speciﬁ c activity in com-pound sets that are much smaller in size than diverse screening libraries.

Therefore, data mining also becomes highly attractive for these applications.

For example, the ability to predict biological targets for large numbers of database compounds using multiple Bayesian models [70] is expected to sub-stantially aid in prioritizing compounds for the assembly of target - focused libraries. Thus, similar to iterative screening, we can expect that the design of specialized compound libraries will also be a future growth area for data mining applications.

4.7 CONCLUSIONS

In this chapter, we have discussed various data mining approaches and have selected applications in the context of chemoinformatics. Since the perfor-mance of data mining methods cannot be separated from the molecular rep-resentations that are employed, prominent types of molecular descriptors and representations have also been reviewed. Special emphasis has been put on discussing theoretical foundations of three advanced data mining approaches that are becoming increasingly popular in chemoinformatics and in pharma-ceutical research: Bayesian modeling, binary kernel discrimination, and support vector machines. We have particularly highlighted virtual and inte-grated compound screening schemes and the design of target - focused com-pound libraries as attractive application areas with future growth potential.

REFERENCES

1. Halperin I , Ma B , Wolfson H , Nussinov R . Principles of docking: An overview of search algorithms and a guide to scoring functions . Proteins 2002 ; 47 : 409 – 443 . 2. Klebe G . Virtual ligand screening: strategies, perspectives and limitations . Drug

Discov Today 2006 ; 11 : 580 – 594 .

3. Mason JS , Good AC , Martin EJ . 3 - D pharmacophores in drug discovery . Curr Pharm Des 2006 ; 7 : 567 – 597 .

4. Hawkins P , Skillman A , Nicholls A . Comparison of shape - matching and docking as virtual screening tools . J Med Chem 2007 ; 50 : 74 – 82 .

5. Sheridan RP , Kearsley SK . Why do we need so many chemical similarity search methods? Drug Discov Today 2002 ; 7 : 903 – 911 .

6. McGaughey GB , Sheridan RP , Bayly CI , Culberson JC , Kreatsoula C , Lindsley S , Maiorov V , Truchon JF , Cornell WD . Comparison of topological, shape, and docking methods in virtual screening . J Chem Inf Model 2007 ; 47 : 1504 – 1519 . 7. Lipinski CA , Lombardo F , Dominy BW , Feeney PJ . Experimental and

computa-tional approaches to estimate solubility and permeability in drug discovery and development settings . Adv Drug Deliv Rev 1997 ; 23 : 3 – 25 .

8. Willett P , Winterman V , Bawden D . Implementation of nonhierarchic cluster analysis methods in chemical information systems: Selection of compounds for biological testing and clustering of substructure search output . J Chem Inf Comput Sci 1986 ; 26 : 109 – 118 .

9. Barnard JM , Downs GM . Clustering of chemical structures on the basis of two dimensional similarity measures . J Chem Inf Comput Sci 1992 ; 32 : 644 – 649 . 10. Brown RD , Martin YC . Use of structure - activity data to compare structure - based

clustering methods and descriptors for use in compound selection . J Chem Inf Comput Sci 1996 ; 36 : 572 – 584 .

11. Brown RD , Martin YC . The information content of 2D and 3D structural descrip-tors relevant to ligand - receptor binding . J Chem Inf Comput Sci 1997 ; 37 : 1 – 9 . 12. Pearlman RS , Smith K . Novel software tools for chemical diversity . Perspect Drug

Discov Des 1998 ; 9 : 339 – 353 .

13. Bajorath J . Integration of virtual and high - throughput screening . Nat Rev Drug Discov 2002 ; 1 : 882 – 894 .

14. Schnur D , Beno BR , Good A , Tebben A . Approaches to Target Class Combinatorial Library Design. Chemoinformatics — Concepts, Methods, and Tools for Drug Discovery , pp. 355 – 378 . Totowa, NJ : Humana Press , 2004 .

15. Todeschini R , Consonni V . Handbook of Molecular Descriptors . Weinheim : Wiley - VCH , 2000 .

16. Maldonado AG , Doucet JP , Petitjean M , Fan BT . Molecular similarity and diver-sity in chemoinformatics: From theory to applications . Mol Divers 2006 ; 10 : 39 – 79 . 17. Weininger D . SMILES, a chemical language and information system. 1. Introduction

to methodology and encoding rules. J Chem Inf Comp Sci 1988 ; 28 : 31 – 36 . 18. Weininger D , Weininger A , Weininger JL . SMILES. 2. Algorithm for generation

of unique smiles notation . J Chem Inf Comput Sci 1989 ; 29 : 97 – 101 .

19. Stein SE , Heller SR , Tchekhovski D . An open standard for chemical structure representation— The IUPAC chemical identiﬁ er . In Nimes International Chemical Information Conference Proceedings , edited by Collier H , pp. 131 – 143 . Tetbury, UK : Infomatics , 2002 . Available at http://www.iupac.org/inchi/ (accessed February 1, 2008 ).

20. Vidal D , Thormann M , Pons M . LINGO, an efﬁ cient holographic text based method to calculate biophysical properties and intermolecular similarities . J Chem Inf Model 2005 ; 45 : 386 – 393 .

21. Grant J , Haigh J , Pickup B , Nicholls A , Sayle R . Lingos, ﬁ nite state machines, and fast similarity searching . J Chem Inf Model 2006 ; 46 : 1912 – 1918 .

22. Azencott CA , Ksikes A , Swamidass SJ , Chen JH , Ralaivola L , Baldi P . One - to four - dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties . J Chem Inf Model 2007 ; 47 : 965 – 974 .

23. Gillet V , Willett P , Bradshaw J . Similarity searching using reduced graphs . J Chem Inf Comput Sci 2003 ; 43 , 338 – 345 .

24. Labute P . Derivation and Applications of Molecular Descriptors Based on Approximate Surface Area. Chemoinformatics — Concepts, Methods, and Tools for Drug Discovery , pp. 261 – 278 . Totowa, NJ : Humana Press , 2004 .

25. Barnard JM . Substructure searching methods: Old and new . J Chem Inf Comput Sci 1993 ; 33 : 532 – 538 .

26. Raymond J , Gardiner E , Willett P . Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm . J Chem Inf Comput Sci 2002 ; 42 : 305 – 316 .

27. Willett P . Searching techniques for databases of two - and three - dimensional chem-ical structures . J Med Chem 2005 ; 48 : 4183 – 4199 .

28. Gardiner E , Gillet V , Willett P , Cosgrove D . Representing clusters using a maximum common edge substructure algorithm applied to reduced graphs and molecular graphs . J Chem Inf Model 2007 ; 47 : 354 – 366 .

29. Hessler G , Zimmermann M , Matter H , Evers A , Naumann T , Lengauer T , Rarey M . Multiple - ligand - based virtual screening: Methods and applications of the MTree approach . J Med Chem 2005 ; 48 : 6575 – 6584 .

30. Barker E , Buttar D , Cosgrove D , Gardiner E , Kitts P , Willett P , Gillet V . Scaffold hopping using clique detection applied to reduced graphs . J Chem Inf Model 2006 ; 46 : 503 – 511 .

31. McGregor M , Pallai P . Clustering of large databases of compounds: Using the MDL “ keys ” as structural descriptors . J Chem Inf Comput Sci 1997 ; 37 : 443 – 448 .

32. Durant J , Leland B , Henry D , Nourse J . Reoptimization of MDL keys for use in drug discovery . J Chem Inf Comput Sci 2002 ; 42 : 1273 – 1280 .

33. Mason J , Morize I , Menard P , Cheney D , Hulme C , Labaudiniere R . New 4 - point pharmacophore method for molecular similarity and diversity applications:

Overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures . J Med Chem 1999 ; 42 : 3251 – 3264 .

34. Carhart RE , Smith DH , Venkataraghavan R . Atom pairs as molecular features in structure - activity studies: Deﬁ nition and applications . J Chem Inf Comput Sci 1985 ; 25 : 64 – 73 .

35. Daylight Theory Manual . Aliso Viejo, CA : Daylight Chemical Information Systems, Inc . 2002. Available at http://www.daylight.com/dayhtml/doc/theory/

(accessed February 1, 2008).

36. Bender A , Mussa HY , Glen RC , Reiling S . Molecular similarity searching using atom environments, information - based feature selection, and a na ï ve Bayesian classiﬁ er . J Chem Inf Comput Sci 2004 ; 44 : 170 – 178 .

37. Klon A , Glick M , Thoma M , Acklin P , Davies J . Finding more needles in the haystack: A simple and efﬁ cient method for improving high - throughput docking results . J Med Chem 2004 ; 47 : 2743 – 2749 .

38. Klon A , Glick M , Davies J . Application of machine learning to improve the results of high - throughput docking against the HIV - 1 protease . J Chem Inf Comput Sci 2004 ; 44 : 2216 – 2224 .

39. Bender A , Mussa HY , Glen RC , Reiling S . Similarity searching of chemical data-bases using atom environment descriptors (MOLPRINT 2D): Evaluation of per-formance . J Chem Inf Comput Sci 2004 ; 44 : 1708 – 1718 .

40. Bender A , Mussa H , Gill G , Glen R . Molecular surface point environments for virtual screening and the elucidation of binding patterns (MOLPRINT 3D) . J Med Chem 2004 ; 47 : 6569 – 6583 .

41. Xue L , Godden J , Stahura F , Bajorath J . Design and evaluation of a molecular ﬁ ngerprint involving the transformation of property descriptor values into a binary classiﬁ cation scheme . J Chem Inf Comput Sci 2003 ; 43 : 1151 – 1157 .

42. Eckert H , Bajorath J . Design and evaluation of a novel class - directed 2D ﬁ nger-print to search for structurally diverse active compounds . J Chem Inf Model 2006 ; 46 : 2515 – 2526 .

43. Bajorath J . Selected concepts and investigations in compound classiﬁ cation, molec-ular descriptor analysis, and virtual screening . J Chem Inf Comput Sci 2006 ; 41 : 233 – 245 .

44. Engels MFM , Venkatarangan P . Smart screening: Approaches to efﬁ cient HTS . Curr Opin Drug Discov Devel 2001 ; 4 : 275 – 283 .

45. Downs GM , Barnard JM . Clustering methods and their uses in computational chemistry . In: Reviews in Computational Chemistry , Vol. 18 , edited by Lipkowitz KB , Boyd DB , pp. 1 – 40 . Weinheim : Wiley - WCH , 2002 .

46. Ward JH . Hierarchical grouping to optimize an objective function . J Am Stat Assoc 1963 ; 58 : 236 – 244 .

47. Duda RO , Hart PE , Stork DG . Pattern Classiﬁ cation , 2nd edn . New York : Wiley Interscience , 2000 .

48. Jarvis R , Patrick E . Clustering using a similarity measure based on shared near neighbors . IEEE Trans Comput 1973 ; C22 : 1025 – 1034 .

49. Pearlman R , Smith K . Metric validation and the receptor - relevant subspace concept. J Chem Inf Comput Sci 1999 ; 39 : 28 – 35 .

50. Godden J , Xue L , Kitchen D , Stahura F , Schermerhorn E , Bajorath J . Median partitioning: A novel method for the selection of representative subsets from large compound pools . J Chem Inf Comput Sci 2002 ; 42 : 885 – 893 .

51. Chen X , Rusinko A , Young S . Recursive partitioning analysis of a large structure -activity data set using three - dimensional descriptors . J Chem Inf Comput Sci 1998 ; 38 : 1054 – 1062 .

52. Rusinko A , Farmen M , Lambert C , Brown P , Young S . Analysis of a large struc-ture/biological activity data set using recursive partitioning . J Chem Inf Comput Sci 1999 ; 39 : 1017 – 1026 .

53. Johnson M , Maggiora G . Concepts and Applications of Molecular Similarity . New York : John Wiley & Sons , 1990 .

54. Willett P . Chemical similarity searching . J Chem Inf Comput Sci 1998 ; 38 : 983 – 996 .

55. Molecular Drug Data Report (MDDR) . San Leandro, CA : Elsevier MDL . Available at http://www.mdl.com (accessed February 1, 2008).

56. Olah M , Mracec M , Ostopovici L , Rad R , Bora A , Hadaruga N , Olah I , Banda M , Simon Z , Mracec M , Oprea TI . WOMBAT: World of molecular bioactivity . In:

Chemoinformatics in Drug Discovery , edited by Oprea TI , pp. 223 – 239 . New York : Wiley - VCH , 2004 .

57. Salim N , Holliday J , Willett P . Combination of ﬁ ngerprint - based similarity coef-ﬁ cients using data fusion . J Chem Inf Comput Sci 2003 ; 43 : 435 – 442 .

58. Hert J , Willett P , Wilton DJ , Acklin P , Azzaoui K , Jacoby E , Schuffenhauer A . Comparison of ﬁ ngerprint - based methods for virtual screening using multiple bio-active reference structures . J Chem Inf Comput Sci 2004 ; 44 : 1177 – 1185 .

59. Whittle M , Gillet VJ , Willett P . Analysis of data fusion methods in virtual screen-ing: Theoretical model . J Chem Inf Model 2006 ; 46 : 2193 – 2205 .

60. Whittle M , Gillet VJ , Willett P . Analysis of data fusion methods in virtual screen-ing: Similarity and group fusion . J Chem Inf Model 2006 ; 46 : 2206 – 2219 .

61. Xue L , Stahura F , Godden J , Bajorath J . Fingerprint scaling increases the probabil-ity of identifying molecules with similar activprobabil-ity in virtual screening calculations . J Chem Inf Comput Sci 2001 ; 41 : 746 – 753 .

62. Godden J , Furr J , Xue L , Stahura F , Bajorath J . Molecular similarity analysis and virtual screening by mapping of consensus positions in binary - transformed chemi-cal descriptor spaces with variable dimensionality . J Chem Inf Comput Sci 2004 ; 44 : 21 – 29 .

63. Eckert H , Bajorath J . Determination and mapping of activity - speciﬁ c descriptor value ranges for the identiﬁ cation of active compounds . J Med Chem 2006 ; 49 : 2284 – 2293 .

64. Eckert H , Vogt I , Bajorath J . Mapping algorithms for molecular similarity analysis and ligand - based virtual screening: Design of DynaMAD and comparison with MAD and DMC . J Chem Inf Model 2006 ; 46 : 1623 – 1634 .

65. Xia X , Maliski E , Gallant P , Rogers D . Classiﬁ cation of kinase inhibitors using a Bayesian model . J Med Chem 2004 ; 47 : 4463 – 4470 .

66. Vogt M , Godden J , Bajorath J . Bayesian interpretation of a distance function for navigating high - dimensional descriptor spaces . J Chem Inf Model 2007 ; 47 : 39 – 46 . 67. Vogt M , Bajorath J . Bayesian screening for active compounds in high - dimensional

chemical spaces combining property descriptors and molecular ﬁ ngerprints . Chem Biol Drug Des 2008 ; 71 : 8 – 14 .

68. Watson P . Na ï ve Bayes classiﬁ cation using 2D pharmacophore feature triplet vectors . J Chem Inf Model 2008 ; 48 : 166 – 178 .

69. Sun H . A naive Bayes classiﬁ er for prediction of multidrug resistance reversal activity on the basis of atom typing . J Med Chem 2005 ; 48 : 4031 – 4039 .

70. Nidhi , Glick M , Davies JW , Jenkins JL . Prediction of biological targets for com-pounds using multiple - category Bayesian models trained on chemogenomics data-bases . J Chem Inf Model 2006 ; 46 : 1124 – 1133 .

71. Klon AE , Lowrie JF , Diller DJ . Improved na ï ve Bayesian modeling of numerical data for absorption, distribution, metabolism and excretion (ADME) property prediction . J Chem Inf Model 2006 ; 46 : 1945 – 1956 .

72. Vogt M , Bajorath J . Bayesian similarity searching in high - dimensional descriptor spaces combined with Kullback - Leibler descriptor divergence analysis . J Chem Inf Model 2008 ; 48 : 247 – 255 .

73. Labute P . Binary QSAR: A new method for the determination of quantitative structure activity relationships . In: Paciﬁ c Symposium on Biocomputing , Vol. 4 , edited by Altman RB , Dunber AK , Hunter L , Klein TE , pp. 444 – 455 . Singapore : World Scientiﬁ c Publishing , 1999 .

74. Godden J , Bajorath , J . A distance function for retrieval of active molecules from complex chemical space representations . J Chem Inf Model 2006 ; 46 : 1094 – 1097 .

75. Ormerod A , Willett P , Bawden D . Comparison of fragment weighting schemes for substructural analysis . Quant Struct Act Relat 1989 ; 8 : 115 – 129 .

76. Wilton DJ , Harrison RF , Willett P , Delaney J , Lawson K , Mullier G . Virtual screening using binary kernel discrimination: Analysis of pesticide data . J Chem Inf Model 2006 ; 46 : 471 – 477 .

77. Cramer R , Redl G , Berkoff C . Substructural analysis. A novel approach to the problem of drug design . J Med Chem 1974 ; 17 : 533 – 535 .

78. Kullback S . Information Theory and Statistics . Mineola, MN : Dover Publications , 1997 .

79. Vogt M , Bajorath J . Introduction of an information - theoretic method to predict recovery rates of active compounds for Bayesian in silico screening: Theory and screening trials . J Chem Inf Model 2007 ; 47 : 337 – 341 .

80. Vogt M , Bajorath J . Introduction of a generally appicable method to estimate retrieval of active molecules for similarity searching using ﬁ ngerprints . ChemMedChem 2007 ; 2 : 1311 – 1320 .

81. Harper G , Bradshaw J , Gittins JC , Green DVS , Leach AR . Prediction of biological activity for high - throughput screening using binary kernel discrimination . J Chem Inf Comput Sci 2001 ; 41 : 1295 – 1300 .

82. Wilton D , Willett P , Lawson K , Mullier G . Comparison of ranking methods for virtual screening in lead - discovery programs . J Chem Inf Comput Sci 2003 ; 43 : 469 – 474 .

83. Chen B , Harrison RF , Pasupa K , Willett P , Wilton DJ , Wood DJ , Lewell XQ . Virtual screening using binary kernel discrimination: Effect of noisy training data and the optimization of performance . J Chem Inf Model 2006 ; 46 : 478 – 486 .

84. Burges CJC . A tutorial on support vector machines for pattern recognition . Data Min Knowl Discov 1998 ; 2 : 121 – 167 .

85. Vapnik VN . The Nature of Statistical Learning Theory , 2nd edn . New York : Springer , 2000 .

86. Stahura FL , Bajorath J . Virtual screening methods that complement HTS . Comb Chem High Throughput Screen 2004 ; 7 : 259 – 269 .

87. Blower PE , Cross KP , Eichler GS , Myatt GJ , Weinstein JN , Yang C . Comparison of methods for sequential screening of large compound sets . Comb Chem High Throughput Screen 2006 ; 9 : 115 – 122 .

88. Parker CN , Bajorath J . Towards uniﬁ ed compound screening strategies: A critical evaluation of error sources in experimental and virtual high - throughput screening . QSAR Comb Sci 2006 ; 25 : 1153 – 1161 .

89. Jones - Hertzog DK , Mukhopadhyay P , Keefer CE , Young SS . Use of recursive partitioning in the sequential screening of G - protein - coupled receptors . J Pharmacol Toxicol Methods 1999 ; 42 : 207 – 215 .

90. Abt M , Lim Y , Sacks J , Xie M , Young SS . A sequential approach for identifying lead compounds in large chemical databases . Stat Sci 2001 ; 16 : 154 – 168 .

91. van Rhee AM . Use of recursion forests in the sequential screening process:

Consensus selection by multiple recursion trees . J Chem Inf Comput Sci 2003 ; 43 : 941 – 948 .

92. Auer J , Bajorath J . Simulation of sequential screening experiments using emerging chemical patterns . Med Chem 2008 ; 4 : 80 – 90 .

93. Deng Z , Chuaqui C , Singh J . Knowledge - based design of target - focused libraries using protein - ligand interaction constraints . J Med Chem 2006 ; 49 : 490 – 500 . 94. Orry AJ , Abagyan RA , Cavasotto CN . Structure based development of target

speciﬁ c compound libraries . Drug Discov Today 2006 ; 11 : 261 – 266 .

PREDICTION OF TOXIC EFFECTS

In document Pharmaceutical Data Mining (Page 153-163)