Mechanistic Interpretation - MODEL VALIDATION

PREDICTION OF TOXIC EFFECTS OF PHARMACEUTICAL AGENTS

5.7 MODEL VALIDATION

5.7.3 Mechanistic Interpretation

Many (Q)SAR and data mining techniques can be used to derive a hypothesis about biological mechanisms. However, it is important to remember that most of these techniques have no knowledge about chemical and biological pro-cesses. Thus, they cannot reason about mechanisms, but they can provide information that is relevant for a mechanistic assessment (e.g., structural alerts, compounds with similar modes of action). This means that a toxicologi-cal researcher has to evaluate only a limited number of possible hypotheses, but expert knowledge is still needed for the identiﬁ cation of mechanisms.

The interpretability of models and individual predictions may depend on several factors:

• Model complexity. Interpretability decreases with model complexity and abstraction level, but complex models are frequently needed to accom-modate for real world situations. It is, however, not always necessary to interpret complete models. The extraction of speciﬁ c information (e.g., relevant substructures/properties) and the inspection of rationales for individual predictions may provide more information for toxicologists that complete models.

• Biological rationale for the algorithm. Most scientists ﬁ nd it easier to interpret models that have a biological rationale and/or resemble their way of thinking about toxicological phenomena. Techniques based on chemical similarities are very useful in this respect because they support the search for analogs and chemical classes. A mechanistic hypothesis (and a critical evaluation of individual predictions) can be obtained from the inspection of relevant features and from the mechanisms of structur-ally similar compounds.

• Visual presentation of the results. Most data mining programs are hard to use for nondata mining experts and have great shortcomings in the visual presentation of their results. End users with a toxicological back-ground should not be confused with data mining/(Q)SAR terminology and with detailed options for algorithms and parameter settings. The

interface should provide instead an intuitive and traceable presentation of the rationales for a prediction together with links for the access of supporting information (e.g., original data, results in other assays, literature).

5.8 CONCLUSION

The most frequent application of data mining in predictive toxicology is the development of (Q)SAR models. The development of (Q)SAR models requires (1) the generation of features that represent chemical structures, (2) the selection of features for a particular end point, (3) the development of a (Q)SAR model, (4) the validation of the model, and (5) the interpretation of the model and of individual predictions.

For each of these tasks, a large number of data mining techniques are available. Selecting and combining suitable algorithms for the individual steps allows us to develop problem - speciﬁ c solutions with capabilities that go beyond standardized solutions. However, it is important to understand the properties and limitations of the applied techniques and to communicate them clearly to the model users.

REFERENCES

1. R Development Core Team . R: A Language and Environment for Statistical Computing R. Foundation for Statistical Computing, Vienna, Austria. 2007 . Available at http://www.R-project.org . ISBN 3 - 900051 - 07 - 0 (accessed June 6, 2008).

2. Richard AM . Commercial toxicology prediction systems: A regulatory perspective . Toxicol Lett 1998 ; 102 - 103 : 611 – 616 .

3. Richard AM , Williams CR . Public sources of mutagenicity and carcinogenicity data: Use in structure - activity relationship models . In: Quantitative Structure Activity Relationship (QSAR) Models of Mutagens and Carcinogens , edited by Benigni R . Boca Raton, FL : CRC Press , 2003 .

4. Witten IH , Frank E . Data Mining: Practical Machine Learning Tools and Techniques . San Francisco, CA : Morgan Kaufmann . 2005 . Available at http://www.

cs.waikato.ac.nz/ml/weka/ (accessed August 6, 2008).

5. Guha R . Chemical informatics functionality in R . J Stat Softw 2007 ; 18 : 00 – 00 . Available at http://www.jstatsoft.org/v18/i05 (accessed August 6, 2008).

6. Benigni R , Bossa C , Netzeva T , Worth A . Collection and evaluation of (Q)SAR models for mutagenicity and carcinogenicity. 2007. http://ecb.jrc.ec.europa.eu/

documents/QSAR/EUR_22772_EN.pdf (accessed June 25, 2008) .

7. Pavan M , Netzeva T , Worth A . Validation of a QSAR model for acute toxicity . SAR QSAR Environ Res 2006 ; 17 : 147 – 171 .

8. Guha R , Howard MT , Hutchison GR , Murray - Rust P , Rzepa H , Steinbeck C , Wegner JK , Willighagen EL . The blue obelisk – interoperability in chemical infor-matics . J Chem Inf Model 2006 ; 46 : 991 – 998 .

9. Crawley MJ . Statistics: An Introduction Using R . Chichester, UK : Wiley , 2005 .

10. Guha R , Serra JR , Jurs PC . Generation of QSAR sets with a self - organizing map . J Mol Graph Model 2004 ; 23 : 1 – 14 .

11. Jolliffe IT . Principal Components Analysis . New York : Springer , 2002 .

12. Yoav B , Yekutieli D . The Control of the False Discovery Rate in Multiple Testing Under Uncertainty . Ann Stat 2001 ; 29 : 1165 – 1188 .

13. Pollard KS Dudoit S van der Laan MJ . Multiple testing procedures: R multitest package and applications to genomics . In: Bioinformatics and Computational Bio-logy Solutions Using R and Bioconductor , pp. 251 – 272 (edited by Gentleman R , et al.). New York : Springer Science + Business Media , 2005 .

14. Russell SJ , Norvig P . Artiﬁ cial Intelligence: A Modern Approach , 2nd edn. Upper Saddle River, NJ : Prentice Hall , 2002 .

15. Franke R , Gruska A . General introduction to QSAR . In: Quantitative Structure Activity Relationship (QSAR) Models of Mutagens and Carcinogens , edited by Benigni R . Boca Raton, FL : CRC Press , 2003 .

16. Papa E , Villa F , Gramatica P . Statistically validated QSARs, based on theoretical descriptors, for modeling aquatic toxicity of organic chemicals in Pimephales promelas (fathead minnow) . J Chem Inf Model 2005 ; 45 : 1256 – 1266 .

17. Eldred DV , Weikel CL , Jurs PC , Kaiser KL . Prediction of fathead minnow acute toxicity of organic compounds from molecular structure . Chem Res Toxicol 1999 ; 12 : 670 – 678 .

18. Serra JR , Jurs PC , Kaiser KL . Linear regression and computational neural network prediction of tetrahymena acute toxicity for aromatic compounds from molecular structure . Chem Res Toxicol 2001 ; 14 : 1535 – 1545 .

19. Chang CC , Lin CJ . LIBSVM: A library for support vector machines. 2001 . Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm (accessed July 6, 2008).

20. Mitchell TM . Machine Learning . Columbus, OH : The McGraw - Hill Companies, Inc. , 1997 .

21. Rasmussen CE , Williams CKI . Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) . London : MIT Press , 2005 .

22. Sch ö lkopf B , Smola AJ . Learning with Kernels . London : MIT Press , 2002 .

23. Bellman R . Dynamic Programming . Princeton, NJ : Princeton University Press , 1957 .

24. Holliday J , Hu C , Willett P . Grouping of coefﬁ cients for the calculation of inter molecular similarity and dissimilarity using 2D fragment bit - strings . Comb Chem High Throughput Screen 2002 ; 5 : 155 – 166 .

25. Helma C . Lazy structure - activity relationships (lazar) for the prediction of rodent carcinogenicity and salmonella mutagenicity . Mol Divers 2006 ; 10 : 147 – 158 .

26. Atkeson CG , Moore AW , Schaal S . Locally weighted learning . Artif Intell Rev 1997 ; 11 : 11 – 73 .

27. Helma C , Kramer S , De Raedt L . The molecular feature miner MolFea . Proceedings of the Beilstein - Institut, Workshop , Bozen, Italy, May 13 – 16, 2002 .

28. R ü ckert U , Kramer S . Frequent free tree discovery in graph data . In: Proceed-ings of the ACM Symposium on Applied Computing (SAC 2004) , 2005, pp. 564 – 570 .

29. Yan X , Han J . gSpan: Graph - based substructure pattern mining . In: ICDM ’ 02:

Proceedings of the 2002 IEEE International Conference on Data Mining . Washington, DC : IEEE Computer Society , 2002 .

30. Nijssen S , Kok JN . The Gaston tool for frequent subgraph mining. Electronic notes in theoretical computer science . In: Proceedings of the International Workshop on Graph - Based Tools (GraBaTs 2004) , vol. 127, issue 1. Elsevier, 2005 .

31. Ralaivola L , Swamidass SJ , Saigo H , Baldi P . Graph kernels for chemical informat-ics . Neural Netw 2005 ; 18 : 1093 – 1110 .

32. Jaworska JS , Comber M , Auer C , Van Leeuwen CJ . Summary of a workshop on regulatory acceptance of (Q)SARs for human health and environmental end-points . Environ Health Perspect 2003 ; 111 : 1358 – 1360 .

33. Jaworska J , Nikolova - Jeliazkova N , Aldenberg T . QSAR applicability domain estimation by projection of the training set descriptor space: A review . Altern Lab Anim 2005 ; 33 ( 5 ): 445 – 459 .

34. Zhang S , Golbraikh A , Oloff S , Kohn H , Tropsha A . A novel automated lazy learning QSAR (ALLQSAR) approach: Method development, applications, and virtual screening of chemical databases using validated ALL - QSAR models . J Chem Inf Model 2006 ; 46 : 1984 – 1995 .

35. Egan JP . Signal Detection Theory and ROC Analysis . New York : Academic Press , 1975 .

36. Sing T , Sander O , Beerenwinkel N , Lengauer T . ROCR: Visualizing classiﬁ er performance in R . Bioinformatics 2005 ; 21 : 3940 – 3941 .

37. Anscombe FJ . Graphs in statistical analysis . Am Stat 1973 ; 27 : 17 – 21 .

38. Cronin MT , Livingstone DJ . Predicting Chemical Toxicity and Fate . Boca Raton, FL : CRC Press , 2004 .

39. Benigni R , Netzeva TI , Benfenati E , Bossa C , Franke R , Helma C , Hulzebos E , Marchant C , Richard A , Woo YT , Yang C . The expanding role of predictive toxi-cology: An update on the (Q)SAR models for mutagens and carcinogens . J Environ Sci Health C Environ Carcinog Ecotoxicol Rev 2007 ; 25 : 53 – 97 .

CHEMOGENOMICS - BASED DESIGN

In document Pharmaceutical Data Mining (Page 188-193)