• No results found

List of Tables

1.3 Problem Statement and Aims of the Study

studies estimate that traditional development of a new prescription drug takes between 10 and 20 years and costs an average of $500-$800 million (Dickson &

Gagnon, 2009). Consequently, the burden to reduce costs and accelerate drug discovery is high, especially considering the human benefits of such achievement and the constant inevitability to cope with new diseases. Besides the efficiency of the process, it is also important to consider the need to improve the percent-age of compounds with high therapeutic value and to reduce the side effects of drugs. The advantages of the computational methods extend to the primary and earlier stage of the complex drug discovery process: drug lead identification and elimination of compounds that are toxic or have poor pharmacokinetic properties.

1.3 Problem Statement and Aims of the Study

While much bioscience is published with the knowledge that machines will be expected to understand at least part of it, almost all chemistry is published purely for humans to read.

∼ P. Murray-Rust et al. (2004)

Various analytical tools from statistics and machine learning are used in QSPR/QSAR analysis including predictive modelling (classification and regres-sion), visualization, exploratory data analysis and cluster analysis. These studies rely on the principle that states that similar compounds tend to have similar prop-erties (Johnson & Maggiora, 1990). The fact that the domain of QSPR/QSAR problem is naturally composed by unstructured data, as molecules can have ar-bitrary dimension, structure and composition, and the fact that there is not a univocal and unequivocal way of coding and comparing these molecules make it challenging to apply machine learning techniques. Several approaches exist and several have provided good results for specific domains, however, to the best of our knowledge, one cannot expect a QSPR/QSAR approach to work well to predict any property, the set of descriptors that allows predictions with good pre-dictive power depend highly on the property of interest and most methodologies work like "black boxes" without a detailed understanding of each prediction and expected prediction error.

Having in mind all the strengths and limitations of the existing databases and prediction methods, the thesis underlying the present work is that it is possible

Figure 1.2: Schematic overview of the study objectives. a) Representation and manipulation of molecular structures and experimental data. b) Development of data-mining models. c) Implementation of existing models. d) Assessment and comparison of results. e) Development of Web-based systems to disseminate the results.

to improve the current models for the prediction of physical, chemical and biolog-ical properties based solely on the chembiolog-ical structure using advanced automated analysis solutions based on Machine Learning. The aims of the study cover the development and implementation of cutting-edge machine learning and statistical modelling algorithms for handling large-scale chemical data in order to improve the prediction of properties not only in terms of predictive power but also im-proving the robustness and comprehensibility of such methodologies. The more specific aims of this study are represented in Figure 1.2 and include:

• Compile and make available good experimental data of chemical, physical or biological properties of molecules, since without good data it is not possible to develop good predictive models (i. e. "garbage in, garbage out");

• Study the theoretical basis of QSPR/QSAR modelling and the machine

1.3 Problem Statement and Aims of the Study

learning methods used in this field;

• Understand the specificities of the representation of chemical structures in computer readable formats which is required for data analysis. Physical, chemical as well as biological properties are in large part determined by the molecular structure. There are several ways to represent a molecular structure and different representations contain different chemical informa-tion. One of the major tasks in automated extraction of meaning, patterns, and regularities using machine learning methods is to represent chemical structures, to transfer the various types of chemical information taking into account their complex and heterogeneous nature into a machine-readable representation that can be processed by a machine learning model. Hence, it is important to select machine-readable representations and machine learn-ing models that can handle and extract the right chemical data accordlearn-ing to the chemical property that needs to be predicted;

• Implement and assess existing prediction models with experimental data extracted from several sources verifying the quality of results produced and develop and validate new model-based machine learning approaches to im-prove the results;

• Implement and assess the most widely used methodologies to calculate sim-ilarity between molecules;

• Develop a new algorithm to adequately quantify the structural similarity between molecules with an high discriminative power of similar molecules.

• Develop and assess an instance-based method that, in light of the structural similarity principle (Johnson & Maggiora, 1990), takes into account the high dimensionality of the chemical space, predicting chemical, physical or biological properties based on the most structurally similar compounds in the molecular space, consequently avoiding the selection of descriptors, increasing the robustness and comprehensibility of the method;

• Develop and implement methods that automatically search the neighbour-hood of a compound and determine the optimal number of neighbours which can be used to predict its property with a minimized prediction error;

• Design and develop experiments to be simultaneously proof-of-the-concept and applicable to existing experimental data. In fact, besides investigating the methodology and developing new models, one of the main aims is to investigate the impact of the approach on methods developed for real-world data and problems;

• Design and implement tools as well as make publicly available to the com-munity open source code that not only allows access to data in a compre-hensive way but also permits using methodologies developed in the context of this work.

The research questions which guide the development of this work are:

• Can chemical, physical and biological properties prediction be improved in terms of prediction error, robustness and comprehensibility using a new descriptor selection technique coupled with a model-based machine learning algorithm?

• Can quantification of structural similarity be improved using an algorithm that is based on atom matching?

• Can chemical, physical and biological properties be predicted by an instance-based machine learning approach using as input a metric space constructed based on structural similarity?

• Does an instance-based machine learning approach using structural similar-ity to construct a metric in order to predict chemical, physical and biological properties has advantages in terms of predictive performance, robustness and comprehensibility in relation to a model-based machine learning ap-proach?

• Is it possible to increase the predictive results and comprehensibility of the method using smaller local neighbourhoods?