Context: Data Explosion in Chemistry

List of Tables

1.1 Context: Data Explosion in Chemistry

Chemoinformatics - A new name for an old problem?

∼ M. Hann and R. Green (1999)

Large scale research projects are becoming part of chemistry research in more and more laboratories, producing an ever-increasing amount of data and infor-mation. The chemo-information keeps growing exponentially due to constantly refined and optimized experimental technologies (Bachrach, 2009; Chen, 2006).

According to Chemical Abstracts Service (CAS) ¹ there are currently more than 77 million known substances of which 10 million were added in less than one year (Figure 1.1). This database is updated daily and approximately 12,000 new sub-stances are added each day. In comparison, it took 33 years for CAS to register its first 10 million substances in 1990, which is an indicator of the accelerating pace of chemical knowledge (Figure 1.1) ².

Thus, it was realized that the amount and complexity of information ac-cumulated by chemists can only be managed by exploring it using computer technologies (Bajorath, 2004; Gasteiger, 2003). This problem led to a new field of expertise – the intersection of chemistry and computer science, with empha-sis on the acquisition, manipulation, organization, analyempha-sis and dissemination of

1Chemical Abstracts Service: http://www.cas.org/, accessed in December, 2013

2Data from "CAS Statistical Summary 1907-1997," Chemical Abstracts Service, Columbus, Ohio: http://www.shinwon.co.kr/cas/ASSETS/casstats.pdf, accessed in December, 2013

Figure 1.1: Graphical representation of the annual evolution in the number of unique organic and inorganic substances recorded in the Chemical Abstract Ser-vice Registry System ^1,² between 1965 and 2013.

chemical data and information (Bajorath, 2004; Chen, 2006). This field of ex-pertise clearly spans a very large (and still to be defined) range of problems and approaches and it does no longer imply, as it did in the beginning, that it is necessarily related to drug discovery. This new interdisciplinary area was named

"Chemoinformatics" byBrown & James(1998). In this article, chemoinformatics is defined, yet very focused on drug discovery process, as follows:

"The use of information technology and management has become a critical part of the drug discovery process. Chemoinformatics is the mixing of those informa-tion resources to transform data into informainforma-tion and informainforma-tion into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and organization".

As for the definition of chemoinformatics, the name of the area is not uni-versally agreed upon (Bajorath & Warr, 2011). In the literature, several terms are used synonymously to chemoinformatics: cheminformatics, chemi-informatics, chemical informatics, chemometrics, computational chemistry, etc (Bajorath,2004).

1.2 Motivation

A Google search, in December 2013, retrieved ∼467,000 hits for the term “chemin-formatics” whereas “chemoin“chemin-formatics” had ∼261,000 hits. The results indicate that cheminformatics is more commonly applied and therefore it will be used throughout the document.

Chemoinformatics has diﬀerent practical applications in diﬀerent areas such as pesticide, drug and material design, environmental protection, food safety, among others.

1.2 Motivation

Chemistry is (almost) everywhere and in everything.

∼ A. Shani (2004)

"When you hold this document you are holding molecules. When you drink coﬀee you are ingesting molecules, as you sit in a room you are bombarded by a continuous storm of molecules. When you appreciate the colour of an orchid and the textures of a landscape you are admiring molecules. When you savour food and drink you are enjoying molecules. When you sense decay you are smelling molecules. You are clothed in molecules, you eat molecules, and you excrete molecules. In fact, you are made of molecules" (Atkins, 2003). In other words, molecules are (almost) everywhere and in everything, and as mentioned above the number of molecules discovered each day continues to grow at an exponential rate due to constantly refined and optimized experimental technologies (Bachrach, 2009). However, the experimental determination of the chemical, physical and bi-ological properties (from this point on simply referred as properties) of compounds is often expensive, time-consuming and in many cases impossible. According to George Hammond in the 1968 Norris Award Lecture, "the most fundamental and lasting objective of synthesis is not production of new compounds, but produc-tion of properties", thus it is evident that there is a great need to organize and make high quality experimental data available to the scientific community and foster the application of property prediction methods with a good predictive per-formance when experimental values are not available which is essential to many industries and technologies. One of the most promising areas in cheminformat-ics is the development of methods aimed at predicting these properties from the

structure of the molecules. Unlike quantum chemistry or molecular simulation, which are designed to model physical reality, cheminformatics is intended sim-ply to produce useful models that can predict properties of compounds given their structure. These methods are usually known as Quantitative Structure-Property/Activity Relationship (QSPR/QSAR). During the last twenty years QSPR/QSAR have been applied to a wide range of problems gaining an exten-sive recognition in physical, organic, analytical, pharmaceutical and medicinal chemistry, biochemistry, chemical engineering and technology, toxicology, and environmental sciences (Micheli, 2003). Examples of the wide range of predicted properties include melting and boiling temperature, molar heat capacity, vapor pressure, solubility, viscosity and partition coeﬃcients, standard enthalpy of for-mation, refractive index, density, solvation free energy, receptor binding aﬃnities, pharmacological activities, and enzyme inhibition constants (Micheli,2003).

It is important to understand that QSPR/QSAR models will not replace ex-perimental measurements, however they oﬀer multiple advantages with an enor-mous scientific, humanitarian and economic impact: (1) innovation, to analyse manually such a huge amount of chemical data is obviously impossible, and thus computer, in silico methods, will represent a privileged way to explore, discover and design promising compounds with desired properties; (2) prioritizing needs by selecting the most promising untested and sometimes yet unavailable com-pounds; (3) reduction of time needed for experiments as they serve as a filter to reduce the number of compounds that need to be tested; (4) even in an hypothet-ical situation of trying to experimentally study all properties of all compounds, the amount of existing laboratories and human resources is not suﬃcient to deal with this quantity of chemical data; (5) reduce the costs by reducing the num-ber of measurements as the cost of performing experimental measurements is in most cases very high; (6) reduction of the number of animals needed for in vivo experiments which is ethically and economically very important; (7) the develop-ment of new chemicals is often centred on the target properties for the candidate new product, however critical issues, such as toxicity, industrial safety and en-vironmental health should also be evaluated from the beginning - reducing the economic resources needed to the development of chemicals without the knowl-edge of their toxicological and environmental properties. As an example, various

In document Machine learning methods for quantitative structure-property relationship modeling (Page 39-43)