MOLECULAR REPRESENTATIONS AND DESCRIPTORS

ITERATIVE SCREENING

4.2 MOLECULAR REPRESENTATIONS AND DESCRIPTORS

The performance of data mining approaches does not only depend on the method itself but also on the chosen molecular representations. Often combi-nations of numerical chemical descriptors are used to represent a molecule as a vector of descriptor values in descriptor space. Typically, descriptor combi-nations capture only a part of the chemical information content of a molecule and, although seemingly a triviality, data mining algorithms can only exploit

this information. If it is too limited, data mining will fail. Thus, the choice of molecular representations is indeed a major determinant for the outcome of data mining, regardless of the algorithms that are used. Many different types of descriptors [15] and molecular representations have been introduced [16] . A reason for the continued interest in deriving novel molecular representa-tions might be that confl icting tasks often infl uence chemical similarity analy-sis: one aims at the identifi cation of molecules that are similar in activity to known reference compounds, but these molecules should then be as structur-ally diverse as possible. So, it is desirable for representations to focus on rel-evant attributes for activity rather than on structural resemblance.

Representations can roughly be separated into three types: one - dimensional (1 - D) representations include the chemical composition formula, the simplest molecular view, but also more complex representations such as linear nota-tions including the pioneering SMILES language [17,18] and InChI [19] . SMILES and InChI capture the structure of a molecule in a unique way and are well suited for database searching and compound retrieval. Although not speciﬁ cally designed for similarity searching, SMILES representations have been used for database mining by building feature vectors from substrings [20 – 22] . Molecular 2 - D representations include connection tables, graph rep-resentations, and reduced graphs [23] . Molecular graphs are often employed as queries in similarity searching using algorithms from graph theory for the detection of common substructures. Typically, those algorithms are time consuming, which limits their applicability for screening large databases. 3 - D representations include, for example, molecular surfaces or volumes calcu-lated from molecular conformations. If these representations should be recalcu-lated to biological activity, then binding conformations of test compounds must be known. However, for large compound databases, conformations must usually be predicted, which introduces uncertainties in the use of such representations for compound activity - oriented applications. Pharmacophore models are 3 - D representations that reduce molecules to spatial arrangements of atoms, groups, or functions that render them active and are among the most popular tools for 3 - D database searching.

Combinations of calculated molecular descriptors are also used to repre-sent molecules and/or to position them in chemical space. Descriptors are in general best understood as numerical mathematical models designed to capture different chemical properties [15] . In many cases, descriptors calculate chemical properties that can be experimentally measured such as dipole moments, molecular refractivity, or log P ( o / w ), the octanol/water partition coefﬁ cient, a measure of hydrophobicity. Descriptors are often organized according to the dimensionality - dependent classiﬁ cation scheme, as discussed above for molecular representations. Thus, dependent on the dimensionality of the molecular representation from which they are calculated, we distinguish 1 - D, 2 - D, and 3 - D descriptors. 1 - D descriptors are constitutional descriptors requiring little or no knowledge about the structure of a molecule such as

molecular mass or atom type counts. 2 - D descriptors are based upon the graph representation of a molecule. Large numbers of descriptors are calculated from the 2 - D structure of chemical compounds. For example, topological descriptors describe properties such as connectivity patterns, molecular com-plexity (e.g., degree of branching), or approximate shape. Other 2 - D descrip-tors are designed to approximate 3 - D molecular features like van der Waals volume or surface area using only the connectivity table of a molecule as input.

3 - D descriptors and representations both require knowledge about molecular conformations and geometrical properties of the molecules . Many 2 - D and 3 - D descriptors vary greatly in their complexity. For example, complex molecular descriptors have been designed to combine multiple descriptor contributions related to biological activity [12] or model surface properties such as the distribution of partial charges on the surface of a molecule [24] . In the following, we will describe graph representations and ﬁ ngerprint descriptors in more detail.

4.2.1 Graph Representations

In canonical molecular graph representations, nodes represent atoms and edges represent bonds. The use of graph - based algorithms has a long tradition in chemical database searching [25] . The identifi cation of substructures in molecular graphs is hindered by subgraph isomorphism identifi cation, which is a hard problem in computer science and for the treatment of which, in general, no effi cient algorithms exist [25] . A special case of compound similar-ity evaluation on the basis of graph - based representations is to consider the maximum common subgraph (MCS) [26,27] , i.e., the largest common sub-structure. MCS comparison retains most of the structural information of a molecule and consequently detects distinctly similar compounds in a database search. Reduced graphs [23,28] or feature trees [29] simplify graph - based molecular comparisons by combining characteristic chemical features like aromatic rings or functional groups into single nodes and abstract from 2 - D structure. This simplifi es graph - based comparisons and increases computa-tional effi ciency as well as the potential of scaffold hopping [30] , i.e., the identifi cation of compounds having similar activity but diverse structures.

4.2.2 Fingerprints

Fingerprints are special kinds of descriptors that characterize a molecule and its properties as a binary bit vector. Since many fi ngerprints have unique designs and are used for similarity searching in combination with selected similarity metrics, they are often also regarded as search methods. In structural fi ngerprints, each bit represents a specifi c substructural feature, like an aro-matic ring or a functional group of a molecule, and the bit setting accounts either for its presence (i.e., the bit is set on) or absence (off). Fixed - size bit

string representations, where each bit encodes the presence or absence of a predefi ned structural feature, simplify substructure searching and circumvent the computational complexity associated with the use of graph isomorphism algorithms. Once fi ngerprints for all compounds in a database have been computed, quantitative fi ngerprint overlap between query and database com-pounds is calculated as a measure of molecular similarity. The set of 166 MDL structural keys (MACCS) [31,32] represents a widely used prototype of a fragment based fi ngerprint. An early search strategy has been to use fragment based fi ngerprints in a fast prescreening step to eliminate large numbers of database compounds lacking encoded fragments present in a query, followed by a subgraph isomorphism search on the remaining molecules [25] . In recent years, increasingly sophisticated fi ngerprint designs have been introduced that enable database searching beyond prescreening or fragment matching includ-ing, for example, pharmacophore fi ngerprints [33] . These types of fi ngerprints systematically account for 2 - D or 3 - D patterns of two to four features such as hydrogen bond donor or acceptor functions, hydrophobic or aromatic moi-eties, or charged groups, and pairwise distance ranges separating them. For 3 - D pharmacophore fi ngerprinting, test molecules are subjected to systematic conformational search and matches of fi ngerprint - encoded pharmacophore patterns are monitored. In 2 - D pharmacophore fi ngerprints, bond distances replace spatial distances between feature points, and atom types are often used instead of pharmacophore functions, which is reminiscent of atom pair type descriptors [34] . Due to the combinatorial nature of pharmacophore patterns, especially 3 - D pharmacophore fi ngerprints can be exceedingly large and often consist of millions of bit positions. Other types of 2 - D fi ngerprints systematically account for connectivity pathways through molecules up to a predefi ned length. This fi ngerprint design was pioneered by Daylight [35] . The Daylight fi ngerprints employ hashing and folding techniques to map the large number of possible pathways to a small number of bits. Furthermore, atom environment fi ngerprints developed by Glen and coworkers [36] encode the local environment of each atom in a molecule as strings and assemble charac-teristic strings. Here collections of strings represent the molecular fi ngerprint, which departs from the classical fi xed - length design. Similarly, extended connecti vity fi ngerprints (ECFPs) [37,38] also capture local atom environ-ments. MOLPRINT codes each individual atom environment (either in 2 - D or 3 - D) up to a certain bond distance range as a fi ngerprint bit and has been implemented together with Bayesian modeling using multiple template com-pounds for similarity searching [39,40] .

Encoding of numerical property descriptors in a fi ngerprint format is also possible. For example, the MP - MFP fi ngerprint [41] assigns 61 property descriptors to individual bits by partitioning their ranges at the median of a compound database (i.e., through binary transformation). Moreover, through equifrequent binning of database descriptor value distributions, the PDR - FP fi ngerprint encodes a set of 93 molecular property descriptors using only 500 bit positions [42] .

In document Pharmaceutical Data Mining (Page 135-139)