3D Structure Generation and Conformational Searching
2. PROBLEM DESCRIPTION 1. Computational Requirements
The main area of automatic structure generation is the 2D-to-3D conversion of large databases of druglike organic compounds. These databases often contain millions of structures, imposing some restrictions on the development of 3D structure generators.
The decision to use a specific conversion program plays a crucial role because a change to another program can only be made with great difficulties. Firstly, the amount of computer resources for the conversion of hundreds of thousands of structures is quite large, and, secondly, much scientific work will be based on such a database and, thus, a change of these data makes a lot of the work already performed questionable or obsolete. Therefore, the choice to use a particular 3D structure generation program should be made only after a careful evaluation process. On the other hand, the task of generating 3D structures from connectivity information (the constitution of a mole-cule) is just too important and the problems to be solved are so diverse that it should
always be open to new ideas and approaches. 3D database developers at Molecular Design Ltd. formulated the following criteria for a 2D-to-3D conversion program [9]
and we will cover only those published approaches that fulfill more or less all of these criteria (the quotes are slightly abbreviated and modified):
Robustness. The program should run for a long time before failing and should indicate the actions taken on failure rather than simply crash.
Large files. The program should be able to handle large numbers of structures contained in a single file to minimize the number of conversion jobs.
Variety of chemical types. The program should be able to handle a wide variety of structural types.
Stereochemistry. The stereochemical information contained in the input data must be handled correctly.
Rapid and automated. The large size of the databases to be processed requires the conversion program to run in batch mode and to work with acceptable speed.
High-quality models. The generated models should be of high quality without further energy minimization and should represent at least one low-energy conformation. It should have internal diagnostics to validate the models generated
High conversion rate. As many 2D structures as possible should be converted.
For conformer generators, some specific additional criteria have to defined:
Coverage. The ensemble of conformations must include all relevant conforma-tions and the method should be able to reproduce biologically active conformations.
Diversity. Because it is impossible to generate all conformations of reasonably large molecules in infinite resolution, the subset chosen from the whole con-formational space has to be reasonably diverse.
Compactness. Given the size of today’s databases and the requirement to store hundreds of conformers per molecule, compactness of storage becomes an issue with respect to both file size and retrieval efficiency.
2.2. General Problems
Each approach to automatic generation of 3D molecular models has to solve a number of general problems. The strategy for building a molecular model can be compared with the use of a mechanical molecular model building kit. Monocentric fragments that represent different hybridization states and provide the corresponding bond angles are connected using joints with a length corresponding to the required bond lengths. A basic assumption in this process of 3D structure generation is an allowed transfer of bond lengths and bond angles from one molecular environment to another (i.e., the usage of standard values for bond lengths and bond angles).
However, this assumption requires to distinguish between a sufficiently large number of different atom types, hybridization states, and bond types with appropriate bond lengths and bond angles. Usually, the deviations from these standard values are rather small. A totally different situation is encountered for dihedral or torsional angles, which describe the twisting of a fragment of four atoms connected by a sequence of bonds. Because the steric energy may have multiple minima around a rotatable bond with similar energy content, this leads to more than one possibility for
constructing a 3D molecular model for such molecules or, in other terms, to multiple conformations.
In acyclical molecules or substructures, the preferred torsional angles are those which simultaneously minimize torsional strain and the steric interactions between nonbonded atoms. The relatively large flexibility of such systems gives rise to multiple solutions (conformations) for the process of structure generation, which have quite similar energy. Account of this flexibility has to be taken and geometrically unaccept-able situations as [e.g., the overlap of atoms (‘‘clashes’’)] must strictly be avoided. With increasing numbers of possible conformations, it becomes less and less likely that the generated 3D structure corresponds to the experimentally determined geometry.
In cyclical structures, ring closure has to be taken into account as an additional geometrical constraint of the 3D structure generation process. Ring closure dramat-ically reduces the degrees of freedom as expressed in a reduction in the number of possible conformations compared to those in acyclical systems. In particular, the endocyclical torsional angles are mutually dependent. Due to this fact, many of the 3D structure generators use information on possible single-ring conformations. These conformations can be stored as 3D coordinate fragments or simply as lists of torsional angles. These so-called ring templates implicitly fulfill the condition of ring closure.
Additional levels of sophistication are reached when the rings have exocyclical substituents, or when they are assembled in fused or bridged ring systems. Another challenge arises with increasing ring size. Large rings are apart from the requirement to ring closure, as flexible as acyclical systems. Fig. 1 shows the increase in the number of known conformations of cycloalkanes with dependence on ring size.
The conformational flexibility and thus the number of valid 3D molecular models steeply increase from ring size 9 upward. An explicit use of potential ring conformations becomes more and more infeasible. Some of the programs discussed below therefore refrain from generating 3D structures for macrocyclical and poly-macrocyclical structures such as the tripoly-macrocyclical system in Fig. 2.
Figure 1 Increase of the number of known conformations of cycloalkanes with increasing size.
Due to the specific complications when predicting the geometry of ring systems, many of the approaches to 3D structure generation dedicate most of the program intelligence to this part. Most often, the molecule under consideration is fragmented into acyclical and cyclical portions at the very beginning of the 3D generation process.
The fragments are then handled separately and reassembled at the end of the whole process.
The objective of conformational searches is to locate minima on the energy surface and to generate the corresponding 3D structures. Therefore, conformational searches have to utilize energy and geometry optimization methods, and have to tackle problems inherent in these techniques. One major drawback of most optimization algorithms, as they are implemented in common quantum mechanical (QM) or force field packages, is that they can only identify adjacent minima, which lie ‘‘downhill’’ on the potential energy surface from a given 3D geometry as starting point (i.e., they are unable to overcome energy barriers to locate other minima elsewhere on the energy surface). Fig. 3 illustrates this behavior.
Thus, conformational searches first have to generate a set of starting geometries which then can be submitted to energy and geometry optimization. This directly leads to a third problem, which can also be seen in Fig. 3. Two different starting geometries, which both are located near the same minimum, will become identical after the op-timization procedure. This redundant information has to be filtered out. Each newly generated conformation has to be compared with all previously generated conforma-tions. It has to be stored if a new conformation, whose geometry relevantly differs from all previously generated conformations, has been found; otherwise, it has to be rejected. A common metric to perceive the similarity between two conformations is Figure 2 Trimacrocyclical bridged system.
the RMS deviation of the positions of their atoms. The RMS deviation can either be measured in Cartesian space (RMSXYZ[A˚]), where the 3D Cartesian coordinates of all corresponding atom pairs are compared, or in torsion angle (TA) space (RMSTA[j]), by calculating the deviation of all corresponding torsion angles of both conformations.
As already discussed in the context of reproducing x-ray structures (see Sec. 4), two conformations can be regarded as identical if their RMS deviation is less than 0.3 A˚ or 15j, respectively.
Fig. 4 shows the general work flow of a conformational search. After generating an initial starting geometry, which is optimized in the subsequent step, the new structure is compared to all previously generated conformations (normally stored as a list of unique structures). If a substantially new geometry is found, it is added to this list of unique conformations; otherwise, it will be rejected. Then, a new starting structure Figure 3 Identification of energy minima on the conformational energy surface (symbolic).
Figure 4 General work flow of conformational search techniques.
has to be generated for the next iteration. This loop is continued until a certain stop criterion for the entire search procedure is reached (i.e., a given number of iterations has been performed, or if no new conformations can be found).
When generating ensembles of conformations, several additional problems arise.
First, the coverage problem arises, imposing the question of whether the interesting biologically active conformations have been generated. Because it is per se unknown which conformations are needed later on and because time and storage restrictions forbid to generate too many, a selection of a representative subset of the whole conformer space becomes an important issue.
2.3. Classification of Specific Concepts
In this chapter, a classification of the specific concepts of different approaches to 3D structure generation is undertaken and the domain covered in this article is defined.
Under the term ‘‘automatic 3D model builder,’’ programs capable of automatically predicting a 3D molecular structure directly from the 2D connectivity information and without user interaction are covered. The term ‘‘conformer generator’’ covers pro-grams which, starting from the 2D structure or a single 3D model, generate sets of conformations. Most of the methods presented here are designed especially for small, druglike molecules. The prediction of the geometry of polymers, in particular of biopolymers, is a task of its own and not even attempted by the approaches discussed here.
2.3.1. Manual Methods
In the early beginning of thinking in three dimensions in organic chemistry, 3D molecular models were built by hand, using standard bond length and bond angle units from mechanical molecular model building kits. This technique, still useful today, found in the age of computational chemistry its modern expression in the well-known interactive 3D structure building options incorporated into nearly each program package for molecular modeling. The user may construct a 3D molecular geometry interactively, positioning atoms and bonds on a 3D graphics interface using standard bond lengths and angles, or connecting predefined fragments. All these methods are summarized under manual methods because all model building steps are performed by hand, irrespective of whether this is done in real space or with computer models.
2.3.2. Automatic Methods
Distinct from these are automatic methods that directly transform 2D input informa-tion on atoms, bonds, and stereochemistry into 3D atomic coordinates. The automatic methods are classified into rule-based and data-based, fragment-based, conforma-tional analysis, and numerical methods (Fig. 5). These classes of methods overlap more or less with each other and belong more or less to the domain of automatic 3D structure generation:
Rule-based and data-based methods. Under rule-based and data-based methods, approaches that are based on the knowledge of chemists on geometrical and energy rules and principles for constructing 3D molecular models are covered.
This knowledge was originally gained from experimental data and through theoretical investigations. It is built into 2D-to-3D conversion programs in the
form of chemical knowledge either in explicit (e.g., rules) or in implicit form (e.g., data on allowed ring conformations).
Fragment-based methods. At the far end of rule-based and data-based methods are approaches that are based almost exclusively on structural data. These methods are covered under a separate subdivision as fragment-based meth-ods. These methods follow the concept of constructing molecular models from fragments that are as large and as similar as possible to the molecule to be built. The fragments are taken from a library of 3D structures. Fragment-based programs make extensive use of the implicit knowledge on model building represented by databases of 3D structures. Of course, fragment-based methods need also explicit rules on the fragmentation of the input structures, on finding closest analogs in the libraries, and on combining frag-ments to the entire molecular model.
Conformational analysis methods. In the field of conformational analysis, the 3D model builders and the conformer generators overlap. It is impossible to develop a 3D structure prediction program that does not implicitly look at several alternative conformations before settling down with the one written into the output file. The most common methods applied to conformational analysis and searching are systematic methods, random techniques, genetic algorithms (GAs), and simulation experiments. All these methods can be utilized either to identify the global minimum structure of a molecule under consideration, or to explore conformational space to generate an ensemble of low-energy conformations. Because pure conformer generation requires some additional issues to be addressed, this topic is described in another section.
Numerical methods. Quantum mechanical calculations, molecular mechanics, and distance geometry (DG) are summarized under numerical methods be-cause they are based on extensive numerical optimization procedures often requiring substantial computation times (QMMM > DG). Although quan-tum mechanical or molecular mechanics programs need a reasonable starting Figure 5 Classification of concepts.
geometry and are thus not genuine automatic structure generators, the dis-tance geometry approach by Crippen and Havel [10] represents a stand-alone modeling procedure of its own because the so-called embedding procedure generates starting coordinates for further optimization. The basic principles of the distance geometry approach for 3D structure generation as well as for conformational searches will therefore be described briefly.
Clearly, there is no sharp border between all of the subdivisions discussed above.
Rule-based and data-based methods use small fragments as at least bond lengths or ring templates, and fragment-based approaches of course use also rules for appropri-ately finding and combining the fragments. Both rule-based and fragment-based methods often make use of numerical optimization methods or of the principles of conformational analysis. However, the above classification into rule-based and data-based and fragment-data-based approaches will be retained in the following sections on 3D structure generation for clarity reasons. Conformational analysis methods are dis-cussed in another section. The basic principles of numerical methods (QM and MM) are given elsewhere in this volume.
3. 3D STRUCTURE GENERATION: METHODS AND PROGRAMS In this section, most of the currently available programs for automatic 3D structure generation will be discussed as far as they have been described in the literature. In addition, some early precursors of these methods are briefly presented due to their pioneering role in this field.
3.1. Early Precursors
3.1.1. Conformational Analysis for Six-Membered Rings in the LHASA Program
Corey and Feiner [11] assigned conformations of six-membered ring systems in a semiquantitative manner during the development of the synthesis design program, LHASA. The aim of this work was the prediction of the preferred conformations of synthetically important six-membered ring systems to evaluate the steric hindrance of different reaction sites in a molecule. In the first step, several possible geometries are assigned to the single rings (e.g., chair, half-chair, and boat) and the flexibility of these rings is evaluated (e.g., the possibility to distort them or to flip them into another conformation) using the 2D connection table and the stereochemical information.
Secondly, the exocyclical substituents of the ring atoms are labelled to be either axial or equatorial. Thirdly, the relative energy differences between several possible confor-mations of flexible ring systems are calculated using empirical procedures based on energy increment schemes for the single-ring conformations, for intraring interactions (e.g., monoaxial substituents, 1,2-diequatorial, or 1,3-diaxial interactions in chair conformations), and inter-ring interactions between different rings of one ring system.
Fig. 6 shows this increment scheme for intraring interactions in monoaxial, 1,2-diequatorial, and 1,3-diaxial substituted cyclohexane chair conformations. To predict destabilization energies ED in monoaxial substituted cyclohexane chair conforma-tions, energy increments AR for a specific substituent, which describe the energy difference between the axial and equatorial configuration, are used. The interactions in
1,2-diequatorial or 1,3-diaxial substituted ring systems are calculated by separate increment schemes GRand UR, respectively. The increments for the substituents AR, GR, and URdepend on the atom type, hydrogen attachment, and hybridization state of the atom directly connected to the ring.
Finally, the method is completed by using several rules to model the influence of endocyclical heteroatoms. In a series of examples, sufficient agreement was found with energies obtained by molecular mechanics and with geometries obtained by x-ray crystallography. The strength of the method was the use of symbolical logic (e.g., energy increments to calculate destabilization energies, rules to model the influence of endocyclical heteroatoms) for the geometry and energy prediction. However, the approach was limited to six-membered ring conformations and 3D structures were not generated explicitly.
3.1.2. The SCRIPT Program
Cohen et al. [12] presented in 1981 the SCRIPT program. A molecule is considered as an assembly of chain and ring fragments, possessing different conformations. The conformations are handled in an abstract form as ‘‘conformational diagrams’’ con-taining symbolical descriptions of the torsional angles of each bond. Chain fragments are treated as sequential four-atom fragments. Several possible low-energy conforma-tions are given for the torsional angles of such a fragment that only depend on the nature of the central bond. Ring fragments are handled as templates that are joined.
Possible conformers of rings of three to eight atoms are taken from a predefined table of templates that depend on the ring size and on the distribution of double bonds.
These conformers are stored in the form of conformational diagrams as shown in Fig. 7 for the six-membered ring. The torsional angles of the ring bonds in these diagrams are represented only by their sign (+/) for gauche angle types or zero (0) for a cis bond.
For ring fragments consisting of more than one ring, being either fused or bridged, a set of rules that restricts the allowed conformations of two adjacent rings is used. These rules consist of allowed combinations of torsional angles of the bond of fusion in the two regarded rings that depend on the stereochemistry of the bridgehead atoms.
Figure 6 Incremental calculation scheme to predict destabilization energies EDin monoaxial substituted (a), 1,2-diequatorial substituted (b), and 1,3-diaxial substituted (c) cyclohexane chair conformations in the LHASA program.
In a first step, the possible conformations are generated on a symbolical level of conformational diagrams. The combinatorial product of all conformational diagrams for rings and chains forms the conformational space of the molecule. In a second step, a set of rules and computational schemes allows the direct translation of the con-formational diagrams into 3D atomic coordinates by using standard bond lengths, bond angles, and torsional angles calculated from the symbolical descriptions in the diagrams. This is achieved by computational schemes based on ring sizes. The 3D
In a first step, the possible conformations are generated on a symbolical level of conformational diagrams. The combinatorial product of all conformational diagrams for rings and chains forms the conformational space of the molecule. In a second step, a set of rules and computational schemes allows the direct translation of the con-formational diagrams into 3D atomic coordinates by using standard bond lengths, bond angles, and torsional angles calculated from the symbolical descriptions in the diagrams. This is achieved by computational schemes based on ring sizes. The 3D