Evaluation - Optimisation Methods - Bond Order and Formal Charge Assignment

I. SpinningTop

9. Bond Order and Formal Charge Assignment

9.4. Optimisation Methods

9.4.5. Evaluation

The performance of each algorithm was assessed in terms of both speed, and the ability to correctly assign bond orders and formal charges. The latter requires reference data, i.e. a set of molecules for which both properties are already known. The reference data must also represented aromatic bonds in kekulised form, i.e. alternating single and double bonds, and explict hydrogens must be present. To match these requirements, three molecular databases were chosen to provide validation sets, the MMFF94 validation suite,2the KEGG Drug Database,56 and ZINC15.57–59Both the MMFF94 validation suite and KEGG Drug Database were used as validation sets for the BALL method.55

The MMFF94 validation suite contains 761 structures for small molecules and ions, with bond orders, formal charges and missing hydrogen atoms manually added by the authors.2_For-

mal charges and bond orders are available in either hypervalent or dative representation. The hypervalent representation has been chosen here. All molecules in the suite were parsed by OpenBabel41and removed if any errors or warnings were issued. Additionally, if they contained three or fewer atoms, elements not contained in the LUTf c lookup table (section 9.3.4), or an

odd number of valence electrons they were also removed. This reduced the suite to 700 unique molecules for testing purposes. The set of molecules can be found in table G.1.

The KEGG Drug Database contains a large number of drug like molecules.56Structure files contain only two dimensional coordinate information, meaning that they are a perfect test set for topology only bond order and formal charge assignment. Hydrogen atoms are not explicitly present in the coordinate files, but atom typing means that they can be easily added using standard rules by OpenBabel.41Furthermore, some of the structure files contain multiple molecules, and each molecule may appear in more than one file, or also appear in the MMFF94 validation suite. To prevent skewing the analysis through inclusion of identical‡molecules, each file was split into connected components, and each component checked against all others, including those of the MMFF94 set, for uniqueness. Again, molecules with three or fewer atoms, containing elements not in LUTf c, or with an odd number of valence electrons were also ignored. This

preparation gave 5955 unique molecules, which are listed in table G.2.

ZINC15 is a free online database containing over 100 million commercially available com- pounds.57–59Molecules are available as SMILES strings, meaning that they do not have explicit hydrogen atoms which must again be added in using the standard rules employed in OpenBabel. With such a large database, it is infeasible to check the algorithms against all valid and unique molecules. Rather, 11,000 molecules were randomly chosen and passed through the same ﬁl- tering processes as the MMFF94 validation suite and KEGG Drug Database. This resulted in

‡_{stereo isomerisation is ignored when comparing molecules}

9.4. Optimisation Methods

10844 unique molecules for validation which are listed in table G.3.

In section 9.4.5.1 the four algorithms developed here are compared with one another. Using the FPT algorithm, section 9.4.5.2 compares the results obtained using an exact algorithm with the reference bond order and formal charge assignments of the three molecular databases. This section also compares the results of different computation levels to obtain the bond and atom energies, and the accuracy of the exact algorithm compared with that of the BALL method.

9.4.5.1. Comparison of Algorithms

The four algorithms developed here are compared amongst themselves. The first comparison is to determine how well and often the algorithms produce the minimum energy electron distribution. As both A* and FPT are exact, given enough time they will give identical results so there is no point in determining the accuracy of these two methods. Because they are exact methods, their results can be used as a basis to determine the accuracy of the other methods. Another point of interest is the relative efficiencies of the four algorithms. To determine this, each method is run with a two and a half second time-out period, that is if the algorithm has not completed within 2.5 seconds it is halted and treated as a fail. This time-out period was chosen as a pragmatic limit for the computational expense to determine an electron distribution for a given molecule. It does not imply that the algorithm has failed to optimise, rather that the time period it would take to optimise is much larger than 2.5 seconds. The more efficient methods will have less fails due to time-outs. The results of this method of efficiency determination are not entirely reliable, as they are skewed towards smaller systems and do not consider the different scalability of the algorithms, but they do provide an easy to obtain estimate.

For these tests, the atom and bond energies used were calculated using the def2-SVPD basis set with the BSSE accounted for through the counterpoise correction method. Bond energies for bonds with separated charge were not included. Reference energies were determined for each molecule in the test sets using the FPT algorithm. Energies, as opposed to actual bond order and formal charge assignments, are used for reference as the energy of two resonance structures will be identical whereas the bond order and formal charge states will not, even though each resonance state is a valid, and minimum, bond order and formal charge assignment. Accuracies of the non exact algorithms were determined using only the subset of molecules which did not time-out. The genetic algorithm was run using the following parameters: a population size Np=50 withNs=25 of the initial population generated using the seeding method, a breeding

candidate selection scale parameter valueξb=0.25, a mutation rateρm=0.1, a generational

progression scale parameter value ξg=0.25, a minimum of m=25 generations, maximum

9. Bond Order and Formal Charge Assignment

Table 9.4:Efficiency and accuracy results for the four different algorithms. The efficiency measure () is a count of the total number of molecules which timed-out the algorithm with a 2.5 second time-out period. The accuracy (◎) is the percentage of molecules for which the correct minimum energy electron distribution was determined. Failures due to time-outs are removed from the accuracy calculation. Due to the stochastic nature of the genetic algorithm, it was run five times and the results averaged across the runs. Standard deviations in the results are thus provided.

MMFF94 (700 total) σ ◎% σ Local Optimisation 21 76.29 Genetic Algorithm 120.2 6.40 49.54 0.69 A* 257 100.00 FPT 188 100.00 KEGG (5955 total) Local Optimisation 1703 71.19 Genetic Algorithm 3367.4 10.48 40.95 0.15 A* 3828 100.00 FPT 2996 100.00 ZINC15 (10844) Local Optimisation 591 70.77 Genetic Algorithm 4771.2 22.35 47.28 0.13 A* 5458 100.00 FPT 2659 100.00

utilised the abstemious heuristic. Results are given in table 9.4.

Clearly, the local optimisation method is the most efficient optimisation method. It also has fairly good accuracy, which is primarily due to the ordering ofP, and is thus well suited to ob- taining an heuristically good electron distribution. By contrast, with the parameters used here, the genetic algorithm performs poorly, with only approximately 50% accuracy. The accuracy can be improved through optimisation of the parameters to use, but this will lead to a noticeable decrease in the efficiency of the algorithm as the population size or number of generations would likely have to be increased to give noticeable accuracy improvement. Alternatively, or in addition to parameter optimisation, the genetic algorithm could be combined with the local optimisation algorithm to give a memetic algorithm. A memetic algorithm uses a local search, like the local optimisation method, to reduce the likelihood of premature convergence of the overar- ching genetic algorithm. Of the two exact algorithms, the FPT algorithm is the most efficient. This becomes even more noticeable if the time-out period is extended as it becomes apparent that the FPT algorithm scales far better than the A* algorithm. Due to the overhead associated

9.4. Optimisation Methods

with constructing a nice tree-decomposition, the A* algorithm is more efﬁcient than the FPT algorithm when the molecule has few positions for electrons to be placed.

9.4.5.2. Comparison to Reference Assignments

To compare with the reference assignments of the three molecular databases, the FPT algorithm is used. An electron distribution energy for the reference assignment is calculated using the reference formal charges and bond orders. Results from the FPT algorithm are compared to this energy value. As an energy value is effectively a representation of the counts of vari- ous formal charges and bond orders, comparing the energies accounts for different resonance structures produced from the optimisation process which would not match up exactly with the reference assignments but are fundamentally the same assignment. A small number of reference assignments contained bond orders which were not present in the LUTbolookup table, leading

to the reference energy being calculated as∞. These molecules remain in the data set as such a result will influence all methods equally. Electron distributions are optimised using energies calculated with both the def2-SVPD and def2-TZVPPD basis sets, with and without the BSSE accounted for through counterpoise correction and with and without different energies for bonds with separated charge. Additionally, the energy of the first fifty minimum penalty score bond order assignments returned by the BALL method are calculated and compared with the reference energies. If any of the BALL calculated assignments had an energy matching the reference energy, that molecule was deemed to be correctly calculated by the BALL method. Three forms of the BALL method were used: when the formal charge of the reference assignments were provided to the algorithm, when no formal charges were provided, and when no formal charges were provided but the energy calculated only included contributions from bonds. These energies are only calculated with the def2-SVPD basis set without counterpoise correction and no different energies for bonds with separated charges as the results for them should be independent of the energies used. The results of comparing the energies with the reference assignment energies provide a benchmark test for the FPT algorithm developed and used here. Results obtained are given in table 9.5.

These results show that the FPT algorithm developed here can reproduce reference electron distribution energies with a high degree of reliability, regardless of the level of theory used to calculate atom and bond energies. This is particularly appealing as it means the bond energy and atom energy tables need only be extended at the computationally cheapest level of QM theory. Accounting for the BSSE through counterpoise correction also has no noticeable affect on the accuracy of the results. In only two cases were the results different when BSSE was accounted for, which is not a signiﬁcant enough difference to justify the vastly increased computational cost

9. Bond Order and Formal Charge Assignment

Table 9.5:Percent of optimised electron distribution energies which match the energy calculated for the reference assignment. Energies were calculated with two basis sets, with and with bonds with separated charge having different energies (Q-bonds) and with and without counterpoise correction to the BSSE. For the results utilising the BALL method, energies were calculated with the def2-SVPD basis set.

Q-bonds BSSE MMFF94 KEGG ZINC15

def2-SVPD N N 83.57 97.25 94.32 def2-SVPD N Y 83.57 97.25 94.32 def2-SVPD Y N 91.00 98.09 96.36 def2-SVPD Y Y 91.14 98.09 96.36 def2-TZVPPD N N 83.57 97.25 94.32 def2-TZVPPD N Y 83.57 97.25 94.32 def2-TZVPPD Y N 91.00 98.09 96.36 def2-TZVPPD Y Y 91.00 98.09 96.35 BALL w/ FC N N 97.43 98.86 98.51 BALL w/o FC N N 63.86 84.85 55.36 BALL w/o FC (BO) N N 86.14 94.49 96.67

associated with determining the counterpoise correction. The other outcome of note is that the use of different energies for bonds with charge separation leads to a noticeable improvement in the accuracy of the optimised energies, particularly with the MMFF94 validation suite. Primarily this is due to many nitrogen containing molecules being better represented with charged bond energies than without.

Comparison with the BALL method provides a measure of how well the FPT method developed here compares with a current state of the art method. When formal charges are provided to the BALL method, it out performs the FPT method. This is not an entirely fair comparison though as the BALL method is working with additional information which is not only unavail- able to our FPT method, but is also being determined in parallel with the bond order assignment. Having said that, the level of accuracy attained is not too dissimilar to that of the BALL method. It is in the case where formal charges have not been supplied to the BALL method were our FPT implementation really shines. This is the situation when bond orders and formal charges need to be determined within the CherryPicker algorithm. Here, our FPT algorithm vastly out- performs the BALL method, though, again, this is not a fair comparison as molecules with any formal charges will not match with the reference assignment. Accounting for only the energy contribution due to bonds is more representative of the true performance of the BALL method. If formal charges were back calculated from the bond order assignment, the true performance of the BALL method would fall somewhere between these two cases. With this in mind, we can conclude that the accuracy of our FPT algorithm matches or exceeds that of the current state of

9.5. Conclusion

the art BALL method when formal charges are unknown.

The BALL method does have a distinct advantage over our FPT algorithm, and that is to do with the implementation. The BALL method is implemented in C++ which is a more perfor- mant programming language than Python, in which our algorithms are implemented. This performance gap could be closed somewhat by some tweaks to our FPT algorithm focused around calculation of the score table, and optimisation of the nice tree-decomposition. Such tweaks are left for future work, along with possible reimplementation in a lower level programming language, such as C++.

9.5. Conclusion

Four different algorithms for the calculation of optimal bond order and formal charge assignment were developed and implemented, utilising energies calculated with high level QM methods. Of the four, the FPT algorithm showed the best performance in both computational efﬁciency and accuracy. Results obtained show that there is no difference in accuracy with different levels of QM computation, indicating that extension of the algorithms to be able to handle additional element and bond types need only consider performing QM calculations at the lowest level of theory. In comparison with the state of the art BALL method, the FPT algorithm developed here performs remarkably well, attaining accuracies close to those attained when the BALL method is provided with formal charges from the reference molecules, and exceeding the accuracy attained when no formal charges were provided.

Bibliography

(1) A. K. Rappe, C. J. Casewit, K. S. Colwell, W. A. Goddard and W. M. Skiff, “UFF, a full periodic table force ﬁeld for molecular mechanics and molecular dynamics simulations”, Journal of the American Chemical Society, 1992,114, 10024–10035.

(2) T. A. Halgren, “Merck molecular force ﬁeld. I. Basis, form, scope, parameterization, and performance of MMFF94”,Journal of Computational Chemistry, 1996,17, 490–519. (3) A. A. S. T. Ribeiro, B. A. C. Horta and R. B. de Alencastro, “MKTOP: a Program for

Automatic Construction of Molecular Topologies”,Journal of the Brazilian Chemical Society, 2008,19, 1433–1435.

(4) G. A. Kaminski, R. A. Friesner, J. Tirado-Rives and W. L. Jorgensen, “Evaluation and Reparametrization of the OPLS-AA Force Field for Proteins via Comparison with Accu- rate Quantum Chemical Calculations on Peptides”,The Journal of Physical Chemistry B, 2001,105, 6474–6487.

(5) J. Wang, W. Wang, P. A. Kollman and D. A. Case, “Automatic atom type and bond type perception in molecular mechanical calculations”,Journal of Molecular Graphics and Modelling, 2006,25, 247–260.

(6) J. Wang, R. M. Wolf, J. W. Caldwell, P. A. Kollman and D. A. Case, “Development and testing of a general amber force ﬁeld”,Journal of Computational Chemistry, 2004,25, 1157–1174.

(7) A. W. Schüttelkopf and D. M. F. van Aalten, “PRODRG: a tool for high-throughput crystallography of protein–ligand complexes”,Acta Crystallographica Section D, 2004,

60, 1355–1363.

(8) M. Lundborg and E. Lindahl, “Automatic GROMACS Topology Generation and Com- parisons of Force Fields for Solvation Free Energy Calculations”,The Journal of Physi- cal Chemistry B, 2015,119, 810–823.

(9) A. K. Malde, L. Zuo, M. Breeze, M. Stroet, D. Poger, P. C. Nair, C. Oostenbrink and A. E. Mark, “An Automated Force Field Topology Builder (ATB) and Repository: Ver- sion 1.0”,Journal of Chemical Theory and Computation, 2011,7, 4026–4037.

Bibliography

(10) S. Canzar, M. El-Kebir, R. Pool, K. Elbassioni, A. K. Malde, A. E. Mark, D. P. Geerke, L. Stougie and G. W. Klau, “Charge Group Partitioning in Biomolecular Simulation”, Journal of Computational Biology, 2013,20, 188–198.

(11) K. Koziara, M. Stroet, A. Malde and A. Mark, “Testing and validation of the Automated Topology Builder (ATB) version 2.0: prediction of hydration free enthalpies”,Journal of Computer-Aided Molecular Design, 2014,28, 221–233.

(12) J. A. Lemkul, W. J. Allen and D. R. Bevan, “Practical Considerations for Building GRO- MOS-Compatible Small-Molecule Topologies”,Journal of Chemical Information and Modeling, 2010,50, 2221–2235.

(13) R. M. Betz and R. C. Walker, “Paramﬁt: Automated optimization of force ﬁeld parameters for molecular dynamics simulations”,Journal of Computational Chemistry, 2015,

36, 79–87.

(14) B. T. Miller, R. P. Singh, J. B. Klauda, M. Hodošˇcek, B. R. Brooks and H. L. Woodcock, “CHARMMing: A New, Flexible Web Portal for CHARMM”,Journal of Chemical In- formation and Modeling, 2008,48, 1920–1929.

(15) E. Vanquelef, S. Simon, G. Marquant, E. Garcia, G. Klimerak, J. C. Delepine, P. Cieplak and F.-Y. Dupradeau, “R.E.D. Server: a web service for deriving RESP and ESP charges and building force ﬁeld libraries for new molecules and molecular fragments”,Nucleic Acids Research, 2011,39, W511–W517.

(16) L.-P. Wang, T. J. Martinez and V. S. Pande, “Building Force Fields: An Automatic, Systematic, and Reproducible Approach”,The Journal of Physical Chemistry Letters, 2014,5, 1885–1891.

(17) W. L. Jorgensen, J. Chandrasekhar, J. D. Madura, R. W. Impey and M. L. Klein, “Com- parison of simple potential functions for simulating liquid water”,The Journal of Chem- ical Physics, 1983,79, 926–935.

(18) S. Grimme, “A General Quantum Mechanically Derived Force Field (QMDFF) for Molecules and Condensed Phase Simulations”,Journal of Chemical Theory and Com- putation, 2014,10, 4497–4514.

(19) M. S. Gordon, D. G. Fedorov, S. R. Pruitt and L. V. Slipchenko, “Fragmentation Meth- ods: A Route to Accurate Calculations on Large Systems”,Chemical Reviews, 2012,

112, 632–672.

Bibliography

(20) W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C.

In document On using automated algorithms to parameterise molecules for molecular dynamics simulations and investigating suitable ensembles for the simulation of naphthalimide monolayers : a thesis submitted to Massey University in Albany, Auckland in fulfilment of t (Page 175-198)