1.2 Computational Methodology
1.2.1 Genetic Algorithms (GA)
I will begin with a thorough overview of the theory behind and rationale for using GA. This section will first discuss how GAs work conceptually, including the underlying concept of “search space.” I will then explain how GAs problem solve (i.e., how they produce new solutions for a given problem), how GAs assess the quality of those solutions, and how GAs select solutions to be further tested. I will end by discussing the technique’s limitations and the conditions most appropriate for implementing a GA.
1.2.1.1 Search Space and the Fundamentals of Genetic Algorithms (GA)
GA are stochastic problem-solving strategies that navigate “search space” by iteratively evolving populations of new solutions145,146. “Search space” in this context means the set of all possible solutions to a given problem146. In multimodal problems, there are often multiple “good” solutions with varying levels of “goodness” or “fitness.” There may or may not be a true global optimum, but there are multiple local optima147. Indeed, the number of all potential solutions may be enormous146,147. For instance, the search space of all synthesizable drug-like molecules includes 1020 to 1023 possible solutions148–150. In such cases, testing all possible solutions is impractical or impossible.
GA excel in these scenarios because they balance computational costs with thoroughness of search and so can generate multiple “fit” answers to a problem145,146. They comprise a class of evolutionary algorithms that attempt to mimic Darwinian evolution by successively evolving
populations of solutions (referred to as generations)145,146,151. Traits of the most “fit” parent solutions are combined or altered to generate new child solutions145–147,151. These new solutions are then evaluated to determine their “fitness.”145–147,151 Selective pressure is applied by selecting fit solutions to parent the next generation145–147,151. Randomness is incorporated in several ways, such as through seed-population and trait selection145–147,151. This randomness means that the predicted solutions may differ each time the algorithm is run, which is why GA are considered stochastic.
1.2.1.2 Populating a Generation of Solutions
GAs create new generations of solutions by either advancing fit solutions from the previous generation or evolving new solutions from those of the previous generation145–147,151. AutoGrow4 “solutions” are small molecules that dock into a protein pocket. Operators are the functions that alter, merge, or advance solutions from one generation to the next145–147,151. The three most common operators are crossover, mutation, and elitism145,146,151. Crossovers merge traits of two parent solutions145,146, mutations make small alterations to a single parent compound145,146, and elitism advances a solution from a previous generation without alteration145,146. I describe these operators in the context of AutoGrow4 in “Section 2.2.2: Operators: Population Generation via Crossover, Mutation, and Elitism” (Figure 9, p.74).
1.2.1.3 Fitness
All GA require a metric with which to judge the quality of solutions145–147,151. This is referred to as a fitness metric and is determined by a fitness function145–147,151. In the case of AutoGrow4, the primary fitness metric is the predicted binding affinity that is calculated by a docking program. Offspring are created from well-scoring solutions to create a new generation of solutions145–147,151.
The dynamics of how populations of solutions evolve can be viewed in terms of adaptive landscapes152–154. This concept was introduced by Sewall Wright in the 1930s as a way to describe the relationship between genotype and fitness as related to natural selection152,153. In his 1982 article, “The Shifting Balance Theory and Macroevolution,”154 Wright describes population shifts from one adaptive peak (i.e., optima) to another as a three-step process: (1) “stochastic variability” (i.e., random genetic alteration or mutations) causes subpopulations to explore the adaptive landscape, which results in the genetic drift of a portion of the population154; (2) natural selection pressures subpopulations that exist between adaptive peaks to migrate through genotypic space towards a peak154; and (3) populations at the most advantageous peak(s) proliferate and mate with other subpopulations to confer advantageous traits to other portions of the population154. Wright’s view of adaptive peaks separated by barriers transcends evolutionary biology and is a cornerstone of evolutionary algorithms that attempt to mimic Darwinian evolution145,146,151,152.
In GA, solutions that have reached advantageous peaks (i.e., achieved high scores) proliferate at higher rates than other solutions and so can become overrepresented, resulting in a homogenized population151. Taken to an extreme, this loss of diversity within the solution pool,
referred to as convergence, is a sign that a local optima has been reached and that new solutions are not likely to be created from the current population151. Delaying convergence allows for more solution space to be explored, which may lead to the discovery of even better solutions151. To delay homogenization, AutoGrow4 provides a secondary fitness function that selects for chemically unique compounds. The details of both the primary and secondary fitness functions are discussed in “Section 2.2.7: Assessing Fitness.” AutoGrow4 also provides multiple options for reassessing binding properties, which are detailed in “Section 2.2.7: Assessing Fitness.”
1.2.1.4 Ranking and Selection Approaches
A core feature of GA is the application of a selective pressure to guide the fitness of successive generations, traditionally by selecting solutions that are able to spawn new solutions and/or solutions that advance via elitism151. Broadly speaking, three of the most commonly used strategies for selection in GA are Ranking selector, Roulette selector, and Tournament selector151. A Ranking selector grades solutions from best to worst and selects the best solutions. This is an effective way to find local optima, but it often yields inbred populations comprised of highly similar compounds151. In extreme cases, population homogenization can cause convergence wherein the algorithm perpetually recreates similar compounds without substantial fitness improvement.
A Roulette selector assigns each solution to an area on a metaphoric roulette wheel, with the size of each area weighted by fitness, and so incorporates randomness into each generation. This ideally minimizes the GA’s chance of becoming trapped in local optima151. However, this
method also provides an opportunity for all potential solutions to advance, including the most unfit ones151.
Lastly, a Tournament selector randomly chooses a subpopulation of candidates from a generation and selects the fittest from that subpopulation151, thereby incorporating more randomness than a Ranking selector while mitigating the risk of selecting unfit solutions151.
A user’s choice of strategy depends on the goals and resources of the project. For instance, someone aiming to find a good local optimum with little computational overhead would benefit from using a deterministic approach with a Ranking selector. Alternatively, if the user’s goal is to explore a wider range of search space and test the greatest possible diversity of solutions, a stochastic selection strategy that provides randomness is recommended. It is incumbent on users to choose the selection strategy most applicable to their studies.
1.2.1.5 Limitations of Genetic Algorithms (GA)
GA are most applicable when a search space is so large that it would be impractical to brute-force test all possible solutions, and finding the global optima is not essential145. The set of all possible drug-like small molecules is one such search space; it includes 1020 to 1023 compounds148–150, an untestable number of compounds by any means. Because GA operate with limited prior knowledge of solution space, they are more efficient than brute-force testing in large search spaces145. Conversely, GA are often less efficient than more direct methods in small search spaces, or when prior knowledge can be used to intelligently limit the search space145.
While some selection approaches can postpone convergence, GA still suffer from local optima “trapping,”145,151 which leads to population homogenization as a few solutions begin to dominate145 (discussed in “Section 1.2.1.3: Fitness”). This concern is further discussed in “Section 3.3.1.3: A Caution Regarding Homogeneity and Convergence”
The stochastic nature of GA means that repeated independent runs can produce different results145. This is conducive to a wide search of solution space, but it also raises concerns about reproducibility and parameter optimization. A user may have difficulty distinguishing whether random chance or a specific parameter is primarily responsible for a given result.