2.4 Dynamic Algorithm Configuration
2.4.2 Adaptive Heuristic Algorithms
Adaptive heuristic algorithms aim to select the most effective configuration of a given heuristic algorithm to use at each decision point during the search process. Many different mechanisms to control the adaptation of algorithm configurations have been developed, which fall within three main categories [35]:
• Deterministic: algorithm configurations are predefined functions of time. • Self-Adaptive: algorithm configurations are part of the genotype and opti-
mized by evolution itself.
• Adaptive Rules: algorithm configurations are predefined functions of the search history.
Deterministic methods essentially raise even higher difficulties than static algo- rithm configuration: as the optimal configuration of the target algorithm changes with time, these functions must pre-define a schedule for applying different config- urations along the search process. Self-Adaptive method is acknowledged as one of the most effective approaches to evolutionary parameter setting, specifically in the framework of continuous parameter optimization [31]. In the general case however, self-adaptive approaches often significantly increase the size of the search space, and thus the complexity of the optimization problem (not only should a successful indi- vidual have good genes; it should also bear parameter values enforcing some effective transmission of its genes) [35].
Adaptive rules, also referred to as feedback-based control, use information from the search history to adapt the configuration of a given heuristic algorithm while solving the problem. In particular, adaptive rules aim at defining an online strategy to determine the most effective configuration on the fly. In the context of adaptive rules, the design of adaptive heuristic algorithms concerns two main issues, the credit assignment mechanism and selection mechanism, where the former assigns a reward
to a configuration based on evaluating the contribution of the configuration to the overall performance and the later serves as a selection rule in charge of selecting the configuration to use.
2.4.2.1 Credit Assignment Mechanisms
Several mechanisms for credit assignment have been proposed which mostly con- cern how to compute the rewards to be assigned to candidate configurations. Most existing mechanisms mainly differ in the metric used for evaluating performance of candidate configurations.
Most approaches defined the performance metric as the fitness improvement between the offspring and the parents or other objects. To be more specific, the fitness improvement is assessed in comparison with i) the current best individual [29]; ii) the median fitness [71]; or iii) the parent fitness [116]. To avoid premature convergence of the population-based algorithms, Maturana and Saubion [87] took into account the population diversity and proposed a measure defined as a weighted sum of both the fitness improvement and the offspring diversity.
Instead of considering instantaneous or average improvement, Whitacre et al. [128] considered extreme improvements, using a statistical measure aimed at out- lier detection in numerical optimisation. The experimental results showed that the proposed measure significantly outperformed its competitors on a set of continuous benchmark problems.
In addition, some other work proposed that the impact of candidate configu- rations on the overall performance should be measured after the genealogy of the outstanding offspring, e.g., rewarding the operators producing the ancestors of a good offspring according to a bucket brigade algorithm [29].
One common shortcoming of the credit assignment mechanisms summarised above is the rewards are determined based on the historical performance of can-
didate configurations only, however, it is noted that considering only the past per- formance can be misleading on problems with complex structure such as deception [30, 88]. Furthermore, the optimal reward assigned to a configuration is a dynamic random variable and the underlying distribution of this random variable changes as the search proceeds [29,90].
2.4.2.2 Selection Mechanisms
Extensive research efforts have been devoted to developing effective selection mechanisms in the last two decades. Most selection mechanisms transform the rewards assigned to candidate configurations into a probability distribution consists of probabilities indicating the likelihoods for selecting the candidate configurations along the search process. Many different approaches are proposed for learning the optimal probability values of applying a fixed set of algorithm configurations.
Most existing methods belong to the probability matching type [24,45,59,116,
127]. The basic probability matching selection rule computes each configuration’s selection probability as the proportion of the configuration’s reward to the total sum of all rewards. The main drawback of probability matching is that this can lead to the loss of some candidate configurations. If the selection probability of an algorithm configuration would become too low at some point, it would never be used again and its reward can no longer be updated. This is an unwanted property since the operator might become valuable again in a future stage of the search process [114]. To ensure no candidate configuration gets lost, a minimum selection probability can be enforced. In practice, all mildly relevant operators keep being selected, hindering the probability matching performance [35]. To address this issue, the adaptive pursuit method [114, 115] has been proposed, in which the selection probability is updated in such a way that the algorithm pursues the configuration that currently has the maximal reward. To achieve this, the pursuit method increases the selection
probability of the configuration with the maximal reward and decreases selection probabilities of all other configurations.
Alternatively, there has been a class of methods based on the Multi-armed Bandit Paradigm [6, 27], which formulates the configuration selection as a Exploration vs. Exploitation (EvE) dilemma, where Exploitation aims at selecting the best rewarded configuration in the last stages of search whereas Exploration is concerned with checking whether other configurations might in fact become the best ones at some later stages. The EvE dilemma has been intensively studied in Game Theory, more specifically in the so-called Multi-Armed Bandit (MAB) framework [6]. The MAB framework considers a set of K independent arms, each one of which having some unknown probability of getting a (boolean) reward. The optimal selection strategy is one maximizing the cumulative reward along time. A vital limitation of the standard MAB framework lies in that it only considers a static environment (the unknown reward probability of any arm being fixed along time), whereas the adaptive algorithm is intrinsically dynamic (the quality of any configuration is bound to vary along evolution). Even though every configuration keeps being selected and it can ultimately be realised that some new configuration has become the best one, in practice this would need to wait way too long before the new best configuration can be discovered.