CHAPTER 10
E
VOLUTIONARY
C
OMPUTATION
II
:
G
ENERAL
M
ETHODS AND
T
HEORY
•Organization of chapter in ISSO
– Introduction
– Evolution strategy and evolutionary programming; comparisons with GAs
– Schema theory for GAs – What makes a problem hard? – Convergence theory
– No free lunch theorems
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall
Methods of EC
• Genetic algorithms (GAs), evolution strategy (ES), and evolutionary programming (EP) are most common EC methods
• Many modern EC implementations borrow aspects from one or more EC methods
– Ant colony optimization, differential evolution, particle swarm optimization, etc.
10-3
ES Algorithm with Noise-Free Loss
Measurements
Step 0 (initialization)Randomly or deterministically
generate initial population of N values of and
evaluate Lfor each of the values.
Step 1 (offspring)Generate offspring from current
population of N candidate values such that all values
satisfy direct or indirect constraints on .
Step 2 (selection)For (N+)-ES, select N best values from
combined population of Noriginal values plus offspring;
for (N,)-ES, select N best values from population of > N
offspring only.
Step 3 (repeat or terminate) Repeat steps 1 and 2 or
terminate.
Schema Theory for GAs
• Key innovation in Holland (1975) is a form of theoretical
foundation for GAs based on schemas
– Represents first attempt at serious theoretical analysis – But not entirely successful, as “leap of faith” required to
relate schema theory to actual convergence of GA
• “GAs work by discovering, emphasizing, and recombining good ‘building blocks’ of solutions in a highly parallel fashion.” (Melanie Mitchell, An Introduction to Genetic Algorithms [p. 27], 1996, paraphrasing John Holland)
– Statement above more intuitive than formal
10-5
Schema Theory for GAs (cont’d)
• Schema is template for chromosomes in GAs
• Example: [* 1 0 * * * * 1], where the * symbol represents a
don’t care(or free) element
– [11001101] is specific instance of this schema
• Schemas sometimes called building blocksof GAs
• Two fundamental results: Schema theoremand implicit
parallelism
• Schema theorem says that better templates dominate the population as generations proceed
• Implicit parallelism says that GA processes >> N schemas
at each iteration
• Schema theory is controversial
– Not connected to algorithm performance in same direct way as usual convergence theory for iterates of algorithm
Convergence Theory via Markov Chains
• Schema theory inadequate
– Mathematics behind schema theory not fully rigorous – Unjustified claims about implications of schema theory • More rigorous convergence theory exists
– Pertains to noise-free loss (fitness) measurements
– Pertains to finite representation (e.g., bit coding or floating point representation on digital computer)
• Convergence theory relies on Markov chains • Each state in chain represents possible population • Markov transition matrix P contains all information for
10-7
GA Markov Chain Model
• GAs with binary bit coding can be modeled as (discrete state) Markov chains
• Recall states in chain represent possible populations • ith element of probability vector p
k represents probability of
achieving ith population at iteration k
• Transition matrix: The i, j element of P represents the probability of population i producing population j through
the selection, crossover and mutation operations
– Depends on loss (fitness) function, selection method, and reproduction and mutation parameters
•
Given transition matrix P, it is known that +1pTk =p PTk
Rudolph (1994) and Markov Chain
Analysis for Canonical GA
• Rudolph (1994, IEEE Trans. Neural Nets.) uses Markov
chain analysis to study “canonical GA” (CGA)
• CGA includes binary bit coding, crossover, mutation, and “roulette wheel” selection
– CGA is focus of seminal book, Holland (1975)
• CGA does notinclude elitismlack of elitism is critical aspect of theoretical analysis
• CGA assumes mutation probability 0 < Pm < 1 and
single-point crossover probability 0 Pc 1
• Key preliminary result: CGA is ergodic Markov chain:
10-9
Rudolph (1994) and Markov Chain
Analysis for CGA (cont’d)
• Ergodicity for CGA provides a negative result on convergence in Rudolph (1994)
• Let denote lowest of N(= population size) loss
values within population at iteration k
– represents loss value for in population kthat has maximum fitness value
• Main theorem: CGA satisfies
(above limit on left-hand side exists by ergodicity)
• Implies CGA does not converge to the global optimum
min, ˆ
lim k
( )
1k P L L
min, ˆ
k L
min, ˆ
k L
Rudolph (1994) and Markov Chain
Analysis for CGA (cont’d)
• Fundamental problem with CGA is that optimal solutions are found but then lost
• CGA has no mechanism for retaining optimal solution • Rudolph discusses modification to CGA yielding positive
convergence results
• Appends “super individual” to each population – Super individual represents best chromosome so far – Not eligible for GA operations (selection, crossover,
mutation)
– Not same as elitism
10-11
Contrast of Suzuki (1995) and Rudolph
(1994) in Markov Chain Analysis for GA
• Suzuki (1995, IEEE Trans. Systems, Man, and Cyber.)
uses Markov chain analysis to study GA with elitism
– Same as CGA of Rudolph (1994) except for elitism
• Suzuki (1995) only considers unique states (populations) – Rudolph (1994) includes redundant states
• With N= population size and B = no. of bits/chromosome:
unique states in Suzuki (1995),
2NBstates in Rudolph (1994) (much larger than number of
unique statesabove)
• Above affects bookkeeping; does not fundamentally change relative results of Suzuki (1995) and Rudolph (1994)
( 2 1)!
2 1
(2 1)! !
B B B N N N N
Convergence Under Elitism
• In both CGA case (Rudolph, 1994) and case with elitism (Suzuki, 1995) the limit exists:
(dimension of differs according to definition of states, unique or nonunique as on previous slide)
• Suzuki (1995) assumes each population includes oneelite element and that crossover probability Pc = 1
• Let represent jth element of , and Jrepresent indices j
where population j includes chromosome achieving L()
• Then from Suzuki (1995):
0
lim
pT p PT k
k p p
p
j p
j 110-13
Calculation of Stationary Distribution
• Markov chain theory provides useful conceptual device • Practical calculation difficult due to explosive growth of
number of possible populations (states)
• Growth is in terms of factorialsof N and bit string length
(B)
• Practical calculation of pk usually impossible due to difficulty in getting P
• Transition matrix can be very large in practice – E.g., if N = B= 6, Pis 108108matrix!
– Real problems have Nand B much largerthan 6
• Ongoing work attempts to severely reduce dimension by limiting states to only most important (e.g., Spears, 1999; Moey and Rowe, 2004)
Example 10.2 from
ISSO
: Markov Chain
Calculations for Small-Scale Implementation
• Consider L() = = [0,15]
• Function has local and global minimum; plot on next slide • Several GA implementations with very small population
sizes (N) and numbers of bits (B)
• Small scale implementations imply Markov transition matrices are computable
– But still not trivial, as matrix dimensions range from approximately 20002000 to 40004000
10-15
Loss Function for Example 10.2 in
ISSO
Markov chain theory provides probability of finding solution (= 15) in given number of iterations
Example 10.2 (cont’d): Probability
Calculations for Very Small-Scale GAs
Probability that GA with elitism produces population containing optimal solution
GA iteration 0 5 10 20 30 40 50 100 150
Crossover (Pc) = 1.0
Mutation (Pm) = 0.05
Population (N) = 2 Bit length (B) = 6
0.03 0.08 0.15 0.32 0.48 0.62 0.74 0.97 1.00
Pc = 1.0
Pm = 0.05
N = 4 B = 4
0.21 0.51 0.69 0.92 1.00 -- -- --
--Pc = 1.0
10-17
Summary of GA Convergence Theory
• Schema theory (Holland, 1975) was most popular method for theoretical analysis until approximately mid-1990s
– Schema theory not fully rigorous and not fully connected to actual algorithm performance
• Markov chain theory provides more formal means of
convergence—and convergence rate—analysis
• Rudolph (1994) used Markov chains to provide largely negative result on convergence for canonical GAs
– Canonical GA does not converge to optimum
• Suzuki (1995) considered GAs with elitism; unlike Rudolph
(1994), GA is now convergent
• Challenges exist in practical calculation of Markov transition matrix
No Free Lunch Theorems (Reprise, Chap. 1)
• No free lunch (NFL) Theorems apply to EC algorithms – Theorems imply there can be no universally efficient EC
algorithm
– Performance of one algorithm when averaged over all problems is identical to that of any other algorithm • Suppose EC algorithm A applied to loss L
– Let denote lowest loss value from most recent N
population elements after n³Nunique function evaluations • Consider the probability that after n unique
evaluations of the loss:
ˆn
L
ˆn ,
P L L A
ˆn
L