Chapter 10_handout.pdf

(1)

CHAPTER 10 E

VOLUTIONARY

C

OMPUTATION

II

:

G

ENERAL

M

ETHODS AND

T

HEORY

•Organization of chapter in ISSO

– Introduction

– Evolution strategy and evolutionary programming; comparisons with GAs

– Schema theory for GAs – What makes a problem hard? – Convergence theory

– No free lunch theorems

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Methods of EC

• Genetic algorithms (GAs), evolution strategy (ES), and evolutionary programming (EP) are most common EC methods

• Many modern EC implementations borrow aspects from one or more EC methods

– Ant colony optimization, differential evolution, particle swarm optimization, etc.

(2)

10-3

ES Algorithm with Noise-Free Loss

Measurements

Step 0 (initialization)Randomly or deterministically

generate initial population of N values of    and

evaluate Lfor each of the values.

Step 1 (offspring)Generate offspring from current

population of N candidate values such that all values

satisfy direct or indirect constraints on .

Step 2 (selection)For (N+)-ES, select N best values from

combined population of Noriginal values plus offspring;

for (N,)-ES, select N best values from population of > N

offspring only.

Step 3 (repeat or terminate) Repeat steps 1 and 2 or

terminate.

Schema Theory for GAs

• Key innovation in Holland (1975) is a form of theoretical

foundation for GAs based on schemas

– Represents first attempt at serious theoretical analysis – But not entirely successful, as “leap of faith” required to

relate schema theory to actual convergence of GA

• “GAs work by discovering, emphasizing, and recombining good ‘building blocks’ of solutions in a highly parallel fashion.” (Melanie Mitchell, An Introduction to Genetic Algorithms [p. 27], 1996, paraphrasing John Holland)

– Statement above more intuitive than formal

(3)

10-5

Schema Theory for GAs (cont’d)

• Schema is template for chromosomes in GAs

• Example: [* 1 0 * * * * 1], where the * symbol represents a

don’t care(or free) element

– [11001101] is specific instance of this schema

• Schemas sometimes called building blocksof GAs

• Two fundamental results: Schema theoremand implicit

parallelism

• Schema theorem says that better templates dominate the population as generations proceed

• Implicit parallelism says that GA processes >> N schemas

at each iteration

• Schema theory is controversial

– Not connected to algorithm performance in same direct way as usual convergence theory for iterates of algorithm

Convergence Theory via Markov Chains

• Schema theory inadequate

– Mathematics behind schema theory not fully rigorous – Unjustified claims about implications of schema theory • More rigorous convergence theory exists

– Pertains to noise-free loss (fitness) measurements

– Pertains to finite representation (e.g., bit coding or floating point representation on digital computer)

• Convergence theory relies on Markov chains • Each state in chain represents possible population • Markov transition matrix P contains all information for

(4)

10-7

GA Markov Chain Model

• GAs with binary bit coding can be modeled as (discrete state) Markov chains

• Recall states in chain represent possible populations • ith element of probability vector p

k represents probability of

achieving ith population at iteration k

• Transition matrix: The i, j element of P represents the probability of population i producing population j through

the selection, crossover and mutation operations

– Depends on loss (fitness) function, selection method, and reproduction and mutation parameters

•

Given transition matrix P, it is known that +1

pT_k =p PT_k

Rudolph (1994) and Markov Chain

Analysis for Canonical GA

• Rudolph (1994, IEEE Trans. Neural Nets.) uses Markov

chain analysis to study “canonical GA” (CGA)

• CGA includes binary bit coding, crossover, mutation, and “roulette wheel” selection

– CGA is focus of seminal book, Holland (1975)

• CGA does notinclude elitismlack of elitism is critical aspect of theoretical analysis

• CGA assumes mutation probability 0 < P_m < 1 and

single-point crossover probability 0  P_c 1

• Key preliminary result: CGA is ergodic Markov chain:

(5)

10-9

Rudolph (1994) and Markov Chain

Analysis for CGA (cont’d)

• Ergodicity for CGA provides a negative result on convergence in Rudolph (1994)

• Let denote lowest of N(= population size) loss

values within population at iteration k

– represents loss value for in population kthat has maximum fitness value

• Main theorem: CGA satisfies

(above limit on left-hand side exists by ergodicity)

• Implies CGA does not converge to the global optimum









 

min, ˆ

lim _k

( )

1

k P L L 

min, ˆ

k L

min, ˆ

k L

Rudolph (1994) and Markov Chain

Analysis for CGA (cont’d)

• Fundamental problem with CGA is that optimal solutions are found but then lost

• CGA has no mechanism for retaining optimal solution • Rudolph discusses modification to CGA yielding positive

convergence results

• Appends “super individual” to each population – Super individual represents best chromosome so far – Not eligible for GA operations (selection, crossover,

mutation)

– Not same as elitism

(6)

10-11

Contrast of Suzuki (1995) and Rudolph

(1994) in Markov Chain Analysis for GA

• Suzuki (1995, IEEE Trans. Systems, Man, and Cyber.)

uses Markov chain analysis to study GA with elitism

– Same as CGA of Rudolph (1994) except for elitism

• Suzuki (1995) only considers unique states (populations) – Rudolph (1994) includes redundant states

• With N= population size and B = no. of bits/chromosome:

unique states in Suzuki (1995),

2NB_{states in Rudolph (1994) (much larger than number of}

unique statesabove)

• Above affects bookkeeping; does not fundamentally change relative results of Suzuki (1995) and Rudolph (1994)      _{ }     

( 2 1)!

2 1

(2 1)! !

B B B N N N N

Convergence Under Elitism

• In both CGA case (Rudolph, 1994) and case with elitism (Suzuki, 1995) the limit exists:

(dimension of differs according to definition of states, unique or nonunique as on previous slide)

• Suzuki (1995) assumes each population includes oneelite element and that crossover probability P_c = 1

• Let represent jth element of , and Jrepresent indices j

where population j includes chromosome achieving L()

• Then from Suzuki (1995):

0



 lim

pT p PT k

k p p

p

j p  



_j 1

(7)

10-13

Calculation of Stationary Distribution

• Markov chain theory provides useful conceptual device • Practical calculation difficult due to explosive growth of

number of possible populations (states)

• Growth is in terms of factorialsof N and bit string length

(B)

• Practical calculation of p_k usually impossible due to difficulty in getting P

• Transition matrix can be very large in practice – E.g., if N = B= 6, Pis 108108matrix!

– Real problems have Nand B much largerthan 6

• Ongoing work attempts to severely reduce dimension by limiting states to only most important (e.g., Spears, 1999; Moey and Rowe, 2004)

Example 10.2 from

ISSO

: Markov Chain

Calculations for Small-Scale Implementation

• Consider L() =   = [0,15]

• Function has local and global minimum; plot on next slide • Several GA implementations with very small population

sizes (N) and numbers of bits (B)

• Small scale implementations imply Markov transition matrices are computable

– But still not trivial, as matrix dimensions range from approximately 20002000 to 40004000

 

(8)

10-15

Loss Function for Example 10.2 in

ISSO

Markov chain theory provides probability of finding solution (_{= 15) in given number of iterations}

Example 10.2 (cont’d): Probability

Calculations for Very Small-Scale GAs

Probability that GA with elitism produces population containing optimal solution

GA iteration 0 5 10 20 30 40 50 100 150

Crossover (Pc) = 1.0

Mutation (Pm) = 0.05

Population (N) = 2 Bit length (B) = 6

0.03 0.08 0.15 0.32 0.48 0.62 0.74 0.97 1.00

Pc = 1.0

Pm = 0.05

N = 4 B = 4

0.21 0.51 0.69 0.92 1.00 -- -- --

--Pc = 1.0

(9)

10-17

Summary of GA Convergence Theory

• Schema theory (Holland, 1975) was most popular method for theoretical analysis until approximately mid-1990s

– Schema theory not fully rigorous and not fully connected to actual algorithm performance

• Markov chain theory provides more formal means of

convergence—and convergence rate—analysis

• Rudolph (1994) used Markov chains to provide largely negative result on convergence for canonical GAs

– Canonical GA does not converge to optimum

• Suzuki (1995) considered GAs with elitism; unlike Rudolph

(1994), GA is now convergent

• Challenges exist in practical calculation of Markov transition matrix

No Free Lunch Theorems (Reprise, Chap. 1)

• No free lunch (NFL) Theorems apply to EC algorithms – Theorems imply there can be no universally efficient EC

algorithm

– Performance of one algorithm when averaged over all problems is identical to that of any other algorithm • Suppose EC algorithm A applied to loss L

– Let denote lowest loss value from most recent N

population elements after n³Nunique function evaluations • Consider the probability that after n unique

evaluations of the loss:

 

ˆn

L

 



ˆn _,



P L   L A  

ˆn

L  