Binary Optimization - Swarm intelligence James Kennedy pdf

A frequently used kind of combinatorial optimization, which we have already seen in NK landscapes and some Hopfield networks, occurs when elements are represented as binary variables. The binary encoding scheme is very useful, for a number of reasons. First of all, as the

versatility of the digital computer indicates, almost anything can be represented to any degree of precision using zeroes and ones. A bitstring can represent a base-two number; for instance, the number 10011 represents—going from right to left and increasing the multiplier by powers of two—(1×1)+(1×2)+(0×4)+(0×8)+(1×16)=19. A problem may be set up so that the 19 indicates the 19th letter of the alphabet (e.g., S) or the 19th thing in some list, for instance, the 19th city in a tour. The 19 could be divided by 10 in the evaluation function to represent 1.9, and in fact it can go to any specified number of decimal places. The segment can be embedded in a longer bitstring, for instance, 10010010111011100110110, where positions or sets of positions are evaluated in specific ways.

Because zero and one are discrete states, they can be used to encode qualitative, nonnumeric variables as well as numeric ones. The sites on the bitstring can represent discrete aspects of a problem, for instance, the presence or absence of some quality or item. Thus the same bitstring 10011 can mean 19, or it can be used to summarize attendance at a meet- ing where Andy was present, Beth was absent, Carl was absent, Denise was present, and Everett was present. While this flexibility gives binary coding great power, its greatest disadvantage has to do with failure to dis- tinguish between these two kinds of uses of bitstrings, numeric and qualitative. As we will see, real-valued encodings are usually more appropri- ate for numeric problems, but binary encoding offers many advantages in a wide range of situations.

The size of a binary search space doubles with each element added to the bitstring. Thus there are 2n_{points in the search space of a bitstring of} nelements. A bitstring of more than three dimensions is conceptualized

as ahypercube.To understand this concept, start with a one-dimensional

bitstring, that is, one element that is in either the zero or the one state: a bit. This can be depicted as a line segment with a zero at one end and a one at the other (see Figure 2.14). The state of this one-dimensional “system” can be summarized by a point at one end of the line segment. Now to add a dimension, we will place another line segment with its zero end at the same point as the first dimension’s zero, rising perpendicular to the first line segment. Now our system can take on 22 _{possible states:} (0,0), (0,1), (1,0), or (1,1). We can plot the first of these states at the origin, and the other points will be seen to mark the corners of a square whose sides are the (arbitrary) length of the line segments we used. The (1,1) point is diagonally opposed to the origin, which is meaningful since it has nothing in common with the (0,0) corner and thus should not be near it. Corners that do share a value are adjacent to one another; those that don’t are opposed.

To add a third dimension, draw a cube extending above the page into space. Now the 2n_{points are three-dimensional: (0,0,0), (0,0,1), (0,1,0),} (0,1,1), (1,0,0), (1,0,1), (1,1,0), and (1,1,1), and each state of the system can be represented as a corner of a cube. As with the two-dimensional bitstring, points with elements in common are connected to one another by an edge of the cube, and points with none in common, for instance, (1,0,1) and (0,1,0), are not connected. The number of positions in which two bitstrings differ is called theHamming distancebetween them, and it is a measure of their similarity/difference as well as the distance between them on the cube. If two points have a Hamming distance ofhbetween them, it will take a minimum ofhsteps to travel from one of the corners of the hypercube to the other.

A hypercube has the characteristics of a cube, that is, bitstrings are conceived as corners of the hypercube. It is possible to depict a low- dimensional hypercube (see Figure 2.14), and some insights can be gained from inspecting the graph, but with more dimensions it quickly becomes incomprehensible.

Binary optimization searches for the best corner of the hypercube. Even though the size of binary problems only increases by powers of two, they can getintractable;it is not practical or possible to consider every corner of a high-dimensional hypercube. For instance, a 25-bit problem—which is not an especially big problem—represents a hypercube with 33,554,432 corners. There is obviously a need for methods to reduce the size of the search space, so that we can find a good answer without evaluating every possibility, so we can optimally allocate trials.

A couple of years ago, a hamburger restaurant advertised that they could prepare their hamburgers “more than 1,023 different ways.” Any- one with experience in binary arithmetic would immediately recognize that 1,024 is 210_{, thus indicating how “more than 1,023” came to be im-} portant. If you could order your hamburger with or without (1) cheese, (2) lettuce, (3) mayonnaise, (4) pickles, (5) ketchup, (6) onions, (7) mustard, (8) relish, (9) dressing, and (10) sesame seed bun, you could have it more than 1,023 different ways. Each combination of the presence or absence of each item represents a corner on a 10-dimensional hypercube, which could be evaluated. For instance, you might consider a 1110000011, which is a hamburger with cheese, lettuce, mayonnaise, dressing, and sesame seed bun. For me, I prefer corners of the hypercube where pickles=0, cheese = 1, and mustard= 0. Some notational systems use the “#” sign to mean “don’t care”: you might say I prefer my hamburgers prepared as 1##0##0###. The preferred state (absent or present) of the other ingredients might depend on my mood, how other peo-

1100 1000 1001 1101 1111 1110 0110 0111 0010 0011 0001 0100 0101 1011 0000 1010 0 00 01 10 11 000 010 001 100 110 101 111 011 1

Figure 2.14 One-, two-, three-, and four-dimensional binary spaces: lines, squares, cubes, and hypercubes.

not very important. At least I can focus my search in the region of the hypercube that has those values.

Random and Greedy Searches

What is a rational strategy for searching for a good point in the binary space? We will assume that the space is large, that we don’t want to consider every possible corner on the hypercube. So how could we reduce the size of our search? Let’s start by making up a naive algorithm.

Common sense tells us we need to start somewhere. Our natural sense of order suggests that we start at the extremes, say—assuming a 10-bit problem—0000000000 or 1111111111. Now, which one of these is a better place to start? We don’t have any way to know. Is it better to pic- ture a bare hamburger and imagine adding things to it or to start with the “everything” burger and remove stuff? We don’t know. If we started at one extreme (say, all zeroes) and the optimum was at the other end, we would end up having to go the entire distance before we found it. The best answer is probably somewhere in the middle. One commonsense strategy is just to pick a point randomly, say, 1001011010.

Since we don’t know anything about our problem, we could try aran-

dom searchstrategy, also called “generate and test.” Random bitstrings

are generated, one after the other, each one is evaluated, and the best one is saved. After some prespecified number of iterations, the algorithm stops, and the best bitstring found is taken as the solution. There is no reason to expect the random search strategy to find a very good answer, and in fact it really doesn’t produce very impressive results on big problems.

Another method that won’t work especially well is to take arandom walkon the landscape. We could generate a bitstring solution, flip one bit chosen at random, evaluate the new bitstring, flip another bit, evaluate that, and so on. Such a search can be imagined as movement around the hypercube, going from corner to adjacent corner, one step at a time. If we were looking for an optimum, we could just store the best solution found so far and use it after the walk was finished. As far as optimization goes, a random walk is just an exploitational random search, focusing in one region of the search space.

One thing that sometimes works is the so-calledgreedyapproach that we tried withNKlandscapes. Start by generating a bitstring and evaluating it. Then select the first bit on the bitstring, flip it, and evaluate the new bitstring. If flipping the bit resulted in improved performance,

the “flipped” value is kept, or else it is set back to what it was. The greedy algorithm goes through the entire bitstring in this way, flipping, evaluating, and choosing, once for every position, 10 times ifN=10. This algorithm will result in improvement unless the search happened to randomly start on a really good corner. What the greedy algorithm doesn’t do is to capitalize on patterns of bits that work together. The difficulty of this can be gleaned from the discussion of NK landscapes; when K is high, the effect of a site depends on the states of many other sites. The greedy algorithm may be a reasonable, quick first-guess approach to opti- mizing a small or simple system and will probably work fine where epistasis, or interaction among elements, is weak.

Hill Climbing

Ahill-climbingsearch strategy is a kind of random walk that modifies a

pattern, accepts the changes if they result in improvement, and then tries changes to the new pattern (Kauffman calls this an “adaptive random walk”). A good way to do this on a binary landscape is to set a probability of flipping a bit, say, 0.10; this will be called aprobability threshold. Then at every position on the bitstring a random number between 0.0 and 1.0 is generated; if it is less than 0.10, that bit is flipped, and if it is not, it stays as it was. Once we have gone through the entire bitstring, we evaluate the new pattern. If its fitness is better than the best so far, then we keep this bitstring. On the next iteration, we stochastically flip the bits on this new bitstring. If this leads to improvement, we keep it, other- wise we retain the pattern, so fitness is always improving, as long as it can.

It is easy to see why this is called hill climbing and also why it is not a perfect method. If the mutation rate, or proportion of bits flipped, is small, for instance, where an average of one bit per iteration or less is flipped, then a hill climber will climb to the top of whatever hill it started on—it will improve until it reaches the top of that hill. Then it’s stuck. This is called finding a local optimum, and while a hill climber is guaranteed to find the top of the hill eventually, we really want to find the global optimum—the highest peak on the entire landscape. (Hill- climbing terminology assumes we are maximizing something, but the same concepts hold for minimization problems.)

Increasing the mutation rate would enable the hill climber to jump from one hill to another—maybe higher—one, but it probably would not climb that hill very well; it would be as likely to jump to another hill

as to climb the one it’s on. When the Hamming distance between states of a hill climber on successive iterations is large, the algorithm can cover large areas of the landscape and might find better regions than it would have if it had only climbed its original hill. Here is the trade-off between exploration and exploitation. At one extreme, the random search algorithm explores perfectly well, testing one region of the landscape after the other, but fails miserably to zero in on the peaks in those regions. At the opposite extreme, the hill climber is excellent for zeroing in on the best point in its area, but fails entirely to explore new regions.

In fact random search and hill climbing can be seen as two variations on the same theme. Random search, where a new bitstring is randomly generated at every time step, can be variously considered as new individ- uals being created every time, or it can be considered as successive gener- ations or states of the same individual. In this view, if each bit is deter- mined randomly, it will appear that about half the individual’s bits change state each time, on the average. In comparison, if you took a hill climber with a low mutation rate, say, 1/n,wherenis the number of bits in the string—guaranteed to find a local optimum—and increased the mutation rate to 0.5, you would have exactly the search algorithm we have been calling random search. Thus, from at least this point of view, the difference between exploration and exploitation is the mutation rate, or size of steps through the search space.

Simulated Annealing

One of the more powerful approaches to binary optimization is known

assimulated annealing.Simulated annealing is based on the metaphor of

molecules cooling into a crystalline pattern after being heated. In a mol- ten metal the molecules move chaotically, and as the metal cools they begin to find patterns of connectivity with neighboring molecules, until they cool into a nice orderly pattern—an optimum. Simulated annealing takes the basic idea of hill climbing and adds to it a stochastic decision and acooling schedule.A bitstring is modified by flipping randomly se- lected bits, and if the modified bitstring performs better, it replaces the original. If the modified bitstring performs worse, though, it can still be accepted if a probability test is passed. The probability threshold is a function of the system’s “temperature,” which decreases over time. Thus the probability of accepting a poorer problem solution decreases as the system cools; the effect of this is that the algorithm roams over wide areas of the search space in the early iterations, bouncing into and out of

locally optimal regions. Later in the search the algorithm will be focused on the more promising regions of the problem space.

The salient lesson to be learned from simulated annealing is that sometimes you have to choose to do worse in the short run to do better in the long run. It may seem counterproductive to accept a bad solution, but it is necessary in order to make sure that the search moves beyond local optima, increasing the chances of finding global optima, or at least better local optima. Simulated annealing explores early in the experi- ment and exploits later—an approach that has been shown to be quite successful for many kinds of problems.

There is another, subtler, lesson to be learned here, too, having to do with the usefulness of metaphor for understanding abstract mathemati- cal systems. There is nothing about the procedure of the simulated annealing algorithm that ties it inextricably to the cooling of molecules in a crystal lattice. The algorithm can be written in abstract algebraic symbols, just like any other algorithmic process. But it was the metaphor of annealing that allowed the conceptualization and creation of the method in the first place, and it is the metaphor that enables people to understand how it works: it would be much more difficult to convey the process through purely abstract symbols. In discussions of some of the paradigms in this book, there may be ambiguity and even confusion between computer programs that are intended to simulate real processes and computer programs that are explained in terms of real processes. Metaphor is the other side of simulation, and considering the two together gives insights into both. There is a computer program, and there is something in the world, and we perceive some degree of resemblance between them. A simulation is a program that helps us understand something about the world, but sometimes the world simply offers a template that helps us understand a program. The world and the program are two systems whose contours are parallel in some relevant way. In one view they are separate and different things, with imagined similarities; in another view they are two ways of doing the same thing. On the one hand, we say a computer program is like a process in the physical world, and on the other, it is often insightful to recognize that the physical world is like a computer program. Beings in computers evolve—and computers think.

Binary and Gray Coding

Flipping bits increments a numeric bitstring with a certain lack of sub- tlety and grace, especially as binary encoding of real numbers introduces

Hamming cliffsinto the landscape. Imagine some parameters were being adjusted, getting close to a solution, and perhaps needed to move only one unit—for instance, from 15 to 16—in order to get to the optimum. Though this is a step of only one unit on a number line, in binary code it requires a jump from 01111 to 10000: this is a Hamming distance of 5— the maximum, as all five bits have to be flipped. A binary algorithm searching for a quantity (for instance, a hill climber) would encounter an obstacle at these cliffs and might never make the complete reversal at all positions necessary to improve the bitstring’s fitness.

Gray coding overcomes this impediment while retaining the advan-

tages of binary operations. The challenge is to devise a scheme, using zeroes and ones, to encode integers where the Hamming distance between

In document Swarm intelligence James Kennedy pdf (Page 96-106)