• No results found

Simple Symbolic Regression

Four Introductory Examples of Genetic Programming

7.3 Simple Symbolic Regression

As a third illustration of genetic programming, consider a simple form of the problem of symbolic regression (symbolic function identification).

In linear regression, one is given a set of values of various independent variable(s) and the corresponding values for the dependent variable(s). The goal is to discover a set of numerical coefficients for a linear combination of the independent variable(s) that minimizes some measure of error (such as the square root of the sum of the squares of the differences) between the given values and computed values of the dependent variable(s). Similarly, in quadratic regression the goal is to discover a set of numerical coefficients for a quadratic expression that minimizes error. In Fourier "regression," the goal is to discover a set of numerical coefficients for various harmonics of the sine and cosine functions that minimizes error.

Of course, it is left to the researcher to decide whether to do a linear regression, a quadratic regression, a higher-order polynomial regression, or whether to try to fit the data points to some non-polynomial family of functions. But often, the issue is deciding what type of function most appropriately fits the data, not merely computing the numerical coefficients after the type of function for the model has already been chosen. In other words, the real problem is often both the discovery of the correct functional form that fits the data and the discovery of the

appropriate numeric coefficients that go with that functional form. We call the problem of finding a function, in symbolic form, that fits a given finite sample of data symbolic regression. It is "data-to-function" regression. The desirability of doing regression without specifying in advance the functional form of the eventual solution was recognized by Dallemand (1958), Westervelt (1960), and Collins (1968).

For example, suppose we are given a sampling of the numerical values from a target curve over 20 points in some domain, such as the real interval [-1.0, +1.0]. That is, we are given a sample of data in the form of 20 pairs (xi, yi), where xi is a value of the independent variable in the

interval [-1.0, +1.0] and yi is the associated value of the dependent variable. The 20 values of xi were chosen at random in the interval [-1.0,

+1.0]. For example, these 20 pairs (xi, yi) might include pairs such as (-0.40, -0.2784), (+0.25, +0.3320), ..., and (+0.50, +0.9375).

These 20 pairs (xi, yi) are the fitness cases that will be used to evaluate the fitness of any proposed S-expression.

The goal is to find a function, in symbolic form, that is a good or a perfect fit to the 20 pairs of numerical data points. The solution to this problem of finding a function in symbolic form that fits a given sample of data can be viewed as a search for a mathematical expression (S- expression) from a space of possible S-expressions that can be composed from a set of available functions and terminals.

Page 163 The first major step in preparing to use genetic programming is to identify the set of terminals. In the cart centering problem, the computer program (which was called a control strategy) processed information about the current state of the system in order to generate a control variable to drive the future state of the system to a specified target state. In the artificial ant problem, the computer program processed information about whether food was present immediately in front of the ant in order to move the ant around the grid. In this problem, the information which the mathematical expression must process is the value of the independent variable X. Thus, the terminal set is

T = {X}.

The second major step in preparing to use genetic programming is to identify the set of functions that are used to generate the mathematical expressions that attempt to fit the given finite sample of data. If we wanted to use our knowledge that the answer is x4 + x3 + x2 + x, a function set consisting only of the addition and multiplication operations would be sufficient for this problem. A more general choice might be the function set consisting of the four ordinary arithmetic operations of addition, subtraction, multiplication, and the protected division function %. If we want the possibility of creating a wider variety of expressions and solving a wider variety of problems, we might also include the sine function SIN, the cosine function COS, the exponential function EXP, and the protected logarithm function RLOG (described in subsection 6.1.1). If we accept the above reasons for selecting the function set, then the function set F for this problem consists of eight functions (six of which are extraneous to the immediate problem) and is

F = {+, -, *, %, SIN, COS, EXP, RLOG},

The third major step in preparing to use genetic programming is to identify the fitness measure. The raw fitness for this problem is the sum, taken over the 20 fitness cases, of the absolute value of the difference (error) between the value in the real-valued range space produced by the S-expression for a given value of the independent variable xi and the correct yi in the range space. The closer this sum of errors is to 0, the

better the computer program. Error-based fitness is the most common measure of fitness used in this book. Standardized fitness is equal to raw fitness for this problem.

The hits measure for this problem counts the number of fitness cases for which the numerical value returned by the S-expression comes within a small tolerance (called the hits criterion) of the correct value. For example, the hits criterion might be 0.01. In monitoring runs, hits is a much more intuitive measure than fitness. The fact that an S-expression in the population comes within 0.01 of the target value yi of the

dependent variable for a number of points gives an immediate picture of the progress of a run.

Table 7.4 summarizes the key features of the simple symbolic regression problem with the target function of x4 + x3 + x2 + x.

Page 164 Table 7.4 Tableau for the simple symbolic regression problem.

Objective: Find a function of one independent variable and one dependent variable, in symbolic form, that fits a given sample of 20 (xi, yi)data points, where

the target function is the quartic polynomial x4 + x3+ x2 + x. Terminal set: X (the independent variable).

Function set: +, -, *, %, SIN, COS, EXP, RLOG.

Fitness cases: The given sample of 20 data points (xi, yi)where the xicome from the

interval [-1, +1].

Raw fitness: The sum, taken over the 20 fitness cases, of the absolute value of difference between value of the dependent variable produced by the S- expression and the target value yiof the dependent variable.

Standardized fitness: Equals raw fitness for this problem.

Hits: Number of fitness cases for which the value of the dependent variable produced by the S-expression comes within 0.01 of the target value yiof

the dependent variable.

Wrapper: None.

Parameters: M = 500. G = 51.

Success predicate: An S-expression scores 20 hits.

Predictably, the initial population of random S-expressions includes a wide variety of highly unfit S-expressions. In one run, the worst-of-generation individual in generation 0 was the S-expression

(EXP (- (% X (- X (SIN X))) (RLOG (RLOG (* X X))))).

The sum of the absolute values of the differences between this worst-of-generation individual and the 20 data points (i.e., the raw fitness) was about 1038.

The median individual in the initial random population was (COS (COS (+ (- (* X X) (% X X)) X))), which is equivalent to

Cos [Cos (x2 + x - 1)].

The sum of the absolute values of the differences between this median individual and the 20 data points was 23.67.

Figure 7.22 shows a graph in the interval [-1, +1] of this median individual from generation 0 and a graph of the target quartic curve x4 + x3 + x2 + x). The distance between the curve for this median individual and the target curve averaged about 1.2 units over the 20 fitness cases. Although this curve is not particularly close to the target curve, its distance is considerably closer than 1038.

Page 165

Figure 7.22

Median individual from generation 0 compared to target quartic curve x4 + x3 + x2 + x for the

simple symbolic regression problem.

Figure 7.23

Second-best individual from generation 0 compared to target quartic curve x4 + x3 + x2 + x for the

simple symbolic regression problem. The second-best individual in the initial random population, when simplified, was x + [RLog 2x + x] * [Sin 2x + Sin x2].

The sum of the absolute values of the differences between this second-best individual over the 20 fitness cases was 6.05. That is, its raw fitness was 6.05.

Figure 7.23 shows the curve for this second-best individual and the target curve. This second-best curve is considerably closer to the target curve than the median individual above. The average distance between the curve for this second-best individual and the target curve over the 20 points was about 0.3 per fitness case.

The best-of-generation individual in the population at generation 0 was the following S-expression with 19 points: (* X (+ (+ (- (% X X) (% X X)) (SIN (- X X)))

(RLOG (EXP (EXP X))))). This S-expression is equivalent to xex.

Page 166

Figure 7.24

Best-of-generation individual from generation 0 compared to target quartic curve x4+ x3 + x2+ x for the simple symbolic regression problem. The raw fitness for this best-of-generation individual was 4.47.

Figure 7.24 shows the curve for this best-of-generation individual and the target curve. The average distance between the curve for this best- of-generation individual and the target curve over the 20 points is about 0.22 per fitness case. As can be seen, this best-of-generation individual is considerably closer to the target curve than the second-best individual above.

The best-of-generation individual from the initial random population (namely xex) produced a value that came within this hits criterion (0.01

for this problem) of the correct value of the target curve for two of the 20 fitness cases. That is, it scored two hits. All the other individuals of generation 0 scored no hits or only one hit.

Although xex is not a particularly good fit (much less a perfect fit) to the target curve, this individual is nonetheless visibly better than the

worst individual in the initial random population, the median individual, and the second-best individual. When graphed, xex bears some

similarity to the target curve x4 + x3 + x2 + x. First, both xex and x4 + x3 + x2 + x are zero when x is 0. The exact agreement of the two curves at the origin accounts for one of the two hits scored by xex and the closeness of the two curves for another value of x near 0 accounts for the

second hit. Second, when x approaches +1.0, xex approaches 2.7, while x4 + x3 + x2 + x approaches the somewhat nearby value of 4.0. Also, when x is between 0.0 and about -0.7, xex and x4 + x3 + x2 + x are very close.

Table 7.5 contains a simplified calculation that further illustrates the above. In this simplified calculation, we use only five equally spaced xi

points in the interval [-1, 1], instead of 20 randomly generated points. These five values of xi are shown in row 1 of this table.

Row 2 shows the value of the best-of-generation individual y = xex from generation 0 for the five values of x

i. Row 3 shows the target data T

representing the target curve x4 + x3 + x2 + x. Row 4 shows the absolute value of the difference between the target data T and the value of the best-of-generation individual y = xex from generation 0. The sum of the five

Page 167 Table 7.5 Simplified presention of the simple symbolic regression problem with only

five fitness cases.

1 xi -1.0 -0.5 .00 +.5 +1.0

2 y = xex -.368 -.303 .000 .824 2.718

3 T 0.0 -.312 .000 .938 4.0

4 |T - y| .368 .009 .000 .113 1.212

items in row 4 (i.e., the raw fitness) is 1.702. If this raw fitness were zero, the function y on row 2 would be a perfect fit to the given data on row 3.

By generation 2, the best-of-generation individual in the population was the S-expression with 23 points (+ (* (* (+ X (* X (* X (% (% X X) (+ X X)))))

which is equivalent to

x4 + 1.5x3 + 0.5x2 + x.

The raw fitness of this best-of-generation individual improved to 2.57 for generation 2 (as compared to 4.47 from generation 0). This is an average of about 0.13 per fitness case. This best-of-generation individual from generation 2 scored five hits as compared to only two hits for the best-of-generation individual from generation 0.

This best-of-generation individual from generation 2 bears a greater similarity to the target function than any of its predecessors. It is, for example, a polynomial. Moreover, it is a polynomial of the correct order (i.e., 4). Moreover, the coefficients of two of its four terms (its quartic term and its linear term) are already correct. In addition, the incorrect coefficients (1.5 for the cubic term and 0.5 for the quadratic term) are not too different from the correct coefficients (1.0 and 1.0).

Before we proceed farther, notice that even though no numerical coefficients were explicitly provided in the terminal set, genetic programming automatically created the rational coefficient 0.5 for the quadratic term x2 by first creating 1/2x (by dividing x/x = 1 by x + x = 2x) and then multiplying by x. The rational coefficient 1.5 for the cubic term x3 was created similarly.

Figure 7.25 shows, by generation, the standardized fitness of the best-of-generation individual, the worst-of-generation individual, and the average individual in the population between generations 0 and 34 of one run of the symbolic regression problem. Because of the large magnitudes of standardized fitness for the worst-of-generation individual and the average individual in the population, a logarithmic scale is used on the vertical axis of this figure. As can be seen, the standardized fitness of the best-of-generation individual generally improves (i.e., decreases) and trends toward the horizontal line representing the near-zero value of 10-6.

By generation 34, the sum of the absolute values of the differences between the best-of-generation individual and the target curve x4 + x3 + x2 + x over the 20 fitness cases reached 0.0 for the first time in this run. This individual, of

Page 168

Figure 7.25 Fitness curves for the simple symbolic regression problem.

course, also scored 20 hits. This best-of-generation individual for generation 34 was the following S-expression containing 20 points: (+ X (* (+ X (* (* (+ X (- (COS (- X X)) (- X X))) X)

X)) X)).

Note that the cosine term (COS (- X X)) evaluates merely to 1.0. This entire S-expression is equivalent to x4 + x3 + x2 + x, which is, of course, the target curve.

Figure 7.26 graphically depicts this 100%-correct best-of-run individual from generation 34.

The best-of-run S-expression obtained in generation 34 has 20 points. There were varying numbers of points in the best-of-generation S- expression for the various intermediate generations (e.g., 19 points for generation 0 and 23 points for generation 2). We did not specify that the solution would have 20 points, nor did we specify the shape or the particular content of the S-expression that emerged in generation 34. The size, shape, and content of the S-expression that solves this problem evolved in response to the selective pressure exerted by the fitness (error) measure.

The function we discovered is complete in the sense that it is defined for any point in the original interval [-1, +1]. Thus, this discovered function can be viewed as a model of the process that produced the 20 observed data points (i.e., the 20 fitness cases). The discovered function can be used to give a value of the dependent variable (i.e., y) for any value of the independent variable (i.e., x) in the interval if one accepts this discovered model. As it happens, the discovered function is also well defined beyond the original interval [-1, +1]; in fact, it is well defined for any real value of x. Thus, the discovered function can be used to forecast the value of the dependent variable (i.e., y) for any real value of the independent variable (i.e., x) if one accepts this discovered model.

Although all 20 pairs of observed data (xi,yi) for this particular example were consistent and noncontradictory, the symbolic regression

problem would have proceeded in an identical fashion even if two different values of the

Page 169

Figure 7.26 100%-correct best-of-run

individual for the simple symbolic regression problem.

dependent variable (i.e., y) happened to be associated with one particular value of the dependent variable. In such a case of noisy data, one would not expect the error (i.e., raw fitness) ever to reach 0 and would not expect 20 hits.

The best-of-run individual shown above employed the functions +, -, *, and COS, but did not employ %, SIN, EXP, and RLOG. That is, four of the eight primitive functions in the function set were extraneous for the actual best-of-run individual. In other runs of this problem, we have obtained a solution using only the functions + and *, thus rendering six of the eight primitive functions extraneous.

Constant creation in connection with symbolic regression will be discussed in sections 10.1 and 10.2. 7.4 Boolean Multiplexer

As a fourth illustration of genetic programming, consider the problem of Boolean concept learning (i.e., discovering a composition of Boolean functions that can return the correct value of a Boolean function after seeing a certain number of examples consisting of the correct value of the function associated with a particular combination of arguments). This problem may be viewed as similar to the problem of symbolic regression of a polynomial except that Boolean functions and arguments are involved. It may also be viewed as a problem of electronic circuit design.

Page 170 Boolean functions provide a useful test bed for machine learning for several reasons.

First, it is intuitively easy to see how the structural components of the S-expression for a Boolean function contribute to the overall performance of the Boolean expression. This direct connection between structure and performance is much harder to comprehend for most other problems presented in this book.

Second, there are fewer practical obstacles to computer implementation for Boolean functions than for most other types of problems described in this book. There are no overflows or underflows generated by arbitrary compositions of Boolean functions, and there is no time-consuming simulation to write as there is with the artificial ant problem and the cart centering problem. Thus, the reader will find it particularly easy to work with Boolean problems and to replicate the results of this section.

Third, Boolean problems have an easily quantifiable search space. This is not the case for most other problems presented herein.

Fourth, for Boolean functions, the number of fitness cases is finite; thus, it is possible and practical to test 100% of the possible fitness cases for some Boolean problems. Testing of 100% of the fitness cases for a given problem sidesteps the question of whether the set of fitness cases is sufficiently representative of the problem domain to allow proper generalization. As will be shown in section 23.2, even when the number of fitness cases is finite, it is often considerably more efficient to measure fitness by a statistical sampling of the fitness cases.

7.4.1 11-multiplexer

Consider the problem of learning the Boolean ll-multiplexer function.

The solution of this problem (which has a search space of size approximately 10616) will serve to show the interplay, in genetic programming, of

• the genetic variation inevitably created in the initial random generation,

• the small improvements for some individuals in the population via localized hill climbing from generation to generation,