Symbolic Regression—Error-Driven Evolution
10.5 Symbolic Integration
Symbolic integration involves finding a mathematical expression that is the integral, in symbolic form, of a given curve. The LEX system developed by Mitchell, Utgoff, and Banerji (1983) is a well-known approach to symbolic integration. Mills (1987) reviews various approaches to the problem of symbolic integration.
Genetic programming can be used to perform a kind of symbolic integration via a direct extension of the symbolic regression process described earlier in this chapter. The result of symbolic integration by means of genetic programming is a function expressed as a symbolic mathematical expression. The resulting function may be a perfect solution to the problem or it may be a function that approximates the correct integral.
The given curve may be presented either as a mathematical expression in symbolic form or a discrete sampling of data points (i.e., the symbolic form of the given curve is not explicitly specified).
If the given curve is presented as a mathematical expression, we first convert it into a finite sample of data points. We do this by taking a random sample of values {xi} of the independent variable appearing in the given mathematical
Page 259 expression over some appropriate domain. We then pair each value of the independent variable xi with the result yi of evaluating the given
mathematical expression for that value of the independent variable.
Thus, we begin the process of symbolic integration with a given finite sampling of pairs of numerical values (xi, yi). If there are, say, 50 (xi, yi) pairs (for i between 0 and 49), then, for convenience, we assume that the values of xi have been sorted so that xi < xi+1 for i between 0 and 48. The domain values xi lie in some appropriate interval.
The goal is to find, in symbolic form, a mathematical expression that is a perfect fit (or a good fit) to the integral of the given curve using only the given 50 pairs of numerical points.
For example, if the given curve happened to be
the goal would be to find its integral in symbolic form, namely
given the 50 pairs (xi, yi). The domain appropriate to this example might be the interval [0, 2π].
Symbolic integration is, in fact, merely symbolic regression with an additional preliminary step of numerical integration. Specifically, we numerically integrate the curve defined by the given set of 50 points (xi, yi) over the interval starting at x0 and running to xi. The integral Ι(xi) is a function of xi. The value of this integral I(x0) for the first point x0 is 0. For any other point xi, where i is between 1 and 49, we perform a numerical integration by adding up the areas of the i trapezoids lying between the point x0 and the point xi. We thereby obtain an
approximation to the value for the integral I(xi) of the given curve for each point xi. We therefore obtain 50 new pairs (xi, I(xi)) for i between 0
and 49. These 50 pairs are the fitness cases for this problem.
We then perform symbolic regression to find the mathematical expression for the curve defined by the 50 new pairs (xi, I(xi)). This
Table 10.6 illustrates the process described above using only five points, instead of 50 points. Row 1 shows five values of xi spaced equally in
the interval [0, 2π]. Row 2 shows, for each of the five values of xi from row 1, the value of the given curve Cos x + 2x + 1. Row 3 contains the
numerical
Table 10.6 Finding an integral in symbolic form.
1 xi 0.00 1.57 3.14 4.71 6.28 2 y = Cos xi + 2xi + 1 2.00 4.14 6.28 10.42 14.57 3 Cos x+ 2x + 1dx 0.00 4.82 13.01 26.13 45.76 4 Sin x+ x2+ x 0.00 5.04 13.01 25.92 45.76 5 Absolute error 0.00 0.21 1.78 0.21 0.00 Page 260 integral of the given curve Cos x + 2x + 1 from the beginning of the interval (i.e., 0.0) to xi. This numerical integral is computed by adding up the trapezoids lying under the unknown curve given by row 2. Symbolic regression is then applied to rows 1 and 3. Specifically, row 1 is considered to be the independent variable of the unknown function, while row 3 is considered to be the value of the dependent variable. After running for several generations, genetic programming may produce Sin x + x2 + x, in symbolic form, as the integral of the unknown curve. Row 4 shows the value of Sin x + x2 + x for all five values of x
i. Row 5 shows the error between rows 3 and 4. Since the error is relatively
small for all five values of xi, the curve Sin x + x2 + x can be considered to be the integral of the unknown curve. One could, of course, add a
constant of integration, if one so desired.
When genetic programming is applied to this problem, the terminal set should contain the independent variable(s) of the problem, so T = {X}.
The function set should contain functions that might be needed to express the solution to the problem. Of course, the functions needed to express the integral of a given function are not, in general, known a priori. In this situation, we must make some kind of reasonable choice for the function set. It is probably better to include a few possibly extraneous functions in the function set than to omit a needed function. Of course, if a needed function is not in the function set, genetic programming will perform the symbolic regression as best as it can using the available functions. The following function set is a
Table 10.7 Tableau for symbolic integration.
Objective: Find a function, in symbolic form, that is the integral of a curve presented either as a mathematical expression or as a given finite sample of points (xi, yi).
Terminal set: X.
Function set: +, -, *, %, SIN, COS, EXP, RLOG. Fitness cases: Sample of 50 data points (xi, yi).
Raw fitness: The sum, taken over the 50 fitness cases, of the absolute value of the difference between the individual genetically produced function fj(xi) at domain point xi and
the value of the numerical integral I(xi).
Standardized fitness: Same as standardized fitness for this problem.
Hits: Number of fitness cases coming within 0.01 of the target value I(xi).
Wrapper: None.
Parameters: M =500. G = 51.
Page 261 reasonable choice for this problem:
F = {+, -, *, %, SIN, COS, EXP, RLOG},
taking two, two, two, two, one, one, one, and one argument, respectively.
As each individual genetically produced function fj is generated, we evaluate fj(xi) so as to obtain 50 pairs (xi, fj(xi)). The raw fitness of an
individual genetically produced function is the sum of the absolute value of difference between the value fj(xi) of the individual genetically
produced function fj at domain point xi and the value of the numerical integral I(xi). A hit for this problem occurs when fj(xi) comes within 0.01
of the target value I(xi).
In creating the fitness cases for symbolic integration, it will usually be desirable to have a larger number of fitness cases (e.g., 50) than for an ordinary problem of symbolic regression, because of the error inherent in the extra step of numerical integration.
Table 10.7 summarizes the key features of the symbolic integration problem. In one run, the best-of-generation S-expression in generation 4 was
(+ (+ (- (SIN X) (- X X)) X) (* X X)).
This S-expression scored 50 hits and had a standardized fitness of virtually 0. The standardized fitness (error) does not reach 0 exactly, because the integral is merely a numerical approximation and because of the small errors inherent in floating-point calculations. This best-of-run S-expression is equivalent to
which is, in fact, the symbolic integral of
Figure 10.17 presents the performance curves showing, by generation, the cumulative probability of success P(M, i) and the number of individuals that must be processed I(M, i, z) to guarantee, with 99% probability, that at least
Figure 10.17
Page 262 one S-expression comes within 0.01 of the target value function for all 50 fitness cases. The graph is based on 20 runs and a population size of 500. The cumulative probability of success P(M, i) is 50% by generation 8 and 60% by generation 50. The numbers in the oval indicate that, if this problem is run through to generation 8, processing a total of 31,500 (i.e., 500 x 9 generations x 7 runs) individuals is sufficient to
guarantee solution of this problem with 99% probability.
In another symbolic integration run, x4 + x3 + x2 + x was obtained as the symbolic integral of 4x3 + 3x2 + 2x + 1.
The step of numerical integration could, if desired, be replaced by symbolic integration for any S-expression that one happens to be able to integrate symbolically.