Learning performance measures - On the generalization ability of GSGP

4.5 On the generalization ability of GSGP

4.5.3 Learning performance measures

With fT being the fitness on the training set, and fC being the real fitness we define the

generalization ability of an evolutionary algorithm A as

G(A) = E[1 − fC( eX)]

where ˜X is the individual obtained after A has converged. The generalisation ability is in the range [0, 1], and is the larger the better.

Notice that the definition of the generalisation ability depends just on fC while the

algorithm has just access to fT. This means that the best generalisation performance will

not necessarily be achieved by the algorithm which better optimise fT. This is common

knowledge in machine learning, where the effect of maximising fT regardless of fC would

be considered as overfitting.

expected runtime to fit the training set, which in this framework can be defined as:

R(A) = E[T (fT(X) = 0)]

which is the expected time for the algorithm to find an individual with zero error on the training set since T (fT(X) = 0) is the number of generations until the first individual

with zero training error is produced.

Significance of the measures

It is important to understand the significance and the limits of the two measures.

The expected runtime gives guarantees on the convergence time of the algorithm. It measures how fast a solution is found but it doesn’t give any indication on the quality of solution for the learning problem. On the other hand the generalization ability measures the goodness of the algorithm as a learning tool providing indication on the quality of the solutions produced by it. It does not give any indication, though, on the time the algorithm takes to get to that solution. The two measures are both important and ideally they should both be taken into consideration when designing an algorithm. A good algorithm for a specific problem is one that has an high generalization ability and low (not more than polynomial) running time on that problem.

In this thesis we contributed towards the analysis of the expected runtime of GSGP. We also used the insights coming from the analysis of the runtime to design operators which are better from a runtime point of view. The problem of analysing the generalization ability of these operators is left open for future contributions.

CHAPTER 5

GEOMETRIC SEMANTIC GP FOR BASIS

FUNCTIONS REGRESSION

In the previous chapters (Chapter 3, 4) we gave two examples of theory-driven design providing design and analysis of geometric semantic mutation operators for the domain of Boolean Functions and Classification Trees respectively. In the case of Boolean functions, theory helped in realizing that the first geometric semantic operators proposed would take exponential time in solving instances of the black box Boolean learning problem, even with a training set of polynomial size. We then built on that consideration new blockwise mutation operators, we analysed them, and we found out the conditions under which they perform well.

For the domain of Classification Trees we did the same. Moreover the theoretical analysis gave insights to derive block operators with an improved space efficiency.

In this chapter we will leave the integer domain and start to consider the continuous domain. We will in fact design and analyse Geometric Semantic Genetic Programming operators for Basis Functions Regression.

This is a general form of regression in which the regression function obtained comes from a class of functions expressible as a linear combination of the basis functions {gj}i∈N.

By using different families of basis functions, one obtains well-known specific regression classes such as simple linear regression, polynomial regression, trigonometric regression and more.

This chapter is based on [60]. I have contributed on the design of the IGSM mutation operator, on the statements and proofs of Theorem 12 and Lemma 1, on the experiments in section 5.4 and partially on the analysis in Section 5.3.1.

5.1 Problem Definition

The problem class we consider is a general form of regression – basis functions regression – defined as follows. Let gj : Rn → R be a family of m basis functions, and H be

the class of functions expressible as linear combination of the basis functions {gj}. Let

p : Rn → R be an unknown target function, and T = {(i1, o1), ..., (ik, ok)} be a fixed set of

input/output pairs of p. The aim is to find a function h ∈ H that best approximates p on the training examples. I.e., h∗ = argmin_h∈H{ED(PI(h), O)}. By using different families

of basis functions, one obtains well-known specific regression classes. E.g., for polynomial regression with a single input variable, one can choose gj = xj.

The general problem class can be solved directly, without search, using the least squares method. A function h ∈ H can be written as h(x1, . . . , xn) = c1· g1(x1, . . . , xn) + . . . +

cm · gm(x1, . . . , xn) and it is identified in H, given {gj}, by the vector of coefficients

C = (c1, . . . , cm). The optimal vector of coefficients can be determined solving the system

of simultaneous linear equations obtained by using the training set T , i.e., oz = c1·g1(iz)+

. . . + cm· gm(iz) with z = 1, . . . , k. In matrix form is O0 = GC0 with O being the vector

of training set outputs and G being the matrix {gj(iz)}j,z. The solution is C0 = G+O0

where G+ _{is the Moore-Penrose pseudoinverse of G that is a generalisation of inverse}

matrix that returns a well-defined result when the system is underdetermined (m > k) or overdetermined (m < k). The optimal approximating function so computed minimises the Euclidean distance to the target output vector.

GSGP goes beyond the least square method as it can solve the basis function regression problem in the black-box setting: it does not use the knowledge of the training set, but only the errors of candidate solutions on the training set.

+ T1 = / \ X 1 - / \ T2 = * 1 / \ X X TR = 0.5 Crossover Scheme + / \ * * T3 = / \ / \ T1 TR T2 - / \ 1 TR Offspring + / \ * * = / \ / \ X 0.5 0.5 * / \ X X

Figure 5.1: T1 and T2 are parent functions and TR is a random value in [0, 1]. The offspring T3 is obtained by substituting T1, T2 and TR in the crossover scheme and simplifying algebraically.

5.2 Construction of Semantic Operators for Real Func-

In document Theory grounded design of genetic programming and parallel evolutionary algorithms (Page 97-101)