Numbers of representative pairs - Characterizations with default parameters

6.5 Characterizations with default parameters

6.5.1 Numbers of representative pairs

The new method summarizes the full set of schemata down to a set of representative pairs. The number of such pairs affects both the size of the summary produced by the new method on disk and the time taken to analyze using the new method.

This subsection presents the results of experiments characterizing how many representative pairs are typically produced by the new method. The number of representative pairs for a given population and form of schema is the same as the number of maximal program sets and is the same as the number of maximal schemata if a rooted form is being used. Each representative pair of a non-rooted form may hold more than one maximal schema.

Figures 6.1, 6.2, 6.3, and 6.4 present plots for this value at generation 40 for eleven forms of schema.

Looking at these figures it is clear that the distribution for all schemata is highly biased. Many runs had small numbers of representative pairs and maximal schemata with a few having very large numbers. For the rooted forms of schemata very few of the runs exceeded the upper limit of 200,000 representative pairs and those that did could at times have more than 1,000,000 representative pairs. The non-rooted forms of ordered- subtrees and restrictive-ordered-subtrees also have few large values. But the distribution for non-rooted forms of schema is more biased with many small and few large numbers of representative pairs.

The different forms of schema had very different numbers of representative pairs with median values over 50 trials presented in table 6.4 on page 152. As expected, there are many more representative pairs for non-rooted forms of schema than for rooted forms of schema. Also as the complexity and expressiveness of a form of schema increases, for instance

Table 6.4: Median number of representative pairs found for different forms of schema for population sizes 51 programs and 101 programs

Form Median representative pairs

51 progs 101 progs Restrictive-ordered-programs 664 1,698 Ordered-programs 742 1,915 Restrictive-ordered-subtrees 3,620 15,784 Rooted-partly-ordered-subtrees 5,266 22,341 Rooted-restrictive-ordered-fragments 6,612 24,971 Rooted-ordered-fragments 6,893 27,681 Rooted-ordered-hyperschemata 19,453 109,747 Ordered-subtrees 19,875 ≥200,000 Partly-ordered-subtrees ≥200,000 ≥200,000 Ordered-fragments ≥200,000 Restrictive-ordered-fragments ≥200,000

6.5. CHARACTERIZATIONS WITH DEFAULT PARAMETERS 153

Figure 6.1: Distribution of numbers of representative pairs at generation 40, population 51, task BCW for three forms of schema.

Figure 6.2: Distribution of numbers of representative pairs at generation 40, population 51, task BCW for next three forms of schema.

6.5. CHARACTERIZATIONS WITH DEFAULT PARAMETERS 155

Figure 6.3: Distribution of numbers of representative pairs at generation 40, population 51, task BCW for next three forms of schema.

Figure 6.4: Distribution of numbers of representative pairs at generation 40, population 51, task BCW for remaining two forms of schema.

6.5. CHARACTERIZATIONS WITH DEFAULT PARAMETERS 157 from subtrees to fragments to hyperschemata, so too does the number of representative pairs found.

One interesting feature of the data is the seemingly excessive number of representative pairs of restrictive forms of schema, most noticed with “restrictive-ordered-programs”. Logically, each maximal schema of this form must be the same as one of the programs and thus there is a strict limit on how many schemata there may be as it must be less than or equal to the population size. Investigation shows that for the restrictive forms, most representative pairs represented object subsets but no schemata of the form. This is also seen with non-rooted forms, although for a very different reason.

In the case of restrictive forms of schemata, the sets of schema components held by these non-schema-representing representative pairs are not valid as schemata since, even though there are programs which match each schema component individually, no program matches the schema made from the set as a whole. An example is the two programs (+ 1), (+ 2) under the “restrictive-ordered-programs” form of schema. The set of schema components {+} is the intersection of the sets {+, (+ 1)} and {+, (+ 2)} and is found as part of the meet-semi-lattice, but this intersection describes a schema “+” which does not occur in any of the programs in the population although it would if the form where loosened to “ordered-programs”. For the restrictive forms of schema, a great many of these representative pairs with “invalid” schemata were found, while the numbers for “valid” schemata was as times relatively small.

In the case of non-rooted forms of schemata, representative pairs may be found which do not represent any schemata. However in these cases the schemata occurring in each such representative pair’s program set are “valid” but are simply represented by representative pairs which hold more programs.

To show the trend over time, figures 6.5 and 6.6 present the quartiles of the numbers of representative pairs for each generation from the initial

generation to the 100th, at intervals of five for six key forms of schema. The five series of the graph are, from top, the maximum, third quartile, median, first quartile and minimum results over fifty runs.

The figure shows a marked increase in the number of representative pairs as the run goes on but generally this trend plateaus at about generation 30 to 40. This may be caused by the programs of the population shar- ing more large schemata. Since the algorithms of the thesis‘ method search for patterns that are common between programs, they must do more work to process the later generations, in which programs share large schemata, than to process early generations, in which few programs share even small schemata. This increase in work is exhibited as an increase in the number of maximal pairs.

It is interesting to note that, while the number of maximal subtrees has plateaued by generation 20, the number of representative pairs for the more expressive forms of schema continue to increase until later in the run. Also, while the number of maximal subtrees stays relatively constant after the initial increase, the numbers of the more expressive schemata can at times vary markedly. For instance, the median number of rooted-ordered- hyperschemata at generation 85 (11,812) almost halves the same statistic for just 10 generations before (21,873).

For each of these figures, there is an equivalent figure for the sphere task. These figures exhibit very similar overall trends and look very similar to the figures presented, so those figures are omitted from this thesis.

In document Empirical Analysis of Schemata in Genetic Programming (Page 163-170)