Multiplexer Data - Data Mining A Heuristic Approach Abbass HA (2002) pdf

The standard n-multiplexer is a binary classification problem where each input x consists of n=L+2L_{Boolean coordinates: the first L are the address bits, the rest} are register bits. The address bits encode a particular register bit, and the associated response y is precisely the value stored there. The optimal, maximally general solution to the n-multiplexer consists of 2L+1_{rules, all with specificity h equal to} 100(L+1)/n. These rules provide a complete partition of input space and never make mistakes.

The jmultiplexer merges several multiplexers into one. Specifically, the s × l jmultiplexer combines s independent l-multiplexers to yield input vectors x of length n=sl. The combination of these partial binary outputs o_i is taken as the encoding o=o(y)=(o_s...o₃o₂o₁) of the output label 1 ≤ y ≤ 2s _{= k. We shall be concerned} here with the 3×11 jmultiplexer. In this case, for example, o = 001 corresponds to output label 2. This data set is considered with the purpose of illustrating both the generality of BYPASS classifiers and the predictive power of the underlying team- based evolution of rules.

Figure 1: BYPASS follow-up screen showing a single run for the jmultiplexer data. Execution parameters are π = .8, µ₀ = 1/40 (mercy = 6), p = 3, γ = 1.7 and θ = 1.62. Window size is wlc = 2,500. From left to right, top down: MIX success rate, population size, specificity, edge, aging index, match set size, reward on failure and some run statistics; see text for details.

Just as the standard 11-multiplexer admits an optimal solution involving 16 rules, the obvious solution set for this 3 × 11 jmultiplexer consists of 163_=4,096 disjoint receptive fields, each having about 36% specificity. BYPASS definitely provides an alternative, more economical solution to this problem. Figure 2 shows a run executed under the configuration π=.75, µ₀ = 1/35 (mercy = 6) and p = 3, γ = 1.7. This run involves three phases: θ = 0 was used first for one million cycles, then the GA was let into play under θ = 1.62 for a second million cycles; finally, the system was cooled for an additional .2 million cycles.

The final population consists of 173 classifiers with average specificity h* = 11% and match sets of about m* = 12 rules. It could still be reducing its size as suggested by the image. This population achieves an outstanding 77.8% success rate on the test sample. The MIX over-fit (difference between training and test rates) is about 2%. In both training and testing, the MIX edge over SW predictions is close to 55 percentage points. The proportion of genetic classifiers is also about 55%. Note the impact on h* by the GA (implying a burst in m* as well). Finally, note also the initial increasing trend in γ*. This is partially curbed subsequently by the GA (again, after a sudden uprise), and ultimately resolved by cooling.

Let us now take a closer look at the individual classifiers in this final population. A natural way to split up the population is to extract units whose MAP predictions equal the various output labels. For each output label we find basically the same picture: two groups of rules of about the same size. Most (sometimes all) rules in the first group were created by the GA, and their accuracies ρ are close to Figure 2: A single BYPASS run for the jmultiplexer data. Each dot reflects wlc = 25,000 cycles. From left to right, top down: MIX success rate, population size, average specificity, aging index, match set size and reward on failure. Three phases involving different execution parameters are clearly distinguished; see text for details.

1.4. Each of these rules uses exactly four bits to perfectly capture a single bit o_i. Therefore, their predictive distributions are roughly uniform over a subset of four output labels, with corresponding entropy of about 1.386. For instance, for j=1, we find the receptive field 00#00#...# predicting {1,3,5,7}. Note that this rule does not even belong to the optimal solution set for the reduced 11-multiplexer problem: its receptive field covers indeed half of both optimal receptive fields 0000##...# and 001#0#...#. Yet, it has the same specificity and makes no mistakes either, so that, to the system’s eyes, is undistinguishable from them. This phenomenon explains why it is so difficult to organize the collection of such “optimal” receptive fields.

The second group of rules consists of receptive fields with just three defined bits and accuracy ρ close to 1.9. These rules are nearly always created by EXM. For instance, 00#0#...# makes some mistakes but tends to be successful when o₁= 0. Unlike rules in the previous group, its predictive distribution assigns mass to all eight output labels, but {1,3,5,7} concentrates about 3/4. We sometimes refer to these four-bit and three-bit regularities as neat and blurred respectively. Needless to say, both types of regularities require classifiers equipped with probabilistic predictions to be adequately described.

We note that not all output labels are equally covered: the number of mistakes (on the test sample) by category are 147, 135, 178, 224, 379, 269, 401 and 488 respectively. According to our MAP splitting, the last two categories include only 14 and four classifiers respectively (of course, there is some noise here due to sampling variation). Thus, it appears that further progress can be made in the Figure 3: A single BYPASS run for the satellite data. Execution parameters are π = .9, µ₀ = 1/25 (mercy = 10), p = 5, γ = 1.15 and θ = 0. Window size is wlc = 250. From left to right, top down: MIX success rate, population size, specificity, edge, aging index, match set size, reward on failure and some run statistics; see text for details.

organization of the underlying population. This can be attempted by again executing the algorithm with this population as a starting point.

Figure 1 shows another run implementing precisely this strategy. The previous population was reinitialized as described earlier and then one million effective training cycles plus .1 million cooling cycles were conducted under the slightly different configuration π =.8, µ₀= 1/40 (mercy = 6, p = 3, γ = 1.7 and θ = 1.62 as before). The resulting population includes only 105 rules, yet it achieves an even better test success rate of 83.3%. Again, the population is still decreasing at termination time, and it would appear that the smaller the population, the better the system works. Note how the proportion of genetic classifiers stabilizes at above the middle of the run, yet h* increases all the time. We are witnessing indeed the takeover by the neat regularities as evidenced by the overall median accuracy ρ* (read off the last panel). Under the new training regime, γ* remains within better bounds than in Figure 2. Match sets average about m* = 8 rules.

Table 3 illustrates a single learning cycle by this population: matching, predictions and reinforcement. In this case we find exactly 4 neat and 4 blurred regularities, and output bits o₁= 1 , o₂ = 1 and o₃ = 1 are supported by 2, 4 and 2 rules respectively. Although all these classifiers concentrate on subsets of 4 output labels (and have similar R_j for them), only two output labels are highlighted in each case. Exactly which two are shown is largely due to the underlying programming of the display. The important point is that it is the set of hidden probabilities that matters, not the simplified but nonetheless useful display. Note also that neat regularities will tend to produce better scores and hence accumulate reward.

In document Data Mining A Heuristic Approach Abbass HA (2002) pdf (Page 142-145)