Evolutionary Concept Learning in First Order Logic

(1)

Evolutionary Concept Learning in First Order

Logic

An Overview

Federico Divina

Computational Linguistics and AI Section Tilburg University, Tilburg, The Netherlands e-mail: [email protected]

This paper presents an overview of evolutionary ap-proaches to Inductive Logic Programming (ILP). Af-ter a short description of the two popular ILP sys-tems FOIL and Progol, we focus on methods based on evolutionary algorithms (EAs). Six systems are de-scribed and compared by means of the following as-pects: search strategy, representation, hypothesis eval-uation, search operators and biases adopted for limit-ing the hypothesis space. We discuss possible advan-tages and drawbacks related to the specific features of the systems along these aspects. Issues concerning the relative performance and efficiency of the systems are addressed.

Keywords: Evolutionary Computation,Inductive Con-cept Learning,First Order Logic,Inductive Logic Pro-gramming.

1. Introduction

The ability of learning from examples is a typ-ical characteristic of all natural systems. Machine Learning was conceived also with the aim of devel-oping algorithms capable of learning from exam-ples. In particular in Inductive Concept Learning (ICL) a finite set of positive and negative exam-ples, called training set, is used to induce a concept description. The process can be seen as a search in a space of candidate hypotheses [34] expressed in a given representation language. Starting from an initial hypothesis, generalization and specializa-tion operators may be applied to direct the search towards good hypotheses that cover many positive examples and few negative ones (a hypothesis cov-ers an example if the hypothesis is true on that example).

The aim of this paper is to provide an overview of recent inductive learning methods based on evo-lutionary algorithms which use a fragment of First Order Logic (FOL) as the hypothesis language. FOL provides a formal framework for describing and reasoning about objects, their parts, and re-lations among the objects and/or the parts. This area is known as Inductive Logic Programming (ILP) [40,43,16]. ILP constitutes a central topic in Machine Learning, with relevant applications to problems in complex domains, like natural lan-guage and computational biology [38], where prob-lems can not be represented reasonably by a set of attributes.

The approach used in the majority of first–order based learning systems is to use specific search strategies, like the general-to-specific (hill climb-ing) search [44] and the inverse resolution mecha-nism [39]. However, the greedy selection strategies adopted for reducing the computational effort ren-der these techniques often incapable of escaping from local optima.

Recently various systems based on Evolution-ary Algorithms (EAs) for ILP have been shown to be effective alternatives to standard ILP methods. This approach is motivated by two major charac-teristics of EAs: their good exploration power, that gives them the possibility of escaping from local optima, and their ability to cope well when there is interaction among arguments and when argu-ments are of different type. Another appealing fea-ture of EAs is represented by their intrinsic paral-lelism. Moreover, EAs provide a learning method motivated by an analogy to biological evolution, which is known to be a successful, robust method for adaptation within biological systems.

In this paper we first describe two popular ILP systems, FOIL and Progol, and then focus on six recent EA based ILP systems. Even if the aim of this paper is to provide an overview over

evolu-AI Communications

(2)

tionary approaches to ILP, it is nevertheless inter-esting to briefly describe the two non–evolutionary systems. We have chosen to present FOIL because it represents probably the most popular system for ILP, and Progol because of its effectiveness in solv-ing a number of ILP problems and also because EA variants of it have been proposed.

The following features of the EA based systems are considered: search strategy, representation, hy-pothesis evaluation, search operators and search biases. We discuss possible advantages and draw-backs of the systems related to these specific fea-tures. Moreover, issues concerning the relative per-formance and efficiency of the systems are ad-dressed.

The order in which the systems are presented follows the complexity of the encoding they adopt. First REGAL, DOGMA and G-NET are intro-duced, which adopt a standard bit string encod-ing, next SIA01 and ECL which adopt a higher level representation, and finally GLPS where each individual encodes a whole logic program.

The paper is organized as follows. The basic principles of evolutionary computation (EC) and of FOL are given in section 2. In section 3 ICL and ILP are introduced. Section 4 describes FOIL and Progol, while in section 5 two evolutionary vari-ants of Progol, and six EA based systems are pre-sented. In section 6 we compare the six evolution-ary systems with respect to the above mentioned features. Finally in section 7 the most promising aspects of the systems are highlighted, which could be used as a possible guide in the design of a new EA based ILP system, and in section 8 some con-clusions are given.

2. Basic Notions of EC and FOL

In this section the basic notions of EC and FOL needed in this paper are given. For a detailed in-troduction to EC, the reader can refer to, e.g., [17,5,57], while for FOL the reader can refer to, e.g., [50,6].

The following notation is used throughout the paper:

– E denotes a set of examples, E+ _{and E}−

de-note the positive and negative example sets and e is a single example;

– B indicates the background knowledge;

– C denotes a clause, H a hypothesis;

– l stands for a literal, P for a predicate symbol and X, Y, Z, . . . for variables;

– px and nx denote the positive and negative

examples covered by x respectively, where x is either a clause or a hypothesis.

2.1. Evolutionary Computation

EC is a population–based stochastic iterative optimization technique based on the Darwinian concepts of evolution. Inspired by these principles, like survival of the fittest and selective pressure, EC tackles difficult problems by evolving approx-imate solutions of an optimization problem inside a computer. An algorithm based on EC is called an evolutionary algorithm.

Given an optimization problem, all EAs typi-cally start from a set, called population, of random (candidate) solutions. These solutions are evolved by the repeated selection and variations of more fit solutions, following the principle of the survival of the fittest. We refer to the elements of the pop-ulation as individuals or as chromosomes, which encode candidate solutions.

Solutions can be encoded in many different ways. A typical example is represented by binary string encoding, where each bit of the string has a particular meaning. In general, with the term phe-notype we refer to an object forming a possible so-lution within the original context, while its encod-ing is called genotype. To each genotype must cor-respond at most one phenotype, so that the chosen encoding can be inverted, i.e., genotypes can be decoded. If the genotype is equal (or very similar) to the phenotype, the encoding can be referred as a high level encoding.

Individuals are typically selected according to the quality of the solution they represent. To mea-sure the quality of a solution, a fitness function is assigned to each individual of the population. Hence, the better the fitness of an individual, the more possibilities the individual has of being se-lected for reproduction and the more parts of its genetic material will be passed on to the next gen-erations.

The selected individuals reproduce by means of crossover and mutation. In simple terms crossover swaps some genetic material between two or more individuals, while mutation changes a small part of the genetic material of an individual to a new

(3)

ran-dom value. From the reproduction phase, new off-spring are generated. Offoff-spring compete with the old individuals for a place in the next generation. In this way EAs can efficiently explore the space of the possible solutions of an optimization prob-lem. This space is called search space, and it con-tains all the possible solutions that can be en-coded. In ICL and ILP the search space is often referred as hypothesis space, since it consists of all possible hypotheses that can be considered.

EAs have been shown to be efficient in search-ing in huge spaces [23] (e.g., [54,15,4,21,20]). The stochastic operators used allow EAs to search for possible solution in an efficient way. For these rea-sons, EAs represent a valid alternative to greedy heuristic.

An important aspect that has to be addressed in EAs is the maintenance of diversity in the popu-lation. Maintaining diversity in the population al-lows to have individuals spread across the hypoth-esis space, so that all the areas of the hypothe-sis space can be searched, and there are no over-crowded regions. This can be seen as having differ-ent species occupying differdiffer-ent niches in the search space, in the same way as in a successful natu-ral system different species can survive in differ-ent niches of the environmdiffer-ent. Moreover, if diver-sity is maintained computational resources are ex-ploited more effectively by avoiding useless repli-cations and redundancies.

Different methods for achieving species and niches formations, as well as for maintaining diver-sity in the population, have been proposed. Among these crowding [9] and sharing function [24] are two popular methods.

2.2. First Oder Logic

The basic components of FOL are called terms. Terms can be constant, variable or functions. A constant denotes a particular object in some do-main, e.g., “4” is a constant denoting the number four in the domain of natural numbers. A variable is a name that can denote any object in a domain, and a function symbol denotes a function taking n arguments from a domain and returning one ob-ject of the domain.

In addition to terms we have predicate symbols. A predicate symbol stands for the name of a rela-tionship between objects. A predicate symbol ap-plied to a set of terms is called literal. For instance,

f ar(X, Y, 2) is a literal, where f ar is a predicate symbol, X, Y are variables and 2 is a constant. Literals can be positives or negatives. For example f ar(X, Y, 2) is a positive literal, and it is true if f ar(X, Y, 2) is true. ¬f ar(X, Y, 3) is a negative lit-eral, which is true if f ar(X, Y, 3) is false. We refer to positive literals also as atoms.

We can now define Horn clauses: a Horn clause is a clause of the form A ← L1, . . . , Ln, where A

is an atom and L1, . . . , Ln are literals. A is also

called the head of the clause, while we can refer to L1, . . . , Ln as the body of the clause. The above

clause can be read as if (L1, . . . , Ln) is true then

A is true, or more formally ∀Xi(L1, . . . , Ln) → A.

A clause containing no variable is called a ground clause, and if a clause consists of only the head it is called fact.

In ILP, a clause C is said to cover an example e if the theory formed by C and a given background knowledge B logically entails e. For example, given E+₌ _{{grandparent(tom,bill), grandparent(eve,cliff )}}_;

E−= {grandparent(tom,cliff ), grandparent(eve,tom)}; B = {parent(tom,jack), parent(jack,bill),

parent(eve,peter),parent(peter,cliff )};

then the clause C = grandparent(X,Y) ← par-ent(X,Z), parent(Z,Y) covers the positive exam-ples and none of the negative ones. The theory formed by C and B is a logic program, i.e., a logic program is a set of Horn clauses.

In order to verify if a clause covers an exam-ple, variables contained in the clause need to be bound to the constants belonging to the example. To this aim, substitutions are used. A substitution θ = {X1/t1, . . . , Xn/tn} is a finite mapping from

variables to terms that assign to each variables Xi

a term ti, ti 6= Xi, 1 ≤ i ≤ n. Applying a

substitu-tion θ to a term t is the results of the simultaneous replacement of each occurrence of a variable in t appearing also in θ with the correspondent term.

3. Inductive Concept Learning

The objective of inductive learning is to find a hypothesis that explains the classifications of the examples, given their descriptions. More formally: Given E+_{, E}− _{and B, the aim of inductive}

learning is to find a hypothesis H such that H cov-ers every e in E+ _{and H does not cover any e in}

(4)

E F F E C T I V E N E S S 3.1 Search Strategy 3.2 Representation 3.3 Evaluation 3.4 Operators 3.5 Biases - System for ICL E F F I C I E N C Y Sampling Mechanisms Parallelization

Fig. 1. Some features of a system for ICL. Next to each effectiveness box the section where the relative feature is first described is reported.

E−.

In the example given in the previous section, C = grandparent(X, Y ) ← parent(X, Z), parent (Z, Y ). is an example of such a hypothesis, be-cause, given the sets of examples E+ and E− and background knowledge B, C covers all the positive examples and none of the negative ones.

Figure 1 illustrates important features for effec-tiveness (left box) and efficiency (right box) of a system for ICL. This division is not always so well defined. For instance parallelization may in some cases also influence the effectiveness of the system. We explain briefly the features in the effectiveness box. Sampling and parallelization will be briefly addressed in the description of the REGAL and ECL systems.

3.1. Search Strategies

The search space can be structured with a general-to-specific ordering of hypotheses. We say that a hypothesis H1 is more general that a

hy-pothesis H2, and H2 is more specific than H1, if

all the examples covered by H2are also covered by

H1. With this ordering of hypotheses, the search

space can be seen as a lattice with the general to specific ordering.

Many systems for ICL exploit this ordering of hypotheses in the operators they use for moving in the search space and for deciding the direction in which the search is performed. Some systems start the search from a specific hypothesis, which is then generalized during the learning process. This

approach is called bottom-up. Alternatively a top-down approach can be used. In this case the learn-ing process starts with a general hypothesis which is specialized to fit the training examples.

Inside these two approaches a search strategy is used. Sequential covering and hill climbing (for an explanation see, e.g., [48]) are two popular search strategies. Systems adopting sequential covering iteratively learn a set of rules that represent the target concept. To do this they first learn a rule. The learned rule is added to the emerging target concept. All the positive examples covered by the learned rule are removed from the training exam-ples set. Another rule is learned using the current set of training examples, and added to the emerg-ing target concept. The process is iterated until all positive examples are covered.

Hill climbing algorithms refine an initial hypoth-esis. Several variations of the current hypothesis are built. Among these the best one is chosen, ac-cording to some criterion. The process is iterated until a sufficiently good hypothesis is found. The process is called hill climbing because it proceeds with optimization steps toward a local best hy-pothesis. Hill climbing can be used inside sequen-tial covering for learning rules.

3.2. Representation Language and Encoding When we want to solve a problem with a com-puter, the first thing that should be done is to translate the problem into computational terms. In our case this means to choose a representation language and an encoding.

(5)

The choice of a representation language may vary from a fragment of propositional calculus to second order logic. While the former has a low ex-pressive power, the latter is rather complex, and for this reason is seldom used. FOL is used in various successful systems for ICL. Usually Horn clauses are employed.

Every system using FOL usually adopts some restrictions. For example function symbols may not be allowed to appear as arguments of a lit-eral, only variables appearing in the head of the clause may be allowed to appear in the literals of the body, etc. Another limitation could be the exclusion of the recursion. A clause is recursive if it is defined in terms of itself. For instance, ancestor(X, Y ) ←− parent(X, Z), ancestor(Z, Y ) is a recursive clause. These restrictions are adopted in order to reduce the size of the search space, and can be seen as language biases, see section 3.5.

The systems described in this paper learn a set of rules. Once a representation language has been chosen, rules need to be encoded in some way, in order to be processed. Examples of encodings are binary encoding and tree encoding. In section 5 we will see some examples of how rules can be en-coded. The encoding can represent either a single rule or a set of rules. Notice that sometimes the term representation is used instead of the term en-coding.

3.3. Evaluating a Hypothesis

In simple terms what characterizes a hypothe-sis as good is how well it performs on the training examples and a prediction of how well its behav-ior will be on unseen examples. For instance, a hy-pothesis covering several positive examples and no negative examples could be considered as a good hypothesis.

A scoring function (or fitness function in EAs) is used to measure the goodness of a hypothesis. Several properties can be used for defining a scor-ing function, like: completeness, consistency and simplicity.

A hypothesis H is complete iff H covers all the positive examples. H is consistent iff H does not cover any negative examples.

Completeness and consistency are two proper-ties that almost all inductive learning systems in-corporate in the scoring function they adopt.

Simplicity is a concept often used following the Occam’s razor principle [8], which is stated as to prefer the simplest hypothesis that fits the data. One explanation for this is that there are fewer short hypotheses than long ones, and so it is less likely that one will find a short hypothesis that coincidentally fits the data. There are many ways for defining simplicity, e.g.:

– Short rules [8]. Prefer shorter rules over longer rules. The length of a rule depends on the rep-resentation used, and so the same rule could be considered short by a learner and long by another.

– MDL [47]. This is a more general concept, since it uses a notion of length that does not depend on the particular representation used. According to the Minimal Description Length (MDL) principle the best model for describing some data, is the one that minimizes the sum of the length of the model and the length of the data given to the model. Here by length we mean the number of bits needed for encod-ing a model or the data.

– Information gain [45]. Information gain is a measure of how a change in a hypothesis af-fects its classification of the examples. This principle when incorporated in the search strategy of a method like in decision trees, may bias the search toward shorter rules, as it aims to minimize the number of tests needed for the classification of a new object. However it is mostly up to the search strategy adopted to prefer short rules.

3.4. Operators

Operators are used for moving in the search space. These operators vary from system to sys-tem, depending on the approach used, the problem to solve, the ideas of the authors and so on. An operator basically receives a hypothesis, changes it in some ways and returns the changed hypothesis. Systems not relying on evolutionary techniques, e.g., Progol, employ so called refinement operators. An example of a refinement operator is inverse res-olution. We will describe here the inverse resolu-tion in the proposiresolu-tional form, while for details about the inverse resolution in FOL the reader can refer to [36].

What is done in this method is simply in-verting the resolution rule. Given rules C1 and

(6)

C2, the resolution operator constructs a clause

C which is derived from C1 and C2. For

exam-ple, if C1 is going out ∨ staying home and C2 is

¬staying home ∨ study then C will be going out ∨ study. The inverse resolution operator then pro-duces C2starting from C1 and C. The inverse

res-olution operator is not deterministic. This means that in general there are multiple choices for C2. A

way for limiting the number of choices is to restrict the representation language to Horn clauses and to use inverse entailment. The idea behind inverse entailment is to change the entailment constraint B ∧ H |= e into the equivalent form B ∧ ¬e |= ¬H. The previous constraint says that from the back-ground knowledge and the negation of the classi-fication of an example, the negation of a hypoth-esis explaining the example can be derived. Thus, from the modified constraint one can use a process similar to deduction to derive a hypothesis H.

Systems based on EAs employ crossover and mutation operators, which are explained in section 2.1.

3.5. Biasing the Hypothesis Space

If we want to ensure that a solution is found, it is obvious that the unknown target concept must be contained in the portion of the hypothesis space that is searched. Using a hypothesis space capable of representing every learnable concept could seem the solution, but this would lead to a very large search space. To illustrate this, consider a learner that uses examples described by a set of attributes. In general, in this setting, an unbiased hypothesis space contains 2|E|possible concepts, where | E | is the cardinality of the example set. For instance, if a set of attributes can describe 90 different examples of the concept to be learned, then there are 290

distinct target concepts that a learner might be called upon to learn. This is a huge space to search, and for this reason some biases have to be used in order to limit the search to a portion of the hypothesis space [41].

To limit the size of the search space two main kind of biases are used [18]:

– Search bias. This kind of bias imposes a di-rect limitation on the search performed by the learner, limiting the hypothesis space by means of some bound;

– Language bias. This kind of bias imposes a limitation in what kind of hypotheses can be represented by the algorithm. The hypothesis space is limited to the possible set of repre-sentable hypotheses.

4. Two Popular ILP Systems

In this section we describe two of the most pop-ular ILP algorithms. The first one is FOIL [44]. FOIL has proved to solve a wide variety of prob-lems, and for this reason its results are often taken as a reference measure for other systems. The second system described in this section is Progol [36,37], which employs inverse resolution for solv-ing its learnsolv-ing task.

4.1. FOIL

FOIL searches the hypothesis space using a top-down search approach and adopts an AQ-like quential covering algorithm [32]. AQ uses a se-quential covering algorithm, to build its concept description. It starts from an empty set of rules. A positive example e is selected, and a general-to-specific search is conducted in order to find a rule covering e (and possibly more positive exam-ples) and no negative examples. Among the con-structed rules the “best” one is selected and added to the emerging set of rules forming the concept description. All the positive examples covered by the found rule are removed and the process is re-peated until all positive examples are covered. The “best” rule is usually some compromise between the desire to cover as many positive examples as possible and the desire to have as compact and readable a representation as possible.

In the same way FOIL first induces a consis-tent clause and stores it. All the positive examples covered by the learned clause are removed from the training set, and the process is repeated until all positive examples are covered. When a clause needs to be induced, the system employs a hill climbing strategy. It starts with the most general clause, consisting of a clause with an empty body and head equal to the target predicated. All the arguments of the head are distinct variables. In this way this initial clause classifies all examples as positive. The clause is then specialized by adding literals to its body. Several literals are considered

(7)

Algorithm(F OIL) 1 Initialize the clause

2 while the clause covers negative examples 3 do Find a “good” literal to be added to 4 the clause body;

5 Remove all examples covered by the clause; 6 Add the clause to the emerging concept definition; 7 If there are any uncovered positive examples 8 then go to 1;

Fig. 2. The scheme of the algorithm adopted by FOIL.

for this purpose. The literal yielding the best im-provement is added to the body. If the clause is not consistent then another literal is added.

In figure 2 a scheme of the algorithm adopted by FOIL is presented. In lines 2 and 3 the hill climbing phase is performed.

The representation language of FOIL is Datalog, a restricted form of FOL, that omits disjunctive descriptions, and function symbols. Negated liter-als are allowed, where the negation is interpreted in a limited way (negation by failure).

The scoring function used by FOIL to estimate the utility of adding a new literal is based on the number of positive and negative examples covered before and after adding the new literal. More pre-cisely, let C be the clause to which a new literal l has to be added and C0 the clause created by adding l to C. The information gain function used is then the following:

Inf o gain = t(log pC0

pC0+nC0 − log pC pC+nC)

where t is the number of positive examples cov-ered by C that are still covcov-ered after adding l to C.

The add operator considers literals of the fol-lowing form:

– P (X1, X2, . . . , Xk) and ¬P (X1, X2, . . . , Xk),

where Xis are existing variables of the clause

or new variables;

– Xi = Xj or Xi 6= Xj, for variables of the

clause;

– Xi = c and Xi 6= c, where Xi is a variable of

the clause and c is an appropriate constant; – Xi≤ Xj, Xi> Xj, Xi≤ v and Xi> v, where

Xiand Xjare variables of the clause that can

assume numeric values and v is a threshold chosen by FOIL.

Algorithm(P rogol) 1 If E = ∅ return B;

2 Let e be the first example in E; 3 Construct a most specific clause ⊥ for e 4 using inverse entailment;

5 Construct a good clause C from ⊥; 6 Add C to B;

7 Remove from E all the examples that are now covered; 8 Go to 1;

Fig. 3. Covering algorithm adopted by Progol. The emerg-ing hypotheses are added to the background knowledge and the algorithm is repeated until all the positive examples are covered.

There is a constraint on literals that can be in-troduced in a clause: at least one variable appear-ing in the literal to be added must be already present in the clause. Another restriction adopted by FOIL, is motivated by the Occam’s razor prin-ciple, which states that when a clause becomes longer (according to some metric) than the total number of the positive examples that the clause explains, that clause is not considered as a po-tential part of the hypothesis any more. There is also another bias on the hypothesis space, and it is the upper bound represented by the most general clause initially generated. In fact all the clauses that are generated are more specific than the ini-tial one.

4.2. Progol

Progol uses inverse entailment to generate just a single most specific clause (usually called “bottom clause” and denoted as ⊥) that, together with the background knowledge, entails the observed data. ⊥ can then be used to bound a top-down search through the hypothesis space with the constraint that the only clauses considered are those more general than the initial bound.

Progol uses a sequential covering algorithm to carry out its learning task illustrated in figure 3. For each positive example e that is not yet covered, it first searches for ⊥, which covers e (line 3). For doing this it applies the inverse entailment i times, where i is a parameter specified by the user. In line 4 a A∗ _{strategy is adopted for finding a good}

clause starting from the most general clause. Progol uses θ-subsumption for ordering the hy-pothesis space. A clause C1 θ−subsumes a clause

C2iff there exists a substitution θ such that C1θ ⊆

(8)

lit-erals in their disjunctive form (C1is more general

than C2, written also C1 C2). The refinement

operator maintains the relationship 2 C ⊥ for every considered clause C. In the previous re-lationship2 is the empty clause. Thus the search is limited to the bounded sub-lattice2 C ⊥. Since C ⊥, there exists a substitution θ such that Cθ ⊆⊥. So for each l in C, there exists a literal l0 in ⊥ such that lθ = l0. The refinement operator has to keep track of θ and a list of those literals l0 in ⊥ that have a corresponding literal l in C. Any clause C that subsumes ⊥ corresponds to a subset of literals in ⊥ with substitutions applied.

The scoring function used to measure the good-ness of a candidate clause C is:

f (C) = pC− (nC+ lghC+ hC)

where lghC is the length of C minus 1 and hC is

the expected number of further atoms that have to be added in order to complete the clause. hCis

cal-culated by inspecting the output variables in the clause and determining whether they have been defined. The output variables are given by a user supplied model.

A first bias on the hypothesis space is repre-sented by the upper bound 2 and by the lower bound ⊥. A second constraint is the use of the head and body mode declarations together with other settings to build the most specific clause. With a mode declaration the user specifies, for each atom used, the modality in which an argument can be used. This model is also used for computing the value of hCin the scoring function. So for example,

it can be specified that a particular argument is an input variable, or an output variable, or a partic-ular constant. Progol imposes a restriction upon the placement of input variables. Every input able in any atom has to be either an input vari-able in the head of the clause or an output varivari-able in some atom that appeared before in the clause. This imposes a quasi-order on the body atoms and ensures that the clause is logically consistent in its use of input and output variables.

5. The Evolutionary Approach

EAs have proved to be successful in solving com-paratively hard optimization problems, as well as problems like ICL. EAs have an intrinsic paral-lelism and can therefore exploit parallel machines

much more easily than classical search algorithms. Furthermore EAs have the capability of escaping from local optima, while greedy algorithms may not show this ability. Finally EAs tend to cope better than greedy rule induction algorithms when there is interaction among arguments [18].

Depending on the representation used, two ma-jor approaches are used: the Pittsburgh and the Michigan approach, so called because they were first introduced by research groups at the Pitts-burgh’s and Michigan’s university, respectively. In the former case each individual encodes a whole solution, while in the latter case an individual encodes a part of the solution. Both approaches present advantages and drawbacks. The Pitts-burgh approach allows an easier control of the ge-netic search, but introduces a large redundancy that can lead to hard to manage populations and to individuals of enormous size. The Michigan ap-proach, on the other hand, allows for coopera-tion and competicoopera-tion between different individu-als, hence reduces redundancy, but requires so-phisticated strategies, like co-evolution, for coping with the presence in the population of super indi-viduals.

5.1. Two Evolutionary Variants of Progol

In [52] a GA is used inside Progol for exploring the bounded hypothesis space in search of a good clause (step 5 of figure 3). A slightly modified ver-sion of the Simple Genetic Algorithm [23] is used for this purpose. The GA used adopts a Michigan approach. Each clause is represented by a binary string. Generalization and specialization crossover and a standard mutation are used as genetic op-erators. The GA evolves a population of clauses which all subsume the most specific clause com-puted by Progol with the application of the inverse entailment.

Another GA based system using Progol is EVIL 1 [46]. This algorithm adopts a Pittsburgh approach. Every individual thus represents a set of rules (logic program) encoded as a tree structure. Each node of the tree represents a single clause. In this way it is possible to represent a whole logic program inside a single individual, by means of a tree. A tree representation allows also to easily de-fine crossover and mutation operator that can act on a logic program.

(9)

At each generation individuals induce new rules and add them to the logic program they have al-ready induced. Progol is used for inducing rules. The reproduction phase uses a crossover operator. This operator acts on trees, randomly exchanging subtrees between the two parents.

5.2. REGAL

REGAL (RElational Genetic Algorithm Learner) [19,42] exploits the explicit parallelism of GAs. In fact it consists of a network of genetic nodes fully interconnected and exchanging individuals at each generation. Each genetic node performs a GA on an assigned set of training examples. A supervisor node is used in order to coordinate these subpop-ulations. The system adopts a Michigan approach, each individual encodes a partial solution, i.e., a clause.

The representation language used by REGAL is an intermediate between V L2 and V L21 [33,32],

in which terms can be variables or disjunction of constants, and negation occurs in a restricted form. An atomic formula of arity n has the form P (X1, . . . , Xn, K), where X1, . . . , Xnare variables

and K is a disjunction of constant terms, denoted by [v1, . . . , vm], or the negation of such a

disjunc-tion. For example, these are well formed formulas: shape(X, [square, triangle]), f ar(X, Y, [1, 2, 3]), color(X, ¬[red, blue]). The first formula states that the shape of X is either a square or a triangle, and corresponds to the two literals shape(X, square) and shape(X, triangle).

Before introducing how individuals are actually encoded by REGAL, we first have to introduce the concept of language template used by REGAL. In-formally, the template is a formula Λ belonging to the language, such that every admissible con-junctive concept description can be obtained from Λ by deleting some constants from the internal disjunctions occurring in it. The predicates in the template can be divided in predicates in completed form and those not in completed form.

A predicate is in completed form if the set [v1, . . . , vm], which constitutes its internal

disjunc-tion, is such that the predicate can be satisfied for any binding of the variables X1, . . . , Xn.

Thus, in other words, a predicate containing a disjunctive term in completed form is true on every instance in the learning set.

For instance,in figure 4 color(X, [red, blue, ∗] is in completed form, while weight(X, [3, 4, 5]) is not in completed form. The symbol * means “every-thing which does not appear in the internal dis-junction”. The predicate color(X, [red, blue, ∗] is in completed form because [red, blue, ∗] is the set of all possible colors. Thus a predicate is in com-pleted form if its internal disjunction list all the constants that belongs to the domain.

A language template Λ must contain at least one predicate in completed form. Indeed, given a language template, the search space explored by REGAL is restricted to the set H(Λ) of formulas that can be obtained by deleting some constants from the completed terms occurring in Λ. This is because predicates not in completed form have the role of constraints and must be satisfied by the specific binding chosen for the variables in Λ, while predicates in completed form are used to define the search space. Deleting a constant from a completed or incompleted term makes the term more specific. Since the search space is limited to H(Λ), only predicates in completed form need to be processed, and so encoded. REGAL uses bit strings for this purpose, where a string is divided into substrings. Each substring corresponds to a literal, in the same order as they appear in the language template. Each bit corresponds to a term. If the bit corre-sponding to a given term viin a predicate P is set

to 1, then it means that vi belongs to the current

internal disjunction, whereas, if it is set to 0 it does not belong to the internal disjunction.

An example of a language template and of the representation of formulas is given in fig-ure 4. In the figfig-ure, ϕ1 correspond to the rule

weight(X, [3, 4, 5] ∧ color(X, [red])) ∧ shape(X, ¬[square, circle]) ∧ f ar(X, Y, [1, 2]). Notice that the predicate weight is not encoded since it is not in completed form . The first substring of ϕ1

corresponds to the predicate color, and it means that only the constant red belongs to the in-ternal disjunction of this predicate, i.e., red is the only constant that can appear as argument. The second substring correspond to the predicate shape, and it means that triangle and ∗ belongs to the internal disjunction, which corresponds to ¬[square, circle].

The language template used by REGAL is a propositionalisation method, i.e., a method for reformulating a FOL learning problem into an attribute-value problem. It is strongly related to

(10)

Λ = weight(X, [3, 4, 5]) ∧ color(X, [red, blue, ∗]) ∧ shape(X, [square, triangle, circle, ∗]) ∧f ar(X, Y, [1, 2, 3, 4, 5, ∗])

Λs = color(X, [red, blue, ∗]) ∧ shape(X, [square, triangle, circle, ∗]) ∧ f ar(X, Y, [1, 2, 3, 4, 5, ∗])

ϕ1= weight(X, [3, 4, 5]) ∧ color(X, [red]) ∧ shape(X, ¬[square, circle]) ∧ f ar(X, Y, [1, 2])

ϕ2= weight(X, [3, 4, 5]) ∧ color(X, [blue]) ∧ shape(X, [square]) ∧ f ar(X, Y, [2])

s(Λs) → 1 1 1 1 1 1 1 1 1 1 1 1 1

ϕ1→ 1 0 0 0 1 0 1 1 1 0 0 0 0

ϕ2→ 0 1 0 1 0 0 0 0 1 0 0 0 0

Fig. 4. In the figure Λs is the subset of Λ consisting of the predicates in completed form. The bit strings are divided in

substrings each of them corresponding to a predicate in completed form, appearing in the same order than in Λ. weight is not encoded because it is not in completed form. So the first substring of ϕ1 correspond to the predicate color, the second

substring to shape and the third substring to f ar.

some of the propositionalisation methods proposed in [1]. We refer the reader to [29] for a comparison of various methods for propositionalisation.

When the system evaluates a formula on an example, each variable in the formula has to be bound to some object in the description of the ex-ample. Then the predicates occurring in the for-mula are evaluated on the basis of the attributes of the object bound to their variables. A formula is said to be true on an example iff there exists at least one choice such that all the predicates oc-curring in the formula are true. The user has to specify how to evaluate the semantics of the pred-icates before starting to run REGAL on a specific application.

The fitness of an individual ϕ is given by the function f (ϕ) = f (z, nϕ) = (1 + Az)e−nϕ, where z

is a measure of the simplicity1_{of the formula, n} ϕ

is the number of negative examples covered by the formula and A is a user tunable parameter with default value of 0.1.

Individuals are selected for reproduction by means of the Universal Suffrage (US ) selection mechanism. This selection mechanism works as fol-low:

1. the operator randomly selects n positive ex-amples ei, 1 ≤ i ≤ n;

1_{Namely z is the number of 1s in the string divided by}

the length of the string.

2. for each ei an individual is selected. For all

ei a roulette wheel tournament is performed

among individuals covering ei. The winners

of each tournament are selected for reproduc-tion. The dimension of the sector associated to each individual is proportional to its fit-ness. If an example ei is not covered then a

new individual covering ei is created using a

seed operator.

REGAL adopts four crossover operators: the classical two-point and uniform crossovers, and the generalizing and specializing crossovers. The gen-eralizing crossover works by OR-ing some selected substrings of the parents, while the specializing crossover works by AND-ing. The probability of applying the first two classical crossovers is higher when the two selected individuals have a low fit-ness. Conversely, the higher the fitness the more likely is to apply the other two crossovers. This choice is justified by the observation that two-points and uniform crossovers have an high explo-ration power, while the generalizing and specializ-ing crossover can be used for refinspecializ-ing individuals that are already good. The mutation operator is a classical bit mutation operator, and can affect all the bits of the string.

A first bias for limiting the hypothesis space is represented by the language template. The set of examples that is assigned to a particular node rep-resents another bias. A node will develop individ-uals that belong to the species determined by the examples assigned to the node.

(11)

Nodal Genetic Algorithm( Node ν )

1 Initialize the population P opνand evaluate it;

2 while not solve

3 do receive µ· | P opν| individuals from the network

4 and store them in P opnet;

5 Select Bν from P opν∪ P opnet with the US;

6 Recombine Bνusing crossover and mutation;

7 Update P opνand P opnetwith the

8 new individuals in Bν;

9 Send P opneton the network;

10 Send the status to the supervisor; 11 Check for messages from the supervisor;

Fig. 5. Genetic algorithm used by a node ν in the dis-tributed version of REGAL. The algorithm is repeated un-til the node receives a solve signal from the supervisor. In the algorithm µ is a migration parameter.

In figure 5 a scheme of the genetic algorithm for a node ν is presented. In line 3 the node receives a number of individuals from the network, these in-dividuals will be used for avoiding the lethal mat-ing problem (lethal matmat-ings are matmat-ings bound to produce bad offspring [55]). The execution will end when the node receives a solve signal from the su-pervisor. During the learning process the supervi-sor periodically receives and stores the best solu-tion found by each genetic node. From these rules a solution is extracted. For this purpose first the set E_H+ is constructed, as the union of all posi-tive examples covered by the received clauses. The clauses are sorted in decreasing order according to π(C) = pC· f (C), where pC is the number of

positive examples covered by C. The first n best clauses able to cover E_H+ represent the solution. 5.3. G-NET

G-NET (Genetic Network) [2] represents a de-scendant of REGAL. As its predecessor, G-NET is a distributed system, with a collection of genetic nodes and a supervisor. However, G-NET differs from REGAL for many aspects.

First, G-NET adopts a co-evolution strategy by means of two algorithms. The first algorithm com-putes a global concept description out of the best hypotheses emerged in various genetic nodes. The second algorithm computes the assignment of the positive concept instances to the genetic nodes. The strategy consists of addressing the search on the concept instances that are covered by poor hy-potheses, without omitting to continue the refine-ment of the other hypotheses.

G-NET is based on the same theory of species and niches formation and on the same representa-tion language adopted by REGAL.

The fitness function used by G-NET is different from the one employed in REGAL. In fact G-NET uses two functions. The first one is used at a global level, while the second function is used at local level, so for evaluating a clause in a genetic node. A global hypothesis H is evaluated in the following way:

fG(H) = M DLM AX− M DL(¬pH+ nH)−

−M DL(H)

where M DLM AXis the MDL of the whole

learn-ing set and ¬pHis the number of positive examples

not covered by H.

The formula used for evaluating an individual ϕ at the local level is the following:

fL(ϕ) = M DLM AX− M DL(ϕ) − M DL(¬pϕ)+

+(fG(H0) − fG(H))

In the above formula H is the current global hypothesis and H0 is the hypothesis obtained by adding ϕ to H.

Another difference between REGAL and G-NET lies in the selection operator. In fact, G-G-NET does not use the universal suffrage operator, but individuals are selected with a fitness proportional selection.

G-NET adopts three kinds of mutation opera-tors. One of the mutation operators is used in or-der to generalize an individual, another one is used for the specialization and a third mutation oper-ator is used for creating new clauses, so it can be also seen as a seeding operator. The crossover is a combination of the two-point crossover with a variant of the uniform crossover, modified in order to perform either generalization or specialization of individuals. Both crossover and mutation oper-ators enforce diversity, so that it is assured that in a genetic node there are no equal clauses.

5.4. DOGMA

DOGMA (Domain Oriented Genetic MAchine) [27,26] employs two distinct levels. On the lower level the Michigan approach is adopted, while on a higher level the Pittsburgh approach is used.

On the lowest level the system uses fixed length chromosomes, which are manipulated by crossover and mutation operators. On the higher level

(12)

chromosomes are combined into genetic families, through some special operators that can merge and break families.

The representation language and encoding adop-ted by DOGMA are equal to those used by RE-GAL.

The fitness function combines two different func-tions. One is based on the MDL principle, and the other is based on the information gain measure. The total description length of a hypothesis H con-sists of the hypothesis cost, i.e., the length of the encoding of H, and the exception cost, which is the encoding of the data that is erroneously classified by H. The unit of length used is a binary digit. To turn the MDL principle (see section 3.3) into a fit-ness function, the MDL of the current hypothesis H is compared against the total exception length with the following function:

fM DL(H, E) = 1 −_W M DL(H,E) ∅×M DL(H∅,E)

In the above formula, M DL(H∅, E) stands for

the total exception length, i.e., the description length of an empty hypothesis that covers no ex-amples. W∅ is a weight factor that is used to

guar-antee that even fairly bad hypotheses have a pos-itive fitness. This function alone can not be used as a fitness function. In fact the function under-rates hypotheses that are almost consistent and very incomplete. This would lead to a prevalence of fairly large, but very incomplete clauses. This is because the fitness function would prefer large and incomplete clauses to fairly small but almost con-sistent clauses. In this way the population would become overly general and very inconsistent. For this reason, the function based on information gain is used.

This function promotes small and almost consis-tent clauses. The information gain of a hypothesis H compared to another hypothesis Hdef measures

how much information is gained in the distribution of correctly and incorrectly classified positive ex-amples of H compared to the distribution of Hdef.

The fitness function based on the information gain uses this gain measure:

Gain(Hdef, H, E) =

logb(pH+ 1) × (Inf o(Hdef, E) − Inf o(H, E)),

where b ≥ 1 (default value 1.2) and Inf o(H, E) = −log pH

pH+nH. The hypothesis Hdef is a default

hy-pothesis that classifies all examples as positives. The fitness function based of the information gain is then defined as follows:

fG(H, E) = WG×

Gain(Hdef,H,E) Gain(Hdef,Hmax,E)

where WG> 0 is a tunable parameter and Hdef is

a hypothetical hypothesis that correctly classifies all examples. Finally the fitness function used by DOGMA is the following:

fM G= min(fM DL(H, E), fG(H, E))

that chooses the minimum between fM DLand fG.

To enhance diversity and to separate different kinds of clauses, the system uses speciation of chro-mosomes. This can be done randomly or by di-viding individuals into species according to which parts of the background knowledge they may use. Speciation has three applications in the system. First it is used for controlling the mating of chro-mosomes of different species. Secondly, speciation can control what part of the background knowl-edge individuals can use. Finally, speciation is used when merging chromosomes into families. Chro-mosomes of the same species cannot be merged in the same family.

DOGMA uses the crossover operators used by REGAL, a classical mutation operator, and a seed-ing operator which, given a randomly selected ex-ample, randomly creates a bit string and then ad-justs it in order to cover that example. The remain-ing operators work on the family level. A break operator splits randomly a family into two sepa-rate families. In opposition to the break operator, a join operator joins two families into one. If there are two chromosomes of the same species then one of them is deleted. In addition to these operators, a makefamily operator is used for forming fami-lies by selecting useful chromosomes of different species from the population. The order in which the operators are applied is given in figure 6.

Make-next-generation( P op ) 1 P opS← Select families in P op 2 P opS0← Mate chromosomes P op_S 3 P opX← CrossoverP opS0 4 P opM← Mutate P opX 5 P opB← Break families P opM 6 P opJ← Join families P opB 7 P opU← P opJ∪ Make families P op 8 P opE← Evaluate P opU

9 P op0← Replace families (P op, P opE)

10 return P op0

Fig. 6. Algorithm used by DOGMA for the creation of a new population P op0starting from an old population P op.

(13)

DOGMA follows the metaphor of competing families by keeping genetic operators, such as crossover and mutation, working on the lower level, and by building good blocks of chromosomes, while lifting selection and replacement to the family level. Fitness is also lifted to the higher level. 5.5. SIA01

SIA01 (Supervised Inductive Algorithm version 01) [3] uses the sequential covering principle devel-oped in AQ [32].

SIA01 adopts a bottom-up approach. In order to create the initial clause, SIA01 randomly chooses an uncovered positive example and uses it as a seed. Then it finds the best generalization of this clause according to some evaluation criterion. This is done by means of a GA.

To obtain a new generation the algorithm ap-plies to each individual in the population a genetic operator, and then the newly created individual may be inserted in the population. The size of the population can grow in this way until a certain bound is reached. A scheme of the GA used for searching the best clause is represented in figure 7. Differently from REGAL, that adopts a bit string representation for encoding clauses, SIA adopts a high level encoding. SIA01 directly uses a FOL notation, by using predicates and their argu-ments as genes. For instance the clause Obj(X) ← color(X, blue), shape(X, square), f ar(X, Y, 2)will be encoded in the following individual:

Obj X color X blue shape X square far X Y 2

The fitness function used takes into considera-tion the consistency of the clause, its complete-ness, its syntactic generality and some user’s pref-erences:

f (C) = (1 − α − β)CM + αS + βA if CN > 1 − N

0 otherwise

In the above formula CM is the absolute com-pleteness and it is defined as pϕ

|E+_|, where | E+| is

the total number of positive examples. CN is the absolute consistency and it is defined as |E−|−nϕ

|E−_|

where | E− _{| is the total number of negative}

ex-amples. N is the maximum noise tolerated, S is the syntactic generality of ϕ and A is the clause’s appropriateness to the user’s preferences. N , A, α and β are user tunable parameters.

GA(SIA01) 1 Pop = Seed 2 repeat

3 for ∀ ϕ ∈ Pop

4 do if ϕ has not already produced offspring 5 then create 1 offspring by mutation of ϕ 6 create 2 offspring by crossover with ϕ0

7 put the offspring in P op0 8 for ∀ ϕ ∈ P op0

9 do if ϕ /∈ P op

10 then if size(P op) < max 11 or fitness ϕ is better 12 than the worst fitness in Pop 13 then insert ϕ in Pop

14 until fitness(best ϕ) hasn’t changed in last n gens Fig. 7. The scheme of the GA adopted by SIA01. ϕ0 is an individual in the population that has not yet produced any offspring.

A mutation operator and two crossover oper-ators are used for creating new individuals. The mutation operator selects a relevant gene and per-forms one of the following operations:

– if the gene encodes a predicate then change it with a more general predicate, according to the background knowledge. If it is not possi-ble to generalize anymore then drop the pred-icate. For example the predicate P yramid could be changed into P ointed − top without modifying the arguments of the predicate. To this aim, the order of generality among pred-icates is also stored in B;

– if the gene encodes a numerical constant then the mutation can create an interval, or if the gene is already an interval the operator can enlarge it;

– if the gene codes a numeric or symbolic con-stant, then the operator may create an inter-nal disjunction or generalize an existing dis-junction;

– if the gene codes a symbolic constant this con-stant can be turn into a variable. This change is reported in the whole individual;

– if the gene codes a variable the operator may replace it with another variable.

The first crossover, which is used by default, is a restrained one-point crossover. The restriction is that the chosen point in the clause has to be before a predicate. If the seed contains only one predicate then the standard one-point crossover is used.

SIA01 is a refinement of the system SIA [55]. Another recently developed system based on SIA

(14)

GA(ECL)

1 Sel = positive examples 2 repeat

3 Select partial Background Knowledge 4 Pop = ∅

5 while not terminate

6 do Select n individuals using Sel 7 for each selected individual ϕ 8 do Mutate ϕ

9 Optimize ϕ 10 Insert ϕ in Pop 11 Store Pop in Final Population

12 Sel = Sel - {positive examples covered by Pop} 13 until max iter is reached

14 Extract a solution from Final Population Fig. 8. The overall learning algorithm ECL.

is the Extended SIA (ESIA) [31]. ESIA adopts a sequential covering algorithm but learns concepts expressed in propositional logic.

5.6. ECL

ECL (Evolutionary Concept Learner) [12,11,14] adopts the Michigan approach. Newly created in-dividuals represent a generalization of a most spe-cific clause built from a positive seed example.

In figure 8 a scheme of ECL is given. In the repeat statement the algorithm constructs itera-tively a Final Population as the union of max iter populations (line 11). At the end of the evolution a logic program for the target concept is extracted from Final Population. In order to do so, the most precise clauses in Final Population are repeatedly added to the emerging solution until the accuracy of the solution does not decrease. The precision of a clause is defined as pC

pC+nC.

The fitness of an individual ϕ is given by the inverse of its accuracy:

f (ϕ) = 1 Acc(ϕ) =

|E+_|+|E− | pϕ+(|E−|−nϕ)

The aim of ECL is to minimize the fitness of indi-viduals.

The representation used is very similar to the one adopted by SIA01. A clause Obj(X) ← color(X, blue), shape(X, square), f ar(X, Y, 2) is en-coded by the sequence

Obj, X , color, X, blue , shape, X, square , f ar, X, Y, 2

Individuals are selected with a variant of the US selection operator (see section 5.2) called Expo-nential Weighted US (EWUS) [13]. In the EWUS examples difficult to cover have higher probabil-ities of being chosen. The difficulty of an exam-ple ei is determined by the number of individuals

that cover ei. Examples are selected with a roulette

wheel mechanism, where the dimension of the sec-tor associated to each example is proportional to the difficulty of the example.

ECL uses four mutation operators in order to evolve individuals, two for generalization and two for specialization. A clause can be generalized by either deleting an atom from its body or by turn-ing a constant into a variable. With the dual op-erations a clause can be specialized. Each opera-tor has a degree of greediness, which can be con-trolled by the user by setting the value of N in the following steps:

1. test N mutation possibilities on C;

2. apply the mutation yielding the best im-provement in the fitness to C.

For instance, if the variable Z of the above clause has to be turned into a constant, the system may consider the substitutions {Z/a}, {Z/b}, {Z/c}, where a, b, c are possible values for Z and apply the one yielding the best fitness improvement. When an individual is chosen for mutation, a first (ran-domized) test decides whether the individual will be generalized or specialized. Next, one of the two operators of the chosen class is randomly applied. If the individual is consistent with the training set, then it is likely that the individual will be gener-alized. Otherwise it is more probable that a spe-cialization operator will be applied.

After mutation, the individual undergoes an op-timization phase. This phase consists in the re-peated application of the mutation operators un-til a maximum number of optimization steps has been reached, or until the fitness of the individual does not increase. In the latter case the last mu-tation is retracted. No crossover operator is used. The particular representation used by ECL makes it difficult to design an effective crossover opera-tor. A variant of the uniform crossover operator has been tried. However the results obtained did not justify its use.

The aim of ECL is to find hypotheses of satis-factory quality, both with respect to accuracy and simplicity, in a short amount of time. For this

(15)

pur-cup(X) OR 0 J J J J C1 C2 AND 1 paper cup(X)2 J J J J stable(X) 3 liftable(X) 4 stable(X) AND 0 J J J J C3 flat(Z) 2 bottom(X,Z) 1 liftable(X) AND 0 C4 J J J J handle(Y)2 has(X,Z) 1

Fig. 9. A forest of AND-OR trees. The numbers next to each node are the identifier numbers of the nodes.

pose two mechanisms are used. The first mecha-nism allows the user to specify the percentage of background knowledge used by the GA (step 3 of figure 8) at each iteration. This is done by using a simple stochastic sampling mechanism: a user tun-able parameter p, 0 < p ≤ 1, which determines the probability that each fact of the background knowledge has of being selected. This then lead to the implicit selection of a subset of the training examples. Only examples that can be covered us-ing the chosen part of the background knowledge will be used. Individuals are evaluated using the partial background knowledge.

The second mechanism allows the user to con-trol the greediness of the mutation operators, by means of the parameter N , thus controlling the computational cost of the search. Another user de-fined bias is used to limit the maximum length of a clause. This parameter is also user tunable. 5.7. GLPS

The Genetic Logic Programming System (GLPS) [56] is a GP system, that adopts a Pittsburgh ap-proach. The reproduction phase involves select-ing a program from the current population of pro-grams and allowing it to survive by copying it into the new population. The selection is based either on fitness or tournament. The system uses crossover to create two offspring from the selected parents. GLPS does not use any mutation opera-tors. After the reproduction and crossover phase, the new generation replaces the old one. Next, GLPS evaluates the population assigning a fitness value to each individual and iterates this process over many generations, until a termination crite-rion is satisfied.

The system adopts a restriction on the repre-sentable clauses: function symbols can not appear

in a clause. Logic programs are represented as a forest of AND-OR trees, being the leafs of such trees positive or negative literals containing pred-icate symbols and terms of the problem domain. For example, figure 9 represent the logic program:

C1 : cup(X) ← stable(X), lif table(X). C2 : cup(X) ← paper cup(X).

C3 : stable(X) ← bottom(X, Z), f lat(Z). C4 : lif table(X) ← has(X, Y ), handle(Y ). The left most tree in figure 9 represents clauses C1 and C2. In fact it can be derived from the tree that X is a cup if either X is stable and liftable or if X is a paper cup.

With this representation, it is not difficult to generate an initial population randomly. A forest of AND-OR trees can be randomly generated and then the leaves of these trees can be filled with literals of the problem. Another way is to generate the initial population using some other systems, like FOIL.

The fitness function used by GLPS is a weighted sum of the total number of misclassified positive and negative examples. The weight is used for deal-ing with uneven distribution of positive and nega-tive examples.

An ad-hoc crossover operator is used, which can operate in various modalities:

1. individuals are just copied unchanged to the next generation;

2. individuals exchange a set of clauses; 3. a number of clauses belonging to a particular

rule are exchanged between the individuals; 4. a number of literals belonging to a clause are

exchanged.

6. Comparison of the Systems

In this section we compare the features of RE-GAL, G-NET, DOGMA, SIA01, ECL and GLPS.

(16)

Table 1

In the table con stands for consistency, com for complete-ness, sim for simplicity, pref for user’s preferences, gen for syntactic generality. MDL is the Minimum Description Length Principle. Gain is the information gain.

Algorithm Encoding Fitness features Approach REGAL bit strings (needs an initial template) con + sim Michigan DOGMA bit strings (needs an initial template) MDL + Gain Michigan & Pittsburgh

G-NET bit strings (needs an initial template) 2 functions. con + sim + MDL Michigan SIA01 high level language representation con + com + gen + pref Michigan ECL high level language representation con + com Michigan GLPS AND-OR trees con + com Pittsburgh

We do not consider FOIL and Progol because we are interested in the evolutionary approach to ILP. We try to compare the systems with respect to search strategy, representation, fitness function, operators and biases. Moreover, in section 6.6 we discuss the effectiveness of the systems.

Table 1 summarizes the representations of the systems, the properties used in the fitness function and the approach adopted. Table 2 summarizes the genetic and selection operators.

6.1. Search Strategy

The described evolutionary systems, in general, do not follow a specific search approach (top-down, bottom-up), except for SIA01, which adopt a bottom-up approach. All the systems exploit the general-to-specific ordering of hypotheses in some of the genetic operators adopted. The search pro-ceeds by successive generalization and specializa-tion of hypotheses. Moreover, co-evoluspecializa-tion strate-gies and speciation is used in REGAL, G-NET and DOGMA, and implicitly in ECL (with the selec-tion).

6.2. Representation

REGAL, G-NET and DOGMA use the supplied template to map clauses into bit strings. This im-plies some knowledge of what the user expects to discover, which cannot be always provided. The use of the initial templates also imposes another limitation. All the rules handled must follow the initial given template, which is constant and can-not change during the learning process. Also with this approach, some problems can arise when deal-ing with numerical values. In fact the binary

repre-sentation of the model can become quite long and this may slow down the process.

The bit string representation used by these three systems does not allow them to perform some operations that are typical FOL operations, e.g., changing a constant into a variable. The high level representation adopted by SIA01 and ECL is more flexible, where the shape of clauses learned can vary during the learning process. In fact the par-ticular shape of each clause is determined by the positive example used as seed in the initializa-tion phase. The two systems are also capable of performing some FOL operations, e.g., applying a substitution to a clause.

GLPS does not require an initial template ei-ther, so the shape of the initial clauses is not fixed. 6.3. Fitness Function

The systems that adopt the simplest fitness functions are GLPS and ECL. They take into consideration only the completeness and the con-sistency of individuals. The function adopted by SIA01 is more elaborated. In addition simplicity is considered, and the user can express some pref-erences for some type of clauses. However, consis-tency and completeness are the features that have the biggest influence on the fitness function.

The function used by REGAL considers only simplicity and consistency. Completeness is not considered because the US selection operator al-ready promotes complete individuals. This reduces the complexity of the evaluation. G-NET uses two fitness functions, one used at a local level, in the genetic nodes, and another one used at a global level. G-NET also makes use of the MDL principle in its fitness functions. Unlike REGAL, at a local

(17)

Table 2

A summary of the characteristics of the various operators adopted by the presented systems. Algorithm Type of crossover Mutation Selection Operator

REGAL uniform, two-point classic US generalizing, specializing

DOGMA uniform, two-point classic based on fitness generalizing, specializing lifted to the family level G-NET generalizing, specializing generalizing tournament

two-point specializing,seed

SIA01 restrained 1-point 4 generalizing select all individuals that have classic 1-point modalities not produced an offspring ECL none 2 generalizing EWUS

2 specializing

GLPS reproduction none tournament or fitness exchange info proportionate selection

level G-NET considers also the global behavior of clauses. This is achieved by evaluating how well a clause combines with others in order to form a global solution. This is a good strategy, since G-NET is a distributed system.

DOGMA combines the information gain and the MDL principle. Information gain is used for pro-moting small and almost consistent clauses. This is done because using only the MDL would result in an under rating of hypotheses that are almost consistent but very incomplete. This would lead DOGMA towards a population with a majority of large and very inconsistent clauses. In this way the population would be too general and very incon-sistent.

6.4. Operators

REGAL and DOGMA adopt the same opera-tors.

Two crossovers are used to generalize and spe-cialize individuals. In addition to these, uniform and two-point crossovers are used in order to make bigger steps in the hypothesis search space.

G-NET introduces some novel ILP operators. Both the crossover and the mutation operators, can be used in three modalities. For the crossover these modalities are: generalization, specialization and exchanging modality. The latter one is imple-mented by a classical two-point crossover. The gen-eralization modality tends to be used when both parents are consistent, otherwise the specialization modality is more likely to be applied. For the mu-tation a similar strategy is applied. With this

strat-egy when in a node a clause is often chosen the search turns into a stochastic hill climbing.

What is done by these three systems when they generalize or specialize a rule, is basically dropping or adding conditions. This is because the systems adopt an internal disjunction in order to define the values that a variable can assume.

SIA01 and ECL have the possibility to perform some operations which are more FOL oriented. The mutation operator used by SIA01 can perform a variety of operations, e.g., changing a numeric constant into an interval or turning a constant into a variable. An interesting feature is that the opera-tors are designed in a way that guarantees that in-dividuals of the population are syntactically differ-ent from each other. ECL uses two generalization and two specialization mutation operators. These operators do not act completely at random, but they have a degree of greediness that can be tuned by the user. After an individual has been mutated, an optimization phase is applied to it, as described in section 5.6. Both SIA01 and ECL rely mostly on mutation. This is because the high level represen-tation makes the design of an effective crossover operator difficult.

In opposition, GLPS does not make use of any mutation operators, so the reproduction phase is carried out only by the crossover operator, which can exchange rules, clauses or just literals. 6.5. Biases

REGAL, DOGMA and G-NET limit the hy-pothesis space by means of an initial template.

(18)

Only clauses that can be derived from the initial template are considered during the search of a sat-isfying concept. ECL and SIA01 consider clauses limited to the possible generalizations of the ini-tial clause that was build starting from a seed ex-ample. Therefore, these systems use different bi-ases for each individual, depending on the exam-ple that is used for creating the individual. In this way individuals are not constrained into a fixed shape. Moreover ECL limits the hypothesis space with the use of a partial background knowledge. At each iteration the system focuses only on a part of the hypothesis space. GLPS restricts the search to individuals than can be represented with trees having a maximum depth specified by a user tun-able parameter.

6.6. Effectiveness

Unfortunately some of the described systems are either not available (DOGMA, GLPS, SIA01) or not installable (REGAL2_{, G-NET). For this}

rea-son, we can only provide a brief comparison of the performance of the evolutionary systems, based on results taken from various publications.

Results discussed in this section are obtained on the datasets showed in table 3, where the features of the various datasets are also given. The first column specifies the kind of dataset, where Prop. stands for propositional. The second columns shows the number of examples for each dataset, while the third column shows the back-ground knowledge size, i.e., the number of facts describing the examples. The crx, breast, splice-junction and vote datasets are taken from [7], while the mutagenesis dataset originates from [10]. The mutagenesis and the splice-junction are FOL datasets, while the others are propositional. Table 4 shows the accuracies obtained by the systems on the datasets on which they were tested, as esti-mated by 10–fold cross validation.

It is interesting to compare the performance of G-NET, REGAL and DOGMA, since these sys-tems are based on the ideas first developed in RE-GAL. The three systems were tested on the splice-junction dataset. The obtained solutions showed an accuracy of 96.6%, 95.6% and 94.3% respec-tively. G-NET not only improves the accuracy, but 2_{Special thanks to Filippo Neri for his assistance in the}

attempt to install the system

Table 3 Features of the datasets.

Dataset Type Examples Background Crx Prop. 690 10283 Breast Prop. 699 6275

Vote Prop. 435 6568 Splice Junctions ILP 3190 191400

Mutagenesis ILP 188 13125

also the number of disjunctions needed for the con-cept decreased.

We were not able to find any results for SIA01. We could only find some results for ESIA. We compare the effectiveness of ESIA with G-NET and ECL on two datasets, the crx and the breast dataset. On the first dataset ESIA obtained an curacy of 77.4, while G-NET was capable of an ac-curacy of 84.4% and ECL of 84%. On the breast dataset the three systems obtained the same ac-curacy, namely of 94.7%. ECL and G-NET were also compared on another two datasets: mutage-nesis and vote. For the first dataset we propose also results obtained by Progol. G-NET evolved a hypothesis of accuracy 91.2% for the mutagene-sis dataset, ECL of 90.3% while Progol was capa-ble of finding a theory of accuracy 88.2%. G-NET employed several hours for evolving a theory for the mutagenesis dataset, while ECL was more ef-ficient, employing an average of ten minutes. For the vote dataset the accuracy obtained by G-NET is 95% and the accuracy of ECL is 94%. GLPS is directly comparable only to a version of FOIL when learning concepts from noisy data, where it performed better. However the initial population was initialized with FOIL.

Table 4

Results obtained by the systems on different datasets. Av-erage accuracy is given.

Algorithm Splice Crx Breast Vote Mutagenesis G-NET 96.6 84.4 94.7 95 91.2 ECL - 84 94.7 94 90.3 REGAL 95.6 - - - -DOGMA 94.3 - - - -ESIA - 77.4 94.7 - -Progol - - - - 88.2

It is difficult, after this kind of comparison, to come out with some conclusion about the