Learning and evolution - Evolutionary control of autonomous underwater vehicles

In biological organisms evolution and learning are two fundamental forms of adaptation that occur over different time frames (Nolfi and Floreano, 1999). Evolution occurs over several generations of organisms and permits a species to adapt to long-term environmental changes.

Conversely, learning occurs within the lifetime of a single organism and permits the organism to adapt to short-term environmental changes.

In the field of ANNs, learning is the process through which the connection weights of a NN are calculated. It is defined as “a process by which the free parameters of a neural network are adapted through a process of simulation by the environment in which the network is embedded.” (Haykin, 1999, p. 50) Thus, the connection weights represent stored knowledge which is defined as information used to “interpret, predict, and appropriately respond”(Haykin, 1999, p. 23) to the environment.

Three major learning paradigms model the environment in which learning occurs:

Supervised learning In supervised learning, an agent learns by emulating a teacher. The teacher has knowledge of the environment represented by input-output examples known as the training set. These examples are presented to the agent, which is adjusted based on the error between the output of the teacher and the output of the agent for a par-ticular input. The goal is to learn the input-output mapping of the environment such that the agent can provide the correct output when given some input once the teacher is removed. Examples of supervised learning include pattern recognition and function approximation.

Unsupervised learning In unsupervised learning, an agent learns by discovering patterns present in the environment. The agent receives only inputs, because no teacher is present to provide target outputs and no rewards are received from the environment. Instead, the agent is updated according to rules that specify what aspects of the inputs should be captured in the outputs. The goal is to learn an input-output mapping that captures the desired characteristics of the environment. Clustering is an example of unsupervised learning.

Reinforcement learning In Reinforcement Learning (RL), an agent learns by interacting with its environment in a sequence of discrete steps. At each step the agent senses the state of the environment and takes some action. As a result of this action the state of the environment changes and the agent receives some reward that measures the desirability of this new state. The goal is to learn a mapping of states to actions, called a policy, that maximises the long term reward received by an agent. Examples of RL include robot control and game playing.

Note that while this thesis is primarily concerned with Reinforcement Learning (RL) problems, it does not apply a RL technique.

Generalisation refers to the ability of a NN to produce reasonable outputs for inputs that were not present in the training set. This is an important concept because it is hoped that generalisation reduces the amount of training required by a NN to learn optimal policies.

A NN is said to be overfitted if it matches the training set very accurately, but generalises poorly. In this case the NN has learnt a feature that is present in the training set, but is not true of the underlying function to be modeled.

2.4.1 Evolutionary learning

Even though NE was inspired by biological evolution, it is typically applied to learning prob-lems in a manner more similar to that of a learning algorithm. As learning is defined as a process that adapts the free parameters of a NN, the evolution of connection weights or topol-ogy with connection weights emulates a learning algorithm. This process can, therefore, be described as evolutionary learning and, as such, the terms learn and evolve can be used inter-changeably.

NE begins to distinguish itself from other learning algorithms when it is used to evolve topology without connection weights, which are learnt in an independent process, and when it is used to evolve learning rules.

NE does not fit neatly into any of the three major learning paradigms. However, it can be applied to problems normally associated with all three paradigms if a suitable fitness function can be defined. For example, NE can be used for supervised learning problems by setting the fitness of a solution proportional to the error between the output of the evolved NN and that of the teacher.

2.4.2 Comparison between evolutionary and reinforcement learning

A brief comparison between NE and RL shall be presented before continuing. This comparison will permit greater insight into the advantages and disadvantages of NE as a technique for the problems considered in this thesis. Furthermore, there appears to be confusion in some previously published work with respect to the relationship between NE and RL.

NE and RL share many common characteristics. They both involve learning while inter-acting with the environment and are applied to the same class of problems where a policy consisting of a sequence of actions must be found that maximises some long-term reward.

Such problems are typically described as being RL problems. However, despite the appar-ent similarities between NE and RL they are differappar-ent and, while NE may be applied to RL problems, it is not a RL method.

Sutton and Barto (1998) discussed the differences between RL and evolutionary methods in terms of trial-and-error learning. Consider the major elements of trial-and-error learning:

Selectional Try alternative actions and select among them by comparing consequences.

Associative Associate actions found by selection with particular situations.

Evolutionary methods are selectional, but not associative. They select entire policies based on a fitness score, which is a singular scalar value. The states experienced and actions taken by a solution are not considered beyond the calculation of the fitness score. So, specific actions cannot be associated with specific states. Conversely, RL is both selectional and associative.

It maps each state or state-action pair to an immediate reward and assigns values based on the expected rewards from states that follow. Actions are then selected to maximise the expected future reward. For these reasons Sutton and Barto (1998) concluded that evolutionary methods are not well suited to RL problems where state information is available. Nevertheless, NE has been applied to RL problems with some success.

A number of comprehensive comparisons between NE and RL have been published using the inverted pendulum problem (Whitley et al., 1993; Moriarty and Mikkulainen, 1996; Gomez et al., 2008). These studies show that NE methods can find solutions in this problem domain with fewer function evaluations than RL methods. In particular, NE is shown to significantly outperform RL for variations of the inverted pendulum problem that contain hidden state. The inverted pendulum problem is discussed further in Chapter 5.

A limited number of direct comparisons between NE and RL have been published in other domains. These studies have commonly used some form of Temporal Difference (TD) learn-ing, which represents one of the fundamental classes of RL methods Sutton and Barto (1998).

Taylor et al. (2006) compared NEAT with the TD method Sarsa on the keepaway robot soccer task. The results showed that NEAT was able to find better policies than Sarsa, but required

more evaluations to do so. Lucas and Kendall (2006) provides an overview of comparisons conducted by several authors using games, which will not be discussed in detail here.

Overall, no firm conclusions can be drawn regarding the relative performance of NE and RL due to the limited number of problem domains studied. However, the results do suggest that for some problems NE methods can produce better solutions than RL. Future comparisons are not required to determine which method provides the best performance, but rather the properties of the problems for which each method is best suited. This will provide a greater understanding of the strengths and weaknesses of these approaches.

In document Evolutionary control of autonomous underwater vehicles (Page 35-39)