Elitism - Radial Basis Function - Reinforcement learning in continuous state- and action-space

3.3 Radial Basis Function

4.2.7 Elitism

The genetic operations applied to the individuals in GP are guided by the fitness, and therefore, will create individuals with higher fitness on the task GP is being applied to. However, there is a chance that the fittest individuals may not be reproduced to the next population and even if they are selected for crossover, the resulting individuals may have much lower fitness.

In order to overcome this problem elitism may be applied, which, as with GA, ensures the fittest individual(s) from the current generation are copied to the next generation.

4.3 Chapter Summary

This chapter was a brief introduction to evolutionary algorithms in the forms of GA and GP, both of which are applied to RL problems later in the thesis. There are many variations of these techniques including alternative crossover and mutation operations; however, this limited presentation describes the fundamentals of the techniques.

GA is often applied as a global optimisation method and has been applied to find- ing the weights of artificial neural networks, such as those introduced in Chapter 3, approach is applied to the solution of an RL problem, such methods can be seen as an example of direct policy search (described in Section 5.2).

GP is also a well known approach to direct policy search [30, 53] and is described in Section 5.2.3 and an example of its application to the Acrobot swing-up problem is presented in Section 7.1.

Chapter

5

Continuous State- and Action-Space

State-of-the-art

Here we elaborate on the problems of continuous state- and action-spaces, as was described briefly in Chapter 2, and present the current state-of-the-art methods applied to solving such problems.

When problems have discrete, small state- and action-spaces the value function Q : S × A → R, giving the expected value of taking a given action from a given state can be stored in a look-up table; however, if |S| is too large or, in the continuous state-space setting, infinite this is no longer feasible, and leads to two problems:

1. it would be impossible to store all values in a look-up table

2. the agent would be unable to visit all states and actions sufficiently often to find the correct values

When this is the case, but the action-space is discrete and sufficiently small, we can solve both of these problems by utilising function approximation, such as artificial neural networks (Chapter 3), to approximate the state-action value function Q(s, a). Due to the generalisation capabilities of such function approximation methods, we are able to make estimates of values of unvisited states based on the current learnt values of other states. Also the updating of these values will improve the approximation of similar states and actions. Furthermore, as the function is approximated using a small number of parameters it is possible to store in memory. The optimal action arg max_aQ(s, a) can then be selected by evaluating:

Q(s, a), ∀a ∈ A (5.1)

which is often the favoured approach when it is practical to do so [21].

However, when the action-space A is large, or continuous, this approach is no longer directly applicable, as it would be impossible to evaluate so many possible actions at every time-step. There are, however, several approaches which either facilitate the application of this method through discretization or avoid the problem by storing an approximation of the policy function, either alongside the value function; or by directly optimizing the policy function without approximating the value function. The examples of the state-of-the-art approaches we discuss here can be classified as follows:

• actor-critic:

– adaptive critic [48] – CACLA [56]

• direct policy search:

– genetic algorithm based direct policy search [16] – genetic programming based direct policy [12] – policy gradient [46]

• implicit policy methods:

– action-space discretization [47]

– gradient based action selection [30, 47]

• other methods:

– k-nearest neighbours [23] – wire fitting [2]

In the remainder of this chapter the above approaches will be presented individually, along with the advantages and disadvantages of each, followed by a summary at the end of the chapter.

5.1 Actor-Critic

Actor-Critic (AC) methods [21, 55] require two function approximators: the critic stores the value function (either Q(s, a) or V (s) may be used) whilst the actor stores the policy function π : S → A, i.e. which action to take given the current state. This leads to very fast selection of actions and allows the full range of continuous actions to be selected. However, it also leads to many more parameters which must be tuned, in the form of the function approximator parameters, and two function approximators which must be trained.

CHAPTER 5. CONTINUOUS SPACE STATE-OF-THE-ART 45

Actor

Critic

Environment

action

state

(state,reward)

feedback

Figure 5.1: Diagram of the Actor-Critic architecture, showing how the actor and critic interact with the environment and each other.

At each time step the actor selects an action and applies it to the environment, after which it receives the next state s0. At the same time the critic receives s0 and also r which are used to update the critics estimate of the value function and also to send a feedback value to the actor from which the actor updates the policy function. A graphical representation of this interaction between actor, critic and environment is shown in Fig. 5.1.

Two specific examples of AC implementations which have been applied to continuous state- and action-space problems are the adaptive critic [48] and CACLA [55], each of which is described in more detail below.

In document Reinforcement learning in continuous state- and action-space (Page 52-55)