The Robot Problem 71 Relative speed Average Caught Crash Timed-out

Standard Q( )

4. The Robot Problem 71 Relative speed Average Caught Crash Timed-out

of target robot payo

1.0 0.978 963 37 0

0.9 0.981 970 30 0

0.7 0.985 974 24 2

0.5 0.992 984 13 3

Table 4.9: Results for 1000 trials of catcher robot as the relative speed of the target robot is varied. The maximum number of steps before time-out is 500.

Figure 4.14: Sample trajectories of the two robots: +++ marks the trajectory of the target robot and the trajectory of the catcher robot. Left: The target robot is travelling at

half the speed of the catcher. Right: Example when the target is travelling at 9/10 of the speed of the catcher robot.

4.5.1 Policy Limitations

The robot has the capability to reach the goal on every trial, but fails to do so. This is caused by looping (the robot dodges an obstacle and ends up back at a point it visited before) and thus being timed-out, and occasional crashes with obstacles which occur due to the limited visual eld of the robot range sensors. These problems are caused by the purely reactive behaviour of the robot | it has no memory of situations that have happened previously, or the number of steps it has taken. So, for example, a wall beside the robot that has fallen behind its forward sensor arc will not be remembered, and thus the robot may turn and hit it.

Therefore, the overall ability of the control system as presented is limited by being purely reactive. One method to produce a robot capable of dealing with more complex environments (such as non-convex obstacles and mazes), would be to use a more hierar- chical approach. This would involve separate Q-learning modules being taught to deal with dierent tasks, and then training the system to choose between them based on the situation (Lin 1993a, Singh 1992).

4. The Robot Problem

72 4.5.2 Heuristic Parameters

There are several parameters that must be set in order to use the reinforcement learning methods presented in this chapter.

, ,

, and

T

must all be set, and poor choices can result in the system failing to converge to a satisfactory policy. The diculty is that these values are all heuristic in nature and currently need to be selected based on rules of thumb rather than strict scientic methods.

The contour plots of Figs. 4.2, 4.3, 4.6, and 4.7 show how the choice of learning rate

and the TD-learning parameter can eect the subsequent success or failure of the system to converge to a successful solution. Some values simply result in very slow convergence times others in complete failure to learn a successful policy. This is because of the generalisation property of MLPs, which means that information can be `forgotten' as well as learnt. If the parameters chosen during training are unsuitable, the robot will forget information as fast as it learns it and so be unable to converge on a successful solution. This is why no proofs yet exist regarding the convergence of Q-learning or TD-algorithms for connectionist systems.

Consequently, it is desirable to use training methods that are less sensitive to the choice of training parameters, to avoid the need to perform repeated experiments to establish which values work best. The results presented in the last section suggest that on-line updates and the use of Modied Q-Learning or Q( ), as opposed to standard Q-learning updates, help reduce this sensitivity.

The value of the discount factor,

, was xed throughout the experiments presented at a value of 0.99. This was chosen so that the system would converge to solutions which used the fewest steps to reach the goal, but needed to be a value close to 1 in order that the discounted payos seen at states many steps from the goal would be a reasonable size. With no discounting, the robot can arrive at solutions that reap high nal payos, but do not use ecient trajectories (and hence the robot is often timed-out). To illustrate this, Fig. 4.15 shows the training curves for two robots trained with on-line Modied Q- Learning with and without discounting. As can be seen, the undiscounted robot does considerably worse, especially in the average number of steps taken per trial, despite the fact that there is only a 1%dierence in the updates being made at each time step.

Thrun and Schwartz (1993) provide limits for

based on the trial length and number of actions available to a system, assuming one-step Q-learning is being used, but more general results are as yet unavailable. Also, an alternative to Q-learning called R-learning (Schwartz 1993) has been suggested, which eliminates the discount factor altogether by trying to learn undiscounted returns. However, results presented by Mahadevan (1994) showed that Q-learning outperformed R-learning in all the tasks he examined.

Finally, some experiments have shown that the convergence of the neural networks relies heavily on the exploration used at each stage of learning. If it is too low early on then the robot cannot nd improved policies, whilst if it is too high at a later stage then the randomness interferes with the ne tuning required to have reliable goal reaching policies. When using a Boltzmann distribution, therefore, the rate of convergence is dependent on the rate of reduction of

T

4.5.3 On-line v Backward-Replay

The results of the tests for using on-line updating compared to backward-replay are in- teresting on-line updating consistently performs more successfully over a wider range of training parameters for all 3 update rules, with the most marked dierence in performance

4. The Robot Problem

73

In document PROBLEM SOLVING WITH REINFORCEMENT LEARNING Gavin Adrian Rummery pdf (Page 77-79)