4.2 Experiments in RoboCup Games
4.2.2 Experiments in Keepaway games
We first give details of the game settings, and then present the performances of AARL8and SARSA(λ) against different fixed strategies of takers.
8 Within this chapter, in cases without ambiguity, we use AARL and SARSA(λ)-AARL inter-
Implementation
As mentioned in Section 3.1.1, at each moment, each agent has ‘perfect knowl- edge’ of the environment. However, this knowledge is consisting of coordinate lo- cations, and RL cannot effectively work by directly using these locations as states [SSK05]. Stone et al. [SSK05] proposed state variables to represent each state, as shown in Table 4.3. The state variables are proposed not only for describing the situation in the field, but also for facilitating decision making in RL. For example, the distances between takers and the ball holder are state variables, because the holder can use this information to decide when to pass the ball and to whom to pass the ball. As we can see, all state variables are designed from the perspective of the ball holder, because the ball holder is the only learner in Keepaway. We say that these state variables are holder-oriented.
The arguments and values we use have been introduced in Example 1 and Ex- ample 5, in Chapter 3, respectively. We set the ordering of values (Valpref) as follows:
• HOLD LONG >v LESS INT>v TEAM LONGiffK1is safe
• LESS INT >vHOLD LONG>v TEAM LONGiffK1is under threat
• LESS INT >vTEAM LONG>v HOLD LONGiffK1is in danger
Whenminqdist(K1, Tq) > 10, K1 is safe; when5 < minqdist(K1, Tq) 6 10,
K1 is under threat; when 0 < minqdist(K1, Tq) 6 5, K1 is in danger, whereq
ranges over all takers, i.e.q ∈ {1, · · · , N }.
Because the time duration of an action in Keepaway is a variable, e.g. the time duration of action PassBall(2) can be different in different states, depending on the distance between the ball holder and keeperK2, this game is modelled as a
SMDP (see Section 2.2.1). Note that both Algorithm 1 and 8 can be directly used in SMDP problems.
Empirical Results in Keepaway
The performances of AARL-based (Algorithm 8) and SARSA(λ)-based (Algo- rithm 1) keepers against three fixed takers’ strategies are shown in Figures 4.1, 4.2 and 4.3, respectively. Note that the length of each experiment under different set- tings is different: for example, results given in Figure 4.1(a) are averaged over 10 experiments, each lasting for 80 hours, while results shown in Figures 4.2 and 4.3
Table 4.3: State variables in aN -Keepaway game. State Variable(s) Description
dist(Kp, C), p ∈
[1, N + 1]
Distance between keepers and the centre of the court.
dist(Tq, C), q ∈
[1, N ]
Distance between takers and the centre of the court. dist(K1, Kp), p ∈
[2, N + 1]
Distance betweenK1and the other keepers.
dist(K1, Tq), q ∈
[1, N ]
Distance betweenK1and the takers.
min
q∈[1,N ]dist(Kp, Tq),
p ∈ [2, N + 1]
Distance betweenKpand its closest taker.
min
q∈[1,N ]ang(Kp, Tq),
p ∈ [2, N + 1]
The smallest angle betweenKp and the takers with
vertex atK1.
are averaged over 40-hour experiments. The reason is that after a long time of run- ning, the platform becomes unstable and may exit accidentally, so we can hardly make all experiments last for over 80 hours. From these performances, we can see that: at least one AARL algorithm significantly outperforms standard SARSA(λ), regardless of the fixed strategy the takers use and the number of agents involved in the game. However, none of the two different AARL implementations (with grounded extensions or with preferred extensions) is significantly better than the other in all settings. For example, when playing against random takers (Figure 4.1), AARL using preferred extensions outperforms AARL using grounded ex- tensions, while when playing against argument-based takers (Figure 4.3), AARL using grounded extensions outperforms AARL using preferred extensions in 3- Keepaway games. The reason of this mixed result is still unclear and worth further investigation.
In order to further investigate the learning algorithms’ performances after mul- tiple hours of learning, we present ‘last’ performances of different algorithms against three different takers’ fixed strategies in Tables 4.4, 4.5 and 4.6. Note that we do not present the pairwise p-values9in these tables, but we refer to them
9In statistics, the p-value is a function of the observed sample results (a statistic) that is used for
testing a statistical hypothesis. If the p-value is equal to or smaller than the significance level (usually 0.05), it suggests that the observed data are inconsistent with the assumption that the
null hypothesisis true, and thus that hypothesis must be rejected and the alternative hypothesis is accepted as true. All p-values presented in this chapter, unless stated otherwise, are computed by
rithms do not have significant differences (p-value: 0.48). The reason of the mix results for AARL-grounded is still unclear. However, by comparing these learning algorithms’ performances with that of hand-coded keeper’s strategy (see Tables 4.1 and 4.2), we can see that all learning algorithms’ performances are signifi- cantly better than that of the hand-coded strategy (p-values are all smaller than 0.01). This result indicates that when playing against some simple and easy-to- predict strategies, RL-based learning algorithms can achieve better performances than sophisticatedly designed hand-coded strategies.
From Table 4.6, we see that in both 2- and 3-Keepaway games, both AARL- preferred and AARL-grounded significantly outperform SARSA(λ) (p-values are all smaller than 0.01), and these two AARL implementations do not have sig- nificant difference in their performances (p-value: 0.17). However, by comparing these learning algorithms’ performances with that of the hand-coded keeper’s strat- egy (see Tables 4.1 and 4.2), all these learning algorithms perform significantly worse than the hand-coded strategy. This result indicates that when playing against a sophisticatedly design fixed strategies, SARSA(λ)-based learning algorithms still cannot achieve the same performance as people’s hand-coded strategies.
Table 4.4: Performances (average episode durations (in second)± standard errors) of learning keepers playing against random takers after several hours of learning. All performances are averaged over 100 episodes (10 episodes per experiment, 10 experiments for each algorithm).
Learning algorithms After 80 hours’ learning
2-Keepaway, SARSA(λ) 19.09± 0.12
2-Keepaway, AARL-preferred 19.43± 0.06 2-Keepaway, AARL-grounded 19.46± 0.10
Learning algorithms After 60 hours’ learning
3-Keepaway, SARSA(λ) 13.34± 0.05
3-Keepaway, AARL-preferred 14.49± 0.06 3-Keepaway, AARL-grounded 13.37± 0.13