4.2 Experiments in RoboCup Games
4.2.3 Experiments in Takeaway games
We first give details of the Takeaway games’ settings, and then present the perfor- mances of AARL and SARSA(λ) against the fixed keeper’s strategies.
Table 4.5: Performances (average episode durations (in second) ± standard er- rors) of learning keepers playing against always-tackle takers after 40 hours of learning. All performances are averaged over 100 episodes (10 episodes per experiment, 10 experiments for each algorithm).
Learning algorithms After 40 hours’ learning
2-Keepaway, SARSA(λ) 24.97± 0.07 2-Keepaway, AARL-preferred 26.57± 0.07 2-Keepaway, AARL-grounded 24.60± 0.09 3-Keepaway, SARSA(λ) 27.97± 0.08 3-Keepaway, AARL-preferred 31.92± 0.09 3-Keepaway, AARL-grounded 32.00± 0.07 Implementation
Most existing work on Takeaway uses the holder-oriented state variables (e.g. [IE08, MZCZ08, DGK11]), the same as the state variables used in Keepaway games (see Table 4.3). However, since each taker is learning independently and takers need to cooperate with each other, self-oriented state variables, which describe the sit- uation in the field from each taker’s perspective, can be more helpful. Therefore, we combine taker’s self-oriented state variables and some holder-oriented state variables to build a new state vector for learning takers, as shown in Table 4.7.
Besides the arguments and values we mentioned in Examples 2 and 6, we addi- tionally use another two arguments:
• TqO(p):Tqperforms MarkKeeper(p) IF Kp is open
• TqF(p):Tqperforms MarkKeeper(p) IF Kpis far
The definitions of ‘open’ and ‘far’ in these two arguments are the same as in the keeper’s arguments (Example 1 in Chapter 3). The reason (value) behind these two arguments are just opposite of the values of keeper’s arguments O(p) and F(p), respectively. For example, recall that the reason behind O(p) is ‘passing the ball to the open keepers reduces the risk of the ball being intercepted’; so, from the takers’ perspective, they should prevent the holder from passing the ball to an open keeper. We denote the values promoted byTqO(p) andTqF(p) as MARK OPEN
(standing for ‘to mark a keeper that is open, so as to increase the success rate of interception’) and MARK FAR (standing for ‘to mark a keeper that is far, so as to
Table 4.6: Performances (average episode durations (in second) ± standard er- rors) of learning keepers playing against argument-based takers after 40 hours of learning. All performances are averaged over 100 episodes (10 episodes per experiment, 10 experiments for each algorithm).
Learning algorithms After 40 hours’ learning
2-Keepaway, SARSA(λ) 11.92± 0.03 2-Keepaway, AARL-preferred 12.08± 0.02 2-Keepaway, AARL-grounded 12.11± 0.03 3-Keepaway, SARSA(λ) 10.74± 0.03 3-Keepaway, AARL-preferred 11.45± 0.02 3-Keepaway, AARL-grounded 11.50± 0.03
reduce the time keepers control the ball’), respectively. We keep the following ranking of values (Valpref ) fixed throughout our experiments: QUICK TAC >v
QUICK MARK=v QUICK CLOSE>v MARK OPEN>v MARK FAR.
Empirical Results in Takeaway
The performances of takers’ learning strategies against keeper’s random and hand- coded strategies are shown in Figures 4.4 and 4.5, respectively. From Figure 4.4, we can see that when playing against random keepers, all RL algorithms perform similarly (i.e. no algorithm is significantly better or worse than the others); also, we see that the 95% confidence intervals of all RL algorithms are wide and thus have much overlapping, indicating that in cooperative multi-agent learning prob- lems, when playing against a random opponent, RL algorithms need to try many different strategies before they find the optimal policy. From Figure 4.5, we see that both AARL algorithms significantly outperform SARSA(λ) throughout the 40-hour experiments; however, with respect the relative goodness of two AARL implementations, we see that in 2-Takeaway, AARL-grounded significantly out- performs AARL-preferred most of the time, but in 3-Takeaway, no AARL imple- mentation has significant advantages over the other during the learning process.
In order to further investigate the learning algorithms’ performances after multi- ple hours of learning, we present ‘last’ performances of different takers’ RL algo- rithms against two different keeper’s fixed strategies in Tables 4.8 and 4.9. From Table 4.8 we can see that, when playing against random keepers, no algorithm has
Table 4.7: State variables for learning takerT1 in a N -Takeaway game. State
variables of other takers can be obtained similarly. The top three rows describe self-oriented variables, and the others describe variables about the keepers’ relative layout.
State Variable(s) Description
dist(Kp, M e), p ∈ [1, N + 1] Distance between keepers and myself.
dist(Tq, M e), q ∈ [2, N ] Distance between other takers and myself.
ang(Kp, M e), p ∈ [2, N + 1] The angle between the free keepers and
myself, with vertex atK1.
dist(Kp, K1), p ∈ [2, N + 1] Distance betweenK1 and the other keep-
ers.
dist(Tq, K1), p ∈ [2, N ] Distance betweenK1and the other takers.
min
j∈[1,N ]ang(Kp, Tq), p ∈
[2, N + 1]
The smallest angle between Kp and the
takers with vertex atK1.
significant advantages or disadvantages over the other algorithms (in 2-Takeaway, the p-value between SARSA(λ) and AARL-preferred is 0.44, between SARSA(λ) and AARL-grounded is 0.31, and between the two AARLs is 0.91; in 3-Takeaway, the p-value between SARSA(λ) and AARL-preferred is 0.85, between SARSA(λ) and AARL-grounded is 0.71, and between the two AARLs is 0.40). This result is in line with our observation of the learning curves (Figure 4.4). Also, by compar- ing the takers’ fixed strategies’ performances against random keepers (see Tables 4.1 and 4.2), we can see that all learning algorithms’ performances are signifi- cantly better than any of the takers’ fixed strategies (all p-values are smaller than 0.01). This result indicates that when playing against some opponents that are dif- ficult to predict, RL-based multi-agent cooperative learning outperforms people’s hand-coded strategies.
From Table 4.9 we can see that when playing against hand-coded keepers, in both 2- and 3-Takeaway, both AARL implementations significantly outperform SARSA(λ) after 40 hours of learning (all p-values are smaller than 0.01). How- ever, in 2-Takeaway, AARL-grounded is significantly better than AARL-preferred (p-value< 0.01), whereas in 3-Takeaway, these two AARL algorithms have no significant difference in their last performances (p-value: 0.17). Also, by com- paring the takers’ fixed strategies’ performances against the hand-coded keepers (see Table 4.1 and 4.2), we can see that all learning algorithms’ performances are significantly better than any of the takers’ fixed strategies (all p-values are smaller
than 0.01). This result indicates that when playing against some carefully designed hand-coded opponents, RL-based multi-agent cooperative learning outperforms people’s hand-coded strategies.
Table 4.8: Performances (average episode durations (in second) ± standard er- rors) of learning takers playing against random keepers after 40 hours of learning. All performances are averaged over 100 episodes (10 episodes per experiment, 10 experiments for each algorithm).
Learning algorithms Performance after 40 hours’ learning
2-Takeaway, SARSA(λ) 6.72± 0.07 2-Takeaway, AARL-preferred 6.84± 0.14 2-Takeaway, AARL-grounded 6.86± 0.12 3-Takeaway, SARSA(λ) 6.80± 0.18 3-Takeaway, AARL-preferred 6.76± 0.11 3-Takeaway, AARL-grounded 6.87± 0.07
Table 4.9: Performances (average episode durations (in second) ± standard er- rors) of learning takers playing against random keepers after 40 hours of learning. All performances are averaged over 300 episodes (10 episodes per experiment, 30 experiments for each algorithm).
Learning algorithms Performances after 40 hours’ learning
2-Takeaway, SARSA(λ) 12.55± 0.03 2-Takeaway, AARL-preferred 10.70± 0.01 2-Takeaway, AARL-grounded 10.09± 0.01 3-Takeaway, SARSA(λ) 9.21± 0.02 3-Takeaway, AARL-preferred 7.47± 0.05 3-Takeaway, AARL-grounded 7.39± 0.03
The state-of-the-art heuristics for takeaway games are proposed by Devlin et al. [DGK11]. They also use look-back advice to integrate heuristics into Takeaway, and their strategies’ performances in 2- and 3-Takeaway (also on a40×40 field) are shown in Figure 4.6(a) and 4.6(b), respectively. They also use SARSA(λ) as the standard RL algorithm, and all RL parameters they used are the same as ours10.
(a) 2-Takeaway (b) 3-Takeaway
Figure 4.6: The performances of two Takeaway games by using potential values proposed in [DGK11]. Reproduced from [DGK11] with permission.
They used three heuristics: separation-based shaping encourages each agent to take actions that increase its distance to other teammates; role-based shaping as- signs each agent a role (either tackler or marker) a priori and only the tackler is encouraged to tackle; combined shaping is the integration of these two heuristics. Their strategies played against the keeper’s hand-coded strategy (the same as we use). Their results showed that even though these heuristics successfully improved RL performances in 3-Takeaway (Figure 4.6(b)), they misled RL in 2-Takeaway (Figure 4.6(a)). We believe the reason for these mixed results lies in their lack of a systematic methodology to provide heuristics. Instead, AARL allows to inte- grate domain knowledge into RL while providing a high-level abstraction method (VAFs) for domain experts to propose domain knowledge. Also, the improvements of their heuristically-instructed strategies over SARSA(λ) are not as significant as with AARL.