A Large Multi-Agent Problem - Policy-Gradient Algorithms for Partially Observable Markov Decisi

Model based methods for POMDPs have been restricted to at most a few hundred states with 10’s of observations and actions [Geffner and Bonet, 1998]. This section demonstrates thatGAMPcan learn the optimal policy for a noisy multi-agent POMDP with 21,632 states, 1024 observations and 16 actions.

The scenario is shown in Figure 4.2: a factory floor with 13 grid locations to which 2 robots have access. The robots are identical except that one is given priority in situations where both robots want to move into the same space. They can turn left or right, move 1 position ahead, or wait where they are. One agent learns to move unfinished parts from the left shaded area to the middle area, where the part is processed instantly, ready for the second agent to move the processed part from the middle to the right shaded area. The middle processing machine can only handle 1 part at a time, so if the first agent drops off a part at the middle before the second agent has picked up the last part dropped at the middle, the new part is discarded.

The large state space arises from the combined state of the two independent agents plus the global state. Each agent can be loaded or unloaded in 13 states with 4 orientations, giving each agent 2×13×4 = 104 states. The global state indicates if a part is waiting at the middle processing machine and the state of the 2 agents, giving 2×1042 _{= 21,}_{632 states.}

Agent1 Agent2

Figure 4.2: Plan of the factory floor for the multi-agent problem. The dashed arrows shows one of the routes traced by the final agents.

only need to exit the loading or drop off locations to pick up or drop loads. To receive the maximum reward the agents must cooperate without explicit communication, the actions of the first agent allowing the second agent to receive rewards.

The observations for each agent consist of 4 bits describing whether their path is blocked in each of the 4 neighbouring positions, and a 5th bit describing if the agent is in the uppermost corridor (which is necessary to break the symmetry of the map). The combined observations are 10 bits, or |Y|= 1024.3 _{The actions for each agent are} {move forward, turn left, turn right, wait}, resulting in a total of |U| = 16 actions.

Uncertainty is added with a 10% chance of the agents’ action failing, resulting in no movement, and a 10% chance of the agents’ sensors completely failing, receiving a “no walls” observation. This problem was designed to be solved by a reactive policy. Section 8 demonstrates GAMPon problems that require memory to solve.

4.5.1 Experimental Protocol

These experiments were run on an unloaded AMD Athlon @ 1.3 GHz. GAMPrequired less than 47 Mbytes of RAM to run this problem. Compare this to just storing every element of ∇P explicitly, which would require 893 Giga bytes of ram.

The agents were parameterised with tables of real numbers as described in Sec- tion 3.4.1. There are |Y| × |U|= 128 parameters per agent. We set θc = 0 ∀c. There

are no φ parameters since the scenario can be solved without I-states. A quadratic penalty of℘= 0.0001 was used to stop the parameters settling in a local maximum too early (see Appendix B.2). The quadratic penalties for all the experiments in this thesis were chosen by trial and error. We determined penalties that prevented the weights growing past approximately 0.5 before the penalty is automatically reduced for the first time. Penalty reduction occurs after three line search iterations without improvement of the average reward. We chose x = π = 0.0001, which was the largest value (for

Table 4.1: Results for multi-agent factory setting POMDP. The values forηare multiplied by 102_.

Algorithm mean η max. η var. secs to η = 5.0

GAMP 6.51 6.51 0 1035

Hand 6.51

the fastest approximation) tested that allowed the agent to consistently converge to an agent with equivalent performance to the best hand coded policy.

Exact algorithms based on factored belief states could work well for this scenario since it decomposes into astate variables [Boutilier and Poole, 1996, Hansen and Feng, 2000, Poupart and Boutilier, 2001], however we do not assume that the state-variable model is known. We shall discuss some possibilities for factored versions of GAMPat the end of this chapter.

4.5.2 Results

The agents learnt to move in opposing circles around the factory (shown by the dashed lines in Figure 4.2). This policy reduces the chances of collision. They also learnt to wait when their sensors fail, using the wait action to gather information. Table 4.1 shows a comparison between GAMPwith no memory and the best policy we designed by hand.4

Without applying quadratic penalties training terminated in substantially sub- optimal local maxima. This is because the early gradient estimates tended to point in sub-optimal directions, dominated by concepts that are easy to learn, such as “don’t run into walls,” or “moving forward is good.” These gradients drive parameters to very large values. The soft-max function enters a local maximum when the parameters diverge to±∞, so the agent quickly becomes stuck having learnt only the most simple concepts. The quadratic penalty keeps parameters near 0. This forces the ω(·|φ, g, y) and µ(·|θ, h, y) distributions to stay close to uniform, which encourages exploration. The most common local maxima occurred when the agents learnt early in training that forward is a generally useful action, even when the sensors fail and the agent should wait for more information. Because the move forward concept was learnt so strongly the soft-max derivative for the relevant parameters was close to 0.

We attempted to run the exact Incremental Pruning algorithm [Zhang and Liu, 1996] on the Factory problem. The code aborted during the first iteration of dynamic

4_{An mpeg visualisation of trained agents is available from the author, or from}_{http://discus.anu.}

programming after consuming all 256 MB of memory.5 _{Storing just one double precision} belief state of length 21,632 requires 169 Kb and exact value-function-based algorithms quickly generate many thousands of vectors for large problems.

In document Policy-Gradient Algorithms for Partially Observable Markov Decision Processes (Page 79-82)