5.12 Multi-Agent Systems of Note
5.12.1 Multi-Agent Reinforcement Learning (MARL)
Since many reinforcement learning tasks do not require sequential computation, a distributed methodology can be applied. A multi-agent approach is appealing, since agents can run in parallel, computing various aspects of the system. This gives the potential for adding robustness to the task. Failure of an agent to perform a task can be overcome if another agent takes up the mantle.
5.12. Multi-Agent Systems of Note 114
Whilst there are potential advantages in applying multi-agent techniques to RL problems, there are also numerous challenges that must be addressed. In standard RL, there is often a clear goal to be achieved, but with MARL the problem is usually distributed, so assigning goals to individual agents can be difficult.
When multiple agents are evaluating an environment, there is potential for an agent’s learn- ing to be affected by the learning of others, making the learning problem non-stationary. Similarly, if the learning problem is not specifically defined, agents can begin to build up information about the other agents as well as the environment they are supposed to be investigating.
Another issue to be addressed is whether to adopt either a selfish scheme or a cooperative scheme for learning and whether or not agents must keep track of the learning of others. All of these issues must be considered. However, there is no ‘one size fits all’ solution, so the considerations must be made on a case by case basis.
The Goal of MARL
Defining a clear goal for MARL is difficult, as discussed by Busoniu et al. (2008), because a lot depends on the scope of the problem being investigated. Broadly speaking, however, he defines the two main goals of MARL as stability and adaption where stability is convergence to a stationary policy and adaption deals with maintaining performance as other agents change their policies.
MARL for Shared Information
The ability of agents to share learnt information can be an interesting prospect as, in many situations, agents have a limited sphere of influence. Tan (1993) investigates this for a preda- tor prey environment which was used as a testbed for MARL with and without information sharing. The three cases under investigation were sharing of sensory information, sharing of learnt policies and sharing of sensory information for joint tasks.
His results show a slight improvement of convergence in the first case and marked improve- ments in the second and third cases. Despite the research being grounded in a simulation where parameter choices inherently affect the learning results, a strong case is made for information sharing to aid convergence to a solution.
5.12. Multi-Agent Systems of Note 115
mentor agent inform a strategy for an imitator agent. The advantages of such a scheme would be most beneficial in systems where direct communication between agents is either difficult or impossible. The scheme is tested on a number of simple MDPs and shows improvements in convergence times over standard reinforcement learning techniques.
Whilst these examples give no explicit methodology or framework for information sharing in all situations, the techniques are interesting and should be considered at the design time of a MAS.
MARL for Distributed Control Problems
Very little work has been done in the field of MARL for distributed control. Of these, most examples of MARL are focussing on static games and small grid worlds. Busoniu et al. (2008) notes that:
Most MARL algorithms are applied to small problems only, like static games and small grid worlds. As a consequence, these algorithms are unlikely to scale up to real-life multiagent problems, where the state and action spaces are large or even continuous. Few of them are able to deal with incomplete, uncertain observations. This situation can be explained by noting that scalability and uncertainty are also open problems in single-agent RL.
Despite this, however, there are a few examples of MARL being used for distributed control. Gross et al. (2000) use a multi-agent neural function approximator approach to control an industrial hard-coal combustion process in a power plant. The choice to use a neural function approximator was made due to the continuous state and action spaces in the system and the, then, prohibitive memory cost to store state action pairs in memory for standard techniques such as Q-learning.
The control system comprises four agents, each with access to information from the six burners of the combustion system. This is realised in the form of a camera system that observes the colour, shape and size of the flame. From the images, information can be determined about the temperature, coal distribution and the makeup of the emissions. The control inputs allow alteration of the distribution of air between burners, the distribution between primary and secondary air (where secondary air is recycled air from a previous burn) and the overall air amount. An emphasis is put on agent scheduling so that agents can be sure that the outputs
5.12. Multi-Agent Systems of Note 116
are a direct result of their change in inputs and not the effect of another agent’s control strategy.
The results of their system indicate that use of their multi-agent control scheme does give improvements over the standard control scheme adopted in the plant. Similar results can be achieved by consuming less air. However, the results do not appear to be significant and, since the learning is scheduled, the use of a multi-agent approach seems superfluous.
Wiering (2000) demonstrates a different kind of MARL to increase efficiency in a traffic light control problem where the traffic lights and the cars are modelled as agents. In the simulation each car has three parameters, the traffic light they are at, their place in the queue at the specified traffic light and their destination. There are 48 traffic lights (tl ∈ [1..48]), 20 positions in a queue (place ∈ [1..20]) and 10 destination addresses (des ∈ [1..10]). This huge parameter set is the reason that a multi-agent approach was adopted, since the number of potential system states makes it intractable as a global RL problem. It is assumed that the cars can pass information about these three states to the traffic light controller so that it can make locally optimal decisions about an appropriate action to take.
The decisions of the traffic light controllers were tested using a range of techniques including random selection, fixed rate decisions, largest queue first, highest intersection throughput and three reinforcement learning based controllers. The fitness metric was the average waiting time of a car within the system. The RL algorithms used are not detailed here as they are standard RL optimisation algorithms and the substance of this work is in the way in which it is distributed. The results of the tests showed that the RL controllers slightly outperformed the other systems for low traffic loads and had marked improvements over the other controllers for high traffic volumes. Whilst the results are promising and show once again that RL can provide improvements to the decisions made in a distributed system, this is, in essence, an optimisation problem. The fact that a random function selector works at all is a testament to this fact.