Multi-Agent Reinforcement Learning - Reinforcement Learning

2.3 Reinforcement Learning

2.3.3 Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) is concerned with cases when there is more than one learning agent in the same environment. MARL has strong links with game theory. An MDP in single-agent RL becomes a stochastic game (SG) in MARL, sometimes also referred to as a multi-agent MDP. A large number of MARL algorithms are based on game theory, since it is one of the most suitable frameworks to model the interactions among several agents in a common environment [55]. This gives rise to the investigation of the applications of MARL to different types of SGs - fully cooperative, fully competitive and mixed games.

Extending RL to the multi-agent case presents several challenges investigated by Bu- soniu et al. [14]. In many cases a formal definition of a multi-agent learning goal becomes a difficult task. Every learning agent is affected by the actions of the other learning agents. Therefore, the environment is no longer static, it becomes highly dynamic from the viewpoint of each individual agent. This significantly increases the complexity of the learning tasks and invalidates most convergence guarantees of single-agent RL. A popular way to specify a MARL goal is to use a Nash Equilibrium (NE), as used in the game theory context, where none of the agents in the environment has an incentive to deviate from its policy.

Nevertheless, employing the MARL methods also presents a number of benefits [14]. For example, there is scope for experience sharing among the learning agents to im- prove the initial and steady-state performance of an RL algorithm and, thus, to increase its adaptability. This paradigm lies within the emerging research topic of transfer learning (TL), sometimes also referred to as docitive learning in the wireless communications domain [31]. MARL is also inherently more robust than SARL in that in a certain type of RL problems the faulty agents can be supported or replaced by new ones. Finally, there is a high degree of scalability in MARL, because most MARL algorithms allow easy insertion of new learning agents into the environment.

The rest of this subsection gives examples of several notable MARL algorithms found in the literature.

Nash-Q

The Nash-Q algorithm introduced by Hu and Wellman [36] is an extension of Q- learning to the multi-agent case, where the goal of all agents is to converge to an NE strategy in every state of the environment. The drawback of this algorithm is that every learning agent is supposed to observe the actions taken and rewards received by all other learning agents, and to store all their Q-tables. This is an assumption that may not be valid in many learning problems. It is also inefficient in terms of memory and communication overhead among the agents. However, the advantage of this method, as presented by Hu and Wellman [36], is the proven convergence of this algorithm towards a mixed strategy NE, which is rare in the MARL domain.

Distributed-Q

The Distributed-Q algorithm for fully cooperative SGs is proposed by Lauer and Ried- miller [51]. Here, every learning agent senses the entire environment and performs a single-agent Q-learning algorithm assuming that all other agents will be choosing a certain greedy action at all times. This works extremely well in deterministic envi- ronments. However, in the wireless communications domain the real-world learning problems are bound to be highly stochastic instead, due to random environmental ef- fects which cannot be modelled and predicted. It also assumes that every learning agent is able to accurately estimate the greedy actions of the other agents. This may not be possible in a number of distributed multi-agent learning problems.

Conjecture-Based Reinforcement Learning

A more promising variation of multi-agent Q-learning recently proposed by Chen et al. [19] is called conjecture-based RL. It deals with the stochastic nature of the learning process by defining a conjecture term which is used in the Q-table update formula. It is effectively a probability of all other learning agents in the environment choosing a particular set of policies, which determines the reward received by the learning agent. It then calculates the expected reward as a weighted sum of possible rewards depend- ing on policies chosen by other agents. Chen et al. [19] successfully use this algorithm

to enable CR devices in a simulated wireless mesh network to learn optimal spectrum and power allocation strategies for improved energy efficiency of the network. How- ever, this approach has only been applied to a relatively small and analytically tractable scenario with six secondary users and five primary users. The scalability of this algorithm has not been tested. For example, it is not clear whether this algorithm would exhibit good performance during the initial exploration stage of the learning process in a significantly larger and more complex wireless environment, and whether it would maintain its property of converging towards optimal strategies.

Independent Single-Agent Reinforcement Learning

The simplest approach to MARL is the “naive” implementation of independent single- agent RL algorithms for each learning agent in the environment, e.g. [77][88]. Despite the fact that the independent learning agents are not even aware of the existence of the other learning agents in the environment, this approach has been successfully applied to various coordination tasks, e.g. [46][77]. For example, an implementation of independent stateless Q-learning agents in a multi-agent environment has also been shown to exhibit remarkably similar convergence performance in a simple coordination task as the “joint action learner”, but with significantly less information available to the learning agents [21].

The fundamental advantage of this approach is the lack of assumptions about each learning agent’s awareness of the actions performed by the other agents required by the rest of the MARL algorithms described in this subsection so far. It significantly increases the breadth of potential applications of this MARL approach with different information availability constraints, including those in the wireless communications domain.

Heuristically Accelerated Reinforcement Learning

A common disadvantage of RL algorithms is their need for many learning iterations to converge on an acceptable solution. A lot of researchers have been addressing this problem, and one of the more recent promising solutions is the heuristically ac-

celerated reinforcement learning (HARL) approach. Its goal is to speed up the RL algorithms, particularly in the multi-agent domain, by guiding the exploration of the state space using additional heuristic information. According to Bianchi et al. [11], a heuristic policy is derived from additional knowledge, either external or internal, which is not included in the learning process. The goal of the heuristic policy is to influence the action choices of a learning agent, i.e. to modify its current policy in a way which would accelerate the learning process. For example, the first evidence of HARL in the literature is the paper by Bianchi et al. [12], where a heuristic function H(s, a) is defined that dictates which actions should be taken in which states to ex- plore the state-space more efficiently. This function can be obtained from additional expert knowledge or “existing clues in the learning process itself” [12]. In [11] the authors prove the convergence of four multi-agent HARL algorithms and demonstrate that they outperform their classical RL counterparts.

This approach is particularly relevant in the DSA environment where various stan- dardised signals with useful spectrum awareness information may be available to the learning agents.

In document Accelerating Reinforcement Learning for Dynamic Spectrum Access in Cognitive Wireless Networks (Page 37-40)