1 I NTRODUCTION
Capturing the efficient, coordinated behavior common in human motion is applicable to a number of fields such as architecture, video games, movies, and virtual reality. For example, architects will design through an iterative process of simulation and evaluation of a building’s traffic flow to ac- commodate dense crowds [23, 44]. Moreover, computer games often must simulate under real-time constraints scores of characters moving in coordinated formations or realistic crowds . And, for other similar applications, allowing simulated agents to navigate efficiently in a shared space is an important aspect of providing believable, natural, and socially coherent motion for virtual charac- ters. Furthermore, crowd simulations optimized for navigational efficiency provide behavior akin to realistic, human motion [7, 19], and so this same efficiency can produce cooperative, coordinated kinds of motion. It is a natural, linguistic pattern of human behavior to have idioms, norms, and customs which allow us to efficiently coordinate large formations from locally communicated inter- actions. So, the central motivation of this work is addressing how a multi-agent system could learn norms to explicitly communicate for the purpose of coordinated behaviors.
Miller and Inoue  use reinforcement learning to train a DIDS called Perceptual In- trusion Detection System with Reinforcement (SPIDeR). The system consists of hetero- geneous agents performing intrusion detection and communicating through a blackboard system. All the agents have a three-layer architecture composed of a signature-based detection for well known intrusions, an array of SOM for anomaly detection, and a third layer to collect information to further analysis. The remote agents perform intrusion de- tection and they send their votes through the blackboard system about the local activity sensed. By means of a RL process the central blackboard system computes and weighs the votes and in turn it rewards the agents in accordance with it effectiveness. The authors evaluate SPIDeR using the KDDCup’99 data set  showing positive results. By means of HMM, RL and the behavioural analysis of IP addresses Xiu et al.  propose a DIDS focused on detecting DoS/DDoS attacks. The architecture is composed of a group of sensors that have partial observability of the environment. Because of communication constraints, sensors are not able to send all their sensor information. Instead they learn to recognize local attacks and to communicate them to a central facility. Although the authors report high rates of detection, a possible drawback with this approach is the use of a single source of information (IP addresses) that can be easily forged. Furthermore, the authors reduced the problem of detecting DoS attacks to discriminate legal IP addresses from random IP addressees, which in our opinion is not enough information to accurate detect these types of attacks.
The problem of non-optimal policy convergence can be addressed by allowing the automata in the system to communicate their action choices with each other. Such action communication constitutes centralized game of learning automata. Depending on the number of automata involved in the communication, the game can be com- pletely or partially centralized. Centralization can also be achieved by combining actions of diﬀerent automata into a superautomaton. The superautomaton then acts as representative of the group of automata and participates in the game on behalf of the group. However, such superautomaton construction reduces the degree of au- tonomy in the system. Since, superautomaton makes actions selection for all the automata it represents, the individual automaton loses its own autonomy and surren- ders it to the superautomaton. Thus depending on how centralization is performed (with or without the use of superautomaton approach), the system will possess dif- ferent degree autonomy. Thus, we motivate the discussion of our algorithms based on two factors: degree of communication and autonomy. Depending on the availabil- ity of the resources and the domain constraints (dictating the memory capacity and autonomy of an agent), a suitable algorithm can be chosen from a gamut of possible algorithms.
Despite violating the theoretical guarantee of convergence, the multi-agent Q-learning method under the MMDP framework has shown promising results. The density graphs have shown that the MMDP agents are able to follow some optimal policies and navigate to the goal states, pro- vided there is sufficient learning. The density graphs have further confirmed that the simulated behaviour correlates to the behaviour observed from real-life simulations. The MMDP frame- work also has the capacity to cater for a larger environment and a larger number of agents. On the other hand, agents learning in the MMDP framework are experiencing collisions. A simple reasoning regarding the collision is that the loss of theoretical guarantees caused by the chaotic (non-stationary) environment as agents are not able to account for other agents’ actions. Although the agents failed to coordinate, this scenario best describes an emergency evacuation simulation where the collisions can be considered as stampedes occurring at mass gatherings. It is inevitable that in a joint state action or joint state environment, the dimensions and the number of agents have to be greatly reduced. Although a crowd constitutes a large number of agents, we follow the multi-agent convention of equating a crowd to having more than 1 agent. Nevertheless, the results obtained from the simulations of smaller scaled environment are still worthy of note. One important concept in crowd navigation is the notion of a collision. Agents need to learn to avoid collisions by means of making way for other agents to pass through. Provided with sufficient learning in the JSA-Q 1 learning algorithm, the agents are able to learn
Much of the success of single agent deep reinforcement learning (DRL) in recent years can be attributed to the use of experience replay memories (ERM), which allow Deep Q-Networks (DQNs) to be trained efficiently through sampling stored state transitions. However, care is required when using ERMs for multi-agent deep reinforcement learning (MA-DRL), as stored transitions can become outdated because agents update their policies in parallel . In this work we apply leniency  to MA-DRL. Lenient agents map state-action pairs to decaying temperature values that control the amount of leniency applied towards negative policy updates that are sampled from the ERM. This introduces optimism in the value- function update, and has been shown to facilitate cooperation in tabular fully-cooperative multi-agent reinforcement learning prob- lems. We evaluate our Lenient-DQN (LDQN) empirically against the related Hysteretic-DQN (HDQN) algorithm  as well as a mod- ified version we call scheduled-HDQN, that uses average reward learning near terminal states. Evaluations take place in extended variations of the CoordinatedMulti-Agent Object Transportation Problem (CMOTP)  which include fully-cooperative sub-tasks and stochastic rewards. We find that LDQN agents are more likely to converge to the optimal policy in a stochastic reward CMOTP compared to standard and scheduled-HDQN agents.
A significant amount of research in recent years has been dedicated towards sin- gle agent deep reinforcement learning. Much of the success of deep reinforcement learning can be attributed towards the use of experience replay memories within which state transitions are stored. Function approximation methods such as con- volutional neural networks (referred to as deep Q-Networks, or DQNs, in this con- text) can subsequently be trained through sampling the stored transitions. How- ever, considerations are required when using experience replay memories within multi-agent systems, as stored transitions can become outdated due to agents up- dating their respective policies in parallel [ 1 ]. In this work we apply leniency [ 2 ] to multi-agent deep reinforcement learning (MA-DRL), acting as a control mechanism to determine which state-transitions sampled are allowed to update the DQN. Our resulting Lenient-DQN (LDQN) is evaluated using variations of the CoordinatedMulti-Agent Object Transportation Problem (CMOTP) outlined by Bus¸oniu et al. [ 3 ]. The LDQN significantly outperforms the existing hysteretic DQN (HDQN) [ 4 ] within environments that yield stochastic rewards. Based on results from experiments conducted using vanilla and double Q-learning versions of the lenient and hysteretic algorithms, we advocate a hybrid approach where learners initially use vanilla Q-learning before transitioning to double Q-learners upon converging on a cooperative joint policy.
The main contribution of this work is a coded distributed learning framework that can be applied with any policy gradient method to solve MARL problems efficiently despite possible straggler effects. As an illustration, we apply the proposed framework to create a coded distributed version of MADDPG , a state-of-the-art MARL algorithm. Fur- thermore, to gain a comprehensive understanding of the benefits of coding in distributed MARL, we investigate various codes, including the maximum distance separable (MDS) code, random sparse code, replication-based code, and regular low density parity check (LDPC) code. Simula- tions in several multi-robot problems, including cooperative navigation, predator-prey, physical deception and keep away tasks , indicate that the proposed framework speeds up the training of policy gradient algorithms in the presence of stragglers, while maintaining the same accuracy as a centralized approach.
Animals are often on the move to search for something: a food source, a potential mate or a desirable site for laying their eggs. In many instances their navigation is informed by airborne chemical cues. One of the best known, and most impres- sive, olfactory search behavior is displayed by male moths [ 45 , 48 , 51 , 70 ]. Males are attracted by the scent of pheromones emitted in minute amounts by calling females that might be at hundreds of meters away. The difficulty of olfactory search can be appreciated by realizing that, due to air turbulence, the odor plume downwind of the source breaks down into small, sparse patches interspersed by clean air or other extraneous environmental odors [ 71 , 72 ]. The absence of a well-defined gradient in odor concentration at any given location and time greatly limits the efficiency of conventional search strategies like gradient climbing. Exper- imental studies have in fact shown that moths display a different search strategy composed of two phases: surging, i.e. sustained upwind flight, and casting, i.e. extended alternating crosswind motion. These phases occur depending on whether the pheromone signal is detected or not. This strategy and others have inspired the design of robotic systems for the identification of sources of gas leaks or other harmful volatile compounds [ 73 – 77 ]. Albeit the effectiveness of individual search is already remarkable in itself, the performance can be further boosted by coop- eration among individuals, even in absence of a centralized control [ 12 , 78 – 82 ].
Based on the function class F to which f belongs, the difficulty of the optimization problem that defines finding f varies, as does its statistical performance. By exploiting structural properties of the choice of F in a principled way, while also making use of convex op- timization techniques such as Lagrange duality and penalty methods, we have developed tools that successfully allow a network of interconnected agents to collaboratively learn accurate statistical models from their local data streams and message passing with their neighbors. While we have not solved nonparametric multi-agent stochastic programs ex- actly, we have solved them approximately in a memory-efficient way that is provably stable and yields good performance in practice. It is left to future directions, discussed in more detail in Chapter 9, to solve multi-agent nonparametric stochastic programs exactly using Lagrange duality, as well as extend this framework to settings such as, e.g., RKHSs with compositional multi-layer kernels (and possibly come within striking distance of the off-line accuracy benchmarks set forth by deep learning) and different hypotheses regarding agents’ data which may motivate use of proximity constraints as in Chapter 3.
Hao et al. carried out a systematic investigation about how agents could effectively learn so as to coordinate on an optimal policy in various cooperative multi-agent environments under the networked social learning framework. In their framework, they proposed two types of learners: IALs and JALs. In their previous work, the authors have introduced the networked social learning framework that basically concentrated on only two representative social network topologies: the small-world and scale-free network topologies and resulted in successfully improved the coordination among agents. In this article, they just extended the previous framework by considering two more representative topologies: Random and Ring networks. After a systematic investigation, they concluded that the network topology has a significant impact on the learning performance among agents and the framework introduced by them has literally accelerated the coordination among agents. In general, JALs are able to achieve better coordination performance than IALs. Their framework was basically focused on four representative topologies. The influence of various network topologies and various network topology factors on the learning performance of agents was investigated in this work. The experimental results showed that the underlying topologies were indeed counted in terms of the success rate of coordination and the convergence rate.
Three seller retail stores are considered which sell a selected product and gives quantity concessions for customers purchasing many items. Seller's inventory strategy, refill period, and the entry procedure of the customers are measured. A Markov Decision Process (MDP) model is suggested for this system. A new way for context-based dynamic decision making by cooperative multi-agentlearning algorithms is proposed. A novel move toward multi-agent cooperation methods by reinforcement learning (MCMRL) is proposed here. Communication methods for reinforcement learning build on the multi-agent scheme is proposed and implemented ,. The paper is ordered as given. Section 2, describes an innovative approach towards multi-agent cooperation methods by reinforcement learning (MCMRL). Section 3, illustrates the system kinetics of retail shops modeled by Markov decision procedure. Section 4 describes a simulation results all four methods with continuing price as the profit parameter. Section 5 describes concluding remark.
It has been shown in both model-based planning and model-free learning that value decom- position facilitates multi-agent algorithms. One of the earliest works in decomposing value function in MDPs planning is by Schneider et al.  who showed that the decompos- ing value function can reduce the complexity of computing joint policy. Later, Guestrin et al.  showed that the optimal joint centralized policy in MDPs with factored transition and reward functions can be efficiently computed by using the factored value function in approximate dynamic programs. Using a similar idea for model-based factored value func- tion in , Kok and Vlassis  proposed a model-free counterpart where value function components are learnt by local rewards. For Dec-(PO)MDPs with independent-transition, Kumar et al.  proposed individual value functions to be computed by a factored dynamic program based on a sparse interaction structure. The main problem with model-based value function decomposition is that they assume the sparse interaction between agents, mean- while in our CDec-POMDP domains an agent can interact with all other agents along its trajectory. Recently, in parallel to our work, Sunehag et al.  proposed to approximate the global value function by a sum of local value functions one for each agent. Agent de- composition in  requires maintaining value function for every agent and the training of value function uses global rewards, which is not effective in CDec-POMDPs with large number of agents.
There are two categories of information that can be used to bias action selection: information from prior tasks and information from the learning process [Bianchi et al., 2008]. The former requires that knowledge is available prior to the target task’s execu- tion, while the latter can be established at run-time. Both operate in a broadly similar manner; they try extract structure from the environment in much the same way as a learning model. This structure is then used to guide exploration. If a particular sequence of actions is found to transition to a goal state (or other ‘good’ area of the state-space), then biassing encourages the agent to explore this area preferentially [Bianchi et al., 2007]. This does not reduce the amount of exploration or the time it takes, it just pri- oritises it. The exploration that has the greatest effect on performance is done first. In applications with limited training time or resources, this allows good behaviour earlier, but the rest of the exploration will still need to be done to guarantee an optimal policy. In effect, Selection Biassing tries to get an agent to follow a (known or expected) good partial policy which is to some degree equivalent to Reward Shaping. Selection Biassing has no direct effect on any of the problems affecting learning (credit assignment prob- lem (I), sparsely visited states (II) or sample variation (III)), it only addresses them by increasing the frequency of visits to given states which necessarily reduces visits to others.
This paper deals with intelligent autonomous navigation of a vehicle in cluttered environment. We present a control architecture for safe and smooth navigation of a Unmanned Ground Vehicles (UGV). This control architecture is designed to allow the use of a single control law for different vehicle contexts (attraction to the target, obstacle avoidance, etc.) . The reactive obstacle avoidance strategy is based on the limit-cycle approach . To manage the interaction between the controllers according to the context, the multi-agent system is proposed. Multi-agent systems are an efﬁcient approach for problem solving and decision making. They can be applied to a wide range of applications thanks to their intrinsic properties such as self-organization/emergent phenomena. Merging approach between control laws is based on their properties to adapt the control to the environment. Different simulations on cluttered environments show the performance and the efﬁciency of our proposal, to obtain fully reactive and safe control strategy, for the navigation of a UGV.
agents to observe some otherwise hidden information only while they are learning. We view such learning as a rehearsal—a phase where agents are allowed to access information that will not be available when executing their learned policies. While this additional information can facilitate the learning during rehearsal, agents must learn policies that can indeed be executed in the Dec-POMDP (i.e., without relying on this additional informa- tion). Thus agents must ultimately wean their policies off of any reliance on this information. This creates a principled incentive for agents to explore actions that will help them achieve this goal. Based on these ideas, we present a new approach to RL for Dec-POMDPs—Reinforcement Learning as a Rehearsal or RLaR, including a new exploration strategy. We establish a weak convergence result for RLaR, demonstrating that RLaR’s value function converges in probability when certain conditions are met, and demonstrate experimentally that RLaR can nearly optimally solve several existing benchmark Dec-POMDP problems with a low sample complexity. We also compare RLaR against an existing approximate Dec-POMDP solver, Dec-RSPI , which also does not assume a priori knowledge of the model. Instead, Dec-RSPI assumes full access to a simulator, which makes it a planning algorithm. Even so, it is comparable to a learning algorithm such as RLaR due to the shared lack of a priori knowledge of the model, and the access to at least as much information as RLaR assumes. We show that while RLaR’s policy representation is not as scalable as Dec-RSPI’s, it produces higher quality policies for problems and horizons studied.
Figure 5: A bird’s eye view of the simulation. A robotic ball is immersed in an arena composed of 3 coloured blocks. The contact between the toy and a block induces a laugh which is interpreted as a reward function. The long term objective is to maximize laugh production.
served (the object to imitate) and what can be done by the imitating entity (Nehaniv and Dautenhahn, 2002). Imitation also imposes the existence of a function al- lowing to associate the observed state of the world to the corresponding state of the toy. Such a function is complex and limits the interaction modalities. Another form of Learning from Demonstration pro- poses to use Reinforcement Learning algorithms com- bined with Learning from Demonstration (Knox and Stone, 2009). Reinforcement Learning algorithms brings an interesting solution to the correspondence problem by letting the learning system discovering through its own experience what are the interesting in- teractions. Then, the toy can explore its environment and learn from its interactions with the child. How- ever, in its original form, Reinforcement Learning al- gorithms require the existence of a feedback function which associates at each state of the world a reward value. This function is often complex to specify be- cause it requires to evaluate a priori the distance to an objective. However, in our context, such objective is a priori unknown and has to be discovered. On the other hand, the Inverse Reinforcement Learning approach (Argall et al., 2009) proposes to infer this reward function from a set of reward examples. The system is fed with a set of situations and the asso- ciated rewards and the system infers a function that models this reward. This function can then be used to optimise a policy. But this solution implies that an external entity is able to provide the required demon- strations to learn the reward function which is, in the case being considered, non trivial. To face this chal-
electric power market modeling tools that can reliably and quickly approximate real-world conditions and predict market behavior under variable conditions. To a significant extent, research to date has either been devoted to modeling the physical power system as accurately as possible and omitting strategic participant behavior (including state estimators used in ISO system operations, and generator commitment and dispatch algorithms are examples), or has sacrificed the detail of most physical power systems to focus on agent behavior. Some exceptions that attempt both include work reported in Conzelman, et al, 2004, Bagnall and Smith, 2005, Sun and Tesfatsion, 2006, and Somani and Tesfatsion, 2008. The model detailed here incorporates, as mentioned before, a multi-node transmission model with locational marginal prices, features central to the Standard Market Design put forward by the Federal Energy Regulatory Commission in its early orders mandating a move towards open competitive electric power markets. Other features that will be necessary to incorporate into future models include multi-settlement systems, non-linear commitment and dispatch algorithms, dynamic load, demand-side participation and more. 11
movement, and so on. The awareness of the information space gained by the user as a result of earlier visits and the ability to assimilate the available information also influences navigation abilities. In order to understand an information space, a mental model influenced by the user’s spatial ability is created. Studies carried out earlier propose that users create three types of mental models; landmark, route and survey during navigation (Dillion & Vaughan, 1997). Landmark awareness is acquired at the initial stage of interaction. The user obtains information on the exclusive properties of the information space. Route knowledge is defined by Dillion, McKnight & Richardson (1993) as “the ability to navigate from point A to point B utilizing the landmark knowledge acquired to make decisions about when to turn left or right” (p.173). Survey knowledge is developed in the final stage of navigation and helps the user find landmarks and routes. Thuring, Hannemann and Haake (1995) claim that coherence and Constancy of objects such as colour, size and visual origin of object (Raubal, n.d) are elements considered to affect the formation of mental models and spatial ability.
There are worldwide control centers planned for navigation and satellite control. The core of the ground segment will consist of two GALILEO control centers in Germany and Italy 46
. The main control center will be the German Aerospace (DLR) Center at Oberpfaffenhofen. From there the control of normal operation of the 30 satellites is planned for at least 20 years. A second comprehensive control center with its own specific responsibilities for normal operation will be located at Fucino in Italy. This is also to be a backup to the main control center in the event of any problems that should arise there. Control of the positioning of the 30 satellites will be evenly divided between the European Satellite Control Center (ESA/ESOC) in Darmstadt, Germany, and the French National Space Studies Center (CNES) in Toulouse, France. A chain of about 30 Integrity Monitoring Stations (IMS) distributed worldwide will control the integrity of the satellite signals. Two control centers will evaluate the IMS information and sound an alarm in the event of an excessive deviation in position data.
Having identified suitable information to transfer and done so, the target agent needs to add this informa- tion to that which it already has. In this work, we will assume all agents can be trusted to only sharing properly formatted, correct information. The problem faced by the target agent is how best to determine if the received information is more accurate or better represents the environment than its current informa- tion. Maintaining statistics about how often a state has been visited and comparing this to how often the information received was sampled and accepting the most sampled value is a possible approach. It would, however, fail if the environment does not change the same way in response to each agent. This is particu- larly true if mapping from a distinctly different source and target, as there may not be a one to one correla- tion of convergence rates. This method will have little variation in the information it tends to transfer. As only selecting either the transferred information or the local information will not permit agents to use mul- tiple sources transferring information efficiently, lo- cal and received information will have to be merged. If merging the new information can move a partic- ular value towards its true value, then it should be done otherwise the received information should be dis- carded. As the true value of a state cannot be known until the learning process is finished, estimates will have to suffice to determine if merging should occur. If there is no prior information for a given state then the received information should be accepted. The more difficult decision is merging if there is already infor- mation available. In this case the received information needs to be evaluated to determine if it is likely to be closer to the true value. Due to the way RL updates the value of a state-action pair, it will tend to move towards the true value over several visits to a state, this means that a direction of movement can be de- termined, this direction can then be used to see if the received information is in the right area for the true value. The steps towards the true value should get smaller the closer to convergence the value is, if the steps have already become small then it is probably not worth merging the received information. This merging scheme works reasonably well for the tested problems, however there is room for improvement. The informa- tion being transferred is not intended to provide the final exact value but to give an approximation so that fewer samples are needed to converge. This the signif- icance attached to received data should drop as values approach convergence.