Expected Policy Update - Difference Rewards Based Credit Assignment

6.3 Difference Rewards Based Credit Assignment

6.5.2 Expected Policy Update

One remedy for the high variance of stochastic policy gradient is to compute an analytic ex- pression of the policy gradient. The idea was first suggested to apply to tabular Q-learning in the book of Sutton and Barto [117] as expected Sarsa. Instead of a Q-value update

based on a deterministic action at+1, Sutton and Barto [117] proposed that value could be

computed by an expectation over all possible actions. Seijen et al. [104] later proved the benefit of expected Sarsa in reducing the variance of Sarsa algorithm. Recently, Asadi et al. [6] and Ciosek and Whiteson [22] proposed expected policy gradient methods which have lower variance than stochastic policy gradient. Instead of updating the policy by sampled action, expected policy gradient approaches directly compute policy gradient by an expectation over all possible action. Asadi et al. [6] showed this expectation can be easily done in discrete and finite actions problem. In continuous action domains, the expectation can be computed under a closed-form formula if the policy and value functions have Gaussian form [22]. In case of non-Gaussian value function, Ciosek and Whiteson [22] suggested that we can use a Taylor approximation of value function instead. In our multi-agent decision making in CDec-POMDPs, the action space is discrete (hence method in [22] is not applicable ) and has exponential size due to joint value (it is impossible to use method in [6] to compute the expectation over all possible joint actions).

Our MCAC algorithm uses the mean of the action count to estimate the policy gradient. The action mean is also used in some RL algorithm in the literature. Gu et al. [36] and

Ciosek and Whiteson [22] used the action mean to compute Taylor approximation for value function of single agent. Tumer and Agogino [126] and Wu et al. [150] used mean action to estimate the difference of reward. Our MCAC approach is different from these work in its use of mean joint action to estimate policy gradient of multi-agent policy in CDec-POMDP domains. Our policy is decentralized and the value function depends on the joint-count of agents (in different states and actions) rather than the single-agent setting considered in [22, 36].

6.6 Summary

In this work, we addressed the problem of collective multiagent RL with global rewards. Our main contributions include developing techniques for multiagent credit assignment and computing low variance gradient estimates in the presence of global rewards. In such settings, we showed that an effective critic which is trainable using global rewards is not decomposable among agents. To use non-decomposable critic in multi-agent settings, we addressed the credit assignment problem by proposing MCAC and CCAC algorithms.

To derive MCAC, we highlighted a general structure of the critic in the multiagent RL setting that is suited for the credit assignment problem, but unfortunately is difficult to train using global rewards. Therefore, we developed techniques based on approximation of the critic that can resolve such contrasting requirements. For lower variance of the gradients, we showed how to compute expected or mean collective policy gradients by exploiting the special feature of CDec-POMDPs.

To derive CCAC algorithm, we used the notion of difference-of-reward/utility [126, 30] in multi-agent RL. We showed how difference-of-reward can be used in CDec-POMDP planning without agent identity. As the number of agents and joint action space are large, we derived an approximation of difference-of-reward using total differential. In large population, the contribution of one agent to the whole population becomes small, which makes the differential adequately approximate the difference-of-reward function.

world supply-demand taxi matching problem in a large Asian city with 8000 taxis and a police re-allocation problem. Thanks to our techniques for multiagent credit assignment and low variance policy gradients, our multiagent RL algorithms converge to high quality solutions faster than the standard policy gradient method and the best factored actor-critic approach from Chapter 5. Our approaches are also competitive even with a strong central- ized online planner based on anticipatory algorithms [67] despite decentralized and partially observable environment in our case.

Chapter 7 Conclusions and Future Works

7.1 Conclusions

This thesis contributed to the literature of multi-agent systems by a “lifted ” multi-agent planning framework using the count variables. Our framework allows us to develop multi- agent reinforcement learning algorithms to optimize decentralized policy of a large population (up to 8000 agents). In particular, we addressed the high complexity of joint trajectory by proposing a novel representation with agent counts. The counts are more compact than the joint trajectory as their dimensions depend only on the size of the local state spaces rather than the population size. Based on this count-based representation, we proposed collective reinforcement learning algorithms to solve large scale multi-agent planning problems with sampled values of the count variables. In local reward optimization problems, we proposed collective algorithms combined with fictitious play rule to be able to optimize individual policy. As inherited from fictitious play, our algorithms were also applicable to non-cooperative settings. Our fictitious play based algorithms could converge to a symmet- ric equilibrium in population game. However, similar to other fictitious play algorithms, convergence to equilibrium cannot be guaranteed in general. In global reward optimization problems, we addressed the credit-assignment problem in multi-agent system by proposing

in multiple cooperative multi-agent domains.

Our planning framework is based on two key ideas: the collective distribution of the counts in planning and the count-based value functions. These ideas were inspired by the counting formulas for lifted inference in Markov Logic Network [17], [69] and collective inference in Collective Graphical Model (CGM) [108]. The lifted inference technique was first proposed to compute marginal probability of individual state rather to learn individual behavior as in our case. Recently, there were research works extending the counting formulas in lifted inference to compute value functions in MDP planning [97], [113]. However, these works focused on finding policies of heterogeneous agents in domains with sparse in- teraction graph. On the contrary, our count-based planning framework considered domains where agents fully interact with each other. Second, although the collective distribution in CGM shares some similarities with ours, it is only applicable to domains where there are typically no interactions between agents. Our work is the first one that considers agent interactions and applying collective distribution of counts in multi-agent planning domains.

In document Reinforcement learning for collective multi-agent decision making (Page 156-160)