Relation between CGM inference and CDec-POMDP plan-

3.2 Collective inference in CGM

3.2.4 Relation between CGM inference and CDec-POMDP plan-

Researchers have shown the relation between reinforcement learning problems and probabilistic inference problems [138, 101]. In fact, MDP planning problems can be cast as probabilistic inference problems and solved using probabilistic inference algorithms such expectation maximization [123, 58]. We have shown in proposition 3.1 that CGM and CDec-POMDP in fact overlap in a special case of independent transition. In this special case, CDec-POMDP problem can be modeled as a collective inference problem in CGMs, consequently it can be solved by CGM inference solvers.

Proposition 3.2. When the individual transition is only dependent on the previous local state and action as Pl(smt+1|smt , amt ) and agents follow the same open-loop

policy ⇡(am

t |smt ), the CDec-POMDP planning problem can be re-cast as an CGM

parameter inference.

Proof. Using proposition 3.1, we can represent the dynamic in CDec-POMDP by a CGM nodes with a set of nodes hsm

t , amt im,t for all agents over planning horizon.

Similar to Toussaint et al. [123], Kumar et al. [58], we can define a likelihood function by an auxiliary binary variables ˆy 2 {0; 1} to represent the reward received at each time T by p(ˆyT = 1|nT, T ) = r(nsa T ) rmin rmax rmin ,

in which rmax, rmin are the maximum and minimum values of the immediate re-

Then we can re-interpret the planning objective function E[PH

T =0rT]as a maxi-

mization of a mixture of likelihood over time as

max p(ˆy|✓) = H X T =0 p(T ) X n1:T2⌦ p(ˆyT = 1|nT, T )p(n1:T|✓),

in which mixture weight p(T ) = 1/H.

3.3 Related works

The count inference was first proposed as the MAP inference for CGMs by Shel- don et al. [107] as a sub-step for parameters learning [107] within the EM framework [24]. Since then, there are a number of approaches proposed for inference in CGMs [108, 107, 63, 114]. Sheldon et al. develop a continuous convex relax- ation of the MAP inference problem formulated over the junction tree derived from the individual model, and solve it using a generic optimization solver. Liu et al. develop a Gaussian approximation for CGMs and use Expectation-Propagation for inference. Sun et al. generalize the well known belief propagation algorithm [89] to nonlinear belief propagation (NLBP) for CGMs.

There is a close relation between MAP inference in CGMs and probabilistic inference in standard graphical model [49, 135]. In particular, the marginal count constraint and likelihood function of the count variables are equivalent to marginal probability constraint and posteriori probability function in probability inference [49]. This relation between count inference and probability inference was also no- ticed previously when Liu et al. [63] and Sun et al. [114] adopted belief propagation methods into count inference problem. This motivates us to consider other tech- niques from standard probabilistic inference, namely Bethe entropy approximation [154] and the concave-convex procedure (CCCP) [156, 155], to develop approxi-

3.4 Summary

In this chapter, we showed the relation between CGM inference problems and CDec-POMDP planning problems.

The collective inference problem was introduced by Sheldon and Dietterich [108] to infer the underlying counts from the noisy observation of the counts. To solve the collective inference problem, Sheldon and Dietterich [108] constructed a collective graphical model (CGM) of the counts as a lifted representation of the population. The CGM model considers agents having transition function independent from each other. In our CDec-POMDP model, agents are interacting with each other and their transition functions are interdependent through the collective behav- ior (the counts). However, we showed that our planning model and CGM overlap in the case of independent transition and open-loop policy. Furthermore, in this special case, the objective function in collective planning and can be re-cast as a likelihood function in CGM.

Although we can model a special instance of CDec-POMDP planning as CGM inference problem, this is shown only for demonstrating the relation between CDec- POMDP and CGM. In general, sampling process for the directed graphical model in CDec-POMDP is more efficient than rejected sampling procedure in CGM.

Chapter 4 Collective Multi-agent Reinforcement

Learning Framework

In this chapter, we present general frameworks to optimize agent policies individual policy in CDec-POMDP model. First, we study the model-based approach by show- ing the dynamic program for CDec-POMDPs as a special case of DEC-POMDPs. We show that we can reformulate the dynamic programming in CDec-POMDP by using the counts which have lower complexity than the dynamic programming over joint state-action. Unfortunately, the lifted dynamic programming algorithm in CDec-POMDP still has exponential time complexity with respect to the number of states and actions. This motivates us to develop sampling-based planning algorithms using reinforcement learning in CDec-POMDPs. To establish the basis for efficient RL algorithms, we study decomposition of the critic and the decomposition of policy gradient in CDec-POMDPs. For the critic decomposition, we show that the compatible value function approximation (or critic) in CDec-POMDP is decomposable amongst agents. For the actor decomposition, we show that if the critic is a linear function of the action counts, the policy gradient is decomposable. We show

CDec-POMDP algorithms with fast convergence to high quality solutions.

4.1 Multi-agent Planning Model

In document Reinforcement learning for collective multi-agent decision making (Page 66-70)