Reward Models - Learning Models of Behavior From Demonstration and Through Interaction

which translates the agents’ observations {Yn(t)} into local control commands {un(t)}.

The policy is parametrized by a control parameter θ ∈ Θ ⊆ RC with corresponding

parameter space Θ. Note that Θ implicitly defines the control set U through the functional form of πθ.

Putting all ingredients together, we can write Equation (19.1) in the more explicit form dXn = h Xn, πθ ξ(Xn, X) dt + dWn = h Xn, πθ ₁ Nn X m∈Nn g(Xn, Xm) ! dt + dWn. (19.5) In Chapter 21, we will consider the system from a reinforcement learning perspective and our goal will be to find a particular θ such that the resulting network optimally solves a given task. For now, however, we may assume that θ is fixed.

Remark 5 (State exploration). By restricting our interaction model to the class of

deterministic reactive policies (i.e., where un(t) is a function of the current observation

Yn(t) only), we focus on settings with agents that do not explicitly face the problem

of information gathering— see Discussion and Outlook. In particular, inspired by the characteristics of natural swarm systems, we assume a simplistic system architecture that bypasses the requirement of long-term agent memory by replacing exploration strategies executed at the agent level with collective strategies deployed at the swarm level.∗

As shown by the examples in Chapter 22, this architecture is indeed flexible enough to solve a number of common swarm-related tasks. However, it should be mentioned that there are situations for which an extension to stochastic system controllers may be necessary, for example, to model the effect of spontaneous symmetry breaking in natural swarms. For the special case of linear dynamics h and Gaussian control policies, such an extension is straightforward since the stochastic component of the controller can be absorbed into the diffusion coefficient D of the noise processes {Wn}.

Remark 6 (Information processing). In Equation (19.5), πθ operates directly on the

raw (cumulative) sensory information gathered by the agents. The use of more complex

∗Note that both these exploration types are again different from the exploration in policy space, which

19.4 Reward Models

observational features to control the agent behavior can be easily realized with the help of an additional preprocessing stage that first extracts the required high-level information from the agent’s sensory input before passing it to the controller. However, both these formulations are mathematically equivalent since any preprocessing stage can be absorbed into πθ; hence, we may stick with the more compact form in Equation (19.5)

without loss of generality. Nevertheless, for the overall picture of the information processing chain, it can be helpful to keep that preprocessing step in mind. A simple example is discussed in Section 22.2 (Equation 22.1).

19.4 Reward Models

With regard to the forthcoming policy optimization procedure described in Chapter 21, we finally need to define a learning objective for our system. As in Part III, we define that objective via a scalar reward signal r(t), which serves as a feedback from the system process that allows to assess the system’s instantaneous state in the context of the collective task. In the following, we consider the finite-horizon value criterion, which measures the total reward accumulated over a certain period of time T , i.e.

Vθ , E " Z T 0 r(t) dt πθ # . (19.6)

Herein, the expectation is with respect to the underlying random system trajectory generated under πθ, where we assume that the initial system state is drawn from some

fixed state distribution.

While the reward mechanism has been defined only locally in Part III, we now consider a more general setting where we allow arbitrary feedback from the system process to train our network. To this end, the following two types of reward mechanisms are considered.

19.4.1 Global Reward Model

From a macroscopic point of view, a natural approach to assess the behavior of the network is to use a reward mechanism that directly evaluates the global system state, i.e.

rG(t) , RGX(t)_, (19.7)

where RG: XN _{→ R is a score function assessing the entire agent configuration. For}

Chapter 19: The Agent Model

to a target location or, alternatively, in an aggregation task, it could be the (negative) average pairwise distance between the agents.

Considered from a system optimization perspective, the global reward signal rG(t) can be

interpreted as a direct feedback from the system state X(t) to a centralized critic system that monitors the global system state and utilizes the provided reward information to optimize the agent behavior through πθ. In fact, if we treat our system model as a

black box and ignore its internal decentralized structure, the setting can be interpreted as a single-agent scenario with the central critic system as the only decision-maker.

19.4.2 Local Reward Model

The computation of the global reward signal as defined in Equation (19.7) requires complete knowledge of the global system state X(t), which, in a distributed network with many agents, may not be available at any point in the system. To account for this information loss, we also consider the decentralized setting from Part III, where each agent n is provided its own local reward signal rn(t) instead, i.e.

rn(t) , RL Yn(t) = RL_ξ(X n(t), X(t)) . (19.8)

Herein, RL : Y → R is a score function used by the agents to assess their local state

configuration captured in the form of the local observations {Yn(t)}. Assuming that the

scalar reward signals {rn(t)} can be accessed by a central critic system, we can define

our performance measure indirectly via the local agent rewards, i.e. rL(t) , 1 N N X n=1 rn(t). (19.9)

From a learning perspective, we can think of this scenario as a setting where each agent reports its local reward— in this context to be interpreted as a summary description of its local state configuration — to a centralized learning system, whose goal is to coordinate the system behavior based on the received local feedback.

Remark 7 (Global critic, local actors). For both described reward mechanisms, learning

takes place centrally at the critic system, irrespective of whether the behavior is evaluated globally or locally. Yet, it is important to note that, once the learning phase is completed, the agents no longer rely on any centralized information, and hence, they can act autonomously without further instructions from the critic. The learning architecture thus resembles an actor-critic network [Gro+12] in which the global critic

19.5 Value Estimation

system serves as a fusion center during the training period but where the trained actor (i.e., the agent’s policy) relies only on local information. A particular learning strategy that exploits this special structure can be found in [HŠN17]. Also, note that extensions of the proposed information processing pipeline are conceivable that do not involve a fusion center at all (see Discussion and Outlook).

19.5 Value Estimation

In order to assess the performance of a particular system policy in the context of a given task, we need to estimate the corresponding value (Section 19.4). Writing Equation (19.6) in a more explicit form, we obtain

Vθ = Z Ω Z T 0 R G_{X(t; χ)}_{dt P (dχ)} =Z XN Z T 0 R G(x)p(x, t) dt dx, (19.10)

where Ω and P denote, respectively, the underlying sample space and probability measure of the state process X, and p(· , t) is the corresponding marginal density at time t.∗

The notation X(· ; χ) refers to a particular sample path of the system, i.e., a specific system trajectory generated under the given system dynamics. To keep the notation simple, we omit the system policy πθ in the above equations and implicitly assume that

X is controlled by πθ according to Equation (19.5).

Having the form of an expectation, the outer integral in Equation (19.10) can be approximated via Monte Carlo integration, i.e.

ˆV{S} θ = 1 S S X s=1 Z T 0 R G_{X(t; χ}{s}₎_dt, _(19.11)

where χ{s} _{∼ P , and S is a specified number of system rollouts. To evaluate this}

expression for a particular target network, the remaining integral can be replaced by a discrete-time approximation, which is explained in Chapter 21.

∗Note that the value definition in Equation (19.10) is based on the global reward model described

in Section 19.4.1. However, we can easily switch to a local value computation by choosing RG as the

average of the agent-based rewards, i.e., by setting RG _X(t)

, N1

n=1R

L(ξ(X

n(t), X(t))), which

Chapter 19: The Agent Model Xn,t Xn,t+1 Yn,t Yn,t+1 un,t un,t+1 rn,t rn,t+1 rG t rtG+1 rL t rLt+1 N N N ξ ξ πθ πθ RL _RL RG _RG

Figure 19.2: The continuous-time N-agent system as a discrete-time swarMDP whose

transition dynamics factorize over agents. The local observation Yn,tof agent n at time t

is determined by the observation model ξ, based on the local states {Xm,t : m ∈ Nn}

of other agents in the neighborhood. The agent transitions to a new state Xn,t+1 by

executing a control command un,t, dictated by the policy πθ. Note: the dashed arrow

indicates dependencies between variables belonging to the same agent. At each time instant, the local and global rewards rL

t and rtGare determined via the functions RLand

RG according to the agents’ local observations and the global system state, respectively.

In document Learning Models of Behavior From Demonstration and Through Interaction (Page 160-164)