Reward Update - Learning Models of Behavior From Demonstration and Through Interaction

Note, however, that the local estimators in Equation (15.6) are correlated since the agents are coupled through the system process. Yet, due to the specific coupling structure of a swarm caused by the locality property (Chapter 14), correlation is introduced only locally, meaning that the coupling between any two agents drops when their topological distance increases. We analyze this phenomenon for the Vicsek model [Vic+95] in Section 16.1.

15.3 Reward Update

The last step of Algorithm 1 consists of updating the estimated agent reward function. Depending on the single-agent IRL framework in use, this involves an algorithm-specific optimization procedure, e.g., given in the form of a quadratic program [AN04; NR00] or a gradient-based optimization [NS07; Zie+08]. For our experiments in Chapter 16, we adopt the max-margin principle presented in [AN04]; however, the estimation procedure can be replaced with any other value-based IRL method, as explained at the beginning of this chapter.

Following the max-margin approach, the local reward function is represented as a linear combination of observational features, i.e., R(o) = wT_φ(o), with weights w ∈ Rd and

a given feature function φ : O → Rd. The feature weights after the ith iteration of

Algorithm 1 are then obtained as w{i+1} = arg max

w:||w||2≤1

min

j∈{1, ... ,i} w T(µ

E − µ(j)),

where µE and {µ(j)}ij=1 are, respectively, the feature expectations [AN04] of the expert

policy and those of the learned policies up to iteration i. Simulating a one-shot learning scenario, we estimate these quantities from a single system trajectory according to Equation (15.7), i.e. ˆµ(π) , 1 N N X n=1 T X t=0 γtφξ(n)(st) ,

for some sufficiently large T , where the state sequence (s0, s1, . . . , sT) is generated

Chapter 16: Experimental Results

16 Experimental Results

In this chapter, we provide experimental results for two different system types. For the heterogeneous learning scheme (Algorithm 2) used in the policy update step of Algorithm 1, the initial number of exploring agents is set to 50% of the population size and the learning rate is initialized close to 1. Both quantities are controlled by a quadratic decay, which ensures that, at the end of the learning period (i.e., after 200 iterations), the learning rate reaches zero and there are no exploring agents left. Note that these parameters are by no means optimized; yet, in our experiments we could observe that the learning results are largely insensitive to the particular choice of parameter values. Since the agents’ observation space is one-dimensional in both experiments, we use a simple tabular representation for the learned Q-function. For higher-dimensional problems, we must resort to function approximation [LP03].

16.1 The Vicsek Model

First, we test our framework on the Vicsek model of self-propelled particles [Vic+95]. The model consists of a fixed number of particles — or agents — living in the unit square [0, 1] × [0, 1] with periodic boundary conditions. Each agent n moves with a constant absolute velocity v and is characterized by its location x(n)t and orientation θt(n)

in the plane, as summarized by the local state variable s(n)t , (x(n)t , θ(n)t ). The time-

varying neighborhood structure of the agents is determined by a fixed interaction radius ρ. At each time instant, the agents’ orientations get synchronously updated to the average orientation of their neighbors (including themselves) plus some additive random perturbations {∆θ(n)

t }, i.e.

θ(n)t+1 = hθt(n)iρ+ ∆θ(n)t ,

x(n)_t₊₁ = x(n)t + vt(n).

(16.1) Herein, hθt(n)iρ denotes the mean orientation of all agents within the ρ-neighborhood of

agent n at time t, and v(n)t , v· [cos θt(n), sin θ(n)t ] is the velocity vector of agent n.

Our goal is to learn a policy model from recorded agent trajectories that reproduces the system behavior in Equation (16.1) using the proposed IRL framework. As a simple observation mechanism, we let the agents in the system compute the angular distance

16.1 The Vicsek Model 1 2 3 4 5 6 unconnected 0 0.05 0.1 0.15 0.2 0.25 0.3 (a) ρ= 0.125 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 unconnected 0 0.05 0.1 0.15 0.2 0.25 0.3 (b) ρ= 0.05

Figure 16.1: Uncertainty coefficient as a function of the topological distance between

two agents in the Vicsek system. The curves show the coefficient values at different simulation time points for two interaction radii. The results were obtained from a kernel density estimate of the joint distribution of two agents’ orientations based on 10000 Monte Carlo runs.

to the average orientation of their neighbors, i.e., o(n)t = ξ(n)(st) , hθ(n)t iρ− θ(n)t , giving

the agents the ability to monitor their local misalignment. For simplicity, we discretize the observation space [0, 2π) into 36 equally-sized intervals (visible in Figure 16.2) that build the feature representation φ (Section 15.3). Furthermore, we coarse-grain the space of possible direction changes to [−60◦_,₋₅₀◦_{, . . . ,}60◦], resulting in a total of

13 actions available to the agents. For the experiment, we use a system size of N = 200, an interaction radius of ρ = 0.1 (if not stated otherwise), an absolute velocity of v = 0.1, a discount factor of γ = 0.9, and a zero-mean Gaussian noise model for {∆θ(n)

t } with

a standard deviation of 10◦. These parameter values are chosen such that the expert

system operates in an ordered phase [Vic+95].

16.1.1 Local Coupling and Redundancy

In Section 15.2.1, we claimed that the dependence of any two agents in a swarm system declines with increasing distance between those agents, due to the local coupling structure of the swarm (Chapter 14). In this section, we substantiate our claim by analyzing the coupling strength in the system as a function of the agents’ topological distance. As a measure of (in-)dependence, we employ the uncertainty coefficient [Pre+07]— a normalized version of the mutual information shared between two agents— to quantify the amount of information we can predict about one agent’s orientation by observing that of the other. As opposed to Pearson’s correlation coefficient, this quantity is able

Chapter 16: Experimental Results

to capture nonlinear dependencies, and hence, it is more meaningful in the context of the Vicsek model, whose state dynamics are inherently nonlinear (Equation 16.1). Figure 16.1 depicts the result of our analysis, which nicely reveals the spatio-temporal flow of information in the system. It confirms that the information exchange between the agents strongly depends on the strength of their coupling, which is determined by (i) their topological distance and (ii) the average number of connecting links (seen from the fact that, for a fixed topological distance, the dependence grows with the interaction radius ρ). We further notice that, for increasing radii, the level of information exchange increases even for agents that are temporarily not connected with each other through the system, due to the higher chance of having been connected at some earlier stage.

16.1.2 Learning Results

A fundamental problem inherent to any IRL approach is the assessment of the ex- tracted reward function, because there is generally no unique solution to the estimation problem (Section 2.2.2). Nonetheless, there are several ways to verify if the estimated reward model is plausible in the context of the observed task.

The simplest way to check the plausibility of the learned model is by subjective inspec- tion: since a system’s reward function can be regarded as a concise description of the performed task, the found estimate should provide a sufficiently intuitive explanation for the observed behavior. As we can see from Figure 16.2, this is clearly the case for the obtained estimation result. Even though there exists no “true” reward model for the Vicsek system, we can tell from the system dynamics in Equation (16.1) that the agents tend to align over time— a behavior that can be induced by providing higher rewards for synchronized states and giving lower (or negative) rewards for local misalignment. Inspecting the induced system dynamics in Figure 16.3, we observe that the learned reward model indeed reproduces the behavior of the expert system, both during the transient phase and at stationarity. Note that the absolute direction of motion is irrelevant since the Vicsek dynamics are invariant to rotations of the coordinate system. To quantify the accuracy of the behavior reconstruction, we compare both policies in terms of the achieved order parameter [Vic+95], which provides a measure for the total alignment of the swarm, i.e.

ωt , 1 N v N X n=1 v(n)t ∈ [0, 1],

In document Learning Models of Behavior From Demonstration and Through Interaction (Page 143-147)