Partial Observability - Learning Models of Behavior From Demonstration and Through Interaction

Considering the system process from a global perspective, the optimization task can be effectively treated as a single-agent control problem (see Section 19.4.1). Nevertheless, the internal distributed character of the system remains and the degree of local observ-

Chapter 22: Experimental Results -1 -0.5 0 0.5 1 -2 -1 0 1 2 100 200 300 400 500 0 0.5 1.5 2

Figure 22.2: Results for the aggregation task in Section 22.1, based on 100 Monte

Carlo runs. Left: Learned system controller. The orange circles and bars indicate the empirical mean values and standard deviations of the learned control parameters used for the RBF policy network. The blue curve represents the mean control policy defined by those parameters, again shown with the corresponding standard deviation. Positive control values let the executing agent drift toward larger angles, negative values cause a drift in the opposite direction. Right: Example system trajectory generated from a random, approximately uniform initial agent density using the mean control parameters from the left subfigure. The trajectory corresponds to a temporal rollout of the continuum density as illustrated in Figure 22.1. Red color indicates high density values. ability crucially affects the flow of information in the system. The goal of this section is to investigate how different observation modalities of the agents influence the learning process of the network. To this end, we re-run the experiment from Section 22.1 in a partially observable environment, by restricting the interaction of the agents to a limited range ∈ (0, π], i.e.

B(x) ,ny_{∈ X : |∠(x, y)| ≤}o.

Herein, |∠(x, y)| denotes the absolute angular distance between state x and state y. Note that the maximum value of = π recovers the setting in Section 22.1.

As before, we use the global order parameter rG(t) to train the system but consider an

extended simulation period of T = 20 to account for the increased difficulty of the task. The learning result is depicted by the solid orange line in Figure 22.3, which displays the final order parameter rG(T ) of the trained system for different interaction ranges. The

plot shows that the system performance quickly breaks down in small-range interaction regimes. This is also reflected in the right subfigure, which reveals the system’s inability to learn a functioning control policy for small values of . The obtained result is not surprising as, for small , the information content available to the agents is not

22.2 Partial Observability 0.25 0.5 0.75 0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1 -2 -1 0 1 2 0.2 0.4 0.6 0.8

Figure 22.3: Results for the partially observable setting in Section 22.2, based on

20 Monte Carlo runs. Left: Aggregation performance for different interaction ranges , measured in terms of the network’s final order parameter. Legend: global/local indicates the type of reward signal used during training. 1D refers to the sinusoidal observation model, 2D refers to the augmented model that allows the agents to additionally sense the relative agent mass in their vicinity. For large , the state synchronization is always successful, whereas for small , a functioning aggregation strategy is only found if the network is trained with global reward information. Right: Standard deviation intervals of the learned control policies (centered around their mean values) for the 1D global setting. The colored areas are analogous to the blue area shown in the left part of Figure 22.2, but correspond to different interaction ranges , as marked by the black crosses in the left subfigure. For small , the system is unable to learn a functioning aggregation policy, as reflected by the corresponding order parameters shown on the left. For = π, the result from Figure 22.2 is recovered.

sufficient to solve the aggregation task reliably: due to the limited observation range, it is impossible for the agents to decide locally in which direction to move in order to form a single coherent aggregation instead of creating multiple smaller ones.

An interesting question is, therefore, if the agents can develop a functioning strategy when they are provided with additional information. To verify this hypothesis, we equip the agents with extended observation capabilities that allow them to determine the relative agent number in their vicinity, giving rise to the following two-dimensional local observation, Yn(t) ,   1 Nn P m∈Nnsin Xm(t) − Xn(t) Nn N  . (22.1)

While the resulting observation model is not directly consistent with Equation (19.2), the above control input Yn(t) can be computed locally by agent n from its received

observational features ξXn(t), X(t)

if we assume that the agent is aware of the total network size N (Remark 6). The latter assumption is fulfilled if we treat N as an

Chapter 22: Experimental Results 0 5 10 15 20 0 0.5 1.5 2 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Figure 22.4: Learning progress in the partially observable environment (Section 22.2)

when the system is trained with global reward feedback, using the two-dimensional observation model in Equation (22.1) and an interaction range of = π

10. The setting

is indicated by the black square in Figure 22.3. Red color indicates high continuum densities. The controller learns an aggregation strategy that exploits the local agent mass to accumulate all agents by assigning different drift velocities.

observable environmental feature (Remark 4), i.e.

g(x, y) ,hsin(y − x), Ni>. (22.2) Since the resulting observation space is now two-dimensional, i.e., Y = [−1, 1] × [0, 1], we use a bivariate RBF network to parametrize the system policy, i.e.

πθ(Y ) , C X c=1 θc· 2 Y k=1 exp   − Yk− qc,k γk !2_ .

Note that the subscript k here indicates the observation dimension and not the agent index. As before, we place the modes {qc} of the basis functions on a regular grid on

the observation space by discretizing the first dimension into 10 center positions and the second dimension into 4 positions, resulting in a total of C = 40 control parameters to be learned.

By exploiting the additional state information, the agents are now able to solve the aggregation task for arbitrary interaction ranges, as is indicated by the solid blue line in Figure 22.3. It is particularly interesting to inspect closer the learning progress and the final aggregation strategy found by the algorithm, illustrated in Figure 22.4. At first, the policy performs no better than the one trained on the one-dimensional observation model— the controller only manages to aggregate the agents locally. However, after a few iterations, the controller learns to exploit the additional state information contained in the neighborhood size: it assigns different drift velocities (and directions) to agent constellations of different sizes to form a rotating group of agents that eventually absorbs

22.2 Partial Observability

all smaller agent aggregations. This strategy is then optimized toward the end of the learning period. It is worth mentioning that a similar behavior was discovered by Hamann [Ham14] in the context of a very different learning problem, where the goal of the agents was to predict their local observations one step ahead.

As a final variant of the experiment, we replace the global reward signal rG(t) with

the fused signal rL(t) (Equation 19.9), assuming that the central critic system has no

access to the global system state. For this purpose, we let each agent n compute its own “local order parameter”, i.e.

rn(t) , 1 Nn X m∈Nn expiXm(t) − Xn(t) ,

which measures the local alignment of the agent relative to its neighbors. Accordingly, we extend the interaction potential g by a dimension of the form exp{i(y − x)}, to provide the agents with the necessary state information. Note that, for = π, the locally computed reward rL(t) reported to the critic coincides with the global signal rG(t) that

we used in the centralized setting with known global system state (Section 22.1). With the entire reward information being computed locally at the agent level, we compare the resulting system performance to the centralized setting, again by measuring the final order parameter of the system (dashed lines in Figure 22.3). The results reveal that, for small , the aggregation now fails again, even for the augmented observation model in Equation (22.1) that additionally captures the relative neighborhood size. This time, however, the reason for the failure is different and can be traced back to the learning phase— the local state information provided to the agents is, in fact, sufficient to solve the problem, as we have seen just before. The problem is rather that the critic cannot tell a locally aggregated system state from a globally aggregated one based on the reported feedback signal because both system states result in similar reward signatures when measured locally. Consequently, the learning algorithm is unable to develop a functioning strategy that favors one of the two system states.

The result gives rise to the following interesting conclusion: while the local state information of the agents can be sufficient for executing a certain task, it might not be sufficient for learning the task in the first place. Yet, the example also demonstrates that a global system goal may be still achievable through local interaction if the behavior is learned under global supervision. This underpins the idea of guided learning presented in [HŠN17] and motivates the use of a centralized feedback architecture to acquire new

Chapter 22: Experimental Results

skills for a task whose execution can be finally conducted in a decentralized manner. Note, however, that such a learning paradigm not only requires the existence of a central critic system but also relies on a bidirectional communication channel between the critic and the agents, which is often only available in a simulated environment. Hence, if the learning phase is to be conducted on the real system, a decentralized architecture might be the only viable option (see Discussion and Outlook).

In document Learning Models of Behavior From Demonstration and Through Interaction (Page 175-180)