Introducing sensory information - Distributed reinforcement learning for self-reconfiguring mod

In previous chapters, we had also assumed that there is no other sensory information available to the modules apart from the existence of neighbors in the immediate Moore neighborhood (eight adjacent cells). As we had pointed out before, this limits the class of representable policies and excludes some useful behaviors. Here, we introduce a minimal sensing model.

6.2.1 Minimal sensing: spatial gradients

We assume that each module is capable of making measurements of some underlying spatially distributed function f such that it knows, at any time, in which direction lies the largest gradient of f . We will assume the sensing resolution to be either one in 4 (N, E, S or W) or one in 8 (N, NE, E, SE, S, SW, W or NW) possible directions. This sensory information adds a new dimension to the observation space of the module. There are now 2⁸ possible neighborhood observations ×5 possible gradient observations¹ ×9 possible actions = 11, 520 parameters, in the case of the lower resolution, and respectively 2⁸× 9 × 9 = 20, 736 parameters in the case of the higher resolution. This is a substantial increase. The obvious advantage is to broaden the representation to apply to a wider range of problems and tasks. However, if we severely downsize the observation space and thus limit the number of parameters to learn, it should speed up the learning process significantly.

6.2.2 A compact representation: minimal learning program

Throughout this thesis we have established the dimensions and parameters which influence the speed and reliability of reinforcement learning by policy search in the domain of lattice-based self-reconfigurable modular robots. In particular, we have found that the greatest influence is exerted by the number of parameters which rep-resent the class of policies to be searched. The minimal learning approach is to reduce the number of policy parameters to the minimum necessary to represent a reasonable policy.

To achieve this goal, we incorporate a pre-processing step and a post-processing step into the learning algorithm. Each module still perceives its immediate Moore neighborhood. However, it now computes a discrete local measure of the weighted center of mass of its neighborhood, as depicted in figure 6-5. At the end of this pre-processing step, there are only 9 possible observations for each module, depending on which cell the neighborhood center of mass belongs to.

In addition, instead of actions corresponding to direct movements into adjacent cells, we define the following set of actions. There are 9 actions, corresponding as before to the set of 8 discretized directions plus a NOP. However, in the minimalist learning framework, each action corresponds to a choice of general heading. The particular motion that will be executed by a module, given the local configuration and the chosen action, is determined by the post-processing step shown in figure 6-5.

The module first determines the set of legal motions given the local configuration.

From that set, given the chosen action (heading), it determines the closest motion to this heading up to a threshold. If no such motions exist, the module remains stationary. The threshold is set by the designer and in our experiments is equal to 1.1, which allows motions up to 2 cells different from the chosen heading.

Differentiating between actions (headings) and motions (cell movements) allows the modules to essentially learn a discrete trajectory as a function of a simple measure

1Four directions of largest gradient plus the case of a local optimum (all gradients equal to a threshold).

(a) (b) (c) (d) (e)

Figure 6-5: Compressed learning representation and algorithm: pre-processing (a) compute local neighborhood’s weighted center of mass, biased towards the periphery, (b) discretize into one of 9 possible observations; GAPS (c) from observation, choose action according to the policy: each action represents one of 8 possible headings or a NOP; post-processing (d) select closest legal movement up to threshold give local configuration of neighbors, (e) execute movement. Only the steps inside the dashed line are part of the actual learning algorithm.

of locally observed neighbors.

6.2.3 Experiments: reaching for a random target

The modules learned to reach out to a target positioned randomly in the robot’s workspace (reachable x, y−space). The target (x₀, y₀) was also the (sole) maximum of f , which was a distance metric. The targets were placed randomly in the half-sphere defined by the y = 0 line and the circle with its center at the middle of the robot and its radius equal to (N + M )/2, where N is the number of modules and M the robot’s size along the y dimension.

Fifteen modules learned to reach to a random goal target using GAPS with the minimal representation and minimal gradient sensing framework with the lower sen-sory resolution (4 possible directions or local optimum) of section 6.2.1. Figure 6-6 shows an execution sequence leading to the reaching of one of the modules to the tar-get: the sequence was obtained by executing a policy learned after 10,000 episodes.

Ten learning trials were performed, and each policy was then tested for 10 test trials.

The average number of times a module reached the goal in 50 timesteps during the test trials for each policy was 7.1 out of 10 (standard deviation: 1.9).

We record two failure modes that contribute to this error rate. On the one hand, local observability and stochastic policy means a module may sometimes move into a position that completes a box-type structure with no modules inside (as previously shown in figure 3-12b). When that happens, the conservative local rules we use to prevent robot disconnection will also prevent any corner modules of such a structure from moving, and the robot will be stuck. Each learned policy led the robot into this box-type stuck configuration a minimum of 0 and a maximum of 3 out of the 10 test runs (mean: 1.2, standard deviation: 1). We can eliminate this source of error by allowing a communication protocol to establish an alternative connection beyond the local observation window (Fitch & Butler (2007) have one such protocol). This

Figure 6-6: Reaching to a random goal position using sensory gradient information during learning and test phases. Policy execution captured every 5 frames.

leaves an average of 1.7 proper failures per 10 test trials (standard deviation: 1.6).

Three out of 10 policies displayed no failures, and another two failed only once.

In document Distributed reinforcement learning for self-reconfiguring modular robots (Page 86-89)