Summary - Distributed reinforcement learning for self-reconfiguring modular robots

We have formulated the locomotion problem for a SRMR as a multi-agent POMDP and applied gradient-ascent search in policy value space to solve it. Our results suggest that automating controller design by learning is a promising approach. We should, however, bear in mind the potential drawbacks of direct policy search as the learning technique of choice.

As with all hill-climbing methods, there is a guarantee of GAPS converging to a local optimum in policy value space, given infinite data, but no proof of convergence to the global optimum is possible. A local optimum is the best solution we can find to a POMDP problem. Unfortunately, not all local optima correspond to reasonable locomotion gaits.

In addition, we have seen that GAPS takes on average a rather long time (mea-sured in thousands of episodes) to learn. We have identified three key issues that

con-tribute to the speed and quality of learning in stochastic gradient ascent algorithms such as GAPS, and we have established which robot parameters can contribute to the make-up of these three variables. In this chapter, we have already attempted, un-successfully, to address one of the issues — the number of policy parameters to learn

— by introducing feature spaces. In the next two chapters, we explore the influence of robot size, search constraints, episode length, information sharing, and smarter policy representations on the speed and reliability of learning in SRMRs. The goal to keep in sight as we report the results of those experiments, is to find the right mix of automation and easily available constraints and information that will help guide automated search for the good distributed controllers.

Chapter 4 How Constraints and Information Affect Learning

We have established that stochastic gradient ascent in policy space works in principle for the task of locomotion by self-reconfiguration. In particular, if modules can some-how pool their experience together and average their rewards, provided that there are not too many of them, the learning algorithm will converge to a good policy.

However, we have also seen that even in this centralized factored case, increasing the size of the robot uncovers the algorithm’s susceptibility to local minima which do not correspond to acceptable policies. In general, local search will be plagued by these unless we can provide either a good starting point, or guidance in the form of constraints on the search space.

Modular robot designers are usually well placed to provide either a starting point or search constraints, as we can expect them to have some idea about what a reason-able policy, or at least parts of it, should look like. In this chapter, we examine how specifying such information or constraints affects learning by policy search.

4.1 Additional exploration constraints

In an effort to reduce the search space for gradient-based algorithms, we are looking for ways to give the learning modules some information that is easy for the designer to specify yet will be very helpful in narrowing the search. An obvious choice is to let the modules pre-select actions that can actually be executed in any one of the local configurations.

During the initial hundreds of episodes where the algorithm explores the policy space, the modules will attempt to execute undesirable or impossible actions which could lead to damage on a physical robot. Naturally, one may not have the luxury of thousands of trial runs on a physical robot anyway.

Each module will know which subset of actions it can safely execute given any local observation, and how these actions will affect its position; yet it will not know what new local configuration to expect when the associated motion is executed. Re-stricting search to legal actions is useful because it (1) effectively reduces the number

(a) (b)

Figure 4-1: Determining if actions are legal for the purpose of constraining explo-ration: (a) A = {N OP }, (b) A = {N OP }, (c) A = {N OP } (d) A = {2(N E), 7(W )}.

of parameters that need to be learned and (2) causes the initial exploration phases to be more efficient because the robot will not waste its time trying out impossible actions. The second effect is probably more important than the first.

The following rules were used to pre-select the subset Aⁱ_t of actions possible for module i at time t, given the local configuration as the immediate Moore neighborhood (see also figure 4-1):

1. Aⁱ_t= {N OP }¹ if three or more neighbors are present at the face sites 2. Aⁱ_t= {N OP } if two neighbors are present at opposite face sites

3. Aⁱ_t = {N OP } if module i is the only “cornerstone” between two neighbors at adjacent face sites²

4. Aⁱ_t = {legal actions based on local neighbor configuration and the sliding cube model}

5. Aⁱ_t= Aⁱ_t− any action that would lead into an already occupied cell

These rules are applied in the above sequence and incorporated into the GAPS algorithm by setting the corresponding θ(ot, at) to a large negative value, thereby making it extremely unlikely that actions not in Aⁱ_t would be randomly selected by the policy. Those parameters are not updated, thereby constraining the search at every time step.

1N OP stands for ‘no operation’ and means the module’s action is to stay in its current location and not attempt any motion.

2This is a very conservative rule, which prevents disconnection of the robot locally.

We predict that the added space structure and constraints that were introduced here will result in the modular robot finding good policies with less experience.

In document Distributed reinforcement learning for self-reconfiguring modular robots (Page 48-52)