FROM SUPERVISED TO REINFORCEMENT LEARNING

Discussion

10.2. FROM SUPERVISED TO REINFORCEMENT LEARNING

do not wish to lose this efficient reuse of samples. Gradient methods are one method to po tentially avoid building trees. As a heuristic, these methods could be used with a measure /i in an attempt to alleviate their exploration related problems (see chapter 4). However, it is not clear how they hang on to this idea of efficient reuse (see 6.4.3). Here, we have argued that the /x-PolicySearch and CPI algorithms are natural methods which both effi ciently reuse samples and optimize with respect to a measure fi. Let us now discuss what in particular allowed these algorithms to achieve both of these goals.

10.2.2. Successful Policy Update Rules. Both //-PolicySearch and CPI algorithms use PolicyChooser subroutines which attempt to return decision rules t t' € II which choose

actions with “large” advantages with respect to the current policy t t, and the notion of

“large” is an average one based on the measure /i. These PolicyChooser algorithms effi ciently reuse samples to find a “good” t t' in a manner similar to the trajectory tree method

(except now the measure n is used in lieu of building the tree, see subsection 6.3.3). A central theme in part 2 was the construction of policy update rules which drive the ad vantages to be small with respect to a measure // using the output decision rules of the PolicyChooser.

The ingredients for successful updates for both /i-PolicySearch and CPI are twofold. First, both algorithms make “small” policy changes. Second, both algorithms are variants of policy iteration. This means that each subsequent decision rule attempts to choose better actions by taking into account the advantages of the current policy.

In /i-PolicySearch, the small policy change is implemented by only altering the policy at one decision epoch at a time starting from time T — 1 and working down to time 0. The policy iteration nature of the algorithm forces the PolicyChooser to construct the decision rule 7 r ( - , f ) by taking into account the remaining sequence of decision rules 7 t(- , f -f-1 ),. . . , 7 r(-, T — 1). This allows max norm error bounds to be avoided (such as in

the regression version of non-stationary approximate pohcy iteration, see section 5.3). The final policy returned by /2-PolicySearch is both deterministic and non-stationary (assuming that n is a class of deterministic decision rules).

In contrast, CPI returns a good stationary policy. The natural update rule implemented by CPI just mixes the new policy with the old policy using some mixing parameter a (see equation 7.2.1). Unlike in ^-PolicySearch which halts after T updates, it was much harder to understand the behavior of this update rule and we had to think more carefully about when to halt CPI and how to set a.

Note that both algorithms are using all their previous decision rules — //-Policy search is executing all the decision rules in backward order of construction while CPI is mixing between all its decision rules (in order to preserve stationarity).

148 10. DISCUSSION

Interestingly, we have only presented /i-based algorithms (with polynomial T dependence) which output either stochastic, stationary policies or deterministic, non-stationary policies. It is not clear how to present an algorithm which has a similar ^-based guarantee (with respect to the advantages) and that outputs a deterministic and stationary policy.

10.2.3. Query Learning. A fruitful direction to consider is query (or active) learning (as in Angluin [1987]). The typical setting is one in which the learner is permitted to actively query the instances over an input space in order to reduce generalization error with respect to a fixed distribution D. This setting has been shown to help reduce the generalization error with respect to D in a variety of problems. Ideally, we would like to consider an algorithm which is not tied to using a single measure D and perhaps tries to efficiently and robustly reduce the error with respect to multiple measures. Though, in the extreme, this leads back to dealing with the max norm error or the dependence. This direction for future work might provide a more general means to tackle the exploration problem rather than using a fixed distribution fx.

10.3. POMDPs

We have so far tended to avoid issues of planning and exploration in partially observable Markov decision processes (POMDPs).

10.3.1. Planning. Although the computational complexity of exact planning in MDPs and POMDPs is different (see Littman [1996]), there is a close connection between ap proximate planning in MDPs and POMDPs. Intuitively, the reason is that using only par tial information for approximate planning in an MDP can often be viewed as working in a POMDP framework. For this reason, gradient methods have direct applicability to POMDPs.

The trajectory tree method was originally presented as means for sample-based planning in POMDPs. Our summary only described this algorithm for MDPs, but it is clear that a single tree in a POMDP provides simultaneous estimates for the values of all policies (and so the same uniform convergence arguments can be applied in the POMDP setting). Of course the policy class II must be restricted to use only observable information.

In our setting, it is not too difficult to see that our /Li-based planning approaches can also be applied to POMDPs. However, now ^ is a distribution over history vectors or belief states, and, of course, the policy class is only restricted to use observable information. Here the problem of choosing and representing a good n becomes more interesting and challenging (and we could certainly consider using “memoryless” fx's).

In document On the Sample Complexity of Reinforcement Learning (Page 137-139)