PART III: SAMPLING, FEATURE SELECTION, AND SEARCH
9.1 Sampling In Value Function Approximation
Sampling, as an optimization with incomplete information, is an issue that has been widely addressed in the machine learning community for the problems of regression and classi- fication. The challenge in these problems is to guarantee that a solution based on a small subset of the data generalizes also to the remainder of the data (Devroye, Gyorfi, & Lugosi, 1996; Vapnik, 1999; Bousquet, Boucheron, & Lugosi, 2004).
The goal of regression in machine learning is to estimate a function on a domain from a set of its values. We now formulate value function approximation as a regression problem in order to illustrate the differences. Our treatment of regression here is very informal; please see for example Gyorfi, Kohler, Lrzyzak, and Walk (2002) for a more formal description. Regression in value function approximation can be seen as the problem of estimating a function f :S →R that represents the Bellman residual:
f(s) = (v−Lv)(s)
for some fixed value function v. Regression methods typically assume that a subset of the function values is known. That is why f(s)must represent the Bellman residual — which is known from the samples — and not the value function — which is unknown.
Again, the goal of regression is to find a function ˜f∈ F based on sampled values of f(si) for s1. . . sn ∈ S drawn i.i.d. according to a distribution µ. Commonly, the function ˜f is chosen by minimizing the sample error∑ni=1(f(si)− ˜f(si))2to minimize the true error ∑s∈S(f(s)− ˜f(s))2. The sampling bounds are then on the difference between the sample error and the true error:
1 n n
∑
i=1 (f(si)− ˜f(si))2−∑
s∈S µ(s)(f(s)− ˜f(s))2 .Most bounds on this error rely on the redundancy of the samples with respect to the set of functions inF. That is, given a sufficient number of samples s1. . . sn, additional samples have very little influence on which function ˜f is chosen from the set of possibilities. The problem with value function approximation is that there is no fixed distribution µ over the states that limits the importance of function values. An important distribution over the states in bounding the policy loss is the state visitation frequency(1−γ)uπ for a policy π.
Here,(1−γ)is just a normalization coefficient. Unfortunately, this distribution depends on the policy, which depends on the value function. In the regression, this would mean that µ is not fixed, but it instead depends on the result ˜f. As a result, at no point it is simply possible to assume that additional samples will have a small influence on the choice of the function ˜f — the cumulative distribution may shift heavily in favor of the non-sampled states with every new sample.
Because of the difference between regression and value function approximation setting, we need different assumptions than the assumptions that are standard in regression. In particular, the assumptions in regression do not concern the space which is not sampled, since most likely, it is not likely according to µ. On the other hand, the assumptions for sampling bounds for value function approximation must be uniform over the state space, since the distribution µ may change arbitrarily. Therefore, the assumptions for value func- tion approximation must be stronger. Yet, as we show later in the chapter, regression results methods can be used to compute tight bounds on the sampling error.
To derive bounds on the sampling error we assume that the representable value functions are regularized. Our focus is on regularization that uses the L1 norm, but the extension
to other polynomial norms is trivial using the Holder’s inequality. The advantages of the L1 norm are that 1) it can be easily represented using a set of linear constraints, and 2) it encourages sparsity of the solution coefficients. We use the following refinement of
Assumption 2.21.
Assumption 9.1. The set of representable functions for the L1norm are defined as:
M(ψ) ={Φx kxk1,e≤ψ}.
such that φ1 =1and e(1) =0 and e(i) >0 for all i>1. Note that the norm is weighted by e.
Assumption 9.1impliesAssumption 2.21. We also use the following weighted L∞norm:
kxk∞,e−1 =max i |x(i)|/e(i) when|x(i)| >0 0 otherwise .
The following lemma relates the weighted L1and L∞norm and will be useful in deriving bounds.
Lemma 9.2. Let v∈ M(ψ)as defined inAssumption 9.1. Then for any y of an appropriate size:
|yTv| ≤ |y|T
|v| ≤ψkyk∞,e−1
The assumptions that we introduce must capture the structure of the MDP. Because many structures have been studied in the context of metric spaces, it is convenient to map the state space to a metric space (in particularRn). This makes it possible to take advantage of the structures proposed for metric spaces.
Definition 9.3 (State Embedding Function). The state embedding function k : S → Rn maps states to a vector space with an arbitrary normk · k. The state–action embedding function k : (S × A) → Rn maps states and actions the a vector space with an arbitrary normk · k. It is also possible to define a state–action–feature embedding function similarly.
The state embedding function is a part of the assumed structure of the MDP. In many ap- plications, the definition of the function follows naturally from the problem. The meaning of the function k is specified when not apparent from the context.