• No results found

Value Functions and Policies

PART I: FORMULATIONS

2.3 Value Functions and Policies

We start the section by defining the basic properties of value functions and then discuss how a value function can be used to construct a policy. These properties are important in defining the objectives in calculating a value function. The approximate value function ˜v in this section is an arbitrary estimate of the optimal value function v∗.

Value functions serve to simplify the solution of MDPs because they are easier to calculate and analyze than policies. Finding the optimal value function for an MDP corresponds to finding a fixed point of the nonlinear Bellman operator (Bellman, 1957).

Definition 2.6((Puterman, 2005)). The Bellman operator L : R|S| R|S| and the value function update Lπ :R|S| →R|S|for a policy π are defined as:

Lπv =γPπv+rπ

Lv=max

π∈Π

The operator L is well-defined because the maximization can be decomposed state-wise. That is Lv ≥ Lπv for all policies π ∈ Π. Notice that L is a non-linear operator and Lπ is an

affine operator (a linear operator offset by a constant).

A value function is optimal when it is stationary with respect to the Bellman operator. Theorem 2.7((Bellman, 1957)). A value function v∗is optimal if and only if v∗ = Lv∗. Moreover, v∗is unique and satisfies v∗ ≥vπ.

The proof of the theorem can be found inSection C.2. The proof of this basic property is inSection C.2. It illustrates well the concepts used in other proofs in the thesis.

Because the ultimate goal of solving an MDP is to compute a good policy, it is necessary to be able to compute a policy from a value function. The simplest method is to take the greedy policy with respect to the value function. A greedy policy takes in each state the action that maximizes the expected value when transiting to the following state.

Definition 2.8(Greedy Policy). A policy π is greedy with respect to a value function v when

π(s) =arg max a∈A r(s, a) +γs

0∈S P(s, a, s0)v(s0) =arg max a∈A 1 T s (ra+γPav),

and is greedy with respect to an action-value function q when

π(s) =arg max a∈A q(s, a).

The following propositions summarize the main properties of greedy policies.

Proposition 2.9. The policy π greedy for a value-function v satisfies Lv = Lπv ≥ Lπ0v for all

policies π0 ∈ Π. In addition, the greedy policy with respect to the optimal value function v∗is an optimal policy.

The proof of the proposition can be found inSection C.3.

Most practical MDPs are too large for the optimal value function to be computed precisely. In these cases, we calculate an approximate value function ˜v and take the greedy policy π

with respect to it. The quality of such a policy can be evaluated from its value function vπ

in one of the following two main ways.

Definition 2.10(Policy Loss). Let π be a policy computed from value function approxima- tion. The average policy loss measures the expected loss of π and is defined as:

kv∗vπk1,α =α

Tv

αTvπ (2.1)

The robust policy loss measures the robust policy loss of π and is defined as:

kv∗vπk∞ =max

s∈S |v

(s)v

π(s)| (2.2)

The average policy loss captures the total loss of discounted average reward when follow- ing the policy π instead of the optimal policy, assuming the initial distribution. The robust policy loss ignores the initial distribution and measures the difference for the worst-case initial distribution.

Taking the greedy policy with respect to a value function is the simplest method for choos- ing a policy. There are other — more computationally intensive — methods that can often lead to much better performance, but are harder to construct and analyze. We discuss these and other methods in more detail inChapter 11.

The methods for constructing policies from value functions can be divided into two main classes based on the effect of the value function error, as Chapter 11 describes in more detail. In the first class of methods, the computational complexity increases with a value function error, but solution quality is unaffected. A* and other classic search methods are included in this class (Russell & Norvig, 2003). In the second class of methods, the so- lution quality decreases with value function error, but the computational complexity is unaffected. Greedy policies are an example of such a method. In the remainder of the the- sis, we focus on greedy policies because they are easy to study, can be easily constructed, and often work well.

A crucial concept in evaluating the quality of a value function with respect to the greedy policy is the Bellman residual, which is defined as follows.

S

v

L v v

∈ K

e v

∈ K(

e

)

Figure 2.1.Transitive-feasible value functions in an MDP with a linear state-space.

Definition 2.11 (Bellman residual). The Bellman residual of a value function v is a vector defined as v−Lv.

The Bellman residual can be easily estimated from data, and is used in bounds on the policy loss of greedy policies. Most methods that approximate the value function are at least loosely based on minimization of a function of the Bellman residual.

In many of the methods that we study, it is advantageous to restrict the value functions so that their Bellman residual must be non-negative, or at least bounded from below. We call such value functions transitive-feasible and define them as follows.

Definition 2.12. A value function is transitive-feasible when v ≥ Lv. The set of transitive- feasible value functions is:

K = {vR|S| v Lv}.

Assume an arbitrary e 0. The set of e-transitive-feasible value functions is defined as follows:

K(e) ={v ∈R|S| v≥ Lv−e1}.

Notice that the optimal value function v∗is transitive-feasible, which follows directly from

Theorem 2.7. Transitive-feasible value functions are illustrated inFigure 2.1. The following lemma summarizes the main importance of transitive-feasible value functions:

Lemma 2.13. Transitive feasible value functions are an upper bound on the optimal value function. Assume an e-transitive-feasible value function v∈ K(e). Then:

vv∗ e 1−γ1.

The proof of the lemma can be found inSection C.2.

Another important property of transitive-feasible value functions follows.

Proposition 2.14. The set K of transitive-feasible value functions is convex. That is for any v1, v2∈ Kand any β∈ [0, 1]also βv1+ (1−β)v2 ∈ K.

The proof of the proposition can be found inSection C.2.

The crucial property of approximate value functions is the quality of the corresponding greedy policy. The robust policy loss can be bounded as follows.

Theorem 2.15. Let ˜v be the approximate value function, and vπbe a value function of an arbitrary

policy π. Then: kv∗−vπk∞ ≤ 1 1−γk˜v−Lπ˜vk∞+k˜v−v ∗k ∞ kv∗−vπk∞ ≤ 2 1−γk˜v−Lπ˜vk∞

The proof of the theorem can be found inSection C.3. This theorem extends the classical bounds on policy loss (Williams & Baird, 1994). The following theorem states the bounds for the greedy policy in particular.

Theorem 2.16(Robust Policy Loss). Let π be the policy greedy with respect to ˜v. Then:

kv∗vπk∞ ≤

2

1γk˜v−L˜vk∞.

In addition, if ˜v∈ K, the policy loss is minimized for the greedy policy and:

kv∗vπk∞ ≤

1

1−γk˜v−L˜vk∞.

The proof of the theorem can be found inSection C.3.

The bounds above ignore the initial distribution. When the initial distribution is known, bounds on the expected policy loss can be used.

Theorem 2.17(Expected Policy Loss). Let π be a greedy policy with respect to a value function ˜v and let the state-action visitation frequencies of π be bounded as u≤uπ ≤u. Then:¯

kv∗vπk1,α =α

Tv

αT˜v+uTπ(˜v−L˜v)

αTv∗−αT˜v+uT[˜v−L˜v]+u¯T[˜v−L˜v]+.

The state-visitation frequency uπdepends on the initial distribution α, unlike v∗. In addition, when

˜v∈ K, the bound is:

kv∗−vπk1,α ≤ −kv∗− ˜vk1,α+k˜v−L˜vk1, ¯u kv∗vπk1,α ≤ −kv∗− ˜vk1,α+

1

1γk˜v−L˜vk∞

The proof of the theorem can be found inSection C.3. The proof is based on the com- plementary slackness principle in linear programs (Mendelssohn, 1980; Shetty & Taylor, 1987; Zipkin, 1977). Notice that the bounds inTheorem 2.17can be minimized even with- out knowing v∗. The optimal value function v∗ is independent of the approximate value function ˜v and the greedy policy π depends only on ˜v.

Remark 2.18. The bounds inTheorem 2.17generalize the bounds of Theorem 1.3 in (de Farias, 2002). Those bounds state that whenever v∈ K:

kv∗−vπk1,α ≤ 1 1γkv

˜vk

1,(1γ)u.

This bound is a special case ofTheorem 2.17because:

k˜vL˜vk1,u ≤ kv∗− ˜vk1,u ≤ 1 1−γkv ∗ ˜vk 1,(1γ)u, from v∗ ≤ L˜v ˜v and αTv

αT˜v≤0. The proof ofTheorem 2.17also simplifies the proof of Theorem 1.3 in (de Farias, 2002)

The bounds from the remark above can be further tightened and revised as the following theorem shows. We use this new bound later to improve the standard ALP formulation. Theorem 2.19(Expected Policy Loss). Let π be a greedy policy with respect to a value function

˜v and let the state-action visitation frequencies of π be bounded as uuπ ≤u. Then:¯

kv∗−vπk1,α ≤  ¯ uT(IγP∗)−αT  (˜v−v∗) +u¯T[L˜v− ˜v]+,

where P∗= Pπ∗. The state-visitation frequency uπdepends on the initial distribution α, unlike v∗.

In addition, when ˜v∈ K, the bound can be simplified to:

kv∗vπk1,α ≤u¯

T(I

γP∗)(˜v−v∗) This simplification is, however, looser.

The proof of the theorem can be found inSection C.3.

Notice the significant difference between Theorems2.19and2.17. Theorem 2.19involves the term[L˜v− ˜v]+, while inTheorem 2.17it is reversed to be ˜v− [L˜v]+.

The bounds above play in important role in the approaches that we propose. Chapter 4

loss in Theorems 2.17and2.19. However, it only minimizes loose upper bounds. Then,

Chapter 5shows that the tighter bounds on Theorems2.17and2.16can be optimized using approximate bilinear programming.