• No results found

In this section, we develop a general procedure for exploiting structure in two cost CMDPs. We first develop sufficient optimality conditions that are easier to verify

than those introduced in Theorem 2.3.1. Rewrite

CV = sup

γ≥0

min

σ∈ΠS

{C2(σ) + γ(C1(σ) − V )}

= sup

γ≥0

{ min

σ∈ΠS{γC1(σ) + C2(σ)} − γV },

and define the function g(γ) := minσ∈ΠS{γC1(σ) + C2(σ)} − γV . Note that the minimization in g(γ),

min

σ∈ΠS

{γC1(σ) + C2(σ)}, (LR(γ))

is an unconstrained MDP with cost function cγ(x, a) = γc1(x, a) + c2(x, a). Denote this problem by LR(γ) and let Oγdenote the set of optimal stationary policies that achieve the minimum in LR(γ). The following proposition provides properties of g that help us find a stationary policy satisfying the equation in Statement 2 of Theorem 2.3.1.

Proposition 2.4.1 The following hold for all γ ∈ R

1. g(γ) is concave in γ.

2. For any σγ ∈ Oγ, V − C1γ) ∈ ∂(−g)(γ), where ∂f is the subdifferential (set of all subgradients) of the function f .

3. If γ < ˆγ, and σγ ∈ Oγ and σγˆ ∈ Oˆγ, then C1γ) ≥ C1ˆγ).

Proof. Note that since sums of concave functions are concave, and the minimum of concave functions is concave the first result holds. To show the second result we need to show that for any γ ∈ R, σγ ∈ Oγ,

−g(γ0) ≥ −g(γ) + (V − C1γ))(γ0− γ) ∀γ0 ∈ R.

Fix γ ∈ R. We have, for any γ0 ∈ R,

g(γ0) − g(γ) = min

σ∈ΠS

0C1(σ) + C2(σ)} − min

σ∈ΠS

{γC1(σ) + C2(σ)} − V (γ0− γ)

= min

σ∈ΠS

0C1(σ) + C2(σ)} − γC1γ) − C2γ) − V (γ0− γ)

≤ γ0C1γ) + C2γ) − γC1γ) − C2γ) − V (γ0− γ)

= −(V − C1γ))(γ0− γ).

Hence

−g(γ0) ≥ −g(γ) + (V − C1γ))(γ0− γ),

as desired.

For the remaining result, fix γ ∈ R and let δ > 0. Let ν ∈ Oγ and ˆν ∈ Oγ+δ. This implies

γC1(ν) + C2(ν) ≤ γC1(ˆν) + C2(ˆν) (γ + δ)C1(ˆν) + C2(ˆν) ≤ (γ + δ)C1(ν) + C2(ν).

Using the fact that A ≤ B and C ≤ D implies C − B ≤ D − A yields

δ(C1(ˆν) − C1(ν)) ≤ 0,

so that C1(ˆν) ≤ C1(ν).

Suppose we find γ ≥ 0 such that there exists an optimal policy σ ∈ Oγfor LR(γ) satisfying the constraint at equality: C1) = V . This implies 0 ∈ ∂(−g)(γ) by way of the second statement of Proposition 2.4.1. Since g(γ) is concave, this implies

that γ attains the supremum of g(γ). Observe that

CV = sup

γ≥0

{ min

σ∈ΠS

{γC1(σ) + C2(σ)} − γV }

= min

σ∈ΠSC1(σ) + C2(σ)} − γV

= C2) + γ(C1) − V )

= sup

γ≥0

{C2) + γ(C1) − V )},

where the last equality follows since C1) − V = 0. From Statement 2 of The-orem 2.3.1, σ is optimal for B(V). This implies sufficient optimality conditions, summarized in the following proposition.

Proposition 2.4.2 (Sufficient optimality conditions) Suppose that (σ, γ) ∈ ΠS× R+ satisfies

σ ∈ Oγ (2.4)

C1) = V. (2.5)

The policy σ is optimal for B(V).

Given these optimality conditions, we have converted the problem of directly find-ing a constrained-optimal stationary policy to that of findfind-ing the optimal policy for the appropriate unconstrained MDP. In fact, the results in Propositions 2.4.1 and 2.4.2, combined with algorithms such as subgradient descent for convex optimiza-tion problems, yield algorithms capable of solving a CMDP by instead solving a sequence of unconstrained MDPs. Of course, such algorithms are of little practical use: one could just solve the constrained MDP directly via linear programming (possibly with some truncation methods if the state space is countably infinite), which is much faster. However, this perspective allows us to look at structural

results for unconstrained MDPs (which is well-understood) and see how they may be extended to the constrained problem. More precisely, it turns the problem of finding structured constrained-optimal policies into that of finding structured policies that are optimal for the unconstrained problem with the correct Lagrange multiplier. The rest of the section is dedicated to making this process more exact.

Doing so involves answering some important questions:

1. Do we need the existence of a single optimal policy with the desired structural properties or do we need to show something stronger?

2. For what values of γ do these results need to hold? Clearly, the results must hold for γ attaining the supremum in the Lagrangian dual, but what if it is not obvious what γ is?

3. Assuming we are able to show structural results for the correct unconstrained MDP, how do the results extend? Is the same structure maintained or does it change slightly?

To answer these questions, we consider the range of class 1 costs obtainable by policies optimal for the Lagrangian relaxation LR(γ):

C1(γ) := {C1(σ) : σ ∈ Oγ}.

Note that, for every γ ≥ 0, C1(γ) is an interval of costs (where the singleton interval is a possibility). To see this, note that for any two policies σ, σ0 ∈ Oγ with corresponding occupation measures φ, φ0, we can create a randomized policy σp for p ∈ [0, 1] corresponding to the occupation measure φp = pφ + (1 − p)φ, that has the same objective value as σ and σ0, but whose class 1 cost is a convex combination of C1(σ) and C10):

C1p) = pC1(σ) + (1 − p)C10).

Hence, if c1, c01 ∈ C1(γ) for c1 < c01, then [c1, c01] ⊆ C1(γ).

Now suppose that we have found the optimal multiplier γ described in Propo-sition 2.4.2, and thus, the correct Lagrangian relaxation problem for which to prove structural results. We find ourselves in one of two cases. In the first case, we have that |C1)| = 1. This means that every policy in the argmin, Oγ has the same class 1 cost. Hence, if we are able to show the existence of an optimal policy with the desired structural properties for this Lagrangian relaxation, then we have shown the exact same structural results hold for the constrained problem.

We suspect that these types of problems are uncommon: both of the applications we consider do not fall into this case, and indeed have not found an example (other than the trivial example of a CMDP with a singleton action space in every state).

However, this case is still considered for completeness.

On the other hand, if |C1)| 6= 1, then it is uncountably infinite (since it is a continuous interval), and, thus, it is not necessary that every policy in Oγ is constrained-optimal, since not all of these policies are binding. This is indeed the case in both of the applications we study. Thus, stronger structural properties must be shown in order to extend the results to the constrained case. We seek to develop a general procedure that allows us to extend structural properties for the unconstrained problem to the constrained problem in this case. In doing so, we define the notion of “extreme” policies in a given structured class, ΠStr:

P1Str := argminΠStrC1(σ) P2Str := argmaxΠStrC1(σ).

It should be noted that the structured class of policies, ΠStr, needs to be chosen carefully so that ΠStr⊆ Oγ. In this case, if V ∈ [C1(P1Str), C1(P2Str)], then one of these structured policies, say σ ∈ ΠStr, satisfies the constraint at equality, and is

constrained-optimal by way of Proposition 2.4.2.

One method for constructing a class of policies with these properties is to pick the extreme policies, P1Str and P2Str, first, and then construct a sequence of policies (σn)n=0 ⊆ Oγ conforming to some desired structural property (e.g. threshold policies) such that σ1 = P1Str and σn → P2Str as n → ∞. Intuitively, since we start with a policy whose class 1 cost is below the constraint, and converge to a policy whose class 1 cost exceeds the constraint, we should find two policies along the sequence that “straddle” the constraint: one policy has class 1 cost below the constraint, the other above. We should then be able to find a binding policy by randomizing between these two policies, thus producing a constrained-optimal policy. Doing this requires cost continuity with respect to the mode of convergence in which σn → P2Str. We find that it is most intuitive to consider pointwise convergence of policies: for every x ∈ X, limn→∞n)x(A) = (P2Str)x(A) for every A ⊆ A(x). In this context, cost continuity means that σn → P2Str pointwise implies that limn→∞C1n) = C1(P2Str). This allows for the use of the intermediate value theorem to find an optimal policy for B(V).

In the sections that follow, we introduce specific problems in which |C1)| 6= 1, and show how to find a constrained-optimal policy within a particular structured class.