General Framework - Structural Results for Constrained Markov Decision Processes

In this section, we develop a general procedure for exploiting structure in two cost CMDPs. We first develop sufficient optimality conditions that are easier to verify

than those introduced in Theorem 2.3.1. Rewrite

C_V = sup

γ≥0

min

σ∈Π^S

{C₂(σ) + γ(C₁(σ) − V )}

= sup

γ≥0

{ min

σ∈Π^S{γC₁(σ) + C₂(σ)} − γV },

and define the function g(γ) := min_σ∈Π^S{γC1(σ) + C2(σ)} − γV . Note that the minimization in g(γ),

min

σ∈Π^S

{γC₁(σ) + C₂(σ)}, (LR(γ))

is an unconstrained MDP with cost function c_γ(x, a) = γc₁(x, a) + c₂(x, a). Denote this problem by LR(γ) and let O_γdenote the set of optimal stationary policies that achieve the minimum in LR(γ). The following proposition provides properties of g that help us find a stationary policy satisfying the equation in Statement 2 of Theorem 2.3.1.

Proposition 2.4.1 The following hold for all γ ∈ R

1. g(γ) is concave in γ.

2. For any σ_γ ∈ O_γ, V − C₁(σ_γ) ∈ ∂(−g)(γ), where ∂f is the subdifferential (set of all subgradients) of the function f .

3. If γ < ˆγ, and σγ ∈ O_γ and σγˆ ∈ O_ˆ_γ, then C1(σγ) ≥ C1(σˆγ).

Proof. Note that since sums of concave functions are concave, and the minimum of concave functions is concave the first result holds. To show the second result we need to show that for any γ ∈ R, σγ ∈ O_γ,

−g(γ₀) ≥ −g(γ) + (V − C₁(σ_γ))(γ₀− γ) ∀γ₀ ∈ R.

Fix γ ∈ R. We have, for any γ0 ∈ R,

g(γ₀) − g(γ) = min

σ∈Π^S

{γ₀C₁(σ) + C₂(σ)} − min

σ∈Π^S

{γC₁(σ) + C₂(σ)} − V (γ₀− γ)

= min

σ∈Π^S

{γ₀C₁(σ) + C₂(σ)} − γC₁(σ_γ) − C₂(σ_γ) − V (γ₀− γ)

≤ γ₀C₁(σ_γ) + C₂(σ_γ) − γC₁(σ_γ) − C₂(σ_γ) − V (γ₀− γ)

= −(V − C₁(σ_γ))(γ₀− γ).

Hence

−g(γ₀) ≥ −g(γ) + (V − C₁(σ_γ))(γ₀− γ),

as desired.

For the remaining result, fix γ ∈ R and let δ > 0. Let ν ∈ Oγ and ˆν ∈ O_γ+δ. This implies

γC1(ν) + C2(ν) ≤ γC1(ˆν) + C2(ˆν) (γ + δ)C₁(ˆν) + C₂(ˆν) ≤ (γ + δ)C₁(ν) + C₂(ν).

Using the fact that A ≤ B and C ≤ D implies C − B ≤ D − A yields

δ(C1(ˆν) − C1(ν)) ≤ 0,

so that C₁(ˆν) ≤ C₁(ν).

Suppose we find γ^∗ ≥ 0 such that there exists an optimal policy σ^∗ ∈ O_γ^∗for LR(γ) satisfying the constraint at equality: C₁(σ^∗) = V . This implies 0 ∈ ∂(−g)(γ^∗) by way of the second statement of Proposition 2.4.1. Since g(γ) is concave, this implies

that γ^∗ attains the supremum of g(γ). Observe that

C_V = sup

γ≥0

{ min

σ∈Π^S

{γC₁(σ) + C₂(σ)} − γV }

= min

σ∈Π^S{γ^∗C₁(σ) + C₂(σ)} − γ^∗V

= C₂(σ^∗) + γ^∗(C₁(σ^∗) − V )

= sup

γ≥0

{C₂(σ^∗) + γ(C₁(σ^∗) − V )},

where the last equality follows since C₁(σ^∗) − V = 0. From Statement 2 of The-orem 2.3.1, σ^∗ is optimal for B(V). This implies sufficient optimality conditions, summarized in the following proposition.

Proposition 2.4.2 (Sufficient optimality conditions) Suppose that (σ^∗, γ^∗) ∈ Π^S× R+ satisfies

σ^∗ ∈ O_γ^∗ (2.4)

C1(σ^∗) = V. (2.5)

The policy σ^∗ is optimal for B(V).

Given these optimality conditions, we have converted the problem of directly find-ing a constrained-optimal stationary policy to that of findfind-ing the optimal policy for the appropriate unconstrained MDP. In fact, the results in Propositions 2.4.1 and 2.4.2, combined with algorithms such as subgradient descent for convex optimiza-tion problems, yield algorithms capable of solving a CMDP by instead solving a sequence of unconstrained MDPs. Of course, such algorithms are of little practical use: one could just solve the constrained MDP directly via linear programming (possibly with some truncation methods if the state space is countably infinite), which is much faster. However, this perspective allows us to look at structural

results for unconstrained MDPs (which is well-understood) and see how they may be extended to the constrained problem. More precisely, it turns the problem of finding structured constrained-optimal policies into that of finding structured policies that are optimal for the unconstrained problem with the correct Lagrange multiplier. The rest of the section is dedicated to making this process more exact.

Doing so involves answering some important questions:

1. Do we need the existence of a single optimal policy with the desired structural properties or do we need to show something stronger?

2. For what values of γ do these results need to hold? Clearly, the results must hold for γ^∗ attaining the supremum in the Lagrangian dual, but what if it is not obvious what γ^∗ is?

3. Assuming we are able to show structural results for the correct unconstrained MDP, how do the results extend? Is the same structure maintained or does it change slightly?

To answer these questions, we consider the range of class 1 costs obtainable by policies optimal for the Lagrangian relaxation LR(γ):

C₁(γ) := {C₁(σ) : σ ∈ O_γ}.

Note that, for every γ ≥ 0, C₁(γ) is an interval of costs (where the singleton interval is a possibility). To see this, note that for any two policies σ, σ⁰ ∈ O_γ with corresponding occupation measures φ, φ⁰, we can create a randomized policy σ_p for p ∈ [0, 1] corresponding to the occupation measure φp = pφ + (1 − p)φ, that has the same objective value as σ and σ⁰, but whose class 1 cost is a convex combination of C₁(σ) and C₁(σ⁰):

C₁(σ_p) = pC₁(σ) + (1 − p)C₁(σ⁰).

Hence, if c₁, c⁰₁ ∈ C₁(γ) for c₁ < c⁰₁, then [c₁, c⁰₁] ⊆ C₁(γ).

Now suppose that we have found the optimal multiplier γ^∗ described in Propo-sition 2.4.2, and thus, the correct Lagrangian relaxation problem for which to prove structural results. We find ourselves in one of two cases. In the first case, we have that |C₁(γ^∗)| = 1. This means that every policy in the argmin, O_γ^∗ has the same class 1 cost. Hence, if we are able to show the existence of an optimal policy with the desired structural properties for this Lagrangian relaxation, then we have shown the exact same structural results hold for the constrained problem.

We suspect that these types of problems are uncommon: both of the applications we consider do not fall into this case, and indeed have not found an example (other than the trivial example of a CMDP with a singleton action space in every state).

However, this case is still considered for completeness.

On the other hand, if |C1(γ^∗)| 6= 1, then it is uncountably infinite (since it is a continuous interval), and, thus, it is not necessary that every policy in O_γ^∗ is constrained-optimal, since not all of these policies are binding. This is indeed the case in both of the applications we study. Thus, stronger structural properties must be shown in order to extend the results to the constrained case. We seek to develop a general procedure that allows us to extend structural properties for the unconstrained problem to the constrained problem in this case. In doing so, we define the notion of “extreme” policies in a given structured class, Π^Str:

P₁^Str := argmin_ΠStrC₁(σ) P₂^Str := argmax_ΠStrC₁(σ).

It should be noted that the structured class of policies, Π^Str, needs to be chosen carefully so that Π^Str⊆ O_γ^∗. In this case, if V ∈ [C₁(P₁^Str), C₁(P₂^Str)], then one of these structured policies, say σ^∗ ∈ Π^Str, satisfies the constraint at equality, and is

constrained-optimal by way of Proposition 2.4.2.

One method for constructing a class of policies with these properties is to pick the extreme policies, P₁^Str and P₂^Str, first, and then construct a sequence of policies (σ_n)^∞_n=0 ⊆ O_γ^∗ conforming to some desired structural property (e.g. threshold policies) such that σ₁ = P₁^Str and σ_n → P₂^Str as n → ∞. Intuitively, since we start with a policy whose class 1 cost is below the constraint, and converge to a policy whose class 1 cost exceeds the constraint, we should find two policies along the sequence that “straddle” the constraint: one policy has class 1 cost below the constraint, the other above. We should then be able to find a binding policy by randomizing between these two policies, thus producing a constrained-optimal policy. Doing this requires cost continuity with respect to the mode of convergence in which σ_n → P₂^Str. We find that it is most intuitive to consider pointwise convergence of policies: for every x ∈ X, lim^n→∞(σn)x(A) = (P₂^Str)x(A) for every A ⊆ A(x). In this context, cost continuity means that σn → P₂^Str pointwise implies that lim_n→∞C₁(σ_n) = C₁(P₂^Str). This allows for the use of the intermediate value theorem to find an optimal policy for B(V).

In the sections that follow, we introduce specific problems in which |C₁(γ^∗)| 6= 1, and show how to find a constrained-optimal policy within a particular structured class.

In document Structural Results for Constrained Markov Decision Processes (Page 29-35)