B Proof of Theorem 5 - Robust Markov Decision Processes

The proof of Theorem 5.3 relies on the Theorems 2.1, 2.2 and 5.1 in [5], which establish asymptotic properties of maximum likelihood estimators of ordinary MCs. To keep the paper self-contained, we summarise these results in Theorem B.1.

Theorem B.1 Consider a finite MC with state set _{X = {1, . . . , X} and transition probabilities p}xy(θ),

x, y ∈ X , that depend on an unknown parameter vector θ ranging over an open set Θ ⊆ RU_{. Assume}

that the following conditions are satisfied:

(C1) Each function pxy has continuous partial derivatives of third order throughout Θ.

(C2) The set-valued mapping D(θ) :={(x, y) ∈ X × X : pxy(θ) > 0} is constant, that is, there is a set

D⊆ X × X such that D(θ) = D for all θ ∈ Θ.

(C3) The Jacobian matrix of the transition kernel (pxy(θ))_x,y has rank U throughout Θ.

(C4) For each θ_{∈ Θ, the MC is irreducible.}

Let (x1, . . . , xm) denote an observation of the MC under its true transition kernel pxy(θ0), where θ0∈ Θ,

and let mxydenote the number of observations of transition (x, y)∈ X ×X . For the sequence of functions

fm(θ) :=P(x,y)∈Dmxylog [pxy(θ)], Θ contains a sequence of random vectors θmthat satisfy

2fm(θm)− fm(θ0) −→ m→∞ χ 2 U, (44a) m1/2 θm− θ0 −→ m→∞ N (0, Γ). (44b)

Here,_{N (0, Γ) is a multivariate normal distribution with zero mean and finite covariance matrix Γ ≻ 0.} Moreover, θm _{is a strict local maximiser of f}

m with probability going to one as m tends to infinity.

In order to apply Theorem B.1 to MDPs, we interpret the state-action sequence (30) as an observation history of an ordinary MC. Theorem 5.3 then follows from (44a). To simplify the exposition, we prove Theorem 5.3 first under assumption (A3’) on page 33. At the end of this section, we extend our proof to hold under the weaker assumption (A3).

We interpret the state-action sequence (30) as an observation of n states of an MC with state set

X :=(s, a)_{∈ S × A : π}0(a_{|s) > 0} . (45a) The MC is in state (s, a)_{∈ X whenever the underlying MDP is in state s and the decision maker chooses} action a. Note that we omit state-action pairs (s, a)_{∈ S × A with π}0_(a

see, this is a necessary (but not sufficient) condition for the MC to be irreducible, see condition (C4) of Theorem B.1. By construction, the MC starts in state (s, a)∈ X with probability p0(s) π0(a|s), and it

moves from state (s, a)∈ X to state (s′_{, a}′₎_{∈ X with probability p}ξ0

(s′_{|s, a) π}0_(a′_|s′_{), where ξ}0 _{is the}

unknown true parameter of the underlying MDP. Since the historical policy π0 is stationary, the MC indeed satisfies the Markov property.

We can establish the following relationship between the MC and the MDP.

Θ := int Ξ0 (45b)

and pxy(θ) := pθ(s′|s, a) π0(a′|s′) for θ∈ Θ and x = (s, a), y = (s′, a′)∈ X . (45c)

By assumption (A1), we have ξ0

∈ int Ξ0_{. Hence, Θ indeed contains the unknown true parameter vector}

θ0_{:= ξ}0 _{of the MC as required by Theorem B.1.}

We now show that the MC defined through (45) satisfies the conditions (C1)–(C4) of Theorem B.1. Lemma B.2 If the MDP satisfies assumptions (A2) and (A3’), then the MC defined through (45) sat-

isfies the conditions (C1)–(C4) of Theorem B.1.

Proof Condition (C1) is satisfied since pxyis affine in θ for all x, y∈ X , see definitions (45c) and (3).

As for condition (C2), the definitions (45a) and (45c) imply that

D(θ) =(x, y)∈ X × X : pθ_(s′

|s, a) > 0 for x = (s, a) and y = (s′_{, a}′₎ _.

We recall that pθ_(s′

|s, a) = ksa+ Ksaθ. We claim that for any θ∈ Θ, the set D(θ) equals

D :=n(x, y)_{∈ X × X : [k}saKsa]⊤s′_·6= 0 for x = (s, a) and y = (s

′_{, a}′₎o_.

By construction, D(θ) _{⊆ D for all θ ∈ Θ. It remains to show that D ⊆ D(θ) for all θ ∈ Θ. Assume} to the contrary that [ksaKsa]⊤_s′_· 6= 0 but pθ(s′|s, a) = 0 for x = (s, a), y = (s′, a′) ∈ X and θ ∈ Θ.

Since Θ is an open set, there is a neighbourhood of θ that is contained in Θ, and all points θ′ _{in this}

neighbourhood have to satisfy pθ′

(s′_{|s, a) ≥ 0. Since p}θ_(s′_{|s, a) = 0, this implies that [K}

sa]⊤s′_·= 0, and

hence [ksa]s′ = 0 as well. This contradicts our assumption that [ksa Ksa]

⊤

s′_·6= 0. We therefore conclude

that pθ_(s′

|s, a) > 0 for all θ ∈ Θ, that is, D ⊆ D(θ) for all θ ∈ Θ. We now consider condition (C3). The Jacobian J(θ)_{∈ R}X2

×U _{of the MC’s transition kernel is defined}

through Jxy,u := ∂pxy(θ)/∂θu for x, y∈ X and u = 1, . . . , U. For x = (s, a), y = (s′, a′)∈ X , we have

∂pxy(θ)/∂θu= π0(a′|s′) [Ksa]s′_u. Thus, assumption (A3’) ensures that J(θ) has rank U .

of the set of transitions with strictly positive probability; the actual probabilities are irrelevant. However, the proof of condition (C2) implies that for all state pairs (x, y)∈ X × X , either pxy(θ) > 0 for all θ∈ Θ

or pxy(θ) = 0 for all θ ∈ Θ. Hence, the set of transitions with strictly positive probability does not

depend on θ, and the MC defined through (45) is irreducible for all θ∈ Θ if and only if it is irreducible for some θ_{∈ Θ. Condition (C4) therefore follows from assumption (A2).}

We can now apply Theorem B.1 to the MC defined through (45). This allows us to prove Theorem 5.3 under the stronger assumption (A3’).

Proof of Theorem 5.3 Under assumption (A3’) the assumptions of Lemma B.2 are satisfied, and we can apply Theorem B.1 to the MC defined through (45). Hence, we know that Θ contains a sequence θn _{that satisfies (44a), and each θ}n _{constitutes a strict local maximiser of f}

n with probability going to

one as n tends to infinity. By definition (45c) of p, every function fn is concave, which implies that θn

is indeed the unique global maximiser of fn with probability going to one as n tends to infinity.

Let mxy denote the number of observations of transition (x, y)∈ X × X in (30). We additionally set

mxy:= 0 for (x, y)∈ (S × A)2\ (X × X ). For any θ ∈ Θ, we have

ℓn(θ) = X (s,a,s′_)∈N nsas′log pθ(s′|s, a)+ ζ = X x=(s,a)∈X , y=(s′_,a′_{)∈X :} mxy>0 mxylog pθ(s′|s, a)+ ζ = X x,y∈X : mxy>0 mxylog [pxy(θ)] + ψ = X (x,y)∈D mxylog [pxy(θ)] + ψ = fn(θ) + ψ, (46)

where ψ := log [p0(s1)] + log

π0_(a

1|s1)

. The first equality follows from the definition of ℓn in (33’).

The second equality holds because nsas′ = P

a′_∈Am_(s,a),(s′_,a′₎ and m_(s,a),(s′_,a′₎ = 0 if π0(a|s) = 0 or

π0_(a′_|s′_{) = 0. The third equality follows from the definition (45c) of p and our choice of ψ. As for the}

fourth equality, note that all x, y∈ X with mxy> 0 satisfy pxy(θ0) > 0 for θ0= ξ0. Lemma B.2 therefore

ensures that (x, y)∈ D(θ0_{) = D. The last equality follows from the definition of f}

n in Theorem B.1.

From (46) and the fact that θ0_{= ξ}0 _{we conclude that l}

n(ξ0) = fn(θ0) + ψ. Moreover, (46) implies

that θn defined in Theorem B.1 represents the unique global maximiser of ℓn with probability going to

one as n tends to infinity. The assertion of Theorem 5.3 now follows from (44a).

Remark B.3 Throughout this section, we replaced assumption (A3) with the stronger assumption (A3’)

from page 33. Under assumption (A3), the Jacobian of the MC’s transition kernel may violate condi- tion (C3) of Theorem B.1. We circumvent this problem by decomposing the affine mapping p in (45c) into the composition of a linear surjection, followed by an affine injection. If we replace Θ with the image of int Ξ0 _{under the surjection and p with the injection, all conditions of Theorem B.1 remain satisfied.}

In document Robust Markov Decision Processes (Page 42-45)