A model with hybrid state and action spaces and with observable continuous

Figure 8.2: Illustrative example

The goal of the next paragraph is to generalize the Bellman equation for MDPs to the case of hybrid state and action spaces with observable time. As the proofs of section 8.3 will show, introducing an observable time is the main difficulty on the way to proving that the adapted Bellman equation for this broader class of problems is still valid. The results presented in sections 8.2 and 8.3 were first introduced in [Rachelson et al., 2008a].

8.2 A model with hybrid state and action spaces and with observable

continuous time

8.2.1 Model definition

In order to illustrate the following definitions on a simple example, we propose the game presented in figure 8.2. In this game, the goal is to bring the ball from the start box to the finish box. Unfortunately, the problem depends on a continuous time variable because the boxes’ floors retract at known dates and because actions durations are uncertain and real-valued. At each decision epoch, the player has five possible actions: he can either push the ball in one of the four directions or he can wait for a certain duration in order to reach a better configuration. Finally the “push” actions are uncertain and the ball can end up in the wrong box. This problem has a hybrid state space composed of discrete variables - the ball’s position - and continuous ones - the current date1_{. It also has four non-parametric actions -}

the “push” actions - and one parametric action - the “wait” action. We are therefore trying to find a policy on a stochastic process with continuous and discrete variables and parametric actions (with real valued parameters). Keeping this example in mind, we introduce the notion of parametric MDP:

Definition(XMDP). A parametric action MDP, or XMDP, is a tuple hS, A(X), p, ri where: S is a Borel state space which can describe continuous or discrete state variables including

the process’ time.

A is an action space describing a finite set of actions ai(x) where x is a vector of param-

eters taking its values in X. Therefore, the action space of our problem is a hybrid action space, factored by the different actions an agent can undertake. In practice, each action only depends on a subset of variables fromX.

1_{This illustrative example can actually be written as a TMDP. We draw from the TMDP experience to}

Chapter 8. Generalization: the XMDP model

p is a probability density transition function p(s0_{|s, a(x)).}

r is a reward function r(s, a(x)).

8.2.2 Emphasizing the place of time

As in the MDP case, we consider the set T of timed decision epochs. In order to deal with the more general case, we will consider a real-valued time variable t and — as previously — will write the state (s, t) in order to emphasize the specificity of this variable in the discounted case.

So it is important to note that — since we consider t as a state variable — we can split the random variable describing the current state into a vector with at least two components. To clarify this, let us reuse a previous SMDP+ notation and call σ any state in the state space. The first component of the σ vector is the time corresponding to the current state tσ.

The second component is the classical state value sσ, aggregating any other state variable

values which might be defined for the problem at hand.

This leads us to correct our notations: we call S the set of values taken by the vector of state variables but time. Hence, our state space will be written as the Borel algebra on the S × R topological space and we will use pairs (s, t) to describe one state. To simplify the notations, we will refer to this state space as S × R.

Note that for discrete variables, the p() function of the XMDP is a discrete probability distribution function and that writing integrals over p() is equivalent to writing a sum over the discrete variables.

Concerning the time variable, if it is discrete and bounded, optimizing a policy for the XMDP corresponds to optimizing a finite horizon criterion. In the case of continuous observable time, section 8.4 will show that the TMDP and SMDP+ are subproblems of the XMDP framework. In order to summarize:

XMDPs are a generalization of MDPs where hybrid action spaces are represented through the abstraction of parametric actions. Moreover, XMDPs include the process’ time as a state variable, thus suppressing the distinction between stationary and non-stationary policies for MDPs.

Lastly, as previously, we will write δ the number of the current decision epoch, and, consequently, tδ the time at which decision epoch δ occurs.

8.2.3 Reward model

One can comment that writing the reward model as r((s, t), a(x)) might not span the full genericity of the models we wish to represent. For instance, we could wish to write some r((s, t), a(x), (s0_{, t}0_{)) reward model in order to specify the reward of each transition.}

In the following developments, we will nevertheless restrict ourselves to the r((s, t), a(x)) expected reward model for simplicity. Even though we did not write the proofs for the general case of r((s, t), a(x), (s0_{, t}0_{)), making a parallel with the standard MDP case — where}

8.2. A model with hybrid state and action spaces and with observable continuous time one can similarly establish equations for r(s, a) or r(s, a, s0_{) — seems possible.}

One interesting option for decomposing the reward model is the one inspired by SMDPs. A transition reward r(s, a) in an SMDP is given as a lump sum reward k(s, a) and a set of reward rates c(j, s, a) where the j states correspond to all the intermediate states encoun- tered during the transition and before the next decision epoch (for details on SMDPs reward models and underlying stochastic processes, one can refer to [Puterman, 1994], page 533). The reward model r(s, a) then corresponds to the lump sum reward plus the discounted integral over possible durations of the reward rates in each state j visited by the underlying process.

The interesting point in such a reward model is that it separates the lump sum reward from the reward rates. For an XMDP, such a model would imply an important property for instantaneous transitions: for such transitions, the reward acquired is only the lump sum reward.

The TMDP type of reward model is then the extension to (s, a, s0_{) transitions of such}

an SMDP reward model with lump sum reward in t, duration reward through a reward rate and lump sum reward in t0_.

Even though these modelling options might be appealing because of their practical im- plications, they remain a special case of an r((s, t), a(x)) (or r((s, t), a(x), (s0_{, t}0_{))) reward}

model and we won’t adopt them for further reasoning.

8.2.4 Policies and criterion

We define a deterministic Markovian decision rule at decision epoch δ as the mapping from states to actions:

dδ:

S × R → A(X)

s, t 7→ a(x) (8.1)

dδ(s, t) specifies the parametric action to undertake in state (s, t) at decision epoch δ. A

policy is defined as a set of decision rules (one for each δ) and we consider, as in [Puterman, 1994], the set D of stationary (with respect to δ) markovian deterministic policies.

Definition (Stationary, deterministic, Markovian policy). Such a control policy π is given by a single decision rule, applicable at each decision epoch. Hence we identify this decision rule and the policy:

π :

S × R → A(X)

s, t 7→ a(x) (8.2)

For simplicity and without justification as to the relevance of this approach, we choose to search for policies in D. Analysis of stochastic or non-Markovian policies for instance is beyond the scope of this chapter’s results.

In order to evaluate policies in D for our problem, we need to define a criterion. Given the strong similarity of XMDPs and SMDPs, the discounted criterion we define follows the example of [Howard, 1963] and integrates the expected reward over all possible transition durations. We introduce the discounted criterion for XMDPs as the expected sum of the

Chapter 8. Generalization: the XMDP model

successive discounted rewards, with respect to the application of policy π starting in state (s, t): V_γπ(s, t) = E_(sπ₀_=s,t₀_=t) ( _∞ X δ=0 γtδ−t0_r_π_(s_δ_{, t}_δ₎ ) (8.3) In order to make sure this series has a finite limit, our model introduces three more hypothesis:

• |r ((s, t), a(x)) | is bounded by M,

• ∀δ ∈ T, tδ+1− tδ≥ α > 0, where α is the smallest possible duration of an action ,

• γ < 1.

The discount factor γt _{insures the convergence of the series. Physically, it can be seen as a}

probability of still being functional after time t. With these hypothesis, one can write: Lemma 1. ∀(s, t) ∈ S × R:

|V_γπ(s, t)| < _{1 − γ}M _α (8.4) Proof. Since the transition durations are lower-bounded by α, one can write, for all π ∈ D:

tδ ≤ t0+ δα

γtδ _{≤ γ}t0+δα

γtδ−t0 _{≤ γ}δα

And since |r| is bounded by M we have: ∀π ∈ D, −γδα_{M ≤ γ}tδ−t0_r π(sδ, tδ) ≤ γδαM And finally: ∀π ∈ D, −X∞ δ=0 γδα_{M ≤}X∞ δ=0 γtδ−t_r π(sδ, tδ) ≤ ∞ X δ=0 γδα_M

Which finally provides the result of the lemma.

We will further restrict the first assumption by saying that we only consider positive reward models, thus we write: 0 ≤ r ((s, t), a(x)) ≤ M. In the following discussion, we will highlight why and where this assumption is necessary.

The rather strong assumption of a lower bound on the duration of transitions will be important for the proof in section 8.3.2. We will see in section 8.3.3 how we can try to lift this hypothesis (and why it is not trivial).

However, it seems important to avoid a — somehow intuitive — misconception: this assumption is a constraint on the transition model and not on the policy search space. The mentioned misconception would be to believe that the constraint t0_{− t ≥ α corresponds to}

forbidding any wait(τ) with τ < α action. While this seems intuitively the case, it is not what is implied by this assumption: any wait(τ) action is possible (with respect to this lower bound only — we will see a little further that we might have to restrict the A(X) space as well for other reasons) but their result will always yield a transition with t0_{− t ≥ α.}

8.2. A model with hybrid state and action spaces and with observable continuous time We will admit that the set V of value functions (functions from S × R to R) is a complete metrizable space for the supremum norm kV k∞= sup

(s,t)∈S×RV (s, t).

The optimal value function for Markovian, deterministic policies, given the discounted criterion is defined as V∗ _{= sup}

π∈DV π γ.

An optimal policy in D, if it exists, is then defined as a policy π∗ _{which verifies V}π∗

γ =

sup

π∈DV π

γ . Guaranteeing the existence of such a policy whose value actually reaches the sup

requires three hypothesis which will be developed in section 8.3.4:

• The r and p models are upper semi-continuous with respect to the parameters x. • The reward model is positive.

• A(X) is a compact action subset of a topological space describing a finite set of actions ai(x) where x is a vector of parameters taking its values in X. Since we consider a

finite set of ai high-level actions, this hypothesis can be rephrased for Ai(X): Ai(X)

is homeomorphic to a compact subset of a topological space describing the set of parameters admissible for the finite set of abstract actions ai.

These three hypothesis guarantee that the value function will be upper semi-continuous and that there exists a vector x of finite parameters that reaches the sup of the value function. Such a proof was immediate in the classical MDP model because the action space was countable, the upper semi-continuity and compacity arguments allow the extension to the parametric action case.

We avoided including these additional hypothesis directly in the previous model definition because one can prove the existence of an optimal value function without it. However, this hypothesis is a sufficient condition to guarantee the existence of a policy whose value is equal to the optimal value function. So, we will specifically mention when we make these assumptions in the following proofs.

From here on we will omit the γ index on V .

8.2.5 Summarizing the XMDP’s hypothesis

Based on the previous short discussion, we can summarize the assumptions upon which the XMDP framework is built. An XMDP problem is given by the S × R and A(X) sets, the p and r functions, the γ value and the set D of policies, following the assumptions:

Assumption 1 (State space structure). The state space is the Borel algebra defined on a topological space S × R. In order to simplify the notations, we will write (s, t) ∈ S × R the elements of the state space.

Assumption 2 (Action space structure). A(x) is the union of a finite collection of sets Ai(X), where each Ai(X) is homeomorphic to a compact subset of a topological space X.

Each element ofA(X) is noted a_i(x) where a_i refers to the set A_i(X) to which the element originally belongs andx is the corresponding element in A_i(X). It represents the action set. Note that since the compacity assumption is not always necessary for all of the following proofs, we will specifically mention when we use or discard it.

Chapter 8. Generalization: the XMDP model

Assumption 3 (Transition model vs. actions). p is a Markovian probability density function p((s0, t0)|(s, t), a(x)) which is upper semi-continuous with respect to x. It describes the transition model, from (s, t) to (s0, t0), given action a(x).

Assumption 4 (Reward model vs. actions). r is a real-valued function r((s, t), a(x)) which is upper semi-continuous with respect tox. It is the reward model of undertaking action a(x) in state(s, t).

Assumption 5 (Bounded rewards). The reward model is bounded and positive: ∃M ∈ R+_{/∀(s, t, a(x)) ∈ S × R × A(X), 0 ≤ r((s, t), a(x)) ≤ M}

Assumption 6 (Time advance). ∀ (a_n(x_n))_n∈N∈ A(X)N, ∀δ ∈ N, t

δ+1− tδ≥ α > 0

On this basis, we look for a way of characterizing the optimal policy and value function. Namely, we wish to characterize the value of policies (section 8.3.1), to prove the existence of an optimality equation for V∗ _{(section 8.3.2) and to relate this optimal value function to}

an optimal policy (section 8.3.4).

In document Temporal Markov Decision Problems : Formalization and Resolution (Page 135-140)