Figure 8.2: Illustrative example
The goal of the next paragraph is to generalize the Bellman equation for MDPs to the case of hybrid state and action spaces with observable time. As the proofs of section 8.3 will show, introducing an observable time is the main difficulty on the way to proving that the adapted Bellman equation for this broader class of problems is still valid. The results presented in sections 8.2 and 8.3 were first introduced in [Rachelson et al., 2008a].
8.2
A model with hybrid state and action spaces and with observable
continuous time
8.2.1 Model definition
In order to illustrate the following definitions on a simple example, we propose the game presented in figure 8.2. In this game, the goal is to bring the ball from the start box to the finish box. Unfortunately, the problem depends on a continuous time variable because the boxes’ floors retract at known dates and because actions durations are uncertain and real-valued. At each decision epoch, the player has five possible actions: he can either push the ball in one of the four directions or he can wait for a certain duration in order to reach a better configuration. Finally the “push” actions are uncertain and the ball can end up in the wrong box. This problem has a hybrid state space composed of discrete variables - the ball’s position - and continuous ones - the current date1. It also has four non-parametric actions -
the “push” actions - and one parametric action - the “wait” action. We are therefore trying to find a policy on a stochastic process with continuous and discrete variables and parametric actions (with real valued parameters). Keeping this example in mind, we introduce the notion of parametric MDP:
Definition(XMDP). A parametric action MDP, or XMDP, is a tuple hS, A(X), p, ri where: S is a Borel state space which can describe continuous or discrete state variables including
the process’ time.
A is an action space describing a finite set of actions ai(x) where x is a vector of param-
eters taking its values in X. Therefore, the action space of our problem is a hybrid action space, factored by the different actions an agent can undertake. In practice, each action only depends on a subset of variables fromX.
1This illustrative example can actually be written as a TMDP. We draw from the TMDP experience to
Chapter 8. Generalization: the XMDP model
p is a probability density transition function p(s0|s, a(x)).
r is a reward function r(s, a(x)).
8.2.2 Emphasizing the place of time
As in the MDP case, we consider the set T of timed decision epochs. In order to deal with the more general case, we will consider a real-valued time variable t and — as previously — will write the state (s, t) in order to emphasize the specificity of this variable in the discounted case.
So it is important to note that — since we consider t as a state variable — we can split the random variable describing the current state into a vector with at least two components. To clarify this, let us reuse a previous SMDP+ notation and call σ any state in the state space. The first component of the σ vector is the time corresponding to the current state tσ.
The second component is the classical state value sσ, aggregating any other state variable
values which might be defined for the problem at hand.
This leads us to correct our notations: we call S the set of values taken by the vector of state variables but time. Hence, our state space will be written as the Borel algebra on the S × R topological space and we will use pairs (s, t) to describe one state. To simplify the notations, we will refer to this state space as S × R.
Note that for discrete variables, the p() function of the XMDP is a discrete probability distribution function and that writing integrals over p() is equivalent to writing a sum over the discrete variables.
Concerning the time variable, if it is discrete and bounded, optimizing a policy for the XMDP corresponds to optimizing a finite horizon criterion. In the case of continuous ob- servable time, section 8.4 will show that the TMDP and SMDP+ are subproblems of the XMDP framework. In order to summarize:
XMDPs are a generalization of MDPs where hybrid action spaces are represented through the abstraction of parametric actions. Moreover, XMDPs include the pro- cess’ time as a state variable, thus suppressing the distinction between stationary and non-stationary policies for MDPs.
Lastly, as previously, we will write δ the number of the current decision epoch, and, consequently, tδ the time at which decision epoch δ occurs.
8.2.3 Reward model
One can comment that writing the reward model as r((s, t), a(x)) might not span the full genericity of the models we wish to represent. For instance, we could wish to write some r((s, t), a(x), (s0, t0)) reward model in order to specify the reward of each transition.
In the following developments, we will nevertheless restrict ourselves to the r((s, t), a(x)) expected reward model for simplicity. Even though we did not write the proofs for the gen- eral case of r((s, t), a(x), (s0, t0)), making a parallel with the standard MDP case — where
8.2. A model with hybrid state and action spaces and with observable continuous time one can similarly establish equations for r(s, a) or r(s, a, s0) — seems possible.
One interesting option for decomposing the reward model is the one inspired by SMDPs. A transition reward r(s, a) in an SMDP is given as a lump sum reward k(s, a) and a set of reward rates c(j, s, a) where the j states correspond to all the intermediate states encoun- tered during the transition and before the next decision epoch (for details on SMDPs reward models and underlying stochastic processes, one can refer to [Puterman, 1994], page 533). The reward model r(s, a) then corresponds to the lump sum reward plus the discounted integral over possible durations of the reward rates in each state j visited by the underlying process.
The interesting point in such a reward model is that it separates the lump sum reward from the reward rates. For an XMDP, such a model would imply an important property for instantaneous transitions: for such transitions, the reward acquired is only the lump sum reward.
The TMDP type of reward model is then the extension to (s, a, s0) transitions of such
an SMDP reward model with lump sum reward in t, duration reward through a reward rate and lump sum reward in t0.
Even though these modelling options might be appealing because of their practical im- plications, they remain a special case of an r((s, t), a(x)) (or r((s, t), a(x), (s0, t0))) reward
model and we won’t adopt them for further reasoning.
8.2.4 Policies and criterion
We define a deterministic Markovian decision rule at decision epoch δ as the mapping from states to actions:
dδ:
S × R → A(X)
s, t 7→ a(x) (8.1)
dδ(s, t) specifies the parametric action to undertake in state (s, t) at decision epoch δ. A
policy is defined as a set of decision rules (one for each δ) and we consider, as in [Puterman, 1994], the set D of stationary (with respect to δ) markovian deterministic policies.
Definition (Stationary, deterministic, Markovian policy). Such a control policy π is given by a single decision rule, applicable at each decision epoch. Hence we identify this decision rule and the policy:
π :
S × R → A(X)
s, t 7→ a(x) (8.2)
For simplicity and without justification as to the relevance of this approach, we choose to search for policies in D. Analysis of stochastic or non-Markovian policies for instance is beyond the scope of this chapter’s results.
In order to evaluate policies in D for our problem, we need to define a criterion. Given the strong similarity of XMDPs and SMDPs, the discounted criterion we define follows the example of [Howard, 1963] and integrates the expected reward over all possible transition durations. We introduce the discounted criterion for XMDPs as the expected sum of the
Chapter 8. Generalization: the XMDP model
successive discounted rewards, with respect to the application of policy π starting in state (s, t): Vγπ(s, t) = E(sπ0=s,t0=t) ( ∞ X δ=0 γtδ−t0rπ(sδ, tδ) ) (8.3) In order to make sure this series has a finite limit, our model introduces three more hypoth- esis:
• |r ((s, t), a(x)) | is bounded by M,
• ∀δ ∈ T, tδ+1− tδ≥ α > 0, where α is the smallest possible duration of an action ,
• γ < 1.
The discount factor γt insures the convergence of the series. Physically, it can be seen as a
probability of still being functional after time t. With these hypothesis, one can write: Lemma 1. ∀(s, t) ∈ S × R:
|Vγπ(s, t)| < 1 − γM α (8.4) Proof. Since the transition durations are lower-bounded by α, one can write, for all π ∈ D:
tδ ≤ t0+ δα
γtδ ≤ γt0+δα
γtδ−t0 ≤ γδα
And since |r| is bounded by M we have: ∀π ∈ D, −γδαM ≤ γtδ−t0r π(sδ, tδ) ≤ γδαM And finally: ∀π ∈ D, −X∞ δ=0 γδαM ≤X∞ δ=0 γtδ−tr π(sδ, tδ) ≤ ∞ X δ=0 γδαM
Which finally provides the result of the lemma.
We will further restrict the first assumption by saying that we only consider positive reward models, thus we write: 0 ≤ r ((s, t), a(x)) ≤ M. In the following discussion, we will highlight why and where this assumption is necessary.
The rather strong assumption of a lower bound on the duration of transitions will be important for the proof in section 8.3.2. We will see in section 8.3.3 how we can try to lift this hypothesis (and why it is not trivial).
However, it seems important to avoid a — somehow intuitive — misconception: this assumption is a constraint on the transition model and not on the policy search space. The mentioned misconception would be to believe that the constraint t0− t ≥ α corresponds to
forbidding any wait(τ) with τ < α action. While this seems intuitively the case, it is not what is implied by this assumption: any wait(τ) action is possible (with respect to this lower bound only — we will see a little further that we might have to restrict the A(X) space as well for other reasons) but their result will always yield a transition with t0− t ≥ α.
8.2. A model with hybrid state and action spaces and with observable continuous time We will admit that the set V of value functions (functions from S × R to R) is a complete metrizable space for the supremum norm kV k∞= sup
(s,t)∈S×RV (s, t).
The optimal value function for Markovian, deterministic policies, given the discounted criterion is defined as V∗ = sup
π∈DV π γ.
An optimal policy in D, if it exists, is then defined as a policy π∗ which verifies Vπ∗
γ =
sup
π∈DV π
γ . Guaranteeing the existence of such a policy whose value actually reaches the sup
requires three hypothesis which will be developed in section 8.3.4:
• The r and p models are upper semi-continuous with respect to the parameters x. • The reward model is positive.
• A(X) is a compact action subset of a topological space describing a finite set of actions ai(x) where x is a vector of parameters taking its values in X. Since we consider a
finite set of ai high-level actions, this hypothesis can be rephrased for Ai(X): Ai(X)
is homeomorphic to a compact subset of a topological space describing the set of parameters admissible for the finite set of abstract actions ai.
These three hypothesis guarantee that the value function will be upper semi-continuous and that there exists a vector x of finite parameters that reaches the sup of the value func- tion. Such a proof was immediate in the classical MDP model because the action space was countable, the upper semi-continuity and compacity arguments allow the extension to the parametric action case.
We avoided including these additional hypothesis directly in the previous model defini- tion because one can prove the existence of an optimal value function without it. However, this hypothesis is a sufficient condition to guarantee the existence of a policy whose value is equal to the optimal value function. So, we will specifically mention when we make these assumptions in the following proofs.
From here on we will omit the γ index on V .
8.2.5 Summarizing the XMDP’s hypothesis
Based on the previous short discussion, we can summarize the assumptions upon which the XMDP framework is built. An XMDP problem is given by the S × R and A(X) sets, the p and r functions, the γ value and the set D of policies, following the assumptions:
Assumption 1 (State space structure). The state space is the Borel algebra defined on a topological space S × R. In order to simplify the notations, we will write (s, t) ∈ S × R the elements of the state space.
Assumption 2 (Action space structure). A(x) is the union of a finite collection of sets Ai(X), where each Ai(X) is homeomorphic to a compact subset of a topological space X.
Each element ofA(X) is noted ai(x) where ai refers to the set Ai(X) to which the element originally belongs andx is the corresponding element in Ai(X). It represents the action set. Note that since the compacity assumption is not always necessary for all of the following proofs, we will specifically mention when we use or discard it.
Chapter 8. Generalization: the XMDP model
Assumption 3 (Transition model vs. actions). p is a Markovian probability density func- tion p((s0, t0)|(s, t), a(x)) which is upper semi-continuous with respect to x. It describes the transition model, from (s, t) to (s0, t0), given action a(x).
Assumption 4 (Reward model vs. actions). r is a real-valued function r((s, t), a(x)) which is upper semi-continuous with respect tox. It is the reward model of undertaking action a(x) in state(s, t).
Assumption 5 (Bounded rewards). The reward model is bounded and positive: ∃M ∈ R+/∀(s, t, a(x)) ∈ S × R × A(X), 0 ≤ r((s, t), a(x)) ≤ M
Assumption 6 (Time advance). ∀ (an(xn))n∈N∈ A(X)N, ∀δ ∈ N, t
δ+1− tδ≥ α > 0
On this basis, we look for a way of characterizing the optimal policy and value function. Namely, we wish to characterize the value of policies (section 8.3.1), to prove the existence of an optimality equation for V∗ (section 8.3.2) and to relate this optimal value function to
an optimal policy (section 8.3.4).