Markov chain theory is an extremely broad field in mathematics. In this chapter, we only discussed the preliminaries that are essential for the remainder of the thesis. More details about CTMCs and DTMCs can be found in the textbooks [KS76, Kul95]. More details about MDPs can be found in [Bel57, How71, Ber95] and in the textbook [Put94].
Compared to the other models presented in this chapter, CTMDPs have received less attention. As do the seminal papers of Miller [Mil68b, Mil68a], most of the results that are known for CTMDPs concentrate on optimizing reward-based measures such as the finite horizon expected state-based reward, the infinite horizon discounted state-based reward or the long run expected average reward. Details about the results that are known in mathematics can be found in [Put94] and in the survey paper [GHLPR06].
Lately, CTMDPs are considered in the field of game theory, where the model has become known as a continuous-time stochastic 112 player game. However, the results mostly concentrate on time-abstract schedulers [BFK+09]. The same holds for the re-sults in [BHKH05], which are closely related to those of this thesis:
In [BHKH05], the authors provide an algorithm to optimize time-bounded reacha-bility probabilities for time-abstract schedulers on a subclass of CTMDPs. This thesis extends these approaches in different respects. Most notably, we lift the restriction to certain subclasses of CTMDPs and consider strictly better time-dependent schedulers.
These contributions are described in detail in the following chapters.
3We may assume FPathsωto be complete, see Def. 2.4.
Nothing is more difficult, and therefore more precious, than to be able to decide.
(Napol´eon Bonaparte) Schedulers in CTMDPs and other variants of randomly timed games can roughly be classified as to whether they use timing information or not. In the literature, the analysis of CTMDPs is mostly focused on determining optimal schedulers for criteria such as the expected total reward, the expected long-run average reward (cf. the survey [GHLPR06]) and unbounded reachability probabilities [Put94]. For such comparatively simple crite-ria, time-abstract schedulers suffice. Stated differently, providing the scheduler with in-formation on the amount of time that has passed does not improve its decisions for such properties. When analyzing such criteria, it therefore suffices to either fully abstract from the timing information in the CTMDP or to abstract from it at least partly by transform-ing the CTMDP into an equivalent discrete-time MDP. The latter process is commonly referred to as uniformization [Put94, p. 562],[GHLPR06].
In comparison to the properties stated above, the focus of this thesis is mostly on time bounded reachability objectives such as the maximum probability to hit a given set of goal states during a finite time-interval. As we will see in this chapter, the maximum achievable probability of such events strongly depends on whether the underlying sched-uler class uses timing information or not.
In the previous chapter, we have introduced the class of generic measurable schedulers.
It is complete in a sense, as the correspondingGM-schedulers may use the complete infor-mation about the trajectory that led into the current state. For example, aGM-scheduler can access the state history and the sojourn time in each individual state of the history.
In this chapter, we investigate schedulers more closely and define a hierarchy of posi-tional and history-dependent schedulers which refines the notion of measurable sched-ulers from Sec. 3.3.2. As it turns out, an important distinguishing criterion is the level of detail of timing information the schedulers may exploit, e.g. the delay in the last state, the total time that was spent during the trajectory that led into the current state, or all individual state residence times.
In general, the delay that has to pass in a states before the CTMDP jumps to a successor states′is determined by the action that is selected by the scheduler when entering states′. In the second part of this chapter, we therefore investigate under which conditions this resolution of nondeterminism may be deferred: More precisely, we identify the subclass
oflocally uniform CTMDPs and show how its schedulers delay their decision up to the point at which the current states is left.
Rather than focusing on a specific objective, we consider this delayed nondeterminism for arbitrary measurable events. The core of our study is a transformation — called local uniformization — on CTMDPs which unifies the speed of outgoing transitions per state.
Whereas classical uniformization [Gra91, GM84, Jen53] adds self-loops to achieve this, local uniformization uses auxiliary copy-states. In this way, we enforce that schedulers in the original and uniformized CTMDP have (for important scheduler classes) the same power, whereas classical loop-based uniformization permits a scheduler to change its decision when re-entering a state through the added self-loop.
Therefore, locally uniform CTMDPs permit to defer the resolution of nondeterminism, i.e., they dissolve the intrinsic dependency between state residence times and schedulers, and can be viewed as MDPs with exponentially distributed state residence times. This characterization provides the basis for Chapter 5, where we develop an approximation algorithm which computes time-bounded reachability probabilities in locally uniform CTMDPs.
Organization of this chapter. Section 4.1 proposes a hierarchy of scheduler classes and refines the notion of generic measurable schedulers from Sec. 3.3.2. In Sec. 4.2, we de-fine local uniformization and prove its correctness. Section 4.3 summarizes the main re-sults and Sec. 4.4 proves that deferring nondeterministic choices induces strictly tighter bounds on quantitative properties.
4.1 A hierarchy of scheduler classes
In Sec. 3.3.2, we have defined the probability of measurable sets of paths with respect to GM-schedulers. However, this does not fully describe a CTMDP, as a single scheduler represents only one way to resolve the CTMDP’s nondeterministic choices. Therefore, in-stead of a single scheduler, we consider schedulerclasses that group schedulers according to the information that they use for making a decision:
Given an event Π ∈ FPathsω, a scheduler class induces a set of probabilities — one for each scheduler in the respective class — which reflects the CTMDP’s possible behaviors.
In this chapter, we propose a variety of scheduler classes (see the lattice depicted in Fig. 4.1) and investigate which of them preserve the minimum and maximum probabili-ties under local uniformization.
We start our discussion and recall the notion ofGM-schedulers: As proved in [WJ06], they are the most general class definable on arbitrary CTMDPs. More precisely, the au-thors prove that all probability measures that conform to a CTMDP’s set of valid paths are induced by someGM-scheduler. The intuition is as follows: If paths π1 andπ2end in state s, a GM-scheduler D ∶ Paths⋆× FAct → [0, 1] may yield different distributions D(π1, ⋅) and D(π2, ⋅) over the next action, depending on the entire histories π1 andπ2.
GM = THR
TTHR
TTPR
TPR
TAPR
TAHOPR TAHR
Figure 4.1: A hierarchy of scheduler classes.
Note thatπ1andπ2contain the state sequence that was traversed, the sojourn time in each of those states and the action that was chosen to move from one state to another. Hence, we also refer toGM-schedulers as time- and history-dependent randomized schedulers.
On the contrary, a schedulerD is time-abstract and positional (a TAPR-scheduler), if D(π1, ⋅) = D(π2, ⋅) for all π1,π2 ∈ Paths⋆ that end in the same state. As D(π, ⋅) only depends on the current state, it can be specified as a mappingD ∶S → Distr(Act).
Example 4.1. For TAPR scheduler D with D(s0) = {α ↦ 1} and D(s1) = {β ↦ 1}, the in-duced stochastic process of the CTMDP in Fig. 4.2(a) is the CTMC depicted in Fig. 4.2(b).
Note however, that in general, randomized schedulers do not yield CTMCs as the induced sojourn times are hyper-exponentially distributed. Hence, a continuous-time Markov de-cision process with an associated randomized scheduler is a slight misnomer, as a hyper-exponentially distributed sojourn time does not obey the Markov property, in general. How-ever, this can safely be ignored, as we will see in the next chapters that considering determin-isticschedulers (which obviously induce exponentially distributed sojourn times) suffices to
optimize time-bounded reachability properties. ♢
ForTAHOPR-schedulers, the decision may depend on the current state s and the length of π1 and π2 (hop-counting schedulers); accordingly, they are isomorphic to mappings D ∶ S × N → Distr(Act). Moreover, D is a time-abstract history-dependent scheduler (TAHR), if D(π1, ⋅) = D(π2, ⋅) for all histories π1,π2 ∈ Paths⋆ withabs(π1) = abs(π2):
Given historyπ, TAHR-schedulers may decide based on the sequence of states and ac-tions inabs(π). In [BHKH05], the authors show that TAHOPR- and TAHR-schedulers induce the same probability bounds for timed reachability which are tighter than the bounds induced by the class ofTAPR-schedulers.
Time-dependent scheduler classes generally induce probability bounds that exceed
s0 s1
Figure 4.2: An example of a CTMDP and its induced CTMC (under aTAPD-scheduler).
those of the corresponding time-abstract classes [BHKH05]. As they are the main focus of this thesis, we discuss them in greater detail here:
If we move from statesi−1 to statesi, atimed positional scheduler (TPR) yields a dis-tribution overAct(si) which depends on the current state si and the time it took to go from statesi−1to statesi; thus, the class ofTPR-schedulers extends TAPR-schedulers with information on the delay of the last transition.
Similarly, total time history-dependent schedulers (TTHR) extend TAHR-schedulers with information on the time that passed up to the current state: IfD ∈ TTHR and π1,π2∈ TTHR ⊆ GM. Intuitively, a TTHR-schedulers may depend on the accumulated time (that is, on ∆(π)), but not on sojourn times in individual states of the history. Hence, for general events, the probability bounds ofTTHR-schedulers are less strict than those of GM-schedulers. However, this does not hold for time-bounded reachability probabilities.
To optimize them, an even simpler class of time-dependent schedulers suffices:
For the properties that we investigate in this thesis, the class of total time positional schedulers (TTPR) is of great importance: A TTPR-scheduler is given as a mapping D ∶ S × R≥0 → Distr(Act). Intuitively, it expects the current state in its first argument; the second argument is the total amount of time that has passed before the current state was entered. Hence,TTPR-schedulers are similar to TTHR-schedulers but abstract from the state-history: For two historiesπ1andπ2,D(π1, ⋅) = D(π2, ⋅) if π1andπ2end in the same state and if the total amount of time that was spent onπ1and π2 is the same, that is, if
∆(π1) = ∆(π2).
TTPR-schedulers are of particular interest, as they induce optimal probability bounds with respect to time- and interval bounded reachability objectives: To see this, consider the probability to reach a set of goal statesG ⊆S within t time units. If state s is reached via π ∈ Paths⋆ (without visitingG), the maximal probability to enter G is given by a scheduler which maximizes the probability to reachG from state s within the remaining t−∆(π) time units. Obviously, a TTPR scheduler is sufficient in this case. In Chapter 5, we
will come back to this issue (cf. Thm. 5.2 on page 124) and formally prove this claim for a slightly different class of schedulers. However, the proof carries over toTTPR-schedulers, trivially.
A further remark is in order here: In [BHKH05] it is proved thatTAHOPD-schedulers (i.e.deterministic TAHOPR-schedulers) suffice for optimizing time-bounded reachability objectivesunder all time-abstract schedulers. This is similar to the continuous-time case, where for time-dependent schedulers it is sufficient to measure the total amount of time that has passed. In particular, information about the state- or action-history (as it is provided byTAHR- and TTHR-schedulers) is proved to be unnecessary.
Example 4.2. Reconsider the CTMDP depicted in Fig. 4.2(a) and assume that we aim at maximizing the probability to move from state s0to state s3within a given time bound z ∈ R≥0. Obviously, an optimal TTPR scheduler has to choose action α in state s0: If it chose β, the CTMDP would move to state s4and stay there forever. Thus, we may assume that state s1
is entered via action α after a sojourn in state s0of duration t0∈ R≥0.
Being in state s1, a nondeterministic choice between actions α and β occurs: If α is cho-sen, state s1 is left with exit rate E(s1,α) = R(s1,α, s3) + R(s1,α, s4) = 3. However, the probability P(s1,α, s3) = RE(s(s1,α,s1,α)3)to enter state s3 (instead of state s4) is only 31. If action β is chosen, the situation is different: Although the rate for leaving state s1under action β is the same (i.e. E(s1,β) = R(s1,β, s2) = 3), we do not enter the goal state s3directly. Instead, the transition from state s2to state s3with rate R(s2,β, s3) = 1 induces an additional delay.
However, note that if action β is chosen in state s1, we reach state s3with probability 1.
Obviously, the optimal decision in state s1 depends on the time z − t0 that remains to reach s3 when t0 time units have been spent in state s0, already. With this reasoning, we obtain an optimal TTPR-scheduler D as follows: Define D(s0, 0) = {α ↦ 1} and D(s1,t0) = {α ↦ 1} if t0≥z − ln(58+81√
105) and D(s1,t0) = {β ↦ 1}, otherwise.
The derivation for D is as follows: The probability to move within the remaining x = z−t0
time units from state s1to state s3with action α is given by the function a(x) = 31(1 − e−3x).
For action β, the corresponding function b(x) is given by the convolution to go to state s3
via state s2. Hence b(x) = ∫0x(3e−3t1∫0x−t1 e−t2 dt2) dt1.Fig. 4.3 depicts the two cumulative distribution functions. Now, let d ∈ R≥0be the unique solution of the equation a(x) = b(x);
then d = ln(58+81√
105). Obviously, if more than d time units remain, i.e. if z − t0>d, the optimal decision in state s1is action β. On the other hand, if z − t0≤d, it is more profitable to choose action α.
For now, we note that (a) time-abstract schedulers obviously do not suffice to obtain the maximum probability and (b) that the scheduler D is a deterministic TTPD-scheduler. ♢ With the preceding informal description of the scheduler classes that are mentioned in Fig. 4.1, we define them formally as follows:
Definition 4.1 (Scheduler classes). LetC be a CTMDP and D a GM-scheduler on C. If
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0 0.5 1 1.5 2
Prob
z − t0
α β
ln(58+18√ 105)
Figure 4.3: Reachability inz − t time units.
π and π′range over Paths⋆(C), the scheduler classes are defined as follows:
D ∈ TAPR ⇐⇒ π↓ = π′↓ ⇒ D(π) = D(π′)
D ∈ TAHOPR ⇐⇒ (π↓ = π′↓ ∧∣π∣ = ∣π′∣) ⇒ D(π) = D(π′) D ∈ TAHR ⇐⇒ abs(π) = abs(π′) ⇒ D(π) = D(π′)
D ∈ TTHR ⇐⇒ (abs(π) = abs(π′) ∧ ∆(π) = ∆(π′)) ⇒ D(π) = D(π′) D ∈ TTPR ⇐⇒ (π↓ = π′↓ ∧ ∆(π) = ∆(π′)) ⇒ D(π) = D(π′)
D ∈ TPR ⇐⇒ (π↓ = π′↓ ∧ δ(π, ∣π − 1∣) = δ(π′,∣π′− 1∣)) ⇒ D(π) = D(π′).
Def. 4.1 justifies to restrict the domain of the schedulers to the information the respec-tive class exploits. In this way, we obtain the characterization in Table 4.1.
In the next section, we come to a transformation on CTMDPs that unifies the speed of outgoing transitions and thereby allows us to defer the resolution of nondeterministic choices: Intuitively, if the sojourn time in a state does not depend on the scheduler, the decision needs not be taken when entering that state, but may be delayed up to the point when the state is left.