Now, we want to present methods for computing quantiles where the accumulation of the reward is bounded from below. So, we analyse quantile queries of the form
qus QPDp(A Uě? B)
(3.2) with A, B Ď S, D P tą, ěu, and Q P tD, @u. Here, we calculate the maximum accumulated reward before a certain event occurs with high probability p, and so it is possible to ask, e.g., for the maximal utility that can be provided by an energy-critical system when there is only a specific energy budget available. For example, one could ask for the maximal number of videos that can be decoded on a mobile device using only a specific portion of the power that will be provided by its battery.
But, instead of calculating the demanded values in a direct way, we propose methods for the calculation of
qus(QPEp(A Uě? B)) (3.3)
with E P tă, ďu. Using the dualities presented in Section 3.2, we can derive the desired quantile value 3.2 from the value 3.3 by suitable transformations. So, instead of computing the maximum reward possible with a probability of at least p, the idea is to compute the minimal accumulated reward such that the probability becomes smaller than p for the first time. This enables the computation to rely on similar iteration-based methods as already presented for the calculation of upper-reward bounded quantiles. And, like it was already the case there we do need a mechanism guaranteeing that the demanded quantile really exists. Therefore, the computation also starts with a precomputation followed by the actual calculation of the demanded quantile value.
For simplicity, only the treatment of reachability (♦ě?B) with a lower reward bound will be sketched in the upcoming considerations.
3.4.1 Precomputation
In the following the set C denotes all states t P S that are contained in some (maximal) end component (T, A) with rew(t1, α) ą 0 for some state t1 P T and some action α P A(t1). So, C is the union of all end components possessing a positive reward.
At first, we want to learn under which circumstances it is possible to obtain finite quantile values for reachability quantiles with lower reward-bounds. Therefore the following lemma is very helpful when interested in the computation of a universal quantile.
3.4 Lower-reward bounded quantiles
Lemma 3.4.1. For all states s in M, we have:
qus @Păp(♦ě?B) = 8 iff Prmaxs ♦(C ^ ♦B)
ě p Proof. A first observation is as follows:
min r P N : Prmax
s ♦
ěrB) ă p( = 8 iff there is no r P N such that Prmax
s ♦
ěrB) ă p
iff for all r P N there exists a scheduler Sr with PrSsr ♦
ěrB) ě p
“ðù’’: Suppose Prmax
s ♦(C ^♦B)
ě p. Let S
opt be a (finite-memory) scheduler that
maximises the probability for ♦(C ^♦B) for all states. Now, pick some (finite-memory) scheduler SC such that from each state t P C with probability 1 all states of the maximal end component (T, A) with t P T will be visited infinitely often and each of its actions will be taken infinitely often. Then, the accumulated reward for almost all infinite SC-paths starting in a C-state is 8. Furthermore, let S♦B be a (memoryless) scheduler that maximises the probability to reach B for all states. Given r P N, we now regard the scheduler Sr that operates in three phases:
• Phase 1: As long as C has not been reached, Sr behaves as Sopt. As soon as C has been reached, Sr switches from phase 1 to phase 2.
• Phase 2: Sr mimics SC, provided that the total accumulated reward is less than r. If the current state belongs to C and the total accumulated reward is larger or equal r then Sr switches from phase 2 to phase 3.
• Phase 3: Sr behaves as S♦B.
When entering a C-state in the first phase and the total accumulated reward is ě r then Sr can move directly from phase 1 to phase 3.
Now, the fact is used that all states that belong to the same maximal end component have the same maximal reachability probabilities [Cie+08]. This yields that for each ρ P t0, 1, . . . , r´1u and all states t of a maximal end component (T, A) of M that contains at least one state-action pair with positive reward:
PrSr t|rew=ρ ♦ěrB =ÿ t1PT PrSC t|rew=ρ C Uěr t1¨PrSt1♦B ♦B =ÿ t1PT PrSC t|rew=ρ C Uěr t1¨Prmaxt ♦B =Prmaxt ♦B
Here, the notation PrSr
t|rew=ρ indicates the probability under Sr under the condition
accumulated reward is ρ. We obtain: PrSr s ♦ěrB = ÿ 0ďρăr ÿ (T,A) ÿ tPT PrSr s C U ((rew = ρ) ^ t) ¨ Pr Sr t|rew=ρ ♦ěrB +ÿ tPC PrSr s C U ((rew ě r) ^ t) ¨ Pr S♦B t ♦B =ÿ tPC PrSopt s C U t ¨Prmaxt ♦B =Prmax s ♦(C ^ ♦B)
where (T, A) ranges over all maximal end components that contain a state-action pair with positive reward.
“ùñ’’: We now suppose that the quantile for the state s and the objective @Păp(♦ě?B) is 8. Let Sr
rPN be a family of schedulers such that: PrSr
s ♦ěrB
ě p The task is to show that Prmax
s (♦(C ^ ♦B)) ě p. For this it is shown that for each ε ą 0there exists a scheduler S such that PrSs(♦(C ^ ♦B)) ě p ´ ε.
In what follows, some positive ε is fixed. There exists some r P N such that for each scheduler S:
PrS
s ( C) U ěr
B ă ε
This is due to the fact that the limit of almost all S-paths constitutes an end component and that the reward earned in end components not contained in C is zero. For scheduler S= Sr we obtain: p ď PrSs ♦ěrB =PrSs ( C) UěrB +PrS s ♦(C ^ ♦((rew ě r) ^ B)) ď ε +PrSs ♦(C ^ ♦B) Hence, PrS s ♦(C ^ ♦B) is at least p ´ ε.
Let MĂbe the MDP that results from M by adding two new states goal and fail and a fresh action symbol τ with transition probabilities:
P (t, τ,goal) = PrmaxM,t ♦B P (t, τ,fail) = 1 ´ PrmaxM,t ♦B if t P C and P (s, τ, s1) = 0 for all states s P SzC, s1
PS. The outgoing transitions of the new states goal and fail are irrelevant for our purposes. We then have:
Prmax
M,s ♦(C ^ ♦B) =PrmaxM,sĂ ♦goal
3.4 Lower-reward bounded quantiles
The generation ofMĂmainly requires the computation of the values PrmaxM,s(♦B) and the computation of the maximal end components of M. The former can be done using graph algorithms and linear-programming techniques in time polynomial in the size of M, while the latter is possible using standard algorithms in time quadratic in the size of the underlying graph of M.
Now, that we have seen the criteria that need to be fulfilled in order to guarantee a finite value for a universal lower-reward bounded quantile, we want to investigate the same in the case of an existential quantile. We therefore start our investigation with the following logical consequences:
min r P N : Prmin
s ♦ěrB) ă p( = 8 iff there is no r P N such that Prmin
s ♦
ěrB) ă p iff Prmin
s ♦
ěrB) ě p for all r P N
Obviously, this is the case if under each scheduler, with probability at least p, the set B will be visited infinitely often and the accumulated reward tends to infinity. A direct consequence of those considerations is stated in the following lemma. Therefore, let posRew Ď S ˆ Act be the set of state-action pairs (s, α) with rew(s, α) ą 0.
Lemma 3.4.2. For all states s in M, we have:
qus DPăp(♦ě?B) = 8 iff Prmins l♦B ^ l♦posRew
ě p
In order to compute the minimal probability for the generalised Büchi condition l♦B ^ l♦posRew we can rely on standard techniques. We compute the set D consisting of states that are contained in some end component (T, A) with T X B = ∅ or with rew(t1, α) = 0 for all actions α P A(t1) and states t1
P T. We then have: Prmin s l♦B ^ l♦posRew = 1 ´Pr max s ♦D
Using the previous statements, it is possible to formulate the following corollary enabling to check if a quantile exists or if it is infinite.
Corollary 3.4.3. The following two problems are in P:
(1) decide whether qus @Păp(♦ě?B) = 8
(2) decide whether qus DPăp(♦ě?B) = 8
3.4.2 Computation scheme
The approach for computing upper-reward bounded quantiles as in Section 3.3.2 can be adapted to the computation of quantiles for reachability formulas with lower reward bounds, i.e., ♦ě?B. We start with the universal quantile:
qus @Păp(♦ě?B) =min r P N : Prmaxs ♦
ěrB
Clearly, if Prmax
s (♦B) ă p then the quantile for state s is 0. Furthermore (see Lemma 3.4.1):
qus @Păp(♦ě?B) = 8 iff Prmaxs ♦(C ^ ♦B)
ě p,
where C consists of all states t that are contained in a maximal end component (T, A) with rew(t1, α) ą 0 for some state t1 P T and an action α P A(t1). Intuitively, when entering C one can stay in C until the accumulated reward is greater or equal than r, before entering B. Otherwise, we apply the same idea as before and compute the values ps,r =Prmaxs (♦ěrB) for increasing r until ps,r ă p. The values ps,r are obtained as the unique solution of the following LP with variables xs,i2 for (s, i) P S[r] and the following constraints for s P S and 1 ď i ď r:
xs,0 =Prmaxs ♦B xs,i ě 0 xs,i ě ÿ tPS
P (s, α, t) ¨ xt,` if α P Act(s) and ` = maxt0, i ´ rew(s, α)u The objective is to minimise ř(s,i)PS[r]xs,i. To speed up the computation, one can add the following constraints: xs,i= 1 if Prmaxs ♦(C ^ ♦B) = 1for s P S.
The existential quantile
qus DPăp(♦ě?B) =min r P N : Prmins ♦
ěrB
ă p(
can then be computed by an analogous approach, using the fact that the values ps,r =Prmins ♦ěrB
are the greatest solutions in [0, 1] of the linear constraints xs,0=Prmins ♦B
xs,i= 0 if i ě 1, Prmins (♦B) = 0 or Prmins (♦posRew) = 0 xs,iď ÿ tPS P (s, α, t) ¨ xt,` if i ě 1, Prmins (♦B) ą 0 and Pr min s (♦posRew) ą 0, α PAct(s) and ` = maxt0, i ´ rew(s, α)u where posRew Ď S ˆ Act is the set of state-action pairs (s, α) with rew(s, α) ą 0. Then, qus DPăp(♦ě?B) = 8 iff Prmins l♦B ^ l♦posRew
ě p (see Lemma 3.4.2). Again, one could add the following constraints: xs,i= 1 if Prmins (l♦B ^ l♦posRew) = 1 for s P S.