3.3 Upper-reward bounded quantiles
3.3.2 Computation scheme
[UB13] presents a linear-programming approach for the computation of quantiles over (constrained) reachability properties with upper reward bounds (briefly called Uď?-quantiles) over MDPs annotated with state rewards. The paper provides related approaches for both, existential and universal quantiles. We recall the approach for existential Uď?-quantiles, and as [UB13] considers state rewards, the approach will be simultaneously extended to be applicable for state-action rewards as well.
The main ingredient for the computation of quantiles is the calculation of (con- strained) reachability probabilities for all states of the model when the accumulation of the reward has been restricted from above to a certain bound. So, we want to solve a linear program that incorporates multiple bounded reachability probabilities, and Figure 3.3 depicts the linear program of [UB13] that needs to be solved for the computation of existential upper-reward bounded quantiles, adapted for the case of state-action rewards (rather than state rewards)1. So, we compute reachability probab- ilities for all existing states for a sequence of bounds. Lemma 10 and Lemma 13 in [UB13] show that it is completely sufficient to analyse the reward-bounded reachability up to a specific (exponential) upper bound rmax. If the requested quantile is finite it can be guaranteed that its value can not be greater than this bound. This LP-based computation scheme can be therefore solved in exponential time, using this exponential bound for the smallest (finite) quantile. Therefore, the computation of upper-reward bounded Uď?-quantiles in MDPs can be evaluated in exponential time (as also stated in Theorem 11 and Theorem 14 of [UB13]).
1The intuitive meaning of x
s,i is the probability of reaching B through A-states from the state s
3.3 Upper-reward bounded quantiles minimise ř
(s,i)PS[r]
xs,i subject to
xs,i= 0 if s * D(A U B) and 0 ď i ď r
xs,i= 1 if s P B and 0 ď i ď r
xs,iě ÿ tPS
P (s, α, t) ¨ xt,i´rew(s,α) if s R B, s |ù D(A U B) and α P Act(s) such that rew(s, α) ď i ď r
Figure 3.3: Linear program LPr with the unique solution ps,i =Prmaxs A Uďi B
A naïve approach thus could first compute rmax, generate the linear program with variables xs,i for (s, i) P S[rmax] and then use general-purpose linear- or dynamic- programming techniques to solve the constructed linear program (e.g., the Simplex algorithm [Nas00], ellipsoid methods [GLS93] or value [Bel57] or policy iteration [How90]). However, since the upper bound rmax is exponential in the size of M and depends on the transition probabilities and rewards in M and the specified probability bound p, this approach turns out to be intractable when M or the reward values are large. It turns out that in practice the demanded quantile values are normally much smaller than the theoretically established bound rmax (see the different analysis-results presented in Chapter 6). Even for those cases where the calculations do not consider all theoretically possible iterations an approach that tries to solve the linear program in one single self-contained step by using general-purpose methods like an LP-solver is not feasible (see Table 6.3 in Section 6.1.1 (Self-Stabilising Protocol) or Table 6.8 in Section 6.1.2 (Asynchronous Leader-Election Protocol) for more details on this incident). Instead, it is recommended to compute the reward-bounded reachability probabilities using the iterative back-propagation procedure described in Section 5.1.1.
Anyways, since a precomputation procedure has already been passed, we can assume that the user-specified probability threshold p can be met. We use this information and therefore compute the maximal probabilities ps,r =Prmaxs A Uďr B
for increasing reward bound r (starting with 0), until ps,rD p holds for the first time. Then, r will be returned as the result of the computation and corresponds to the demanded quantile value.
The computation of universal Uď?-quantiles can be done using an analogous approach where some details need to be adapted slightly. As already stated in Section 3.2, quantiles that refer to reward-bounded release formulas are dual and can be computed using the same techniques.
A simple consequence of the representation of Prmax
s A Uďr B
as the unique solution of the linear program in Figure 3.3 is the existence of a finite-memory scheduler S with the modes 0, 1, . . . , r that maximises the probability for A Uďr B. In particular, we obtain:
until-quantiles there exists an optimal scheduler S such that for all finite paths ρ1 and ρ2:
if last(ρ1) = last(ρ2) and rew(ρ1) =rew(ρ2) then S(ρ1) = S(ρ2) (3.1)
For existential quantiles r = qus(D . . .) there are optimal finite-memory schedulers
whose size is bounded by r + 1.
Similarly, the universal until-quantiles have adversarial schedulers S satisfying statement 3.1. Thus, an analogous statement holds for universal until-quantiles and adversarial schedulers.
[HK15] presents an ExpTime-completeness result for the computation of cost problems for general cost processes. The authors show the ExpTime-hardness by
reducing the problem of determining the winner in a countdown game [JLS07] to a cost problem over cost processes by constructing a cost process such that a player of the countdown game can only win if and only if there exists a scheduler solving the considered cost problem. Since cost problems are directly related to the computation of quantiles over MDPs, there is no hope to improve the computation of quantiles substantially by utilising another class of algorithms. Nevertheless, Section 5.1 presents several methods to enhance the computation of quantiles / reward-bounded reachability probabilities that improve the practical performance of the computations significantly. It is even the case that the calculations carried out practically are impossible in some places without the utilisation of those optimisations.