Markov Decision Processes 4
4.6 Statistical Model Checking
As mentioned at the beginning of this chapter, using statistical model checking to analyse probabilistic reachability properties on MDP models is problematic:
If we are to perform simulation as we did for DTMC in Algorithm6, we not only have to conduct a probabilistic experiment in each state, but also resolve the nondeterminism between the outgoing transitions first. That is, we have to mimic a scheduler’s decisions. These scheduling choices determine which probability out of the interval between maximum and minimum we actually observe in SMC. As the only relevant values in verification are the actual max-imum and minmax-imum probabilities, we would need to be able to use the corres-ponding extremal schedulers to obtain useful results. These schedulers are not known upfront, however.
In the remainder of this chapter, we present several approaches to tackle this problem. First, in this section, we adapt the DTMC simulation algorithm to the MDP setting and show that naïve scheduling choices cannot lead to trustworthy, useful results. We then give a detailed presentation of two ap-proaches in Section4.7that make use of the fact that a particular nondetermin-istic choice may not make a difference for the property at hand. In that case, the nondeterminism is spurious, and any resolution leads to the same probab-ility. Provided all nondeterministic choices in an MDP are spurious for some property, its maximum and minimum reachability probability coincide and sim-ulation results can be relied upon. The first approach (Section4.7.1) is to use techniques adapted from partial order reduction to detect, on-the-fly during simulation, if the nondeterministic choice just encountered in a state is spurious and simulation can continue. However, the partial order-based check is conser-vative: It may not be able to identify all spurious choices as such. It works
on networks of VMDP and can in particular only prove interleavings, i.e. non-determinism resulting from the interleaving semantics of parallel composition, as spurious. This limitation can be overcome by using the notion of confluence instead of partial order reduction (Section4.7.2). The confluence-based check applies directly to the concrete state space of a model, i.e. to MDP. It is not limited to spurious interleavings, but it still is conservative. In contrast to the notion of partial order reduction we used previously, confluence can only deal with choices between nonprobabilistic transitions. Thus, the two approaches turn out to be incomparable w.r.t. the classes of MDP they are applicable to.
The goal of the two techniques outlined above, both based on state space re-duction methods, is to prove on-the-fly during simulation that all nondetermin-istic choices encountered are spurious. This guarantees that the chosen resolu-tion leads to the maximum and minimum probability. It indeed means that the minimum and maximum probabilities of reaching the target states are the same, i.e. the interval of probabilities is a singleton. This is, of course, not the case in general for arbitrary MDP models. Three approaches to perform SMC for general nondeterministic MDP have been developed by others, and we briefly present them for completeness and comparison in Section4.8.
Henriques et al. [HMZ+12] first proposed the use of reinforcement learn-ing, a technique from artificial intelligence, to actually learn the resolutions of nondeterminism (by memoryless schedulers) that maximise probabilities for a given bounded LTL property. While this allows SMC for models with arbitrary nondeterministic choices (not only spurious ones), scheduling decisions need to be stored for every explored state. Memory usage can thus be as in traditional model checking, but is highly dependent on the structure of the model and the learning process. However, several problems in their algorithm w.r.t. conver-gence and correctness have recently been described [LST14]. Similar learning-based methods have been picked up again by Brázdil et al. [BCC+14]. They propose two techniques that require different amounts of information about the model, but provide clear error bounds. Memory usage can again be as high as in model checking but depends on the model structure. We summarise these two learning-based approaches in Section4.8.1. Our approaches based on confluence and POR have the same theoretical memory usage bound as the learning-based ones, but use comparatively little memory in practice. They do not introduce any additional overapproximation and thus have no influence on the usual error bounds of SMC.
Legay and Sedwards recently developed a second technique [LST14]. It is based on randomly generating a (large) number of schedulers, for each of which
1 function simulate(M = hS,A,T,sinit,AP,Li, R,φ, d)
2 s := sinit, seen := ∅
3 for i = 1 to d do
4 ifφ(L(s)) then return true
5 else if s ∈ seen then return false
6 μ:= R(s)
7 ν:= choose a transition ha,νi randomly according toμ
8 ifμandνare Diracthen seen := seen ∪ {s} else seen := ∅
9 s := choose a state s randomly according toν
10 end
11 return unknown
Algorithm 15: Path gen. for an MDP and a resolver, with cycle detection
a standard SMC analysis is performed. To achieve the necessary memory effi-ciency, they propose an innovative O(1) encoding of a subset of all (memory-less or history-dependent) schedulers. However, their method cannot guarantee that the optimal schedulers are contained in the encodable subset, and cannot provide an error bound. We give a more detailed overview of this technique in Section4.8.2, before concluding this chapter with an overall summary in Section4.9.
4.6.1 Resolving Nondeterminism
In order to simulate an MDP, i.e. to generate paths through it, the nondetermin-istic choices need to be resolved. Adapting the DTMC simulation algorithm to MDP thus results in Algorithm15. It takes as an additional parameter a re-solver R, i.e. a function in S → Dist(A × Dist(S)) such that, for all states s of the MDP at hand, we have ha,μi ∈ support(R(s)) ⇒ ha,μi ∈ T (s). We can say that a resolver is a memoryless but probabilistic scheduler. If we burden the user with the task of specifying a resolver, SMC for MDP is easy: we can apply Algorithms7and8by merely changing the simulate function they use to the new one of Algorithm15.
Many simulation tools, including e.g. the simulation engine that is part of PRISM, in fact implicitly use a specific built-in resolver so users do not even need to bother specifying one. On the other hand, this means that users are not able to do so if they wanted to, either. The implicit resolver that is typically used simply makes a uniformly distributed choice between the available transitions:
RUni= {s 7→ U (T(s)) | s ∈ S}def
However, one can think of other generic resolvers. For example, a total order on the actions (i.e. priorities) can be specified by the user, with the correspond-ing resolver makcorrespond-ing a uniform choice only between the available transitions with the highest-priority label. A special case of this appears when we con-sider MDP that model the passage of a unit of physical time with a dedicated tickaction: If we assign the lowest priority totick, we schedule the other transitions as soon as possible; if we assign the highest priority totick, we schedule the other transitions as late as possible. We will revisit these ASAP and ALAP schedulers when we investigate real-time models in Chapter5.
Unfortunately, performing SMC with some implicit scheduler as described above is not sound: While a probabilistic reachability property asks for the minimum or maximum probability of reaching a set of target states, using an implicit scheduler merely results in some probability in the interval between minimum and maximum.
Definition 53. An SMC procedure for MDP is sound if, given any MDP M